xref: /dpdk/doc/guides/sample_app_ug/l3_forward_power_man.rst (revision 8750576fb2a9a067ffbcce4bab6481f3bfa47097)
1..  SPDX-License-Identifier: BSD-3-Clause
2    Copyright(c) 2010-2014 Intel Corporation.
3
4L3 Forwarding with Power Management Sample Application
5======================================================
6
7Introduction
8------------
9
10The L3 forwarding with Power Management application is an example
11of power-aware packet processing using the DPDK.
12The application is based on the existing L3 forwarding sample application,
13with the power management algorithms to control the P-states and
14C-states of the Intel processor via a power management library.
15
16Overview
17--------
18
19The application demonstrates the use of the Power libraries in the DPDK to implement packet forwarding.
20The initialization and run-time paths are very similar to those of the :doc:`l3_forward`.
21The main difference from the L3 Forwarding sample application is that this application introduces power-aware optimization algorithms
22by leveraging the Power library to control P-state and C-state of processor based on packet load.
23
24The DPDK includes poll-mode drivers to configure Intel NIC devices and their receive (Rx) and transmit (Tx) queues.
25The design principle of this PMD is to access the Rx and Tx descriptors directly without any interrupts to quickly receive,
26process and deliver packets in the user space.
27
28In general, the DPDK executes an endless packet processing loop on dedicated IA cores that include the following steps:
29
30*   Retrieve input packets through the PMD to poll Rx queue
31
32*   Process each received packet or provide received packets to other processing cores through software queues
33
34*   Send pending output packets to Tx queue through the PMD
35
36In this way, the PMD achieves better performance than a traditional interrupt-mode driver,
37at the cost of keeping cores active and running at the highest frequency,
38hence consuming the maximum power all the time.
39However, during the period of processing light network traffic,
40which happens regularly in communication infrastructure systems due to well-known "tidal effect",
41the PMD is still busy waiting for network packets, which wastes a lot of power.
42
43Processor performance states (P-states) are the capability of an Intel processor
44to switch between different supported operating frequencies and voltages.
45If configured correctly, according to system workload, this feature provides power savings.
46CPUFreq is the infrastructure provided by the Linux* kernel to control the processor performance state capability.
47CPUFreq supports a user space governor that enables setting frequency via manipulating the virtual file device from a user space application.
48The Power library in the DPDK provides a set of APIs for manipulating a virtual file device to allow user space application
49to set the CPUFreq governor and set the frequency of specific cores.
50
51This application includes a P-state power management algorithm to generate a frequency hint to be sent to CPUFreq.
52The algorithm uses the number of received and available Rx packets on recent polls to make a heuristic decision to scale frequency up/down.
53Specifically, some thresholds are checked to see whether a specific core running a DPDK polling thread needs to increase frequency
54a step up based on the near to full trend of polled Rx queues.
55Also, it decreases frequency a step if packet processed per loop is far less than the expected threshold
56or the thread's sleeping time exceeds a threshold.
57
58C-States are also known as sleep states.
59They allow software to put an Intel core into a low power idle state from which it is possible to exit via an event, such as an interrupt.
60However, there is a tradeoff between the power consumed in the idle state and the time required to wake up from the idle state (exit latency).
61Therefore, as you go into deeper C-states, the power consumed is lower but the exit latency is increased. Each C-state has a target residency.
62It is essential that when entering into a C-state, the core remains in this C-state for at least as long as the target residency in order
63to fully realize the benefits of entering the C-state.
64CPUIdle is the infrastructure provide by the Linux kernel to control the processor C-state capability.
65Unlike CPUFreq, CPUIdle does not provide a mechanism that allows the application to change C-state.
66It actually has its own heuristic algorithms in kernel space to select target C-state to enter by executing privileged instructions like HLT and MWAIT,
67based on the speculative sleep duration of the core.
68In this application, we introduce a heuristic algorithm that allows packet processing cores to sleep for a short period
69if there is no Rx packet received on recent polls.
70In this way, CPUIdle automatically forces the corresponding cores to enter deeper C-states
71instead of always running to the C0 state waiting for packets.
72But user can set the CPU resume latency to control C-state selection.
73Setting the CPU resume latency to 0
74can limit the CPU just to enter C0-state to improve performance,
75which may increase power consumption of platform.
76
77.. note::
78
79    To fully demonstrate the power saving capability of using C-states,
80    it is recommended to enable deeper C3 and C6 states in the BIOS during system boot up.
81
82Compiling the Application
83-------------------------
84
85To compile the sample application, see :doc:`compiling`.
86
87The application is located in the ``l3fwd-power`` sub-directory.
88
89Running the Application
90-----------------------
91
92The application has a number of command line options:
93
94.. code-block:: console
95
96    ./<build_dir>/examples/dpdk-l3fwd_power [EAL options] -- -p PORTMASK [-P]  --config(port,queue,lcore)[,(port,queue,lcore)] [--max-pkt-len PKTLEN] [--no-numa]
97
98where,
99
100*   -p PORTMASK: Hexadecimal bitmask of ports to configure
101
102*   -P: Sets all ports to promiscuous mode so that packets are accepted regardless of the packet's Ethernet MAC destination address.
103    Without this option, only packets with the Ethernet MAC destination address set to the Ethernet address of the port are accepted.
104
105*   -u: optional, sets uncore min/max frequency to minimum value.
106
107*   -U: optional, sets uncore min/max frequency to maximum value.
108
109*   -i (frequency index): optional, sets uncore frequency to frequency index value, by setting min and max values to be the same.
110
111*   --config (port,queue,lcore)[,(port,queue,lcore)]: determines which queues from which ports are mapped to which cores.
112
113*   --cpu-resume-latency LATENCY: set CPU resume latency to control C-state selection, 0 : just allow to enter C0-state.
114
115*   --max-pkt-len: optional, maximum packet length in decimal (64-9600)
116
117*   --no-numa: optional, disables numa awareness
118
119*   --telemetry:  Telemetry mode.
120
121*   --pmd-mgmt: PMD power management mode.
122
123*   --max-empty-polls : Number of empty polls to wait before entering sleep state. Applies to --pmd-mgmt mode only.
124
125*   --pause-duration: Set the duration of the pause callback (microseconds). Applies to --pmd-mgmt mode only.
126
127*   --scale-freq-min: Set minimum frequency for scaling. Applies to --pmd-mgmt mode only.
128
129*   --scale-freq-max: Set maximum frequency for scaling. Applies to --pmd-mgmt mode only.
130
131See :doc:`l3_forward` for details.
132The L3fwd-power example reuses the L3fwd command line options.
133
134Explanation
135-----------
136
137The following sections provide explanation of the sample application code.
138As mentioned in the overview section,
139the initialization and run-time paths are identical to those of the L3 forwarding application.
140The following sections describe aspects that are specific to the L3 Forwarding with Power Management sample application.
141
142Power Library Initialization
143~~~~~~~~~~~~~~~~~~~~~~~~~~~~
144
145The Power library is initialized in the main routine.
146It changes the P-state governor to userspace for specific cores that are under control.
147The Timer library is also initialized and several timers are created later on,
148responsible for checking if it needs to scale down frequency at run time by checking CPU utilization statistics.
149
150.. note::
151
152    Only the power management related initialization is shown.
153
154.. literalinclude:: ../../../examples/l3fwd-power/main.c
155    :language: c
156    :start-after: Power library initialized in the main routine. 8<
157    :end-before: >8 End of power library initialization.
158
159Monitoring Loads of Rx Queues
160~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
161
162In general, the polling nature of the DPDK prevents the OS power management subsystem from knowing
163if the network load is actually heavy or light.
164In this sample, sampling network load work is done by monitoring received and
165available descriptors on NIC Rx queues in recent polls.
166Based on the number of returned and available Rx descriptors,
167this example implements algorithms to generate frequency scaling hints and speculative sleep duration,
168and use them to control P-state and C-state of processors via the power management library.
169Frequency (P-state) control and sleep state (C-state) control work individually for each logical core,
170and the combination of them contributes to a power efficient packet processing solution when serving light network loads.
171
172The rte_eth_rx_burst() function and the newly-added rte_eth_rx_queue_count() function are used in the endless packet processing loop
173to return the number of received and available Rx descriptors.
174And those numbers of specific queue are passed to P-state and C-state heuristic algorithms
175to generate hints based on recent network load trends.
176
177.. note::
178
179    Only power control related code is shown.
180
181.. literalinclude:: ../../../examples/l3fwd-power/main.c
182    :language: c
183    :start-after: Main processing loop. 8<
184    :end-before: >8 End of main processing loop.
185
186P-State Heuristic Algorithm
187~~~~~~~~~~~~~~~~~~~~~~~~~~~
188
189The power_freq_scaleup_heuristic() function is responsible for generating a frequency hint for the specified logical core
190according to available descriptor number returned from rte_eth_rx_queue_count().
191On every poll for new packets, the length of available descriptor on an Rx queue is evaluated,
192and the algorithm used for frequency hinting is as follows:
193
194*   If the size of available descriptors exceeds 96, the maximum frequency is hinted.
195
196*   If the size of available descriptors exceeds 64, a trend counter is incremented by 100.
197
198*   If the length of the ring exceeds 32, the trend counter is incremented by 1.
199
200*   When the trend counter reached 10000 the frequency hint is changed to the next higher frequency.
201
202.. note::
203
204    The assumption is that the Rx queue size is 128 and the thresholds specified above
205    must be adjusted accordingly based on actual hardware Rx queue size,
206    which are configured via the rte_eth_rx_queue_setup() function.
207
208In general, a thread needs to poll packets from multiple Rx queues.
209Most likely, different queue have different load, so they would return different frequency hints.
210The algorithm evaluates all the hints and then scales up frequency in an aggressive manner
211by scaling up to highest frequency as long as one Rx queue requires.
212In this way, we can minimize any negative performance impact.
213
214On the other hand, frequency scaling down is controlled in the timer callback function.
215Specifically, if the sleep times of a logical core indicate that it is sleeping more than 25% of the sampling period,
216or if the average packet per iteration is less than expectation, the frequency is decreased by one step.
217
218C-State Heuristic Algorithm
219~~~~~~~~~~~~~~~~~~~~~~~~~~~
220
221Whenever recent rte_eth_rx_burst() polls return 5 consecutive zero packets,
222an idle counter begins incrementing for each successive zero poll.
223At the same time, the function power_idle_heuristic() is called to generate speculative sleep duration
224in order to force logical to enter deeper sleeping C-state.
225There is no way to control C- state directly, and the CPUIdle subsystem in OS is intelligent enough
226to select C-state to enter based on actual sleep period time of giving logical core.
227The algorithm has the following sleeping behavior depending on the idle counter:
228
229*   If idle count less than 100, the counter value is used as a microsecond sleep value through rte_delay_us()
230    which execute pause instructions to avoid costly context switch but saving power at the same time.
231
232*   If idle count is between 100 and 999, a fixed sleep interval of 100 μs is used.
233    A 100 μs sleep interval allows the core to enter the C1 state while keeping a fast response time in case new traffic arrives.
234
235*   If idle count is greater than 1000, a fixed sleep value of 1 ms is used until the next timer expiration is used.
236    This allows the core to enter the C3/C6 states.
237
238.. note::
239
240    The thresholds specified above need to be adjusted for different Intel processors and traffic profiles.
241
242If a thread polls multiple Rx queues and different queue returns different sleep duration values,
243the algorithm controls the sleep time in a conservative manner by sleeping for the least possible time
244in order to avoid a potential performance impact.
245
246Telemetry Mode
247--------------
248
249The telemetry mode support for ``l3fwd-power`` is a standalone mode. In this mode,
250``l3fwd-power`` does simple l3fwding along with calculating empty polls, full polls,
251and busy percentage for each forwarding core. The aggregation of these
252values of all cores is reported as application level telemetry to metric
253library for every 500ms from the main core.
254
255The busy percentage is calculated by recording the poll_count
256and when the count reaches a defined value the total
257cycles it took is measured and compared with minimum and maximum
258reference cycles and accordingly busy rate is set  to either 0% or
25950% or 100%.
260
261.. code-block:: console
262
263        ./<build_dir>/examples/dpdk-l3fwd-power --telemetry -l 1-3 -- -p 0x0f --config="(0,0,2),(0,1,3)" --telemetry
264
265The new stats ``empty_poll`` , ``full_poll`` and ``busy_percent`` can be viewed by running the script
266``/usertools/dpdk-telemetry-client.py`` and selecting the menu option ``Send for global Metrics``.
267
268PMD power management Mode
269-------------------------
270
271The PMD power management mode support for ``l3fwd-power`` is a standalone mode.
272In this mode, ``l3fwd-power`` does simple l3fwding
273along with enabling the power saving scheme on specific port/queue/lcore.
274Main purpose for this mode is to demonstrate
275how to use the PMD power management API.
276
277.. code-block:: console
278
279        ./build/examples/dpdk-l3fwd-power -l 1-3 --  --pmd-mgmt -p 0x0f --config="(0,0,2),(0,1,3)"
280
281PMD Power Management Mode
282-------------------------
283
284There is also a traffic-aware operating mode that,
285instead of using explicit power management,
286will use automatic PMD power management.
287This mode is limited to one queue per core,
288and has three available power management schemes:
289
290``baseline``
291  This mode will not enable any power saving features.
292
293``monitor``
294  This will use ``rte_power_monitor()`` function to enter
295  a power-optimized state (subject to platform support).
296
297``pause``
298  This will use ``rte_power_pause()`` or ``rte_pause()``
299  to avoid busy looping when there is no traffic.
300
301``scale``
302  This will use frequency scaling routines
303  available in the ``librte_power`` library.
304  The reaction time of the scale mode is longer
305  than the pause and monitor mode.
306
307See :doc:`Power Management<../prog_guide/power_man>` chapter
308in the DPDK Programmer's Guide for more details on PMD power management.
309
310.. code-block:: console
311
312        ./<build_dir>/examples/dpdk-l3fwd-power -l 1-3 -- -p 0x0f --config="(0,0,2),(0,1,3)" --pmd-mgmt=scale
313
314Setting Uncore Values
315---------------------
316
317Uncore frequency can be adjusted through manipulating related sysfs entries
318to adjust the minimum and maximum uncore values.
319This will be set for each package and die on the SKU.
320The driver for enabling this is available from kernel version 5.6 and above.
321Three options are available for setting uncore frequency:
322
323``-u``
324  This will set uncore minimum and maximum frequencies to minimum possible value.
325
326``-U``
327  This will set uncore minimum and maximum frequencies to maximum possible value.
328
329``-i``
330  This will allow you to set the specific uncore frequency index that you want,
331  by setting the uncore frequency to a frequency pointed by index.
332  Frequency index's are set 100MHz apart from maximum to minimum.
333  Frequency index values are in descending order,
334  i.e., index 0 is maximum frequency index.
335
336.. code-block:: console
337
338   dpdk-l3fwd-power -l 1-3 -- -p 0x0f --config="(0,0,2),(0,1,3)" -i 1
339