1.. SPDX-License-Identifier: BSD-3-Clause 2 Copyright(c) 2010-2014 Intel Corporation. 3 4L3 Forwarding with Power Management Sample Application 5====================================================== 6 7Introduction 8------------ 9 10The L3 Forwarding with Power Management application is an example of power-aware packet processing using the DPDK. 11The application is based on existing L3 Forwarding sample application, 12with the power management algorithms to control the P-states and 13C-states of the Intel processor via a power management library. 14 15Overview 16-------- 17 18The application demonstrates the use of the Power libraries in the DPDK to implement packet forwarding. 19The initialization and run-time paths are very similar to those of the :doc:`l3_forward`. 20The main difference from the L3 Forwarding sample application is that this application introduces power-aware optimization algorithms 21by leveraging the Power library to control P-state and C-state of processor based on packet load. 22 23The DPDK includes poll-mode drivers to configure Intel NIC devices and their receive (Rx) and transmit (Tx) queues. 24The design principle of this PMD is to access the Rx and Tx descriptors directly without any interrupts to quickly receive, 25process and deliver packets in the user space. 26 27In general, the DPDK executes an endless packet processing loop on dedicated IA cores that include the following steps: 28 29* Retrieve input packets through the PMD to poll Rx queue 30 31* Process each received packet or provide received packets to other processing cores through software queues 32 33* Send pending output packets to Tx queue through the PMD 34 35In this way, the PMD achieves better performance than a traditional interrupt-mode driver, 36at the cost of keeping cores active and running at the highest frequency, 37hence consuming the maximum power all the time. 38However, during the period of processing light network traffic, 39which happens regularly in communication infrastructure systems due to well-known "tidal effect", 40the PMD is still busy waiting for network packets, which wastes a lot of power. 41 42Processor performance states (P-states) are the capability of an Intel processor 43to switch between different supported operating frequencies and voltages. 44If configured correctly, according to system workload, this feature provides power savings. 45CPUFreq is the infrastructure provided by the Linux* kernel to control the processor performance state capability. 46CPUFreq supports a user space governor that enables setting frequency via manipulating the virtual file device from a user space application. 47The Power library in the DPDK provides a set of APIs for manipulating a virtual file device to allow user space application 48to set the CPUFreq governor and set the frequency of specific cores. 49 50This application includes a P-state power management algorithm to generate a frequency hint to be sent to CPUFreq. 51The algorithm uses the number of received and available Rx packets on recent polls to make a heuristic decision to scale frequency up/down. 52Specifically, some thresholds are checked to see whether a specific core running a DPDK polling thread needs to increase frequency 53a step up based on the near to full trend of polled Rx queues. 54Also, it decreases frequency a step if packet processed per loop is far less than the expected threshold 55or the thread's sleeping time exceeds a threshold. 56 57C-States are also known as sleep states. 58They allow software to put an Intel core into a low power idle state from which it is possible to exit via an event, such as an interrupt. 59However, there is a tradeoff between the power consumed in the idle state and the time required to wake up from the idle state (exit latency). 60Therefore, as you go into deeper C-states, the power consumed is lower but the exit latency is increased. Each C-state has a target residency. 61It is essential that when entering into a C-state, the core remains in this C-state for at least as long as the target residency in order 62to fully realize the benefits of entering the C-state. 63CPUIdle is the infrastructure provide by the Linux kernel to control the processor C-state capability. 64Unlike CPUFreq, CPUIdle does not provide a mechanism that allows the application to change C-state. 65It actually has its own heuristic algorithms in kernel space to select target C-state to enter by executing privileged instructions like HLT and MWAIT, 66based on the speculative sleep duration of the core. 67In this application, we introduce a heuristic algorithm that allows packet processing cores to sleep for a short period 68if there is no Rx packet received on recent polls. 69In this way, CPUIdle automatically forces the corresponding cores to enter deeper C-states 70instead of always running to the C0 state waiting for packets. 71 72.. note:: 73 74 To fully demonstrate the power saving capability of using C-states, 75 it is recommended to enable deeper C3 and C6 states in the BIOS during system boot up. 76 77Compiling the Application 78------------------------- 79 80To compile the sample application see :doc:`compiling`. 81 82The application is located in the ``l3fwd-power`` sub-directory. 83 84Running the Application 85----------------------- 86 87The application has a number of command line options: 88 89.. code-block:: console 90 91 ./<build_dir>/examples/dpdk-l3fwd_power [EAL options] -- -p PORTMASK [-P] --config(port,queue,lcore)[,(port,queue,lcore)] [--max-pkt-len PKTLEN] [--no-numa] 92 93where, 94 95* -p PORTMASK: Hexadecimal bitmask of ports to configure 96 97* -P: Sets all ports to promiscuous mode so that packets are accepted regardless of the packet's Ethernet MAC destination address. 98 Without this option, only packets with the Ethernet MAC destination address set to the Ethernet address of the port are accepted. 99 100* -u: optional, sets uncore min/max frequency to minimum value. 101 102* -U: optional, sets uncore min/max frequency to maximum value. 103 104* -i (frequency index): optional, sets uncore frequency to frequency index value, by setting min and max values to be the same. 105 106* --config (port,queue,lcore)[,(port,queue,lcore)]: determines which queues from which ports are mapped to which cores. 107 108* --max-pkt-len: optional, maximum packet length in decimal (64-9600) 109 110* --no-numa: optional, disables numa awareness 111 112* --empty-poll: Traffic Aware power management. See below for details 113 114* --telemetry: Telemetry mode. 115 116* --pmd-mgmt: PMD power management mode. 117 118* --max-empty-polls : Number of empty polls to wait before entering sleep state. Applies to --pmd-mgmt mode only. 119 120* --pause-duration: Set the duration of the pause callback (microseconds). Applies to --pmd-mgmt mode only. 121 122* --scale-freq-min: Set minimum frequency for scaling. Applies to --pmd-mgmt mode only. 123 124* --scale-freq-max: Set maximum frequency for scaling. Applies to --pmd-mgmt mode only. 125 126See :doc:`l3_forward` for details. 127The L3fwd-power example reuses the L3fwd command line options. 128 129Explanation 130----------- 131 132The following sections provide some explanation of the sample application code. 133As mentioned in the overview section, 134the initialization and run-time paths are identical to those of the L3 forwarding application. 135The following sections describe aspects that are specific to the L3 Forwarding with Power Management sample application. 136 137Power Library Initialization 138~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 139 140The Power library is initialized in the main routine. 141It changes the P-state governor to userspace for specific cores that are under control. 142The Timer library is also initialized and several timers are created later on, 143responsible for checking if it needs to scale down frequency at run time by checking CPU utilization statistics. 144 145.. note:: 146 147 Only the power management related initialization is shown. 148 149.. literalinclude:: ../../../examples/l3fwd-power/main.c 150 :language: c 151 :start-after: Power library initialized in the main routine. 8< 152 :end-before: >8 End of power library initialization. 153 154Monitoring Loads of Rx Queues 155~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 156 157In general, the polling nature of the DPDK prevents the OS power management subsystem from knowing 158if the network load is actually heavy or light. 159In this sample, sampling network load work is done by monitoring received and 160available descriptors on NIC Rx queues in recent polls. 161Based on the number of returned and available Rx descriptors, 162this example implements algorithms to generate frequency scaling hints and speculative sleep duration, 163and use them to control P-state and C-state of processors via the power management library. 164Frequency (P-state) control and sleep state (C-state) control work individually for each logical core, 165and the combination of them contributes to a power efficient packet processing solution when serving light network loads. 166 167The rte_eth_rx_burst() function and the newly-added rte_eth_rx_queue_count() function are used in the endless packet processing loop 168to return the number of received and available Rx descriptors. 169And those numbers of specific queue are passed to P-state and C-state heuristic algorithms 170to generate hints based on recent network load trends. 171 172.. note:: 173 174 Only power control related code is shown. 175 176.. literalinclude:: ../../../examples/l3fwd-power/main.c 177 :language: c 178 :start-after: Main processing loop. 8< 179 :end-before: >8 End of main processing loop. 180 181P-State Heuristic Algorithm 182~~~~~~~~~~~~~~~~~~~~~~~~~~~ 183 184The power_freq_scaleup_heuristic() function is responsible for generating a frequency hint for the specified logical core 185according to available descriptor number returned from rte_eth_rx_queue_count(). 186On every poll for new packets, the length of available descriptor on an Rx queue is evaluated, 187and the algorithm used for frequency hinting is as follows: 188 189* If the size of available descriptors exceeds 96, the maximum frequency is hinted. 190 191* If the size of available descriptors exceeds 64, a trend counter is incremented by 100. 192 193* If the length of the ring exceeds 32, the trend counter is incremented by 1. 194 195* When the trend counter reached 10000 the frequency hint is changed to the next higher frequency. 196 197.. note:: 198 199 The assumption is that the Rx queue size is 128 and the thresholds specified above 200 must be adjusted accordingly based on actual hardware Rx queue size, 201 which are configured via the rte_eth_rx_queue_setup() function. 202 203In general, a thread needs to poll packets from multiple Rx queues. 204Most likely, different queue have different load, so they would return different frequency hints. 205The algorithm evaluates all the hints and then scales up frequency in an aggressive manner 206by scaling up to highest frequency as long as one Rx queue requires. 207In this way, we can minimize any negative performance impact. 208 209On the other hand, frequency scaling down is controlled in the timer callback function. 210Specifically, if the sleep times of a logical core indicate that it is sleeping more than 25% of the sampling period, 211or if the average packet per iteration is less than expectation, the frequency is decreased by one step. 212 213C-State Heuristic Algorithm 214~~~~~~~~~~~~~~~~~~~~~~~~~~~ 215 216Whenever recent rte_eth_rx_burst() polls return 5 consecutive zero packets, 217an idle counter begins incrementing for each successive zero poll. 218At the same time, the function power_idle_heuristic() is called to generate speculative sleep duration 219in order to force logical to enter deeper sleeping C-state. 220There is no way to control C- state directly, and the CPUIdle subsystem in OS is intelligent enough 221to select C-state to enter based on actual sleep period time of giving logical core. 222The algorithm has the following sleeping behavior depending on the idle counter: 223 224* If idle count less than 100, the counter value is used as a microsecond sleep value through rte_delay_us() 225 which execute pause instructions to avoid costly context switch but saving power at the same time. 226 227* If idle count is between 100 and 999, a fixed sleep interval of 100 μs is used. 228 A 100 μs sleep interval allows the core to enter the C1 state while keeping a fast response time in case new traffic arrives. 229 230* If idle count is greater than 1000, a fixed sleep value of 1 ms is used until the next timer expiration is used. 231 This allows the core to enter the C3/C6 states. 232 233.. note:: 234 235 The thresholds specified above need to be adjusted for different Intel processors and traffic profiles. 236 237If a thread polls multiple Rx queues and different queue returns different sleep duration values, 238the algorithm controls the sleep time in a conservative manner by sleeping for the least possible time 239in order to avoid a potential performance impact. 240 241Empty Poll Mode 242------------------------- 243Additionally, there is a traffic aware mode of operation called "Empty 244Poll" where the number of empty polls can be monitored to keep track 245of how busy the application is. Empty poll mode can be enabled by the 246command line option --empty-poll. 247 248See :doc:`Power Management<../prog_guide/power_man>` chapter in the DPDK Programmer's Guide for empty poll mode details. 249 250.. code-block:: console 251 252 ./<build_dir>/examples/dpdk-l3fwd-power -l xxx -n 4 -a 0000:xx:00.0 -a 0000:xx:00.1 \ 253 -- -p 0x3 -P --config="(0,0,xx),(1,0,xx)" --empty-poll="0,0,0" -l 14 -m 9 -h 1 254 255Where, 256 257--empty-poll: Enable the empty poll mode instead of original algorithm 258 259--empty-poll="training_flag, med_threshold, high_threshold" 260 261* ``training_flag`` : optional, enable/disable training mode. Default value is 0. If the training_flag is set as 1(true), then the application will start in training mode and print out the trained threshold values. If the training_flag is set as 0(false), the application will start in normal mode, and will use either the default thresholds or those supplied on the command line. The trained threshold values are specific to the user’s system, may give a better power profile when compared to the default threshold values. 262 263* ``med_threshold`` : optional, sets the empty poll threshold of a modestly busy system state. If this is not supplied, the application will apply the default value of 350000. 264 265* ``high_threshold`` : optional, sets the empty poll threshold of a busy system state. If this is not supplied, the application will apply the default value of 580000. 266 267* -l : optional, set up the LOW power state frequency index 268 269* -m : optional, set up the MED power state frequency index 270 271* -h : optional, set up the HIGH power state frequency index 272 273Empty Poll Mode Example Usage 274~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 275To initially obtain the ideal thresholds for the system, the training 276mode should be run first. This is achieved by running the l3fwd-power 277app with the training flag set to “1”, and the other parameters set to 2780. 279 280.. code-block:: console 281 282 ./<build_dir>/examples/dpdk-l3fwd-power -l 1-3 -- -p 0x0f --config="(0,0,2),(0,1,3)" --empty-poll "1,0,0" –P 283 284This will run the training algorithm for x seconds on each core (cores 2 285and 3), and then print out the recommended threshold values for those 286cores. The thresholds should be very similar for each core. 287 288.. code-block:: console 289 290 POWER: Bring up the Timer 291 POWER: set the power freq to MED 292 POWER: Low threshold is 230277 293 POWER: MED threshold is 335071 294 POWER: HIGH threshold is 523769 295 POWER: Training is Complete for 2 296 POWER: set the power freq to MED 297 POWER: Low threshold is 236814 298 POWER: MED threshold is 344567 299 POWER: HIGH threshold is 538580 300 POWER: Training is Complete for 3 301 302Once the values have been measured for a particular system, the app can 303then be started without the training mode so traffic can start immediately. 304 305.. code-block:: console 306 307 ./<build_dir>/examples/dpdk-l3fwd-power -l 1-3 -- -p 0x0f --config="(0,0,2),(0,1,3)" --empty-poll "0,340000,540000" –P 308 309Telemetry Mode 310-------------- 311 312The telemetry mode support for ``l3fwd-power`` is a standalone mode, in this mode 313``l3fwd-power`` does simple l3fwding along with calculating empty polls, full polls, 314and busy percentage for each forwarding core. The aggregation of these 315values of all cores is reported as application level telemetry to metric 316library for every 500ms from the main core. 317 318The busy percentage is calculated by recording the poll_count 319and when the count reaches a defined value the total 320cycles it took is measured and compared with minimum and maximum 321reference cycles and accordingly busy rate is set to either 0% or 32250% or 100%. 323 324.. code-block:: console 325 326 ./<build_dir>/examples/dpdk-l3fwd-power --telemetry -l 1-3 -- -p 0x0f --config="(0,0,2),(0,1,3)" --telemetry 327 328The new stats ``empty_poll`` , ``full_poll`` and ``busy_percent`` can be viewed by running the script 329``/usertools/dpdk-telemetry-client.py`` and selecting the menu option ``Send for global Metrics``. 330 331PMD power management Mode 332------------------------- 333 334The PMD power management mode support for ``l3fwd-power`` is a standalone mode. 335In this mode, ``l3fwd-power`` does simple l3fwding 336along with enabling the power saving scheme on specific port/queue/lcore. 337Main purpose for this mode is to demonstrate 338how to use the PMD power management API. 339 340.. code-block:: console 341 342 ./build/examples/dpdk-l3fwd-power -l 1-3 -- --pmd-mgmt -p 0x0f --config="(0,0,2),(0,1,3)" 343 344PMD Power Management Mode 345------------------------- 346 347There is also a traffic-aware operating mode that, 348instead of using explicit power management, 349will use automatic PMD power management. 350This mode is limited to one queue per core, 351and has three available power management schemes: 352 353``monitor`` 354 This will use ``rte_power_monitor()`` function to enter 355 a power-optimized state (subject to platform support). 356 357``pause`` 358 This will use ``rte_power_pause()`` or ``rte_pause()`` 359 to avoid busy looping when there is no traffic. 360 361``scale`` 362 This will use frequency scaling routines 363 available in the ``librte_power`` library. 364 The reaction time of the scale mode is longer 365 than the pause and monitor mode. 366 367See :doc:`Power Management<../prog_guide/power_man>` chapter 368in the DPDK Programmer's Guide for more details on PMD power management. 369 370.. code-block:: console 371 372 ./<build_dir>/examples/dpdk-l3fwd-power -l 1-3 -- -p 0x0f --config="(0,0,2),(0,1,3)" --pmd-mgmt=scale 373 374Setting Uncore Values 375--------------------- 376 377Uncore frequency can be adjusted through manipulating related sysfs entries 378to adjust the minimum and maximum uncore values. 379This will be set for each package and die on the SKU. 380The driver for enabling this is available from kernel version 5.6 and above. 381Three options are available for setting uncore frequency: 382 383``-u`` 384 This will set uncore minimum and maximum frequencies to minimum possible value. 385 386``-U`` 387 This will set uncore minimum and maximum frequencies to maximum possible value. 388 389``-i`` 390 This will allow you to set the specific uncore frequency index that you want, 391 by setting the uncore frequency to a frequency pointed by index. 392 Frequency index's are set 100MHz apart from maximum to minimum. 393 Frequency index values are in descending order, 394 i.e., index 0 is maximum frequency index. 395 396.. code-block:: console 397 398 dpdk-l3fwd-power -l 1-3 -- -p 0x0f --config="(0,0,2),(0,1,3)" -i 1 399