1.. BSD LICENSE 2 Copyright(c) 2010-2014 Intel Corporation. All rights reserved. 3 All rights reserved. 4 5 Redistribution and use in source and binary forms, with or without 6 modification, are permitted provided that the following conditions 7 are met: 8 9 * Redistributions of source code must retain the above copyright 10 notice, this list of conditions and the following disclaimer. 11 * Redistributions in binary form must reproduce the above copyright 12 notice, this list of conditions and the following disclaimer in 13 the documentation and/or other materials provided with the 14 distribution. 15 * Neither the name of Intel Corporation nor the names of its 16 contributors may be used to endorse or promote products derived 17 from this software without specific prior written permission. 18 19 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 20 "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 21 LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR 22 A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT 23 OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, 24 SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 25 LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, 26 DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY 27 THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 28 (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 29 OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 30 31L3 Forwarding with Power Management Sample Application 32====================================================== 33 34Introduction 35------------ 36 37The L3 Forwarding with Power Management application is an example of power-aware packet processing using the DPDK. 38The application is based on existing L3 Forwarding sample application, 39with the power management algorithms to control the P-states and 40C-states of the Intel processor via a power management library. 41 42Overview 43-------- 44 45The application demonstrates the use of the Power libraries in the DPDK to implement packet forwarding. 46The initialization and run-time paths are very similar to those of the L3 forwarding sample application 47(see Chapter 10 "L3 Forwarding Sample Application" for more information). 48The main difference from the L3 Forwarding sample application is that this application introduces power-aware optimization algorithms 49by leveraging the Power library to control P-state and C-state of processor based on packet load. 50 51The DPDK includes poll-mode drivers to configure Intel NIC devices and their receive (Rx) and transmit (Tx) queues. 52The design principle of this PMD is to access the Rx and Tx descriptors directly without any interrupts to quickly receive, 53process and deliver packets in the user space. 54 55In general, the DPDK executes an endless packet processing loop on dedicated IA cores that include the following steps: 56 57* Retrieve input packets through the PMD to poll Rx queue 58 59* Process each received packet or provide received packets to other processing cores through software queues 60 61* Send pending output packets to Tx queue through the PMD 62 63In this way, the PMD achieves better performance than a traditional interrupt-mode driver, 64at the cost of keeping cores active and running at the highest frequency, 65hence consuming the maximum power all the time. 66However, during the period of processing light network traffic, 67which happens regularly in communication infrastructure systems due to well-known "tidal effect", 68the PMD is still busy waiting for network packets, which wastes a lot of power. 69 70Processor performance states (P-states) are the capability of an Intel processor 71to switch between different supported operating frequencies and voltages. 72If configured correctly, according to system workload, this feature provides power savings. 73CPUFreq is the infrastructure provided by the Linux* kernel to control the processor performance state capability. 74CPUFreq supports a user space governor that enables setting frequency via manipulating the virtual file device from a user space application. 75The Power library in the DPDK provides a set of APIs for manipulating a virtual file device to allow user space application 76to set the CPUFreq governor and set the frequency of specific cores. 77 78This application includes a P-state power management algorithm to generate a frequency hint to be sent to CPUFreq. 79The algorithm uses the number of received and available Rx packets on recent polls to make a heuristic decision to scale frequency up/down. 80Specifically, some thresholds are checked to see whether a specific core running an DPDK polling thread needs to increase frequency 81a step up based on the near to full trend of polled Rx queues. 82Also, it decreases frequency a step if packet processed per loop is far less than the expected threshold 83or the thread's sleeping time exceeds a threshold. 84 85C-States are also known as sleep states. 86They allow software to put an Intel core into a low power idle state from which it is possible to exit via an event, such as an interrupt. 87However, there is a tradeoff between the power consumed in the idle state and the time required to wake up from the idle state (exit latency). 88Therefore, as you go into deeper C-states, the power consumed is lower but the exit latency is increased. Each C-state has a target residency. 89It is essential that when entering into a C-state, the core remains in this C-state for at least as long as the target residency in order 90to fully realize the benefits of entering the C-state. 91CPUIdle is the infrastructure provide by the Linux kernel to control the processor C-state capability. 92Unlike CPUFreq, CPUIdle does not provide a mechanism that allows the application to change C-state. 93It actually has its own heuristic algorithms in kernel space to select target C-state to enter by executing privileged instructions like HLT and MWAIT, 94based on the speculative sleep duration of the core. 95In this application, we introduce a heuristic algorithm that allows packet processing cores to sleep for a short period 96if there is no Rx packet received on recent polls. 97In this way, CPUIdle automatically forces the corresponding cores to enter deeper C-states 98instead of always running to the C0 state waiting for packets. 99 100.. note:: 101 102 To fully demonstrate the power saving capability of using C-states, 103 it is recommended to enable deeper C3 and C6 states in the BIOS during system boot up. 104 105Compiling the Application 106------------------------- 107 108To compile the application: 109 110#. Go to the sample application directory: 111 112 .. code-block:: console 113 114 export RTE_SDK=/path/to/rte_sdk 115 cd ${RTE_SDK}/examples/l3fwd-power 116 117#. Set the target (a default target is used if not specified). For example: 118 119 .. code-block:: console 120 121 export RTE_TARGET=x86_64-native-linuxapp-gcc 122 123 See the *DPDK Getting Started Guide* for possible RTE_TARGET values. 124 125#. Build the application: 126 127 .. code-block:: console 128 129 make 130 131Running the Application 132----------------------- 133 134The application has a number of command line options: 135 136.. code-block:: console 137 138 ./build/l3fwd_power [EAL options] -- -p PORTMASK [-P] --config(port,queue,lcore)[,(port,queue,lcore)] [--enable-jumbo [--max-pkt-len PKTLEN]] [--no-numa] 139 140where, 141 142* -p PORTMASK: Hexadecimal bitmask of ports to configure 143 144* -P: Sets all ports to promiscuous mode so that packets are accepted regardless of the packet's Ethernet MAC destination address. 145 Without this option, only packets with the Ethernet MAC destination address set to the Ethernet address of the port are accepted. 146 147* --config (port,queue,lcore)[,(port,queue,lcore)]: determines which queues from which ports are mapped to which cores. 148 149* --enable-jumbo: optional, enables jumbo frames 150 151* --max-pkt-len: optional, maximum packet length in decimal (64-9600) 152 153* --no-numa: optional, disables numa awareness 154 155See Chapter 10 "L3 Forwarding Sample Application" for details. 156The L3fwd-power example reuses the L3fwd command line options. 157 158Explanation 159----------- 160 161The following sections provide some explanation of the sample application code. 162As mentioned in the overview section, 163the initialization and run-time paths are identical to those of the L3 forwarding application. 164The following sections describe aspects that are specific to the L3 Forwarding with Power Management sample application. 165 166Power Library Initialization 167~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 168 169The Power library is initialized in the main routine. 170It changes the P-state governor to userspace for specific cores that are under control. 171The Timer library is also initialized and several timers are created later on, 172responsible for checking if it needs to scale down frequency at run time by checking CPU utilization statistics. 173 174.. note:: 175 176 Only the power management related initialization is shown. 177 178.. code-block:: c 179 180 int main(int argc, char **argv) 181 { 182 struct lcore_conf *qconf; 183 int ret; 184 unsigned nb_ports; 185 uint16_t queueid; 186 unsigned lcore_id; 187 uint64_t hz; 188 uint32_t n_tx_queue, nb_lcores; 189 uint8_t portid, nb_rx_queue, queue, socketid; 190 191 // ... 192 193 /* init RTE timer library to be used to initialize per-core timers */ 194 195 rte_timer_subsystem_init(); 196 197 // ... 198 199 200 /* per-core initialization */ 201 202 for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) { 203 if (rte_lcore_is_enabled(lcore_id) == 0) 204 continue; 205 206 /* init power management library for a specified core */ 207 208 ret = rte_power_init(lcore_id); 209 if (ret) 210 rte_exit(EXIT_FAILURE, "Power management library " 211 "initialization failed on core%d\n", lcore_id); 212 213 /* init timer structures for each enabled lcore */ 214 215 rte_timer_init(&power_timers[lcore_id]); 216 217 hz = rte_get_hpet_hz(); 218 219 rte_timer_reset(&power_timers[lcore_id], hz/TIMER_NUMBER_PER_SECOND, SINGLE, lcore_id, power_timer_cb, NULL); 220 221 // ... 222 } 223 224 // ... 225 } 226 227Monitoring Loads of Rx Queues 228~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 229 230In general, the polling nature of the DPDK prevents the OS power management subsystem from knowing 231if the network load is actually heavy or light. 232In this sample, sampling network load work is done by monitoring received and 233available descriptors on NIC Rx queues in recent polls. 234Based on the number of returned and available Rx descriptors, 235this example implements algorithms to generate frequency scaling hints and speculative sleep duration, 236and use them to control P-state and C-state of processors via the power management library. 237Frequency (P-state) control and sleep state (C-state) control work individually for each logical core, 238and the combination of them contributes to a power efficient packet processing solution when serving light network loads. 239 240The rte_eth_rx_burst() function and the newly-added rte_eth_rx_queue_count() function are used in the endless packet processing loop 241to return the number of received and available Rx descriptors. 242And those numbers of specific queue are passed to P-state and C-state heuristic algorithms 243to generate hints based on recent network load trends. 244 245.. note:: 246 247 Only power control related code is shown. 248 249.. code-block:: c 250 251 static 252 attribute ((noreturn)) int main_loop( attribute ((unused)) void *dummy) 253 { 254 // ... 255 256 while (1) { 257 // ... 258 259 /** 260 * Read packet from RX queues 261 */ 262 263 lcore_scaleup_hint = FREQ_CURRENT; 264 lcore_rx_idle_count = 0; 265 266 for (i = 0; i < qconf->n_rx_queue; ++i) 267 { 268 rx_queue = &(qconf->rx_queue_list[i]); 269 rx_queue->idle_hint = 0; 270 portid = rx_queue->port_id; 271 queueid = rx_queue->queue_id; 272 273 nb_rx = rte_eth_rx_burst(portid, queueid, pkts_burst, MAX_PKT_BURST); 274 stats[lcore_id].nb_rx_processed += nb_rx; 275 276 if (unlikely(nb_rx == 0)) { 277 /** 278 * no packet received from rx queue, try to 279 * sleep for a while forcing CPU enter deeper 280 * C states. 281 */ 282 283 rx_queue->zero_rx_packet_count++; 284 285 if (rx_queue->zero_rx_packet_count <= MIN_ZERO_POLL_COUNT) 286 continue; 287 288 rx_queue->idle_hint = power_idle_heuristic(rx_queue->zero_rx_packet_count); 289 lcore_rx_idle_count++; 290 } else { 291 rx_ring_length = rte_eth_rx_queue_count(portid, queueid); 292 293 rx_queue->zero_rx_packet_count = 0; 294 295 /** 296 * do not scale up frequency immediately as 297 * user to kernel space communication is costly 298 * which might impact packet I/O for received 299 * packets. 300 */ 301 302 rx_queue->freq_up_hint = power_freq_scaleup_heuristic(lcore_id, rx_ring_length); 303 } 304 305 /* Prefetch and forward packets */ 306 307 // ... 308 } 309 310 if (likely(lcore_rx_idle_count != qconf->n_rx_queue)) { 311 for (i = 1, lcore_scaleup_hint = qconf->rx_queue_list[0].freq_up_hint; i < qconf->n_rx_queue; ++i) { 312 x_queue = &(qconf->rx_queue_list[i]); 313 314 if (rx_queue->freq_up_hint > lcore_scaleup_hint) 315 316 lcore_scaleup_hint = rx_queue->freq_up_hint; 317 } 318 319 if (lcore_scaleup_hint == FREQ_HIGHEST) 320 321 rte_power_freq_max(lcore_id); 322 323 else if (lcore_scaleup_hint == FREQ_HIGHER) 324 rte_power_freq_up(lcore_id); 325 } else { 326 /** 327 * All Rx queues empty in recent consecutive polls, 328 * sleep in a conservative manner, meaning sleep as 329 * less as possible. 330 */ 331 332 for (i = 1, lcore_idle_hint = qconf->rx_queue_list[0].idle_hint; i < qconf->n_rx_queue; ++i) { 333 rx_queue = &(qconf->rx_queue_list[i]); 334 if (rx_queue->idle_hint < lcore_idle_hint) 335 lcore_idle_hint = rx_queue->idle_hint; 336 } 337 338 if ( lcore_idle_hint < SLEEP_GEAR1_THRESHOLD) 339 /** 340 * execute "pause" instruction to avoid context 341 * switch for short sleep. 342 */ 343 rte_delay_us(lcore_idle_hint); 344 else 345 /* long sleep force ruining thread to suspend */ 346 usleep(lcore_idle_hint); 347 348 stats[lcore_id].sleep_time += lcore_idle_hint; 349 } 350 } 351 } 352 353P-State Heuristic Algorithm 354~~~~~~~~~~~~~~~~~~~~~~~~~~~ 355 356The power_freq_scaleup_heuristic() function is responsible for generating a frequency hint for the specified logical core 357according to available descriptor number returned from rte_eth_rx_queue_count(). 358On every poll for new packets, the length of available descriptor on an Rx queue is evaluated, 359and the algorithm used for frequency hinting is as follows: 360 361* If the size of available descriptors exceeds 96, the maximum frequency is hinted. 362 363* If the size of available descriptors exceeds 64, a trend counter is incremented by 100. 364 365* If the length of the ring exceeds 32, the trend counter is incremented by 1. 366 367* When the trend counter reached 10000 the frequency hint is changed to the next higher frequency. 368 369.. note:: 370 371 The assumption is that the Rx queue size is 128 and the thresholds specified above 372 must be adjusted accordingly based on actual hardware Rx queue size, 373 which are configured via the rte_eth_rx_queue_setup() function. 374 375In general, a thread needs to poll packets from multiple Rx queues. 376Most likely, different queue have different load, so they would return different frequency hints. 377The algorithm evaluates all the hints and then scales up frequency in an aggressive manner 378by scaling up to highest frequency as long as one Rx queue requires. 379In this way, we can minimize any negative performance impact. 380 381On the other hand, frequency scaling down is controlled in the timer callback function. 382Specifically, if the sleep times of a logical core indicate that it is sleeping more than 25% of the sampling period, 383or if the average packet per iteration is less than expectation, the frequency is decreased by one step. 384 385C-State Heuristic Algorithm 386~~~~~~~~~~~~~~~~~~~~~~~~~~~ 387 388Whenever recent rte_eth_rx_burst() polls return 5 consecutive zero packets, 389an idle counter begins incrementing for each successive zero poll. 390At the same time, the function power_idle_heuristic() is called to generate speculative sleep duration 391in order to force logical to enter deeper sleeping C-state. 392There is no way to control C- state directly, and the CPUIdle subsystem in OS is intelligent enough 393to select C-state to enter based on actual sleep period time of giving logical core. 394The algorithm has the following sleeping behavior depending on the idle counter: 395 396* If idle count less than 100, the counter value is used as a microsecond sleep value through rte_delay_us() 397 which execute pause instructions to avoid costly context switch but saving power at the same time. 398 399* If idle count is between 100 and 999, a fixed sleep interval of 100 μs is used. 400 A 100 μs sleep interval allows the core to enter the C1 state while keeping a fast response time in case new traffic arrives. 401 402* If idle count is greater than 1000, a fixed sleep value of 1 ms is used until the next timer expiration is used. 403 This allows the core to enter the C3/C6 states. 404 405.. note:: 406 407 The thresholds specified above need to be adjusted for different Intel processors and traffic profiles. 408 409If a thread polls multiple Rx queues and different queue returns different sleep duration values, 410the algorithm controls the sleep time in a conservative manner by sleeping for the least possible time 411in order to avoid a potential performance impact. 412