xref: /dpdk/doc/guides/sample_app_ug/l3_forward_power_man.rst (revision 6491dbbecebb1e4f07fc970ef90b34119d8be2e3)
1..  SPDX-License-Identifier: BSD-3-Clause
2    Copyright(c) 2010-2014 Intel Corporation.
3
4L3 Forwarding with Power Management Sample Application
5======================================================
6
7Introduction
8------------
9
10The L3 Forwarding with Power Management application is an example of power-aware packet processing using the DPDK.
11The application is based on existing L3 Forwarding sample application,
12with the power management algorithms to control the P-states and
13C-states of the Intel processor via a power management library.
14
15Overview
16--------
17
18The application demonstrates the use of the Power libraries in the DPDK to implement packet forwarding.
19The initialization and run-time paths are very similar to those of the :doc:`l3_forward`.
20The main difference from the L3 Forwarding sample application is that this application introduces power-aware optimization algorithms
21by leveraging the Power library to control P-state and C-state of processor based on packet load.
22
23The DPDK includes poll-mode drivers to configure Intel NIC devices and their receive (Rx) and transmit (Tx) queues.
24The design principle of this PMD is to access the Rx and Tx descriptors directly without any interrupts to quickly receive,
25process and deliver packets in the user space.
26
27In general, the DPDK executes an endless packet processing loop on dedicated IA cores that include the following steps:
28
29*   Retrieve input packets through the PMD to poll Rx queue
30
31*   Process each received packet or provide received packets to other processing cores through software queues
32
33*   Send pending output packets to Tx queue through the PMD
34
35In this way, the PMD achieves better performance than a traditional interrupt-mode driver,
36at the cost of keeping cores active and running at the highest frequency,
37hence consuming the maximum power all the time.
38However, during the period of processing light network traffic,
39which happens regularly in communication infrastructure systems due to well-known "tidal effect",
40the PMD is still busy waiting for network packets, which wastes a lot of power.
41
42Processor performance states (P-states) are the capability of an Intel processor
43to switch between different supported operating frequencies and voltages.
44If configured correctly, according to system workload, this feature provides power savings.
45CPUFreq is the infrastructure provided by the Linux* kernel to control the processor performance state capability.
46CPUFreq supports a user space governor that enables setting frequency via manipulating the virtual file device from a user space application.
47The Power library in the DPDK provides a set of APIs for manipulating a virtual file device to allow user space application
48to set the CPUFreq governor and set the frequency of specific cores.
49
50This application includes a P-state power management algorithm to generate a frequency hint to be sent to CPUFreq.
51The algorithm uses the number of received and available Rx packets on recent polls to make a heuristic decision to scale frequency up/down.
52Specifically, some thresholds are checked to see whether a specific core running an DPDK polling thread needs to increase frequency
53a step up based on the near to full trend of polled Rx queues.
54Also, it decreases frequency a step if packet processed per loop is far less than the expected threshold
55or the thread's sleeping time exceeds a threshold.
56
57C-States are also known as sleep states.
58They allow software to put an Intel core into a low power idle state from which it is possible to exit via an event, such as an interrupt.
59However, there is a tradeoff between the power consumed in the idle state and the time required to wake up from the idle state (exit latency).
60Therefore, as you go into deeper C-states, the power consumed is lower but the exit latency is increased. Each C-state has a target residency.
61It is essential that when entering into a C-state, the core remains in this C-state for at least as long as the target residency in order
62to fully realize the benefits of entering the C-state.
63CPUIdle is the infrastructure provide by the Linux kernel to control the processor C-state capability.
64Unlike CPUFreq, CPUIdle does not provide a mechanism that allows the application to change C-state.
65It actually has its own heuristic algorithms in kernel space to select target C-state to enter by executing privileged instructions like HLT and MWAIT,
66based on the speculative sleep duration of the core.
67In this application, we introduce a heuristic algorithm that allows packet processing cores to sleep for a short period
68if there is no Rx packet received on recent polls.
69In this way, CPUIdle automatically forces the corresponding cores to enter deeper C-states
70instead of always running to the C0 state waiting for packets.
71
72.. note::
73
74    To fully demonstrate the power saving capability of using C-states,
75    it is recommended to enable deeper C3 and C6 states in the BIOS during system boot up.
76
77Compiling the Application
78-------------------------
79
80To compile the sample application see :doc:`compiling`.
81
82The application is located in the ``l3fwd-power`` sub-directory.
83
84Running the Application
85-----------------------
86
87The application has a number of command line options:
88
89.. code-block:: console
90
91    ./build/l3fwd_power [EAL options] -- -p PORTMASK [-P]  --config(port,queue,lcore)[,(port,queue,lcore)] [--enable-jumbo [--max-pkt-len PKTLEN]] [--no-numa]
92
93where,
94
95*   -p PORTMASK: Hexadecimal bitmask of ports to configure
96
97*   -P: Sets all ports to promiscuous mode so that packets are accepted regardless of the packet's Ethernet MAC destination address.
98    Without this option, only packets with the Ethernet MAC destination address set to the Ethernet address of the port are accepted.
99
100*   --config (port,queue,lcore)[,(port,queue,lcore)]: determines which queues from which ports are mapped to which cores.
101
102*   --enable-jumbo: optional, enables jumbo frames
103
104*   --max-pkt-len: optional, maximum packet length in decimal (64-9600)
105
106*   --no-numa: optional, disables numa awareness
107
108See :doc:`l3_forward` for details.
109The L3fwd-power example reuses the L3fwd command line options.
110
111Explanation
112-----------
113
114The following sections provide some explanation of the sample application code.
115As mentioned in the overview section,
116the initialization and run-time paths are identical to those of the L3 forwarding application.
117The following sections describe aspects that are specific to the L3 Forwarding with Power Management sample application.
118
119Power Library Initialization
120~~~~~~~~~~~~~~~~~~~~~~~~~~~~
121
122The Power library is initialized in the main routine.
123It changes the P-state governor to userspace for specific cores that are under control.
124The Timer library is also initialized and several timers are created later on,
125responsible for checking if it needs to scale down frequency at run time by checking CPU utilization statistics.
126
127.. note::
128
129    Only the power management related initialization is shown.
130
131.. code-block:: c
132
133    int main(int argc, char **argv)
134    {
135        struct lcore_conf *qconf;
136        int ret;
137        unsigned nb_ports;
138        uint16_t queueid, portid;
139        unsigned lcore_id;
140        uint64_t hz;
141        uint32_t n_tx_queue, nb_lcores;
142        uint8_t nb_rx_queue, queue, socketid;
143
144        // ...
145
146        /* init RTE timer library to be used to initialize per-core timers */
147
148        rte_timer_subsystem_init();
149
150        // ...
151
152
153        /* per-core initialization */
154
155        for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
156            if (rte_lcore_is_enabled(lcore_id) == 0)
157                continue;
158
159            /* init power management library for a specified core */
160
161            ret = rte_power_init(lcore_id);
162            if (ret)
163                rte_exit(EXIT_FAILURE, "Power management library "
164                    "initialization failed on core%d\n", lcore_id);
165
166            /* init timer structures for each enabled lcore */
167
168            rte_timer_init(&power_timers[lcore_id]);
169
170            hz = rte_get_hpet_hz();
171
172            rte_timer_reset(&power_timers[lcore_id], hz/TIMER_NUMBER_PER_SECOND, SINGLE, lcore_id, power_timer_cb, NULL);
173
174            // ...
175        }
176
177        // ...
178    }
179
180Monitoring Loads of Rx Queues
181~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
182
183In general, the polling nature of the DPDK prevents the OS power management subsystem from knowing
184if the network load is actually heavy or light.
185In this sample, sampling network load work is done by monitoring received and
186available descriptors on NIC Rx queues in recent polls.
187Based on the number of returned and available Rx descriptors,
188this example implements algorithms to generate frequency scaling hints and speculative sleep duration,
189and use them to control P-state and C-state of processors via the power management library.
190Frequency (P-state) control and sleep state (C-state) control work individually for each logical core,
191and the combination of them contributes to a power efficient packet processing solution when serving light network loads.
192
193The rte_eth_rx_burst() function and the newly-added rte_eth_rx_queue_count() function are used in the endless packet processing loop
194to return the number of received and available Rx descriptors.
195And those numbers of specific queue are passed to P-state and C-state heuristic algorithms
196to generate hints based on recent network load trends.
197
198.. note::
199
200    Only power control related code is shown.
201
202.. code-block:: c
203
204    static
205    attribute ((noreturn)) int main_loop( attribute ((unused)) void *dummy)
206    {
207        // ...
208
209        while (1) {
210        // ...
211
212        /**
213         * Read packet from RX queues
214         */
215
216        lcore_scaleup_hint = FREQ_CURRENT;
217        lcore_rx_idle_count = 0;
218
219        for (i = 0; i < qconf->n_rx_queue; ++i)
220        {
221            rx_queue = &(qconf->rx_queue_list[i]);
222            rx_queue->idle_hint = 0;
223            portid = rx_queue->port_id;
224            queueid = rx_queue->queue_id;
225
226            nb_rx = rte_eth_rx_burst(portid, queueid, pkts_burst, MAX_PKT_BURST);
227            stats[lcore_id].nb_rx_processed += nb_rx;
228
229            if (unlikely(nb_rx == 0)) {
230                /**
231                 * no packet received from rx queue, try to
232                 * sleep for a while forcing CPU enter deeper
233                 * C states.
234                 */
235
236                rx_queue->zero_rx_packet_count++;
237
238                if (rx_queue->zero_rx_packet_count <= MIN_ZERO_POLL_COUNT)
239                    continue;
240
241                rx_queue->idle_hint = power_idle_heuristic(rx_queue->zero_rx_packet_count);
242                lcore_rx_idle_count++;
243            } else {
244                rx_ring_length = rte_eth_rx_queue_count(portid, queueid);
245
246                rx_queue->zero_rx_packet_count = 0;
247
248                /**
249                 * do not scale up frequency immediately as
250                 * user to kernel space communication is costly
251                 * which might impact packet I/O for received
252                 * packets.
253                 */
254
255                rx_queue->freq_up_hint = power_freq_scaleup_heuristic(lcore_id, rx_ring_length);
256            }
257
258            /* Prefetch and forward packets */
259
260            // ...
261        }
262
263        if (likely(lcore_rx_idle_count != qconf->n_rx_queue)) {
264            for (i = 1, lcore_scaleup_hint = qconf->rx_queue_list[0].freq_up_hint; i < qconf->n_rx_queue; ++i) {
265                x_queue = &(qconf->rx_queue_list[i]);
266
267                if (rx_queue->freq_up_hint > lcore_scaleup_hint)
268
269                    lcore_scaleup_hint = rx_queue->freq_up_hint;
270            }
271
272            if (lcore_scaleup_hint == FREQ_HIGHEST)
273
274                rte_power_freq_max(lcore_id);
275
276            else if (lcore_scaleup_hint == FREQ_HIGHER)
277                rte_power_freq_up(lcore_id);
278            } else {
279                /**
280                 *  All Rx queues empty in recent consecutive polls,
281                 *  sleep in a conservative manner, meaning sleep as
282                 * less as possible.
283                 */
284
285                for (i = 1, lcore_idle_hint = qconf->rx_queue_list[0].idle_hint; i < qconf->n_rx_queue; ++i) {
286                    rx_queue = &(qconf->rx_queue_list[i]);
287                    if (rx_queue->idle_hint < lcore_idle_hint)
288                        lcore_idle_hint = rx_queue->idle_hint;
289                }
290
291                if ( lcore_idle_hint < SLEEP_GEAR1_THRESHOLD)
292                    /**
293                     *   execute "pause" instruction to avoid context
294                     *   switch for short sleep.
295                     */
296                    rte_delay_us(lcore_idle_hint);
297                else
298                    /* long sleep force ruining thread to suspend */
299                    usleep(lcore_idle_hint);
300
301               stats[lcore_id].sleep_time += lcore_idle_hint;
302            }
303        }
304    }
305
306P-State Heuristic Algorithm
307~~~~~~~~~~~~~~~~~~~~~~~~~~~
308
309The power_freq_scaleup_heuristic() function is responsible for generating a frequency hint for the specified logical core
310according to available descriptor number returned from rte_eth_rx_queue_count().
311On every poll for new packets, the length of available descriptor on an Rx queue is evaluated,
312and the algorithm used for frequency hinting is as follows:
313
314*   If the size of available descriptors exceeds 96, the maximum frequency is hinted.
315
316*   If the size of available descriptors exceeds 64, a trend counter is incremented by 100.
317
318*   If the length of the ring exceeds 32, the trend counter is incremented by 1.
319
320*   When the trend counter reached 10000 the frequency hint is changed to the next higher frequency.
321
322.. note::
323
324    The assumption is that the Rx queue size is 128 and the thresholds specified above
325    must be adjusted accordingly based on actual hardware Rx queue size,
326    which are configured via the rte_eth_rx_queue_setup() function.
327
328In general, a thread needs to poll packets from multiple Rx queues.
329Most likely, different queue have different load, so they would return different frequency hints.
330The algorithm evaluates all the hints and then scales up frequency in an aggressive manner
331by scaling up to highest frequency as long as one Rx queue requires.
332In this way, we can minimize any negative performance impact.
333
334On the other hand, frequency scaling down is controlled in the timer callback function.
335Specifically, if the sleep times of a logical core indicate that it is sleeping more than 25% of the sampling period,
336or if the average packet per iteration is less than expectation, the frequency is decreased by one step.
337
338C-State Heuristic Algorithm
339~~~~~~~~~~~~~~~~~~~~~~~~~~~
340
341Whenever recent rte_eth_rx_burst() polls return 5 consecutive zero packets,
342an idle counter begins incrementing for each successive zero poll.
343At the same time, the function power_idle_heuristic() is called to generate speculative sleep duration
344in order to force logical to enter deeper sleeping C-state.
345There is no way to control C- state directly, and the CPUIdle subsystem in OS is intelligent enough
346to select C-state to enter based on actual sleep period time of giving logical core.
347The algorithm has the following sleeping behavior depending on the idle counter:
348
349*   If idle count less than 100, the counter value is used as a microsecond sleep value through rte_delay_us()
350    which execute pause instructions to avoid costly context switch but saving power at the same time.
351
352*   If idle count is between 100 and 999, a fixed sleep interval of 100 μs is used.
353    A 100 μs sleep interval allows the core to enter the C1 state while keeping a fast response time in case new traffic arrives.
354
355*   If idle count is greater than 1000, a fixed sleep value of 1 ms is used until the next timer expiration is used.
356    This allows the core to enter the C3/C6 states.
357
358.. note::
359
360    The thresholds specified above need to be adjusted for different Intel processors and traffic profiles.
361
362If a thread polls multiple Rx queues and different queue returns different sleep duration values,
363the algorithm controls the sleep time in a conservative manner by sleeping for the least possible time
364in order to avoid a potential performance impact.
365