xref: /dpdk/doc/guides/prog_guide/writing_efficient_code.rst (revision 7a8889324654c9e39f9e18097ccc74d6ff2588cf)
1fc1f2750SBernard Iremonger..  BSD LICENSE
2fc1f2750SBernard Iremonger    Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
3fc1f2750SBernard Iremonger    All rights reserved.
4fc1f2750SBernard Iremonger
5fc1f2750SBernard Iremonger    Redistribution and use in source and binary forms, with or without
6fc1f2750SBernard Iremonger    modification, are permitted provided that the following conditions
7fc1f2750SBernard Iremonger    are met:
8fc1f2750SBernard Iremonger
9fc1f2750SBernard Iremonger    * Redistributions of source code must retain the above copyright
10fc1f2750SBernard Iremonger    notice, this list of conditions and the following disclaimer.
11fc1f2750SBernard Iremonger    * Redistributions in binary form must reproduce the above copyright
12fc1f2750SBernard Iremonger    notice, this list of conditions and the following disclaimer in
13fc1f2750SBernard Iremonger    the documentation and/or other materials provided with the
14fc1f2750SBernard Iremonger    distribution.
15fc1f2750SBernard Iremonger    * Neither the name of Intel Corporation nor the names of its
16fc1f2750SBernard Iremonger    contributors may be used to endorse or promote products derived
17fc1f2750SBernard Iremonger    from this software without specific prior written permission.
18fc1f2750SBernard Iremonger
19fc1f2750SBernard Iremonger    THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
20fc1f2750SBernard Iremonger    "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
21fc1f2750SBernard Iremonger    LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
22fc1f2750SBernard Iremonger    A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
23fc1f2750SBernard Iremonger    OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
24fc1f2750SBernard Iremonger    SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
25fc1f2750SBernard Iremonger    LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
26fc1f2750SBernard Iremonger    DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
27fc1f2750SBernard Iremonger    THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
28fc1f2750SBernard Iremonger    (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
29fc1f2750SBernard Iremonger    OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
30fc1f2750SBernard Iremonger
31fc1f2750SBernard IremongerWriting Efficient Code
32fc1f2750SBernard Iremonger======================
33fc1f2750SBernard Iremonger
3448624fd9SSiobhan ButlerThis chapter provides some tips for developing efficient code using the DPDK.
35fc1f2750SBernard IremongerFor additional and more general information,
36fc1f2750SBernard Iremongerplease refer to the *Intel® 64 and IA-32 Architectures Optimization Reference Manual*
37fc1f2750SBernard Iremongerwhich is a valuable reference to writing efficient code.
38fc1f2750SBernard Iremonger
39fc1f2750SBernard IremongerMemory
40fc1f2750SBernard Iremonger------
41fc1f2750SBernard Iremonger
4248624fd9SSiobhan ButlerThis section describes some key memory considerations when developing applications in the DPDK environment.
43fc1f2750SBernard Iremonger
44fc1f2750SBernard IremongerMemory Copy: Do not Use libc in the Data Plane
45fc1f2750SBernard Iremonger~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
46fc1f2750SBernard Iremonger
4748624fd9SSiobhan ButlerMany libc functions are available in the DPDK, via the Linux* application environment.
48fc1f2750SBernard IremongerThis can ease the porting of applications and the development of the configuration plane.
49fc1f2750SBernard IremongerHowever, many of these functions are not designed for performance.
50fc1f2750SBernard IremongerFunctions such as memcpy() or strcpy() should not be used in the data plane.
51fc1f2750SBernard IremongerTo copy small structures, the preference is for a simpler technique that can be optimized by the compiler.
52fc1f2750SBernard IremongerRefer to the *VTune™ Performance Analyzer Essentials* publication from Intel Press for recommendations.
53fc1f2750SBernard Iremonger
54fc1f2750SBernard IremongerFor specific functions that are called often,
55fc1f2750SBernard Iremongerit is also a good idea to provide a self-made optimized function, which should be declared as static inline.
56fc1f2750SBernard Iremonger
5748624fd9SSiobhan ButlerThe DPDK API provides an optimized rte_memcpy() function.
58fc1f2750SBernard Iremonger
59fc1f2750SBernard IremongerMemory Allocation
60fc1f2750SBernard Iremonger~~~~~~~~~~~~~~~~~
61fc1f2750SBernard Iremonger
62fc1f2750SBernard IremongerOther functions of libc, such as malloc(), provide a flexible way to allocate and free memory.
63fc1f2750SBernard IremongerIn some cases, using dynamic allocation is necessary,
64fc1f2750SBernard Iremongerbut it is really not advised to use malloc-like functions in the data plane because
65fc1f2750SBernard Iremongermanaging a fragmented heap can be costly and the allocator may not be optimized for parallel allocation.
66fc1f2750SBernard Iremonger
67fc1f2750SBernard IremongerIf you really need dynamic allocation in the data plane, it is better to use a memory pool of fixed-size objects.
68fc1f2750SBernard IremongerThis API is provided by librte_mempool.
69fc1f2750SBernard IremongerThis data structure provides several services that increase performance, such as memory alignment of objects,
70fc1f2750SBernard Iremongerlockless access to objects, NUMA awareness, bulk get/put and per-lcore cache.
71fc1f2750SBernard IremongerThe rte_malloc () function uses a similar concept to mempools.
72fc1f2750SBernard Iremonger
73fc1f2750SBernard IremongerConcurrent Access to the Same Memory Area
74fc1f2750SBernard Iremonger~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
75fc1f2750SBernard Iremonger
76fc1f2750SBernard IremongerRead-Write (RW) access operations by several lcores to the same memory area can generate a lot of data cache misses,
77fc1f2750SBernard Iremongerwhich are very costly.
78fc1f2750SBernard IremongerIt is often possible to use per-lcore variables, for example, in the case of statistics.
79fc1f2750SBernard IremongerThere are at least two solutions for this:
80fc1f2750SBernard Iremonger
81fc1f2750SBernard Iremonger*   Use RTE_PER_LCORE variables. Note that in this case, data on lcore X is not available to lcore Y.
82fc1f2750SBernard Iremonger
83fc1f2750SBernard Iremonger*   Use a table of structures (one per lcore). In this case, each structure must be cache-aligned.
84fc1f2750SBernard Iremonger
85fc1f2750SBernard IremongerRead-mostly variables can be shared among lcores without performance losses if there are no RW variables in the same cache line.
86fc1f2750SBernard Iremonger
87fc1f2750SBernard IremongerNUMA
88fc1f2750SBernard Iremonger~~~~
89fc1f2750SBernard Iremonger
90fc1f2750SBernard IremongerOn a NUMA system, it is preferable to access local memory since remote memory access is slower.
9148624fd9SSiobhan ButlerIn the DPDK, the memzone, ring, rte_malloc and mempool APIs provide a way to create a pool on a specific socket.
92fc1f2750SBernard Iremonger
93fc1f2750SBernard IremongerSometimes, it can be a good idea to duplicate data to optimize speed.
94fc1f2750SBernard IremongerFor read-mostly variables that are often accessed,
95fc1f2750SBernard Iremongerit should not be a problem to keep them in one socket only, since data will be present in cache.
96fc1f2750SBernard Iremonger
97fc1f2750SBernard IremongerDistribution Across Memory Channels
98fc1f2750SBernard Iremonger~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
99fc1f2750SBernard Iremonger
100fc1f2750SBernard IremongerModern memory controllers have several memory channels that can load or store data in parallel.
101fc1f2750SBernard IremongerDepending on the memory controller and its configuration,
102fc1f2750SBernard Iremongerthe number of channels and the way the memory is distributed across the channels varies.
103fc1f2750SBernard IremongerEach channel has a bandwidth limit,
104fc1f2750SBernard Iremongermeaning that if all memory access operations are done on the first channel only, there is a potential bottleneck.
105fc1f2750SBernard Iremonger
106fc1f2750SBernard IremongerBy default, the  :ref:`Mempool Library <Mempool_Library>` spreads the addresses of objects among memory channels.
107fc1f2750SBernard Iremonger
108*7a888932SEelco ChaudronLocking memory pages
109*7a888932SEelco Chaudron~~~~~~~~~~~~~~~~~~~~
110*7a888932SEelco Chaudron
111*7a888932SEelco ChaudronThe underlying operating system is allowed to load/unload memory pages at its own discretion.
112*7a888932SEelco ChaudronThese page loads could impact the performance, as the process is on hold when the kernel fetches them.
113*7a888932SEelco Chaudron
114*7a888932SEelco ChaudronTo avoid these you could pre-load, and lock them into memory with the ``mlockall()`` call.
115*7a888932SEelco Chaudron
116*7a888932SEelco Chaudron.. code-block:: c
117*7a888932SEelco Chaudron
118*7a888932SEelco Chaudron    if (mlockall(MCL_CURRENT | MCL_FUTURE)) {
119*7a888932SEelco Chaudron        RTE_LOG(NOTICE, USER1, "mlockall() failed with error \"%s\"\n",
120*7a888932SEelco Chaudron                strerror(errno));
121*7a888932SEelco Chaudron    }
122*7a888932SEelco Chaudron
123fc1f2750SBernard IremongerCommunication Between lcores
124fc1f2750SBernard Iremonger----------------------------
125fc1f2750SBernard Iremonger
126fc1f2750SBernard IremongerTo provide a message-based communication between lcores,
12748624fd9SSiobhan Butlerit is advised to use the DPDK ring API, which provides a lockless ring implementation.
128fc1f2750SBernard Iremonger
129fc1f2750SBernard IremongerThe ring supports bulk and burst access,
130fc1f2750SBernard Iremongermeaning that it is possible to read several elements from the ring with only one costly atomic operation
13129e30cbcSThomas Monjalon(see :doc:`ring_lib`).
132fc1f2750SBernard IremongerPerformance is greatly improved when using bulk access operations.
133fc1f2750SBernard Iremonger
134fc1f2750SBernard IremongerThe code algorithm that dequeues messages may be something similar to the following:
135fc1f2750SBernard Iremonger
136fc1f2750SBernard Iremonger.. code-block:: c
137fc1f2750SBernard Iremonger
138fc1f2750SBernard Iremonger    #define MAX_BULK 32
139fc1f2750SBernard Iremonger
140fc1f2750SBernard Iremonger    while (1) {
141fc1f2750SBernard Iremonger        /* Process as many elements as can be dequeued. */
142ecaed092SBruce Richardson        count = rte_ring_dequeue_burst(ring, obj_table, MAX_BULK, NULL);
143fc1f2750SBernard Iremonger        if (unlikely(count == 0))
144fc1f2750SBernard Iremonger            continue;
145fc1f2750SBernard Iremonger
146fc1f2750SBernard Iremonger        my_process_bulk(obj_table, count);
147fc1f2750SBernard Iremonger   }
148fc1f2750SBernard Iremonger
149fc1f2750SBernard IremongerPMD Driver
150fc1f2750SBernard Iremonger----------
151fc1f2750SBernard Iremonger
15248624fd9SSiobhan ButlerThe DPDK Poll Mode Driver (PMD) is also able to work in bulk/burst mode,
153fc1f2750SBernard Iremongerallowing the factorization of some code for each call in the send or receive function.
154fc1f2750SBernard Iremonger
155fc1f2750SBernard IremongerAvoid partial writes.
156fc1f2750SBernard IremongerWhen PCI devices write to system memory through DMA,
157fc1f2750SBernard Iremongerit costs less if the write operation is on a full cache line as opposed to part of it.
158fc1f2750SBernard IremongerIn the PMD code, actions have been taken to avoid partial writes as much as possible.
159fc1f2750SBernard Iremonger
160fc1f2750SBernard IremongerLower Packet Latency
161fc1f2750SBernard Iremonger~~~~~~~~~~~~~~~~~~~~
162fc1f2750SBernard Iremonger
163fc1f2750SBernard IremongerTraditionally, there is a trade-off between throughput and latency.
164fc1f2750SBernard IremongerAn application can be tuned to achieve a high throughput,
165fc1f2750SBernard Iremongerbut the end-to-end latency of an average packet will typically increase as a result.
166fc1f2750SBernard IremongerSimilarly, the application can be tuned to have, on average,
167fc1f2750SBernard Iremongera low end-to-end latency, at the cost of lower throughput.
168fc1f2750SBernard Iremonger
169fc1f2750SBernard IremongerIn order to achieve higher throughput,
17048624fd9SSiobhan Butlerthe DPDK attempts to aggregate the cost of processing each packet individually by processing packets in bursts.
171fc1f2750SBernard Iremonger
172fc1f2750SBernard IremongerUsing the testpmd application as an example,
173fc1f2750SBernard Iremongerthe burst size can be set on the command line to a value of 16 (also the default value).
174fc1f2750SBernard IremongerThis allows the application to request 16 packets at a time from the PMD.
175fc1f2750SBernard IremongerThe testpmd application then immediately attempts to transmit all the packets that were received,
176fc1f2750SBernard Iremongerin this case, all 16 packets.
177fc1f2750SBernard Iremonger
178fc1f2750SBernard IremongerThe packets are not transmitted until the tail pointer is updated on the corresponding TX queue of the network port.
179fc1f2750SBernard IremongerThis behavior is desirable when tuning for high throughput because
180fc1f2750SBernard Iremongerthe cost of tail pointer updates to both the RX and TX queues can be spread across 16 packets,
181fc1f2750SBernard Iremongereffectively hiding the relatively slow MMIO cost of writing to the PCIe* device.
182fc1f2750SBernard IremongerHowever, this is not very desirable when tuning for low latency because
183fc1f2750SBernard Iremongerthe first packet that was received must also wait for another 15 packets to be received.
184fc1f2750SBernard IremongerIt cannot be transmitted until the other 15 packets have also been processed because
185fc1f2750SBernard Iremongerthe NIC will not know to transmit the packets until the TX tail pointer has been updated,
186fc1f2750SBernard Iremongerwhich is not done until all 16 packets have been processed for transmission.
187fc1f2750SBernard Iremonger
188fc1f2750SBernard IremongerTo consistently achieve low latency, even under heavy system load,
189fc1f2750SBernard Iremongerthe application developer should avoid processing packets in bunches.
190fc1f2750SBernard IremongerThe testpmd application can be configured from the command line to use a burst value of 1.
191fc1f2750SBernard IremongerThis will allow a single packet to be processed at a time, providing lower latency,
192fc1f2750SBernard Iremongerbut with the added cost of lower throughput.
193fc1f2750SBernard Iremonger
194fc1f2750SBernard IremongerLocks and Atomic Operations
195fc1f2750SBernard Iremonger---------------------------
196fc1f2750SBernard Iremonger
197fc1f2750SBernard IremongerAtomic operations imply a lock prefix before the instruction,
198fc1f2750SBernard Iremongercausing the processor's LOCK# signal to be asserted during execution of the following instruction.
199fc1f2750SBernard IremongerThis has a big impact on performance in a multicore environment.
200fc1f2750SBernard Iremonger
201fc1f2750SBernard IremongerPerformance can be improved by avoiding lock mechanisms in the data plane.
202fc1f2750SBernard IremongerIt can often be replaced by other solutions like per-lcore variables.
203fc1f2750SBernard IremongerAlso, some locking techniques are more efficient than others.
204fc1f2750SBernard IremongerFor instance, the Read-Copy-Update (RCU) algorithm can frequently replace simple rwlocks.
205fc1f2750SBernard Iremonger
206fc1f2750SBernard IremongerCoding Considerations
207fc1f2750SBernard Iremonger---------------------
208fc1f2750SBernard Iremonger
209fc1f2750SBernard IremongerInline Functions
210fc1f2750SBernard Iremonger~~~~~~~~~~~~~~~~
211fc1f2750SBernard Iremonger
212fc1f2750SBernard IremongerSmall functions can be declared as static inline in the header file.
213fc1f2750SBernard IremongerThis avoids the cost of a call instruction (and the associated context saving).
214fc1f2750SBernard IremongerHowever, this technique is not always efficient; it depends on many factors including the compiler.
215fc1f2750SBernard Iremonger
216fc1f2750SBernard IremongerBranch Prediction
217fc1f2750SBernard Iremonger~~~~~~~~~~~~~~~~~
218fc1f2750SBernard Iremonger
219fc1f2750SBernard IremongerThe Intel® C/C++ Compiler (icc)/gcc built-in helper functions likely() and unlikely()
220fc1f2750SBernard Iremongerallow the developer to indicate if a code branch is likely to be taken or not.
221fc1f2750SBernard IremongerFor instance:
222fc1f2750SBernard Iremonger
223fc1f2750SBernard Iremonger.. code-block:: c
224fc1f2750SBernard Iremonger
225fc1f2750SBernard Iremonger    if (likely(x > 1))
226fc1f2750SBernard Iremonger        do_stuff();
227fc1f2750SBernard Iremonger
228fc1f2750SBernard IremongerSetting the Target CPU Type
229fc1f2750SBernard Iremonger---------------------------
230fc1f2750SBernard Iremonger
23148624fd9SSiobhan ButlerThe DPDK supports CPU microarchitecture-specific optimizations by means of CONFIG_RTE_MACHINE option
23248624fd9SSiobhan Butlerin the DPDK configuration file.
233fea1d908SJohn McNamaraThe degree of optimization depends on the compiler's ability to optimize for a specific microarchitecture,
234fc1f2750SBernard Iremongertherefore it is preferable to use the latest compiler versions whenever possible.
235fc1f2750SBernard Iremonger
236fc1f2750SBernard IremongerIf the compiler version does not support the specific feature set (for example, the Intel® AVX instruction set),
237fc1f2750SBernard Iremongerthe build process gracefully degrades to whatever latest feature set is supported by the compiler.
238fc1f2750SBernard Iremonger
239fc1f2750SBernard IremongerSince the build and runtime targets may not be the same,
240fc1f2750SBernard Iremongerthe resulting binary also contains a platform check that runs before the
241fc1f2750SBernard Iremongermain() function and checks if the current machine is suitable for running the binary.
242fc1f2750SBernard Iremonger
243fc1f2750SBernard IremongerAlong with compiler optimizations,
244fc1f2750SBernard Iremongera set of preprocessor defines are automatically added to the build process (regardless of the compiler version).
245fc1f2750SBernard IremongerThese defines correspond to the instruction sets that the target CPU should be able to support.
246fc1f2750SBernard IremongerFor example, a binary compiled for any SSE4.2-capable processor will have RTE_MACHINE_CPUFLAG_SSE4_2 defined,
247fc1f2750SBernard Iremongerthus enabling compile-time code path selection for different platforms.
248