xref: /dpdk/doc/guides/prog_guide/writing_efficient_code.rst (revision fc1f2750a3ec6da919e3c86e59d56f34ec97154b)
1*fc1f2750SBernard Iremonger..  BSD LICENSE
2*fc1f2750SBernard Iremonger    Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
3*fc1f2750SBernard Iremonger    All rights reserved.
4*fc1f2750SBernard Iremonger
5*fc1f2750SBernard Iremonger    Redistribution and use in source and binary forms, with or without
6*fc1f2750SBernard Iremonger    modification, are permitted provided that the following conditions
7*fc1f2750SBernard Iremonger    are met:
8*fc1f2750SBernard Iremonger
9*fc1f2750SBernard Iremonger    * Redistributions of source code must retain the above copyright
10*fc1f2750SBernard Iremonger    notice, this list of conditions and the following disclaimer.
11*fc1f2750SBernard Iremonger    * Redistributions in binary form must reproduce the above copyright
12*fc1f2750SBernard Iremonger    notice, this list of conditions and the following disclaimer in
13*fc1f2750SBernard Iremonger    the documentation and/or other materials provided with the
14*fc1f2750SBernard Iremonger    distribution.
15*fc1f2750SBernard Iremonger    * Neither the name of Intel Corporation nor the names of its
16*fc1f2750SBernard Iremonger    contributors may be used to endorse or promote products derived
17*fc1f2750SBernard Iremonger    from this software without specific prior written permission.
18*fc1f2750SBernard Iremonger
19*fc1f2750SBernard Iremonger    THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
20*fc1f2750SBernard Iremonger    "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
21*fc1f2750SBernard Iremonger    LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
22*fc1f2750SBernard Iremonger    A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
23*fc1f2750SBernard Iremonger    OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
24*fc1f2750SBernard Iremonger    SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
25*fc1f2750SBernard Iremonger    LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
26*fc1f2750SBernard Iremonger    DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
27*fc1f2750SBernard Iremonger    THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
28*fc1f2750SBernard Iremonger    (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
29*fc1f2750SBernard Iremonger    OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
30*fc1f2750SBernard Iremonger
31*fc1f2750SBernard IremongerWriting Efficient Code
32*fc1f2750SBernard Iremonger======================
33*fc1f2750SBernard Iremonger
34*fc1f2750SBernard IremongerThis chapter provides some tips for developing efficient code using the Intel® DPDK.
35*fc1f2750SBernard IremongerFor additional and more general information,
36*fc1f2750SBernard Iremongerplease refer to the *Intel® 64 and IA-32 Architectures Optimization Reference Manual*
37*fc1f2750SBernard Iremongerwhich is a valuable reference to writing efficient code.
38*fc1f2750SBernard Iremonger
39*fc1f2750SBernard IremongerMemory
40*fc1f2750SBernard Iremonger------
41*fc1f2750SBernard Iremonger
42*fc1f2750SBernard IremongerThis section describes some key memory considerations when developing applications in the Intel® DPDK environment.
43*fc1f2750SBernard Iremonger
44*fc1f2750SBernard IremongerMemory Copy: Do not Use libc in the Data Plane
45*fc1f2750SBernard Iremonger~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
46*fc1f2750SBernard Iremonger
47*fc1f2750SBernard IremongerMany libc functions are available in the Intel® DPDK, via the Linux* application environment.
48*fc1f2750SBernard IremongerThis can ease the porting of applications and the development of the configuration plane.
49*fc1f2750SBernard IremongerHowever, many of these functions are not designed for performance.
50*fc1f2750SBernard IremongerFunctions such as memcpy() or strcpy() should not be used in the data plane.
51*fc1f2750SBernard IremongerTo copy small structures, the preference is for a simpler technique that can be optimized by the compiler.
52*fc1f2750SBernard IremongerRefer to the *VTune™ Performance Analyzer Essentials* publication from Intel Press for recommendations.
53*fc1f2750SBernard Iremonger
54*fc1f2750SBernard IremongerFor specific functions that are called often,
55*fc1f2750SBernard Iremongerit is also a good idea to provide a self-made optimized function, which should be declared as static inline.
56*fc1f2750SBernard Iremonger
57*fc1f2750SBernard IremongerThe Intel® DPDK API provides an optimized rte_memcpy() function.
58*fc1f2750SBernard Iremonger
59*fc1f2750SBernard IremongerMemory Allocation
60*fc1f2750SBernard Iremonger~~~~~~~~~~~~~~~~~
61*fc1f2750SBernard Iremonger
62*fc1f2750SBernard IremongerOther functions of libc, such as malloc(), provide a flexible way to allocate and free memory.
63*fc1f2750SBernard IremongerIn some cases, using dynamic allocation is necessary,
64*fc1f2750SBernard Iremongerbut it is really not advised to use malloc-like functions in the data plane because
65*fc1f2750SBernard Iremongermanaging a fragmented heap can be costly and the allocator may not be optimized for parallel allocation.
66*fc1f2750SBernard Iremonger
67*fc1f2750SBernard IremongerIf you really need dynamic allocation in the data plane, it is better to use a memory pool of fixed-size objects.
68*fc1f2750SBernard IremongerThis API is provided by librte_mempool.
69*fc1f2750SBernard IremongerThis data structure provides several services that increase performance, such as memory alignment of objects,
70*fc1f2750SBernard Iremongerlockless access to objects, NUMA awareness, bulk get/put and per-lcore cache.
71*fc1f2750SBernard IremongerThe rte_malloc () function uses a similar concept to mempools.
72*fc1f2750SBernard Iremonger
73*fc1f2750SBernard IremongerConcurrent Access to the Same Memory Area
74*fc1f2750SBernard Iremonger~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
75*fc1f2750SBernard Iremonger
76*fc1f2750SBernard IremongerRead-Write (RW) access operations by several lcores to the same memory area can generate a lot of data cache misses,
77*fc1f2750SBernard Iremongerwhich are very costly.
78*fc1f2750SBernard IremongerIt is often possible to use per-lcore variables, for example, in the case of statistics.
79*fc1f2750SBernard IremongerThere are at least two solutions for this:
80*fc1f2750SBernard Iremonger
81*fc1f2750SBernard Iremonger*   Use RTE_PER_LCORE variables. Note that in this case, data on lcore X is not available to lcore Y.
82*fc1f2750SBernard Iremonger
83*fc1f2750SBernard Iremonger*   Use a table of structures (one per lcore). In this case, each structure must be cache-aligned.
84*fc1f2750SBernard Iremonger
85*fc1f2750SBernard IremongerRead-mostly variables can be shared among lcores without performance losses if there are no RW variables in the same cache line.
86*fc1f2750SBernard Iremonger
87*fc1f2750SBernard IremongerNUMA
88*fc1f2750SBernard Iremonger~~~~
89*fc1f2750SBernard Iremonger
90*fc1f2750SBernard IremongerOn a NUMA system, it is preferable to access local memory since remote memory access is slower.
91*fc1f2750SBernard IremongerIn the Intel® DPDK, the memzone, ring, rte_malloc and mempool APIs provide a way to create a pool on a specific socket.
92*fc1f2750SBernard Iremonger
93*fc1f2750SBernard IremongerSometimes, it can be a good idea to duplicate data to optimize speed.
94*fc1f2750SBernard IremongerFor read-mostly variables that are often accessed,
95*fc1f2750SBernard Iremongerit should not be a problem to keep them in one socket only, since data will be present in cache.
96*fc1f2750SBernard Iremonger
97*fc1f2750SBernard IremongerDistribution Across Memory Channels
98*fc1f2750SBernard Iremonger~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
99*fc1f2750SBernard Iremonger
100*fc1f2750SBernard IremongerModern memory controllers have several memory channels that can load or store data in parallel.
101*fc1f2750SBernard IremongerDepending on the memory controller and its configuration,
102*fc1f2750SBernard Iremongerthe number of channels and the way the memory is distributed across the channels varies.
103*fc1f2750SBernard IremongerEach channel has a bandwidth limit,
104*fc1f2750SBernard Iremongermeaning that if all memory access operations are done on the first channel only, there is a potential bottleneck.
105*fc1f2750SBernard Iremonger
106*fc1f2750SBernard IremongerBy default, the  :ref:`Mempool Library <Mempool_Library>` spreads the addresses of objects among memory channels.
107*fc1f2750SBernard Iremonger
108*fc1f2750SBernard IremongerCommunication Between lcores
109*fc1f2750SBernard Iremonger----------------------------
110*fc1f2750SBernard Iremonger
111*fc1f2750SBernard IremongerTo provide a message-based communication between lcores,
112*fc1f2750SBernard Iremongerit is advised to use the Intel® DPDK ring API, which provides a lockless ring implementation.
113*fc1f2750SBernard Iremonger
114*fc1f2750SBernard IremongerThe ring supports bulk and burst access,
115*fc1f2750SBernard Iremongermeaning that it is possible to read several elements from the ring with only one costly atomic operation
116*fc1f2750SBernard Iremonger(see Chapter 5 "Ring Library").
117*fc1f2750SBernard IremongerPerformance is greatly improved when using bulk access operations.
118*fc1f2750SBernard Iremonger
119*fc1f2750SBernard IremongerThe code algorithm that dequeues messages may be something similar to the following:
120*fc1f2750SBernard Iremonger
121*fc1f2750SBernard Iremonger.. code-block:: c
122*fc1f2750SBernard Iremonger
123*fc1f2750SBernard Iremonger    #define MAX_BULK 32
124*fc1f2750SBernard Iremonger
125*fc1f2750SBernard Iremonger    while (1) {
126*fc1f2750SBernard Iremonger        /* Process as many elements as can be dequeued. */
127*fc1f2750SBernard Iremonger        count = rte_ring_dequeue_burst(ring, obj_table, MAX_BULK);
128*fc1f2750SBernard Iremonger        if (unlikely(count == 0))
129*fc1f2750SBernard Iremonger            continue;
130*fc1f2750SBernard Iremonger
131*fc1f2750SBernard Iremonger        my_process_bulk(obj_table, count);
132*fc1f2750SBernard Iremonger   }
133*fc1f2750SBernard Iremonger
134*fc1f2750SBernard IremongerPMD Driver
135*fc1f2750SBernard Iremonger----------
136*fc1f2750SBernard Iremonger
137*fc1f2750SBernard IremongerThe Intel® DPDK Poll Mode Driver (PMD) is also able to work in bulk/burst mode,
138*fc1f2750SBernard Iremongerallowing the factorization of some code for each call in the send or receive function.
139*fc1f2750SBernard Iremonger
140*fc1f2750SBernard IremongerAvoid partial writes.
141*fc1f2750SBernard IremongerWhen PCI devices write to system memory through DMA,
142*fc1f2750SBernard Iremongerit costs less if the write operation is on a full cache line as opposed to part of it.
143*fc1f2750SBernard IremongerIn the PMD code, actions have been taken to avoid partial writes as much as possible.
144*fc1f2750SBernard Iremonger
145*fc1f2750SBernard IremongerLower Packet Latency
146*fc1f2750SBernard Iremonger~~~~~~~~~~~~~~~~~~~~
147*fc1f2750SBernard Iremonger
148*fc1f2750SBernard IremongerTraditionally, there is a trade-off between throughput and latency.
149*fc1f2750SBernard IremongerAn application can be tuned to achieve a high throughput,
150*fc1f2750SBernard Iremongerbut the end-to-end latency of an average packet will typically increase as a result.
151*fc1f2750SBernard IremongerSimilarly, the application can be tuned to have, on average,
152*fc1f2750SBernard Iremongera low end-to-end latency, at the cost of lower throughput.
153*fc1f2750SBernard Iremonger
154*fc1f2750SBernard IremongerIn order to achieve higher throughput,
155*fc1f2750SBernard Iremongerthe Intel® DPDK attempts to aggregate the cost of processing each packet individually by processing packets in bursts.
156*fc1f2750SBernard Iremonger
157*fc1f2750SBernard IremongerUsing the testpmd application as an example,
158*fc1f2750SBernard Iremongerthe burst size can be set on the command line to a value of 16 (also the default value).
159*fc1f2750SBernard IremongerThis allows the application to request 16 packets at a time from the PMD.
160*fc1f2750SBernard IremongerThe testpmd application then immediately attempts to transmit all the packets that were received,
161*fc1f2750SBernard Iremongerin this case, all 16 packets.
162*fc1f2750SBernard Iremonger
163*fc1f2750SBernard IremongerThe packets are not transmitted until the tail pointer is updated on the corresponding TX queue of the network port.
164*fc1f2750SBernard IremongerThis behavior is desirable when tuning for high throughput because
165*fc1f2750SBernard Iremongerthe cost of tail pointer updates to both the RX and TX queues can be spread across 16 packets,
166*fc1f2750SBernard Iremongereffectively hiding the relatively slow MMIO cost of writing to the PCIe* device.
167*fc1f2750SBernard IremongerHowever, this is not very desirable when tuning for low latency because
168*fc1f2750SBernard Iremongerthe first packet that was received must also wait for another 15 packets to be received.
169*fc1f2750SBernard IremongerIt cannot be transmitted until the other 15 packets have also been processed because
170*fc1f2750SBernard Iremongerthe NIC will not know to transmit the packets until the TX tail pointer has been updated,
171*fc1f2750SBernard Iremongerwhich is not done until all 16 packets have been processed for transmission.
172*fc1f2750SBernard Iremonger
173*fc1f2750SBernard IremongerTo consistently achieve low latency, even under heavy system load,
174*fc1f2750SBernard Iremongerthe application developer should avoid processing packets in bunches.
175*fc1f2750SBernard IremongerThe testpmd application can be configured from the command line to use a burst value of 1.
176*fc1f2750SBernard IremongerThis will allow a single packet to be processed at a time, providing lower latency,
177*fc1f2750SBernard Iremongerbut with the added cost of lower throughput.
178*fc1f2750SBernard Iremonger
179*fc1f2750SBernard IremongerLocks and Atomic Operations
180*fc1f2750SBernard Iremonger---------------------------
181*fc1f2750SBernard Iremonger
182*fc1f2750SBernard IremongerAtomic operations imply a lock prefix before the instruction,
183*fc1f2750SBernard Iremongercausing the processor's LOCK# signal to be asserted during execution of the following instruction.
184*fc1f2750SBernard IremongerThis has a big impact on performance in a multicore environment.
185*fc1f2750SBernard Iremonger
186*fc1f2750SBernard IremongerPerformance can be improved by avoiding lock mechanisms in the data plane.
187*fc1f2750SBernard IremongerIt can often be replaced by other solutions like per-lcore variables.
188*fc1f2750SBernard IremongerAlso, some locking techniques are more efficient than others.
189*fc1f2750SBernard IremongerFor instance, the Read-Copy-Update (RCU) algorithm can frequently replace simple rwlocks.
190*fc1f2750SBernard Iremonger
191*fc1f2750SBernard IremongerCoding Considerations
192*fc1f2750SBernard Iremonger---------------------
193*fc1f2750SBernard Iremonger
194*fc1f2750SBernard IremongerInline Functions
195*fc1f2750SBernard Iremonger~~~~~~~~~~~~~~~~
196*fc1f2750SBernard Iremonger
197*fc1f2750SBernard IremongerSmall functions can be declared as static inline in the header file.
198*fc1f2750SBernard IremongerThis avoids the cost of a call instruction (and the associated context saving).
199*fc1f2750SBernard IremongerHowever, this technique is not always efficient; it depends on many factors including the compiler.
200*fc1f2750SBernard Iremonger
201*fc1f2750SBernard IremongerBranch Prediction
202*fc1f2750SBernard Iremonger~~~~~~~~~~~~~~~~~
203*fc1f2750SBernard Iremonger
204*fc1f2750SBernard IremongerThe Intel® C/C++ Compiler (icc)/gcc built-in helper functions likely() and unlikely()
205*fc1f2750SBernard Iremongerallow the developer to indicate if a code branch is likely to be taken or not.
206*fc1f2750SBernard IremongerFor instance:
207*fc1f2750SBernard Iremonger
208*fc1f2750SBernard Iremonger.. code-block:: c
209*fc1f2750SBernard Iremonger
210*fc1f2750SBernard Iremonger    if (likely(x > 1))
211*fc1f2750SBernard Iremonger        do_stuff();
212*fc1f2750SBernard Iremonger
213*fc1f2750SBernard IremongerSetting the Target CPU Type
214*fc1f2750SBernard Iremonger---------------------------
215*fc1f2750SBernard Iremonger
216*fc1f2750SBernard IremongerThe Intel® DPDK supports CPU microarchitecture-specific optimizations by means of CONFIG_RTE_MACHINE option
217*fc1f2750SBernard Iremongerin the Intel® DPDK configuration file.
218*fc1f2750SBernard IremongerThe degree of optimization depends on the compiler's ability to optimize for a specitic microarchitecture,
219*fc1f2750SBernard Iremongertherefore it is preferable to use the latest compiler versions whenever possible.
220*fc1f2750SBernard Iremonger
221*fc1f2750SBernard IremongerIf the compiler version does not support the specific feature set (for example, the Intel® AVX instruction set),
222*fc1f2750SBernard Iremongerthe build process gracefully degrades to whatever latest feature set is supported by the compiler.
223*fc1f2750SBernard Iremonger
224*fc1f2750SBernard IremongerSince the build and runtime targets may not be the same,
225*fc1f2750SBernard Iremongerthe resulting binary also contains a platform check that runs before the
226*fc1f2750SBernard Iremongermain() function and checks if the current machine is suitable for running the binary.
227*fc1f2750SBernard Iremonger
228*fc1f2750SBernard IremongerAlong with compiler optimizations,
229*fc1f2750SBernard Iremongera set of preprocessor defines are automatically added to the build process (regardless of the compiler version).
230*fc1f2750SBernard IremongerThese defines correspond to the instruction sets that the target CPU should be able to support.
231*fc1f2750SBernard IremongerFor example, a binary compiled for any SSE4.2-capable processor will have RTE_MACHINE_CPUFLAG_SSE4_2 defined,
232*fc1f2750SBernard Iremongerthus enabling compile-time code path selection for different platforms.
233