xref: /dpdk/doc/guides/prog_guide/writing_efficient_code.rst (revision ce6b8c31548b4d71a986d9807cd06cf3a616d1ab)
1..  SPDX-License-Identifier: BSD-3-Clause
2    Copyright(c) 2010-2014 Intel Corporation.
3
4Writing Efficient Code
5======================
6
7This chapter provides some tips for developing efficient code using the DPDK.
8For additional and more general information,
9please refer to the *Intel® 64 and IA-32 Architectures Optimization Reference Manual*
10which is a valuable reference to writing efficient code.
11
12Memory
13------
14
15This section describes some key memory considerations when developing applications in the DPDK environment.
16
17Memory Copy: Do not Use libc in the Data Plane
18~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
19
20Many libc functions are available in the DPDK, via the Linux* application environment.
21This can ease the porting of applications and the development of the configuration plane.
22However, many of these functions are not designed for performance.
23Functions such as memcpy() or strcpy() should not be used in the data plane.
24To copy small structures, the preference is for a simpler technique that can be optimized by the compiler.
25Refer to the *VTune™ Performance Analyzer Essentials* publication from Intel Press for recommendations.
26
27For specific functions that are called often,
28it is also a good idea to provide a self-made optimized function, which should be declared as static inline.
29
30The DPDK API provides an optimized rte_memcpy() function.
31
32Memory Allocation
33~~~~~~~~~~~~~~~~~
34
35Other functions of libc, such as malloc(), provide a flexible way to allocate and free memory.
36In some cases, using dynamic allocation is necessary,
37but it is really not advised to use malloc-like functions in the data plane because
38managing a fragmented heap can be costly and the allocator may not be optimized for parallel allocation.
39
40If you really need dynamic allocation in the data plane, it is better to use a memory pool of fixed-size objects.
41This API is provided by librte_mempool.
42This data structure provides several services that increase performance, such as memory alignment of objects,
43lockless access to objects, NUMA awareness, bulk get/put and per-lcore cache.
44The rte_malloc () function uses a similar concept to mempools.
45
46Concurrent Access to the Same Memory Area
47~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
48
49Read-Write (RW) access operations by several lcores to the same memory area can generate a lot of data cache misses,
50which are very costly.
51It is often possible to use per-lcore variables, for example, in the case of statistics.
52There are at least two solutions for this:
53
54*   Use RTE_PER_LCORE variables. Note that in this case, data on lcore X is not available to lcore Y.
55
56*   Use a table of structures (one per lcore). In this case, each structure must be cache-aligned.
57
58Read-mostly variables can be shared among lcores without performance losses if there are no RW variables in the same cache line.
59
60NUMA
61~~~~
62
63On a NUMA system, it is preferable to access local memory since remote memory access is slower.
64In the DPDK, the memzone, ring, rte_malloc and mempool APIs provide a way to create a pool on a specific socket.
65
66Sometimes, it can be a good idea to duplicate data to optimize speed.
67For read-mostly variables that are often accessed,
68it should not be a problem to keep them in one socket only, since data will be present in cache.
69
70Distribution Across Memory Channels
71~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
72
73Modern memory controllers have several memory channels that can load or store data in parallel.
74Depending on the memory controller and its configuration,
75the number of channels and the way the memory is distributed across the channels varies.
76Each channel has a bandwidth limit,
77meaning that if all memory access operations are done on the first channel only, there is a potential bottleneck.
78
79By default, the  :ref:`Mempool Library <Mempool_Library>` spreads the addresses of objects among memory channels.
80
81Locking memory pages
82~~~~~~~~~~~~~~~~~~~~
83
84The underlying operating system is allowed to load/unload memory pages at its own discretion.
85These page loads could impact the performance, as the process is on hold when the kernel fetches them.
86
87To avoid these you could pre-load, and lock them into memory with the ``mlockall()`` call.
88
89.. code-block:: c
90
91    if (mlockall(MCL_CURRENT | MCL_FUTURE)) {
92        RTE_LOG(NOTICE, USER1, "mlockall() failed with error \"%s\"\n",
93                strerror(errno));
94    }
95
96Communication Between lcores
97----------------------------
98
99To provide a message-based communication between lcores,
100it is advised to use the DPDK ring API, which provides a lockless ring implementation.
101
102The ring supports bulk and burst access,
103meaning that it is possible to read several elements from the ring with only one costly atomic operation
104(see :doc:`ring_lib`).
105Performance is greatly improved when using bulk access operations.
106
107The code algorithm that dequeues messages may be something similar to the following:
108
109.. code-block:: c
110
111    #define MAX_BULK 32
112
113    while (1) {
114        /* Process as many elements as can be dequeued. */
115        count = rte_ring_dequeue_burst(ring, obj_table, MAX_BULK, NULL);
116        if (unlikely(count == 0))
117            continue;
118
119        my_process_bulk(obj_table, count);
120   }
121
122PMD Driver
123----------
124
125The DPDK Poll Mode Driver (PMD) is also able to work in bulk/burst mode,
126allowing the factorization of some code for each call in the send or receive function.
127
128Avoid partial writes.
129When PCI devices write to system memory through DMA,
130it costs less if the write operation is on a full cache line as opposed to part of it.
131In the PMD code, actions have been taken to avoid partial writes as much as possible.
132
133Lower Packet Latency
134~~~~~~~~~~~~~~~~~~~~
135
136Traditionally, there is a trade-off between throughput and latency.
137An application can be tuned to achieve a high throughput,
138but the end-to-end latency of an average packet will typically increase as a result.
139Similarly, the application can be tuned to have, on average,
140a low end-to-end latency, at the cost of lower throughput.
141
142In order to achieve higher throughput,
143the DPDK attempts to aggregate the cost of processing each packet individually by processing packets in bursts.
144
145Using the testpmd application as an example,
146the burst size can be set on the command line to a value of 16 (also the default value).
147This allows the application to request 16 packets at a time from the PMD.
148The testpmd application then immediately attempts to transmit all the packets that were received,
149in this case, all 16 packets.
150
151The packets are not transmitted until the tail pointer is updated on the corresponding TX queue of the network port.
152This behavior is desirable when tuning for high throughput because
153the cost of tail pointer updates to both the RX and TX queues can be spread across 16 packets,
154effectively hiding the relatively slow MMIO cost of writing to the PCIe* device.
155However, this is not very desirable when tuning for low latency because
156the first packet that was received must also wait for another 15 packets to be received.
157It cannot be transmitted until the other 15 packets have also been processed because
158the NIC will not know to transmit the packets until the TX tail pointer has been updated,
159which is not done until all 16 packets have been processed for transmission.
160
161To consistently achieve low latency, even under heavy system load,
162the application developer should avoid processing packets in bunches.
163The testpmd application can be configured from the command line to use a burst value of 1.
164This will allow a single packet to be processed at a time, providing lower latency,
165but with the added cost of lower throughput.
166
167Locks and Atomic Operations
168---------------------------
169
170This section describes some key considerations when using locks and atomic
171operations in the DPDK environment.
172
173Locks
174~~~~~
175
176On x86, atomic operations imply a lock prefix before the instruction,
177causing the processor's LOCK# signal to be asserted during execution of the following instruction.
178This has a big impact on performance in a multicore environment.
179
180Performance can be improved by avoiding lock mechanisms in the data plane.
181It can often be replaced by other solutions like per-lcore variables.
182Also, some locking techniques are more efficient than others.
183For instance, the Read-Copy-Update (RCU) algorithm can frequently replace simple rwlocks.
184
185Atomic Operations: Use C11 Atomic Builtins
186~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
187
188DPDK generic rte_atomic operations are implemented by __sync builtins. These
189__sync builtins result in full barriers on aarch64, which are unnecessary
190in many use cases. They can be replaced by __atomic builtins that conform to
191the C11 memory model and provide finer memory order control.
192
193So replacing the rte_atomic operations with __atomic builtins might improve
194performance for aarch64 machines.
195
196Some typical optimization cases are listed below:
197
198Atomicity
199^^^^^^^^^
200
201Some use cases require atomicity alone, the ordering of the memory operations
202does not matter. For example, the packet statistics counters need to be
203incremented atomically but do not need any particular memory ordering.
204So, RELAXED memory ordering is sufficient.
205
206One-way Barrier
207^^^^^^^^^^^^^^^
208
209Some use cases allow for memory reordering in one way while requiring memory
210ordering in the other direction.
211
212For example, the memory operations before the spinlock lock are allowed to
213move to the critical section, but the memory operations in the critical section
214are not allowed to move above the lock. In this case, the full memory barrier
215in the compare-and-swap operation can be replaced with ACQUIRE memory order.
216On the other hand, the memory operations after the spinlock unlock are allowed
217to move to the critical section, but the memory operations in the critical
218section are not allowed to move below the unlock. So the full barrier in the
219store operation can use RELEASE memory order.
220
221Reader-Writer Concurrency
222^^^^^^^^^^^^^^^^^^^^^^^^^
223
224Lock-free reader-writer concurrency is one of the common use cases in DPDK.
225
226The payload or the data that the writer wants to communicate to the reader,
227can be written with RELAXED memory order. However, the guard variable should
228be written with RELEASE memory order. This ensures that the store to guard
229variable is observable only after the store to payload is observable.
230
231Correspondingly, on the reader side, the guard variable should be read
232with ACQUIRE memory order. The payload or the data the writer communicated,
233can be read with RELAXED memory order. This ensures that, if the store to
234guard variable is observable, the store to payload is also observable.
235
236Coding Considerations
237---------------------
238
239Inline Functions
240~~~~~~~~~~~~~~~~
241
242Small functions can be declared as static inline in the header file.
243This avoids the cost of a call instruction (and the associated context saving).
244However, this technique is not always efficient; it depends on many factors including the compiler.
245
246Branch Prediction
247~~~~~~~~~~~~~~~~~
248
249The Intel® C/C++ Compiler (icc)/gcc built-in helper functions likely() and unlikely()
250allow the developer to indicate if a code branch is likely to be taken or not.
251For instance:
252
253.. code-block:: c
254
255    if (likely(x > 1))
256        do_stuff();
257
258Setting the Target CPU Type
259---------------------------
260
261The DPDK supports CPU microarchitecture-specific optimizations by means of CONFIG_RTE_MACHINE option
262in the DPDK configuration file.
263The degree of optimization depends on the compiler's ability to optimize for a specific microarchitecture,
264therefore it is preferable to use the latest compiler versions whenever possible.
265
266If the compiler version does not support the specific feature set (for example, the Intel® AVX instruction set),
267the build process gracefully degrades to whatever latest feature set is supported by the compiler.
268
269Since the build and runtime targets may not be the same,
270the resulting binary also contains a platform check that runs before the
271main() function and checks if the current machine is suitable for running the binary.
272
273Along with compiler optimizations,
274a set of preprocessor defines are automatically added to the build process (regardless of the compiler version).
275These defines correspond to the instruction sets that the target CPU should be able to support.
276For example, a binary compiled for any SSE4.2-capable processor will have RTE_MACHINE_CPUFLAG_SSE4_2 defined,
277thus enabling compile-time code path selection for different platforms.
278