1.. BSD LICENSE 2 Copyright(c) 2010-2014 Intel Corporation. All rights reserved. 3 All rights reserved. 4 5 Redistribution and use in source and binary forms, with or without 6 modification, are permitted provided that the following conditions 7 are met: 8 9 * Redistributions of source code must retain the above copyright 10 notice, this list of conditions and the following disclaimer. 11 * Redistributions in binary form must reproduce the above copyright 12 notice, this list of conditions and the following disclaimer in 13 the documentation and/or other materials provided with the 14 distribution. 15 * Neither the name of Intel Corporation nor the names of its 16 contributors may be used to endorse or promote products derived 17 from this software without specific prior written permission. 18 19 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 20 "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 21 LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR 22 A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT 23 OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, 24 SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 25 LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, 26 DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY 27 THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 28 (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 29 OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 30 31Writing Efficient Code 32====================== 33 34This chapter provides some tips for developing efficient code using the DPDK. 35For additional and more general information, 36please refer to the *Intel® 64 and IA-32 Architectures Optimization Reference Manual* 37which is a valuable reference to writing efficient code. 38 39Memory 40------ 41 42This section describes some key memory considerations when developing applications in the DPDK environment. 43 44Memory Copy: Do not Use libc in the Data Plane 45~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 46 47Many libc functions are available in the DPDK, via the Linux* application environment. 48This can ease the porting of applications and the development of the configuration plane. 49However, many of these functions are not designed for performance. 50Functions such as memcpy() or strcpy() should not be used in the data plane. 51To copy small structures, the preference is for a simpler technique that can be optimized by the compiler. 52Refer to the *VTune™ Performance Analyzer Essentials* publication from Intel Press for recommendations. 53 54For specific functions that are called often, 55it is also a good idea to provide a self-made optimized function, which should be declared as static inline. 56 57The DPDK API provides an optimized rte_memcpy() function. 58 59Memory Allocation 60~~~~~~~~~~~~~~~~~ 61 62Other functions of libc, such as malloc(), provide a flexible way to allocate and free memory. 63In some cases, using dynamic allocation is necessary, 64but it is really not advised to use malloc-like functions in the data plane because 65managing a fragmented heap can be costly and the allocator may not be optimized for parallel allocation. 66 67If you really need dynamic allocation in the data plane, it is better to use a memory pool of fixed-size objects. 68This API is provided by librte_mempool. 69This data structure provides several services that increase performance, such as memory alignment of objects, 70lockless access to objects, NUMA awareness, bulk get/put and per-lcore cache. 71The rte_malloc () function uses a similar concept to mempools. 72 73Concurrent Access to the Same Memory Area 74~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 75 76Read-Write (RW) access operations by several lcores to the same memory area can generate a lot of data cache misses, 77which are very costly. 78It is often possible to use per-lcore variables, for example, in the case of statistics. 79There are at least two solutions for this: 80 81* Use RTE_PER_LCORE variables. Note that in this case, data on lcore X is not available to lcore Y. 82 83* Use a table of structures (one per lcore). In this case, each structure must be cache-aligned. 84 85Read-mostly variables can be shared among lcores without performance losses if there are no RW variables in the same cache line. 86 87NUMA 88~~~~ 89 90On a NUMA system, it is preferable to access local memory since remote memory access is slower. 91In the DPDK, the memzone, ring, rte_malloc and mempool APIs provide a way to create a pool on a specific socket. 92 93Sometimes, it can be a good idea to duplicate data to optimize speed. 94For read-mostly variables that are often accessed, 95it should not be a problem to keep them in one socket only, since data will be present in cache. 96 97Distribution Across Memory Channels 98~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 99 100Modern memory controllers have several memory channels that can load or store data in parallel. 101Depending on the memory controller and its configuration, 102the number of channels and the way the memory is distributed across the channels varies. 103Each channel has a bandwidth limit, 104meaning that if all memory access operations are done on the first channel only, there is a potential bottleneck. 105 106By default, the :ref:`Mempool Library <Mempool_Library>` spreads the addresses of objects among memory channels. 107 108Locking memory pages 109~~~~~~~~~~~~~~~~~~~~ 110 111The underlying operating system is allowed to load/unload memory pages at its own discretion. 112These page loads could impact the performance, as the process is on hold when the kernel fetches them. 113 114To avoid these you could pre-load, and lock them into memory with the ``mlockall()`` call. 115 116.. code-block:: c 117 118 if (mlockall(MCL_CURRENT | MCL_FUTURE)) { 119 RTE_LOG(NOTICE, USER1, "mlockall() failed with error \"%s\"\n", 120 strerror(errno)); 121 } 122 123Communication Between lcores 124---------------------------- 125 126To provide a message-based communication between lcores, 127it is advised to use the DPDK ring API, which provides a lockless ring implementation. 128 129The ring supports bulk and burst access, 130meaning that it is possible to read several elements from the ring with only one costly atomic operation 131(see :doc:`ring_lib`). 132Performance is greatly improved when using bulk access operations. 133 134The code algorithm that dequeues messages may be something similar to the following: 135 136.. code-block:: c 137 138 #define MAX_BULK 32 139 140 while (1) { 141 /* Process as many elements as can be dequeued. */ 142 count = rte_ring_dequeue_burst(ring, obj_table, MAX_BULK, NULL); 143 if (unlikely(count == 0)) 144 continue; 145 146 my_process_bulk(obj_table, count); 147 } 148 149PMD Driver 150---------- 151 152The DPDK Poll Mode Driver (PMD) is also able to work in bulk/burst mode, 153allowing the factorization of some code for each call in the send or receive function. 154 155Avoid partial writes. 156When PCI devices write to system memory through DMA, 157it costs less if the write operation is on a full cache line as opposed to part of it. 158In the PMD code, actions have been taken to avoid partial writes as much as possible. 159 160Lower Packet Latency 161~~~~~~~~~~~~~~~~~~~~ 162 163Traditionally, there is a trade-off between throughput and latency. 164An application can be tuned to achieve a high throughput, 165but the end-to-end latency of an average packet will typically increase as a result. 166Similarly, the application can be tuned to have, on average, 167a low end-to-end latency, at the cost of lower throughput. 168 169In order to achieve higher throughput, 170the DPDK attempts to aggregate the cost of processing each packet individually by processing packets in bursts. 171 172Using the testpmd application as an example, 173the burst size can be set on the command line to a value of 16 (also the default value). 174This allows the application to request 16 packets at a time from the PMD. 175The testpmd application then immediately attempts to transmit all the packets that were received, 176in this case, all 16 packets. 177 178The packets are not transmitted until the tail pointer is updated on the corresponding TX queue of the network port. 179This behavior is desirable when tuning for high throughput because 180the cost of tail pointer updates to both the RX and TX queues can be spread across 16 packets, 181effectively hiding the relatively slow MMIO cost of writing to the PCIe* device. 182However, this is not very desirable when tuning for low latency because 183the first packet that was received must also wait for another 15 packets to be received. 184It cannot be transmitted until the other 15 packets have also been processed because 185the NIC will not know to transmit the packets until the TX tail pointer has been updated, 186which is not done until all 16 packets have been processed for transmission. 187 188To consistently achieve low latency, even under heavy system load, 189the application developer should avoid processing packets in bunches. 190The testpmd application can be configured from the command line to use a burst value of 1. 191This will allow a single packet to be processed at a time, providing lower latency, 192but with the added cost of lower throughput. 193 194Locks and Atomic Operations 195--------------------------- 196 197Atomic operations imply a lock prefix before the instruction, 198causing the processor's LOCK# signal to be asserted during execution of the following instruction. 199This has a big impact on performance in a multicore environment. 200 201Performance can be improved by avoiding lock mechanisms in the data plane. 202It can often be replaced by other solutions like per-lcore variables. 203Also, some locking techniques are more efficient than others. 204For instance, the Read-Copy-Update (RCU) algorithm can frequently replace simple rwlocks. 205 206Coding Considerations 207--------------------- 208 209Inline Functions 210~~~~~~~~~~~~~~~~ 211 212Small functions can be declared as static inline in the header file. 213This avoids the cost of a call instruction (and the associated context saving). 214However, this technique is not always efficient; it depends on many factors including the compiler. 215 216Branch Prediction 217~~~~~~~~~~~~~~~~~ 218 219The Intel® C/C++ Compiler (icc)/gcc built-in helper functions likely() and unlikely() 220allow the developer to indicate if a code branch is likely to be taken or not. 221For instance: 222 223.. code-block:: c 224 225 if (likely(x > 1)) 226 do_stuff(); 227 228Setting the Target CPU Type 229--------------------------- 230 231The DPDK supports CPU microarchitecture-specific optimizations by means of CONFIG_RTE_MACHINE option 232in the DPDK configuration file. 233The degree of optimization depends on the compiler's ability to optimize for a specific microarchitecture, 234therefore it is preferable to use the latest compiler versions whenever possible. 235 236If the compiler version does not support the specific feature set (for example, the Intel® AVX instruction set), 237the build process gracefully degrades to whatever latest feature set is supported by the compiler. 238 239Since the build and runtime targets may not be the same, 240the resulting binary also contains a platform check that runs before the 241main() function and checks if the current machine is suitable for running the binary. 242 243Along with compiler optimizations, 244a set of preprocessor defines are automatically added to the build process (regardless of the compiler version). 245These defines correspond to the instruction sets that the target CPU should be able to support. 246For example, a binary compiled for any SSE4.2-capable processor will have RTE_MACHINE_CPUFLAG_SSE4_2 defined, 247thus enabling compile-time code path selection for different platforms. 248