1*fc1f2750SBernard Iremonger.. BSD LICENSE 2*fc1f2750SBernard Iremonger Copyright(c) 2010-2014 Intel Corporation. All rights reserved. 3*fc1f2750SBernard Iremonger All rights reserved. 4*fc1f2750SBernard Iremonger 5*fc1f2750SBernard Iremonger Redistribution and use in source and binary forms, with or without 6*fc1f2750SBernard Iremonger modification, are permitted provided that the following conditions 7*fc1f2750SBernard Iremonger are met: 8*fc1f2750SBernard Iremonger 9*fc1f2750SBernard Iremonger * Redistributions of source code must retain the above copyright 10*fc1f2750SBernard Iremonger notice, this list of conditions and the following disclaimer. 11*fc1f2750SBernard Iremonger * Redistributions in binary form must reproduce the above copyright 12*fc1f2750SBernard Iremonger notice, this list of conditions and the following disclaimer in 13*fc1f2750SBernard Iremonger the documentation and/or other materials provided with the 14*fc1f2750SBernard Iremonger distribution. 15*fc1f2750SBernard Iremonger * Neither the name of Intel Corporation nor the names of its 16*fc1f2750SBernard Iremonger contributors may be used to endorse or promote products derived 17*fc1f2750SBernard Iremonger from this software without specific prior written permission. 18*fc1f2750SBernard Iremonger 19*fc1f2750SBernard Iremonger THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 20*fc1f2750SBernard Iremonger "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 21*fc1f2750SBernard Iremonger LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR 22*fc1f2750SBernard Iremonger A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT 23*fc1f2750SBernard Iremonger OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, 24*fc1f2750SBernard Iremonger SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 25*fc1f2750SBernard Iremonger LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, 26*fc1f2750SBernard Iremonger DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY 27*fc1f2750SBernard Iremonger THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 28*fc1f2750SBernard Iremonger (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 29*fc1f2750SBernard Iremonger OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 30*fc1f2750SBernard Iremonger 31*fc1f2750SBernard IremongerWriting Efficient Code 32*fc1f2750SBernard Iremonger====================== 33*fc1f2750SBernard Iremonger 34*fc1f2750SBernard IremongerThis chapter provides some tips for developing efficient code using the Intel® DPDK. 35*fc1f2750SBernard IremongerFor additional and more general information, 36*fc1f2750SBernard Iremongerplease refer to the *Intel® 64 and IA-32 Architectures Optimization Reference Manual* 37*fc1f2750SBernard Iremongerwhich is a valuable reference to writing efficient code. 38*fc1f2750SBernard Iremonger 39*fc1f2750SBernard IremongerMemory 40*fc1f2750SBernard Iremonger------ 41*fc1f2750SBernard Iremonger 42*fc1f2750SBernard IremongerThis section describes some key memory considerations when developing applications in the Intel® DPDK environment. 43*fc1f2750SBernard Iremonger 44*fc1f2750SBernard IremongerMemory Copy: Do not Use libc in the Data Plane 45*fc1f2750SBernard Iremonger~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 46*fc1f2750SBernard Iremonger 47*fc1f2750SBernard IremongerMany libc functions are available in the Intel® DPDK, via the Linux* application environment. 48*fc1f2750SBernard IremongerThis can ease the porting of applications and the development of the configuration plane. 49*fc1f2750SBernard IremongerHowever, many of these functions are not designed for performance. 50*fc1f2750SBernard IremongerFunctions such as memcpy() or strcpy() should not be used in the data plane. 51*fc1f2750SBernard IremongerTo copy small structures, the preference is for a simpler technique that can be optimized by the compiler. 52*fc1f2750SBernard IremongerRefer to the *VTune™ Performance Analyzer Essentials* publication from Intel Press for recommendations. 53*fc1f2750SBernard Iremonger 54*fc1f2750SBernard IremongerFor specific functions that are called often, 55*fc1f2750SBernard Iremongerit is also a good idea to provide a self-made optimized function, which should be declared as static inline. 56*fc1f2750SBernard Iremonger 57*fc1f2750SBernard IremongerThe Intel® DPDK API provides an optimized rte_memcpy() function. 58*fc1f2750SBernard Iremonger 59*fc1f2750SBernard IremongerMemory Allocation 60*fc1f2750SBernard Iremonger~~~~~~~~~~~~~~~~~ 61*fc1f2750SBernard Iremonger 62*fc1f2750SBernard IremongerOther functions of libc, such as malloc(), provide a flexible way to allocate and free memory. 63*fc1f2750SBernard IremongerIn some cases, using dynamic allocation is necessary, 64*fc1f2750SBernard Iremongerbut it is really not advised to use malloc-like functions in the data plane because 65*fc1f2750SBernard Iremongermanaging a fragmented heap can be costly and the allocator may not be optimized for parallel allocation. 66*fc1f2750SBernard Iremonger 67*fc1f2750SBernard IremongerIf you really need dynamic allocation in the data plane, it is better to use a memory pool of fixed-size objects. 68*fc1f2750SBernard IremongerThis API is provided by librte_mempool. 69*fc1f2750SBernard IremongerThis data structure provides several services that increase performance, such as memory alignment of objects, 70*fc1f2750SBernard Iremongerlockless access to objects, NUMA awareness, bulk get/put and per-lcore cache. 71*fc1f2750SBernard IremongerThe rte_malloc () function uses a similar concept to mempools. 72*fc1f2750SBernard Iremonger 73*fc1f2750SBernard IremongerConcurrent Access to the Same Memory Area 74*fc1f2750SBernard Iremonger~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 75*fc1f2750SBernard Iremonger 76*fc1f2750SBernard IremongerRead-Write (RW) access operations by several lcores to the same memory area can generate a lot of data cache misses, 77*fc1f2750SBernard Iremongerwhich are very costly. 78*fc1f2750SBernard IremongerIt is often possible to use per-lcore variables, for example, in the case of statistics. 79*fc1f2750SBernard IremongerThere are at least two solutions for this: 80*fc1f2750SBernard Iremonger 81*fc1f2750SBernard Iremonger* Use RTE_PER_LCORE variables. Note that in this case, data on lcore X is not available to lcore Y. 82*fc1f2750SBernard Iremonger 83*fc1f2750SBernard Iremonger* Use a table of structures (one per lcore). In this case, each structure must be cache-aligned. 84*fc1f2750SBernard Iremonger 85*fc1f2750SBernard IremongerRead-mostly variables can be shared among lcores without performance losses if there are no RW variables in the same cache line. 86*fc1f2750SBernard Iremonger 87*fc1f2750SBernard IremongerNUMA 88*fc1f2750SBernard Iremonger~~~~ 89*fc1f2750SBernard Iremonger 90*fc1f2750SBernard IremongerOn a NUMA system, it is preferable to access local memory since remote memory access is slower. 91*fc1f2750SBernard IremongerIn the Intel® DPDK, the memzone, ring, rte_malloc and mempool APIs provide a way to create a pool on a specific socket. 92*fc1f2750SBernard Iremonger 93*fc1f2750SBernard IremongerSometimes, it can be a good idea to duplicate data to optimize speed. 94*fc1f2750SBernard IremongerFor read-mostly variables that are often accessed, 95*fc1f2750SBernard Iremongerit should not be a problem to keep them in one socket only, since data will be present in cache. 96*fc1f2750SBernard Iremonger 97*fc1f2750SBernard IremongerDistribution Across Memory Channels 98*fc1f2750SBernard Iremonger~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 99*fc1f2750SBernard Iremonger 100*fc1f2750SBernard IremongerModern memory controllers have several memory channels that can load or store data in parallel. 101*fc1f2750SBernard IremongerDepending on the memory controller and its configuration, 102*fc1f2750SBernard Iremongerthe number of channels and the way the memory is distributed across the channels varies. 103*fc1f2750SBernard IremongerEach channel has a bandwidth limit, 104*fc1f2750SBernard Iremongermeaning that if all memory access operations are done on the first channel only, there is a potential bottleneck. 105*fc1f2750SBernard Iremonger 106*fc1f2750SBernard IremongerBy default, the :ref:`Mempool Library <Mempool_Library>` spreads the addresses of objects among memory channels. 107*fc1f2750SBernard Iremonger 108*fc1f2750SBernard IremongerCommunication Between lcores 109*fc1f2750SBernard Iremonger---------------------------- 110*fc1f2750SBernard Iremonger 111*fc1f2750SBernard IremongerTo provide a message-based communication between lcores, 112*fc1f2750SBernard Iremongerit is advised to use the Intel® DPDK ring API, which provides a lockless ring implementation. 113*fc1f2750SBernard Iremonger 114*fc1f2750SBernard IremongerThe ring supports bulk and burst access, 115*fc1f2750SBernard Iremongermeaning that it is possible to read several elements from the ring with only one costly atomic operation 116*fc1f2750SBernard Iremonger(see Chapter 5 "Ring Library"). 117*fc1f2750SBernard IremongerPerformance is greatly improved when using bulk access operations. 118*fc1f2750SBernard Iremonger 119*fc1f2750SBernard IremongerThe code algorithm that dequeues messages may be something similar to the following: 120*fc1f2750SBernard Iremonger 121*fc1f2750SBernard Iremonger.. code-block:: c 122*fc1f2750SBernard Iremonger 123*fc1f2750SBernard Iremonger #define MAX_BULK 32 124*fc1f2750SBernard Iremonger 125*fc1f2750SBernard Iremonger while (1) { 126*fc1f2750SBernard Iremonger /* Process as many elements as can be dequeued. */ 127*fc1f2750SBernard Iremonger count = rte_ring_dequeue_burst(ring, obj_table, MAX_BULK); 128*fc1f2750SBernard Iremonger if (unlikely(count == 0)) 129*fc1f2750SBernard Iremonger continue; 130*fc1f2750SBernard Iremonger 131*fc1f2750SBernard Iremonger my_process_bulk(obj_table, count); 132*fc1f2750SBernard Iremonger } 133*fc1f2750SBernard Iremonger 134*fc1f2750SBernard IremongerPMD Driver 135*fc1f2750SBernard Iremonger---------- 136*fc1f2750SBernard Iremonger 137*fc1f2750SBernard IremongerThe Intel® DPDK Poll Mode Driver (PMD) is also able to work in bulk/burst mode, 138*fc1f2750SBernard Iremongerallowing the factorization of some code for each call in the send or receive function. 139*fc1f2750SBernard Iremonger 140*fc1f2750SBernard IremongerAvoid partial writes. 141*fc1f2750SBernard IremongerWhen PCI devices write to system memory through DMA, 142*fc1f2750SBernard Iremongerit costs less if the write operation is on a full cache line as opposed to part of it. 143*fc1f2750SBernard IremongerIn the PMD code, actions have been taken to avoid partial writes as much as possible. 144*fc1f2750SBernard Iremonger 145*fc1f2750SBernard IremongerLower Packet Latency 146*fc1f2750SBernard Iremonger~~~~~~~~~~~~~~~~~~~~ 147*fc1f2750SBernard Iremonger 148*fc1f2750SBernard IremongerTraditionally, there is a trade-off between throughput and latency. 149*fc1f2750SBernard IremongerAn application can be tuned to achieve a high throughput, 150*fc1f2750SBernard Iremongerbut the end-to-end latency of an average packet will typically increase as a result. 151*fc1f2750SBernard IremongerSimilarly, the application can be tuned to have, on average, 152*fc1f2750SBernard Iremongera low end-to-end latency, at the cost of lower throughput. 153*fc1f2750SBernard Iremonger 154*fc1f2750SBernard IremongerIn order to achieve higher throughput, 155*fc1f2750SBernard Iremongerthe Intel® DPDK attempts to aggregate the cost of processing each packet individually by processing packets in bursts. 156*fc1f2750SBernard Iremonger 157*fc1f2750SBernard IremongerUsing the testpmd application as an example, 158*fc1f2750SBernard Iremongerthe burst size can be set on the command line to a value of 16 (also the default value). 159*fc1f2750SBernard IremongerThis allows the application to request 16 packets at a time from the PMD. 160*fc1f2750SBernard IremongerThe testpmd application then immediately attempts to transmit all the packets that were received, 161*fc1f2750SBernard Iremongerin this case, all 16 packets. 162*fc1f2750SBernard Iremonger 163*fc1f2750SBernard IremongerThe packets are not transmitted until the tail pointer is updated on the corresponding TX queue of the network port. 164*fc1f2750SBernard IremongerThis behavior is desirable when tuning for high throughput because 165*fc1f2750SBernard Iremongerthe cost of tail pointer updates to both the RX and TX queues can be spread across 16 packets, 166*fc1f2750SBernard Iremongereffectively hiding the relatively slow MMIO cost of writing to the PCIe* device. 167*fc1f2750SBernard IremongerHowever, this is not very desirable when tuning for low latency because 168*fc1f2750SBernard Iremongerthe first packet that was received must also wait for another 15 packets to be received. 169*fc1f2750SBernard IremongerIt cannot be transmitted until the other 15 packets have also been processed because 170*fc1f2750SBernard Iremongerthe NIC will not know to transmit the packets until the TX tail pointer has been updated, 171*fc1f2750SBernard Iremongerwhich is not done until all 16 packets have been processed for transmission. 172*fc1f2750SBernard Iremonger 173*fc1f2750SBernard IremongerTo consistently achieve low latency, even under heavy system load, 174*fc1f2750SBernard Iremongerthe application developer should avoid processing packets in bunches. 175*fc1f2750SBernard IremongerThe testpmd application can be configured from the command line to use a burst value of 1. 176*fc1f2750SBernard IremongerThis will allow a single packet to be processed at a time, providing lower latency, 177*fc1f2750SBernard Iremongerbut with the added cost of lower throughput. 178*fc1f2750SBernard Iremonger 179*fc1f2750SBernard IremongerLocks and Atomic Operations 180*fc1f2750SBernard Iremonger--------------------------- 181*fc1f2750SBernard Iremonger 182*fc1f2750SBernard IremongerAtomic operations imply a lock prefix before the instruction, 183*fc1f2750SBernard Iremongercausing the processor's LOCK# signal to be asserted during execution of the following instruction. 184*fc1f2750SBernard IremongerThis has a big impact on performance in a multicore environment. 185*fc1f2750SBernard Iremonger 186*fc1f2750SBernard IremongerPerformance can be improved by avoiding lock mechanisms in the data plane. 187*fc1f2750SBernard IremongerIt can often be replaced by other solutions like per-lcore variables. 188*fc1f2750SBernard IremongerAlso, some locking techniques are more efficient than others. 189*fc1f2750SBernard IremongerFor instance, the Read-Copy-Update (RCU) algorithm can frequently replace simple rwlocks. 190*fc1f2750SBernard Iremonger 191*fc1f2750SBernard IremongerCoding Considerations 192*fc1f2750SBernard Iremonger--------------------- 193*fc1f2750SBernard Iremonger 194*fc1f2750SBernard IremongerInline Functions 195*fc1f2750SBernard Iremonger~~~~~~~~~~~~~~~~ 196*fc1f2750SBernard Iremonger 197*fc1f2750SBernard IremongerSmall functions can be declared as static inline in the header file. 198*fc1f2750SBernard IremongerThis avoids the cost of a call instruction (and the associated context saving). 199*fc1f2750SBernard IremongerHowever, this technique is not always efficient; it depends on many factors including the compiler. 200*fc1f2750SBernard Iremonger 201*fc1f2750SBernard IremongerBranch Prediction 202*fc1f2750SBernard Iremonger~~~~~~~~~~~~~~~~~ 203*fc1f2750SBernard Iremonger 204*fc1f2750SBernard IremongerThe Intel® C/C++ Compiler (icc)/gcc built-in helper functions likely() and unlikely() 205*fc1f2750SBernard Iremongerallow the developer to indicate if a code branch is likely to be taken or not. 206*fc1f2750SBernard IremongerFor instance: 207*fc1f2750SBernard Iremonger 208*fc1f2750SBernard Iremonger.. code-block:: c 209*fc1f2750SBernard Iremonger 210*fc1f2750SBernard Iremonger if (likely(x > 1)) 211*fc1f2750SBernard Iremonger do_stuff(); 212*fc1f2750SBernard Iremonger 213*fc1f2750SBernard IremongerSetting the Target CPU Type 214*fc1f2750SBernard Iremonger--------------------------- 215*fc1f2750SBernard Iremonger 216*fc1f2750SBernard IremongerThe Intel® DPDK supports CPU microarchitecture-specific optimizations by means of CONFIG_RTE_MACHINE option 217*fc1f2750SBernard Iremongerin the Intel® DPDK configuration file. 218*fc1f2750SBernard IremongerThe degree of optimization depends on the compiler's ability to optimize for a specitic microarchitecture, 219*fc1f2750SBernard Iremongertherefore it is preferable to use the latest compiler versions whenever possible. 220*fc1f2750SBernard Iremonger 221*fc1f2750SBernard IremongerIf the compiler version does not support the specific feature set (for example, the Intel® AVX instruction set), 222*fc1f2750SBernard Iremongerthe build process gracefully degrades to whatever latest feature set is supported by the compiler. 223*fc1f2750SBernard Iremonger 224*fc1f2750SBernard IremongerSince the build and runtime targets may not be the same, 225*fc1f2750SBernard Iremongerthe resulting binary also contains a platform check that runs before the 226*fc1f2750SBernard Iremongermain() function and checks if the current machine is suitable for running the binary. 227*fc1f2750SBernard Iremonger 228*fc1f2750SBernard IremongerAlong with compiler optimizations, 229*fc1f2750SBernard Iremongera set of preprocessor defines are automatically added to the build process (regardless of the compiler version). 230*fc1f2750SBernard IremongerThese defines correspond to the instruction sets that the target CPU should be able to support. 231*fc1f2750SBernard IremongerFor example, a binary compiled for any SSE4.2-capable processor will have RTE_MACHINE_CPUFLAG_SSE4_2 defined, 232*fc1f2750SBernard Iremongerthus enabling compile-time code path selection for different platforms. 233