guides/prog_guide/writing_efficient_code.rst

*fc1f2750SBernard Iremonger..  BSD LICENSE
*fc1f2750SBernard Iremonger    Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
*fc1f2750SBernard Iremonger    All rights reserved.
*fc1f2750SBernard Iremonger
*fc1f2750SBernard Iremonger    Redistribution and use in source and binary forms, with or without
*fc1f2750SBernard Iremonger    modification, are permitted provided that the following conditions
*fc1f2750SBernard Iremonger    are met:
*fc1f2750SBernard Iremonger
*fc1f2750SBernard Iremonger    * Redistributions of source code must retain the above copyright
*fc1f2750SBernard Iremonger    notice, this list of conditions and the following disclaimer.
*fc1f2750SBernard Iremonger    * Redistributions in binary form must reproduce the above copyright
*fc1f2750SBernard Iremonger    notice, this list of conditions and the following disclaimer in
*fc1f2750SBernard Iremonger    the documentation and/or other materials provided with the
*fc1f2750SBernard Iremonger    distribution.
*fc1f2750SBernard Iremonger    * Neither the name of Intel Corporation nor the names of its
*fc1f2750SBernard Iremonger    contributors may be used to endorse or promote products derived
*fc1f2750SBernard Iremonger    from this software without specific prior written permission.
*fc1f2750SBernard Iremonger
*fc1f2750SBernard Iremonger    THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
*fc1f2750SBernard Iremonger    "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
*fc1f2750SBernard Iremonger    LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
*fc1f2750SBernard Iremonger    A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
*fc1f2750SBernard Iremonger    OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
*fc1f2750SBernard Iremonger    SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
*fc1f2750SBernard Iremonger    LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
*fc1f2750SBernard Iremonger    DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
*fc1f2750SBernard Iremonger    THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
*fc1f2750SBernard Iremonger    (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
*fc1f2750SBernard Iremonger    OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*fc1f2750SBernard Iremonger
*fc1f2750SBernard IremongerWriting Efficient Code
*fc1f2750SBernard Iremonger======================
*fc1f2750SBernard Iremonger
*fc1f2750SBernard IremongerThis chapter provides some tips for developing efficient code using the Intel® DPDK.
*fc1f2750SBernard IremongerFor additional and more general information,
*fc1f2750SBernard Iremongerplease refer to the *Intel® 64 and IA-32 Architectures Optimization Reference Manual*
*fc1f2750SBernard Iremongerwhich is a valuable reference to writing efficient code.
*fc1f2750SBernard Iremonger
*fc1f2750SBernard IremongerMemory
*fc1f2750SBernard Iremonger------
*fc1f2750SBernard Iremonger
*fc1f2750SBernard IremongerThis section describes some key memory considerations when developing applications in the Intel® DPDK environment.
*fc1f2750SBernard Iremonger
*fc1f2750SBernard IremongerMemory Copy: Do not Use libc in the Data Plane
*fc1f2750SBernard Iremonger~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*fc1f2750SBernard Iremonger
*fc1f2750SBernard IremongerMany libc functions are available in the Intel® DPDK, via the Linux* application environment.
*fc1f2750SBernard IremongerThis can ease the porting of applications and the development of the configuration plane.
*fc1f2750SBernard IremongerHowever, many of these functions are not designed for performance.
*fc1f2750SBernard IremongerFunctions such as memcpy() or strcpy() should not be used in the data plane.
*fc1f2750SBernard IremongerTo copy small structures, the preference is for a simpler technique that can be optimized by the compiler.
*fc1f2750SBernard IremongerRefer to the *VTune™ Performance Analyzer Essentials* publication from Intel Press for recommendations.
*fc1f2750SBernard Iremonger
*fc1f2750SBernard IremongerFor specific functions that are called often,
*fc1f2750SBernard Iremongerit is also a good idea to provide a self-made optimized function, which should be declared as static inline.
*fc1f2750SBernard Iremonger
*fc1f2750SBernard IremongerThe Intel® DPDK API provides an optimized rte_memcpy() function.
*fc1f2750SBernard Iremonger
*fc1f2750SBernard IremongerMemory Allocation
*fc1f2750SBernard Iremonger~~~~~~~~~~~~~~~~~
*fc1f2750SBernard Iremonger
*fc1f2750SBernard IremongerOther functions of libc, such as malloc(), provide a flexible way to allocate and free memory.
*fc1f2750SBernard IremongerIn some cases, using dynamic allocation is necessary,
*fc1f2750SBernard Iremongerbut it is really not advised to use malloc-like functions in the data plane because
*fc1f2750SBernard Iremongermanaging a fragmented heap can be costly and the allocator may not be optimized for parallel allocation.
*fc1f2750SBernard Iremonger
*fc1f2750SBernard IremongerIf you really need dynamic allocation in the data plane, it is better to use a memory pool of fixed-size objects.
*fc1f2750SBernard IremongerThis API is provided by librte_mempool.
*fc1f2750SBernard IremongerThis data structure provides several services that increase performance, such as memory alignment of objects,
*fc1f2750SBernard Iremongerlockless access to objects, NUMA awareness, bulk get/put and per-lcore cache.
*fc1f2750SBernard IremongerThe rte_malloc () function uses a similar concept to mempools.
*fc1f2750SBernard Iremonger
*fc1f2750SBernard IremongerConcurrent Access to the Same Memory Area
*fc1f2750SBernard Iremonger~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*fc1f2750SBernard Iremonger
*fc1f2750SBernard IremongerRead-Write (RW) access operations by several lcores to the same memory area can generate a lot of data cache misses,
*fc1f2750SBernard Iremongerwhich are very costly.
*fc1f2750SBernard IremongerIt is often possible to use per-lcore variables, for example, in the case of statistics.
*fc1f2750SBernard IremongerThere are at least two solutions for this:
*fc1f2750SBernard Iremonger
*fc1f2750SBernard Iremonger*   Use RTE_PER_LCORE variables. Note that in this case, data on lcore X is not available to lcore Y.
*fc1f2750SBernard Iremonger
*fc1f2750SBernard Iremonger*   Use a table of structures (one per lcore). In this case, each structure must be cache-aligned.
*fc1f2750SBernard Iremonger
*fc1f2750SBernard IremongerRead-mostly variables can be shared among lcores without performance losses if there are no RW variables in the same cache line.
*fc1f2750SBernard Iremonger
*fc1f2750SBernard IremongerNUMA
*fc1f2750SBernard Iremonger~~~~
*fc1f2750SBernard Iremonger
*fc1f2750SBernard IremongerOn a NUMA system, it is preferable to access local memory since remote memory access is slower.
*fc1f2750SBernard IremongerIn the Intel® DPDK, the memzone, ring, rte_malloc and mempool APIs provide a way to create a pool on a specific socket.
*fc1f2750SBernard Iremonger
*fc1f2750SBernard IremongerSometimes, it can be a good idea to duplicate data to optimize speed.
*fc1f2750SBernard IremongerFor read-mostly variables that are often accessed,
*fc1f2750SBernard Iremongerit should not be a problem to keep them in one socket only, since data will be present in cache.
*fc1f2750SBernard Iremonger
*fc1f2750SBernard IremongerDistribution Across Memory Channels
*fc1f2750SBernard Iremonger~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*fc1f2750SBernard Iremonger
*fc1f2750SBernard IremongerModern memory controllers have several memory channels that can load or store data in parallel.
*fc1f2750SBernard IremongerDepending on the memory controller and its configuration,
*fc1f2750SBernard Iremongerthe number of channels and the way the memory is distributed across the channels varies.
*fc1f2750SBernard IremongerEach channel has a bandwidth limit,
*fc1f2750SBernard Iremongermeaning that if all memory access operations are done on the first channel only, there is a potential bottleneck.
*fc1f2750SBernard Iremonger
*fc1f2750SBernard IremongerBy default, the  :ref:`Mempool Library <Mempool_Library>` spreads the addresses of objects among memory channels.
*fc1f2750SBernard Iremonger
*fc1f2750SBernard IremongerCommunication Between lcores
*fc1f2750SBernard Iremonger----------------------------
*fc1f2750SBernard Iremonger
*fc1f2750SBernard IremongerTo provide a message-based communication between lcores,
*fc1f2750SBernard Iremongerit is advised to use the Intel® DPDK ring API, which provides a lockless ring implementation.
*fc1f2750SBernard Iremonger
*fc1f2750SBernard IremongerThe ring supports bulk and burst access,
*fc1f2750SBernard Iremongermeaning that it is possible to read several elements from the ring with only one costly atomic operation
*fc1f2750SBernard Iremonger(see Chapter 5 "Ring Library").
*fc1f2750SBernard IremongerPerformance is greatly improved when using bulk access operations.
*fc1f2750SBernard Iremonger
*fc1f2750SBernard IremongerThe code algorithm that dequeues messages may be something similar to the following:
*fc1f2750SBernard Iremonger
*fc1f2750SBernard Iremonger.. code-block:: c
*fc1f2750SBernard Iremonger
*fc1f2750SBernard Iremonger    #define MAX_BULK 32
*fc1f2750SBernard Iremonger
*fc1f2750SBernard Iremonger    while (1) {
*fc1f2750SBernard Iremonger        /* Process as many elements as can be dequeued. */
*fc1f2750SBernard Iremonger        count = rte_ring_dequeue_burst(ring, obj_table, MAX_BULK);
*fc1f2750SBernard Iremonger        if (unlikely(count == 0))
*fc1f2750SBernard Iremonger            continue;
*fc1f2750SBernard Iremonger
*fc1f2750SBernard Iremonger        my_process_bulk(obj_table, count);
*fc1f2750SBernard Iremonger   }
*fc1f2750SBernard Iremonger
*fc1f2750SBernard IremongerPMD Driver
*fc1f2750SBernard Iremonger----------
*fc1f2750SBernard Iremonger
*fc1f2750SBernard IremongerThe Intel® DPDK Poll Mode Driver (PMD) is also able to work in bulk/burst mode,
*fc1f2750SBernard Iremongerallowing the factorization of some code for each call in the send or receive function.
*fc1f2750SBernard Iremonger
*fc1f2750SBernard IremongerAvoid partial writes.
*fc1f2750SBernard IremongerWhen PCI devices write to system memory through DMA,
*fc1f2750SBernard Iremongerit costs less if the write operation is on a full cache line as opposed to part of it.
*fc1f2750SBernard IremongerIn the PMD code, actions have been taken to avoid partial writes as much as possible.
*fc1f2750SBernard Iremonger
*fc1f2750SBernard IremongerLower Packet Latency
*fc1f2750SBernard Iremonger~~~~~~~~~~~~~~~~~~~~
*fc1f2750SBernard Iremonger
*fc1f2750SBernard IremongerTraditionally, there is a trade-off between throughput and latency.
*fc1f2750SBernard IremongerAn application can be tuned to achieve a high throughput,
*fc1f2750SBernard Iremongerbut the end-to-end latency of an average packet will typically increase as a result.
*fc1f2750SBernard IremongerSimilarly, the application can be tuned to have, on average,
*fc1f2750SBernard Iremongera low end-to-end latency, at the cost of lower throughput.
*fc1f2750SBernard Iremonger
*fc1f2750SBernard IremongerIn order to achieve higher throughput,
*fc1f2750SBernard Iremongerthe Intel® DPDK attempts to aggregate the cost of processing each packet individually by processing packets in bursts.
*fc1f2750SBernard Iremonger
*fc1f2750SBernard IremongerUsing the testpmd application as an example,
*fc1f2750SBernard Iremongerthe burst size can be set on the command line to a value of 16 (also the default value).
*fc1f2750SBernard IremongerThis allows the application to request 16 packets at a time from the PMD.
*fc1f2750SBernard IremongerThe testpmd application then immediately attempts to transmit all the packets that were received,
*fc1f2750SBernard Iremongerin this case, all 16 packets.
*fc1f2750SBernard Iremonger
*fc1f2750SBernard IremongerThe packets are not transmitted until the tail pointer is updated on the corresponding TX queue of the network port.
*fc1f2750SBernard IremongerThis behavior is desirable when tuning for high throughput because
*fc1f2750SBernard Iremongerthe cost of tail pointer updates to both the RX and TX queues can be spread across 16 packets,
*fc1f2750SBernard Iremongereffectively hiding the relatively slow MMIO cost of writing to the PCIe* device.
*fc1f2750SBernard IremongerHowever, this is not very desirable when tuning for low latency because
*fc1f2750SBernard Iremongerthe first packet that was received must also wait for another 15 packets to be received.
*fc1f2750SBernard IremongerIt cannot be transmitted until the other 15 packets have also been processed because
*fc1f2750SBernard Iremongerthe NIC will not know to transmit the packets until the TX tail pointer has been updated,
*fc1f2750SBernard Iremongerwhich is not done until all 16 packets have been processed for transmission.
*fc1f2750SBernard Iremonger
*fc1f2750SBernard IremongerTo consistently achieve low latency, even under heavy system load,
*fc1f2750SBernard Iremongerthe application developer should avoid processing packets in bunches.
*fc1f2750SBernard IremongerThe testpmd application can be configured from the command line to use a burst value of 1.
*fc1f2750SBernard IremongerThis will allow a single packet to be processed at a time, providing lower latency,
*fc1f2750SBernard Iremongerbut with the added cost of lower throughput.
*fc1f2750SBernard Iremonger
*fc1f2750SBernard IremongerLocks and Atomic Operations
*fc1f2750SBernard Iremonger---------------------------
*fc1f2750SBernard Iremonger
*fc1f2750SBernard IremongerAtomic operations imply a lock prefix before the instruction,
*fc1f2750SBernard Iremongercausing the processor's LOCK# signal to be asserted during execution of the following instruction.
*fc1f2750SBernard IremongerThis has a big impact on performance in a multicore environment.
*fc1f2750SBernard Iremonger
*fc1f2750SBernard IremongerPerformance can be improved by avoiding lock mechanisms in the data plane.
*fc1f2750SBernard IremongerIt can often be replaced by other solutions like per-lcore variables.
*fc1f2750SBernard IremongerAlso, some locking techniques are more efficient than others.
*fc1f2750SBernard IremongerFor instance, the Read-Copy-Update (RCU) algorithm can frequently replace simple rwlocks.
*fc1f2750SBernard Iremonger
*fc1f2750SBernard IremongerCoding Considerations
*fc1f2750SBernard Iremonger---------------------
*fc1f2750SBernard Iremonger
*fc1f2750SBernard IremongerInline Functions
*fc1f2750SBernard Iremonger~~~~~~~~~~~~~~~~
*fc1f2750SBernard Iremonger
*fc1f2750SBernard IremongerSmall functions can be declared as static inline in the header file.
*fc1f2750SBernard IremongerThis avoids the cost of a call instruction (and the associated context saving).
*fc1f2750SBernard IremongerHowever, this technique is not always efficient; it depends on many factors including the compiler.
*fc1f2750SBernard Iremonger
*fc1f2750SBernard IremongerBranch Prediction
*fc1f2750SBernard Iremonger~~~~~~~~~~~~~~~~~
*fc1f2750SBernard Iremonger
*fc1f2750SBernard IremongerThe Intel® C/C++ Compiler (icc)/gcc built-in helper functions likely() and unlikely()
*fc1f2750SBernard Iremongerallow the developer to indicate if a code branch is likely to be taken or not.
*fc1f2750SBernard IremongerFor instance:
*fc1f2750SBernard Iremonger
*fc1f2750SBernard Iremonger.. code-block:: c
*fc1f2750SBernard Iremonger
*fc1f2750SBernard Iremonger    if (likely(x > 1))
*fc1f2750SBernard Iremonger        do_stuff();
*fc1f2750SBernard Iremonger
*fc1f2750SBernard IremongerSetting the Target CPU Type
*fc1f2750SBernard Iremonger---------------------------
*fc1f2750SBernard Iremonger
*fc1f2750SBernard IremongerThe Intel® DPDK supports CPU microarchitecture-specific optimizations by means of CONFIG_RTE_MACHINE option
*fc1f2750SBernard Iremongerin the Intel® DPDK configuration file.
*fc1f2750SBernard IremongerThe degree of optimization depends on the compiler's ability to optimize for a specitic microarchitecture,
*fc1f2750SBernard Iremongertherefore it is preferable to use the latest compiler versions whenever possible.
*fc1f2750SBernard Iremonger
*fc1f2750SBernard IremongerIf the compiler version does not support the specific feature set (for example, the Intel® AVX instruction set),
*fc1f2750SBernard Iremongerthe build process gracefully degrades to whatever latest feature set is supported by the compiler.
*fc1f2750SBernard Iremonger
*fc1f2750SBernard IremongerSince the build and runtime targets may not be the same,
*fc1f2750SBernard Iremongerthe resulting binary also contains a platform check that runs before the
*fc1f2750SBernard Iremongermain() function and checks if the current machine is suitable for running the binary.
*fc1f2750SBernard Iremonger
*fc1f2750SBernard IremongerAlong with compiler optimizations,
*fc1f2750SBernard Iremongera set of preprocessor defines are automatically added to the build process (regardless of the compiler version).
*fc1f2750SBernard IremongerThese defines correspond to the instruction sets that the target CPU should be able to support.
*fc1f2750SBernard IremongerFor example, a binary compiled for any SSE4.2-capable processor will have RTE_MACHINE_CPUFLAG_SSE4_2 defined,
*fc1f2750SBernard Iremongerthus enabling compile-time code path selection for different platforms.