guides/prog_guide/writing_efficient_code.rst

5630257fSFerruh Yigit..  SPDX-License-Identifier: BSD-3-Clause
5630257fSFerruh Yigit    Copyright(c) 2010-2014 Intel Corporation.
fc1f2750SBernard Iremonger
fc1f2750SBernard IremongerWriting Efficient Code
fc1f2750SBernard Iremonger======================
fc1f2750SBernard Iremonger
48624fd9SSiobhan ButlerThis chapter provides some tips for developing efficient code using the DPDK.
fc1f2750SBernard IremongerFor additional and more general information,
fc1f2750SBernard Iremongerplease refer to the *Intel® 64 and IA-32 Architectures Optimization Reference Manual*
fc1f2750SBernard Iremongerwhich is a valuable reference to writing efficient code.
fc1f2750SBernard Iremonger
fc1f2750SBernard IremongerMemory
fc1f2750SBernard Iremonger------
fc1f2750SBernard Iremonger
48624fd9SSiobhan ButlerThis section describes some key memory considerations when developing applications in the DPDK environment.
fc1f2750SBernard Iremonger
fc1f2750SBernard IremongerMemory Copy: Do not Use libc in the Data Plane
fc1f2750SBernard Iremonger~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
fc1f2750SBernard Iremonger
48624fd9SSiobhan ButlerMany libc functions are available in the DPDK, via the Linux* application environment.
fc1f2750SBernard IremongerThis can ease the porting of applications and the development of the configuration plane.
fc1f2750SBernard IremongerHowever, many of these functions are not designed for performance.
fc1f2750SBernard IremongerFunctions such as memcpy() or strcpy() should not be used in the data plane.
fc1f2750SBernard IremongerTo copy small structures, the preference is for a simpler technique that can be optimized by the compiler.
fc1f2750SBernard IremongerRefer to the *VTune™ Performance Analyzer Essentials* publication from Intel Press for recommendations.
fc1f2750SBernard Iremonger
fc1f2750SBernard IremongerFor specific functions that are called often,
fc1f2750SBernard Iremongerit is also a good idea to provide a self-made optimized function, which should be declared as static inline.
fc1f2750SBernard Iremonger
48624fd9SSiobhan ButlerThe DPDK API provides an optimized rte_memcpy() function.
fc1f2750SBernard Iremonger
fc1f2750SBernard IremongerMemory Allocation
fc1f2750SBernard Iremonger~~~~~~~~~~~~~~~~~
fc1f2750SBernard Iremonger
fc1f2750SBernard IremongerOther functions of libc, such as malloc(), provide a flexible way to allocate and free memory.
fc1f2750SBernard IremongerIn some cases, using dynamic allocation is necessary,
fc1f2750SBernard Iremongerbut it is really not advised to use malloc-like functions in the data plane because
fc1f2750SBernard Iremongermanaging a fragmented heap can be costly and the allocator may not be optimized for parallel allocation.
fc1f2750SBernard Iremonger
fc1f2750SBernard IremongerIf you really need dynamic allocation in the data plane, it is better to use a memory pool of fixed-size objects.
fc1f2750SBernard IremongerThis API is provided by librte_mempool.
fc1f2750SBernard IremongerThis data structure provides several services that increase performance, such as memory alignment of objects,
fc1f2750SBernard Iremongerlockless access to objects, NUMA awareness, bulk get/put and per-lcore cache.
fc1f2750SBernard IremongerThe rte_malloc () function uses a similar concept to mempools.
fc1f2750SBernard Iremonger
fc1f2750SBernard IremongerConcurrent Access to the Same Memory Area
fc1f2750SBernard Iremonger~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
fc1f2750SBernard Iremonger
fc1f2750SBernard IremongerRead-Write (RW) access operations by several lcores to the same memory area can generate a lot of data cache misses,
fc1f2750SBernard Iremongerwhich are very costly.
fc1f2750SBernard IremongerIt is often possible to use per-lcore variables, for example, in the case of statistics.
fc1f2750SBernard IremongerThere are at least two solutions for this:
fc1f2750SBernard Iremonger
fc1f2750SBernard Iremonger*   Use RTE_PER_LCORE variables. Note that in this case, data on lcore X is not available to lcore Y.
fc1f2750SBernard Iremonger
fc1f2750SBernard Iremonger*   Use a table of structures (one per lcore). In this case, each structure must be cache-aligned.
fc1f2750SBernard Iremonger
fc1f2750SBernard IremongerRead-mostly variables can be shared among lcores without performance losses if there are no RW variables in the same cache line.
fc1f2750SBernard Iremonger
fc1f2750SBernard IremongerNUMA
fc1f2750SBernard Iremonger~~~~
fc1f2750SBernard Iremonger
fc1f2750SBernard IremongerOn a NUMA system, it is preferable to access local memory since remote memory access is slower.
48624fd9SSiobhan ButlerIn the DPDK, the memzone, ring, rte_malloc and mempool APIs provide a way to create a pool on a specific socket.
fc1f2750SBernard Iremonger
fc1f2750SBernard IremongerSometimes, it can be a good idea to duplicate data to optimize speed.
fc1f2750SBernard IremongerFor read-mostly variables that are often accessed,
fc1f2750SBernard Iremongerit should not be a problem to keep them in one socket only, since data will be present in cache.
fc1f2750SBernard Iremonger
fc1f2750SBernard IremongerDistribution Across Memory Channels
fc1f2750SBernard Iremonger~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
fc1f2750SBernard Iremonger
fc1f2750SBernard IremongerModern memory controllers have several memory channels that can load or store data in parallel.
fc1f2750SBernard IremongerDepending on the memory controller and its configuration,
fc1f2750SBernard Iremongerthe number of channels and the way the memory is distributed across the channels varies.
fc1f2750SBernard IremongerEach channel has a bandwidth limit,
fc1f2750SBernard Iremongermeaning that if all memory access operations are done on the first channel only, there is a potential bottleneck.
fc1f2750SBernard Iremonger
fc1f2750SBernard IremongerBy default, the  :ref:`Mempool Library <Mempool_Library>` spreads the addresses of objects among memory channels.
fc1f2750SBernard Iremonger
7a888932SEelco ChaudronLocking memory pages
7a888932SEelco Chaudron~~~~~~~~~~~~~~~~~~~~
7a888932SEelco Chaudron
7a888932SEelco ChaudronThe underlying operating system is allowed to load/unload memory pages at its own discretion.
7a888932SEelco ChaudronThese page loads could impact the performance, as the process is on hold when the kernel fetches them.
7a888932SEelco Chaudron
7a888932SEelco ChaudronTo avoid these you could pre-load, and lock them into memory with the ``mlockall()`` call.
7a888932SEelco Chaudron
7a888932SEelco Chaudron.. code-block:: c
7a888932SEelco Chaudron
7a888932SEelco Chaudron    if (mlockall(MCL_CURRENT | MCL_FUTURE)) {
7a888932SEelco Chaudron        RTE_LOG(NOTICE, USER1, "mlockall() failed with error \"%s\"\n",
7a888932SEelco Chaudron                strerror(errno));
7a888932SEelco Chaudron    }
7a888932SEelco Chaudron
fc1f2750SBernard IremongerCommunication Between lcores
fc1f2750SBernard Iremonger----------------------------
fc1f2750SBernard Iremonger
fc1f2750SBernard IremongerTo provide a message-based communication between lcores,
48624fd9SSiobhan Butlerit is advised to use the DPDK ring API, which provides a lockless ring implementation.
fc1f2750SBernard Iremonger
fc1f2750SBernard IremongerThe ring supports bulk and burst access,
fc1f2750SBernard Iremongermeaning that it is possible to read several elements from the ring with only one costly atomic operation
29e30cbcSThomas Monjalon(see :doc:`ring_lib`).
fc1f2750SBernard IremongerPerformance is greatly improved when using bulk access operations.
fc1f2750SBernard Iremonger
fc1f2750SBernard IremongerThe code algorithm that dequeues messages may be something similar to the following:
fc1f2750SBernard Iremonger
fc1f2750SBernard Iremonger.. code-block:: c
fc1f2750SBernard Iremonger
fc1f2750SBernard Iremonger    #define MAX_BULK 32
fc1f2750SBernard Iremonger
fc1f2750SBernard Iremonger    while (1) {
fc1f2750SBernard Iremonger        /* Process as many elements as can be dequeued. */
ecaed092SBruce Richardson        count = rte_ring_dequeue_burst(ring, obj_table, MAX_BULK, NULL);
fc1f2750SBernard Iremonger        if (unlikely(count == 0))
fc1f2750SBernard Iremonger            continue;
fc1f2750SBernard Iremonger
fc1f2750SBernard Iremonger        my_process_bulk(obj_table, count);
fc1f2750SBernard Iremonger   }
fc1f2750SBernard Iremonger
fc1f2750SBernard IremongerPMD Driver
fc1f2750SBernard Iremonger----------
fc1f2750SBernard Iremonger
48624fd9SSiobhan ButlerThe DPDK Poll Mode Driver (PMD) is also able to work in bulk/burst mode,
fc1f2750SBernard Iremongerallowing the factorization of some code for each call in the send or receive function.
fc1f2750SBernard Iremonger
fc1f2750SBernard IremongerAvoid partial writes.
fc1f2750SBernard IremongerWhen PCI devices write to system memory through DMA,
fc1f2750SBernard Iremongerit costs less if the write operation is on a full cache line as opposed to part of it.
fc1f2750SBernard IremongerIn the PMD code, actions have been taken to avoid partial writes as much as possible.
fc1f2750SBernard Iremonger
fc1f2750SBernard IremongerLower Packet Latency
fc1f2750SBernard Iremonger~~~~~~~~~~~~~~~~~~~~
fc1f2750SBernard Iremonger
fc1f2750SBernard IremongerTraditionally, there is a trade-off between throughput and latency.
fc1f2750SBernard IremongerAn application can be tuned to achieve a high throughput,
fc1f2750SBernard Iremongerbut the end-to-end latency of an average packet will typically increase as a result.
fc1f2750SBernard IremongerSimilarly, the application can be tuned to have, on average,
fc1f2750SBernard Iremongera low end-to-end latency, at the cost of lower throughput.
fc1f2750SBernard Iremonger
fc1f2750SBernard IremongerIn order to achieve higher throughput,
48624fd9SSiobhan Butlerthe DPDK attempts to aggregate the cost of processing each packet individually by processing packets in bursts.
fc1f2750SBernard Iremonger
fc1f2750SBernard IremongerUsing the testpmd application as an example,
fc1f2750SBernard Iremongerthe burst size can be set on the command line to a value of 16 (also the default value).
fc1f2750SBernard IremongerThis allows the application to request 16 packets at a time from the PMD.
fc1f2750SBernard IremongerThe testpmd application then immediately attempts to transmit all the packets that were received,
fc1f2750SBernard Iremongerin this case, all 16 packets.
fc1f2750SBernard Iremonger
fc1f2750SBernard IremongerThe packets are not transmitted until the tail pointer is updated on the corresponding TX queue of the network port.
fc1f2750SBernard IremongerThis behavior is desirable when tuning for high throughput because
fc1f2750SBernard Iremongerthe cost of tail pointer updates to both the RX and TX queues can be spread across 16 packets,
fc1f2750SBernard Iremongereffectively hiding the relatively slow MMIO cost of writing to the PCIe* device.
fc1f2750SBernard IremongerHowever, this is not very desirable when tuning for low latency because
fc1f2750SBernard Iremongerthe first packet that was received must also wait for another 15 packets to be received.
fc1f2750SBernard IremongerIt cannot be transmitted until the other 15 packets have also been processed because
fc1f2750SBernard Iremongerthe NIC will not know to transmit the packets until the TX tail pointer has been updated,
fc1f2750SBernard Iremongerwhich is not done until all 16 packets have been processed for transmission.
fc1f2750SBernard Iremonger
fc1f2750SBernard IremongerTo consistently achieve low latency, even under heavy system load,
fc1f2750SBernard Iremongerthe application developer should avoid processing packets in bunches.
fc1f2750SBernard IremongerThe testpmd application can be configured from the command line to use a burst value of 1.
fc1f2750SBernard IremongerThis will allow a single packet to be processed at a time, providing lower latency,
fc1f2750SBernard Iremongerbut with the added cost of lower throughput.
fc1f2750SBernard Iremonger
fc1f2750SBernard IremongerLocks and Atomic Operations
fc1f2750SBernard Iremonger---------------------------
fc1f2750SBernard Iremonger
*703a62a6SPhil YangThis section describes some key considerations when using locks and atomic
*703a62a6SPhil Yangoperations in the DPDK environment.
*703a62a6SPhil Yang
*703a62a6SPhil YangLocks
*703a62a6SPhil Yang~~~~~
*703a62a6SPhil Yang
*703a62a6SPhil YangOn x86, atomic operations imply a lock prefix before the instruction,
fc1f2750SBernard Iremongercausing the processor's LOCK# signal to be asserted during execution of the following instruction.
fc1f2750SBernard IremongerThis has a big impact on performance in a multicore environment.
fc1f2750SBernard Iremonger
fc1f2750SBernard IremongerPerformance can be improved by avoiding lock mechanisms in the data plane.
fc1f2750SBernard IremongerIt can often be replaced by other solutions like per-lcore variables.
fc1f2750SBernard IremongerAlso, some locking techniques are more efficient than others.
fc1f2750SBernard IremongerFor instance, the Read-Copy-Update (RCU) algorithm can frequently replace simple rwlocks.
fc1f2750SBernard Iremonger
*703a62a6SPhil YangAtomic Operations: Use C11 Atomic Builtins
*703a62a6SPhil Yang~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*703a62a6SPhil Yang
*703a62a6SPhil YangDPDK generic rte_atomic operations are implemented by __sync builtins. These
*703a62a6SPhil Yang__sync builtins result in full barriers on aarch64, which are unnecessary
*703a62a6SPhil Yangin many use cases. They can be replaced by __atomic builtins that conform to
*703a62a6SPhil Yangthe C11 memory model and provide finer memory order control.
*703a62a6SPhil Yang
*703a62a6SPhil YangSo replacing the rte_atomic operations with __atomic builtins might improve
*703a62a6SPhil Yangperformance for aarch64 machines.
*703a62a6SPhil Yang
*703a62a6SPhil YangSome typical optimization cases are listed below:
*703a62a6SPhil Yang
*703a62a6SPhil YangAtomicity
*703a62a6SPhil Yang^^^^^^^^^
*703a62a6SPhil Yang
*703a62a6SPhil YangSome use cases require atomicity alone, the ordering of the memory operations
*703a62a6SPhil Yangdoes not matter. For example, the packet statistics counters need to be
*703a62a6SPhil Yangincremented atomically but do not need any particular memory ordering.
*703a62a6SPhil YangSo, RELAXED memory ordering is sufficient.
*703a62a6SPhil Yang
*703a62a6SPhil YangOne-way Barrier
*703a62a6SPhil Yang^^^^^^^^^^^^^^^
*703a62a6SPhil Yang
*703a62a6SPhil YangSome use cases allow for memory reordering in one way while requiring memory
*703a62a6SPhil Yangordering in the other direction.
*703a62a6SPhil Yang
*703a62a6SPhil YangFor example, the memory operations before the spinlock lock are allowed to
*703a62a6SPhil Yangmove to the critical section, but the memory operations in the critical section
*703a62a6SPhil Yangare not allowed to move above the lock. In this case, the full memory barrier
*703a62a6SPhil Yangin the compare-and-swap operation can be replaced with ACQUIRE memory order.
*703a62a6SPhil YangOn the other hand, the memory operations after the spinlock unlock are allowed
*703a62a6SPhil Yangto move to the critical section, but the memory operations in the critical
*703a62a6SPhil Yangsection are not allowed to move below the unlock. So the full barrier in the
*703a62a6SPhil Yangstore operation can use RELEASE memory order.
*703a62a6SPhil Yang
*703a62a6SPhil YangReader-Writer Concurrency
*703a62a6SPhil Yang^^^^^^^^^^^^^^^^^^^^^^^^^
*703a62a6SPhil Yang
*703a62a6SPhil YangLock-free reader-writer concurrency is one of the common use cases in DPDK.
*703a62a6SPhil Yang
*703a62a6SPhil YangThe payload or the data that the writer wants to communicate to the reader,
*703a62a6SPhil Yangcan be written with RELAXED memory order. However, the guard variable should
*703a62a6SPhil Yangbe written with RELEASE memory order. This ensures that the store to guard
*703a62a6SPhil Yangvariable is observable only after the store to payload is observable.
*703a62a6SPhil Yang
*703a62a6SPhil YangCorrespondingly, on the reader side, the guard variable should be read
*703a62a6SPhil Yangwith ACQUIRE memory order. The payload or the data the writer communicated,
*703a62a6SPhil Yangcan be read with RELAXED memory order. This ensures that, if the store to
*703a62a6SPhil Yangguard variable is observable, the store to payload is also observable.
*703a62a6SPhil Yang
fc1f2750SBernard IremongerCoding Considerations
fc1f2750SBernard Iremonger---------------------
fc1f2750SBernard Iremonger
fc1f2750SBernard IremongerInline Functions
fc1f2750SBernard Iremonger~~~~~~~~~~~~~~~~
fc1f2750SBernard Iremonger
fc1f2750SBernard IremongerSmall functions can be declared as static inline in the header file.
fc1f2750SBernard IremongerThis avoids the cost of a call instruction (and the associated context saving).
fc1f2750SBernard IremongerHowever, this technique is not always efficient; it depends on many factors including the compiler.
fc1f2750SBernard Iremonger
fc1f2750SBernard IremongerBranch Prediction
fc1f2750SBernard Iremonger~~~~~~~~~~~~~~~~~
fc1f2750SBernard Iremonger
fc1f2750SBernard IremongerThe Intel® C/C++ Compiler (icc)/gcc built-in helper functions likely() and unlikely()
fc1f2750SBernard Iremongerallow the developer to indicate if a code branch is likely to be taken or not.
fc1f2750SBernard IremongerFor instance:
fc1f2750SBernard Iremonger
fc1f2750SBernard Iremonger.. code-block:: c
fc1f2750SBernard Iremonger
fc1f2750SBernard Iremonger    if (likely(x > 1))
fc1f2750SBernard Iremonger        do_stuff();
fc1f2750SBernard Iremonger
fc1f2750SBernard IremongerSetting the Target CPU Type
fc1f2750SBernard Iremonger---------------------------
fc1f2750SBernard Iremonger
48624fd9SSiobhan ButlerThe DPDK supports CPU microarchitecture-specific optimizations by means of CONFIG_RTE_MACHINE option
48624fd9SSiobhan Butlerin the DPDK configuration file.
fea1d908SJohn McNamaraThe degree of optimization depends on the compiler's ability to optimize for a specific microarchitecture,
fc1f2750SBernard Iremongertherefore it is preferable to use the latest compiler versions whenever possible.
fc1f2750SBernard Iremonger
fc1f2750SBernard IremongerIf the compiler version does not support the specific feature set (for example, the Intel® AVX instruction set),
fc1f2750SBernard Iremongerthe build process gracefully degrades to whatever latest feature set is supported by the compiler.
fc1f2750SBernard Iremonger
fc1f2750SBernard IremongerSince the build and runtime targets may not be the same,
fc1f2750SBernard Iremongerthe resulting binary also contains a platform check that runs before the
fc1f2750SBernard Iremongermain() function and checks if the current machine is suitable for running the binary.
fc1f2750SBernard Iremonger
fc1f2750SBernard IremongerAlong with compiler optimizations,
fc1f2750SBernard Iremongera set of preprocessor defines are automatically added to the build process (regardless of the compiler version).
fc1f2750SBernard IremongerThese defines correspond to the instruction sets that the target CPU should be able to support.
fc1f2750SBernard IremongerFor example, a binary compiled for any SSE4.2-capable processor will have RTE_MACHINE_CPUFLAG_SSE4_2 defined,
fc1f2750SBernard Iremongerthus enabling compile-time code path selection for different platforms.