1*41dd9a6bSDavid Young.. SPDX-License-Identifier: BSD-3-Clause 2*41dd9a6bSDavid Young Copyright(c) 2010-2015 Intel Corporation. 3*41dd9a6bSDavid Young 4*41dd9a6bSDavid YoungPoll Mode Driver 5*41dd9a6bSDavid Young================ 6*41dd9a6bSDavid Young 7*41dd9a6bSDavid YoungThe DPDK includes 1 Gigabit, 10 Gigabit and 40 Gigabit and para virtualized virtio Poll Mode Drivers. 8*41dd9a6bSDavid Young 9*41dd9a6bSDavid YoungA Poll Mode Driver (PMD) consists of APIs, provided through the BSD driver running in user space, 10*41dd9a6bSDavid Youngto configure the devices and their respective queues. 11*41dd9a6bSDavid YoungIn addition, a PMD accesses the RX and TX descriptors directly without any interrupts 12*41dd9a6bSDavid Young(with the exception of Link Status Change interrupts) to quickly receive, 13*41dd9a6bSDavid Youngprocess and deliver packets in the user's application. 14*41dd9a6bSDavid YoungThis section describes the requirements of the PMDs, 15*41dd9a6bSDavid Youngtheir global design principles and proposes a high-level architecture and a generic external API for the Ethernet PMDs. 16*41dd9a6bSDavid Young 17*41dd9a6bSDavid YoungRequirements and Assumptions 18*41dd9a6bSDavid Young---------------------------- 19*41dd9a6bSDavid Young 20*41dd9a6bSDavid YoungThe DPDK environment for packet processing applications allows for two models, run-to-completion and pipe-line: 21*41dd9a6bSDavid Young 22*41dd9a6bSDavid Young* In the *run-to-completion* model, a specific port's RX descriptor ring is polled for packets through an API. 23*41dd9a6bSDavid Young Packets are then processed on the same core and placed on a port's TX descriptor ring through an API for transmission. 24*41dd9a6bSDavid Young 25*41dd9a6bSDavid Young* In the *pipe-line* model, one core polls one or more port's RX descriptor ring through an API. 26*41dd9a6bSDavid Young Packets are received and passed to another core via a ring. 27*41dd9a6bSDavid Young The other core continues to process the packet which then may be placed on a port's TX descriptor ring through an API for transmission. 28*41dd9a6bSDavid Young 29*41dd9a6bSDavid YoungIn a synchronous run-to-completion model, 30*41dd9a6bSDavid Youngeach logical core assigned to the DPDK executes a packet processing loop that includes the following steps: 31*41dd9a6bSDavid Young 32*41dd9a6bSDavid Young* Retrieve input packets through the PMD receive API 33*41dd9a6bSDavid Young 34*41dd9a6bSDavid Young* Process each received packet one at a time, up to its forwarding 35*41dd9a6bSDavid Young 36*41dd9a6bSDavid Young* Send pending output packets through the PMD transmit API 37*41dd9a6bSDavid Young 38*41dd9a6bSDavid YoungConversely, in an asynchronous pipe-line model, some logical cores may be dedicated to the retrieval of received packets and 39*41dd9a6bSDavid Youngother logical cores to the processing of previously received packets. 40*41dd9a6bSDavid YoungReceived packets are exchanged between logical cores through rings. 41*41dd9a6bSDavid YoungThe loop for packet retrieval includes the following steps: 42*41dd9a6bSDavid Young 43*41dd9a6bSDavid Young* Retrieve input packets through the PMD receive API 44*41dd9a6bSDavid Young 45*41dd9a6bSDavid Young* Provide received packets to processing lcores through packet queues 46*41dd9a6bSDavid Young 47*41dd9a6bSDavid YoungThe loop for packet processing includes the following steps: 48*41dd9a6bSDavid Young 49*41dd9a6bSDavid Young* Retrieve the received packet from the packet queue 50*41dd9a6bSDavid Young 51*41dd9a6bSDavid Young* Process the received packet, up to its retransmission if forwarded 52*41dd9a6bSDavid Young 53*41dd9a6bSDavid YoungTo avoid any unnecessary interrupt processing overhead, the execution environment must not use any asynchronous notification mechanisms. 54*41dd9a6bSDavid YoungWhenever needed and appropriate, asynchronous communication should be introduced as much as possible through the use of rings. 55*41dd9a6bSDavid Young 56*41dd9a6bSDavid YoungAvoiding lock contention is a key issue in a multi-core environment. 57*41dd9a6bSDavid YoungTo address this issue, PMDs are designed to work with per-core private resources as much as possible. 58*41dd9a6bSDavid YoungFor example, a PMD maintains a separate transmit queue per-core, per-port, if the PMD is not ``RTE_ETH_TX_OFFLOAD_MT_LOCKFREE`` capable. 59*41dd9a6bSDavid YoungIn the same way, every receive queue of a port is assigned to and polled by a single logical core (lcore). 60*41dd9a6bSDavid Young 61*41dd9a6bSDavid YoungTo comply with Non-Uniform Memory Access (NUMA), memory management is designed to assign to each logical core 62*41dd9a6bSDavid Younga private buffer pool in local memory to minimize remote memory access. 63*41dd9a6bSDavid YoungThe configuration of packet buffer pools should take into account the underlying physical memory architecture in terms of DIMMS, 64*41dd9a6bSDavid Youngchannels and ranks. 65*41dd9a6bSDavid YoungThe application must ensure that appropriate parameters are given at memory pool creation time. 66*41dd9a6bSDavid YoungSee :doc:`../mempool_lib`. 67*41dd9a6bSDavid Young 68*41dd9a6bSDavid YoungDesign Principles 69*41dd9a6bSDavid Young----------------- 70*41dd9a6bSDavid Young 71*41dd9a6bSDavid YoungThe API and architecture of the Ethernet* PMDs are designed with the following guidelines in mind. 72*41dd9a6bSDavid Young 73*41dd9a6bSDavid YoungPMDs must help global policy-oriented decisions to be enforced at the upper application level. 74*41dd9a6bSDavid YoungConversely, NIC PMD functions should not impede the benefits expected by upper-level global policies, 75*41dd9a6bSDavid Youngor worse prevent such policies from being applied. 76*41dd9a6bSDavid Young 77*41dd9a6bSDavid YoungFor instance, both the receive and transmit functions of a PMD have a maximum number of packets/descriptors to poll. 78*41dd9a6bSDavid YoungThis allows a run-to-completion processing stack to statically fix or 79*41dd9a6bSDavid Youngto dynamically adapt its overall behavior through different global loop policies, such as: 80*41dd9a6bSDavid Young 81*41dd9a6bSDavid Young* Receive, process immediately and transmit packets one at a time in a piecemeal fashion. 82*41dd9a6bSDavid Young 83*41dd9a6bSDavid Young* Receive as many packets as possible, then process all received packets, transmitting them immediately. 84*41dd9a6bSDavid Young 85*41dd9a6bSDavid Young* Receive a given maximum number of packets, process the received packets, accumulate them and finally send all accumulated packets to transmit. 86*41dd9a6bSDavid Young 87*41dd9a6bSDavid YoungTo achieve optimal performance, overall software design choices and pure software optimization techniques must be considered and 88*41dd9a6bSDavid Youngbalanced against available low-level hardware-based optimization features (CPU cache properties, bus speed, NIC PCI bandwidth, and so on). 89*41dd9a6bSDavid YoungThe case of packet transmission is an example of this software/hardware tradeoff issue when optimizing burst-oriented network packet processing engines. 90*41dd9a6bSDavid YoungIn the initial case, the PMD could export only an rte_eth_tx_one function to transmit one packet at a time on a given queue. 91*41dd9a6bSDavid YoungOn top of that, one can easily build an rte_eth_tx_burst function that loops invoking the rte_eth_tx_one function to transmit several packets at a time. 92*41dd9a6bSDavid YoungHowever, an rte_eth_tx_burst function is effectively implemented by the PMD to minimize the driver-level transmit cost per packet through the following optimizations: 93*41dd9a6bSDavid Young 94*41dd9a6bSDavid Young* Share among multiple packets the un-amortized cost of invoking the rte_eth_tx_one function. 95*41dd9a6bSDavid Young 96*41dd9a6bSDavid Young* Enable the rte_eth_tx_burst function to take advantage of burst-oriented hardware features (prefetch data in cache, use of NIC head/tail registers) 97*41dd9a6bSDavid Young to minimize the number of CPU cycles per packet, for example by avoiding unnecessary read memory accesses to ring transmit descriptors, 98*41dd9a6bSDavid Young or by systematically using arrays of pointers that exactly fit cache line boundaries and sizes. 99*41dd9a6bSDavid Young 100*41dd9a6bSDavid Young* Apply burst-oriented software optimization techniques to remove operations that would otherwise be unavoidable, such as ring index wrap back management. 101*41dd9a6bSDavid Young 102*41dd9a6bSDavid YoungBurst-oriented functions are also introduced via the API for services that are intensively used by the PMD. 103*41dd9a6bSDavid YoungThis applies in particular to buffer allocators used to populate NIC rings, which provide functions to allocate/free several buffers at a time. 104*41dd9a6bSDavid YoungFor example, an mbuf_multiple_alloc function returning an array of pointers to rte_mbuf buffers which speeds up the receive poll function of the PMD when 105*41dd9a6bSDavid Youngreplenishing multiple descriptors of the receive ring. 106*41dd9a6bSDavid Young 107*41dd9a6bSDavid YoungLogical Cores, Memory and NIC Queues Relationships 108*41dd9a6bSDavid Young-------------------------------------------------- 109*41dd9a6bSDavid Young 110*41dd9a6bSDavid YoungThe DPDK supports NUMA allowing for better performance when a processor's logical cores and interfaces utilize its local memory. 111*41dd9a6bSDavid YoungTherefore, mbuf allocation associated with local PCIe* interfaces should be allocated from memory pools created in the local memory. 112*41dd9a6bSDavid YoungThe buffers should, if possible, remain on the local processor to obtain the best performance results and RX and TX buffer descriptors 113*41dd9a6bSDavid Youngshould be populated with mbufs allocated from a mempool allocated from local memory. 114*41dd9a6bSDavid Young 115*41dd9a6bSDavid YoungThe run-to-completion model also performs better if packet or data manipulation is in local memory instead of a remote processors memory. 116*41dd9a6bSDavid YoungThis is also true for the pipe-line model provided all logical cores used are located on the same processor. 117*41dd9a6bSDavid Young 118*41dd9a6bSDavid YoungMultiple logical cores should never share receive or transmit queues for interfaces since this would require global locks and hinder performance. 119*41dd9a6bSDavid Young 120*41dd9a6bSDavid YoungIf the PMD is ``RTE_ETH_TX_OFFLOAD_MT_LOCKFREE`` capable, multiple threads can invoke ``rte_eth_tx_burst()`` 121*41dd9a6bSDavid Youngconcurrently on the same tx queue without SW lock. This PMD feature found in some NICs and useful in the following use cases: 122*41dd9a6bSDavid Young 123*41dd9a6bSDavid Young* Remove explicit spinlock in some applications where lcores are not mapped to Tx queues with 1:1 relation. 124*41dd9a6bSDavid Young 125*41dd9a6bSDavid Young* In the eventdev use case, avoid dedicating a separate TX core for transmitting and thus 126*41dd9a6bSDavid Young enables more scaling as all workers can send the packets. 127*41dd9a6bSDavid Young 128*41dd9a6bSDavid YoungSee `Hardware Offload`_ for ``RTE_ETH_TX_OFFLOAD_MT_LOCKFREE`` capability probing details. 129*41dd9a6bSDavid Young 130*41dd9a6bSDavid YoungDevice Identification, Ownership and Configuration 131*41dd9a6bSDavid Young-------------------------------------------------- 132*41dd9a6bSDavid Young 133*41dd9a6bSDavid YoungDevice Identification 134*41dd9a6bSDavid Young~~~~~~~~~~~~~~~~~~~~~ 135*41dd9a6bSDavid Young 136*41dd9a6bSDavid YoungEach NIC port is uniquely designated by its (bus/bridge, device, function) PCI 137*41dd9a6bSDavid Youngidentifiers assigned by the PCI probing/enumeration function executed at DPDK initialization. 138*41dd9a6bSDavid YoungBased on their PCI identifier, NIC ports are assigned two other identifiers: 139*41dd9a6bSDavid Young 140*41dd9a6bSDavid Young* A port index used to designate the NIC port in all functions exported by the PMD API. 141*41dd9a6bSDavid Young 142*41dd9a6bSDavid Young* A port name used to designate the port in console messages, for administration or debugging purposes. 143*41dd9a6bSDavid Young For ease of use, the port name includes the port index. 144*41dd9a6bSDavid Young 145*41dd9a6bSDavid YoungPort Ownership 146*41dd9a6bSDavid Young~~~~~~~~~~~~~~ 147*41dd9a6bSDavid Young 148*41dd9a6bSDavid YoungThe Ethernet devices ports can be owned by a single DPDK entity (application, library, PMD, process, etc). 149*41dd9a6bSDavid YoungThe ownership mechanism is controlled by ethdev APIs and allows to set/remove/get a port owner by DPDK entities. 150*41dd9a6bSDavid YoungIt prevents Ethernet ports to be managed by different entities. 151*41dd9a6bSDavid Young 152*41dd9a6bSDavid Young.. note:: 153*41dd9a6bSDavid Young 154*41dd9a6bSDavid Young It is the DPDK entity responsibility to set the port owner before using it and to manage the port usage synchronization between different threads or processes. 155*41dd9a6bSDavid Young 156*41dd9a6bSDavid YoungIt is recommended to set port ownership early, 157*41dd9a6bSDavid Younglike during the probing notification ``RTE_ETH_EVENT_NEW``. 158*41dd9a6bSDavid Young 159*41dd9a6bSDavid YoungDevice Configuration 160*41dd9a6bSDavid Young~~~~~~~~~~~~~~~~~~~~ 161*41dd9a6bSDavid Young 162*41dd9a6bSDavid YoungThe configuration of each NIC port includes the following operations: 163*41dd9a6bSDavid Young 164*41dd9a6bSDavid Young* Allocate PCI resources 165*41dd9a6bSDavid Young 166*41dd9a6bSDavid Young* Reset the hardware (issue a Global Reset) to a well-known default state 167*41dd9a6bSDavid Young 168*41dd9a6bSDavid Young* Set up the PHY and the link 169*41dd9a6bSDavid Young 170*41dd9a6bSDavid Young* Initialize statistics counters 171*41dd9a6bSDavid Young 172*41dd9a6bSDavid YoungThe PMD API must also export functions to start/stop the all-multicast feature of a port and functions to set/unset the port in promiscuous mode. 173*41dd9a6bSDavid Young 174*41dd9a6bSDavid YoungSome hardware offload features must be individually configured at port initialization through specific configuration parameters. 175*41dd9a6bSDavid YoungThis is the case for the Receive Side Scaling (RSS) and Data Center Bridging (DCB) features for example. 176*41dd9a6bSDavid Young 177*41dd9a6bSDavid YoungOn-the-Fly Configuration 178*41dd9a6bSDavid Young~~~~~~~~~~~~~~~~~~~~~~~~ 179*41dd9a6bSDavid Young 180*41dd9a6bSDavid YoungAll device features that can be started or stopped "on the fly" (that is, without stopping the device) do not require the PMD API to export dedicated functions for this purpose. 181*41dd9a6bSDavid Young 182*41dd9a6bSDavid YoungAll that is required is the mapping address of the device PCI registers to implement the configuration of these features in specific functions outside of the drivers. 183*41dd9a6bSDavid Young 184*41dd9a6bSDavid YoungFor this purpose, 185*41dd9a6bSDavid Youngthe PMD API exports a function that provides all the information associated with a device that can be used to set up a given device feature outside of the driver. 186*41dd9a6bSDavid YoungThis includes the PCI vendor identifier, the PCI device identifier, the mapping address of the PCI device registers, and the name of the driver. 187*41dd9a6bSDavid Young 188*41dd9a6bSDavid YoungThe main advantage of this approach is that it gives complete freedom on the choice of the API used to configure, to start, and to stop such features. 189*41dd9a6bSDavid Young 190*41dd9a6bSDavid YoungAs an example, refer to the configuration of the IEEE1588 feature for the Intel® 82576 Gigabit Ethernet Controller and 191*41dd9a6bSDavid Youngthe Intel® 82599 10 Gigabit Ethernet Controller controllers in the testpmd application. 192*41dd9a6bSDavid Young 193*41dd9a6bSDavid YoungOther features such as the L3/L4 5-Tuple packet filtering feature of a port can be configured in the same way. 194*41dd9a6bSDavid YoungEthernet* flow control (pause frame) can be configured on the individual port. 195*41dd9a6bSDavid YoungRefer to the testpmd source code for details. 196*41dd9a6bSDavid YoungAlso, L4 (UDP/TCP/ SCTP) checksum offload by the NIC can be enabled for an individual packet as long as the packet mbuf is set up correctly. See `Hardware Offload`_ for details. 197*41dd9a6bSDavid Young 198*41dd9a6bSDavid YoungConfiguration of Transmit Queues 199*41dd9a6bSDavid Young~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 200*41dd9a6bSDavid Young 201*41dd9a6bSDavid YoungEach transmit queue is independently configured with the following information: 202*41dd9a6bSDavid Young 203*41dd9a6bSDavid Young* The number of descriptors of the transmit ring 204*41dd9a6bSDavid Young 205*41dd9a6bSDavid Young* The socket identifier used to identify the appropriate DMA memory zone from which to allocate the transmit ring in NUMA architectures 206*41dd9a6bSDavid Young 207*41dd9a6bSDavid Young* The values of the Prefetch, Host and Write-Back threshold registers of the transmit queue 208*41dd9a6bSDavid Young 209*41dd9a6bSDavid Young* The *minimum* transmit packets to free threshold (tx_free_thresh). 210*41dd9a6bSDavid Young When the number of descriptors used to transmit packets exceeds this threshold, the network adaptor should be checked to see if it has written back descriptors. 211*41dd9a6bSDavid Young A value of 0 can be passed during the TX queue configuration to indicate the default value should be used. 212*41dd9a6bSDavid Young The default value for tx_free_thresh is 32. 213*41dd9a6bSDavid Young This ensures that the PMD does not search for completed descriptors until at least 32 have been processed by the NIC for this queue. 214*41dd9a6bSDavid Young 215*41dd9a6bSDavid Young* The *minimum* RS bit threshold. The minimum number of transmit descriptors to use before setting the Report Status (RS) bit in the transmit descriptor. 216*41dd9a6bSDavid Young Note that this parameter may only be valid for Intel 10 GbE network adapters. 217*41dd9a6bSDavid Young The RS bit is set on the last descriptor used to transmit a packet if the number of descriptors used since the last RS bit setting, 218*41dd9a6bSDavid Young up to the first descriptor used to transmit the packet, exceeds the transmit RS bit threshold (tx_rs_thresh). 219*41dd9a6bSDavid Young In short, this parameter controls which transmit descriptors are written back to host memory by the network adapter. 220*41dd9a6bSDavid Young A value of 0 can be passed during the TX queue configuration to indicate that the default value should be used. 221*41dd9a6bSDavid Young The default value for tx_rs_thresh is 32. 222*41dd9a6bSDavid Young This ensures that at least 32 descriptors are used before the network adapter writes back the most recently used descriptor. 223*41dd9a6bSDavid Young This saves upstream PCIe* bandwidth resulting from TX descriptor write-backs. 224*41dd9a6bSDavid Young It is important to note that the TX Write-back threshold (TX wthresh) should be set to 0 when tx_rs_thresh is greater than 1. 225*41dd9a6bSDavid Young Refer to the Intel® 82599 10 Gigabit Ethernet Controller Datasheet for more details. 226*41dd9a6bSDavid Young 227*41dd9a6bSDavid YoungThe following constraints must be satisfied for tx_free_thresh and tx_rs_thresh: 228*41dd9a6bSDavid Young 229*41dd9a6bSDavid Young* tx_rs_thresh must be greater than 0. 230*41dd9a6bSDavid Young 231*41dd9a6bSDavid Young* tx_rs_thresh must be less than the size of the ring minus 2. 232*41dd9a6bSDavid Young 233*41dd9a6bSDavid Young* tx_rs_thresh must be less than or equal to tx_free_thresh. 234*41dd9a6bSDavid Young 235*41dd9a6bSDavid Young* tx_free_thresh must be greater than 0. 236*41dd9a6bSDavid Young 237*41dd9a6bSDavid Young* tx_free_thresh must be less than the size of the ring minus 3. 238*41dd9a6bSDavid Young 239*41dd9a6bSDavid Young* For optimal performance, TX wthresh should be set to 0 when tx_rs_thresh is greater than 1. 240*41dd9a6bSDavid Young 241*41dd9a6bSDavid YoungOne descriptor in the TX ring is used as a sentinel to avoid a hardware race condition, hence the maximum threshold constraints. 242*41dd9a6bSDavid Young 243*41dd9a6bSDavid Young.. note:: 244*41dd9a6bSDavid Young 245*41dd9a6bSDavid Young When configuring for DCB operation, at port initialization, both the number of transmit queues and the number of receive queues must be set to 128. 246*41dd9a6bSDavid Young 247*41dd9a6bSDavid YoungFree Tx mbuf on Demand 248*41dd9a6bSDavid Young~~~~~~~~~~~~~~~~~~~~~~ 249*41dd9a6bSDavid Young 250*41dd9a6bSDavid YoungMany of the drivers do not release the mbuf back to the mempool, or local cache, 251*41dd9a6bSDavid Youngimmediately after the packet has been transmitted. 252*41dd9a6bSDavid YoungInstead, they leave the mbuf in their Tx ring and 253*41dd9a6bSDavid Youngeither perform a bulk release when the ``tx_rs_thresh`` has been crossed 254*41dd9a6bSDavid Youngor free the mbuf when a slot in the Tx ring is needed. 255*41dd9a6bSDavid Young 256*41dd9a6bSDavid YoungAn application can request the driver to release used mbufs with the ``rte_eth_tx_done_cleanup()`` API. 257*41dd9a6bSDavid YoungThis API requests the driver to release mbufs that are no longer in use, 258*41dd9a6bSDavid Youngindependent of whether or not the ``tx_rs_thresh`` has been crossed. 259*41dd9a6bSDavid YoungThere are two scenarios when an application may want the mbuf released immediately: 260*41dd9a6bSDavid Young 261*41dd9a6bSDavid Young* When a given packet needs to be sent to multiple destination interfaces 262*41dd9a6bSDavid Young (either for Layer 2 flooding or Layer 3 multi-cast). 263*41dd9a6bSDavid Young One option is to make a copy of the packet or a copy of the header portion that needs to be manipulated. 264*41dd9a6bSDavid Young A second option is to transmit the packet and then poll the ``rte_eth_tx_done_cleanup()`` API 265*41dd9a6bSDavid Young until the reference count on the packet is decremented. 266*41dd9a6bSDavid Young Then the same packet can be transmitted to the next destination interface. 267*41dd9a6bSDavid Young The application is still responsible for managing any packet manipulations needed 268*41dd9a6bSDavid Young between the different destination interfaces, but a packet copy can be avoided. 269*41dd9a6bSDavid Young This API is independent of whether the packet was transmitted or dropped, 270*41dd9a6bSDavid Young only that the mbuf is no longer in use by the interface. 271*41dd9a6bSDavid Young 272*41dd9a6bSDavid Young* Some applications are designed to make multiple runs, like a packet generator. 273*41dd9a6bSDavid Young For performance reasons and consistency between runs, 274*41dd9a6bSDavid Young the application may want to reset back to an initial state 275*41dd9a6bSDavid Young between each run, where all mbufs are returned to the mempool. 276*41dd9a6bSDavid Young In this case, it can call the ``rte_eth_tx_done_cleanup()`` API 277*41dd9a6bSDavid Young for each destination interface it has been using 278*41dd9a6bSDavid Young to request it to release of all its used mbufs. 279*41dd9a6bSDavid Young 280*41dd9a6bSDavid YoungTo determine if a driver supports this API, check for the *Free Tx mbuf on demand* feature 281*41dd9a6bSDavid Youngin the *Network Interface Controller Drivers* document. 282*41dd9a6bSDavid Young 283*41dd9a6bSDavid YoungHardware Offload 284*41dd9a6bSDavid Young~~~~~~~~~~~~~~~~ 285*41dd9a6bSDavid Young 286*41dd9a6bSDavid YoungDepending on driver capabilities advertised by 287*41dd9a6bSDavid Young``rte_eth_dev_info_get()``, the PMD may support hardware offloading 288*41dd9a6bSDavid Youngfeature like checksumming, TCP segmentation, VLAN insertion or 289*41dd9a6bSDavid Younglockfree multithreaded TX burst on the same TX queue. 290*41dd9a6bSDavid Young 291*41dd9a6bSDavid YoungThe support of these offload features implies the addition of dedicated 292*41dd9a6bSDavid Youngstatus bit(s) and value field(s) into the rte_mbuf data structure, along 293*41dd9a6bSDavid Youngwith their appropriate handling by the receive/transmit functions 294*41dd9a6bSDavid Youngexported by each PMD. The list of flags and their precise meaning is 295*41dd9a6bSDavid Youngdescribed in the mbuf API documentation and in the :ref:`mbuf_meta` chapter. 296*41dd9a6bSDavid Young 297*41dd9a6bSDavid YoungPer-Port and Per-Queue Offloads 298*41dd9a6bSDavid Young^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 299*41dd9a6bSDavid Young 300*41dd9a6bSDavid YoungIn the DPDK offload API, offloads are divided into per-port and per-queue offloads as follows: 301*41dd9a6bSDavid Young 302*41dd9a6bSDavid Young* A per-queue offloading can be enabled on a queue and disabled on another queue at the same time. 303*41dd9a6bSDavid Young* A pure per-port offload is the one supported by device but not per-queue type. 304*41dd9a6bSDavid Young* A pure per-port offloading can't be enabled on a queue and disabled on another queue at the same time. 305*41dd9a6bSDavid Young* A pure per-port offloading must be enabled or disabled on all queues at the same time. 306*41dd9a6bSDavid Young* Any offloading is per-queue or pure per-port type, but can't be both types at same devices. 307*41dd9a6bSDavid Young* Port capabilities = per-queue capabilities + pure per-port capabilities. 308*41dd9a6bSDavid Young* Any supported offloading can be enabled on all queues. 309*41dd9a6bSDavid Young 310*41dd9a6bSDavid YoungThe different offloads capabilities can be queried using ``rte_eth_dev_info_get()``. 311*41dd9a6bSDavid YoungThe ``dev_info->[rt]x_queue_offload_capa`` returned from ``rte_eth_dev_info_get()`` includes all per-queue offloading capabilities. 312*41dd9a6bSDavid YoungThe ``dev_info->[rt]x_offload_capa`` returned from ``rte_eth_dev_info_get()`` includes all pure per-port and per-queue offloading capabilities. 313*41dd9a6bSDavid YoungSupported offloads can be either per-port or per-queue. 314*41dd9a6bSDavid Young 315*41dd9a6bSDavid YoungOffloads are enabled using the existing ``RTE_ETH_TX_OFFLOAD_*`` or ``RTE_ETH_RX_OFFLOAD_*`` flags. 316*41dd9a6bSDavid YoungAny requested offloading by an application must be within the device capabilities. 317*41dd9a6bSDavid YoungAny offloading is disabled by default if it is not set in the parameter 318*41dd9a6bSDavid Young``dev_conf->[rt]xmode.offloads`` to ``rte_eth_dev_configure()`` and 319*41dd9a6bSDavid Young``[rt]x_conf->offloads`` to ``rte_eth_[rt]x_queue_setup()``. 320*41dd9a6bSDavid Young 321*41dd9a6bSDavid YoungIf any offloading is enabled in ``rte_eth_dev_configure()`` by an application, 322*41dd9a6bSDavid Youngit is enabled on all queues no matter whether it is per-queue or 323*41dd9a6bSDavid Youngper-port type and no matter whether it is set or cleared in 324*41dd9a6bSDavid Young``[rt]x_conf->offloads`` to ``rte_eth_[rt]x_queue_setup()``. 325*41dd9a6bSDavid Young 326*41dd9a6bSDavid YoungIf a per-queue offloading hasn't been enabled in ``rte_eth_dev_configure()``, 327*41dd9a6bSDavid Youngit can be enabled or disabled in ``rte_eth_[rt]x_queue_setup()`` for individual queue. 328*41dd9a6bSDavid YoungA newly added offloads in ``[rt]x_conf->offloads`` to ``rte_eth_[rt]x_queue_setup()`` input by application 329*41dd9a6bSDavid Youngis the one which hasn't been enabled in ``rte_eth_dev_configure()`` and is requested to be enabled 330*41dd9a6bSDavid Youngin ``rte_eth_[rt]x_queue_setup()``. It must be per-queue type, otherwise trigger an error log. 331*41dd9a6bSDavid Young 332*41dd9a6bSDavid YoungPoll Mode Driver API 333*41dd9a6bSDavid Young-------------------- 334*41dd9a6bSDavid Young 335*41dd9a6bSDavid YoungGeneralities 336*41dd9a6bSDavid Young~~~~~~~~~~~~ 337*41dd9a6bSDavid Young 338*41dd9a6bSDavid YoungBy default, all functions exported by a PMD are lock-free functions that are assumed 339*41dd9a6bSDavid Youngnot to be invoked in parallel on different logical cores to work on the same target object. 340*41dd9a6bSDavid YoungFor instance, a PMD receive function cannot be invoked in parallel on two logical cores to poll the same RX queue of the same port. 341*41dd9a6bSDavid YoungOf course, this function can be invoked in parallel by different logical cores on different RX queues. 342*41dd9a6bSDavid YoungIt is the responsibility of the upper-level application to enforce this rule. 343*41dd9a6bSDavid Young 344*41dd9a6bSDavid YoungIf needed, parallel accesses by multiple logical cores to shared queues can be explicitly protected by dedicated inline lock-aware functions 345*41dd9a6bSDavid Youngbuilt on top of their corresponding lock-free functions of the PMD API. 346*41dd9a6bSDavid Young 347*41dd9a6bSDavid YoungGeneric Packet Representation 348*41dd9a6bSDavid Young~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 349*41dd9a6bSDavid Young 350*41dd9a6bSDavid YoungA packet is represented by an rte_mbuf structure, which is a generic metadata structure containing all necessary housekeeping information. 351*41dd9a6bSDavid YoungThis includes fields and status bits corresponding to offload hardware features, such as checksum computation of IP headers or VLAN tags. 352*41dd9a6bSDavid Young 353*41dd9a6bSDavid YoungThe rte_mbuf data structure includes specific fields to represent, in a generic way, the offload features provided by network controllers. 354*41dd9a6bSDavid YoungFor an input packet, most fields of the rte_mbuf structure are filled in by the PMD receive function with the information contained in the receive descriptor. 355*41dd9a6bSDavid YoungConversely, for output packets, most fields of rte_mbuf structures are used by the PMD transmit function to initialize transmit descriptors. 356*41dd9a6bSDavid Young 357*41dd9a6bSDavid YoungSee :doc:`../mbuf_lib` chapter for more details. 358*41dd9a6bSDavid Young 359*41dd9a6bSDavid YoungEthernet Device API 360*41dd9a6bSDavid Young~~~~~~~~~~~~~~~~~~~ 361*41dd9a6bSDavid Young 362*41dd9a6bSDavid YoungThe Ethernet device API exported by the Ethernet PMDs is described in the *DPDK API Reference*. 363*41dd9a6bSDavid Young 364*41dd9a6bSDavid Young.. _ethernet_device_standard_device_arguments: 365*41dd9a6bSDavid Young 366*41dd9a6bSDavid YoungEthernet Device Standard Device Arguments 367*41dd9a6bSDavid Young~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 368*41dd9a6bSDavid Young 369*41dd9a6bSDavid YoungStandard Ethernet device arguments allow for a set of commonly used arguments/ 370*41dd9a6bSDavid Youngparameters which are applicable to all Ethernet devices to be available to for 371*41dd9a6bSDavid Youngspecification of specific device and for passing common configuration 372*41dd9a6bSDavid Youngparameters to those ports. 373*41dd9a6bSDavid Young 374*41dd9a6bSDavid Young* ``representor`` for a device which supports the creation of representor ports 375*41dd9a6bSDavid Young this argument allows user to specify which switch ports to enable port 376*41dd9a6bSDavid Young representors for:: 377*41dd9a6bSDavid Young 378*41dd9a6bSDavid Young -a DBDF,representor=vf0 379*41dd9a6bSDavid Young -a DBDF,representor=vf[0,4,6,9] 380*41dd9a6bSDavid Young -a DBDF,representor=vf[0-31] 381*41dd9a6bSDavid Young -a DBDF,representor=vf[0,2-4,7,9-11] 382*41dd9a6bSDavid Young -a DBDF,representor=sf0 383*41dd9a6bSDavid Young -a DBDF,representor=sf[1,3,5] 384*41dd9a6bSDavid Young -a DBDF,representor=sf[0-1023] 385*41dd9a6bSDavid Young -a DBDF,representor=sf[0,2-4,7,9-11] 386*41dd9a6bSDavid Young -a DBDF,representor=pf1vf0 387*41dd9a6bSDavid Young -a DBDF,representor=pf[0-1]sf[0-127] 388*41dd9a6bSDavid Young -a DBDF,representor=pf1 389*41dd9a6bSDavid Young -a DBDF,representor=[pf[0-1],pf2vf[0-2],pf3[3,5-8]] 390*41dd9a6bSDavid Young (Multiple representors in one device argument can be represented as a list) 391*41dd9a6bSDavid Young 392*41dd9a6bSDavid YoungNote: PMDs are not required to support the standard device arguments and users 393*41dd9a6bSDavid Youngshould consult the relevant PMD documentation to see support devargs. 394*41dd9a6bSDavid Young 395*41dd9a6bSDavid YoungExtended Statistics API 396*41dd9a6bSDavid Young~~~~~~~~~~~~~~~~~~~~~~~ 397*41dd9a6bSDavid Young 398*41dd9a6bSDavid YoungThe extended statistics API allows a PMD to expose all statistics that are 399*41dd9a6bSDavid Youngavailable to it, including statistics that are unique to the device. 400*41dd9a6bSDavid YoungEach statistic has three properties ``name``, ``id`` and ``value``: 401*41dd9a6bSDavid Young 402*41dd9a6bSDavid Young* ``name``: A human readable string formatted by the scheme detailed below. 403*41dd9a6bSDavid Young* ``id``: An integer that represents only that statistic. 404*41dd9a6bSDavid Young* ``value``: A unsigned 64-bit integer that is the value of the statistic. 405*41dd9a6bSDavid Young 406*41dd9a6bSDavid YoungNote that extended statistic identifiers are 407*41dd9a6bSDavid Youngdriver-specific, and hence might not be the same for different ports. 408*41dd9a6bSDavid YoungThe API consists of various ``rte_eth_xstats_*()`` functions, and allows an 409*41dd9a6bSDavid Youngapplication to be flexible in how it retrieves statistics. 410*41dd9a6bSDavid Young 411*41dd9a6bSDavid YoungScheme for Human Readable Names 412*41dd9a6bSDavid Young^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 413*41dd9a6bSDavid Young 414*41dd9a6bSDavid YoungA naming scheme exists for the strings exposed to clients of the API. This is 415*41dd9a6bSDavid Youngto allow scraping of the API for statistics of interest. The naming scheme uses 416*41dd9a6bSDavid Youngstrings split by a single underscore ``_``. The scheme is as follows: 417*41dd9a6bSDavid Young 418*41dd9a6bSDavid Young* direction 419*41dd9a6bSDavid Young* detail 1 420*41dd9a6bSDavid Young* detail 2 421*41dd9a6bSDavid Young* detail n 422*41dd9a6bSDavid Young* unit 423*41dd9a6bSDavid Young 424*41dd9a6bSDavid YoungExamples of common statistics xstats strings, formatted to comply to the scheme 425*41dd9a6bSDavid Youngproposed above: 426*41dd9a6bSDavid Young 427*41dd9a6bSDavid Young* ``rx_bytes`` 428*41dd9a6bSDavid Young* ``rx_crc_errors`` 429*41dd9a6bSDavid Young* ``tx_multicast_packets`` 430*41dd9a6bSDavid Young 431*41dd9a6bSDavid YoungThe scheme, although quite simple, allows flexibility in presenting and reading 432*41dd9a6bSDavid Younginformation from the statistic strings. The following example illustrates the 433*41dd9a6bSDavid Youngnaming scheme:``rx_packets``. In this example, the string is split into two 434*41dd9a6bSDavid Youngcomponents. The first component ``rx`` indicates that the statistic is 435*41dd9a6bSDavid Youngassociated with the receive side of the NIC. The second component ``packets`` 436*41dd9a6bSDavid Youngindicates that the unit of measure is packets. 437*41dd9a6bSDavid Young 438*41dd9a6bSDavid YoungA more complicated example: ``tx_size_128_to_255_packets``. In this example, 439*41dd9a6bSDavid Young``tx`` indicates transmission, ``size`` is the first detail, ``128`` etc are 440*41dd9a6bSDavid Youngmore details, and ``packets`` indicates that this is a packet counter. 441*41dd9a6bSDavid Young 442*41dd9a6bSDavid YoungSome additions in the metadata scheme are as follows: 443*41dd9a6bSDavid Young 444*41dd9a6bSDavid Young* If the first part does not match ``rx`` or ``tx``, the statistic does not 445*41dd9a6bSDavid Young have an affinity with either receive of transmit. 446*41dd9a6bSDavid Young 447*41dd9a6bSDavid Young* If the first letter of the second part is ``q`` and this ``q`` is followed 448*41dd9a6bSDavid Young by a number, this statistic is part of a specific queue. 449*41dd9a6bSDavid Young 450*41dd9a6bSDavid YoungAn example where queue numbers are used is as follows: ``tx_q7_bytes`` which 451*41dd9a6bSDavid Youngindicates this statistic applies to queue number 7, and represents the number 452*41dd9a6bSDavid Youngof transmitted bytes on that queue. 453*41dd9a6bSDavid Young 454*41dd9a6bSDavid YoungAPI Design 455*41dd9a6bSDavid Young^^^^^^^^^^ 456*41dd9a6bSDavid Young 457*41dd9a6bSDavid YoungThe xstats API uses the ``name``, ``id``, and ``value`` to allow performant 458*41dd9a6bSDavid Younglookup of specific statistics. Performant lookup means two things; 459*41dd9a6bSDavid Young 460*41dd9a6bSDavid Young* No string comparisons with the ``name`` of the statistic in fast-path 461*41dd9a6bSDavid Young* Allow requesting of only the statistics of interest 462*41dd9a6bSDavid Young 463*41dd9a6bSDavid YoungThe API ensures these requirements are met by mapping the ``name`` of the 464*41dd9a6bSDavid Youngstatistic to a unique ``id``, which is used as a key for lookup in the fast-path. 465*41dd9a6bSDavid YoungThe API allows applications to request an array of ``id`` values, so that the 466*41dd9a6bSDavid YoungPMD only performs the required calculations. Expected usage is that the 467*41dd9a6bSDavid Youngapplication scans the ``name`` of each statistic, and caches the ``id`` 468*41dd9a6bSDavid Youngif it has an interest in that statistic. On the fast-path, the integer can be used 469*41dd9a6bSDavid Youngto retrieve the actual ``value`` of the statistic that the ``id`` represents. 470*41dd9a6bSDavid Young 471*41dd9a6bSDavid YoungAPI Functions 472*41dd9a6bSDavid Young^^^^^^^^^^^^^ 473*41dd9a6bSDavid Young 474*41dd9a6bSDavid YoungThe API is built out of a small number of functions, which can be used to 475*41dd9a6bSDavid Youngretrieve the number of statistics and the names, IDs and values of those 476*41dd9a6bSDavid Youngstatistics. 477*41dd9a6bSDavid Young 478*41dd9a6bSDavid Young* ``rte_eth_xstats_get_names_by_id()``: returns the names of the statistics. When given a 479*41dd9a6bSDavid Young ``NULL`` parameter the function returns the number of statistics that are available. 480*41dd9a6bSDavid Young 481*41dd9a6bSDavid Young* ``rte_eth_xstats_get_id_by_name()``: Searches for the statistic ID that matches 482*41dd9a6bSDavid Young ``xstat_name``. If found, the ``id`` integer is set. 483*41dd9a6bSDavid Young 484*41dd9a6bSDavid Young* ``rte_eth_xstats_get_by_id()``: Fills in an array of ``uint64_t`` values 485*41dd9a6bSDavid Young with matching the provided ``ids`` array. If the ``ids`` array is NULL, it 486*41dd9a6bSDavid Young returns all statistics that are available. 487*41dd9a6bSDavid Young 488*41dd9a6bSDavid Young 489*41dd9a6bSDavid YoungApplication Usage 490*41dd9a6bSDavid Young^^^^^^^^^^^^^^^^^ 491*41dd9a6bSDavid Young 492*41dd9a6bSDavid YoungImagine an application that wants to view the dropped packet count. If no 493*41dd9a6bSDavid Youngpackets are dropped, the application does not read any other metrics for 494*41dd9a6bSDavid Youngperformance reasons. If packets are dropped, the application has a particular 495*41dd9a6bSDavid Youngset of statistics that it requests. This "set" of statistics allows the app to 496*41dd9a6bSDavid Youngdecide what next steps to perform. The following code-snippets show how the 497*41dd9a6bSDavid Youngxstats API can be used to achieve this goal. 498*41dd9a6bSDavid Young 499*41dd9a6bSDavid YoungFirst step is to get all statistics names and list them: 500*41dd9a6bSDavid Young 501*41dd9a6bSDavid Young.. code-block:: c 502*41dd9a6bSDavid Young 503*41dd9a6bSDavid Young struct rte_eth_xstat_name *xstats_names; 504*41dd9a6bSDavid Young uint64_t *values; 505*41dd9a6bSDavid Young int len, i; 506*41dd9a6bSDavid Young 507*41dd9a6bSDavid Young /* Get number of stats */ 508*41dd9a6bSDavid Young len = rte_eth_xstats_get_names_by_id(port_id, NULL, NULL, 0); 509*41dd9a6bSDavid Young if (len < 0) { 510*41dd9a6bSDavid Young printf("Cannot get xstats count\n"); 511*41dd9a6bSDavid Young goto err; 512*41dd9a6bSDavid Young } 513*41dd9a6bSDavid Young 514*41dd9a6bSDavid Young xstats_names = malloc(sizeof(struct rte_eth_xstat_name) * len); 515*41dd9a6bSDavid Young if (xstats_names == NULL) { 516*41dd9a6bSDavid Young printf("Cannot allocate memory for xstat names\n"); 517*41dd9a6bSDavid Young goto err; 518*41dd9a6bSDavid Young } 519*41dd9a6bSDavid Young 520*41dd9a6bSDavid Young /* Retrieve xstats names, passing NULL for IDs to return all statistics */ 521*41dd9a6bSDavid Young if (len != rte_eth_xstats_get_names_by_id(port_id, xstats_names, NULL, len)) { 522*41dd9a6bSDavid Young printf("Cannot get xstat names\n"); 523*41dd9a6bSDavid Young goto err; 524*41dd9a6bSDavid Young } 525*41dd9a6bSDavid Young 526*41dd9a6bSDavid Young values = malloc(sizeof(values) * len); 527*41dd9a6bSDavid Young if (values == NULL) { 528*41dd9a6bSDavid Young printf("Cannot allocate memory for xstats\n"); 529*41dd9a6bSDavid Young goto err; 530*41dd9a6bSDavid Young } 531*41dd9a6bSDavid Young 532*41dd9a6bSDavid Young /* Getting xstats values */ 533*41dd9a6bSDavid Young if (len != rte_eth_xstats_get_by_id(port_id, NULL, values, len)) { 534*41dd9a6bSDavid Young printf("Cannot get xstat values\n"); 535*41dd9a6bSDavid Young goto err; 536*41dd9a6bSDavid Young } 537*41dd9a6bSDavid Young 538*41dd9a6bSDavid Young /* Print all xstats names and values */ 539*41dd9a6bSDavid Young for (i = 0; i < len; i++) { 540*41dd9a6bSDavid Young printf("%s: %"PRIu64"\n", xstats_names[i].name, values[i]); 541*41dd9a6bSDavid Young } 542*41dd9a6bSDavid Young 543*41dd9a6bSDavid YoungThe application has access to the names of all of the statistics that the PMD 544*41dd9a6bSDavid Youngexposes. The application can decide which statistics are of interest, cache the 545*41dd9a6bSDavid Youngids of those statistics by looking up the name as follows: 546*41dd9a6bSDavid Young 547*41dd9a6bSDavid Young.. code-block:: c 548*41dd9a6bSDavid Young 549*41dd9a6bSDavid Young uint64_t id; 550*41dd9a6bSDavid Young uint64_t value; 551*41dd9a6bSDavid Young const char *xstat_name = "rx_errors"; 552*41dd9a6bSDavid Young 553*41dd9a6bSDavid Young if(!rte_eth_xstats_get_id_by_name(port_id, xstat_name, &id)) { 554*41dd9a6bSDavid Young rte_eth_xstats_get_by_id(port_id, &id, &value, 1); 555*41dd9a6bSDavid Young printf("%s: %"PRIu64"\n", xstat_name, value); 556*41dd9a6bSDavid Young } 557*41dd9a6bSDavid Young else { 558*41dd9a6bSDavid Young printf("Cannot find xstats with a given name\n"); 559*41dd9a6bSDavid Young goto err; 560*41dd9a6bSDavid Young } 561*41dd9a6bSDavid Young 562*41dd9a6bSDavid YoungThe API provides flexibility to the application so that it can look up multiple 563*41dd9a6bSDavid Youngstatistics using an array containing multiple ``id`` numbers. This reduces the 564*41dd9a6bSDavid Youngfunction call overhead of retrieving statistics, and makes lookup of multiple 565*41dd9a6bSDavid Youngstatistics simpler for the application. 566*41dd9a6bSDavid Young 567*41dd9a6bSDavid Young.. code-block:: c 568*41dd9a6bSDavid Young 569*41dd9a6bSDavid Young #define APP_NUM_STATS 4 570*41dd9a6bSDavid Young /* application cached these ids previously; see above */ 571*41dd9a6bSDavid Young uint64_t ids_array[APP_NUM_STATS] = {3,4,7,21}; 572*41dd9a6bSDavid Young uint64_t value_array[APP_NUM_STATS]; 573*41dd9a6bSDavid Young 574*41dd9a6bSDavid Young /* Getting multiple xstats values from array of IDs */ 575*41dd9a6bSDavid Young rte_eth_xstats_get_by_id(port_id, ids_array, value_array, APP_NUM_STATS); 576*41dd9a6bSDavid Young 577*41dd9a6bSDavid Young uint32_t i; 578*41dd9a6bSDavid Young for(i = 0; i < APP_NUM_STATS; i++) { 579*41dd9a6bSDavid Young printf("%d: %"PRIu64"\n", ids_array[i], value_array[i]); 580*41dd9a6bSDavid Young } 581*41dd9a6bSDavid Young 582*41dd9a6bSDavid Young 583*41dd9a6bSDavid YoungThis array lookup API for xstats allows the application create multiple 584*41dd9a6bSDavid Young"groups" of statistics, and look up the values of those IDs using a single API 585*41dd9a6bSDavid Youngcall. As an end result, the application is able to achieve its goal of 586*41dd9a6bSDavid Youngmonitoring a single statistic ("rx_errors" in this case), and if that shows 587*41dd9a6bSDavid Youngpackets being dropped, it can easily retrieve a "set" of statistics using the 588*41dd9a6bSDavid YoungIDs array parameter to ``rte_eth_xstats_get_by_id`` function. 589*41dd9a6bSDavid Young 590*41dd9a6bSDavid YoungNIC Reset API 591*41dd9a6bSDavid Young~~~~~~~~~~~~~ 592*41dd9a6bSDavid Young 593*41dd9a6bSDavid Young.. code-block:: c 594*41dd9a6bSDavid Young 595*41dd9a6bSDavid Young int rte_eth_dev_reset(uint16_t port_id); 596*41dd9a6bSDavid Young 597*41dd9a6bSDavid YoungSometimes a port has to be reset passively. For example when a PF is 598*41dd9a6bSDavid Youngreset, all its VFs should also be reset by the application to make them 599*41dd9a6bSDavid Youngconsistent with the PF. A DPDK application also can call this function 600*41dd9a6bSDavid Youngto trigger a port reset. Normally, a DPDK application would invokes this 601*41dd9a6bSDavid Youngfunction when an RTE_ETH_EVENT_INTR_RESET event is detected. 602*41dd9a6bSDavid Young 603*41dd9a6bSDavid YoungIt is the duty of the PMD to trigger RTE_ETH_EVENT_INTR_RESET events and 604*41dd9a6bSDavid Youngthe application should register a callback function to handle these 605*41dd9a6bSDavid Youngevents. When a PMD needs to trigger a reset, it can trigger an 606*41dd9a6bSDavid YoungRTE_ETH_EVENT_INTR_RESET event. On receiving an RTE_ETH_EVENT_INTR_RESET 607*41dd9a6bSDavid Youngevent, applications can handle it as follows: Stop working queues, stop 608*41dd9a6bSDavid Youngcalling Rx and Tx functions, and then call rte_eth_dev_reset(). For 609*41dd9a6bSDavid Youngthread safety all these operations should be called from the same thread. 610*41dd9a6bSDavid Young 611*41dd9a6bSDavid YoungFor example when PF is reset, the PF sends a message to notify VFs of 612*41dd9a6bSDavid Youngthis event and also trigger an interrupt to VFs. Then in the interrupt 613*41dd9a6bSDavid Youngservice routine the VFs detects this notification message and calls 614*41dd9a6bSDavid Youngrte_eth_dev_callback_process(dev, RTE_ETH_EVENT_INTR_RESET, NULL). 615*41dd9a6bSDavid YoungThis means that a PF reset triggers an RTE_ETH_EVENT_INTR_RESET 616*41dd9a6bSDavid Youngevent within VFs. The function rte_eth_dev_callback_process() will 617*41dd9a6bSDavid Youngcall the registered callback function. The callback function can trigger 618*41dd9a6bSDavid Youngthe application to handle all operations the VF reset requires including 619*41dd9a6bSDavid Youngstopping Rx/Tx queues and calling rte_eth_dev_reset(). 620*41dd9a6bSDavid Young 621*41dd9a6bSDavid YoungThe rte_eth_dev_reset() itself is a generic function which only does 622*41dd9a6bSDavid Youngsome hardware reset operations through calling dev_unint() and 623*41dd9a6bSDavid Youngdev_init(), and itself does not handle synchronization, which is handled 624*41dd9a6bSDavid Youngby application. 625*41dd9a6bSDavid Young 626*41dd9a6bSDavid YoungThe PMD itself should not call rte_eth_dev_reset(). The PMD can trigger 627*41dd9a6bSDavid Youngthe application to handle reset event. It is duty of application to 628*41dd9a6bSDavid Younghandle all synchronization before it calls rte_eth_dev_reset(). 629*41dd9a6bSDavid Young 630*41dd9a6bSDavid YoungThe above error handling mode is known as ``RTE_ETH_ERROR_HANDLE_MODE_PASSIVE``. 631*41dd9a6bSDavid Young 632*41dd9a6bSDavid YoungProactive Error Handling Mode 633*41dd9a6bSDavid Young~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 634*41dd9a6bSDavid Young 635*41dd9a6bSDavid YoungThis mode is known as ``RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE``, 636*41dd9a6bSDavid Youngdifferent from the application invokes recovery in PASSIVE mode, 637*41dd9a6bSDavid Youngthe PMD automatically recovers from error in PROACTIVE mode, 638*41dd9a6bSDavid Youngand only a small amount of work is required for the application. 639*41dd9a6bSDavid Young 640*41dd9a6bSDavid YoungDuring error detection and automatic recovery, 641*41dd9a6bSDavid Youngthe PMD sets the data path pointers to dummy functions 642*41dd9a6bSDavid Young(which will prevent the crash), 643*41dd9a6bSDavid Youngand also make sure the control path operations fail with a return code ``-EBUSY``. 644*41dd9a6bSDavid Young 645*41dd9a6bSDavid YoungBecause the PMD recovers automatically, 646*41dd9a6bSDavid Youngthe application can only sense that the data flow is disconnected for a while 647*41dd9a6bSDavid Youngand the control API returns an error in this period. 648*41dd9a6bSDavid Young 649*41dd9a6bSDavid YoungIn order to sense the error happening/recovering, 650*41dd9a6bSDavid Youngas well as to restore some additional configuration, 651*41dd9a6bSDavid Youngthree events are available: 652*41dd9a6bSDavid Young 653*41dd9a6bSDavid Young``RTE_ETH_EVENT_ERR_RECOVERING`` 654*41dd9a6bSDavid Young Notify the application that an error is detected 655*41dd9a6bSDavid Young and the recovery is being started. 656*41dd9a6bSDavid Young Upon receiving the event, the application should not invoke 657*41dd9a6bSDavid Young any control path function until receiving 658*41dd9a6bSDavid Young ``RTE_ETH_EVENT_RECOVERY_SUCCESS`` or ``RTE_ETH_EVENT_RECOVERY_FAILED`` event. 659*41dd9a6bSDavid Young 660*41dd9a6bSDavid Young.. note:: 661*41dd9a6bSDavid Young 662*41dd9a6bSDavid Young Before the PMD reports the recovery result, 663*41dd9a6bSDavid Young the PMD may report the ``RTE_ETH_EVENT_ERR_RECOVERING`` event again, 664*41dd9a6bSDavid Young because a larger error may occur during the recovery. 665*41dd9a6bSDavid Young 666*41dd9a6bSDavid Young``RTE_ETH_EVENT_RECOVERY_SUCCESS`` 667*41dd9a6bSDavid Young Notify the application that the recovery from error is successful, 668*41dd9a6bSDavid Young the PMD already re-configures the port, 669*41dd9a6bSDavid Young and the effect is the same as a restart operation. 670*41dd9a6bSDavid Young 671*41dd9a6bSDavid Young``RTE_ETH_EVENT_RECOVERY_FAILED`` 672*41dd9a6bSDavid Young Notify the application that the recovery from error failed, 673*41dd9a6bSDavid Young the port should not be usable anymore. 674*41dd9a6bSDavid Young The application should close the port. 675*41dd9a6bSDavid Young 676*41dd9a6bSDavid YoungThe error handling mode supported by the PMD can be reported through 677*41dd9a6bSDavid Young``rte_eth_dev_info_get``. 678