xref: /dpdk/doc/guides/prog_guide/ethdev/ethdev.rst (revision 41dd9a6bc2d9c6e20e139ad713cc9d172572dd43)
1*41dd9a6bSDavid Young..  SPDX-License-Identifier: BSD-3-Clause
2*41dd9a6bSDavid Young    Copyright(c) 2010-2015 Intel Corporation.
3*41dd9a6bSDavid Young
4*41dd9a6bSDavid YoungPoll Mode Driver
5*41dd9a6bSDavid Young================
6*41dd9a6bSDavid Young
7*41dd9a6bSDavid YoungThe DPDK includes 1 Gigabit, 10 Gigabit and 40 Gigabit and para virtualized virtio Poll Mode Drivers.
8*41dd9a6bSDavid Young
9*41dd9a6bSDavid YoungA Poll Mode Driver (PMD) consists of APIs, provided through the BSD driver running in user space,
10*41dd9a6bSDavid Youngto configure the devices and their respective queues.
11*41dd9a6bSDavid YoungIn addition, a PMD accesses the RX and TX descriptors directly without any interrupts
12*41dd9a6bSDavid Young(with the exception of Link Status Change interrupts) to quickly receive,
13*41dd9a6bSDavid Youngprocess and deliver packets in the user's application.
14*41dd9a6bSDavid YoungThis section describes the requirements of the PMDs,
15*41dd9a6bSDavid Youngtheir global design principles and proposes a high-level architecture and a generic external API for the Ethernet PMDs.
16*41dd9a6bSDavid Young
17*41dd9a6bSDavid YoungRequirements and Assumptions
18*41dd9a6bSDavid Young----------------------------
19*41dd9a6bSDavid Young
20*41dd9a6bSDavid YoungThe DPDK environment for packet processing applications allows for two models, run-to-completion and pipe-line:
21*41dd9a6bSDavid Young
22*41dd9a6bSDavid Young*   In the *run-to-completion*  model, a specific port's RX descriptor ring is polled for packets through an API.
23*41dd9a6bSDavid Young    Packets are then processed on the same core and placed on a port's TX descriptor ring through an API for transmission.
24*41dd9a6bSDavid Young
25*41dd9a6bSDavid Young*   In the *pipe-line*  model, one core polls one or more port's RX descriptor ring through an API.
26*41dd9a6bSDavid Young    Packets are received and passed to another core via a ring.
27*41dd9a6bSDavid Young    The other core continues to process the packet which then may be placed on a port's TX descriptor ring through an API for transmission.
28*41dd9a6bSDavid Young
29*41dd9a6bSDavid YoungIn a synchronous run-to-completion model,
30*41dd9a6bSDavid Youngeach logical core assigned to the DPDK executes a packet processing loop that includes the following steps:
31*41dd9a6bSDavid Young
32*41dd9a6bSDavid Young*   Retrieve input packets through the PMD receive API
33*41dd9a6bSDavid Young
34*41dd9a6bSDavid Young*   Process each received packet one at a time, up to its forwarding
35*41dd9a6bSDavid Young
36*41dd9a6bSDavid Young*   Send pending output packets through the PMD transmit API
37*41dd9a6bSDavid Young
38*41dd9a6bSDavid YoungConversely, in an asynchronous pipe-line model, some logical cores may be dedicated to the retrieval of received packets and
39*41dd9a6bSDavid Youngother logical cores to the processing of previously received packets.
40*41dd9a6bSDavid YoungReceived packets are exchanged between logical cores through rings.
41*41dd9a6bSDavid YoungThe loop for packet retrieval includes the following steps:
42*41dd9a6bSDavid Young
43*41dd9a6bSDavid Young*   Retrieve input packets through the PMD receive API
44*41dd9a6bSDavid Young
45*41dd9a6bSDavid Young*   Provide received packets to processing lcores through packet queues
46*41dd9a6bSDavid Young
47*41dd9a6bSDavid YoungThe loop for packet processing includes the following steps:
48*41dd9a6bSDavid Young
49*41dd9a6bSDavid Young*   Retrieve the received packet from the packet queue
50*41dd9a6bSDavid Young
51*41dd9a6bSDavid Young*   Process the received packet, up to its retransmission if forwarded
52*41dd9a6bSDavid Young
53*41dd9a6bSDavid YoungTo avoid any unnecessary interrupt processing overhead, the execution environment must not use any asynchronous notification mechanisms.
54*41dd9a6bSDavid YoungWhenever needed and appropriate, asynchronous communication should be introduced as much as possible through the use of rings.
55*41dd9a6bSDavid Young
56*41dd9a6bSDavid YoungAvoiding lock contention is a key issue in a multi-core environment.
57*41dd9a6bSDavid YoungTo address this issue, PMDs are designed to work with per-core private resources as much as possible.
58*41dd9a6bSDavid YoungFor example, a PMD maintains a separate transmit queue per-core, per-port, if the PMD is not ``RTE_ETH_TX_OFFLOAD_MT_LOCKFREE`` capable.
59*41dd9a6bSDavid YoungIn the same way, every receive queue of a port is assigned to and polled by a single logical core (lcore).
60*41dd9a6bSDavid Young
61*41dd9a6bSDavid YoungTo comply with Non-Uniform Memory Access (NUMA), memory management is designed to assign to each logical core
62*41dd9a6bSDavid Younga private buffer pool in local memory to minimize remote memory access.
63*41dd9a6bSDavid YoungThe configuration of packet buffer pools should take into account the underlying physical memory architecture in terms of DIMMS,
64*41dd9a6bSDavid Youngchannels and ranks.
65*41dd9a6bSDavid YoungThe application must ensure that appropriate parameters are given at memory pool creation time.
66*41dd9a6bSDavid YoungSee :doc:`../mempool_lib`.
67*41dd9a6bSDavid Young
68*41dd9a6bSDavid YoungDesign Principles
69*41dd9a6bSDavid Young-----------------
70*41dd9a6bSDavid Young
71*41dd9a6bSDavid YoungThe API and architecture of the Ethernet* PMDs are designed with the following guidelines in mind.
72*41dd9a6bSDavid Young
73*41dd9a6bSDavid YoungPMDs must help global policy-oriented decisions to be enforced at the upper application level.
74*41dd9a6bSDavid YoungConversely, NIC PMD functions should not impede the benefits expected by upper-level global policies,
75*41dd9a6bSDavid Youngor worse prevent such policies from being applied.
76*41dd9a6bSDavid Young
77*41dd9a6bSDavid YoungFor instance, both the receive and transmit functions of a PMD have a maximum number of packets/descriptors to poll.
78*41dd9a6bSDavid YoungThis allows a run-to-completion processing stack to statically fix or
79*41dd9a6bSDavid Youngto dynamically adapt its overall behavior through different global loop policies, such as:
80*41dd9a6bSDavid Young
81*41dd9a6bSDavid Young*   Receive, process immediately and transmit packets one at a time in a piecemeal fashion.
82*41dd9a6bSDavid Young
83*41dd9a6bSDavid Young*   Receive as many packets as possible, then process all received packets, transmitting them immediately.
84*41dd9a6bSDavid Young
85*41dd9a6bSDavid Young*   Receive a given maximum number of packets, process the received packets, accumulate them and finally send all accumulated packets to transmit.
86*41dd9a6bSDavid Young
87*41dd9a6bSDavid YoungTo achieve optimal performance, overall software design choices and pure software optimization techniques must be considered and
88*41dd9a6bSDavid Youngbalanced against available low-level hardware-based optimization features (CPU cache properties, bus speed, NIC PCI bandwidth, and so on).
89*41dd9a6bSDavid YoungThe case of packet transmission is an example of this software/hardware tradeoff issue when optimizing burst-oriented network packet processing engines.
90*41dd9a6bSDavid YoungIn the initial case, the PMD could export only an rte_eth_tx_one function to transmit one packet at a time on a given queue.
91*41dd9a6bSDavid YoungOn top of that, one can easily build an rte_eth_tx_burst function that loops invoking the rte_eth_tx_one function to transmit several packets at a time.
92*41dd9a6bSDavid YoungHowever, an rte_eth_tx_burst function is effectively implemented by the PMD to minimize the driver-level transmit cost per packet through the following optimizations:
93*41dd9a6bSDavid Young
94*41dd9a6bSDavid Young*   Share among multiple packets the un-amortized cost of invoking the rte_eth_tx_one function.
95*41dd9a6bSDavid Young
96*41dd9a6bSDavid Young*   Enable the rte_eth_tx_burst function to take advantage of burst-oriented hardware features (prefetch data in cache, use of NIC head/tail registers)
97*41dd9a6bSDavid Young    to minimize the number of CPU cycles per packet, for example by avoiding unnecessary read memory accesses to ring transmit descriptors,
98*41dd9a6bSDavid Young    or by systematically using arrays of pointers that exactly fit cache line boundaries and sizes.
99*41dd9a6bSDavid Young
100*41dd9a6bSDavid Young*   Apply burst-oriented software optimization techniques to remove operations that would otherwise be unavoidable, such as ring index wrap back management.
101*41dd9a6bSDavid Young
102*41dd9a6bSDavid YoungBurst-oriented functions are also introduced via the API for services that are intensively used by the PMD.
103*41dd9a6bSDavid YoungThis applies in particular to buffer allocators used to populate NIC rings, which provide functions to allocate/free several buffers at a time.
104*41dd9a6bSDavid YoungFor example, an mbuf_multiple_alloc function returning an array of pointers to rte_mbuf buffers which speeds up the receive poll function of the PMD when
105*41dd9a6bSDavid Youngreplenishing multiple descriptors of the receive ring.
106*41dd9a6bSDavid Young
107*41dd9a6bSDavid YoungLogical Cores, Memory and NIC Queues Relationships
108*41dd9a6bSDavid Young--------------------------------------------------
109*41dd9a6bSDavid Young
110*41dd9a6bSDavid YoungThe DPDK supports NUMA allowing for better performance when a processor's logical cores and interfaces utilize its local memory.
111*41dd9a6bSDavid YoungTherefore, mbuf allocation associated with local PCIe* interfaces should be allocated from memory pools created in the local memory.
112*41dd9a6bSDavid YoungThe buffers should, if possible, remain on the local processor to obtain the best performance results and RX and TX buffer descriptors
113*41dd9a6bSDavid Youngshould be populated with mbufs allocated from a mempool allocated from local memory.
114*41dd9a6bSDavid Young
115*41dd9a6bSDavid YoungThe run-to-completion model also performs better if packet or data manipulation is in local memory instead of a remote processors memory.
116*41dd9a6bSDavid YoungThis is also true for the pipe-line model provided all logical cores used are located on the same processor.
117*41dd9a6bSDavid Young
118*41dd9a6bSDavid YoungMultiple logical cores should never share receive or transmit queues for interfaces since this would require global locks and hinder performance.
119*41dd9a6bSDavid Young
120*41dd9a6bSDavid YoungIf the PMD is ``RTE_ETH_TX_OFFLOAD_MT_LOCKFREE`` capable, multiple threads can invoke ``rte_eth_tx_burst()``
121*41dd9a6bSDavid Youngconcurrently on the same tx queue without SW lock. This PMD feature found in some NICs and useful in the following use cases:
122*41dd9a6bSDavid Young
123*41dd9a6bSDavid Young*  Remove explicit spinlock in some applications where lcores are not mapped to Tx queues with 1:1 relation.
124*41dd9a6bSDavid Young
125*41dd9a6bSDavid Young*  In the eventdev use case, avoid dedicating a separate TX core for transmitting and thus
126*41dd9a6bSDavid Young   enables more scaling as all workers can send the packets.
127*41dd9a6bSDavid Young
128*41dd9a6bSDavid YoungSee `Hardware Offload`_ for ``RTE_ETH_TX_OFFLOAD_MT_LOCKFREE`` capability probing details.
129*41dd9a6bSDavid Young
130*41dd9a6bSDavid YoungDevice Identification, Ownership and Configuration
131*41dd9a6bSDavid Young--------------------------------------------------
132*41dd9a6bSDavid Young
133*41dd9a6bSDavid YoungDevice Identification
134*41dd9a6bSDavid Young~~~~~~~~~~~~~~~~~~~~~
135*41dd9a6bSDavid Young
136*41dd9a6bSDavid YoungEach NIC port is uniquely designated by its (bus/bridge, device, function) PCI
137*41dd9a6bSDavid Youngidentifiers assigned by the PCI probing/enumeration function executed at DPDK initialization.
138*41dd9a6bSDavid YoungBased on their PCI identifier, NIC ports are assigned two other identifiers:
139*41dd9a6bSDavid Young
140*41dd9a6bSDavid Young*   A port index used to designate the NIC port in all functions exported by the PMD API.
141*41dd9a6bSDavid Young
142*41dd9a6bSDavid Young*   A port name used to designate the port in console messages, for administration or debugging purposes.
143*41dd9a6bSDavid Young    For ease of use, the port name includes the port index.
144*41dd9a6bSDavid Young
145*41dd9a6bSDavid YoungPort Ownership
146*41dd9a6bSDavid Young~~~~~~~~~~~~~~
147*41dd9a6bSDavid Young
148*41dd9a6bSDavid YoungThe Ethernet devices ports can be owned by a single DPDK entity (application, library, PMD, process, etc).
149*41dd9a6bSDavid YoungThe ownership mechanism is controlled by ethdev APIs and allows to set/remove/get a port owner by DPDK entities.
150*41dd9a6bSDavid YoungIt prevents Ethernet ports to be managed by different entities.
151*41dd9a6bSDavid Young
152*41dd9a6bSDavid Young.. note::
153*41dd9a6bSDavid Young
154*41dd9a6bSDavid Young    It is the DPDK entity responsibility to set the port owner before using it and to manage the port usage synchronization between different threads or processes.
155*41dd9a6bSDavid Young
156*41dd9a6bSDavid YoungIt is recommended to set port ownership early,
157*41dd9a6bSDavid Younglike during the probing notification ``RTE_ETH_EVENT_NEW``.
158*41dd9a6bSDavid Young
159*41dd9a6bSDavid YoungDevice Configuration
160*41dd9a6bSDavid Young~~~~~~~~~~~~~~~~~~~~
161*41dd9a6bSDavid Young
162*41dd9a6bSDavid YoungThe configuration of each NIC port includes the following operations:
163*41dd9a6bSDavid Young
164*41dd9a6bSDavid Young*   Allocate PCI resources
165*41dd9a6bSDavid Young
166*41dd9a6bSDavid Young*   Reset the hardware (issue a Global Reset) to a well-known default state
167*41dd9a6bSDavid Young
168*41dd9a6bSDavid Young*   Set up the PHY and the link
169*41dd9a6bSDavid Young
170*41dd9a6bSDavid Young*   Initialize statistics counters
171*41dd9a6bSDavid Young
172*41dd9a6bSDavid YoungThe PMD API must also export functions to start/stop the all-multicast feature of a port and functions to set/unset the port in promiscuous mode.
173*41dd9a6bSDavid Young
174*41dd9a6bSDavid YoungSome hardware offload features must be individually configured at port initialization through specific configuration parameters.
175*41dd9a6bSDavid YoungThis is the case for the Receive Side Scaling (RSS) and Data Center Bridging (DCB) features for example.
176*41dd9a6bSDavid Young
177*41dd9a6bSDavid YoungOn-the-Fly Configuration
178*41dd9a6bSDavid Young~~~~~~~~~~~~~~~~~~~~~~~~
179*41dd9a6bSDavid Young
180*41dd9a6bSDavid YoungAll device features that can be started or stopped "on the fly" (that is, without stopping the device) do not require the PMD API to export dedicated functions for this purpose.
181*41dd9a6bSDavid Young
182*41dd9a6bSDavid YoungAll that is required is the mapping address of the device PCI registers to implement the configuration of these features in specific functions outside of the drivers.
183*41dd9a6bSDavid Young
184*41dd9a6bSDavid YoungFor this purpose,
185*41dd9a6bSDavid Youngthe PMD API exports a function that provides all the information associated with a device that can be used to set up a given device feature outside of the driver.
186*41dd9a6bSDavid YoungThis includes the PCI vendor identifier, the PCI device identifier, the mapping address of the PCI device registers, and the name of the driver.
187*41dd9a6bSDavid Young
188*41dd9a6bSDavid YoungThe main advantage of this approach is that it gives complete freedom on the choice of the API used to configure, to start, and to stop such features.
189*41dd9a6bSDavid Young
190*41dd9a6bSDavid YoungAs an example, refer to the configuration of the IEEE1588 feature for the Intel® 82576 Gigabit Ethernet Controller and
191*41dd9a6bSDavid Youngthe Intel® 82599 10 Gigabit Ethernet Controller controllers in the testpmd application.
192*41dd9a6bSDavid Young
193*41dd9a6bSDavid YoungOther features such as the L3/L4 5-Tuple packet filtering feature of a port can be configured in the same way.
194*41dd9a6bSDavid YoungEthernet* flow control (pause frame) can be configured on the individual port.
195*41dd9a6bSDavid YoungRefer to the testpmd source code for details.
196*41dd9a6bSDavid YoungAlso, L4 (UDP/TCP/ SCTP) checksum offload by the NIC can be enabled for an individual packet as long as the packet mbuf is set up correctly. See `Hardware Offload`_ for details.
197*41dd9a6bSDavid Young
198*41dd9a6bSDavid YoungConfiguration of Transmit Queues
199*41dd9a6bSDavid Young~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
200*41dd9a6bSDavid Young
201*41dd9a6bSDavid YoungEach transmit queue is independently configured with the following information:
202*41dd9a6bSDavid Young
203*41dd9a6bSDavid Young*   The number of descriptors of the transmit ring
204*41dd9a6bSDavid Young
205*41dd9a6bSDavid Young*   The socket identifier used to identify the appropriate DMA memory zone from which to allocate the transmit ring in NUMA architectures
206*41dd9a6bSDavid Young
207*41dd9a6bSDavid Young*   The values of the Prefetch, Host and Write-Back threshold registers of the transmit queue
208*41dd9a6bSDavid Young
209*41dd9a6bSDavid Young*   The *minimum* transmit packets to free threshold (tx_free_thresh).
210*41dd9a6bSDavid Young    When the number of descriptors used to transmit packets exceeds this threshold, the network adaptor should be checked to see if it has written back descriptors.
211*41dd9a6bSDavid Young    A value of 0 can be passed during the TX queue configuration to indicate the default value should be used.
212*41dd9a6bSDavid Young    The default value for tx_free_thresh is 32.
213*41dd9a6bSDavid Young    This ensures that the PMD does not search for completed descriptors until at least 32 have been processed by the NIC for this queue.
214*41dd9a6bSDavid Young
215*41dd9a6bSDavid Young*   The *minimum*  RS bit threshold. The minimum number of transmit descriptors to use before setting the Report Status (RS) bit in the transmit descriptor.
216*41dd9a6bSDavid Young    Note that this parameter may only be valid for Intel 10 GbE network adapters.
217*41dd9a6bSDavid Young    The RS bit is set on the last descriptor used to transmit a packet if the number of descriptors used since the last RS bit setting,
218*41dd9a6bSDavid Young    up to the first descriptor used to transmit the packet, exceeds the transmit RS bit threshold (tx_rs_thresh).
219*41dd9a6bSDavid Young    In short, this parameter controls which transmit descriptors are written back to host memory by the network adapter.
220*41dd9a6bSDavid Young    A value of 0 can be passed during the TX queue configuration to indicate that the default value should be used.
221*41dd9a6bSDavid Young    The default value for tx_rs_thresh is 32.
222*41dd9a6bSDavid Young    This ensures that at least 32 descriptors are used before the network adapter writes back the most recently used descriptor.
223*41dd9a6bSDavid Young    This saves upstream PCIe* bandwidth resulting from TX descriptor write-backs.
224*41dd9a6bSDavid Young    It is important to note that the TX Write-back threshold (TX wthresh) should be set to 0 when tx_rs_thresh is greater than 1.
225*41dd9a6bSDavid Young    Refer to the Intel® 82599 10 Gigabit Ethernet Controller Datasheet for more details.
226*41dd9a6bSDavid Young
227*41dd9a6bSDavid YoungThe following constraints must be satisfied for tx_free_thresh and tx_rs_thresh:
228*41dd9a6bSDavid Young
229*41dd9a6bSDavid Young*   tx_rs_thresh must be greater than 0.
230*41dd9a6bSDavid Young
231*41dd9a6bSDavid Young*   tx_rs_thresh must be less than the size of the ring minus 2.
232*41dd9a6bSDavid Young
233*41dd9a6bSDavid Young*   tx_rs_thresh must be less than or equal to tx_free_thresh.
234*41dd9a6bSDavid Young
235*41dd9a6bSDavid Young*   tx_free_thresh must be greater than 0.
236*41dd9a6bSDavid Young
237*41dd9a6bSDavid Young*   tx_free_thresh must be less than the size of the ring minus 3.
238*41dd9a6bSDavid Young
239*41dd9a6bSDavid Young*   For optimal performance, TX wthresh should be set to 0 when tx_rs_thresh is greater than 1.
240*41dd9a6bSDavid Young
241*41dd9a6bSDavid YoungOne descriptor in the TX ring is used as a sentinel to avoid a hardware race condition, hence the maximum threshold constraints.
242*41dd9a6bSDavid Young
243*41dd9a6bSDavid Young.. note::
244*41dd9a6bSDavid Young
245*41dd9a6bSDavid Young    When configuring for DCB operation, at port initialization, both the number of transmit queues and the number of receive queues must be set to 128.
246*41dd9a6bSDavid Young
247*41dd9a6bSDavid YoungFree Tx mbuf on Demand
248*41dd9a6bSDavid Young~~~~~~~~~~~~~~~~~~~~~~
249*41dd9a6bSDavid Young
250*41dd9a6bSDavid YoungMany of the drivers do not release the mbuf back to the mempool, or local cache,
251*41dd9a6bSDavid Youngimmediately after the packet has been transmitted.
252*41dd9a6bSDavid YoungInstead, they leave the mbuf in their Tx ring and
253*41dd9a6bSDavid Youngeither perform a bulk release when the ``tx_rs_thresh`` has been crossed
254*41dd9a6bSDavid Youngor free the mbuf when a slot in the Tx ring is needed.
255*41dd9a6bSDavid Young
256*41dd9a6bSDavid YoungAn application can request the driver to release used mbufs with the ``rte_eth_tx_done_cleanup()`` API.
257*41dd9a6bSDavid YoungThis API requests the driver to release mbufs that are no longer in use,
258*41dd9a6bSDavid Youngindependent of whether or not the ``tx_rs_thresh`` has been crossed.
259*41dd9a6bSDavid YoungThere are two scenarios when an application may want the mbuf released immediately:
260*41dd9a6bSDavid Young
261*41dd9a6bSDavid Young* When a given packet needs to be sent to multiple destination interfaces
262*41dd9a6bSDavid Young  (either for Layer 2 flooding or Layer 3 multi-cast).
263*41dd9a6bSDavid Young  One option is to make a copy of the packet or a copy of the header portion that needs to be manipulated.
264*41dd9a6bSDavid Young  A second option is to transmit the packet and then poll the ``rte_eth_tx_done_cleanup()`` API
265*41dd9a6bSDavid Young  until the reference count on the packet is decremented.
266*41dd9a6bSDavid Young  Then the same packet can be transmitted to the next destination interface.
267*41dd9a6bSDavid Young  The application is still responsible for managing any packet manipulations needed
268*41dd9a6bSDavid Young  between the different destination interfaces, but a packet copy can be avoided.
269*41dd9a6bSDavid Young  This API is independent of whether the packet was transmitted or dropped,
270*41dd9a6bSDavid Young  only that the mbuf is no longer in use by the interface.
271*41dd9a6bSDavid Young
272*41dd9a6bSDavid Young* Some applications are designed to make multiple runs, like a packet generator.
273*41dd9a6bSDavid Young  For performance reasons and consistency between runs,
274*41dd9a6bSDavid Young  the application may want to reset back to an initial state
275*41dd9a6bSDavid Young  between each run, where all mbufs are returned to the mempool.
276*41dd9a6bSDavid Young  In this case, it can call the ``rte_eth_tx_done_cleanup()`` API
277*41dd9a6bSDavid Young  for each destination interface it has been using
278*41dd9a6bSDavid Young  to request it to release of all its used mbufs.
279*41dd9a6bSDavid Young
280*41dd9a6bSDavid YoungTo determine if a driver supports this API, check for the *Free Tx mbuf on demand* feature
281*41dd9a6bSDavid Youngin the *Network Interface Controller Drivers* document.
282*41dd9a6bSDavid Young
283*41dd9a6bSDavid YoungHardware Offload
284*41dd9a6bSDavid Young~~~~~~~~~~~~~~~~
285*41dd9a6bSDavid Young
286*41dd9a6bSDavid YoungDepending on driver capabilities advertised by
287*41dd9a6bSDavid Young``rte_eth_dev_info_get()``, the PMD may support hardware offloading
288*41dd9a6bSDavid Youngfeature like checksumming, TCP segmentation, VLAN insertion or
289*41dd9a6bSDavid Younglockfree multithreaded TX burst on the same TX queue.
290*41dd9a6bSDavid Young
291*41dd9a6bSDavid YoungThe support of these offload features implies the addition of dedicated
292*41dd9a6bSDavid Youngstatus bit(s) and value field(s) into the rte_mbuf data structure, along
293*41dd9a6bSDavid Youngwith their appropriate handling by the receive/transmit functions
294*41dd9a6bSDavid Youngexported by each PMD. The list of flags and their precise meaning is
295*41dd9a6bSDavid Youngdescribed in the mbuf API documentation and in the :ref:`mbuf_meta` chapter.
296*41dd9a6bSDavid Young
297*41dd9a6bSDavid YoungPer-Port and Per-Queue Offloads
298*41dd9a6bSDavid Young^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
299*41dd9a6bSDavid Young
300*41dd9a6bSDavid YoungIn the DPDK offload API, offloads are divided into per-port and per-queue offloads as follows:
301*41dd9a6bSDavid Young
302*41dd9a6bSDavid Young* A per-queue offloading can be enabled on a queue and disabled on another queue at the same time.
303*41dd9a6bSDavid Young* A pure per-port offload is the one supported by device but not per-queue type.
304*41dd9a6bSDavid Young* A pure per-port offloading can't be enabled on a queue and disabled on another queue at the same time.
305*41dd9a6bSDavid Young* A pure per-port offloading must be enabled or disabled on all queues at the same time.
306*41dd9a6bSDavid Young* Any offloading is per-queue or pure per-port type, but can't be both types at same devices.
307*41dd9a6bSDavid Young* Port capabilities = per-queue capabilities + pure per-port capabilities.
308*41dd9a6bSDavid Young* Any supported offloading can be enabled on all queues.
309*41dd9a6bSDavid Young
310*41dd9a6bSDavid YoungThe different offloads capabilities can be queried using ``rte_eth_dev_info_get()``.
311*41dd9a6bSDavid YoungThe ``dev_info->[rt]x_queue_offload_capa`` returned from ``rte_eth_dev_info_get()`` includes all per-queue offloading capabilities.
312*41dd9a6bSDavid YoungThe ``dev_info->[rt]x_offload_capa`` returned from ``rte_eth_dev_info_get()`` includes all pure per-port and per-queue offloading capabilities.
313*41dd9a6bSDavid YoungSupported offloads can be either per-port or per-queue.
314*41dd9a6bSDavid Young
315*41dd9a6bSDavid YoungOffloads are enabled using the existing ``RTE_ETH_TX_OFFLOAD_*`` or ``RTE_ETH_RX_OFFLOAD_*`` flags.
316*41dd9a6bSDavid YoungAny requested offloading by an application must be within the device capabilities.
317*41dd9a6bSDavid YoungAny offloading is disabled by default if it is not set in the parameter
318*41dd9a6bSDavid Young``dev_conf->[rt]xmode.offloads`` to ``rte_eth_dev_configure()`` and
319*41dd9a6bSDavid Young``[rt]x_conf->offloads`` to ``rte_eth_[rt]x_queue_setup()``.
320*41dd9a6bSDavid Young
321*41dd9a6bSDavid YoungIf any offloading is enabled in ``rte_eth_dev_configure()`` by an application,
322*41dd9a6bSDavid Youngit is enabled on all queues no matter whether it is per-queue or
323*41dd9a6bSDavid Youngper-port type and no matter whether it is set or cleared in
324*41dd9a6bSDavid Young``[rt]x_conf->offloads`` to ``rte_eth_[rt]x_queue_setup()``.
325*41dd9a6bSDavid Young
326*41dd9a6bSDavid YoungIf a per-queue offloading hasn't been enabled in ``rte_eth_dev_configure()``,
327*41dd9a6bSDavid Youngit can be enabled or disabled in ``rte_eth_[rt]x_queue_setup()`` for individual queue.
328*41dd9a6bSDavid YoungA newly added offloads in ``[rt]x_conf->offloads`` to ``rte_eth_[rt]x_queue_setup()`` input by application
329*41dd9a6bSDavid Youngis the one which hasn't been enabled in ``rte_eth_dev_configure()`` and is requested to be enabled
330*41dd9a6bSDavid Youngin ``rte_eth_[rt]x_queue_setup()``. It must be per-queue type, otherwise trigger an error log.
331*41dd9a6bSDavid Young
332*41dd9a6bSDavid YoungPoll Mode Driver API
333*41dd9a6bSDavid Young--------------------
334*41dd9a6bSDavid Young
335*41dd9a6bSDavid YoungGeneralities
336*41dd9a6bSDavid Young~~~~~~~~~~~~
337*41dd9a6bSDavid Young
338*41dd9a6bSDavid YoungBy default, all functions exported by a PMD are lock-free functions that are assumed
339*41dd9a6bSDavid Youngnot to be invoked in parallel on different logical cores to work on the same target object.
340*41dd9a6bSDavid YoungFor instance, a PMD receive function cannot be invoked in parallel on two logical cores to poll the same RX queue of the same port.
341*41dd9a6bSDavid YoungOf course, this function can be invoked in parallel by different logical cores on different RX queues.
342*41dd9a6bSDavid YoungIt is the responsibility of the upper-level application to enforce this rule.
343*41dd9a6bSDavid Young
344*41dd9a6bSDavid YoungIf needed, parallel accesses by multiple logical cores to shared queues can be explicitly protected by dedicated inline lock-aware functions
345*41dd9a6bSDavid Youngbuilt on top of their corresponding lock-free functions of the PMD API.
346*41dd9a6bSDavid Young
347*41dd9a6bSDavid YoungGeneric Packet Representation
348*41dd9a6bSDavid Young~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
349*41dd9a6bSDavid Young
350*41dd9a6bSDavid YoungA packet is represented by an rte_mbuf structure, which is a generic metadata structure containing all necessary housekeeping information.
351*41dd9a6bSDavid YoungThis includes fields and status bits corresponding to offload hardware features, such as checksum computation of IP headers or VLAN tags.
352*41dd9a6bSDavid Young
353*41dd9a6bSDavid YoungThe rte_mbuf data structure includes specific fields to represent, in a generic way, the offload features provided by network controllers.
354*41dd9a6bSDavid YoungFor an input packet, most fields of the rte_mbuf structure are filled in by the PMD receive function with the information contained in the receive descriptor.
355*41dd9a6bSDavid YoungConversely, for output packets, most fields of rte_mbuf structures are used by the PMD transmit function to initialize transmit descriptors.
356*41dd9a6bSDavid Young
357*41dd9a6bSDavid YoungSee :doc:`../mbuf_lib` chapter for more details.
358*41dd9a6bSDavid Young
359*41dd9a6bSDavid YoungEthernet Device API
360*41dd9a6bSDavid Young~~~~~~~~~~~~~~~~~~~
361*41dd9a6bSDavid Young
362*41dd9a6bSDavid YoungThe Ethernet device API exported by the Ethernet PMDs is described in the *DPDK API Reference*.
363*41dd9a6bSDavid Young
364*41dd9a6bSDavid Young.. _ethernet_device_standard_device_arguments:
365*41dd9a6bSDavid Young
366*41dd9a6bSDavid YoungEthernet Device Standard Device Arguments
367*41dd9a6bSDavid Young~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
368*41dd9a6bSDavid Young
369*41dd9a6bSDavid YoungStandard Ethernet device arguments allow for a set of commonly used arguments/
370*41dd9a6bSDavid Youngparameters which are applicable to all Ethernet devices to be available to for
371*41dd9a6bSDavid Youngspecification of specific device and for passing common configuration
372*41dd9a6bSDavid Youngparameters to those ports.
373*41dd9a6bSDavid Young
374*41dd9a6bSDavid Young* ``representor`` for a device which supports the creation of representor ports
375*41dd9a6bSDavid Young  this argument allows user to specify which switch ports to enable port
376*41dd9a6bSDavid Young  representors for::
377*41dd9a6bSDavid Young
378*41dd9a6bSDavid Young   -a DBDF,representor=vf0
379*41dd9a6bSDavid Young   -a DBDF,representor=vf[0,4,6,9]
380*41dd9a6bSDavid Young   -a DBDF,representor=vf[0-31]
381*41dd9a6bSDavid Young   -a DBDF,representor=vf[0,2-4,7,9-11]
382*41dd9a6bSDavid Young   -a DBDF,representor=sf0
383*41dd9a6bSDavid Young   -a DBDF,representor=sf[1,3,5]
384*41dd9a6bSDavid Young   -a DBDF,representor=sf[0-1023]
385*41dd9a6bSDavid Young   -a DBDF,representor=sf[0,2-4,7,9-11]
386*41dd9a6bSDavid Young   -a DBDF,representor=pf1vf0
387*41dd9a6bSDavid Young   -a DBDF,representor=pf[0-1]sf[0-127]
388*41dd9a6bSDavid Young   -a DBDF,representor=pf1
389*41dd9a6bSDavid Young   -a DBDF,representor=[pf[0-1],pf2vf[0-2],pf3[3,5-8]]
390*41dd9a6bSDavid Young   (Multiple representors in one device argument can be represented as a list)
391*41dd9a6bSDavid Young
392*41dd9a6bSDavid YoungNote: PMDs are not required to support the standard device arguments and users
393*41dd9a6bSDavid Youngshould consult the relevant PMD documentation to see support devargs.
394*41dd9a6bSDavid Young
395*41dd9a6bSDavid YoungExtended Statistics API
396*41dd9a6bSDavid Young~~~~~~~~~~~~~~~~~~~~~~~
397*41dd9a6bSDavid Young
398*41dd9a6bSDavid YoungThe extended statistics API allows a PMD to expose all statistics that are
399*41dd9a6bSDavid Youngavailable to it, including statistics that are unique to the device.
400*41dd9a6bSDavid YoungEach statistic has three properties ``name``, ``id`` and ``value``:
401*41dd9a6bSDavid Young
402*41dd9a6bSDavid Young* ``name``: A human readable string formatted by the scheme detailed below.
403*41dd9a6bSDavid Young* ``id``: An integer that represents only that statistic.
404*41dd9a6bSDavid Young* ``value``: A unsigned 64-bit integer that is the value of the statistic.
405*41dd9a6bSDavid Young
406*41dd9a6bSDavid YoungNote that extended statistic identifiers are
407*41dd9a6bSDavid Youngdriver-specific, and hence might not be the same for different ports.
408*41dd9a6bSDavid YoungThe API consists of various ``rte_eth_xstats_*()`` functions, and allows an
409*41dd9a6bSDavid Youngapplication to be flexible in how it retrieves statistics.
410*41dd9a6bSDavid Young
411*41dd9a6bSDavid YoungScheme for Human Readable Names
412*41dd9a6bSDavid Young^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
413*41dd9a6bSDavid Young
414*41dd9a6bSDavid YoungA naming scheme exists for the strings exposed to clients of the API. This is
415*41dd9a6bSDavid Youngto allow scraping of the API for statistics of interest. The naming scheme uses
416*41dd9a6bSDavid Youngstrings split by a single underscore ``_``. The scheme is as follows:
417*41dd9a6bSDavid Young
418*41dd9a6bSDavid Young* direction
419*41dd9a6bSDavid Young* detail 1
420*41dd9a6bSDavid Young* detail 2
421*41dd9a6bSDavid Young* detail n
422*41dd9a6bSDavid Young* unit
423*41dd9a6bSDavid Young
424*41dd9a6bSDavid YoungExamples of common statistics xstats strings, formatted to comply to the scheme
425*41dd9a6bSDavid Youngproposed above:
426*41dd9a6bSDavid Young
427*41dd9a6bSDavid Young* ``rx_bytes``
428*41dd9a6bSDavid Young* ``rx_crc_errors``
429*41dd9a6bSDavid Young* ``tx_multicast_packets``
430*41dd9a6bSDavid Young
431*41dd9a6bSDavid YoungThe scheme, although quite simple, allows flexibility in presenting and reading
432*41dd9a6bSDavid Younginformation from the statistic strings. The following example illustrates the
433*41dd9a6bSDavid Youngnaming scheme:``rx_packets``. In this example, the string is split into two
434*41dd9a6bSDavid Youngcomponents. The first component ``rx`` indicates that the statistic is
435*41dd9a6bSDavid Youngassociated with the receive side of the NIC.  The second component ``packets``
436*41dd9a6bSDavid Youngindicates that the unit of measure is packets.
437*41dd9a6bSDavid Young
438*41dd9a6bSDavid YoungA more complicated example: ``tx_size_128_to_255_packets``. In this example,
439*41dd9a6bSDavid Young``tx`` indicates transmission, ``size``  is the first detail, ``128`` etc are
440*41dd9a6bSDavid Youngmore details, and ``packets`` indicates that this is a packet counter.
441*41dd9a6bSDavid Young
442*41dd9a6bSDavid YoungSome additions in the metadata scheme are as follows:
443*41dd9a6bSDavid Young
444*41dd9a6bSDavid Young* If the first part does not match ``rx`` or ``tx``, the statistic does not
445*41dd9a6bSDavid Young  have an affinity with either receive of transmit.
446*41dd9a6bSDavid Young
447*41dd9a6bSDavid Young* If the first letter of the second part is ``q`` and this ``q`` is followed
448*41dd9a6bSDavid Young  by a number, this statistic is part of a specific queue.
449*41dd9a6bSDavid Young
450*41dd9a6bSDavid YoungAn example where queue numbers are used is as follows: ``tx_q7_bytes`` which
451*41dd9a6bSDavid Youngindicates this statistic applies to queue number 7, and represents the number
452*41dd9a6bSDavid Youngof transmitted bytes on that queue.
453*41dd9a6bSDavid Young
454*41dd9a6bSDavid YoungAPI Design
455*41dd9a6bSDavid Young^^^^^^^^^^
456*41dd9a6bSDavid Young
457*41dd9a6bSDavid YoungThe xstats API uses the ``name``, ``id``, and ``value`` to allow performant
458*41dd9a6bSDavid Younglookup of specific statistics. Performant lookup means two things;
459*41dd9a6bSDavid Young
460*41dd9a6bSDavid Young* No string comparisons with the ``name`` of the statistic in fast-path
461*41dd9a6bSDavid Young* Allow requesting of only the statistics of interest
462*41dd9a6bSDavid Young
463*41dd9a6bSDavid YoungThe API ensures these requirements are met by mapping the ``name`` of the
464*41dd9a6bSDavid Youngstatistic to a unique ``id``, which is used as a key for lookup in the fast-path.
465*41dd9a6bSDavid YoungThe API allows applications to request an array of ``id`` values, so that the
466*41dd9a6bSDavid YoungPMD only performs the required calculations. Expected usage is that the
467*41dd9a6bSDavid Youngapplication scans the ``name`` of each statistic, and caches the ``id``
468*41dd9a6bSDavid Youngif it has an interest in that statistic. On the fast-path, the integer can be used
469*41dd9a6bSDavid Youngto retrieve the actual ``value`` of the statistic that the ``id`` represents.
470*41dd9a6bSDavid Young
471*41dd9a6bSDavid YoungAPI Functions
472*41dd9a6bSDavid Young^^^^^^^^^^^^^
473*41dd9a6bSDavid Young
474*41dd9a6bSDavid YoungThe API is built out of a small number of functions, which can be used to
475*41dd9a6bSDavid Youngretrieve the number of statistics and the names, IDs and values of those
476*41dd9a6bSDavid Youngstatistics.
477*41dd9a6bSDavid Young
478*41dd9a6bSDavid Young* ``rte_eth_xstats_get_names_by_id()``: returns the names of the statistics. When given a
479*41dd9a6bSDavid Young  ``NULL`` parameter the function returns the number of statistics that are available.
480*41dd9a6bSDavid Young
481*41dd9a6bSDavid Young* ``rte_eth_xstats_get_id_by_name()``: Searches for the statistic ID that matches
482*41dd9a6bSDavid Young  ``xstat_name``. If found, the ``id`` integer is set.
483*41dd9a6bSDavid Young
484*41dd9a6bSDavid Young* ``rte_eth_xstats_get_by_id()``: Fills in an array of ``uint64_t`` values
485*41dd9a6bSDavid Young  with matching the provided ``ids`` array. If the ``ids`` array is NULL, it
486*41dd9a6bSDavid Young  returns all statistics that are available.
487*41dd9a6bSDavid Young
488*41dd9a6bSDavid Young
489*41dd9a6bSDavid YoungApplication Usage
490*41dd9a6bSDavid Young^^^^^^^^^^^^^^^^^
491*41dd9a6bSDavid Young
492*41dd9a6bSDavid YoungImagine an application that wants to view the dropped packet count. If no
493*41dd9a6bSDavid Youngpackets are dropped, the application does not read any other metrics for
494*41dd9a6bSDavid Youngperformance reasons. If packets are dropped, the application has a particular
495*41dd9a6bSDavid Youngset of statistics that it requests. This "set" of statistics allows the app to
496*41dd9a6bSDavid Youngdecide what next steps to perform. The following code-snippets show how the
497*41dd9a6bSDavid Youngxstats API can be used to achieve this goal.
498*41dd9a6bSDavid Young
499*41dd9a6bSDavid YoungFirst step is to get all statistics names and list them:
500*41dd9a6bSDavid Young
501*41dd9a6bSDavid Young.. code-block:: c
502*41dd9a6bSDavid Young
503*41dd9a6bSDavid Young    struct rte_eth_xstat_name *xstats_names;
504*41dd9a6bSDavid Young    uint64_t *values;
505*41dd9a6bSDavid Young    int len, i;
506*41dd9a6bSDavid Young
507*41dd9a6bSDavid Young    /* Get number of stats */
508*41dd9a6bSDavid Young    len = rte_eth_xstats_get_names_by_id(port_id, NULL, NULL, 0);
509*41dd9a6bSDavid Young    if (len < 0) {
510*41dd9a6bSDavid Young        printf("Cannot get xstats count\n");
511*41dd9a6bSDavid Young        goto err;
512*41dd9a6bSDavid Young    }
513*41dd9a6bSDavid Young
514*41dd9a6bSDavid Young    xstats_names = malloc(sizeof(struct rte_eth_xstat_name) * len);
515*41dd9a6bSDavid Young    if (xstats_names == NULL) {
516*41dd9a6bSDavid Young        printf("Cannot allocate memory for xstat names\n");
517*41dd9a6bSDavid Young        goto err;
518*41dd9a6bSDavid Young    }
519*41dd9a6bSDavid Young
520*41dd9a6bSDavid Young    /* Retrieve xstats names, passing NULL for IDs to return all statistics */
521*41dd9a6bSDavid Young    if (len != rte_eth_xstats_get_names_by_id(port_id, xstats_names, NULL, len)) {
522*41dd9a6bSDavid Young        printf("Cannot get xstat names\n");
523*41dd9a6bSDavid Young        goto err;
524*41dd9a6bSDavid Young    }
525*41dd9a6bSDavid Young
526*41dd9a6bSDavid Young    values = malloc(sizeof(values) * len);
527*41dd9a6bSDavid Young    if (values == NULL) {
528*41dd9a6bSDavid Young        printf("Cannot allocate memory for xstats\n");
529*41dd9a6bSDavid Young        goto err;
530*41dd9a6bSDavid Young    }
531*41dd9a6bSDavid Young
532*41dd9a6bSDavid Young    /* Getting xstats values */
533*41dd9a6bSDavid Young    if (len != rte_eth_xstats_get_by_id(port_id, NULL, values, len)) {
534*41dd9a6bSDavid Young        printf("Cannot get xstat values\n");
535*41dd9a6bSDavid Young        goto err;
536*41dd9a6bSDavid Young    }
537*41dd9a6bSDavid Young
538*41dd9a6bSDavid Young    /* Print all xstats names and values */
539*41dd9a6bSDavid Young    for (i = 0; i < len; i++) {
540*41dd9a6bSDavid Young        printf("%s: %"PRIu64"\n", xstats_names[i].name, values[i]);
541*41dd9a6bSDavid Young    }
542*41dd9a6bSDavid Young
543*41dd9a6bSDavid YoungThe application has access to the names of all of the statistics that the PMD
544*41dd9a6bSDavid Youngexposes. The application can decide which statistics are of interest, cache the
545*41dd9a6bSDavid Youngids of those statistics by looking up the name as follows:
546*41dd9a6bSDavid Young
547*41dd9a6bSDavid Young.. code-block:: c
548*41dd9a6bSDavid Young
549*41dd9a6bSDavid Young    uint64_t id;
550*41dd9a6bSDavid Young    uint64_t value;
551*41dd9a6bSDavid Young    const char *xstat_name = "rx_errors";
552*41dd9a6bSDavid Young
553*41dd9a6bSDavid Young    if(!rte_eth_xstats_get_id_by_name(port_id, xstat_name, &id)) {
554*41dd9a6bSDavid Young        rte_eth_xstats_get_by_id(port_id, &id, &value, 1);
555*41dd9a6bSDavid Young        printf("%s: %"PRIu64"\n", xstat_name, value);
556*41dd9a6bSDavid Young    }
557*41dd9a6bSDavid Young    else {
558*41dd9a6bSDavid Young        printf("Cannot find xstats with a given name\n");
559*41dd9a6bSDavid Young        goto err;
560*41dd9a6bSDavid Young    }
561*41dd9a6bSDavid Young
562*41dd9a6bSDavid YoungThe API provides flexibility to the application so that it can look up multiple
563*41dd9a6bSDavid Youngstatistics using an array containing multiple ``id`` numbers. This reduces the
564*41dd9a6bSDavid Youngfunction call overhead of retrieving statistics, and makes lookup of multiple
565*41dd9a6bSDavid Youngstatistics simpler for the application.
566*41dd9a6bSDavid Young
567*41dd9a6bSDavid Young.. code-block:: c
568*41dd9a6bSDavid Young
569*41dd9a6bSDavid Young    #define APP_NUM_STATS 4
570*41dd9a6bSDavid Young    /* application cached these ids previously; see above */
571*41dd9a6bSDavid Young    uint64_t ids_array[APP_NUM_STATS] = {3,4,7,21};
572*41dd9a6bSDavid Young    uint64_t value_array[APP_NUM_STATS];
573*41dd9a6bSDavid Young
574*41dd9a6bSDavid Young    /* Getting multiple xstats values from array of IDs */
575*41dd9a6bSDavid Young    rte_eth_xstats_get_by_id(port_id, ids_array, value_array, APP_NUM_STATS);
576*41dd9a6bSDavid Young
577*41dd9a6bSDavid Young    uint32_t i;
578*41dd9a6bSDavid Young    for(i = 0; i < APP_NUM_STATS; i++) {
579*41dd9a6bSDavid Young        printf("%d: %"PRIu64"\n", ids_array[i], value_array[i]);
580*41dd9a6bSDavid Young    }
581*41dd9a6bSDavid Young
582*41dd9a6bSDavid Young
583*41dd9a6bSDavid YoungThis array lookup API for xstats allows the application create multiple
584*41dd9a6bSDavid Young"groups" of statistics, and look up the values of those IDs using a single API
585*41dd9a6bSDavid Youngcall. As an end result, the application is able to achieve its goal of
586*41dd9a6bSDavid Youngmonitoring a single statistic ("rx_errors" in this case), and if that shows
587*41dd9a6bSDavid Youngpackets being dropped, it can easily retrieve a "set" of statistics using the
588*41dd9a6bSDavid YoungIDs array parameter to ``rte_eth_xstats_get_by_id`` function.
589*41dd9a6bSDavid Young
590*41dd9a6bSDavid YoungNIC Reset API
591*41dd9a6bSDavid Young~~~~~~~~~~~~~
592*41dd9a6bSDavid Young
593*41dd9a6bSDavid Young.. code-block:: c
594*41dd9a6bSDavid Young
595*41dd9a6bSDavid Young    int rte_eth_dev_reset(uint16_t port_id);
596*41dd9a6bSDavid Young
597*41dd9a6bSDavid YoungSometimes a port has to be reset passively. For example when a PF is
598*41dd9a6bSDavid Youngreset, all its VFs should also be reset by the application to make them
599*41dd9a6bSDavid Youngconsistent with the PF. A DPDK application also can call this function
600*41dd9a6bSDavid Youngto trigger a port reset. Normally, a DPDK application would invokes this
601*41dd9a6bSDavid Youngfunction when an RTE_ETH_EVENT_INTR_RESET event is detected.
602*41dd9a6bSDavid Young
603*41dd9a6bSDavid YoungIt is the duty of the PMD to trigger RTE_ETH_EVENT_INTR_RESET events and
604*41dd9a6bSDavid Youngthe application should register a callback function to handle these
605*41dd9a6bSDavid Youngevents. When a PMD needs to trigger a reset, it can trigger an
606*41dd9a6bSDavid YoungRTE_ETH_EVENT_INTR_RESET event. On receiving an RTE_ETH_EVENT_INTR_RESET
607*41dd9a6bSDavid Youngevent, applications can handle it as follows: Stop working queues, stop
608*41dd9a6bSDavid Youngcalling Rx and Tx functions, and then call rte_eth_dev_reset(). For
609*41dd9a6bSDavid Youngthread safety all these operations should be called from the same thread.
610*41dd9a6bSDavid Young
611*41dd9a6bSDavid YoungFor example when PF is reset, the PF sends a message to notify VFs of
612*41dd9a6bSDavid Youngthis event and also trigger an interrupt to VFs. Then in the interrupt
613*41dd9a6bSDavid Youngservice routine the VFs detects this notification message and calls
614*41dd9a6bSDavid Youngrte_eth_dev_callback_process(dev, RTE_ETH_EVENT_INTR_RESET, NULL).
615*41dd9a6bSDavid YoungThis means that a PF reset triggers an RTE_ETH_EVENT_INTR_RESET
616*41dd9a6bSDavid Youngevent within VFs. The function rte_eth_dev_callback_process() will
617*41dd9a6bSDavid Youngcall the registered callback function. The callback function can trigger
618*41dd9a6bSDavid Youngthe application to handle all operations the VF reset requires including
619*41dd9a6bSDavid Youngstopping Rx/Tx queues and calling rte_eth_dev_reset().
620*41dd9a6bSDavid Young
621*41dd9a6bSDavid YoungThe rte_eth_dev_reset() itself is a generic function which only does
622*41dd9a6bSDavid Youngsome hardware reset operations through calling dev_unint() and
623*41dd9a6bSDavid Youngdev_init(), and itself does not handle synchronization, which is handled
624*41dd9a6bSDavid Youngby application.
625*41dd9a6bSDavid Young
626*41dd9a6bSDavid YoungThe PMD itself should not call rte_eth_dev_reset(). The PMD can trigger
627*41dd9a6bSDavid Youngthe application to handle reset event. It is duty of application to
628*41dd9a6bSDavid Younghandle all synchronization before it calls rte_eth_dev_reset().
629*41dd9a6bSDavid Young
630*41dd9a6bSDavid YoungThe above error handling mode is known as ``RTE_ETH_ERROR_HANDLE_MODE_PASSIVE``.
631*41dd9a6bSDavid Young
632*41dd9a6bSDavid YoungProactive Error Handling Mode
633*41dd9a6bSDavid Young~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
634*41dd9a6bSDavid Young
635*41dd9a6bSDavid YoungThis mode is known as ``RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE``,
636*41dd9a6bSDavid Youngdifferent from the application invokes recovery in PASSIVE mode,
637*41dd9a6bSDavid Youngthe PMD automatically recovers from error in PROACTIVE mode,
638*41dd9a6bSDavid Youngand only a small amount of work is required for the application.
639*41dd9a6bSDavid Young
640*41dd9a6bSDavid YoungDuring error detection and automatic recovery,
641*41dd9a6bSDavid Youngthe PMD sets the data path pointers to dummy functions
642*41dd9a6bSDavid Young(which will prevent the crash),
643*41dd9a6bSDavid Youngand also make sure the control path operations fail with a return code ``-EBUSY``.
644*41dd9a6bSDavid Young
645*41dd9a6bSDavid YoungBecause the PMD recovers automatically,
646*41dd9a6bSDavid Youngthe application can only sense that the data flow is disconnected for a while
647*41dd9a6bSDavid Youngand the control API returns an error in this period.
648*41dd9a6bSDavid Young
649*41dd9a6bSDavid YoungIn order to sense the error happening/recovering,
650*41dd9a6bSDavid Youngas well as to restore some additional configuration,
651*41dd9a6bSDavid Youngthree events are available:
652*41dd9a6bSDavid Young
653*41dd9a6bSDavid Young``RTE_ETH_EVENT_ERR_RECOVERING``
654*41dd9a6bSDavid Young   Notify the application that an error is detected
655*41dd9a6bSDavid Young   and the recovery is being started.
656*41dd9a6bSDavid Young   Upon receiving the event, the application should not invoke
657*41dd9a6bSDavid Young   any control path function until receiving
658*41dd9a6bSDavid Young   ``RTE_ETH_EVENT_RECOVERY_SUCCESS`` or ``RTE_ETH_EVENT_RECOVERY_FAILED`` event.
659*41dd9a6bSDavid Young
660*41dd9a6bSDavid Young.. note::
661*41dd9a6bSDavid Young
662*41dd9a6bSDavid Young   Before the PMD reports the recovery result,
663*41dd9a6bSDavid Young   the PMD may report the ``RTE_ETH_EVENT_ERR_RECOVERING`` event again,
664*41dd9a6bSDavid Young   because a larger error may occur during the recovery.
665*41dd9a6bSDavid Young
666*41dd9a6bSDavid Young``RTE_ETH_EVENT_RECOVERY_SUCCESS``
667*41dd9a6bSDavid Young   Notify the application that the recovery from error is successful,
668*41dd9a6bSDavid Young   the PMD already re-configures the port,
669*41dd9a6bSDavid Young   and the effect is the same as a restart operation.
670*41dd9a6bSDavid Young
671*41dd9a6bSDavid Young``RTE_ETH_EVENT_RECOVERY_FAILED``
672*41dd9a6bSDavid Young   Notify the application that the recovery from error failed,
673*41dd9a6bSDavid Young   the port should not be usable anymore.
674*41dd9a6bSDavid Young   The application should close the port.
675*41dd9a6bSDavid Young
676*41dd9a6bSDavid YoungThe error handling mode supported by the PMD can be reported through
677*41dd9a6bSDavid Young``rte_eth_dev_info_get``.
678