xref: /dpdk/doc/guides/nics/mlx5.rst (revision 4843aacb0d1201fef37e8a579fcd8baec4acdf98)
1..  SPDX-License-Identifier: BSD-3-Clause
2    Copyright 2015 6WIND S.A.
3    Copyright 2015 Mellanox Technologies, Ltd
4
5.. include:: <isonum.txt>
6
7NVIDIA MLX5 Ethernet Driver
8===========================
9
10.. note::
11
12   NVIDIA acquired Mellanox Technologies in 2020.
13   The DPDK documentation and code might still include instances
14   of or references to Mellanox trademarks (like BlueField and ConnectX)
15   that are now NVIDIA trademarks.
16
17The mlx5 Ethernet poll mode driver library (**librte_net_mlx5**) provides support
18for **NVIDIA ConnectX-4**, **NVIDIA ConnectX-4 Lx** , **NVIDIA ConnectX-5**,
19**NVIDIA ConnectX-6**, **NVIDIA ConnectX-6 Dx**, **NVIDIA ConnectX-6 Lx**,
20**NVIDIA ConnectX-7**, **NVIDIA BlueField**, **NVIDIA BlueField-2** and
21**NVIDIA BlueField-3** families of 10/25/40/50/100/200/400 Gb/s adapters
22as well as their virtual functions (VF) in SR-IOV context.
23
24Supported NICs
25--------------
26
27The following NVIDIA device families are supported by the same mlx5 driver:
28
29  - ConnectX-4
30  - ConnectX-4 Lx
31  - ConnectX-5
32  - ConnectX-5 Ex
33  - ConnectX-6
34  - ConnectX-6 Dx
35  - ConnectX-6 Lx
36  - ConnectX-7
37  - BlueField
38  - BlueField-2
39  - BlueField-3
40
41Below are detailed device names:
42
43* NVIDIA\ |reg| ConnectX\ |reg|-4 10G MCX4111A-XCAT (1x10G)
44* NVIDIA\ |reg| ConnectX\ |reg|-4 10G MCX412A-XCAT (2x10G)
45* NVIDIA\ |reg| ConnectX\ |reg|-4 25G MCX4111A-ACAT (1x25G)
46* NVIDIA\ |reg| ConnectX\ |reg|-4 25G MCX412A-ACAT (2x25G)
47* NVIDIA\ |reg| ConnectX\ |reg|-4 40G MCX413A-BCAT (1x40G)
48* NVIDIA\ |reg| ConnectX\ |reg|-4 40G MCX4131A-BCAT (1x40G)
49* NVIDIA\ |reg| ConnectX\ |reg|-4 40G MCX415A-BCAT (1x40G)
50* NVIDIA\ |reg| ConnectX\ |reg|-4 50G MCX413A-GCAT (1x50G)
51* NVIDIA\ |reg| ConnectX\ |reg|-4 50G MCX4131A-GCAT (1x50G)
52* NVIDIA\ |reg| ConnectX\ |reg|-4 50G MCX414A-BCAT (2x50G)
53* NVIDIA\ |reg| ConnectX\ |reg|-4 50G MCX415A-GCAT (1x50G)
54* NVIDIA\ |reg| ConnectX\ |reg|-4 50G MCX416A-BCAT (2x50G)
55* NVIDIA\ |reg| ConnectX\ |reg|-4 50G MCX416A-GCAT (2x50G)
56* NVIDIA\ |reg| ConnectX\ |reg|-4 50G MCX415A-CCAT (1x100G)
57* NVIDIA\ |reg| ConnectX\ |reg|-4 100G MCX416A-CCAT (2x100G)
58* NVIDIA\ |reg| ConnectX\ |reg|-4 Lx 10G MCX4111A-XCAT (1x10G)
59* NVIDIA\ |reg| ConnectX\ |reg|-4 Lx 10G MCX4121A-XCAT (2x10G)
60* NVIDIA\ |reg| ConnectX\ |reg|-4 Lx 25G MCX4111A-ACAT (1x25G)
61* NVIDIA\ |reg| ConnectX\ |reg|-4 Lx 25G MCX4121A-ACAT (2x25G)
62* NVIDIA\ |reg| ConnectX\ |reg|-4 Lx 40G MCX4131A-BCAT (1x40G)
63* NVIDIA\ |reg| ConnectX\ |reg|-5 100G MCX556A-ECAT (2x100G)
64* NVIDIA\ |reg| ConnectX\ |reg|-5 Ex EN 100G MCX516A-CDAT (2x100G)
65* NVIDIA\ |reg| ConnectX\ |reg|-6 200G MCX654106A-HCAT (2x200G)
66* NVIDIA\ |reg| ConnectX\ |reg|-6 Dx EN 100G MCX623106AN-CDAT (2x100G)
67* NVIDIA\ |reg| ConnectX\ |reg|-6 Dx EN 200G MCX623105AN-VDAT (1x200G)
68* NVIDIA\ |reg| ConnectX\ |reg|-6 Lx EN 25G MCX631102AN-ADAT (2x25G)
69* NVIDIA\ |reg| ConnectX\ |reg|-7 200G CX713106AE-HEA_QP1_Ax (2x200G)
70* NVIDIA\ |reg| BlueField\ |reg|-2 25G MBF2H332A-AEEOT_A1 (2x25Gg
71* NVIDIA\ |reg| BlueField\ |reg|-3 200GbE 900-9D3B6-00CV-AA0 (2x200)
72* NVIDIA\ |reg| BlueField\ |reg|-3 200GbE 900-9D3B6-00SV-AA0 (2x200)
73* NVIDIA\ |reg| BlueField\ |reg|-3 400GbE 900-9D3B6-00CN-AB0 (2x400)
74* NVIDIA\ |reg| BlueField\ |reg|-3 100GbE 900-9D3B4-00CC-EA0 (2x100)
75* NVIDIA\ |reg| BlueField\ |reg|-3 100GbE 900-9D3B4-00SC-EA0 (2x100)
76* NVIDIA\ |reg| BlueField\ |reg|-3 400GbE 900-9D3B4-00EN-EA0 (1x100)
77
78
79Design
80------
81
82Besides its dependency on libibverbs (that implies libmlx5 and associated
83kernel support), librte_net_mlx5 relies heavily on system calls for control
84operations such as querying/updating the MTU and flow control parameters.
85
86This capability allows the PMD to coexist with kernel network interfaces
87which remain functional, although they stop receiving unicast packets as
88long as they share the same MAC address.
89This means legacy linux control tools (for example: ethtool, ifconfig and
90more) can operate on the same network interfaces that owned by the DPDK
91application.
92
93See :doc:`../../platform/mlx5` guide for more design details,
94including prerequisites installation.
95
96Features
97--------
98
99- Multi arch support: x86_64, POWER8, ARMv8, i686.
100- Multiple TX and RX queues.
101- Shared Rx queue.
102- Rx queue delay drop.
103- Rx queue available descriptor threshold event.
104- Host shaper support.
105- Support steering for external Rx queue created outside the PMD.
106- Support for scattered TX frames.
107- Advanced support for scattered Rx frames with tunable buffer attributes.
108- IPv4, IPv6, TCPv4, TCPv6, UDPv4 and UDPv6 RSS on any number of queues.
109- RSS using different combinations of fields: L3 only, L4 only or both,
110  and source only, destination only or both.
111- Several RSS hash keys, one for each flow type.
112- Default RSS operation with no hash key specification.
113- Symmetric RSS function.
114- Configurable RETA table.
115- Link flow control (pause frame).
116- Support for multiple MAC addresses.
117- VLAN filtering.
118- RX VLAN stripping.
119- TX VLAN insertion.
120- RX CRC stripping configuration.
121- TX mbuf fast free offload.
122- Promiscuous mode on PF and VF.
123- Multicast promiscuous mode on PF and VF.
124- Hardware checksum offloads.
125- Flow director (RTE_FDIR_MODE_PERFECT, RTE_FDIR_MODE_PERFECT_MAC_VLAN and
126  RTE_ETH_FDIR_REJECT).
127- Flow API, including :ref:`flow_isolated_mode`.
128- Multiple process.
129- KVM and VMware ESX SR-IOV modes are supported.
130- RSS hash result is supported.
131- Hardware TSO for generic IP or UDP tunnel, including VXLAN and GRE.
132- Hardware checksum Tx offload for generic IP or UDP tunnel, including VXLAN and GRE.
133- RX interrupts.
134- Statistics query including Basic, Extended and per queue.
135- Rx HW timestamp.
136- Tunnel types: VXLAN, L3 VXLAN, VXLAN-GPE, GRE, MPLSoGRE, MPLSoUDP, IP-in-IP, Geneve, GTP.
137- Tunnel HW offloads: packet type, inner/outer RSS, IP and UDP checksum verification.
138- NIC HW offloads: encapsulation (vxlan, gre, mplsoudp, mplsogre), NAT, routing, TTL
139  increment/decrement, count, drop, mark. For details please see :ref:`mlx5_offloads_support`.
140- Flow insertion rate of more then million flows per second, when using Direct Rules.
141- Support for multiple rte_flow groups.
142- Per packet no-inline hint flag to disable packet data copying into Tx descriptors.
143- Hardware LRO.
144- Hairpin.
145- Multiple-thread flow insertion.
146- Matching on IPv4 Internet Header Length (IHL).
147- Matching on IPv6 routing extension header.
148- Matching on GTP extension header with raw encap/decap action.
149- Matching on Geneve TLV option header with raw encap/decap action.
150- Matching on ESP header SPI field.
151- Matching on InfiniBand BTH.
152- Matching on random value.
153- Modify IPv4/IPv6 ECN field.
154- Push or remove IPv6 routing extension.
155- NAT64.
156- RSS support in sample action.
157- E-Switch mirroring and jump.
158- E-Switch mirroring and modify.
159- Send to kernel.
160- 21844 flow priorities for ingress or egress flow groups greater than 0 and for any transfer
161  flow group.
162- Flow quota.
163- Flow metering, including meter policy API.
164- Flow meter hierarchy.
165- Flow meter mark.
166- Flow integrity offload API.
167- Connection tracking.
168- Sub-Function representors.
169- Sub-Function.
170- Matching on represented port.
171- Matching on aggregated affinity.
172- Matching on external Tx queue.
173- Matching on E-Switch manager.
174
175
176Limitations
177-----------
178
179- Windows support:
180
181  On Windows, the features are limited:
182
183  - Promiscuous mode is not supported
184  - The following rules are supported:
185
186    - IPv4/UDP with CVLAN filtering
187    - Unicast MAC filtering
188
189  - Additional rules are supported from WinOF2 version 2.70:
190
191    - IPv4/TCP with CVLAN filtering
192    - L4 steering rules for port RSS of UDP, TCP and IP
193
194- PCI Virtual Function MTU:
195
196  MTU settings on PCI Virtual Functions have no effect.
197  The maximum receivable packet size for a VF is determined by the MTU
198  configured on its associated Physical Function.
199  DPDK applications using VFs must be prepared to handle packets
200  up to the maximum size of this PF port.
201
202- For secondary process:
203
204  - Forked secondary process not supported.
205  - MPRQ is not supported. Callback to free externally attached MPRQ buffer is set
206    in a primary process, but has a different virtual address in a secondary process.
207    Calling a function at the wrong address leads to a segmentation fault.
208  - External memory unregistered in EAL memseg list cannot be used for DMA
209    unless such memory has been registered by ``mlx5_mr_update_ext_mp()`` in
210    primary process and remapped to the same virtual address in secondary
211    process. If the external memory is registered by primary process but has
212    different virtual address in secondary process, unexpected error may happen.
213
214- Shared Rx queue:
215
216  - Counters of received packets and bytes number of devices in same share group are same.
217  - Counters of received packets and bytes number of queues in same group and queue ID are same.
218
219- Available descriptor threshold event:
220
221  - Does not support shared Rx queue and hairpin Rx queue.
222
223- The symmetric RSS function is supported by swapping source and destination
224  addresses and ports.
225
226- Host shaper:
227
228  - Support BlueField series NIC from BlueField-2.
229  - When configuring host shaper with ``RTE_PMD_MLX5_HOST_SHAPER_FLAG_AVAIL_THRESH_TRIGGERED`` flag,
230    only rates 0 and 100Mbps are supported.
231
232- HW steering:
233
234  - WQE based high scaling and safer flow insertion/destruction.
235  - Set ``dv_flow_en`` to 2 in order to enable HW steering.
236  - Async queue-based ``rte_flow_async`` APIs supported only.
237  - NIC ConnectX-5 and before are not supported.
238  - Reconfiguring flow API engine is not supported.
239    Any subsequent call to ``rte_flow_configure()`` with different configuration
240    than initially provided will be rejected with ``-ENOTSUP`` error code.
241  - Partial match with item template is not supported.
242  - IPv6 5-tuple matching is not supported.
243  - With E-Switch enabled, ports which share the E-Switch domain
244    should be started and stopped in a specific order:
245
246    - When starting ports, the transfer proxy port should be started first
247      and port representors should follow.
248    - When stopping ports, all of the port representors
249      should be stopped before stopping the transfer proxy port.
250
251    If ports are started/stopped in an incorrect order,
252    ``rte_eth_dev_start()``/``rte_eth_dev_stop()`` will return an appropriate error code:
253
254    - ``-EAGAIN`` for ``rte_eth_dev_start()``.
255    - ``-EBUSY`` for ``rte_eth_dev_stop()``.
256
257  - Matching on ICMP6 following IPv6 routing extension header,
258    should match ``ipv6_routing_ext_next_hdr`` instead of ICMP6.
259    IPv6 routing extension matching is not supported in flow template relaxed
260    matching mode (see ``struct rte_flow_pattern_template_attr::relaxed_matching``).
261
262  - The supported actions order is as below::
263
264          MARK (a)
265          *_DECAP (b)
266          OF_POP_VLAN
267          COUNT | AGE
268          METER_MARK | CONNTRACK
269          OF_PUSH_VLAN
270          MODIFY_FIELD
271          *_ENCAP (c)
272          JUMP | DROP | RSS (a) | QUEUE (a) | REPRESENTED_PORT (d)
273
274    a. Only supported on ingress.
275    b. Any decapsulation action, including the combination of RAW_ENCAP and RAW_DECAP actions
276       which results in L3 decapsulation.
277       Not supported on egress.
278    c. Any encapsulation action, including the combination of RAW_ENCAP and RAW_DECAP actions
279       which results in L3 encap.
280    d. Only in transfer (switchdev) mode.
281
282- When using Verbs flow engine (``dv_flow_en`` = 0), flow pattern without any
283  specific VLAN will match for VLAN packets as well:
284
285  When VLAN spec is not specified in the pattern, the matching rule will be created with VLAN as a wild card.
286  Meaning, the flow rule::
287
288        flow create 0 ingress pattern eth / vlan vid is 3 / ipv4 / end ...
289
290  Will only match vlan packets with vid=3. and the flow rule::
291
292        flow create 0 ingress pattern eth / ipv4 / end ...
293
294  Will match any ipv4 packet (VLAN included).
295
296- When using Verbs flow engine (``dv_flow_en`` = 0), multi-tagged(QinQ) match is not supported.
297
298- When using DV flow engine (``dv_flow_en`` = 1), flow pattern with any VLAN specification will match only single-tagged packets unless the ETH item ``type`` field is 0x88A8 or the VLAN item ``has_more_vlan`` field is 1.
299  The flow rule::
300
301        flow create 0 ingress pattern eth / ipv4 / end ...
302
303  Will match any ipv4 packet.
304  The flow rules::
305
306        flow create 0 ingress pattern eth / vlan / end ...
307        flow create 0 ingress pattern eth has_vlan is 1 / end ...
308        flow create 0 ingress pattern eth type is 0x8100 / end ...
309
310  Will match single-tagged packets only, with any VLAN ID value.
311  The flow rules::
312
313        flow create 0 ingress pattern eth type is 0x88A8 / end ...
314        flow create 0 ingress pattern eth / vlan has_more_vlan is 1 / end ...
315
316  Will match multi-tagged packets only, with any VLAN ID value.
317
318- A flow pattern with 2 sequential VLAN items is not supported.
319
320- VLAN pop offload command:
321
322  - Flow rules having a VLAN pop offload command as one of their actions and
323    are lacking a match on VLAN as one of their items are not supported.
324  - The command is not supported on egress traffic in NIC mode.
325
326- VLAN push offload is not supported on ingress traffic in NIC mode.
327
328- VLAN set PCP offload is not supported on existing headers.
329
330- A multi segment packet must have not more segments than reported by dev_infos_get()
331  in tx_desc_lim.nb_seg_max field. This value depends on maximal supported Tx descriptor
332  size and ``txq_inline_min`` settings and may be from 2 (worst case forced by maximal
333  inline settings) to 58.
334
335- Match on VXLAN supports any bits in the tunnel header
336
337  - Flag 8-bits and first 24-bits reserved fields matching
338    is only supported when using DV flow engine (``dv_flow_en`` = 2).
339  - For ConnectX-5, the UDP destination port must be the standard one (4789).
340  - Default UDP destination is 4789 if not explicitly specified.
341  - Group zero's behavior may differ which depends on FW.
342
343- Matching on VXLAN-GPE header fields:
344
345     - ``rsvd0``/``rsvd1`` matching support depends on FW version
346       when using DV flow engine (``dv_flow_en`` = 1).
347     - ``protocol`` should be explicitly specified in HWS (``dv_flow_en`` = 2).
348
349- L3 VXLAN and VXLAN-GPE tunnels cannot be supported together with MPLSoGRE and MPLSoUDP.
350
351- MPLSoGRE is not supported in HW steering (``dv_flow_en`` = 2).
352
353- MPLSoUDP with multiple MPLS headers is only supported in HW steering (``dv_flow_en`` = 2).
354
355- Match on Geneve header supports the following fields only:
356
357     - VNI
358     - OAM
359     - protocol type
360     - options length
361
362- Match on Geneve TLV option is supported on the following fields:
363
364     - Class
365     - Type
366     - Length
367     - Data
368
369  Class/Type/Length fields must be specified as well as masks.
370  Class/Type/Length specified masks must be full.
371  Matching Geneve TLV option without specifying data is not supported.
372  Matching Geneve TLV option with ``data & mask == 0`` is not supported.
373
374  In SW steering (``dv_flow_en`` = 1):
375
376     - Only one Class/Type/Length Geneve TLV option is supported per shared device.
377     - Supported only with ``FLEX_PARSER_PROFILE_ENABLE`` = 0.
378
379  In HW steering (``dv_flow_en`` = 2):
380
381     - Multiple Class/Type/Length Geneve TLV options are supported per physical device.
382     - Multiple of same Geneve TLV option isn't supported at the same pattern template.
383     - Supported only with ``FLEX_PARSER_PROFILE_ENABLE`` = 8.
384     - Supported also with ``FLEX_PARSER_PROFILE_ENABLE`` = 0 for single DW only.
385     - Supported for FW version **xx.37.0142** and above.
386
387  .. _geneve_parser_api:
388
389  - An API (``rte_pmd_mlx5_create_geneve_tlv_parser``)
390    is available for the flexible parser used in HW steering:
391
392    Each physical device has 7 DWs for GENEVE TLV options.
393    Partial option configuration is supported,
394    mask for data is provided in parser creation
395    indicating which DWs configuration is requested.
396    Only masked data DWs can be matched later as item field using flow API.
397
398    - Matching of ``type`` field is supported for each configured option.
399    - However, for matching ``class`` field,
400      the option should be configured with ``match_on_class_mode=2``.
401      One extra DW is consumed for it.
402    - Matching on ``length`` field is not supported.
403
404    - More limitations with ``FLEX_PARSER_PROFILE_ENABLE`` = 0:
405
406      - single DW
407      - ``sample_len`` must be equal to ``option_len`` and not bigger than 1.
408      - ``match_on_class_mode`` different than 1 is not supported.
409      - ``offset`` must be 0.
410
411    Although the parser is created per physical device, this API is port oriented.
412    Each port should call this API before using GENEVE OPT item,
413    but its configuration must use the same options list
414    with same internal order configured by first port.
415
416    Calling this API for different ports under same physical device doesn't consume
417    more DWs, the first one creates the parser and the rest use same configuration.
418
419- VF: flow rules created on VF devices can only match traffic targeted at the
420  configured MAC addresses (see ``rte_eth_dev_mac_addr_add()``).
421
422- Match on GTP tunnel header item supports the following fields only:
423
424     - v_pt_rsv_flags: E flag, S flag, PN flag
425     - msg_type
426     - teid
427
428- Match on GTP extension header only for GTP PDU session container (next
429  extension header type = 0x85).
430- Match on GTP extension header is not supported in group 0.
431
432- When using DV/Verbs flow engine (``dv_flow_en`` = 1/0 respectively),
433  match on SPI field in ESP header for group 0 is supported from ConnectX-7.
434
435- Matching on SPI field in ESP header is supported over the PF only.
436
437- Flex item:
438
439  - Hardware support: **NVIDIA BlueField-2** and **NVIDIA BlueField-3**.
440  - Flex item is supported on PF only.
441  - Hardware limits ``header_length_mask_width`` up to 6 bits.
442  - Firmware supports 8 global sample fields.
443    Each flex item allocates non-shared sample fields from that pool.
444  - Supported flex item can have 1 input link - ``eth`` or ``udp``
445    and up to 3 output links - ``ipv4`` or ``ipv6``.
446  - Flex item fields (``next_header``, ``next_protocol``, ``samples``)
447    do not participate in RSS hash functions.
448  - In flex item configuration, ``next_header.field_base`` value
449    must be byte aligned (multiple of 8).
450  - Modify field with flex item, the offset must be byte aligned (multiple of 8).
451
452- Match on random value:
453
454  - Supported only with HW Steering enabled (``dv_flow_en`` = 2).
455  - Supported only in table with ``nb_flows=1``.
456  - NIC ingress/egress flow in group 0 is not supported.
457  - Supports matching only 16 bits (LSB).
458
459- Match with compare result item (``RTE_FLOW_ITEM_TYPE_COMPARE``):
460
461  - Only supported in HW steering(``dv_flow_en`` = 2) mode.
462  - Only single flow is supported to the flow table.
463  - Only single item is supported per pattern template.
464  - In switch mode, when the ``repr_matching_en`` flag is enabled in the devargs
465    (which is the default setting),
466    the match with compare result item is not supported for ``ingress`` rules.
467    This is because an implicit ``REPRESENTED_PORT`` needs to be added to the matcher,
468    which conflicts with the single item limitation.
469  - Only 32-bit comparison is supported or 16-bit for random field.
470  - Only supported for ``RTE_FLOW_FIELD_META``, ``RTE_FLOW_FIELD_TAG``,
471    ``RTE_FLOW_FIELD_ESP_SEQ_NUM``,
472    ``RTE_FLOW_FIELD_RANDOM`` and ``RTE_FLOW_FIELD_VALUE``.
473  - The field type ``RTE_FLOW_FIELD_VALUE`` must be the base (``b``) field.
474  - The field type ``RTE_FLOW_FIELD_RANDOM`` can only be compared with
475    ``RTE_FLOW_FIELD_VALUE``.
476
477- No Tx metadata go to the E-Switch steering domain for the Flow group 0.
478  The flows within group 0 and set metadata action are rejected by hardware.
479
480.. note::
481
482   MAC addresses not already present in the bridge table of the associated
483   kernel network device will be added and cleaned up by the PMD when closing
484   the device. In case of ungraceful program termination, some entries may
485   remain present and should be removed manually by other means.
486
487- Buffer split offload is supported with regular Rx burst routine only,
488  no MPRQ feature or vectorized code can be engaged.
489
490- When Multi-Packet Rx queue is configured (``mprq_en``), a Rx packet can be
491  externally attached to a user-provided mbuf with having RTE_MBUF_F_EXTERNAL in
492  ol_flags. As the mempool for the external buffer is managed by PMD, all the
493  Rx mbufs must be freed before the device is closed. Otherwise, the mempool of
494  the external buffers will be freed by PMD and the application which still
495  holds the external buffers may be corrupted.
496  User-managed mempools with external pinned data buffers
497  cannot be used in conjunction with MPRQ
498  since packets may be already attached to PMD-managed external buffers.
499
500- If Multi-Packet Rx queue is configured (``mprq_en``) and Rx CQE compression is
501  enabled (``rxq_cqe_comp_en``) at the same time, RSS hash result is not fully
502  supported. Some Rx packets may not have RTE_MBUF_F_RX_RSS_HASH.
503
504- IPv6 Multicast messages are not supported on VM, while promiscuous mode
505  and allmulticast mode are both set to off.
506  To receive IPv6 Multicast messages on VM, explicitly set the relevant
507  MAC address using rte_eth_dev_mac_addr_add() API.
508
509- To support a mixed traffic pattern (some buffers from local host memory, some
510  buffers from other devices) with high bandwidth, a mbuf flag is used.
511
512  An application hints the PMD whether or not it should try to inline the
513  given mbuf data buffer. PMD should do the best effort to act upon this request.
514
515  The hint flag ``RTE_PMD_MLX5_FINE_GRANULARITY_INLINE`` is dynamic,
516  registered by application with rte_mbuf_dynflag_register(). This flag is
517  purely driver-specific and declared in PMD specific header ``rte_pmd_mlx5.h``,
518  which is intended to be used by the application.
519
520  To query the supported specific flags in runtime,
521  the function ``rte_pmd_mlx5_get_dyn_flag_names`` returns the array of
522  currently (over present hardware and configuration) supported specific flags.
523  The "not inline hint" feature operating flow is the following one:
524
525    - application starts
526    - probe the devices, ports are created
527    - query the port capabilities
528    - if port supporting the feature is found
529    - register dynamic flag ``RTE_PMD_MLX5_FINE_GRANULARITY_INLINE``
530    - application starts the ports
531    - on ``dev_start()`` PMD checks whether the feature flag is registered and
532      enables the feature support in datapath
533    - application might set the registered flag bit in ``ol_flags`` field
534      of mbuf being sent and PMD will handle ones appropriately.
535
536- The amount of descriptors in Tx queue may be limited by data inline settings.
537  Inline data require the more descriptor building blocks and overall block
538  amount may exceed the hardware supported limits. The application should
539  reduce the requested Tx size or adjust data inline settings with
540  ``txq_inline_max`` and ``txq_inline_mpw`` devargs keys.
541
542- To provide the packet send scheduling on mbuf timestamps the ``tx_pp``
543  parameter should be specified.
544  When PMD sees the RTE_MBUF_DYNFLAG_TX_TIMESTAMP_NAME set on the packet
545  being sent it tries to synchronize the time of packet appearing on
546  the wire with the specified packet timestamp. It the specified one
547  is in the past it should be ignored, if one is in the distant future
548  it should be capped with some reasonable value (in range of seconds).
549  These specific cases ("too late" and "distant future") can be optionally
550  reported via device xstats to assist applications to detect the
551  time-related problems.
552
553  The timestamp upper "too-distant-future" limit
554  at the moment of invoking the Tx burst routine
555  can be estimated as ``tx_pp`` option (in nanoseconds) multiplied by 2^23.
556  Please note, for the testpmd txonly mode,
557  the limit is deduced from the expression::
558
559        (n_tx_descriptors / burst_size + 1) * inter_burst_gap
560
561  There is no any packet reordering according timestamps is supposed,
562  neither within packet burst, nor between packets, it is an entirely
563  application responsibility to generate packets and its timestamps
564  in desired order. The timestamps can be put only in the first packet
565  in the burst providing the entire burst scheduling.
566
567- E-Switch decapsulation Flow:
568
569  - can be applied to PF port only.
570  - must specify VF port action (packet redirection from PF to VF).
571  - optionally may specify tunnel inner source and destination MAC addresses.
572
573- E-Switch  encapsulation Flow:
574
575  - can be applied to VF ports only.
576  - must specify PF port action (packet redirection from VF to PF).
577
578- E-Switch Manager matching:
579
580  - For BlueField with old FW
581    which doesn't expose the E-Switch Manager vport ID in the capability,
582    matching E-Switch Manager should be used only in BlueField embedded CPU mode.
583
584- Raw encapsulation:
585
586  - The input buffer, used as outer header, is not validated.
587
588- Raw decapsulation:
589
590  - The decapsulation is always done up to the outermost tunnel detected by the HW.
591  - The input buffer, providing the removal size, is not validated.
592  - The buffer size must match the length of the headers to be removed.
593
594- Outer UDP checksum calculation for encapsulation flow actions:
595
596  - Currently available NVIDIA NICs and DPUs do not have a capability to calculate
597    the UDP checksum in the header added using encapsulation flow actions.
598
599    Applications are required to use 0 in UDP checksum field in such flow actions.
600    Resulting packet will have outer UDP checksum equal to 0.
601
602- ICMP(code/type/identifier/sequence number) / ICMP6(code/type/identifier/sequence number) matching,
603  IP-in-IP and MPLS flow matching are all mutually exclusive features which cannot be supported together
604  (see :ref:`mlx5_firmware_config`).
605
606- LRO:
607
608  - Requires DevX and DV flow to be enabled.
609  - KEEP_CRC offload cannot be supported with LRO.
610  - The first mbuf length, without head-room,  must be big enough to include the
611    TCP header (122B).
612  - Rx queue with LRO offload enabled, receiving a non-LRO packet, can forward
613    it with size limited to max LRO size, not to max RX packet length.
614  - The driver rounds down the port configuration value ``max_lro_pkt_size``
615    (from ``rte_eth_rxmode``) to a multiple of 256 due to hardware limitation.
616  - LRO can be used with outer header of TCP packets of the standard format:
617        eth (with or without vlan) / ipv4 or ipv6 / tcp / payload
618
619    Other TCP packets (e.g. with MPLS label) received on Rx queue with LRO enabled, will be received with bad checksum.
620  - LRO packet aggregation is performed by HW only for packet size larger than
621    ``lro_min_mss_size``. This value is reported on device start, when debug
622    mode is enabled.
623
624- CRC:
625
626  - ``RTE_ETH_RX_OFFLOAD_KEEP_CRC`` cannot be supported with decapsulation
627    for some NICs (such as ConnectX-6 Dx, ConnectX-6 Lx, ConnectX-7, BlueField-2,
628    and BlueField-3).
629    The capability bit ``scatter_fcs_w_decap_disable`` shows NIC support.
630
631- TX mbuf fast free:
632
633  - fast free offload assumes the all mbufs being sent are originated from the
634    same memory pool and there is no any extra references to the mbufs (the
635    reference counter for each mbuf is equal 1 on tx_burst call). The latter
636    means there should be no any externally attached buffers in mbufs. It is
637    an application responsibility to provide the correct mbufs if the fast
638    free offload is engaged. The mlx5 PMD implicitly produces the mbufs with
639    externally attached buffers if MPRQ option is enabled, hence, the fast
640    free offload is neither supported nor advertised if there is MPRQ enabled.
641
642- Sample flow:
643
644  - Supports ``RTE_FLOW_ACTION_TYPE_SAMPLE`` action only within NIC Rx and
645    E-Switch steering domain.
646  - In E-Switch steering domain, for sampling with sample ratio > 1 in a transfer rule,
647    additional actions are not supported in the sample actions list.
648  - For ConnectX-5, the ``RTE_FLOW_ACTION_TYPE_SAMPLE`` is typically used as
649    first action in the E-Switch egress flow if with header modify or
650    encapsulation actions.
651  - For NIC Rx flow, supports only ``MARK``, ``COUNT``, ``QUEUE``, ``RSS`` in the
652    sample actions list.
653  - In E-Switch steering domain, for mirroring with sample ratio = 1 in a transfer rule,
654    supports only ``RAW_ENCAP``, ``PORT_ID``, ``REPRESENTED_PORT``, ``VXLAN_ENCAP``, ``NVGRE_ENCAP``
655    in the sample actions list.
656  - In E-Switch steering domain, for mirroring with sample ratio = 1 in a transfer rule,
657    the encapsulation actions (``RAW_ENCAP`` or ``VXLAN_ENCAP`` or ``NVGRE_ENCAP``)
658    support uplink port only.
659  - In E-Switch steering domain, for mirroring with sample ratio = 1 in a transfer rule,
660    the port actions (``PORT_ID`` or ``REPRESENTED_PORT``) with uplink port and ``JUMP`` action
661    are not supported without the encapsulation actions
662    (``RAW_ENCAP`` or ``VXLAN_ENCAP`` or ``NVGRE_ENCAP``) in the sample actions list.
663  - For ConnectX-5 trusted device, the application metadata with SET_TAG index 0
664    is not supported before ``RTE_FLOW_ACTION_TYPE_SAMPLE`` action.
665
666- Modify Field flow:
667
668  - Supports the 'set' and 'add' operations for ``RTE_FLOW_ACTION_TYPE_MODIFY_FIELD`` action.
669  - Modification of an arbitrary place in a packet via the special ``RTE_FLOW_FIELD_START`` Field ID is not supported.
670  - Modify field action using ``RTE_FLOW_FIELD_RANDOM`` is not supported.
671  - Modification of the 802.1Q tag is not supported.
672  - Modification of VXLAN network or GENEVE network ID is supported only for HW steering.
673  - Modification of the VXLAN header is supported with below limitations:
674
675    - Only for HW steering (``dv_flow_en=2``).
676    - Support VNI and the last reserved byte modifications for traffic
677      with default UDP destination port: 4789 for VXLAN and VXLAN-GBP, 4790 for VXLAN-GPE.
678
679  - Modification of GENEVE network ID is not supported when configured
680    ``FLEX_PARSER_PROFILE_ENABLE`` supports Geneve TLV options.
681    See :ref:`mlx5_firmware_config` for more flex parser information.
682  - Modification of GENEVE TLV option fields is supported only for HW steering.
683    Only DWs configured in :ref:`parser creation <geneve_parser_api>` can be modified,
684    'type' and 'class' fields can be modified when ``match_on_class_mode=2``.
685  - Modification of GENEVE TLV option data supports one DW per action.
686  - Offsets cannot skip past the boundary of a field.
687  - If the field type is ``RTE_FLOW_FIELD_MAC_TYPE``
688    and packet contains one or more VLAN headers,
689    the meaningful type field following the last VLAN header
690    is used as modify field operation argument.
691    The modify field action is not intended to modify VLAN headers type field,
692    dedicated VLAN push and pop actions should be used instead.
693  - For packet fields (e.g. MAC addresses, IPv4 addresses or L4 ports)
694    offset specifies the number of bits to skip from field's start,
695    starting from MSB in the first byte, in the network order.
696  - For flow metadata fields (e.g. META or TAG)
697    offset specifies the number of bits to skip from field's start,
698    starting from LSB in the least significant byte, in the host order.
699  - Modification of the MPLS header is supported with some limitations:
700
701    - Only in HW steering.
702    - Only in ``src`` field.
703    - Only for outermost tunnel header (``level=2``).
704      For ``RTE_FLOW_FIELD_MPLS``,
705      the default encapsulation level ``0`` describes the outermost tunnel header.
706
707      .. note::
708
709         The default encapsulation level ``0`` describes
710         the "outermost that match is supported",
711         currently it is the first tunnel,
712         but it can be changed to outer when it is supported.
713
714  - Default encapsulation level ``0`` describes outermost.
715  - Encapsulation level ``2`` is supported with some limitations:
716
717    - Only in HW steering.
718    - Only in ``src`` field.
719    - ``RTE_FLOW_FIELD_VLAN_ID`` is not supported.
720    - ``RTE_FLOW_FIELD_IPV4_PROTO`` is not supported.
721    - ``RTE_FLOW_FIELD_IPV6_PROTO/DSCP/ECN`` are not supported.
722    - ``RTE_FLOW_FIELD_ESP_PROTO/SPI/SEQ_NUM`` are not supported.
723    - ``RTE_FLOW_FIELD_TCP_SEQ/ACK_NUM`` are not supported.
724    - Second tunnel fields are not supported.
725
726  - Encapsulation levels greater than ``2`` are not supported.
727
728- Age action:
729
730  - with HW steering (``dv_flow_en=2``)
731
732    - Using the same indirect count action combined with multiple age actions
733      in different flows may cause a wrong age state for the age actions.
734    - Creating/destroying flow rules with indirect age action when it is active
735      (timeout != 0) may cause a wrong age state for the indirect age action.
736
737    - The driver reuses counters for aging action, so for optimization
738      the values in ``rte_flow_port_attr`` structure should describe:
739
740      - ``nb_counters`` is the number of flow rules using counter (with/without age)
741        in addition to flow rules using only age (without count action).
742      - ``nb_aging_objects`` is the number of flow rules containing age action.
743
744- IPv6 header item 'proto' field, indicating the next header protocol, should
745  not be set as extension header.
746  In case the next header is an extension header, it should not be specified in
747  IPv6 header item 'proto' field.
748  The last extension header item 'next header' field can specify the following
749  header protocol type.
750
751- Match on IPv6 routing extension header supports the following fields only:
752
753  - ``type``
754  - ``next_hdr``
755  - ``segments_left``
756
757  Only supports HW steering (``dv_flow_en=2``).
758
759- IPv6 routing extension push/remove:
760
761  - Supported only with HW Steering enabled (``dv_flow_en=2``).
762  - Supported in non-zero group
763    (no limits on transfer domain if ``fdb_def_rule_en=1`` which is default).
764  - Only supports TCP or UDP as next layer.
765  - IPv6 routing header must be the only present extension.
766  - Not supported on guest port.
767
768- NAT64 action:
769
770  - Supported only with HW Steering enabled (``dv_flow_en`` = 2).
771  - FW version: at least ``XX.39.1002``.
772  - Supported only on non-root table.
773  - Actions order limitation should follow the modify fields action.
774  - The last 2 TAG registers will be used implicitly in address backup mode.
775  - Even if the action can be shared, new steering entries will be created per flow rule.
776    It is recommended a single rule with NAT64 should be shared
777    to reduce the duplication of entries.
778    The default address and other fields conversion will be handled with NAT64 action.
779    To support other address, new rule(s) with modify fields on the IP addresses should be created.
780  - TOS / Traffic Class is not supported now.
781
782- Hairpin:
783
784  - Hairpin between two ports could only manual binding and explicit Tx flow mode. For single port hairpin, all the combinations of auto/manual binding and explicit/implicit Tx flow mode could be supported.
785  - Hairpin in switchdev SR-IOV mode is not supported till now.
786  - ``out_of_buffer`` statistics are not available on:
787    - NICs older than ConnectX-7.
788    - DPUs older than BlueField-3.
789
790- Quota:
791
792  - Quota implemented for HWS / template API.
793  - Maximal value for quota SET and ADD operations in INT32_MAX (2GB).
794  - Application cannot use 2 consecutive ADD updates.
795    Next tokens update after ADD must always be SET.
796  - Quota flow action cannot be used with Meter or CT flow actions in the same rule.
797  - Quota flow action and item supported in non-root HWS tables.
798  - Maximal number of HW quota and HW meter objects <= 16e6.
799
800- Meter:
801
802  - All the meter colors with drop action will be counted only by the global drop statistics.
803  - Yellow detection is only supported with ASO metering.
804  - Red color must be with drop action.
805  - Meter statistics are supported only for drop case.
806  - A meter action created with pre-defined policy must be the last action in the flow except single case where the policy actions are:
807     - green: NULL or END.
808     - yellow: NULL or END.
809     - RED: DROP / END.
810  - The only supported meter policy actions:
811     - green: QUEUE, RSS, PORT_ID, REPRESENTED_PORT, JUMP, DROP, MODIFY_FIELD, MARK, METER and SET_TAG.
812     - yellow: QUEUE, RSS, PORT_ID, REPRESENTED_PORT, JUMP, DROP, MODIFY_FIELD, MARK, METER and SET_TAG.
813     - RED: must be DROP.
814  - Policy actions of RSS for green and yellow should have the same configuration except queues.
815  - Policy with RSS/queue action is not supported when ``dv_xmeta_en`` enabled.
816  - If green action is METER, yellow action must be the same METER action or NULL.
817  - meter profile packet mode is supported.
818  - meter profiles of RFC2697, RFC2698 and RFC4115 are supported.
819  - RFC4115 implementation is following MEF, meaning yellow traffic may reclaim unused green bandwidth when green token bucket is full.
820  - When using DV flow engine (``dv_flow_en`` = 1),
821    if meter has drop count
822    or meter hierarchy contains any meter that uses drop count,
823    it cannot be used by flow rule matching all ports.
824  - When using DV flow engine (``dv_flow_en`` = 1),
825    if meter hierarchy contains any meter that has MODIFY_FIELD/SET_TAG,
826    it cannot be used by flow matching all ports.
827  - When using HWS flow engine (``dv_flow_en`` = 2),
828    only meter mark action is supported.
829
830- Ptype:
831
832  - Only supports HW steering (``dv_flow_en=2``).
833  - The supported values are:
834    L2: ``RTE_PTYPE_L2_ETHER``, ``RTE_PTYPE_L2_ETHER_VLAN``, ``RTE_PTYPE_L2_ETHER_QINQ``
835    L3: ``RTE_PTYPE_L3_IPV4``, ``RTE_PTYPE_L3_IPV6``
836    L4: ``RTE_PTYPE_L4_TCP``, ``RTE_PTYPE_L4_UDP``, ``RTE_PTYPE_L4_ICMP``
837    and their ``RTE_PTYPE_INNER_XXX`` counterparts as well as ``RTE_PTYPE_TUNNEL_ESP``.
838    Any other values are not supported. Using them as a value will cause unexpected behavior.
839  - Matching on both outer and inner IP fragmented is supported
840    using ``RTE_PTYPE_L4_FRAG`` and ``RTE_PTYPE_INNER_L4_FRAG`` values.
841    They are not part of L4 types, so they should be provided explicitly
842    as a mask value during pattern template creation.
843    Providing ``RTE_PTYPE_L4_MASK`` during pattern template creation
844    and ``RTE_PTYPE_L4_FRAG`` during flow rule creation
845    will cause unexpected behavior.
846
847- Integrity:
848
849  - Verification bits provided by the hardware are ``l3_ok``, ``ipv4_csum_ok``, ``l4_ok``, ``l4_csum_ok``.
850  - ``level`` value 0 references outer headers.
851  - Negative integrity item verification is not supported.
852
853  - With SW steering (``dv_flow_en=1``)
854
855    - Integrity offload is enabled starting from **ConnectX-6 Dx**.
856    - Multiple integrity items not supported in a single flow rule.
857    - Flow rule items supplied by application must explicitly specify
858      network headers referred by integrity item.
859
860      For example, if integrity item mask sets ``l4_ok`` or ``l4_csum_ok`` bits,
861      reference to L4 network header, TCP or UDP, must be in the rule pattern as well::
862
863         flow create 0 ingress pattern integrity level is 0 value mask l3_ok value spec l3_ok / eth / ipv6 / end ...
864         flow create 0 ingress pattern integrity level is 0 value mask l4_ok value spec l4_ok / eth / ipv4 proto is udp / end ...
865
866  - With HW steering (``dv_flow_en=2``)
867    - The ``l3_ok`` field represents all L3 checks, but nothing about IPv4 checksum.
868    - The ``l4_ok`` field represents all L4 checks including L4 checksum.
869
870- Connection tracking:
871
872  - Cannot co-exist with ASO meter, ASO age action in a single flow rule.
873  - Flow rules insertion rate and memory consumption need more optimization.
874  - 16 ports maximum (with ``dv_flow_en=1``).
875  - 32M connections maximum.
876
877- Multi-thread flow insertion:
878
879  - In order to achieve best insertion rate, application should manage the flows per lcore.
880  - Better to disable memory reclaim by setting ``reclaim_mem_mode`` to 0 to accelerate the flow object allocation and release with cache.
881
882- HW hashed bonding
883
884  - TXQ affinity subjects to HW hash once enabled.
885
886- Bonding under socket direct mode
887
888  - Needs MLNX_OFED 5.4+.
889
890- Match on aggregated affinity:
891
892  - Supports NIC ingress flow in group 0.
893  - Supports E-Switch flow in group 0 and depends on
894    device-managed flow steering (DMFS) mode.
895
896- Timestamps:
897
898  - CQE timestamp field width is limited by hardware to 63 bits, MSB is zero.
899  - In the free-running mode the timestamp counter is reset on power on
900    and 63-bit value provides over 1800 years of uptime till overflow.
901  - In the real-time mode
902    (configurable with ``REAL_TIME_CLOCK_ENABLE`` firmware settings),
903    the timestamp presents the nanoseconds elapsed since 01-Jan-1970,
904    hardware timestamp overflow will happen on 19-Jan-2038
905    (0x80000000 seconds since 01-Jan-1970).
906  - The send scheduling is based on timestamps
907    from the reference "Clock Queue" completions,
908    the scheduled send timestamps should not be specified with non-zero MSB.
909
910- Match on GRE header supports the following fields:
911
912  - c_rsvd0_v: C bit, K bit, S bit
913  - protocol type
914  - checksum
915  - key
916  - sequence
917
918  Matching on checksum and sequence needs MLNX_OFED 5.6+.
919
920- Matching on NVGRE header:
921
922  - c_rc_k_s_rsvd0_ver
923  - protocol
924  - tni
925  - flow_id
926
927  In SW steering (``dv_flow_en`` = 1), only tni is supported.
928  In HW steering (``dv_flow_en`` = 2), all fields are supported.
929
930- The NIC egress flow rules on representor port are not supported.
931
932- In switch mode, flow rule matching ``RTE_FLOW_ACTION_TYPE_REPRESENTED_PORT`` item
933  with port ID ``UINT16_MAX`` means matching packets sent by E-Switch manager from software.
934  Need MLNX_OFED 24.04+.
935
936- A driver limitation for ``RTE_FLOW_ACTION_TYPE_PORT_REPRESENTOR`` action
937  restricts the ``port_id`` configuration to only accept the value ``0xffff``,
938  indicating the E-Switch manager.
939  If the ``repr_matching_en`` flag is enabled, the traffic will be directed
940  to the representor of the source virtual port (SF/VF), while if it is disabled,
941  the traffic will be routed based on the steering rules in the ingress domain.
942
943- Send to kernel action (``RTE_FLOW_ACTION_TYPE_SEND_TO_KERNEL``):
944
945  - Supported on non-root table.
946  - Supported in isolated mode.
947  - In HW steering (``dv_flow_en`` = 2):
948    - not supported on guest port.
949
950- During live migration to a new process set its flow engine as standby mode,
951  the user should only program flow rules in group 0 (``fdb_def_rule_en=0``).
952  Live migration is only supported under SWS (``dv_flow_en=1``).
953  The flow group 0 is shared between DPDK processes
954  while the other flow groups are limited to the current process.
955  The flow engine of a process cannot move from active to standby mode
956  if preceding active application rules are still present and vice versa.
957
958
959Statistics
960----------
961
962MLX5 supports various methods to report statistics:
963
964Port statistics can be queried using ``rte_eth_stats_get()``. The received and sent statistics are through SW only and counts the number of packets received or sent successfully by the PMD. The imissed counter is the amount of packets that could not be delivered to SW because a queue was full. Packets not received due to congestion in the bus or on the NIC can be queried via the rx_discards_phy xstats counter.
965
966Extended statistics can be queried using ``rte_eth_xstats_get()``. The extended statistics expose a wider set of counters counted by the device. The extended port statistics counts the number of packets received or sent successfully by the port. As NVIDIA NICs are using the :ref:`Bifurcated Linux Driver <linux_gsg_linux_drivers>` those counters counts also packet received or sent by the Linux kernel. The counters with ``_phy`` suffix counts the total events on the physical port, therefore not valid for VF.
967
968Finally per-flow statistics can by queried using ``rte_flow_query`` when attaching a count action for specific flow. The flow counter counts the number of packets received successfully by the port and match the specific flow.
969
970
971Extended Statistics Counters
972~~~~~~~~~~~~~~~~~~~~~~~~~~~~
973
974Send Scheduling Counters
975^^^^^^^^^^^^^^^^^^^^^^^^
976
977The mlx5 PMD provides a comprehensive set of counters designed for
978debugging and diagnostics related to packet scheduling during transmission.
979These counters are applicable only if the port was configured with the ``tx_pp`` devarg
980and reflect the status of the PMD scheduling infrastructure
981based on Clock and Rearm Queues, used as a workaround on ConnectX-6 DX NICs.
982
983``tx_pp_missed_interrupt_errors``
984  Indicates that the Rearm Queue interrupt was not serviced on time.
985  The EAL manages interrupts in a dedicated thread,
986  and it is possible that other time-consuming actions were being processed concurrently.
987
988``tx_pp_rearm_queue_errors``
989  Signifies hardware errors that occurred on the Rearm Queue,
990  typically caused by delays in servicing interrupts.
991
992``tx_pp_clock_queue_errors``
993  Reflects hardware errors on the Clock Queue,
994  which usually indicate configuration issues
995  or problems with the internal NIC hardware or firmware.
996
997``tx_pp_timestamp_past_errors``
998  Tracks the application attempted to send packets with timestamps set in the past.
999  It is useful for debugging application code
1000  and does not indicate a malfunction of the PMD.
1001
1002``tx_pp_timestamp_future_errors``
1003  Records attempts by the application to send packets
1004  with timestamps set too far into the future,
1005  exceeding the hardware’s scheduling capabilities.
1006  Like the previous counter, it aids in application debugging
1007  without suggesting a PMD malfunction.
1008
1009``tx_pp_jitter``
1010  Measures the internal NIC real-time clock jitter estimation
1011  between two consecutive Clock Queue completions, expressed in nanoseconds.
1012  Significant jitter may signal potential clock synchronization issues,
1013  possibly due to inappropriate adjustments
1014  made by a system PTP (Precision Time Protocol) agent.
1015
1016``tx_pp_wander``
1017  Indicates the long-term stability of the internal NIC real-time clock
1018  over 2^24 completions, measured in nanoseconds.
1019  Significant wander may also suggest clock synchronization problems.
1020
1021``tx_pp_sync_lost``
1022  A general operational indicator;
1023  a non-zero value indicates that the driver has lost synchronization with the Clock Queue,
1024  resulting in improper scheduling operations.
1025  To restore correct scheduling functionality, it is necessary to restart the port.
1026
1027The following counters are particularly valuable for verifying and debugging application code.
1028They do not indicate driver or hardware malfunctions
1029and are applicable to newer hardware with direct on-time scheduling capabilities
1030(such as ConnectX-7 and above):
1031
1032``tx_pp_timestamp_order_errors``
1033  Indicates attempts by the application to send packets
1034  with timestamps that are not in strictly ascending order.
1035  Since the PMD does not reorder packets within hardware queues,
1036  violations of timestamp order can lead to packets being sent at incorrect times.
1037
1038
1039Compilation
1040-----------
1041
1042See :ref:`mlx5 common compilation <mlx5_common_compilation>`.
1043
1044
1045Configuration
1046-------------
1047
1048Environment Configuration
1049~~~~~~~~~~~~~~~~~~~~~~~~~
1050
1051See :ref:`mlx5 common configuration <mlx5_common_env>`.
1052
1053Firmware configuration
1054~~~~~~~~~~~~~~~~~~~~~~
1055
1056See :ref:`mlx5_firmware_config` guide.
1057
1058Runtime Configuration
1059~~~~~~~~~~~~~~~~~~~~~
1060
1061Please refer to :ref:`mlx5 common options <mlx5_common_driver_options>`
1062for an additional list of options shared with other mlx5 drivers.
1063
1064- ``rxq_cqe_comp_en`` parameter [int]
1065
1066  A nonzero value enables the compression of CQE on RX side. This feature
1067  allows to save PCI bandwidth and improve performance. Enabled by default.
1068  Different compression formats are supported in order to achieve the best
1069  performance for different traffic patterns. Default format depends on
1070  Multi-Packet Rx queue configuration: Hash RSS format is used in case
1071  MPRQ is disabled, Checksum format is used in case MPRQ is enabled.
1072
1073  The lower 3 bits define the CQE compression format:
1074  Specifying 2 in these bits of the ``rxq_cqe_comp_en`` parameter selects
1075  the flow tag format for better compression rate in case of flow mark traffic.
1076  Specifying 3 in these bits selects checksum format.
1077  Specifying 4 in these bits selects L3/L4 header format for
1078  better compression rate in case of mixed TCP/UDP and IPv4/IPv6 traffic.
1079  CQE compression format selection requires DevX to be enabled. If there is
1080  no DevX enabled/supported the value is reset to 1 by default.
1081
1082  8th bit defines the CQE compression layout.
1083  Setting this bit to 1 turns enhanced CQE compression layout on.
1084  Enhanced CQE compression is designed for better latency and SW utilization.
1085  This bit is ignored if only the basic CQE compression layout is supported.
1086
1087  Supported on:
1088
1089  - x86_64 with ConnectX-4, ConnectX-4 Lx, ConnectX-5, ConnectX-6, ConnectX-6 Dx,
1090    ConnectX-6 Lx, ConnectX-7, BlueField, BlueField-2, and BlueField-3.
1091  - POWER9 and ARMv8 with ConnectX-4 Lx, ConnectX-5, ConnectX-6, ConnectX-6 Dx,
1092    ConnectX-6 Lx, ConnectX-7 BlueField, BlueField-2, and BlueField-3.
1093
1094- ``rxq_pkt_pad_en`` parameter [int]
1095
1096  A nonzero value enables padding Rx packet to the size of cacheline on PCI
1097  transaction. This feature would waste PCI bandwidth but could improve
1098  performance by avoiding partial cacheline write which may cause costly
1099  read-modify-copy in memory transaction on some architectures. Disabled by
1100  default.
1101
1102  Supported on:
1103
1104  - x86_64 with ConnectX-4, ConnectX-4 Lx, ConnectX-5, ConnectX-6, ConnectX-6 Dx,
1105    ConnectX-6 Lx, ConnectX-7, BlueField, BlueField-2, and BlueField-3.
1106  - POWER8 and ARMv8 with ConnectX-4 Lx, ConnectX-5, ConnectX-6, ConnectX-6 Dx,
1107    ConnectX-6 Lx, ConnectX-7, BlueField, BlueField-2, and BlueField-3.
1108
1109- ``delay_drop`` parameter [int]
1110
1111  Bitmask value for the Rx queue delay drop attribute. Bit 0 is used for the
1112  standard Rx queue and bit 1 is used for the hairpin Rx queue. By default, the
1113  delay drop is disabled for all Rx queues. It will be ignored if the port does
1114  not support the attribute even if it is enabled explicitly.
1115
1116  The packets being received will not be dropped immediately when the WQEs are
1117  exhausted in a Rx queue with delay drop enabled.
1118
1119  A timeout value is set in the driver to control the waiting time before
1120  dropping a packet. Once the timer is expired, the delay drop will be
1121  deactivated for all the Rx queues with this feature enable. To re-activate
1122  it, a rearming is needed and it is part of the kernel driver starting from
1123  MLNX_OFED 5.5.
1124
1125  To enable / disable the delay drop rearming, the private flag ``dropless_rq``
1126  can be set and queried via ethtool:
1127
1128  - ethtool --set-priv-flags <netdev> dropless_rq on (/ off)
1129  - ethtool --show-priv-flags <netdev>
1130
1131  The configuration flag is global per PF and can only be set on the PF, once
1132  it is on, all the VFs', SFs' and representors' Rx queues will share the timer
1133  and rearming.
1134
1135- ``mprq_en`` parameter [int]
1136
1137  A nonzero value enables configuring Multi-Packet Rx queues. Rx queue is
1138  configured as Multi-Packet RQ if the total number of Rx queues is
1139  ``rxqs_min_mprq`` or more. Disabled by default.
1140
1141  Multi-Packet Rx Queue (MPRQ a.k.a Striding RQ) can further save PCIe bandwidth
1142  by posting a single large buffer for multiple packets. Instead of posting a
1143  buffers per a packet, one large buffer is posted in order to receive multiple
1144  packets on the buffer. A MPRQ buffer consists of multiple fixed-size strides
1145  and each stride receives one packet. MPRQ can improve throughput for
1146  small-packet traffic.
1147
1148  When MPRQ is enabled, MTU can be larger than the size of
1149  user-provided mbuf even if RTE_ETH_RX_OFFLOAD_SCATTER isn't enabled. PMD will
1150  configure large stride size enough to accommodate MTU as long as
1151  device allows. Note that this can waste system memory compared to enabling Rx
1152  scatter and multi-segment packet.
1153
1154- ``mprq_log_stride_num`` parameter [int]
1155
1156  Log 2 of the number of strides for Multi-Packet Rx queue. Configuring more
1157  strides can reduce PCIe traffic further. If configured value is not in the
1158  range of device capability, the default value will be set with a warning
1159  message. The default value is 4 which is 16 strides per a buffer, valid only
1160  if ``mprq_en`` is set.
1161
1162  The size of Rx queue should be bigger than the number of strides.
1163
1164- ``mprq_log_stride_size`` parameter [int]
1165
1166  Log 2 of the size of a stride for Multi-Packet Rx queue. Configuring a smaller
1167  stride size can save some memory and reduce probability of a depletion of all
1168  available strides due to unreleased packets by an application. If configured
1169  value is not in the range of device capability, the default value will be set
1170  with a warning message. The default value is 11 which is 2048 bytes per a
1171  stride, valid only if ``mprq_en`` is set. With ``mprq_log_stride_size`` set
1172  it is possible for a packet to span across multiple strides. This mode allows
1173  support of jumbo frames (9K) with MPRQ. The memcopy of some packets (or part
1174  of a packet if Rx scatter is configured) may be required in case there is no
1175  space left for a head room at the end of a stride which incurs some
1176  performance penalty.
1177
1178- ``mprq_max_memcpy_len`` parameter [int]
1179
1180  The maximum length of packet to memcpy in case of Multi-Packet Rx queue. Rx
1181  packet is mem-copied to a user-provided mbuf if the size of Rx packet is less
1182  than or equal to this parameter. Otherwise, PMD will attach the Rx packet to
1183  the mbuf by external buffer attachment - ``rte_pktmbuf_attach_extbuf()``.
1184  A mempool for external buffers will be allocated and managed by PMD. If Rx
1185  packet is externally attached, ol_flags field of the mbuf will have
1186  RTE_MBUF_F_EXTERNAL and this flag must be preserved. ``RTE_MBUF_HAS_EXTBUF()``
1187  checks the flag. The default value is 128, valid only if ``mprq_en`` is set.
1188
1189- ``rxqs_min_mprq`` parameter [int]
1190
1191  Configure Rx queues as Multi-Packet RQ if the total number of Rx queues is
1192  greater or equal to this value. The default value is 12, valid only if
1193  ``mprq_en`` is set.
1194
1195- ``txq_inline`` parameter [int]
1196
1197  Amount of data to be inlined during TX operations. This parameter is
1198  deprecated and converted to the new parameter ``txq_inline_max`` providing
1199  partial compatibility.
1200
1201- ``txqs_min_inline`` parameter [int]
1202
1203  Enable inline data send only when the number of TX queues is greater or equal
1204  to this value.
1205
1206  This option should be used in combination with ``txq_inline_max`` and
1207  ``txq_inline_mpw`` below and does not affect ``txq_inline_min`` settings above.
1208
1209  If this option is not specified the default value 16 is used for BlueField
1210  and 8 for other platforms
1211
1212  The data inlining consumes the CPU cycles, so this option is intended to
1213  auto enable inline data if we have enough Tx queues, which means we have
1214  enough CPU cores and PCI bandwidth is getting more critical and CPU
1215  is not supposed to be bottleneck anymore.
1216
1217  The copying data into WQE improves latency and can improve PPS performance
1218  when PCI back pressure is detected and may be useful for scenarios involving
1219  heavy traffic on many queues.
1220
1221  Because additional software logic is necessary to handle this mode, this
1222  option should be used with care, as it may lower performance when back
1223  pressure is not expected.
1224
1225  If inline data are enabled it may affect the maximal size of Tx queue in
1226  descriptors because the inline data increase the descriptor size and
1227  queue size limits supported by hardware may be exceeded.
1228
1229- ``txq_inline_min`` parameter [int]
1230
1231  Minimal amount of data to be inlined into WQE during Tx operations. NICs
1232  may require this minimal data amount to operate correctly. The exact value
1233  may depend on NIC operation mode, requested offloads, etc. It is strongly
1234  recommended to omit this parameter and use the default values. Anyway,
1235  applications using this parameter should take into consideration that
1236  specifying an inconsistent value may prevent the NIC from sending packets.
1237
1238  If ``txq_inline_min`` key is present the specified value (may be aligned
1239  by the driver in order not to exceed the limits and provide better descriptor
1240  space utilization) will be used by the driver and it is guaranteed that
1241  requested amount of data bytes are inlined into the WQE beside other inline
1242  settings. This key also may update ``txq_inline_max`` value (default
1243  or specified explicitly in devargs) to reserve the space for inline data.
1244
1245  If ``txq_inline_min`` key is not present, the value may be queried by the
1246  driver from the NIC via DevX if this feature is available. If there is no DevX
1247  enabled/supported the value 18 (supposing L2 header including VLAN) is set
1248  for ConnectX-4 and ConnectX-4 Lx, and 0 is set by default for ConnectX-5
1249  and newer NICs. If packet is shorter the ``txq_inline_min`` value, the entire
1250  packet is inlined.
1251
1252  For ConnectX-4 NIC, driver does not allow specifying value below 18
1253  (minimal L2 header, including VLAN), error will be raised.
1254
1255  For ConnectX-4 Lx NIC, it is allowed to specify values below 18, but
1256  it is not recommended and may prevent NIC from sending packets over
1257  some configurations.
1258
1259  For ConnectX-4 and ConnectX-4 Lx NICs, automatically configured value
1260  is insufficient for some traffic, because they require at least all L2 headers
1261  to be inlined. For example, Q-in-Q adds 4 bytes to default 18 bytes
1262  of Ethernet and VLAN, thus ``txq_inline_min`` must be set to 22.
1263  MPLS would add 4 bytes per label. Final value must account for all possible
1264  L2 encapsulation headers used in particular environment.
1265
1266  Please, note, this minimal data inlining disengages eMPW feature (Enhanced
1267  Multi-Packet Write), because last one does not support partial packet inlining.
1268  This is not very critical due to minimal data inlining is mostly required
1269  by ConnectX-4 and ConnectX-4 Lx, these NICs do not support eMPW feature.
1270
1271- ``txq_inline_max`` parameter [int]
1272
1273  Specifies the maximal packet length to be completely inlined into WQE
1274  Ethernet Segment for ordinary SEND method. If packet is larger than specified
1275  value, the packet data won't be copied by the driver at all, data buffer
1276  is addressed with a pointer. If packet length is less or equal all packet
1277  data will be copied into WQE. This may improve PCI bandwidth utilization for
1278  short packets significantly but requires the extra CPU cycles.
1279
1280  The data inline feature is controlled by number of Tx queues, if number of Tx
1281  queues is larger than ``txqs_min_inline`` key parameter, the inline feature
1282  is engaged, if there are not enough Tx queues (which means not enough CPU cores
1283  and CPU resources are scarce), data inline is not performed by the driver.
1284  Assigning ``txqs_min_inline`` with zero always enables the data inline.
1285
1286  The default ``txq_inline_max`` value is 290. The specified value may be adjusted
1287  by the driver in order not to exceed the limit (930 bytes) and to provide better
1288  WQE space filling without gaps, the adjustment is reflected in the debug log.
1289  Also, the default value (290) may be decreased in run-time if the large transmit
1290  queue size is requested and hardware does not support enough descriptor
1291  amount, in this case warning is emitted. If ``txq_inline_max`` key is
1292  specified and requested inline settings can not be satisfied then error
1293  will be raised.
1294
1295- ``txq_inline_mpw`` parameter [int]
1296
1297  Specifies the maximal packet length to be completely inlined into WQE for
1298  Enhanced MPW method. If packet is large the specified value, the packet data
1299  won't be copied, and data buffer is addressed with pointer. If packet length
1300  is less or equal, all packet data will be copied into WQE. This may improve PCI
1301  bandwidth utilization for short packets significantly but requires the extra
1302  CPU cycles.
1303
1304  The data inline feature is controlled by number of TX queues, if number of Tx
1305  queues is larger than ``txqs_min_inline`` key parameter, the inline feature
1306  is engaged, if there are not enough Tx queues (which means not enough CPU cores
1307  and CPU resources are scarce), data inline is not performed by the driver.
1308  Assigning ``txqs_min_inline`` with zero always enables the data inline.
1309
1310  The default ``txq_inline_mpw`` value is 268. The specified value may be adjusted
1311  by the driver in order not to exceed the limit (930 bytes) and to provide better
1312  WQE space filling without gaps, the adjustment is reflected in the debug log.
1313  Due to multiple packets may be included to the same WQE with Enhanced Multi
1314  Packet Write Method and overall WQE size is limited it is not recommended to
1315  specify large values for the ``txq_inline_mpw``. Also, the default value (268)
1316  may be decreased in run-time if the large transmit queue size is requested
1317  and hardware does not support enough descriptor amount, in this case warning
1318  is emitted. If ``txq_inline_mpw`` key is  specified and requested inline
1319  settings can not be satisfied then error will be raised.
1320
1321- ``txqs_max_vec`` parameter [int]
1322
1323  Enable vectorized Tx only when the number of TX queues is less than or
1324  equal to this value. This parameter is deprecated and ignored, kept
1325  for compatibility issue to not prevent driver from probing.
1326
1327- ``txq_mpw_hdr_dseg_en`` parameter [int]
1328
1329  A nonzero value enables including two pointers in the first block of TX
1330  descriptor. The parameter is deprecated and ignored, kept for compatibility
1331  issue.
1332
1333- ``txq_max_inline_len`` parameter [int]
1334
1335  Maximum size of packet to be inlined. This limits the size of packet to
1336  be inlined. If the size of a packet is larger than configured value, the
1337  packet isn't inlined even though there's enough space remained in the
1338  descriptor. Instead, the packet is included with pointer. This parameter
1339  is deprecated and converted directly to ``txq_inline_mpw`` providing full
1340  compatibility. Valid only if eMPW feature is engaged.
1341
1342- ``txq_mpw_en`` parameter [int]
1343
1344  A nonzero value enables Enhanced Multi-Packet Write (eMPW) for ConnectX-5,
1345  ConnectX-6, ConnectX-6 Dx, ConnectX-6 Lx, ConnectX-7, BlueField, BlueField-2
1346  BlueField-3. eMPW allows the Tx burst function to pack up multiple packets
1347  in a single descriptor session in order to save PCI bandwidth
1348  and improve performance at the cost of a slightly higher CPU usage.
1349  When ``txq_inline_mpw`` is set along with ``txq_mpw_en``,
1350  Tx burst function copies entire packet data on to Tx descriptor
1351  instead of including pointer of packet.
1352
1353  The Enhanced Multi-Packet Write feature is enabled by default if NIC supports
1354  it, can be disabled by explicit specifying 0 value for ``txq_mpw_en`` option.
1355  Also, if minimal data inlining is requested by non-zero ``txq_inline_min``
1356  option or reported by the NIC, the eMPW feature is disengaged.
1357
1358- ``tx_db_nc`` parameter [int]
1359
1360  This parameter name is deprecated and ignored.
1361  The new name for this parameter is ``sq_db_nc``.
1362  See :ref:`common driver options <mlx5_common_driver_options>`.
1363
1364- ``tx_pp`` parameter [int]
1365
1366  If a nonzero value is specified the driver creates all necessary internal
1367  objects to provide accurate packet send scheduling on mbuf timestamps.
1368  The positive value specifies the scheduling granularity in nanoseconds,
1369  the packet send will be accurate up to specified digits. The allowed range is
1370  from 500 to 1 million of nanoseconds. The negative value specifies the module
1371  of granularity and engages the special test mode the check the schedule rate.
1372  By default (if the ``tx_pp`` is not specified) send scheduling on timestamps
1373  feature is disabled.
1374
1375  Starting with ConnectX-7 the capability to schedule traffic directly
1376  on timestamp specified in descriptor is provided,
1377  no extra objects are needed anymore and scheduling capability
1378  is advertised and handled regardless ``tx_pp`` parameter presence.
1379
1380- ``tx_skew`` parameter [int]
1381
1382  The parameter adjusts the send packet scheduling on timestamps and represents
1383  the average delay between beginning of the transmitting descriptor processing
1384  by the hardware and appearance of actual packet data on the wire. The value
1385  should be provided in nanoseconds and is valid only if ``tx_pp`` parameter is
1386  specified. The default value is zero.
1387
1388- ``tx_vec_en`` parameter [int]
1389
1390  A nonzero value enables Tx vector on ConnectX-5, ConnectX-6, ConnectX-6 Dx,
1391  ConnectX-6 Lx, ConnectX-7, BlueField, BlueField-2, and BlueField-3 NICs
1392  if the number of global Tx queues on the port is less than ``txqs_max_vec``.
1393  The parameter is deprecated and ignored.
1394
1395- ``rx_vec_en`` parameter [int]
1396
1397  A nonzero value enables Rx vector if the port is not configured in
1398  multi-segment otherwise this parameter is ignored.
1399
1400  Enabled by default.
1401
1402- ``vf_nl_en`` parameter [int]
1403
1404  A nonzero value enables Netlink requests from the VF to add/remove MAC
1405  addresses or/and enable/disable promiscuous/all multicast on the Netdevice.
1406  Otherwise the relevant configuration must be run with Linux iproute2 tools.
1407  This is a prerequisite to receive this kind of traffic.
1408
1409  Enabled by default, valid only on VF devices ignored otherwise.
1410
1411- ``l3_vxlan_en`` parameter [int]
1412
1413  A nonzero value allows L3 VXLAN and VXLAN-GPE flow creation. To enable
1414  L3 VXLAN or VXLAN-GPE, users has to configure firmware and enable this
1415  parameter. This is a prerequisite to receive this kind of traffic.
1416
1417  Disabled by default.
1418
1419- ``dv_xmeta_en`` parameter [int]
1420
1421  A nonzero value enables extensive flow metadata support if device is
1422  capable and driver supports it. This can enable extensive support of
1423  ``MARK`` and ``META`` item of ``rte_flow``. The newly introduced
1424  ``SET_TAG`` and ``SET_META`` actions do not depend on ``dv_xmeta_en``.
1425
1426  There are some possible configurations, depending on parameter value:
1427
1428  - 0, this is default value, defines the legacy mode, the ``MARK`` and
1429    ``META`` related actions and items operate only within NIC Tx and
1430    NIC Rx steering domains, no ``MARK`` and ``META`` information crosses
1431    the domain boundaries. The ``MARK`` item is 24 bits wide, the ``META``
1432    item is 32 bits wide and match supported on egress only
1433    when ``dv_flow_en`` = 1.
1434
1435  - 1, this engages extensive metadata mode, the ``MARK`` and ``META``
1436    related actions and items operate within all supported steering domains,
1437    including FDB, ``MARK`` and ``META`` information may cross the domain
1438    boundaries. The ``MARK`` item is 24 bits wide, the ``META`` item width
1439    depends on kernel and firmware configurations and might be 0, 16 or
1440    32 bits. Within NIC Tx domain ``META`` data width is 32 bits for
1441    compatibility, the actual width of data transferred to the FDB domain
1442    depends on kernel configuration and may be vary. The actual supported
1443    width can be retrieved in runtime by series of rte_flow_validate()
1444    trials.
1445
1446  - 2, this engages extensive metadata mode, the ``MARK`` and ``META``
1447    related actions and items operate within all supported steering domains,
1448    including FDB, ``MARK`` and ``META`` information may cross the domain
1449    boundaries. The ``META`` item is 32 bits wide, the ``MARK`` item width
1450    depends on kernel and firmware configurations and might be 0, 16 or
1451    24 bits. The actual supported width can be retrieved in runtime by
1452    series of rte_flow_validate() trials.
1453
1454  - 3, this engages tunnel offload mode. In E-Switch configuration, that
1455    mode implicitly activates ``dv_xmeta_en=1``.
1456
1457  - 4, this mode is only supported in HWS (``dv_flow_en=2``).
1458    The Rx/Tx metadata with 32b width copy between FDB and NIC is supported.
1459    The mark is only supported in NIC and there is no copy supported.
1460
1461  +------+-----------+-----------+-------------+-------------+
1462  | Mode | ``MARK``  | ``META``  | ``META`` Tx | FDB/Through |
1463  +======+===========+===========+=============+=============+
1464  | 0    | 24 bits   | 32 bits   | 32 bits     | no          |
1465  +------+-----------+-----------+-------------+-------------+
1466  | 1    | 24 bits   | vary 0-32 | 32 bits     | yes         |
1467  +------+-----------+-----------+-------------+-------------+
1468  | 2    | vary 0-24 | 32 bits   | 32 bits     | yes         |
1469  +------+-----------+-----------+-------------+-------------+
1470
1471  If there is no E-Switch configuration the ``dv_xmeta_en`` parameter is
1472  ignored and the device is configured to operate in legacy mode (0).
1473
1474  Disabled by default (set to 0).
1475
1476  The Direct Verbs/Rules (engaged with ``dv_flow_en`` = 1) supports all
1477  of the extensive metadata features. The legacy Verbs supports FLAG and
1478  MARK metadata actions over NIC Rx steering domain only.
1479
1480  Setting META value to zero in flow action means there is no item provided
1481  and receiving datapath will not report in mbufs the metadata are present.
1482  Setting MARK value to zero in flow action means the zero FDIR ID value
1483  will be reported on packet receiving.
1484
1485  For the MARK action the last 16 values in the full range are reserved for
1486  internal PMD purposes (to emulate FLAG action). The valid range for the
1487  MARK action values is 0-0xFFEF for the 16-bit mode and 0-0xFFFFEF
1488  for the 24-bit mode, the flows with the MARK action value outside
1489  the specified range will be rejected.
1490
1491- ``dv_flow_en`` parameter [int]
1492
1493  Value 0 means legacy Verbs flow offloading.
1494
1495  Value 1 enables the DV flow steering assuming it is supported by the
1496  driver (requires rdma-core 24 or higher).
1497
1498  Value 2 enables the WQE based hardware steering.
1499  In this mode, only queue-based flow management is supported.
1500
1501  It is configured by default to 1 (DV flow steering) if supported.
1502  Otherwise, the value is 0 which indicates legacy Verbs flow offloading.
1503
1504- ``dv_esw_en`` parameter [int]
1505
1506  A nonzero value enables E-Switch using Direct Rules.
1507
1508  Enabled by default if supported.
1509
1510- ``fdb_def_rule_en`` parameter [int]
1511
1512  A non-zero value enables to create a dedicated rule on E-Switch root table.
1513  This dedicated rule forwards all incoming packets into table 1.
1514  Other rules will be created in E-Switch table original table level plus one,
1515  to improve the flow insertion rate due to skipping root table managed by firmware.
1516  If set to 0, all rules will be created on the original E-Switch table level.
1517
1518  By default, the PMD will set this value to 1.
1519
1520- ``lacp_by_user`` parameter [int]
1521
1522  A nonzero value enables the control of LACP traffic by the user application.
1523  When a bond exists in the driver, by default it should be managed by the
1524  kernel and therefore LACP traffic should be steered to the kernel.
1525  If this devarg is set to 1 it will allow the user to manage the bond by
1526  itself and not steer LACP traffic to the kernel.
1527
1528  Disabled by default (set to 0).
1529
1530- ``representor`` parameter [list]
1531
1532  This parameter can be used to instantiate DPDK Ethernet devices from
1533  existing port (PF, VF or SF) representors configured on the device.
1534
1535  It is a standard parameter whose format is described in
1536  :ref:`ethernet_device_standard_device_arguments`.
1537
1538  For instance, to probe VF port representors 0 through 2::
1539
1540    <PCI_BDF>,representor=vf[0-2]
1541
1542  To probe SF port representors 0 through 2::
1543
1544    <PCI_BDF>,representor=sf[0-2]
1545
1546  To probe VF port representors 0 through 2 on both PFs of bonding device::
1547
1548    <Primary_PCI_BDF>,representor=pf[0,1]vf[0-2]
1549
1550- ``repr_matching_en`` parameter [int]
1551
1552  - 0. If representor matching is disabled, then there will be no implicit
1553    item added. As a result, ingress flow rules will match traffic
1554    coming to any port, not only the port on which flow rule is created.
1555    Because of that, default flow rules for ingress traffic cannot be created
1556    and port starts in isolated mode by default. Port cannot be switched back
1557    to non-isolated mode.
1558
1559  - 1. If representor matching is enabled (default setting),
1560    then each ingress pattern template has an implicit REPRESENTED_PORT
1561    item added. Flow rules based on this pattern template will match
1562    the vport associated with port on which rule is created.
1563
1564- ``max_dump_files_num`` parameter [int]
1565
1566  The maximum number of files per PMD entity that may be created for debug information.
1567  The files will be created in /var/log directory or in current directory.
1568
1569  set to 128 by default.
1570
1571- ``lro_timeout_usec`` parameter [int]
1572
1573  The maximum allowed duration of an LRO session, in micro-seconds.
1574  PMD will set the nearest value supported by HW, which is not bigger than
1575  the input ``lro_timeout_usec`` value.
1576  If this parameter is not specified, by default PMD will set
1577  the smallest value supported by HW.
1578
1579- ``hp_buf_log_sz`` parameter [int]
1580
1581  The total data buffer size of a hairpin queue (logarithmic form), in bytes.
1582  PMD will set the data buffer size to 2 ** ``hp_buf_log_sz``, both for RX & TX.
1583  The capacity of the value is specified by the firmware and the initialization
1584  will get a failure if it is out of scope.
1585  The range of the value is from 11 to 19 right now, and the supported frame
1586  size of a single packet for hairpin is from 512B to 128KB. It might change if
1587  different firmware release is being used. By using a small value, it could
1588  reduce memory consumption but not work with a large frame. If the value is
1589  too large, the memory consumption will be high and some potential performance
1590  degradation will be introduced.
1591  By default, the PMD will set this value to 16, which means that 9KB jumbo
1592  frames will be supported.
1593
1594- ``reclaim_mem_mode`` parameter [int]
1595
1596  Cache some resources in flow destroy will help flow recreation more efficient.
1597  While some systems may require the all the resources can be reclaimed after
1598  flow destroyed.
1599  The parameter ``reclaim_mem_mode`` provides the option for user to configure
1600  if the resource cache is needed or not.
1601
1602  There are three options to choose:
1603
1604  - 0. It means the flow resources will be cached as usual. The resources will
1605    be cached, helpful with flow insertion rate.
1606
1607  - 1. It will only enable the DPDK PMD level resources reclaim.
1608
1609  - 2. Both DPDK PMD level and rdma-core low level will be configured as
1610    reclaimed mode.
1611
1612  By default, the PMD will set this value to 0.
1613
1614- ``decap_en`` parameter [int]
1615
1616  Some devices do not support FCS (frame checksum) scattering for
1617  tunnel-decapsulated packets.
1618  If set to 0, this option forces the FCS feature and rejects tunnel
1619  decapsulation in the flow engine for such devices.
1620
1621  By default, the PMD will set this value to 1.
1622
1623- ``allow_duplicate_pattern`` parameter [int]
1624
1625  There are two options to choose:
1626
1627  - 0. Prevent insertion of rules with the same pattern items on non-root table.
1628    In this case, only the first rule is inserted and the following rules are
1629    rejected and error code EEXIST is returned.
1630
1631  - 1. Allow insertion of rules with the same pattern items.
1632    In this case, all rules are inserted but only the first rule takes effect,
1633    the next rule takes effect only if the previous rules are deleted.
1634
1635  By default, the PMD will set this value to 1.
1636
1637
1638Multiport E-Switch
1639------------------
1640
1641In standard deployments of NVIDIA ConnectX and BlueField HCAs, where embedded switch is enabled,
1642each physical port is associated with a single switching domain.
1643Only PFs, VFs and SFs related to that physical port are connected to this domain
1644and offloaded flow rules are allowed to steer traffic only between the entities in the given domain.
1645
1646The following diagram pictures the high level overview of this architecture::
1647
1648       .---. .------. .------. .---. .------. .------.
1649       |PF0| |PF0VFi| |PF0SFi| |PF1| |PF1VFi| |PF1SFi|
1650       .-+-. .--+---. .--+---. .-+-. .--+---. .--+---.
1651         |      |        |       |      |        |
1652     .---|------|--------|-------|------|--------|---------.
1653     |   |      |        |       |      |        |      HCA|
1654     | .-+------+--------+---. .-+------+--------+---.     |
1655     | |                     | |                     |     |
1656     | |      E-Switch       | |     E-Switch        |     |
1657     | |         PF0         | |        PF1          |     |
1658     | |                     | |                     |     |
1659     | .---------+-----------. .--------+------------.     |
1660     |           |                      |                  |
1661     .--------+--+---+---------------+--+---+--------------.
1662              |      |               |      |
1663              | PHY0 |               | PHY1 |
1664              |      |               |      |
1665              .------.               .------.
1666
1667Multiport E-Switch is a deployment scenario where:
1668
1669- All physical ports, PFs, VFs and SFs share the same switching domain.
1670- Each physical port gets a separate representor port.
1671- Traffic can be matched or forwarded explicitly between any of the entities
1672  connected to the domain.
1673
1674The following diagram pictures the high level overview of this architecture::
1675
1676       .---. .------. .------. .---. .------. .------.
1677       |PF0| |PF0VFi| |PF0SFi| |PF1| |PF1VFi| |PF1SFi|
1678       .-+-. .--+---. .--+---. .-+-. .--+---. .--+---.
1679         |      |        |       |      |        |
1680     .---|------|--------|-------|------|--------|---------.
1681     |   |      |        |       |      |        |      HCA|
1682     | .-+------+--------+-------+------+--------+---.     |
1683     | |                                             |     |
1684     | |                   Shared                    |     |
1685     | |                  E-Switch                   |     |
1686     | |                                             |     |
1687     | .---------+----------------------+------------.     |
1688     |           |                      |                  |
1689     .--------+--+---+---------------+--+---+--------------.
1690              |      |               |      |
1691              | PHY0 |               | PHY1 |
1692              |      |               |      |
1693              .------.               .------.
1694
1695In this deployment a single application can control the switching and forwarding behavior for all
1696entities on the HCA.
1697
1698With this configuration, mlx5 PMD supports:
1699
1700- matching traffic coming from physical port, PF, VF or SF using REPRESENTED_PORT items;
1701- matching traffic coming from E-Switch manager
1702  using REPRESENTED_PORT item with port ID ``UINT16_MAX``;
1703- forwarding traffic to physical port, PF, VF or SF using REPRESENTED_PORT actions;
1704
1705Requirements
1706~~~~~~~~~~~~
1707
1708Supported HCAs:
1709
1710- ConnectX family: ConnectX-6 Dx and above.
1711- BlueField family: BlueField-2 and above.
1712- FW version: at least ``XX.37.1014``.
1713
1714Supported mlx5 kernel modules versions:
1715
1716- Upstream Linux - from version 6.3.
1717- Modules packaged in MLNX_OFED - from version v23.04-0.5.3.3.
1718
1719Configuration
1720~~~~~~~~~~~~~
1721
1722#. Apply required FW configuration::
1723
1724      sudo mlxconfig -d /dev/mst/mt4125_pciconf0 set LAG_RESOURCE_ALLOCATION=1
1725
1726#. Reset FW or cold reboot the host.
1727
1728#. Switch E-Switch mode on all of the PFs to ``switchdev`` mode::
1729
1730      sudo devlink dev eswitch set pci/0000:08:00.0 mode switchdev
1731      sudo devlink dev eswitch set pci/0000:08:00.1 mode switchdev
1732
1733#. Enable Multiport E-Switch on all of the PFs::
1734
1735      sudo devlink dev param set pci/0000:08:00.0 name esw_multiport value true cmode runtime
1736      sudo devlink dev param set pci/0000:08:00.1 name esw_multiport value true cmode runtime
1737
1738#. Configure required number of VFs/SFs::
1739
1740      echo 4 | sudo tee /sys/class/net/eth2/device/sriov_numvfs
1741      echo 4 | sudo tee /sys/class/net/eth3/device/sriov_numvfs
1742
1743#. Start testpmd and verify that all ports are visible::
1744
1745      $ sudo dpdk-testpmd -a 08:00.0,dv_flow_en=2,representor=pf0-1vf0-3 -- -i
1746      testpmd> show port summary all
1747      Number of available ports: 10
1748      Port MAC Address       Name         Driver         Status   Link
1749      0    E8:EB:D5:18:22:BC 08:00.0_p0   mlx5_pci       up       200 Gbps
1750      1    E8:EB:D5:18:22:BD 08:00.0_p1   mlx5_pci       up       200 Gbps
1751      2    D2:F6:43:0B:9E:19 08:00.0_representor_c0pf0vf0 mlx5_pci       up       200 Gbps
1752      3    E6:42:27:B7:68:BD 08:00.0_representor_c0pf0vf1 mlx5_pci       up       200 Gbps
1753      4    A6:5B:7F:8B:B8:47 08:00.0_representor_c0pf0vf2 mlx5_pci       up       200 Gbps
1754      5    12:93:50:45:89:02 08:00.0_representor_c0pf0vf3 mlx5_pci       up       200 Gbps
1755      6    06:D3:B2:79:FE:AC 08:00.0_representor_c0pf1vf0 mlx5_pci       up       200 Gbps
1756      7    12:FC:08:E4:C2:CA 08:00.0_representor_c0pf1vf1 mlx5_pci       up       200 Gbps
1757      8    8E:A9:9A:D0:35:4C 08:00.0_representor_c0pf1vf2 mlx5_pci       up       200 Gbps
1758      9    E6:35:83:1F:B0:A9 08:00.0_representor_c0pf1vf3 mlx5_pci       up       200 Gbps
1759
1760Limitations
1761~~~~~~~~~~~
1762
1763- Multiport E-Switch is not supported on Windows.
1764- Multiport E-Switch is supported only with HW Steering flow engine (``dv_flow_en=2``).
1765- Matching traffic coming from a physical port and forwarding it to a physical port
1766  (either the same or other one) is not supported.
1767
1768  In order to achieve such a functionality, an application has to setup hairpin queues
1769  between physical port representors and forward the traffic using hairpin queues.
1770
1771
1772Sub-Function
1773------------
1774
1775See :ref:`mlx5_sub_function`.
1776
1777Sub-Function representor support
1778~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1779
1780A SF netdev supports E-Switch representation offload
1781similar to PF and VF representors.
1782Use <sfnum> to probe SF representor::
1783
1784   testpmd> port attach <PCI_BDF>,representor=sf<sfnum>,dv_flow_en=1
1785
1786
1787Performance tuning
1788------------------
1789
1790#. Configure aggressive CQE Zipping for maximum performance::
1791
1792        mlxconfig -d <mst device> s CQE_COMPRESSION=1
1793
1794   To set it back to the default CQE Zipping mode use::
1795
1796        mlxconfig -d <mst device> s CQE_COMPRESSION=0
1797
1798#. In case of virtualization:
1799
1800   - Make sure that hypervisor kernel is 3.16 or newer.
1801   - Configure boot with ``iommu=pt``.
1802   - Use 1G huge pages.
1803   - Make sure to allocate a VM on huge pages.
1804   - Make sure to set CPU pinning.
1805
1806#. Use the CPU near local NUMA node to which the PCIe adapter is connected,
1807   for better performance. For VMs, verify that the right CPU
1808   and NUMA node are pinned according to the above. Run::
1809
1810        lstopo-no-graphics --merge
1811
1812   to identify the NUMA node to which the PCIe adapter is connected.
1813
1814#. If more than one adapter is used, and root complex capabilities allow
1815   to put both adapters on the same NUMA node without PCI bandwidth degradation,
1816   it is recommended to locate both adapters on the same NUMA node.
1817   This in order to forward packets from one to the other without
1818   NUMA performance penalty.
1819
1820#. Disable pause frames::
1821
1822        ethtool -A <netdev> rx off tx off
1823
1824#. Verify IO non-posted prefetch is disabled by default. This can be checked
1825   via the BIOS configuration. Please contact you server provider for more
1826   information about the settings.
1827
1828   .. note::
1829
1830        On some machines, depends on the machine integrator, it is beneficial
1831        to set the PCI max read request parameter to 1K. This can be
1832        done in the following way:
1833
1834        To query the read request size use::
1835
1836                setpci -s <NIC PCI address> 68.w
1837
1838        If the output is different than 3XXX, set it by::
1839
1840                setpci -s <NIC PCI address> 68.w=3XXX
1841
1842        The XXX can be different on different systems. Make sure to configure
1843        according to the setpci output.
1844
1845#. To minimize overhead of searching Memory Regions:
1846
1847   - '--socket-mem' is recommended to pin memory by predictable amount.
1848   - Configure per-lcore cache when creating Mempools for packet buffer.
1849   - Refrain from dynamically allocating/freeing memory in run-time.
1850
1851Rx burst functions
1852------------------
1853
1854There are multiple Rx burst functions with different advantages and limitations.
1855
1856.. table:: Rx burst functions
1857
1858   +-------------------+------------------------+---------+-----------------+------+-------+
1859   || Function Name    || Enabler               || Scatter|| Error Recovery || CQE || Large|
1860   |                   |                        |         |                 || comp|| MTU  |
1861   +===================+========================+=========+=================+======+=======+
1862   | rx_burst          | rx_vec_en=0            |   Yes   | Yes             |  Yes |  Yes  |
1863   +-------------------+------------------------+---------+-----------------+------+-------+
1864   | rx_burst_vec      | rx_vec_en=1 (default)  |   No    | if CQE comp off |  Yes |  No   |
1865   +-------------------+------------------------+---------+-----------------+------+-------+
1866   | rx_burst_mprq     || mprq_en=1             |   No    | Yes             |  Yes |  Yes  |
1867   |                   || RxQs >= rxqs_min_mprq |         |                 |      |       |
1868   +-------------------+------------------------+---------+-----------------+------+-------+
1869   | rx_burst_mprq_vec || rx_vec_en=1 (default) |   No    | if CQE comp off |  Yes |  Yes  |
1870   |                   || mprq_en=1             |         |                 |      |       |
1871   |                   || RxQs >= rxqs_min_mprq |         |                 |      |       |
1872   +-------------------+------------------------+---------+-----------------+------+-------+
1873
1874.. _mlx5_offloads_support:
1875
1876Supported hardware offloads
1877---------------------------
1878
1879Below tables show offload support depending on hardware, firmware,
1880and Linux software support.
1881
1882The :ref:`Linux prerequisites <mlx5_linux_prerequisites>`
1883are Linux kernel and rdma-core libraries.
1884These dependencies are also packaged in MLNX_OFED or MLNX_EN,
1885shortened below as "OFED".
1886
1887.. table:: Minimal SW/HW versions for queue offloads
1888
1889   ============== ===== ===== ========= ===== ========== =============
1890   Offload        DPDK  Linux rdma-core OFED   firmware   hardware
1891   ============== ===== ===== ========= ===== ========== =============
1892   common base    17.11  4.14    16     4.2-1 12.21.1000 ConnectX-4
1893   checksums      17.11  4.14    16     4.2-1 12.21.1000 ConnectX-4
1894   Rx timestamp   17.11  4.14    16     4.2-1 12.21.1000 ConnectX-4
1895   TSO            17.11  4.14    16     4.2-1 12.21.1000 ConnectX-4
1896   LRO            19.08  N/A     N/A    4.6-4 16.25.6406 ConnectX-5
1897   Tx scheduling  20.08  N/A     N/A    5.1-2 22.28.2006 ConnectX-6 Dx
1898   Buffer Split   20.11  N/A     N/A    5.1-2 16.28.2006 ConnectX-5
1899   ============== ===== ===== ========= ===== ========== =============
1900
1901.. table:: Minimal SW/HW versions for rte_flow offloads
1902
1903   +-----------------------+-----------------+-----------------+
1904   | Offload               | with E-Switch   | with NIC        |
1905   +=======================+=================+=================+
1906   | Count                 | | DPDK 19.05    | | DPDK 19.02    |
1907   |                       | | OFED 4.6      | | OFED 4.6      |
1908   |                       | | rdma-core 24  | | rdma-core 23  |
1909   |                       | | ConnectX-5    | | ConnectX-5    |
1910   +-----------------------+-----------------+-----------------+
1911   | Drop                  | | DPDK 19.05    | | DPDK 18.11    |
1912   |                       | | OFED 4.6      | | OFED 4.5      |
1913   |                       | | rdma-core 24  | | rdma-core 23  |
1914   |                       | | ConnectX-5    | | ConnectX-4    |
1915   +-----------------------+-----------------+-----------------+
1916   | Queue / RSS           | |               | | DPDK 18.11    |
1917   |                       | |     N/A       | | OFED 4.5      |
1918   |                       | |               | | rdma-core 23  |
1919   |                       | |               | | ConnectX-4    |
1920   +-----------------------+-----------------+-----------------+
1921   | Shared action         | |               | |               |
1922   |                       | | :numref:`sact`| | :numref:`sact`|
1923   |                       | |               | |               |
1924   |                       | |               | |               |
1925   +-----------------------+-----------------+-----------------+
1926   | | VLAN                | | DPDK 19.11    | | DPDK 19.11    |
1927   | | (of_pop_vlan /      | | OFED 4.7-1    | | OFED 4.7-1    |
1928   | | of_push_vlan /      | | ConnectX-5    | | ConnectX-5    |
1929   | | of_set_vlan_pcp /   | |               | |               |
1930   | | of_set_vlan_vid)    | |               | |               |
1931   +-----------------------+-----------------+-----------------+
1932   | | VLAN                | | DPDK 21.05    | |               |
1933   | | ingress and /       | | OFED 5.3      | |    N/A        |
1934   | | of_push_vlan /      | | ConnectX-6 Dx | |               |
1935   +-----------------------+-----------------+-----------------+
1936   | | VLAN                | | DPDK 21.05    | |               |
1937   | | egress and /        | | OFED 5.3      | |    N/A        |
1938   | | of_pop_vlan /       | | ConnectX-6 Dx | |               |
1939   +-----------------------+-----------------+-----------------+
1940   | Encapsulation         | | DPDK 19.05    | | DPDK 19.02    |
1941   | (VXLAN / NVGRE / RAW) | | OFED 4.7-1    | | OFED 4.6      |
1942   |                       | | rdma-core 24  | | rdma-core 23  |
1943   |                       | | ConnectX-5    | | ConnectX-5    |
1944   +-----------------------+-----------------+-----------------+
1945   | Encapsulation         | | DPDK 19.11    | | DPDK 19.11    |
1946   | GENEVE                | | OFED 4.7-3    | | OFED 4.7-3    |
1947   |                       | | rdma-core 27  | | rdma-core 27  |
1948   |                       | | ConnectX-5    | | ConnectX-5    |
1949   +-----------------------+-----------------+-----------------+
1950   | Tunnel Offload        | |  DPDK 20.11   | | DPDK 20.11    |
1951   |                       | |  OFED 5.1-2   | | OFED 5.1-2    |
1952   |                       | |  rdma-core 32 | | N/A           |
1953   |                       | |  ConnectX-5   | | ConnectX-5    |
1954   +-----------------------+-----------------+-----------------+
1955   | | Header rewrite      | | DPDK 19.05    | | DPDK 19.02    |
1956   | | (set_ipv4_src /     | | OFED 4.7-1    | | OFED 4.7-1    |
1957   | | set_ipv4_dst /      | | rdma-core 24  | | rdma-core 24  |
1958   | | set_ipv6_src /      | | ConnectX-5    | | ConnectX-5    |
1959   | | set_ipv6_dst /      | |               | |               |
1960   | | set_tp_src /        | |               | |               |
1961   | | set_tp_dst /        | |               | |               |
1962   | | dec_ttl /           | |               | |               |
1963   | | set_ttl /           | |               | |               |
1964   | | set_mac_src /       | |               | |               |
1965   | | set_mac_dst)        | |               | |               |
1966   +-----------------------+-----------------+-----------------+
1967   | | Header rewrite      | | DPDK 20.02    | | DPDK 20.02    |
1968   | | (set_dscp)          | | OFED 5.0      | | OFED 5.0      |
1969   | |                     | | rdma-core 24  | | rdma-core 24  |
1970   | |                     | | ConnectX-5    | | ConnectX-5    |
1971   +-----------------------+-----------------+-----------------+
1972   | | Header rewrite      | | DPDK 22.07    | | DPDK 22.07    |
1973   | | (ipv4_ecn /         | | OFED 5.6-2    | | OFED 5.6-2    |
1974   | | ipv6_ecn)           | | rdma-core 41  | | rdma-core 41  |
1975   | |                     | | ConnectX-5    | | ConnectX-5    |
1976   +-----------------------+-----------------+-----------------+
1977   | Jump                  | | DPDK 19.05    | | DPDK 19.02    |
1978   |                       | | OFED 4.7-1    | | OFED 4.7-1    |
1979   |                       | | rdma-core 24  | | N/A           |
1980   |                       | | ConnectX-5    | | ConnectX-5    |
1981   +-----------------------+-----------------+-----------------+
1982   | Mark / Flag           | | DPDK 19.05    | | DPDK 18.11    |
1983   |                       | | OFED 4.6      | | OFED 4.5      |
1984   |                       | | rdma-core 24  | | rdma-core 23  |
1985   |                       | | ConnectX-5    | | ConnectX-4    |
1986   +-----------------------+-----------------+-----------------+
1987   | Meta data             | |  DPDK 19.11   | | DPDK 19.11    |
1988   |                       | |  OFED 4.7-3   | | OFED 4.7-3    |
1989   |                       | |  rdma-core 26 | | rdma-core 26  |
1990   |                       | |  ConnectX-5   | | ConnectX-5    |
1991   +-----------------------+-----------------+-----------------+
1992   | Port ID               | | DPDK 19.05    |     | N/A       |
1993   |                       | | OFED 4.7-1    |     | N/A       |
1994   |                       | | rdma-core 24  |     | N/A       |
1995   |                       | | ConnectX-5    |     | N/A       |
1996   +-----------------------+-----------------+-----------------+
1997   | Hairpin               | |               | | DPDK 19.11    |
1998   |                       | |     N/A       | | OFED 4.7-3    |
1999   |                       | |               | | rdma-core 26  |
2000   |                       | |               | | ConnectX-5    |
2001   +-----------------------+-----------------+-----------------+
2002   | 2-port Hairpin        | |               | | DPDK 20.11    |
2003   |                       | |     N/A       | | OFED 5.1-2    |
2004   |                       | |               | | N/A           |
2005   |                       | |               | | ConnectX-5    |
2006   +-----------------------+-----------------+-----------------+
2007   | Metering              | |  DPDK 19.11   | | DPDK 19.11    |
2008   |                       | |  OFED 4.7-3   | | OFED 4.7-3    |
2009   |                       | |  rdma-core 26 | | rdma-core 26  |
2010   |                       | |  ConnectX-5   | | ConnectX-5    |
2011   +-----------------------+-----------------+-----------------+
2012   | ASO Metering          | |  DPDK 21.05   | | DPDK 21.05    |
2013   |                       | |  OFED 5.3     | | OFED 5.3      |
2014   |                       | |  rdma-core 33 | | rdma-core 33  |
2015   |                       | |  ConnectX-6 Dx| | ConnectX-6 Dx |
2016   +-----------------------+-----------------+-----------------+
2017   | Metering Hierarchy    | |  DPDK 21.08   | | DPDK 21.08    |
2018   |                       | |  OFED 5.3     | | OFED 5.3      |
2019   |                       | |  N/A          | | N/A           |
2020   |                       | |  ConnectX-6 Dx| | ConnectX-6 Dx |
2021   +-----------------------+-----------------+-----------------+
2022   | Sampling              | |  DPDK 20.11   | | DPDK 20.11    |
2023   |                       | |  OFED 5.1-2   | | OFED 5.1-2    |
2024   |                       | |  rdma-core 32 | | N/A           |
2025   |                       | |  ConnectX-5   | | ConnectX-5    |
2026   +-----------------------+-----------------+-----------------+
2027   | Encapsulation         | |  DPDK 21.02   | | DPDK 21.02    |
2028   | GTP PSC               | |  OFED 5.2     | | OFED 5.2      |
2029   |                       | |  rdma-core 35 | | rdma-core 35  |
2030   |                       | |  ConnectX-6 Dx| | ConnectX-6 Dx |
2031   +-----------------------+-----------------+-----------------+
2032   | Encapsulation         | | DPDK 21.02    | | DPDK 21.02    |
2033   | GENEVE TLV option     | | OFED 5.2      | | OFED 5.2      |
2034   |                       | | rdma-core 34  | | rdma-core 34  |
2035   |                       | | ConnectX-6 Dx | | ConnectX-6 Dx |
2036   +-----------------------+-----------------+-----------------+
2037   | Modify Field          | | DPDK 21.02    | | DPDK 21.02    |
2038   |                       | | OFED 5.2      | | OFED 5.2      |
2039   |                       | | rdma-core 35  | | rdma-core 35  |
2040   |                       | | ConnectX-5    | | ConnectX-5    |
2041   +-----------------------+-----------------+-----------------+
2042   | Connection tracking   | |               | | DPDK 21.05    |
2043   |                       | |     N/A       | | OFED 5.3      |
2044   |                       | |               | | rdma-core 35  |
2045   |                       | |               | | ConnectX-6 Dx |
2046   +-----------------------+-----------------+-----------------+
2047
2048.. table:: Minimal SW/HW versions for shared action offload
2049   :name: sact
2050
2051   +-----------------------+-----------------+-----------------+
2052   | Shared Action         | with E-Switch   | with NIC        |
2053   +=======================+=================+=================+
2054   | RSS                   | |               | | DPDK 20.11    |
2055   |                       | |     N/A       | | OFED 5.2      |
2056   |                       | |               | | rdma-core 33  |
2057   |                       | |               | | ConnectX-5    |
2058   +-----------------------+-----------------+-----------------+
2059   | Age                   | | DPDK 20.11    | | DPDK 20.11    |
2060   |                       | | OFED 5.2      | | OFED 5.2      |
2061   |                       | | rdma-core 32  | | rdma-core 32  |
2062   |                       | | ConnectX-6 Dx | | ConnectX-6 Dx |
2063   +-----------------------+-----------------+-----------------+
2064   | Count                 | | DPDK 21.05    | | DPDK 21.05    |
2065   |                       | | OFED 4.6      | | OFED 4.6      |
2066   |                       | | rdma-core 24  | | rdma-core 23  |
2067   |                       | | ConnectX-5    | | ConnectX-5    |
2068   +-----------------------+-----------------+-----------------+
2069
2070.. table:: Minimal SW/HW versions for flow template API
2071
2072   +-----------------+--------------------+--------------------+
2073   | DPDK            | NIC                | Firmware           |
2074   +=================+====================+====================+
2075   | 22.11           | ConnectX-6 Dx      | xx.35.1012         |
2076   +-----------------+--------------------+--------------------+
2077
2078Notes for metadata
2079------------------
2080
2081MARK and META items are interrelated with datapath - they might move from/to
2082the applications in mbuf fields. Hence, zero value for these items has the
2083special meaning - it means "no metadata are provided", not zero values are
2084treated by applications and PMD as valid ones.
2085
2086Moreover in the flow engine domain the value zero is acceptable to match and
2087set, and we should allow to specify zero values as rte_flow parameters for the
2088META and MARK items and actions. In the same time zero mask has no meaning and
2089should be rejected on validation stage.
2090
2091Notes for rte_flow
2092------------------
2093
2094Flows are not cached in the driver.
2095When stopping a device port, all the flows created on this port from the
2096application will be flushed automatically in the background.
2097After stopping the device port, all flows on this port become invalid and
2098not represented in the system.
2099All references to these flows held by the application should be discarded
2100directly but neither destroyed nor flushed.
2101
2102The application should re-create the flows as required after the port restart.
2103
2104
2105Notes for flow counters
2106-----------------------
2107
2108mlx5 PMD supports the ``COUNT`` flow action,
2109which provides an ability to count packets (and bytes)
2110matched against a given flow rule.
2111This section describes the high level overview of
2112how this support is implemented and limitations.
2113
2114HW steering flow engine
2115~~~~~~~~~~~~~~~~~~~~~~~
2116
2117Flow counters are allocated from HW in bulks.
2118A set of bulks forms a flow counter pool managed by PMD.
2119When flow counters are queried from HW,
2120each counter is identified by an offset in a given bulk.
2121Querying HW flow counter requires sending a request to HW,
2122which will request a read of counter values for given offsets.
2123HW will asynchronously provide these values through a DMA write.
2124
2125In order to optimize HW to SW communication,
2126these requests are handled in a separate counter service thread
2127spawned by mlx5 PMD.
2128This service thread will refresh the counter values stored in memory,
2129in cycles, each spanning ``svc_cycle_time`` milliseconds.
2130By default, ``svc_cycle_time`` is set to 500.
2131When applications query the ``COUNT`` flow action,
2132PMD returns the values stored in host memory.
2133
2134mlx5 PMD manages 3 global rings of allocated counter offsets:
2135
2136- ``free`` ring - Counters which were not used at all.
2137- ``wait_reset`` ring - Counters which were used in some flow rules,
2138  but were recently freed (flow rule was destroyed
2139  or an indirect action was destroyed).
2140  Since the count value might have changed
2141  between the last counter service thread cycle and the moment it was freed,
2142  the value in host memory might be stale.
2143  During the next service thread cycle,
2144  such counters will be moved to ``reuse`` ring.
2145- ``reuse`` ring - Counters which were used at least once
2146  and can be reused in new flow rules.
2147
2148When counters are assigned to a flow rule (or allocated to indirect action),
2149the PMD first tries to fetch a counter from ``reuse`` ring.
2150If it's empty, the PMD fetches a counter from ``free`` ring.
2151
2152The counter service thread works as follows:
2153
2154#. Record counters stored in ``wait_reset`` ring.
2155#. Read values of all counters which were used at least once
2156   or are currently in use.
2157#. Move recorded counters from ``wait_reset`` to ``reuse`` ring.
2158#. Sleep for ``(query time) - svc_cycle_time`` milliseconds
2159#. Repeat.
2160
2161Because freeing a counter (by destroying a flow rule or destroying indirect action)
2162does not immediately make it available for the application,
2163the PMD might return:
2164
2165- ``ENOENT`` if no counter is available in ``free``, ``reuse``
2166  or ``wait_reset`` rings.
2167  No counter will be available until the application releases some of them.
2168- ``EAGAIN`` if no counter is available in ``free`` and ``reuse`` rings,
2169  but there are counters in ``wait_reset`` ring.
2170  This means that after the next service thread cycle new counters will be available.
2171
2172The application has to be aware that flow rule create or indirect action create
2173might need be retried.
2174
2175
2176Notes for hairpin
2177-----------------
2178
2179NVIDIA ConnectX and BlueField devices support
2180specifying memory placement for hairpin Rx and Tx queues.
2181This feature requires NVIDIA MLNX_OFED 5.8.
2182
2183By default, data buffers and packet descriptors for hairpin queues
2184are placed in device memory
2185which is shared with other resources (e.g. flow rules).
2186
2187Starting with DPDK 22.11 and NVIDIA MLNX_OFED 5.8,
2188applications are allowed to:
2189
2190#. Place data buffers and Rx packet descriptors in dedicated device memory.
2191   Application can request that configuration
2192   through ``use_locked_device_memory`` configuration option.
2193
2194   Placing data buffers and Rx packet descriptors in dedicated device memory
2195   can decrease latency on hairpinned traffic,
2196   since traffic processing for the hairpin queue will not be memory starved.
2197
2198   However, reserving device memory for hairpin Rx queues
2199   may decrease throughput under heavy load,
2200   since less resources will be available on device.
2201
2202   This option is supported only for Rx hairpin queues.
2203
2204#. Place Tx packet descriptors in host memory.
2205   Application can request that configuration
2206   through ``use_rte_memory`` configuration option.
2207
2208   Placing Tx packet descritors in host memory can increase traffic throughput.
2209   This results in more resources available on the device for other purposes,
2210   which reduces memory contention on device.
2211   Side effect of this option is visible increase in latency,
2212   since each packet incurs additional PCI transactions.
2213
2214   This option is supported only for Tx hairpin queues.
2215
2216
2217Notes for testpmd
2218-----------------
2219
2220Compared to librte_net_mlx4 that implements a single RSS configuration per
2221port, librte_net_mlx5 supports per-protocol RSS configuration.
2222
2223Since ``testpmd`` defaults to IP RSS mode and there is currently no
2224command-line parameter to enable additional protocols (UDP and TCP as well
2225as IP), the following commands must be entered from its CLI to get the same
2226behavior as librte_net_mlx4::
2227
2228   > port stop all
2229   > port config all rss all
2230   > port start all
2231
2232Usage example
2233-------------
2234
2235This section demonstrates how to launch **testpmd** with NVIDIA
2236ConnectX-4/ConnectX-5/ConnectX-6/BlueField devices managed by librte_net_mlx5.
2237
2238#. Load the kernel modules::
2239
2240      modprobe -a ib_uverbs mlx5_core mlx5_ib
2241
2242   Alternatively if MLNX_OFED/MLNX_EN is fully installed, the following script
2243   can be run::
2244
2245      /etc/init.d/openibd restart
2246
2247   .. note::
2248
2249      User space I/O kernel modules (uio and igb_uio) are not used and do
2250      not have to be loaded.
2251
2252#. Make sure Ethernet interfaces are in working order and linked to kernel
2253   verbs. Related sysfs entries should be present::
2254
2255      ls -d /sys/class/net/*/device/infiniband_verbs/uverbs* | cut -d / -f 5
2256
2257   Example output::
2258
2259      eth30
2260      eth31
2261      eth32
2262      eth33
2263
2264#. Optionally, retrieve their PCI bus addresses for to be used with the allow list::
2265
2266      {
2267          for intf in eth2 eth3 eth4 eth5;
2268          do
2269              (cd "/sys/class/net/${intf}/device/" && pwd -P);
2270          done;
2271      } |
2272      sed -n 's,.*/\(.*\),-a \1,p'
2273
2274   Example output::
2275
2276      -a 0000:05:00.1
2277      -a 0000:06:00.0
2278      -a 0000:06:00.1
2279      -a 0000:05:00.0
2280
2281#. Request huge pages::
2282
2283      dpdk-hugepages.py --setup 2G
2284
2285#. Start testpmd with basic parameters::
2286
2287      dpdk-testpmd -l 8-15 -n 4 -a 05:00.0 -a 05:00.1 -a 06:00.0 -a 06:00.1 -- --rxq=2 --txq=2 -i
2288
2289   Example output::
2290
2291      [...]
2292      EAL: PCI device 0000:05:00.0 on NUMA socket 0
2293      EAL:   probe driver: 15b3:1013 librte_net_mlx5
2294      PMD: librte_net_mlx5: PCI information matches, using device "mlx5_0" (VF: false)
2295      PMD: librte_net_mlx5: 1 port(s) detected
2296      PMD: librte_net_mlx5: port 1 MAC address is e4:1d:2d:e7:0c:fe
2297      EAL: PCI device 0000:05:00.1 on NUMA socket 0
2298      EAL:   probe driver: 15b3:1013 librte_net_mlx5
2299      PMD: librte_net_mlx5: PCI information matches, using device "mlx5_1" (VF: false)
2300      PMD: librte_net_mlx5: 1 port(s) detected
2301      PMD: librte_net_mlx5: port 1 MAC address is e4:1d:2d:e7:0c:ff
2302      EAL: PCI device 0000:06:00.0 on NUMA socket 0
2303      EAL:   probe driver: 15b3:1013 librte_net_mlx5
2304      PMD: librte_net_mlx5: PCI information matches, using device "mlx5_2" (VF: false)
2305      PMD: librte_net_mlx5: 1 port(s) detected
2306      PMD: librte_net_mlx5: port 1 MAC address is e4:1d:2d:e7:0c:fa
2307      EAL: PCI device 0000:06:00.1 on NUMA socket 0
2308      EAL:   probe driver: 15b3:1013 librte_net_mlx5
2309      PMD: librte_net_mlx5: PCI information matches, using device "mlx5_3" (VF: false)
2310      PMD: librte_net_mlx5: 1 port(s) detected
2311      PMD: librte_net_mlx5: port 1 MAC address is e4:1d:2d:e7:0c:fb
2312      Interactive-mode selected
2313      Configuring Port 0 (socket 0)
2314      PMD: librte_net_mlx5: 0x8cba80: TX queues number update: 0 -> 2
2315      PMD: librte_net_mlx5: 0x8cba80: RX queues number update: 0 -> 2
2316      Port 0: E4:1D:2D:E7:0C:FE
2317      Configuring Port 1 (socket 0)
2318      PMD: librte_net_mlx5: 0x8ccac8: TX queues number update: 0 -> 2
2319      PMD: librte_net_mlx5: 0x8ccac8: RX queues number update: 0 -> 2
2320      Port 1: E4:1D:2D:E7:0C:FF
2321      Configuring Port 2 (socket 0)
2322      PMD: librte_net_mlx5: 0x8cdb10: TX queues number update: 0 -> 2
2323      PMD: librte_net_mlx5: 0x8cdb10: RX queues number update: 0 -> 2
2324      Port 2: E4:1D:2D:E7:0C:FA
2325      Configuring Port 3 (socket 0)
2326      PMD: librte_net_mlx5: 0x8ceb58: TX queues number update: 0 -> 2
2327      PMD: librte_net_mlx5: 0x8ceb58: RX queues number update: 0 -> 2
2328      Port 3: E4:1D:2D:E7:0C:FB
2329      Checking link statuses...
2330      Port 0 Link Up - speed 40000 Mbps - full-duplex
2331      Port 1 Link Up - speed 40000 Mbps - full-duplex
2332      Port 2 Link Up - speed 10000 Mbps - full-duplex
2333      Port 3 Link Up - speed 10000 Mbps - full-duplex
2334      Done
2335      testpmd>
2336
2337How to dump flows
2338-----------------
2339
2340This section demonstrates how to dump flows. Currently, it's possible to dump
2341all flows with assistance of external tools.
2342
2343#. 2 ways to get flow raw file:
2344
2345   - Using testpmd CLI:
2346
2347   .. code-block:: console
2348
2349       To dump all flows:
2350       testpmd> flow dump <port> all <output_file>
2351       and dump one flow:
2352       testpmd> flow dump <port> rule <rule_id> <output_file>
2353
2354   - call rte_flow_dev_dump api:
2355
2356   .. code-block:: console
2357
2358       rte_flow_dev_dump(port, flow, file, NULL);
2359
2360#. Dump human-readable flows from raw file:
2361
2362   Get flow parsing tool from: https://github.com/Mellanox/mlx_steering_dump
2363
2364   .. code-block:: console
2365
2366       mlx_steering_dump.py -f <output_file> -flowptr <flow_ptr>
2367
2368How to share a meter between ports in the same switch domain
2369------------------------------------------------------------
2370
2371This section demonstrates how to use the shared meter. A meter M can be created
2372on port X and to be shared with a port Y on the same switch domain by the next way:
2373
2374.. code-block:: console
2375
2376   flow create X ingress transfer pattern eth / port_id id is Y / end actions meter mtr_id M / end
2377
2378How to use meter hierarchy
2379--------------------------
2380
2381This section demonstrates how to create and use a meter hierarchy.
2382A termination meter M can be the policy green action of another termination meter N.
2383The two meters are chained together as a chain. Using meter N in a flow will apply
2384both the meters in hierarchy on that flow.
2385
2386.. code-block:: console
2387
2388   add port meter policy 0 1 g_actions queue index 0 / end y_actions end r_actions drop / end
2389   create port meter 0 M 1 1 yes 0xffff 1 0
2390   add port meter policy 0 2 g_actions meter mtr_id M / end y_actions end r_actions drop / end
2391   create port meter 0 N 2 2 yes 0xffff 1 0
2392   flow create 0 ingress group 1 pattern eth / end actions meter mtr_id N / end
2393
2394How to configure a VF as trusted
2395--------------------------------
2396
2397This section demonstrates how to configure a virtual function (VF) interface as trusted.
2398Trusted VF is needed to offload rules with rte_flow to a group that is bigger than 0.
2399The configuration is done in two parts: driver and FW.
2400
2401The procedure below is an example of using a ConnectX-5 adapter card (pf0) with 2 VFs:
2402
2403#. Create 2 VFs on the PF pf0 when in Legacy SR-IOV mode::
2404
2405   $ echo 2 > /sys/class/net/pf0/device/mlx5_num_vfs
2406
2407#. Verify the VFs are created:
2408
2409   .. code-block:: console
2410
2411      $ lspci | grep Mellanox
2412      82:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
2413      82:00.1 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
2414      82:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
2415      82:00.3 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
2416
2417#. Unbind all VFs. For each VF PCIe, using the following command to unbind the driver::
2418
2419   $ echo "0000:82:00.2" >> /sys/bus/pci/drivers/mlx5_core/unbind
2420
2421#. Set the VFs to be trusted for the kernel by using one of the methods below:
2422
2423      - Using sysfs file::
2424
2425        $ echo ON | tee /sys/class/net/pf0/device/sriov/0/trust
2426        $ echo ON | tee /sys/class/net/pf0/device/sriov/1/trust
2427
2428      - Using “ip link” command::
2429
2430        $ ip link set p0 vf 0 trust on
2431        $ ip link set p0 vf 1 trust on
2432
2433#. Configure all VFs using ``mlxreg``:
2434
2435   - For MFT >= 4.21::
2436
2437     $ mlxreg -d /dev/mst/mt4121_pciconf0 --reg_name VHCA_TRUST_LEVEL --yes --indexes 'all_vhca=0x1,vhca_id=0x0' --set 'trust_level=0x1'
2438
2439   - For MFT < 4.21::
2440
2441     $ mlxreg -d /dev/mst/mt4121_pciconf0 --reg_name VHCA_TRUST_LEVEL --yes --set "all_vhca=0x1,trust_level=0x1"
2442
2443   .. note::
2444
2445      Firmware version used must be >= xx.29.1016 and MFT >= 4.18
2446
2447#. For each VF PCIe, using the following command to bind the driver::
2448
2449   $ echo "0000:82:00.2" >> /sys/bus/pci/drivers/mlx5_core/bind
2450
2451How to trace Tx datapath
2452------------------------
2453
2454The mlx5 PMD provides Tx datapath tracing capability with extra debug information:
2455when and how packets were scheduled,
2456and when the actual sending was completed by the NIC hardware.
2457
2458Steps to enable Tx datapath tracing:
2459
2460#. Build DPDK application with enabled datapath tracing
2461
2462   The Meson option ``--enable_trace_fp=true`` and
2463   the C flag ``ALLOW_EXPERIMENTAL_API`` should be specified.
2464
2465   .. code-block:: console
2466
2467      meson configure --buildtype=debug -Denable_trace_fp=true
2468         -Dc_args='-DRTE_LIBRTE_MLX5_DEBUG -DRTE_ENABLE_ASSERT -DALLOW_EXPERIMENTAL_API' build
2469
2470#. Configure the NIC
2471
2472   If the sending completion timings are important,
2473   the NIC should be configured to provide realtime timestamps.
2474   The non-volatile settings parameter  ``REAL_TIME_CLOCK_ENABLE`` should be configured as ``1``.
2475
2476   .. code-block:: console
2477
2478      mlxconfig -d /dev/mst/mt4125_pciconf0 s REAL_TIME_CLOCK_ENABLE=1
2479
2480   The ``mlxconfig`` utility is part of the MFT package.
2481
2482#. Run application with EAL parameter enabling tracing in mlx5 Tx datapath
2483
2484   By default all tracepoints are disabled.
2485   To analyze Tx datapath and its timings: ``--trace=pmd.net.mlx5.tx``.
2486
2487#. Commit the tracing data to the storage (with ``rte_trace_save()`` API call).
2488
2489#. Install or build the ``babeltrace2`` package
2490
2491   The Python script analyzing gathered trace data uses the ``babeltrace2`` library.
2492   The package should be either installed or built from source as shown below.
2493
2494   .. code-block:: console
2495
2496      git clone https://github.com/efficios/babeltrace.git
2497      cd babeltrace
2498      ./bootstrap
2499      ./configure -help
2500      ./configure --disable-api-doc --disable-man-pages
2501                  --disable-python-bindings-doc --enable-python-plugins
2502                  --enable-python-binding
2503
2504#. Run analyzing script
2505
2506   ``mlx5_trace.py`` is used to combine related events (packet firing and completion)
2507   and to show the results in human-readable view.
2508
2509   The analyzing script is located in the DPDK source tree: ``drivers/net/mlx5/tools``.
2510
2511   It requires Python 3.6 and ``babeltrace2`` package.
2512
2513   The parameter of the script is the trace data folder.
2514
2515   The optional parameter ``-a`` forces to dump incomplete bursts.
2516
2517   The optional parameter ``-v [level]`` forces to dump raw records data
2518   for the specified level and below.
2519   Level 0 dumps bursts, level 1 dumps WQEs, level 2 dumps mbufs.
2520
2521   .. code-block:: console
2522
2523      mlx5_trace.py /var/log/rte-2023-01-23-AM-11-52-39
2524
2525#. Interpreting the script output data
2526
2527   All the timings are given in nanoseconds.
2528   The list of Tx bursts per port/queue is presented in the output.
2529   Each list element contains the list of built WQEs with specific opcodes.
2530   Each WQE contains the list of the encompassed packets to send.
2531
2532Host shaper
2533-----------
2534
2535Host shaper register is per host port register
2536which sets a shaper on the host port.
2537All VF/host PF representors belonging to one host port share one host shaper.
2538For example, if representor 0 and representor 1 belong to the same host port,
2539and a host shaper rate of 1Gbps is configured,
2540the shaper throttles both representors traffic from the host.
2541
2542Host shaper has two modes for setting the shaper,
2543immediate and deferred to available descriptor threshold event trigger.
2544
2545In immediate mode, the rate limit is configured immediately to host shaper.
2546
2547When deferring to the available descriptor threshold trigger,
2548the shaper is not set until an available descriptor threshold event
2549is received by any Rx queue in a VF representor belonging to the host port.
2550The only rate supported for deferred mode is 100Mbps
2551(there is no limit on the supported rates for immediate mode).
2552In deferred mode, the shaper is set on the host port by the firmware
2553upon receiving the available descriptor threshold event,
2554which allows throttling host traffic on available descriptor threshold events
2555at minimum latency, preventing excess drops in the Rx queue.
2556
2557Dependency on mstflint package
2558~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2559
2560In order to configure host shaper register,
2561``librte_net_mlx5`` depends on ``libmtcr_ul``
2562which can be installed from MLNX_OFED mstflint package.
2563Meson detects ``libmtcr_ul`` existence at configure stage.
2564If the library is detected, the application must link with ``-lmtcr_ul``,
2565as done by the pkg-config file libdpdk.pc.
2566
2567Available descriptor threshold and host shaper
2568~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2569
2570There is a command to configure the available descriptor threshold in testpmd.
2571Testpmd also contains sample logic to handle available descriptor threshold events.
2572The typical workflow is:
2573testpmd configures available descriptor threshold for Rx queues,
2574enables ``avail_thresh_triggered`` in host shaper and registers a callback.
2575When traffic from the host is too high
2576and Rx queue emptiness is below the available descriptor threshold,
2577the PMD receives an event
2578and the firmware configures a 100Mbps shaper on the host port automatically.
2579Then the PMD call the callback registered previously,
2580which will delay a while to let Rx queue empty,
2581then disable host shaper.
2582
2583Let's assume we have a simple BlueField-2 setup:
2584port 0 is uplink, port 1 is VF representor.
2585Each port has 2 Rx queues.
2586To control traffic from the host to the Arm device,
2587we can enable the available descriptor threshold in testpmd by:
2588
2589.. code-block:: console
2590
2591   testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 1 rate 0
2592   testpmd> set port 1 rxq 0 avail_thresh 70
2593   testpmd> set port 1 rxq 1 avail_thresh 70
2594
2595The first command disables the current host shaper
2596and enables the available descriptor threshold triggered mode.
2597The other commands configure the available descriptor threshold
2598to 70% of Rx queue size for both Rx queues.
2599
2600When traffic from the host is too high,
2601testpmd console prints log about available descriptor threshold event,
2602then host shaper is disabled.
2603The traffic rate from the host is controlled and less drop happens in Rx queues.
2604
2605The threshold event and shaper can be disabled like this:
2606
2607.. code-block:: console
2608
2609   testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 0 rate 0
2610   testpmd> set port 1 rxq 0 avail_thresh 0
2611   testpmd> set port 1 rxq 1 avail_thresh 0
2612
2613It is recommended an application disables the available descriptor threshold
2614and ``avail_thresh_triggered`` before exit,
2615if it enables them before.
2616
2617The shaper can also be configured with a value, the rate unit is 100Mbps.
2618Below, the command sets the current shaper to 5Gbps
2619and disables ``avail_thresh_triggered``.
2620
2621.. code-block:: console
2622
2623   testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 0 rate 50
2624
2625
2626Testpmd driver specific commands
2627--------------------------------
2628
2629port attach with socket path
2630~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2631
2632It is possible to allocate a port with ``libibverbs`` from external application.
2633For importing the external port with extra device arguments,
2634there is a specific testpmd command
2635similar to :ref:`port attach command <port_attach>`::
2636
2637   testpmd> mlx5 port attach (identifier) socket=(path)
2638
2639where:
2640
2641* ``identifier``: device identifier with optional parameters
2642  as same as :ref:`port attach command <port_attach>`.
2643* ``path``: path to IPC server socket created by the external application.
2644
2645This command performs:
2646
2647#. Open IPC client socket using the given path, and connect it.
2648
2649#. Import ibverbs context and ibverbs protection domain.
2650
2651#. Add two device arguments for context (``cmd_fd``)
2652   and protection domain (``pd_handle``) to the device identifier.
2653   See :ref:`mlx5 driver options <mlx5_common_driver_options>` for more
2654   information about these device arguments.
2655
2656#. Call the regular ``port attach`` function with updated identifier.
2657
2658For example, to attach a port whose PCI address is ``0000:0a:00.0``
2659and its socket path is ``/var/run/import_ipc_socket``:
2660
2661.. code-block:: console
2662
2663   testpmd> mlx5 port attach 0000:0a:00.0 socket=/var/run/import_ipc_socket
2664   testpmd: MLX5 socket path is /var/run/import_ipc_socket
2665   testpmd: Attach port with extra devargs 0000:0a:00.0,cmd_fd=40,pd_handle=1
2666   Attaching a new port...
2667   EAL: Probe PCI driver: mlx5_pci (15b3:101d) device: 0000:0a:00.0 (socket 0)
2668   Port 0 is attached. Now total ports is 1
2669   Done
2670
2671
2672port map external Rx queue
2673~~~~~~~~~~~~~~~~~~~~~~~~~~
2674
2675External Rx queue indexes mapping management.
2676
2677Map HW queue index (32-bit) to ethdev queue index (16-bit) for external Rx queue::
2678
2679   testpmd> mlx5 port (port_id) ext_rxq map (sw_queue_id) (hw_queue_id)
2680
2681Unmap external Rx queue::
2682
2683   testpmd> mlx5 port (port_id) ext_rxq unmap (sw_queue_id)
2684
2685where:
2686
2687* ``sw_queue_id``: queue index in range [64536, 65535].
2688  This range is the highest 1000 numbers.
2689* ``hw_queue_id``: queue index given by HW in queue creation.
2690
2691
2692Dump RQ/SQ/CQ HW context for debug purposes
2693~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2694
2695Dump RQ/CQ HW context for a given port/queue to a file::
2696
2697   testpmd> mlx5 port (port_id) queue (queue_id) dump rq_context (file_name)
2698
2699Dump SQ/CQ HW context for a given port/queue to a file::
2700
2701   testpmd> mlx5 port (port_id) queue (queue_id) dump sq_context (file_name)
2702
2703
2704Set Flow Engine Mode
2705~~~~~~~~~~~~~~~~~~~~
2706
2707Set the flow engine to active or standby mode with specific flags (bitmap style).
2708See ``RTE_PMD_MLX5_FLOW_ENGINE_FLAG_*`` for the flag definitions.
2709
2710.. code-block:: console
2711
2712   testpmd> mlx5 set flow_engine <active|standby> [<flags>]
2713
2714This command is used for testing live migration,
2715and works for software steering only.
2716Default FDB jump should be disabled if switchdev is enabled.
2717The mode will propagate to all the probed ports.
2718
2719
2720GENEVE TLV options parser
2721~~~~~~~~~~~~~~~~~~~~~~~~~
2722
2723See the :ref:`GENEVE parser API <geneve_parser_api>` for more information.
2724
2725Set
2726^^^
2727
2728Add single option to the global option list::
2729
2730   testpmd> mlx5 set tlv_option class (class) type (type) len (length) \
2731            offset (sample_offset) sample_len (sample_len) \
2732            class_mode (ignore|fixed|matchable) data (0xffffffff|0x0 [0xffffffff|0x0]*)
2733
2734where:
2735
2736* ``class``: option class.
2737* ``type``: option type.
2738* ``length``: option data length in 4 bytes granularity.
2739* ``sample_offset``: offset to data list related to option data start.
2740  The offset is in 4 bytes granularity.
2741* ``sample_len``: length data list in 4 bytes granularity.
2742* ``ignore``: ignore ``class`` field.
2743* ``fixed``: option class is fixed and defines the option along with the type.
2744* ``matchable``: ``class`` field is matchable.
2745* ``data``: list of masks indicating which DW should be configure.
2746  The size of list should be equal to ``sample_len``.
2747* ``0xffffffff``: this DW should be configure.
2748* ``0x0``: this DW shouldn't be configure.
2749
2750Flush
2751^^^^^
2752
2753Remove several options from the global option list::
2754
2755   testpmd> mlx5 flush tlv_options max (nb_option)
2756
2757where:
2758
2759* ``nb_option``: maximum number of option to remove from list. The order is LIFO.
2760
2761List
2762^^^^
2763
2764Print all options which are set in the global option list so far::
2765
2766   testpmd> mlx5 list tlv_options
2767
2768Output contains the values of each option, one per line.
2769There is no output at all when no options are configured on the global list::
2770
2771   ID      Type    Class   Class_mode   Len     Offset  Sample_len   Data
2772   [...]   [...]   [...]   [...]        [...]   [...]   [...]        [...]
2773
2774Setting several options and listing them::
2775
2776   testpmd> mlx5 set tlv_option class 1 type 1 len 4 offset 1 sample_len 3
2777            class_mode fixed data 0xffffffff 0x0 0xffffffff
2778   testpmd: set new option in global list, now it has 1 options
2779   testpmd> mlx5 set tlv_option class 1 type 2 len 2 offset 0 sample_len 2
2780            class_mode fixed data 0xffffffff 0xffffffff
2781   testpmd: set new option in global list, now it has 2 options
2782   testpmd> mlx5 set tlv_option class 1 type 3 len 5 offset 4 sample_len 1
2783            class_mode fixed data 0xffffffff
2784   testpmd: set new option in global list, now it has 3 options
2785   testpmd> mlx5 list tlv_options
2786   ID      Type    Class   Class_mode   Len    Offset  Sample_len  Data
2787   0       1       1       fixed        4      1       3           0xffffffff 0x0 0xffffffff
2788   1       2       1       fixed        2      0       2           0xffffffff 0xffffffff
2789   2       3       1       fixed        5      4       1           0xffffffff
2790   testpmd>
2791
2792Apply
2793^^^^^
2794
2795Create GENEVE TLV parser for specific port using option list which are set so far::
2796
2797   testpmd> mlx5 port (port_id) apply tlv_options
2798
2799The same global option list can used by several ports.
2800
2801Destroy
2802^^^^^^^
2803
2804Destroy GENEVE TLV parser for specific port::
2805
2806   testpmd> mlx5 port (port_id) destroy tlv_options
2807
2808This command doesn't destroy the global list,
2809For releasing options, ``flush`` command should be used.
2810