1.. SPDX-License-Identifier: BSD-3-Clause 2 Copyright 2015 6WIND S.A. 3 Copyright 2015 Mellanox Technologies, Ltd 4 5.. include:: <isonum.txt> 6 7NVIDIA MLX5 Ethernet Driver 8=========================== 9 10.. note:: 11 12 NVIDIA acquired Mellanox Technologies in 2020. 13 The DPDK documentation and code might still include instances 14 of or references to Mellanox trademarks (like BlueField and ConnectX) 15 that are now NVIDIA trademarks. 16 17The mlx5 Ethernet poll mode driver library (**librte_net_mlx5**) provides support 18for **NVIDIA ConnectX-4**, **NVIDIA ConnectX-4 Lx** , **NVIDIA ConnectX-5**, 19**NVIDIA ConnectX-6**, **NVIDIA ConnectX-6 Dx**, **NVIDIA ConnectX-6 Lx**, 20**NVIDIA ConnectX-7**, **NVIDIA BlueField**, **NVIDIA BlueField-2** and 21**NVIDIA BlueField-3** families of 10/25/40/50/100/200/400 Gb/s adapters 22as well as their virtual functions (VF) in SR-IOV context. 23 24Supported NICs 25-------------- 26 27The following NVIDIA device families are supported by the same mlx5 driver: 28 29 - ConnectX-4 30 - ConnectX-4 Lx 31 - ConnectX-5 32 - ConnectX-5 Ex 33 - ConnectX-6 34 - ConnectX-6 Dx 35 - ConnectX-6 Lx 36 - ConnectX-7 37 - BlueField 38 - BlueField-2 39 - BlueField-3 40 41Below are detailed device names: 42 43* NVIDIA\ |reg| ConnectX\ |reg|-4 10G MCX4111A-XCAT (1x10G) 44* NVIDIA\ |reg| ConnectX\ |reg|-4 10G MCX412A-XCAT (2x10G) 45* NVIDIA\ |reg| ConnectX\ |reg|-4 25G MCX4111A-ACAT (1x25G) 46* NVIDIA\ |reg| ConnectX\ |reg|-4 25G MCX412A-ACAT (2x25G) 47* NVIDIA\ |reg| ConnectX\ |reg|-4 40G MCX413A-BCAT (1x40G) 48* NVIDIA\ |reg| ConnectX\ |reg|-4 40G MCX4131A-BCAT (1x40G) 49* NVIDIA\ |reg| ConnectX\ |reg|-4 40G MCX415A-BCAT (1x40G) 50* NVIDIA\ |reg| ConnectX\ |reg|-4 50G MCX413A-GCAT (1x50G) 51* NVIDIA\ |reg| ConnectX\ |reg|-4 50G MCX4131A-GCAT (1x50G) 52* NVIDIA\ |reg| ConnectX\ |reg|-4 50G MCX414A-BCAT (2x50G) 53* NVIDIA\ |reg| ConnectX\ |reg|-4 50G MCX415A-GCAT (1x50G) 54* NVIDIA\ |reg| ConnectX\ |reg|-4 50G MCX416A-BCAT (2x50G) 55* NVIDIA\ |reg| ConnectX\ |reg|-4 50G MCX416A-GCAT (2x50G) 56* NVIDIA\ |reg| ConnectX\ |reg|-4 50G MCX415A-CCAT (1x100G) 57* NVIDIA\ |reg| ConnectX\ |reg|-4 100G MCX416A-CCAT (2x100G) 58* NVIDIA\ |reg| ConnectX\ |reg|-4 Lx 10G MCX4111A-XCAT (1x10G) 59* NVIDIA\ |reg| ConnectX\ |reg|-4 Lx 10G MCX4121A-XCAT (2x10G) 60* NVIDIA\ |reg| ConnectX\ |reg|-4 Lx 25G MCX4111A-ACAT (1x25G) 61* NVIDIA\ |reg| ConnectX\ |reg|-4 Lx 25G MCX4121A-ACAT (2x25G) 62* NVIDIA\ |reg| ConnectX\ |reg|-4 Lx 40G MCX4131A-BCAT (1x40G) 63* NVIDIA\ |reg| ConnectX\ |reg|-5 100G MCX556A-ECAT (2x100G) 64* NVIDIA\ |reg| ConnectX\ |reg|-5 Ex EN 100G MCX516A-CDAT (2x100G) 65* NVIDIA\ |reg| ConnectX\ |reg|-6 200G MCX654106A-HCAT (2x200G) 66* NVIDIA\ |reg| ConnectX\ |reg|-6 Dx EN 100G MCX623106AN-CDAT (2x100G) 67* NVIDIA\ |reg| ConnectX\ |reg|-6 Dx EN 200G MCX623105AN-VDAT (1x200G) 68* NVIDIA\ |reg| ConnectX\ |reg|-6 Lx EN 25G MCX631102AN-ADAT (2x25G) 69* NVIDIA\ |reg| ConnectX\ |reg|-7 200G CX713106AE-HEA_QP1_Ax (2x200G) 70* NVIDIA\ |reg| BlueField\ |reg|-2 25G MBF2H332A-AEEOT_A1 (2x25Gg 71* NVIDIA\ |reg| BlueField\ |reg|-3 200GbE 900-9D3B6-00CV-AA0 (2x200) 72* NVIDIA\ |reg| BlueField\ |reg|-3 200GbE 900-9D3B6-00SV-AA0 (2x200) 73* NVIDIA\ |reg| BlueField\ |reg|-3 400GbE 900-9D3B6-00CN-AB0 (2x400) 74* NVIDIA\ |reg| BlueField\ |reg|-3 100GbE 900-9D3B4-00CC-EA0 (2x100) 75* NVIDIA\ |reg| BlueField\ |reg|-3 100GbE 900-9D3B4-00SC-EA0 (2x100) 76* NVIDIA\ |reg| BlueField\ |reg|-3 400GbE 900-9D3B4-00EN-EA0 (1x100) 77 78 79Design 80------ 81 82Besides its dependency on libibverbs (that implies libmlx5 and associated 83kernel support), librte_net_mlx5 relies heavily on system calls for control 84operations such as querying/updating the MTU and flow control parameters. 85 86This capability allows the PMD to coexist with kernel network interfaces 87which remain functional, although they stop receiving unicast packets as 88long as they share the same MAC address. 89This means legacy linux control tools (for example: ethtool, ifconfig and 90more) can operate on the same network interfaces that owned by the DPDK 91application. 92 93See :doc:`../../platform/mlx5` guide for more design details, 94including prerequisites installation. 95 96Features 97-------- 98 99- Multi arch support: x86_64, POWER8, ARMv8, i686. 100- Multiple TX and RX queues. 101- Shared Rx queue. 102- Rx queue delay drop. 103- Rx queue available descriptor threshold event. 104- Host shaper support. 105- Support steering for external Rx queue created outside the PMD. 106- Support for scattered TX frames. 107- Advanced support for scattered Rx frames with tunable buffer attributes. 108- IPv4, IPv6, TCPv4, TCPv6, UDPv4 and UDPv6 RSS on any number of queues. 109- RSS using different combinations of fields: L3 only, L4 only or both, 110 and source only, destination only or both. 111- Several RSS hash keys, one for each flow type. 112- Default RSS operation with no hash key specification. 113- Symmetric RSS function. 114- Configurable RETA table. 115- Link flow control (pause frame). 116- Support for multiple MAC addresses. 117- VLAN filtering. 118- RX VLAN stripping. 119- TX VLAN insertion. 120- RX CRC stripping configuration. 121- TX mbuf fast free offload. 122- Promiscuous mode on PF and VF. 123- Multicast promiscuous mode on PF and VF. 124- Hardware checksum offloads. 125- Flow director (RTE_FDIR_MODE_PERFECT, RTE_FDIR_MODE_PERFECT_MAC_VLAN and 126 RTE_ETH_FDIR_REJECT). 127- Flow API, including :ref:`flow_isolated_mode`. 128- Multiple process. 129- KVM and VMware ESX SR-IOV modes are supported. 130- RSS hash result is supported. 131- Hardware TSO for generic IP or UDP tunnel, including VXLAN and GRE. 132- Hardware checksum Tx offload for generic IP or UDP tunnel, including VXLAN and GRE. 133- RX interrupts. 134- Statistics query including Basic, Extended and per queue. 135- Rx HW timestamp. 136- Tunnel types: VXLAN, L3 VXLAN, VXLAN-GPE, GRE, MPLSoGRE, MPLSoUDP, IP-in-IP, Geneve, GTP. 137- Tunnel HW offloads: packet type, inner/outer RSS, IP and UDP checksum verification. 138- NIC HW offloads: encapsulation (vxlan, gre, mplsoudp, mplsogre), NAT, routing, TTL 139 increment/decrement, count, drop, mark. For details please see :ref:`mlx5_offloads_support`. 140- Flow insertion rate of more then million flows per second, when using Direct Rules. 141- Support for multiple rte_flow groups. 142- Per packet no-inline hint flag to disable packet data copying into Tx descriptors. 143- Hardware LRO. 144- Hairpin. 145- Multiple-thread flow insertion. 146- Matching on IPv4 Internet Header Length (IHL). 147- Matching on IPv6 routing extension header. 148- Matching on GTP extension header with raw encap/decap action. 149- Matching on Geneve TLV option header with raw encap/decap action. 150- Matching on ESP header SPI field. 151- Matching on InfiniBand BTH. 152- Matching on random value. 153- Modify IPv4/IPv6 ECN field. 154- Push or remove IPv6 routing extension. 155- NAT64. 156- RSS support in sample action. 157- E-Switch mirroring and jump. 158- E-Switch mirroring and modify. 159- Send to kernel. 160- 21844 flow priorities for ingress or egress flow groups greater than 0 and for any transfer 161 flow group. 162- Flow quota. 163- Flow metering, including meter policy API. 164- Flow meter hierarchy. 165- Flow meter mark. 166- Flow integrity offload API. 167- Connection tracking. 168- Sub-Function representors. 169- Sub-Function. 170- Matching on represented port. 171- Matching on aggregated affinity. 172- Matching on external Tx queue. 173- Matching on E-Switch manager. 174 175 176Limitations 177----------- 178 179- Windows support: 180 181 On Windows, the features are limited: 182 183 - Promiscuous mode is not supported 184 - The following rules are supported: 185 186 - IPv4/UDP with CVLAN filtering 187 - Unicast MAC filtering 188 189 - Additional rules are supported from WinOF2 version 2.70: 190 191 - IPv4/TCP with CVLAN filtering 192 - L4 steering rules for port RSS of UDP, TCP and IP 193 194- PCI Virtual Function MTU: 195 196 MTU settings on PCI Virtual Functions have no effect. 197 The maximum receivable packet size for a VF is determined by the MTU 198 configured on its associated Physical Function. 199 DPDK applications using VFs must be prepared to handle packets 200 up to the maximum size of this PF port. 201 202- For secondary process: 203 204 - Forked secondary process not supported. 205 - MPRQ is not supported. Callback to free externally attached MPRQ buffer is set 206 in a primary process, but has a different virtual address in a secondary process. 207 Calling a function at the wrong address leads to a segmentation fault. 208 - External memory unregistered in EAL memseg list cannot be used for DMA 209 unless such memory has been registered by ``mlx5_mr_update_ext_mp()`` in 210 primary process and remapped to the same virtual address in secondary 211 process. If the external memory is registered by primary process but has 212 different virtual address in secondary process, unexpected error may happen. 213 214- Shared Rx queue: 215 216 - Counters of received packets and bytes number of devices in same share group are same. 217 - Counters of received packets and bytes number of queues in same group and queue ID are same. 218 219- Available descriptor threshold event: 220 221 - Does not support shared Rx queue and hairpin Rx queue. 222 223- The symmetric RSS function is supported by swapping source and destination 224 addresses and ports. 225 226- Host shaper: 227 228 - Support BlueField series NIC from BlueField-2. 229 - When configuring host shaper with ``RTE_PMD_MLX5_HOST_SHAPER_FLAG_AVAIL_THRESH_TRIGGERED`` flag, 230 only rates 0 and 100Mbps are supported. 231 232- HW steering: 233 234 - WQE based high scaling and safer flow insertion/destruction. 235 - Set ``dv_flow_en`` to 2 in order to enable HW steering. 236 - Async queue-based ``rte_flow_async`` APIs supported only. 237 - NIC ConnectX-5 and before are not supported. 238 - Reconfiguring flow API engine is not supported. 239 Any subsequent call to ``rte_flow_configure()`` with different configuration 240 than initially provided will be rejected with ``-ENOTSUP`` error code. 241 - Partial match with item template is not supported. 242 - IPv6 5-tuple matching is not supported. 243 - With E-Switch enabled, ports which share the E-Switch domain 244 should be started and stopped in a specific order: 245 246 - When starting ports, the transfer proxy port should be started first 247 and port representors should follow. 248 - When stopping ports, all of the port representors 249 should be stopped before stopping the transfer proxy port. 250 251 If ports are started/stopped in an incorrect order, 252 ``rte_eth_dev_start()``/``rte_eth_dev_stop()`` will return an appropriate error code: 253 254 - ``-EAGAIN`` for ``rte_eth_dev_start()``. 255 - ``-EBUSY`` for ``rte_eth_dev_stop()``. 256 257 - Matching on ICMP6 following IPv6 routing extension header, 258 should match ``ipv6_routing_ext_next_hdr`` instead of ICMP6. 259 IPv6 routing extension matching is not supported in flow template relaxed 260 matching mode (see ``struct rte_flow_pattern_template_attr::relaxed_matching``). 261 262 - The supported actions order is as below:: 263 264 MARK (a) 265 *_DECAP (b) 266 OF_POP_VLAN 267 COUNT | AGE 268 METER_MARK | CONNTRACK 269 OF_PUSH_VLAN 270 MODIFY_FIELD 271 *_ENCAP (c) 272 JUMP | DROP | RSS (a) | QUEUE (a) | REPRESENTED_PORT (d) 273 274 a. Only supported on ingress. 275 b. Any decapsulation action, including the combination of RAW_ENCAP and RAW_DECAP actions 276 which results in L3 decapsulation. 277 Not supported on egress. 278 c. Any encapsulation action, including the combination of RAW_ENCAP and RAW_DECAP actions 279 which results in L3 encap. 280 d. Only in transfer (switchdev) mode. 281 282- When using Verbs flow engine (``dv_flow_en`` = 0), flow pattern without any 283 specific VLAN will match for VLAN packets as well: 284 285 When VLAN spec is not specified in the pattern, the matching rule will be created with VLAN as a wild card. 286 Meaning, the flow rule:: 287 288 flow create 0 ingress pattern eth / vlan vid is 3 / ipv4 / end ... 289 290 Will only match vlan packets with vid=3. and the flow rule:: 291 292 flow create 0 ingress pattern eth / ipv4 / end ... 293 294 Will match any ipv4 packet (VLAN included). 295 296- When using Verbs flow engine (``dv_flow_en`` = 0), multi-tagged(QinQ) match is not supported. 297 298- When using DV flow engine (``dv_flow_en`` = 1), flow pattern with any VLAN specification will match only single-tagged packets unless the ETH item ``type`` field is 0x88A8 or the VLAN item ``has_more_vlan`` field is 1. 299 The flow rule:: 300 301 flow create 0 ingress pattern eth / ipv4 / end ... 302 303 Will match any ipv4 packet. 304 The flow rules:: 305 306 flow create 0 ingress pattern eth / vlan / end ... 307 flow create 0 ingress pattern eth has_vlan is 1 / end ... 308 flow create 0 ingress pattern eth type is 0x8100 / end ... 309 310 Will match single-tagged packets only, with any VLAN ID value. 311 The flow rules:: 312 313 flow create 0 ingress pattern eth type is 0x88A8 / end ... 314 flow create 0 ingress pattern eth / vlan has_more_vlan is 1 / end ... 315 316 Will match multi-tagged packets only, with any VLAN ID value. 317 318- A flow pattern with 2 sequential VLAN items is not supported. 319 320- VLAN pop offload command: 321 322 - Flow rules having a VLAN pop offload command as one of their actions and 323 are lacking a match on VLAN as one of their items are not supported. 324 - The command is not supported on egress traffic in NIC mode. 325 326- VLAN push offload is not supported on ingress traffic in NIC mode. 327 328- VLAN set PCP offload is not supported on existing headers. 329 330- A multi segment packet must have not more segments than reported by dev_infos_get() 331 in tx_desc_lim.nb_seg_max field. This value depends on maximal supported Tx descriptor 332 size and ``txq_inline_min`` settings and may be from 2 (worst case forced by maximal 333 inline settings) to 58. 334 335- Match on VXLAN supports any bits in the tunnel header 336 337 - Flag 8-bits and first 24-bits reserved fields matching 338 is only supported when using DV flow engine (``dv_flow_en`` = 2). 339 - For ConnectX-5, the UDP destination port must be the standard one (4789). 340 - Default UDP destination is 4789 if not explicitly specified. 341 - Group zero's behavior may differ which depends on FW. 342 343- Matching on VXLAN-GPE header fields: 344 345 - ``rsvd0``/``rsvd1`` matching support depends on FW version 346 when using DV flow engine (``dv_flow_en`` = 1). 347 - ``protocol`` should be explicitly specified in HWS (``dv_flow_en`` = 2). 348 349- L3 VXLAN and VXLAN-GPE tunnels cannot be supported together with MPLSoGRE and MPLSoUDP. 350 351- MPLSoGRE is not supported in HW steering (``dv_flow_en`` = 2). 352 353- MPLSoUDP with multiple MPLS headers is only supported in HW steering (``dv_flow_en`` = 2). 354 355- Match on Geneve header supports the following fields only: 356 357 - VNI 358 - OAM 359 - protocol type 360 - options length 361 362- Match on Geneve TLV option is supported on the following fields: 363 364 - Class 365 - Type 366 - Length 367 - Data 368 369 Class/Type/Length fields must be specified as well as masks. 370 Class/Type/Length specified masks must be full. 371 Matching Geneve TLV option without specifying data is not supported. 372 Matching Geneve TLV option with ``data & mask == 0`` is not supported. 373 374 In SW steering (``dv_flow_en`` = 1): 375 376 - Only one Class/Type/Length Geneve TLV option is supported per shared device. 377 - Supported only with ``FLEX_PARSER_PROFILE_ENABLE`` = 0. 378 379 In HW steering (``dv_flow_en`` = 2): 380 381 - Multiple Class/Type/Length Geneve TLV options are supported per physical device. 382 - Multiple of same Geneve TLV option isn't supported at the same pattern template. 383 - Supported only with ``FLEX_PARSER_PROFILE_ENABLE`` = 8. 384 - Supported also with ``FLEX_PARSER_PROFILE_ENABLE`` = 0 for single DW only. 385 - Supported for FW version **xx.37.0142** and above. 386 387 .. _geneve_parser_api: 388 389 - An API (``rte_pmd_mlx5_create_geneve_tlv_parser``) 390 is available for the flexible parser used in HW steering: 391 392 Each physical device has 7 DWs for GENEVE TLV options. 393 Partial option configuration is supported, 394 mask for data is provided in parser creation 395 indicating which DWs configuration is requested. 396 Only masked data DWs can be matched later as item field using flow API. 397 398 - Matching of ``type`` field is supported for each configured option. 399 - However, for matching ``class`` field, 400 the option should be configured with ``match_on_class_mode=2``. 401 One extra DW is consumed for it. 402 - Matching on ``length`` field is not supported. 403 404 - More limitations with ``FLEX_PARSER_PROFILE_ENABLE`` = 0: 405 406 - single DW 407 - ``sample_len`` must be equal to ``option_len`` and not bigger than 1. 408 - ``match_on_class_mode`` different than 1 is not supported. 409 - ``offset`` must be 0. 410 411 Although the parser is created per physical device, this API is port oriented. 412 Each port should call this API before using GENEVE OPT item, 413 but its configuration must use the same options list 414 with same internal order configured by first port. 415 416 Calling this API for different ports under same physical device doesn't consume 417 more DWs, the first one creates the parser and the rest use same configuration. 418 419- VF: flow rules created on VF devices can only match traffic targeted at the 420 configured MAC addresses (see ``rte_eth_dev_mac_addr_add()``). 421 422- Match on GTP tunnel header item supports the following fields only: 423 424 - v_pt_rsv_flags: E flag, S flag, PN flag 425 - msg_type 426 - teid 427 428- Match on GTP extension header only for GTP PDU session container (next 429 extension header type = 0x85). 430- Match on GTP extension header is not supported in group 0. 431 432- When using DV/Verbs flow engine (``dv_flow_en`` = 1/0 respectively), 433 match on SPI field in ESP header for group 0 is supported from ConnectX-7. 434 435- Matching on SPI field in ESP header is supported over the PF only. 436 437- Flex item: 438 439 - Hardware support: **NVIDIA BlueField-2** and **NVIDIA BlueField-3**. 440 - Flex item is supported on PF only. 441 - Hardware limits ``header_length_mask_width`` up to 6 bits. 442 - Firmware supports 8 global sample fields. 443 Each flex item allocates non-shared sample fields from that pool. 444 - Supported flex item can have 1 input link - ``eth`` or ``udp`` 445 and up to 3 output links - ``ipv4`` or ``ipv6``. 446 - Flex item fields (``next_header``, ``next_protocol``, ``samples``) 447 do not participate in RSS hash functions. 448 - In flex item configuration, ``next_header.field_base`` value 449 must be byte aligned (multiple of 8). 450 - Modify field with flex item, the offset must be byte aligned (multiple of 8). 451 452- Match on random value: 453 454 - Supported only with HW Steering enabled (``dv_flow_en`` = 2). 455 - Supported only in table with ``nb_flows=1``. 456 - NIC ingress/egress flow in group 0 is not supported. 457 - Supports matching only 16 bits (LSB). 458 459- Match with compare result item (``RTE_FLOW_ITEM_TYPE_COMPARE``): 460 461 - Only supported in HW steering(``dv_flow_en`` = 2) mode. 462 - Only single flow is supported to the flow table. 463 - Only single item is supported per pattern template. 464 - In switch mode, when the ``repr_matching_en`` flag is enabled in the devargs 465 (which is the default setting), 466 the match with compare result item is not supported for ``ingress`` rules. 467 This is because an implicit ``REPRESENTED_PORT`` needs to be added to the matcher, 468 which conflicts with the single item limitation. 469 - Only 32-bit comparison is supported or 16-bit for random field. 470 - Only supported for ``RTE_FLOW_FIELD_META``, ``RTE_FLOW_FIELD_TAG``, 471 ``RTE_FLOW_FIELD_ESP_SEQ_NUM``, 472 ``RTE_FLOW_FIELD_RANDOM`` and ``RTE_FLOW_FIELD_VALUE``. 473 - The field type ``RTE_FLOW_FIELD_VALUE`` must be the base (``b``) field. 474 - The field type ``RTE_FLOW_FIELD_RANDOM`` can only be compared with 475 ``RTE_FLOW_FIELD_VALUE``. 476 477- No Tx metadata go to the E-Switch steering domain for the Flow group 0. 478 The flows within group 0 and set metadata action are rejected by hardware. 479 480.. note:: 481 482 MAC addresses not already present in the bridge table of the associated 483 kernel network device will be added and cleaned up by the PMD when closing 484 the device. In case of ungraceful program termination, some entries may 485 remain present and should be removed manually by other means. 486 487- Buffer split offload is supported with regular Rx burst routine only, 488 no MPRQ feature or vectorized code can be engaged. 489 490- When Multi-Packet Rx queue is configured (``mprq_en``), a Rx packet can be 491 externally attached to a user-provided mbuf with having RTE_MBUF_F_EXTERNAL in 492 ol_flags. As the mempool for the external buffer is managed by PMD, all the 493 Rx mbufs must be freed before the device is closed. Otherwise, the mempool of 494 the external buffers will be freed by PMD and the application which still 495 holds the external buffers may be corrupted. 496 User-managed mempools with external pinned data buffers 497 cannot be used in conjunction with MPRQ 498 since packets may be already attached to PMD-managed external buffers. 499 500- If Multi-Packet Rx queue is configured (``mprq_en``) and Rx CQE compression is 501 enabled (``rxq_cqe_comp_en``) at the same time, RSS hash result is not fully 502 supported. Some Rx packets may not have RTE_MBUF_F_RX_RSS_HASH. 503 504- IPv6 Multicast messages are not supported on VM, while promiscuous mode 505 and allmulticast mode are both set to off. 506 To receive IPv6 Multicast messages on VM, explicitly set the relevant 507 MAC address using rte_eth_dev_mac_addr_add() API. 508 509- To support a mixed traffic pattern (some buffers from local host memory, some 510 buffers from other devices) with high bandwidth, a mbuf flag is used. 511 512 An application hints the PMD whether or not it should try to inline the 513 given mbuf data buffer. PMD should do the best effort to act upon this request. 514 515 The hint flag ``RTE_PMD_MLX5_FINE_GRANULARITY_INLINE`` is dynamic, 516 registered by application with rte_mbuf_dynflag_register(). This flag is 517 purely driver-specific and declared in PMD specific header ``rte_pmd_mlx5.h``, 518 which is intended to be used by the application. 519 520 To query the supported specific flags in runtime, 521 the function ``rte_pmd_mlx5_get_dyn_flag_names`` returns the array of 522 currently (over present hardware and configuration) supported specific flags. 523 The "not inline hint" feature operating flow is the following one: 524 525 - application starts 526 - probe the devices, ports are created 527 - query the port capabilities 528 - if port supporting the feature is found 529 - register dynamic flag ``RTE_PMD_MLX5_FINE_GRANULARITY_INLINE`` 530 - application starts the ports 531 - on ``dev_start()`` PMD checks whether the feature flag is registered and 532 enables the feature support in datapath 533 - application might set the registered flag bit in ``ol_flags`` field 534 of mbuf being sent and PMD will handle ones appropriately. 535 536- The amount of descriptors in Tx queue may be limited by data inline settings. 537 Inline data require the more descriptor building blocks and overall block 538 amount may exceed the hardware supported limits. The application should 539 reduce the requested Tx size or adjust data inline settings with 540 ``txq_inline_max`` and ``txq_inline_mpw`` devargs keys. 541 542- To provide the packet send scheduling on mbuf timestamps the ``tx_pp`` 543 parameter should be specified. 544 When PMD sees the RTE_MBUF_DYNFLAG_TX_TIMESTAMP_NAME set on the packet 545 being sent it tries to synchronize the time of packet appearing on 546 the wire with the specified packet timestamp. It the specified one 547 is in the past it should be ignored, if one is in the distant future 548 it should be capped with some reasonable value (in range of seconds). 549 These specific cases ("too late" and "distant future") can be optionally 550 reported via device xstats to assist applications to detect the 551 time-related problems. 552 553 The timestamp upper "too-distant-future" limit 554 at the moment of invoking the Tx burst routine 555 can be estimated as ``tx_pp`` option (in nanoseconds) multiplied by 2^23. 556 Please note, for the testpmd txonly mode, 557 the limit is deduced from the expression:: 558 559 (n_tx_descriptors / burst_size + 1) * inter_burst_gap 560 561 There is no any packet reordering according timestamps is supposed, 562 neither within packet burst, nor between packets, it is an entirely 563 application responsibility to generate packets and its timestamps 564 in desired order. The timestamps can be put only in the first packet 565 in the burst providing the entire burst scheduling. 566 567- E-Switch decapsulation Flow: 568 569 - can be applied to PF port only. 570 - must specify VF port action (packet redirection from PF to VF). 571 - optionally may specify tunnel inner source and destination MAC addresses. 572 573- E-Switch encapsulation Flow: 574 575 - can be applied to VF ports only. 576 - must specify PF port action (packet redirection from VF to PF). 577 578- E-Switch Manager matching: 579 580 - For BlueField with old FW 581 which doesn't expose the E-Switch Manager vport ID in the capability, 582 matching E-Switch Manager should be used only in BlueField embedded CPU mode. 583 584- Raw encapsulation: 585 586 - The input buffer, used as outer header, is not validated. 587 588- Raw decapsulation: 589 590 - The decapsulation is always done up to the outermost tunnel detected by the HW. 591 - The input buffer, providing the removal size, is not validated. 592 - The buffer size must match the length of the headers to be removed. 593 594- Outer UDP checksum calculation for encapsulation flow actions: 595 596 - Currently available NVIDIA NICs and DPUs do not have a capability to calculate 597 the UDP checksum in the header added using encapsulation flow actions. 598 599 Applications are required to use 0 in UDP checksum field in such flow actions. 600 Resulting packet will have outer UDP checksum equal to 0. 601 602- ICMP(code/type/identifier/sequence number) / ICMP6(code/type/identifier/sequence number) matching, 603 IP-in-IP and MPLS flow matching are all mutually exclusive features which cannot be supported together 604 (see :ref:`mlx5_firmware_config`). 605 606- LRO: 607 608 - Requires DevX and DV flow to be enabled. 609 - KEEP_CRC offload cannot be supported with LRO. 610 - The first mbuf length, without head-room, must be big enough to include the 611 TCP header (122B). 612 - Rx queue with LRO offload enabled, receiving a non-LRO packet, can forward 613 it with size limited to max LRO size, not to max RX packet length. 614 - The driver rounds down the port configuration value ``max_lro_pkt_size`` 615 (from ``rte_eth_rxmode``) to a multiple of 256 due to hardware limitation. 616 - LRO can be used with outer header of TCP packets of the standard format: 617 eth (with or without vlan) / ipv4 or ipv6 / tcp / payload 618 619 Other TCP packets (e.g. with MPLS label) received on Rx queue with LRO enabled, will be received with bad checksum. 620 - LRO packet aggregation is performed by HW only for packet size larger than 621 ``lro_min_mss_size``. This value is reported on device start, when debug 622 mode is enabled. 623 624- CRC: 625 626 - ``RTE_ETH_RX_OFFLOAD_KEEP_CRC`` cannot be supported with decapsulation 627 for some NICs (such as ConnectX-6 Dx, ConnectX-6 Lx, ConnectX-7, BlueField-2, 628 and BlueField-3). 629 The capability bit ``scatter_fcs_w_decap_disable`` shows NIC support. 630 631- TX mbuf fast free: 632 633 - fast free offload assumes the all mbufs being sent are originated from the 634 same memory pool and there is no any extra references to the mbufs (the 635 reference counter for each mbuf is equal 1 on tx_burst call). The latter 636 means there should be no any externally attached buffers in mbufs. It is 637 an application responsibility to provide the correct mbufs if the fast 638 free offload is engaged. The mlx5 PMD implicitly produces the mbufs with 639 externally attached buffers if MPRQ option is enabled, hence, the fast 640 free offload is neither supported nor advertised if there is MPRQ enabled. 641 642- Sample flow: 643 644 - Supports ``RTE_FLOW_ACTION_TYPE_SAMPLE`` action only within NIC Rx and 645 E-Switch steering domain. 646 - In E-Switch steering domain, for sampling with sample ratio > 1 in a transfer rule, 647 additional actions are not supported in the sample actions list. 648 - For ConnectX-5, the ``RTE_FLOW_ACTION_TYPE_SAMPLE`` is typically used as 649 first action in the E-Switch egress flow if with header modify or 650 encapsulation actions. 651 - For NIC Rx flow, supports only ``MARK``, ``COUNT``, ``QUEUE``, ``RSS`` in the 652 sample actions list. 653 - In E-Switch steering domain, for mirroring with sample ratio = 1 in a transfer rule, 654 supports only ``RAW_ENCAP``, ``PORT_ID``, ``REPRESENTED_PORT``, ``VXLAN_ENCAP``, ``NVGRE_ENCAP`` 655 in the sample actions list. 656 - In E-Switch steering domain, for mirroring with sample ratio = 1 in a transfer rule, 657 the encapsulation actions (``RAW_ENCAP`` or ``VXLAN_ENCAP`` or ``NVGRE_ENCAP``) 658 support uplink port only. 659 - In E-Switch steering domain, for mirroring with sample ratio = 1 in a transfer rule, 660 the port actions (``PORT_ID`` or ``REPRESENTED_PORT``) with uplink port and ``JUMP`` action 661 are not supported without the encapsulation actions 662 (``RAW_ENCAP`` or ``VXLAN_ENCAP`` or ``NVGRE_ENCAP``) in the sample actions list. 663 - For ConnectX-5 trusted device, the application metadata with SET_TAG index 0 664 is not supported before ``RTE_FLOW_ACTION_TYPE_SAMPLE`` action. 665 666- Modify Field flow: 667 668 - Supports the 'set' and 'add' operations for ``RTE_FLOW_ACTION_TYPE_MODIFY_FIELD`` action. 669 - Modification of an arbitrary place in a packet via the special ``RTE_FLOW_FIELD_START`` Field ID is not supported. 670 - Modify field action using ``RTE_FLOW_FIELD_RANDOM`` is not supported. 671 - Modification of the 802.1Q tag is not supported. 672 - Modification of VXLAN network or GENEVE network ID is supported only for HW steering. 673 - Modification of the VXLAN header is supported with below limitations: 674 675 - Only for HW steering (``dv_flow_en=2``). 676 - Support VNI and the last reserved byte modifications for traffic 677 with default UDP destination port: 4789 for VXLAN and VXLAN-GBP, 4790 for VXLAN-GPE. 678 679 - Modification of GENEVE network ID is not supported when configured 680 ``FLEX_PARSER_PROFILE_ENABLE`` supports Geneve TLV options. 681 See :ref:`mlx5_firmware_config` for more flex parser information. 682 - Modification of GENEVE TLV option fields is supported only for HW steering. 683 Only DWs configured in :ref:`parser creation <geneve_parser_api>` can be modified, 684 'type' and 'class' fields can be modified when ``match_on_class_mode=2``. 685 - Modification of GENEVE TLV option data supports one DW per action. 686 - Offsets cannot skip past the boundary of a field. 687 - If the field type is ``RTE_FLOW_FIELD_MAC_TYPE`` 688 and packet contains one or more VLAN headers, 689 the meaningful type field following the last VLAN header 690 is used as modify field operation argument. 691 The modify field action is not intended to modify VLAN headers type field, 692 dedicated VLAN push and pop actions should be used instead. 693 - For packet fields (e.g. MAC addresses, IPv4 addresses or L4 ports) 694 offset specifies the number of bits to skip from field's start, 695 starting from MSB in the first byte, in the network order. 696 - For flow metadata fields (e.g. META or TAG) 697 offset specifies the number of bits to skip from field's start, 698 starting from LSB in the least significant byte, in the host order. 699 - Modification of the MPLS header is supported with some limitations: 700 701 - Only in HW steering. 702 - Only in ``src`` field. 703 - Only for outermost tunnel header (``level=2``). 704 For ``RTE_FLOW_FIELD_MPLS``, 705 the default encapsulation level ``0`` describes the outermost tunnel header. 706 707 .. note:: 708 709 The default encapsulation level ``0`` describes 710 the "outermost that match is supported", 711 currently it is the first tunnel, 712 but it can be changed to outer when it is supported. 713 714 - Default encapsulation level ``0`` describes outermost. 715 - Encapsulation level ``2`` is supported with some limitations: 716 717 - Only in HW steering. 718 - Only in ``src`` field. 719 - ``RTE_FLOW_FIELD_VLAN_ID`` is not supported. 720 - ``RTE_FLOW_FIELD_IPV4_PROTO`` is not supported. 721 - ``RTE_FLOW_FIELD_IPV6_PROTO/DSCP/ECN`` are not supported. 722 - ``RTE_FLOW_FIELD_ESP_PROTO/SPI/SEQ_NUM`` are not supported. 723 - ``RTE_FLOW_FIELD_TCP_SEQ/ACK_NUM`` are not supported. 724 - Second tunnel fields are not supported. 725 726 - Encapsulation levels greater than ``2`` are not supported. 727 728- Age action: 729 730 - with HW steering (``dv_flow_en=2``) 731 732 - Using the same indirect count action combined with multiple age actions 733 in different flows may cause a wrong age state for the age actions. 734 - Creating/destroying flow rules with indirect age action when it is active 735 (timeout != 0) may cause a wrong age state for the indirect age action. 736 737 - The driver reuses counters for aging action, so for optimization 738 the values in ``rte_flow_port_attr`` structure should describe: 739 740 - ``nb_counters`` is the number of flow rules using counter (with/without age) 741 in addition to flow rules using only age (without count action). 742 - ``nb_aging_objects`` is the number of flow rules containing age action. 743 744- IPv6 header item 'proto' field, indicating the next header protocol, should 745 not be set as extension header. 746 In case the next header is an extension header, it should not be specified in 747 IPv6 header item 'proto' field. 748 The last extension header item 'next header' field can specify the following 749 header protocol type. 750 751- Match on IPv6 routing extension header supports the following fields only: 752 753 - ``type`` 754 - ``next_hdr`` 755 - ``segments_left`` 756 757 Only supports HW steering (``dv_flow_en=2``). 758 759- IPv6 routing extension push/remove: 760 761 - Supported only with HW Steering enabled (``dv_flow_en=2``). 762 - Supported in non-zero group 763 (no limits on transfer domain if ``fdb_def_rule_en=1`` which is default). 764 - Only supports TCP or UDP as next layer. 765 - IPv6 routing header must be the only present extension. 766 - Not supported on guest port. 767 768- NAT64 action: 769 770 - Supported only with HW Steering enabled (``dv_flow_en`` = 2). 771 - FW version: at least ``XX.39.1002``. 772 - Supported only on non-root table. 773 - Actions order limitation should follow the modify fields action. 774 - The last 2 TAG registers will be used implicitly in address backup mode. 775 - Even if the action can be shared, new steering entries will be created per flow rule. 776 It is recommended a single rule with NAT64 should be shared 777 to reduce the duplication of entries. 778 The default address and other fields conversion will be handled with NAT64 action. 779 To support other address, new rule(s) with modify fields on the IP addresses should be created. 780 - TOS / Traffic Class is not supported now. 781 782- Hairpin: 783 784 - Hairpin between two ports could only manual binding and explicit Tx flow mode. For single port hairpin, all the combinations of auto/manual binding and explicit/implicit Tx flow mode could be supported. 785 - Hairpin in switchdev SR-IOV mode is not supported till now. 786 - ``out_of_buffer`` statistics are not available on: 787 - NICs older than ConnectX-7. 788 - DPUs older than BlueField-3. 789 790- Quota: 791 792 - Quota implemented for HWS / template API. 793 - Maximal value for quota SET and ADD operations in INT32_MAX (2GB). 794 - Application cannot use 2 consecutive ADD updates. 795 Next tokens update after ADD must always be SET. 796 - Quota flow action cannot be used with Meter or CT flow actions in the same rule. 797 - Quota flow action and item supported in non-root HWS tables. 798 - Maximal number of HW quota and HW meter objects <= 16e6. 799 800- Meter: 801 802 - All the meter colors with drop action will be counted only by the global drop statistics. 803 - Yellow detection is only supported with ASO metering. 804 - Red color must be with drop action. 805 - Meter statistics are supported only for drop case. 806 - A meter action created with pre-defined policy must be the last action in the flow except single case where the policy actions are: 807 - green: NULL or END. 808 - yellow: NULL or END. 809 - RED: DROP / END. 810 - The only supported meter policy actions: 811 - green: QUEUE, RSS, PORT_ID, REPRESENTED_PORT, JUMP, DROP, MODIFY_FIELD, MARK, METER and SET_TAG. 812 - yellow: QUEUE, RSS, PORT_ID, REPRESENTED_PORT, JUMP, DROP, MODIFY_FIELD, MARK, METER and SET_TAG. 813 - RED: must be DROP. 814 - Policy actions of RSS for green and yellow should have the same configuration except queues. 815 - Policy with RSS/queue action is not supported when ``dv_xmeta_en`` enabled. 816 - If green action is METER, yellow action must be the same METER action or NULL. 817 - meter profile packet mode is supported. 818 - meter profiles of RFC2697, RFC2698 and RFC4115 are supported. 819 - RFC4115 implementation is following MEF, meaning yellow traffic may reclaim unused green bandwidth when green token bucket is full. 820 - When using DV flow engine (``dv_flow_en`` = 1), 821 if meter has drop count 822 or meter hierarchy contains any meter that uses drop count, 823 it cannot be used by flow rule matching all ports. 824 - When using DV flow engine (``dv_flow_en`` = 1), 825 if meter hierarchy contains any meter that has MODIFY_FIELD/SET_TAG, 826 it cannot be used by flow matching all ports. 827 - When using HWS flow engine (``dv_flow_en`` = 2), 828 only meter mark action is supported. 829 830- Ptype: 831 832 - Only supports HW steering (``dv_flow_en=2``). 833 - The supported values are: 834 L2: ``RTE_PTYPE_L2_ETHER``, ``RTE_PTYPE_L2_ETHER_VLAN``, ``RTE_PTYPE_L2_ETHER_QINQ`` 835 L3: ``RTE_PTYPE_L3_IPV4``, ``RTE_PTYPE_L3_IPV6`` 836 L4: ``RTE_PTYPE_L4_TCP``, ``RTE_PTYPE_L4_UDP``, ``RTE_PTYPE_L4_ICMP`` 837 and their ``RTE_PTYPE_INNER_XXX`` counterparts as well as ``RTE_PTYPE_TUNNEL_ESP``. 838 Any other values are not supported. Using them as a value will cause unexpected behavior. 839 - Matching on both outer and inner IP fragmented is supported 840 using ``RTE_PTYPE_L4_FRAG`` and ``RTE_PTYPE_INNER_L4_FRAG`` values. 841 They are not part of L4 types, so they should be provided explicitly 842 as a mask value during pattern template creation. 843 Providing ``RTE_PTYPE_L4_MASK`` during pattern template creation 844 and ``RTE_PTYPE_L4_FRAG`` during flow rule creation 845 will cause unexpected behavior. 846 847- Integrity: 848 849 - Verification bits provided by the hardware are ``l3_ok``, ``ipv4_csum_ok``, ``l4_ok``, ``l4_csum_ok``. 850 - ``level`` value 0 references outer headers. 851 - Negative integrity item verification is not supported. 852 853 - With SW steering (``dv_flow_en=1``) 854 855 - Integrity offload is enabled starting from **ConnectX-6 Dx**. 856 - Multiple integrity items not supported in a single flow rule. 857 - Flow rule items supplied by application must explicitly specify 858 network headers referred by integrity item. 859 860 For example, if integrity item mask sets ``l4_ok`` or ``l4_csum_ok`` bits, 861 reference to L4 network header, TCP or UDP, must be in the rule pattern as well:: 862 863 flow create 0 ingress pattern integrity level is 0 value mask l3_ok value spec l3_ok / eth / ipv6 / end ... 864 flow create 0 ingress pattern integrity level is 0 value mask l4_ok value spec l4_ok / eth / ipv4 proto is udp / end ... 865 866 - With HW steering (``dv_flow_en=2``) 867 - The ``l3_ok`` field represents all L3 checks, but nothing about IPv4 checksum. 868 - The ``l4_ok`` field represents all L4 checks including L4 checksum. 869 870- Connection tracking: 871 872 - Cannot co-exist with ASO meter, ASO age action in a single flow rule. 873 - Flow rules insertion rate and memory consumption need more optimization. 874 - 16 ports maximum (with ``dv_flow_en=1``). 875 - 32M connections maximum. 876 877- Multi-thread flow insertion: 878 879 - In order to achieve best insertion rate, application should manage the flows per lcore. 880 - Better to disable memory reclaim by setting ``reclaim_mem_mode`` to 0 to accelerate the flow object allocation and release with cache. 881 882- HW hashed bonding 883 884 - TXQ affinity subjects to HW hash once enabled. 885 886- Bonding under socket direct mode 887 888 - Needs MLNX_OFED 5.4+. 889 890- Match on aggregated affinity: 891 892 - Supports NIC ingress flow in group 0. 893 - Supports E-Switch flow in group 0 and depends on 894 device-managed flow steering (DMFS) mode. 895 896- Timestamps: 897 898 - CQE timestamp field width is limited by hardware to 63 bits, MSB is zero. 899 - In the free-running mode the timestamp counter is reset on power on 900 and 63-bit value provides over 1800 years of uptime till overflow. 901 - In the real-time mode 902 (configurable with ``REAL_TIME_CLOCK_ENABLE`` firmware settings), 903 the timestamp presents the nanoseconds elapsed since 01-Jan-1970, 904 hardware timestamp overflow will happen on 19-Jan-2038 905 (0x80000000 seconds since 01-Jan-1970). 906 - The send scheduling is based on timestamps 907 from the reference "Clock Queue" completions, 908 the scheduled send timestamps should not be specified with non-zero MSB. 909 910- Match on GRE header supports the following fields: 911 912 - c_rsvd0_v: C bit, K bit, S bit 913 - protocol type 914 - checksum 915 - key 916 - sequence 917 918 Matching on checksum and sequence needs MLNX_OFED 5.6+. 919 920- Matching on NVGRE header: 921 922 - c_rc_k_s_rsvd0_ver 923 - protocol 924 - tni 925 - flow_id 926 927 In SW steering (``dv_flow_en`` = 1), only tni is supported. 928 In HW steering (``dv_flow_en`` = 2), all fields are supported. 929 930- The NIC egress flow rules on representor port are not supported. 931 932- In switch mode, flow rule matching ``RTE_FLOW_ACTION_TYPE_REPRESENTED_PORT`` item 933 with port ID ``UINT16_MAX`` means matching packets sent by E-Switch manager from software. 934 Need MLNX_OFED 24.04+. 935 936- A driver limitation for ``RTE_FLOW_ACTION_TYPE_PORT_REPRESENTOR`` action 937 restricts the ``port_id`` configuration to only accept the value ``0xffff``, 938 indicating the E-Switch manager. 939 If the ``repr_matching_en`` flag is enabled, the traffic will be directed 940 to the representor of the source virtual port (SF/VF), while if it is disabled, 941 the traffic will be routed based on the steering rules in the ingress domain. 942 943- Send to kernel action (``RTE_FLOW_ACTION_TYPE_SEND_TO_KERNEL``): 944 945 - Supported on non-root table. 946 - Supported in isolated mode. 947 - In HW steering (``dv_flow_en`` = 2): 948 - not supported on guest port. 949 950- During live migration to a new process set its flow engine as standby mode, 951 the user should only program flow rules in group 0 (``fdb_def_rule_en=0``). 952 Live migration is only supported under SWS (``dv_flow_en=1``). 953 The flow group 0 is shared between DPDK processes 954 while the other flow groups are limited to the current process. 955 The flow engine of a process cannot move from active to standby mode 956 if preceding active application rules are still present and vice versa. 957 958 959Statistics 960---------- 961 962MLX5 supports various methods to report statistics: 963 964Port statistics can be queried using ``rte_eth_stats_get()``. The received and sent statistics are through SW only and counts the number of packets received or sent successfully by the PMD. The imissed counter is the amount of packets that could not be delivered to SW because a queue was full. Packets not received due to congestion in the bus or on the NIC can be queried via the rx_discards_phy xstats counter. 965 966Extended statistics can be queried using ``rte_eth_xstats_get()``. The extended statistics expose a wider set of counters counted by the device. The extended port statistics counts the number of packets received or sent successfully by the port. As NVIDIA NICs are using the :ref:`Bifurcated Linux Driver <linux_gsg_linux_drivers>` those counters counts also packet received or sent by the Linux kernel. The counters with ``_phy`` suffix counts the total events on the physical port, therefore not valid for VF. 967 968Finally per-flow statistics can by queried using ``rte_flow_query`` when attaching a count action for specific flow. The flow counter counts the number of packets received successfully by the port and match the specific flow. 969 970 971Extended Statistics Counters 972~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 973 974Send Scheduling Counters 975^^^^^^^^^^^^^^^^^^^^^^^^ 976 977The mlx5 PMD provides a comprehensive set of counters designed for 978debugging and diagnostics related to packet scheduling during transmission. 979These counters are applicable only if the port was configured with the ``tx_pp`` devarg 980and reflect the status of the PMD scheduling infrastructure 981based on Clock and Rearm Queues, used as a workaround on ConnectX-6 DX NICs. 982 983``tx_pp_missed_interrupt_errors`` 984 Indicates that the Rearm Queue interrupt was not serviced on time. 985 The EAL manages interrupts in a dedicated thread, 986 and it is possible that other time-consuming actions were being processed concurrently. 987 988``tx_pp_rearm_queue_errors`` 989 Signifies hardware errors that occurred on the Rearm Queue, 990 typically caused by delays in servicing interrupts. 991 992``tx_pp_clock_queue_errors`` 993 Reflects hardware errors on the Clock Queue, 994 which usually indicate configuration issues 995 or problems with the internal NIC hardware or firmware. 996 997``tx_pp_timestamp_past_errors`` 998 Tracks the application attempted to send packets with timestamps set in the past. 999 It is useful for debugging application code 1000 and does not indicate a malfunction of the PMD. 1001 1002``tx_pp_timestamp_future_errors`` 1003 Records attempts by the application to send packets 1004 with timestamps set too far into the future, 1005 exceeding the hardware’s scheduling capabilities. 1006 Like the previous counter, it aids in application debugging 1007 without suggesting a PMD malfunction. 1008 1009``tx_pp_jitter`` 1010 Measures the internal NIC real-time clock jitter estimation 1011 between two consecutive Clock Queue completions, expressed in nanoseconds. 1012 Significant jitter may signal potential clock synchronization issues, 1013 possibly due to inappropriate adjustments 1014 made by a system PTP (Precision Time Protocol) agent. 1015 1016``tx_pp_wander`` 1017 Indicates the long-term stability of the internal NIC real-time clock 1018 over 2^24 completions, measured in nanoseconds. 1019 Significant wander may also suggest clock synchronization problems. 1020 1021``tx_pp_sync_lost`` 1022 A general operational indicator; 1023 a non-zero value indicates that the driver has lost synchronization with the Clock Queue, 1024 resulting in improper scheduling operations. 1025 To restore correct scheduling functionality, it is necessary to restart the port. 1026 1027The following counters are particularly valuable for verifying and debugging application code. 1028They do not indicate driver or hardware malfunctions 1029and are applicable to newer hardware with direct on-time scheduling capabilities 1030(such as ConnectX-7 and above): 1031 1032``tx_pp_timestamp_order_errors`` 1033 Indicates attempts by the application to send packets 1034 with timestamps that are not in strictly ascending order. 1035 Since the PMD does not reorder packets within hardware queues, 1036 violations of timestamp order can lead to packets being sent at incorrect times. 1037 1038 1039Compilation 1040----------- 1041 1042See :ref:`mlx5 common compilation <mlx5_common_compilation>`. 1043 1044 1045Configuration 1046------------- 1047 1048Environment Configuration 1049~~~~~~~~~~~~~~~~~~~~~~~~~ 1050 1051See :ref:`mlx5 common configuration <mlx5_common_env>`. 1052 1053Firmware configuration 1054~~~~~~~~~~~~~~~~~~~~~~ 1055 1056See :ref:`mlx5_firmware_config` guide. 1057 1058Runtime Configuration 1059~~~~~~~~~~~~~~~~~~~~~ 1060 1061Please refer to :ref:`mlx5 common options <mlx5_common_driver_options>` 1062for an additional list of options shared with other mlx5 drivers. 1063 1064- ``rxq_cqe_comp_en`` parameter [int] 1065 1066 A nonzero value enables the compression of CQE on RX side. This feature 1067 allows to save PCI bandwidth and improve performance. Enabled by default. 1068 Different compression formats are supported in order to achieve the best 1069 performance for different traffic patterns. Default format depends on 1070 Multi-Packet Rx queue configuration: Hash RSS format is used in case 1071 MPRQ is disabled, Checksum format is used in case MPRQ is enabled. 1072 1073 The lower 3 bits define the CQE compression format: 1074 Specifying 2 in these bits of the ``rxq_cqe_comp_en`` parameter selects 1075 the flow tag format for better compression rate in case of flow mark traffic. 1076 Specifying 3 in these bits selects checksum format. 1077 Specifying 4 in these bits selects L3/L4 header format for 1078 better compression rate in case of mixed TCP/UDP and IPv4/IPv6 traffic. 1079 CQE compression format selection requires DevX to be enabled. If there is 1080 no DevX enabled/supported the value is reset to 1 by default. 1081 1082 8th bit defines the CQE compression layout. 1083 Setting this bit to 1 turns enhanced CQE compression layout on. 1084 Enhanced CQE compression is designed for better latency and SW utilization. 1085 This bit is ignored if only the basic CQE compression layout is supported. 1086 1087 Supported on: 1088 1089 - x86_64 with ConnectX-4, ConnectX-4 Lx, ConnectX-5, ConnectX-6, ConnectX-6 Dx, 1090 ConnectX-6 Lx, ConnectX-7, BlueField, BlueField-2, and BlueField-3. 1091 - POWER9 and ARMv8 with ConnectX-4 Lx, ConnectX-5, ConnectX-6, ConnectX-6 Dx, 1092 ConnectX-6 Lx, ConnectX-7 BlueField, BlueField-2, and BlueField-3. 1093 1094- ``rxq_pkt_pad_en`` parameter [int] 1095 1096 A nonzero value enables padding Rx packet to the size of cacheline on PCI 1097 transaction. This feature would waste PCI bandwidth but could improve 1098 performance by avoiding partial cacheline write which may cause costly 1099 read-modify-copy in memory transaction on some architectures. Disabled by 1100 default. 1101 1102 Supported on: 1103 1104 - x86_64 with ConnectX-4, ConnectX-4 Lx, ConnectX-5, ConnectX-6, ConnectX-6 Dx, 1105 ConnectX-6 Lx, ConnectX-7, BlueField, BlueField-2, and BlueField-3. 1106 - POWER8 and ARMv8 with ConnectX-4 Lx, ConnectX-5, ConnectX-6, ConnectX-6 Dx, 1107 ConnectX-6 Lx, ConnectX-7, BlueField, BlueField-2, and BlueField-3. 1108 1109- ``delay_drop`` parameter [int] 1110 1111 Bitmask value for the Rx queue delay drop attribute. Bit 0 is used for the 1112 standard Rx queue and bit 1 is used for the hairpin Rx queue. By default, the 1113 delay drop is disabled for all Rx queues. It will be ignored if the port does 1114 not support the attribute even if it is enabled explicitly. 1115 1116 The packets being received will not be dropped immediately when the WQEs are 1117 exhausted in a Rx queue with delay drop enabled. 1118 1119 A timeout value is set in the driver to control the waiting time before 1120 dropping a packet. Once the timer is expired, the delay drop will be 1121 deactivated for all the Rx queues with this feature enable. To re-activate 1122 it, a rearming is needed and it is part of the kernel driver starting from 1123 MLNX_OFED 5.5. 1124 1125 To enable / disable the delay drop rearming, the private flag ``dropless_rq`` 1126 can be set and queried via ethtool: 1127 1128 - ethtool --set-priv-flags <netdev> dropless_rq on (/ off) 1129 - ethtool --show-priv-flags <netdev> 1130 1131 The configuration flag is global per PF and can only be set on the PF, once 1132 it is on, all the VFs', SFs' and representors' Rx queues will share the timer 1133 and rearming. 1134 1135- ``mprq_en`` parameter [int] 1136 1137 A nonzero value enables configuring Multi-Packet Rx queues. Rx queue is 1138 configured as Multi-Packet RQ if the total number of Rx queues is 1139 ``rxqs_min_mprq`` or more. Disabled by default. 1140 1141 Multi-Packet Rx Queue (MPRQ a.k.a Striding RQ) can further save PCIe bandwidth 1142 by posting a single large buffer for multiple packets. Instead of posting a 1143 buffers per a packet, one large buffer is posted in order to receive multiple 1144 packets on the buffer. A MPRQ buffer consists of multiple fixed-size strides 1145 and each stride receives one packet. MPRQ can improve throughput for 1146 small-packet traffic. 1147 1148 When MPRQ is enabled, MTU can be larger than the size of 1149 user-provided mbuf even if RTE_ETH_RX_OFFLOAD_SCATTER isn't enabled. PMD will 1150 configure large stride size enough to accommodate MTU as long as 1151 device allows. Note that this can waste system memory compared to enabling Rx 1152 scatter and multi-segment packet. 1153 1154- ``mprq_log_stride_num`` parameter [int] 1155 1156 Log 2 of the number of strides for Multi-Packet Rx queue. Configuring more 1157 strides can reduce PCIe traffic further. If configured value is not in the 1158 range of device capability, the default value will be set with a warning 1159 message. The default value is 4 which is 16 strides per a buffer, valid only 1160 if ``mprq_en`` is set. 1161 1162 The size of Rx queue should be bigger than the number of strides. 1163 1164- ``mprq_log_stride_size`` parameter [int] 1165 1166 Log 2 of the size of a stride for Multi-Packet Rx queue. Configuring a smaller 1167 stride size can save some memory and reduce probability of a depletion of all 1168 available strides due to unreleased packets by an application. If configured 1169 value is not in the range of device capability, the default value will be set 1170 with a warning message. The default value is 11 which is 2048 bytes per a 1171 stride, valid only if ``mprq_en`` is set. With ``mprq_log_stride_size`` set 1172 it is possible for a packet to span across multiple strides. This mode allows 1173 support of jumbo frames (9K) with MPRQ. The memcopy of some packets (or part 1174 of a packet if Rx scatter is configured) may be required in case there is no 1175 space left for a head room at the end of a stride which incurs some 1176 performance penalty. 1177 1178- ``mprq_max_memcpy_len`` parameter [int] 1179 1180 The maximum length of packet to memcpy in case of Multi-Packet Rx queue. Rx 1181 packet is mem-copied to a user-provided mbuf if the size of Rx packet is less 1182 than or equal to this parameter. Otherwise, PMD will attach the Rx packet to 1183 the mbuf by external buffer attachment - ``rte_pktmbuf_attach_extbuf()``. 1184 A mempool for external buffers will be allocated and managed by PMD. If Rx 1185 packet is externally attached, ol_flags field of the mbuf will have 1186 RTE_MBUF_F_EXTERNAL and this flag must be preserved. ``RTE_MBUF_HAS_EXTBUF()`` 1187 checks the flag. The default value is 128, valid only if ``mprq_en`` is set. 1188 1189- ``rxqs_min_mprq`` parameter [int] 1190 1191 Configure Rx queues as Multi-Packet RQ if the total number of Rx queues is 1192 greater or equal to this value. The default value is 12, valid only if 1193 ``mprq_en`` is set. 1194 1195- ``txq_inline`` parameter [int] 1196 1197 Amount of data to be inlined during TX operations. This parameter is 1198 deprecated and converted to the new parameter ``txq_inline_max`` providing 1199 partial compatibility. 1200 1201- ``txqs_min_inline`` parameter [int] 1202 1203 Enable inline data send only when the number of TX queues is greater or equal 1204 to this value. 1205 1206 This option should be used in combination with ``txq_inline_max`` and 1207 ``txq_inline_mpw`` below and does not affect ``txq_inline_min`` settings above. 1208 1209 If this option is not specified the default value 16 is used for BlueField 1210 and 8 for other platforms 1211 1212 The data inlining consumes the CPU cycles, so this option is intended to 1213 auto enable inline data if we have enough Tx queues, which means we have 1214 enough CPU cores and PCI bandwidth is getting more critical and CPU 1215 is not supposed to be bottleneck anymore. 1216 1217 The copying data into WQE improves latency and can improve PPS performance 1218 when PCI back pressure is detected and may be useful for scenarios involving 1219 heavy traffic on many queues. 1220 1221 Because additional software logic is necessary to handle this mode, this 1222 option should be used with care, as it may lower performance when back 1223 pressure is not expected. 1224 1225 If inline data are enabled it may affect the maximal size of Tx queue in 1226 descriptors because the inline data increase the descriptor size and 1227 queue size limits supported by hardware may be exceeded. 1228 1229- ``txq_inline_min`` parameter [int] 1230 1231 Minimal amount of data to be inlined into WQE during Tx operations. NICs 1232 may require this minimal data amount to operate correctly. The exact value 1233 may depend on NIC operation mode, requested offloads, etc. It is strongly 1234 recommended to omit this parameter and use the default values. Anyway, 1235 applications using this parameter should take into consideration that 1236 specifying an inconsistent value may prevent the NIC from sending packets. 1237 1238 If ``txq_inline_min`` key is present the specified value (may be aligned 1239 by the driver in order not to exceed the limits and provide better descriptor 1240 space utilization) will be used by the driver and it is guaranteed that 1241 requested amount of data bytes are inlined into the WQE beside other inline 1242 settings. This key also may update ``txq_inline_max`` value (default 1243 or specified explicitly in devargs) to reserve the space for inline data. 1244 1245 If ``txq_inline_min`` key is not present, the value may be queried by the 1246 driver from the NIC via DevX if this feature is available. If there is no DevX 1247 enabled/supported the value 18 (supposing L2 header including VLAN) is set 1248 for ConnectX-4 and ConnectX-4 Lx, and 0 is set by default for ConnectX-5 1249 and newer NICs. If packet is shorter the ``txq_inline_min`` value, the entire 1250 packet is inlined. 1251 1252 For ConnectX-4 NIC, driver does not allow specifying value below 18 1253 (minimal L2 header, including VLAN), error will be raised. 1254 1255 For ConnectX-4 Lx NIC, it is allowed to specify values below 18, but 1256 it is not recommended and may prevent NIC from sending packets over 1257 some configurations. 1258 1259 For ConnectX-4 and ConnectX-4 Lx NICs, automatically configured value 1260 is insufficient for some traffic, because they require at least all L2 headers 1261 to be inlined. For example, Q-in-Q adds 4 bytes to default 18 bytes 1262 of Ethernet and VLAN, thus ``txq_inline_min`` must be set to 22. 1263 MPLS would add 4 bytes per label. Final value must account for all possible 1264 L2 encapsulation headers used in particular environment. 1265 1266 Please, note, this minimal data inlining disengages eMPW feature (Enhanced 1267 Multi-Packet Write), because last one does not support partial packet inlining. 1268 This is not very critical due to minimal data inlining is mostly required 1269 by ConnectX-4 and ConnectX-4 Lx, these NICs do not support eMPW feature. 1270 1271- ``txq_inline_max`` parameter [int] 1272 1273 Specifies the maximal packet length to be completely inlined into WQE 1274 Ethernet Segment for ordinary SEND method. If packet is larger than specified 1275 value, the packet data won't be copied by the driver at all, data buffer 1276 is addressed with a pointer. If packet length is less or equal all packet 1277 data will be copied into WQE. This may improve PCI bandwidth utilization for 1278 short packets significantly but requires the extra CPU cycles. 1279 1280 The data inline feature is controlled by number of Tx queues, if number of Tx 1281 queues is larger than ``txqs_min_inline`` key parameter, the inline feature 1282 is engaged, if there are not enough Tx queues (which means not enough CPU cores 1283 and CPU resources are scarce), data inline is not performed by the driver. 1284 Assigning ``txqs_min_inline`` with zero always enables the data inline. 1285 1286 The default ``txq_inline_max`` value is 290. The specified value may be adjusted 1287 by the driver in order not to exceed the limit (930 bytes) and to provide better 1288 WQE space filling without gaps, the adjustment is reflected in the debug log. 1289 Also, the default value (290) may be decreased in run-time if the large transmit 1290 queue size is requested and hardware does not support enough descriptor 1291 amount, in this case warning is emitted. If ``txq_inline_max`` key is 1292 specified and requested inline settings can not be satisfied then error 1293 will be raised. 1294 1295- ``txq_inline_mpw`` parameter [int] 1296 1297 Specifies the maximal packet length to be completely inlined into WQE for 1298 Enhanced MPW method. If packet is large the specified value, the packet data 1299 won't be copied, and data buffer is addressed with pointer. If packet length 1300 is less or equal, all packet data will be copied into WQE. This may improve PCI 1301 bandwidth utilization for short packets significantly but requires the extra 1302 CPU cycles. 1303 1304 The data inline feature is controlled by number of TX queues, if number of Tx 1305 queues is larger than ``txqs_min_inline`` key parameter, the inline feature 1306 is engaged, if there are not enough Tx queues (which means not enough CPU cores 1307 and CPU resources are scarce), data inline is not performed by the driver. 1308 Assigning ``txqs_min_inline`` with zero always enables the data inline. 1309 1310 The default ``txq_inline_mpw`` value is 268. The specified value may be adjusted 1311 by the driver in order not to exceed the limit (930 bytes) and to provide better 1312 WQE space filling without gaps, the adjustment is reflected in the debug log. 1313 Due to multiple packets may be included to the same WQE with Enhanced Multi 1314 Packet Write Method and overall WQE size is limited it is not recommended to 1315 specify large values for the ``txq_inline_mpw``. Also, the default value (268) 1316 may be decreased in run-time if the large transmit queue size is requested 1317 and hardware does not support enough descriptor amount, in this case warning 1318 is emitted. If ``txq_inline_mpw`` key is specified and requested inline 1319 settings can not be satisfied then error will be raised. 1320 1321- ``txqs_max_vec`` parameter [int] 1322 1323 Enable vectorized Tx only when the number of TX queues is less than or 1324 equal to this value. This parameter is deprecated and ignored, kept 1325 for compatibility issue to not prevent driver from probing. 1326 1327- ``txq_mpw_hdr_dseg_en`` parameter [int] 1328 1329 A nonzero value enables including two pointers in the first block of TX 1330 descriptor. The parameter is deprecated and ignored, kept for compatibility 1331 issue. 1332 1333- ``txq_max_inline_len`` parameter [int] 1334 1335 Maximum size of packet to be inlined. This limits the size of packet to 1336 be inlined. If the size of a packet is larger than configured value, the 1337 packet isn't inlined even though there's enough space remained in the 1338 descriptor. Instead, the packet is included with pointer. This parameter 1339 is deprecated and converted directly to ``txq_inline_mpw`` providing full 1340 compatibility. Valid only if eMPW feature is engaged. 1341 1342- ``txq_mpw_en`` parameter [int] 1343 1344 A nonzero value enables Enhanced Multi-Packet Write (eMPW) for ConnectX-5, 1345 ConnectX-6, ConnectX-6 Dx, ConnectX-6 Lx, ConnectX-7, BlueField, BlueField-2 1346 BlueField-3. eMPW allows the Tx burst function to pack up multiple packets 1347 in a single descriptor session in order to save PCI bandwidth 1348 and improve performance at the cost of a slightly higher CPU usage. 1349 When ``txq_inline_mpw`` is set along with ``txq_mpw_en``, 1350 Tx burst function copies entire packet data on to Tx descriptor 1351 instead of including pointer of packet. 1352 1353 The Enhanced Multi-Packet Write feature is enabled by default if NIC supports 1354 it, can be disabled by explicit specifying 0 value for ``txq_mpw_en`` option. 1355 Also, if minimal data inlining is requested by non-zero ``txq_inline_min`` 1356 option or reported by the NIC, the eMPW feature is disengaged. 1357 1358- ``tx_db_nc`` parameter [int] 1359 1360 This parameter name is deprecated and ignored. 1361 The new name for this parameter is ``sq_db_nc``. 1362 See :ref:`common driver options <mlx5_common_driver_options>`. 1363 1364- ``tx_pp`` parameter [int] 1365 1366 If a nonzero value is specified the driver creates all necessary internal 1367 objects to provide accurate packet send scheduling on mbuf timestamps. 1368 The positive value specifies the scheduling granularity in nanoseconds, 1369 the packet send will be accurate up to specified digits. The allowed range is 1370 from 500 to 1 million of nanoseconds. The negative value specifies the module 1371 of granularity and engages the special test mode the check the schedule rate. 1372 By default (if the ``tx_pp`` is not specified) send scheduling on timestamps 1373 feature is disabled. 1374 1375 Starting with ConnectX-7 the capability to schedule traffic directly 1376 on timestamp specified in descriptor is provided, 1377 no extra objects are needed anymore and scheduling capability 1378 is advertised and handled regardless ``tx_pp`` parameter presence. 1379 1380- ``tx_skew`` parameter [int] 1381 1382 The parameter adjusts the send packet scheduling on timestamps and represents 1383 the average delay between beginning of the transmitting descriptor processing 1384 by the hardware and appearance of actual packet data on the wire. The value 1385 should be provided in nanoseconds and is valid only if ``tx_pp`` parameter is 1386 specified. The default value is zero. 1387 1388- ``tx_vec_en`` parameter [int] 1389 1390 A nonzero value enables Tx vector on ConnectX-5, ConnectX-6, ConnectX-6 Dx, 1391 ConnectX-6 Lx, ConnectX-7, BlueField, BlueField-2, and BlueField-3 NICs 1392 if the number of global Tx queues on the port is less than ``txqs_max_vec``. 1393 The parameter is deprecated and ignored. 1394 1395- ``rx_vec_en`` parameter [int] 1396 1397 A nonzero value enables Rx vector if the port is not configured in 1398 multi-segment otherwise this parameter is ignored. 1399 1400 Enabled by default. 1401 1402- ``vf_nl_en`` parameter [int] 1403 1404 A nonzero value enables Netlink requests from the VF to add/remove MAC 1405 addresses or/and enable/disable promiscuous/all multicast on the Netdevice. 1406 Otherwise the relevant configuration must be run with Linux iproute2 tools. 1407 This is a prerequisite to receive this kind of traffic. 1408 1409 Enabled by default, valid only on VF devices ignored otherwise. 1410 1411- ``l3_vxlan_en`` parameter [int] 1412 1413 A nonzero value allows L3 VXLAN and VXLAN-GPE flow creation. To enable 1414 L3 VXLAN or VXLAN-GPE, users has to configure firmware and enable this 1415 parameter. This is a prerequisite to receive this kind of traffic. 1416 1417 Disabled by default. 1418 1419- ``dv_xmeta_en`` parameter [int] 1420 1421 A nonzero value enables extensive flow metadata support if device is 1422 capable and driver supports it. This can enable extensive support of 1423 ``MARK`` and ``META`` item of ``rte_flow``. The newly introduced 1424 ``SET_TAG`` and ``SET_META`` actions do not depend on ``dv_xmeta_en``. 1425 1426 There are some possible configurations, depending on parameter value: 1427 1428 - 0, this is default value, defines the legacy mode, the ``MARK`` and 1429 ``META`` related actions and items operate only within NIC Tx and 1430 NIC Rx steering domains, no ``MARK`` and ``META`` information crosses 1431 the domain boundaries. The ``MARK`` item is 24 bits wide, the ``META`` 1432 item is 32 bits wide and match supported on egress only 1433 when ``dv_flow_en`` = 1. 1434 1435 - 1, this engages extensive metadata mode, the ``MARK`` and ``META`` 1436 related actions and items operate within all supported steering domains, 1437 including FDB, ``MARK`` and ``META`` information may cross the domain 1438 boundaries. The ``MARK`` item is 24 bits wide, the ``META`` item width 1439 depends on kernel and firmware configurations and might be 0, 16 or 1440 32 bits. Within NIC Tx domain ``META`` data width is 32 bits for 1441 compatibility, the actual width of data transferred to the FDB domain 1442 depends on kernel configuration and may be vary. The actual supported 1443 width can be retrieved in runtime by series of rte_flow_validate() 1444 trials. 1445 1446 - 2, this engages extensive metadata mode, the ``MARK`` and ``META`` 1447 related actions and items operate within all supported steering domains, 1448 including FDB, ``MARK`` and ``META`` information may cross the domain 1449 boundaries. The ``META`` item is 32 bits wide, the ``MARK`` item width 1450 depends on kernel and firmware configurations and might be 0, 16 or 1451 24 bits. The actual supported width can be retrieved in runtime by 1452 series of rte_flow_validate() trials. 1453 1454 - 3, this engages tunnel offload mode. In E-Switch configuration, that 1455 mode implicitly activates ``dv_xmeta_en=1``. 1456 1457 - 4, this mode is only supported in HWS (``dv_flow_en=2``). 1458 The Rx/Tx metadata with 32b width copy between FDB and NIC is supported. 1459 The mark is only supported in NIC and there is no copy supported. 1460 1461 +------+-----------+-----------+-------------+-------------+ 1462 | Mode | ``MARK`` | ``META`` | ``META`` Tx | FDB/Through | 1463 +======+===========+===========+=============+=============+ 1464 | 0 | 24 bits | 32 bits | 32 bits | no | 1465 +------+-----------+-----------+-------------+-------------+ 1466 | 1 | 24 bits | vary 0-32 | 32 bits | yes | 1467 +------+-----------+-----------+-------------+-------------+ 1468 | 2 | vary 0-24 | 32 bits | 32 bits | yes | 1469 +------+-----------+-----------+-------------+-------------+ 1470 1471 If there is no E-Switch configuration the ``dv_xmeta_en`` parameter is 1472 ignored and the device is configured to operate in legacy mode (0). 1473 1474 Disabled by default (set to 0). 1475 1476 The Direct Verbs/Rules (engaged with ``dv_flow_en`` = 1) supports all 1477 of the extensive metadata features. The legacy Verbs supports FLAG and 1478 MARK metadata actions over NIC Rx steering domain only. 1479 1480 Setting META value to zero in flow action means there is no item provided 1481 and receiving datapath will not report in mbufs the metadata are present. 1482 Setting MARK value to zero in flow action means the zero FDIR ID value 1483 will be reported on packet receiving. 1484 1485 For the MARK action the last 16 values in the full range are reserved for 1486 internal PMD purposes (to emulate FLAG action). The valid range for the 1487 MARK action values is 0-0xFFEF for the 16-bit mode and 0-0xFFFFEF 1488 for the 24-bit mode, the flows with the MARK action value outside 1489 the specified range will be rejected. 1490 1491- ``dv_flow_en`` parameter [int] 1492 1493 Value 0 means legacy Verbs flow offloading. 1494 1495 Value 1 enables the DV flow steering assuming it is supported by the 1496 driver (requires rdma-core 24 or higher). 1497 1498 Value 2 enables the WQE based hardware steering. 1499 In this mode, only queue-based flow management is supported. 1500 1501 It is configured by default to 1 (DV flow steering) if supported. 1502 Otherwise, the value is 0 which indicates legacy Verbs flow offloading. 1503 1504- ``dv_esw_en`` parameter [int] 1505 1506 A nonzero value enables E-Switch using Direct Rules. 1507 1508 Enabled by default if supported. 1509 1510- ``fdb_def_rule_en`` parameter [int] 1511 1512 A non-zero value enables to create a dedicated rule on E-Switch root table. 1513 This dedicated rule forwards all incoming packets into table 1. 1514 Other rules will be created in E-Switch table original table level plus one, 1515 to improve the flow insertion rate due to skipping root table managed by firmware. 1516 If set to 0, all rules will be created on the original E-Switch table level. 1517 1518 By default, the PMD will set this value to 1. 1519 1520- ``lacp_by_user`` parameter [int] 1521 1522 A nonzero value enables the control of LACP traffic by the user application. 1523 When a bond exists in the driver, by default it should be managed by the 1524 kernel and therefore LACP traffic should be steered to the kernel. 1525 If this devarg is set to 1 it will allow the user to manage the bond by 1526 itself and not steer LACP traffic to the kernel. 1527 1528 Disabled by default (set to 0). 1529 1530- ``representor`` parameter [list] 1531 1532 This parameter can be used to instantiate DPDK Ethernet devices from 1533 existing port (PF, VF or SF) representors configured on the device. 1534 1535 It is a standard parameter whose format is described in 1536 :ref:`ethernet_device_standard_device_arguments`. 1537 1538 For instance, to probe VF port representors 0 through 2:: 1539 1540 <PCI_BDF>,representor=vf[0-2] 1541 1542 To probe SF port representors 0 through 2:: 1543 1544 <PCI_BDF>,representor=sf[0-2] 1545 1546 To probe VF port representors 0 through 2 on both PFs of bonding device:: 1547 1548 <Primary_PCI_BDF>,representor=pf[0,1]vf[0-2] 1549 1550- ``repr_matching_en`` parameter [int] 1551 1552 - 0. If representor matching is disabled, then there will be no implicit 1553 item added. As a result, ingress flow rules will match traffic 1554 coming to any port, not only the port on which flow rule is created. 1555 Because of that, default flow rules for ingress traffic cannot be created 1556 and port starts in isolated mode by default. Port cannot be switched back 1557 to non-isolated mode. 1558 1559 - 1. If representor matching is enabled (default setting), 1560 then each ingress pattern template has an implicit REPRESENTED_PORT 1561 item added. Flow rules based on this pattern template will match 1562 the vport associated with port on which rule is created. 1563 1564- ``max_dump_files_num`` parameter [int] 1565 1566 The maximum number of files per PMD entity that may be created for debug information. 1567 The files will be created in /var/log directory or in current directory. 1568 1569 set to 128 by default. 1570 1571- ``lro_timeout_usec`` parameter [int] 1572 1573 The maximum allowed duration of an LRO session, in micro-seconds. 1574 PMD will set the nearest value supported by HW, which is not bigger than 1575 the input ``lro_timeout_usec`` value. 1576 If this parameter is not specified, by default PMD will set 1577 the smallest value supported by HW. 1578 1579- ``hp_buf_log_sz`` parameter [int] 1580 1581 The total data buffer size of a hairpin queue (logarithmic form), in bytes. 1582 PMD will set the data buffer size to 2 ** ``hp_buf_log_sz``, both for RX & TX. 1583 The capacity of the value is specified by the firmware and the initialization 1584 will get a failure if it is out of scope. 1585 The range of the value is from 11 to 19 right now, and the supported frame 1586 size of a single packet for hairpin is from 512B to 128KB. It might change if 1587 different firmware release is being used. By using a small value, it could 1588 reduce memory consumption but not work with a large frame. If the value is 1589 too large, the memory consumption will be high and some potential performance 1590 degradation will be introduced. 1591 By default, the PMD will set this value to 16, which means that 9KB jumbo 1592 frames will be supported. 1593 1594- ``reclaim_mem_mode`` parameter [int] 1595 1596 Cache some resources in flow destroy will help flow recreation more efficient. 1597 While some systems may require the all the resources can be reclaimed after 1598 flow destroyed. 1599 The parameter ``reclaim_mem_mode`` provides the option for user to configure 1600 if the resource cache is needed or not. 1601 1602 There are three options to choose: 1603 1604 - 0. It means the flow resources will be cached as usual. The resources will 1605 be cached, helpful with flow insertion rate. 1606 1607 - 1. It will only enable the DPDK PMD level resources reclaim. 1608 1609 - 2. Both DPDK PMD level and rdma-core low level will be configured as 1610 reclaimed mode. 1611 1612 By default, the PMD will set this value to 0. 1613 1614- ``decap_en`` parameter [int] 1615 1616 Some devices do not support FCS (frame checksum) scattering for 1617 tunnel-decapsulated packets. 1618 If set to 0, this option forces the FCS feature and rejects tunnel 1619 decapsulation in the flow engine for such devices. 1620 1621 By default, the PMD will set this value to 1. 1622 1623- ``allow_duplicate_pattern`` parameter [int] 1624 1625 There are two options to choose: 1626 1627 - 0. Prevent insertion of rules with the same pattern items on non-root table. 1628 In this case, only the first rule is inserted and the following rules are 1629 rejected and error code EEXIST is returned. 1630 1631 - 1. Allow insertion of rules with the same pattern items. 1632 In this case, all rules are inserted but only the first rule takes effect, 1633 the next rule takes effect only if the previous rules are deleted. 1634 1635 By default, the PMD will set this value to 1. 1636 1637 1638Multiport E-Switch 1639------------------ 1640 1641In standard deployments of NVIDIA ConnectX and BlueField HCAs, where embedded switch is enabled, 1642each physical port is associated with a single switching domain. 1643Only PFs, VFs and SFs related to that physical port are connected to this domain 1644and offloaded flow rules are allowed to steer traffic only between the entities in the given domain. 1645 1646The following diagram pictures the high level overview of this architecture:: 1647 1648 .---. .------. .------. .---. .------. .------. 1649 |PF0| |PF0VFi| |PF0SFi| |PF1| |PF1VFi| |PF1SFi| 1650 .-+-. .--+---. .--+---. .-+-. .--+---. .--+---. 1651 | | | | | | 1652 .---|------|--------|-------|------|--------|---------. 1653 | | | | | | | HCA| 1654 | .-+------+--------+---. .-+------+--------+---. | 1655 | | | | | | 1656 | | E-Switch | | E-Switch | | 1657 | | PF0 | | PF1 | | 1658 | | | | | | 1659 | .---------+-----------. .--------+------------. | 1660 | | | | 1661 .--------+--+---+---------------+--+---+--------------. 1662 | | | | 1663 | PHY0 | | PHY1 | 1664 | | | | 1665 .------. .------. 1666 1667Multiport E-Switch is a deployment scenario where: 1668 1669- All physical ports, PFs, VFs and SFs share the same switching domain. 1670- Each physical port gets a separate representor port. 1671- Traffic can be matched or forwarded explicitly between any of the entities 1672 connected to the domain. 1673 1674The following diagram pictures the high level overview of this architecture:: 1675 1676 .---. .------. .------. .---. .------. .------. 1677 |PF0| |PF0VFi| |PF0SFi| |PF1| |PF1VFi| |PF1SFi| 1678 .-+-. .--+---. .--+---. .-+-. .--+---. .--+---. 1679 | | | | | | 1680 .---|------|--------|-------|------|--------|---------. 1681 | | | | | | | HCA| 1682 | .-+------+--------+-------+------+--------+---. | 1683 | | | | 1684 | | Shared | | 1685 | | E-Switch | | 1686 | | | | 1687 | .---------+----------------------+------------. | 1688 | | | | 1689 .--------+--+---+---------------+--+---+--------------. 1690 | | | | 1691 | PHY0 | | PHY1 | 1692 | | | | 1693 .------. .------. 1694 1695In this deployment a single application can control the switching and forwarding behavior for all 1696entities on the HCA. 1697 1698With this configuration, mlx5 PMD supports: 1699 1700- matching traffic coming from physical port, PF, VF or SF using REPRESENTED_PORT items; 1701- matching traffic coming from E-Switch manager 1702 using REPRESENTED_PORT item with port ID ``UINT16_MAX``; 1703- forwarding traffic to physical port, PF, VF or SF using REPRESENTED_PORT actions; 1704 1705Requirements 1706~~~~~~~~~~~~ 1707 1708Supported HCAs: 1709 1710- ConnectX family: ConnectX-6 Dx and above. 1711- BlueField family: BlueField-2 and above. 1712- FW version: at least ``XX.37.1014``. 1713 1714Supported mlx5 kernel modules versions: 1715 1716- Upstream Linux - from version 6.3. 1717- Modules packaged in MLNX_OFED - from version v23.04-0.5.3.3. 1718 1719Configuration 1720~~~~~~~~~~~~~ 1721 1722#. Apply required FW configuration:: 1723 1724 sudo mlxconfig -d /dev/mst/mt4125_pciconf0 set LAG_RESOURCE_ALLOCATION=1 1725 1726#. Reset FW or cold reboot the host. 1727 1728#. Switch E-Switch mode on all of the PFs to ``switchdev`` mode:: 1729 1730 sudo devlink dev eswitch set pci/0000:08:00.0 mode switchdev 1731 sudo devlink dev eswitch set pci/0000:08:00.1 mode switchdev 1732 1733#. Enable Multiport E-Switch on all of the PFs:: 1734 1735 sudo devlink dev param set pci/0000:08:00.0 name esw_multiport value true cmode runtime 1736 sudo devlink dev param set pci/0000:08:00.1 name esw_multiport value true cmode runtime 1737 1738#. Configure required number of VFs/SFs:: 1739 1740 echo 4 | sudo tee /sys/class/net/eth2/device/sriov_numvfs 1741 echo 4 | sudo tee /sys/class/net/eth3/device/sriov_numvfs 1742 1743#. Start testpmd and verify that all ports are visible:: 1744 1745 $ sudo dpdk-testpmd -a 08:00.0,dv_flow_en=2,representor=pf0-1vf0-3 -- -i 1746 testpmd> show port summary all 1747 Number of available ports: 10 1748 Port MAC Address Name Driver Status Link 1749 0 E8:EB:D5:18:22:BC 08:00.0_p0 mlx5_pci up 200 Gbps 1750 1 E8:EB:D5:18:22:BD 08:00.0_p1 mlx5_pci up 200 Gbps 1751 2 D2:F6:43:0B:9E:19 08:00.0_representor_c0pf0vf0 mlx5_pci up 200 Gbps 1752 3 E6:42:27:B7:68:BD 08:00.0_representor_c0pf0vf1 mlx5_pci up 200 Gbps 1753 4 A6:5B:7F:8B:B8:47 08:00.0_representor_c0pf0vf2 mlx5_pci up 200 Gbps 1754 5 12:93:50:45:89:02 08:00.0_representor_c0pf0vf3 mlx5_pci up 200 Gbps 1755 6 06:D3:B2:79:FE:AC 08:00.0_representor_c0pf1vf0 mlx5_pci up 200 Gbps 1756 7 12:FC:08:E4:C2:CA 08:00.0_representor_c0pf1vf1 mlx5_pci up 200 Gbps 1757 8 8E:A9:9A:D0:35:4C 08:00.0_representor_c0pf1vf2 mlx5_pci up 200 Gbps 1758 9 E6:35:83:1F:B0:A9 08:00.0_representor_c0pf1vf3 mlx5_pci up 200 Gbps 1759 1760Limitations 1761~~~~~~~~~~~ 1762 1763- Multiport E-Switch is not supported on Windows. 1764- Multiport E-Switch is supported only with HW Steering flow engine (``dv_flow_en=2``). 1765- Matching traffic coming from a physical port and forwarding it to a physical port 1766 (either the same or other one) is not supported. 1767 1768 In order to achieve such a functionality, an application has to setup hairpin queues 1769 between physical port representors and forward the traffic using hairpin queues. 1770 1771 1772Sub-Function 1773------------ 1774 1775See :ref:`mlx5_sub_function`. 1776 1777Sub-Function representor support 1778~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1779 1780A SF netdev supports E-Switch representation offload 1781similar to PF and VF representors. 1782Use <sfnum> to probe SF representor:: 1783 1784 testpmd> port attach <PCI_BDF>,representor=sf<sfnum>,dv_flow_en=1 1785 1786 1787Performance tuning 1788------------------ 1789 1790#. Configure aggressive CQE Zipping for maximum performance:: 1791 1792 mlxconfig -d <mst device> s CQE_COMPRESSION=1 1793 1794 To set it back to the default CQE Zipping mode use:: 1795 1796 mlxconfig -d <mst device> s CQE_COMPRESSION=0 1797 1798#. In case of virtualization: 1799 1800 - Make sure that hypervisor kernel is 3.16 or newer. 1801 - Configure boot with ``iommu=pt``. 1802 - Use 1G huge pages. 1803 - Make sure to allocate a VM on huge pages. 1804 - Make sure to set CPU pinning. 1805 1806#. Use the CPU near local NUMA node to which the PCIe adapter is connected, 1807 for better performance. For VMs, verify that the right CPU 1808 and NUMA node are pinned according to the above. Run:: 1809 1810 lstopo-no-graphics --merge 1811 1812 to identify the NUMA node to which the PCIe adapter is connected. 1813 1814#. If more than one adapter is used, and root complex capabilities allow 1815 to put both adapters on the same NUMA node without PCI bandwidth degradation, 1816 it is recommended to locate both adapters on the same NUMA node. 1817 This in order to forward packets from one to the other without 1818 NUMA performance penalty. 1819 1820#. Disable pause frames:: 1821 1822 ethtool -A <netdev> rx off tx off 1823 1824#. Verify IO non-posted prefetch is disabled by default. This can be checked 1825 via the BIOS configuration. Please contact you server provider for more 1826 information about the settings. 1827 1828 .. note:: 1829 1830 On some machines, depends on the machine integrator, it is beneficial 1831 to set the PCI max read request parameter to 1K. This can be 1832 done in the following way: 1833 1834 To query the read request size use:: 1835 1836 setpci -s <NIC PCI address> 68.w 1837 1838 If the output is different than 3XXX, set it by:: 1839 1840 setpci -s <NIC PCI address> 68.w=3XXX 1841 1842 The XXX can be different on different systems. Make sure to configure 1843 according to the setpci output. 1844 1845#. To minimize overhead of searching Memory Regions: 1846 1847 - '--socket-mem' is recommended to pin memory by predictable amount. 1848 - Configure per-lcore cache when creating Mempools for packet buffer. 1849 - Refrain from dynamically allocating/freeing memory in run-time. 1850 1851Rx burst functions 1852------------------ 1853 1854There are multiple Rx burst functions with different advantages and limitations. 1855 1856.. table:: Rx burst functions 1857 1858 +-------------------+------------------------+---------+-----------------+------+-------+ 1859 || Function Name || Enabler || Scatter|| Error Recovery || CQE || Large| 1860 | | | | || comp|| MTU | 1861 +===================+========================+=========+=================+======+=======+ 1862 | rx_burst | rx_vec_en=0 | Yes | Yes | Yes | Yes | 1863 +-------------------+------------------------+---------+-----------------+------+-------+ 1864 | rx_burst_vec | rx_vec_en=1 (default) | No | if CQE comp off | Yes | No | 1865 +-------------------+------------------------+---------+-----------------+------+-------+ 1866 | rx_burst_mprq || mprq_en=1 | No | Yes | Yes | Yes | 1867 | || RxQs >= rxqs_min_mprq | | | | | 1868 +-------------------+------------------------+---------+-----------------+------+-------+ 1869 | rx_burst_mprq_vec || rx_vec_en=1 (default) | No | if CQE comp off | Yes | Yes | 1870 | || mprq_en=1 | | | | | 1871 | || RxQs >= rxqs_min_mprq | | | | | 1872 +-------------------+------------------------+---------+-----------------+------+-------+ 1873 1874.. _mlx5_offloads_support: 1875 1876Supported hardware offloads 1877--------------------------- 1878 1879Below tables show offload support depending on hardware, firmware, 1880and Linux software support. 1881 1882The :ref:`Linux prerequisites <mlx5_linux_prerequisites>` 1883are Linux kernel and rdma-core libraries. 1884These dependencies are also packaged in MLNX_OFED or MLNX_EN, 1885shortened below as "OFED". 1886 1887.. table:: Minimal SW/HW versions for queue offloads 1888 1889 ============== ===== ===== ========= ===== ========== ============= 1890 Offload DPDK Linux rdma-core OFED firmware hardware 1891 ============== ===== ===== ========= ===== ========== ============= 1892 common base 17.11 4.14 16 4.2-1 12.21.1000 ConnectX-4 1893 checksums 17.11 4.14 16 4.2-1 12.21.1000 ConnectX-4 1894 Rx timestamp 17.11 4.14 16 4.2-1 12.21.1000 ConnectX-4 1895 TSO 17.11 4.14 16 4.2-1 12.21.1000 ConnectX-4 1896 LRO 19.08 N/A N/A 4.6-4 16.25.6406 ConnectX-5 1897 Tx scheduling 20.08 N/A N/A 5.1-2 22.28.2006 ConnectX-6 Dx 1898 Buffer Split 20.11 N/A N/A 5.1-2 16.28.2006 ConnectX-5 1899 ============== ===== ===== ========= ===== ========== ============= 1900 1901.. table:: Minimal SW/HW versions for rte_flow offloads 1902 1903 +-----------------------+-----------------+-----------------+ 1904 | Offload | with E-Switch | with NIC | 1905 +=======================+=================+=================+ 1906 | Count | | DPDK 19.05 | | DPDK 19.02 | 1907 | | | OFED 4.6 | | OFED 4.6 | 1908 | | | rdma-core 24 | | rdma-core 23 | 1909 | | | ConnectX-5 | | ConnectX-5 | 1910 +-----------------------+-----------------+-----------------+ 1911 | Drop | | DPDK 19.05 | | DPDK 18.11 | 1912 | | | OFED 4.6 | | OFED 4.5 | 1913 | | | rdma-core 24 | | rdma-core 23 | 1914 | | | ConnectX-5 | | ConnectX-4 | 1915 +-----------------------+-----------------+-----------------+ 1916 | Queue / RSS | | | | DPDK 18.11 | 1917 | | | N/A | | OFED 4.5 | 1918 | | | | | rdma-core 23 | 1919 | | | | | ConnectX-4 | 1920 +-----------------------+-----------------+-----------------+ 1921 | Shared action | | | | | 1922 | | | :numref:`sact`| | :numref:`sact`| 1923 | | | | | | 1924 | | | | | | 1925 +-----------------------+-----------------+-----------------+ 1926 | | VLAN | | DPDK 19.11 | | DPDK 19.11 | 1927 | | (of_pop_vlan / | | OFED 4.7-1 | | OFED 4.7-1 | 1928 | | of_push_vlan / | | ConnectX-5 | | ConnectX-5 | 1929 | | of_set_vlan_pcp / | | | | | 1930 | | of_set_vlan_vid) | | | | | 1931 +-----------------------+-----------------+-----------------+ 1932 | | VLAN | | DPDK 21.05 | | | 1933 | | ingress and / | | OFED 5.3 | | N/A | 1934 | | of_push_vlan / | | ConnectX-6 Dx | | | 1935 +-----------------------+-----------------+-----------------+ 1936 | | VLAN | | DPDK 21.05 | | | 1937 | | egress and / | | OFED 5.3 | | N/A | 1938 | | of_pop_vlan / | | ConnectX-6 Dx | | | 1939 +-----------------------+-----------------+-----------------+ 1940 | Encapsulation | | DPDK 19.05 | | DPDK 19.02 | 1941 | (VXLAN / NVGRE / RAW) | | OFED 4.7-1 | | OFED 4.6 | 1942 | | | rdma-core 24 | | rdma-core 23 | 1943 | | | ConnectX-5 | | ConnectX-5 | 1944 +-----------------------+-----------------+-----------------+ 1945 | Encapsulation | | DPDK 19.11 | | DPDK 19.11 | 1946 | GENEVE | | OFED 4.7-3 | | OFED 4.7-3 | 1947 | | | rdma-core 27 | | rdma-core 27 | 1948 | | | ConnectX-5 | | ConnectX-5 | 1949 +-----------------------+-----------------+-----------------+ 1950 | Tunnel Offload | | DPDK 20.11 | | DPDK 20.11 | 1951 | | | OFED 5.1-2 | | OFED 5.1-2 | 1952 | | | rdma-core 32 | | N/A | 1953 | | | ConnectX-5 | | ConnectX-5 | 1954 +-----------------------+-----------------+-----------------+ 1955 | | Header rewrite | | DPDK 19.05 | | DPDK 19.02 | 1956 | | (set_ipv4_src / | | OFED 4.7-1 | | OFED 4.7-1 | 1957 | | set_ipv4_dst / | | rdma-core 24 | | rdma-core 24 | 1958 | | set_ipv6_src / | | ConnectX-5 | | ConnectX-5 | 1959 | | set_ipv6_dst / | | | | | 1960 | | set_tp_src / | | | | | 1961 | | set_tp_dst / | | | | | 1962 | | dec_ttl / | | | | | 1963 | | set_ttl / | | | | | 1964 | | set_mac_src / | | | | | 1965 | | set_mac_dst) | | | | | 1966 +-----------------------+-----------------+-----------------+ 1967 | | Header rewrite | | DPDK 20.02 | | DPDK 20.02 | 1968 | | (set_dscp) | | OFED 5.0 | | OFED 5.0 | 1969 | | | | rdma-core 24 | | rdma-core 24 | 1970 | | | | ConnectX-5 | | ConnectX-5 | 1971 +-----------------------+-----------------+-----------------+ 1972 | | Header rewrite | | DPDK 22.07 | | DPDK 22.07 | 1973 | | (ipv4_ecn / | | OFED 5.6-2 | | OFED 5.6-2 | 1974 | | ipv6_ecn) | | rdma-core 41 | | rdma-core 41 | 1975 | | | | ConnectX-5 | | ConnectX-5 | 1976 +-----------------------+-----------------+-----------------+ 1977 | Jump | | DPDK 19.05 | | DPDK 19.02 | 1978 | | | OFED 4.7-1 | | OFED 4.7-1 | 1979 | | | rdma-core 24 | | N/A | 1980 | | | ConnectX-5 | | ConnectX-5 | 1981 +-----------------------+-----------------+-----------------+ 1982 | Mark / Flag | | DPDK 19.05 | | DPDK 18.11 | 1983 | | | OFED 4.6 | | OFED 4.5 | 1984 | | | rdma-core 24 | | rdma-core 23 | 1985 | | | ConnectX-5 | | ConnectX-4 | 1986 +-----------------------+-----------------+-----------------+ 1987 | Meta data | | DPDK 19.11 | | DPDK 19.11 | 1988 | | | OFED 4.7-3 | | OFED 4.7-3 | 1989 | | | rdma-core 26 | | rdma-core 26 | 1990 | | | ConnectX-5 | | ConnectX-5 | 1991 +-----------------------+-----------------+-----------------+ 1992 | Port ID | | DPDK 19.05 | | N/A | 1993 | | | OFED 4.7-1 | | N/A | 1994 | | | rdma-core 24 | | N/A | 1995 | | | ConnectX-5 | | N/A | 1996 +-----------------------+-----------------+-----------------+ 1997 | Hairpin | | | | DPDK 19.11 | 1998 | | | N/A | | OFED 4.7-3 | 1999 | | | | | rdma-core 26 | 2000 | | | | | ConnectX-5 | 2001 +-----------------------+-----------------+-----------------+ 2002 | 2-port Hairpin | | | | DPDK 20.11 | 2003 | | | N/A | | OFED 5.1-2 | 2004 | | | | | N/A | 2005 | | | | | ConnectX-5 | 2006 +-----------------------+-----------------+-----------------+ 2007 | Metering | | DPDK 19.11 | | DPDK 19.11 | 2008 | | | OFED 4.7-3 | | OFED 4.7-3 | 2009 | | | rdma-core 26 | | rdma-core 26 | 2010 | | | ConnectX-5 | | ConnectX-5 | 2011 +-----------------------+-----------------+-----------------+ 2012 | ASO Metering | | DPDK 21.05 | | DPDK 21.05 | 2013 | | | OFED 5.3 | | OFED 5.3 | 2014 | | | rdma-core 33 | | rdma-core 33 | 2015 | | | ConnectX-6 Dx| | ConnectX-6 Dx | 2016 +-----------------------+-----------------+-----------------+ 2017 | Metering Hierarchy | | DPDK 21.08 | | DPDK 21.08 | 2018 | | | OFED 5.3 | | OFED 5.3 | 2019 | | | N/A | | N/A | 2020 | | | ConnectX-6 Dx| | ConnectX-6 Dx | 2021 +-----------------------+-----------------+-----------------+ 2022 | Sampling | | DPDK 20.11 | | DPDK 20.11 | 2023 | | | OFED 5.1-2 | | OFED 5.1-2 | 2024 | | | rdma-core 32 | | N/A | 2025 | | | ConnectX-5 | | ConnectX-5 | 2026 +-----------------------+-----------------+-----------------+ 2027 | Encapsulation | | DPDK 21.02 | | DPDK 21.02 | 2028 | GTP PSC | | OFED 5.2 | | OFED 5.2 | 2029 | | | rdma-core 35 | | rdma-core 35 | 2030 | | | ConnectX-6 Dx| | ConnectX-6 Dx | 2031 +-----------------------+-----------------+-----------------+ 2032 | Encapsulation | | DPDK 21.02 | | DPDK 21.02 | 2033 | GENEVE TLV option | | OFED 5.2 | | OFED 5.2 | 2034 | | | rdma-core 34 | | rdma-core 34 | 2035 | | | ConnectX-6 Dx | | ConnectX-6 Dx | 2036 +-----------------------+-----------------+-----------------+ 2037 | Modify Field | | DPDK 21.02 | | DPDK 21.02 | 2038 | | | OFED 5.2 | | OFED 5.2 | 2039 | | | rdma-core 35 | | rdma-core 35 | 2040 | | | ConnectX-5 | | ConnectX-5 | 2041 +-----------------------+-----------------+-----------------+ 2042 | Connection tracking | | | | DPDK 21.05 | 2043 | | | N/A | | OFED 5.3 | 2044 | | | | | rdma-core 35 | 2045 | | | | | ConnectX-6 Dx | 2046 +-----------------------+-----------------+-----------------+ 2047 2048.. table:: Minimal SW/HW versions for shared action offload 2049 :name: sact 2050 2051 +-----------------------+-----------------+-----------------+ 2052 | Shared Action | with E-Switch | with NIC | 2053 +=======================+=================+=================+ 2054 | RSS | | | | DPDK 20.11 | 2055 | | | N/A | | OFED 5.2 | 2056 | | | | | rdma-core 33 | 2057 | | | | | ConnectX-5 | 2058 +-----------------------+-----------------+-----------------+ 2059 | Age | | DPDK 20.11 | | DPDK 20.11 | 2060 | | | OFED 5.2 | | OFED 5.2 | 2061 | | | rdma-core 32 | | rdma-core 32 | 2062 | | | ConnectX-6 Dx | | ConnectX-6 Dx | 2063 +-----------------------+-----------------+-----------------+ 2064 | Count | | DPDK 21.05 | | DPDK 21.05 | 2065 | | | OFED 4.6 | | OFED 4.6 | 2066 | | | rdma-core 24 | | rdma-core 23 | 2067 | | | ConnectX-5 | | ConnectX-5 | 2068 +-----------------------+-----------------+-----------------+ 2069 2070.. table:: Minimal SW/HW versions for flow template API 2071 2072 +-----------------+--------------------+--------------------+ 2073 | DPDK | NIC | Firmware | 2074 +=================+====================+====================+ 2075 | 22.11 | ConnectX-6 Dx | xx.35.1012 | 2076 +-----------------+--------------------+--------------------+ 2077 2078Notes for metadata 2079------------------ 2080 2081MARK and META items are interrelated with datapath - they might move from/to 2082the applications in mbuf fields. Hence, zero value for these items has the 2083special meaning - it means "no metadata are provided", not zero values are 2084treated by applications and PMD as valid ones. 2085 2086Moreover in the flow engine domain the value zero is acceptable to match and 2087set, and we should allow to specify zero values as rte_flow parameters for the 2088META and MARK items and actions. In the same time zero mask has no meaning and 2089should be rejected on validation stage. 2090 2091Notes for rte_flow 2092------------------ 2093 2094Flows are not cached in the driver. 2095When stopping a device port, all the flows created on this port from the 2096application will be flushed automatically in the background. 2097After stopping the device port, all flows on this port become invalid and 2098not represented in the system. 2099All references to these flows held by the application should be discarded 2100directly but neither destroyed nor flushed. 2101 2102The application should re-create the flows as required after the port restart. 2103 2104 2105Notes for flow counters 2106----------------------- 2107 2108mlx5 PMD supports the ``COUNT`` flow action, 2109which provides an ability to count packets (and bytes) 2110matched against a given flow rule. 2111This section describes the high level overview of 2112how this support is implemented and limitations. 2113 2114HW steering flow engine 2115~~~~~~~~~~~~~~~~~~~~~~~ 2116 2117Flow counters are allocated from HW in bulks. 2118A set of bulks forms a flow counter pool managed by PMD. 2119When flow counters are queried from HW, 2120each counter is identified by an offset in a given bulk. 2121Querying HW flow counter requires sending a request to HW, 2122which will request a read of counter values for given offsets. 2123HW will asynchronously provide these values through a DMA write. 2124 2125In order to optimize HW to SW communication, 2126these requests are handled in a separate counter service thread 2127spawned by mlx5 PMD. 2128This service thread will refresh the counter values stored in memory, 2129in cycles, each spanning ``svc_cycle_time`` milliseconds. 2130By default, ``svc_cycle_time`` is set to 500. 2131When applications query the ``COUNT`` flow action, 2132PMD returns the values stored in host memory. 2133 2134mlx5 PMD manages 3 global rings of allocated counter offsets: 2135 2136- ``free`` ring - Counters which were not used at all. 2137- ``wait_reset`` ring - Counters which were used in some flow rules, 2138 but were recently freed (flow rule was destroyed 2139 or an indirect action was destroyed). 2140 Since the count value might have changed 2141 between the last counter service thread cycle and the moment it was freed, 2142 the value in host memory might be stale. 2143 During the next service thread cycle, 2144 such counters will be moved to ``reuse`` ring. 2145- ``reuse`` ring - Counters which were used at least once 2146 and can be reused in new flow rules. 2147 2148When counters are assigned to a flow rule (or allocated to indirect action), 2149the PMD first tries to fetch a counter from ``reuse`` ring. 2150If it's empty, the PMD fetches a counter from ``free`` ring. 2151 2152The counter service thread works as follows: 2153 2154#. Record counters stored in ``wait_reset`` ring. 2155#. Read values of all counters which were used at least once 2156 or are currently in use. 2157#. Move recorded counters from ``wait_reset`` to ``reuse`` ring. 2158#. Sleep for ``(query time) - svc_cycle_time`` milliseconds 2159#. Repeat. 2160 2161Because freeing a counter (by destroying a flow rule or destroying indirect action) 2162does not immediately make it available for the application, 2163the PMD might return: 2164 2165- ``ENOENT`` if no counter is available in ``free``, ``reuse`` 2166 or ``wait_reset`` rings. 2167 No counter will be available until the application releases some of them. 2168- ``EAGAIN`` if no counter is available in ``free`` and ``reuse`` rings, 2169 but there are counters in ``wait_reset`` ring. 2170 This means that after the next service thread cycle new counters will be available. 2171 2172The application has to be aware that flow rule create or indirect action create 2173might need be retried. 2174 2175 2176Notes for hairpin 2177----------------- 2178 2179NVIDIA ConnectX and BlueField devices support 2180specifying memory placement for hairpin Rx and Tx queues. 2181This feature requires NVIDIA MLNX_OFED 5.8. 2182 2183By default, data buffers and packet descriptors for hairpin queues 2184are placed in device memory 2185which is shared with other resources (e.g. flow rules). 2186 2187Starting with DPDK 22.11 and NVIDIA MLNX_OFED 5.8, 2188applications are allowed to: 2189 2190#. Place data buffers and Rx packet descriptors in dedicated device memory. 2191 Application can request that configuration 2192 through ``use_locked_device_memory`` configuration option. 2193 2194 Placing data buffers and Rx packet descriptors in dedicated device memory 2195 can decrease latency on hairpinned traffic, 2196 since traffic processing for the hairpin queue will not be memory starved. 2197 2198 However, reserving device memory for hairpin Rx queues 2199 may decrease throughput under heavy load, 2200 since less resources will be available on device. 2201 2202 This option is supported only for Rx hairpin queues. 2203 2204#. Place Tx packet descriptors in host memory. 2205 Application can request that configuration 2206 through ``use_rte_memory`` configuration option. 2207 2208 Placing Tx packet descritors in host memory can increase traffic throughput. 2209 This results in more resources available on the device for other purposes, 2210 which reduces memory contention on device. 2211 Side effect of this option is visible increase in latency, 2212 since each packet incurs additional PCI transactions. 2213 2214 This option is supported only for Tx hairpin queues. 2215 2216 2217Notes for testpmd 2218----------------- 2219 2220Compared to librte_net_mlx4 that implements a single RSS configuration per 2221port, librte_net_mlx5 supports per-protocol RSS configuration. 2222 2223Since ``testpmd`` defaults to IP RSS mode and there is currently no 2224command-line parameter to enable additional protocols (UDP and TCP as well 2225as IP), the following commands must be entered from its CLI to get the same 2226behavior as librte_net_mlx4:: 2227 2228 > port stop all 2229 > port config all rss all 2230 > port start all 2231 2232Usage example 2233------------- 2234 2235This section demonstrates how to launch **testpmd** with NVIDIA 2236ConnectX-4/ConnectX-5/ConnectX-6/BlueField devices managed by librte_net_mlx5. 2237 2238#. Load the kernel modules:: 2239 2240 modprobe -a ib_uverbs mlx5_core mlx5_ib 2241 2242 Alternatively if MLNX_OFED/MLNX_EN is fully installed, the following script 2243 can be run:: 2244 2245 /etc/init.d/openibd restart 2246 2247 .. note:: 2248 2249 User space I/O kernel modules (uio and igb_uio) are not used and do 2250 not have to be loaded. 2251 2252#. Make sure Ethernet interfaces are in working order and linked to kernel 2253 verbs. Related sysfs entries should be present:: 2254 2255 ls -d /sys/class/net/*/device/infiniband_verbs/uverbs* | cut -d / -f 5 2256 2257 Example output:: 2258 2259 eth30 2260 eth31 2261 eth32 2262 eth33 2263 2264#. Optionally, retrieve their PCI bus addresses for to be used with the allow list:: 2265 2266 { 2267 for intf in eth2 eth3 eth4 eth5; 2268 do 2269 (cd "/sys/class/net/${intf}/device/" && pwd -P); 2270 done; 2271 } | 2272 sed -n 's,.*/\(.*\),-a \1,p' 2273 2274 Example output:: 2275 2276 -a 0000:05:00.1 2277 -a 0000:06:00.0 2278 -a 0000:06:00.1 2279 -a 0000:05:00.0 2280 2281#. Request huge pages:: 2282 2283 dpdk-hugepages.py --setup 2G 2284 2285#. Start testpmd with basic parameters:: 2286 2287 dpdk-testpmd -l 8-15 -n 4 -a 05:00.0 -a 05:00.1 -a 06:00.0 -a 06:00.1 -- --rxq=2 --txq=2 -i 2288 2289 Example output:: 2290 2291 [...] 2292 EAL: PCI device 0000:05:00.0 on NUMA socket 0 2293 EAL: probe driver: 15b3:1013 librte_net_mlx5 2294 PMD: librte_net_mlx5: PCI information matches, using device "mlx5_0" (VF: false) 2295 PMD: librte_net_mlx5: 1 port(s) detected 2296 PMD: librte_net_mlx5: port 1 MAC address is e4:1d:2d:e7:0c:fe 2297 EAL: PCI device 0000:05:00.1 on NUMA socket 0 2298 EAL: probe driver: 15b3:1013 librte_net_mlx5 2299 PMD: librte_net_mlx5: PCI information matches, using device "mlx5_1" (VF: false) 2300 PMD: librte_net_mlx5: 1 port(s) detected 2301 PMD: librte_net_mlx5: port 1 MAC address is e4:1d:2d:e7:0c:ff 2302 EAL: PCI device 0000:06:00.0 on NUMA socket 0 2303 EAL: probe driver: 15b3:1013 librte_net_mlx5 2304 PMD: librte_net_mlx5: PCI information matches, using device "mlx5_2" (VF: false) 2305 PMD: librte_net_mlx5: 1 port(s) detected 2306 PMD: librte_net_mlx5: port 1 MAC address is e4:1d:2d:e7:0c:fa 2307 EAL: PCI device 0000:06:00.1 on NUMA socket 0 2308 EAL: probe driver: 15b3:1013 librte_net_mlx5 2309 PMD: librte_net_mlx5: PCI information matches, using device "mlx5_3" (VF: false) 2310 PMD: librte_net_mlx5: 1 port(s) detected 2311 PMD: librte_net_mlx5: port 1 MAC address is e4:1d:2d:e7:0c:fb 2312 Interactive-mode selected 2313 Configuring Port 0 (socket 0) 2314 PMD: librte_net_mlx5: 0x8cba80: TX queues number update: 0 -> 2 2315 PMD: librte_net_mlx5: 0x8cba80: RX queues number update: 0 -> 2 2316 Port 0: E4:1D:2D:E7:0C:FE 2317 Configuring Port 1 (socket 0) 2318 PMD: librte_net_mlx5: 0x8ccac8: TX queues number update: 0 -> 2 2319 PMD: librte_net_mlx5: 0x8ccac8: RX queues number update: 0 -> 2 2320 Port 1: E4:1D:2D:E7:0C:FF 2321 Configuring Port 2 (socket 0) 2322 PMD: librte_net_mlx5: 0x8cdb10: TX queues number update: 0 -> 2 2323 PMD: librte_net_mlx5: 0x8cdb10: RX queues number update: 0 -> 2 2324 Port 2: E4:1D:2D:E7:0C:FA 2325 Configuring Port 3 (socket 0) 2326 PMD: librte_net_mlx5: 0x8ceb58: TX queues number update: 0 -> 2 2327 PMD: librte_net_mlx5: 0x8ceb58: RX queues number update: 0 -> 2 2328 Port 3: E4:1D:2D:E7:0C:FB 2329 Checking link statuses... 2330 Port 0 Link Up - speed 40000 Mbps - full-duplex 2331 Port 1 Link Up - speed 40000 Mbps - full-duplex 2332 Port 2 Link Up - speed 10000 Mbps - full-duplex 2333 Port 3 Link Up - speed 10000 Mbps - full-duplex 2334 Done 2335 testpmd> 2336 2337How to dump flows 2338----------------- 2339 2340This section demonstrates how to dump flows. Currently, it's possible to dump 2341all flows with assistance of external tools. 2342 2343#. 2 ways to get flow raw file: 2344 2345 - Using testpmd CLI: 2346 2347 .. code-block:: console 2348 2349 To dump all flows: 2350 testpmd> flow dump <port> all <output_file> 2351 and dump one flow: 2352 testpmd> flow dump <port> rule <rule_id> <output_file> 2353 2354 - call rte_flow_dev_dump api: 2355 2356 .. code-block:: console 2357 2358 rte_flow_dev_dump(port, flow, file, NULL); 2359 2360#. Dump human-readable flows from raw file: 2361 2362 Get flow parsing tool from: https://github.com/Mellanox/mlx_steering_dump 2363 2364 .. code-block:: console 2365 2366 mlx_steering_dump.py -f <output_file> -flowptr <flow_ptr> 2367 2368How to share a meter between ports in the same switch domain 2369------------------------------------------------------------ 2370 2371This section demonstrates how to use the shared meter. A meter M can be created 2372on port X and to be shared with a port Y on the same switch domain by the next way: 2373 2374.. code-block:: console 2375 2376 flow create X ingress transfer pattern eth / port_id id is Y / end actions meter mtr_id M / end 2377 2378How to use meter hierarchy 2379-------------------------- 2380 2381This section demonstrates how to create and use a meter hierarchy. 2382A termination meter M can be the policy green action of another termination meter N. 2383The two meters are chained together as a chain. Using meter N in a flow will apply 2384both the meters in hierarchy on that flow. 2385 2386.. code-block:: console 2387 2388 add port meter policy 0 1 g_actions queue index 0 / end y_actions end r_actions drop / end 2389 create port meter 0 M 1 1 yes 0xffff 1 0 2390 add port meter policy 0 2 g_actions meter mtr_id M / end y_actions end r_actions drop / end 2391 create port meter 0 N 2 2 yes 0xffff 1 0 2392 flow create 0 ingress group 1 pattern eth / end actions meter mtr_id N / end 2393 2394How to configure a VF as trusted 2395-------------------------------- 2396 2397This section demonstrates how to configure a virtual function (VF) interface as trusted. 2398Trusted VF is needed to offload rules with rte_flow to a group that is bigger than 0. 2399The configuration is done in two parts: driver and FW. 2400 2401The procedure below is an example of using a ConnectX-5 adapter card (pf0) with 2 VFs: 2402 2403#. Create 2 VFs on the PF pf0 when in Legacy SR-IOV mode:: 2404 2405 $ echo 2 > /sys/class/net/pf0/device/mlx5_num_vfs 2406 2407#. Verify the VFs are created: 2408 2409 .. code-block:: console 2410 2411 $ lspci | grep Mellanox 2412 82:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5] 2413 82:00.1 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5] 2414 82:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function] 2415 82:00.3 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function] 2416 2417#. Unbind all VFs. For each VF PCIe, using the following command to unbind the driver:: 2418 2419 $ echo "0000:82:00.2" >> /sys/bus/pci/drivers/mlx5_core/unbind 2420 2421#. Set the VFs to be trusted for the kernel by using one of the methods below: 2422 2423 - Using sysfs file:: 2424 2425 $ echo ON | tee /sys/class/net/pf0/device/sriov/0/trust 2426 $ echo ON | tee /sys/class/net/pf0/device/sriov/1/trust 2427 2428 - Using “ip link” command:: 2429 2430 $ ip link set p0 vf 0 trust on 2431 $ ip link set p0 vf 1 trust on 2432 2433#. Configure all VFs using ``mlxreg``: 2434 2435 - For MFT >= 4.21:: 2436 2437 $ mlxreg -d /dev/mst/mt4121_pciconf0 --reg_name VHCA_TRUST_LEVEL --yes --indexes 'all_vhca=0x1,vhca_id=0x0' --set 'trust_level=0x1' 2438 2439 - For MFT < 4.21:: 2440 2441 $ mlxreg -d /dev/mst/mt4121_pciconf0 --reg_name VHCA_TRUST_LEVEL --yes --set "all_vhca=0x1,trust_level=0x1" 2442 2443 .. note:: 2444 2445 Firmware version used must be >= xx.29.1016 and MFT >= 4.18 2446 2447#. For each VF PCIe, using the following command to bind the driver:: 2448 2449 $ echo "0000:82:00.2" >> /sys/bus/pci/drivers/mlx5_core/bind 2450 2451How to trace Tx datapath 2452------------------------ 2453 2454The mlx5 PMD provides Tx datapath tracing capability with extra debug information: 2455when and how packets were scheduled, 2456and when the actual sending was completed by the NIC hardware. 2457 2458Steps to enable Tx datapath tracing: 2459 2460#. Build DPDK application with enabled datapath tracing 2461 2462 The Meson option ``--enable_trace_fp=true`` and 2463 the C flag ``ALLOW_EXPERIMENTAL_API`` should be specified. 2464 2465 .. code-block:: console 2466 2467 meson configure --buildtype=debug -Denable_trace_fp=true 2468 -Dc_args='-DRTE_LIBRTE_MLX5_DEBUG -DRTE_ENABLE_ASSERT -DALLOW_EXPERIMENTAL_API' build 2469 2470#. Configure the NIC 2471 2472 If the sending completion timings are important, 2473 the NIC should be configured to provide realtime timestamps. 2474 The non-volatile settings parameter ``REAL_TIME_CLOCK_ENABLE`` should be configured as ``1``. 2475 2476 .. code-block:: console 2477 2478 mlxconfig -d /dev/mst/mt4125_pciconf0 s REAL_TIME_CLOCK_ENABLE=1 2479 2480 The ``mlxconfig`` utility is part of the MFT package. 2481 2482#. Run application with EAL parameter enabling tracing in mlx5 Tx datapath 2483 2484 By default all tracepoints are disabled. 2485 To analyze Tx datapath and its timings: ``--trace=pmd.net.mlx5.tx``. 2486 2487#. Commit the tracing data to the storage (with ``rte_trace_save()`` API call). 2488 2489#. Install or build the ``babeltrace2`` package 2490 2491 The Python script analyzing gathered trace data uses the ``babeltrace2`` library. 2492 The package should be either installed or built from source as shown below. 2493 2494 .. code-block:: console 2495 2496 git clone https://github.com/efficios/babeltrace.git 2497 cd babeltrace 2498 ./bootstrap 2499 ./configure -help 2500 ./configure --disable-api-doc --disable-man-pages 2501 --disable-python-bindings-doc --enable-python-plugins 2502 --enable-python-binding 2503 2504#. Run analyzing script 2505 2506 ``mlx5_trace.py`` is used to combine related events (packet firing and completion) 2507 and to show the results in human-readable view. 2508 2509 The analyzing script is located in the DPDK source tree: ``drivers/net/mlx5/tools``. 2510 2511 It requires Python 3.6 and ``babeltrace2`` package. 2512 2513 The parameter of the script is the trace data folder. 2514 2515 The optional parameter ``-a`` forces to dump incomplete bursts. 2516 2517 The optional parameter ``-v [level]`` forces to dump raw records data 2518 for the specified level and below. 2519 Level 0 dumps bursts, level 1 dumps WQEs, level 2 dumps mbufs. 2520 2521 .. code-block:: console 2522 2523 mlx5_trace.py /var/log/rte-2023-01-23-AM-11-52-39 2524 2525#. Interpreting the script output data 2526 2527 All the timings are given in nanoseconds. 2528 The list of Tx bursts per port/queue is presented in the output. 2529 Each list element contains the list of built WQEs with specific opcodes. 2530 Each WQE contains the list of the encompassed packets to send. 2531 2532Host shaper 2533----------- 2534 2535Host shaper register is per host port register 2536which sets a shaper on the host port. 2537All VF/host PF representors belonging to one host port share one host shaper. 2538For example, if representor 0 and representor 1 belong to the same host port, 2539and a host shaper rate of 1Gbps is configured, 2540the shaper throttles both representors traffic from the host. 2541 2542Host shaper has two modes for setting the shaper, 2543immediate and deferred to available descriptor threshold event trigger. 2544 2545In immediate mode, the rate limit is configured immediately to host shaper. 2546 2547When deferring to the available descriptor threshold trigger, 2548the shaper is not set until an available descriptor threshold event 2549is received by any Rx queue in a VF representor belonging to the host port. 2550The only rate supported for deferred mode is 100Mbps 2551(there is no limit on the supported rates for immediate mode). 2552In deferred mode, the shaper is set on the host port by the firmware 2553upon receiving the available descriptor threshold event, 2554which allows throttling host traffic on available descriptor threshold events 2555at minimum latency, preventing excess drops in the Rx queue. 2556 2557Dependency on mstflint package 2558~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2559 2560In order to configure host shaper register, 2561``librte_net_mlx5`` depends on ``libmtcr_ul`` 2562which can be installed from MLNX_OFED mstflint package. 2563Meson detects ``libmtcr_ul`` existence at configure stage. 2564If the library is detected, the application must link with ``-lmtcr_ul``, 2565as done by the pkg-config file libdpdk.pc. 2566 2567Available descriptor threshold and host shaper 2568~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2569 2570There is a command to configure the available descriptor threshold in testpmd. 2571Testpmd also contains sample logic to handle available descriptor threshold events. 2572The typical workflow is: 2573testpmd configures available descriptor threshold for Rx queues, 2574enables ``avail_thresh_triggered`` in host shaper and registers a callback. 2575When traffic from the host is too high 2576and Rx queue emptiness is below the available descriptor threshold, 2577the PMD receives an event 2578and the firmware configures a 100Mbps shaper on the host port automatically. 2579Then the PMD call the callback registered previously, 2580which will delay a while to let Rx queue empty, 2581then disable host shaper. 2582 2583Let's assume we have a simple BlueField-2 setup: 2584port 0 is uplink, port 1 is VF representor. 2585Each port has 2 Rx queues. 2586To control traffic from the host to the Arm device, 2587we can enable the available descriptor threshold in testpmd by: 2588 2589.. code-block:: console 2590 2591 testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 1 rate 0 2592 testpmd> set port 1 rxq 0 avail_thresh 70 2593 testpmd> set port 1 rxq 1 avail_thresh 70 2594 2595The first command disables the current host shaper 2596and enables the available descriptor threshold triggered mode. 2597The other commands configure the available descriptor threshold 2598to 70% of Rx queue size for both Rx queues. 2599 2600When traffic from the host is too high, 2601testpmd console prints log about available descriptor threshold event, 2602then host shaper is disabled. 2603The traffic rate from the host is controlled and less drop happens in Rx queues. 2604 2605The threshold event and shaper can be disabled like this: 2606 2607.. code-block:: console 2608 2609 testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 0 rate 0 2610 testpmd> set port 1 rxq 0 avail_thresh 0 2611 testpmd> set port 1 rxq 1 avail_thresh 0 2612 2613It is recommended an application disables the available descriptor threshold 2614and ``avail_thresh_triggered`` before exit, 2615if it enables them before. 2616 2617The shaper can also be configured with a value, the rate unit is 100Mbps. 2618Below, the command sets the current shaper to 5Gbps 2619and disables ``avail_thresh_triggered``. 2620 2621.. code-block:: console 2622 2623 testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 0 rate 50 2624 2625 2626Testpmd driver specific commands 2627-------------------------------- 2628 2629port attach with socket path 2630~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2631 2632It is possible to allocate a port with ``libibverbs`` from external application. 2633For importing the external port with extra device arguments, 2634there is a specific testpmd command 2635similar to :ref:`port attach command <port_attach>`:: 2636 2637 testpmd> mlx5 port attach (identifier) socket=(path) 2638 2639where: 2640 2641* ``identifier``: device identifier with optional parameters 2642 as same as :ref:`port attach command <port_attach>`. 2643* ``path``: path to IPC server socket created by the external application. 2644 2645This command performs: 2646 2647#. Open IPC client socket using the given path, and connect it. 2648 2649#. Import ibverbs context and ibverbs protection domain. 2650 2651#. Add two device arguments for context (``cmd_fd``) 2652 and protection domain (``pd_handle``) to the device identifier. 2653 See :ref:`mlx5 driver options <mlx5_common_driver_options>` for more 2654 information about these device arguments. 2655 2656#. Call the regular ``port attach`` function with updated identifier. 2657 2658For example, to attach a port whose PCI address is ``0000:0a:00.0`` 2659and its socket path is ``/var/run/import_ipc_socket``: 2660 2661.. code-block:: console 2662 2663 testpmd> mlx5 port attach 0000:0a:00.0 socket=/var/run/import_ipc_socket 2664 testpmd: MLX5 socket path is /var/run/import_ipc_socket 2665 testpmd: Attach port with extra devargs 0000:0a:00.0,cmd_fd=40,pd_handle=1 2666 Attaching a new port... 2667 EAL: Probe PCI driver: mlx5_pci (15b3:101d) device: 0000:0a:00.0 (socket 0) 2668 Port 0 is attached. Now total ports is 1 2669 Done 2670 2671 2672port map external Rx queue 2673~~~~~~~~~~~~~~~~~~~~~~~~~~ 2674 2675External Rx queue indexes mapping management. 2676 2677Map HW queue index (32-bit) to ethdev queue index (16-bit) for external Rx queue:: 2678 2679 testpmd> mlx5 port (port_id) ext_rxq map (sw_queue_id) (hw_queue_id) 2680 2681Unmap external Rx queue:: 2682 2683 testpmd> mlx5 port (port_id) ext_rxq unmap (sw_queue_id) 2684 2685where: 2686 2687* ``sw_queue_id``: queue index in range [64536, 65535]. 2688 This range is the highest 1000 numbers. 2689* ``hw_queue_id``: queue index given by HW in queue creation. 2690 2691 2692Dump RQ/SQ/CQ HW context for debug purposes 2693~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2694 2695Dump RQ/CQ HW context for a given port/queue to a file:: 2696 2697 testpmd> mlx5 port (port_id) queue (queue_id) dump rq_context (file_name) 2698 2699Dump SQ/CQ HW context for a given port/queue to a file:: 2700 2701 testpmd> mlx5 port (port_id) queue (queue_id) dump sq_context (file_name) 2702 2703 2704Set Flow Engine Mode 2705~~~~~~~~~~~~~~~~~~~~ 2706 2707Set the flow engine to active or standby mode with specific flags (bitmap style). 2708See ``RTE_PMD_MLX5_FLOW_ENGINE_FLAG_*`` for the flag definitions. 2709 2710.. code-block:: console 2711 2712 testpmd> mlx5 set flow_engine <active|standby> [<flags>] 2713 2714This command is used for testing live migration, 2715and works for software steering only. 2716Default FDB jump should be disabled if switchdev is enabled. 2717The mode will propagate to all the probed ports. 2718 2719 2720GENEVE TLV options parser 2721~~~~~~~~~~~~~~~~~~~~~~~~~ 2722 2723See the :ref:`GENEVE parser API <geneve_parser_api>` for more information. 2724 2725Set 2726^^^ 2727 2728Add single option to the global option list:: 2729 2730 testpmd> mlx5 set tlv_option class (class) type (type) len (length) \ 2731 offset (sample_offset) sample_len (sample_len) \ 2732 class_mode (ignore|fixed|matchable) data (0xffffffff|0x0 [0xffffffff|0x0]*) 2733 2734where: 2735 2736* ``class``: option class. 2737* ``type``: option type. 2738* ``length``: option data length in 4 bytes granularity. 2739* ``sample_offset``: offset to data list related to option data start. 2740 The offset is in 4 bytes granularity. 2741* ``sample_len``: length data list in 4 bytes granularity. 2742* ``ignore``: ignore ``class`` field. 2743* ``fixed``: option class is fixed and defines the option along with the type. 2744* ``matchable``: ``class`` field is matchable. 2745* ``data``: list of masks indicating which DW should be configure. 2746 The size of list should be equal to ``sample_len``. 2747* ``0xffffffff``: this DW should be configure. 2748* ``0x0``: this DW shouldn't be configure. 2749 2750Flush 2751^^^^^ 2752 2753Remove several options from the global option list:: 2754 2755 testpmd> mlx5 flush tlv_options max (nb_option) 2756 2757where: 2758 2759* ``nb_option``: maximum number of option to remove from list. The order is LIFO. 2760 2761List 2762^^^^ 2763 2764Print all options which are set in the global option list so far:: 2765 2766 testpmd> mlx5 list tlv_options 2767 2768Output contains the values of each option, one per line. 2769There is no output at all when no options are configured on the global list:: 2770 2771 ID Type Class Class_mode Len Offset Sample_len Data 2772 [...] [...] [...] [...] [...] [...] [...] [...] 2773 2774Setting several options and listing them:: 2775 2776 testpmd> mlx5 set tlv_option class 1 type 1 len 4 offset 1 sample_len 3 2777 class_mode fixed data 0xffffffff 0x0 0xffffffff 2778 testpmd: set new option in global list, now it has 1 options 2779 testpmd> mlx5 set tlv_option class 1 type 2 len 2 offset 0 sample_len 2 2780 class_mode fixed data 0xffffffff 0xffffffff 2781 testpmd: set new option in global list, now it has 2 options 2782 testpmd> mlx5 set tlv_option class 1 type 3 len 5 offset 4 sample_len 1 2783 class_mode fixed data 0xffffffff 2784 testpmd: set new option in global list, now it has 3 options 2785 testpmd> mlx5 list tlv_options 2786 ID Type Class Class_mode Len Offset Sample_len Data 2787 0 1 1 fixed 4 1 3 0xffffffff 0x0 0xffffffff 2788 1 2 1 fixed 2 0 2 0xffffffff 0xffffffff 2789 2 3 1 fixed 5 4 1 0xffffffff 2790 testpmd> 2791 2792Apply 2793^^^^^ 2794 2795Create GENEVE TLV parser for specific port using option list which are set so far:: 2796 2797 testpmd> mlx5 port (port_id) apply tlv_options 2798 2799The same global option list can used by several ports. 2800 2801Destroy 2802^^^^^^^ 2803 2804Destroy GENEVE TLV parser for specific port:: 2805 2806 testpmd> mlx5 port (port_id) destroy tlv_options 2807 2808This command doesn't destroy the global list, 2809For releasing options, ``flush`` command should be used. 2810