1.. SPDX-License-Identifier: BSD-3-Clause 2 Copyright 2015 6WIND S.A. 3 Copyright 2015 Mellanox Technologies, Ltd 4 5.. include:: <isonum.txt> 6 7MLX5 poll mode driver 8===================== 9 10The MLX5 poll mode driver library (**librte_net_mlx5**) provides support 11for **Mellanox ConnectX-4**, **Mellanox ConnectX-4 Lx** , **Mellanox 12ConnectX-5**, **Mellanox ConnectX-6**, **Mellanox ConnectX-6 Dx**, **Mellanox 13ConnectX-6 Lx**, **Mellanox BlueField** and **Mellanox BlueField-2** families 14of 10/25/40/50/100/200 Gb/s adapters as well as their virtual functions (VF) 15in SR-IOV context. 16 17Information and documentation about these adapters can be found on the 18`Mellanox website <http://www.mellanox.com>`__. Help is also provided by the 19`Mellanox community <http://community.mellanox.com/welcome>`__. 20 21There is also a `section dedicated to this poll mode driver 22<http://www.mellanox.com/page/products_dyn?product_family=209&mtag=pmd_for_dpdk>`__. 23 24 25Design 26------ 27 28Besides its dependency on libibverbs (that implies libmlx5 and associated 29kernel support), librte_net_mlx5 relies heavily on system calls for control 30operations such as querying/updating the MTU and flow control parameters. 31 32For security reasons and robustness, this driver only deals with virtual 33memory addresses. The way resources allocations are handled by the kernel, 34combined with hardware specifications that allow to handle virtual memory 35addresses directly, ensure that DPDK applications cannot access random 36physical memory (or memory that does not belong to the current process). 37 38This capability allows the PMD to coexist with kernel network interfaces 39which remain functional, although they stop receiving unicast packets as 40long as they share the same MAC address. 41This means legacy linux control tools (for example: ethtool, ifconfig and 42more) can operate on the same network interfaces that owned by the DPDK 43application. 44 45The PMD can use libibverbs and libmlx5 to access the device firmware 46or directly the hardware components. 47There are different levels of objects and bypassing abilities 48to get the best performances: 49 50- Verbs is a complete high-level generic API 51- Direct Verbs is a device-specific API 52- DevX allows to access firmware objects 53- Direct Rules manages flow steering at low-level hardware layer 54 55Enabling librte_net_mlx5 causes DPDK applications to be linked against 56libibverbs. 57 58Features 59-------- 60 61- Multi arch support: x86_64, POWER8, ARMv8, i686. 62- Multiple TX and RX queues. 63- Support for scattered TX frames. 64- Advanced support for scattered Rx frames with tunable buffer attributes. 65- IPv4, IPv6, TCPv4, TCPv6, UDPv4 and UDPv6 RSS on any number of queues. 66- RSS using different combinations of fields: L3 only, L4 only or both, 67 and source only, destination only or both. 68- Several RSS hash keys, one for each flow type. 69- Default RSS operation with no hash key specification. 70- Configurable RETA table. 71- Link flow control (pause frame). 72- Support for multiple MAC addresses. 73- VLAN filtering. 74- RX VLAN stripping. 75- TX VLAN insertion. 76- RX CRC stripping configuration. 77- TX mbuf fast free offload. 78- Promiscuous mode on PF and VF. 79- Multicast promiscuous mode on PF and VF. 80- Hardware checksum offloads. 81- Flow director (RTE_FDIR_MODE_PERFECT, RTE_FDIR_MODE_PERFECT_MAC_VLAN and 82 RTE_ETH_FDIR_REJECT). 83- Flow API, including :ref:`flow_isolated_mode`. 84- Multiple process. 85- KVM and VMware ESX SR-IOV modes are supported. 86- RSS hash result is supported. 87- Hardware TSO for generic IP or UDP tunnel, including VXLAN and GRE. 88- Hardware checksum Tx offload for generic IP or UDP tunnel, including VXLAN and GRE. 89- RX interrupts. 90- Statistics query including Basic, Extended and per queue. 91- Rx HW timestamp. 92- Tunnel types: VXLAN, L3 VXLAN, VXLAN-GPE, GRE, MPLSoGRE, MPLSoUDP, IP-in-IP, Geneve, GTP. 93- Tunnel HW offloads: packet type, inner/outer RSS, IP and UDP checksum verification. 94- NIC HW offloads: encapsulation (vxlan, gre, mplsoudp, mplsogre), NAT, routing, TTL 95 increment/decrement, count, drop, mark. For details please see :ref:`mlx5_offloads_support`. 96- Flow insertion rate of more then million flows per second, when using Direct Rules. 97- Support for multiple rte_flow groups. 98- Per packet no-inline hint flag to disable packet data copying into Tx descriptors. 99- Hardware LRO. 100- Hairpin. 101- Multiple-thread flow insertion. 102- Matching on GTP extension header with raw encap/decap action. 103- Matching on Geneve TLV option header with raw encap/decap action. 104- RSS support in sample action. 105- E-Switch mirroring and jump. 106- E-Switch mirroring and modify. 107- 21844 flow priorities for ingress or egress flow groups greater than 0 and for any transfer 108 flow group. 109 110Limitations 111----------- 112 113- Windows support: 114 115 On Windows, the features are limited: 116 117 - Promiscuous mode is not supported 118 - The following rules are supported: 119 120 - IPv4/UDP with CVLAN filtering 121 - Unicast MAC filtering 122 123- For secondary process: 124 125 - Forked secondary process not supported. 126 - External memory unregistered in EAL memseg list cannot be used for DMA 127 unless such memory has been registered by ``mlx5_mr_update_ext_mp()`` in 128 primary process and remapped to the same virtual address in secondary 129 process. If the external memory is registered by primary process but has 130 different virtual address in secondary process, unexpected error may happen. 131 132- When using Verbs flow engine (``dv_flow_en`` = 0), flow pattern without any 133 specific VLAN will match for VLAN packets as well: 134 135 When VLAN spec is not specified in the pattern, the matching rule will be created with VLAN as a wild card. 136 Meaning, the flow rule:: 137 138 flow create 0 ingress pattern eth / vlan vid is 3 / ipv4 / end ... 139 140 Will only match vlan packets with vid=3. and the flow rule:: 141 142 flow create 0 ingress pattern eth / ipv4 / end ... 143 144 Will match any ipv4 packet (VLAN included). 145 146- When using Verbs flow engine (``dv_flow_en`` = 0), multi-tagged(QinQ) match is not supported. 147 148- When using DV flow engine (``dv_flow_en`` = 1), flow pattern with any VLAN specification will match only single-tagged packets unless the ETH item ``type`` field is 0x88A8 or the VLAN item ``has_more_vlan`` field is 1. 149 The flow rule:: 150 151 flow create 0 ingress pattern eth / ipv4 / end ... 152 153 Will match any ipv4 packet. 154 The flow rules:: 155 156 flow create 0 ingress pattern eth / vlan / end ... 157 flow create 0 ingress pattern eth has_vlan is 1 / end ... 158 flow create 0 ingress pattern eth type is 0x8100 / end ... 159 160 Will match single-tagged packets only, with any VLAN ID value. 161 The flow rules:: 162 163 flow create 0 ingress pattern eth type is 0x88A8 / end ... 164 flow create 0 ingress pattern eth / vlan has_more_vlan is 1 / end ... 165 166 Will match multi-tagged packets only, with any VLAN ID value. 167 168- A flow pattern with 2 sequential VLAN items is not supported. 169 170- VLAN pop offload command: 171 172 - Flow rules having a VLAN pop offload command as one of their actions and 173 are lacking a match on VLAN as one of their items are not supported. 174 - The command is not supported on egress traffic. 175 176- VLAN push offload is not supported on ingress traffic. 177 178- VLAN set PCP offload is not supported on existing headers. 179 180- A multi segment packet must have not more segments than reported by dev_infos_get() 181 in tx_desc_lim.nb_seg_max field. This value depends on maximal supported Tx descriptor 182 size and ``txq_inline_min`` settings and may be from 2 (worst case forced by maximal 183 inline settings) to 58. 184 185- Flows with a VXLAN Network Identifier equal (or ends to be equal) 186 to 0 are not supported. 187 188- L3 VXLAN and VXLAN-GPE tunnels cannot be supported together with MPLSoGRE and MPLSoUDP. 189 190- Match on Geneve header supports the following fields only: 191 192 - VNI 193 - OAM 194 - protocol type 195 - options length 196 197- Match on Geneve TLV option is supported on the following fields: 198 199 - Class 200 - Type 201 - Length 202 - Data 203 204 Only one Class/Type/Length Geneve TLV option is supported per shared device. 205 Class/Type/Length fields must be specified as well as masks. 206 Class/Type/Length specified masks must be full. 207 Matching Geneve TLV option without specifying data is not supported. 208 Matching Geneve TLV option with ``data & mask == 0`` is not supported. 209 210- VF: flow rules created on VF devices can only match traffic targeted at the 211 configured MAC addresses (see ``rte_eth_dev_mac_addr_add()``). 212 213- Match on GTP tunnel header item supports the following fields only: 214 215 - v_pt_rsv_flags: E flag, S flag, PN flag 216 - msg_type 217 - teid 218 219- Match on GTP extension header only for GTP PDU session container (next 220 extension header type = 0x85). 221- Match on GTP extension header is not supported in group 0. 222 223- No Tx metadata go to the E-Switch steering domain for the Flow group 0. 224 The flows within group 0 and set metadata action are rejected by hardware. 225 226.. note:: 227 228 MAC addresses not already present in the bridge table of the associated 229 kernel network device will be added and cleaned up by the PMD when closing 230 the device. In case of ungraceful program termination, some entries may 231 remain present and should be removed manually by other means. 232 233- Buffer split offload is supported with regular Rx burst routine only, 234 no MPRQ feature or vectorized code can be engaged. 235 236- When Multi-Packet Rx queue is configured (``mprq_en``), a Rx packet can be 237 externally attached to a user-provided mbuf with having EXT_ATTACHED_MBUF in 238 ol_flags. As the mempool for the external buffer is managed by PMD, all the 239 Rx mbufs must be freed before the device is closed. Otherwise, the mempool of 240 the external buffers will be freed by PMD and the application which still 241 holds the external buffers may be corrupted. 242 243- If Multi-Packet Rx queue is configured (``mprq_en``) and Rx CQE compression is 244 enabled (``rxq_cqe_comp_en``) at the same time, RSS hash result is not fully 245 supported. Some Rx packets may not have PKT_RX_RSS_HASH. 246 247- IPv6 Multicast messages are not supported on VM, while promiscuous mode 248 and allmulticast mode are both set to off. 249 To receive IPv6 Multicast messages on VM, explicitly set the relevant 250 MAC address using rte_eth_dev_mac_addr_add() API. 251 252- To support a mixed traffic pattern (some buffers from local host memory, some 253 buffers from other devices) with high bandwidth, a mbuf flag is used. 254 255 An application hints the PMD whether or not it should try to inline the 256 given mbuf data buffer. PMD should do the best effort to act upon this request. 257 258 The hint flag ``RTE_PMD_MLX5_FINE_GRANULARITY_INLINE`` is dynamic, 259 registered by application with rte_mbuf_dynflag_register(). This flag is 260 purely driver-specific and declared in PMD specific header ``rte_pmd_mlx5.h``, 261 which is intended to be used by the application. 262 263 To query the supported specific flags in runtime, 264 the function ``rte_pmd_mlx5_get_dyn_flag_names`` returns the array of 265 currently (over present hardware and configuration) supported specific flags. 266 The "not inline hint" feature operating flow is the following one: 267 268 - application starts 269 - probe the devices, ports are created 270 - query the port capabilities 271 - if port supporting the feature is found 272 - register dynamic flag ``RTE_PMD_MLX5_FINE_GRANULARITY_INLINE`` 273 - application starts the ports 274 - on ``dev_start()`` PMD checks whether the feature flag is registered and 275 enables the feature support in datapath 276 - application might set the registered flag bit in ``ol_flags`` field 277 of mbuf being sent and PMD will handle ones appropriately. 278 279- The amount of descriptors in Tx queue may be limited by data inline settings. 280 Inline data require the more descriptor building blocks and overall block 281 amount may exceed the hardware supported limits. The application should 282 reduce the requested Tx size or adjust data inline settings with 283 ``txq_inline_max`` and ``txq_inline_mpw`` devargs keys. 284 285- To provide the packet send scheduling on mbuf timestamps the ``tx_pp`` 286 parameter should be specified. 287 When PMD sees the RTE_MBUF_DYNFLAG_TX_TIMESTAMP_NAME set on the packet 288 being sent it tries to synchronize the time of packet appearing on 289 the wire with the specified packet timestamp. It the specified one 290 is in the past it should be ignored, if one is in the distant future 291 it should be capped with some reasonable value (in range of seconds). 292 These specific cases ("too late" and "distant future") can be optionally 293 reported via device xstats to assist applications to detect the 294 time-related problems. 295 296 The timestamp upper "too-distant-future" limit 297 at the moment of invoking the Tx burst routine 298 can be estimated as ``tx_pp`` option (in nanoseconds) multiplied by 2^23. 299 Please note, for the testpmd txonly mode, 300 the limit is deduced from the expression:: 301 302 (n_tx_descriptors / burst_size + 1) * inter_burst_gap 303 304 There is no any packet reordering according timestamps is supposed, 305 neither within packet burst, nor between packets, it is an entirely 306 application responsibility to generate packets and its timestamps 307 in desired order. The timestamps can be put only in the first packet 308 in the burst providing the entire burst scheduling. 309 310- E-Switch decapsulation Flow: 311 312 - can be applied to PF port only. 313 - must specify VF port action (packet redirection from PF to VF). 314 - optionally may specify tunnel inner source and destination MAC addresses. 315 316- E-Switch encapsulation Flow: 317 318 - can be applied to VF ports only. 319 - must specify PF port action (packet redirection from VF to PF). 320 321- Raw encapsulation: 322 323 - The input buffer, used as outer header, is not validated. 324 325- Raw decapsulation: 326 327 - The decapsulation is always done up to the outermost tunnel detected by the HW. 328 - The input buffer, providing the removal size, is not validated. 329 - The buffer size must match the length of the headers to be removed. 330 331- ICMP(code/type/identifier/sequence number) / ICMP6(code/type) matching, IP-in-IP and MPLS flow matching are all 332 mutually exclusive features which cannot be supported together 333 (see :ref:`mlx5_firmware_config`). 334 335- LRO: 336 337 - Requires DevX and DV flow to be enabled. 338 - KEEP_CRC offload cannot be supported with LRO. 339 - The first mbuf length, without head-room, must be big enough to include the 340 TCP header (122B). 341 - Rx queue with LRO offload enabled, receiving a non-LRO packet, can forward 342 it with size limited to max LRO size, not to max RX packet length. 343 - LRO can be used with outer header of TCP packets of the standard format: 344 eth (with or without vlan) / ipv4 or ipv6 / tcp / payload 345 346 Other TCP packets (e.g. with MPLS label) received on Rx queue with LRO enabled, will be received with bad checksum. 347 - LRO packet aggregation is performed by HW only for packet size larger than 348 ``lro_min_mss_size``. This value is reported on device start, when debug 349 mode is enabled. 350 351- CRC: 352 353 - ``DEV_RX_OFFLOAD_KEEP_CRC`` cannot be supported with decapsulation 354 for some NICs (such as ConnectX-6 Dx, ConnectX-6 Lx, and BlueField-2). 355 The capability bit ``scatter_fcs_w_decap_disable`` shows NIC support. 356 357- TX mbuf fast free: 358 359 - fast free offload assumes the all mbufs being sent are originated from the 360 same memory pool and there is no any extra references to the mbufs (the 361 reference counter for each mbuf is equal 1 on tx_burst call). The latter 362 means there should be no any externally attached buffers in mbufs. It is 363 an application responsibility to provide the correct mbufs if the fast 364 free offload is engaged. The mlx5 PMD implicitly produces the mbufs with 365 externally attached buffers if MPRQ option is enabled, hence, the fast 366 free offload is neither supported nor advertised if there is MPRQ enabled. 367 368- Sample flow: 369 370 - Supports ``RTE_FLOW_ACTION_TYPE_SAMPLE`` action only within NIC Rx and 371 E-Switch steering domain. 372 - For E-Switch Sampling flow with sample ratio > 1, additional actions are not 373 supported in the sample actions list. 374 - For ConnectX-5, the ``RTE_FLOW_ACTION_TYPE_SAMPLE`` is typically used as 375 first action in the E-Switch egress flow if with header modify or 376 encapsulation actions. 377 - For NIC Rx flow, supports ``MARK``, ``COUNT``, ``QUEUE``, ``RSS`` in the 378 sample actions list. 379 - For E-Switch mirroring flow, supports ``RAW ENCAP``, ``Port ID`` in the 380 sample actions list. 381 382- Modify Field flow: 383 384 - Supports the 'set' operation only for ``RTE_FLOW_ACTION_TYPE_MODIFY_FIELD`` action. 385 - Modification of an arbitrary place in a packet via the special ``RTE_FLOW_FIELD_START`` Field ID is not supported. 386 - Modification of the 802.1Q Tag, VXLAN Network or GENEVE Network ID's is not supported. 387 - Encapsulation levels are not supported, can modify outermost header fields only. 388 - Offsets must be 32-bits aligned, cannot skip past the boundary of a field. 389 390- IPv6 header item 'proto' field, indicating the next header protocol, should 391 not be set as extension header. 392 In case the next header is an extension header, it should not be specified in 393 IPv6 header item 'proto' field. 394 The last extension header item 'next header' field can specify the following 395 header protocol type. 396 397- Hairpin: 398 399 - Hairpin between two ports could only manual binding and explicit Tx flow mode. For single port hairpin, all the combinations of auto/manual binding and explicit/implicit Tx flow mode could be supported. 400 - Hairpin in switchdev SR-IOV mode is not supported till now. 401 402Statistics 403---------- 404 405MLX5 supports various methods to report statistics: 406 407Port statistics can be queried using ``rte_eth_stats_get()``. The received and sent statistics are through SW only and counts the number of packets received or sent successfully by the PMD. The imissed counter is the amount of packets that could not be delivered to SW because a queue was full. Packets not received due to congestion in the bus or on the NIC can be queried via the rx_discards_phy xstats counter. 408 409Extended statistics can be queried using ``rte_eth_xstats_get()``. The extended statistics expose a wider set of counters counted by the device. The extended port statistics counts the number of packets received or sent successfully by the port. As Mellanox NICs are using the :ref:`Bifurcated Linux Driver <linux_gsg_linux_drivers>` those counters counts also packet received or sent by the Linux kernel. The counters with ``_phy`` suffix counts the total events on the physical port, therefore not valid for VF. 410 411Finally per-flow statistics can by queried using ``rte_flow_query`` when attaching a count action for specific flow. The flow counter counts the number of packets received successfully by the port and match the specific flow. 412 413Configuration 414------------- 415 416Compilation options 417~~~~~~~~~~~~~~~~~~~ 418 419The ibverbs libraries can be linked with this PMD in a number of ways, 420configured by the ``ibverbs_link`` build option: 421 422- ``shared`` (default): the PMD depends on some .so files. 423 424- ``dlopen``: Split the dependencies glue in a separate library 425 loaded when needed by dlopen. 426 It make dependencies on libibverbs and libmlx4 optional, 427 and has no performance impact. 428 429- ``static``: Embed static flavor of the dependencies libibverbs and libmlx4 430 in the PMD shared library or the executable static binary. 431 432Environment variables 433~~~~~~~~~~~~~~~~~~~~~ 434 435- ``MLX5_GLUE_PATH`` 436 437 A list of directories in which to search for the rdma-core "glue" plug-in, 438 separated by colons or semi-colons. 439 440- ``MLX5_SHUT_UP_BF`` 441 442 Configures HW Tx doorbell register as IO-mapped. 443 444 By default, the HW Tx doorbell is configured as a write-combining register. 445 The register would be flushed to HW usually when the write-combining buffer 446 becomes full, but it depends on CPU design. 447 448 Except for vectorized Tx burst routines, a write memory barrier is enforced 449 after updating the register so that the update can be immediately visible to 450 HW. 451 452 When vectorized Tx burst is called, the barrier is set only if the burst size 453 is not aligned to MLX5_VPMD_TX_MAX_BURST. However, setting this environmental 454 variable will bring better latency even though the maximum throughput can 455 slightly decline. 456 457Run-time configuration 458~~~~~~~~~~~~~~~~~~~~~~ 459 460- librte_net_mlx5 brings kernel network interfaces up during initialization 461 because it is affected by their state. Forcing them down prevents packets 462 reception. 463 464- **ethtool** operations on related kernel interfaces also affect the PMD. 465 466Run as non-root 467^^^^^^^^^^^^^^^ 468 469In order to run as a non-root user, 470some capabilities must be granted to the application:: 471 472 setcap cap_sys_admin,cap_net_admin,cap_net_raw,cap_ipc_lock+ep <dpdk-app> 473 474Below are the reasons of the need for each capability: 475 476``cap_sys_admin`` 477 When using physical addresses (PA mode), with Linux >= 4.0, 478 for access to ``/proc/self/pagemap``. 479 480``cap_net_admin`` 481 For device configuration. 482 483``cap_net_raw`` 484 For raw ethernet queue allocation through kernel driver. 485 486``cap_ipc_lock`` 487 For DMA memory pinning. 488 489Driver options 490^^^^^^^^^^^^^^ 491 492- ``rxq_cqe_comp_en`` parameter [int] 493 494 A nonzero value enables the compression of CQE on RX side. This feature 495 allows to save PCI bandwidth and improve performance. Enabled by default. 496 Different compression formats are supported in order to achieve the best 497 performance for different traffic patterns. Default format depends on 498 Multi-Packet Rx queue configuration: Hash RSS format is used in case 499 MPRQ is disabled, Checksum format is used in case MPRQ is enabled. 500 501 Specifying 2 as a ``rxq_cqe_comp_en`` value selects Flow Tag format for 502 better compression rate in case of RTE Flow Mark traffic. 503 Specifying 3 as a ``rxq_cqe_comp_en`` value selects Checksum format. 504 Specifying 4 as a ``rxq_cqe_comp_en`` value selects L3/L4 Header format for 505 better compression rate in case of mixed TCP/UDP and IPv4/IPv6 traffic. 506 CQE compression format selection requires DevX to be enabled. If there is 507 no DevX enabled/supported the value is reset to 1 by default. 508 509 Supported on: 510 511 - x86_64 with ConnectX-4, ConnectX-4 Lx, ConnectX-5, ConnectX-6, ConnectX-6 Dx, 512 ConnectX-6 Lx, BlueField and BlueField-2. 513 - POWER9 and ARMv8 with ConnectX-4 Lx, ConnectX-5, ConnectX-6, ConnectX-6 Dx, 514 ConnectX-6 Lx, BlueField and BlueField-2. 515 516- ``rxq_pkt_pad_en`` parameter [int] 517 518 A nonzero value enables padding Rx packet to the size of cacheline on PCI 519 transaction. This feature would waste PCI bandwidth but could improve 520 performance by avoiding partial cacheline write which may cause costly 521 read-modify-copy in memory transaction on some architectures. Disabled by 522 default. 523 524 Supported on: 525 526 - x86_64 with ConnectX-4, ConnectX-4 Lx, ConnectX-5, ConnectX-6, ConnectX-6 Dx, 527 ConnectX-6 Lx, BlueField and BlueField-2. 528 - POWER8 and ARMv8 with ConnectX-4 Lx, ConnectX-5, ConnectX-6, ConnectX-6 Dx, 529 ConnectX-6 Lx, BlueField and BlueField-2. 530 531- ``mprq_en`` parameter [int] 532 533 A nonzero value enables configuring Multi-Packet Rx queues. Rx queue is 534 configured as Multi-Packet RQ if the total number of Rx queues is 535 ``rxqs_min_mprq`` or more. Disabled by default. 536 537 Multi-Packet Rx Queue (MPRQ a.k.a Striding RQ) can further save PCIe bandwidth 538 by posting a single large buffer for multiple packets. Instead of posting a 539 buffers per a packet, one large buffer is posted in order to receive multiple 540 packets on the buffer. A MPRQ buffer consists of multiple fixed-size strides 541 and each stride receives one packet. MPRQ can improve throughput for 542 small-packet traffic. 543 544 When MPRQ is enabled, max_rx_pkt_len can be larger than the size of 545 user-provided mbuf even if DEV_RX_OFFLOAD_SCATTER isn't enabled. PMD will 546 configure large stride size enough to accommodate max_rx_pkt_len as long as 547 device allows. Note that this can waste system memory compared to enabling Rx 548 scatter and multi-segment packet. 549 550- ``mprq_log_stride_num`` parameter [int] 551 552 Log 2 of the number of strides for Multi-Packet Rx queue. Configuring more 553 strides can reduce PCIe traffic further. If configured value is not in the 554 range of device capability, the default value will be set with a warning 555 message. The default value is 4 which is 16 strides per a buffer, valid only 556 if ``mprq_en`` is set. 557 558 The size of Rx queue should be bigger than the number of strides. 559 560- ``mprq_log_stride_size`` parameter [int] 561 562 Log 2 of the size of a stride for Multi-Packet Rx queue. Configuring a smaller 563 stride size can save some memory and reduce probability of a depletion of all 564 available strides due to unreleased packets by an application. If configured 565 value is not in the range of device capability, the default value will be set 566 with a warning message. The default value is 11 which is 2048 bytes per a 567 stride, valid only if ``mprq_en`` is set. With ``mprq_log_stride_size`` set 568 it is possible for a packet to span across multiple strides. This mode allows 569 support of jumbo frames (9K) with MPRQ. The memcopy of some packets (or part 570 of a packet if Rx scatter is configured) may be required in case there is no 571 space left for a head room at the end of a stride which incurs some 572 performance penalty. 573 574- ``mprq_max_memcpy_len`` parameter [int] 575 576 The maximum length of packet to memcpy in case of Multi-Packet Rx queue. Rx 577 packet is mem-copied to a user-provided mbuf if the size of Rx packet is less 578 than or equal to this parameter. Otherwise, PMD will attach the Rx packet to 579 the mbuf by external buffer attachment - ``rte_pktmbuf_attach_extbuf()``. 580 A mempool for external buffers will be allocated and managed by PMD. If Rx 581 packet is externally attached, ol_flags field of the mbuf will have 582 EXT_ATTACHED_MBUF and this flag must be preserved. ``RTE_MBUF_HAS_EXTBUF()`` 583 checks the flag. The default value is 128, valid only if ``mprq_en`` is set. 584 585- ``rxqs_min_mprq`` parameter [int] 586 587 Configure Rx queues as Multi-Packet RQ if the total number of Rx queues is 588 greater or equal to this value. The default value is 12, valid only if 589 ``mprq_en`` is set. 590 591- ``txq_inline`` parameter [int] 592 593 Amount of data to be inlined during TX operations. This parameter is 594 deprecated and converted to the new parameter ``txq_inline_max`` providing 595 partial compatibility. 596 597- ``txqs_min_inline`` parameter [int] 598 599 Enable inline data send only when the number of TX queues is greater or equal 600 to this value. 601 602 This option should be used in combination with ``txq_inline_max`` and 603 ``txq_inline_mpw`` below and does not affect ``txq_inline_min`` settings above. 604 605 If this option is not specified the default value 16 is used for BlueField 606 and 8 for other platforms 607 608 The data inlining consumes the CPU cycles, so this option is intended to 609 auto enable inline data if we have enough Tx queues, which means we have 610 enough CPU cores and PCI bandwidth is getting more critical and CPU 611 is not supposed to be bottleneck anymore. 612 613 The copying data into WQE improves latency and can improve PPS performance 614 when PCI back pressure is detected and may be useful for scenarios involving 615 heavy traffic on many queues. 616 617 Because additional software logic is necessary to handle this mode, this 618 option should be used with care, as it may lower performance when back 619 pressure is not expected. 620 621 If inline data are enabled it may affect the maximal size of Tx queue in 622 descriptors because the inline data increase the descriptor size and 623 queue size limits supported by hardware may be exceeded. 624 625- ``txq_inline_min`` parameter [int] 626 627 Minimal amount of data to be inlined into WQE during Tx operations. NICs 628 may require this minimal data amount to operate correctly. The exact value 629 may depend on NIC operation mode, requested offloads, etc. It is strongly 630 recommended to omit this parameter and use the default values. Anyway, 631 applications using this parameter should take into consideration that 632 specifying an inconsistent value may prevent the NIC from sending packets. 633 634 If ``txq_inline_min`` key is present the specified value (may be aligned 635 by the driver in order not to exceed the limits and provide better descriptor 636 space utilization) will be used by the driver and it is guaranteed that 637 requested amount of data bytes are inlined into the WQE beside other inline 638 settings. This key also may update ``txq_inline_max`` value (default 639 or specified explicitly in devargs) to reserve the space for inline data. 640 641 If ``txq_inline_min`` key is not present, the value may be queried by the 642 driver from the NIC via DevX if this feature is available. If there is no DevX 643 enabled/supported the value 18 (supposing L2 header including VLAN) is set 644 for ConnectX-4 and ConnectX-4 Lx, and 0 is set by default for ConnectX-5 645 and newer NICs. If packet is shorter the ``txq_inline_min`` value, the entire 646 packet is inlined. 647 648 For ConnectX-4 NIC, driver does not allow specifying value below 18 649 (minimal L2 header, including VLAN), error will be raised. 650 651 For ConnectX-4 Lx NIC, it is allowed to specify values below 18, but 652 it is not recommended and may prevent NIC from sending packets over 653 some configurations. 654 655 Please, note, this minimal data inlining disengages eMPW feature (Enhanced 656 Multi-Packet Write), because last one does not support partial packet inlining. 657 This is not very critical due to minimal data inlining is mostly required 658 by ConnectX-4 and ConnectX-4 Lx, these NICs do not support eMPW feature. 659 660- ``txq_inline_max`` parameter [int] 661 662 Specifies the maximal packet length to be completely inlined into WQE 663 Ethernet Segment for ordinary SEND method. If packet is larger than specified 664 value, the packet data won't be copied by the driver at all, data buffer 665 is addressed with a pointer. If packet length is less or equal all packet 666 data will be copied into WQE. This may improve PCI bandwidth utilization for 667 short packets significantly but requires the extra CPU cycles. 668 669 The data inline feature is controlled by number of Tx queues, if number of Tx 670 queues is larger than ``txqs_min_inline`` key parameter, the inline feature 671 is engaged, if there are not enough Tx queues (which means not enough CPU cores 672 and CPU resources are scarce), data inline is not performed by the driver. 673 Assigning ``txqs_min_inline`` with zero always enables the data inline. 674 675 The default ``txq_inline_max`` value is 290. The specified value may be adjusted 676 by the driver in order not to exceed the limit (930 bytes) and to provide better 677 WQE space filling without gaps, the adjustment is reflected in the debug log. 678 Also, the default value (290) may be decreased in run-time if the large transmit 679 queue size is requested and hardware does not support enough descriptor 680 amount, in this case warning is emitted. If ``txq_inline_max`` key is 681 specified and requested inline settings can not be satisfied then error 682 will be raised. 683 684- ``txq_inline_mpw`` parameter [int] 685 686 Specifies the maximal packet length to be completely inlined into WQE for 687 Enhanced MPW method. If packet is large the specified value, the packet data 688 won't be copied, and data buffer is addressed with pointer. If packet length 689 is less or equal, all packet data will be copied into WQE. This may improve PCI 690 bandwidth utilization for short packets significantly but requires the extra 691 CPU cycles. 692 693 The data inline feature is controlled by number of TX queues, if number of Tx 694 queues is larger than ``txqs_min_inline`` key parameter, the inline feature 695 is engaged, if there are not enough Tx queues (which means not enough CPU cores 696 and CPU resources are scarce), data inline is not performed by the driver. 697 Assigning ``txqs_min_inline`` with zero always enables the data inline. 698 699 The default ``txq_inline_mpw`` value is 268. The specified value may be adjusted 700 by the driver in order not to exceed the limit (930 bytes) and to provide better 701 WQE space filling without gaps, the adjustment is reflected in the debug log. 702 Due to multiple packets may be included to the same WQE with Enhanced Multi 703 Packet Write Method and overall WQE size is limited it is not recommended to 704 specify large values for the ``txq_inline_mpw``. Also, the default value (268) 705 may be decreased in run-time if the large transmit queue size is requested 706 and hardware does not support enough descriptor amount, in this case warning 707 is emitted. If ``txq_inline_mpw`` key is specified and requested inline 708 settings can not be satisfied then error will be raised. 709 710- ``txqs_max_vec`` parameter [int] 711 712 Enable vectorized Tx only when the number of TX queues is less than or 713 equal to this value. This parameter is deprecated and ignored, kept 714 for compatibility issue to not prevent driver from probing. 715 716- ``txq_mpw_hdr_dseg_en`` parameter [int] 717 718 A nonzero value enables including two pointers in the first block of TX 719 descriptor. The parameter is deprecated and ignored, kept for compatibility 720 issue. 721 722- ``txq_max_inline_len`` parameter [int] 723 724 Maximum size of packet to be inlined. This limits the size of packet to 725 be inlined. If the size of a packet is larger than configured value, the 726 packet isn't inlined even though there's enough space remained in the 727 descriptor. Instead, the packet is included with pointer. This parameter 728 is deprecated and converted directly to ``txq_inline_mpw`` providing full 729 compatibility. Valid only if eMPW feature is engaged. 730 731- ``txq_mpw_en`` parameter [int] 732 733 A nonzero value enables Enhanced Multi-Packet Write (eMPW) for ConnectX-5, 734 ConnectX-6, ConnectX-6 Dx, ConnectX-6 Lx, BlueField, BlueField-2. 735 eMPW allows the Tx burst function to pack up multiple packets 736 in a single descriptor session in order to save PCI bandwidth 737 and improve performance at the cost of a slightly higher CPU usage. 738 When ``txq_inline_mpw`` is set along with ``txq_mpw_en``, 739 Tx burst function copies entire packet data on to Tx descriptor 740 instead of including pointer of packet. 741 742 The Enhanced Multi-Packet Write feature is enabled by default if NIC supports 743 it, can be disabled by explicit specifying 0 value for ``txq_mpw_en`` option. 744 Also, if minimal data inlining is requested by non-zero ``txq_inline_min`` 745 option or reported by the NIC, the eMPW feature is disengaged. 746 747- ``tx_db_nc`` parameter [int] 748 749 The rdma core library can map doorbell register in two ways, depending on the 750 environment variable "MLX5_SHUT_UP_BF": 751 752 - As regular cached memory (usually with write combining attribute), if the 753 variable is either missing or set to zero. 754 - As non-cached memory, if the variable is present and set to not "0" value. 755 756 The type of mapping may slightly affect the Tx performance, the optimal choice 757 is strongly relied on the host architecture and should be deduced practically. 758 759 If ``tx_db_nc`` is set to zero, the doorbell is forced to be mapped to regular 760 memory (with write combining), the PMD will perform the extra write memory barrier 761 after writing to doorbell, it might increase the needed CPU clocks per packet 762 to send, but latency might be improved. 763 764 If ``tx_db_nc`` is set to one, the doorbell is forced to be mapped to non 765 cached memory, the PMD will not perform the extra write memory barrier 766 after writing to doorbell, on some architectures it might improve the 767 performance. 768 769 If ``tx_db_nc`` is set to two, the doorbell is forced to be mapped to regular 770 memory, the PMD will use heuristics to decide whether write memory barrier 771 should be performed. For bursts with size multiple of recommended one (64 pkts) 772 it is supposed the next burst is coming and no need to issue the extra memory 773 barrier (it is supposed to be issued in the next coming burst, at least after 774 descriptor writing). It might increase latency (on some hosts till next 775 packets transmit) and should be used with care. 776 777 If ``tx_db_nc`` is omitted or set to zero, the preset (if any) environment 778 variable "MLX5_SHUT_UP_BF" value is used. If there is no "MLX5_SHUT_UP_BF", 779 the default ``tx_db_nc`` value is zero for ARM64 hosts and one for others. 780 781- ``tx_pp`` parameter [int] 782 783 If a nonzero value is specified the driver creates all necessary internal 784 objects to provide accurate packet send scheduling on mbuf timestamps. 785 The positive value specifies the scheduling granularity in nanoseconds, 786 the packet send will be accurate up to specified digits. The allowed range is 787 from 500 to 1 million of nanoseconds. The negative value specifies the module 788 of granularity and engages the special test mode the check the schedule rate. 789 By default (if the ``tx_pp`` is not specified) send scheduling on timestamps 790 feature is disabled. 791 792- ``tx_skew`` parameter [int] 793 794 The parameter adjusts the send packet scheduling on timestamps and represents 795 the average delay between beginning of the transmitting descriptor processing 796 by the hardware and appearance of actual packet data on the wire. The value 797 should be provided in nanoseconds and is valid only if ``tx_pp`` parameter is 798 specified. The default value is zero. 799 800- ``tx_vec_en`` parameter [int] 801 802 A nonzero value enables Tx vector on ConnectX-5, ConnectX-6, ConnectX-6 Dx, 803 ConnectX-6 Lx, BlueField and BlueField-2 NICs 804 if the number of global Tx queues on the port is less than ``txqs_max_vec``. 805 The parameter is deprecated and ignored. 806 807- ``rx_vec_en`` parameter [int] 808 809 A nonzero value enables Rx vector if the port is not configured in 810 multi-segment otherwise this parameter is ignored. 811 812 Enabled by default. 813 814- ``vf_nl_en`` parameter [int] 815 816 A nonzero value enables Netlink requests from the VF to add/remove MAC 817 addresses or/and enable/disable promiscuous/all multicast on the Netdevice. 818 Otherwise the relevant configuration must be run with Linux iproute2 tools. 819 This is a prerequisite to receive this kind of traffic. 820 821 Enabled by default, valid only on VF devices ignored otherwise. 822 823- ``l3_vxlan_en`` parameter [int] 824 825 A nonzero value allows L3 VXLAN and VXLAN-GPE flow creation. To enable 826 L3 VXLAN or VXLAN-GPE, users has to configure firmware and enable this 827 parameter. This is a prerequisite to receive this kind of traffic. 828 829 Disabled by default. 830 831- ``dv_xmeta_en`` parameter [int] 832 833 A nonzero value enables extensive flow metadata support if device is 834 capable and driver supports it. This can enable extensive support of 835 ``MARK`` and ``META`` item of ``rte_flow``. The newly introduced 836 ``SET_TAG`` and ``SET_META`` actions do not depend on ``dv_xmeta_en``. 837 838 There are some possible configurations, depending on parameter value: 839 840 - 0, this is default value, defines the legacy mode, the ``MARK`` and 841 ``META`` related actions and items operate only within NIC Tx and 842 NIC Rx steering domains, no ``MARK`` and ``META`` information crosses 843 the domain boundaries. The ``MARK`` item is 24 bits wide, the ``META`` 844 item is 32 bits wide and match supported on egress only. 845 846 - 1, this engages extensive metadata mode, the ``MARK`` and ``META`` 847 related actions and items operate within all supported steering domains, 848 including FDB, ``MARK`` and ``META`` information may cross the domain 849 boundaries. The ``MARK`` item is 24 bits wide, the ``META`` item width 850 depends on kernel and firmware configurations and might be 0, 16 or 851 32 bits. Within NIC Tx domain ``META`` data width is 32 bits for 852 compatibility, the actual width of data transferred to the FDB domain 853 depends on kernel configuration and may be vary. The actual supported 854 width can be retrieved in runtime by series of rte_flow_validate() 855 trials. 856 857 - 2, this engages extensive metadata mode, the ``MARK`` and ``META`` 858 related actions and items operate within all supported steering domains, 859 including FDB, ``MARK`` and ``META`` information may cross the domain 860 boundaries. The ``META`` item is 32 bits wide, the ``MARK`` item width 861 depends on kernel and firmware configurations and might be 0, 16 or 862 24 bits. The actual supported width can be retrieved in runtime by 863 series of rte_flow_validate() trials. 864 865 - 3, this engages tunnel offload mode. In E-Switch configuration, that 866 mode implicitly activates ``dv_xmeta_en=1``. 867 868 +------+-----------+-----------+-------------+-------------+ 869 | Mode | ``MARK`` | ``META`` | ``META`` Tx | FDB/Through | 870 +======+===========+===========+=============+=============+ 871 | 0 | 24 bits | 32 bits | 32 bits | no | 872 +------+-----------+-----------+-------------+-------------+ 873 | 1 | 24 bits | vary 0-32 | 32 bits | yes | 874 +------+-----------+-----------+-------------+-------------+ 875 | 2 | vary 0-24 | 32 bits | 32 bits | yes | 876 +------+-----------+-----------+-------------+-------------+ 877 878 If there is no E-Switch configuration the ``dv_xmeta_en`` parameter is 879 ignored and the device is configured to operate in legacy mode (0). 880 881 Disabled by default (set to 0). 882 883 The Direct Verbs/Rules (engaged with ``dv_flow_en`` = 1) supports all 884 of the extensive metadata features. The legacy Verbs supports FLAG and 885 MARK metadata actions over NIC Rx steering domain only. 886 887 Setting META value to zero in flow action means there is no item provided 888 and receiving datapath will not report in mbufs the metadata are present. 889 Setting MARK value to zero in flow action means the zero FDIR ID value 890 will be reported on packet receiving. 891 892 For the MARK action the last 16 values in the full range are reserved for 893 internal PMD purposes (to emulate FLAG action). The valid range for the 894 MARK action values is 0-0xFFEF for the 16-bit mode and 0-xFFFFEF 895 for the 24-bit mode, the flows with the MARK action value outside 896 the specified range will be rejected. 897 898- ``dv_flow_en`` parameter [int] 899 900 A nonzero value enables the DV flow steering assuming it is supported 901 by the driver (RDMA Core library version is rdma-core-24.0 or higher). 902 903 Enabled by default if supported. 904 905- ``dv_esw_en`` parameter [int] 906 907 A nonzero value enables E-Switch using Direct Rules. 908 909 Enabled by default if supported. 910 911- ``lacp_by_user`` parameter [int] 912 913 A nonzero value enables the control of LACP traffic by the user application. 914 When a bond exists in the driver, by default it should be managed by the 915 kernel and therefore LACP traffic should be steered to the kernel. 916 If this devarg is set to 1 it will allow the user to manage the bond by 917 itself and not steer LACP traffic to the kernel. 918 919 Disabled by default (set to 0). 920 921- ``mr_ext_memseg_en`` parameter [int] 922 923 A nonzero value enables extending memseg when registering DMA memory. If 924 enabled, the number of entries in MR (Memory Region) lookup table on datapath 925 is minimized and it benefits performance. On the other hand, it worsens memory 926 utilization because registered memory is pinned by kernel driver. Even if a 927 page in the extended chunk is freed, that doesn't become reusable until the 928 entire memory is freed. 929 930 Enabled by default. 931 932- ``representor`` parameter [list] 933 934 This parameter can be used to instantiate DPDK Ethernet devices from 935 existing port (PF, VF or SF) representors configured on the device. 936 937 It is a standard parameter whose format is described in 938 :ref:`ethernet_device_standard_device_arguments`. 939 940 For instance, to probe VF port representors 0 through 2:: 941 942 <PCI_BDF>,representor=vf[0-2] 943 944 To probe SF port representors 0 through 2:: 945 946 <PCI_BDF>,representor=sf[0-2] 947 948 To probe VF port representors 0 through 2 on both PFs of bonding device:: 949 950 <Primary_PCI_BDF>,representor=pf[0,1]vf[0-2] 951 952- ``max_dump_files_num`` parameter [int] 953 954 The maximum number of files per PMD entity that may be created for debug information. 955 The files will be created in /var/log directory or in current directory. 956 957 set to 128 by default. 958 959- ``lro_timeout_usec`` parameter [int] 960 961 The maximum allowed duration of an LRO session, in micro-seconds. 962 PMD will set the nearest value supported by HW, which is not bigger than 963 the input ``lro_timeout_usec`` value. 964 If this parameter is not specified, by default PMD will set 965 the smallest value supported by HW. 966 967- ``hp_buf_log_sz`` parameter [int] 968 969 The total data buffer size of a hairpin queue (logarithmic form), in bytes. 970 PMD will set the data buffer size to 2 ** ``hp_buf_log_sz``, both for RX & TX. 971 The capacity of the value is specified by the firmware and the initialization 972 will get a failure if it is out of scope. 973 The range of the value is from 11 to 19 right now, and the supported frame 974 size of a single packet for hairpin is from 512B to 128KB. It might change if 975 different firmware release is being used. By using a small value, it could 976 reduce memory consumption but not work with a large frame. If the value is 977 too large, the memory consumption will be high and some potential performance 978 degradation will be introduced. 979 By default, the PMD will set this value to 16, which means that 9KB jumbo 980 frames will be supported. 981 982- ``reclaim_mem_mode`` parameter [int] 983 984 Cache some resources in flow destroy will help flow recreation more efficient. 985 While some systems may require the all the resources can be reclaimed after 986 flow destroyed. 987 The parameter ``reclaim_mem_mode`` provides the option for user to configure 988 if the resource cache is needed or not. 989 990 There are three options to choose: 991 992 - 0. It means the flow resources will be cached as usual. The resources will 993 be cached, helpful with flow insertion rate. 994 995 - 1. It will only enable the DPDK PMD level resources reclaim. 996 997 - 2. Both DPDK PMD level and rdma-core low level will be configured as 998 reclaimed mode. 999 1000 By default, the PMD will set this value to 0. 1001 1002- ``sys_mem_en`` parameter [int] 1003 1004 A non-zero value enables the PMD memory management allocating memory 1005 from system by default, without explicit rte memory flag. 1006 1007 By default, the PMD will set this value to 0. 1008 1009- ``decap_en`` parameter [int] 1010 1011 Some devices do not support FCS (frame checksum) scattering for 1012 tunnel-decapsulated packets. 1013 If set to 0, this option forces the FCS feature and rejects tunnel 1014 decapsulation in the flow engine for such devices. 1015 1016 By default, the PMD will set this value to 1. 1017 1018.. _mlx5_firmware_config: 1019 1020Firmware configuration 1021~~~~~~~~~~~~~~~~~~~~~~ 1022 1023Firmware features can be configured as key/value pairs. 1024 1025The command to set a value is:: 1026 1027 mlxconfig -d <device> set <key>=<value> 1028 1029The command to query a value is:: 1030 1031 mlxconfig -d <device> query | grep <key> 1032 1033The device name for the command ``mlxconfig`` can be either the PCI address, 1034or the mst device name found with:: 1035 1036 mst status 1037 1038Below are some firmware configurations listed. 1039 1040- link type:: 1041 1042 LINK_TYPE_P1 1043 LINK_TYPE_P2 1044 value: 1=Infiniband 2=Ethernet 3=VPI(auto-sense) 1045 1046- enable SR-IOV:: 1047 1048 SRIOV_EN=1 1049 1050- maximum number of SR-IOV virtual functions:: 1051 1052 NUM_OF_VFS=<max> 1053 1054- enable DevX (required by Direct Rules and other features):: 1055 1056 UCTX_EN=1 1057 1058- aggressive CQE zipping:: 1059 1060 CQE_COMPRESSION=1 1061 1062- L3 VXLAN and VXLAN-GPE destination UDP port:: 1063 1064 IP_OVER_VXLAN_EN=1 1065 IP_OVER_VXLAN_PORT=<udp dport> 1066 1067- enable VXLAN-GPE tunnel flow matching:: 1068 1069 FLEX_PARSER_PROFILE_ENABLE=0 1070 or 1071 FLEX_PARSER_PROFILE_ENABLE=2 1072 1073- enable IP-in-IP tunnel flow matching:: 1074 1075 FLEX_PARSER_PROFILE_ENABLE=0 1076 1077- enable MPLS flow matching:: 1078 1079 FLEX_PARSER_PROFILE_ENABLE=1 1080 1081- enable ICMP(code/type/identifier/sequence number) / ICMP6(code/type) fields matching:: 1082 1083 FLEX_PARSER_PROFILE_ENABLE=2 1084 1085- enable Geneve flow matching:: 1086 1087 FLEX_PARSER_PROFILE_ENABLE=0 1088 or 1089 FLEX_PARSER_PROFILE_ENABLE=1 1090 1091- enable Geneve TLV option flow matching:: 1092 1093 FLEX_PARSER_PROFILE_ENABLE=0 1094 1095- enable GTP flow matching:: 1096 1097 FLEX_PARSER_PROFILE_ENABLE=3 1098 1099- enable eCPRI flow matching:: 1100 1101 FLEX_PARSER_PROFILE_ENABLE=4 1102 PROG_PARSE_GRAPH=1 1103 1104Linux Prerequisites 1105------------------- 1106 1107This driver relies on external libraries and kernel drivers for resources 1108allocations and initialization. The following dependencies are not part of 1109DPDK and must be installed separately: 1110 1111- **libibverbs** 1112 1113 User space Verbs framework used by librte_net_mlx5. This library provides 1114 a generic interface between the kernel and low-level user space drivers 1115 such as libmlx5. 1116 1117 It allows slow and privileged operations (context initialization, hardware 1118 resources allocations) to be managed by the kernel and fast operations to 1119 never leave user space. 1120 1121- **libmlx5** 1122 1123 Low-level user space driver library for Mellanox 1124 ConnectX-4/ConnectX-5/ConnectX-6/BlueField devices, it is automatically loaded 1125 by libibverbs. 1126 1127 This library basically implements send/receive calls to the hardware 1128 queues. 1129 1130- **Kernel modules** 1131 1132 They provide the kernel-side Verbs API and low level device drivers that 1133 manage actual hardware initialization and resources sharing with user 1134 space processes. 1135 1136 Unlike most other PMDs, these modules must remain loaded and bound to 1137 their devices: 1138 1139 - mlx5_core: hardware driver managing Mellanox 1140 ConnectX-4/ConnectX-5/ConnectX-6/BlueField devices and related Ethernet kernel 1141 network devices. 1142 - mlx5_ib: InifiniBand device driver. 1143 - ib_uverbs: user space driver for Verbs (entry point for libibverbs). 1144 1145- **Firmware update** 1146 1147 Mellanox OFED/EN releases include firmware updates for 1148 ConnectX-4/ConnectX-5/ConnectX-6/BlueField adapters. 1149 1150 Because each release provides new features, these updates must be applied to 1151 match the kernel modules and libraries they come with. 1152 1153.. note:: 1154 1155 Both libraries are BSD and GPL licensed. Linux kernel modules are GPL 1156 licensed. 1157 1158Installation 1159~~~~~~~~~~~~ 1160 1161Either RDMA Core library with a recent enough Linux kernel release 1162(recommended) or Mellanox OFED/EN, which provides compatibility with older 1163releases. 1164 1165RDMA Core with Linux Kernel 1166^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1167 1168- Minimal kernel version : v4.14 or the most recent 4.14-rc (see `Linux installation documentation`_) 1169- Minimal rdma-core version: v15+ commit 0c5f5765213a ("Merge pull request #227 from yishaih/tm") 1170 (see `RDMA Core installation documentation`_) 1171- When building for i686 use: 1172 1173 - rdma-core version 18.0 or above built with 32bit support. 1174 - Kernel version 4.14.41 or above. 1175 1176- Starting with rdma-core v21, static libraries can be built:: 1177 1178 cd build 1179 CFLAGS=-fPIC cmake -DIN_PLACE=1 -DENABLE_STATIC=1 -GNinja .. 1180 ninja 1181 1182.. _`Linux installation documentation`: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/plain/Documentation/admin-guide/README.rst 1183.. _`RDMA Core installation documentation`: https://raw.githubusercontent.com/linux-rdma/rdma-core/master/README.md 1184 1185 1186Mellanox OFED/EN 1187^^^^^^^^^^^^^^^^ 1188 1189- Mellanox OFED version: **4.5** and above / 1190 Mellanox EN version: **4.5** and above 1191- firmware version: 1192 1193 - ConnectX-4: **12.21.1000** and above. 1194 - ConnectX-4 Lx: **14.21.1000** and above. 1195 - ConnectX-5: **16.21.1000** and above. 1196 - ConnectX-5 Ex: **16.21.1000** and above. 1197 - ConnectX-6: **20.27.0090** and above. 1198 - ConnectX-6 Dx: **22.27.0090** and above. 1199 - BlueField: **18.25.1010** and above. 1200 1201While these libraries and kernel modules are available on OpenFabrics 1202Alliance's `website <https://www.openfabrics.org/>`__ and provided by package 1203managers on most distributions, this PMD requires Ethernet extensions that 1204may not be supported at the moment (this is a work in progress). 1205 1206`Mellanox OFED 1207<http://www.mellanox.com/page/products_dyn?product_family=26&mtag=linux>`__ and 1208`Mellanox EN 1209<http://www.mellanox.com/page/products_dyn?product_family=27&mtag=linux>`__ 1210include the necessary support and should be used in the meantime. For DPDK, 1211only libibverbs, libmlx5, mlnx-ofed-kernel packages and firmware updates are 1212required from that distribution. 1213 1214.. note:: 1215 1216 Several versions of Mellanox OFED/EN are available. Installing the version 1217 this DPDK release was developed and tested against is strongly 1218 recommended. Please check the `linux prerequisites`_. 1219 1220Windows Prerequisites 1221--------------------- 1222 1223This driver relies on external libraries and kernel drivers for resources 1224allocations and initialization. The dependencies in the following sub-sections 1225are not part of DPDK, and must be installed separately. 1226 1227Compilation Prerequisites 1228~~~~~~~~~~~~~~~~~~~~~~~~~ 1229 1230DevX SDK installation 1231^^^^^^^^^^^^^^^^^^^^^ 1232 1233The DevX SDK must be installed on the machine building the Windows PMD. 1234Additional information can be found at 1235`How to Integrate Windows DevX in Your Development Environment 1236<https://docs.mellanox.com/display/winof2v250/RShim+Drivers+and+Usage#RShimDriversandUsage-DevXInterface>`__. 1237 1238Runtime Prerequisites 1239~~~~~~~~~~~~~~~~~~~~~ 1240 1241WinOF2 version 2.60 or higher must be installed on the machine. 1242 1243WinOF2 installation 1244^^^^^^^^^^^^^^^^^^^ 1245 1246The driver can be downloaded from the following site: 1247`WINOF2 1248<https://www.mellanox.com/products/adapter-software/ethernet/windows/winof-2>`__ 1249 1250DevX Enablement 1251^^^^^^^^^^^^^^^ 1252 1253DevX for Windows must be enabled in the Windows registry. 1254The keys ``DevxEnabled`` and ``DevxFsRules`` must be set. 1255Additional information can be found in the WinOF2 user manual. 1256 1257Supported NICs 1258-------------- 1259 1260The following Mellanox device families are supported by the same mlx5 driver: 1261 1262 - ConnectX-4 1263 - ConnectX-4 Lx 1264 - ConnectX-5 1265 - ConnectX-5 Ex 1266 - ConnectX-6 1267 - ConnectX-6 Dx 1268 - ConnectX-6 Lx 1269 - BlueField 1270 - BlueField-2 1271 1272Below are detailed device names: 1273 1274* Mellanox\ |reg| ConnectX\ |reg|-4 10G MCX4111A-XCAT (1x10G) 1275* Mellanox\ |reg| ConnectX\ |reg|-4 10G MCX412A-XCAT (2x10G) 1276* Mellanox\ |reg| ConnectX\ |reg|-4 25G MCX4111A-ACAT (1x25G) 1277* Mellanox\ |reg| ConnectX\ |reg|-4 25G MCX412A-ACAT (2x25G) 1278* Mellanox\ |reg| ConnectX\ |reg|-4 40G MCX413A-BCAT (1x40G) 1279* Mellanox\ |reg| ConnectX\ |reg|-4 40G MCX4131A-BCAT (1x40G) 1280* Mellanox\ |reg| ConnectX\ |reg|-4 40G MCX415A-BCAT (1x40G) 1281* Mellanox\ |reg| ConnectX\ |reg|-4 50G MCX413A-GCAT (1x50G) 1282* Mellanox\ |reg| ConnectX\ |reg|-4 50G MCX4131A-GCAT (1x50G) 1283* Mellanox\ |reg| ConnectX\ |reg|-4 50G MCX414A-BCAT (2x50G) 1284* Mellanox\ |reg| ConnectX\ |reg|-4 50G MCX415A-GCAT (1x50G) 1285* Mellanox\ |reg| ConnectX\ |reg|-4 50G MCX416A-BCAT (2x50G) 1286* Mellanox\ |reg| ConnectX\ |reg|-4 50G MCX416A-GCAT (2x50G) 1287* Mellanox\ |reg| ConnectX\ |reg|-4 50G MCX415A-CCAT (1x100G) 1288* Mellanox\ |reg| ConnectX\ |reg|-4 100G MCX416A-CCAT (2x100G) 1289* Mellanox\ |reg| ConnectX\ |reg|-4 Lx 10G MCX4111A-XCAT (1x10G) 1290* Mellanox\ |reg| ConnectX\ |reg|-4 Lx 10G MCX4121A-XCAT (2x10G) 1291* Mellanox\ |reg| ConnectX\ |reg|-4 Lx 25G MCX4111A-ACAT (1x25G) 1292* Mellanox\ |reg| ConnectX\ |reg|-4 Lx 25G MCX4121A-ACAT (2x25G) 1293* Mellanox\ |reg| ConnectX\ |reg|-4 Lx 40G MCX4131A-BCAT (1x40G) 1294* Mellanox\ |reg| ConnectX\ |reg|-5 100G MCX556A-ECAT (2x100G) 1295* Mellanox\ |reg| ConnectX\ |reg|-5 Ex EN 100G MCX516A-CDAT (2x100G) 1296* Mellanox\ |reg| ConnectX\ |reg|-6 200G MCX654106A-HCAT (2x200G) 1297* Mellanox\ |reg| ConnectX\ |reg|-6 Dx EN 100G MCX623106AN-CDAT (2x100G) 1298* Mellanox\ |reg| ConnectX\ |reg|-6 Dx EN 200G MCX623105AN-VDAT (1x200G) 1299* Mellanox\ |reg| ConnectX\ |reg|-6 Lx EN 25G MCX631102AN-ADAT (2x25G) 1300 1301Quick Start Guide on OFED/EN 1302---------------------------- 1303 13041. Download latest Mellanox OFED/EN. For more info check the `linux prerequisites`_. 1305 1306 13072. Install the required libraries and kernel modules either by installing 1308 only the required set, or by installing the entire Mellanox OFED/EN:: 1309 1310 ./mlnxofedinstall --upstream-libs --dpdk 1311 13123. Verify the firmware is the correct one:: 1313 1314 ibv_devinfo 1315 13164. Verify all ports links are set to Ethernet:: 1317 1318 mlxconfig -d <mst device> query | grep LINK_TYPE 1319 LINK_TYPE_P1 ETH(2) 1320 LINK_TYPE_P2 ETH(2) 1321 1322 Link types may have to be configured to Ethernet:: 1323 1324 mlxconfig -d <mst device> set LINK_TYPE_P1/2=1/2/3 1325 1326 * LINK_TYPE_P1=<1|2|3> , 1=Infiniband 2=Ethernet 3=VPI(auto-sense) 1327 1328 For hypervisors, verify SR-IOV is enabled on the NIC:: 1329 1330 mlxconfig -d <mst device> query | grep SRIOV_EN 1331 SRIOV_EN True(1) 1332 1333 If needed, configure SR-IOV:: 1334 1335 mlxconfig -d <mst device> set SRIOV_EN=1 NUM_OF_VFS=16 1336 mlxfwreset -d <mst device> reset 1337 13385. Restart the driver:: 1339 1340 /etc/init.d/openibd restart 1341 1342 or:: 1343 1344 service openibd restart 1345 1346 If link type was changed, firmware must be reset as well:: 1347 1348 mlxfwreset -d <mst device> reset 1349 1350 For hypervisors, after reset write the sysfs number of virtual functions 1351 needed for the PF. 1352 1353 To dynamically instantiate a given number of virtual functions (VFs):: 1354 1355 echo [num_vfs] > /sys/class/infiniband/mlx5_0/device/sriov_numvfs 1356 13576. Install DPDK and you are ready to go. 1358 See :doc:`compilation instructions <../linux_gsg/build_dpdk>`. 1359 1360Enable switchdev mode 1361--------------------- 1362 1363Switchdev mode is a mode in E-Switch, that binds between representor and VF or SF. 1364Representor is a port in DPDK that is connected to a VF or SF in such a way 1365that assuming there are no offload flows, each packet that is sent from the VF or SF 1366will be received by the corresponding representor. While each packet that is or SF 1367sent to a representor will be received by the VF or SF. 1368This is very useful in case of SRIOV mode, where the first packet that is sent 1369by the VF or SF will be received by the DPDK application which will decide if this 1370flow should be offloaded to the E-Switch. After offloading the flow packet 1371that the VF or SF that are matching the flow will not be received any more by 1372the DPDK application. 1373 13741. Enable SRIOV mode:: 1375 1376 mlxconfig -d <mst device> set SRIOV_EN=true 1377 13782. Configure the max number of VFs:: 1379 1380 mlxconfig -d <mst device> set NUM_OF_VFS=<num of vfs> 1381 13823. Reset the FW:: 1383 1384 mlxfwreset -d <mst device> reset 1385 13863. Configure the actual number of VFs:: 1387 1388 echo <num of vfs > /sys/class/net/<net device>/device/sriov_numvfs 1389 13904. Unbind the device (can be rebind after the switchdev mode):: 1391 1392 echo -n "<device pci address" > /sys/bus/pci/drivers/mlx5_core/unbind 1393 13945. Enbale switchdev mode:: 1395 1396 echo switchdev > /sys/class/net/<net device>/compat/devlink/mode 1397 1398SubFunction representor support 1399------------------------------- 1400SubFunction is a portion of the PCI device, a SF netdev has its own 1401dedicated queues(txq, rxq). A SF netdev supports E-Switch representation 1402offload similar to existing PF and VF representors. A SF shares PCI 1403level resources with other SFs and/or with its parent PCI function. 1404 14051. Configure SF feature:: 1406 1407 mlxconfig -d <mst device> set PF_BAR2_SIZE=<0/1/2/3> PF_BAR2_ENABLE=1 1408 1409 Value of PF_BAR2_SIZE: 1410 1411 0: 8 SFs 1412 1: 16 SFs 1413 2: 32 SFs 1414 3: 64 SFs 1415 14162. Reset the FW:: 1417 1418 mlxfwreset -d <mst device> reset 1419 14203. Enable switchdev mode:: 1421 1422 echo switchdev > /sys/class/net/<net device>/compat/devlink/mode 1423 14244. Create SF:: 1425 1426 mlnx-sf -d <PCI_BDF> -a create 1427 14285. Probe SF representor:: 1429 1430 testpmd> port attach <PCI_BDF>,representor=sf0,dv_flow_en=1 1431 1432Performance tuning 1433------------------ 1434 14351. Configure aggressive CQE Zipping for maximum performance:: 1436 1437 mlxconfig -d <mst device> s CQE_COMPRESSION=1 1438 1439 To set it back to the default CQE Zipping mode use:: 1440 1441 mlxconfig -d <mst device> s CQE_COMPRESSION=0 1442 14432. In case of virtualization: 1444 1445 - Make sure that hypervisor kernel is 3.16 or newer. 1446 - Configure boot with ``iommu=pt``. 1447 - Use 1G huge pages. 1448 - Make sure to allocate a VM on huge pages. 1449 - Make sure to set CPU pinning. 1450 14513. Use the CPU near local NUMA node to which the PCIe adapter is connected, 1452 for better performance. For VMs, verify that the right CPU 1453 and NUMA node are pinned according to the above. Run:: 1454 1455 lstopo-no-graphics 1456 1457 to identify the NUMA node to which the PCIe adapter is connected. 1458 14594. If more than one adapter is used, and root complex capabilities allow 1460 to put both adapters on the same NUMA node without PCI bandwidth degradation, 1461 it is recommended to locate both adapters on the same NUMA node. 1462 This in order to forward packets from one to the other without 1463 NUMA performance penalty. 1464 14655. Disable pause frames:: 1466 1467 ethtool -A <netdev> rx off tx off 1468 14696. Verify IO non-posted prefetch is disabled by default. This can be checked 1470 via the BIOS configuration. Please contact you server provider for more 1471 information about the settings. 1472 1473.. note:: 1474 1475 On some machines, depends on the machine integrator, it is beneficial 1476 to set the PCI max read request parameter to 1K. This can be 1477 done in the following way: 1478 1479 To query the read request size use:: 1480 1481 setpci -s <NIC PCI address> 68.w 1482 1483 If the output is different than 3XXX, set it by:: 1484 1485 setpci -s <NIC PCI address> 68.w=3XXX 1486 1487 The XXX can be different on different systems. Make sure to configure 1488 according to the setpci output. 1489 14907. To minimize overhead of searching Memory Regions: 1491 1492 - '--socket-mem' is recommended to pin memory by predictable amount. 1493 - Configure per-lcore cache when creating Mempools for packet buffer. 1494 - Refrain from dynamically allocating/freeing memory in run-time. 1495 1496Rx burst functions 1497------------------ 1498 1499There are multiple Rx burst functions with different advantages and limitations. 1500 1501.. table:: Rx burst functions 1502 1503 +-------------------+------------------------+---------+-----------------+------+-------+ 1504 || Function Name || Enabler || Scatter|| Error Recovery || CQE || Large| 1505 | | | | || comp|| MTU | 1506 +===================+========================+=========+=================+======+=======+ 1507 | rx_burst | rx_vec_en=0 | Yes | Yes | Yes | Yes | 1508 +-------------------+------------------------+---------+-----------------+------+-------+ 1509 | rx_burst_vec | rx_vec_en=1 (default) | No | if CQE comp off | Yes | No | 1510 +-------------------+------------------------+---------+-----------------+------+-------+ 1511 | rx_burst_mprq || mprq_en=1 | No | Yes | Yes | Yes | 1512 | || RxQs >= rxqs_min_mprq | | | | | 1513 +-------------------+------------------------+---------+-----------------+------+-------+ 1514 | rx_burst_mprq_vec || rx_vec_en=1 (default) | No | if CQE comp off | Yes | Yes | 1515 | || mprq_en=1 | | | | | 1516 | || RxQs >= rxqs_min_mprq | | | | | 1517 +-------------------+------------------------+---------+-----------------+------+-------+ 1518 1519.. _mlx5_offloads_support: 1520 1521Supported hardware offloads 1522--------------------------- 1523 1524.. table:: Minimal SW/HW versions for queue offloads 1525 1526 ============== ===== ===== ========= ===== ========== ============= 1527 Offload DPDK Linux rdma-core OFED firmware hardware 1528 ============== ===== ===== ========= ===== ========== ============= 1529 common base 17.11 4.14 16 4.2-1 12.21.1000 ConnectX-4 1530 checksums 17.11 4.14 16 4.2-1 12.21.1000 ConnectX-4 1531 Rx timestamp 17.11 4.14 16 4.2-1 12.21.1000 ConnectX-4 1532 TSO 17.11 4.14 16 4.2-1 12.21.1000 ConnectX-4 1533 LRO 19.08 N/A N/A 4.6-4 16.25.6406 ConnectX-5 1534 Tx scheduling 20.08 N/A N/A 5.1-2 22.28.2006 ConnectX-6 Dx 1535 Buffer Split 20.11 N/A N/A 5.1-2 16.28.2006 ConnectX-5 1536 ============== ===== ===== ========= ===== ========== ============= 1537 1538.. table:: Minimal SW/HW versions for rte_flow offloads 1539 1540 +-----------------------+-----------------+-----------------+ 1541 | Offload | with E-Switch | with NIC | 1542 +=======================+=================+=================+ 1543 | Count | | DPDK 19.05 | | DPDK 19.02 | 1544 | | | OFED 4.6 | | OFED 4.6 | 1545 | | | rdma-core 24 | | rdma-core 23 | 1546 | | | ConnectX-5 | | ConnectX-5 | 1547 +-----------------------+-----------------+-----------------+ 1548 | Drop | | DPDK 19.05 | | DPDK 18.11 | 1549 | | | OFED 4.6 | | OFED 4.5 | 1550 | | | rdma-core 24 | | rdma-core 23 | 1551 | | | ConnectX-5 | | ConnectX-4 | 1552 +-----------------------+-----------------+-----------------+ 1553 | Queue / RSS | | | | DPDK 18.11 | 1554 | | | N/A | | OFED 4.5 | 1555 | | | | | rdma-core 23 | 1556 | | | | | ConnectX-4 | 1557 +-----------------------+-----------------+-----------------+ 1558 | Shared action | | | | | 1559 | | | :numref:`sact`| | :numref:`sact`| 1560 | | | | | | 1561 | | | | | | 1562 +-----------------------+-----------------+-----------------+ 1563 | | VLAN | | DPDK 19.11 | | DPDK 19.11 | 1564 | | (of_pop_vlan / | | OFED 4.7-1 | | OFED 4.7-1 | 1565 | | of_push_vlan / | | ConnectX-5 | | ConnectX-5 | 1566 | | of_set_vlan_pcp / | | | | | 1567 | | of_set_vlan_vid) | | | | | 1568 +-----------------------+-----------------+-----------------+ 1569 | Encapsulation | | DPDK 19.05 | | DPDK 19.02 | 1570 | (VXLAN / NVGRE / RAW) | | OFED 4.7-1 | | OFED 4.6 | 1571 | | | rdma-core 24 | | rdma-core 23 | 1572 | | | ConnectX-5 | | ConnectX-5 | 1573 +-----------------------+-----------------+-----------------+ 1574 | Encapsulation | | DPDK 19.11 | | DPDK 19.11 | 1575 | GENEVE | | OFED 4.7-3 | | OFED 4.7-3 | 1576 | | | rdma-core 27 | | rdma-core 27 | 1577 | | | ConnectX-5 | | ConnectX-5 | 1578 +-----------------------+-----------------+-----------------+ 1579 | Tunnel Offload | | DPDK 20.11 | | DPDK 20.11 | 1580 | | | OFED 5.1-2 | | OFED 5.1-2 | 1581 | | | rdma-core 32 | | N/A | 1582 | | | ConnectX-5 | | ConnectX-5 | 1583 +-----------------------+-----------------+-----------------+ 1584 | | Header rewrite | | DPDK 19.05 | | DPDK 19.02 | 1585 | | (set_ipv4_src / | | OFED 4.7-1 | | OFED 4.7-1 | 1586 | | set_ipv4_dst / | | rdma-core 24 | | rdma-core 24 | 1587 | | set_ipv6_src / | | ConnectX-5 | | ConnectX-5 | 1588 | | set_ipv6_dst / | | | | | 1589 | | set_tp_src / | | | | | 1590 | | set_tp_dst / | | | | | 1591 | | dec_ttl / | | | | | 1592 | | set_ttl / | | | | | 1593 | | set_mac_src / | | | | | 1594 | | set_mac_dst) | | | | | 1595 +-----------------------+-----------------+-----------------+ 1596 | | Header rewrite | | DPDK 20.02 | | DPDK 20.02 | 1597 | | (set_dscp) | | OFED 5.0 | | OFED 5.0 | 1598 | | | | rdma-core 24 | | rdma-core 24 | 1599 | | | | ConnectX-5 | | ConnectX-5 | 1600 +-----------------------+-----------------+-----------------+ 1601 | Jump | | DPDK 19.05 | | DPDK 19.02 | 1602 | | | OFED 4.7-1 | | OFED 4.7-1 | 1603 | | | rdma-core 24 | | N/A | 1604 | | | ConnectX-5 | | ConnectX-5 | 1605 +-----------------------+-----------------+-----------------+ 1606 | Mark / Flag | | DPDK 19.05 | | DPDK 18.11 | 1607 | | | OFED 4.6 | | OFED 4.5 | 1608 | | | rdma-core 24 | | rdma-core 23 | 1609 | | | ConnectX-5 | | ConnectX-4 | 1610 +-----------------------+-----------------+-----------------+ 1611 | Meta data | | DPDK 19.11 | | DPDK 19.11 | 1612 | | | OFED 4.7-3 | | OFED 4.7-3 | 1613 | | | rdma-core 26 | | rdma-core 26 | 1614 | | | ConnectX-5 | | ConnectX-5 | 1615 +-----------------------+-----------------+-----------------+ 1616 | Port ID | | DPDK 19.05 | | N/A | 1617 | | | OFED 4.7-1 | | N/A | 1618 | | | rdma-core 24 | | N/A | 1619 | | | ConnectX-5 | | N/A | 1620 +-----------------------+-----------------+-----------------+ 1621 | Hairpin | | | | DPDK 19.11 | 1622 | | | N/A | | OFED 4.7-3 | 1623 | | | | | rdma-core 26 | 1624 | | | | | ConnectX-5 | 1625 +-----------------------+-----------------+-----------------+ 1626 | 2-port Hairpin | | | | DPDK 20.11 | 1627 | | | N/A | | OFED 5.1-2 | 1628 | | | | | N/A | 1629 | | | | | ConnectX-5 | 1630 +-----------------------+-----------------+-----------------+ 1631 | Metering | | DPDK 19.11 | | DPDK 19.11 | 1632 | | | OFED 4.7-3 | | OFED 4.7-3 | 1633 | | | rdma-core 26 | | rdma-core 26 | 1634 | | | ConnectX-5 | | ConnectX-5 | 1635 +-----------------------+-----------------+-----------------+ 1636 | Sampling | | DPDK 20.11 | | DPDK 20.11 | 1637 | | | OFED 5.1-2 | | OFED 5.1-2 | 1638 | | | rdma-core 32 | | N/A | 1639 | | | ConnectX-5 | | ConnectX-5 | 1640 +-----------------------+-----------------+-----------------+ 1641 | Encapsulation | | DPDK 21.02 | | DPDK 21.02 | 1642 | GTP PSC | | OFED 5.2 | | OFED 5.2 | 1643 | | | rdma-core 35 | | rdma-core 35 | 1644 | | | ConnectX-6 Dx| | ConnectX-6 Dx | 1645 +-----------------------+-----------------+-----------------+ 1646 | Encapsulation | | DPDK 21.02 | | DPDK 21.02 | 1647 | GENEVE TLV option | | OFED 5.2 | | OFED 5.2 | 1648 | | | rdma-core 34 | | rdma-core 34 | 1649 | | | ConnectX-6 Dx | | ConnectX-6 Dx | 1650 +-----------------------+-----------------+-----------------+ 1651 | Modify Field | | DPDK 21.02 | | DPDK 21.02 | 1652 | | | OFED 5.2 | | OFED 5.2 | 1653 | | | rdma-core 35 | | rdma-core 35 | 1654 | | | ConnectX-5 | | ConnectX-5 | 1655 +-----------------------+-----------------+-----------------+ 1656 1657.. table:: Minimal SW/HW versions for shared action offload 1658 :name: sact 1659 1660 +-----------------------+-----------------+-----------------+ 1661 | Shared Action | with E-Switch | with NIC | 1662 +=======================+=================+=================+ 1663 | RSS | | | | DPDK 20.11 | 1664 | | | N/A | | OFED 5.2 | 1665 | | | | | rdma-core 33 | 1666 | | | | | ConnectX-5 | 1667 +-----------------------+-----------------+-----------------+ 1668 | Age | | DPDK 20.11 | | DPDK 20.11 | 1669 | | | OFED 5.2 | | OFED 5.2 | 1670 | | | rdma-core 32 | | rdma-core 32 | 1671 | | | ConnectX-6 Dx| | ConnectX-6 Dx | 1672 +-----------------------+-----------------+-----------------+ 1673 1674Notes for metadata 1675------------------ 1676 1677MARK and META items are interrelated with datapath - they might move from/to 1678the applications in mbuf fields. Hence, zero value for these items has the 1679special meaning - it means "no metadata are provided", not zero values are 1680treated by applications and PMD as valid ones. 1681 1682Moreover in the flow engine domain the value zero is acceptable to match and 1683set, and we should allow to specify zero values as rte_flow parameters for the 1684META and MARK items and actions. In the same time zero mask has no meaning and 1685should be rejected on validation stage. 1686 1687Notes for rte_flow 1688------------------ 1689 1690Flows are not cached in the driver. 1691When stopping a device port, all the flows created on this port from the 1692application will be flushed automatically in the background. 1693After stopping the device port, all flows on this port become invalid and 1694not represented in the system. 1695All references to these flows held by the application should be discarded 1696directly but neither destroyed nor flushed. 1697 1698The application should re-create the flows as required after the port restart. 1699 1700Notes for testpmd 1701----------------- 1702 1703Compared to librte_net_mlx4 that implements a single RSS configuration per 1704port, librte_net_mlx5 supports per-protocol RSS configuration. 1705 1706Since ``testpmd`` defaults to IP RSS mode and there is currently no 1707command-line parameter to enable additional protocols (UDP and TCP as well 1708as IP), the following commands must be entered from its CLI to get the same 1709behavior as librte_net_mlx4:: 1710 1711 > port stop all 1712 > port config all rss all 1713 > port start all 1714 1715Usage example 1716------------- 1717 1718This section demonstrates how to launch **testpmd** with Mellanox 1719ConnectX-4/ConnectX-5/ConnectX-6/BlueField devices managed by librte_net_mlx5. 1720 1721#. Load the kernel modules:: 1722 1723 modprobe -a ib_uverbs mlx5_core mlx5_ib 1724 1725 Alternatively if MLNX_OFED/MLNX_EN is fully installed, the following script 1726 can be run:: 1727 1728 /etc/init.d/openibd restart 1729 1730 .. note:: 1731 1732 User space I/O kernel modules (uio and igb_uio) are not used and do 1733 not have to be loaded. 1734 1735#. Make sure Ethernet interfaces are in working order and linked to kernel 1736 verbs. Related sysfs entries should be present:: 1737 1738 ls -d /sys/class/net/*/device/infiniband_verbs/uverbs* | cut -d / -f 5 1739 1740 Example output:: 1741 1742 eth30 1743 eth31 1744 eth32 1745 eth33 1746 1747#. Optionally, retrieve their PCI bus addresses for to be used with the allow list:: 1748 1749 { 1750 for intf in eth2 eth3 eth4 eth5; 1751 do 1752 (cd "/sys/class/net/${intf}/device/" && pwd -P); 1753 done; 1754 } | 1755 sed -n 's,.*/\(.*\),-a \1,p' 1756 1757 Example output:: 1758 1759 -a 0000:05:00.1 1760 -a 0000:06:00.0 1761 -a 0000:06:00.1 1762 -a 0000:05:00.0 1763 1764#. Request huge pages:: 1765 1766 dpdk-hugepages.py --setup 2G 1767 1768#. Start testpmd with basic parameters:: 1769 1770 dpdk-testpmd -l 8-15 -n 4 -a 05:00.0 -a 05:00.1 -a 06:00.0 -a 06:00.1 -- --rxq=2 --txq=2 -i 1771 1772 Example output:: 1773 1774 [...] 1775 EAL: PCI device 0000:05:00.0 on NUMA socket 0 1776 EAL: probe driver: 15b3:1013 librte_net_mlx5 1777 PMD: librte_net_mlx5: PCI information matches, using device "mlx5_0" (VF: false) 1778 PMD: librte_net_mlx5: 1 port(s) detected 1779 PMD: librte_net_mlx5: port 1 MAC address is e4:1d:2d:e7:0c:fe 1780 EAL: PCI device 0000:05:00.1 on NUMA socket 0 1781 EAL: probe driver: 15b3:1013 librte_net_mlx5 1782 PMD: librte_net_mlx5: PCI information matches, using device "mlx5_1" (VF: false) 1783 PMD: librte_net_mlx5: 1 port(s) detected 1784 PMD: librte_net_mlx5: port 1 MAC address is e4:1d:2d:e7:0c:ff 1785 EAL: PCI device 0000:06:00.0 on NUMA socket 0 1786 EAL: probe driver: 15b3:1013 librte_net_mlx5 1787 PMD: librte_net_mlx5: PCI information matches, using device "mlx5_2" (VF: false) 1788 PMD: librte_net_mlx5: 1 port(s) detected 1789 PMD: librte_net_mlx5: port 1 MAC address is e4:1d:2d:e7:0c:fa 1790 EAL: PCI device 0000:06:00.1 on NUMA socket 0 1791 EAL: probe driver: 15b3:1013 librte_net_mlx5 1792 PMD: librte_net_mlx5: PCI information matches, using device "mlx5_3" (VF: false) 1793 PMD: librte_net_mlx5: 1 port(s) detected 1794 PMD: librte_net_mlx5: port 1 MAC address is e4:1d:2d:e7:0c:fb 1795 Interactive-mode selected 1796 Configuring Port 0 (socket 0) 1797 PMD: librte_net_mlx5: 0x8cba80: TX queues number update: 0 -> 2 1798 PMD: librte_net_mlx5: 0x8cba80: RX queues number update: 0 -> 2 1799 Port 0: E4:1D:2D:E7:0C:FE 1800 Configuring Port 1 (socket 0) 1801 PMD: librte_net_mlx5: 0x8ccac8: TX queues number update: 0 -> 2 1802 PMD: librte_net_mlx5: 0x8ccac8: RX queues number update: 0 -> 2 1803 Port 1: E4:1D:2D:E7:0C:FF 1804 Configuring Port 2 (socket 0) 1805 PMD: librte_net_mlx5: 0x8cdb10: TX queues number update: 0 -> 2 1806 PMD: librte_net_mlx5: 0x8cdb10: RX queues number update: 0 -> 2 1807 Port 2: E4:1D:2D:E7:0C:FA 1808 Configuring Port 3 (socket 0) 1809 PMD: librte_net_mlx5: 0x8ceb58: TX queues number update: 0 -> 2 1810 PMD: librte_net_mlx5: 0x8ceb58: RX queues number update: 0 -> 2 1811 Port 3: E4:1D:2D:E7:0C:FB 1812 Checking link statuses... 1813 Port 0 Link Up - speed 40000 Mbps - full-duplex 1814 Port 1 Link Up - speed 40000 Mbps - full-duplex 1815 Port 2 Link Up - speed 10000 Mbps - full-duplex 1816 Port 3 Link Up - speed 10000 Mbps - full-duplex 1817 Done 1818 testpmd> 1819 1820How to dump flows 1821----------------- 1822 1823This section demonstrates how to dump flows. Currently, it's possible to dump 1824all flows with assistance of external tools. 1825 1826#. 2 ways to get flow raw file: 1827 1828 - Using testpmd CLI: 1829 1830 .. code-block:: console 1831 1832 testpmd> flow dump <port> <output_file> 1833 1834 - call rte_flow_dev_dump api: 1835 1836 .. code-block:: console 1837 1838 rte_flow_dev_dump(port, file, NULL); 1839 1840#. Dump human-readable flows from raw file: 1841 1842 Get flow parsing tool from: https://github.com/Mellanox/mlx_steering_dump 1843 1844 .. code-block:: console 1845 1846 mlx_steering_dump.py -f <output_file> 1847