1.. SPDX-License-Identifier: BSD-3-Clause 2 Copyright 2022 6WIND S.A. 3 Copyright (c) 2022 NVIDIA Corporation & Affiliates 4 5.. include:: <isonum.txt> 6 7MLX5 Common Driver 8================== 9 10The mlx5 common driver library (**librte_common_mlx5**) provides support for 11**NVIDIA ConnectX-4**, **NVIDIA ConnectX-4 Lx**, **NVIDIA ConnectX-5**, 12**NVIDIA ConnectX-6**, **NVIDIA ConnectX-6 Dx**, **NVIDIA ConnectX-6 Lx**, 13**NVIDIA ConnectX-7**, **NVIDIA BlueField**, and **NVIDIA BlueField-2** families of 1410/25/40/50/100/200 Gb/s adapters. 15 16Information and documentation for these adapters can be found on the 17`NVIDIA website <https://www.nvidia.com/en-us/networking/>`_. 18Help is also provided by the 19`Mellanox community <http://community.mellanox.com/welcome>`_. 20In addition, there is a `web section dedicated to the Poll Mode Driver 21<https://developer.nvidia.com/networking/dpdk>`_. 22 23 24Design 25------ 26 27For security reasons and to enhance robustness, 28this driver only handles virtual memory addresses. 29The way resources allocations are handled by the kernel, 30combined with hardware specifications that allow handling virtual memory addresses directly, 31ensure that DPDK applications cannot access random physical memory 32(or memory that does not belong to the current process). 33 34There are different levels of objects and bypassing abilities 35which are used to get the best performance: 36 37- **Verbs** is a complete high-level generic API 38- **Direct Verbs** is a device-specific API 39- **DevX** allows accessing firmware objects 40- **Direct Rules** manages flow steering at the low-level hardware layer 41 42On Linux, above interfaces are provided by linking with `libibverbs` and `libmlx5`. 43See :ref:`mlx5_linux_prerequisites` for installation. 44 45On Windows, DevX is the only requirement from the above list. 46See :ref:`mlx5_windows_prerequisites` for DevX SDK package installation. 47 48 49.. _mlx5_classes: 50 51Classes 52------- 53 54One mlx5 device can be probed by a number of different PMDs. 55To select a specific PMD, its name should be specified as a device parameter 56(e.g. ``0000:08:00.1,class=eth``). 57 58In order to allow probing by multiple PMDs, 59several classes may be listed separated by a colon. 60For example: ``class=crypto:regex`` will probe both Crypto and RegEx PMDs. 61 62 63Supported Classes 64~~~~~~~~~~~~~~~~~ 65 66- ``class=compress`` for :doc:`../../compressdevs/mlx5`. 67- ``class=crypto`` for :doc:`../../cryptodevs/mlx5`. 68- ``class=eth`` for :doc:`../../nics/mlx5`. 69- ``class=regex`` for :doc:`../../regexdevs/mlx5`. 70- ``class=vdpa`` for :doc:`../../vdpadevs/mlx5`. 71 72By default, the mlx5 device will be probed by the ``eth`` PMD. 73 74 75Limitations 76~~~~~~~~~~~ 77 78- ``eth`` and ``vdpa`` PMDs cannot be probed at the same time. 79 All other combinations are possible. 80 81- On Windows, only ``eth`` and ``crypto`` are supported. 82 83 84.. _mlx5_common_compilation: 85 86Compilation Prerequisites 87------------------------- 88 89.. _mlx5_linux_prerequisites: 90 91Linux Prerequisites 92~~~~~~~~~~~~~~~~~~~ 93 94This driver relies on external libraries and kernel drivers for resources 95allocations and initialization. 96The following dependencies are not part of DPDK and must be installed separately: 97 98- **libibverbs** 99 100 User space Verbs framework used by ``librte_common_mlx5``. 101 This library provides a generic interface between the kernel 102 and low-level user space drivers such as ``libmlx5``. 103 104 It allows slow and privileged operations (context initialization, 105 hardware resources allocations) to be managed by the kernel 106 and fast operations to never leave user space. 107 108- **libmlx5** 109 110 Low-level user space driver library for Mellanox devices, 111 it is automatically loaded by ``libibverbs``. 112 113 This library basically implements send/receive calls to the hardware queues. 114 115- **Kernel modules** 116 117 They provide the kernel-side Verbs API and low level device drivers 118 that manage actual hardware initialization 119 and resources sharing with user-space processes. 120 121 Unlike most other PMDs, these modules must remain loaded and bound to 122 their devices: 123 124 - ``mlx5_core``: hardware driver managing Mellanox devices 125 and related Ethernet kernel network devices. 126 - ``mlx5_ib``: InfiniBand device driver. 127 - ``ib_uverbs``: user space driver for Verbs (entry point for ``libibverbs``). 128 129- **Firmware update** 130 131 Mellanox OFED/EN releases include firmware updates. 132 133 Because each release provides new features, these updates must be applied to 134 match the kernel modules and libraries they come with. 135 136Libraries and kernel modules can be provided either by the Linux distribution, 137or by installing Mellanox OFED/EN which provides compatibility with older kernels. 138 139 140Upstream Dependencies 141^^^^^^^^^^^^^^^^^^^^^ 142 143The mlx5 kernel modules are part of upstream Linux. 144The minimal supported kernel version is 4.14. 145For 32-bit, version 4.14.41 or above is required. 146 147The libraries `libibverbs` and `libmlx5` are part of ``rdma-core``. 148It is packaged by most of Linux distributions. 149The minimal supported rdma-core version is 16. 150For 32-bit, version 18 or above is required. 151 152The rdma-core sources can be downloaded at 153https://github.com/linux-rdma/rdma-core 154 155It is possible to build rdma-core as static libraries starting with version 21:: 156 157 cd build 158 CFLAGS=-fPIC cmake -DIN_PLACE=1 -DENABLE_STATIC=1 -GNinja .. 159 ninja 160 161 162Mellanox OFED/EN 163^^^^^^^^^^^^^^^^ 164 165The kernel modules and libraries are packaged with other tools 166in Mellanox OFED or Mellanox EN. 167The minimal supported versions are: 168 169- Mellanox OFED version: **4.5** and above. 170- Mellanox EN version: **4.5** and above. 171- Firmware version: 172 173 - ConnectX-4: **12.21.1000** and above. 174 - ConnectX-4 Lx: **14.21.1000** and above. 175 - ConnectX-5: **16.21.1000** and above. 176 - ConnectX-5 Ex: **16.21.1000** and above. 177 - ConnectX-6: **20.27.0090** and above. 178 - ConnectX-6 Dx: **22.27.0090** and above. 179 - ConnectX-6 Lx: **26.27.0090** and above. 180 - ConnectX-7: **28.33.2028** and above. 181 - BlueField: **18.25.1010** and above. 182 - BlueField-2: **24.28.1002** and above. 183 184The firmware, the libraries libibverbs, libmlx5, and mlnx-ofed-kernel modules 185are packaged in `Mellanox OFED 186<https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/>`_. 187After downloading, it can be installed with this command:: 188 189 ./mlnxofedinstall --dpdk 190 191`Mellanox EN 192<https://network.nvidia.com/products/ethernet-drivers/linux/mlnx_en/>`_ 193is a smaller package including what is needed for DPDK. 194After downloading, it can be installed with this command:: 195 196 ./install --dpdk 197 198After installing, the firmware version can be checked:: 199 200 ibv_devinfo 201 202.. note:: 203 204 Several versions of Mellanox OFED/EN are available. Installing the version 205 this DPDK release was developed and tested against is strongly recommended. 206 Please check the "Tested Platforms" section in the :doc:`../../rel_notes/index`. 207 208 209.. _mlx5_windows_prerequisites: 210 211Windows Prerequisites 212~~~~~~~~~~~~~~~~~~~~~ 213 214The mlx5 PMDs rely on external libraries and kernel drivers 215for resource allocation and initialization. 216 217 218DevX SDK Installation 219^^^^^^^^^^^^^^^^^^^^^ 220 221The DevX SDK must be installed on the machine building the Windows PMD. 222Additional information can be found at 223`How to Integrate Windows DevX in Your Development Environment 224<https://docs.nvidia.com/networking/display/winof2v260/RShim+Drivers+and+Usage#RShimDriversandUsage-DevXInterface>`_. 225The minimal supported WinOF2 version is 2.60. 226 227 228Compilation Options 229------------------- 230 231Compilation on Linux 232~~~~~~~~~~~~~~~~~~~~ 233 234The ibverbs libraries can be linked with this PMD in a number of ways, 235configured by the ``ibverbs_link`` build option: 236 237``shared`` (default) 238 The PMD depends on some .so files. 239 240``dlopen`` 241 Split the dependencies glue in a separate library 242 loaded when needed by dlopen (see ``MLX5_GLUE_PATH``). 243 It makes dependencies on libibverbs and libmlx5 optional, 244 and has no performance impact. 245 246``static`` 247 Embed static flavor of the dependencies libibverbs and libmlx5 248 in the PMD shared library or the executable static binary. 249 250 251Compilation on Windows 252~~~~~~~~~~~~~~~~~~~~~~ 253 254The DevX SDK location must be set through two environment variables: 255 256``DEVX_LIB_PATH`` 257 path to the DevX lib file. 258 259``DEVX_INC_PATH`` 260 path to the DevX header files. 261 262 263.. _mlx5_common_env: 264 265Environment Configuration 266------------------------- 267 268Linux Environment 269~~~~~~~~~~~~~~~~~ 270 271The kernel network interfaces are brought up during initialization. 272Forcing them down prevents packets reception. 273 274The ethtool operations on the kernel interfaces may also affect the PMD. 275 276Some runtime behaviours may be configured through environment variables. 277 278``MLX5_GLUE_PATH`` 279 If built with ``ibverbs_link=dlopen``, 280 list of directories in which to search for the rdma-core "glue" plug-in, 281 separated by colons or semi-colons. 282 283``MLX5_SHUT_UP_BF`` 284 If Verbs is used (DevX disabled), 285 HW queue doorbell register mapping. 286 The value 0 means non-cached IO mapping, 287 while 1 is a regular memory mapping. 288 289 With regular memory mapping, the register is flushed to HW 290 usually when the write-combining buffer becomes full, 291 but it depends on CPU design. 292 293 294Port Link with OFED/EN 295^^^^^^^^^^^^^^^^^^^^^^ 296 297Ports links must be set to Ethernet:: 298 299 mlxconfig -d <mst device> query | grep LINK_TYPE 300 LINK_TYPE_P1 ETH(2) 301 LINK_TYPE_P2 ETH(2) 302 303 mlxconfig -d <mst device> set LINK_TYPE_P1/2=1/2/3 304 305Link type values are: 306 307* ``1`` Infiniband 308* ``2`` Ethernet 309* ``3`` VPI (auto-sense) 310 311If link type was changed, firmware must be reset as well:: 312 313 mlxfwreset -d <mst device> reset 314 315 316.. _mlx5_vf: 317 318SR-IOV Virtual Function with OFED/EN 319^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 320 321SR-IOV must be enabled on the NIC. 322It can be checked in the following command:: 323 324 mlxconfig -d <mst device> query | grep SRIOV_EN 325 SRIOV_EN True(1) 326 327If needed, configure SR-IOV:: 328 329 mlxconfig -d <mst device> set SRIOV_EN=1 NUM_OF_VFS=16 330 mlxfwreset -d <mst device> reset 331 332After doing the change, restart the driver:: 333 334 /etc/init.d/openibd restart 335 336or:: 337 338 service openibd restart 339 340Then the virtual functions can be instantiated:: 341 342 echo [num_vfs] > /sys/class/infiniband/mlx5_0/device/sriov_numvfs 343 344 345.. _mlx5_sub_function: 346 347Sub-Function with OFED/EN 348^^^^^^^^^^^^^^^^^^^^^^^^^ 349 350Sub-Function is a portion of the PCI device, 351it has its own dedicated queues. 352An SF shares PCI-level resources with other SFs and/or with its parent PCI function. 353 3540. Requirement:: 355 356 OFED version >= 5.4-0.3.3.0 357 3581. Configure SF feature:: 359 360 # Run mlxconfig on both PFs on host and ECPFs on BlueField. 361 mlxconfig -d <mst device> set PER_PF_NUM_SF=1 PF_TOTAL_SF=252 PF_SF_BAR_SIZE=12 362 3632. Enable switchdev mode:: 364 365 mlxdevm dev eswitch set pci/<DBDF> mode switchdev 366 3673. Add SF port:: 368 369 mlxdevm port add pci/<DBDF> flavour pcisf pfnum 0 sfnum <sfnum> 370 371 Get SFID from output: pci/<DBDF>/<SFID> 372 3734. Modify MAC address:: 374 375 mlxdevm port function set pci/<DBDF>/<SFID> hw_addr <MAC> 376 3775. Activate SF port:: 378 379 mlxdevm port function set pci/<DBDF>/<ID> state active 380 3816. Devargs to probe SF device:: 382 383 auxiliary:mlx5_core.sf.<num>,class=eth:regex 384 385 386Enable Switchdev Mode 387^^^^^^^^^^^^^^^^^^^^^ 388 389Switchdev mode is a mode in E-Switch, that binds between representor and VF or SF. 390Representor is a port in DPDK that is connected to a VF or SF in such a way 391that assuming there are no offload flows, each packet that is sent from the VF or SF 392will be received by the corresponding representor. 393While each packet that is sent to a representor will be received by the VF or SF. 394 395After :ref:`configuring VF <mlx5_vf>`, the device must be unbound:: 396 397 printf "<device pci address>" > /sys/bus/pci/drivers/mlx5_core/unbind 398 399Then switchdev mode is enabled:: 400 401 echo switchdev > /sys/class/net/<net device>/compat/devlink/mode 402 403The device can be bound again at this point. 404 405 406Run as Non-Root 407^^^^^^^^^^^^^^^ 408 409Hugepage and resource limit setup are documented 410in the :ref:`common Linux guide <Running_Without_Root_Privileges>`. 411This PMD can operate without access to physical addresses, 412therefore it does not require ``SYS_ADMIN`` to access ``/proc/self/pagemaps``. 413Note that this requirement may still come from other drivers. 414 415Below are additional capabilities that must be granted to the application 416with the reasons for the need of each capability: 417 418``NET_RAW`` 419 For raw Ethernet queue allocation through the kernel driver. 420 421``NET_ADMIN`` 422 For device configuration, like setting link status or MTU. 423 424``SYS_RAWIO`` 425 For using group 1 and above (software steering) in Flow API. 426 427They can be manually granted for a specific executable file:: 428 429 setcap cap_net_raw,cap_net_admin,cap_sys_rawio+ep <executable> 430 431Alternatively, a service manager or a container runtime 432may configure the capabilities for a process. 433 434 435Windows Environment 436~~~~~~~~~~~~~~~~~~~ 437 438WinOF2 version 2.60 or higher must be installed on the machine. 439 440 441WinOF2 Installation 442^^^^^^^^^^^^^^^^^^^ 443 444The driver can be downloaded from the following site: `WINOF2 445<https://network.nvidia.com/products/adapter-software/ethernet/windows/winof-2/>`_. 446 447 448DevX Enablement 449^^^^^^^^^^^^^^^ 450 451DevX for Windows must be enabled in the Windows registry. 452The keys ``DevxEnabled`` and ``DevxFsRules`` must be set. 453Additional information can be found in the WinOF2 user manual. 454 455 456.. _mlx5_firmware_config: 457 458Firmware Configuration 459~~~~~~~~~~~~~~~~~~~~~~ 460 461Firmware features can be configured as key/value pairs. 462 463The command to set a value is:: 464 465 mlxconfig -d <device> set <key>=<value> 466 467The command to query a value is:: 468 469 mlxconfig -d <device> query <key> 470 471The device name for the command ``mlxconfig`` can be either the PCI address, 472or the mst device name found with:: 473 474 mst status 475 476Below are some firmware configurations listed. 477 478- link type:: 479 480 LINK_TYPE_P1 481 LINK_TYPE_P2 482 value: 1=Infiniband 2=Ethernet 3=VPI(auto-sense) 483 484- enable SR-IOV:: 485 486 SRIOV_EN=1 487 488- the maximum number of SR-IOV virtual functions:: 489 490 NUM_OF_VFS=<max> 491 492- enable DevX (required by Direct Rules and other features):: 493 494 UCTX_EN=1 495 496- aggressive CQE zipping:: 497 498 CQE_COMPRESSION=1 499 500- L3 VXLAN and VXLAN-GPE destination UDP port:: 501 502 IP_OVER_VXLAN_EN=1 503 IP_OVER_VXLAN_PORT=<udp dport> 504 505- enable VXLAN-GPE tunnel flow matching:: 506 507 FLEX_PARSER_PROFILE_ENABLE=0 508 or 509 FLEX_PARSER_PROFILE_ENABLE=2 510 511- enable IP-in-IP tunnel flow matching:: 512 513 FLEX_PARSER_PROFILE_ENABLE=0 514 515- enable MPLS flow matching:: 516 517 FLEX_PARSER_PROFILE_ENABLE=1 518 519- enable ICMP(code/type/identifier/sequence number) / ICMP6(code/type) fields matching:: 520 521 FLEX_PARSER_PROFILE_ENABLE=2 522 523- enable Geneve flow matching:: 524 525 FLEX_PARSER_PROFILE_ENABLE=0 526 or 527 FLEX_PARSER_PROFILE_ENABLE=1 528 529- enable Geneve TLV option flow matching:: 530 531 FLEX_PARSER_PROFILE_ENABLE=0 532 533- enable GTP flow matching:: 534 535 FLEX_PARSER_PROFILE_ENABLE=3 536 537- enable eCPRI flow matching:: 538 539 FLEX_PARSER_PROFILE_ENABLE=4 540 PROG_PARSE_GRAPH=1 541 542- enable dynamic flex parser for flex item:: 543 544 FLEX_PARSER_PROFILE_ENABLE=4 545 PROG_PARSE_GRAPH=1 546 547- enable realtime timestamp format:: 548 549 REAL_TIME_CLOCK_ENABLE=1 550 551 552.. _mlx5_common_driver_options: 553 554Device Arguments 555---------------- 556 557The driver can be configured per device. 558A single argument list can be used for a device managed by multiple PMDs. 559The parameters must be passed through the EAL option ``-a``, 560as examples below: 561 562- PCI device:: 563 564 -a 0000:03:00.2,class=eth:regex,mr_mempool_reg_en=0 565 566- Auxiliary SF:: 567 568 -a auxiliary:mlx5_core.sf.2,class=compress,mr_ext_memseg_en=0 569 570Each device class PMD has its own list of specific arguments, 571and below are the arguments supported by the common mlx5 layer. 572 573- ``class`` parameter [string] 574 575 Select the classes of the drivers that should probe the device. 576 See :ref:`mlx5_classes` for more explanation and details. 577 578 The default value is ``eth``. 579 580- ``mr_ext_memseg_en`` parameter [int] 581 582 A nonzero value enables extending memseg when registering DMA memory. If 583 enabled, the number of entries in MR (Memory Region) lookup table on datapath 584 is minimized and it benefits performance. On the other hand, it worsens memory 585 utilization because registered memory is pinned by kernel driver. Even if a 586 page in the extended chunk is freed, that doesn't become reusable until the 587 entire memory is freed. 588 589 Enabled by default. 590 591- ``mr_mempool_reg_en`` parameter [int] 592 593 A nonzero value enables implicit registration of DMA memory of all mempools 594 except those having ``RTE_MEMPOOL_F_NON_IO``. This flag is set automatically 595 for mempools populated with non-contiguous objects or those without IOVA. 596 The effect is that when a packet from a mempool is transmitted, 597 its memory is already registered for DMA in the PMD and no registration 598 will happen on the data path. The tradeoff is extra work on the creation 599 of each mempool and increased HW resource use if some mempools 600 are not used with MLX5 devices. 601 602 Enabled by default. 603 604- ``sys_mem_en`` parameter [int] 605 606 A non-zero value enables the PMD memory management allocating memory 607 from system by default, without explicit rte memory flag. 608 609 By default, the PMD will set this value to 0. 610 611- ``sq_db_nc`` parameter [int] 612 613 The rdma core library can map doorbell register in two ways, 614 depending on the environment variable "MLX5_SHUT_UP_BF": 615 616 - As regular cached memory (usually with write combining attribute), 617 if the variable is either missing or set to zero. 618 - As non-cached memory, if the variable is present and set to not "0" value. 619 620 The same doorbell mapping approach is implemented directly by PMD 621 in UAR generation for queues created with DevX. 622 623 The type of mapping may slightly affect the send queue performance, 624 the optimal choice strongly relied on the host architecture 625 and should be deduced practically. 626 627 If ``sq_db_nc`` is set to zero, the doorbell is forced to be mapped to 628 regular memory (with write combining), the PMD will perform the extra write 629 memory barrier after writing to doorbell, it might increase the needed CPU 630 clocks per packet to send, but latency might be improved. 631 632 If ``sq_db_nc`` is set to one, the doorbell is forced to be mapped to non 633 cached memory, the PMD will not perform the extra write memory barrier after 634 writing to doorbell, on some architectures it might improve the performance. 635 636 If ``sq_db_nc`` is set to two, the doorbell is forced to be mapped to 637 regular memory, the PMD will use heuristics to decide whether a write memory 638 barrier should be performed. For bursts with size multiple of recommended one 639 (64 pkts) it is supposed the next burst is coming and no need to issue the 640 extra memory barrier (it is supposed to be issued in the next coming burst, 641 at least after descriptor writing). It might increase latency (on some hosts 642 till the next packets transmit) and should be used with care. 643 The PMD uses heuristics only for Tx queue, for other semd queues the doorbell 644 is forced to be mapped to regular memory as same as ``sq_db_nc`` is set to 0. 645 646 If ``sq_db_nc`` is omitted, the preset (if any) environment variable 647 "MLX5_SHUT_UP_BF" value is used. If there is no "MLX5_SHUT_UP_BF", the 648 default ``sq_db_nc`` value is zero for ARM64 hosts and one for others. 649 650- ``cmd_fd`` parameter [int] 651 652 File descriptor of ``ibv_context`` created outside the PMD. 653 PMD will use this FD to import remote CTX. The ``cmd_fd`` is obtained from 654 the ``ibv_context->cmd_fd`` member, which must be dup'd before being passed. 655 This parameter is valid only if ``pd_handle`` parameter is specified. 656 657 By default, the PMD will create a new ``ibv_context``. 658 659 .. note:: 660 661 When FD comes from another process, it is the user responsibility to 662 share the FD between the processes (e.g. by SCM_RIGHTS). 663 664- ``pd_handle`` parameter [int] 665 666 Protection domain handle of ``ibv_pd`` created outside the PMD. 667 PMD will use this handle to import remote PD. The ``pd_handle`` can be 668 achieved from the original PD by getting its ``ibv_pd->handle`` member value. 669 This parameter is valid only if ``cmd_fd`` parameter is specified, 670 and its value must be a valid kernel handle for a PD object 671 in the context represented by given ``cmd_fd``. 672 673 By default, the PMD will allocate a new PD. 674 675 .. note:: 676 677 The ``ibv_pd->handle`` member is different than ``mlx5dv_pd->pdn`` member. 678