1.. SPDX-License-Identifier: BSD-3-Clause 2 Copyright 2022 6WIND S.A. 3 Copyright (c) 2022 NVIDIA Corporation & Affiliates 4 5.. include:: <isonum.txt> 6 7NVIDIA MLX5 Common Driver 8========================= 9 10.. note:: 11 12 NVIDIA acquired Mellanox Technologies in 2020. 13 The DPDK documentation and code might still include instances 14 of or references to Mellanox trademarks (like BlueField and ConnectX) 15 that are now NVIDIA trademarks. 16 17The mlx5 common driver library (**librte_common_mlx5**) provides support for 18**NVIDIA ConnectX-4**, **NVIDIA ConnectX-4 Lx**, **NVIDIA ConnectX-5**, 19**NVIDIA ConnectX-6**, **NVIDIA ConnectX-6 Dx**, **NVIDIA ConnectX-6 Lx**, 20**NVIDIA ConnectX-7**, **NVIDIA BlueField**, and **NVIDIA BlueField-2** families of 2110/25/40/50/100/200 Gb/s adapters. 22 23Information and documentation for these adapters can be found on the 24`NVIDIA website <https://www.nvidia.com/en-us/networking/>`_. 25Help is also provided by the 26`NVIDIA Networking forum <https://forums.developer.nvidia.com/c/infrastructure/369/>`_. 27In addition, there is a `web section dedicated to DPDK 28<https://developer.nvidia.com/networking/dpdk>`_. 29 30 31Design 32------ 33 34For security reasons and to enhance robustness, 35this driver only handles virtual memory addresses. 36The way resources allocations are handled by the kernel, 37combined with hardware specifications that allow handling virtual memory addresses directly, 38ensure that DPDK applications cannot access random physical memory 39(or memory that does not belong to the current process). 40 41There are different levels of objects and bypassing abilities 42which are used to get the best performance: 43 44- **Verbs** is a complete high-level generic API 45- **Direct Verbs** is a device-specific API 46- **DevX** allows accessing firmware objects 47- **Direct Rules** manages flow steering at the low-level hardware layer 48 49On Linux, above interfaces are provided by linking with `libibverbs` and `libmlx5`. 50See :ref:`mlx5_linux_prerequisites` for installation. 51 52On Windows, DevX is the only requirement from the above list. 53See :ref:`mlx5_windows_prerequisites` for DevX SDK package installation. 54 55 56.. _mlx5_classes: 57 58Classes 59------- 60 61One mlx5 device can be probed by a number of different PMDs. 62To select a specific PMD, its name should be specified as a device parameter 63(e.g. ``0000:08:00.1,class=eth``). 64 65In order to allow probing by multiple PMDs, 66several classes may be listed separated by a colon. 67For example: ``class=crypto:regex`` will probe both Crypto and RegEx PMDs. 68 69 70Supported Classes 71~~~~~~~~~~~~~~~~~ 72 73- ``class=compress`` for :doc:`../../compressdevs/mlx5`. 74- ``class=crypto`` for :doc:`../../cryptodevs/mlx5`. 75- ``class=eth`` for :doc:`../../nics/mlx5`. 76- ``class=regex`` for :doc:`../../regexdevs/mlx5`. 77- ``class=vdpa`` for :doc:`../../vdpadevs/mlx5`. 78 79By default, the mlx5 device will be probed by the ``eth`` PMD. 80 81 82Limitations 83~~~~~~~~~~~ 84 85- ``eth`` and ``vdpa`` PMDs cannot be probed at the same time. 86 All other combinations are possible. 87 88- On Windows, only ``eth`` and ``crypto`` are supported. 89 90 91.. _mlx5_common_compilation: 92 93Compilation Prerequisites 94------------------------- 95 96.. _mlx5_linux_prerequisites: 97 98Linux Prerequisites 99~~~~~~~~~~~~~~~~~~~ 100 101This driver relies on external libraries and kernel drivers for resources 102allocations and initialization. 103The following dependencies are not part of DPDK and must be installed separately: 104 105- **libibverbs** 106 107 User space Verbs framework used by ``librte_common_mlx5``. 108 This library provides a generic interface between the kernel 109 and low-level user space drivers such as ``libmlx5``. 110 111 It allows slow and privileged operations (context initialization, 112 hardware resources allocations) to be managed by the kernel 113 and fast operations to never leave user space. 114 115- **libmlx5** 116 117 Low-level user space driver library for NVIDIA devices, 118 it is automatically loaded by ``libibverbs``. 119 120 This library basically implements send/receive calls to the hardware queues. 121 122- **Kernel modules** 123 124 They provide the kernel-side Verbs API and low level device drivers 125 that manage actual hardware initialization 126 and resources sharing with user-space processes. 127 128 Unlike most other PMDs, these modules must remain loaded and bound to 129 their devices: 130 131 - ``mlx5_core``: hardware driver managing NVIDIA devices 132 and related Ethernet kernel network devices. 133 - ``mlx5_ib``: InfiniBand device driver. 134 - ``ib_uverbs``: user space driver for Verbs (entry point for ``libibverbs``). 135 136- **Firmware update** 137 138 NVIDIA MLNX_OFED/EN releases include firmware updates. 139 140 Because each release provides new features, these updates must be applied to 141 match the kernel modules and libraries they come with. 142 143Libraries and kernel modules can be provided either by the Linux distribution, 144or by installing NVIDIA MLNX_OFED/EN which provides compatibility with older kernels. 145 146 147Upstream Dependencies 148^^^^^^^^^^^^^^^^^^^^^ 149 150The mlx5 kernel modules are part of upstream Linux. 151The minimal supported kernel version is 4.14. 152For 32-bit, version 4.14.41 or above is required. 153 154The libraries `libibverbs` and `libmlx5` are part of ``rdma-core``. 155It is packaged by most of Linux distributions. 156The minimal supported rdma-core version is 16. 157For 32-bit, version 18 or above is required. 158 159The rdma-core sources can be downloaded at 160https://github.com/linux-rdma/rdma-core 161 162It is possible to build rdma-core as static libraries starting with version 21:: 163 164 cd build 165 CFLAGS=-fPIC cmake -DIN_PLACE=1 -DENABLE_STATIC=1 -GNinja .. 166 ninja 167 168 169NVIDIA MLNX_OFED/EN 170^^^^^^^^^^^^^^^^^^^ 171 172The kernel modules and libraries are packaged with other tools 173in NVIDIA MLNX_OFED or NVIDIA MLNX_EN. 174The minimal supported versions are: 175 176- NVIDIA MLNX_OFED version: **4.5** and above. 177- NVIDIA MLNX_EN version: **4.5** and above. 178- Firmware version: 179 180 - ConnectX-4: **12.21.1000** and above. 181 - ConnectX-4 Lx: **14.21.1000** and above. 182 - ConnectX-5: **16.21.1000** and above. 183 - ConnectX-5 Ex: **16.21.1000** and above. 184 - ConnectX-6: **20.27.0090** and above. 185 - ConnectX-6 Dx: **22.27.0090** and above. 186 - ConnectX-6 Lx: **26.27.0090** and above. 187 - ConnectX-7: **28.33.2028** and above. 188 - BlueField: **18.25.1010** and above. 189 - BlueField-2: **24.28.1002** and above. 190 191The firmware, the libraries libibverbs, libmlx5, and mlnx-ofed-kernel modules 192are packaged in `NVIDIA MLNX_OFED 193<https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/>`_. 194After downloading, it can be installed with this command:: 195 196 ./mlnxofedinstall --dpdk 197 198`NVIDIA MLNX_EN 199<https://network.nvidia.com/products/ethernet-drivers/linux/mlnx_en/>`_ 200is a smaller package including what is needed for DPDK. 201After downloading, it can be installed with this command:: 202 203 ./install --dpdk 204 205After installing, the firmware version can be checked:: 206 207 ibv_devinfo 208 209.. note:: 210 211 Several versions of NVIDIA MLNX_OFED/EN are available. Installing the version 212 this DPDK release was developed and tested against is strongly recommended. 213 Please check the "Tested Platforms" section in the :doc:`../../rel_notes/index`. 214 215 216.. _mlx5_windows_prerequisites: 217 218Windows Prerequisites 219~~~~~~~~~~~~~~~~~~~~~ 220 221The mlx5 PMDs rely on external libraries and kernel drivers 222for resource allocation and initialization. 223 224 225DevX SDK Installation 226^^^^^^^^^^^^^^^^^^^^^ 227 228The DevX SDK must be installed on the machine building the Windows PMD. 229Additional information can be found at 230`How to Integrate Windows DevX in Your Development Environment 231<https://docs.nvidia.com/networking/display/winof2v260/RShim+Drivers+and+Usage#RShimDriversandUsage-DevXInterface>`_. 232The minimal supported WinOF2 version is 2.60. 233 234 235Compilation Options 236------------------- 237 238Compilation on Linux 239~~~~~~~~~~~~~~~~~~~~ 240 241The ibverbs libraries can be linked with this PMD in a number of ways, 242configured by the ``ibverbs_link`` build option: 243 244``shared`` (default) 245 The PMD depends on some .so files. 246 247``dlopen`` 248 Split the dependencies glue in a separate library 249 loaded when needed by dlopen (see ``MLX5_GLUE_PATH``). 250 It makes dependencies on libibverbs and libmlx5 optional, 251 and has no performance impact. 252 253``static`` 254 Embed static flavor of the dependencies libibverbs and libmlx5 255 in the PMD shared library or the executable static binary. 256 257 258Compilation on Windows 259~~~~~~~~~~~~~~~~~~~~~~ 260 261The DevX SDK location must be set through two environment variables: 262 263``DEVX_LIB_PATH`` 264 path to the DevX lib file. 265 266``DEVX_INC_PATH`` 267 path to the DevX header files. 268 269 270.. _mlx5_common_env: 271 272Environment Configuration 273------------------------- 274 275Linux Environment 276~~~~~~~~~~~~~~~~~ 277 278The kernel network interfaces are brought up during initialization. 279Forcing them down prevents packets reception. 280 281The ethtool operations on the kernel interfaces may also affect the PMD. 282 283Some runtime behaviours may be configured through environment variables. 284 285``MLX5_GLUE_PATH`` 286 If built with ``ibverbs_link=dlopen``, 287 list of directories in which to search for the rdma-core "glue" plug-in, 288 separated by colons or semi-colons. 289 290``MLX5_SHUT_UP_BF`` 291 If Verbs is used (DevX disabled), 292 HW queue doorbell register mapping. 293 The value 0 means non-cached IO mapping, 294 while 1 is a regular memory mapping. 295 296 With regular memory mapping, the register is flushed to HW 297 usually when the write-combining buffer becomes full, 298 but it depends on CPU design. 299 300 301Port Link with MLNX_OFED/EN 302^^^^^^^^^^^^^^^^^^^^^^^^^^^ 303 304Ports links must be set to Ethernet:: 305 306 mlxconfig -d <mst device> query | grep LINK_TYPE 307 LINK_TYPE_P1 ETH(2) 308 LINK_TYPE_P2 ETH(2) 309 310 mlxconfig -d <mst device> set LINK_TYPE_P1/2=1/2/3 311 312Link type values are: 313 314* ``1`` Infiniband 315* ``2`` Ethernet 316* ``3`` VPI (auto-sense) 317 318If link type was changed, firmware must be reset as well:: 319 320 mlxfwreset -d <mst device> reset 321 322 323.. _mlx5_vf: 324 325SR-IOV Virtual Function with MLNX_OFED/EN 326^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 327 328SR-IOV must be enabled on the NIC. 329It can be checked in the following command:: 330 331 mlxconfig -d <mst device> query | grep SRIOV_EN 332 SRIOV_EN True(1) 333 334If needed, configure SR-IOV:: 335 336 mlxconfig -d <mst device> set SRIOV_EN=1 NUM_OF_VFS=16 337 mlxfwreset -d <mst device> reset 338 339After doing the change, restart the driver:: 340 341 /etc/init.d/openibd restart 342 343or:: 344 345 service openibd restart 346 347Then the virtual functions can be instantiated:: 348 349 echo [num_vfs] > /sys/class/infiniband/mlx5_0/device/sriov_numvfs 350 351 352.. _mlx5_sub_function: 353 354Sub-Function with MLNX_OFED/EN 355^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 356 357Sub-Function is a portion of the PCI device, 358it has its own dedicated queues. 359An SF shares PCI-level resources with other SFs and/or with its parent PCI function. 360 3610. Requirement:: 362 363 MLNX_OFED version >= 5.4-0.3.3.0 364 3651. Configure SF feature:: 366 367 # Run mlxconfig on both PFs on host and ECPFs on BlueField. 368 mlxconfig -d <mst device> set PER_PF_NUM_SF=1 PF_TOTAL_SF=252 PF_SF_BAR_SIZE=12 369 3702. Enable switchdev mode:: 371 372 mlxdevm dev eswitch set pci/<DBDF> mode switchdev 373 3743. Add SF port:: 375 376 mlxdevm port add pci/<DBDF> flavour pcisf pfnum 0 sfnum <sfnum> 377 378 Get SFID from output: pci/<DBDF>/<SFID> 379 3804. Modify MAC address:: 381 382 mlxdevm port function set pci/<DBDF>/<SFID> hw_addr <MAC> 383 3845. Activate SF port:: 385 386 mlxdevm port function set pci/<DBDF>/<ID> state active 387 3886. Devargs to probe SF device:: 389 390 auxiliary:mlx5_core.sf.<num>,class=eth:regex 391 392 393Enable Switchdev Mode 394^^^^^^^^^^^^^^^^^^^^^ 395 396Switchdev mode is a mode in E-Switch, that binds between representor and VF or SF. 397Representor is a port in DPDK that is connected to a VF or SF in such a way 398that assuming there are no offload flows, each packet that is sent from the VF or SF 399will be received by the corresponding representor. 400While each packet that is sent to a representor will be received by the VF or SF. 401 402After :ref:`configuring VF <mlx5_vf>`, the device must be unbound:: 403 404 printf "<device pci address>" > /sys/bus/pci/drivers/mlx5_core/unbind 405 406Then switchdev mode is enabled:: 407 408 echo switchdev > /sys/class/net/<net device>/compat/devlink/mode 409 410The device can be bound again at this point. 411 412 413Run as Non-Root 414^^^^^^^^^^^^^^^ 415 416Hugepage and resource limit setup are documented 417in the :ref:`common Linux guide <Running_Without_Root_Privileges>`. 418This PMD can operate without access to physical addresses, 419therefore it does not require ``SYS_ADMIN`` to access ``/proc/self/pagemaps``. 420Note that this requirement may still come from other drivers. 421 422Below are additional capabilities that must be granted to the application 423with the reasons for the need of each capability: 424 425``NET_RAW`` 426 For raw Ethernet queue allocation through the kernel driver. 427 428``NET_ADMIN`` 429 For device configuration, like setting link status or MTU. 430 431``SYS_RAWIO`` 432 For using group 1 and above (software steering) in Flow API. 433 434They can be manually granted for a specific executable file:: 435 436 setcap cap_net_raw,cap_net_admin,cap_sys_rawio+ep <executable> 437 438Alternatively, a service manager or a container runtime 439may configure the capabilities for a process. 440 441 442Windows Environment 443~~~~~~~~~~~~~~~~~~~ 444 445WinOF2 version 2.60 or higher must be installed on the machine. 446 447 448WinOF2 Installation 449^^^^^^^^^^^^^^^^^^^ 450 451The driver can be downloaded from the following site: `WINOF2 452<https://network.nvidia.com/products/adapter-software/ethernet/windows/winof-2/>`_. 453 454 455DevX Enablement 456^^^^^^^^^^^^^^^ 457 458DevX for Windows must be enabled in the Windows registry. 459The keys ``DevxEnabled`` and ``DevxFsRules`` must be set. 460Additional information can be found in the WinOF2 user manual. 461 462 463.. _mlx5_firmware_config: 464 465Firmware Configuration 466~~~~~~~~~~~~~~~~~~~~~~ 467 468Firmware features can be configured as key/value pairs. 469 470The command to set a value is:: 471 472 mlxconfig -d <device> set <key>=<value> 473 474The command to query a value is:: 475 476 mlxconfig -d <device> query <key> 477 478The device name for the command ``mlxconfig`` can be either the PCI address, 479or the mst device name found with:: 480 481 mst status 482 483Below are some firmware configurations listed. 484 485- link type:: 486 487 LINK_TYPE_P1 488 LINK_TYPE_P2 489 value: 1=Infiniband 2=Ethernet 3=VPI(auto-sense) 490 491- enable SR-IOV:: 492 493 SRIOV_EN=1 494 495- the maximum number of SR-IOV virtual functions:: 496 497 NUM_OF_VFS=<max> 498 499- enable DevX (required by Direct Rules and other features):: 500 501 UCTX_EN=1 502 503- aggressive CQE zipping:: 504 505 CQE_COMPRESSION=1 506 507- L3 VXLAN and VXLAN-GPE destination UDP port:: 508 509 IP_OVER_VXLAN_EN=1 510 IP_OVER_VXLAN_PORT=<udp dport> 511 512- enable VXLAN-GPE tunnel flow matching:: 513 514 FLEX_PARSER_PROFILE_ENABLE=0 515 or 516 FLEX_PARSER_PROFILE_ENABLE=2 517 518- enable IP-in-IP tunnel flow matching:: 519 520 FLEX_PARSER_PROFILE_ENABLE=0 521 522- enable MPLS flow matching:: 523 524 FLEX_PARSER_PROFILE_ENABLE=1 525 526- enable ICMP(code/type/identifier/sequence number) / ICMP6(code/type) fields matching:: 527 528 FLEX_PARSER_PROFILE_ENABLE=2 529 530- enable Geneve flow matching:: 531 532 FLEX_PARSER_PROFILE_ENABLE=0 533 or 534 FLEX_PARSER_PROFILE_ENABLE=1 535 536- enable Geneve TLV option flow matching:: 537 538 FLEX_PARSER_PROFILE_ENABLE=0 539 540- enable GTP flow matching:: 541 542 FLEX_PARSER_PROFILE_ENABLE=3 543 544- enable eCPRI flow matching:: 545 546 FLEX_PARSER_PROFILE_ENABLE=4 547 PROG_PARSE_GRAPH=1 548 549- enable dynamic flex parser for flex item:: 550 551 FLEX_PARSER_PROFILE_ENABLE=4 552 PROG_PARSE_GRAPH=1 553 554- enable realtime timestamp format:: 555 556 REAL_TIME_CLOCK_ENABLE=1 557 558- allow locking hairpin RQ data buffer in device memory:: 559 560 HAIRPIN_DATA_BUFFER_LOCK=1 561 MEMIC_SIZE_LIMIT=0 562 563 564.. _mlx5_common_driver_options: 565 566Device Arguments 567---------------- 568 569The driver can be configured per device. 570A single argument list can be used for a device managed by multiple PMDs. 571The parameters must be passed through the EAL option ``-a``, 572as examples below: 573 574- PCI device:: 575 576 -a 0000:03:00.2,class=eth:regex,mr_mempool_reg_en=0 577 578- Auxiliary SF:: 579 580 -a auxiliary:mlx5_core.sf.2,class=compress,mr_ext_memseg_en=0 581 582Each device class PMD has its own list of specific arguments, 583and below are the arguments supported by the common mlx5 layer. 584 585- ``class`` parameter [string] 586 587 Select the classes of the drivers that should probe the device. 588 See :ref:`mlx5_classes` for more explanation and details. 589 590 The default value is ``eth``. 591 592- ``mr_ext_memseg_en`` parameter [int] 593 594 A nonzero value enables extending memseg when registering DMA memory. If 595 enabled, the number of entries in MR (Memory Region) lookup table on datapath 596 is minimized and it benefits performance. On the other hand, it worsens memory 597 utilization because registered memory is pinned by kernel driver. Even if a 598 page in the extended chunk is freed, that doesn't become reusable until the 599 entire memory is freed. 600 601 Enabled by default. 602 603- ``mr_mempool_reg_en`` parameter [int] 604 605 A nonzero value enables implicit registration of DMA memory of all mempools 606 except those having ``RTE_MEMPOOL_F_NON_IO``. This flag is set automatically 607 for mempools populated with non-contiguous objects or those without IOVA. 608 The effect is that when a packet from a mempool is transmitted, 609 its memory is already registered for DMA in the PMD and no registration 610 will happen on the data path. The tradeoff is extra work on the creation 611 of each mempool and increased HW resource use if some mempools 612 are not used with MLX5 devices. 613 614 Enabled by default. 615 616- ``sys_mem_en`` parameter [int] 617 618 A non-zero value enables the PMD memory management allocating memory 619 from system by default, without explicit rte memory flag. 620 621 By default, the PMD will set this value to 0. 622 623- ``sq_db_nc`` parameter [int] 624 625 The rdma core library can map doorbell register in two ways, 626 depending on the environment variable "MLX5_SHUT_UP_BF": 627 628 - As regular cached memory (usually with write combining attribute), 629 if the variable is either missing or set to zero. 630 - As non-cached memory, if the variable is present and set to not "0" value. 631 632 The same doorbell mapping approach is implemented directly by PMD 633 in UAR generation for queues created with DevX. 634 635 The type of mapping may slightly affect the send queue performance, 636 the optimal choice strongly relied on the host architecture 637 and should be deduced practically. 638 639 If ``sq_db_nc`` is set to zero, the doorbell is forced to be mapped to 640 regular memory (with write combining), the PMD will perform the extra write 641 memory barrier after writing to doorbell, it might increase the needed CPU 642 clocks per packet to send, but latency might be improved. 643 644 If ``sq_db_nc`` is set to one, the doorbell is forced to be mapped to non 645 cached memory, the PMD will not perform the extra write memory barrier after 646 writing to doorbell, on some architectures it might improve the performance. 647 648 If ``sq_db_nc`` is set to two, the doorbell is forced to be mapped to 649 regular memory, the PMD will use heuristics to decide whether a write memory 650 barrier should be performed. For bursts with size multiple of recommended one 651 (64 pkts) it is supposed the next burst is coming and no need to issue the 652 extra memory barrier (it is supposed to be issued in the next coming burst, 653 at least after descriptor writing). It might increase latency (on some hosts 654 till the next packets transmit) and should be used with care. 655 The PMD uses heuristics only for Tx queue, for other semd queues the doorbell 656 is forced to be mapped to regular memory as same as ``sq_db_nc`` is set to 0. 657 658 If ``sq_db_nc`` is omitted, the preset (if any) environment variable 659 "MLX5_SHUT_UP_BF" value is used. If there is no "MLX5_SHUT_UP_BF", the 660 default ``sq_db_nc`` value is zero for ARM64 hosts and one for others. 661 662- ``cmd_fd`` parameter [int] 663 664 File descriptor of ``ibv_context`` created outside the PMD. 665 PMD will use this FD to import remote CTX. The ``cmd_fd`` is obtained from 666 the ``ibv_context->cmd_fd`` member, which must be dup'd before being passed. 667 This parameter is valid only if ``pd_handle`` parameter is specified. 668 669 By default, the PMD will create a new ``ibv_context``. 670 671 .. note:: 672 673 When FD comes from another process, it is the user responsibility to 674 share the FD between the processes (e.g. by SCM_RIGHTS). 675 676- ``pd_handle`` parameter [int] 677 678 Protection domain handle of ``ibv_pd`` created outside the PMD. 679 PMD will use this handle to import remote PD. The ``pd_handle`` can be 680 achieved from the original PD by getting its ``ibv_pd->handle`` member value. 681 This parameter is valid only if ``cmd_fd`` parameter is specified, 682 and its value must be a valid kernel handle for a PD object 683 in the context represented by given ``cmd_fd``. 684 685 By default, the PMD will allocate a new PD. 686 687 .. note:: 688 689 The ``ibv_pd->handle`` member is different than ``mlx5dv_pd->pdn`` member. 690