1.. SPDX-License-Identifier: BSD-3-Clause 2 Copyright 2022 6WIND S.A. 3 Copyright (c) 2022 NVIDIA Corporation & Affiliates 4 5.. include:: <isonum.txt> 6 7NVIDIA MLX5 Common Driver 8========================= 9 10.. note:: 11 12 NVIDIA acquired Mellanox Technologies in 2020. 13 The DPDK documentation and code might still include instances 14 of or references to Mellanox trademarks (like BlueField and ConnectX) 15 that are now NVIDIA trademarks. 16 17The mlx5 common driver library (**librte_common_mlx5**) provides support for 18**NVIDIA ConnectX-4**, **NVIDIA ConnectX-4 Lx**, **NVIDIA ConnectX-5**, 19**NVIDIA ConnectX-6**, **NVIDIA ConnectX-6 Dx**, **NVIDIA ConnectX-6 Lx**, 20**NVIDIA ConnectX-7**, **NVIDIA BlueField**, **NVIDIA BlueField-2** and 21**NVIDIA BlueField-3** families of 10/25/40/50/100/200 Gb/s adapters. 22 23Information and documentation for these adapters can be found on the 24`NVIDIA website <https://www.nvidia.com/en-us/networking/>`_. 25Help is also provided by the 26`NVIDIA Networking forum <https://forums.developer.nvidia.com/c/infrastructure/369/>`_. 27In addition, there is a `web section dedicated to DPDK 28<https://developer.nvidia.com/networking/dpdk>`_. 29 30 31Design 32------ 33 34For security reasons and to enhance robustness, 35this driver only handles virtual memory addresses. 36The way resources allocations are handled by the kernel, 37combined with hardware specifications that allow handling virtual memory addresses directly, 38ensure that DPDK applications cannot access random physical memory 39(or memory that does not belong to the current process). 40 41There are different levels of objects and bypassing abilities 42which are used to get the best performance: 43 44- **Verbs** is a complete high-level generic API 45- **Direct Verbs** is a device-specific API 46- **DevX** allows accessing firmware objects 47- **Direct Rules** manages flow steering at the low-level hardware layer 48 49On Linux, above interfaces are provided by linking with `libibverbs` and `libmlx5`. 50See :ref:`mlx5_linux_prerequisites` for installation. 51 52On Windows, DevX is the only requirement from the above list. 53See :ref:`mlx5_windows_prerequisites` for DevX SDK package installation. 54 55 56.. _mlx5_classes: 57 58Classes 59------- 60 61One mlx5 device can be probed by a number of different PMDs. 62To select a specific PMD, its name should be specified as a device parameter 63(e.g. ``0000:08:00.1,class=eth``). 64 65In order to allow probing by multiple PMDs, 66several classes may be listed separated by a colon. 67For example: ``class=crypto:regex`` will probe both Crypto and RegEx PMDs. 68 69 70Supported Classes 71~~~~~~~~~~~~~~~~~ 72 73- ``class=compress`` for :doc:`../../compressdevs/mlx5`. 74- ``class=crypto`` for :doc:`../../cryptodevs/mlx5`. 75- ``class=eth`` for :doc:`../../nics/mlx5`. 76- ``class=regex`` for :doc:`../../regexdevs/mlx5`. 77- ``class=vdpa`` for :doc:`../../vdpadevs/mlx5`. 78 79By default, the mlx5 device will be probed by the ``eth`` PMD. 80 81 82Limitations 83~~~~~~~~~~~ 84 85- ``eth`` and ``vdpa`` PMDs cannot be probed at the same time. 86 All other combinations are possible. 87 88- On Windows, only ``eth`` and ``crypto`` are supported. 89 90 91.. _mlx5_common_compilation: 92 93Compilation Prerequisites 94------------------------- 95 96.. _mlx5_linux_prerequisites: 97 98Linux Prerequisites 99~~~~~~~~~~~~~~~~~~~ 100 101This driver relies on external libraries and kernel drivers for resources 102allocations and initialization. 103The following dependencies are not part of DPDK and must be installed separately: 104 105- **libibverbs** 106 107 User space Verbs framework used by ``librte_common_mlx5``. 108 This library provides a generic interface between the kernel 109 and low-level user space drivers such as ``libmlx5``. 110 111 It allows slow and privileged operations (context initialization, 112 hardware resources allocations) to be managed by the kernel 113 and fast operations to never leave user space. 114 115- **libmlx5** 116 117 Low-level user space driver library for NVIDIA devices, 118 it is automatically loaded by ``libibverbs``. 119 120 This library basically implements send/receive calls to the hardware queues. 121 122- **Kernel modules** 123 124 They provide the kernel-side Verbs API and low level device drivers 125 that manage actual hardware initialization 126 and resources sharing with user-space processes. 127 128 Unlike most other PMDs, these modules must remain loaded and bound to 129 their devices: 130 131 - ``mlx5_core``: hardware driver managing NVIDIA devices 132 and related Ethernet kernel network devices. 133 - ``mlx5_ib``: InfiniBand device driver. 134 - ``ib_uverbs``: user space driver for Verbs (entry point for ``libibverbs``). 135 136- **Firmware update** 137 138 NVIDIA MLNX_OFED/EN releases include firmware updates. 139 140 Because each release provides new features, these updates must be applied to 141 match the kernel modules and libraries they come with. 142 143Libraries and kernel modules can be provided either by the Linux distribution, 144or by installing NVIDIA MLNX_OFED/EN which provides compatibility with older kernels. 145 146 147Upstream Dependencies 148^^^^^^^^^^^^^^^^^^^^^ 149 150The mlx5 kernel modules are part of upstream Linux. 151The minimal supported kernel version is 4.14. 152For 32-bit, version 4.14.41 or above is required. 153 154The libraries `libibverbs` and `libmlx5` are part of ``rdma-core``. 155It is packaged by most of Linux distributions. 156The minimal supported rdma-core version is 16. 157For 32-bit, version 18 or above is required. 158 159The rdma-core sources can be downloaded at 160https://github.com/linux-rdma/rdma-core 161 162It is possible to build rdma-core as static libraries starting with version 21:: 163 164 cd build 165 CFLAGS=-fPIC cmake -DENABLE_STATIC=1 -DNO_PYVERBS=1 -DNO_MAN_PAGES=1 -GNinja .. 166 ninja 167 ninja install 168 169 170NVIDIA MLNX_OFED/EN 171^^^^^^^^^^^^^^^^^^^ 172 173The kernel modules and libraries are packaged with other tools 174in NVIDIA MLNX_OFED or NVIDIA MLNX_EN. 175The minimal supported versions are: 176 177- NVIDIA MLNX_OFED version: **4.5** and above. 178- NVIDIA MLNX_EN version: **4.5** and above. 179- Firmware version: 180 181 - ConnectX-4: **12.21.1000** and above. 182 - ConnectX-4 Lx: **14.21.1000** and above. 183 - ConnectX-5: **16.21.1000** and above. 184 - ConnectX-5 Ex: **16.21.1000** and above. 185 - ConnectX-6: **20.27.0090** and above. 186 - ConnectX-6 Dx: **22.27.0090** and above. 187 - ConnectX-6 Lx: **26.27.0090** and above. 188 - ConnectX-7: **28.33.2028** and above. 189 - BlueField: **18.25.1010** and above. 190 - BlueField-2: **24.28.1002** and above. 191 - BlueField-3: **32.36.3126** and above. 192 193The firmware, the libraries libibverbs, libmlx5, and mlnx-ofed-kernel modules 194are packaged in `NVIDIA MLNX_OFED 195<https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/>`_. 196After downloading, it can be installed with this command:: 197 198 ./mlnxofedinstall --dpdk 199 200`NVIDIA MLNX_EN 201<https://network.nvidia.com/products/ethernet-drivers/linux/mlnx_en/>`_ 202is a smaller package including what is needed for DPDK. 203After downloading, it can be installed with this command:: 204 205 ./install --dpdk 206 207After installing, the firmware version can be checked:: 208 209 ibv_devinfo 210 211.. note:: 212 213 Several versions of NVIDIA MLNX_OFED/EN are available. Installing the version 214 this DPDK release was developed and tested against is strongly recommended. 215 Please check the "Tested Platforms" section in the :doc:`../../rel_notes/index`. 216 217 218.. _mlx5_windows_prerequisites: 219 220Windows Prerequisites 221~~~~~~~~~~~~~~~~~~~~~ 222 223The mlx5 PMDs rely on external libraries and kernel drivers 224for resource allocation and initialization. 225 226 227DevX SDK Installation 228^^^^^^^^^^^^^^^^^^^^^ 229 230The DevX SDK must be installed on the machine building the Windows PMD. 231Additional information can be found at 232`How to Integrate Windows DevX in Your Development Environment 233<https://docs.nvidia.com/networking/display/winof2v260/RShim+Drivers+and+Usage#RShimDriversandUsage-DevXInterface>`_. 234The minimal supported WinOF2 version is 2.60. 235 236 237Compilation Options 238------------------- 239 240Compilation on Linux 241~~~~~~~~~~~~~~~~~~~~ 242 243The ibverbs libraries can be linked with this PMD in a number of ways, 244configured by the ``ibverbs_link`` build option: 245 246``shared`` (default) 247 The PMD depends on some .so files. 248 249``dlopen`` 250 Split the dependencies glue in a separate library 251 loaded when needed by dlopen (see ``MLX5_GLUE_PATH``). 252 It makes dependencies on libibverbs and libmlx5 optional, 253 and has no performance impact. 254 255``static`` 256 Embed static flavor of the dependencies libibverbs and libmlx5 257 in the PMD shared library or the executable static binary. 258 259 260Compilation on Windows 261~~~~~~~~~~~~~~~~~~~~~~ 262 263The DevX SDK location must be set through CFLAGS/LDFLAGS, 264either:: 265 266 meson.exe setup "-Dc_args=-I\"%DEVX_INC_PATH%\"" "-Dc_link_args=-L\"%DEVX_LIB_PATH%\"" ... 267 268or:: 269 270 set CFLAGS=-I"%DEVX_INC_PATH%" && set LDFLAGS=-L"%DEVX_LIB_PATH%" && meson.exe setup ... 271 272 273.. _mlx5_common_env: 274 275Environment Configuration 276------------------------- 277 278Linux Environment 279~~~~~~~~~~~~~~~~~ 280 281The kernel network interfaces are brought up during initialization. 282Forcing them down prevents packets reception. 283 284The ethtool operations on the kernel interfaces may also affect the PMD. 285 286Some runtime behaviours may be configured through environment variables. 287 288``MLX5_GLUE_PATH`` 289 If built with ``ibverbs_link=dlopen``, 290 list of directories in which to search for the rdma-core "glue" plug-in, 291 separated by colons or semi-colons. 292 293``MLX5_SHUT_UP_BF`` 294 If Verbs is used (DevX disabled), 295 HW queue doorbell register mapping. 296 The value 0 means non-cached IO mapping, 297 while 1 is a regular memory mapping. 298 299 With regular memory mapping, the register is flushed to HW 300 usually when the write-combining buffer becomes full, 301 but it depends on CPU design. 302 303 304Port Link with MLNX_OFED/EN 305^^^^^^^^^^^^^^^^^^^^^^^^^^^ 306 307Ports links must be set to Ethernet:: 308 309 mlxconfig -d <mst device> query | grep LINK_TYPE 310 LINK_TYPE_P1 ETH(2) 311 LINK_TYPE_P2 ETH(2) 312 313 mlxconfig -d <mst device> set LINK_TYPE_P1/2=1/2/3 314 315Link type values are: 316 317* ``1`` Infiniband 318* ``2`` Ethernet 319* ``3`` VPI (auto-sense) 320 321If link type was changed, firmware must be reset as well:: 322 323 mlxfwreset -d <mst device> reset 324 325 326.. _mlx5_vf: 327 328SR-IOV Virtual Function with MLNX_OFED/EN 329^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 330 331SR-IOV must be enabled on the NIC. 332It can be checked in the following command:: 333 334 mlxconfig -d <mst device> query | grep SRIOV_EN 335 SRIOV_EN True(1) 336 337If needed, configure SR-IOV:: 338 339 mlxconfig -d <mst device> set SRIOV_EN=1 NUM_OF_VFS=16 340 mlxfwreset -d <mst device> reset 341 342After doing the change, restart the driver:: 343 344 /etc/init.d/openibd restart 345 346or:: 347 348 service openibd restart 349 350Then the virtual functions can be instantiated:: 351 352 echo [num_vfs] > /sys/class/infiniband/mlx5_0/device/sriov_numvfs 353 354 355.. _mlx5_sub_function: 356 357Sub-Function with MLNX_OFED/EN 358^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 359 360Sub-Function is a portion of the PCI device, 361it has its own dedicated queues. 362An SF shares PCI-level resources with other SFs and/or with its parent PCI function. 363 364#. Requirement:: 365 366 MLNX_OFED version >= 5.4-0.3.3.0 367 368#. Configure SF feature:: 369 370 # Run mlxconfig on both PFs on host and ECPFs on BlueField. 371 mlxconfig -d <mst device> set PER_PF_NUM_SF=1 PF_TOTAL_SF=252 PF_SF_BAR_SIZE=12 372 373#. Enable switchdev mode:: 374 375 mlxdevm dev eswitch set pci/<DBDF> mode switchdev 376 377#. Add SF port:: 378 379 mlxdevm port add pci/<DBDF> flavour pcisf pfnum 0 sfnum <sfnum> 380 381 Get SFID from output: pci/<DBDF>/<SFID> 382 383#. Modify MAC address:: 384 385 mlxdevm port function set pci/<DBDF>/<SFID> hw_addr <MAC> 386 387#. Activate SF port:: 388 389 mlxdevm port function set pci/<DBDF>/<ID> state active 390 391#. Devargs to probe SF device:: 392 393 auxiliary:mlx5_core.sf.<num>,class=eth:regex 394 395 396Enable Switchdev Mode 397^^^^^^^^^^^^^^^^^^^^^ 398 399Switchdev mode is a mode in E-Switch, that binds between representor and VF or SF. 400Representor is a port in DPDK that is connected to a VF or SF in such a way 401that assuming there are no offload flows, each packet that is sent from the VF or SF 402will be received by the corresponding representor. 403While each packet that is sent to a representor will be received by the VF or SF. 404 405After :ref:`configuring VF <mlx5_vf>`, the device must be unbound:: 406 407 printf "<device pci address>" > /sys/bus/pci/drivers/mlx5_core/unbind 408 409Then switchdev mode is enabled:: 410 411 echo switchdev > /sys/class/net/<net device>/compat/devlink/mode 412 413The device can be bound again at this point. 414 415 416Run as Non-Root 417^^^^^^^^^^^^^^^ 418 419Hugepage and resource limit setup are documented 420in the :ref:`common Linux guide <Running_Without_Root_Privileges>`. 421This PMD can operate without access to physical addresses, 422therefore it does not require ``SYS_ADMIN`` to access ``/proc/self/pagemaps``. 423Note that this requirement may still come from other drivers. 424 425Below are additional capabilities that must be granted to the application 426with the reasons for the need of each capability: 427 428``NET_RAW`` 429 For raw Ethernet queue allocation through the kernel driver. 430 431``NET_ADMIN`` 432 For device configuration, like setting link status or MTU. 433 434``SYS_RAWIO`` 435 For using group 1 and above (software steering) in Flow API. 436 437They can be manually granted for a specific executable file:: 438 439 setcap cap_net_raw,cap_net_admin,cap_sys_rawio+ep <executable> 440 441Alternatively, a service manager or a container runtime 442may configure the capabilities for a process. 443 444 445Windows Environment 446~~~~~~~~~~~~~~~~~~~ 447 448WinOF2 version 2.60 or higher must be installed on the machine. 449 450 451WinOF2 Installation 452^^^^^^^^^^^^^^^^^^^ 453 454The driver can be downloaded from the following site: `WINOF2 455<https://network.nvidia.com/products/adapter-software/ethernet/windows/winof-2/>`_. 456 457 458DevX Enablement 459^^^^^^^^^^^^^^^ 460 461DevX for Windows must be enabled in the Windows registry. 462The keys ``DevxEnabled`` and ``DevxFsRules`` must be set. 463Additional information can be found in the WinOF2 user manual. 464 465 466.. _mlx5_firmware_config: 467 468Firmware Configuration 469~~~~~~~~~~~~~~~~~~~~~~ 470 471Firmware features can be configured as key/value pairs. 472 473The command to set a value is:: 474 475 mlxconfig -d <device> set <key>=<value> 476 477The command to query a value is:: 478 479 mlxconfig -d <device> query <key> 480 481The device name for the command ``mlxconfig`` can be either the PCI address, 482or the mst device name found with:: 483 484 mst status 485 486Below are some firmware configurations listed. 487 488- link type:: 489 490 LINK_TYPE_P1 491 LINK_TYPE_P2 492 value: 1=Infiniband 2=Ethernet 3=VPI(auto-sense) 493 494- enable SR-IOV:: 495 496 SRIOV_EN=1 497 498- the maximum number of SR-IOV virtual functions:: 499 500 NUM_OF_VFS=<max> 501 502- enable DevX (required by Direct Rules and other features):: 503 504 UCTX_EN=1 505 506- aggressive CQE zipping:: 507 508 CQE_COMPRESSION=1 509 510- L3 VXLAN and VXLAN-GPE destination UDP port:: 511 512 IP_OVER_VXLAN_EN=1 513 IP_OVER_VXLAN_PORT=<udp dport> 514 515- enable VXLAN-GPE tunnel flow matching:: 516 517 FLEX_PARSER_PROFILE_ENABLE=0 518 or 519 FLEX_PARSER_PROFILE_ENABLE=2 520 521- enable IP-in-IP tunnel flow matching:: 522 523 FLEX_PARSER_PROFILE_ENABLE=0 524 525- enable MPLS flow matching:: 526 527 FLEX_PARSER_PROFILE_ENABLE=1 528 529- enable ICMP(code/type/identifier/sequence number) / ICMP6(code/type) fields matching:: 530 531 FLEX_PARSER_PROFILE_ENABLE=2 532 533- enable Geneve flow matching:: 534 535 FLEX_PARSER_PROFILE_ENABLE=0 536 or 537 FLEX_PARSER_PROFILE_ENABLE=1 538 539- enable Geneve TLV option flow matching:: 540 541 FLEX_PARSER_PROFILE_ENABLE=0 542 or 543 FLEX_PARSER_PROFILE_ENABLE=8 544 545- enable GTP flow matching:: 546 547 FLEX_PARSER_PROFILE_ENABLE=3 548 549- enable eCPRI flow matching:: 550 551 FLEX_PARSER_PROFILE_ENABLE=4 552 PROG_PARSE_GRAPH=1 553 554- enable dynamic flex parser for flex item:: 555 556 FLEX_PARSER_PROFILE_ENABLE=4 557 PROG_PARSE_GRAPH=1 558 559- enable realtime timestamp format:: 560 561 REAL_TIME_CLOCK_ENABLE=1 562 563- allow locking hairpin RQ data buffer in device memory:: 564 565 HAIRPIN_DATA_BUFFER_LOCK=1 566 MEMIC_SIZE_LIMIT=0 567 568 569.. _mlx5_common_driver_options: 570 571Device Arguments 572---------------- 573 574The driver can be configured per device. 575A single argument list can be used for a device managed by multiple PMDs. 576The parameters must be passed through the EAL option ``-a``, 577as examples below: 578 579- PCI device:: 580 581 -a 0000:03:00.2,class=eth:regex,mr_mempool_reg_en=0 582 583- Auxiliary SF:: 584 585 -a auxiliary:mlx5_core.sf.2,class=compress,mr_ext_memseg_en=0 586 587Each device class PMD has its own list of specific arguments, 588and below are the arguments supported by the common mlx5 layer. 589 590- ``class`` parameter [string] 591 592 Select the classes of the drivers that should probe the device. 593 See :ref:`mlx5_classes` for more explanation and details. 594 595 The default value is ``eth``. 596 597- ``mr_ext_memseg_en`` parameter [int] 598 599 A nonzero value enables extending memseg when registering DMA memory. If 600 enabled, the number of entries in MR (Memory Region) lookup table on datapath 601 is minimized and it benefits performance. On the other hand, it worsens memory 602 utilization because registered memory is pinned by kernel driver. Even if a 603 page in the extended chunk is freed, that doesn't become reusable until the 604 entire memory is freed. 605 606 Enabled by default. 607 608- ``mr_mempool_reg_en`` parameter [int] 609 610 A nonzero value enables implicit registration of DMA memory of all mempools 611 except those having ``RTE_MEMPOOL_F_NON_IO``. This flag is set automatically 612 for mempools populated with non-contiguous objects or those without IOVA. 613 The effect is that when a packet from a mempool is transmitted, 614 its memory is already registered for DMA in the PMD and no registration 615 will happen on the data path. The tradeoff is extra work on the creation 616 of each mempool and increased HW resource use if some mempools 617 are not used with MLX5 devices. 618 619 Enabled by default. 620 621- ``sys_mem_en`` parameter [int] 622 623 A non-zero value enables the PMD memory management allocating memory 624 from system by default, without explicit rte memory flag. 625 626 By default, the PMD will set this value to 0. 627 628- ``sq_db_nc`` parameter [int] 629 630 The rdma core library can map doorbell register in two ways, 631 depending on the environment variable "MLX5_SHUT_UP_BF": 632 633 - As regular cached memory (usually with write combining attribute), 634 if the variable is either missing or set to zero. 635 - As non-cached memory, if the variable is present and set to not "0" value. 636 637 The same doorbell mapping approach is implemented directly by PMD 638 in UAR generation for queues created with DevX. 639 640 The type of mapping may slightly affect the send queue performance, 641 the optimal choice strongly relied on the host architecture 642 and should be deduced practically. 643 644 If ``sq_db_nc`` is set to zero, the doorbell is forced to be mapped to 645 regular memory (with write combining), the PMD will perform the extra write 646 memory barrier after writing to doorbell, it might increase the needed CPU 647 clocks per packet to send, but latency might be improved. 648 649 If ``sq_db_nc`` is set to one, the doorbell is forced to be mapped to non 650 cached memory, the PMD will not perform the extra write memory barrier after 651 writing to doorbell, on some architectures it might improve the performance. 652 653 If ``sq_db_nc`` is set to two, the doorbell is forced to be mapped to 654 regular memory, the PMD will use heuristics to decide whether a write memory 655 barrier should be performed. For bursts with size multiple of recommended one 656 (64 pkts) it is supposed the next burst is coming and no need to issue the 657 extra memory barrier (it is supposed to be issued in the next coming burst, 658 at least after descriptor writing). It might increase latency (on some hosts 659 till the next packets transmit) and should be used with care. 660 The PMD uses heuristics only for Tx queue, for other semd queues the doorbell 661 is forced to be mapped to regular memory as same as ``sq_db_nc`` is set to 0. 662 663 If ``sq_db_nc`` is omitted, the preset (if any) environment variable 664 "MLX5_SHUT_UP_BF" value is used. If there is no "MLX5_SHUT_UP_BF", the 665 default ``sq_db_nc`` value is zero for ARM64 hosts and one for others. 666 667- ``cmd_fd`` parameter [int] 668 669 File descriptor of ``ibv_context`` created outside the PMD. 670 PMD will use this FD to import remote CTX. The ``cmd_fd`` is obtained from 671 the ``ibv_context->cmd_fd`` member, which must be dup'd before being passed. 672 This parameter is valid only if ``pd_handle`` parameter is specified. 673 674 By default, the PMD will create a new ``ibv_context``. 675 676 .. note:: 677 678 When FD comes from another process, it is the user responsibility to 679 share the FD between the processes (e.g. by SCM_RIGHTS). 680 681- ``pd_handle`` parameter [int] 682 683 Protection domain handle of ``ibv_pd`` created outside the PMD. 684 PMD will use this handle to import remote PD. The ``pd_handle`` can be 685 achieved from the original PD by getting its ``ibv_pd->handle`` member value. 686 This parameter is valid only if ``cmd_fd`` parameter is specified, 687 and its value must be a valid kernel handle for a PD object 688 in the context represented by given ``cmd_fd``. 689 690 By default, the PMD will allocate a new PD. 691 692 .. note:: 693 694 The ``ibv_pd->handle`` member is different than ``mlx5dv_pd->pdn`` member. 695