1.. SPDX-License-Identifier: BSD-3-Clause 2 Copyright 2022 6WIND S.A. 3 Copyright (c) 2022 NVIDIA Corporation & Affiliates 4 5.. include:: <isonum.txt> 6 7NVIDIA MLX5 Common Driver 8========================= 9 10.. note:: 11 12 NVIDIA acquired Mellanox Technologies in 2020. 13 The DPDK documentation and code might still include instances 14 of or references to Mellanox trademarks (like BlueField and ConnectX) 15 that are now NVIDIA trademarks. 16 17The mlx5 common driver library (**librte_common_mlx5**) provides support for 18**NVIDIA ConnectX-4**, **NVIDIA ConnectX-4 Lx**, **NVIDIA ConnectX-5**, 19**NVIDIA ConnectX-6**, **NVIDIA ConnectX-6 Dx**, **NVIDIA ConnectX-6 Lx**, 20**NVIDIA ConnectX-7**, **NVIDIA BlueField**, **NVIDIA BlueField-2** and 21**NVIDIA BlueField-3** families of 10/25/40/50/100/200 Gb/s adapters. 22 23Information and documentation for these adapters can be found on the 24`NVIDIA website <https://www.nvidia.com/en-us/networking/>`_. 25Help is also provided by the 26`NVIDIA Networking forum <https://forums.developer.nvidia.com/c/infrastructure/369/>`_. 27In addition, there is a `web section dedicated to DPDK 28<https://developer.nvidia.com/networking/dpdk>`_. 29 30 31Design 32------ 33 34For security reasons and to enhance robustness, 35this driver only handles virtual memory addresses. 36The way resources allocations are handled by the kernel, 37combined with hardware specifications that allow handling virtual memory addresses directly, 38ensure that DPDK applications cannot access random physical memory 39(or memory that does not belong to the current process). 40 41There are different levels of objects and bypassing abilities 42which are used to get the best performance: 43 44- **Verbs** is a complete high-level generic API 45- **Direct Verbs** is a device-specific API 46- **DevX** allows accessing firmware objects 47- **Direct Rules** manages flow steering at the low-level hardware layer 48 49On Linux, above interfaces are provided by linking with `libibverbs` and `libmlx5`. 50See :ref:`mlx5_linux_prerequisites` for installation. 51 52On Windows, DevX is the only requirement from the above list. 53See :ref:`mlx5_windows_prerequisites` for DevX SDK package installation. 54 55 56.. _mlx5_classes: 57 58Classes 59------- 60 61One mlx5 device can be probed by a number of different PMDs. 62To select a specific PMD, its name should be specified as a device parameter 63(e.g. ``0000:08:00.1,class=eth``). 64 65In order to allow probing by multiple PMDs, 66several classes may be listed separated by a colon. 67For example: ``class=crypto:regex`` will probe both Crypto and RegEx PMDs. 68 69 70Supported Classes 71~~~~~~~~~~~~~~~~~ 72 73- ``class=compress`` for :doc:`../../compressdevs/mlx5`. 74- ``class=crypto`` for :doc:`../../cryptodevs/mlx5`. 75- ``class=eth`` for :doc:`../../nics/mlx5`. 76- ``class=regex`` for :doc:`../../regexdevs/mlx5`. 77- ``class=vdpa`` for :doc:`../../vdpadevs/mlx5`. 78 79By default, the mlx5 device will be probed by the ``eth`` PMD. 80 81 82Limitations 83~~~~~~~~~~~ 84 85- ``eth`` and ``vdpa`` PMDs cannot be probed at the same time. 86 All other combinations are possible. 87 88- On Windows, only ``eth`` and ``crypto`` are supported. 89 90 91.. _mlx5_common_compilation: 92 93Compilation Prerequisites 94------------------------- 95 96.. _mlx5_linux_prerequisites: 97 98Linux Prerequisites 99~~~~~~~~~~~~~~~~~~~ 100 101This driver relies on external libraries and kernel drivers for resources 102allocations and initialization. 103The following dependencies are not part of DPDK and must be installed separately: 104 105- **libibverbs** 106 107 User space Verbs framework used by ``librte_common_mlx5``. 108 This library provides a generic interface between the kernel 109 and low-level user space drivers such as ``libmlx5``. 110 111 It allows slow and privileged operations (context initialization, 112 hardware resources allocations) to be managed by the kernel 113 and fast operations to never leave user space. 114 115- **libmlx5** 116 117 Low-level user space driver library for NVIDIA devices, 118 it is automatically loaded by ``libibverbs``. 119 120 This library basically implements send/receive calls to the hardware queues. 121 122- **Kernel modules** 123 124 They provide the kernel-side Verbs API and low level device drivers 125 that manage actual hardware initialization 126 and resources sharing with user-space processes. 127 128 Unlike most other PMDs, these modules must remain loaded and bound to 129 their devices: 130 131 - ``mlx5_core``: hardware driver managing NVIDIA devices 132 and related Ethernet kernel network devices. 133 - ``mlx5_ib``: InfiniBand device driver. 134 - ``ib_uverbs``: user space driver for Verbs (entry point for ``libibverbs``). 135 136- **Firmware update** 137 138 NVIDIA MLNX_OFED/EN releases include firmware updates. 139 140 Because each release provides new features, these updates must be applied to 141 match the kernel modules and libraries they come with. 142 143Libraries and kernel modules can be provided either by the Linux distribution, 144or by installing NVIDIA MLNX_OFED/EN which provides compatibility with older kernels. 145 146 147Upstream Dependencies 148^^^^^^^^^^^^^^^^^^^^^ 149 150The mlx5 kernel modules are part of upstream Linux. 151The minimal supported kernel version is 4.14. 152For 32-bit, version 4.14.41 or above is required. 153 154The libraries `libibverbs` and `libmlx5` are part of ``rdma-core``. 155It is packaged by most of Linux distributions. 156The minimal supported rdma-core version is 16. 157For 32-bit, version 18 or above is required. 158 159The rdma-core sources can be downloaded at 160https://github.com/linux-rdma/rdma-core 161 162It is possible to build rdma-core as static libraries starting with version 21:: 163 164 cd build 165 CFLAGS=-fPIC cmake -DIN_PLACE=1 -DENABLE_STATIC=1 -GNinja .. 166 ninja 167 168 169NVIDIA MLNX_OFED/EN 170^^^^^^^^^^^^^^^^^^^ 171 172The kernel modules and libraries are packaged with other tools 173in NVIDIA MLNX_OFED or NVIDIA MLNX_EN. 174The minimal supported versions are: 175 176- NVIDIA MLNX_OFED version: **4.5** and above. 177- NVIDIA MLNX_EN version: **4.5** and above. 178- Firmware version: 179 180 - ConnectX-4: **12.21.1000** and above. 181 - ConnectX-4 Lx: **14.21.1000** and above. 182 - ConnectX-5: **16.21.1000** and above. 183 - ConnectX-5 Ex: **16.21.1000** and above. 184 - ConnectX-6: **20.27.0090** and above. 185 - ConnectX-6 Dx: **22.27.0090** and above. 186 - ConnectX-6 Lx: **26.27.0090** and above. 187 - ConnectX-7: **28.33.2028** and above. 188 - BlueField: **18.25.1010** and above. 189 - BlueField-2: **24.28.1002** and above. 190 - BlueField-3: **32.36.3126** and above. 191 192The firmware, the libraries libibverbs, libmlx5, and mlnx-ofed-kernel modules 193are packaged in `NVIDIA MLNX_OFED 194<https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/>`_. 195After downloading, it can be installed with this command:: 196 197 ./mlnxofedinstall --dpdk 198 199`NVIDIA MLNX_EN 200<https://network.nvidia.com/products/ethernet-drivers/linux/mlnx_en/>`_ 201is a smaller package including what is needed for DPDK. 202After downloading, it can be installed with this command:: 203 204 ./install --dpdk 205 206After installing, the firmware version can be checked:: 207 208 ibv_devinfo 209 210.. note:: 211 212 Several versions of NVIDIA MLNX_OFED/EN are available. Installing the version 213 this DPDK release was developed and tested against is strongly recommended. 214 Please check the "Tested Platforms" section in the :doc:`../../rel_notes/index`. 215 216 217.. _mlx5_windows_prerequisites: 218 219Windows Prerequisites 220~~~~~~~~~~~~~~~~~~~~~ 221 222The mlx5 PMDs rely on external libraries and kernel drivers 223for resource allocation and initialization. 224 225 226DevX SDK Installation 227^^^^^^^^^^^^^^^^^^^^^ 228 229The DevX SDK must be installed on the machine building the Windows PMD. 230Additional information can be found at 231`How to Integrate Windows DevX in Your Development Environment 232<https://docs.nvidia.com/networking/display/winof2v260/RShim+Drivers+and+Usage#RShimDriversandUsage-DevXInterface>`_. 233The minimal supported WinOF2 version is 2.60. 234 235 236Compilation Options 237------------------- 238 239Compilation on Linux 240~~~~~~~~~~~~~~~~~~~~ 241 242The ibverbs libraries can be linked with this PMD in a number of ways, 243configured by the ``ibverbs_link`` build option: 244 245``shared`` (default) 246 The PMD depends on some .so files. 247 248``dlopen`` 249 Split the dependencies glue in a separate library 250 loaded when needed by dlopen (see ``MLX5_GLUE_PATH``). 251 It makes dependencies on libibverbs and libmlx5 optional, 252 and has no performance impact. 253 254``static`` 255 Embed static flavor of the dependencies libibverbs and libmlx5 256 in the PMD shared library or the executable static binary. 257 258 259Compilation on Windows 260~~~~~~~~~~~~~~~~~~~~~~ 261 262The DevX SDK location must be set through two environment variables: 263 264``DEVX_LIB_PATH`` 265 path to the DevX lib file. 266 267``DEVX_INC_PATH`` 268 path to the DevX header files. 269 270 271.. _mlx5_common_env: 272 273Environment Configuration 274------------------------- 275 276Linux Environment 277~~~~~~~~~~~~~~~~~ 278 279The kernel network interfaces are brought up during initialization. 280Forcing them down prevents packets reception. 281 282The ethtool operations on the kernel interfaces may also affect the PMD. 283 284Some runtime behaviours may be configured through environment variables. 285 286``MLX5_GLUE_PATH`` 287 If built with ``ibverbs_link=dlopen``, 288 list of directories in which to search for the rdma-core "glue" plug-in, 289 separated by colons or semi-colons. 290 291``MLX5_SHUT_UP_BF`` 292 If Verbs is used (DevX disabled), 293 HW queue doorbell register mapping. 294 The value 0 means non-cached IO mapping, 295 while 1 is a regular memory mapping. 296 297 With regular memory mapping, the register is flushed to HW 298 usually when the write-combining buffer becomes full, 299 but it depends on CPU design. 300 301 302Port Link with MLNX_OFED/EN 303^^^^^^^^^^^^^^^^^^^^^^^^^^^ 304 305Ports links must be set to Ethernet:: 306 307 mlxconfig -d <mst device> query | grep LINK_TYPE 308 LINK_TYPE_P1 ETH(2) 309 LINK_TYPE_P2 ETH(2) 310 311 mlxconfig -d <mst device> set LINK_TYPE_P1/2=1/2/3 312 313Link type values are: 314 315* ``1`` Infiniband 316* ``2`` Ethernet 317* ``3`` VPI (auto-sense) 318 319If link type was changed, firmware must be reset as well:: 320 321 mlxfwreset -d <mst device> reset 322 323 324.. _mlx5_vf: 325 326SR-IOV Virtual Function with MLNX_OFED/EN 327^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 328 329SR-IOV must be enabled on the NIC. 330It can be checked in the following command:: 331 332 mlxconfig -d <mst device> query | grep SRIOV_EN 333 SRIOV_EN True(1) 334 335If needed, configure SR-IOV:: 336 337 mlxconfig -d <mst device> set SRIOV_EN=1 NUM_OF_VFS=16 338 mlxfwreset -d <mst device> reset 339 340After doing the change, restart the driver:: 341 342 /etc/init.d/openibd restart 343 344or:: 345 346 service openibd restart 347 348Then the virtual functions can be instantiated:: 349 350 echo [num_vfs] > /sys/class/infiniband/mlx5_0/device/sriov_numvfs 351 352 353.. _mlx5_sub_function: 354 355Sub-Function with MLNX_OFED/EN 356^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 357 358Sub-Function is a portion of the PCI device, 359it has its own dedicated queues. 360An SF shares PCI-level resources with other SFs and/or with its parent PCI function. 361 3620. Requirement:: 363 364 MLNX_OFED version >= 5.4-0.3.3.0 365 3661. Configure SF feature:: 367 368 # Run mlxconfig on both PFs on host and ECPFs on BlueField. 369 mlxconfig -d <mst device> set PER_PF_NUM_SF=1 PF_TOTAL_SF=252 PF_SF_BAR_SIZE=12 370 3712. Enable switchdev mode:: 372 373 mlxdevm dev eswitch set pci/<DBDF> mode switchdev 374 3753. Add SF port:: 376 377 mlxdevm port add pci/<DBDF> flavour pcisf pfnum 0 sfnum <sfnum> 378 379 Get SFID from output: pci/<DBDF>/<SFID> 380 3814. Modify MAC address:: 382 383 mlxdevm port function set pci/<DBDF>/<SFID> hw_addr <MAC> 384 3855. Activate SF port:: 386 387 mlxdevm port function set pci/<DBDF>/<ID> state active 388 3896. Devargs to probe SF device:: 390 391 auxiliary:mlx5_core.sf.<num>,class=eth:regex 392 393 394Enable Switchdev Mode 395^^^^^^^^^^^^^^^^^^^^^ 396 397Switchdev mode is a mode in E-Switch, that binds between representor and VF or SF. 398Representor is a port in DPDK that is connected to a VF or SF in such a way 399that assuming there are no offload flows, each packet that is sent from the VF or SF 400will be received by the corresponding representor. 401While each packet that is sent to a representor will be received by the VF or SF. 402 403After :ref:`configuring VF <mlx5_vf>`, the device must be unbound:: 404 405 printf "<device pci address>" > /sys/bus/pci/drivers/mlx5_core/unbind 406 407Then switchdev mode is enabled:: 408 409 echo switchdev > /sys/class/net/<net device>/compat/devlink/mode 410 411The device can be bound again at this point. 412 413 414Run as Non-Root 415^^^^^^^^^^^^^^^ 416 417Hugepage and resource limit setup are documented 418in the :ref:`common Linux guide <Running_Without_Root_Privileges>`. 419This PMD can operate without access to physical addresses, 420therefore it does not require ``SYS_ADMIN`` to access ``/proc/self/pagemaps``. 421Note that this requirement may still come from other drivers. 422 423Below are additional capabilities that must be granted to the application 424with the reasons for the need of each capability: 425 426``NET_RAW`` 427 For raw Ethernet queue allocation through the kernel driver. 428 429``NET_ADMIN`` 430 For device configuration, like setting link status or MTU. 431 432``SYS_RAWIO`` 433 For using group 1 and above (software steering) in Flow API. 434 435They can be manually granted for a specific executable file:: 436 437 setcap cap_net_raw,cap_net_admin,cap_sys_rawio+ep <executable> 438 439Alternatively, a service manager or a container runtime 440may configure the capabilities for a process. 441 442 443Windows Environment 444~~~~~~~~~~~~~~~~~~~ 445 446WinOF2 version 2.60 or higher must be installed on the machine. 447 448 449WinOF2 Installation 450^^^^^^^^^^^^^^^^^^^ 451 452The driver can be downloaded from the following site: `WINOF2 453<https://network.nvidia.com/products/adapter-software/ethernet/windows/winof-2/>`_. 454 455 456DevX Enablement 457^^^^^^^^^^^^^^^ 458 459DevX for Windows must be enabled in the Windows registry. 460The keys ``DevxEnabled`` and ``DevxFsRules`` must be set. 461Additional information can be found in the WinOF2 user manual. 462 463 464.. _mlx5_firmware_config: 465 466Firmware Configuration 467~~~~~~~~~~~~~~~~~~~~~~ 468 469Firmware features can be configured as key/value pairs. 470 471The command to set a value is:: 472 473 mlxconfig -d <device> set <key>=<value> 474 475The command to query a value is:: 476 477 mlxconfig -d <device> query <key> 478 479The device name for the command ``mlxconfig`` can be either the PCI address, 480or the mst device name found with:: 481 482 mst status 483 484Below are some firmware configurations listed. 485 486- link type:: 487 488 LINK_TYPE_P1 489 LINK_TYPE_P2 490 value: 1=Infiniband 2=Ethernet 3=VPI(auto-sense) 491 492- enable SR-IOV:: 493 494 SRIOV_EN=1 495 496- the maximum number of SR-IOV virtual functions:: 497 498 NUM_OF_VFS=<max> 499 500- enable DevX (required by Direct Rules and other features):: 501 502 UCTX_EN=1 503 504- aggressive CQE zipping:: 505 506 CQE_COMPRESSION=1 507 508- L3 VXLAN and VXLAN-GPE destination UDP port:: 509 510 IP_OVER_VXLAN_EN=1 511 IP_OVER_VXLAN_PORT=<udp dport> 512 513- enable VXLAN-GPE tunnel flow matching:: 514 515 FLEX_PARSER_PROFILE_ENABLE=0 516 or 517 FLEX_PARSER_PROFILE_ENABLE=2 518 519- enable IP-in-IP tunnel flow matching:: 520 521 FLEX_PARSER_PROFILE_ENABLE=0 522 523- enable MPLS flow matching:: 524 525 FLEX_PARSER_PROFILE_ENABLE=1 526 527- enable ICMP(code/type/identifier/sequence number) / ICMP6(code/type) fields matching:: 528 529 FLEX_PARSER_PROFILE_ENABLE=2 530 531- enable Geneve flow matching:: 532 533 FLEX_PARSER_PROFILE_ENABLE=0 534 or 535 FLEX_PARSER_PROFILE_ENABLE=1 536 537- enable Geneve TLV option flow matching:: 538 539 FLEX_PARSER_PROFILE_ENABLE=0 540 541- enable GTP flow matching:: 542 543 FLEX_PARSER_PROFILE_ENABLE=3 544 545- enable eCPRI flow matching:: 546 547 FLEX_PARSER_PROFILE_ENABLE=4 548 PROG_PARSE_GRAPH=1 549 550- enable dynamic flex parser for flex item:: 551 552 FLEX_PARSER_PROFILE_ENABLE=4 553 PROG_PARSE_GRAPH=1 554 555- enable realtime timestamp format:: 556 557 REAL_TIME_CLOCK_ENABLE=1 558 559- allow locking hairpin RQ data buffer in device memory:: 560 561 HAIRPIN_DATA_BUFFER_LOCK=1 562 MEMIC_SIZE_LIMIT=0 563 564 565.. _mlx5_common_driver_options: 566 567Device Arguments 568---------------- 569 570The driver can be configured per device. 571A single argument list can be used for a device managed by multiple PMDs. 572The parameters must be passed through the EAL option ``-a``, 573as examples below: 574 575- PCI device:: 576 577 -a 0000:03:00.2,class=eth:regex,mr_mempool_reg_en=0 578 579- Auxiliary SF:: 580 581 -a auxiliary:mlx5_core.sf.2,class=compress,mr_ext_memseg_en=0 582 583Each device class PMD has its own list of specific arguments, 584and below are the arguments supported by the common mlx5 layer. 585 586- ``class`` parameter [string] 587 588 Select the classes of the drivers that should probe the device. 589 See :ref:`mlx5_classes` for more explanation and details. 590 591 The default value is ``eth``. 592 593- ``mr_ext_memseg_en`` parameter [int] 594 595 A nonzero value enables extending memseg when registering DMA memory. If 596 enabled, the number of entries in MR (Memory Region) lookup table on datapath 597 is minimized and it benefits performance. On the other hand, it worsens memory 598 utilization because registered memory is pinned by kernel driver. Even if a 599 page in the extended chunk is freed, that doesn't become reusable until the 600 entire memory is freed. 601 602 Enabled by default. 603 604- ``mr_mempool_reg_en`` parameter [int] 605 606 A nonzero value enables implicit registration of DMA memory of all mempools 607 except those having ``RTE_MEMPOOL_F_NON_IO``. This flag is set automatically 608 for mempools populated with non-contiguous objects or those without IOVA. 609 The effect is that when a packet from a mempool is transmitted, 610 its memory is already registered for DMA in the PMD and no registration 611 will happen on the data path. The tradeoff is extra work on the creation 612 of each mempool and increased HW resource use if some mempools 613 are not used with MLX5 devices. 614 615 Enabled by default. 616 617- ``sys_mem_en`` parameter [int] 618 619 A non-zero value enables the PMD memory management allocating memory 620 from system by default, without explicit rte memory flag. 621 622 By default, the PMD will set this value to 0. 623 624- ``sq_db_nc`` parameter [int] 625 626 The rdma core library can map doorbell register in two ways, 627 depending on the environment variable "MLX5_SHUT_UP_BF": 628 629 - As regular cached memory (usually with write combining attribute), 630 if the variable is either missing or set to zero. 631 - As non-cached memory, if the variable is present and set to not "0" value. 632 633 The same doorbell mapping approach is implemented directly by PMD 634 in UAR generation for queues created with DevX. 635 636 The type of mapping may slightly affect the send queue performance, 637 the optimal choice strongly relied on the host architecture 638 and should be deduced practically. 639 640 If ``sq_db_nc`` is set to zero, the doorbell is forced to be mapped to 641 regular memory (with write combining), the PMD will perform the extra write 642 memory barrier after writing to doorbell, it might increase the needed CPU 643 clocks per packet to send, but latency might be improved. 644 645 If ``sq_db_nc`` is set to one, the doorbell is forced to be mapped to non 646 cached memory, the PMD will not perform the extra write memory barrier after 647 writing to doorbell, on some architectures it might improve the performance. 648 649 If ``sq_db_nc`` is set to two, the doorbell is forced to be mapped to 650 regular memory, the PMD will use heuristics to decide whether a write memory 651 barrier should be performed. For bursts with size multiple of recommended one 652 (64 pkts) it is supposed the next burst is coming and no need to issue the 653 extra memory barrier (it is supposed to be issued in the next coming burst, 654 at least after descriptor writing). It might increase latency (on some hosts 655 till the next packets transmit) and should be used with care. 656 The PMD uses heuristics only for Tx queue, for other semd queues the doorbell 657 is forced to be mapped to regular memory as same as ``sq_db_nc`` is set to 0. 658 659 If ``sq_db_nc`` is omitted, the preset (if any) environment variable 660 "MLX5_SHUT_UP_BF" value is used. If there is no "MLX5_SHUT_UP_BF", the 661 default ``sq_db_nc`` value is zero for ARM64 hosts and one for others. 662 663- ``cmd_fd`` parameter [int] 664 665 File descriptor of ``ibv_context`` created outside the PMD. 666 PMD will use this FD to import remote CTX. The ``cmd_fd`` is obtained from 667 the ``ibv_context->cmd_fd`` member, which must be dup'd before being passed. 668 This parameter is valid only if ``pd_handle`` parameter is specified. 669 670 By default, the PMD will create a new ``ibv_context``. 671 672 .. note:: 673 674 When FD comes from another process, it is the user responsibility to 675 share the FD between the processes (e.g. by SCM_RIGHTS). 676 677- ``pd_handle`` parameter [int] 678 679 Protection domain handle of ``ibv_pd`` created outside the PMD. 680 PMD will use this handle to import remote PD. The ``pd_handle`` can be 681 achieved from the original PD by getting its ``ibv_pd->handle`` member value. 682 This parameter is valid only if ``cmd_fd`` parameter is specified, 683 and its value must be a valid kernel handle for a PD object 684 in the context represented by given ``cmd_fd``. 685 686 By default, the PMD will allocate a new PD. 687 688 .. note:: 689 690 The ``ibv_pd->handle`` member is different than ``mlx5dv_pd->pdn`` member. 691