1.. SPDX-License-Identifier: BSD-3-Clause 2 Copyright 2022 6WIND S.A. 3 Copyright (c) 2022 NVIDIA Corporation & Affiliates 4 5.. include:: <isonum.txt> 6 7NVIDIA MLX5 Common Driver 8========================= 9 10.. note:: 11 12 NVIDIA acquired Mellanox Technologies in 2020. 13 The DPDK documentation and code might still include instances 14 of or references to Mellanox trademarks (like BlueField and ConnectX) 15 that are now NVIDIA trademarks. 16 17The mlx5 common driver library (**librte_common_mlx5**) provides support for 18**NVIDIA ConnectX-4**, **NVIDIA ConnectX-4 Lx**, **NVIDIA ConnectX-5**, 19**NVIDIA ConnectX-6**, **NVIDIA ConnectX-6 Dx**, **NVIDIA ConnectX-6 Lx**, 20**NVIDIA ConnectX-7**, **NVIDIA BlueField**, **NVIDIA BlueField-2** and 21**NVIDIA BlueField-3** families of 10/25/40/50/100/200 Gb/s adapters. 22 23Information and documentation for these adapters can be found on the 24`NVIDIA website <https://www.nvidia.com/en-us/networking/>`_. 25Help is also provided by the 26`NVIDIA Networking forum <https://forums.developer.nvidia.com/c/infrastructure/369/>`_. 27In addition, there is a `web section dedicated to DPDK 28<https://developer.nvidia.com/networking/dpdk>`_. 29 30 31Design 32------ 33 34For security reasons and to enhance robustness, 35this driver only handles virtual memory addresses. 36The way resources allocations are handled by the kernel, 37combined with hardware specifications that allow handling virtual memory addresses directly, 38ensure that DPDK applications cannot access random physical memory 39(or memory that does not belong to the current process). 40 41There are different levels of objects and bypassing abilities 42which are used to get the best performance: 43 44- **Verbs** is a complete high-level generic API 45- **Direct Verbs** is a device-specific API 46- **DevX** allows accessing firmware objects 47- **Direct Rules** manages flow steering at the low-level hardware layer 48 49On Linux, above interfaces are provided by linking with `libibverbs` and `libmlx5`. 50See :ref:`mlx5_linux_prerequisites` for installation. 51 52On Windows, DevX is the only requirement from the above list. 53See :ref:`mlx5_windows_prerequisites` for DevX SDK package installation. 54 55 56.. _mlx5_classes: 57 58Classes 59------- 60 61One mlx5 device can be probed by a number of different PMDs. 62To select a specific PMD, its name should be specified as a device parameter 63(e.g. ``0000:08:00.1,class=eth``). 64 65In order to allow probing by multiple PMDs, 66several classes may be listed separated by a colon. 67For example: ``class=crypto:regex`` will probe both Crypto and RegEx PMDs. 68 69 70Supported Classes 71~~~~~~~~~~~~~~~~~ 72 73- ``class=compress`` for :doc:`../../compressdevs/mlx5`. 74- ``class=crypto`` for :doc:`../../cryptodevs/mlx5`. 75- ``class=eth`` for :doc:`../../nics/mlx5`. 76- ``class=regex`` for :doc:`../../regexdevs/mlx5`. 77- ``class=vdpa`` for :doc:`../../vdpadevs/mlx5`. 78 79By default, the mlx5 device will be probed by the ``eth`` PMD. 80 81 82Limitations 83~~~~~~~~~~~ 84 85- ``eth`` and ``vdpa`` PMDs cannot be probed at the same time. 86 All other combinations are possible. 87 88- On Windows, only ``eth`` and ``crypto`` are supported. 89 90 91.. _mlx5_common_compilation: 92 93Compilation Prerequisites 94------------------------- 95 96.. _mlx5_linux_prerequisites: 97 98Linux Prerequisites 99~~~~~~~~~~~~~~~~~~~ 100 101This driver relies on external libraries and kernel drivers for resources 102allocations and initialization. 103The following dependencies are not part of DPDK and must be installed separately: 104 105- **libibverbs** 106 107 User space Verbs framework used by ``librte_common_mlx5``. 108 This library provides a generic interface between the kernel 109 and low-level user space drivers such as ``libmlx5``. 110 111 It allows slow and privileged operations (context initialization, 112 hardware resources allocations) to be managed by the kernel 113 and fast operations to never leave user space. 114 115- **libmlx5** 116 117 Low-level user space driver library for NVIDIA devices, 118 it is automatically loaded by ``libibverbs``. 119 120 This library basically implements send/receive calls to the hardware queues. 121 122- **Kernel modules** 123 124 They provide the kernel-side Verbs API and low level device drivers 125 that manage actual hardware initialization 126 and resources sharing with user-space processes. 127 128 Unlike most other PMDs, these modules must remain loaded and bound to 129 their devices: 130 131 - ``mlx5_core``: hardware driver managing NVIDIA devices 132 and related Ethernet kernel network devices. 133 - ``mlx5_ib``: InfiniBand device driver. 134 - ``ib_uverbs``: user space driver for Verbs (entry point for ``libibverbs``). 135 136- **Firmware** 137 138 Minimal supported firmware version: 139 140 - ConnectX-4: **12.21.1000** and above. 141 - ConnectX-4 Lx: **14.21.1000** and above. 142 - ConnectX-5: **16.21.1000** and above. 143 - ConnectX-5 Ex: **16.21.1000** and above. 144 - ConnectX-6: **20.27.0090** and above. 145 - ConnectX-6 Dx: **22.27.0090** and above. 146 - ConnectX-6 Lx: **26.27.0090** and above. 147 - ConnectX-7: **28.33.2028** and above. 148 - BlueField: **18.25.1010** and above. 149 - BlueField-2: **24.28.1002** and above. 150 - BlueField-3: **32.36.3126** and above. 151 152 New features may be added in more recent firmwares. 153 154Libraries and kernel modules can be provided either by the Linux distribution, 155or by installing NVIDIA MLNX_OFED/EN which provides compatibility with older kernels. 156 157 158Upstream Dependencies 159^^^^^^^^^^^^^^^^^^^^^ 160 161The mlx5 kernel modules are part of upstream Linux. 162The minimal supported kernel version is 4.14. 163For 32-bit, version 4.14.41 or above is required. 164 165The libraries `libibverbs` and `libmlx5` are part of ``rdma-core``. 166It is packaged by most of Linux distributions. 167The minimal supported rdma-core version is 16. 168For 32-bit, version 18 or above is required. 169 170The rdma-core sources can be downloaded at 171https://github.com/linux-rdma/rdma-core 172 173It is possible to build rdma-core as static libraries starting with version 21:: 174 175 cd build 176 CFLAGS=-fPIC cmake -DENABLE_STATIC=1 -DNO_PYVERBS=1 -DNO_MAN_PAGES=1 -GNinja .. 177 ninja 178 ninja install 179 180The firmware can be updated with `mlxup 181<https://docs.nvidia.com/networking/display/mlxupfwutility>`_. 182The latest firmwares can be downloaded at 183https://network.nvidia.com/support/firmware/firmware-downloads/ 184 185 186NVIDIA MLNX_OFED/EN 187^^^^^^^^^^^^^^^^^^^ 188 189The kernel modules and libraries are packaged with other tools 190in NVIDIA MLNX_OFED or NVIDIA MLNX_EN. 191The minimal supported versions are: 192 193- NVIDIA MLNX_OFED version: **4.5** and above. 194- NVIDIA MLNX_EN version: **4.5** and above. 195 196The firmware, the libraries libibverbs, libmlx5, and mlnx-ofed-kernel modules 197are packaged in `NVIDIA MLNX_OFED 198<https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/>`_. 199After downloading, it can be installed with this command:: 200 201 ./mlnxofedinstall --dpdk 202 203`NVIDIA MLNX_EN 204<https://network.nvidia.com/products/ethernet-drivers/linux/mlnx_en/>`_ 205is a smaller package including what is needed for DPDK. 206After downloading, it can be installed with this command:: 207 208 ./install --dpdk 209 210After installing, the firmware version can be checked:: 211 212 ibv_devinfo 213 214The firmware updates are included in NVIDIA MLNX_OFED/EN packages. 215Because each release provides new features, these updates must be applied 216to match the kernel modules and libraries they come with. 217 218.. note:: 219 220 Several versions of NVIDIA MLNX_OFED/EN are available. Installing the version 221 this DPDK release was developed and tested against is strongly recommended. 222 Please check the "Tested Platforms" section in the :doc:`../../rel_notes/index`. 223 224 225.. _mlx5_windows_prerequisites: 226 227Windows Prerequisites 228~~~~~~~~~~~~~~~~~~~~~ 229 230The mlx5 PMDs rely on external libraries and kernel drivers 231for resource allocation and initialization. 232 233 234DevX SDK Installation 235^^^^^^^^^^^^^^^^^^^^^ 236 237The DevX SDK must be installed on the machine building the Windows PMD. 238Additional information can be found at 239`How to Integrate Windows DevX in Your Development Environment 240<https://docs.nvidia.com/networking/display/winof2v290/devx+interface>`_. 241The minimal supported WinOF2 version is 2.60. 242 243 244Compilation Options 245------------------- 246 247Compilation on Linux 248~~~~~~~~~~~~~~~~~~~~ 249 250The ibverbs libraries can be linked with this PMD in a number of ways, 251configured by the ``ibverbs_link`` build option: 252 253``shared`` (default) 254 The PMD depends on some .so files. 255 256``dlopen`` 257 Split the dependencies glue in a separate library 258 loaded when needed by dlopen (see ``MLX5_GLUE_PATH``). 259 It makes dependencies on libibverbs and libmlx5 optional, 260 and has no performance impact. 261 262``static`` 263 Embed static flavor of the dependencies libibverbs and libmlx5 264 in the PMD shared library or the executable static binary. 265 266 267Compilation on Windows 268~~~~~~~~~~~~~~~~~~~~~~ 269 270The DevX SDK location must be set through CFLAGS/LDFLAGS, 271either:: 272 273 meson.exe setup "-Dc_args=-I\"%DEVX_INC_PATH%\"" "-Dc_link_args=-L\"%DEVX_LIB_PATH%\"" ... 274 275or:: 276 277 set CFLAGS=-I"%DEVX_INC_PATH%" && set LDFLAGS=-L"%DEVX_LIB_PATH%" && meson.exe setup ... 278 279 280.. _mlx5_common_env: 281 282Environment Configuration 283------------------------- 284 285Linux Environment 286~~~~~~~~~~~~~~~~~ 287 288The kernel network interfaces are brought up during initialization. 289Forcing them down prevents packets reception. 290 291The ethtool operations on the kernel interfaces may also affect the PMD. 292 293Some runtime behaviours may be configured through environment variables. 294 295``MLX5_GLUE_PATH`` 296 If built with ``ibverbs_link=dlopen``, 297 list of directories in which to search for the rdma-core "glue" plug-in, 298 separated by colons or semi-colons. 299 300``MLX5_SHUT_UP_BF`` 301 If Verbs is used (DevX disabled), 302 HW queue doorbell register mapping. 303 The value 0 means non-cached IO mapping, 304 while 1 is a regular memory mapping. 305 306 With regular memory mapping, the register is flushed to HW 307 usually when the write-combining buffer becomes full, 308 but it depends on CPU design. 309 310 311Port Link with MLNX_OFED/EN 312^^^^^^^^^^^^^^^^^^^^^^^^^^^ 313 314Ports links must be set to Ethernet:: 315 316 mlxconfig -d <mst device> query | grep LINK_TYPE 317 LINK_TYPE_P1 ETH(2) 318 LINK_TYPE_P2 ETH(2) 319 320 mlxconfig -d <mst device> set LINK_TYPE_P1/2=1/2/3 321 322Link type values are: 323 324* ``1`` Infiniband 325* ``2`` Ethernet 326* ``3`` VPI (auto-sense) 327 328If link type was changed, firmware must be reset as well:: 329 330 mlxfwreset -d <mst device> reset 331 332 333.. _mlx5_vf: 334 335SR-IOV Virtual Function with MLNX_OFED/EN 336^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 337 338SR-IOV must be enabled on the NIC. 339It can be checked in the following command:: 340 341 mlxconfig -d <mst device> query | grep SRIOV_EN 342 SRIOV_EN True(1) 343 344If needed, configure SR-IOV:: 345 346 mlxconfig -d <mst device> set SRIOV_EN=1 NUM_OF_VFS=16 347 mlxfwreset -d <mst device> reset 348 349After doing the change, restart the driver:: 350 351 /etc/init.d/openibd restart 352 353or:: 354 355 service openibd restart 356 357Then the virtual functions can be instantiated:: 358 359 echo [num_vfs] > /sys/class/infiniband/mlx5_0/device/sriov_numvfs 360 361 362.. _mlx5_sub_function: 363 364Sub-Function with MLNX_OFED/EN 365^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 366 367Sub-Function is a portion of the PCI device, 368it has its own dedicated queues. 369An SF shares PCI-level resources with other SFs and/or with its parent PCI function. 370 371#. Requirement:: 372 373 MLNX_OFED version >= 5.4-0.3.3.0 374 375#. Configure SF feature:: 376 377 # Run mlxconfig on both PFs on host and ECPFs on BlueField. 378 mlxconfig -d <mst device> set PER_PF_NUM_SF=1 PF_TOTAL_SF=252 PF_SF_BAR_SIZE=12 379 380#. Enable switchdev mode:: 381 382 mlxdevm dev eswitch set pci/<DBDF> mode switchdev 383 384#. Add SF port:: 385 386 mlxdevm port add pci/<DBDF> flavour pcisf pfnum 0 sfnum <sfnum> 387 388 Get SFID from output: pci/<DBDF>/<SFID> 389 390#. Modify MAC address:: 391 392 mlxdevm port function set pci/<DBDF>/<SFID> hw_addr <MAC> 393 394#. Activate SF port:: 395 396 mlxdevm port function set pci/<DBDF>/<ID> state active 397 398#. Devargs to probe SF device:: 399 400 auxiliary:mlx5_core.sf.<num>,class=eth:regex 401 402 403Enable Switchdev Mode 404^^^^^^^^^^^^^^^^^^^^^ 405 406Switchdev mode is a mode in E-Switch, that binds between representor and VF or SF. 407Representor is a port in DPDK that is connected to a VF or SF in such a way 408that assuming there are no offload flows, each packet that is sent from the VF or SF 409will be received by the corresponding representor. 410While each packet that is sent to a representor will be received by the VF or SF. 411 412After :ref:`configuring VF <mlx5_vf>`, the device must be unbound:: 413 414 printf "<device pci address>" > /sys/bus/pci/drivers/mlx5_core/unbind 415 416Then switchdev mode is enabled:: 417 418 echo switchdev > /sys/class/net/<net device>/compat/devlink/mode 419 420The device can be bound again at this point. 421 422 423Run as Non-Root 424^^^^^^^^^^^^^^^ 425 426Hugepage and resource limit setup are documented 427in the :ref:`common Linux guide <Running_Without_Root_Privileges>`. 428This PMD can operate without access to physical addresses, 429therefore it does not require ``SYS_ADMIN`` to access ``/proc/self/pagemaps``. 430Note that this requirement may still come from other drivers. 431 432Below are additional capabilities that must be granted to the application 433with the reasons for the need of each capability: 434 435``NET_RAW`` 436 For raw Ethernet queue allocation through the kernel driver. 437 438``NET_ADMIN`` 439 For device configuration, like setting link status or MTU. 440 441``SYS_RAWIO`` 442 For using group 1 and above (software steering) in Flow API. 443 444They can be manually granted for a specific executable file:: 445 446 setcap cap_net_raw,cap_net_admin,cap_sys_rawio+ep <executable> 447 448Alternatively, a service manager or a container runtime 449may configure the capabilities for a process. 450 451 452Windows Environment 453~~~~~~~~~~~~~~~~~~~ 454 455WinOF2 version 2.60 or higher must be installed on the machine. 456 457 458WinOF2 Installation 459^^^^^^^^^^^^^^^^^^^ 460 461The driver can be downloaded from the following site: `WINOF2 462<https://network.nvidia.com/products/adapter-software/ethernet/windows/winof-2/>`_. 463 464 465DevX Enablement 466^^^^^^^^^^^^^^^ 467 468DevX for Windows must be enabled in the Windows registry. 469The keys ``DevxEnabled`` and ``DevxFsRules`` must be set. 470Additional information can be found in the WinOF2 user manual. 471 472 473.. _mlx5_firmware_config: 474 475Firmware Configuration 476~~~~~~~~~~~~~~~~~~~~~~ 477 478Firmware features can be configured as key/value pairs. 479 480The command to set a value is:: 481 482 mlxconfig -d <device> set <key>=<value> 483 484The command to query a value is:: 485 486 mlxconfig -d <device> query <key> 487 488The device name for the command ``mlxconfig`` can be either the PCI address, 489or the mst device name found with:: 490 491 mst status 492 493Below are some firmware configurations listed. 494 495- link type:: 496 497 LINK_TYPE_P1 498 LINK_TYPE_P2 499 value: 1=Infiniband 2=Ethernet 3=VPI(auto-sense) 500 501- enable SR-IOV:: 502 503 SRIOV_EN=1 504 505- the maximum number of SR-IOV virtual functions:: 506 507 NUM_OF_VFS=<max> 508 509- enable DevX (required by Direct Rules and other features):: 510 511 UCTX_EN=1 512 513- aggressive CQE zipping:: 514 515 CQE_COMPRESSION=1 516 517- L3 VXLAN and VXLAN-GPE destination UDP port:: 518 519 IP_OVER_VXLAN_EN=1 520 IP_OVER_VXLAN_PORT=<udp dport> 521 522- enable VXLAN-GPE tunnel flow matching:: 523 524 FLEX_PARSER_PROFILE_ENABLE=0 525 or 526 FLEX_PARSER_PROFILE_ENABLE=2 527 528- enable IP-in-IP tunnel flow matching:: 529 530 FLEX_PARSER_PROFILE_ENABLE=0 531 532- enable MPLS flow matching:: 533 534 FLEX_PARSER_PROFILE_ENABLE=1 535 536- enable ICMP(code/type/identifier/sequence number) / ICMP6(code/type) fields matching:: 537 538 FLEX_PARSER_PROFILE_ENABLE=2 539 540- enable Geneve flow matching:: 541 542 FLEX_PARSER_PROFILE_ENABLE=0 543 or 544 FLEX_PARSER_PROFILE_ENABLE=1 545 546- enable Geneve TLV option flow matching:: 547 548 FLEX_PARSER_PROFILE_ENABLE=0 549 or 550 FLEX_PARSER_PROFILE_ENABLE=8 551 552- enable GTP flow matching:: 553 554 FLEX_PARSER_PROFILE_ENABLE=3 555 556- enable eCPRI flow matching:: 557 558 FLEX_PARSER_PROFILE_ENABLE=4 559 PROG_PARSE_GRAPH=1 560 561- enable dynamic flex parser for flex item:: 562 563 FLEX_PARSER_PROFILE_ENABLE=4 564 PROG_PARSE_GRAPH=1 565 566- enable realtime timestamp format:: 567 568 REAL_TIME_CLOCK_ENABLE=1 569 570- allow locking hairpin RQ data buffer in device memory:: 571 572 HAIRPIN_DATA_BUFFER_LOCK=1 573 MEMIC_SIZE_LIMIT=0 574 575 576.. _mlx5_common_driver_options: 577 578Device Arguments 579---------------- 580 581The driver can be configured per device. 582A single argument list can be used for a device managed by multiple PMDs. 583The parameters must be passed through the EAL option ``-a``, 584as examples below: 585 586- PCI device:: 587 588 -a 0000:03:00.2,class=eth:regex,mr_mempool_reg_en=0 589 590- Auxiliary SF:: 591 592 -a auxiliary:mlx5_core.sf.2,class=compress,mr_ext_memseg_en=0 593 594Each device class PMD has its own list of specific arguments, 595and below are the arguments supported by the common mlx5 layer. 596 597- ``class`` parameter [string] 598 599 Select the classes of the drivers that should probe the device. 600 See :ref:`mlx5_classes` for more explanation and details. 601 602 The default value is ``eth``. 603 604- ``mr_ext_memseg_en`` parameter [int] 605 606 A nonzero value enables extending memseg when registering DMA memory. If 607 enabled, the number of entries in MR (Memory Region) lookup table on datapath 608 is minimized and it benefits performance. On the other hand, it worsens memory 609 utilization because registered memory is pinned by kernel driver. Even if a 610 page in the extended chunk is freed, that doesn't become reusable until the 611 entire memory is freed. 612 613 Enabled by default. 614 615- ``mr_mempool_reg_en`` parameter [int] 616 617 A nonzero value enables implicit registration of DMA memory of all mempools 618 except those having ``RTE_MEMPOOL_F_NON_IO``. This flag is set automatically 619 for mempools populated with non-contiguous objects or those without IOVA. 620 The effect is that when a packet from a mempool is transmitted, 621 its memory is already registered for DMA in the PMD and no registration 622 will happen on the data path. The tradeoff is extra work on the creation 623 of each mempool and increased HW resource use if some mempools 624 are not used with MLX5 devices. 625 626 Enabled by default. 627 628- ``sys_mem_en`` parameter [int] 629 630 A non-zero value enables the PMD memory management allocating memory 631 from system by default, without explicit rte memory flag. 632 633 By default, the PMD will set this value to 0. 634 635- ``sq_db_nc`` parameter [int] 636 637 The rdma core library can map doorbell register in two ways, 638 depending on the environment variable "MLX5_SHUT_UP_BF": 639 640 - As regular cached memory (usually with write combining attribute), 641 if the variable is either missing or set to zero. 642 - As non-cached memory, if the variable is present and set to not "0" value. 643 644 The same doorbell mapping approach is implemented directly by PMD 645 in UAR generation for queues created with DevX. 646 647 The type of mapping may slightly affect the send queue performance, 648 the optimal choice strongly relied on the host architecture 649 and should be deduced practically. 650 651 If ``sq_db_nc`` is set to zero, the doorbell is forced to be mapped to 652 regular memory (with write combining), the PMD will perform the extra write 653 memory barrier after writing to doorbell, it might increase the needed CPU 654 clocks per packet to send, but latency might be improved. 655 656 If ``sq_db_nc`` is set to one, the doorbell is forced to be mapped to non 657 cached memory, the PMD will not perform the extra write memory barrier after 658 writing to doorbell, on some architectures it might improve the performance. 659 660 If ``sq_db_nc`` is set to two, the doorbell is forced to be mapped to 661 regular memory, the PMD will use heuristics to decide whether a write memory 662 barrier should be performed. For bursts with size multiple of recommended one 663 (64 pkts) it is supposed the next burst is coming and no need to issue the 664 extra memory barrier (it is supposed to be issued in the next coming burst, 665 at least after descriptor writing). It might increase latency (on some hosts 666 till the next packets transmit) and should be used with care. 667 The PMD uses heuristics only for Tx queue, for other semd queues the doorbell 668 is forced to be mapped to regular memory as same as ``sq_db_nc`` is set to 0. 669 670 If ``sq_db_nc`` is omitted, the preset (if any) environment variable 671 "MLX5_SHUT_UP_BF" value is used. If there is no "MLX5_SHUT_UP_BF", the 672 default ``sq_db_nc`` value is zero for ARM64 hosts and one for others. 673 674- ``cmd_fd`` parameter [int] 675 676 File descriptor of ``ibv_context`` created outside the PMD. 677 PMD will use this FD to import remote CTX. The ``cmd_fd`` is obtained from 678 the ``ibv_context->cmd_fd`` member, which must be dup'd before being passed. 679 This parameter is valid only if ``pd_handle`` parameter is specified. 680 681 By default, the PMD will create a new ``ibv_context``. 682 683 .. note:: 684 685 When FD comes from another process, it is the user responsibility to 686 share the FD between the processes (e.g. by SCM_RIGHTS). 687 688- ``pd_handle`` parameter [int] 689 690 Protection domain handle of ``ibv_pd`` created outside the PMD. 691 PMD will use this handle to import remote PD. The ``pd_handle`` can be 692 achieved from the original PD by getting its ``ibv_pd->handle`` member value. 693 This parameter is valid only if ``cmd_fd`` parameter is specified, 694 and its value must be a valid kernel handle for a PD object 695 in the context represented by given ``cmd_fd``. 696 697 By default, the PMD will allocate a new PD. 698 699 .. note:: 700 701 The ``ibv_pd->handle`` member is different than ``mlx5dv_pd->pdn`` member. 702