1.. SPDX-License-Identifier: BSD-3-Clause 2 Copyright 2022 6WIND S.A. 3 Copyright (c) 2022 NVIDIA Corporation & Affiliates 4 5.. include:: <isonum.txt> 6 7MLX5 Common Driver 8================== 9 10The mlx5 common driver library (**librte_common_mlx5**) provides support for 11**Mellanox ConnectX-4**, **Mellanox ConnectX-4 Lx**, **Mellanox ConnectX-5**, 12**Mellanox ConnectX-6**, **Mellanox ConnectX-6 Dx**, **Mellanox ConnectX-6 Lx**, 13**Mellanox BlueField** and **Mellanox BlueField-2** families of 1410/25/40/50/100/200 Gb/s adapters. 15 16Information and documentation for these adapters can be found on the 17`NVIDIA website <https://www.nvidia.com/en-us/networking/>`_. 18Help is also provided by the 19`Mellanox community <http://community.mellanox.com/welcome>`_. 20In addition, there is a `web section dedicated to the Poll Mode Driver 21<https://developer.nvidia.com/networking/dpdk>`_. 22 23 24Design 25------ 26 27For security reasons and to enhance robustness, 28this driver only handles virtual memory addresses. 29The way resources allocations are handled by the kernel, 30combined with hardware specifications that allow handling virtual memory addresses directly, 31ensure that DPDK applications cannot access random physical memory 32(or memory that does not belong to the current process). 33 34There are different levels of objects and bypassing abilities 35which are used to get the best performance: 36 37- **Verbs** is a complete high-level generic API 38- **Direct Verbs** is a device-specific API 39- **DevX** allows accessing firmware objects 40- **Direct Rules** manages flow steering at the low-level hardware layer 41 42On Linux, above interfaces are provided by linking with `libibverbs` and `libmlx5`. 43See :ref:`mlx5_linux_prerequisites` for installation. 44 45On Windows, DevX is the only requirement from the above list. 46See :ref:`mlx5_windows_prerequisites` for DevX SDK package installation. 47 48 49.. _mlx5_classes: 50 51Classes 52------- 53 54One mlx5 device can be probed by a number of different PMDs. 55To select a specific PMD, its name should be specified as a device parameter 56(e.g. ``0000:08:00.1,class=eth``). 57 58In order to allow probing by multiple PMDs, 59several classes may be listed separated by a colon. 60For example: ``class=crypto:regex`` will probe both Crypto and RegEx PMDs. 61 62 63Supported Classes 64~~~~~~~~~~~~~~~~~ 65 66- ``class=compress`` for :doc:`../../compressdevs/mlx5`. 67- ``class=crypto`` for :doc:`../../cryptodevs/mlx5`. 68- ``class=eth`` for :doc:`../../nics/mlx5`. 69- ``class=regex`` for :doc:`../../regexdevs/mlx5`. 70- ``class=vdpa`` for :doc:`../../vdpadevs/mlx5`. 71 72By default, the mlx5 device will be probed by the ``eth`` PMD. 73 74 75Limitations 76~~~~~~~~~~~ 77 78- ``eth`` and ``vdpa`` PMDs cannot be probed at the same time. 79 All other combinations are possible. 80 81- On Windows, only ``eth`` and ``crypto`` are supported. 82 83 84.. _mlx5_common_compilation: 85 86Compilation Prerequisites 87------------------------- 88 89.. _mlx5_linux_prerequisites: 90 91Linux Prerequisites 92~~~~~~~~~~~~~~~~~~~ 93 94This driver relies on external libraries and kernel drivers for resources 95allocations and initialization. 96The following dependencies are not part of DPDK and must be installed separately: 97 98- **libibverbs** 99 100 User space Verbs framework used by ``librte_common_mlx5``. 101 This library provides a generic interface between the kernel 102 and low-level user space drivers such as ``libmlx5``. 103 104 It allows slow and privileged operations (context initialization, 105 hardware resources allocations) to be managed by the kernel 106 and fast operations to never leave user space. 107 108- **libmlx5** 109 110 Low-level user space driver library for Mellanox devices, 111 it is automatically loaded by ``libibverbs``. 112 113 This library basically implements send/receive calls to the hardware queues. 114 115- **Kernel modules** 116 117 They provide the kernel-side Verbs API and low level device drivers 118 that manage actual hardware initialization 119 and resources sharing with user-space processes. 120 121 Unlike most other PMDs, these modules must remain loaded and bound to 122 their devices: 123 124 - ``mlx5_core``: hardware driver managing Mellanox devices 125 and related Ethernet kernel network devices. 126 - ``mlx5_ib``: InfiniBand device driver. 127 - ``ib_uverbs``: user space driver for Verbs (entry point for ``libibverbs``). 128 129- **Firmware update** 130 131 Mellanox OFED/EN releases include firmware updates. 132 133 Because each release provides new features, these updates must be applied to 134 match the kernel modules and libraries they come with. 135 136Libraries and kernel modules can be provided either by the Linux distribution, 137or by installing Mellanox OFED/EN which provides compatibility with older kernels. 138 139 140Upstream Dependencies 141^^^^^^^^^^^^^^^^^^^^^ 142 143The mlx5 kernel modules are part of upstream Linux. 144The minimal supported kernel version is 4.14. 145For 32-bit, version 4.14.41 or above is required. 146 147The libraries `libibverbs` and `libmlx5` are part of ``rdma-core``. 148It is packaged by most of Linux distributions. 149The minimal supported rdma-core version is 16. 150For 32-bit, version 18 or above is required. 151 152The rdma-core sources can be downloaded at 153https://github.com/linux-rdma/rdma-core 154 155It is possible to build rdma-core as static libraries starting with version 21:: 156 157 cd build 158 CFLAGS=-fPIC cmake -DIN_PLACE=1 -DENABLE_STATIC=1 -GNinja .. 159 ninja 160 161 162Mellanox OFED/EN 163^^^^^^^^^^^^^^^^ 164 165The kernel modules and libraries are packaged with other tools 166in Mellanox OFED or Mellanox EN. 167The minimal supported versions are: 168 169- Mellanox OFED version: **4.5** and above. 170- Mellanox EN version: **4.5** and above. 171- Firmware version: 172 173 - ConnectX-4: **12.21.1000** and above. 174 - ConnectX-4 Lx: **14.21.1000** and above. 175 - ConnectX-5: **16.21.1000** and above. 176 - ConnectX-5 Ex: **16.21.1000** and above. 177 - ConnectX-6: **20.27.0090** and above. 178 - ConnectX-6 Dx: **22.27.0090** and above. 179 - BlueField: **18.25.1010** and above. 180 - BlueField-2: **24.28.1002** and above. 181 182The firmware, the libraries libibverbs, libmlx5, and mlnx-ofed-kernel modules 183are packaged in `Mellanox OFED 184<https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/>`_. 185After downloading, it can be installed with this command:: 186 187 ./mlnxofedinstall --dpdk 188 189`Mellanox EN 190<https://network.nvidia.com/products/ethernet-drivers/linux/mlnx_en/>`_ 191is a smaller package including what is needed for DPDK. 192After downloading, it can be installed with this command:: 193 194 ./install --dpdk 195 196After installing, the firmware version can be checked:: 197 198 ibv_devinfo 199 200.. note:: 201 202 Several versions of Mellanox OFED/EN are available. Installing the version 203 this DPDK release was developed and tested against is strongly recommended. 204 Please check the "Tested Platforms" section in the :doc:`../../rel_notes/index`. 205 206 207.. _mlx5_windows_prerequisites: 208 209Windows Prerequisites 210~~~~~~~~~~~~~~~~~~~~~ 211 212The mlx5 PMDs rely on external libraries and kernel drivers 213for resource allocation and initialization. 214 215 216DevX SDK Installation 217^^^^^^^^^^^^^^^^^^^^^ 218 219The DevX SDK must be installed on the machine building the Windows PMD. 220Additional information can be found at 221`How to Integrate Windows DevX in Your Development Environment 222<https://docs.nvidia.com/networking/display/winof2v260/RShim+Drivers+and+Usage#RShimDriversandUsage-DevXInterface>`_. 223The minimal supported WinOF2 version is 2.60. 224 225 226Compilation Options 227------------------- 228 229Compilation on Linux 230~~~~~~~~~~~~~~~~~~~~ 231 232The ibverbs libraries can be linked with this PMD in a number of ways, 233configured by the ``ibverbs_link`` build option: 234 235``shared`` (default) 236 The PMD depends on some .so files. 237 238``dlopen`` 239 Split the dependencies glue in a separate library 240 loaded when needed by dlopen (see ``MLX5_GLUE_PATH``). 241 It makes dependencies on libibverbs and libmlx5 optional, 242 and has no performance impact. 243 244``static`` 245 Embed static flavor of the dependencies libibverbs and libmlx5 246 in the PMD shared library or the executable static binary. 247 248 249Compilation on Windows 250~~~~~~~~~~~~~~~~~~~~~~ 251 252The DevX SDK location must be set through two environment variables: 253 254``DEVX_LIB_PATH`` 255 path to the DevX lib file. 256 257``DEVX_INC_PATH`` 258 path to the DevX header files. 259 260 261.. _mlx5_common_env: 262 263Environment Configuration 264------------------------- 265 266Linux Environment 267~~~~~~~~~~~~~~~~~ 268 269The kernel network interfaces are brought up during initialization. 270Forcing them down prevents packets reception. 271 272The ethtool operations on the kernel interfaces may also affect the PMD. 273 274Some runtime behaviours may be configured through environment variables. 275 276``MLX5_GLUE_PATH`` 277 If built with ``ibverbs_link=dlopen``, 278 list of directories in which to search for the rdma-core "glue" plug-in, 279 separated by colons or semi-colons. 280 281``MLX5_SHUT_UP_BF`` 282 If Verbs is used (DevX disabled), 283 HW queue doorbell register mapping. 284 The value 0 means non-cached IO mapping, 285 while 1 is a regular memory mapping. 286 287 With regular memory mapping, the register is flushed to HW 288 usually when the write-combining buffer becomes full, 289 but it depends on CPU design. 290 291 292Port Link with OFED/EN 293^^^^^^^^^^^^^^^^^^^^^^ 294 295Ports links must be set to Ethernet:: 296 297 mlxconfig -d <mst device> query | grep LINK_TYPE 298 LINK_TYPE_P1 ETH(2) 299 LINK_TYPE_P2 ETH(2) 300 301 mlxconfig -d <mst device> set LINK_TYPE_P1/2=1/2/3 302 303Link type values are: 304 305* ``1`` Infiniband 306* ``2`` Ethernet 307* ``3`` VPI (auto-sense) 308 309If link type was changed, firmware must be reset as well:: 310 311 mlxfwreset -d <mst device> reset 312 313 314.. _mlx5_vf: 315 316SR-IOV Virtual Function with OFED/EN 317^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 318 319SR-IOV must be enabled on the NIC. 320It can be checked in the following command:: 321 322 mlxconfig -d <mst device> query | grep SRIOV_EN 323 SRIOV_EN True(1) 324 325If needed, configure SR-IOV:: 326 327 mlxconfig -d <mst device> set SRIOV_EN=1 NUM_OF_VFS=16 328 mlxfwreset -d <mst device> reset 329 330After doing the change, restart the driver:: 331 332 /etc/init.d/openibd restart 333 334or:: 335 336 service openibd restart 337 338Then the virtual functions can be instantiated:: 339 340 echo [num_vfs] > /sys/class/infiniband/mlx5_0/device/sriov_numvfs 341 342 343.. _mlx5_sub_function: 344 345Sub-Function with OFED/EN 346^^^^^^^^^^^^^^^^^^^^^^^^^ 347 348Sub-Function is a portion of the PCI device, 349it has its own dedicated queues. 350An SF shares PCI-level resources with other SFs and/or with its parent PCI function. 351 3520. Requirement:: 353 354 OFED version >= 5.4-0.3.3.0 355 3561. Configure SF feature:: 357 358 # Run mlxconfig on both PFs on host and ECPFs on BlueField. 359 mlxconfig -d <mst device> set PER_PF_NUM_SF=1 PF_TOTAL_SF=252 PF_SF_BAR_SIZE=12 360 3612. Enable switchdev mode:: 362 363 mlxdevm dev eswitch set pci/<DBDF> mode switchdev 364 3653. Add SF port:: 366 367 mlxdevm port add pci/<DBDF> flavour pcisf pfnum 0 sfnum <sfnum> 368 369 Get SFID from output: pci/<DBDF>/<SFID> 370 3714. Modify MAC address:: 372 373 mlxdevm port function set pci/<DBDF>/<SFID> hw_addr <MAC> 374 3755. Activate SF port:: 376 377 mlxdevm port function set pci/<DBDF>/<ID> state active 378 3796. Devargs to probe SF device:: 380 381 auxiliary:mlx5_core.sf.<num>,class=eth:regex 382 383 384Enable Switchdev Mode 385^^^^^^^^^^^^^^^^^^^^^ 386 387Switchdev mode is a mode in E-Switch, that binds between representor and VF or SF. 388Representor is a port in DPDK that is connected to a VF or SF in such a way 389that assuming there are no offload flows, each packet that is sent from the VF or SF 390will be received by the corresponding representor. 391While each packet that is sent to a representor will be received by the VF or SF. 392 393After :ref:`configuring VF <mlx5_vf>`, the device must be unbound:: 394 395 printf "<device pci address>" > /sys/bus/pci/drivers/mlx5_core/unbind 396 397Then switchdev mode is enabled:: 398 399 echo switchdev > /sys/class/net/<net device>/compat/devlink/mode 400 401The device can be bound again at this point. 402 403 404Run as Non-Root 405^^^^^^^^^^^^^^^ 406 407In order to run as a non-root user, 408some capabilities must be granted to the application:: 409 410 setcap cap_sys_admin,cap_net_admin,cap_net_raw,cap_ipc_lock+ep <dpdk-app> 411 412Below are the reasons for the need of each capability: 413 414``cap_sys_admin`` 415 When using physical addresses (PA mode), with Linux >= 4.0, 416 for access to ``/proc/self/pagemap``. 417 418``cap_net_admin`` 419 For device configuration. 420 421``cap_net_raw`` 422 For raw ethernet queue allocation through kernel driver. 423 424``cap_ipc_lock`` 425 For DMA memory pinning. 426 427 428Windows Environment 429~~~~~~~~~~~~~~~~~~~ 430 431WinOF2 version 2.60 or higher must be installed on the machine. 432 433 434WinOF2 Installation 435^^^^^^^^^^^^^^^^^^^ 436 437The driver can be downloaded from the following site: `WINOF2 438<https://network.nvidia.com/products/adapter-software/ethernet/windows/winof-2/>`_. 439 440 441DevX Enablement 442^^^^^^^^^^^^^^^ 443 444DevX for Windows must be enabled in the Windows registry. 445The keys ``DevxEnabled`` and ``DevxFsRules`` must be set. 446Additional information can be found in the WinOF2 user manual. 447 448 449.. _mlx5_firmware_config: 450 451Firmware Configuration 452~~~~~~~~~~~~~~~~~~~~~~ 453 454Firmware features can be configured as key/value pairs. 455 456The command to set a value is:: 457 458 mlxconfig -d <device> set <key>=<value> 459 460The command to query a value is:: 461 462 mlxconfig -d <device> query <key> 463 464The device name for the command ``mlxconfig`` can be either the PCI address, 465or the mst device name found with:: 466 467 mst status 468 469Below are some firmware configurations listed. 470 471- link type:: 472 473 LINK_TYPE_P1 474 LINK_TYPE_P2 475 value: 1=Infiniband 2=Ethernet 3=VPI(auto-sense) 476 477- enable SR-IOV:: 478 479 SRIOV_EN=1 480 481- the maximum number of SR-IOV virtual functions:: 482 483 NUM_OF_VFS=<max> 484 485- enable DevX (required by Direct Rules and other features):: 486 487 UCTX_EN=1 488 489- aggressive CQE zipping:: 490 491 CQE_COMPRESSION=1 492 493- L3 VXLAN and VXLAN-GPE destination UDP port:: 494 495 IP_OVER_VXLAN_EN=1 496 IP_OVER_VXLAN_PORT=<udp dport> 497 498- enable VXLAN-GPE tunnel flow matching:: 499 500 FLEX_PARSER_PROFILE_ENABLE=0 501 or 502 FLEX_PARSER_PROFILE_ENABLE=2 503 504- enable IP-in-IP tunnel flow matching:: 505 506 FLEX_PARSER_PROFILE_ENABLE=0 507 508- enable MPLS flow matching:: 509 510 FLEX_PARSER_PROFILE_ENABLE=1 511 512- enable ICMP(code/type/identifier/sequence number) / ICMP6(code/type) fields matching:: 513 514 FLEX_PARSER_PROFILE_ENABLE=2 515 516- enable Geneve flow matching:: 517 518 FLEX_PARSER_PROFILE_ENABLE=0 519 or 520 FLEX_PARSER_PROFILE_ENABLE=1 521 522- enable Geneve TLV option flow matching:: 523 524 FLEX_PARSER_PROFILE_ENABLE=0 525 526- enable GTP flow matching:: 527 528 FLEX_PARSER_PROFILE_ENABLE=3 529 530- enable eCPRI flow matching:: 531 532 FLEX_PARSER_PROFILE_ENABLE=4 533 PROG_PARSE_GRAPH=1 534 535- enable dynamic flex parser for flex item:: 536 537 FLEX_PARSER_PROFILE_ENABLE=4 538 PROG_PARSE_GRAPH=1 539 540- enable realtime timestamp format:: 541 542 REAL_TIME_CLOCK_ENABLE=1 543 544 545.. _mlx5_common_driver_options: 546 547Device Arguments 548---------------- 549 550The driver can be configured per device. 551A single argument list can be used for a device managed by multiple PMDs. 552The parameters must be passed through the EAL option ``-a``, 553as examples below: 554 555- PCI device:: 556 557 -a 0000:03:00.2,class=eth:regex,mr_mempool_reg_en=0 558 559- Auxiliary SF:: 560 561 -a auxiliary:mlx5_core.sf.2,class=compress,mr_ext_memseg_en=0 562 563Each device class PMD has its own list of specific arguments, 564and below are the arguments supported by the common mlx5 layer. 565 566- ``class`` parameter [string] 567 568 Select the classes of the drivers that should probe the device. 569 See :ref:`mlx5_classes` for more explanation and details. 570 571 The default value is ``eth``. 572 573- ``mr_ext_memseg_en`` parameter [int] 574 575 A nonzero value enables extending memseg when registering DMA memory. If 576 enabled, the number of entries in MR (Memory Region) lookup table on datapath 577 is minimized and it benefits performance. On the other hand, it worsens memory 578 utilization because registered memory is pinned by kernel driver. Even if a 579 page in the extended chunk is freed, that doesn't become reusable until the 580 entire memory is freed. 581 582 Enabled by default. 583 584- ``mr_mempool_reg_en`` parameter [int] 585 586 A nonzero value enables implicit registration of DMA memory of all mempools 587 except those having ``RTE_MEMPOOL_F_NON_IO``. This flag is set automatically 588 for mempools populated with non-contiguous objects or those without IOVA. 589 The effect is that when a packet from a mempool is transmitted, 590 its memory is already registered for DMA in the PMD and no registration 591 will happen on the data path. The tradeoff is extra work on the creation 592 of each mempool and increased HW resource use if some mempools 593 are not used with MLX5 devices. 594 595 Enabled by default. 596 597- ``sys_mem_en`` parameter [int] 598 599 A non-zero value enables the PMD memory management allocating memory 600 from system by default, without explicit rte memory flag. 601 602 By default, the PMD will set this value to 0. 603 604- ``sq_db_nc`` parameter [int] 605 606 The rdma core library can map doorbell register in two ways, 607 depending on the environment variable "MLX5_SHUT_UP_BF": 608 609 - As regular cached memory (usually with write combining attribute), 610 if the variable is either missing or set to zero. 611 - As non-cached memory, if the variable is present and set to not "0" value. 612 613 The same doorbell mapping approach is implemented directly by PMD 614 in UAR generation for queues created with DevX. 615 616 The type of mapping may slightly affect the send queue performance, 617 the optimal choice strongly relied on the host architecture 618 and should be deduced practically. 619 620 If ``sq_db_nc`` is set to zero, the doorbell is forced to be mapped to 621 regular memory (with write combining), the PMD will perform the extra write 622 memory barrier after writing to doorbell, it might increase the needed CPU 623 clocks per packet to send, but latency might be improved. 624 625 If ``sq_db_nc`` is set to one, the doorbell is forced to be mapped to non 626 cached memory, the PMD will not perform the extra write memory barrier after 627 writing to doorbell, on some architectures it might improve the performance. 628 629 If ``sq_db_nc`` is set to two, the doorbell is forced to be mapped to 630 regular memory, the PMD will use heuristics to decide whether a write memory 631 barrier should be performed. For bursts with size multiple of recommended one 632 (64 pkts) it is supposed the next burst is coming and no need to issue the 633 extra memory barrier (it is supposed to be issued in the next coming burst, 634 at least after descriptor writing). It might increase latency (on some hosts 635 till the next packets transmit) and should be used with care. 636 The PMD uses heuristics only for Tx queue, for other semd queues the doorbell 637 is forced to be mapped to regular memory as same as ``sq_db_nc`` is set to 0. 638 639 If ``sq_db_nc`` is omitted, the preset (if any) environment variable 640 "MLX5_SHUT_UP_BF" value is used. If there is no "MLX5_SHUT_UP_BF", the 641 default ``sq_db_nc`` value is zero for ARM64 hosts and one for others. 642 643- ``cmd_fd`` parameter [int] 644 645 File descriptor of ``ibv_context`` created outside the PMD. 646 PMD will use this FD to import remote CTX. The ``cmd_fd`` is obtained from 647 the ``ibv_context->cmd_fd`` member, which must be dup'd before being passed. 648 This parameter is valid only if ``pd_handle`` parameter is specified. 649 650 By default, the PMD will create a new ``ibv_context``. 651 652 .. note:: 653 654 When FD comes from another process, it is the user responsibility to 655 share the FD between the processes (e.g. by SCM_RIGHTS). 656 657- ``pd_handle`` parameter [int] 658 659 Protection domain handle of ``ibv_pd`` created outside the PMD. 660 PMD will use this handle to import remote PD. The ``pd_handle`` can be 661 achieved from the original PD by getting its ``ibv_pd->handle`` member value. 662 This parameter is valid only if ``cmd_fd`` parameter is specified, 663 and its value must be a valid kernel handle for a PD object 664 in the context represented by given ``cmd_fd``. 665 666 By default, the PMD will allocate a new PD. 667 668 .. note:: 669 670 The ``ibv_pd->handle`` member is different than ``mlx5dv_pd->pdn`` member. 671