xref: /dpdk/doc/guides/platform/mlx5.rst (revision e9fd1ebf981f361844aea9ec94e17f4bda5e1479)
1..  SPDX-License-Identifier: BSD-3-Clause
2    Copyright 2022 6WIND S.A.
3    Copyright (c) 2022 NVIDIA Corporation & Affiliates
4
5.. include:: <isonum.txt>
6
7NVIDIA MLX5 Common Driver
8=========================
9
10.. note::
11
12   NVIDIA acquired Mellanox Technologies in 2020.
13   The DPDK documentation and code might still include instances
14   of or references to Mellanox trademarks (like BlueField and ConnectX)
15   that are now NVIDIA trademarks.
16
17The mlx5 common driver library (**librte_common_mlx5**) provides support for
18**NVIDIA ConnectX-4**, **NVIDIA ConnectX-4 Lx**, **NVIDIA ConnectX-5**,
19**NVIDIA ConnectX-6**, **NVIDIA ConnectX-6 Dx**, **NVIDIA ConnectX-6 Lx**,
20**NVIDIA ConnectX-7**, **NVIDIA BlueField**, **NVIDIA BlueField-2** and
21**NVIDIA BlueField-3** families of 10/25/40/50/100/200 Gb/s adapters.
22
23Information and documentation for these adapters can be found on the
24`NVIDIA website <https://www.nvidia.com/en-us/networking/>`_.
25Help is also provided by the
26`NVIDIA Networking forum <https://forums.developer.nvidia.com/c/infrastructure/369/>`_.
27In addition, there is a `web section dedicated to DPDK
28<https://developer.nvidia.com/networking/dpdk>`_.
29
30
31Design
32------
33
34For security reasons and to enhance robustness,
35this driver only handles virtual memory addresses.
36The way resources allocations are handled by the kernel,
37combined with hardware specifications that allow handling virtual memory addresses directly,
38ensure that DPDK applications cannot access random physical memory
39(or memory that does not belong to the current process).
40
41There are different levels of objects and bypassing abilities
42which are used to get the best performance:
43
44- **Verbs** is a complete high-level generic API
45- **Direct Verbs** is a device-specific API
46- **DevX** allows accessing firmware objects
47- **Direct Rules** manages flow steering at the low-level hardware layer
48
49On Linux, above interfaces are provided by linking with `libibverbs` and `libmlx5`.
50See :ref:`mlx5_linux_prerequisites` for installation.
51
52On Windows, DevX is the only requirement from the above list.
53See :ref:`mlx5_windows_prerequisites` for DevX SDK package installation.
54
55
56.. _mlx5_classes:
57
58Classes
59-------
60
61One mlx5 device can be probed by a number of different PMDs.
62To select a specific PMD, its name should be specified as a device parameter
63(e.g. ``0000:08:00.1,class=eth``).
64
65In order to allow probing by multiple PMDs,
66several classes may be listed separated by a colon.
67For example: ``class=crypto:regex`` will probe both Crypto and RegEx PMDs.
68
69
70Supported Classes
71~~~~~~~~~~~~~~~~~
72
73- ``class=compress`` for :doc:`../../compressdevs/mlx5`.
74- ``class=crypto`` for :doc:`../../cryptodevs/mlx5`.
75- ``class=eth`` for :doc:`../../nics/mlx5`.
76- ``class=regex`` for :doc:`../../regexdevs/mlx5`.
77- ``class=vdpa`` for :doc:`../../vdpadevs/mlx5`.
78
79By default, the mlx5 device will be probed by the ``eth`` PMD.
80
81
82Limitations
83~~~~~~~~~~~
84
85- ``eth`` and ``vdpa`` PMDs cannot be probed at the same time.
86  All other combinations are possible.
87
88- On Windows, only ``eth`` and ``crypto`` are supported.
89
90
91.. _mlx5_common_compilation:
92
93Compilation Prerequisites
94-------------------------
95
96.. _mlx5_linux_prerequisites:
97
98Linux Prerequisites
99~~~~~~~~~~~~~~~~~~~
100
101This driver relies on external libraries and kernel drivers for resources
102allocations and initialization.
103The following dependencies are not part of DPDK and must be installed separately:
104
105- **libibverbs**
106
107  User space Verbs framework used by ``librte_common_mlx5``.
108  This library provides a generic interface between the kernel
109  and low-level user space drivers such as ``libmlx5``.
110
111  It allows slow and privileged operations (context initialization,
112  hardware resources allocations) to be managed by the kernel
113  and fast operations to never leave user space.
114
115- **libmlx5**
116
117  Low-level user space driver library for NVIDIA devices,
118  it is automatically loaded by ``libibverbs``.
119
120  This library basically implements send/receive calls to the hardware queues.
121
122- **Kernel modules**
123
124  They provide the kernel-side Verbs API and low level device drivers
125  that manage actual hardware initialization
126  and resources sharing with user-space processes.
127
128  Unlike most other PMDs, these modules must remain loaded and bound to
129  their devices:
130
131  - ``mlx5_core``: hardware driver managing NVIDIA devices
132    and related Ethernet kernel network devices.
133  - ``mlx5_ib``: InfiniBand device driver.
134  - ``ib_uverbs``: user space driver for Verbs (entry point for ``libibverbs``).
135
136- **Firmware update**
137
138  NVIDIA MLNX_OFED/EN releases include firmware updates.
139
140  Because each release provides new features, these updates must be applied to
141  match the kernel modules and libraries they come with.
142
143Libraries and kernel modules can be provided either by the Linux distribution,
144or by installing NVIDIA MLNX_OFED/EN which provides compatibility with older kernels.
145
146
147Upstream Dependencies
148^^^^^^^^^^^^^^^^^^^^^
149
150The mlx5 kernel modules are part of upstream Linux.
151The minimal supported kernel version is 4.14.
152For 32-bit, version 4.14.41 or above is required.
153
154The libraries `libibverbs` and `libmlx5` are part of ``rdma-core``.
155It is packaged by most of Linux distributions.
156The minimal supported rdma-core version is 16.
157For 32-bit, version 18 or above is required.
158
159The rdma-core sources can be downloaded at
160https://github.com/linux-rdma/rdma-core
161
162It is possible to build rdma-core as static libraries starting with version 21::
163
164    cd build
165    CFLAGS=-fPIC cmake -DENABLE_STATIC=1 -DNO_PYVERBS=1 -DNO_MAN_PAGES=1 -GNinja ..
166    ninja
167    ninja install
168
169
170NVIDIA MLNX_OFED/EN
171^^^^^^^^^^^^^^^^^^^
172
173The kernel modules and libraries are packaged with other tools
174in NVIDIA MLNX_OFED or NVIDIA MLNX_EN.
175The minimal supported versions are:
176
177- NVIDIA MLNX_OFED version: **4.5** and above.
178- NVIDIA MLNX_EN version: **4.5** and above.
179- Firmware version:
180
181  - ConnectX-4: **12.21.1000** and above.
182  - ConnectX-4 Lx: **14.21.1000** and above.
183  - ConnectX-5: **16.21.1000** and above.
184  - ConnectX-5 Ex: **16.21.1000** and above.
185  - ConnectX-6: **20.27.0090** and above.
186  - ConnectX-6 Dx: **22.27.0090** and above.
187  - ConnectX-6 Lx: **26.27.0090** and above.
188  - ConnectX-7: **28.33.2028** and above.
189  - BlueField: **18.25.1010** and above.
190  - BlueField-2: **24.28.1002** and above.
191  - BlueField-3: **32.36.3126** and above.
192
193The firmware, the libraries libibverbs, libmlx5, and mlnx-ofed-kernel modules
194are packaged in `NVIDIA MLNX_OFED
195<https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/>`_.
196After downloading, it can be installed with this command::
197
198   ./mlnxofedinstall --dpdk
199
200`NVIDIA MLNX_EN
201<https://network.nvidia.com/products/ethernet-drivers/linux/mlnx_en/>`_
202is a smaller package including what is needed for DPDK.
203After downloading, it can be installed with this command::
204
205   ./install --dpdk
206
207After installing, the firmware version can be checked::
208
209   ibv_devinfo
210
211.. note::
212
213   Several versions of NVIDIA MLNX_OFED/EN are available. Installing the version
214   this DPDK release was developed and tested against is strongly recommended.
215   Please check the "Tested Platforms" section in the :doc:`../../rel_notes/index`.
216
217
218.. _mlx5_windows_prerequisites:
219
220Windows Prerequisites
221~~~~~~~~~~~~~~~~~~~~~
222
223The mlx5 PMDs rely on external libraries and kernel drivers
224for resource allocation and initialization.
225
226
227DevX SDK Installation
228^^^^^^^^^^^^^^^^^^^^^
229
230The DevX SDK must be installed on the machine building the Windows PMD.
231Additional information can be found at
232`How to Integrate Windows DevX in Your Development Environment
233<https://docs.nvidia.com/networking/display/winof2v260/RShim+Drivers+and+Usage#RShimDriversandUsage-DevXInterface>`_.
234The minimal supported WinOF2 version is 2.60.
235
236
237Compilation Options
238-------------------
239
240Compilation on Linux
241~~~~~~~~~~~~~~~~~~~~
242
243The ibverbs libraries can be linked with this PMD in a number of ways,
244configured by the ``ibverbs_link`` build option:
245
246``shared`` (default)
247   The PMD depends on some .so files.
248
249``dlopen``
250   Split the dependencies glue in a separate library
251   loaded when needed by dlopen (see ``MLX5_GLUE_PATH``).
252   It makes dependencies on libibverbs and libmlx5 optional,
253   and has no performance impact.
254
255``static``
256   Embed static flavor of the dependencies libibverbs and libmlx5
257   in the PMD shared library or the executable static binary.
258
259
260Compilation on Windows
261~~~~~~~~~~~~~~~~~~~~~~
262
263The DevX SDK location must be set through CFLAGS/LDFLAGS,
264either::
265
266   meson.exe setup "-Dc_args=-I\"%DEVX_INC_PATH%\"" "-Dc_link_args=-L\"%DEVX_LIB_PATH%\"" ...
267
268or::
269
270   set CFLAGS=-I"%DEVX_INC_PATH%" && set LDFLAGS=-L"%DEVX_LIB_PATH%" && meson.exe setup ...
271
272
273.. _mlx5_common_env:
274
275Environment Configuration
276-------------------------
277
278Linux Environment
279~~~~~~~~~~~~~~~~~
280
281The kernel network interfaces are brought up during initialization.
282Forcing them down prevents packets reception.
283
284The ethtool operations on the kernel interfaces may also affect the PMD.
285
286Some runtime behaviours may be configured through environment variables.
287
288``MLX5_GLUE_PATH``
289   If built with ``ibverbs_link=dlopen``,
290   list of directories in which to search for the rdma-core "glue" plug-in,
291   separated by colons or semi-colons.
292
293``MLX5_SHUT_UP_BF``
294   If Verbs is used (DevX disabled),
295   HW queue doorbell register mapping.
296   The value 0 means non-cached IO mapping,
297   while 1 is a regular memory mapping.
298
299   With regular memory mapping, the register is flushed to HW
300   usually when the write-combining buffer becomes full,
301   but it depends on CPU design.
302
303
304Port Link with MLNX_OFED/EN
305^^^^^^^^^^^^^^^^^^^^^^^^^^^
306
307Ports links must be set to Ethernet::
308
309   mlxconfig -d <mst device> query | grep LINK_TYPE
310   LINK_TYPE_P1                        ETH(2)
311   LINK_TYPE_P2                        ETH(2)
312
313   mlxconfig -d <mst device> set LINK_TYPE_P1/2=1/2/3
314
315Link type values are:
316
317* ``1`` Infiniband
318* ``2`` Ethernet
319* ``3`` VPI (auto-sense)
320
321If link type was changed, firmware must be reset as well::
322
323   mlxfwreset -d <mst device> reset
324
325
326.. _mlx5_vf:
327
328SR-IOV Virtual Function with MLNX_OFED/EN
329^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
330
331SR-IOV must be enabled on the NIC.
332It can be checked in the following command::
333
334   mlxconfig -d <mst device> query | grep SRIOV_EN
335   SRIOV_EN                            True(1)
336
337If needed, configure SR-IOV::
338
339   mlxconfig -d <mst device> set SRIOV_EN=1 NUM_OF_VFS=16
340   mlxfwreset -d <mst device> reset
341
342After doing the change, restart the driver::
343
344   /etc/init.d/openibd restart
345
346or::
347
348   service openibd restart
349
350Then the virtual functions can be instantiated::
351
352   echo [num_vfs] > /sys/class/infiniband/mlx5_0/device/sriov_numvfs
353
354
355.. _mlx5_sub_function:
356
357Sub-Function with MLNX_OFED/EN
358^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
359
360Sub-Function is a portion of the PCI device,
361it has its own dedicated queues.
362An SF shares PCI-level resources with other SFs and/or with its parent PCI function.
363
364#. Requirement::
365
366      MLNX_OFED version >= 5.4-0.3.3.0
367
368#. Configure SF feature::
369
370      # Run mlxconfig on both PFs on host and ECPFs on BlueField.
371      mlxconfig -d <mst device> set PER_PF_NUM_SF=1 PF_TOTAL_SF=252 PF_SF_BAR_SIZE=12
372
373#. Enable switchdev mode::
374
375      mlxdevm dev eswitch set pci/<DBDF> mode switchdev
376
377#. Add SF port::
378
379      mlxdevm port add pci/<DBDF> flavour pcisf pfnum 0 sfnum <sfnum>
380
381      Get SFID from output: pci/<DBDF>/<SFID>
382
383#. Modify MAC address::
384
385      mlxdevm port function set pci/<DBDF>/<SFID> hw_addr <MAC>
386
387#. Activate SF port::
388
389      mlxdevm port function set pci/<DBDF>/<ID> state active
390
391#. Devargs to probe SF device::
392
393      auxiliary:mlx5_core.sf.<num>,class=eth:regex
394
395
396Enable Switchdev Mode
397^^^^^^^^^^^^^^^^^^^^^
398
399Switchdev mode is a mode in E-Switch, that binds between representor and VF or SF.
400Representor is a port in DPDK that is connected to a VF or SF in such a way
401that assuming there are no offload flows, each packet that is sent from the VF or SF
402will be received by the corresponding representor.
403While each packet that is sent to a representor will be received by the VF or SF.
404
405After :ref:`configuring VF <mlx5_vf>`, the device must be unbound::
406
407   printf "<device pci address>" > /sys/bus/pci/drivers/mlx5_core/unbind
408
409Then switchdev mode is enabled::
410
411   echo switchdev > /sys/class/net/<net device>/compat/devlink/mode
412
413The device can be bound again at this point.
414
415
416Run as Non-Root
417^^^^^^^^^^^^^^^
418
419Hugepage and resource limit setup are documented
420in the :ref:`common Linux guide <Running_Without_Root_Privileges>`.
421This PMD can operate without access to physical addresses,
422therefore it does not require ``SYS_ADMIN`` to access ``/proc/self/pagemaps``.
423Note that this requirement may still come from other drivers.
424
425Below are additional capabilities that must be granted to the application
426with the reasons for the need of each capability:
427
428``NET_RAW``
429   For raw Ethernet queue allocation through the kernel driver.
430
431``NET_ADMIN``
432   For device configuration, like setting link status or MTU.
433
434``SYS_RAWIO``
435   For using group 1 and above (software steering) in Flow API.
436
437They can be manually granted for a specific executable file::
438
439   setcap cap_net_raw,cap_net_admin,cap_sys_rawio+ep <executable>
440
441Alternatively, a service manager or a container runtime
442may configure the capabilities for a process.
443
444
445Windows Environment
446~~~~~~~~~~~~~~~~~~~
447
448WinOF2 version 2.60 or higher must be installed on the machine.
449
450
451WinOF2 Installation
452^^^^^^^^^^^^^^^^^^^
453
454The driver can be downloaded from the following site: `WINOF2
455<https://network.nvidia.com/products/adapter-software/ethernet/windows/winof-2/>`_.
456
457
458DevX Enablement
459^^^^^^^^^^^^^^^
460
461DevX for Windows must be enabled in the Windows registry.
462The keys ``DevxEnabled`` and ``DevxFsRules`` must be set.
463Additional information can be found in the WinOF2 user manual.
464
465
466.. _mlx5_firmware_config:
467
468Firmware Configuration
469~~~~~~~~~~~~~~~~~~~~~~
470
471Firmware features can be configured as key/value pairs.
472
473The command to set a value is::
474
475  mlxconfig -d <device> set <key>=<value>
476
477The command to query a value is::
478
479  mlxconfig -d <device> query <key>
480
481The device name for the command ``mlxconfig`` can be either the PCI address,
482or the mst device name found with::
483
484  mst status
485
486Below are some firmware configurations listed.
487
488- link type::
489
490    LINK_TYPE_P1
491    LINK_TYPE_P2
492    value: 1=Infiniband 2=Ethernet 3=VPI(auto-sense)
493
494- enable SR-IOV::
495
496    SRIOV_EN=1
497
498- the maximum number of SR-IOV virtual functions::
499
500    NUM_OF_VFS=<max>
501
502- enable DevX (required by Direct Rules and other features)::
503
504    UCTX_EN=1
505
506- aggressive CQE zipping::
507
508    CQE_COMPRESSION=1
509
510- L3 VXLAN and VXLAN-GPE destination UDP port::
511
512    IP_OVER_VXLAN_EN=1
513    IP_OVER_VXLAN_PORT=<udp dport>
514
515- enable VXLAN-GPE tunnel flow matching::
516
517    FLEX_PARSER_PROFILE_ENABLE=0
518    or
519    FLEX_PARSER_PROFILE_ENABLE=2
520
521- enable IP-in-IP tunnel flow matching::
522
523    FLEX_PARSER_PROFILE_ENABLE=0
524
525- enable MPLS flow matching::
526
527    FLEX_PARSER_PROFILE_ENABLE=1
528
529- enable ICMP(code/type/identifier/sequence number) / ICMP6(code/type) fields matching::
530
531    FLEX_PARSER_PROFILE_ENABLE=2
532
533- enable Geneve flow matching::
534
535   FLEX_PARSER_PROFILE_ENABLE=0
536   or
537   FLEX_PARSER_PROFILE_ENABLE=1
538
539- enable Geneve TLV option flow matching::
540
541   FLEX_PARSER_PROFILE_ENABLE=0
542   or
543   FLEX_PARSER_PROFILE_ENABLE=8
544
545- enable GTP flow matching::
546
547   FLEX_PARSER_PROFILE_ENABLE=3
548
549- enable eCPRI flow matching::
550
551   FLEX_PARSER_PROFILE_ENABLE=4
552   PROG_PARSE_GRAPH=1
553
554- enable dynamic flex parser for flex item::
555
556   FLEX_PARSER_PROFILE_ENABLE=4
557   PROG_PARSE_GRAPH=1
558
559- enable realtime timestamp format::
560
561   REAL_TIME_CLOCK_ENABLE=1
562
563- allow locking hairpin RQ data buffer in device memory::
564
565   HAIRPIN_DATA_BUFFER_LOCK=1
566   MEMIC_SIZE_LIMIT=0
567
568
569.. _mlx5_common_driver_options:
570
571Device Arguments
572----------------
573
574The driver can be configured per device.
575A single argument list can be used for a device managed by multiple PMDs.
576The parameters must be passed through the EAL option ``-a``,
577as examples below:
578
579- PCI device::
580
581  -a 0000:03:00.2,class=eth:regex,mr_mempool_reg_en=0
582
583- Auxiliary SF::
584
585  -a auxiliary:mlx5_core.sf.2,class=compress,mr_ext_memseg_en=0
586
587Each device class PMD has its own list of specific arguments,
588and below are the arguments supported by the common mlx5 layer.
589
590- ``class`` parameter [string]
591
592  Select the classes of the drivers that should probe the device.
593  See :ref:`mlx5_classes` for more explanation and details.
594
595  The default value is ``eth``.
596
597- ``mr_ext_memseg_en`` parameter [int]
598
599  A nonzero value enables extending memseg when registering DMA memory. If
600  enabled, the number of entries in MR (Memory Region) lookup table on datapath
601  is minimized and it benefits performance. On the other hand, it worsens memory
602  utilization because registered memory is pinned by kernel driver. Even if a
603  page in the extended chunk is freed, that doesn't become reusable until the
604  entire memory is freed.
605
606  Enabled by default.
607
608- ``mr_mempool_reg_en`` parameter [int]
609
610  A nonzero value enables implicit registration of DMA memory of all mempools
611  except those having ``RTE_MEMPOOL_F_NON_IO``. This flag is set automatically
612  for mempools populated with non-contiguous objects or those without IOVA.
613  The effect is that when a packet from a mempool is transmitted,
614  its memory is already registered for DMA in the PMD and no registration
615  will happen on the data path. The tradeoff is extra work on the creation
616  of each mempool and increased HW resource use if some mempools
617  are not used with MLX5 devices.
618
619  Enabled by default.
620
621- ``sys_mem_en`` parameter [int]
622
623  A non-zero value enables the PMD memory management allocating memory
624  from system by default, without explicit rte memory flag.
625
626  By default, the PMD will set this value to 0.
627
628- ``sq_db_nc`` parameter [int]
629
630  The rdma core library can map doorbell register in two ways,
631  depending on the environment variable "MLX5_SHUT_UP_BF":
632
633  - As regular cached memory (usually with write combining attribute),
634    if the variable is either missing or set to zero.
635  - As non-cached memory, if the variable is present and set to not "0" value.
636
637   The same doorbell mapping approach is implemented directly by PMD
638   in UAR generation for queues created with DevX.
639
640  The type of mapping may slightly affect the send queue performance,
641  the optimal choice strongly relied on the host architecture
642  and should be deduced practically.
643
644  If ``sq_db_nc`` is set to zero, the doorbell is forced to be mapped to
645  regular memory (with write combining), the PMD will perform the extra write
646  memory barrier after writing to doorbell, it might increase the needed CPU
647  clocks per packet to send, but latency might be improved.
648
649  If ``sq_db_nc`` is set to one, the doorbell is forced to be mapped to non
650  cached memory, the PMD will not perform the extra write memory barrier after
651  writing to doorbell, on some architectures it might improve the performance.
652
653  If ``sq_db_nc`` is set to two, the doorbell is forced to be mapped to
654  regular memory, the PMD will use heuristics to decide whether a write memory
655  barrier should be performed. For bursts with size multiple of recommended one
656  (64 pkts) it is supposed the next burst is coming and no need to issue the
657  extra memory barrier (it is supposed to be issued in the next coming burst,
658  at least after descriptor writing). It might increase latency (on some hosts
659  till the next packets transmit) and should be used with care.
660  The PMD uses heuristics only for Tx queue, for other semd queues the doorbell
661  is forced to be mapped to regular memory as same as ``sq_db_nc`` is set to 0.
662
663  If ``sq_db_nc`` is omitted, the preset (if any) environment variable
664  "MLX5_SHUT_UP_BF" value is used. If there is no "MLX5_SHUT_UP_BF", the
665  default ``sq_db_nc`` value is zero for ARM64 hosts and one for others.
666
667- ``cmd_fd`` parameter [int]
668
669  File descriptor of ``ibv_context`` created outside the PMD.
670  PMD will use this FD to import remote CTX. The ``cmd_fd`` is obtained from
671  the ``ibv_context->cmd_fd`` member, which must be dup'd before being passed.
672  This parameter is valid only if ``pd_handle`` parameter is specified.
673
674  By default, the PMD will create a new ``ibv_context``.
675
676  .. note::
677
678     When FD comes from another process, it is the user responsibility to
679     share the FD between the processes (e.g. by SCM_RIGHTS).
680
681- ``pd_handle`` parameter [int]
682
683  Protection domain handle of ``ibv_pd`` created outside the PMD.
684  PMD will use this handle to import remote PD. The ``pd_handle`` can be
685  achieved from the original PD by getting its ``ibv_pd->handle`` member value.
686  This parameter is valid only if ``cmd_fd`` parameter is specified,
687  and its value must be a valid kernel handle for a PD object
688  in the context represented by given ``cmd_fd``.
689
690  By default, the PMD will allocate a new PD.
691
692  .. note::
693
694     The ``ibv_pd->handle`` member is different than ``mlx5dv_pd->pdn`` member.
695