xref: /dpdk/doc/guides/platform/mlx5.rst (revision 3da59f30a23f2e795d2315f3d949e1b3e0ce0c3d)
1..  SPDX-License-Identifier: BSD-3-Clause
2    Copyright 2022 6WIND S.A.
3    Copyright (c) 2022 NVIDIA Corporation & Affiliates
4
5.. include:: <isonum.txt>
6
7NVIDIA MLX5 Common Driver
8=========================
9
10.. note::
11
12   NVIDIA acquired Mellanox Technologies in 2020.
13   The DPDK documentation and code might still include instances
14   of or references to Mellanox trademarks (like BlueField and ConnectX)
15   that are now NVIDIA trademarks.
16
17The mlx5 common driver library (**librte_common_mlx5**) provides support for
18**NVIDIA ConnectX-4**, **NVIDIA ConnectX-4 Lx**, **NVIDIA ConnectX-5**,
19**NVIDIA ConnectX-6**, **NVIDIA ConnectX-6 Dx**, **NVIDIA ConnectX-6 Lx**,
20**NVIDIA ConnectX-7**, **NVIDIA BlueField**, **NVIDIA BlueField-2** and
21**NVIDIA BlueField-3** families of 10/25/40/50/100/200 Gb/s adapters.
22
23Information and documentation for these adapters can be found on the
24`NVIDIA website <https://www.nvidia.com/en-us/networking/>`_.
25Help is also provided by the
26`NVIDIA Networking forum <https://forums.developer.nvidia.com/c/infrastructure/369/>`_.
27In addition, there is a `web section dedicated to DPDK
28<https://developer.nvidia.com/networking/dpdk>`_.
29
30
31Design
32------
33
34For security reasons and to enhance robustness,
35this driver only handles virtual memory addresses.
36The way resources allocations are handled by the kernel,
37combined with hardware specifications that allow handling virtual memory addresses directly,
38ensure that DPDK applications cannot access random physical memory
39(or memory that does not belong to the current process).
40
41There are different levels of objects and bypassing abilities
42which are used to get the best performance:
43
44- **Verbs** is a complete high-level generic API
45- **Direct Verbs** is a device-specific API
46- **DevX** allows accessing firmware objects
47- **Direct Rules** manages flow steering at the low-level hardware layer
48
49On Linux, above interfaces are provided by linking with `libibverbs` and `libmlx5`.
50See :ref:`mlx5_linux_prerequisites` for installation.
51
52On Windows, DevX is the only requirement from the above list.
53See :ref:`mlx5_windows_prerequisites` for DevX SDK package installation.
54
55
56.. _mlx5_classes:
57
58Classes
59-------
60
61One mlx5 device can be probed by a number of different PMDs.
62To select a specific PMD, its name should be specified as a device parameter
63(e.g. ``0000:08:00.1,class=eth``).
64
65In order to allow probing by multiple PMDs,
66several classes may be listed separated by a colon.
67For example: ``class=crypto:regex`` will probe both Crypto and RegEx PMDs.
68
69
70Supported Classes
71~~~~~~~~~~~~~~~~~
72
73- ``class=compress`` for :doc:`../../compressdevs/mlx5`.
74- ``class=crypto`` for :doc:`../../cryptodevs/mlx5`.
75- ``class=eth`` for :doc:`../../nics/mlx5`.
76- ``class=regex`` for :doc:`../../regexdevs/mlx5`.
77- ``class=vdpa`` for :doc:`../../vdpadevs/mlx5`.
78
79By default, the mlx5 device will be probed by the ``eth`` PMD.
80
81
82Limitations
83~~~~~~~~~~~
84
85- ``eth`` and ``vdpa`` PMDs cannot be probed at the same time.
86  All other combinations are possible.
87
88- On Windows, only ``eth`` and ``crypto`` are supported.
89
90
91.. _mlx5_common_compilation:
92
93Compilation Prerequisites
94-------------------------
95
96.. _mlx5_linux_prerequisites:
97
98Linux Prerequisites
99~~~~~~~~~~~~~~~~~~~
100
101This driver relies on external libraries and kernel drivers for resources
102allocations and initialization.
103The following dependencies are not part of DPDK and must be installed separately:
104
105- **libibverbs**
106
107  User space Verbs framework used by ``librte_common_mlx5``.
108  This library provides a generic interface between the kernel
109  and low-level user space drivers such as ``libmlx5``.
110
111  It allows slow and privileged operations (context initialization,
112  hardware resources allocations) to be managed by the kernel
113  and fast operations to never leave user space.
114
115- **libmlx5**
116
117  Low-level user space driver library for NVIDIA devices,
118  it is automatically loaded by ``libibverbs``.
119
120  This library basically implements send/receive calls to the hardware queues.
121
122- **Kernel modules**
123
124  They provide the kernel-side Verbs API and low level device drivers
125  that manage actual hardware initialization
126  and resources sharing with user-space processes.
127
128  Unlike most other PMDs, these modules must remain loaded and bound to
129  their devices:
130
131  - ``mlx5_core``: hardware driver managing NVIDIA devices
132    and related Ethernet kernel network devices.
133  - ``mlx5_ib``: InfiniBand device driver.
134  - ``ib_uverbs``: user space driver for Verbs (entry point for ``libibverbs``).
135
136- **Firmware update**
137
138  NVIDIA MLNX_OFED/EN releases include firmware updates.
139
140  Because each release provides new features, these updates must be applied to
141  match the kernel modules and libraries they come with.
142
143Libraries and kernel modules can be provided either by the Linux distribution,
144or by installing NVIDIA MLNX_OFED/EN which provides compatibility with older kernels.
145
146
147Upstream Dependencies
148^^^^^^^^^^^^^^^^^^^^^
149
150The mlx5 kernel modules are part of upstream Linux.
151The minimal supported kernel version is 4.14.
152For 32-bit, version 4.14.41 or above is required.
153
154The libraries `libibverbs` and `libmlx5` are part of ``rdma-core``.
155It is packaged by most of Linux distributions.
156The minimal supported rdma-core version is 16.
157For 32-bit, version 18 or above is required.
158
159The rdma-core sources can be downloaded at
160https://github.com/linux-rdma/rdma-core
161
162It is possible to build rdma-core as static libraries starting with version 21::
163
164    cd build
165    CFLAGS=-fPIC cmake -DENABLE_STATIC=1 -DNO_PYVERBS=1 -DNO_MAN_PAGES=1 -GNinja ..
166    ninja
167    ninja install
168
169
170NVIDIA MLNX_OFED/EN
171^^^^^^^^^^^^^^^^^^^
172
173The kernel modules and libraries are packaged with other tools
174in NVIDIA MLNX_OFED or NVIDIA MLNX_EN.
175The minimal supported versions are:
176
177- NVIDIA MLNX_OFED version: **4.5** and above.
178- NVIDIA MLNX_EN version: **4.5** and above.
179- Firmware version:
180
181  - ConnectX-4: **12.21.1000** and above.
182  - ConnectX-4 Lx: **14.21.1000** and above.
183  - ConnectX-5: **16.21.1000** and above.
184  - ConnectX-5 Ex: **16.21.1000** and above.
185  - ConnectX-6: **20.27.0090** and above.
186  - ConnectX-6 Dx: **22.27.0090** and above.
187  - ConnectX-6 Lx: **26.27.0090** and above.
188  - ConnectX-7: **28.33.2028** and above.
189  - BlueField: **18.25.1010** and above.
190  - BlueField-2: **24.28.1002** and above.
191  - BlueField-3: **32.36.3126** and above.
192
193The firmware, the libraries libibverbs, libmlx5, and mlnx-ofed-kernel modules
194are packaged in `NVIDIA MLNX_OFED
195<https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/>`_.
196After downloading, it can be installed with this command::
197
198   ./mlnxofedinstall --dpdk
199
200`NVIDIA MLNX_EN
201<https://network.nvidia.com/products/ethernet-drivers/linux/mlnx_en/>`_
202is a smaller package including what is needed for DPDK.
203After downloading, it can be installed with this command::
204
205   ./install --dpdk
206
207After installing, the firmware version can be checked::
208
209   ibv_devinfo
210
211.. note::
212
213   Several versions of NVIDIA MLNX_OFED/EN are available. Installing the version
214   this DPDK release was developed and tested against is strongly recommended.
215   Please check the "Tested Platforms" section in the :doc:`../../rel_notes/index`.
216
217
218.. _mlx5_windows_prerequisites:
219
220Windows Prerequisites
221~~~~~~~~~~~~~~~~~~~~~
222
223The mlx5 PMDs rely on external libraries and kernel drivers
224for resource allocation and initialization.
225
226
227DevX SDK Installation
228^^^^^^^^^^^^^^^^^^^^^
229
230The DevX SDK must be installed on the machine building the Windows PMD.
231Additional information can be found at
232`How to Integrate Windows DevX in Your Development Environment
233<https://docs.nvidia.com/networking/display/winof2v260/RShim+Drivers+and+Usage#RShimDriversandUsage-DevXInterface>`_.
234The minimal supported WinOF2 version is 2.60.
235
236
237Compilation Options
238-------------------
239
240Compilation on Linux
241~~~~~~~~~~~~~~~~~~~~
242
243The ibverbs libraries can be linked with this PMD in a number of ways,
244configured by the ``ibverbs_link`` build option:
245
246``shared`` (default)
247   The PMD depends on some .so files.
248
249``dlopen``
250   Split the dependencies glue in a separate library
251   loaded when needed by dlopen (see ``MLX5_GLUE_PATH``).
252   It makes dependencies on libibverbs and libmlx5 optional,
253   and has no performance impact.
254
255``static``
256   Embed static flavor of the dependencies libibverbs and libmlx5
257   in the PMD shared library or the executable static binary.
258
259
260Compilation on Windows
261~~~~~~~~~~~~~~~~~~~~~~
262
263The DevX SDK location must be set through CFLAGS/LDFLAGS,
264either::
265
266   meson.exe setup "-Dc_args=-I\"%DEVX_INC_PATH%\"" "-Dc_link_args=-L\"%DEVX_LIB_PATH%\"" ...
267
268or::
269
270   set CFLAGS=-I"%DEVX_INC_PATH%" && set LDFLAGS=-L"%DEVX_LIB_PATH%" && meson.exe setup ...
271
272
273.. _mlx5_common_env:
274
275Environment Configuration
276-------------------------
277
278Linux Environment
279~~~~~~~~~~~~~~~~~
280
281The kernel network interfaces are brought up during initialization.
282Forcing them down prevents packets reception.
283
284The ethtool operations on the kernel interfaces may also affect the PMD.
285
286Some runtime behaviours may be configured through environment variables.
287
288``MLX5_GLUE_PATH``
289   If built with ``ibverbs_link=dlopen``,
290   list of directories in which to search for the rdma-core "glue" plug-in,
291   separated by colons or semi-colons.
292
293``MLX5_SHUT_UP_BF``
294   If Verbs is used (DevX disabled),
295   HW queue doorbell register mapping.
296   The value 0 means non-cached IO mapping,
297   while 1 is a regular memory mapping.
298
299   With regular memory mapping, the register is flushed to HW
300   usually when the write-combining buffer becomes full,
301   but it depends on CPU design.
302
303
304Port Link with MLNX_OFED/EN
305^^^^^^^^^^^^^^^^^^^^^^^^^^^
306
307Ports links must be set to Ethernet::
308
309   mlxconfig -d <mst device> query | grep LINK_TYPE
310   LINK_TYPE_P1                        ETH(2)
311   LINK_TYPE_P2                        ETH(2)
312
313   mlxconfig -d <mst device> set LINK_TYPE_P1/2=1/2/3
314
315Link type values are:
316
317* ``1`` Infiniband
318* ``2`` Ethernet
319* ``3`` VPI (auto-sense)
320
321If link type was changed, firmware must be reset as well::
322
323   mlxfwreset -d <mst device> reset
324
325
326.. _mlx5_vf:
327
328SR-IOV Virtual Function with MLNX_OFED/EN
329^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
330
331SR-IOV must be enabled on the NIC.
332It can be checked in the following command::
333
334   mlxconfig -d <mst device> query | grep SRIOV_EN
335   SRIOV_EN                            True(1)
336
337If needed, configure SR-IOV::
338
339   mlxconfig -d <mst device> set SRIOV_EN=1 NUM_OF_VFS=16
340   mlxfwreset -d <mst device> reset
341
342After doing the change, restart the driver::
343
344   /etc/init.d/openibd restart
345
346or::
347
348   service openibd restart
349
350Then the virtual functions can be instantiated::
351
352   echo [num_vfs] > /sys/class/infiniband/mlx5_0/device/sriov_numvfs
353
354
355.. _mlx5_sub_function:
356
357Sub-Function with MLNX_OFED/EN
358^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
359
360Sub-Function is a portion of the PCI device,
361it has its own dedicated queues.
362An SF shares PCI-level resources with other SFs and/or with its parent PCI function.
363
364#. Requirement::
365
366      MLNX_OFED version >= 5.4-0.3.3.0
367
368#. Configure SF feature::
369
370      # Run mlxconfig on both PFs on host and ECPFs on BlueField.
371      mlxconfig -d <mst device> set PER_PF_NUM_SF=1 PF_TOTAL_SF=252 PF_SF_BAR_SIZE=12
372
373#. Enable switchdev mode::
374
375      mlxdevm dev eswitch set pci/<DBDF> mode switchdev
376
377#. Add SF port::
378
379      mlxdevm port add pci/<DBDF> flavour pcisf pfnum 0 sfnum <sfnum>
380
381      Get SFID from output: pci/<DBDF>/<SFID>
382
383#. Modify MAC address::
384
385      mlxdevm port function set pci/<DBDF>/<SFID> hw_addr <MAC>
386
387#. Activate SF port::
388
389      mlxdevm port function set pci/<DBDF>/<ID> state active
390
391#. Devargs to probe SF device::
392
393      auxiliary:mlx5_core.sf.<num>,class=eth:regex
394
395
396Enable Switchdev Mode
397^^^^^^^^^^^^^^^^^^^^^
398
399Switchdev mode is a mode in E-Switch, that binds between representor and VF or SF.
400Representor is a port in DPDK that is connected to a VF or SF in such a way
401that assuming there are no offload flows, each packet that is sent from the VF or SF
402will be received by the corresponding representor.
403While each packet that is sent to a representor will be received by the VF or SF.
404
405After :ref:`configuring VF <mlx5_vf>`, the device must be unbound::
406
407   printf "<device pci address>" > /sys/bus/pci/drivers/mlx5_core/unbind
408
409Then switchdev mode is enabled::
410
411   echo switchdev > /sys/class/net/<net device>/compat/devlink/mode
412
413The device can be bound again at this point.
414
415
416Run as Non-Root
417^^^^^^^^^^^^^^^
418
419Hugepage and resource limit setup are documented
420in the :ref:`common Linux guide <Running_Without_Root_Privileges>`.
421This PMD can operate without access to physical addresses,
422therefore it does not require ``SYS_ADMIN`` to access ``/proc/self/pagemaps``.
423Note that this requirement may still come from other drivers.
424
425Below are additional capabilities that must be granted to the application
426with the reasons for the need of each capability:
427
428``NET_RAW``
429   For raw Ethernet queue allocation through the kernel driver.
430
431``NET_ADMIN``
432   For device configuration, like setting link status or MTU.
433
434``SYS_RAWIO``
435   For using group 1 and above (software steering) in Flow API.
436
437They can be manually granted for a specific executable file::
438
439   setcap cap_net_raw,cap_net_admin,cap_sys_rawio+ep <executable>
440
441Alternatively, a service manager or a container runtime
442may configure the capabilities for a process.
443
444
445Windows Environment
446~~~~~~~~~~~~~~~~~~~
447
448WinOF2 version 2.60 or higher must be installed on the machine.
449
450
451WinOF2 Installation
452^^^^^^^^^^^^^^^^^^^
453
454The driver can be downloaded from the following site: `WINOF2
455<https://network.nvidia.com/products/adapter-software/ethernet/windows/winof-2/>`_.
456
457
458DevX Enablement
459^^^^^^^^^^^^^^^
460
461DevX for Windows must be enabled in the Windows registry.
462The keys ``DevxEnabled`` and ``DevxFsRules`` must be set.
463Additional information can be found in the WinOF2 user manual.
464
465
466.. _mlx5_firmware_config:
467
468Firmware Configuration
469~~~~~~~~~~~~~~~~~~~~~~
470
471Firmware features can be configured as key/value pairs.
472
473The command to set a value is::
474
475  mlxconfig -d <device> set <key>=<value>
476
477The command to query a value is::
478
479  mlxconfig -d <device> query <key>
480
481The device name for the command ``mlxconfig`` can be either the PCI address,
482or the mst device name found with::
483
484  mst status
485
486Below are some firmware configurations listed.
487
488- link type::
489
490    LINK_TYPE_P1
491    LINK_TYPE_P2
492    value: 1=Infiniband 2=Ethernet 3=VPI(auto-sense)
493
494- enable SR-IOV::
495
496    SRIOV_EN=1
497
498- the maximum number of SR-IOV virtual functions::
499
500    NUM_OF_VFS=<max>
501
502- enable DevX (required by Direct Rules and other features)::
503
504    UCTX_EN=1
505
506- aggressive CQE zipping::
507
508    CQE_COMPRESSION=1
509
510- L3 VXLAN and VXLAN-GPE destination UDP port::
511
512    IP_OVER_VXLAN_EN=1
513    IP_OVER_VXLAN_PORT=<udp dport>
514
515- enable VXLAN-GPE tunnel flow matching::
516
517    FLEX_PARSER_PROFILE_ENABLE=0
518    or
519    FLEX_PARSER_PROFILE_ENABLE=2
520
521- enable IP-in-IP tunnel flow matching::
522
523    FLEX_PARSER_PROFILE_ENABLE=0
524
525- enable MPLS flow matching::
526
527    FLEX_PARSER_PROFILE_ENABLE=1
528
529- enable ICMP(code/type/identifier/sequence number) / ICMP6(code/type) fields matching::
530
531    FLEX_PARSER_PROFILE_ENABLE=2
532
533- enable Geneve flow matching::
534
535   FLEX_PARSER_PROFILE_ENABLE=0
536   or
537   FLEX_PARSER_PROFILE_ENABLE=1
538
539- enable Geneve TLV option flow matching::
540
541   FLEX_PARSER_PROFILE_ENABLE=0
542
543- enable GTP flow matching::
544
545   FLEX_PARSER_PROFILE_ENABLE=3
546
547- enable eCPRI flow matching::
548
549   FLEX_PARSER_PROFILE_ENABLE=4
550   PROG_PARSE_GRAPH=1
551
552- enable dynamic flex parser for flex item::
553
554   FLEX_PARSER_PROFILE_ENABLE=4
555   PROG_PARSE_GRAPH=1
556
557- enable realtime timestamp format::
558
559   REAL_TIME_CLOCK_ENABLE=1
560
561- allow locking hairpin RQ data buffer in device memory::
562
563   HAIRPIN_DATA_BUFFER_LOCK=1
564   MEMIC_SIZE_LIMIT=0
565
566
567.. _mlx5_common_driver_options:
568
569Device Arguments
570----------------
571
572The driver can be configured per device.
573A single argument list can be used for a device managed by multiple PMDs.
574The parameters must be passed through the EAL option ``-a``,
575as examples below:
576
577- PCI device::
578
579  -a 0000:03:00.2,class=eth:regex,mr_mempool_reg_en=0
580
581- Auxiliary SF::
582
583  -a auxiliary:mlx5_core.sf.2,class=compress,mr_ext_memseg_en=0
584
585Each device class PMD has its own list of specific arguments,
586and below are the arguments supported by the common mlx5 layer.
587
588- ``class`` parameter [string]
589
590  Select the classes of the drivers that should probe the device.
591  See :ref:`mlx5_classes` for more explanation and details.
592
593  The default value is ``eth``.
594
595- ``mr_ext_memseg_en`` parameter [int]
596
597  A nonzero value enables extending memseg when registering DMA memory. If
598  enabled, the number of entries in MR (Memory Region) lookup table on datapath
599  is minimized and it benefits performance. On the other hand, it worsens memory
600  utilization because registered memory is pinned by kernel driver. Even if a
601  page in the extended chunk is freed, that doesn't become reusable until the
602  entire memory is freed.
603
604  Enabled by default.
605
606- ``mr_mempool_reg_en`` parameter [int]
607
608  A nonzero value enables implicit registration of DMA memory of all mempools
609  except those having ``RTE_MEMPOOL_F_NON_IO``. This flag is set automatically
610  for mempools populated with non-contiguous objects or those without IOVA.
611  The effect is that when a packet from a mempool is transmitted,
612  its memory is already registered for DMA in the PMD and no registration
613  will happen on the data path. The tradeoff is extra work on the creation
614  of each mempool and increased HW resource use if some mempools
615  are not used with MLX5 devices.
616
617  Enabled by default.
618
619- ``sys_mem_en`` parameter [int]
620
621  A non-zero value enables the PMD memory management allocating memory
622  from system by default, without explicit rte memory flag.
623
624  By default, the PMD will set this value to 0.
625
626- ``sq_db_nc`` parameter [int]
627
628  The rdma core library can map doorbell register in two ways,
629  depending on the environment variable "MLX5_SHUT_UP_BF":
630
631  - As regular cached memory (usually with write combining attribute),
632    if the variable is either missing or set to zero.
633  - As non-cached memory, if the variable is present and set to not "0" value.
634
635   The same doorbell mapping approach is implemented directly by PMD
636   in UAR generation for queues created with DevX.
637
638  The type of mapping may slightly affect the send queue performance,
639  the optimal choice strongly relied on the host architecture
640  and should be deduced practically.
641
642  If ``sq_db_nc`` is set to zero, the doorbell is forced to be mapped to
643  regular memory (with write combining), the PMD will perform the extra write
644  memory barrier after writing to doorbell, it might increase the needed CPU
645  clocks per packet to send, but latency might be improved.
646
647  If ``sq_db_nc`` is set to one, the doorbell is forced to be mapped to non
648  cached memory, the PMD will not perform the extra write memory barrier after
649  writing to doorbell, on some architectures it might improve the performance.
650
651  If ``sq_db_nc`` is set to two, the doorbell is forced to be mapped to
652  regular memory, the PMD will use heuristics to decide whether a write memory
653  barrier should be performed. For bursts with size multiple of recommended one
654  (64 pkts) it is supposed the next burst is coming and no need to issue the
655  extra memory barrier (it is supposed to be issued in the next coming burst,
656  at least after descriptor writing). It might increase latency (on some hosts
657  till the next packets transmit) and should be used with care.
658  The PMD uses heuristics only for Tx queue, for other semd queues the doorbell
659  is forced to be mapped to regular memory as same as ``sq_db_nc`` is set to 0.
660
661  If ``sq_db_nc`` is omitted, the preset (if any) environment variable
662  "MLX5_SHUT_UP_BF" value is used. If there is no "MLX5_SHUT_UP_BF", the
663  default ``sq_db_nc`` value is zero for ARM64 hosts and one for others.
664
665- ``cmd_fd`` parameter [int]
666
667  File descriptor of ``ibv_context`` created outside the PMD.
668  PMD will use this FD to import remote CTX. The ``cmd_fd`` is obtained from
669  the ``ibv_context->cmd_fd`` member, which must be dup'd before being passed.
670  This parameter is valid only if ``pd_handle`` parameter is specified.
671
672  By default, the PMD will create a new ``ibv_context``.
673
674  .. note::
675
676     When FD comes from another process, it is the user responsibility to
677     share the FD between the processes (e.g. by SCM_RIGHTS).
678
679- ``pd_handle`` parameter [int]
680
681  Protection domain handle of ``ibv_pd`` created outside the PMD.
682  PMD will use this handle to import remote PD. The ``pd_handle`` can be
683  achieved from the original PD by getting its ``ibv_pd->handle`` member value.
684  This parameter is valid only if ``cmd_fd`` parameter is specified,
685  and its value must be a valid kernel handle for a PD object
686  in the context represented by given ``cmd_fd``.
687
688  By default, the PMD will allocate a new PD.
689
690  .. note::
691
692     The ``ibv_pd->handle`` member is different than ``mlx5dv_pd->pdn`` member.
693