xref: /dpdk/doc/guides/platform/mlx5.rst (revision 665b49c51639a10c553433bc2bcd85c7331c631e)
1..  SPDX-License-Identifier: BSD-3-Clause
2    Copyright 2022 6WIND S.A.
3    Copyright (c) 2022 NVIDIA Corporation & Affiliates
4
5.. include:: <isonum.txt>
6
7NVIDIA MLX5 Common Driver
8=========================
9
10.. note::
11
12   NVIDIA acquired Mellanox Technologies in 2020.
13   The DPDK documentation and code might still include instances
14   of or references to Mellanox trademarks (like BlueField and ConnectX)
15   that are now NVIDIA trademarks.
16
17The mlx5 common driver library (**librte_common_mlx5**) provides support for
18**NVIDIA ConnectX-4**, **NVIDIA ConnectX-4 Lx**, **NVIDIA ConnectX-5**,
19**NVIDIA ConnectX-6**, **NVIDIA ConnectX-6 Dx**, **NVIDIA ConnectX-6 Lx**,
20**NVIDIA ConnectX-7**, **NVIDIA BlueField**, **NVIDIA BlueField-2** and
21**NVIDIA BlueField-3** families of 10/25/40/50/100/200 Gb/s adapters.
22
23Information and documentation for these adapters can be found on the
24`NVIDIA website <https://www.nvidia.com/en-us/networking/>`_.
25Help is also provided by the
26`NVIDIA Networking forum <https://forums.developer.nvidia.com/c/infrastructure/369/>`_.
27In addition, there is a `web section dedicated to DPDK
28<https://developer.nvidia.com/networking/dpdk>`_.
29
30
31Design
32------
33
34For security reasons and to enhance robustness,
35this driver only handles virtual memory addresses.
36The way resources allocations are handled by the kernel,
37combined with hardware specifications that allow handling virtual memory addresses directly,
38ensure that DPDK applications cannot access random physical memory
39(or memory that does not belong to the current process).
40
41There are different levels of objects and bypassing abilities
42which are used to get the best performance:
43
44- **Verbs** is a complete high-level generic API
45- **Direct Verbs** is a device-specific API
46- **DevX** allows accessing firmware objects
47- **Direct Rules** manages flow steering at the low-level hardware layer
48
49On Linux, above interfaces are provided by linking with `libibverbs` and `libmlx5`.
50See :ref:`mlx5_linux_prerequisites` for installation.
51
52On Windows, DevX is the only requirement from the above list.
53See :ref:`mlx5_windows_prerequisites` for DevX SDK package installation.
54
55
56.. _mlx5_classes:
57
58Classes
59-------
60
61One mlx5 device can be probed by a number of different PMDs.
62To select a specific PMD, its name should be specified as a device parameter
63(e.g. ``0000:08:00.1,class=eth``).
64
65In order to allow probing by multiple PMDs,
66several classes may be listed separated by a colon.
67For example: ``class=crypto:regex`` will probe both Crypto and RegEx PMDs.
68
69
70Supported Classes
71~~~~~~~~~~~~~~~~~
72
73- ``class=compress`` for :doc:`../../compressdevs/mlx5`.
74- ``class=crypto`` for :doc:`../../cryptodevs/mlx5`.
75- ``class=eth`` for :doc:`../../nics/mlx5`.
76- ``class=regex`` for :doc:`../../regexdevs/mlx5`.
77- ``class=vdpa`` for :doc:`../../vdpadevs/mlx5`.
78
79By default, the mlx5 device will be probed by the ``eth`` PMD.
80
81
82Limitations
83~~~~~~~~~~~
84
85- ``eth`` and ``vdpa`` PMDs cannot be probed at the same time.
86  All other combinations are possible.
87
88- On Windows, only ``eth`` and ``crypto`` are supported.
89
90
91.. _mlx5_common_compilation:
92
93Compilation Prerequisites
94-------------------------
95
96.. _mlx5_linux_prerequisites:
97
98Linux Prerequisites
99~~~~~~~~~~~~~~~~~~~
100
101This driver relies on external libraries and kernel drivers for resources
102allocations and initialization.
103The following dependencies are not part of DPDK and must be installed separately:
104
105- **libibverbs**
106
107  User space Verbs framework used by ``librte_common_mlx5``.
108  This library provides a generic interface between the kernel
109  and low-level user space drivers such as ``libmlx5``.
110
111  It allows slow and privileged operations (context initialization,
112  hardware resources allocations) to be managed by the kernel
113  and fast operations to never leave user space.
114
115- **libmlx5**
116
117  Low-level user space driver library for NVIDIA devices,
118  it is automatically loaded by ``libibverbs``.
119
120  This library basically implements send/receive calls to the hardware queues.
121
122- **Kernel modules**
123
124  They provide the kernel-side Verbs API and low level device drivers
125  that manage actual hardware initialization
126  and resources sharing with user-space processes.
127
128  Unlike most other PMDs, these modules must remain loaded and bound to
129  their devices:
130
131  - ``mlx5_core``: hardware driver managing NVIDIA devices
132    and related Ethernet kernel network devices.
133  - ``mlx5_ib``: InfiniBand device driver.
134  - ``ib_uverbs``: user space driver for Verbs (entry point for ``libibverbs``).
135
136- **Firmware update**
137
138  NVIDIA MLNX_OFED/EN releases include firmware updates.
139
140  Because each release provides new features, these updates must be applied to
141  match the kernel modules and libraries they come with.
142
143Libraries and kernel modules can be provided either by the Linux distribution,
144or by installing NVIDIA MLNX_OFED/EN which provides compatibility with older kernels.
145
146
147Upstream Dependencies
148^^^^^^^^^^^^^^^^^^^^^
149
150The mlx5 kernel modules are part of upstream Linux.
151The minimal supported kernel version is 4.14.
152For 32-bit, version 4.14.41 or above is required.
153
154The libraries `libibverbs` and `libmlx5` are part of ``rdma-core``.
155It is packaged by most of Linux distributions.
156The minimal supported rdma-core version is 16.
157For 32-bit, version 18 or above is required.
158
159The rdma-core sources can be downloaded at
160https://github.com/linux-rdma/rdma-core
161
162It is possible to build rdma-core as static libraries starting with version 21::
163
164    cd build
165    CFLAGS=-fPIC cmake -DIN_PLACE=1 -DENABLE_STATIC=1 -GNinja ..
166    ninja
167
168
169NVIDIA MLNX_OFED/EN
170^^^^^^^^^^^^^^^^^^^
171
172The kernel modules and libraries are packaged with other tools
173in NVIDIA MLNX_OFED or NVIDIA MLNX_EN.
174The minimal supported versions are:
175
176- NVIDIA MLNX_OFED version: **4.5** and above.
177- NVIDIA MLNX_EN version: **4.5** and above.
178- Firmware version:
179
180  - ConnectX-4: **12.21.1000** and above.
181  - ConnectX-4 Lx: **14.21.1000** and above.
182  - ConnectX-5: **16.21.1000** and above.
183  - ConnectX-5 Ex: **16.21.1000** and above.
184  - ConnectX-6: **20.27.0090** and above.
185  - ConnectX-6 Dx: **22.27.0090** and above.
186  - ConnectX-6 Lx: **26.27.0090** and above.
187  - ConnectX-7: **28.33.2028** and above.
188  - BlueField: **18.25.1010** and above.
189  - BlueField-2: **24.28.1002** and above.
190  - BlueField-3: **32.36.3126** and above.
191
192The firmware, the libraries libibverbs, libmlx5, and mlnx-ofed-kernel modules
193are packaged in `NVIDIA MLNX_OFED
194<https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/>`_.
195After downloading, it can be installed with this command::
196
197   ./mlnxofedinstall --dpdk
198
199`NVIDIA MLNX_EN
200<https://network.nvidia.com/products/ethernet-drivers/linux/mlnx_en/>`_
201is a smaller package including what is needed for DPDK.
202After downloading, it can be installed with this command::
203
204   ./install --dpdk
205
206After installing, the firmware version can be checked::
207
208   ibv_devinfo
209
210.. note::
211
212   Several versions of NVIDIA MLNX_OFED/EN are available. Installing the version
213   this DPDK release was developed and tested against is strongly recommended.
214   Please check the "Tested Platforms" section in the :doc:`../../rel_notes/index`.
215
216
217.. _mlx5_windows_prerequisites:
218
219Windows Prerequisites
220~~~~~~~~~~~~~~~~~~~~~
221
222The mlx5 PMDs rely on external libraries and kernel drivers
223for resource allocation and initialization.
224
225
226DevX SDK Installation
227^^^^^^^^^^^^^^^^^^^^^
228
229The DevX SDK must be installed on the machine building the Windows PMD.
230Additional information can be found at
231`How to Integrate Windows DevX in Your Development Environment
232<https://docs.nvidia.com/networking/display/winof2v260/RShim+Drivers+and+Usage#RShimDriversandUsage-DevXInterface>`_.
233The minimal supported WinOF2 version is 2.60.
234
235
236Compilation Options
237-------------------
238
239Compilation on Linux
240~~~~~~~~~~~~~~~~~~~~
241
242The ibverbs libraries can be linked with this PMD in a number of ways,
243configured by the ``ibverbs_link`` build option:
244
245``shared`` (default)
246   The PMD depends on some .so files.
247
248``dlopen``
249   Split the dependencies glue in a separate library
250   loaded when needed by dlopen (see ``MLX5_GLUE_PATH``).
251   It makes dependencies on libibverbs and libmlx5 optional,
252   and has no performance impact.
253
254``static``
255   Embed static flavor of the dependencies libibverbs and libmlx5
256   in the PMD shared library or the executable static binary.
257
258
259Compilation on Windows
260~~~~~~~~~~~~~~~~~~~~~~
261
262The DevX SDK location must be set through two environment variables:
263
264``DEVX_LIB_PATH``
265   path to the DevX lib file.
266
267``DEVX_INC_PATH``
268   path to the DevX header files.
269
270
271.. _mlx5_common_env:
272
273Environment Configuration
274-------------------------
275
276Linux Environment
277~~~~~~~~~~~~~~~~~
278
279The kernel network interfaces are brought up during initialization.
280Forcing them down prevents packets reception.
281
282The ethtool operations on the kernel interfaces may also affect the PMD.
283
284Some runtime behaviours may be configured through environment variables.
285
286``MLX5_GLUE_PATH``
287   If built with ``ibverbs_link=dlopen``,
288   list of directories in which to search for the rdma-core "glue" plug-in,
289   separated by colons or semi-colons.
290
291``MLX5_SHUT_UP_BF``
292   If Verbs is used (DevX disabled),
293   HW queue doorbell register mapping.
294   The value 0 means non-cached IO mapping,
295   while 1 is a regular memory mapping.
296
297   With regular memory mapping, the register is flushed to HW
298   usually when the write-combining buffer becomes full,
299   but it depends on CPU design.
300
301
302Port Link with MLNX_OFED/EN
303^^^^^^^^^^^^^^^^^^^^^^^^^^^
304
305Ports links must be set to Ethernet::
306
307   mlxconfig -d <mst device> query | grep LINK_TYPE
308   LINK_TYPE_P1                        ETH(2)
309   LINK_TYPE_P2                        ETH(2)
310
311   mlxconfig -d <mst device> set LINK_TYPE_P1/2=1/2/3
312
313Link type values are:
314
315* ``1`` Infiniband
316* ``2`` Ethernet
317* ``3`` VPI (auto-sense)
318
319If link type was changed, firmware must be reset as well::
320
321   mlxfwreset -d <mst device> reset
322
323
324.. _mlx5_vf:
325
326SR-IOV Virtual Function with MLNX_OFED/EN
327^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
328
329SR-IOV must be enabled on the NIC.
330It can be checked in the following command::
331
332   mlxconfig -d <mst device> query | grep SRIOV_EN
333   SRIOV_EN                            True(1)
334
335If needed, configure SR-IOV::
336
337   mlxconfig -d <mst device> set SRIOV_EN=1 NUM_OF_VFS=16
338   mlxfwreset -d <mst device> reset
339
340After doing the change, restart the driver::
341
342   /etc/init.d/openibd restart
343
344or::
345
346   service openibd restart
347
348Then the virtual functions can be instantiated::
349
350   echo [num_vfs] > /sys/class/infiniband/mlx5_0/device/sriov_numvfs
351
352
353.. _mlx5_sub_function:
354
355Sub-Function with MLNX_OFED/EN
356^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
357
358Sub-Function is a portion of the PCI device,
359it has its own dedicated queues.
360An SF shares PCI-level resources with other SFs and/or with its parent PCI function.
361
3620. Requirement::
363
364      MLNX_OFED version >= 5.4-0.3.3.0
365
3661. Configure SF feature::
367
368      # Run mlxconfig on both PFs on host and ECPFs on BlueField.
369      mlxconfig -d <mst device> set PER_PF_NUM_SF=1 PF_TOTAL_SF=252 PF_SF_BAR_SIZE=12
370
3712. Enable switchdev mode::
372
373      mlxdevm dev eswitch set pci/<DBDF> mode switchdev
374
3753. Add SF port::
376
377      mlxdevm port add pci/<DBDF> flavour pcisf pfnum 0 sfnum <sfnum>
378
379      Get SFID from output: pci/<DBDF>/<SFID>
380
3814. Modify MAC address::
382
383      mlxdevm port function set pci/<DBDF>/<SFID> hw_addr <MAC>
384
3855. Activate SF port::
386
387      mlxdevm port function set pci/<DBDF>/<ID> state active
388
3896. Devargs to probe SF device::
390
391      auxiliary:mlx5_core.sf.<num>,class=eth:regex
392
393
394Enable Switchdev Mode
395^^^^^^^^^^^^^^^^^^^^^
396
397Switchdev mode is a mode in E-Switch, that binds between representor and VF or SF.
398Representor is a port in DPDK that is connected to a VF or SF in such a way
399that assuming there are no offload flows, each packet that is sent from the VF or SF
400will be received by the corresponding representor.
401While each packet that is sent to a representor will be received by the VF or SF.
402
403After :ref:`configuring VF <mlx5_vf>`, the device must be unbound::
404
405   printf "<device pci address>" > /sys/bus/pci/drivers/mlx5_core/unbind
406
407Then switchdev mode is enabled::
408
409   echo switchdev > /sys/class/net/<net device>/compat/devlink/mode
410
411The device can be bound again at this point.
412
413
414Run as Non-Root
415^^^^^^^^^^^^^^^
416
417Hugepage and resource limit setup are documented
418in the :ref:`common Linux guide <Running_Without_Root_Privileges>`.
419This PMD can operate without access to physical addresses,
420therefore it does not require ``SYS_ADMIN`` to access ``/proc/self/pagemaps``.
421Note that this requirement may still come from other drivers.
422
423Below are additional capabilities that must be granted to the application
424with the reasons for the need of each capability:
425
426``NET_RAW``
427   For raw Ethernet queue allocation through the kernel driver.
428
429``NET_ADMIN``
430   For device configuration, like setting link status or MTU.
431
432``SYS_RAWIO``
433   For using group 1 and above (software steering) in Flow API.
434
435They can be manually granted for a specific executable file::
436
437   setcap cap_net_raw,cap_net_admin,cap_sys_rawio+ep <executable>
438
439Alternatively, a service manager or a container runtime
440may configure the capabilities for a process.
441
442
443Windows Environment
444~~~~~~~~~~~~~~~~~~~
445
446WinOF2 version 2.60 or higher must be installed on the machine.
447
448
449WinOF2 Installation
450^^^^^^^^^^^^^^^^^^^
451
452The driver can be downloaded from the following site: `WINOF2
453<https://network.nvidia.com/products/adapter-software/ethernet/windows/winof-2/>`_.
454
455
456DevX Enablement
457^^^^^^^^^^^^^^^
458
459DevX for Windows must be enabled in the Windows registry.
460The keys ``DevxEnabled`` and ``DevxFsRules`` must be set.
461Additional information can be found in the WinOF2 user manual.
462
463
464.. _mlx5_firmware_config:
465
466Firmware Configuration
467~~~~~~~~~~~~~~~~~~~~~~
468
469Firmware features can be configured as key/value pairs.
470
471The command to set a value is::
472
473  mlxconfig -d <device> set <key>=<value>
474
475The command to query a value is::
476
477  mlxconfig -d <device> query <key>
478
479The device name for the command ``mlxconfig`` can be either the PCI address,
480or the mst device name found with::
481
482  mst status
483
484Below are some firmware configurations listed.
485
486- link type::
487
488    LINK_TYPE_P1
489    LINK_TYPE_P2
490    value: 1=Infiniband 2=Ethernet 3=VPI(auto-sense)
491
492- enable SR-IOV::
493
494    SRIOV_EN=1
495
496- the maximum number of SR-IOV virtual functions::
497
498    NUM_OF_VFS=<max>
499
500- enable DevX (required by Direct Rules and other features)::
501
502    UCTX_EN=1
503
504- aggressive CQE zipping::
505
506    CQE_COMPRESSION=1
507
508- L3 VXLAN and VXLAN-GPE destination UDP port::
509
510    IP_OVER_VXLAN_EN=1
511    IP_OVER_VXLAN_PORT=<udp dport>
512
513- enable VXLAN-GPE tunnel flow matching::
514
515    FLEX_PARSER_PROFILE_ENABLE=0
516    or
517    FLEX_PARSER_PROFILE_ENABLE=2
518
519- enable IP-in-IP tunnel flow matching::
520
521    FLEX_PARSER_PROFILE_ENABLE=0
522
523- enable MPLS flow matching::
524
525    FLEX_PARSER_PROFILE_ENABLE=1
526
527- enable ICMP(code/type/identifier/sequence number) / ICMP6(code/type) fields matching::
528
529    FLEX_PARSER_PROFILE_ENABLE=2
530
531- enable Geneve flow matching::
532
533   FLEX_PARSER_PROFILE_ENABLE=0
534   or
535   FLEX_PARSER_PROFILE_ENABLE=1
536
537- enable Geneve TLV option flow matching::
538
539   FLEX_PARSER_PROFILE_ENABLE=0
540
541- enable GTP flow matching::
542
543   FLEX_PARSER_PROFILE_ENABLE=3
544
545- enable eCPRI flow matching::
546
547   FLEX_PARSER_PROFILE_ENABLE=4
548   PROG_PARSE_GRAPH=1
549
550- enable dynamic flex parser for flex item::
551
552   FLEX_PARSER_PROFILE_ENABLE=4
553   PROG_PARSE_GRAPH=1
554
555- enable realtime timestamp format::
556
557   REAL_TIME_CLOCK_ENABLE=1
558
559- allow locking hairpin RQ data buffer in device memory::
560
561   HAIRPIN_DATA_BUFFER_LOCK=1
562   MEMIC_SIZE_LIMIT=0
563
564
565.. _mlx5_common_driver_options:
566
567Device Arguments
568----------------
569
570The driver can be configured per device.
571A single argument list can be used for a device managed by multiple PMDs.
572The parameters must be passed through the EAL option ``-a``,
573as examples below:
574
575- PCI device::
576
577  -a 0000:03:00.2,class=eth:regex,mr_mempool_reg_en=0
578
579- Auxiliary SF::
580
581  -a auxiliary:mlx5_core.sf.2,class=compress,mr_ext_memseg_en=0
582
583Each device class PMD has its own list of specific arguments,
584and below are the arguments supported by the common mlx5 layer.
585
586- ``class`` parameter [string]
587
588  Select the classes of the drivers that should probe the device.
589  See :ref:`mlx5_classes` for more explanation and details.
590
591  The default value is ``eth``.
592
593- ``mr_ext_memseg_en`` parameter [int]
594
595  A nonzero value enables extending memseg when registering DMA memory. If
596  enabled, the number of entries in MR (Memory Region) lookup table on datapath
597  is minimized and it benefits performance. On the other hand, it worsens memory
598  utilization because registered memory is pinned by kernel driver. Even if a
599  page in the extended chunk is freed, that doesn't become reusable until the
600  entire memory is freed.
601
602  Enabled by default.
603
604- ``mr_mempool_reg_en`` parameter [int]
605
606  A nonzero value enables implicit registration of DMA memory of all mempools
607  except those having ``RTE_MEMPOOL_F_NON_IO``. This flag is set automatically
608  for mempools populated with non-contiguous objects or those without IOVA.
609  The effect is that when a packet from a mempool is transmitted,
610  its memory is already registered for DMA in the PMD and no registration
611  will happen on the data path. The tradeoff is extra work on the creation
612  of each mempool and increased HW resource use if some mempools
613  are not used with MLX5 devices.
614
615  Enabled by default.
616
617- ``sys_mem_en`` parameter [int]
618
619  A non-zero value enables the PMD memory management allocating memory
620  from system by default, without explicit rte memory flag.
621
622  By default, the PMD will set this value to 0.
623
624- ``sq_db_nc`` parameter [int]
625
626  The rdma core library can map doorbell register in two ways,
627  depending on the environment variable "MLX5_SHUT_UP_BF":
628
629  - As regular cached memory (usually with write combining attribute),
630    if the variable is either missing or set to zero.
631  - As non-cached memory, if the variable is present and set to not "0" value.
632
633   The same doorbell mapping approach is implemented directly by PMD
634   in UAR generation for queues created with DevX.
635
636  The type of mapping may slightly affect the send queue performance,
637  the optimal choice strongly relied on the host architecture
638  and should be deduced practically.
639
640  If ``sq_db_nc`` is set to zero, the doorbell is forced to be mapped to
641  regular memory (with write combining), the PMD will perform the extra write
642  memory barrier after writing to doorbell, it might increase the needed CPU
643  clocks per packet to send, but latency might be improved.
644
645  If ``sq_db_nc`` is set to one, the doorbell is forced to be mapped to non
646  cached memory, the PMD will not perform the extra write memory barrier after
647  writing to doorbell, on some architectures it might improve the performance.
648
649  If ``sq_db_nc`` is set to two, the doorbell is forced to be mapped to
650  regular memory, the PMD will use heuristics to decide whether a write memory
651  barrier should be performed. For bursts with size multiple of recommended one
652  (64 pkts) it is supposed the next burst is coming and no need to issue the
653  extra memory barrier (it is supposed to be issued in the next coming burst,
654  at least after descriptor writing). It might increase latency (on some hosts
655  till the next packets transmit) and should be used with care.
656  The PMD uses heuristics only for Tx queue, for other semd queues the doorbell
657  is forced to be mapped to regular memory as same as ``sq_db_nc`` is set to 0.
658
659  If ``sq_db_nc`` is omitted, the preset (if any) environment variable
660  "MLX5_SHUT_UP_BF" value is used. If there is no "MLX5_SHUT_UP_BF", the
661  default ``sq_db_nc`` value is zero for ARM64 hosts and one for others.
662
663- ``cmd_fd`` parameter [int]
664
665  File descriptor of ``ibv_context`` created outside the PMD.
666  PMD will use this FD to import remote CTX. The ``cmd_fd`` is obtained from
667  the ``ibv_context->cmd_fd`` member, which must be dup'd before being passed.
668  This parameter is valid only if ``pd_handle`` parameter is specified.
669
670  By default, the PMD will create a new ``ibv_context``.
671
672  .. note::
673
674     When FD comes from another process, it is the user responsibility to
675     share the FD between the processes (e.g. by SCM_RIGHTS).
676
677- ``pd_handle`` parameter [int]
678
679  Protection domain handle of ``ibv_pd`` created outside the PMD.
680  PMD will use this handle to import remote PD. The ``pd_handle`` can be
681  achieved from the original PD by getting its ``ibv_pd->handle`` member value.
682  This parameter is valid only if ``cmd_fd`` parameter is specified,
683  and its value must be a valid kernel handle for a PD object
684  in the context represented by given ``cmd_fd``.
685
686  By default, the PMD will allocate a new PD.
687
688  .. note::
689
690     The ``ibv_pd->handle`` member is different than ``mlx5dv_pd->pdn`` member.
691