xref: /dpdk/doc/guides/platform/mlx5.rst (revision ab9c0ee13d8f8a03d058601b752633b2cfebe6f4)
1..  SPDX-License-Identifier: BSD-3-Clause
2    Copyright 2022 6WIND S.A.
3    Copyright (c) 2022 NVIDIA Corporation & Affiliates
4
5.. include:: <isonum.txt>
6
7NVIDIA MLX5 Common Driver
8=========================
9
10.. note::
11
12   NVIDIA acquired Mellanox Technologies in 2020.
13   The DPDK documentation and code might still include instances
14   of or references to Mellanox trademarks (like BlueField and ConnectX)
15   that are now NVIDIA trademarks.
16
17The mlx5 common driver library (**librte_common_mlx5**) provides support for
18**NVIDIA ConnectX-4**, **NVIDIA ConnectX-4 Lx**, **NVIDIA ConnectX-5**,
19**NVIDIA ConnectX-6**, **NVIDIA ConnectX-6 Dx**, **NVIDIA ConnectX-6 Lx**,
20**NVIDIA ConnectX-7**, **NVIDIA BlueField**, **NVIDIA BlueField-2** and
21**NVIDIA BlueField-3** families of 10/25/40/50/100/200 Gb/s adapters.
22
23Information and documentation for these adapters can be found on the
24`NVIDIA website <https://www.nvidia.com/en-us/networking/>`_.
25Help is also provided by the
26`NVIDIA Networking forum <https://forums.developer.nvidia.com/c/infrastructure/369/>`_.
27In addition, there is a `web section dedicated to DPDK
28<https://developer.nvidia.com/networking/dpdk>`_.
29
30
31Design
32------
33
34For security reasons and to enhance robustness,
35this driver only handles virtual memory addresses.
36The way resources allocations are handled by the kernel,
37combined with hardware specifications that allow handling virtual memory addresses directly,
38ensure that DPDK applications cannot access random physical memory
39(or memory that does not belong to the current process).
40
41There are different levels of objects and bypassing abilities
42which are used to get the best performance:
43
44- **Verbs** is a complete high-level generic API
45- **Direct Verbs** is a device-specific API
46- **DevX** allows accessing firmware objects
47- **Direct Rules** manages flow steering at the low-level hardware layer
48
49On Linux, above interfaces are provided by linking with `libibverbs` and `libmlx5`.
50See :ref:`mlx5_linux_prerequisites` for installation.
51
52On Windows, DevX is the only requirement from the above list.
53See :ref:`mlx5_windows_prerequisites` for DevX SDK package installation.
54
55
56.. _mlx5_classes:
57
58Classes
59-------
60
61One mlx5 device can be probed by a number of different PMDs.
62To select a specific PMD, its name should be specified as a device parameter
63(e.g. ``0000:08:00.1,class=eth``).
64
65In order to allow probing by multiple PMDs,
66several classes may be listed separated by a colon.
67For example: ``class=crypto:regex`` will probe both Crypto and RegEx PMDs.
68
69
70Supported Classes
71~~~~~~~~~~~~~~~~~
72
73- ``class=compress`` for :doc:`../../compressdevs/mlx5`.
74- ``class=crypto`` for :doc:`../../cryptodevs/mlx5`.
75- ``class=eth`` for :doc:`../../nics/mlx5`.
76- ``class=regex`` for :doc:`../../regexdevs/mlx5`.
77- ``class=vdpa`` for :doc:`../../vdpadevs/mlx5`.
78
79By default, the mlx5 device will be probed by the ``eth`` PMD.
80
81
82Limitations
83~~~~~~~~~~~
84
85- ``eth`` and ``vdpa`` PMDs cannot be probed at the same time.
86  All other combinations are possible.
87
88- On Windows, only ``eth`` and ``crypto`` are supported.
89
90
91.. _mlx5_common_compilation:
92
93Compilation Prerequisites
94-------------------------
95
96.. _mlx5_linux_prerequisites:
97
98Linux Prerequisites
99~~~~~~~~~~~~~~~~~~~
100
101This driver relies on external libraries and kernel drivers for resources
102allocations and initialization.
103The following dependencies are not part of DPDK and must be installed separately:
104
105- **libibverbs**
106
107  User space Verbs framework used by ``librte_common_mlx5``.
108  This library provides a generic interface between the kernel
109  and low-level user space drivers such as ``libmlx5``.
110
111  It allows slow and privileged operations (context initialization,
112  hardware resources allocations) to be managed by the kernel
113  and fast operations to never leave user space.
114
115- **libmlx5**
116
117  Low-level user space driver library for NVIDIA devices,
118  it is automatically loaded by ``libibverbs``.
119
120  This library basically implements send/receive calls to the hardware queues.
121
122- **Kernel modules**
123
124  They provide the kernel-side Verbs API and low level device drivers
125  that manage actual hardware initialization
126  and resources sharing with user-space processes.
127
128  Unlike most other PMDs, these modules must remain loaded and bound to
129  their devices:
130
131  - ``mlx5_core``: hardware driver managing NVIDIA devices
132    and related Ethernet kernel network devices.
133  - ``mlx5_ib``: InfiniBand device driver.
134  - ``ib_uverbs``: user space driver for Verbs (entry point for ``libibverbs``).
135
136- **Firmware**
137
138  Minimal supported firmware version:
139
140  - ConnectX-4: **12.21.1000** and above.
141  - ConnectX-4 Lx: **14.21.1000** and above.
142  - ConnectX-5: **16.21.1000** and above.
143  - ConnectX-5 Ex: **16.21.1000** and above.
144  - ConnectX-6: **20.27.0090** and above.
145  - ConnectX-6 Dx: **22.27.0090** and above.
146  - ConnectX-6 Lx: **26.27.0090** and above.
147  - ConnectX-7: **28.33.2028** and above.
148  - BlueField: **18.25.1010** and above.
149  - BlueField-2: **24.28.1002** and above.
150  - BlueField-3: **32.36.3126** and above.
151
152  New features may be added in more recent firmwares.
153
154Libraries and kernel modules can be provided either by the Linux distribution,
155or by installing NVIDIA MLNX_OFED/EN which provides compatibility with older kernels.
156
157
158Upstream Dependencies
159^^^^^^^^^^^^^^^^^^^^^
160
161The mlx5 kernel modules are part of upstream Linux.
162The minimal supported kernel version is 4.14.
163For 32-bit, version 4.14.41 or above is required.
164
165The libraries `libibverbs` and `libmlx5` are part of ``rdma-core``.
166It is packaged by most of Linux distributions.
167The minimal supported rdma-core version is 16.
168For 32-bit, version 18 or above is required.
169
170The rdma-core sources can be downloaded at
171https://github.com/linux-rdma/rdma-core
172
173It is possible to build rdma-core as static libraries starting with version 21::
174
175    cd build
176    CFLAGS=-fPIC cmake -DENABLE_STATIC=1 -DNO_PYVERBS=1 -DNO_MAN_PAGES=1 -GNinja ..
177    ninja
178    ninja install
179
180The firmware can be updated with `mlxup
181<https://docs.nvidia.com/networking/display/mlxupfwutility>`_.
182The latest firmwares can be downloaded at
183https://network.nvidia.com/support/firmware/firmware-downloads/
184
185
186NVIDIA MLNX_OFED/EN
187^^^^^^^^^^^^^^^^^^^
188
189The kernel modules and libraries are packaged with other tools
190in NVIDIA MLNX_OFED or NVIDIA MLNX_EN.
191The minimal supported versions are:
192
193- NVIDIA MLNX_OFED version: **4.5** and above.
194- NVIDIA MLNX_EN version: **4.5** and above.
195
196The firmware, the libraries libibverbs, libmlx5, and mlnx-ofed-kernel modules
197are packaged in `NVIDIA MLNX_OFED
198<https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/>`_.
199After downloading, it can be installed with this command::
200
201   ./mlnxofedinstall --dpdk
202
203`NVIDIA MLNX_EN
204<https://network.nvidia.com/products/ethernet-drivers/linux/mlnx_en/>`_
205is a smaller package including what is needed for DPDK.
206After downloading, it can be installed with this command::
207
208   ./install --dpdk
209
210After installing, the firmware version can be checked::
211
212   ibv_devinfo
213
214The firmware updates are included in NVIDIA MLNX_OFED/EN packages.
215Because each release provides new features, these updates must be applied
216to match the kernel modules and libraries they come with.
217
218.. note::
219
220   Several versions of NVIDIA MLNX_OFED/EN are available. Installing the version
221   this DPDK release was developed and tested against is strongly recommended.
222   Please check the "Tested Platforms" section in the :doc:`../../rel_notes/index`.
223
224
225.. _mlx5_windows_prerequisites:
226
227Windows Prerequisites
228~~~~~~~~~~~~~~~~~~~~~
229
230The mlx5 PMDs rely on external libraries and kernel drivers
231for resource allocation and initialization.
232
233
234DevX SDK Installation
235^^^^^^^^^^^^^^^^^^^^^
236
237The DevX SDK must be installed on the machine building the Windows PMD.
238Additional information can be found at
239`How to Integrate Windows DevX in Your Development Environment
240<https://docs.nvidia.com/networking/display/winof2v290/devx+interface>`_.
241The minimal supported WinOF2 version is 2.60.
242
243
244Compilation Options
245-------------------
246
247Compilation on Linux
248~~~~~~~~~~~~~~~~~~~~
249
250The ibverbs libraries can be linked with this PMD in a number of ways,
251configured by the ``ibverbs_link`` build option:
252
253``shared`` (default)
254   The PMD depends on some .so files.
255
256``dlopen``
257   Split the dependencies glue in a separate library
258   loaded when needed by dlopen (see ``MLX5_GLUE_PATH``).
259   It makes dependencies on libibverbs and libmlx5 optional,
260   and has no performance impact.
261
262``static``
263   Embed static flavor of the dependencies libibverbs and libmlx5
264   in the PMD shared library or the executable static binary.
265
266
267Compilation on Windows
268~~~~~~~~~~~~~~~~~~~~~~
269
270The DevX SDK location must be set through CFLAGS/LDFLAGS,
271either::
272
273   meson.exe setup "-Dc_args=-I\"%DEVX_INC_PATH%\"" "-Dc_link_args=-L\"%DEVX_LIB_PATH%\"" ...
274
275or::
276
277   set CFLAGS=-I"%DEVX_INC_PATH%" && set LDFLAGS=-L"%DEVX_LIB_PATH%" && meson.exe setup ...
278
279
280.. _mlx5_common_env:
281
282Environment Configuration
283-------------------------
284
285Linux Environment
286~~~~~~~~~~~~~~~~~
287
288The kernel network interfaces are brought up during initialization.
289Forcing them down prevents packets reception.
290
291The ethtool operations on the kernel interfaces may also affect the PMD.
292
293Some runtime behaviours may be configured through environment variables.
294
295``MLX5_GLUE_PATH``
296   If built with ``ibverbs_link=dlopen``,
297   list of directories in which to search for the rdma-core "glue" plug-in,
298   separated by colons or semi-colons.
299
300``MLX5_SHUT_UP_BF``
301   If Verbs is used (DevX disabled),
302   HW queue doorbell register mapping.
303   The value 0 means non-cached IO mapping,
304   while 1 is a regular memory mapping.
305
306   With regular memory mapping, the register is flushed to HW
307   usually when the write-combining buffer becomes full,
308   but it depends on CPU design.
309
310
311Port Link with MLNX_OFED/EN
312^^^^^^^^^^^^^^^^^^^^^^^^^^^
313
314Ports links must be set to Ethernet::
315
316   mlxconfig -d <mst device> query | grep LINK_TYPE
317   LINK_TYPE_P1                        ETH(2)
318   LINK_TYPE_P2                        ETH(2)
319
320   mlxconfig -d <mst device> set LINK_TYPE_P1/2=1/2/3
321
322Link type values are:
323
324* ``1`` Infiniband
325* ``2`` Ethernet
326* ``3`` VPI (auto-sense)
327
328If link type was changed, firmware must be reset as well::
329
330   mlxfwreset -d <mst device> reset
331
332
333.. _mlx5_vf:
334
335SR-IOV Virtual Function with MLNX_OFED/EN
336^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
337
338SR-IOV must be enabled on the NIC.
339It can be checked in the following command::
340
341   mlxconfig -d <mst device> query | grep SRIOV_EN
342   SRIOV_EN                            True(1)
343
344If needed, configure SR-IOV::
345
346   mlxconfig -d <mst device> set SRIOV_EN=1 NUM_OF_VFS=16
347   mlxfwreset -d <mst device> reset
348
349After doing the change, restart the driver::
350
351   /etc/init.d/openibd restart
352
353or::
354
355   service openibd restart
356
357Then the virtual functions can be instantiated::
358
359   echo [num_vfs] > /sys/class/infiniband/mlx5_0/device/sriov_numvfs
360
361
362.. _mlx5_sub_function:
363
364Sub-Function with MLNX_OFED/EN
365^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
366
367Sub-Function is a portion of the PCI device,
368it has its own dedicated queues.
369An SF shares PCI-level resources with other SFs and/or with its parent PCI function.
370
371#. Requirement::
372
373      MLNX_OFED version >= 5.4-0.3.3.0
374
375#. Configure SF feature::
376
377      # Run mlxconfig on both PFs on host and ECPFs on BlueField.
378      mlxconfig -d <mst device> set PER_PF_NUM_SF=1 PF_TOTAL_SF=252 PF_SF_BAR_SIZE=12
379
380#. Enable switchdev mode::
381
382      mlxdevm dev eswitch set pci/<DBDF> mode switchdev
383
384#. Add SF port::
385
386      mlxdevm port add pci/<DBDF> flavour pcisf pfnum 0 sfnum <sfnum>
387
388      Get SFID from output: pci/<DBDF>/<SFID>
389
390#. Modify MAC address::
391
392      mlxdevm port function set pci/<DBDF>/<SFID> hw_addr <MAC>
393
394#. Activate SF port::
395
396      mlxdevm port function set pci/<DBDF>/<ID> state active
397
398#. Devargs to probe SF device::
399
400      auxiliary:mlx5_core.sf.<num>,class=eth:regex
401
402
403Enable Switchdev Mode
404^^^^^^^^^^^^^^^^^^^^^
405
406Switchdev mode is a mode in E-Switch, that binds between representor and VF or SF.
407Representor is a port in DPDK that is connected to a VF or SF in such a way
408that assuming there are no offload flows, each packet that is sent from the VF or SF
409will be received by the corresponding representor.
410While each packet that is sent to a representor will be received by the VF or SF.
411
412After :ref:`configuring VF <mlx5_vf>`, the device must be unbound::
413
414   printf "<device pci address>" > /sys/bus/pci/drivers/mlx5_core/unbind
415
416Then switchdev mode is enabled::
417
418   echo switchdev > /sys/class/net/<net device>/compat/devlink/mode
419
420The device can be bound again at this point.
421
422
423Run as Non-Root
424^^^^^^^^^^^^^^^
425
426Hugepage and resource limit setup are documented
427in the :ref:`common Linux guide <Running_Without_Root_Privileges>`.
428This PMD can operate without access to physical addresses,
429therefore it does not require ``SYS_ADMIN`` to access ``/proc/self/pagemaps``.
430Note that this requirement may still come from other drivers.
431
432Below are additional capabilities that must be granted to the application
433with the reasons for the need of each capability:
434
435``NET_RAW``
436   For raw Ethernet queue allocation through the kernel driver.
437
438``NET_ADMIN``
439   For device configuration, like setting link status or MTU.
440
441``SYS_RAWIO``
442   For using group 1 and above (software steering) in Flow API.
443
444They can be manually granted for a specific executable file::
445
446   setcap cap_net_raw,cap_net_admin,cap_sys_rawio+ep <executable>
447
448Alternatively, a service manager or a container runtime
449may configure the capabilities for a process.
450
451
452Windows Environment
453~~~~~~~~~~~~~~~~~~~
454
455WinOF2 version 2.60 or higher must be installed on the machine.
456
457
458WinOF2 Installation
459^^^^^^^^^^^^^^^^^^^
460
461The driver can be downloaded from the following site: `WINOF2
462<https://network.nvidia.com/products/adapter-software/ethernet/windows/winof-2/>`_.
463
464
465DevX Enablement
466^^^^^^^^^^^^^^^
467
468DevX for Windows must be enabled in the Windows registry.
469The keys ``DevxEnabled`` and ``DevxFsRules`` must be set.
470Additional information can be found in the WinOF2 user manual.
471
472
473.. _mlx5_firmware_config:
474
475Firmware Configuration
476~~~~~~~~~~~~~~~~~~~~~~
477
478Firmware features can be configured as key/value pairs.
479
480The command to set a value is::
481
482  mlxconfig -d <device> set <key>=<value>
483
484The command to query a value is::
485
486  mlxconfig -d <device> query <key>
487
488The device name for the command ``mlxconfig`` can be either the PCI address,
489or the mst device name found with::
490
491  mst status
492
493Below are some firmware configurations listed.
494
495- link type::
496
497    LINK_TYPE_P1
498    LINK_TYPE_P2
499    value: 1=Infiniband 2=Ethernet 3=VPI(auto-sense)
500
501- enable SR-IOV::
502
503    SRIOV_EN=1
504
505- the maximum number of SR-IOV virtual functions::
506
507    NUM_OF_VFS=<max>
508
509- enable DevX (required by Direct Rules and other features)::
510
511    UCTX_EN=1
512
513- aggressive CQE zipping::
514
515    CQE_COMPRESSION=1
516
517- L3 VXLAN and VXLAN-GPE destination UDP port::
518
519    IP_OVER_VXLAN_EN=1
520    IP_OVER_VXLAN_PORT=<udp dport>
521
522- enable VXLAN-GPE tunnel flow matching::
523
524    FLEX_PARSER_PROFILE_ENABLE=0
525    or
526    FLEX_PARSER_PROFILE_ENABLE=2
527
528- enable IP-in-IP tunnel flow matching::
529
530    FLEX_PARSER_PROFILE_ENABLE=0
531
532- enable MPLS flow matching::
533
534    FLEX_PARSER_PROFILE_ENABLE=1
535
536- enable ICMP(code/type/identifier/sequence number) / ICMP6(code/type) fields matching::
537
538    FLEX_PARSER_PROFILE_ENABLE=2
539
540- enable Geneve flow matching::
541
542   FLEX_PARSER_PROFILE_ENABLE=0
543   or
544   FLEX_PARSER_PROFILE_ENABLE=1
545
546- enable Geneve TLV option flow matching::
547
548   FLEX_PARSER_PROFILE_ENABLE=0
549   or
550   FLEX_PARSER_PROFILE_ENABLE=8
551
552- enable GTP flow matching::
553
554   FLEX_PARSER_PROFILE_ENABLE=3
555
556- enable eCPRI flow matching::
557
558   FLEX_PARSER_PROFILE_ENABLE=4
559   PROG_PARSE_GRAPH=1
560
561- enable dynamic flex parser for flex item::
562
563   FLEX_PARSER_PROFILE_ENABLE=4
564   PROG_PARSE_GRAPH=1
565
566- enable realtime timestamp format::
567
568   REAL_TIME_CLOCK_ENABLE=1
569
570- allow locking hairpin RQ data buffer in device memory::
571
572   HAIRPIN_DATA_BUFFER_LOCK=1
573   MEMIC_SIZE_LIMIT=0
574
575
576.. _mlx5_common_driver_options:
577
578Device Arguments
579----------------
580
581The driver can be configured per device.
582A single argument list can be used for a device managed by multiple PMDs.
583The parameters must be passed through the EAL option ``-a``,
584as examples below:
585
586- PCI device::
587
588  -a 0000:03:00.2,class=eth:regex,mr_mempool_reg_en=0
589
590- Auxiliary SF::
591
592  -a auxiliary:mlx5_core.sf.2,class=compress,mr_ext_memseg_en=0
593
594Each device class PMD has its own list of specific arguments,
595and below are the arguments supported by the common mlx5 layer.
596
597- ``class`` parameter [string]
598
599  Select the classes of the drivers that should probe the device.
600  See :ref:`mlx5_classes` for more explanation and details.
601
602  The default value is ``eth``.
603
604- ``mr_ext_memseg_en`` parameter [int]
605
606  A nonzero value enables extending memseg when registering DMA memory. If
607  enabled, the number of entries in MR (Memory Region) lookup table on datapath
608  is minimized and it benefits performance. On the other hand, it worsens memory
609  utilization because registered memory is pinned by kernel driver. Even if a
610  page in the extended chunk is freed, that doesn't become reusable until the
611  entire memory is freed.
612
613  Enabled by default.
614
615- ``mr_mempool_reg_en`` parameter [int]
616
617  A nonzero value enables implicit registration of DMA memory of all mempools
618  except those having ``RTE_MEMPOOL_F_NON_IO``. This flag is set automatically
619  for mempools populated with non-contiguous objects or those without IOVA.
620  The effect is that when a packet from a mempool is transmitted,
621  its memory is already registered for DMA in the PMD and no registration
622  will happen on the data path. The tradeoff is extra work on the creation
623  of each mempool and increased HW resource use if some mempools
624  are not used with MLX5 devices.
625
626  Enabled by default.
627
628- ``sys_mem_en`` parameter [int]
629
630  A non-zero value enables the PMD memory management allocating memory
631  from system by default, without explicit rte memory flag.
632
633  By default, the PMD will set this value to 0.
634
635- ``sq_db_nc`` parameter [int]
636
637  The rdma core library can map doorbell register in two ways,
638  depending on the environment variable "MLX5_SHUT_UP_BF":
639
640  - As regular cached memory (usually with write combining attribute),
641    if the variable is either missing or set to zero.
642  - As non-cached memory, if the variable is present and set to not "0" value.
643
644   The same doorbell mapping approach is implemented directly by PMD
645   in UAR generation for queues created with DevX.
646
647  The type of mapping may slightly affect the send queue performance,
648  the optimal choice strongly relied on the host architecture
649  and should be deduced practically.
650
651  If ``sq_db_nc`` is set to zero, the doorbell is forced to be mapped to
652  regular memory (with write combining), the PMD will perform the extra write
653  memory barrier after writing to doorbell, it might increase the needed CPU
654  clocks per packet to send, but latency might be improved.
655
656  If ``sq_db_nc`` is set to one, the doorbell is forced to be mapped to non
657  cached memory, the PMD will not perform the extra write memory barrier after
658  writing to doorbell, on some architectures it might improve the performance.
659
660  If ``sq_db_nc`` is set to two, the doorbell is forced to be mapped to
661  regular memory, the PMD will use heuristics to decide whether a write memory
662  barrier should be performed. For bursts with size multiple of recommended one
663  (64 pkts) it is supposed the next burst is coming and no need to issue the
664  extra memory barrier (it is supposed to be issued in the next coming burst,
665  at least after descriptor writing). It might increase latency (on some hosts
666  till the next packets transmit) and should be used with care.
667  The PMD uses heuristics only for Tx queue, for other semd queues the doorbell
668  is forced to be mapped to regular memory as same as ``sq_db_nc`` is set to 0.
669
670  If ``sq_db_nc`` is omitted, the preset (if any) environment variable
671  "MLX5_SHUT_UP_BF" value is used. If there is no "MLX5_SHUT_UP_BF", the
672  default ``sq_db_nc`` value is zero for ARM64 hosts and one for others.
673
674- ``cmd_fd`` parameter [int]
675
676  File descriptor of ``ibv_context`` created outside the PMD.
677  PMD will use this FD to import remote CTX. The ``cmd_fd`` is obtained from
678  the ``ibv_context->cmd_fd`` member, which must be dup'd before being passed.
679  This parameter is valid only if ``pd_handle`` parameter is specified.
680
681  By default, the PMD will create a new ``ibv_context``.
682
683  .. note::
684
685     When FD comes from another process, it is the user responsibility to
686     share the FD between the processes (e.g. by SCM_RIGHTS).
687
688- ``pd_handle`` parameter [int]
689
690  Protection domain handle of ``ibv_pd`` created outside the PMD.
691  PMD will use this handle to import remote PD. The ``pd_handle`` can be
692  achieved from the original PD by getting its ``ibv_pd->handle`` member value.
693  This parameter is valid only if ``cmd_fd`` parameter is specified,
694  and its value must be a valid kernel handle for a PD object
695  in the context represented by given ``cmd_fd``.
696
697  By default, the PMD will allocate a new PD.
698
699  .. note::
700
701     The ``ibv_pd->handle`` member is different than ``mlx5dv_pd->pdn`` member.
702