xref: /dpdk/doc/guides/platform/mlx5.rst (revision 02d36ef6a9528e0f4a3403956e66bcea5fadbf8c)
1..  SPDX-License-Identifier: BSD-3-Clause
2    Copyright 2022 6WIND S.A.
3    Copyright (c) 2022 NVIDIA Corporation & Affiliates
4
5.. include:: <isonum.txt>
6
7NVIDIA MLX5 Common Driver
8=========================
9
10.. note::
11
12   NVIDIA acquired Mellanox Technologies in 2020.
13   The DPDK documentation and code might still include instances
14   of or references to Mellanox trademarks (like BlueField and ConnectX)
15   that are now NVIDIA trademarks.
16
17The mlx5 common driver library (**librte_common_mlx5**) provides support for
18**NVIDIA ConnectX-4**, **NVIDIA ConnectX-4 Lx**, **NVIDIA ConnectX-5**,
19**NVIDIA ConnectX-6**, **NVIDIA ConnectX-6 Dx**, **NVIDIA ConnectX-6 Lx**,
20**NVIDIA ConnectX-7**, **NVIDIA BlueField**, and **NVIDIA BlueField-2** families of
2110/25/40/50/100/200 Gb/s adapters.
22
23Information and documentation for these adapters can be found on the
24`NVIDIA website <https://www.nvidia.com/en-us/networking/>`_.
25Help is also provided by the
26`NVIDIA Networking forum <https://forums.developer.nvidia.com/c/infrastructure/369/>`_.
27In addition, there is a `web section dedicated to DPDK
28<https://developer.nvidia.com/networking/dpdk>`_.
29
30
31Design
32------
33
34For security reasons and to enhance robustness,
35this driver only handles virtual memory addresses.
36The way resources allocations are handled by the kernel,
37combined with hardware specifications that allow handling virtual memory addresses directly,
38ensure that DPDK applications cannot access random physical memory
39(or memory that does not belong to the current process).
40
41There are different levels of objects and bypassing abilities
42which are used to get the best performance:
43
44- **Verbs** is a complete high-level generic API
45- **Direct Verbs** is a device-specific API
46- **DevX** allows accessing firmware objects
47- **Direct Rules** manages flow steering at the low-level hardware layer
48
49On Linux, above interfaces are provided by linking with `libibverbs` and `libmlx5`.
50See :ref:`mlx5_linux_prerequisites` for installation.
51
52On Windows, DevX is the only requirement from the above list.
53See :ref:`mlx5_windows_prerequisites` for DevX SDK package installation.
54
55
56.. _mlx5_classes:
57
58Classes
59-------
60
61One mlx5 device can be probed by a number of different PMDs.
62To select a specific PMD, its name should be specified as a device parameter
63(e.g. ``0000:08:00.1,class=eth``).
64
65In order to allow probing by multiple PMDs,
66several classes may be listed separated by a colon.
67For example: ``class=crypto:regex`` will probe both Crypto and RegEx PMDs.
68
69
70Supported Classes
71~~~~~~~~~~~~~~~~~
72
73- ``class=compress`` for :doc:`../../compressdevs/mlx5`.
74- ``class=crypto`` for :doc:`../../cryptodevs/mlx5`.
75- ``class=eth`` for :doc:`../../nics/mlx5`.
76- ``class=regex`` for :doc:`../../regexdevs/mlx5`.
77- ``class=vdpa`` for :doc:`../../vdpadevs/mlx5`.
78
79By default, the mlx5 device will be probed by the ``eth`` PMD.
80
81
82Limitations
83~~~~~~~~~~~
84
85- ``eth`` and ``vdpa`` PMDs cannot be probed at the same time.
86  All other combinations are possible.
87
88- On Windows, only ``eth`` and ``crypto`` are supported.
89
90
91.. _mlx5_common_compilation:
92
93Compilation Prerequisites
94-------------------------
95
96.. _mlx5_linux_prerequisites:
97
98Linux Prerequisites
99~~~~~~~~~~~~~~~~~~~
100
101This driver relies on external libraries and kernel drivers for resources
102allocations and initialization.
103The following dependencies are not part of DPDK and must be installed separately:
104
105- **libibverbs**
106
107  User space Verbs framework used by ``librte_common_mlx5``.
108  This library provides a generic interface between the kernel
109  and low-level user space drivers such as ``libmlx5``.
110
111  It allows slow and privileged operations (context initialization,
112  hardware resources allocations) to be managed by the kernel
113  and fast operations to never leave user space.
114
115- **libmlx5**
116
117  Low-level user space driver library for NVIDIA devices,
118  it is automatically loaded by ``libibverbs``.
119
120  This library basically implements send/receive calls to the hardware queues.
121
122- **Kernel modules**
123
124  They provide the kernel-side Verbs API and low level device drivers
125  that manage actual hardware initialization
126  and resources sharing with user-space processes.
127
128  Unlike most other PMDs, these modules must remain loaded and bound to
129  their devices:
130
131  - ``mlx5_core``: hardware driver managing NVIDIA devices
132    and related Ethernet kernel network devices.
133  - ``mlx5_ib``: InfiniBand device driver.
134  - ``ib_uverbs``: user space driver for Verbs (entry point for ``libibverbs``).
135
136- **Firmware update**
137
138  NVIDIA MLNX_OFED/EN releases include firmware updates.
139
140  Because each release provides new features, these updates must be applied to
141  match the kernel modules and libraries they come with.
142
143Libraries and kernel modules can be provided either by the Linux distribution,
144or by installing NVIDIA MLNX_OFED/EN which provides compatibility with older kernels.
145
146
147Upstream Dependencies
148^^^^^^^^^^^^^^^^^^^^^
149
150The mlx5 kernel modules are part of upstream Linux.
151The minimal supported kernel version is 4.14.
152For 32-bit, version 4.14.41 or above is required.
153
154The libraries `libibverbs` and `libmlx5` are part of ``rdma-core``.
155It is packaged by most of Linux distributions.
156The minimal supported rdma-core version is 16.
157For 32-bit, version 18 or above is required.
158
159The rdma-core sources can be downloaded at
160https://github.com/linux-rdma/rdma-core
161
162It is possible to build rdma-core as static libraries starting with version 21::
163
164    cd build
165    CFLAGS=-fPIC cmake -DIN_PLACE=1 -DENABLE_STATIC=1 -GNinja ..
166    ninja
167
168
169NVIDIA MLNX_OFED/EN
170^^^^^^^^^^^^^^^^^^^
171
172The kernel modules and libraries are packaged with other tools
173in NVIDIA MLNX_OFED or NVIDIA MLNX_EN.
174The minimal supported versions are:
175
176- NVIDIA MLNX_OFED version: **4.5** and above.
177- NVIDIA MLNX_EN version: **4.5** and above.
178- Firmware version:
179
180  - ConnectX-4: **12.21.1000** and above.
181  - ConnectX-4 Lx: **14.21.1000** and above.
182  - ConnectX-5: **16.21.1000** and above.
183  - ConnectX-5 Ex: **16.21.1000** and above.
184  - ConnectX-6: **20.27.0090** and above.
185  - ConnectX-6 Dx: **22.27.0090** and above.
186  - ConnectX-6 Lx: **26.27.0090** and above.
187  - ConnectX-7: **28.33.2028** and above.
188  - BlueField: **18.25.1010** and above.
189  - BlueField-2: **24.28.1002** and above.
190
191The firmware, the libraries libibverbs, libmlx5, and mlnx-ofed-kernel modules
192are packaged in `NVIDIA MLNX_OFED
193<https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/>`_.
194After downloading, it can be installed with this command::
195
196   ./mlnxofedinstall --dpdk
197
198`NVIDIA MLNX_EN
199<https://network.nvidia.com/products/ethernet-drivers/linux/mlnx_en/>`_
200is a smaller package including what is needed for DPDK.
201After downloading, it can be installed with this command::
202
203   ./install --dpdk
204
205After installing, the firmware version can be checked::
206
207   ibv_devinfo
208
209.. note::
210
211   Several versions of NVIDIA MLNX_OFED/EN are available. Installing the version
212   this DPDK release was developed and tested against is strongly recommended.
213   Please check the "Tested Platforms" section in the :doc:`../../rel_notes/index`.
214
215
216.. _mlx5_windows_prerequisites:
217
218Windows Prerequisites
219~~~~~~~~~~~~~~~~~~~~~
220
221The mlx5 PMDs rely on external libraries and kernel drivers
222for resource allocation and initialization.
223
224
225DevX SDK Installation
226^^^^^^^^^^^^^^^^^^^^^
227
228The DevX SDK must be installed on the machine building the Windows PMD.
229Additional information can be found at
230`How to Integrate Windows DevX in Your Development Environment
231<https://docs.nvidia.com/networking/display/winof2v260/RShim+Drivers+and+Usage#RShimDriversandUsage-DevXInterface>`_.
232The minimal supported WinOF2 version is 2.60.
233
234
235Compilation Options
236-------------------
237
238Compilation on Linux
239~~~~~~~~~~~~~~~~~~~~
240
241The ibverbs libraries can be linked with this PMD in a number of ways,
242configured by the ``ibverbs_link`` build option:
243
244``shared`` (default)
245   The PMD depends on some .so files.
246
247``dlopen``
248   Split the dependencies glue in a separate library
249   loaded when needed by dlopen (see ``MLX5_GLUE_PATH``).
250   It makes dependencies on libibverbs and libmlx5 optional,
251   and has no performance impact.
252
253``static``
254   Embed static flavor of the dependencies libibverbs and libmlx5
255   in the PMD shared library or the executable static binary.
256
257
258Compilation on Windows
259~~~~~~~~~~~~~~~~~~~~~~
260
261The DevX SDK location must be set through two environment variables:
262
263``DEVX_LIB_PATH``
264   path to the DevX lib file.
265
266``DEVX_INC_PATH``
267   path to the DevX header files.
268
269
270.. _mlx5_common_env:
271
272Environment Configuration
273-------------------------
274
275Linux Environment
276~~~~~~~~~~~~~~~~~
277
278The kernel network interfaces are brought up during initialization.
279Forcing them down prevents packets reception.
280
281The ethtool operations on the kernel interfaces may also affect the PMD.
282
283Some runtime behaviours may be configured through environment variables.
284
285``MLX5_GLUE_PATH``
286   If built with ``ibverbs_link=dlopen``,
287   list of directories in which to search for the rdma-core "glue" plug-in,
288   separated by colons or semi-colons.
289
290``MLX5_SHUT_UP_BF``
291   If Verbs is used (DevX disabled),
292   HW queue doorbell register mapping.
293   The value 0 means non-cached IO mapping,
294   while 1 is a regular memory mapping.
295
296   With regular memory mapping, the register is flushed to HW
297   usually when the write-combining buffer becomes full,
298   but it depends on CPU design.
299
300
301Port Link with MLNX_OFED/EN
302^^^^^^^^^^^^^^^^^^^^^^^^^^^
303
304Ports links must be set to Ethernet::
305
306   mlxconfig -d <mst device> query | grep LINK_TYPE
307   LINK_TYPE_P1                        ETH(2)
308   LINK_TYPE_P2                        ETH(2)
309
310   mlxconfig -d <mst device> set LINK_TYPE_P1/2=1/2/3
311
312Link type values are:
313
314* ``1`` Infiniband
315* ``2`` Ethernet
316* ``3`` VPI (auto-sense)
317
318If link type was changed, firmware must be reset as well::
319
320   mlxfwreset -d <mst device> reset
321
322
323.. _mlx5_vf:
324
325SR-IOV Virtual Function with MLNX_OFED/EN
326^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
327
328SR-IOV must be enabled on the NIC.
329It can be checked in the following command::
330
331   mlxconfig -d <mst device> query | grep SRIOV_EN
332   SRIOV_EN                            True(1)
333
334If needed, configure SR-IOV::
335
336   mlxconfig -d <mst device> set SRIOV_EN=1 NUM_OF_VFS=16
337   mlxfwreset -d <mst device> reset
338
339After doing the change, restart the driver::
340
341   /etc/init.d/openibd restart
342
343or::
344
345   service openibd restart
346
347Then the virtual functions can be instantiated::
348
349   echo [num_vfs] > /sys/class/infiniband/mlx5_0/device/sriov_numvfs
350
351
352.. _mlx5_sub_function:
353
354Sub-Function with MLNX_OFED/EN
355^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
356
357Sub-Function is a portion of the PCI device,
358it has its own dedicated queues.
359An SF shares PCI-level resources with other SFs and/or with its parent PCI function.
360
3610. Requirement::
362
363      MLNX_OFED version >= 5.4-0.3.3.0
364
3651. Configure SF feature::
366
367      # Run mlxconfig on both PFs on host and ECPFs on BlueField.
368      mlxconfig -d <mst device> set PER_PF_NUM_SF=1 PF_TOTAL_SF=252 PF_SF_BAR_SIZE=12
369
3702. Enable switchdev mode::
371
372      mlxdevm dev eswitch set pci/<DBDF> mode switchdev
373
3743. Add SF port::
375
376      mlxdevm port add pci/<DBDF> flavour pcisf pfnum 0 sfnum <sfnum>
377
378      Get SFID from output: pci/<DBDF>/<SFID>
379
3804. Modify MAC address::
381
382      mlxdevm port function set pci/<DBDF>/<SFID> hw_addr <MAC>
383
3845. Activate SF port::
385
386      mlxdevm port function set pci/<DBDF>/<ID> state active
387
3886. Devargs to probe SF device::
389
390      auxiliary:mlx5_core.sf.<num>,class=eth:regex
391
392
393Enable Switchdev Mode
394^^^^^^^^^^^^^^^^^^^^^
395
396Switchdev mode is a mode in E-Switch, that binds between representor and VF or SF.
397Representor is a port in DPDK that is connected to a VF or SF in such a way
398that assuming there are no offload flows, each packet that is sent from the VF or SF
399will be received by the corresponding representor.
400While each packet that is sent to a representor will be received by the VF or SF.
401
402After :ref:`configuring VF <mlx5_vf>`, the device must be unbound::
403
404   printf "<device pci address>" > /sys/bus/pci/drivers/mlx5_core/unbind
405
406Then switchdev mode is enabled::
407
408   echo switchdev > /sys/class/net/<net device>/compat/devlink/mode
409
410The device can be bound again at this point.
411
412
413Run as Non-Root
414^^^^^^^^^^^^^^^
415
416Hugepage and resource limit setup are documented
417in the :ref:`common Linux guide <Running_Without_Root_Privileges>`.
418This PMD can operate without access to physical addresses,
419therefore it does not require ``SYS_ADMIN`` to access ``/proc/self/pagemaps``.
420Note that this requirement may still come from other drivers.
421
422Below are additional capabilities that must be granted to the application
423with the reasons for the need of each capability:
424
425``NET_RAW``
426   For raw Ethernet queue allocation through the kernel driver.
427
428``NET_ADMIN``
429   For device configuration, like setting link status or MTU.
430
431``SYS_RAWIO``
432   For using group 1 and above (software steering) in Flow API.
433
434They can be manually granted for a specific executable file::
435
436   setcap cap_net_raw,cap_net_admin,cap_sys_rawio+ep <executable>
437
438Alternatively, a service manager or a container runtime
439may configure the capabilities for a process.
440
441
442Windows Environment
443~~~~~~~~~~~~~~~~~~~
444
445WinOF2 version 2.60 or higher must be installed on the machine.
446
447
448WinOF2 Installation
449^^^^^^^^^^^^^^^^^^^
450
451The driver can be downloaded from the following site: `WINOF2
452<https://network.nvidia.com/products/adapter-software/ethernet/windows/winof-2/>`_.
453
454
455DevX Enablement
456^^^^^^^^^^^^^^^
457
458DevX for Windows must be enabled in the Windows registry.
459The keys ``DevxEnabled`` and ``DevxFsRules`` must be set.
460Additional information can be found in the WinOF2 user manual.
461
462
463.. _mlx5_firmware_config:
464
465Firmware Configuration
466~~~~~~~~~~~~~~~~~~~~~~
467
468Firmware features can be configured as key/value pairs.
469
470The command to set a value is::
471
472  mlxconfig -d <device> set <key>=<value>
473
474The command to query a value is::
475
476  mlxconfig -d <device> query <key>
477
478The device name for the command ``mlxconfig`` can be either the PCI address,
479or the mst device name found with::
480
481  mst status
482
483Below are some firmware configurations listed.
484
485- link type::
486
487    LINK_TYPE_P1
488    LINK_TYPE_P2
489    value: 1=Infiniband 2=Ethernet 3=VPI(auto-sense)
490
491- enable SR-IOV::
492
493    SRIOV_EN=1
494
495- the maximum number of SR-IOV virtual functions::
496
497    NUM_OF_VFS=<max>
498
499- enable DevX (required by Direct Rules and other features)::
500
501    UCTX_EN=1
502
503- aggressive CQE zipping::
504
505    CQE_COMPRESSION=1
506
507- L3 VXLAN and VXLAN-GPE destination UDP port::
508
509    IP_OVER_VXLAN_EN=1
510    IP_OVER_VXLAN_PORT=<udp dport>
511
512- enable VXLAN-GPE tunnel flow matching::
513
514    FLEX_PARSER_PROFILE_ENABLE=0
515    or
516    FLEX_PARSER_PROFILE_ENABLE=2
517
518- enable IP-in-IP tunnel flow matching::
519
520    FLEX_PARSER_PROFILE_ENABLE=0
521
522- enable MPLS flow matching::
523
524    FLEX_PARSER_PROFILE_ENABLE=1
525
526- enable ICMP(code/type/identifier/sequence number) / ICMP6(code/type) fields matching::
527
528    FLEX_PARSER_PROFILE_ENABLE=2
529
530- enable Geneve flow matching::
531
532   FLEX_PARSER_PROFILE_ENABLE=0
533   or
534   FLEX_PARSER_PROFILE_ENABLE=1
535
536- enable Geneve TLV option flow matching::
537
538   FLEX_PARSER_PROFILE_ENABLE=0
539
540- enable GTP flow matching::
541
542   FLEX_PARSER_PROFILE_ENABLE=3
543
544- enable eCPRI flow matching::
545
546   FLEX_PARSER_PROFILE_ENABLE=4
547   PROG_PARSE_GRAPH=1
548
549- enable dynamic flex parser for flex item::
550
551   FLEX_PARSER_PROFILE_ENABLE=4
552   PROG_PARSE_GRAPH=1
553
554- enable realtime timestamp format::
555
556   REAL_TIME_CLOCK_ENABLE=1
557
558- allow locking hairpin RQ data buffer in device memory::
559
560   HAIRPIN_DATA_BUFFER_LOCK=1
561   MEMIC_SIZE_LIMIT=0
562
563
564.. _mlx5_common_driver_options:
565
566Device Arguments
567----------------
568
569The driver can be configured per device.
570A single argument list can be used for a device managed by multiple PMDs.
571The parameters must be passed through the EAL option ``-a``,
572as examples below:
573
574- PCI device::
575
576  -a 0000:03:00.2,class=eth:regex,mr_mempool_reg_en=0
577
578- Auxiliary SF::
579
580  -a auxiliary:mlx5_core.sf.2,class=compress,mr_ext_memseg_en=0
581
582Each device class PMD has its own list of specific arguments,
583and below are the arguments supported by the common mlx5 layer.
584
585- ``class`` parameter [string]
586
587  Select the classes of the drivers that should probe the device.
588  See :ref:`mlx5_classes` for more explanation and details.
589
590  The default value is ``eth``.
591
592- ``mr_ext_memseg_en`` parameter [int]
593
594  A nonzero value enables extending memseg when registering DMA memory. If
595  enabled, the number of entries in MR (Memory Region) lookup table on datapath
596  is minimized and it benefits performance. On the other hand, it worsens memory
597  utilization because registered memory is pinned by kernel driver. Even if a
598  page in the extended chunk is freed, that doesn't become reusable until the
599  entire memory is freed.
600
601  Enabled by default.
602
603- ``mr_mempool_reg_en`` parameter [int]
604
605  A nonzero value enables implicit registration of DMA memory of all mempools
606  except those having ``RTE_MEMPOOL_F_NON_IO``. This flag is set automatically
607  for mempools populated with non-contiguous objects or those without IOVA.
608  The effect is that when a packet from a mempool is transmitted,
609  its memory is already registered for DMA in the PMD and no registration
610  will happen on the data path. The tradeoff is extra work on the creation
611  of each mempool and increased HW resource use if some mempools
612  are not used with MLX5 devices.
613
614  Enabled by default.
615
616- ``sys_mem_en`` parameter [int]
617
618  A non-zero value enables the PMD memory management allocating memory
619  from system by default, without explicit rte memory flag.
620
621  By default, the PMD will set this value to 0.
622
623- ``sq_db_nc`` parameter [int]
624
625  The rdma core library can map doorbell register in two ways,
626  depending on the environment variable "MLX5_SHUT_UP_BF":
627
628  - As regular cached memory (usually with write combining attribute),
629    if the variable is either missing or set to zero.
630  - As non-cached memory, if the variable is present and set to not "0" value.
631
632   The same doorbell mapping approach is implemented directly by PMD
633   in UAR generation for queues created with DevX.
634
635  The type of mapping may slightly affect the send queue performance,
636  the optimal choice strongly relied on the host architecture
637  and should be deduced practically.
638
639  If ``sq_db_nc`` is set to zero, the doorbell is forced to be mapped to
640  regular memory (with write combining), the PMD will perform the extra write
641  memory barrier after writing to doorbell, it might increase the needed CPU
642  clocks per packet to send, but latency might be improved.
643
644  If ``sq_db_nc`` is set to one, the doorbell is forced to be mapped to non
645  cached memory, the PMD will not perform the extra write memory barrier after
646  writing to doorbell, on some architectures it might improve the performance.
647
648  If ``sq_db_nc`` is set to two, the doorbell is forced to be mapped to
649  regular memory, the PMD will use heuristics to decide whether a write memory
650  barrier should be performed. For bursts with size multiple of recommended one
651  (64 pkts) it is supposed the next burst is coming and no need to issue the
652  extra memory barrier (it is supposed to be issued in the next coming burst,
653  at least after descriptor writing). It might increase latency (on some hosts
654  till the next packets transmit) and should be used with care.
655  The PMD uses heuristics only for Tx queue, for other semd queues the doorbell
656  is forced to be mapped to regular memory as same as ``sq_db_nc`` is set to 0.
657
658  If ``sq_db_nc`` is omitted, the preset (if any) environment variable
659  "MLX5_SHUT_UP_BF" value is used. If there is no "MLX5_SHUT_UP_BF", the
660  default ``sq_db_nc`` value is zero for ARM64 hosts and one for others.
661
662- ``cmd_fd`` parameter [int]
663
664  File descriptor of ``ibv_context`` created outside the PMD.
665  PMD will use this FD to import remote CTX. The ``cmd_fd`` is obtained from
666  the ``ibv_context->cmd_fd`` member, which must be dup'd before being passed.
667  This parameter is valid only if ``pd_handle`` parameter is specified.
668
669  By default, the PMD will create a new ``ibv_context``.
670
671  .. note::
672
673     When FD comes from another process, it is the user responsibility to
674     share the FD between the processes (e.g. by SCM_RIGHTS).
675
676- ``pd_handle`` parameter [int]
677
678  Protection domain handle of ``ibv_pd`` created outside the PMD.
679  PMD will use this handle to import remote PD. The ``pd_handle`` can be
680  achieved from the original PD by getting its ``ibv_pd->handle`` member value.
681  This parameter is valid only if ``cmd_fd`` parameter is specified,
682  and its value must be a valid kernel handle for a PD object
683  in the context represented by given ``cmd_fd``.
684
685  By default, the PMD will allocate a new PD.
686
687  .. note::
688
689     The ``ibv_pd->handle`` member is different than ``mlx5dv_pd->pdn`` member.
690