xref: /dpdk/doc/guides/platform/mlx5.rst (revision 72206323a5dd3182b13f61b25a64abdddfee595c)
1..  SPDX-License-Identifier: BSD-3-Clause
2    Copyright 2022 6WIND S.A.
3    Copyright (c) 2022 NVIDIA Corporation & Affiliates
4
5.. include:: <isonum.txt>
6
7MLX5 Common Driver
8==================
9
10The mlx5 common driver library (**librte_common_mlx5**) provides support for
11**NVIDIA ConnectX-4**, **NVIDIA ConnectX-4 Lx**, **NVIDIA ConnectX-5**,
12**NVIDIA ConnectX-6**, **NVIDIA ConnectX-6 Dx**, **NVIDIA ConnectX-6 Lx**,
13**NVIDIA ConnectX-7**, **NVIDIA BlueField**, and **NVIDIA BlueField-2** families of
1410/25/40/50/100/200 Gb/s adapters.
15
16Information and documentation for these adapters can be found on the
17`NVIDIA website <https://www.nvidia.com/en-us/networking/>`_.
18Help is also provided by the
19`Mellanox community <http://community.mellanox.com/welcome>`_.
20In addition, there is a `web section dedicated to the Poll Mode Driver
21<https://developer.nvidia.com/networking/dpdk>`_.
22
23
24Design
25------
26
27For security reasons and to enhance robustness,
28this driver only handles virtual memory addresses.
29The way resources allocations are handled by the kernel,
30combined with hardware specifications that allow handling virtual memory addresses directly,
31ensure that DPDK applications cannot access random physical memory
32(or memory that does not belong to the current process).
33
34There are different levels of objects and bypassing abilities
35which are used to get the best performance:
36
37- **Verbs** is a complete high-level generic API
38- **Direct Verbs** is a device-specific API
39- **DevX** allows accessing firmware objects
40- **Direct Rules** manages flow steering at the low-level hardware layer
41
42On Linux, above interfaces are provided by linking with `libibverbs` and `libmlx5`.
43See :ref:`mlx5_linux_prerequisites` for installation.
44
45On Windows, DevX is the only requirement from the above list.
46See :ref:`mlx5_windows_prerequisites` for DevX SDK package installation.
47
48
49.. _mlx5_classes:
50
51Classes
52-------
53
54One mlx5 device can be probed by a number of different PMDs.
55To select a specific PMD, its name should be specified as a device parameter
56(e.g. ``0000:08:00.1,class=eth``).
57
58In order to allow probing by multiple PMDs,
59several classes may be listed separated by a colon.
60For example: ``class=crypto:regex`` will probe both Crypto and RegEx PMDs.
61
62
63Supported Classes
64~~~~~~~~~~~~~~~~~
65
66- ``class=compress`` for :doc:`../../compressdevs/mlx5`.
67- ``class=crypto`` for :doc:`../../cryptodevs/mlx5`.
68- ``class=eth`` for :doc:`../../nics/mlx5`.
69- ``class=regex`` for :doc:`../../regexdevs/mlx5`.
70- ``class=vdpa`` for :doc:`../../vdpadevs/mlx5`.
71
72By default, the mlx5 device will be probed by the ``eth`` PMD.
73
74
75Limitations
76~~~~~~~~~~~
77
78- ``eth`` and ``vdpa`` PMDs cannot be probed at the same time.
79  All other combinations are possible.
80
81- On Windows, only ``eth`` and ``crypto`` are supported.
82
83
84.. _mlx5_common_compilation:
85
86Compilation Prerequisites
87-------------------------
88
89.. _mlx5_linux_prerequisites:
90
91Linux Prerequisites
92~~~~~~~~~~~~~~~~~~~
93
94This driver relies on external libraries and kernel drivers for resources
95allocations and initialization.
96The following dependencies are not part of DPDK and must be installed separately:
97
98- **libibverbs**
99
100  User space Verbs framework used by ``librte_common_mlx5``.
101  This library provides a generic interface between the kernel
102  and low-level user space drivers such as ``libmlx5``.
103
104  It allows slow and privileged operations (context initialization,
105  hardware resources allocations) to be managed by the kernel
106  and fast operations to never leave user space.
107
108- **libmlx5**
109
110  Low-level user space driver library for Mellanox devices,
111  it is automatically loaded by ``libibverbs``.
112
113  This library basically implements send/receive calls to the hardware queues.
114
115- **Kernel modules**
116
117  They provide the kernel-side Verbs API and low level device drivers
118  that manage actual hardware initialization
119  and resources sharing with user-space processes.
120
121  Unlike most other PMDs, these modules must remain loaded and bound to
122  their devices:
123
124  - ``mlx5_core``: hardware driver managing Mellanox devices
125    and related Ethernet kernel network devices.
126  - ``mlx5_ib``: InfiniBand device driver.
127  - ``ib_uverbs``: user space driver for Verbs (entry point for ``libibverbs``).
128
129- **Firmware update**
130
131  Mellanox OFED/EN releases include firmware updates.
132
133  Because each release provides new features, these updates must be applied to
134  match the kernel modules and libraries they come with.
135
136Libraries and kernel modules can be provided either by the Linux distribution,
137or by installing Mellanox OFED/EN which provides compatibility with older kernels.
138
139
140Upstream Dependencies
141^^^^^^^^^^^^^^^^^^^^^
142
143The mlx5 kernel modules are part of upstream Linux.
144The minimal supported kernel version is 4.14.
145For 32-bit, version 4.14.41 or above is required.
146
147The libraries `libibverbs` and `libmlx5` are part of ``rdma-core``.
148It is packaged by most of Linux distributions.
149The minimal supported rdma-core version is 16.
150For 32-bit, version 18 or above is required.
151
152The rdma-core sources can be downloaded at
153https://github.com/linux-rdma/rdma-core
154
155It is possible to build rdma-core as static libraries starting with version 21::
156
157    cd build
158    CFLAGS=-fPIC cmake -DIN_PLACE=1 -DENABLE_STATIC=1 -GNinja ..
159    ninja
160
161
162Mellanox OFED/EN
163^^^^^^^^^^^^^^^^
164
165The kernel modules and libraries are packaged with other tools
166in Mellanox OFED or Mellanox EN.
167The minimal supported versions are:
168
169- Mellanox OFED version: **4.5** and above.
170- Mellanox EN version: **4.5** and above.
171- Firmware version:
172
173  - ConnectX-4: **12.21.1000** and above.
174  - ConnectX-4 Lx: **14.21.1000** and above.
175  - ConnectX-5: **16.21.1000** and above.
176  - ConnectX-5 Ex: **16.21.1000** and above.
177  - ConnectX-6: **20.27.0090** and above.
178  - ConnectX-6 Dx: **22.27.0090** and above.
179  - ConnectX-6 Lx: **26.27.0090** and above.
180  - ConnectX-7: **28.33.2028** and above.
181  - BlueField: **18.25.1010** and above.
182  - BlueField-2: **24.28.1002** and above.
183
184The firmware, the libraries libibverbs, libmlx5, and mlnx-ofed-kernel modules
185are packaged in `Mellanox OFED
186<https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/>`_.
187After downloading, it can be installed with this command::
188
189   ./mlnxofedinstall --dpdk
190
191`Mellanox EN
192<https://network.nvidia.com/products/ethernet-drivers/linux/mlnx_en/>`_
193is a smaller package including what is needed for DPDK.
194After downloading, it can be installed with this command::
195
196   ./install --dpdk
197
198After installing, the firmware version can be checked::
199
200   ibv_devinfo
201
202.. note::
203
204   Several versions of Mellanox OFED/EN are available. Installing the version
205   this DPDK release was developed and tested against is strongly recommended.
206   Please check the "Tested Platforms" section in the :doc:`../../rel_notes/index`.
207
208
209.. _mlx5_windows_prerequisites:
210
211Windows Prerequisites
212~~~~~~~~~~~~~~~~~~~~~
213
214The mlx5 PMDs rely on external libraries and kernel drivers
215for resource allocation and initialization.
216
217
218DevX SDK Installation
219^^^^^^^^^^^^^^^^^^^^^
220
221The DevX SDK must be installed on the machine building the Windows PMD.
222Additional information can be found at
223`How to Integrate Windows DevX in Your Development Environment
224<https://docs.nvidia.com/networking/display/winof2v260/RShim+Drivers+and+Usage#RShimDriversandUsage-DevXInterface>`_.
225The minimal supported WinOF2 version is 2.60.
226
227
228Compilation Options
229-------------------
230
231Compilation on Linux
232~~~~~~~~~~~~~~~~~~~~
233
234The ibverbs libraries can be linked with this PMD in a number of ways,
235configured by the ``ibverbs_link`` build option:
236
237``shared`` (default)
238   The PMD depends on some .so files.
239
240``dlopen``
241   Split the dependencies glue in a separate library
242   loaded when needed by dlopen (see ``MLX5_GLUE_PATH``).
243   It makes dependencies on libibverbs and libmlx5 optional,
244   and has no performance impact.
245
246``static``
247   Embed static flavor of the dependencies libibverbs and libmlx5
248   in the PMD shared library or the executable static binary.
249
250
251Compilation on Windows
252~~~~~~~~~~~~~~~~~~~~~~
253
254The DevX SDK location must be set through two environment variables:
255
256``DEVX_LIB_PATH``
257   path to the DevX lib file.
258
259``DEVX_INC_PATH``
260   path to the DevX header files.
261
262
263.. _mlx5_common_env:
264
265Environment Configuration
266-------------------------
267
268Linux Environment
269~~~~~~~~~~~~~~~~~
270
271The kernel network interfaces are brought up during initialization.
272Forcing them down prevents packets reception.
273
274The ethtool operations on the kernel interfaces may also affect the PMD.
275
276Some runtime behaviours may be configured through environment variables.
277
278``MLX5_GLUE_PATH``
279   If built with ``ibverbs_link=dlopen``,
280   list of directories in which to search for the rdma-core "glue" plug-in,
281   separated by colons or semi-colons.
282
283``MLX5_SHUT_UP_BF``
284   If Verbs is used (DevX disabled),
285   HW queue doorbell register mapping.
286   The value 0 means non-cached IO mapping,
287   while 1 is a regular memory mapping.
288
289   With regular memory mapping, the register is flushed to HW
290   usually when the write-combining buffer becomes full,
291   but it depends on CPU design.
292
293
294Port Link with OFED/EN
295^^^^^^^^^^^^^^^^^^^^^^
296
297Ports links must be set to Ethernet::
298
299   mlxconfig -d <mst device> query | grep LINK_TYPE
300   LINK_TYPE_P1                        ETH(2)
301   LINK_TYPE_P2                        ETH(2)
302
303   mlxconfig -d <mst device> set LINK_TYPE_P1/2=1/2/3
304
305Link type values are:
306
307* ``1`` Infiniband
308* ``2`` Ethernet
309* ``3`` VPI (auto-sense)
310
311If link type was changed, firmware must be reset as well::
312
313   mlxfwreset -d <mst device> reset
314
315
316.. _mlx5_vf:
317
318SR-IOV Virtual Function with OFED/EN
319^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
320
321SR-IOV must be enabled on the NIC.
322It can be checked in the following command::
323
324   mlxconfig -d <mst device> query | grep SRIOV_EN
325   SRIOV_EN                            True(1)
326
327If needed, configure SR-IOV::
328
329   mlxconfig -d <mst device> set SRIOV_EN=1 NUM_OF_VFS=16
330   mlxfwreset -d <mst device> reset
331
332After doing the change, restart the driver::
333
334   /etc/init.d/openibd restart
335
336or::
337
338   service openibd restart
339
340Then the virtual functions can be instantiated::
341
342   echo [num_vfs] > /sys/class/infiniband/mlx5_0/device/sriov_numvfs
343
344
345.. _mlx5_sub_function:
346
347Sub-Function with OFED/EN
348^^^^^^^^^^^^^^^^^^^^^^^^^
349
350Sub-Function is a portion of the PCI device,
351it has its own dedicated queues.
352An SF shares PCI-level resources with other SFs and/or with its parent PCI function.
353
3540. Requirement::
355
356      OFED version >= 5.4-0.3.3.0
357
3581. Configure SF feature::
359
360      # Run mlxconfig on both PFs on host and ECPFs on BlueField.
361      mlxconfig -d <mst device> set PER_PF_NUM_SF=1 PF_TOTAL_SF=252 PF_SF_BAR_SIZE=12
362
3632. Enable switchdev mode::
364
365      mlxdevm dev eswitch set pci/<DBDF> mode switchdev
366
3673. Add SF port::
368
369      mlxdevm port add pci/<DBDF> flavour pcisf pfnum 0 sfnum <sfnum>
370
371      Get SFID from output: pci/<DBDF>/<SFID>
372
3734. Modify MAC address::
374
375      mlxdevm port function set pci/<DBDF>/<SFID> hw_addr <MAC>
376
3775. Activate SF port::
378
379      mlxdevm port function set pci/<DBDF>/<ID> state active
380
3816. Devargs to probe SF device::
382
383      auxiliary:mlx5_core.sf.<num>,class=eth:regex
384
385
386Enable Switchdev Mode
387^^^^^^^^^^^^^^^^^^^^^
388
389Switchdev mode is a mode in E-Switch, that binds between representor and VF or SF.
390Representor is a port in DPDK that is connected to a VF or SF in such a way
391that assuming there are no offload flows, each packet that is sent from the VF or SF
392will be received by the corresponding representor.
393While each packet that is sent to a representor will be received by the VF or SF.
394
395After :ref:`configuring VF <mlx5_vf>`, the device must be unbound::
396
397   printf "<device pci address>" > /sys/bus/pci/drivers/mlx5_core/unbind
398
399Then switchdev mode is enabled::
400
401   echo switchdev > /sys/class/net/<net device>/compat/devlink/mode
402
403The device can be bound again at this point.
404
405
406Run as Non-Root
407^^^^^^^^^^^^^^^
408
409Hugepage and resource limit setup are documented
410in the :ref:`common Linux guide <Running_Without_Root_Privileges>`.
411This PMD can operate without access to physical addresses,
412therefore it does not require ``SYS_ADMIN`` to access ``/proc/self/pagemaps``.
413Note that this requirement may still come from other drivers.
414
415Below are additional capabilities that must be granted to the application
416with the reasons for the need of each capability:
417
418``NET_RAW``
419   For raw Ethernet queue allocation through the kernel driver.
420
421``NET_ADMIN``
422   For device configuration, like setting link status or MTU.
423
424``SYS_RAWIO``
425   For using group 1 and above (software steering) in Flow API.
426
427They can be manually granted for a specific executable file::
428
429   setcap cap_net_raw,cap_net_admin,cap_sys_rawio+ep <executable>
430
431Alternatively, a service manager or a container runtime
432may configure the capabilities for a process.
433
434
435Windows Environment
436~~~~~~~~~~~~~~~~~~~
437
438WinOF2 version 2.60 or higher must be installed on the machine.
439
440
441WinOF2 Installation
442^^^^^^^^^^^^^^^^^^^
443
444The driver can be downloaded from the following site: `WINOF2
445<https://network.nvidia.com/products/adapter-software/ethernet/windows/winof-2/>`_.
446
447
448DevX Enablement
449^^^^^^^^^^^^^^^
450
451DevX for Windows must be enabled in the Windows registry.
452The keys ``DevxEnabled`` and ``DevxFsRules`` must be set.
453Additional information can be found in the WinOF2 user manual.
454
455
456.. _mlx5_firmware_config:
457
458Firmware Configuration
459~~~~~~~~~~~~~~~~~~~~~~
460
461Firmware features can be configured as key/value pairs.
462
463The command to set a value is::
464
465  mlxconfig -d <device> set <key>=<value>
466
467The command to query a value is::
468
469  mlxconfig -d <device> query <key>
470
471The device name for the command ``mlxconfig`` can be either the PCI address,
472or the mst device name found with::
473
474  mst status
475
476Below are some firmware configurations listed.
477
478- link type::
479
480    LINK_TYPE_P1
481    LINK_TYPE_P2
482    value: 1=Infiniband 2=Ethernet 3=VPI(auto-sense)
483
484- enable SR-IOV::
485
486    SRIOV_EN=1
487
488- the maximum number of SR-IOV virtual functions::
489
490    NUM_OF_VFS=<max>
491
492- enable DevX (required by Direct Rules and other features)::
493
494    UCTX_EN=1
495
496- aggressive CQE zipping::
497
498    CQE_COMPRESSION=1
499
500- L3 VXLAN and VXLAN-GPE destination UDP port::
501
502    IP_OVER_VXLAN_EN=1
503    IP_OVER_VXLAN_PORT=<udp dport>
504
505- enable VXLAN-GPE tunnel flow matching::
506
507    FLEX_PARSER_PROFILE_ENABLE=0
508    or
509    FLEX_PARSER_PROFILE_ENABLE=2
510
511- enable IP-in-IP tunnel flow matching::
512
513    FLEX_PARSER_PROFILE_ENABLE=0
514
515- enable MPLS flow matching::
516
517    FLEX_PARSER_PROFILE_ENABLE=1
518
519- enable ICMP(code/type/identifier/sequence number) / ICMP6(code/type) fields matching::
520
521    FLEX_PARSER_PROFILE_ENABLE=2
522
523- enable Geneve flow matching::
524
525   FLEX_PARSER_PROFILE_ENABLE=0
526   or
527   FLEX_PARSER_PROFILE_ENABLE=1
528
529- enable Geneve TLV option flow matching::
530
531   FLEX_PARSER_PROFILE_ENABLE=0
532
533- enable GTP flow matching::
534
535   FLEX_PARSER_PROFILE_ENABLE=3
536
537- enable eCPRI flow matching::
538
539   FLEX_PARSER_PROFILE_ENABLE=4
540   PROG_PARSE_GRAPH=1
541
542- enable dynamic flex parser for flex item::
543
544   FLEX_PARSER_PROFILE_ENABLE=4
545   PROG_PARSE_GRAPH=1
546
547- enable realtime timestamp format::
548
549   REAL_TIME_CLOCK_ENABLE=1
550
551
552.. _mlx5_common_driver_options:
553
554Device Arguments
555----------------
556
557The driver can be configured per device.
558A single argument list can be used for a device managed by multiple PMDs.
559The parameters must be passed through the EAL option ``-a``,
560as examples below:
561
562- PCI device::
563
564  -a 0000:03:00.2,class=eth:regex,mr_mempool_reg_en=0
565
566- Auxiliary SF::
567
568  -a auxiliary:mlx5_core.sf.2,class=compress,mr_ext_memseg_en=0
569
570Each device class PMD has its own list of specific arguments,
571and below are the arguments supported by the common mlx5 layer.
572
573- ``class`` parameter [string]
574
575  Select the classes of the drivers that should probe the device.
576  See :ref:`mlx5_classes` for more explanation and details.
577
578  The default value is ``eth``.
579
580- ``mr_ext_memseg_en`` parameter [int]
581
582  A nonzero value enables extending memseg when registering DMA memory. If
583  enabled, the number of entries in MR (Memory Region) lookup table on datapath
584  is minimized and it benefits performance. On the other hand, it worsens memory
585  utilization because registered memory is pinned by kernel driver. Even if a
586  page in the extended chunk is freed, that doesn't become reusable until the
587  entire memory is freed.
588
589  Enabled by default.
590
591- ``mr_mempool_reg_en`` parameter [int]
592
593  A nonzero value enables implicit registration of DMA memory of all mempools
594  except those having ``RTE_MEMPOOL_F_NON_IO``. This flag is set automatically
595  for mempools populated with non-contiguous objects or those without IOVA.
596  The effect is that when a packet from a mempool is transmitted,
597  its memory is already registered for DMA in the PMD and no registration
598  will happen on the data path. The tradeoff is extra work on the creation
599  of each mempool and increased HW resource use if some mempools
600  are not used with MLX5 devices.
601
602  Enabled by default.
603
604- ``sys_mem_en`` parameter [int]
605
606  A non-zero value enables the PMD memory management allocating memory
607  from system by default, without explicit rte memory flag.
608
609  By default, the PMD will set this value to 0.
610
611- ``sq_db_nc`` parameter [int]
612
613  The rdma core library can map doorbell register in two ways,
614  depending on the environment variable "MLX5_SHUT_UP_BF":
615
616  - As regular cached memory (usually with write combining attribute),
617    if the variable is either missing or set to zero.
618  - As non-cached memory, if the variable is present and set to not "0" value.
619
620   The same doorbell mapping approach is implemented directly by PMD
621   in UAR generation for queues created with DevX.
622
623  The type of mapping may slightly affect the send queue performance,
624  the optimal choice strongly relied on the host architecture
625  and should be deduced practically.
626
627  If ``sq_db_nc`` is set to zero, the doorbell is forced to be mapped to
628  regular memory (with write combining), the PMD will perform the extra write
629  memory barrier after writing to doorbell, it might increase the needed CPU
630  clocks per packet to send, but latency might be improved.
631
632  If ``sq_db_nc`` is set to one, the doorbell is forced to be mapped to non
633  cached memory, the PMD will not perform the extra write memory barrier after
634  writing to doorbell, on some architectures it might improve the performance.
635
636  If ``sq_db_nc`` is set to two, the doorbell is forced to be mapped to
637  regular memory, the PMD will use heuristics to decide whether a write memory
638  barrier should be performed. For bursts with size multiple of recommended one
639  (64 pkts) it is supposed the next burst is coming and no need to issue the
640  extra memory barrier (it is supposed to be issued in the next coming burst,
641  at least after descriptor writing). It might increase latency (on some hosts
642  till the next packets transmit) and should be used with care.
643  The PMD uses heuristics only for Tx queue, for other semd queues the doorbell
644  is forced to be mapped to regular memory as same as ``sq_db_nc`` is set to 0.
645
646  If ``sq_db_nc`` is omitted, the preset (if any) environment variable
647  "MLX5_SHUT_UP_BF" value is used. If there is no "MLX5_SHUT_UP_BF", the
648  default ``sq_db_nc`` value is zero for ARM64 hosts and one for others.
649
650- ``cmd_fd`` parameter [int]
651
652  File descriptor of ``ibv_context`` created outside the PMD.
653  PMD will use this FD to import remote CTX. The ``cmd_fd`` is obtained from
654  the ``ibv_context->cmd_fd`` member, which must be dup'd before being passed.
655  This parameter is valid only if ``pd_handle`` parameter is specified.
656
657  By default, the PMD will create a new ``ibv_context``.
658
659  .. note::
660
661     When FD comes from another process, it is the user responsibility to
662     share the FD between the processes (e.g. by SCM_RIGHTS).
663
664- ``pd_handle`` parameter [int]
665
666  Protection domain handle of ``ibv_pd`` created outside the PMD.
667  PMD will use this handle to import remote PD. The ``pd_handle`` can be
668  achieved from the original PD by getting its ``ibv_pd->handle`` member value.
669  This parameter is valid only if ``cmd_fd`` parameter is specified,
670  and its value must be a valid kernel handle for a PD object
671  in the context represented by given ``cmd_fd``.
672
673  By default, the PMD will allocate a new PD.
674
675  .. note::
676
677     The ``ibv_pd->handle`` member is different than ``mlx5dv_pd->pdn`` member.
678