xref: /dpdk/doc/guides/platform/mlx5.rst (revision 97b914f4e715565d53d38ac6e04815b9be5e58a9)
1..  SPDX-License-Identifier: BSD-3-Clause
2    Copyright 2022 6WIND S.A.
3    Copyright (c) 2022 NVIDIA Corporation & Affiliates
4
5.. include:: <isonum.txt>
6
7MLX5 Common Driver
8==================
9
10The mlx5 common driver library (**librte_common_mlx5**) provides support for
11**Mellanox ConnectX-4**, **Mellanox ConnectX-4 Lx**, **Mellanox ConnectX-5**,
12**Mellanox ConnectX-6**, **Mellanox ConnectX-6 Dx**, **Mellanox ConnectX-6 Lx**,
13**Mellanox BlueField** and **Mellanox BlueField-2** families of
1410/25/40/50/100/200 Gb/s adapters.
15
16Information and documentation for these adapters can be found on the
17`NVIDIA website <https://www.nvidia.com/en-us/networking/>`_.
18Help is also provided by the
19`Mellanox community <http://community.mellanox.com/welcome>`_.
20In addition, there is a `web section dedicated to the Poll Mode Driver
21<https://developer.nvidia.com/networking/dpdk>`_.
22
23
24Design
25------
26
27For security reasons and to enhance robustness,
28this driver only handles virtual memory addresses.
29The way resources allocations are handled by the kernel,
30combined with hardware specifications that allow handling virtual memory addresses directly,
31ensure that DPDK applications cannot access random physical memory
32(or memory that does not belong to the current process).
33
34There are different levels of objects and bypassing abilities
35which are used to get the best performance:
36
37- **Verbs** is a complete high-level generic API
38- **Direct Verbs** is a device-specific API
39- **DevX** allows accessing firmware objects
40- **Direct Rules** manages flow steering at the low-level hardware layer
41
42On Linux, above interfaces are provided by linking with `libibverbs` and `libmlx5`.
43See :ref:`mlx5_linux_prerequisites` for installation.
44
45On Windows, DevX is the only requirement from the above list.
46See :ref:`mlx5_windows_prerequisites` for DevX SDK package installation.
47
48
49.. _mlx5_classes:
50
51Classes
52-------
53
54One mlx5 device can be probed by a number of different PMDs.
55To select a specific PMD, its name should be specified as a device parameter
56(e.g. ``0000:08:00.1,class=eth``).
57
58In order to allow probing by multiple PMDs,
59several classes may be listed separated by a colon.
60For example: ``class=crypto:regex`` will probe both Crypto and RegEx PMDs.
61
62
63Supported Classes
64~~~~~~~~~~~~~~~~~
65
66- ``class=compress`` for :doc:`../../compressdevs/mlx5`.
67- ``class=crypto`` for :doc:`../../cryptodevs/mlx5`.
68- ``class=eth`` for :doc:`../../nics/mlx5`.
69- ``class=regex`` for :doc:`../../regexdevs/mlx5`.
70- ``class=vdpa`` for :doc:`../../vdpadevs/mlx5`.
71
72By default, the mlx5 device will be probed by the ``eth`` PMD.
73
74
75Limitations
76~~~~~~~~~~~
77
78- ``eth`` and ``vdpa`` PMDs cannot be probed at the same time.
79  All other combinations are possible.
80
81- On Windows, only ``eth`` and ``crypto`` are supported.
82
83
84.. _mlx5_common_compilation:
85
86Compilation Prerequisites
87-------------------------
88
89.. _mlx5_linux_prerequisites:
90
91Linux Prerequisites
92~~~~~~~~~~~~~~~~~~~
93
94This driver relies on external libraries and kernel drivers for resources
95allocations and initialization.
96The following dependencies are not part of DPDK and must be installed separately:
97
98- **libibverbs**
99
100  User space Verbs framework used by ``librte_common_mlx5``.
101  This library provides a generic interface between the kernel
102  and low-level user space drivers such as ``libmlx5``.
103
104  It allows slow and privileged operations (context initialization,
105  hardware resources allocations) to be managed by the kernel
106  and fast operations to never leave user space.
107
108- **libmlx5**
109
110  Low-level user space driver library for Mellanox devices,
111  it is automatically loaded by ``libibverbs``.
112
113  This library basically implements send/receive calls to the hardware queues.
114
115- **Kernel modules**
116
117  They provide the kernel-side Verbs API and low level device drivers
118  that manage actual hardware initialization
119  and resources sharing with user-space processes.
120
121  Unlike most other PMDs, these modules must remain loaded and bound to
122  their devices:
123
124  - ``mlx5_core``: hardware driver managing Mellanox devices
125    and related Ethernet kernel network devices.
126  - ``mlx5_ib``: InfiniBand device driver.
127  - ``ib_uverbs``: user space driver for Verbs (entry point for ``libibverbs``).
128
129- **Firmware update**
130
131  Mellanox OFED/EN releases include firmware updates.
132
133  Because each release provides new features, these updates must be applied to
134  match the kernel modules and libraries they come with.
135
136Libraries and kernel modules can be provided either by the Linux distribution,
137or by installing Mellanox OFED/EN which provides compatibility with older kernels.
138
139
140Upstream Dependencies
141^^^^^^^^^^^^^^^^^^^^^
142
143The mlx5 kernel modules are part of upstream Linux.
144The minimal supported kernel version is 4.14.
145For 32-bit, version 4.14.41 or above is required.
146
147The libraries `libibverbs` and `libmlx5` are part of ``rdma-core``.
148It is packaged by most of Linux distributions.
149The minimal supported rdma-core version is 16.
150For 32-bit, version 18 or above is required.
151
152The rdma-core sources can be downloaded at
153https://github.com/linux-rdma/rdma-core
154
155It is possible to build rdma-core as static libraries starting with version 21::
156
157    cd build
158    CFLAGS=-fPIC cmake -DIN_PLACE=1 -DENABLE_STATIC=1 -GNinja ..
159    ninja
160
161
162Mellanox OFED/EN
163^^^^^^^^^^^^^^^^
164
165The kernel modules and libraries are packaged with other tools
166in Mellanox OFED or Mellanox EN.
167The minimal supported versions are:
168
169- Mellanox OFED version: **4.5** and above.
170- Mellanox EN version: **4.5** and above.
171- Firmware version:
172
173  - ConnectX-4: **12.21.1000** and above.
174  - ConnectX-4 Lx: **14.21.1000** and above.
175  - ConnectX-5: **16.21.1000** and above.
176  - ConnectX-5 Ex: **16.21.1000** and above.
177  - ConnectX-6: **20.27.0090** and above.
178  - ConnectX-6 Dx: **22.27.0090** and above.
179  - BlueField: **18.25.1010** and above.
180  - BlueField-2: **24.28.1002** and above.
181
182The firmware, the libraries libibverbs, libmlx5, and mlnx-ofed-kernel modules
183are packaged in `Mellanox OFED
184<https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/>`_.
185After downloading, it can be installed with this command::
186
187   ./mlnxofedinstall --dpdk
188
189`Mellanox EN
190<https://network.nvidia.com/products/ethernet-drivers/linux/mlnx_en/>`_
191is a smaller package including what is needed for DPDK.
192After downloading, it can be installed with this command::
193
194   ./install --dpdk
195
196After installing, the firmware version can be checked::
197
198   ibv_devinfo
199
200.. note::
201
202   Several versions of Mellanox OFED/EN are available. Installing the version
203   this DPDK release was developed and tested against is strongly recommended.
204   Please check the "Tested Platforms" section in the :doc:`../../rel_notes/index`.
205
206
207.. _mlx5_windows_prerequisites:
208
209Windows Prerequisites
210~~~~~~~~~~~~~~~~~~~~~
211
212The mlx5 PMDs rely on external libraries and kernel drivers
213for resource allocation and initialization.
214
215
216DevX SDK Installation
217^^^^^^^^^^^^^^^^^^^^^
218
219The DevX SDK must be installed on the machine building the Windows PMD.
220Additional information can be found at
221`How to Integrate Windows DevX in Your Development Environment
222<https://docs.nvidia.com/networking/display/winof2v260/RShim+Drivers+and+Usage#RShimDriversandUsage-DevXInterface>`_.
223The minimal supported WinOF2 version is 2.60.
224
225
226Compilation Options
227-------------------
228
229Compilation on Linux
230~~~~~~~~~~~~~~~~~~~~
231
232The ibverbs libraries can be linked with this PMD in a number of ways,
233configured by the ``ibverbs_link`` build option:
234
235``shared`` (default)
236   The PMD depends on some .so files.
237
238``dlopen``
239   Split the dependencies glue in a separate library
240   loaded when needed by dlopen (see ``MLX5_GLUE_PATH``).
241   It makes dependencies on libibverbs and libmlx5 optional,
242   and has no performance impact.
243
244``static``
245   Embed static flavor of the dependencies libibverbs and libmlx5
246   in the PMD shared library or the executable static binary.
247
248
249Compilation on Windows
250~~~~~~~~~~~~~~~~~~~~~~
251
252The DevX SDK location must be set through two environment variables:
253
254``DEVX_LIB_PATH``
255   path to the DevX lib file.
256
257``DEVX_INC_PATH``
258   path to the DevX header files.
259
260
261.. _mlx5_common_env:
262
263Environment Configuration
264-------------------------
265
266Linux Environment
267~~~~~~~~~~~~~~~~~
268
269The kernel network interfaces are brought up during initialization.
270Forcing them down prevents packets reception.
271
272The ethtool operations on the kernel interfaces may also affect the PMD.
273
274Some runtime behaviours may be configured through environment variables.
275
276``MLX5_GLUE_PATH``
277   If built with ``ibverbs_link=dlopen``,
278   list of directories in which to search for the rdma-core "glue" plug-in,
279   separated by colons or semi-colons.
280
281``MLX5_SHUT_UP_BF``
282   If Verbs is used (DevX disabled),
283   HW queue doorbell register mapping.
284   The value 0 means non-cached IO mapping,
285   while 1 is a regular memory mapping.
286
287   With regular memory mapping, the register is flushed to HW
288   usually when the write-combining buffer becomes full,
289   but it depends on CPU design.
290
291
292Port Link with OFED/EN
293^^^^^^^^^^^^^^^^^^^^^^
294
295Ports links must be set to Ethernet::
296
297   mlxconfig -d <mst device> query | grep LINK_TYPE
298   LINK_TYPE_P1                        ETH(2)
299   LINK_TYPE_P2                        ETH(2)
300
301   mlxconfig -d <mst device> set LINK_TYPE_P1/2=1/2/3
302
303Link type values are:
304
305* ``1`` Infiniband
306* ``2`` Ethernet
307* ``3`` VPI (auto-sense)
308
309If link type was changed, firmware must be reset as well::
310
311   mlxfwreset -d <mst device> reset
312
313
314.. _mlx5_vf:
315
316SR-IOV Virtual Function with OFED/EN
317^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
318
319SR-IOV must be enabled on the NIC.
320It can be checked in the following command::
321
322   mlxconfig -d <mst device> query | grep SRIOV_EN
323   SRIOV_EN                            True(1)
324
325If needed, configure SR-IOV::
326
327   mlxconfig -d <mst device> set SRIOV_EN=1 NUM_OF_VFS=16
328   mlxfwreset -d <mst device> reset
329
330After doing the change, restart the driver::
331
332   /etc/init.d/openibd restart
333
334or::
335
336   service openibd restart
337
338Then the virtual functions can be instantiated::
339
340   echo [num_vfs] > /sys/class/infiniband/mlx5_0/device/sriov_numvfs
341
342
343.. _mlx5_sub_function:
344
345Sub-Function with OFED/EN
346^^^^^^^^^^^^^^^^^^^^^^^^^
347
348Sub-Function is a portion of the PCI device,
349it has its own dedicated queues.
350An SF shares PCI-level resources with other SFs and/or with its parent PCI function.
351
3520. Requirement::
353
354      OFED version >= 5.4-0.3.3.0
355
3561. Configure SF feature::
357
358      # Run mlxconfig on both PFs on host and ECPFs on BlueField.
359      mlxconfig -d <mst device> set PER_PF_NUM_SF=1 PF_TOTAL_SF=252 PF_SF_BAR_SIZE=12
360
3612. Enable switchdev mode::
362
363      mlxdevm dev eswitch set pci/<DBDF> mode switchdev
364
3653. Add SF port::
366
367      mlxdevm port add pci/<DBDF> flavour pcisf pfnum 0 sfnum <sfnum>
368
369      Get SFID from output: pci/<DBDF>/<SFID>
370
3714. Modify MAC address::
372
373      mlxdevm port function set pci/<DBDF>/<SFID> hw_addr <MAC>
374
3755. Activate SF port::
376
377      mlxdevm port function set pci/<DBDF>/<ID> state active
378
3796. Devargs to probe SF device::
380
381      auxiliary:mlx5_core.sf.<num>,class=eth:regex
382
383
384Enable Switchdev Mode
385^^^^^^^^^^^^^^^^^^^^^
386
387Switchdev mode is a mode in E-Switch, that binds between representor and VF or SF.
388Representor is a port in DPDK that is connected to a VF or SF in such a way
389that assuming there are no offload flows, each packet that is sent from the VF or SF
390will be received by the corresponding representor.
391While each packet that is sent to a representor will be received by the VF or SF.
392
393After :ref:`configuring VF <mlx5_vf>`, the device must be unbound::
394
395   printf "<device pci address>" > /sys/bus/pci/drivers/mlx5_core/unbind
396
397Then switchdev mode is enabled::
398
399   echo switchdev > /sys/class/net/<net device>/compat/devlink/mode
400
401The device can be bound again at this point.
402
403
404Run as Non-Root
405^^^^^^^^^^^^^^^
406
407In order to run as a non-root user,
408some capabilities must be granted to the application::
409
410   setcap cap_sys_admin,cap_net_admin,cap_net_raw,cap_ipc_lock+ep <dpdk-app>
411
412Below are the reasons for the need of each capability:
413
414``cap_sys_admin``
415   When using physical addresses (PA mode), with Linux >= 4.0,
416   for access to ``/proc/self/pagemap``.
417
418``cap_net_admin``
419   For device configuration.
420
421``cap_net_raw``
422   For raw ethernet queue allocation through kernel driver.
423
424``cap_ipc_lock``
425   For DMA memory pinning.
426
427
428Windows Environment
429~~~~~~~~~~~~~~~~~~~
430
431WinOF2 version 2.60 or higher must be installed on the machine.
432
433
434WinOF2 Installation
435^^^^^^^^^^^^^^^^^^^
436
437The driver can be downloaded from the following site: `WINOF2
438<https://network.nvidia.com/products/adapter-software/ethernet/windows/winof-2/>`_.
439
440
441DevX Enablement
442^^^^^^^^^^^^^^^
443
444DevX for Windows must be enabled in the Windows registry.
445The keys ``DevxEnabled`` and ``DevxFsRules`` must be set.
446Additional information can be found in the WinOF2 user manual.
447
448
449.. _mlx5_firmware_config:
450
451Firmware Configuration
452~~~~~~~~~~~~~~~~~~~~~~
453
454Firmware features can be configured as key/value pairs.
455
456The command to set a value is::
457
458  mlxconfig -d <device> set <key>=<value>
459
460The command to query a value is::
461
462  mlxconfig -d <device> query <key>
463
464The device name for the command ``mlxconfig`` can be either the PCI address,
465or the mst device name found with::
466
467  mst status
468
469Below are some firmware configurations listed.
470
471- link type::
472
473    LINK_TYPE_P1
474    LINK_TYPE_P2
475    value: 1=Infiniband 2=Ethernet 3=VPI(auto-sense)
476
477- enable SR-IOV::
478
479    SRIOV_EN=1
480
481- the maximum number of SR-IOV virtual functions::
482
483    NUM_OF_VFS=<max>
484
485- enable DevX (required by Direct Rules and other features)::
486
487    UCTX_EN=1
488
489- aggressive CQE zipping::
490
491    CQE_COMPRESSION=1
492
493- L3 VXLAN and VXLAN-GPE destination UDP port::
494
495    IP_OVER_VXLAN_EN=1
496    IP_OVER_VXLAN_PORT=<udp dport>
497
498- enable VXLAN-GPE tunnel flow matching::
499
500    FLEX_PARSER_PROFILE_ENABLE=0
501    or
502    FLEX_PARSER_PROFILE_ENABLE=2
503
504- enable IP-in-IP tunnel flow matching::
505
506    FLEX_PARSER_PROFILE_ENABLE=0
507
508- enable MPLS flow matching::
509
510    FLEX_PARSER_PROFILE_ENABLE=1
511
512- enable ICMP(code/type/identifier/sequence number) / ICMP6(code/type) fields matching::
513
514    FLEX_PARSER_PROFILE_ENABLE=2
515
516- enable Geneve flow matching::
517
518   FLEX_PARSER_PROFILE_ENABLE=0
519   or
520   FLEX_PARSER_PROFILE_ENABLE=1
521
522- enable Geneve TLV option flow matching::
523
524   FLEX_PARSER_PROFILE_ENABLE=0
525
526- enable GTP flow matching::
527
528   FLEX_PARSER_PROFILE_ENABLE=3
529
530- enable eCPRI flow matching::
531
532   FLEX_PARSER_PROFILE_ENABLE=4
533   PROG_PARSE_GRAPH=1
534
535- enable dynamic flex parser for flex item::
536
537   FLEX_PARSER_PROFILE_ENABLE=4
538   PROG_PARSE_GRAPH=1
539
540- enable realtime timestamp format::
541
542   REAL_TIME_CLOCK_ENABLE=1
543
544
545.. _mlx5_common_driver_options:
546
547Device Arguments
548----------------
549
550The driver can be configured per device.
551A single argument list can be used for a device managed by multiple PMDs.
552The parameters must be passed through the EAL option ``-a``,
553as examples below:
554
555- PCI device::
556
557  -a 0000:03:00.2,class=eth:regex,mr_mempool_reg_en=0
558
559- Auxiliary SF::
560
561  -a auxiliary:mlx5_core.sf.2,class=compress,mr_ext_memseg_en=0
562
563Each device class PMD has its own list of specific arguments,
564and below are the arguments supported by the common mlx5 layer.
565
566- ``class`` parameter [string]
567
568  Select the classes of the drivers that should probe the device.
569  See :ref:`mlx5_classes` for more explanation and details.
570
571  The default value is ``eth``.
572
573- ``mr_ext_memseg_en`` parameter [int]
574
575  A nonzero value enables extending memseg when registering DMA memory. If
576  enabled, the number of entries in MR (Memory Region) lookup table on datapath
577  is minimized and it benefits performance. On the other hand, it worsens memory
578  utilization because registered memory is pinned by kernel driver. Even if a
579  page in the extended chunk is freed, that doesn't become reusable until the
580  entire memory is freed.
581
582  Enabled by default.
583
584- ``mr_mempool_reg_en`` parameter [int]
585
586  A nonzero value enables implicit registration of DMA memory of all mempools
587  except those having ``RTE_MEMPOOL_F_NON_IO``. This flag is set automatically
588  for mempools populated with non-contiguous objects or those without IOVA.
589  The effect is that when a packet from a mempool is transmitted,
590  its memory is already registered for DMA in the PMD and no registration
591  will happen on the data path. The tradeoff is extra work on the creation
592  of each mempool and increased HW resource use if some mempools
593  are not used with MLX5 devices.
594
595  Enabled by default.
596
597- ``sys_mem_en`` parameter [int]
598
599  A non-zero value enables the PMD memory management allocating memory
600  from system by default, without explicit rte memory flag.
601
602  By default, the PMD will set this value to 0.
603
604- ``sq_db_nc`` parameter [int]
605
606  The rdma core library can map doorbell register in two ways,
607  depending on the environment variable "MLX5_SHUT_UP_BF":
608
609  - As regular cached memory (usually with write combining attribute),
610    if the variable is either missing or set to zero.
611  - As non-cached memory, if the variable is present and set to not "0" value.
612
613   The same doorbell mapping approach is implemented directly by PMD
614   in UAR generation for queues created with DevX.
615
616  The type of mapping may slightly affect the send queue performance,
617  the optimal choice strongly relied on the host architecture
618  and should be deduced practically.
619
620  If ``sq_db_nc`` is set to zero, the doorbell is forced to be mapped to
621  regular memory (with write combining), the PMD will perform the extra write
622  memory barrier after writing to doorbell, it might increase the needed CPU
623  clocks per packet to send, but latency might be improved.
624
625  If ``sq_db_nc`` is set to one, the doorbell is forced to be mapped to non
626  cached memory, the PMD will not perform the extra write memory barrier after
627  writing to doorbell, on some architectures it might improve the performance.
628
629  If ``sq_db_nc`` is set to two, the doorbell is forced to be mapped to
630  regular memory, the PMD will use heuristics to decide whether a write memory
631  barrier should be performed. For bursts with size multiple of recommended one
632  (64 pkts) it is supposed the next burst is coming and no need to issue the
633  extra memory barrier (it is supposed to be issued in the next coming burst,
634  at least after descriptor writing). It might increase latency (on some hosts
635  till the next packets transmit) and should be used with care.
636  The PMD uses heuristics only for Tx queue, for other semd queues the doorbell
637  is forced to be mapped to regular memory as same as ``sq_db_nc`` is set to 0.
638
639  If ``sq_db_nc`` is omitted, the preset (if any) environment variable
640  "MLX5_SHUT_UP_BF" value is used. If there is no "MLX5_SHUT_UP_BF", the
641  default ``sq_db_nc`` value is zero for ARM64 hosts and one for others.
642
643- ``cmd_fd`` parameter [int]
644
645  File descriptor of ``ibv_context`` created outside the PMD.
646  PMD will use this FD to import remote CTX. The ``cmd_fd`` is obtained from
647  the ``ibv_context->cmd_fd`` member, which must be dup'd before being passed.
648  This parameter is valid only if ``pd_handle`` parameter is specified.
649
650  By default, the PMD will create a new ``ibv_context``.
651
652  .. note::
653
654     When FD comes from another process, it is the user responsibility to
655     share the FD between the processes (e.g. by SCM_RIGHTS).
656
657- ``pd_handle`` parameter [int]
658
659  Protection domain handle of ``ibv_pd`` created outside the PMD.
660  PMD will use this handle to import remote PD. The ``pd_handle`` can be
661  achieved from the original PD by getting its ``ibv_pd->handle`` member value.
662  This parameter is valid only if ``cmd_fd`` parameter is specified,
663  and its value must be a valid kernel handle for a PD object
664  in the context represented by given ``cmd_fd``.
665
666  By default, the PMD will allocate a new PD.
667
668  .. note::
669
670     The ``ibv_pd->handle`` member is different than ``mlx5dv_pd->pdn`` member.
671