xref: /dpdk/doc/guides/nics/mlx4.rst (revision 68a03efeed657e6e05f281479b33b51102797e15)
1..  SPDX-License-Identifier: BSD-3-Clause
2    Copyright 2012 6WIND S.A.
3    Copyright 2015 Mellanox Technologies, Ltd
4
5MLX4 poll mode driver library
6=============================
7
8The MLX4 poll mode driver library (**librte_net_mlx4**) implements support
9for **Mellanox ConnectX-3** and **Mellanox ConnectX-3 Pro** 10/40 Gbps adapters
10as well as their virtual functions (VF) in SR-IOV context.
11
12Information and documentation about this family of adapters can be found on
13the `Mellanox website <http://www.mellanox.com>`_. Help is also provided by
14the `Mellanox community <http://community.mellanox.com/welcome>`_.
15
16There is also a `section dedicated to this poll mode driver
17<http://www.mellanox.com/page/products_dyn?product_family=209&mtag=pmd_for_dpdk>`_.
18
19
20Implementation details
21----------------------
22
23Most Mellanox ConnectX-3 devices provide two ports but expose a single PCI
24bus address, thus unlike most drivers, librte_net_mlx4 registers itself as a
25PCI driver that allocates one Ethernet device per detected port.
26
27For this reason, one cannot block (or allow) a single port without also
28blocking (or allowing) the others on the same device.
29
30Besides its dependency on libibverbs (that implies libmlx4 and associated
31kernel support), librte_net_mlx4 relies heavily on system calls for control
32operations such as querying/updating the MTU and flow control parameters.
33
34For security reasons and robustness, this driver only deals with virtual
35memory addresses. The way resources allocations are handled by the kernel
36combined with hardware specifications that allow it to handle virtual memory
37addresses directly ensure that DPDK applications cannot access random
38physical memory (or memory that does not belong to the current process).
39
40This capability allows the PMD to coexist with kernel network interfaces
41which remain functional, although they stop receiving unicast packets as
42long as they share the same MAC address.
43
44The :ref:`flow_isolated_mode` is supported.
45
46Compiling librte_net_mlx4 causes DPDK to be linked against libibverbs.
47
48Configuration
49-------------
50
51Compilation options
52~~~~~~~~~~~~~~~~~~~
53
54The ibverbs libraries can be linked with this PMD in a number of ways,
55configured by the ``ibverbs_link`` build option:
56
57- ``shared`` (default): the PMD depends on some .so files.
58
59- ``dlopen``: Split the dependencies glue in a separate library
60  loaded when needed by dlopen.
61  It make dependencies on libibverbs and libmlx4 optional,
62  and has no performance impact.
63
64- ``static``: Embed static flavor of the dependencies libibverbs and libmlx4
65  in the PMD shared library or the executable static binary.
66
67
68Environment variables
69~~~~~~~~~~~~~~~~~~~~~
70
71- ``MLX4_GLUE_PATH``
72
73  A list of directories in which to search for the rdma-core "glue" plug-in,
74  separated by colons or semi-colons.
75
76
77Run-time configuration
78~~~~~~~~~~~~~~~~~~~~~~
79
80- librte_net_mlx4 brings kernel network interfaces up during initialization
81  because it is affected by their state. Forcing them down prevents packets
82  reception.
83
84- **ethtool** operations on related kernel interfaces also affect the PMD.
85
86- ``port`` parameter [int]
87
88  This parameter provides a physical port to probe and can be specified multiple
89  times for additional ports. All ports are probed by default if left
90  unspecified.
91
92- ``mr_ext_memseg_en`` parameter [int]
93
94  A nonzero value enables extending memseg when registering DMA memory. If
95  enabled, the number of entries in MR (Memory Region) lookup table on datapath
96  is minimized and it benefits performance. On the other hand, it worsens memory
97  utilization because registered memory is pinned by kernel driver. Even if a
98  page in the extended chunk is freed, that doesn't become reusable until the
99  entire memory is freed.
100
101  Enabled by default.
102
103Kernel module parameters
104~~~~~~~~~~~~~~~~~~~~~~~~
105
106The **mlx4_core** kernel module has several parameters that affect the
107behavior and/or the performance of librte_net_mlx4. Some of them are described
108below.
109
110- **num_vfs** (integer or triplet, optionally prefixed by device address
111  strings)
112
113  Create the given number of VFs on the specified devices.
114
115- **log_num_mgm_entry_size** (integer)
116
117  Device-managed flow steering (DMFS) is required by DPDK applications. It is
118  enabled by using a negative value, the last four bits of which have a
119  special meaning.
120
121  - **-1**: force device-managed flow steering (DMFS).
122  - **-7**: configure optimized steering mode to improve performance with the
123    following limitation: VLAN filtering is not supported with this mode.
124    This is the recommended mode in case VLAN filter is not needed.
125
126Limitations
127-----------
128
129- For secondary process:
130
131  - Forked secondary process not supported.
132  - External memory unregistered in EAL memseg list cannot be used for DMA
133    unless such memory has been registered by ``mlx4_mr_update_ext_mp()`` in
134    primary process and remapped to the same virtual address in secondary
135    process. If the external memory is registered by primary process but has
136    different virtual address in secondary process, unexpected error may happen.
137
138- CRC stripping is supported by default and always reported as "true".
139  The ability to enable/disable CRC stripping requires OFED version
140  4.3-1.5.0.0 and above  or rdma-core version v18 and above.
141
142- TSO (Transmit Segmentation Offload) is supported in OFED version
143  4.4 and above.
144
145Prerequisites
146-------------
147
148This driver relies on external libraries and kernel drivers for resources
149allocations and initialization. The following dependencies are not part of
150DPDK and must be installed separately:
151
152- **libibverbs** (provided by rdma-core package)
153
154  User space verbs framework used by librte_net_mlx4. This library provides
155  a generic interface between the kernel and low-level user space drivers
156  such as libmlx4.
157
158  It allows slow and privileged operations (context initialization, hardware
159  resources allocations) to be managed by the kernel and fast operations to
160  never leave user space.
161
162- **libmlx4** (provided by rdma-core package)
163
164  Low-level user space driver library for Mellanox ConnectX-3 devices,
165  it is automatically loaded by libibverbs.
166
167  This library basically implements send/receive calls to the hardware
168  queues.
169
170- **Kernel modules**
171
172  They provide the kernel-side verbs API and low level device drivers that
173  manage actual hardware initialization and resources sharing with user
174  space processes.
175
176  Unlike most other PMDs, these modules must remain loaded and bound to
177  their devices:
178
179  - mlx4_core: hardware driver managing Mellanox ConnectX-3 devices.
180  - mlx4_en: Ethernet device driver that provides kernel network interfaces.
181  - mlx4_ib: InifiniBand device driver.
182  - ib_uverbs: user space driver for verbs (entry point for libibverbs).
183
184- **Firmware update**
185
186  Mellanox OFED releases include firmware updates for ConnectX-3 adapters.
187
188  Because each release provides new features, these updates must be applied to
189  match the kernel modules and libraries they come with.
190
191.. note::
192
193   Both libraries are BSD and GPL licensed. Linux kernel modules are GPL
194   licensed.
195
196Depending on system constraints and user preferences either RDMA core library
197with a recent enough Linux kernel release (recommended) or Mellanox OFED,
198which provides compatibility with older releases.
199
200Current RDMA core package and Linux kernel (recommended)
201~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
202
203- Minimal Linux kernel version: 4.14.
204- Minimal RDMA core version: v15 (see `RDMA core installation documentation`_).
205
206- Starting with rdma-core v21, static libraries can be built::
207
208    cd build
209    CFLAGS=-fPIC cmake -DIN_PLACE=1 -DENABLE_STATIC=1 -GNinja ..
210    ninja
211
212.. _`RDMA core installation documentation`: https://raw.githubusercontent.com/linux-rdma/rdma-core/master/README.md
213
214.. _Mellanox_OFED_as_a_fallback:
215
216Mellanox OFED as a fallback
217~~~~~~~~~~~~~~~~~~~~~~~~~~~
218
219- `Mellanox OFED`_ version: **4.4, 4.5, 4.6**.
220- firmware version: **2.42.5000** and above.
221
222.. _`Mellanox OFED`: http://www.mellanox.com/page/products_dyn?product_family=26&mtag=linux_sw_drivers
223
224.. note::
225
226   Several versions of Mellanox OFED are available. Installing the version
227   this DPDK release was developed and tested against is strongly
228   recommended. Please check the `prerequisites`_.
229
230Installing Mellanox OFED
231^^^^^^^^^^^^^^^^^^^^^^^^
232
2331. Download latest Mellanox OFED.
234
2352. Install the required libraries and kernel modules either by installing
236   only the required set, or by installing the entire Mellanox OFED:
237
238   For bare metal use::
239
240        ./mlnxofedinstall --dpdk --upstream-libs
241
242   For SR-IOV hypervisors use::
243
244        ./mlnxofedinstall --dpdk --upstream-libs --enable-sriov --hypervisor
245
246   For SR-IOV virtual machine use::
247
248        ./mlnxofedinstall --dpdk --upstream-libs --guest
249
2503. Verify the firmware is the correct one::
251
252        ibv_devinfo
253
2544. Set all ports links to Ethernet, follow instructions on the screen::
255
256        connectx_port_config
257
2585. Continue with :ref:`section 2 of the Quick Start Guide <QSG_2>`.
259
260.. _qsg:
261
262Quick Start Guide
263-----------------
264
2651. Set all ports links to Ethernet::
266
267        PCI=<NIC PCI address>
268        echo eth > "/sys/bus/pci/devices/$PCI/mlx4_port0"
269        echo eth > "/sys/bus/pci/devices/$PCI/mlx4_port1"
270
271   .. note::
272
273        If using Mellanox OFED one can permanently set the port link
274        to Ethernet using connectx_port_config tool provided by it.
275        :ref:`Mellanox_OFED_as_a_fallback`:
276
277.. _QSG_2:
278
2792. In case of bare metal or hypervisor, configure optimized steering mode
280   by adding the following line to ``/etc/modprobe.d/mlx4_core.conf``::
281
282        options mlx4_core log_num_mgm_entry_size=-7
283
284   .. note::
285
286        If VLAN filtering is used, set log_num_mgm_entry_size=-1.
287        Performance degradation can occur on this case.
288
2893. Restart the driver::
290
291        /etc/init.d/openibd restart
292
293   or::
294
295        service openibd restart
296
2974. Install DPDK and you are ready to go.
298   See :doc:`compilation instructions <../linux_gsg/build_dpdk>`.
299
300Performance tuning
301------------------
302
3031. Verify the optimized steering mode is configured::
304
305        cat /sys/module/mlx4_core/parameters/log_num_mgm_entry_size
306
3072. Use the CPU near local NUMA node to which the PCIe adapter is connected,
308   for better performance. For VMs, verify that the right CPU
309   and NUMA node are pinned according to the above. Run::
310
311        lstopo-no-graphics
312
313   to identify the NUMA node to which the PCIe adapter is connected.
314
3153. If more than one adapter is used, and root complex capabilities allow
316   to put both adapters on the same NUMA node without PCI bandwidth degradation,
317   it is recommended to locate both adapters on the same NUMA node.
318   This in order to forward packets from one to the other without
319   NUMA performance penalty.
320
3214. Disable pause frames::
322
323        ethtool -A <netdev> rx off tx off
324
3255. Verify IO non-posted prefetch is disabled by default. This can be checked
326   via the BIOS configuration. Please contact you server provider for more
327   information about the settings.
328
329.. note::
330
331        On some machines, depends on the machine integrator, it is beneficial
332        to set the PCI max read request parameter to 1K. This can be
333        done in the following way:
334
335        To query the read request size use::
336
337                setpci -s <NIC PCI address> 68.w
338
339        If the output is different than 3XXX, set it by::
340
341                setpci -s <NIC PCI address> 68.w=3XXX
342
343        The XXX can be different on different systems. Make sure to configure
344        according to the setpci output.
345
3466. To minimize overhead of searching Memory Regions:
347
348   - '--socket-mem' is recommended to pin memory by predictable amount.
349   - Configure per-lcore cache when creating Mempools for packet buffer.
350   - Refrain from dynamically allocating/freeing memory in run-time.
351
352Usage example
353-------------
354
355This section demonstrates how to launch **testpmd** with Mellanox ConnectX-3
356devices managed by librte_net_mlx4.
357
358#. Load the kernel modules::
359
360      modprobe -a ib_uverbs mlx4_en mlx4_core mlx4_ib
361
362   Alternatively if MLNX_OFED is fully installed, the following script can
363   be run::
364
365      /etc/init.d/openibd restart
366
367   .. note::
368
369      User space I/O kernel modules (uio and igb_uio) are not used and do
370      not have to be loaded.
371
372#. Make sure Ethernet interfaces are in working order and linked to kernel
373   verbs. Related sysfs entries should be present::
374
375      ls -d /sys/class/net/*/device/infiniband_verbs/uverbs* | cut -d / -f 5
376
377   Example output::
378
379      eth2
380      eth3
381      eth4
382      eth5
383
384#. Optionally, retrieve their PCI bus addresses to be used with the allow argument::
385
386      {
387          for intf in eth2 eth3 eth4 eth5;
388          do
389              (cd "/sys/class/net/${intf}/device/" && pwd -P);
390          done;
391      } |
392      sed -n 's,.*/\(.*\),-a \1,p'
393
394   Example output::
395
396      -a 0000:83:00.0
397      -a 0000:83:00.0
398      -a 0000:84:00.0
399      -a 0000:84:00.0
400
401   .. note::
402
403      There are only two distinct PCI bus addresses because the Mellanox
404      ConnectX-3 adapters installed on this system are dual port.
405
406#. Request huge pages::
407
408      dpdk-hugepages.py --setup 2G
409
410#. Start testpmd with basic parameters::
411
412      dpdk-testpmd -l 8-15 -n 4 -a 0000:83:00.0 -a 0000:84:00.0 -- --rxq=2 --txq=2 -i
413
414   Example output::
415
416      [...]
417      EAL: PCI device 0000:83:00.0 on NUMA socket 1
418      EAL:   probe driver: 15b3:1007 librte_net_mlx4
419      PMD: librte_net_mlx4: PCI information matches, using device "mlx4_0" (VF: false)
420      PMD: librte_net_mlx4: 2 port(s) detected
421      PMD: librte_net_mlx4: port 1 MAC address is 00:02:c9:b5:b7:50
422      PMD: librte_net_mlx4: port 2 MAC address is 00:02:c9:b5:b7:51
423      EAL: PCI device 0000:84:00.0 on NUMA socket 1
424      EAL:   probe driver: 15b3:1007 librte_net_mlx4
425      PMD: librte_net_mlx4: PCI information matches, using device "mlx4_1" (VF: false)
426      PMD: librte_net_mlx4: 2 port(s) detected
427      PMD: librte_net_mlx4: port 1 MAC address is 00:02:c9:b5:ba:b0
428      PMD: librte_net_mlx4: port 2 MAC address is 00:02:c9:b5:ba:b1
429      Interactive-mode selected
430      Configuring Port 0 (socket 0)
431      PMD: librte_net_mlx4: 0x867d60: TX queues number update: 0 -> 2
432      PMD: librte_net_mlx4: 0x867d60: RX queues number update: 0 -> 2
433      Port 0: 00:02:C9:B5:B7:50
434      Configuring Port 1 (socket 0)
435      PMD: librte_net_mlx4: 0x867da0: TX queues number update: 0 -> 2
436      PMD: librte_net_mlx4: 0x867da0: RX queues number update: 0 -> 2
437      Port 1: 00:02:C9:B5:B7:51
438      Configuring Port 2 (socket 0)
439      PMD: librte_net_mlx4: 0x867de0: TX queues number update: 0 -> 2
440      PMD: librte_net_mlx4: 0x867de0: RX queues number update: 0 -> 2
441      Port 2: 00:02:C9:B5:BA:B0
442      Configuring Port 3 (socket 0)
443      PMD: librte_net_mlx4: 0x867e20: TX queues number update: 0 -> 2
444      PMD: librte_net_mlx4: 0x867e20: RX queues number update: 0 -> 2
445      Port 3: 00:02:C9:B5:BA:B1
446      Checking link statuses...
447      Port 0 Link Up - speed 10000 Mbps - full-duplex
448      Port 1 Link Up - speed 40000 Mbps - full-duplex
449      Port 2 Link Up - speed 10000 Mbps - full-duplex
450      Port 3 Link Up - speed 40000 Mbps - full-duplex
451      Done
452      testpmd>
453