1.. SPDX-License-Identifier: BSD-3-Clause 2 Copyright 2012 6WIND S.A. 3 Copyright 2015 Mellanox Technologies, Ltd 4 5MLX4 poll mode driver library 6============================= 7 8The MLX4 poll mode driver library (**librte_net_mlx4**) implements support 9for **Mellanox ConnectX-3** and **Mellanox ConnectX-3 Pro** 10/40 Gbps adapters 10as well as their virtual functions (VF) in SR-IOV context. 11 12Information and documentation about this family of adapters can be found on 13the `Mellanox website <http://www.mellanox.com>`_. Help is also provided by 14the `Mellanox community <http://community.mellanox.com/welcome>`_. 15 16There is also a `section dedicated to this poll mode driver 17<http://www.mellanox.com/page/products_dyn?product_family=209&mtag=pmd_for_dpdk>`_. 18 19 20Implementation details 21---------------------- 22 23Most Mellanox ConnectX-3 devices provide two ports but expose a single PCI 24bus address, thus unlike most drivers, librte_net_mlx4 registers itself as a 25PCI driver that allocates one Ethernet device per detected port. 26 27For this reason, one cannot block (or allow) a single port without also 28blocking (or allowing) the others on the same device. 29 30Besides its dependency on libibverbs (that implies libmlx4 and associated 31kernel support), librte_net_mlx4 relies heavily on system calls for control 32operations such as querying/updating the MTU and flow control parameters. 33 34For security reasons and robustness, this driver only deals with virtual 35memory addresses. The way resources allocations are handled by the kernel 36combined with hardware specifications that allow it to handle virtual memory 37addresses directly ensure that DPDK applications cannot access random 38physical memory (or memory that does not belong to the current process). 39 40This capability allows the PMD to coexist with kernel network interfaces 41which remain functional, although they stop receiving unicast packets as 42long as they share the same MAC address. 43 44The :ref:`flow_isolated_mode` is supported. 45 46Compiling librte_net_mlx4 causes DPDK to be linked against libibverbs. 47 48Configuration 49------------- 50 51Compilation options 52~~~~~~~~~~~~~~~~~~~ 53 54The ibverbs libraries can be linked with this PMD in a number of ways, 55configured by the ``ibverbs_link`` build option: 56 57- ``shared`` (default): the PMD depends on some .so files. 58 59- ``dlopen``: Split the dependencies glue in a separate library 60 loaded when needed by dlopen. 61 It make dependencies on libibverbs and libmlx4 optional, 62 and has no performance impact. 63 64- ``static``: Embed static flavor of the dependencies libibverbs and libmlx4 65 in the PMD shared library or the executable static binary. 66 67 68Environment variables 69~~~~~~~~~~~~~~~~~~~~~ 70 71- ``MLX4_GLUE_PATH`` 72 73 A list of directories in which to search for the rdma-core "glue" plug-in, 74 separated by colons or semi-colons. 75 76 77Run-time configuration 78~~~~~~~~~~~~~~~~~~~~~~ 79 80- librte_net_mlx4 brings kernel network interfaces up during initialization 81 because it is affected by their state. Forcing them down prevents packets 82 reception. 83 84- **ethtool** operations on related kernel interfaces also affect the PMD. 85 86- ``port`` parameter [int] 87 88 This parameter provides a physical port to probe and can be specified multiple 89 times for additional ports. All ports are probed by default if left 90 unspecified. 91 92- ``mr_ext_memseg_en`` parameter [int] 93 94 A nonzero value enables extending memseg when registering DMA memory. If 95 enabled, the number of entries in MR (Memory Region) lookup table on datapath 96 is minimized and it benefits performance. On the other hand, it worsens memory 97 utilization because registered memory is pinned by kernel driver. Even if a 98 page in the extended chunk is freed, that doesn't become reusable until the 99 entire memory is freed. 100 101 Enabled by default. 102 103Kernel module parameters 104~~~~~~~~~~~~~~~~~~~~~~~~ 105 106The **mlx4_core** kernel module has several parameters that affect the 107behavior and/or the performance of librte_net_mlx4. Some of them are described 108below. 109 110- **num_vfs** (integer or triplet, optionally prefixed by device address 111 strings) 112 113 Create the given number of VFs on the specified devices. 114 115- **log_num_mgm_entry_size** (integer) 116 117 Device-managed flow steering (DMFS) is required by DPDK applications. It is 118 enabled by using a negative value, the last four bits of which have a 119 special meaning. 120 121 - **-1**: force device-managed flow steering (DMFS). 122 - **-7**: configure optimized steering mode to improve performance with the 123 following limitation: VLAN filtering is not supported with this mode. 124 This is the recommended mode in case VLAN filter is not needed. 125 126Limitations 127----------- 128 129- For secondary process: 130 131 - Forked secondary process not supported. 132 - External memory unregistered in EAL memseg list cannot be used for DMA 133 unless such memory has been registered by ``mlx4_mr_update_ext_mp()`` in 134 primary process and remapped to the same virtual address in secondary 135 process. If the external memory is registered by primary process but has 136 different virtual address in secondary process, unexpected error may happen. 137 138- CRC stripping is supported by default and always reported as "true". 139 The ability to enable/disable CRC stripping requires OFED version 140 4.3-1.5.0.0 and above or rdma-core version v18 and above. 141 142- TSO (Transmit Segmentation Offload) is supported in OFED version 143 4.4 and above. 144 145Prerequisites 146------------- 147 148This driver relies on external libraries and kernel drivers for resources 149allocations and initialization. The following dependencies are not part of 150DPDK and must be installed separately: 151 152- **libibverbs** (provided by rdma-core package) 153 154 User space verbs framework used by librte_net_mlx4. This library provides 155 a generic interface between the kernel and low-level user space drivers 156 such as libmlx4. 157 158 It allows slow and privileged operations (context initialization, hardware 159 resources allocations) to be managed by the kernel and fast operations to 160 never leave user space. 161 162- **libmlx4** (provided by rdma-core package) 163 164 Low-level user space driver library for Mellanox ConnectX-3 devices, 165 it is automatically loaded by libibverbs. 166 167 This library basically implements send/receive calls to the hardware 168 queues. 169 170- **Kernel modules** 171 172 They provide the kernel-side verbs API and low level device drivers that 173 manage actual hardware initialization and resources sharing with user 174 space processes. 175 176 Unlike most other PMDs, these modules must remain loaded and bound to 177 their devices: 178 179 - mlx4_core: hardware driver managing Mellanox ConnectX-3 devices. 180 - mlx4_en: Ethernet device driver that provides kernel network interfaces. 181 - mlx4_ib: InifiniBand device driver. 182 - ib_uverbs: user space driver for verbs (entry point for libibverbs). 183 184- **Firmware update** 185 186 Mellanox OFED releases include firmware updates for ConnectX-3 adapters. 187 188 Because each release provides new features, these updates must be applied to 189 match the kernel modules and libraries they come with. 190 191.. note:: 192 193 Both libraries are BSD and GPL licensed. Linux kernel modules are GPL 194 licensed. 195 196Depending on system constraints and user preferences either RDMA core library 197with a recent enough Linux kernel release (recommended) or Mellanox OFED, 198which provides compatibility with older releases. 199 200Current RDMA core package and Linux kernel (recommended) 201~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 202 203- Minimal Linux kernel version: 4.14. 204- Minimal RDMA core version: v15 (see `RDMA core installation documentation`_). 205 206- Starting with rdma-core v21, static libraries can be built:: 207 208 cd build 209 CFLAGS=-fPIC cmake -DIN_PLACE=1 -DENABLE_STATIC=1 -GNinja .. 210 ninja 211 212.. _`RDMA core installation documentation`: https://raw.githubusercontent.com/linux-rdma/rdma-core/master/README.md 213 214.. _Mellanox_OFED_as_a_fallback: 215 216Mellanox OFED as a fallback 217~~~~~~~~~~~~~~~~~~~~~~~~~~~ 218 219- `Mellanox OFED`_ version: **4.4, 4.5, 4.6**. 220- firmware version: **2.42.5000** and above. 221 222.. _`Mellanox OFED`: http://www.mellanox.com/page/products_dyn?product_family=26&mtag=linux_sw_drivers 223 224.. note:: 225 226 Several versions of Mellanox OFED are available. Installing the version 227 this DPDK release was developed and tested against is strongly 228 recommended. Please check the `prerequisites`_. 229 230Installing Mellanox OFED 231^^^^^^^^^^^^^^^^^^^^^^^^ 232 2331. Download latest Mellanox OFED. 234 2352. Install the required libraries and kernel modules either by installing 236 only the required set, or by installing the entire Mellanox OFED: 237 238 For bare metal use:: 239 240 ./mlnxofedinstall --dpdk --upstream-libs 241 242 For SR-IOV hypervisors use:: 243 244 ./mlnxofedinstall --dpdk --upstream-libs --enable-sriov --hypervisor 245 246 For SR-IOV virtual machine use:: 247 248 ./mlnxofedinstall --dpdk --upstream-libs --guest 249 2503. Verify the firmware is the correct one:: 251 252 ibv_devinfo 253 2544. Set all ports links to Ethernet, follow instructions on the screen:: 255 256 connectx_port_config 257 2585. Continue with :ref:`section 2 of the Quick Start Guide <QSG_2>`. 259 260.. _qsg: 261 262Quick Start Guide 263----------------- 264 2651. Set all ports links to Ethernet:: 266 267 PCI=<NIC PCI address> 268 echo eth > "/sys/bus/pci/devices/$PCI/mlx4_port0" 269 echo eth > "/sys/bus/pci/devices/$PCI/mlx4_port1" 270 271 .. note:: 272 273 If using Mellanox OFED one can permanently set the port link 274 to Ethernet using connectx_port_config tool provided by it. 275 :ref:`Mellanox_OFED_as_a_fallback`: 276 277.. _QSG_2: 278 2792. In case of bare metal or hypervisor, configure optimized steering mode 280 by adding the following line to ``/etc/modprobe.d/mlx4_core.conf``:: 281 282 options mlx4_core log_num_mgm_entry_size=-7 283 284 .. note:: 285 286 If VLAN filtering is used, set log_num_mgm_entry_size=-1. 287 Performance degradation can occur on this case. 288 2893. Restart the driver:: 290 291 /etc/init.d/openibd restart 292 293 or:: 294 295 service openibd restart 296 2974. Install DPDK and you are ready to go. 298 See :doc:`compilation instructions <../linux_gsg/build_dpdk>`. 299 300Performance tuning 301------------------ 302 3031. Verify the optimized steering mode is configured:: 304 305 cat /sys/module/mlx4_core/parameters/log_num_mgm_entry_size 306 3072. Use the CPU near local NUMA node to which the PCIe adapter is connected, 308 for better performance. For VMs, verify that the right CPU 309 and NUMA node are pinned according to the above. Run:: 310 311 lstopo-no-graphics 312 313 to identify the NUMA node to which the PCIe adapter is connected. 314 3153. If more than one adapter is used, and root complex capabilities allow 316 to put both adapters on the same NUMA node without PCI bandwidth degradation, 317 it is recommended to locate both adapters on the same NUMA node. 318 This in order to forward packets from one to the other without 319 NUMA performance penalty. 320 3214. Disable pause frames:: 322 323 ethtool -A <netdev> rx off tx off 324 3255. Verify IO non-posted prefetch is disabled by default. This can be checked 326 via the BIOS configuration. Please contact you server provider for more 327 information about the settings. 328 329.. note:: 330 331 On some machines, depends on the machine integrator, it is beneficial 332 to set the PCI max read request parameter to 1K. This can be 333 done in the following way: 334 335 To query the read request size use:: 336 337 setpci -s <NIC PCI address> 68.w 338 339 If the output is different than 3XXX, set it by:: 340 341 setpci -s <NIC PCI address> 68.w=3XXX 342 343 The XXX can be different on different systems. Make sure to configure 344 according to the setpci output. 345 3466. To minimize overhead of searching Memory Regions: 347 348 - '--socket-mem' is recommended to pin memory by predictable amount. 349 - Configure per-lcore cache when creating Mempools for packet buffer. 350 - Refrain from dynamically allocating/freeing memory in run-time. 351 352Usage example 353------------- 354 355This section demonstrates how to launch **testpmd** with Mellanox ConnectX-3 356devices managed by librte_net_mlx4. 357 358#. Load the kernel modules:: 359 360 modprobe -a ib_uverbs mlx4_en mlx4_core mlx4_ib 361 362 Alternatively if MLNX_OFED is fully installed, the following script can 363 be run:: 364 365 /etc/init.d/openibd restart 366 367 .. note:: 368 369 User space I/O kernel modules (uio and igb_uio) are not used and do 370 not have to be loaded. 371 372#. Make sure Ethernet interfaces are in working order and linked to kernel 373 verbs. Related sysfs entries should be present:: 374 375 ls -d /sys/class/net/*/device/infiniband_verbs/uverbs* | cut -d / -f 5 376 377 Example output:: 378 379 eth2 380 eth3 381 eth4 382 eth5 383 384#. Optionally, retrieve their PCI bus addresses to be used with the allow argument:: 385 386 { 387 for intf in eth2 eth3 eth4 eth5; 388 do 389 (cd "/sys/class/net/${intf}/device/" && pwd -P); 390 done; 391 } | 392 sed -n 's,.*/\(.*\),-a \1,p' 393 394 Example output:: 395 396 -a 0000:83:00.0 397 -a 0000:83:00.0 398 -a 0000:84:00.0 399 -a 0000:84:00.0 400 401 .. note:: 402 403 There are only two distinct PCI bus addresses because the Mellanox 404 ConnectX-3 adapters installed on this system are dual port. 405 406#. Request huge pages:: 407 408 dpdk-hugepages.py --setup 2G 409 410#. Start testpmd with basic parameters:: 411 412 dpdk-testpmd -l 8-15 -n 4 -a 0000:83:00.0 -a 0000:84:00.0 -- --rxq=2 --txq=2 -i 413 414 Example output:: 415 416 [...] 417 EAL: PCI device 0000:83:00.0 on NUMA socket 1 418 EAL: probe driver: 15b3:1007 librte_net_mlx4 419 PMD: librte_net_mlx4: PCI information matches, using device "mlx4_0" (VF: false) 420 PMD: librte_net_mlx4: 2 port(s) detected 421 PMD: librte_net_mlx4: port 1 MAC address is 00:02:c9:b5:b7:50 422 PMD: librte_net_mlx4: port 2 MAC address is 00:02:c9:b5:b7:51 423 EAL: PCI device 0000:84:00.0 on NUMA socket 1 424 EAL: probe driver: 15b3:1007 librte_net_mlx4 425 PMD: librte_net_mlx4: PCI information matches, using device "mlx4_1" (VF: false) 426 PMD: librte_net_mlx4: 2 port(s) detected 427 PMD: librte_net_mlx4: port 1 MAC address is 00:02:c9:b5:ba:b0 428 PMD: librte_net_mlx4: port 2 MAC address is 00:02:c9:b5:ba:b1 429 Interactive-mode selected 430 Configuring Port 0 (socket 0) 431 PMD: librte_net_mlx4: 0x867d60: TX queues number update: 0 -> 2 432 PMD: librte_net_mlx4: 0x867d60: RX queues number update: 0 -> 2 433 Port 0: 00:02:C9:B5:B7:50 434 Configuring Port 1 (socket 0) 435 PMD: librte_net_mlx4: 0x867da0: TX queues number update: 0 -> 2 436 PMD: librte_net_mlx4: 0x867da0: RX queues number update: 0 -> 2 437 Port 1: 00:02:C9:B5:B7:51 438 Configuring Port 2 (socket 0) 439 PMD: librte_net_mlx4: 0x867de0: TX queues number update: 0 -> 2 440 PMD: librte_net_mlx4: 0x867de0: RX queues number update: 0 -> 2 441 Port 2: 00:02:C9:B5:BA:B0 442 Configuring Port 3 (socket 0) 443 PMD: librte_net_mlx4: 0x867e20: TX queues number update: 0 -> 2 444 PMD: librte_net_mlx4: 0x867e20: RX queues number update: 0 -> 2 445 Port 3: 00:02:C9:B5:BA:B1 446 Checking link statuses... 447 Port 0 Link Up - speed 10000 Mbps - full-duplex 448 Port 1 Link Up - speed 40000 Mbps - full-duplex 449 Port 2 Link Up - speed 10000 Mbps - full-duplex 450 Port 3 Link Up - speed 40000 Mbps - full-duplex 451 Done 452 testpmd> 453