1.. SPDX-License-Identifier: BSD-3-Clause 2 Copyright(c) 2010-2015 Intel Corporation. 3 4Link Bonding Poll Mode Driver Library 5===================================== 6 7In addition to Poll Mode Drivers (PMDs) for physical and virtual hardware, 8DPDK also includes a pure-software library that 9allows physical PMDs to be bonded together to create a single logical PMD. 10 11.. figure:: img/bond-overview.* 12 13 Bonded PMDs 14 15 16The Link Bonding PMD library(librte_pmd_bond) supports bonding of groups of 17``rte_eth_dev`` ports of the same speed and duplex to provide similar 18capabilities to that found in Linux bonding driver to allow the aggregation 19of multiple (slave) NICs into a single logical interface between a server 20and a switch. The new bonded PMD will then process these interfaces based on 21the mode of operation specified to provide support for features such as 22redundant links, fault tolerance and/or load balancing. 23 24The librte_pmd_bond library exports a C API which provides an API for the 25creation of bonded devices as well as the configuration and management of the 26bonded device and its slave devices. 27 28.. note:: 29 30 The Link Bonding PMD Library is enabled by default in the build 31 configuration files, the library can be disabled by setting 32 ``CONFIG_RTE_LIBRTE_PMD_BOND=n`` and recompiling the DPDK. 33 34Link Bonding Modes Overview 35--------------------------- 36 37Currently the Link Bonding PMD library supports following modes of operation: 38 39* **Round-Robin (Mode 0):** 40 41.. figure:: img/bond-mode-0.* 42 43 Round-Robin (Mode 0) 44 45 46 This mode provides load balancing and fault tolerance by transmission of 47 packets in sequential order from the first available slave device through 48 the last. Packets are bulk dequeued from devices then serviced in a 49 round-robin manner. This mode does not guarantee in order reception of 50 packets and down stream should be able to handle out of order packets. 51 52* **Active Backup (Mode 1):** 53 54.. figure:: img/bond-mode-1.* 55 56 Active Backup (Mode 1) 57 58 59 In this mode only one slave in the bond is active at any time, a different 60 slave becomes active if, and only if, the primary active slave fails, 61 thereby providing fault tolerance to slave failure. The single logical 62 bonded interface's MAC address is externally visible on only one NIC (port) 63 to avoid confusing the network switch. 64 65* **Balance XOR (Mode 2):** 66 67.. figure:: img/bond-mode-2.* 68 69 Balance XOR (Mode 2) 70 71 72 This mode provides transmit load balancing (based on the selected 73 transmission policy) and fault tolerance. The default policy (layer2) uses 74 a simple calculation based on the packet flow source and destination MAC 75 addresses as well as the number of active slaves available to the bonded 76 device to classify the packet to a specific slave to transmit on. Alternate 77 transmission policies supported are layer 2+3, this takes the IP source and 78 destination addresses into the calculation of the transmit slave port and 79 the final supported policy is layer 3+4, this uses IP source and 80 destination addresses as well as the TCP/UDP source and destination port. 81 82.. note:: 83 The coloring differences of the packets are used to identify different flow 84 classification calculated by the selected transmit policy 85 86 87* **Broadcast (Mode 3):** 88 89.. figure:: img/bond-mode-3.* 90 91 Broadcast (Mode 3) 92 93 94 This mode provides fault tolerance by transmission of packets on all slave 95 ports. 96 97* **Link Aggregation 802.3AD (Mode 4):** 98 99.. figure:: img/bond-mode-4.* 100 101 Link Aggregation 802.3AD (Mode 4) 102 103 104 This mode provides dynamic link aggregation according to the 802.3ad 105 specification. It negotiates and monitors aggregation groups that share the 106 same speed and duplex settings using the selected balance transmit policy 107 for balancing outgoing traffic. 108 109 DPDK implementation of this mode provide some additional requirements of 110 the application. 111 112 #. It needs to call ``rte_eth_tx_burst`` and ``rte_eth_rx_burst`` with 113 intervals period of less than 100ms. 114 115 #. Calls to ``rte_eth_tx_burst`` must have a buffer size of at least 2xN, 116 where N is the number of slaves. This is a space required for LACP 117 frames. Additionally LACP packets are included in the statistics, but 118 they are not returned to the application. 119 120* **Transmit Load Balancing (Mode 5):** 121 122.. figure:: img/bond-mode-5.* 123 124 Transmit Load Balancing (Mode 5) 125 126 127 This mode provides an adaptive transmit load balancing. It dynamically 128 changes the transmitting slave, according to the computed load. Statistics 129 are collected in 100ms intervals and scheduled every 10ms. 130 131 132Implementation Details 133---------------------- 134 135The librte_pmd_bond bonded device are compatible with the Ethernet device API 136exported by the Ethernet PMDs described in the *DPDK API Reference*. 137 138The Link Bonding Library supports the creation of bonded devices at application 139startup time during EAL initialization using the ``--vdev`` option as well as 140programmatically via the C API ``rte_eth_bond_create`` function. 141 142Bonded devices support the dynamical addition and removal of slave devices using 143the ``rte_eth_bond_slave_add`` / ``rte_eth_bond_slave_remove`` APIs. 144 145After a slave device is added to a bonded device slave is stopped using 146``rte_eth_dev_stop`` and then reconfigured using ``rte_eth_dev_configure`` 147the RX and TX queues are also reconfigured using ``rte_eth_tx_queue_setup`` / 148``rte_eth_rx_queue_setup`` with the parameters use to configure the bonding 149device. If RSS is enabled for bonding device, this mode is also enabled on new 150slave and configured as well. 151Any flow which was configured to the bond device also is configured to the added 152slave. 153 154Setting up multi-queue mode for bonding device to RSS, makes it fully 155RSS-capable, so all slaves are synchronized with its configuration. This mode is 156intended to provide RSS configuration on slaves transparent for client 157application implementation. 158 159Bonding device stores its own version of RSS settings i.e. RETA, RSS hash 160function and RSS key, used to set up its slaves. That let to define the meaning 161of RSS configuration of bonding device as desired configuration of whole bonding 162(as one unit), without pointing any of slave inside. It is required to ensure 163consistency and made it more error-proof. 164 165RSS hash function set for bonding device, is a maximal set of RSS hash functions 166supported by all bonded slaves. RETA size is a GCD of all its RETA's sizes, so 167it can be easily used as a pattern providing expected behavior, even if slave 168RETAs' sizes are different. If RSS Key is not set for bonded device, it's not 169changed on the slaves and default key for device is used. 170 171As RSS configurations, there is flow consistency in the bonded slaves for the 172next rte flow operations: 173 174Validate: 175 - Validate flow for each slave, failure at least for one slave causes to 176 bond validation failure. 177 178Create: 179 - Create the flow in all slaves. 180 - Save all the slaves created flows objects in bonding internal flow 181 structure. 182 - Failure in flow creation for existed slave rejects the flow. 183 - Failure in flow creation for new slaves in slave adding time rejects 184 the slave. 185 186Destroy: 187 - Destroy the flow in all slaves and release the bond internal flow 188 memory. 189 190Flush: 191 - Destroy all the bonding PMD flows in all the slaves. 192 193.. note:: 194 195 Don't call slaves flush directly, It destroys all the slave flows which 196 may include external flows or the bond internal LACP flow. 197 198Query: 199 - Summarize flow counters from all the slaves, relevant only for 200 ``RTE_FLOW_ACTION_TYPE_COUNT``. 201 202Isolate: 203 - Call to flow isolate for all slaves. 204 - Failure in flow isolation for existed slave rejects the isolate mode. 205 - Failure in flow isolation for new slaves in slave adding time rejects 206 the slave. 207 208All settings are managed through the bonding port API and always are propagated 209in one direction (from bonding to slaves). 210 211Link Status Change Interrupts / Polling 212~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 213 214Link bonding devices support the registration of a link status change callback, 215using the ``rte_eth_dev_callback_register`` API, this will be called when the 216status of the bonding device changes. For example in the case of a bonding 217device which has 3 slaves, the link status will change to up when one slave 218becomes active or change to down when all slaves become inactive. There is no 219callback notification when a single slave changes state and the previous 220conditions are not met. If a user wishes to monitor individual slaves then they 221must register callbacks with that slave directly. 222 223The link bonding library also supports devices which do not implement link 224status change interrupts, this is achieved by polling the devices link status at 225a defined period which is set using the ``rte_eth_bond_link_monitoring_set`` 226API, the default polling interval is 10ms. When a device is added as a slave to 227a bonding device it is determined using the ``RTE_PCI_DRV_INTR_LSC`` flag 228whether the device supports interrupts or whether the link status should be 229monitored by polling it. 230 231Requirements / Limitations 232~~~~~~~~~~~~~~~~~~~~~~~~~~ 233 234The current implementation only supports devices that support the same speed 235and duplex to be added as a slaves to the same bonded device. The bonded device 236inherits these attributes from the first active slave added to the bonded 237device and then all further slaves added to the bonded device must support 238these parameters. 239 240A bonding device must have a minimum of one slave before the bonding device 241itself can be started. 242 243To use a bonding device dynamic RSS configuration feature effectively, it is 244also required, that all slaves should be RSS-capable and support, at least one 245common hash function available for each of them. Changing RSS key is only 246possible, when all slave devices support the same key size. 247 248To prevent inconsistency on how slaves process packets, once a device is added 249to a bonding device, RSS and rte flow configurations should be managed through 250the bonding device API, and not directly on the slave. 251 252Like all other PMD, all functions exported by a PMD are lock-free functions 253that are assumed not to be invoked in parallel on different logical cores to 254work on the same target object. 255 256It should also be noted that the PMD receive function should not be invoked 257directly on a slave devices after they have been to a bonded device since 258packets read directly from the slave device will no longer be available to the 259bonded device to read. 260 261Configuration 262~~~~~~~~~~~~~ 263 264Link bonding devices are created using the ``rte_eth_bond_create`` API 265which requires a unique device name, the bonding mode, 266and the socket Id to allocate the bonding device's resources on. 267The other configurable parameters for a bonded device are its slave devices, 268its primary slave, a user defined MAC address and transmission policy to use if 269the device is in balance XOR mode. 270 271Slave Devices 272^^^^^^^^^^^^^ 273 274Bonding devices support up to a maximum of ``RTE_MAX_ETHPORTS`` slave devices 275of the same speed and duplex. Ethernet devices can be added as a slave to a 276maximum of one bonded device. Slave devices are reconfigured with the 277configuration of the bonded device on being added to a bonded device. 278 279The bonded also guarantees to return the MAC address of the slave device to its 280original value of removal of a slave from it. 281 282Primary Slave 283^^^^^^^^^^^^^ 284 285The primary slave is used to define the default port to use when a bonded 286device is in active backup mode. A different port will only be used if, and 287only if, the current primary port goes down. If the user does not specify a 288primary port it will default to being the first port added to the bonded device. 289 290MAC Address 291^^^^^^^^^^^ 292 293The bonded device can be configured with a user specified MAC address, this 294address will be inherited by the some/all slave devices depending on the 295operating mode. If the device is in active backup mode then only the primary 296device will have the user specified MAC, all other slaves will retain their 297original MAC address. In mode 0, 2, 3, 4 all slaves devices are configure with 298the bonded devices MAC address. 299 300If a user defined MAC address is not defined then the bonded device will 301default to using the primary slaves MAC address. 302 303Balance XOR Transmit Policies 304^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 305 306There are 3 supported transmission policies for bonded device running in 307Balance XOR mode. Layer 2, Layer 2+3, Layer 3+4. 308 309* **Layer 2:** Ethernet MAC address based balancing is the default 310 transmission policy for Balance XOR bonding mode. It uses a simple XOR 311 calculation on the source MAC address and destination MAC address of the 312 packet and then calculate the modulus of this value to calculate the slave 313 device to transmit the packet on. 314 315* **Layer 2 + 3:** Ethernet MAC address & IP Address based balancing uses a 316 combination of source/destination MAC addresses and the source/destination 317 IP addresses of the data packet to decide which slave port the packet will 318 be transmitted on. 319 320* **Layer 3 + 4:** IP Address & UDP Port based balancing uses a combination 321 of source/destination IP Address and the source/destination UDP ports of 322 the packet of the data packet to decide which slave port the packet will be 323 transmitted on. 324 325All these policies support 802.1Q VLAN Ethernet packets, as well as IPv4, IPv6 326and UDP protocols for load balancing. 327 328Using Link Bonding Devices 329-------------------------- 330 331The librte_pmd_bond library supports two modes of device creation, the libraries 332export full C API or using the EAL command line to statically configure link 333bonding devices at application startup. Using the EAL option it is possible to 334use link bonding functionality transparently without specific knowledge of the 335libraries API, this can be used, for example, to add bonding functionality, 336such as active backup, to an existing application which has no knowledge of 337the link bonding C API. 338 339Using the Poll Mode Driver from an Application 340~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 341 342Using the librte_pmd_bond libraries API it is possible to dynamically create 343and manage link bonding device from within any application. Link bonding 344devices are created using the ``rte_eth_bond_create`` API which requires a 345unique device name, the link bonding mode to initial the device in and finally 346the socket Id which to allocate the devices resources onto. After successful 347creation of a bonding device it must be configured using the generic Ethernet 348device configure API ``rte_eth_dev_configure`` and then the RX and TX queues 349which will be used must be setup using ``rte_eth_tx_queue_setup`` / 350``rte_eth_rx_queue_setup``. 351 352Slave devices can be dynamically added and removed from a link bonding device 353using the ``rte_eth_bond_slave_add`` / ``rte_eth_bond_slave_remove`` 354APIs but at least one slave device must be added to the link bonding device 355before it can be started using ``rte_eth_dev_start``. 356 357The link status of a bonded device is dictated by that of its slaves, if all 358slave device link status are down or if all slaves are removed from the link 359bonding device then the link status of the bonding device will go down. 360 361It is also possible to configure / query the configuration of the control 362parameters of a bonded device using the provided APIs 363``rte_eth_bond_mode_set/ get``, ``rte_eth_bond_primary_set/get``, 364``rte_eth_bond_mac_set/reset`` and ``rte_eth_bond_xmit_policy_set/get``. 365 366Using Link Bonding Devices from the EAL Command Line 367~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 368 369Link bonding devices can be created at application startup time using the 370``--vdev`` EAL command line option. The device name must start with the 371net_bonding prefix followed by numbers or letters. The name must be unique for 372each device. Each device can have multiple options arranged in a comma 373separated list. Multiple devices definitions can be arranged by calling the 374``--vdev`` option multiple times. 375 376Device names and bonding options must be separated by commas as shown below: 377 378.. code-block:: console 379 380 $RTE_TARGET/app/testpmd -l 0-3 -n 4 --vdev 'net_bonding0,bond_opt0=..,bond opt1=..'--vdev 'net_bonding1,bond _opt0=..,bond_opt1=..' 381 382Link Bonding EAL Options 383^^^^^^^^^^^^^^^^^^^^^^^^ 384 385There are multiple ways of definitions that can be assessed and combined as 386long as the following two rules are respected: 387 388* A unique device name, in the format of net_bondingX is provided, 389 where X can be any combination of numbers and/or letters, 390 and the name is no greater than 32 characters long. 391 392* A least one slave device is provided with for each bonded device definition. 393 394* The operation mode of the bonded device being created is provided. 395 396The different options are: 397 398* mode: Integer value defining the bonding mode of the device. 399 Currently supports modes 0,1,2,3,4,5 (round-robin, active backup, balance, 400 broadcast, link aggregation, transmit load balancing). 401 402.. code-block:: console 403 404 mode=2 405 406* slave: Defines the PMD device which will be added as slave to the bonded 407 device. This option can be selected multiple times, for each device to be 408 added as a slave. Physical devices should be specified using their PCI 409 address, in the format domain:bus:devid.function 410 411.. code-block:: console 412 413 slave=0000:0a:00.0,slave=0000:0a:00.1 414 415* primary: Optional parameter which defines the primary slave port, 416 is used in active backup mode to select the primary slave for data TX/RX if 417 it is available. The primary port also is used to select the MAC address to 418 use when it is not defined by the user. This defaults to the first slave 419 added to the device if it is specified. The primary device must be a slave 420 of the bonded device. 421 422.. code-block:: console 423 424 primary=0000:0a:00.0 425 426* socket_id: Optional parameter used to select which socket on a NUMA device 427 the bonded devices resources will be allocated on. 428 429.. code-block:: console 430 431 socket_id=0 432 433* mac: Optional parameter to select a MAC address for link bonding device, 434 this overrides the value of the primary slave device. 435 436.. code-block:: console 437 438 mac=00:1e:67:1d:fd:1d 439 440* xmit_policy: Optional parameter which defines the transmission policy when 441 the bonded device is in balance mode. If not user specified this defaults 442 to l2 (layer 2) forwarding, the other transmission policies available are 443 l23 (layer 2+3) and l34 (layer 3+4) 444 445.. code-block:: console 446 447 xmit_policy=l23 448 449* lsc_poll_period_ms: Optional parameter which defines the polling interval 450 in milli-seconds at which devices which don't support lsc interrupts are 451 checked for a change in the devices link status 452 453.. code-block:: console 454 455 lsc_poll_period_ms=100 456 457* up_delay: Optional parameter which adds a delay in milli-seconds to the 458 propagation of a devices link status changing to up, by default this 459 parameter is zero. 460 461.. code-block:: console 462 463 up_delay=10 464 465* down_delay: Optional parameter which adds a delay in milli-seconds to the 466 propagation of a devices link status changing to down, by default this 467 parameter is zero. 468 469.. code-block:: console 470 471 down_delay=50 472 473Examples of Usage 474^^^^^^^^^^^^^^^^^ 475 476Create a bonded device in round robin mode with two slaves specified by their PCI address: 477 478.. code-block:: console 479 480 $RTE_TARGET/app/testpmd -l 0-3 -n 4 --vdev 'net_bonding0,mode=0,slave=0000:0a:00.01,slave=0000:04:00.00' -- --port-topology=chained 481 482Create a bonded device in round robin mode with two slaves specified by their PCI address and an overriding MAC address: 483 484.. code-block:: console 485 486 $RTE_TARGET/app/testpmd -l 0-3 -n 4 --vdev 'net_bonding0,mode=0,slave=0000:0a:00.01,slave=0000:04:00.00,mac=00:1e:67:1d:fd:1d' -- --port-topology=chained 487 488Create a bonded device in active backup mode with two slaves specified, and a primary slave specified by their PCI addresses: 489 490.. code-block:: console 491 492 $RTE_TARGET/app/testpmd -l 0-3 -n 4 --vdev 'net_bonding0,mode=1,slave=0000:0a:00.01,slave=0000:04:00.00,primary=0000:0a:00.01' -- --port-topology=chained 493 494Create a bonded device in balance mode with two slaves specified by their PCI addresses, and a transmission policy of layer 3 + 4 forwarding: 495 496.. code-block:: console 497 498 $RTE_TARGET/app/testpmd -l 0-3 -n 4 --vdev 'net_bonding0,mode=2,slave=0000:0a:00.01,slave=0000:04:00.00,xmit_policy=l34' -- --port-topology=chained 499