1.. BSD LICENSE 2 Copyright(c) 2010-2015 Intel Corporation. All rights reserved. 3 All rights reserved. 4 5 Redistribution and use in source and binary forms, with or without 6 modification, are permitted provided that the following conditions 7 are met: 8 9 * Redistributions of source code must retain the above copyright 10 notice, this list of conditions and the following disclaimer. 11 * Redistributions in binary form must reproduce the above copyright 12 notice, this list of conditions and the following disclaimer in 13 the documentation and/or other materials provided with the 14 distribution. 15 * Neither the name of Intel Corporation nor the names of its 16 contributors may be used to endorse or promote products derived 17 from this software without specific prior written permission. 18 19 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 20 "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 21 LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR 22 A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT 23 OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, 24 SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 25 LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, 26 DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY 27 THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 28 (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 29 OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 30 31Link Bonding Poll Mode Driver Library 32===================================== 33 34In addition to Poll Mode Drivers (PMDs) for physical and virtual hardware, 35DPDK also includes a pure-software library that 36allows physical PMD's to be bonded together to create a single logical PMD. 37 38.. figure:: img/bond-overview.* 39 40 Bonded PMDs 41 42 43The Link Bonding PMD library(librte_pmd_bond) supports bonding of groups of 44``rte_eth_dev`` ports of the same speed and duplex to provide 45similar the capabilities to that found in Linux bonding driver to allow the 46aggregation of multiple (slave) NICs into a single logical interface between a 47server and a switch. The new bonded PMD will then process these interfaces 48based on the mode of operation specified to provide support for features such 49as redundant links, fault tolerance and/or load balancing. 50 51The librte_pmd_bond library exports a C API which provides an API for the 52creation of bonded devices as well as the configuration and management of the 53bonded device and its slave devices. 54 55.. note:: 56 57 The Link Bonding PMD Library is enabled by default in the build 58 configuration files, the library can be disabled by setting 59 ``CONFIG_RTE_LIBRTE_PMD_BOND=n`` and recompiling the DPDK. 60 61Link Bonding Modes Overview 62--------------------------- 63 64Currently the Link Bonding PMD library supports 4 modes of operation: 65 66* **Round-Robin (Mode 0):** 67 68.. figure:: img/bond-mode-0.* 69 70 Round-Robin (Mode 0) 71 72 73 This mode provides load balancing and fault tolerance by transmission of 74 packets in sequential order from the first available slave device through 75 the last. Packets are bulk dequeued from devices then serviced in a 76 round-robin manner. This mode does not guarantee in order reception of 77 packets and down stream should be able to handle out of order packets. 78 79* **Active Backup (Mode 1):** 80 81.. figure:: img/bond-mode-1.* 82 83 Active Backup (Mode 1) 84 85 86 In this mode only one slave in the bond is active at any time, a different 87 slave becomes active if, and only if, the primary active slave fails, 88 thereby providing fault tolerance to slave failure. The single logical 89 bonded interface's MAC address is externally visible on only one NIC (port) 90 to avoid confusing the network switch. 91 92* **Balance XOR (Mode 2):** 93 94.. figure:: img/bond-mode-2.* 95 96 Balance XOR (Mode 2) 97 98 99 This mode provides transmit load balancing (based on the selected 100 transmission policy) and fault tolerance. The default policy (layer2) uses 101 a simple calculation based on the packet flow source and destination MAC 102 addresses as well as the number of active slaves available to the bonded 103 device to classify the packet to a specific slave to transmit on. Alternate 104 transmission policies supported are layer 2+3, this takes the IP source and 105 destination addresses into the calculation of the transmit slave port and 106 the final supported policy is layer 3+4, this uses IP source and 107 destination addresses as well as the TCP/UDP source and destination port. 108 109.. note:: 110 The coloring differences of the packets are used to identify different flow 111 classification calculated by the selected transmit policy 112 113 114* **Broadcast (Mode 3):** 115 116.. figure:: img/bond-mode-3.* 117 118 Broadcast (Mode 3) 119 120 121 This mode provides fault tolerance by transmission of packets on all slave 122 ports. 123 124* **Link Aggregation 802.3AD (Mode 4):** 125 126.. figure:: img/bond-mode-4.* 127 128 Link Aggregation 802.3AD (Mode 4) 129 130 131 This mode provides dynamic link aggregation according to the 802.3ad 132 specification. It negotiates and monitors aggregation groups that share the 133 same speed and duplex settings using the selected balance transmit policy 134 for balancing outgoing traffic. 135 136 DPDK implementation of this mode provide some additional requirements of 137 the application. 138 139 #. It needs to call ``rte_eth_tx_burst`` and ``rte_eth_rx_burst`` with 140 intervals period of less than 100ms. 141 142 #. Calls to ``rte_eth_tx_burst`` must have a buffer size of at least 2xN, 143 where N is the number of slaves. This is a space required for LACP 144 frames. Additionally LACP packets are included in the statistics, but 145 they are not returned to the application. 146 147* **Transmit Load Balancing (Mode 5):** 148 149.. figure:: img/bond-mode-5.* 150 151 Transmit Load Balancing (Mode 5) 152 153 154 This mode provides an adaptive transmit load balancing. It dynamically 155 changes the transmitting slave, according to the computed load. Statistics 156 are collected in 100ms intervals and scheduled every 10ms. 157 158 159Implementation Details 160---------------------- 161 162The librte_pmd_bond bonded device are compatible with the Ethernet device API 163exported by the Ethernet PMDs described in the *DPDK API Reference*. 164 165The Link Bonding Library supports the creation of bonded devices at application 166startup time during EAL initialization using the ``--vdev`` option as well as 167programmatically via the C API ``rte_eth_bond_create`` function. 168 169Bonded devices support the dynamical addition and removal of slave devices using 170the ``rte_eth_bond_slave_add`` / ``rte_eth_bond_slave_remove`` APIs. 171 172After a slave device is added to a bonded device slave is stopped using 173``rte_eth_dev_stop`` and then reconfigured using ``rte_eth_dev_configure`` 174the RX and TX queues are also reconfigured using ``rte_eth_tx_queue_setup`` / 175``rte_eth_rx_queue_setup`` with the parameters use to configure the bonding 176device. If RSS is enabled for bonding device, this mode is also enabled on new 177slave and configured as well. 178 179Setting up multi-queue mode for bonding device to RSS, makes it fully 180RSS-capable, so all slaves are synchronized with its configuration. This mode is 181intended to provide RSS configuration on slaves transparent for client 182application implementation. 183 184Bonding device stores its own version of RSS settings i.e. RETA, RSS hash 185function and RSS key, used to set up its slaves. That let to define the meaning 186of RSS configuration of bonding device as desired configuration of whole bonding 187(as one unit), without pointing any of slave inside. It is required to ensure 188consistency and made it more error-proof. 189 190RSS hash function set for bonding device, is a maximal set of RSS hash functions 191supported by all bonded slaves. RETA size is a GCD of all its RETA's sizes, so 192it can be easily used as a pattern providing expected behavior, even if slave 193RETAs' sizes are different. If RSS Key is not set for bonded device, it's not 194changed on the slaves and default key for device is used. 195 196All settings are managed through the bonding port API and always are propagated 197in one direction (from bonding to slaves). 198 199Link Status Change Interrupts / Polling 200~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 201 202Link bonding devices support the registration of a link status change callback, 203using the ``rte_eth_dev_callback_register`` API, this will be called when the 204status of the bonding device changes. For example in the case of a bonding 205device which has 3 slaves, the link status will change to up when one slave 206becomes active or change to down when all slaves become inactive. There is no 207callback notification when a single slave changes state and the previous 208conditions are not met. If a user wishes to monitor individual slaves then they 209must register callbacks with that slave directly. 210 211The link bonding library also supports devices which do not implement link 212status change interrupts, this is achieved by polling the devices link status at 213a defined period which is set using the ``rte_eth_bond_link_monitoring_set`` 214API, the default polling interval is 10ms. When a device is added as a slave to 215a bonding device it is determined using the ``RTE_PCI_DRV_INTR_LSC`` flag 216whether the device supports interrupts or whether the link status should be 217monitored by polling it. 218 219Requirements / Limitations 220~~~~~~~~~~~~~~~~~~~~~~~~~~ 221 222The current implementation only supports devices that support the same speed 223and duplex to be added as a slaves to the same bonded device. The bonded device 224inherits these attributes from the first active slave added to the bonded 225device and then all further slaves added to the bonded device must support 226these parameters. 227 228A bonding device must have a minimum of one slave before the bonding device 229itself can be started. 230 231To use a bonding device dynamic RSS configuration feature effectively, it is 232also required, that all slaves should be RSS-capable and support, at least one 233common hash function available for each of them. Changing RSS key is only 234possible, when all slave devices support the same key size. 235 236To prevent inconsistency on how slaves process packets, once a device is added 237to a bonding device, RSS configuration should be managed through the bonding 238device API, and not directly on the slave. 239 240Like all other PMD, all functions exported by a PMD are lock-free functions 241that are assumed not to be invoked in parallel on different logical cores to 242work on the same target object. 243 244It should also be noted that the PMD receive function should not be invoked 245directly on a slave devices after they have been to a bonded device since 246packets read directly from the slave device will no longer be available to the 247bonded device to read. 248 249Configuration 250~~~~~~~~~~~~~ 251 252Link bonding devices are created using the ``rte_eth_bond_create`` API 253which requires a unique device name, the bonding mode, 254and the socket Id to allocate the bonding device's resources on. 255The other configurable parameters for a bonded device are its slave devices, 256its primary slave, a user defined MAC address and transmission policy to use if 257the device is in balance XOR mode. 258 259Slave Devices 260^^^^^^^^^^^^^ 261 262Bonding devices support up to a maximum of ``RTE_MAX_ETHPORTS`` slave devices 263of the same speed and duplex. Ethernet devices can be added as a slave to a 264maximum of one bonded device. Slave devices are reconfigured with the 265configuration of the bonded device on being added to a bonded device. 266 267The bonded also guarantees to return the MAC address of the slave device to its 268original value of removal of a slave from it. 269 270Primary Slave 271^^^^^^^^^^^^^ 272 273The primary slave is used to define the default port to use when a bonded 274device is in active backup mode. A different port will only be used if, and 275only if, the current primary port goes down. If the user does not specify a 276primary port it will default to being the first port added to the bonded device. 277 278MAC Address 279^^^^^^^^^^^ 280 281The bonded device can be configured with a user specified MAC address, this 282address will be inherited by the some/all slave devices depending on the 283operating mode. If the device is in active backup mode then only the primary 284device will have the user specified MAC, all other slaves will retain their 285original MAC address. In mode 0, 2, 3, 4 all slaves devices are configure with 286the bonded devices MAC address. 287 288If a user defined MAC address is not defined then the bonded device will 289default to using the primary slaves MAC address. 290 291Balance XOR Transmit Policies 292^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 293 294There are 3 supported transmission policies for bonded device running in 295Balance XOR mode. Layer 2, Layer 2+3, Layer 3+4. 296 297* **Layer 2:** Ethernet MAC address based balancing is the default 298 transmission policy for Balance XOR bonding mode. It uses a simple XOR 299 calculation on the source MAC address and destination MAC address of the 300 packet and then calculate the modulus of this value to calculate the slave 301 device to transmit the packet on. 302 303* **Layer 2 + 3:** Ethernet MAC address & IP Address based balancing uses a 304 combination of source/destination MAC addresses and the source/destination 305 IP addresses of the data packet to decide which slave port the packet will 306 be transmitted on. 307 308* **Layer 3 + 4:** IP Address & UDP Port based balancing uses a combination 309 of source/destination IP Address and the source/destination UDP ports of 310 the packet of the data packet to decide which slave port the packet will be 311 transmitted on. 312 313All these policies support 802.1Q VLAN Ethernet packets, as well as IPv4, IPv6 314and UDP protocols for load balancing. 315 316Using Link Bonding Devices 317-------------------------- 318 319The librte_pmd_bond library supports two modes of device creation, the libraries 320export full C API or using the EAL command line to statically configure link 321bonding devices at application startup. Using the EAL option it is possible to 322use link bonding functionality transparently without specific knowledge of the 323libraries API, this can be used, for example, to add bonding functionality, 324such as active backup, to an existing application which has no knowledge of 325the link bonding C API. 326 327Using the Poll Mode Driver from an Application 328~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 329 330Using the librte_pmd_bond libraries API it is possible to dynamically create 331and manage link bonding device from within any application. Link bonding 332devices are created using the ``rte_eth_bond_create`` API which requires a 333unique device name, the link bonding mode to initial the device in and finally 334the socket Id which to allocate the devices resources onto. After successful 335creation of a bonding device it must be configured using the generic Ethernet 336device configure API ``rte_eth_dev_configure`` and then the RX and TX queues 337which will be used must be setup using ``rte_eth_tx_queue_setup`` / 338``rte_eth_rx_queue_setup``. 339 340Slave devices can be dynamically added and removed from a link bonding device 341using the ``rte_eth_bond_slave_add`` / ``rte_eth_bond_slave_remove`` 342APIs but at least one slave device must be added to the link bonding device 343before it can be started using ``rte_eth_dev_start``. 344 345The link status of a bonded device is dictated by that of its slaves, if all 346slave device link status are down or if all slaves are removed from the link 347bonding device then the link status of the bonding device will go down. 348 349It is also possible to configure / query the configuration of the control 350parameters of a bonded device using the provided APIs 351``rte_eth_bond_mode_set/ get``, ``rte_eth_bond_primary_set/get``, 352``rte_eth_bond_mac_set/reset`` and ``rte_eth_bond_xmit_policy_set/get``. 353 354Using Link Bonding Devices from the EAL Command Line 355~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 356 357Link bonding devices can be created at application startup time using the 358``--vdev`` EAL command line option. The device name must start with the 359eth_bond prefix followed by numbers or letters. The name must be unique for 360each device. Each device can have multiple options arranged in a comma 361separated list. Multiple devices definitions can be arranged by calling the 362``--vdev`` option multiple times. 363 364Device names and bonding options must be separated by commas as shown below: 365 366.. code-block:: console 367 368 $RTE_TARGET/app/testpmd -c f -n 4 --vdev 'eth_bond0,bond_opt0=..,bond opt1=..'--vdev 'eth_bond1,bond _opt0=..,bond_opt1=..' 369 370Link Bonding EAL Options 371^^^^^^^^^^^^^^^^^^^^^^^^ 372 373There are multiple ways of definitions that can be assessed and combined as 374long as the following two rules are respected: 375 376* A unique device name, in the format of eth_bondX is provided, 377 where X can be any combination of numbers and/or letters, 378 and the name is no greater than 32 characters long. 379 380* A least one slave device is provided with for each bonded device definition. 381 382* The operation mode of the bonded device being created is provided. 383 384The different options are: 385 386* mode: Integer value defining the bonding mode of the device. 387 Currently supports modes 0,1,2,3,4,5 (round-robin, active backup, balance, 388 broadcast, link aggregation, transmit load balancing). 389 390.. code-block:: console 391 392 mode=2 393 394* slave: Defines the PMD device which will be added as slave to the bonded 395 device. This option can be selected multiple times, for each device to be 396 added as a slave. Physical devices should be specified using their PCI 397 address, in the format domain:bus:devid.function 398 399.. code-block:: console 400 401 slave=0000:0a:00.0,slave=0000:0a:00.1 402 403* primary: Optional parameter which defines the primary slave port, 404 is used in active backup mode to select the primary slave for data TX/RX if 405 it is available. The primary port also is used to select the MAC address to 406 use when it is not defined by the user. This defaults to the first slave 407 added to the device if it is specified. The primary device must be a slave 408 of the bonded device. 409 410.. code-block:: console 411 412 primary=0000:0a:00.0 413 414* socket_id: Optional parameter used to select which socket on a NUMA device 415 the bonded devices resources will be allocated on. 416 417.. code-block:: console 418 419 socket_id=0 420 421* mac: Optional parameter to select a MAC address for link bonding device, 422 this overrides the value of the primary slave device. 423 424.. code-block:: console 425 426 mac=00:1e:67:1d:fd:1d 427 428* xmit_policy: Optional parameter which defines the transmission policy when 429 the bonded device is in balance mode. If not user specified this defaults 430 to l2 (layer 2) forwarding, the other transmission policies available are 431 l23 (layer 2+3) and l34 (layer 3+4) 432 433.. code-block:: console 434 435 xmit_policy=l23 436 437* lsc_poll_period_ms: Optional parameter which defines the polling interval 438 in milli-seconds at which devices which don't support lsc interrupts are 439 checked for a change in the devices link status 440 441.. code-block:: console 442 443 lsc_poll_period_ms=100 444 445* up_delay: Optional parameter which adds a delay in milli-seconds to the 446 propagation of a devices link status changing to up, by default this 447 parameter is zero. 448 449.. code-block:: console 450 451 up_delay=10 452 453* down_delay: Optional parameter which adds a delay in milli-seconds to the 454 propagation of a devices link status changing to down, by default this 455 parameter is zero. 456 457.. code-block:: console 458 459 down_delay=50 460 461Examples of Usage 462^^^^^^^^^^^^^^^^^ 463 464Create a bonded device in round robin mode with two slaves specified by their PCI address: 465 466.. code-block:: console 467 468 $RTE_TARGET/app/testpmd -c '0xf' -n 4 --vdev 'eth_bond0,mode=0, slave=0000:00a:00.01,slave=0000:004:00.00' -- --port-topology=chained 469 470Create a bonded device in round robin mode with two slaves specified by their PCI address and an overriding MAC address: 471 472.. code-block:: console 473 474 $RTE_TARGET/app/testpmd -c '0xf' -n 4 --vdev 'eth_bond0,mode=0, slave=0000:00a:00.01,slave=0000:004:00.00,mac=00:1e:67:1d:fd:1d' -- --port-topology=chained 475 476Create a bonded device in active backup mode with two slaves specified, and a primary slave specified by their PCI addresses: 477 478.. code-block:: console 479 480 $RTE_TARGET/app/testpmd -c '0xf' -n 4 --vdev 'eth_bond0,mode=1, slave=0000:00a:00.01,slave=0000:004:00.00,primary=0000:00a:00.01' -- --port-topology=chained 481 482Create a bonded device in balance mode with two slaves specified by their PCI addresses, and a transmission policy of layer 3 + 4 forwarding: 483 484.. code-block:: console 485 486 $RTE_TARGET/app/testpmd -c '0xf' -n 4 --vdev 'eth_bond0,mode=2, slave=0000:00a:00.01,slave=0000:004:00.00,xmit_policy=l34' -- --port-topology=chained 487