1.. SPDX-License-Identifier: BSD-3-Clause 2 Copyright(c) 2010-2015 Intel Corporation. 3 4Link Bonding Poll Mode Driver Library 5===================================== 6 7In addition to Poll Mode Drivers (PMDs) for physical and virtual hardware, 8DPDK also includes a pure-software library that 9allows physical PMDs to be bonded together to create a single logical PMD. 10 11.. figure:: img/bond-overview.* 12 13 Bonding PMDs 14 15 16The Link Bonding PMD library(librte_net_bond) supports bonding of groups of 17``rte_eth_dev`` ports of the same speed and duplex to provide similar 18capabilities to that found in Linux bonding driver to allow the aggregation 19of multiple (member) NICs into a single logical interface between a server 20and a switch. The new bonding PMD will then process these interfaces based on 21the mode of operation specified to provide support for features such as 22redundant links, fault tolerance and/or load balancing. 23 24The librte_net_bond library exports a C API which provides an API for the 25creation of bonding devices as well as the configuration and management of the 26bonding device and its member devices. 27 28.. note:: 29 30 The Link Bonding PMD Library is enabled by default in the build 31 configuration, the library can be disabled using the meson option 32 "-Ddisable_drivers=net/bonding". 33 34 35Link Bonding Modes Overview 36--------------------------- 37 38Currently the Link Bonding PMD library supports following modes of operation: 39 40* **Round-Robin (Mode 0):** 41 42.. figure:: img/bond-mode-0.* 43 44 Round-Robin (Mode 0) 45 46 47 This mode provides load balancing and fault tolerance by transmission of 48 packets in sequential order from the first available member device through 49 the last. Packets are bulk dequeued from devices then serviced in a 50 round-robin manner. This mode does not guarantee in order reception of 51 packets and down stream should be able to handle out of order packets. 52 53* **Active Backup (Mode 1):** 54 55.. figure:: img/bond-mode-1.* 56 57 Active Backup (Mode 1) 58 59 60 In this mode only one member in the bond is active at any time, a different 61 member becomes active if, and only if, the primary active member fails, 62 thereby providing fault tolerance to member failure. The single logical 63 bonding interface's MAC address is externally visible on only one NIC (port) 64 to avoid confusing the network switch. 65 66* **Balance XOR (Mode 2):** 67 68.. figure:: img/bond-mode-2.* 69 70 Balance XOR (Mode 2) 71 72 73 This mode provides transmit load balancing (based on the selected 74 transmission policy) and fault tolerance. The default policy (layer2) uses 75 a simple calculation based on the packet flow source and destination MAC 76 addresses as well as the number of active members available to the bonding 77 device to classify the packet to a specific member to transmit on. Alternate 78 transmission policies supported are layer 2+3, this takes the IP source and 79 destination addresses into the calculation of the transmit member port and 80 the final supported policy is layer 3+4, this uses IP source and 81 destination addresses as well as the TCP/UDP source and destination port. 82 83.. note:: 84 The coloring differences of the packets are used to identify different flow 85 classification calculated by the selected transmit policy 86 87 88* **Broadcast (Mode 3):** 89 90.. figure:: img/bond-mode-3.* 91 92 Broadcast (Mode 3) 93 94 95 This mode provides fault tolerance by transmission of packets on all member 96 ports. 97 98* **Link Aggregation 802.3AD (Mode 4):** 99 100.. figure:: img/bond-mode-4.* 101 102 Link Aggregation 802.3AD (Mode 4) 103 104 105 This mode provides dynamic link aggregation according to the 802.3ad 106 specification. It negotiates and monitors aggregation groups that share the 107 same speed and duplex settings using the selected balance transmit policy 108 for balancing outgoing traffic. 109 110 DPDK implementation of this mode provide some additional requirements of 111 the application. 112 113 #. It needs to call ``rte_eth_tx_burst`` and ``rte_eth_rx_burst`` with 114 intervals period of less than 100ms. 115 116 #. Calls to ``rte_eth_tx_burst`` must have a buffer size of at least 2xN, 117 where N is the number of members. This is a space required for LACP 118 frames. Additionally LACP packets are included in the statistics, but 119 they are not returned to the application. 120 121* **Transmit Load Balancing (Mode 5):** 122 123.. figure:: img/bond-mode-5.* 124 125 Transmit Load Balancing (Mode 5) 126 127 128 This mode provides an adaptive transmit load balancing. It dynamically 129 changes the transmitting member, according to the computed load. Statistics 130 are collected in 100ms intervals and scheduled every 10ms. 131 132 133Implementation Details 134---------------------- 135 136The librte_net_bond bonding device is compatible with the Ethernet device API 137exported by the Ethernet PMDs described in the *DPDK API Reference*. 138 139The Link Bonding Library supports the creation of bonding devices at application 140startup time during EAL initialization using the ``--vdev`` option as well as 141programmatically via the C API ``rte_eth_bond_create`` function. 142 143Bonding devices support the dynamical addition and removal of member devices using 144the ``rte_eth_bond_member_add`` / ``rte_eth_bond_member_remove`` APIs. 145 146After a member device is added to a bonding device member is stopped using 147``rte_eth_dev_stop`` and then reconfigured using ``rte_eth_dev_configure`` 148the RX and TX queues are also reconfigured using ``rte_eth_tx_queue_setup`` / 149``rte_eth_rx_queue_setup`` with the parameters use to configure the bonding 150device. If RSS is enabled for bonding device, this mode is also enabled on new 151member and configured as well. 152Any flow which was configured to the bond device also is configured to the added 153member. 154 155Setting up multi-queue mode for bonding device to RSS, makes it fully 156RSS-capable, so all members are synchronized with its configuration. This mode is 157intended to provide RSS configuration on members transparent for client 158application implementation. 159 160Bonding device stores its own version of RSS settings i.e. RETA, RSS hash 161function and RSS key, used to set up its members. That let to define the meaning 162of RSS configuration of bonding device as desired configuration of whole bonding 163(as one unit), without pointing any of member inside. It is required to ensure 164consistency and made it more error-proof. 165 166RSS hash function set for bonding device, is a maximal set of RSS hash functions 167supported by all bonding members. RETA size is a GCD of all its RETA's sizes, so 168it can be easily used as a pattern providing expected behavior, even if member 169RETAs' sizes are different. If RSS Key is not set for bonding device, it's not 170changed on the members and default key for device is used. 171 172As RSS configurations, there is flow consistency in the bonding members for the 173next rte flow operations: 174 175Validate: 176 - Validate flow for each member, failure at least for one member causes to 177 bond validation failure. 178 179Create: 180 - Create the flow in all members. 181 - Save all the members created flows objects in bonding internal flow 182 structure. 183 - Failure in flow creation for existed member rejects the flow. 184 - Failure in flow creation for new members in member adding time rejects 185 the member. 186 187Destroy: 188 - Destroy the flow in all members and release the bond internal flow 189 memory. 190 191Flush: 192 - Destroy all the bonding PMD flows in all the members. 193 194.. note:: 195 196 Don't call members flush directly, It destroys all the member flows which 197 may include external flows or the bond internal LACP flow. 198 199Query: 200 - Summarize flow counters from all the members, relevant only for 201 ``RTE_FLOW_ACTION_TYPE_COUNT``. 202 203Isolate: 204 - Call to flow isolate for all members. 205 - Failure in flow isolation for existed member rejects the isolate mode. 206 - Failure in flow isolation for new members in member adding time rejects 207 the member. 208 209All settings are managed through the bonding port API and always are propagated 210in one direction (from bonding to members). 211 212Link Status Change Interrupts / Polling 213~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 214 215Link bonding devices support the registration of a link status change callback, 216using the ``rte_eth_dev_callback_register`` API, this will be called when the 217status of the bonding device changes. For example in the case of a bonding 218device which has 3 members, the link status will change to up when one member 219becomes active or change to down when all members become inactive. There is no 220callback notification when a single member changes state and the previous 221conditions are not met. If a user wishes to monitor individual members then they 222must register callbacks with that member directly. 223 224The link bonding library also supports devices which do not implement link 225status change interrupts, this is achieved by polling the devices link status at 226a defined period which is set using the ``rte_eth_bond_link_monitoring_set`` 227API, the default polling interval is 10ms. When a device is added as a member to 228a bonding device it is determined using the ``RTE_PCI_DRV_INTR_LSC`` flag 229whether the device supports interrupts or whether the link status should be 230monitored by polling it. 231 232Requirements / Limitations 233~~~~~~~~~~~~~~~~~~~~~~~~~~ 234 235The current implementation only supports devices that support the same speed 236and duplex to be added as a members to the same bonding device. The bonding device 237inherits these attributes from the first active member added to the bonding 238device and then all further members added to the bonding device must support 239these parameters. 240 241A bonding device must have a minimum of one member before the bonding device 242itself can be started. 243 244To use a bonding device dynamic RSS configuration feature effectively, it is 245also required, that all members should be RSS-capable and support, at least one 246common hash function available for each of them. Changing RSS key is only 247possible, when all member devices support the same key size. 248 249To prevent inconsistency on how members process packets, once a device is added 250to a bonding device, RSS and rte flow configurations should be managed through 251the bonding device API, and not directly on the member. 252 253Like all other PMD, all functions exported by a PMD are lock-free functions 254that are assumed not to be invoked in parallel on different logical cores to 255work on the same target object. 256 257It should also be noted that the PMD receive function should not be invoked 258directly on a member devices after they have been to a bonding device since 259packets read directly from the member device will no longer be available to the 260bonding device to read. 261 262Configuration 263~~~~~~~~~~~~~ 264 265Link bonding devices are created using the ``rte_eth_bond_create`` API 266which requires a unique device name, the bonding mode, 267and the socket Id to allocate the bonding device's resources on. 268The other configurable parameters for a bonding device are its member devices, 269its primary member, a user defined MAC address and transmission policy to use if 270the device is in balance XOR mode. 271 272Member Devices 273^^^^^^^^^^^^^^ 274 275Bonding devices support up to a maximum of ``RTE_MAX_ETHPORTS`` member devices 276of the same speed and duplex. Ethernet devices can be added as a member to a 277maximum of one bonding device. Member devices are reconfigured with the 278configuration of the bonding device on being added to a bonding device. 279 280The bonding also guarantees to return the MAC address of the member device to its 281original value of removal of a member from it. 282 283Primary Member 284^^^^^^^^^^^^^^ 285 286The primary member is used to define the default port to use when a bonding 287device is in active backup mode. A different port will only be used if, and 288only if, the current primary port goes down. If the user does not specify a 289primary port it will default to being the first port added to the bonding device. 290 291MAC Address 292^^^^^^^^^^^ 293 294The bonding device can be configured with a user specified MAC address, this 295address will be inherited by the some/all member devices depending on the 296operating mode. If the device is in active backup mode then only the primary 297device will have the user specified MAC, all other members will retain their 298original MAC address. In mode 0, 2, 3, 4 all members devices are configure with 299the bonding devices MAC address. 300 301If a user defined MAC address is not defined then the bonding device will 302default to using the primary members MAC address. 303 304Balance XOR Transmit Policies 305^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 306 307There are 3 supported transmission policies for bonding device running in 308Balance XOR mode. Layer 2, Layer 2+3, Layer 3+4. 309 310* **Layer 2:** Ethernet MAC address based balancing is the default 311 transmission policy for Balance XOR bonding mode. It uses a simple XOR 312 calculation on the source MAC address and destination MAC address of the 313 packet and then calculate the modulus of this value to calculate the member 314 device to transmit the packet on. 315 316* **Layer 2 + 3:** Ethernet MAC address & IP Address based balancing uses a 317 combination of source/destination MAC addresses and the source/destination 318 IP addresses of the data packet to decide which member port the packet will 319 be transmitted on. 320 321* **Layer 3 + 4:** IP Address & UDP Port based balancing uses a combination 322 of source/destination IP Address and the source/destination UDP ports of 323 the packet of the data packet to decide which member port the packet will be 324 transmitted on. 325 326All these policies support 802.1Q VLAN Ethernet packets, as well as IPv4, IPv6 327and UDP protocols for load balancing. 328 329Using Link Bonding Devices 330-------------------------- 331 332The librte_net_bond library supports two modes of device creation, the libraries 333export full C API or using the EAL command line to statically configure link 334bonding devices at application startup. Using the EAL option it is possible to 335use link bonding functionality transparently without specific knowledge of the 336libraries API, this can be used, for example, to add bonding functionality, 337such as active backup, to an existing application which has no knowledge of 338the link bonding C API. 339 340Using the Poll Mode Driver from an Application 341~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 342 343Using the librte_net_bond libraries API it is possible to dynamically create 344and manage link bonding device from within any application. Link bonding 345devices are created using the ``rte_eth_bond_create`` API which requires a 346unique device name, the link bonding mode to initial the device in and finally 347the socket Id which to allocate the devices resources onto. After successful 348creation of a bonding device it must be configured using the generic Ethernet 349device configure API ``rte_eth_dev_configure`` and then the RX and TX queues 350which will be used must be setup using ``rte_eth_tx_queue_setup`` / 351``rte_eth_rx_queue_setup``. 352 353Member devices can be dynamically added and removed from a link bonding device 354using the ``rte_eth_bond_member_add`` / ``rte_eth_bond_member_remove`` 355APIs but at least one member device must be added to the link bonding device 356before it can be started using ``rte_eth_dev_start``. 357 358The link status of a bonding device is dictated by that of its members, if all 359member device link status are down or if all members are removed from the link 360bonding device then the link status of the bonding device will go down. 361 362It is also possible to configure / query the configuration of the control 363parameters of a bonding device using the provided APIs 364``rte_eth_bond_mode_set/ get``, ``rte_eth_bond_primary_set/get``, 365``rte_eth_bond_mac_set/reset`` and ``rte_eth_bond_xmit_policy_set/get``. 366 367Using Link Bonding Devices from the EAL Command Line 368~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 369 370Link bonding devices can be created at application startup time using the 371``--vdev`` EAL command line option. The device name must start with the 372net_bonding prefix followed by numbers or letters. The name must be unique for 373each device. Each device can have multiple options arranged in a comma 374separated list. Multiple devices definitions can be arranged by calling the 375``--vdev`` option multiple times. 376 377Device names and bonding options must be separated by commas as shown below: 378 379.. code-block:: console 380 381 ./<build_dir>/app/dpdk-testpmd -l 0-3 -n 4 --vdev 'net_bonding0,bond_opt0=..,bond opt1=..'--vdev 'net_bonding1,bond _opt0=..,bond_opt1=..' 382 383Link Bonding EAL Options 384^^^^^^^^^^^^^^^^^^^^^^^^ 385 386There are multiple ways of definitions that can be assessed and combined as 387long as the following two rules are respected: 388 389* A unique device name, in the format of net_bondingX is provided, 390 where X can be any combination of numbers and/or letters, 391 and the name is no greater than 32 characters long. 392 393* A least one member device is provided with for each bonding device definition. 394 395* The operation mode of the bonding device being created is provided. 396 397The different options are: 398 399* mode: Integer value defining the bonding mode of the device. 400 Currently supports modes 0,1,2,3,4,5 (round-robin, active backup, balance, 401 broadcast, link aggregation, transmit load balancing). 402 403.. code-block:: console 404 405 mode=2 406 407* member: Defines the PMD device which will be added as member to the bonding 408 device. This option can be selected multiple times, for each device to be 409 added as a member. Physical devices should be specified using their PCI 410 address, in the format domain:bus:devid.function 411 412.. code-block:: console 413 414 member=0000:0a:00.0,member=0000:0a:00.1 415 416* primary: Optional parameter which defines the primary member port, 417 is used in active backup mode to select the primary member for data TX/RX if 418 it is available. The primary port also is used to select the MAC address to 419 use when it is not defined by the user. This defaults to the first member 420 added to the device if it is specified. The primary device must be a member 421 of the bonding device. 422 423.. code-block:: console 424 425 primary=0000:0a:00.0 426 427* socket_id: Optional parameter used to select which socket on a NUMA device 428 the bonding devices resources will be allocated on. 429 430.. code-block:: console 431 432 socket_id=0 433 434* mac: Optional parameter to select a MAC address for link bonding device, 435 this overrides the value of the primary member device. 436 437.. code-block:: console 438 439 mac=00:1e:67:1d:fd:1d 440 441* xmit_policy: Optional parameter which defines the transmission policy when 442 the bonding device is in balance mode. If not user specified this defaults 443 to l2 (layer 2) forwarding, the other transmission policies available are 444 l23 (layer 2+3) and l34 (layer 3+4) 445 446.. code-block:: console 447 448 xmit_policy=l23 449 450* lsc_poll_period_ms: Optional parameter which defines the polling interval 451 in milli-seconds at which devices which don't support lsc interrupts are 452 checked for a change in the devices link status 453 454.. code-block:: console 455 456 lsc_poll_period_ms=100 457 458* up_delay: Optional parameter which adds a delay in milli-seconds to the 459 propagation of a devices link status changing to up, by default this 460 parameter is zero. 461 462.. code-block:: console 463 464 up_delay=10 465 466* down_delay: Optional parameter which adds a delay in milli-seconds to the 467 propagation of a devices link status changing to down, by default this 468 parameter is zero. 469 470.. code-block:: console 471 472 down_delay=50 473 474Examples of Usage 475^^^^^^^^^^^^^^^^^ 476 477Create a bonding device in round robin mode with two members specified by their PCI address: 478 479.. code-block:: console 480 481 ./<build_dir>/app/dpdk-testpmd -l 0-3 -n 4 --vdev 'net_bonding0,mode=0,member=0000:0a:00.01,member=0000:04:00.00' -- --port-topology=chained 482 483Create a bonding device in round robin mode with two members specified by their PCI address and an overriding MAC address: 484 485.. code-block:: console 486 487 ./<build_dir>/app/dpdk-testpmd -l 0-3 -n 4 --vdev 'net_bonding0,mode=0,member=0000:0a:00.01,member=0000:04:00.00,mac=00:1e:67:1d:fd:1d' -- --port-topology=chained 488 489Create a bonding device in active backup mode with two members specified, and a primary member specified by their PCI addresses: 490 491.. code-block:: console 492 493 ./<build_dir>/app/dpdk-testpmd -l 0-3 -n 4 --vdev 'net_bonding0,mode=1,member=0000:0a:00.01,member=0000:04:00.00,primary=0000:0a:00.01' -- --port-topology=chained 494 495Create a bonding device in balance mode with two members specified by their PCI addresses, and a transmission policy of layer 3 + 4 forwarding: 496 497.. code-block:: console 498 499 ./<build_dir>/app/dpdk-testpmd -l 0-3 -n 4 --vdev 'net_bonding0,mode=2,member=0000:0a:00.01,member=0000:04:00.00,xmit_policy=l34' -- --port-topology=chained 500 501.. _bonding_testpmd_commands: 502 503Testpmd driver specific commands 504-------------------------------- 505 506Some bonding driver specific features are integrated in testpmd. 507 508create bonding device 509~~~~~~~~~~~~~~~~~~~~~ 510 511Create a new bonding device:: 512 513 testpmd> create bonding device (mode) (socket) 514 515For example, to create a bonding device in mode 1 on socket 0:: 516 517 testpmd> create bonding device 1 0 518 created new bonding device (port X) 519 520add bonding member 521~~~~~~~~~~~~~~~~~~ 522 523Adds Ethernet device to a Link Bonding device:: 524 525 testpmd> add bonding member (member id) (port id) 526 527For example, to add Ethernet device (port 6) to a Link Bonding device (port 10):: 528 529 testpmd> add bonding member 6 10 530 531 532remove bonding member 533~~~~~~~~~~~~~~~~~~~~~ 534 535Removes an Ethernet member device from a Link Bonding device:: 536 537 testpmd> remove bonding member (member id) (port id) 538 539For example, to remove Ethernet member device (port 6) to a Link Bonding device (port 10):: 540 541 testpmd> remove bonding member 6 10 542 543set bonding mode 544~~~~~~~~~~~~~~~~ 545 546Set the Link Bonding mode of a Link Bonding device:: 547 548 testpmd> set bonding mode (value) (port id) 549 550For example, to set the bonding mode of a Link Bonding device (port 10) to broadcast (mode 3):: 551 552 testpmd> set bonding mode 3 10 553 554set bonding primary 555~~~~~~~~~~~~~~~~~~~ 556 557Set an Ethernet member device as the primary device on a Link Bonding device:: 558 559 testpmd> set bonding primary (member id) (port id) 560 561For example, to set the Ethernet member device (port 6) as the primary port of a Link Bonding device (port 10):: 562 563 testpmd> set bonding primary 6 10 564 565set bonding mac 566~~~~~~~~~~~~~~~ 567 568Set the MAC address of a Link Bonding device:: 569 570 testpmd> set bonding mac (port id) (mac) 571 572For example, to set the MAC address of a Link Bonding device (port 10) to 00:00:00:00:00:01:: 573 574 testpmd> set bonding mac 10 00:00:00:00:00:01 575 576set bonding balance_xmit_policy 577~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 578 579Set the transmission policy for a Link Bonding device when it is in Balance XOR mode:: 580 581 testpmd> set bonding balance_xmit_policy (port_id) (l2|l23|l34) 582 583For example, set a Link Bonding device (port 10) to use a balance policy of layer 3+4 (IP addresses & UDP ports):: 584 585 testpmd> set bonding balance_xmit_policy 10 l34 586 587 588set bonding mon_period 589~~~~~~~~~~~~~~~~~~~~~~ 590 591Set the link status monitoring polling period in milliseconds for a bonding device. 592 593This adds support for PMD member devices which do not support link status interrupts. 594When the mon_period is set to a value greater than 0 then all PMD's which do not support 595link status ISR will be queried every polling interval to check if their link status has changed:: 596 597 testpmd> set bonding mon_period (port_id) (value) 598 599For example, to set the link status monitoring polling period of bonding device (port 5) to 150ms:: 600 601 testpmd> set bonding mon_period 5 150 602 603 604set bonding lacp dedicated_queue 605~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 606 607Enable dedicated tx/rx queues on bonding devices members to handle LACP control plane traffic 608when in mode 4 (link-aggregation-802.3ad):: 609 610 testpmd> set bonding lacp dedicated_queues (port_id) (enable|disable) 611 612 613set bonding agg_mode 614~~~~~~~~~~~~~~~~~~~~ 615 616Enable one of the specific aggregators mode when in mode 4 (link-aggregation-802.3ad):: 617 618 testpmd> set bonding agg_mode (port_id) (bandwidth|count|stable) 619 620 621show bonding config 622~~~~~~~~~~~~~~~~~~~ 623 624Show the current configuration of a Link Bonding device, 625it also shows link-aggregation-802.3ad information if the link mode is mode 4:: 626 627 testpmd> show bonding config (port id) 628 629For example, 630to show the configuration a Link Bonding device (port 9) with 3 member devices (1, 3, 4) 631in balance mode with a transmission policy of layer 2+3:: 632 633 testpmd> show bonding config 9 634 - Dev basic: 635 Bonding mode: BALANCE(2) 636 Balance Xmit Policy: BALANCE_XMIT_POLICY_LAYER23 637 Members (3): [1 3 4] 638 Active Members (3): [1 3 4] 639 Primary: [3] 640