xref: /dpdk/doc/guides/prog_guide/link_bonding_poll_mode_drv_lib.rst (revision 6491dbbecebb1e4f07fc970ef90b34119d8be2e3)
1..  SPDX-License-Identifier: BSD-3-Clause
2    Copyright(c) 2010-2015 Intel Corporation.
3
4Link Bonding Poll Mode Driver Library
5=====================================
6
7In addition to Poll Mode Drivers (PMDs) for physical and virtual hardware,
8DPDK also includes a pure-software library that
9allows physical PMDs to be bonded together to create a single logical PMD.
10
11.. figure:: img/bond-overview.*
12
13   Bonded PMDs
14
15
16The Link Bonding PMD library(librte_pmd_bond) supports bonding of groups of
17``rte_eth_dev`` ports of the same speed and duplex to provide similar
18capabilities to that found in Linux bonding driver to allow the aggregation
19of multiple (slave) NICs into a single logical interface between a server
20and a switch. The new bonded PMD will then process these interfaces based on
21the mode of operation specified to provide support for features such as
22redundant links, fault tolerance and/or load balancing.
23
24The librte_pmd_bond library exports a C API which provides an API for the
25creation of bonded devices as well as the configuration and management of the
26bonded device and its slave devices.
27
28.. note::
29
30    The Link Bonding PMD Library is enabled by default in the build
31    configuration files, the library can be disabled by setting
32    ``CONFIG_RTE_LIBRTE_PMD_BOND=n`` and recompiling the DPDK.
33
34Link Bonding Modes Overview
35---------------------------
36
37Currently the Link Bonding PMD library supports following modes of operation:
38
39*   **Round-Robin (Mode 0):**
40
41.. figure:: img/bond-mode-0.*
42
43   Round-Robin (Mode 0)
44
45
46    This mode provides load balancing and fault tolerance by transmission of
47    packets in sequential order from the first available slave device through
48    the last. Packets are bulk dequeued from devices then serviced in a
49    round-robin manner. This mode does not guarantee in order reception of
50    packets and down stream should be able to handle out of order packets.
51
52*   **Active Backup (Mode 1):**
53
54.. figure:: img/bond-mode-1.*
55
56   Active Backup (Mode 1)
57
58
59    In this mode only one slave in the bond is active at any time, a different
60    slave becomes active if, and only if, the primary active slave fails,
61    thereby providing fault tolerance to slave failure. The single logical
62    bonded interface's MAC address is externally visible on only one NIC (port)
63    to avoid confusing the network switch.
64
65*   **Balance XOR (Mode 2):**
66
67.. figure:: img/bond-mode-2.*
68
69   Balance XOR (Mode 2)
70
71
72    This mode provides transmit load balancing (based on the selected
73    transmission policy) and fault tolerance. The default policy (layer2) uses
74    a simple calculation based on the packet flow source and destination MAC
75    addresses as well as the number of active slaves available to the bonded
76    device to classify the packet to a specific slave to transmit on. Alternate
77    transmission policies supported are layer 2+3, this takes the IP source and
78    destination addresses into the calculation of the transmit slave port and
79    the final supported policy is layer 3+4, this uses IP source and
80    destination addresses as well as the TCP/UDP source and destination port.
81
82.. note::
83    The coloring differences of the packets are used to identify different flow
84    classification calculated by the selected transmit policy
85
86
87*   **Broadcast (Mode 3):**
88
89.. figure:: img/bond-mode-3.*
90
91   Broadcast (Mode 3)
92
93
94    This mode provides fault tolerance by transmission of packets on all slave
95    ports.
96
97*   **Link Aggregation 802.3AD (Mode 4):**
98
99.. figure:: img/bond-mode-4.*
100
101   Link Aggregation 802.3AD (Mode 4)
102
103
104    This mode provides dynamic link aggregation according to the 802.3ad
105    specification. It negotiates and monitors aggregation groups that share the
106    same speed and duplex settings using the selected balance transmit policy
107    for balancing outgoing traffic.
108
109    DPDK implementation of this mode provide some additional requirements of
110    the application.
111
112    #. It needs to call ``rte_eth_tx_burst`` and ``rte_eth_rx_burst`` with
113       intervals period of less than 100ms.
114
115    #. Calls to ``rte_eth_tx_burst`` must have a buffer size of at least 2xN,
116       where N is the number of slaves. This is a space required for LACP
117       frames. Additionally LACP packets are included in the statistics, but
118       they are not returned to the application.
119
120*   **Transmit Load Balancing (Mode 5):**
121
122.. figure:: img/bond-mode-5.*
123
124   Transmit Load Balancing (Mode 5)
125
126
127    This mode provides an adaptive transmit load balancing. It dynamically
128    changes the transmitting slave, according to the computed load. Statistics
129    are collected in 100ms intervals and scheduled every 10ms.
130
131
132Implementation Details
133----------------------
134
135The librte_pmd_bond bonded device are compatible with the Ethernet device API
136exported by the Ethernet PMDs described in the *DPDK API Reference*.
137
138The Link Bonding Library supports the creation of bonded devices at application
139startup time during EAL initialization using the ``--vdev`` option as well as
140programmatically via the C API ``rte_eth_bond_create`` function.
141
142Bonded devices support the dynamical addition and removal of slave devices using
143the ``rte_eth_bond_slave_add`` / ``rte_eth_bond_slave_remove`` APIs.
144
145After a slave device is added to a bonded device slave is stopped using
146``rte_eth_dev_stop`` and then reconfigured using ``rte_eth_dev_configure``
147the RX and TX queues are also reconfigured using ``rte_eth_tx_queue_setup`` /
148``rte_eth_rx_queue_setup`` with the parameters use to configure the bonding
149device. If RSS is enabled for bonding device, this mode is also enabled on new
150slave and configured as well.
151
152Setting up multi-queue mode for bonding device to RSS, makes it fully
153RSS-capable, so all slaves are synchronized with its configuration. This mode is
154intended to provide RSS configuration on slaves transparent for client
155application implementation.
156
157Bonding device stores its own version of RSS settings i.e. RETA, RSS hash
158function and RSS key, used to set up its slaves. That let to define the meaning
159of RSS configuration of bonding device as desired configuration of whole bonding
160(as one unit), without pointing any of slave inside. It is required to ensure
161consistency and made it more error-proof.
162
163RSS hash function set for bonding device, is a maximal set of RSS hash functions
164supported by all bonded slaves. RETA size is a GCD of all its RETA's sizes, so
165it can be easily used as a pattern providing expected behavior, even if slave
166RETAs' sizes are different. If RSS Key is not set for bonded device, it's not
167changed on the slaves and default key for device is used.
168
169All settings are managed through the bonding port API and always are propagated
170in one direction (from bonding to slaves).
171
172Link Status Change Interrupts / Polling
173~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
174
175Link bonding devices support the registration of a link status change callback,
176using the ``rte_eth_dev_callback_register`` API, this will be called when the
177status of the bonding device changes. For example in the case of a bonding
178device which has 3 slaves, the link status will change to up when one slave
179becomes active or change to down when all slaves become inactive. There is no
180callback notification when a single slave changes state and the previous
181conditions are not met. If a user wishes to monitor individual slaves then they
182must register callbacks with that slave directly.
183
184The link bonding library also supports devices which do not implement link
185status change interrupts, this is achieved by polling the devices link status at
186a defined period which is set using the ``rte_eth_bond_link_monitoring_set``
187API, the default polling interval is 10ms. When a device is added as a slave to
188a bonding device it is determined using the ``RTE_PCI_DRV_INTR_LSC`` flag
189whether the device supports interrupts or whether the link status should be
190monitored by polling it.
191
192Requirements / Limitations
193~~~~~~~~~~~~~~~~~~~~~~~~~~
194
195The current implementation only supports devices that support the same speed
196and duplex to be added as a slaves to the same bonded device. The bonded device
197inherits these attributes from the first active slave added to the bonded
198device and then all further slaves added to the bonded device must support
199these parameters.
200
201A bonding device must have a minimum of one slave before the bonding device
202itself can be started.
203
204To use a bonding device dynamic RSS configuration feature effectively, it is
205also required, that all slaves should be RSS-capable and support, at least one
206common hash function available for each of them. Changing RSS key is only
207possible, when all slave devices support the same key size.
208
209To prevent inconsistency on how slaves process packets, once a device is added
210to a bonding device, RSS configuration should be managed through the bonding
211device API, and not directly on the slave.
212
213Like all other PMD, all functions exported by a PMD are lock-free functions
214that are assumed not to be invoked in parallel on different logical cores to
215work on the same target object.
216
217It should also be noted that the PMD receive function should not be invoked
218directly on a slave devices after they have been to a bonded device since
219packets read directly from the slave device will no longer be available to the
220bonded device to read.
221
222Configuration
223~~~~~~~~~~~~~
224
225Link bonding devices are created using the ``rte_eth_bond_create`` API
226which requires a unique device name, the bonding mode,
227and the socket Id to allocate the bonding device's resources on.
228The other configurable parameters for a bonded device are its slave devices,
229its primary slave, a user defined MAC address and transmission policy to use if
230the device is in balance XOR mode.
231
232Slave Devices
233^^^^^^^^^^^^^
234
235Bonding devices support up to a maximum of ``RTE_MAX_ETHPORTS`` slave devices
236of the same speed and duplex. Ethernet devices can be added as a slave to a
237maximum of one bonded device. Slave devices are reconfigured with the
238configuration of the bonded device on being added to a bonded device.
239
240The bonded also guarantees to return the MAC address of the slave device to its
241original value of removal of a slave from it.
242
243Primary Slave
244^^^^^^^^^^^^^
245
246The primary slave is used to define the default port to use when a bonded
247device is in active backup mode. A different port will only be used if, and
248only if, the current primary port goes down. If the user does not specify a
249primary port it will default to being the first port added to the bonded device.
250
251MAC Address
252^^^^^^^^^^^
253
254The bonded device can be configured with a user specified MAC address, this
255address will be inherited by the some/all slave devices depending on the
256operating mode. If the device is in active backup mode then only the primary
257device will have the user specified MAC, all other slaves will retain their
258original MAC address. In mode 0, 2, 3, 4 all slaves devices are configure with
259the bonded devices MAC address.
260
261If a user defined MAC address is not defined then the bonded device will
262default to using the primary slaves MAC address.
263
264Balance XOR Transmit Policies
265^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
266
267There are 3 supported transmission policies for bonded device running in
268Balance XOR mode. Layer 2, Layer 2+3, Layer 3+4.
269
270*   **Layer 2:**   Ethernet MAC address based balancing is the default
271    transmission policy for Balance XOR bonding mode. It uses a simple XOR
272    calculation on the source MAC address and destination MAC address of the
273    packet and then calculate the modulus of this value to calculate the slave
274    device to transmit the packet on.
275
276*   **Layer 2 + 3:** Ethernet MAC address & IP Address based balancing uses a
277    combination of source/destination MAC addresses and the source/destination
278    IP addresses of the data packet to decide which slave port the packet will
279    be transmitted on.
280
281*   **Layer 3 + 4:**  IP Address & UDP Port based  balancing uses a combination
282    of source/destination IP Address and the source/destination UDP ports of
283    the packet of the data packet to decide which slave port the packet will be
284    transmitted on.
285
286All these policies support 802.1Q VLAN Ethernet packets, as well as IPv4, IPv6
287and UDP protocols for load balancing.
288
289Using Link Bonding Devices
290--------------------------
291
292The librte_pmd_bond library supports two modes of device creation, the libraries
293export full C API or using the EAL command line to statically configure link
294bonding devices at application startup. Using the EAL option it is possible to
295use link bonding functionality transparently without specific knowledge of the
296libraries API, this can be used, for example, to add bonding functionality,
297such as active backup, to an existing application which has no knowledge of
298the link bonding C API.
299
300Using the Poll Mode Driver from an Application
301~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
302
303Using the librte_pmd_bond libraries API it is possible to dynamically create
304and manage link bonding device from within any application. Link bonding
305devices are created using the ``rte_eth_bond_create`` API which requires a
306unique device name, the link bonding mode to initial the device in and finally
307the socket Id which to allocate the devices resources onto. After successful
308creation of a bonding device it must be configured using the generic Ethernet
309device configure API ``rte_eth_dev_configure`` and then the RX and TX queues
310which will be used must be setup using ``rte_eth_tx_queue_setup`` /
311``rte_eth_rx_queue_setup``.
312
313Slave devices can be dynamically added and removed from a link bonding device
314using the ``rte_eth_bond_slave_add`` / ``rte_eth_bond_slave_remove``
315APIs but at least one slave device must be added to the link bonding device
316before it can be started using ``rte_eth_dev_start``.
317
318The link status of a bonded device is dictated by that of its slaves, if all
319slave device link status are down or if all slaves are removed from the link
320bonding device then the link status of the bonding device will go down.
321
322It is also possible to configure / query the configuration of the control
323parameters of a bonded device using the provided APIs
324``rte_eth_bond_mode_set/ get``, ``rte_eth_bond_primary_set/get``,
325``rte_eth_bond_mac_set/reset`` and ``rte_eth_bond_xmit_policy_set/get``.
326
327Using Link Bonding Devices from the EAL Command Line
328~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
329
330Link bonding devices can be created at application startup time using the
331``--vdev`` EAL command line option. The device name must start with the
332net_bonding prefix followed by numbers or letters. The name must be unique for
333each device. Each device can have multiple options arranged in a comma
334separated list. Multiple devices definitions can be arranged by calling the
335``--vdev`` option multiple times.
336
337Device names and bonding options must be separated by commas as shown below:
338
339.. code-block:: console
340
341    $RTE_TARGET/app/testpmd -l 0-3 -n 4 --vdev 'net_bonding0,bond_opt0=..,bond opt1=..'--vdev 'net_bonding1,bond _opt0=..,bond_opt1=..'
342
343Link Bonding EAL Options
344^^^^^^^^^^^^^^^^^^^^^^^^
345
346There are multiple ways of definitions that can be assessed and combined as
347long as the following two rules are respected:
348
349*   A unique device name, in the format of net_bondingX is provided,
350    where X can be any combination of numbers and/or letters,
351    and the name is no greater than 32 characters long.
352
353*   A least one slave device is provided with for each bonded device definition.
354
355*   The operation mode of the bonded device being created is provided.
356
357The different options are:
358
359*   mode: Integer value defining the bonding mode of the device.
360    Currently supports modes 0,1,2,3,4,5 (round-robin, active backup, balance,
361    broadcast, link aggregation, transmit load balancing).
362
363.. code-block:: console
364
365        mode=2
366
367*   slave: Defines the PMD device which will be added as slave to the bonded
368    device. This option can be selected multiple times, for each device to be
369    added as a slave. Physical devices should be specified using their PCI
370    address, in the format domain:bus:devid.function
371
372.. code-block:: console
373
374        slave=0000:0a:00.0,slave=0000:0a:00.1
375
376*   primary: Optional parameter which defines the primary slave port,
377    is used in active backup mode to select the primary slave for data TX/RX if
378    it is available. The primary port also is used to select the MAC address to
379    use when it is not defined by the user. This defaults to the first slave
380    added to the device if it is specified. The primary device must be a slave
381    of the bonded device.
382
383.. code-block:: console
384
385        primary=0000:0a:00.0
386
387*   socket_id: Optional parameter used to select which socket on a NUMA device
388    the bonded devices resources will be allocated on.
389
390.. code-block:: console
391
392        socket_id=0
393
394*   mac: Optional parameter to select a MAC address for link bonding device,
395    this overrides the value of the primary slave device.
396
397.. code-block:: console
398
399        mac=00:1e:67:1d:fd:1d
400
401*   xmit_policy: Optional parameter which defines the transmission policy when
402    the bonded device is in  balance mode. If not user specified this defaults
403    to l2 (layer 2) forwarding, the other transmission policies available are
404    l23 (layer 2+3) and l34 (layer 3+4)
405
406.. code-block:: console
407
408        xmit_policy=l23
409
410*   lsc_poll_period_ms: Optional parameter which defines the polling interval
411    in milli-seconds at which devices which don't support lsc interrupts are
412    checked for a change in the devices link status
413
414.. code-block:: console
415
416        lsc_poll_period_ms=100
417
418*   up_delay: Optional parameter which adds a delay in milli-seconds to the
419    propagation of a devices link status changing to up, by default this
420    parameter is zero.
421
422.. code-block:: console
423
424        up_delay=10
425
426*   down_delay: Optional parameter which adds a delay in milli-seconds to the
427    propagation of a devices link status changing to down, by default this
428    parameter is zero.
429
430.. code-block:: console
431
432        down_delay=50
433
434Examples of Usage
435^^^^^^^^^^^^^^^^^
436
437Create a bonded device in round robin mode with two slaves specified by their PCI address:
438
439.. code-block:: console
440
441    $RTE_TARGET/app/testpmd -l 0-3 -n 4 --vdev 'net_bonding0,mode=0, slave=0000:00a:00.01,slave=0000:004:00.00' -- --port-topology=chained
442
443Create a bonded device in round robin mode with two slaves specified by their PCI address and an overriding MAC address:
444
445.. code-block:: console
446
447    $RTE_TARGET/app/testpmd -l 0-3 -n 4 --vdev 'net_bonding0,mode=0, slave=0000:00a:00.01,slave=0000:004:00.00,mac=00:1e:67:1d:fd:1d' -- --port-topology=chained
448
449Create a bonded device in active backup mode with two slaves specified, and a primary slave specified by their PCI addresses:
450
451.. code-block:: console
452
453    $RTE_TARGET/app/testpmd -l 0-3 -n 4 --vdev 'net_bonding0,mode=1, slave=0000:00a:00.01,slave=0000:004:00.00,primary=0000:00a:00.01' -- --port-topology=chained
454
455Create a bonded device in balance mode with two slaves specified by their PCI addresses, and a transmission policy of layer 3 + 4 forwarding:
456
457.. code-block:: console
458
459    $RTE_TARGET/app/testpmd -l 0-3 -n 4 --vdev 'net_bonding0,mode=2, slave=0000:00a:00.01,slave=0000:004:00.00,xmit_policy=l34' -- --port-topology=chained
460