xref: /dpdk/doc/guides/prog_guide/link_bonding_poll_mode_drv_lib.rst (revision 92ebda07ee58cf6966305ba03b50b81debfb2d98)
1..  BSD LICENSE
2    Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
3    All rights reserved.
4
5    Redistribution and use in source and binary forms, with or without
6    modification, are permitted provided that the following conditions
7    are met:
8
9    * Redistributions of source code must retain the above copyright
10    notice, this list of conditions and the following disclaimer.
11    * Redistributions in binary form must reproduce the above copyright
12    notice, this list of conditions and the following disclaimer in
13    the documentation and/or other materials provided with the
14    distribution.
15    * Neither the name of Intel Corporation nor the names of its
16    contributors may be used to endorse or promote products derived
17    from this software without specific prior written permission.
18
19    THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
20    "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
21    LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
22    A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
23    OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
24    SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
25    LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
26    DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
27    THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
28    (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
29    OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
30
31Link Bonding Poll Mode Driver Library
32=====================================
33
34In addition to Poll Mode Drivers (PMDs) for physical and virtual hardware,
35DPDK also includes a pure-software library that
36allows physical PMD's to be bonded together to create a single logical PMD.
37
38.. figure:: img/bond-overview.*
39
40   Bonded PMDs
41
42
43The Link Bonding PMD library(librte_pmd_bond) supports bonding of groups of
44``rte_eth_dev`` ports of the same speed and duplex to provide
45similar the capabilities to that found in Linux bonding driver to allow the
46aggregation of multiple (slave) NICs into a single logical interface between a
47server and a switch. The new bonded PMD will then process these interfaces
48based on the mode of operation specified to provide support for features such
49as redundant links, fault tolerance and/or load balancing.
50
51The librte_pmd_bond library exports a C API which provides an API for the
52creation of bonded devices as well as the configuration and management of the
53bonded device and its slave devices.
54
55.. note::
56
57    The Link Bonding PMD Library is enabled by default in the build
58    configuration files, the library can be disabled by setting
59    ``CONFIG_RTE_LIBRTE_PMD_BOND=n`` and recompiling the DPDK.
60
61Link Bonding Modes Overview
62---------------------------
63
64Currently the Link Bonding PMD library supports 4 modes of operation:
65
66*   **Round-Robin (Mode 0):**
67
68.. figure:: img/bond-mode-0.*
69
70   Round-Robin (Mode 0)
71
72
73    This mode provides load balancing and fault tolerance by transmission of
74    packets in sequential order from the first available slave device through
75    the last. Packets are bulk dequeued from devices then serviced in a
76    round-robin manner. This mode does not guarantee in order reception of
77    packets and down stream should be able to handle out of order packets.
78
79*   **Active Backup (Mode 1):**
80
81.. figure:: img/bond-mode-1.*
82
83   Active Backup (Mode 1)
84
85
86    In this mode only one slave in the bond is active at any time, a different
87    slave becomes active if, and only if, the primary active slave fails,
88    thereby providing fault tolerance to slave failure. The single logical
89    bonded interface's MAC address is externally visible on only one NIC (port)
90    to avoid confusing the network switch.
91
92*   **Balance XOR (Mode 2):**
93
94.. figure:: img/bond-mode-2.*
95
96   Balance XOR (Mode 2)
97
98
99    This mode provides transmit load balancing (based on the selected
100    transmission policy) and fault tolerance. The default policy (layer2) uses
101    a simple calculation based on the packet flow source and destination MAC
102    addresses as well as the number of active slaves available to the bonded
103    device to classify the packet to a specific slave to transmit on. Alternate
104    transmission policies supported are layer 2+3, this takes the IP source and
105    destination addresses into the calculation of the transmit slave port and
106    the final supported policy is layer 3+4, this uses IP source and
107    destination addresses as well as the TCP/UDP source and destination port.
108
109.. note::
110    The coloring differences of the packets are used to identify different flow
111    classification calculated by the selected transmit policy
112
113
114*   **Broadcast (Mode 3):**
115
116.. figure:: img/bond-mode-3.*
117
118   Broadcast (Mode 3)
119
120
121    This mode provides fault tolerance by transmission of packets on all slave
122    ports.
123
124*   **Link Aggregation 802.3AD (Mode 4):**
125
126.. figure:: img/bond-mode-4.*
127
128   Link Aggregation 802.3AD (Mode 4)
129
130
131    This mode provides dynamic link aggregation according to the 802.3ad
132    specification. It negotiates and monitors aggregation groups that share the
133    same speed and duplex settings using the selected balance transmit policy
134    for balancing outgoing traffic.
135
136    DPDK implementation of this mode provide some additional requirements of
137    the application.
138
139    #. It needs to call ``rte_eth_tx_burst`` and ``rte_eth_rx_burst`` with
140       intervals period of less than 100ms.
141
142    #. Calls to ``rte_eth_tx_burst`` must have a buffer size of at least 2xN,
143       where N is the number of slaves. This is a space required for LACP
144       frames. Additionally LACP packets are included in the statistics, but
145       they are not returned to the application.
146
147*   **Transmit Load Balancing (Mode 5):**
148
149.. figure:: img/bond-mode-5.*
150
151   Transmit Load Balancing (Mode 5)
152
153
154    This mode provides an adaptive transmit load balancing. It dynamically
155    changes the transmitting slave, according to the computed load. Statistics
156    are collected in 100ms intervals and scheduled every 10ms.
157
158
159Implementation Details
160----------------------
161
162The librte_pmd_bond bonded device are compatible with the Ethernet device API
163exported by the Ethernet PMDs described in the *DPDK API Reference*.
164
165The Link Bonding Library supports the creation of bonded devices at application
166startup time during EAL initialization using the ``--vdev`` option as well as
167programmatically via the C API ``rte_eth_bond_create`` function.
168
169Bonded devices support the dynamical addition and removal of slave devices using
170the ``rte_eth_bond_slave_add`` / ``rte_eth_bond_slave_remove`` APIs.
171
172After a slave device is added to a bonded device slave is stopped using
173``rte_eth_dev_stop`` and then reconfigured using ``rte_eth_dev_configure``
174the RX and TX queues are also reconfigured using ``rte_eth_tx_queue_setup`` /
175``rte_eth_rx_queue_setup`` with the parameters use to configure the bonding
176device.
177
178Link Status Change Interrupts / Polling
179~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
180
181Link bonding devices support the registration of a link status change callback,
182using the ``rte_eth_dev_callback_register`` API, this will be called when the
183status of the bonding device changes. For example in the case of a bonding
184device which has 3 slaves, the link status will change to up when one slave
185becomes active or change to down when all slaves become inactive. There is no
186callback notification when a single slave changes state and the previous
187conditions are not met. If a user wishes to monitor individual slaves then they
188must register callbacks with that slave directly.
189
190The link bonding library also supports devices which do not implement link
191status change interrupts, this is achieve by polling the devices link status at
192a defined period which is set using the ``rte_eth_bond_link_monitoring_set``
193API, the default polling interval is 10ms. When a device is added as a slave to
194a bonding device it is determined using the ``RTE_PCI_DRV_INTR_LSC`` flag
195whether the device supports interrupts or whether the link status should be
196monitored by polling it.
197
198Requirements / Limitations
199~~~~~~~~~~~~~~~~~~~~~~~~~~
200
201The current implementation only supports devices that support the same speed
202and duplex to be added as a slaves to the same bonded device. The bonded device
203inherits these attributes from the first active slave added to the bonded
204device and then all further slaves added to the bonded device must support
205these parameters.
206
207A bonding device must have a minimum of one slave before the bonding device
208itself can be started.
209
210Like all other PMD, all functions exported by a PMD are lock-free functions
211that are assumed not to be invoked in parallel on different logical cores to
212work on the same target object.
213
214It should also be noted that the PMD receive function should not be invoked
215directly on a slave devices after they have been to a bonded device since
216packets read directly from the slave device will no longer be available to the
217bonded device to read.
218
219Configuration
220~~~~~~~~~~~~~
221
222Link bonding devices are created using the ``rte_eth_bond_create`` API
223which requires a unique device name, the bonding mode,
224and the socket Id to allocate the bonding device's resources on.
225The other configurable parameters for a bonded device are its slave devices,
226its primary slave, a user defined MAC address and transmission policy to use if
227the device is in balance XOR mode.
228
229Slave Devices
230^^^^^^^^^^^^^
231
232Bonding devices support up to a maximum of ``RTE_MAX_ETHPORTS`` slave devices
233of the same speed and duplex. Ethernet devices can be added as a slave to a
234maximum of one bonded device. Slave devices are reconfigured with the
235configuration of the bonded device on being added to a bonded device.
236
237The bonded also guarantees to return the MAC address of the slave device to its
238original value of removal of a slave from it.
239
240Primary Slave
241^^^^^^^^^^^^^
242
243The primary slave is used to define the default port to use when a bonded
244device is in active backup mode. A different port will only be used if, and
245only if, the current primary port goes down. If the user does not specify a
246primary port it will default to being the first port added to the bonded device.
247
248MAC Address
249^^^^^^^^^^^
250
251The bonded device can be configured with a user specified MAC address, this
252address will be inherited by the some/all slave devices depending on the
253operating mode. If the device is in active backup mode then only the primary
254device will have the user specified MAC, all other slaves will retain their
255original MAC address. In mode 0, 2, 3, 4 all slaves devices are configure with
256the bonded devices MAC address.
257
258If a user defined MAC address is not defined then the bonded device will
259default to using the primary slaves MAC address.
260
261Balance XOR Transmit Policies
262^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
263
264There are 3 supported transmission policies for bonded device running in
265Balance XOR mode. Layer 2, Layer 2+3, Layer 3+4.
266
267*   **Layer 2:**   Ethernet MAC address based balancing is the default
268    transmission policy for Balance XOR bonding mode. It uses a simple XOR
269    calculation on the source MAC address and destination MAC address of the
270    packet and then calculate the modulus of this value to calculate the slave
271    device to transmit the packet on.
272
273*   **Layer 2 + 3:** Ethernet MAC address & IP Address based balancing uses a
274    combination of source/destination MAC addresses and the source/destination
275    IP addresses of the data packet to decide which slave port the packet will
276    be transmitted on.
277
278*   **Layer 3 + 4:**  IP Address & UDP Port based  balancing uses a combination
279    of source/destination IP Address and the source/destination UDP ports of
280    the packet of the data packet to decide which slave port the packet will be
281    transmitted on.
282
283All these policies support 802.1Q VLAN Ethernet packets, as well as IPv4, IPv6
284and UDP protocols for load balancing.
285
286Using Link Bonding Devices
287--------------------------
288
289The librte_pmd_bond library support two modes of device creation, the libraries
290export full C API or using the EAL command line to statically configure link
291bonding devices at application startup. Using the EAL option it is possible to
292use link bonding functionality transparently without specific knowledge of the
293libraries API, this can be used, for example, to add bonding functionality,
294such as active backup, to an existing application which has no knowledge of
295the link bonding C API.
296
297Using the Poll Mode Driver from an Application
298~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
299
300Using the librte_pmd_bond libraries API it is possible to dynamically create
301and manage link bonding device from within any application. Link bonding
302device are created using the ``rte_eth_bond_create`` API which requires a
303unique device name, the link bonding mode to initial the device in and finally
304the socket Id which to allocate the devices resources onto. After successful
305creation of a bonding device it must be configured using the generic Ethernet
306device configure API ``rte_eth_dev_configure`` and then the RX and TX queues
307which will be used must be setup using ``rte_eth_tx_queue_setup`` /
308``rte_eth_rx_queue_setup``.
309
310Slave devices can be dynamically added and removed from a link bonding device
311using the ``rte_eth_bond_slave_add`` / ``rte_eth_bond_slave_remove``
312APIs but at least one slave device must be added to the link bonding device
313before it can be started using ``rte_eth_dev_start``.
314
315The link status of a bonded device is dictated by that of its slaves, if all
316slave device link status are down or if all slaves are removed from the link
317bonding device then the link status of the bonding device will go down.
318
319It is also possible to configure / query the configuration of the control
320parameters of a bonded device using the provided APIs
321``rte_eth_bond_mode_set/ get``, ``rte_eth_bond_primary_set/get``,
322``rte_eth_bond_mac_set/reset`` and ``rte_eth_bond_xmit_policy_set/get``.
323
324Using Link Bonding Devices from the EAL Command Line
325~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
326
327Link bonding devices can be created at application startup time using the
328``--vdev`` EAL command line option. The device name must start with the
329eth_bond prefix followed by numbers or letters. The name must be unique for
330each device. Each device can have multiple options arranged in a comma
331separated list. Multiple devices definitions can be arranged by calling the
332``--vdev`` option multiple times.
333
334Device names and bonding options must be separated by commas as shown below:
335
336.. code-block:: console
337
338    $RTE_TARGET/app/testpmd -c f -n 4 --vdev 'eth_bond0,bond_opt0=..,bond opt1=..'--vdev 'eth_bond1,bond _opt0=..,bond_opt1=..'
339
340Link Bonding EAL Options
341^^^^^^^^^^^^^^^^^^^^^^^^
342
343There are multiple ways of definitions that can be assessed and combined as
344long as the following two rules are respected:
345
346*   A unique device name, in the format of eth_bondX is provided,
347    where X can be any combination of numbers and/or letters,
348    and the name is no greater than 32 characters long.
349
350*   A least one slave device is provided with for each bonded device definition.
351
352*   The operation mode of the bonded device being created is provided.
353
354The different options are:
355
356*   mode: Integer value defining the bonding mode of the device.
357    Currently supports modes 0,1,2,3,4,5 (round-robin, active backup, balance,
358    broadcast, link aggregation, transmit load balancing).
359
360.. code-block:: console
361
362        mode=2
363
364*   slave: Defines the PMD device which will be added as slave to the bonded
365    device. This option can be selected multiple time, for each device to be
366    added as a slave. Physical devices should be specified using their PCI
367    address, in the format domain:bus:devid.function
368
369.. code-block:: console
370
371        slave=0000:0a:00.0,slave=0000:0a:00.1
372
373*   primary: Optional parameter which defines the primary slave port,
374    is used in active backup mode to select the primary slave for data TX/RX if
375    it is available. The primary port also is used to select the MAC address to
376    use when it is not defined by the user. This defaults to the first slave
377    added to the device if it is specified. The primary device must be a slave
378    of the bonded device.
379
380.. code-block:: console
381
382        primary=0000:0a:00.0
383
384*   socket_id: Optional parameter used to select which socket on a NUMA device
385    the bonded devices resources will be allocated on.
386
387.. code-block:: console
388
389        socket_id=0
390
391*   mac: Optional parameter to select a MAC address for link bonding device,
392    this overrides the value of the primary slave device.
393
394.. code-block:: console
395
396        mac=00:1e:67:1d:fd:1d
397
398*   xmit_policy: Optional parameter which defines the transmission policy when
399    the bonded device is in  balance mode. If not user specified this defaults
400    to l2 (layer 2) forwarding, the other transmission policies available are
401    l23 (layer 2+3) and l34 (layer 3+4)
402
403.. code-block:: console
404
405        xmit_policy=l23
406
407*   lsc_poll_period_ms: Optional parameter which defines the polling interval
408    in milli-seconds at which devices which don't support lsc interrupts are
409    checked for a change in the devices link status
410
411.. code-block:: console
412
413        lsc_poll_period_ms=100
414
415*   up_delay: Optional parameter which adds a delay in milli-seconds to the
416    propagation of a devices link status changing to up, by default this
417    parameter is zero.
418
419.. code-block:: console
420
421        up_delay=10
422
423*   down_delay: Optional parameter which adds a delay in milli-seconds to the
424    propagation of a devices link status changing to down, by default this
425    parameter is zero.
426
427.. code-block:: console
428
429        down_delay=50
430
431Examples of Usage
432^^^^^^^^^^^^^^^^^
433
434Create a bonded device in round robin mode with two slaves specified by their PCI address:
435
436.. code-block:: console
437
438    $RTE_TARGET/app/testpmd -c '0xf' -n 4 --vdev 'eth_bond0,mode=0, slave=0000:00a:00.01,slave=0000:004:00.00' -- --port-topology=chained
439
440Create a bonded device in round robin mode with two slaves specified by their PCI address and an overriding MAC address:
441
442.. code-block:: console
443
444    $RTE_TARGET/app/testpmd -c '0xf' -n 4 --vdev 'eth_bond0,mode=0, slave=0000:00a:00.01,slave=0000:004:00.00,mac=00:1e:67:1d:fd:1d' -- --port-topology=chained
445
446Create a bonded device in active backup mode with two slaves specified, and a primary slave specified by their PCI addresses:
447
448.. code-block:: console
449
450    $RTE_TARGET/app/testpmd -c '0xf' -n 4 --vdev 'eth_bond0,mode=1, slave=0000:00a:00.01,slave=0000:004:00.00,primary=0000:00a:00.01' -- --port-topology=chained
451
452Create a bonded device in balance mode with two slaves specified by their PCI addresses, and a transmission policy of layer 3 + 4 forwarding:
453
454.. code-block:: console
455
456    $RTE_TARGET/app/testpmd -c '0xf' -n 4 --vdev 'eth_bond0,mode=2, slave=0000:00a:00.01,slave=0000:004:00.00,xmit_policy=l34' -- --port-topology=chained
457