xref: /dpdk/doc/guides/prog_guide/link_bonding_poll_mode_drv_lib.rst (revision 4f84008676739874712cb95f3c3df62198b80dc8)
1..  SPDX-License-Identifier: BSD-3-Clause
2    Copyright(c) 2010-2015 Intel Corporation.
3
4Link Bonding Poll Mode Driver Library
5=====================================
6
7In addition to Poll Mode Drivers (PMDs) for physical and virtual hardware,
8DPDK also includes a pure-software library that
9allows physical PMDs to be bonded together to create a single logical PMD.
10
11.. figure:: img/bond-overview.*
12
13   Bonding PMDs
14
15
16The Link Bonding PMD library(librte_net_bond) supports bonding of groups of
17``rte_eth_dev`` ports of the same speed and duplex to provide similar
18capabilities to that found in Linux bonding driver to allow the aggregation
19of multiple (member) NICs into a single logical interface between a server
20and a switch. The new bonding PMD will then process these interfaces based on
21the mode of operation specified to provide support for features such as
22redundant links, fault tolerance and/or load balancing.
23
24The librte_net_bond library exports a C API which provides an API for the
25creation of bonding devices as well as the configuration and management of the
26bonding device and its member devices.
27
28.. note::
29
30    The Link Bonding PMD Library is enabled by default in the build
31    configuration, the library can be disabled using the meson option
32    "-Ddisable_drivers=net/bonding".
33
34
35Link Bonding Modes Overview
36---------------------------
37
38Currently the Link Bonding PMD library supports following modes of operation:
39
40*   **Round-Robin (Mode 0):**
41
42.. figure:: img/bond-mode-0.*
43
44   Round-Robin (Mode 0)
45
46
47    This mode provides load balancing and fault tolerance by transmission of
48    packets in sequential order from the first available member device through
49    the last. Packets are bulk dequeued from devices then serviced in a
50    round-robin manner. This mode does not guarantee in order reception of
51    packets and down stream should be able to handle out of order packets.
52
53*   **Active Backup (Mode 1):**
54
55.. figure:: img/bond-mode-1.*
56
57   Active Backup (Mode 1)
58
59
60    In this mode only one member in the bond is active at any time, a different
61    member becomes active if, and only if, the primary active member fails,
62    thereby providing fault tolerance to member failure. The single logical
63    bonding interface's MAC address is externally visible on only one NIC (port)
64    to avoid confusing the network switch.
65
66*   **Balance XOR (Mode 2):**
67
68.. figure:: img/bond-mode-2.*
69
70   Balance XOR (Mode 2)
71
72
73    This mode provides transmit load balancing (based on the selected
74    transmission policy) and fault tolerance. The default policy (layer2) uses
75    a simple calculation based on the packet flow source and destination MAC
76    addresses as well as the number of active members available to the bonding
77    device to classify the packet to a specific member to transmit on. Alternate
78    transmission policies supported are layer 2+3, this takes the IP source and
79    destination addresses into the calculation of the transmit member port and
80    the final supported policy is layer 3+4, this uses IP source and
81    destination addresses as well as the TCP/UDP source and destination port.
82
83.. note::
84    The coloring differences of the packets are used to identify different flow
85    classification calculated by the selected transmit policy
86
87
88*   **Broadcast (Mode 3):**
89
90.. figure:: img/bond-mode-3.*
91
92   Broadcast (Mode 3)
93
94
95    This mode provides fault tolerance by transmission of packets on all member
96    ports.
97
98*   **Link Aggregation 802.3AD (Mode 4):**
99
100.. figure:: img/bond-mode-4.*
101
102   Link Aggregation 802.3AD (Mode 4)
103
104
105    This mode provides dynamic link aggregation according to the 802.3ad
106    specification. It negotiates and monitors aggregation groups that share the
107    same speed and duplex settings using the selected balance transmit policy
108    for balancing outgoing traffic.
109
110    DPDK implementation of this mode provide some additional requirements of
111    the application.
112
113    #. It needs to call ``rte_eth_tx_burst`` and ``rte_eth_rx_burst`` with
114       intervals period of less than 100ms.
115
116    #. Calls to ``rte_eth_tx_burst`` must have a buffer size of at least 2xN,
117       where N is the number of members. This is a space required for LACP
118       frames. Additionally LACP packets are included in the statistics, but
119       they are not returned to the application.
120
121*   **Transmit Load Balancing (Mode 5):**
122
123.. figure:: img/bond-mode-5.*
124
125   Transmit Load Balancing (Mode 5)
126
127
128    This mode provides an adaptive transmit load balancing. It dynamically
129    changes the transmitting member, according to the computed load. Statistics
130    are collected in 100ms intervals and scheduled every 10ms.
131
132
133Implementation Details
134----------------------
135
136The librte_net_bond bonding device is compatible with the Ethernet device API
137exported by the Ethernet PMDs described in the *DPDK API Reference*.
138
139The Link Bonding Library supports the creation of bonding devices at application
140startup time during EAL initialization using the ``--vdev`` option as well as
141programmatically via the C API ``rte_eth_bond_create`` function.
142
143Bonding devices support the dynamical addition and removal of member devices using
144the ``rte_eth_bond_member_add`` / ``rte_eth_bond_member_remove`` APIs.
145
146After a member device is added to a bonding device member is stopped using
147``rte_eth_dev_stop`` and then reconfigured using ``rte_eth_dev_configure``
148the RX and TX queues are also reconfigured using ``rte_eth_tx_queue_setup`` /
149``rte_eth_rx_queue_setup`` with the parameters use to configure the bonding
150device. If RSS is enabled for bonding device, this mode is also enabled on new
151member and configured as well.
152Any flow which was configured to the bond device also is configured to the added
153member.
154
155Setting up multi-queue mode for bonding device to RSS, makes it fully
156RSS-capable, so all members are synchronized with its configuration. This mode is
157intended to provide RSS configuration on members transparent for client
158application implementation.
159
160Bonding device stores its own version of RSS settings i.e. RETA, RSS hash
161function and RSS key, used to set up its members. That let to define the meaning
162of RSS configuration of bonding device as desired configuration of whole bonding
163(as one unit), without pointing any of member inside. It is required to ensure
164consistency and made it more error-proof.
165
166RSS hash function set for bonding device, is a maximal set of RSS hash functions
167supported by all bonding members. RETA size is a GCD of all its RETA's sizes, so
168it can be easily used as a pattern providing expected behavior, even if member
169RETAs' sizes are different. If RSS Key is not set for bonding device, it's not
170changed on the members and default key for device is used.
171
172As RSS configurations, there is flow consistency in the bonding members for the
173next rte flow operations:
174
175Validate:
176	- Validate flow for each member, failure at least for one member causes to
177	  bond validation failure.
178
179Create:
180	- Create the flow in all members.
181	- Save all the members created flows objects in bonding internal flow
182	  structure.
183	- Failure in flow creation for existed member rejects the flow.
184	- Failure in flow creation for new members in member adding time rejects
185	  the member.
186
187Destroy:
188	- Destroy the flow in all members and release the bond internal flow
189	  memory.
190
191Flush:
192	- Destroy all the bonding PMD flows in all the members.
193
194.. note::
195
196    Don't call members flush directly, It destroys all the member flows which
197    may include external flows or the bond internal LACP flow.
198
199Query:
200	- Summarize flow counters from all the members, relevant only for
201	  ``RTE_FLOW_ACTION_TYPE_COUNT``.
202
203Isolate:
204	- Call to flow isolate for all members.
205	- Failure in flow isolation for existed member rejects the isolate mode.
206	- Failure in flow isolation for new members in member adding time rejects
207	  the member.
208
209All settings are managed through the bonding port API and always are propagated
210in one direction (from bonding to members).
211
212Link Status Change Interrupts / Polling
213~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
214
215Link bonding devices support the registration of a link status change callback,
216using the ``rte_eth_dev_callback_register`` API, this will be called when the
217status of the bonding device changes. For example in the case of a bonding
218device which has 3 members, the link status will change to up when one member
219becomes active or change to down when all members become inactive. There is no
220callback notification when a single member changes state and the previous
221conditions are not met. If a user wishes to monitor individual members then they
222must register callbacks with that member directly.
223
224The link bonding library also supports devices which do not implement link
225status change interrupts, this is achieved by polling the devices link status at
226a defined period which is set using the ``rte_eth_bond_link_monitoring_set``
227API, the default polling interval is 10ms. When a device is added as a member to
228a bonding device it is determined using the ``RTE_PCI_DRV_INTR_LSC`` flag
229whether the device supports interrupts or whether the link status should be
230monitored by polling it.
231
232Requirements / Limitations
233~~~~~~~~~~~~~~~~~~~~~~~~~~
234
235The current implementation only supports devices that support the same speed
236and duplex to be added as a members to the same bonding device. The bonding device
237inherits these attributes from the first active member added to the bonding
238device and then all further members added to the bonding device must support
239these parameters.
240
241A bonding device must have a minimum of one member before the bonding device
242itself can be started.
243
244To use a bonding device dynamic RSS configuration feature effectively, it is
245also required, that all members should be RSS-capable and support, at least one
246common hash function available for each of them. Changing RSS key is only
247possible, when all member devices support the same key size.
248
249To prevent inconsistency on how members process packets, once a device is added
250to a bonding device, RSS and rte flow configurations should be managed through
251the bonding device API, and not directly on the member.
252
253Like all other PMD, all functions exported by a PMD are lock-free functions
254that are assumed not to be invoked in parallel on different logical cores to
255work on the same target object.
256
257It should also be noted that the PMD receive function should not be invoked
258directly on a member devices after they have been to a bonding device since
259packets read directly from the member device will no longer be available to the
260bonding device to read.
261
262Configuration
263~~~~~~~~~~~~~
264
265Link bonding devices are created using the ``rte_eth_bond_create`` API
266which requires a unique device name, the bonding mode,
267and the socket Id to allocate the bonding device's resources on.
268The other configurable parameters for a bonding device are its member devices,
269its primary member, a user defined MAC address and transmission policy to use if
270the device is in balance XOR mode.
271
272Member Devices
273^^^^^^^^^^^^^^
274
275Bonding devices support up to a maximum of ``RTE_MAX_ETHPORTS`` member devices
276of the same speed and duplex. Ethernet devices can be added as a member to a
277maximum of one bonding device. Member devices are reconfigured with the
278configuration of the bonding device on being added to a bonding device.
279
280The bonding also guarantees to return the MAC address of the member device to its
281original value of removal of a member from it.
282
283Primary Member
284^^^^^^^^^^^^^^
285
286The primary member is used to define the default port to use when a bonding
287device is in active backup mode. A different port will only be used if, and
288only if, the current primary port goes down. If the user does not specify a
289primary port it will default to being the first port added to the bonding device.
290
291MAC Address
292^^^^^^^^^^^
293
294The bonding device can be configured with a user specified MAC address, this
295address will be inherited by the some/all member devices depending on the
296operating mode. If the device is in active backup mode then only the primary
297device will have the user specified MAC, all other members will retain their
298original MAC address. In mode 0, 2, 3, 4 all members devices are configure with
299the bonding devices MAC address.
300
301If a user defined MAC address is not defined then the bonding device will
302default to using the primary members MAC address.
303
304Balance XOR Transmit Policies
305^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
306
307There are 3 supported transmission policies for bonding device running in
308Balance XOR mode. Layer 2, Layer 2+3, Layer 3+4.
309
310*   **Layer 2:**   Ethernet MAC address based balancing is the default
311    transmission policy for Balance XOR bonding mode. It uses a simple XOR
312    calculation on the source MAC address and destination MAC address of the
313    packet and then calculate the modulus of this value to calculate the member
314    device to transmit the packet on.
315
316*   **Layer 2 + 3:** Ethernet MAC address & IP Address based balancing uses a
317    combination of source/destination MAC addresses and the source/destination
318    IP addresses of the data packet to decide which member port the packet will
319    be transmitted on.
320
321*   **Layer 3 + 4:**  IP Address & UDP Port based  balancing uses a combination
322    of source/destination IP Address and the source/destination UDP ports of
323    the packet of the data packet to decide which member port the packet will be
324    transmitted on.
325
326All these policies support 802.1Q VLAN Ethernet packets, as well as IPv4, IPv6
327and UDP protocols for load balancing.
328
329Using Link Bonding Devices
330--------------------------
331
332The librte_net_bond library supports two modes of device creation, the libraries
333export full C API or using the EAL command line to statically configure link
334bonding devices at application startup. Using the EAL option it is possible to
335use link bonding functionality transparently without specific knowledge of the
336libraries API, this can be used, for example, to add bonding functionality,
337such as active backup, to an existing application which has no knowledge of
338the link bonding C API.
339
340Using the Poll Mode Driver from an Application
341~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
342
343Using the librte_net_bond libraries API it is possible to dynamically create
344and manage link bonding device from within any application. Link bonding
345devices are created using the ``rte_eth_bond_create`` API which requires a
346unique device name, the link bonding mode to initial the device in and finally
347the socket Id which to allocate the devices resources onto. After successful
348creation of a bonding device it must be configured using the generic Ethernet
349device configure API ``rte_eth_dev_configure`` and then the RX and TX queues
350which will be used must be setup using ``rte_eth_tx_queue_setup`` /
351``rte_eth_rx_queue_setup``.
352
353Member devices can be dynamically added and removed from a link bonding device
354using the ``rte_eth_bond_member_add`` / ``rte_eth_bond_member_remove``
355APIs but at least one member device must be added to the link bonding device
356before it can be started using ``rte_eth_dev_start``.
357
358The link status of a bonding device is dictated by that of its members, if all
359member device link status are down or if all members are removed from the link
360bonding device then the link status of the bonding device will go down.
361
362It is also possible to configure / query the configuration of the control
363parameters of a bonding device using the provided APIs
364``rte_eth_bond_mode_set/ get``, ``rte_eth_bond_primary_set/get``,
365``rte_eth_bond_mac_set/reset`` and ``rte_eth_bond_xmit_policy_set/get``.
366
367Using Link Bonding Devices from the EAL Command Line
368~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
369
370Link bonding devices can be created at application startup time using the
371``--vdev`` EAL command line option. The device name must start with the
372net_bonding prefix followed by numbers or letters. The name must be unique for
373each device. Each device can have multiple options arranged in a comma
374separated list. Multiple devices definitions can be arranged by calling the
375``--vdev`` option multiple times.
376
377Device names and bonding options must be separated by commas as shown below:
378
379.. code-block:: console
380
381    ./<build_dir>/app/dpdk-testpmd -l 0-3 -n 4 --vdev 'net_bonding0,bond_opt0=..,bond opt1=..'--vdev 'net_bonding1,bond _opt0=..,bond_opt1=..'
382
383Link Bonding EAL Options
384^^^^^^^^^^^^^^^^^^^^^^^^
385
386There are multiple ways of definitions that can be assessed and combined as
387long as the following two rules are respected:
388
389*   A unique device name, in the format of net_bondingX is provided,
390    where X can be any combination of numbers and/or letters,
391    and the name is no greater than 32 characters long.
392
393*   A least one member device is provided with for each bonding device definition.
394
395*   The operation mode of the bonding device being created is provided.
396
397The different options are:
398
399*   mode: Integer value defining the bonding mode of the device.
400    Currently supports modes 0,1,2,3,4,5 (round-robin, active backup, balance,
401    broadcast, link aggregation, transmit load balancing).
402
403.. code-block:: console
404
405        mode=2
406
407*   member: Defines the PMD device which will be added as member to the bonding
408    device. This option can be selected multiple times, for each device to be
409    added as a member. Physical devices should be specified using their PCI
410    address, in the format domain:bus:devid.function
411
412.. code-block:: console
413
414        member=0000:0a:00.0,member=0000:0a:00.1
415
416*   primary: Optional parameter which defines the primary member port,
417    is used in active backup mode to select the primary member for data TX/RX if
418    it is available. The primary port also is used to select the MAC address to
419    use when it is not defined by the user. This defaults to the first member
420    added to the device if it is specified. The primary device must be a member
421    of the bonding device.
422
423.. code-block:: console
424
425        primary=0000:0a:00.0
426
427*   socket_id: Optional parameter used to select which socket on a NUMA device
428    the bonding devices resources will be allocated on.
429
430.. code-block:: console
431
432        socket_id=0
433
434*   mac: Optional parameter to select a MAC address for link bonding device,
435    this overrides the value of the primary member device.
436
437.. code-block:: console
438
439        mac=00:1e:67:1d:fd:1d
440
441*   xmit_policy: Optional parameter which defines the transmission policy when
442    the bonding device is in  balance mode. If not user specified this defaults
443    to l2 (layer 2) forwarding, the other transmission policies available are
444    l23 (layer 2+3) and l34 (layer 3+4)
445
446.. code-block:: console
447
448        xmit_policy=l23
449
450*   lsc_poll_period_ms: Optional parameter which defines the polling interval
451    in milli-seconds at which devices which don't support lsc interrupts are
452    checked for a change in the devices link status
453
454.. code-block:: console
455
456        lsc_poll_period_ms=100
457
458*   up_delay: Optional parameter which adds a delay in milli-seconds to the
459    propagation of a devices link status changing to up, by default this
460    parameter is zero.
461
462.. code-block:: console
463
464        up_delay=10
465
466*   down_delay: Optional parameter which adds a delay in milli-seconds to the
467    propagation of a devices link status changing to down, by default this
468    parameter is zero.
469
470.. code-block:: console
471
472        down_delay=50
473
474Examples of Usage
475^^^^^^^^^^^^^^^^^
476
477Create a bonding device in round robin mode with two members specified by their PCI address:
478
479.. code-block:: console
480
481    ./<build_dir>/app/dpdk-testpmd -l 0-3 -n 4 --vdev 'net_bonding0,mode=0,member=0000:0a:00.01,member=0000:04:00.00' -- --port-topology=chained
482
483Create a bonding device in round robin mode with two members specified by their PCI address and an overriding MAC address:
484
485.. code-block:: console
486
487    ./<build_dir>/app/dpdk-testpmd -l 0-3 -n 4 --vdev 'net_bonding0,mode=0,member=0000:0a:00.01,member=0000:04:00.00,mac=00:1e:67:1d:fd:1d' -- --port-topology=chained
488
489Create a bonding device in active backup mode with two members specified, and a primary member specified by their PCI addresses:
490
491.. code-block:: console
492
493    ./<build_dir>/app/dpdk-testpmd -l 0-3 -n 4 --vdev 'net_bonding0,mode=1,member=0000:0a:00.01,member=0000:04:00.00,primary=0000:0a:00.01' -- --port-topology=chained
494
495Create a bonding device in balance mode with two members specified by their PCI addresses, and a transmission policy of layer 3 + 4 forwarding:
496
497.. code-block:: console
498
499    ./<build_dir>/app/dpdk-testpmd -l 0-3 -n 4 --vdev 'net_bonding0,mode=2,member=0000:0a:00.01,member=0000:04:00.00,xmit_policy=l34' -- --port-topology=chained
500
501.. _bonding_testpmd_commands:
502
503Testpmd driver specific commands
504--------------------------------
505
506Some bonding driver specific features are integrated in testpmd.
507
508create bonding device
509~~~~~~~~~~~~~~~~~~~~~
510
511Create a new bonding device::
512
513   testpmd> create bonding device (mode) (socket)
514
515For example, to create a bonding device in mode 1 on socket 0::
516
517   testpmd> create bonding device 1 0
518   created new bonding device (port X)
519
520add bonding member
521~~~~~~~~~~~~~~~~~~
522
523Adds Ethernet device to a Link Bonding device::
524
525   testpmd> add bonding member (member id) (port id)
526
527For example, to add Ethernet device (port 6) to a Link Bonding device (port 10)::
528
529   testpmd> add bonding member 6 10
530
531
532remove bonding member
533~~~~~~~~~~~~~~~~~~~~~
534
535Removes an Ethernet member device from a Link Bonding device::
536
537   testpmd> remove bonding member (member id) (port id)
538
539For example, to remove Ethernet member device (port 6) to a Link Bonding device (port 10)::
540
541   testpmd> remove bonding member 6 10
542
543set bonding mode
544~~~~~~~~~~~~~~~~
545
546Set the Link Bonding mode of a Link Bonding device::
547
548   testpmd> set bonding mode (value) (port id)
549
550For example, to set the bonding mode of a Link Bonding device (port 10) to broadcast (mode 3)::
551
552   testpmd> set bonding mode 3 10
553
554set bonding primary
555~~~~~~~~~~~~~~~~~~~
556
557Set an Ethernet member device as the primary device on a Link Bonding device::
558
559   testpmd> set bonding primary (member id) (port id)
560
561For example, to set the Ethernet member device (port 6) as the primary port of a Link Bonding device (port 10)::
562
563   testpmd> set bonding primary 6 10
564
565set bonding mac
566~~~~~~~~~~~~~~~
567
568Set the MAC address of a Link Bonding device::
569
570   testpmd> set bonding mac (port id) (mac)
571
572For example, to set the MAC address of a Link Bonding device (port 10) to 00:00:00:00:00:01::
573
574   testpmd> set bonding mac 10 00:00:00:00:00:01
575
576set bonding balance_xmit_policy
577~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
578
579Set the transmission policy for a Link Bonding device when it is in Balance XOR mode::
580
581   testpmd> set bonding balance_xmit_policy (port_id) (l2|l23|l34)
582
583For example, set a Link Bonding device (port 10) to use a balance policy of layer 3+4 (IP addresses & UDP ports)::
584
585   testpmd> set bonding balance_xmit_policy 10 l34
586
587
588set bonding mon_period
589~~~~~~~~~~~~~~~~~~~~~~
590
591Set the link status monitoring polling period in milliseconds for a bonding device.
592
593This adds support for PMD member devices which do not support link status interrupts.
594When the mon_period is set to a value greater than 0 then all PMD's which do not support
595link status ISR will be queried every polling interval to check if their link status has changed::
596
597   testpmd> set bonding mon_period (port_id) (value)
598
599For example, to set the link status monitoring polling period of bonding device (port 5) to 150ms::
600
601   testpmd> set bonding mon_period 5 150
602
603
604set bonding lacp dedicated_queue
605~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
606
607Enable dedicated tx/rx queues on bonding devices members to handle LACP control plane traffic
608when in mode 4 (link-aggregation-802.3ad)::
609
610   testpmd> set bonding lacp dedicated_queues (port_id) (enable|disable)
611
612
613set bonding agg_mode
614~~~~~~~~~~~~~~~~~~~~
615
616Enable one of the specific aggregators mode when in mode 4 (link-aggregation-802.3ad)::
617
618   testpmd> set bonding agg_mode (port_id) (bandwidth|count|stable)
619
620
621show bonding config
622~~~~~~~~~~~~~~~~~~~
623
624Show the current configuration of a Link Bonding device,
625it also shows link-aggregation-802.3ad information if the link mode is mode 4::
626
627   testpmd> show bonding config (port id)
628
629For example,
630to show the configuration a Link Bonding device (port 9) with 3 member devices (1, 3, 4)
631in balance mode with a transmission policy of layer 2+3::
632
633   testpmd> show bonding config 9
634     - Dev basic:
635        Bonding mode: BALANCE(2)
636        Balance Xmit Policy: BALANCE_XMIT_POLICY_LAYER23
637        Members (3): [1 3 4]
638        Active Members (3): [1 3 4]
639        Primary: [3]
640