xref: /dpdk/doc/guides/eventdevs/dlb2.rst (revision 443b949e17953a1094f80532d600a1ee540f2ba4)
1..  SPDX-License-Identifier: BSD-3-Clause
2    Copyright(c) 2020 Intel Corporation.
3
4Driver for the Intel® Dynamic Load Balancer (DLB)
5=================================================
6
7The DPDK DLB poll mode driver supports the Intel® Dynamic Load Balancer,
8hardware versions 2.0 and 2.5.
9
10Prerequisites
11-------------
12
13Follow the DPDK :ref:`Getting Started Guide for Linux <linux_gsg>` to setup
14the basic DPDK environment.
15
16Configuration
17-------------
18
19The DLB PF PMD is a user-space PMD that uses VFIO to gain direct
20device access. To use this operation mode, the PCIe PF device must
21be bound to a DPDK-compatible VFIO driver, such as vfio-pci.
22
23Eventdev API Notes
24------------------
25
26The DLB PMD provides the functions of a DPDK event device; specifically, it
27supports atomic, ordered, and parallel scheduling events from queues to ports.
28However, the DLB hardware is not a perfect match to the eventdev API. Some DLB
29features are abstracted by the PMD such as directed ports.
30
31In general the DLB PMD is designed for ease-of-use and does not require a
32detailed understanding of the hardware, but these details are important when
33writing high-performance code. This section describes the places where the
34eventdev API and DLB misalign.
35
36Scheduling Domain Configuration
37~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
38
39DLB supports 32 scheduling domains.
40When one is configured, it allocates load-balanced and
41directed queues, ports, credits, and other hardware resources. Some
42resource allocations are user-controlled -- the number of queues, for example
43-- and others, like credit pools (one directed and one load-balanced pool per
44scheduling domain), are not.
45
46The DLB is a closed system eventdev, and as such the ``nb_events_limit`` device
47setup argument and the per-port ``new_event_threshold`` argument apply as
48defined in the eventdev header file. The limit is applied to all enqueues,
49regardless of whether it will consume a directed or load-balanced credit.
50
51Load-Balanced Queues
52~~~~~~~~~~~~~~~~~~~~
53
54A load-balanced queue can support atomic and ordered scheduling, or atomic and
55unordered scheduling, but not atomic and unordered and ordered scheduling. A
56queue's scheduling types are controlled by the event queue configuration.
57
58If the user sets the ``RTE_EVENT_QUEUE_CFG_ALL_TYPES`` flag, the
59``nb_atomic_order_sequences`` determines the supported scheduling types.
60With non-zero ``nb_atomic_order_sequences``, the queue is configured for atomic
61and ordered scheduling. In this case, ``RTE_SCHED_TYPE_PARALLEL`` scheduling is
62supported by scheduling those events as ordered events.  Note that when the
63event is dequeued, its sched_type will be ``RTE_SCHED_TYPE_ORDERED``. Else if
64``nb_atomic_order_sequences`` is zero, the queue is configured for atomic and
65unordered scheduling. In this case, ``RTE_SCHED_TYPE_ORDERED`` is unsupported.
66
67If the ``RTE_EVENT_QUEUE_CFG_ALL_TYPES`` flag is not set, schedule_type
68dictates the queue's scheduling type.
69
70The ``nb_atomic_order_sequences`` queue configuration field sets the ordered
71queue's reorder buffer size.  DLB has 2 groups of ordered queues, where each
72group is configured to contain either 1 queue with 1024 reorder entries, 2
73queues with 512 reorder entries, and so on down to 32 queues with 32 entries.
74
75When a load-balanced queue is created, the PMD will configure a new sequence
76number group on-demand if num_sequence_numbers does not match a pre-existing
77group with available reorder buffer entries. If all sequence number groups are
78in use, no new group will be created and queue configuration will fail. (Note
79that when the PMD is used with a virtual DLB device, it cannot change the
80sequence number configuration.)
81
82The queue's ``nb_atomic_flows`` parameter is ignored by the DLB PMD, because
83the DLB does not limit the number of flows a queue can track. In the DLB, all
84load-balanced queues can use the full 16-bit flow ID range.
85
86Load-balanced and Directed Ports
87~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
88
89DLB ports come in two flavors: load-balanced and directed. The eventdev API
90does not have the same concept, but it has a similar one: ports and queues that
91are singly-linked (i.e. linked to a single queue or port, respectively).
92
93The ``rte_event_dev_info_get()`` function reports the number of available
94event ports and queues (among other things). For the DLB PMD, max_event_ports
95and max_event_queues report the number of available load-balanced ports and
96queues, and max_single_link_event_port_queue_pairs reports the number of
97available directed ports and queues.
98
99When a scheduling domain is created in ``rte_event_dev_configure()``, the user
100specifies ``nb_event_ports`` and ``nb_single_link_event_port_queues``, which
101control the total number of ports (load-balanced and directed) and the number
102of directed ports. Hence, the number of requested load-balanced ports is
103``nb_event_ports - nb_single_link_event_ports``. The ``nb_event_queues`` field
104specifies the total number of queues (load-balanced and directed). The number
105of directed queues comes from ``nb_single_link_event_port_queues``, since
106directed ports and queues come in pairs.
107
108When a port is setup, the ``RTE_EVENT_PORT_CFG_SINGLE_LINK`` flag determines
109whether it should be configured as a directed (the flag is set) or a
110load-balanced (the flag is unset) port. Similarly, the
111``RTE_EVENT_QUEUE_CFG_SINGLE_LINK`` queue configuration flag controls
112whether it is a directed or load-balanced queue.
113
114Load-balanced ports can only be linked to load-balanced queues, and directed
115ports can only be linked to directed queues. Furthermore, directed ports can
116only be linked to a single directed queue (and vice versa), and that link
117cannot change after the eventdev is started.
118
119The eventdev API does not have a directed scheduling type. To support directed
120traffic, the DLB PMD detects when an event is being sent to a directed queue
121and overrides its scheduling type. Note that the originally selected scheduling
122type (atomic, ordered, or parallel) is not preserved, and an event's sched_type
123will be set to ``RTE_SCHED_TYPE_ATOMIC`` when it is dequeued from a directed
124port.
125
126Finally, even though all 3 event types are supported on the same QID by
127converting unordered events to ordered, such use should be discouraged as much
128as possible, since mixing types on the same queue uses valuable reorder
129resources, and orders events which do not require ordering.
130
131Flow ID
132~~~~~~~
133
134The flow ID field is preserved in the event when it is scheduled in the
135DLB.
136
137Hardware Credits
138~~~~~~~~~~~~~~~~
139
140DLB uses a hardware credit scheme to prevent software from overflowing hardware
141event storage, with each unit of storage represented by a credit. A port spends
142a credit to enqueue an event, and hardware refills the ports with credits as the
143events are scheduled to ports. Refills come from credit pools.
144
145For DLB v2.5, there is a single credit pool used for both load balanced and
146directed traffic.
147
148For DLB v2.0, each port is a member of both a load-balanced credit pool and a
149directed credit pool. The load-balanced credits are used to enqueue to
150load-balanced queues, and directed credits are used for directed queues.
151These pools' sizes are controlled by the nb_events_limit field in struct
152rte_event_dev_config. The load-balanced pool is sized to contain
153nb_events_limit credits, and the directed pool is sized to contain
154nb_events_limit/2 credits. The directed pool size can be overridden with the
155num_dir_credits devargs argument, like so:
156
157    .. code-block:: console
158
159       --allow ea:00.0,num_dir_credits=<value>
160
161This can be used if the default allocation is too low or too high for the
162specific application needs. The PMD also supports a devarg that limits the
163max_num_events reported by rte_event_dev_info_get():
164
165    .. code-block:: console
166
167       --allow ea:00.0,max_num_events=<value>
168
169By default, max_num_events is reported as the total available load-balanced
170credits. If multiple DLB-based applications are being used, it may be desirable
171to control how many load-balanced credits each application uses, particularly
172when application(s) are written to configure nb_events_limit equal to the
173reported max_num_events.
174
175Each port is a member of both credit pools. A port's credit allocation is
176defined by its low watermark, high watermark, and refill quanta. These three
177parameters are calculated by the DLB PMD like so:
178
179- The load-balanced high watermark is set to the port's enqueue_depth.
180  The directed high watermark is set to the minimum of the enqueue_depth and
181  the directed pool size divided by the total number of ports.
182- The refill quanta is set to half the high watermark.
183- The low watermark is set to the minimum of 16 and the refill quanta.
184
185When the eventdev is started, each port is pre-allocated a high watermark's
186worth of credits. For example, if an eventdev contains four ports with enqueue
187depths of 32 and a load-balanced credit pool size of 4096, each port will start
188with 32 load-balanced credits, and there will be 3968 credits available to
189replenish the ports. Thus, a single port is not capable of enqueueing up to the
190nb_events_limit (without any events being dequeued), since the other ports are
191retaining their initial credit allocation; in short, all ports must enqueue in
192order to reach the limit.
193
194If a port attempts to enqueue and has no credits available, the enqueue
195operation will fail and the application must retry the enqueue. Credits are
196replenished asynchronously by the DLB hardware.
197
198Software Credits
199~~~~~~~~~~~~~~~~
200
201The DLB is a "closed system" event dev, and the DLB PMD layers a software
202credit scheme on top of the hardware credit scheme in order to comply with
203the per-port backpressure described in the eventdev API.
204
205The DLB's hardware scheme is local to a queue/pipeline stage: a port spends a
206credit when it enqueues to a queue, and credits are later replenished after the
207events are dequeued and released.
208
209In the software credit scheme, a credit is consumed when a new (.op =
210RTE_EVENT_OP_NEW) event is injected into the system, and the credit is
211replenished when the event is released from the system (either explicitly with
212RTE_EVENT_OP_RELEASE or implicitly in dequeue_burst()).
213
214In this model, an event is "in the system" from its first enqueue into eventdev
215until it is last dequeued. If the event goes through multiple event queues, it
216is still considered "in the system" while a worker thread is processing it.
217
218A port will fail to enqueue if the number of events in the system exceeds its
219``new_event_threshold`` (specified at port setup time). A port will also fail
220to enqueue if it lacks enough hardware credits to enqueue; load-balanced
221credits are used to enqueue to a load-balanced queue, and directed credits are
222used to enqueue to a directed queue.
223
224The out-of-credit situations are typically transient, and an eventdev
225application using the DLB ought to retry its enqueues if they fail.
226If enqueue fails, DLB PMD sets rte_errno as follows:
227
228- -ENOSPC: Credit exhaustion (either hardware or software)
229- -EINVAL: Invalid argument, such as port ID, queue ID, or sched_type.
230
231Depending on the pipeline the application has constructed, it's possible to
232enter a credit deadlock scenario wherein the worker thread lacks the credit
233to enqueue an event, and it must dequeue an event before it can recover the
234credit. If the worker thread retries its enqueue indefinitely, it will not
235make forward progress. Such deadlock is possible if the application has event
236"loops", in which an event in dequeued from queue A and later enqueued back to
237queue A.
238
239Due to this, workers should stop retrying after a time, release the events it
240is attempting to enqueue, and dequeue more events. It is important that the
241worker release the events and don't simply set them aside to retry the enqueue
242again later, because the port has limited history list size (by default, same
243as port's dequeue_depth).
244
245Priority
246~~~~~~~~
247
248The DLB supports event priority and per-port queue service priority, as
249described in the eventdev header file. The DLB does not support 'global' event
250queue priority established at queue creation time.
251
252DLB supports 4 event and queue service priority levels. For both priority types,
253the PMD uses the upper three bits of the priority field to determine the DLB
254priority, discarding the 5 least significant bits. But least significant bit out
255of 3 priority bits is effectively ignored for binning into 4 priorities. The
256discarded 5 least significant event priority bits are not preserved when an event
257is enqueued.
258
259Note that event priority only works within the same event type.
260When atomic and ordered or unordered events are enqueued to same QID, priority
261across the types is always equal, and both types are served in a round robin manner.
262
263Reconfiguration
264~~~~~~~~~~~~~~~
265
266The Eventdev API allows one to reconfigure a device, its ports, and its queues
267by first stopping the device, calling the configuration function(s), then
268restarting the device. The DLB does not support configuring an individual queue
269or port without first reconfiguring the entire device, however, so there are
270certain reconfiguration sequences that are valid in the eventdev API but not
271supported by the PMD.
272
273Specifically, the PMD supports the following configuration sequence:
274
275#. Configure and start the device
276
277#. Stop the device
278
279#. (Optional) Reconfigure the device
280   Setup queue(s). The reconfigured queue(s) lose their previous port links.
281   The reconfigured port(s) lose their previous queue links.
282   Link port(s) to queue(s)
283
284#. Restart the device. If the device is reconfigured in step 3 but one or more
285   of its ports or queues are not, the PMD will apply their previous
286   configuration (including port->queue links) at this time.
287
288The PMD does not support the following configuration sequences:
289
290#. Configure and start the device
291
292#. Stop the device
293
294#. Setup queue or setup port
295
296#. Start the device
297
298This sequence is not supported because the event device must be reconfigured
299before its ports or queues can be.
300
301Atomic Inflights Allocation
302~~~~~~~~~~~~~~~~~~~~~~~~~~~
303
304In the last stage prior to scheduling an atomic event to a CQ, DLB holds the
305inflight event in a temporary buffer that is divided among load-balanced
306queues. If a queue's atomic buffer storage fills up, this can result in
307head-of-line-blocking. For example:
308
309- An LDB queue allocated N atomic buffer entries
310- All N entries are filled with events from flow X, which is pinned to CQ 0.
311
312Until CQ 0 releases 1+ events, no other atomic flows for that LDB queue can be
313scheduled. The likelihood of this case depends on the eventdev configuration,
314traffic behavior, event processing latency, potential for a worker to be
315interrupted or otherwise delayed, etc.
316
317By default, the PMD allocates 64 buffer entries for each load-balanced queue,
318which provides an even division across all 32 queues but potentially wastes
319buffer space (e.g. if not all queues are used, or aren't used for atomic
320scheduling).
321
322QID Depth Threshold
323~~~~~~~~~~~~~~~~~~~
324
325DLB supports setting and tracking queue depth thresholds. Hardware uses
326the thresholds to track how full a queue is compared to its threshold.
327Four buckets are used
328
329- Less than or equal to 50% of queue depth threshold
330- Greater than 50%, but less than or equal to 75% of depth threshold
331- Greater than 75%, but less than or equal to 100% of depth threshold
332- Greater than 100% of depth thresholds
333
334Per queue threshold metrics are tracked in the DLB xstats, and are also
335returned in the impl_opaque field of each received event.
336
337The per qid threshold can be specified as part of the device args, and
338can be applied to all queues, a range of queues, or a single queue, as
339shown below.
340
341    .. code-block:: console
342
343       --allow ea:00.0,qid_depth_thresh=all:<threshold_value>
344       --allow ea:00.0,qid_depth_thresh=qidA-qidB:<threshold_value>
345       --allow ea:00.0,qid_depth_thresh=qid:<threshold_value>
346
347Class of service
348~~~~~~~~~~~~~~~~
349
350DLB supports provisioning the DLB bandwidth into 4 classes of service.
351A LDB port or range of LDB ports may be configured to use one of the classes.
352If a port's COS is not defined, then it will be allocated from class 0,
353class 1, class 2, or class 3, in that order, depending on availability.
354
355The sum of the cos_bw values may not exceed 100, and no more than
35616 LDB ports may be assigned to a given class of service. If port cos is
357not defined on the command line, then each class is assigned 25% of the
358bandwidth, and the available load balanced ports are split between the classes.
359Per-port class of service and bandwidth can be specified in the devargs,
360as follows.
361
362    .. code-block:: console
363
364       --allow ea:00.0,port_cos=Px-Py:<0-3>,cos_bw=5:10:80:5
365       --allow ea:00.0,port_cos=Px:<0-3>,cos_bw=5:10:80:5
366
367Use X86 Vector Instructions
368~~~~~~~~~~~~~~~~~~~~~~~~~~~
369
370DLB supports using x86 vector instructions to optimize the data path.
371
372The default mode of operation is to use scalar instructions, but
373the use of vector instructions can be enabled in the devargs, as
374follows
375
376    .. code-block:: console
377
378       --allow ea:00.0,vector_opts_enabled=<y/Y>
379
380Maximum CQ Depth
381~~~~~~~~~~~~~~~~
382
383DLB supports configuring the maximum depth of a consumer queue (CQ).
384The depth must be between 32 and 128, and must be a power of 2. Note
385that credit deadlocks may occur as a result of changing the default depth.
386To prevent deadlock, the user may also need to configure the maximum
387enqueue depth.
388
389    .. code-block:: console
390
391       --allow ea:00.0,max_cq_depth=<depth>
392
393Maximum Enqueue Depth
394~~~~~~~~~~~~~~~~~~~~~
395
396DLB supports configuring the maximum enqueue depth of a producer port (PP).
397The depth must be between 32 and 1024, and must be a power of 2.
398
399    .. code-block:: console
400
401       --allow ea:00.0,max_enqueue_depth=<depth>
402
403Producer Coremask
404~~~~~~~~~~~~~~~~~
405
406For best performance, applications running on certain cores should use
407the DLB device locally available on the same tile along with other
408resources. To allocate optimal resources, probing is done for each
409producer port (PP) for a given CPU and the best performing ports are
410allocated to producers. The cpu used for probing is either the first
411core of producer coremask (if present) or the second core of EAL
412coremask. This will be extended later to probe for all CPUs in the
413producer coremask or EAL coremask. Producer coremask can be passed
414along with the BDF of the DLB devices.
415
416    .. code-block:: console
417
418       -a xx:y.z,producer_coremask=<core_mask>
419
420Default LDB Port Allocation
421~~~~~~~~~~~~~~~~~~~~~~~~~~~
422
423For optimal load balancing ports that map to one or more QIDs in common
424should not be in numerical sequence. The port->QID mapping is application
425dependent, but the driver interleaves port IDs as much as possible to
426reduce the likelihood of sequential ports mapping to the same QID(s).
427
428Hence, DLB uses an initial allocation of Port IDs to maximize the
429average distance between an ID and its immediate neighbors. (i.e.the
430distance from 1 to 0 and to 2, the distance from 2 to 1 and to 3, etc.).
431Initial port allocation option can be passed through devarg. If y (or Y)
432inial port allocation will be used, otherwise initial port allocation
433won't be used.
434
435    .. code-block:: console
436
437       --allow ea:00.0,default_port_allocation=<y/Y>
438
439QE Weight
440~~~~~~~~~
441
442DLB supports advanced scheduling mechanisms, such as CQ weight.
443Each load balanced CQ has a configurable work capacity (max 256)
444which corresponds to the total QE weight DLB will allow to be enqueued
445to that consumer. Every load balanced event/QE carries a weight of 0, 2, 4,
446or 8 and DLB will increment a (per CQ) load indicator when it schedules a
447QE to that CQ. The weight is also stored in the history list. When a
448completion arrives, the weight is popped from the history list and used to
449decrement the load indicator. This creates a new scheduling condition - a CQ
450whose load is equal to or in excess of capacity is not available for traffic.
451Note that the weight may not exceed the maximum CQ depth.
452
453Example command to enable QE Weight feature:
454
455    .. code-block:: console
456
457       --allow ea:00.0,enable_cq_weight=<y/Y>
458
459Running Eventdev Applications with DLB Device
460---------------------------------------------
461
462This section explains how to run eventdev applications
463with DLB hardware as well as difference in command line parameter
464to switch between a DLB hardware and a virtual eventdev device such as SW0, hence
465users can run applications with or without DLB device to compare performance of
466a DLB device.
467
468In order to run eventdev applications, DLB device must be bound
469to a DPDK-compatible VFIO driver, such as vfio-pci.
470
471Example command to bind DLB device to vfio-pci driver:
472
473    .. code-block:: console
474
475       ../usertools/dpdk-devbind.py -b vfio-pci ea:00.0
476
477Eventdev applications can be run with or without a DLB device.
478Below examples give details of running eventdev application without DLB device
479and with DLB device. Notice that the primary difference between two examples are
480passing the parameter ``--vdev <driver><id>``. The first example run uses a virtual
481eventdev device SW0 while second example run directly and picks DLB device from
482VFIO driver.
483
484Example command to run eventdev application without a DLB device:
485
486	.. code-block:: console
487
488	   sudo <build_dir>/app/dpdk-test-eventdev --vdev=event_sw0 -- \
489					--test=order_queue --plcores 1 --wlcores 2,3
490
491After binding DLB device to a supported pci driver such as vfio-pci,
492eventdev applications can be run on the DLB device.
493
494Example command to run eventdev application with a DLB device:
495
496	.. code-block:: console
497
498		sudo build/app/dpdk-test-eventdev -- --test=order_queue\
499			--plcores=1 --wlcores=2-7 --stlist=o --worker_deq_depth=128\
500			--prod_enq_burst_sz=64 --nb_flows=64 --nb_pkts=1000000
501
502A particular DLB device can also be picked from command line by passing
503	``--a`` or  ``--allow`` option:
504
505	.. code-block:: console
506
507		sudo build/app/dpdk-test-eventdev --allow ea:00.0 -- --test=order_queue\
508			--plcores=1 --wlcores=2-7 --stlist=o --worker_deq_depth=128\
509			--prod_enq_burst_sz=64 --nb_flows=64 --nb_pkts=1000000
510
511Debugging options
512~~~~~~~~~~~~~~~~~
513
514To specify log level for a DLB device use ``--log-level=dlb,8``.
515Example command to run eventdev application with a DLB device log level enabled:
516
517	.. code-block:: console
518
519		sudo build/app/dpdk-test-eventdev --allow ea:00.0 --log-level=dlb,8 -- --test=order_queue\
520			--plcores=1 --wlcores=2-7 --stlist=o --worker_deq_depth=128\
521			--prod_enq_burst_sz=64 --nb_flows=64 --nb_pkts=1000000
522