1.. SPDX-License-Identifier: BSD-3-Clause 2 Copyright(c) 2020 Intel Corporation. 3 4Driver for the Intel® Dynamic Load Balancer (DLB2) 5================================================== 6 7The DPDK dlb poll mode driver supports the Intel® Dynamic Load Balancer. 8 9Prerequisites 10------------- 11 12Follow the DPDK :ref:`Getting Started Guide for Linux <linux_gsg>` to setup 13the basic DPDK environment. 14 15Configuration 16------------- 17 18The DLB2 PF PMD is a user-space PMD that uses VFIO to gain direct 19device access. To use this operation mode, the PCIe PF device must be bound 20to a DPDK-compatible VFIO driver, such as vfio-pci. 21 22Eventdev API Notes 23------------------ 24 25The DLB2 provides the functions of a DPDK event device; specifically, it 26supports atomic, ordered, and parallel scheduling events from queues to ports. 27However, the DLB2 hardware is not a perfect match to the eventdev API. Some DLB2 28features are abstracted by the PMD such as directed ports. 29 30In general the dlb PMD is designed for ease-of-use and does not require a 31detailed understanding of the hardware, but these details are important when 32writing high-performance code. This section describes the places where the 33eventdev API and DLB2 misalign. 34 35Scheduling Domain Configuration 36~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 37 38There are 32 scheduling domainis the DLB2. 39When one is configured, it allocates load-balanced and 40directed queues, ports, credits, and other hardware resources. Some 41resource allocations are user-controlled -- the number of queues, for example 42-- and others, like credit pools (one directed and one load-balanced pool per 43scheduling domain), are not. 44 45The DLB2 is a closed system eventdev, and as such the ``nb_events_limit`` device 46setup argument and the per-port ``new_event_threshold`` argument apply as 47defined in the eventdev header file. The limit is applied to all enqueues, 48regardless of whether it will consume a directed or load-balanced credit. 49 50Load-Balanced Queues 51~~~~~~~~~~~~~~~~~~~~ 52 53A load-balanced queue can support atomic and ordered scheduling, or atomic and 54unordered scheduling, but not atomic and unordered and ordered scheduling. A 55queue's scheduling types are controlled by the event queue configuration. 56 57If the user sets the ``RTE_EVENT_QUEUE_CFG_ALL_TYPES`` flag, the 58``nb_atomic_order_sequences`` determines the supported scheduling types. 59With non-zero ``nb_atomic_order_sequences``, the queue is configured for atomic 60and ordered scheduling. In this case, ``RTE_SCHED_TYPE_PARALLEL`` scheduling is 61supported by scheduling those events as ordered events. Note that when the 62event is dequeued, its sched_type will be ``RTE_SCHED_TYPE_ORDERED``. Else if 63``nb_atomic_order_sequences`` is zero, the queue is configured for atomic and 64unordered scheduling. In this case, ``RTE_SCHED_TYPE_ORDERED`` is unsupported. 65 66If the ``RTE_EVENT_QUEUE_CFG_ALL_TYPES`` flag is not set, schedule_type 67dictates the queue's scheduling type. 68 69The ``nb_atomic_order_sequences`` queue configuration field sets the ordered 70queue's reorder buffer size. DLB2 has 4 groups of ordered queues, where each 71group is configured to contain either 1 queue with 1024 reorder entries, 2 72queues with 512 reorder entries, and so on down to 32 queues with 32 entries. 73 74When a load-balanced queue is created, the PMD will configure a new sequence 75number group on-demand if num_sequence_numbers does not match a pre-existing 76group with available reorder buffer entries. If all sequence number groups are 77in use, no new group will be created and queue configuration will fail. (Note 78that when the PMD is used with a virtual DLB2 device, it cannot change the 79sequence number configuration.) 80 81The queue's ``nb_atomic_flows`` parameter is ignored by the DLB2 PMD, because 82the DLB2 does not limit the number of flows a queue can track. In the DLB2, all 83load-balanced queues can use the full 16-bit flow ID range. 84 85Load-Balanced Queues 86~~~~~~~~~~~~~~~~~~~~ 87 88A load-balanced queue can support atomic and ordered scheduling, or atomic and 89unordered scheduling, but not atomic and unordered and ordered scheduling. A 90queue's scheduling types are controlled by the event queue configuration. 91 92If the user sets the ``RTE_EVENT_QUEUE_CFG_ALL_TYPES`` flag, the 93``nb_atomic_order_sequences`` determines the supported scheduling types. 94With non-zero ``nb_atomic_order_sequences``, the queue is configured for atomic 95and ordered scheduling. In this case, ``RTE_SCHED_TYPE_PARALLEL`` scheduling is 96supported by scheduling those events as ordered events. Note that when the 97event is dequeued, its sched_type will be ``RTE_SCHED_TYPE_ORDERED``. Else if 98``nb_atomic_order_sequences`` is zero, the queue is configured for atomic and 99unordered scheduling. In this case, ``RTE_SCHED_TYPE_ORDERED`` is unsupported. 100 101If the ``RTE_EVENT_QUEUE_CFG_ALL_TYPES`` flag is not set, schedule_type 102dictates the queue's scheduling type. 103 104The ``nb_atomic_order_sequences`` queue configuration field sets the ordered 105queue's reorder buffer size. DLB2 has 4 groups of ordered queues, where each 106group is configured to contain either 1 queue with 1024 reorder entries, 2 107queues with 512 reorder entries, and so on down to 32 queues with 32 entries. 108 109When a load-balanced queue is created, the PMD will configure a new sequence 110number group on-demand if num_sequence_numbers does not match a pre-existing 111group with available reorder buffer entries. If all sequence number groups are 112in use, no new group will be created and queue configuration will fail. (Note 113that when the PMD is used with a virtual DLB2 device, it cannot change the 114sequence number configuration.) 115 116The queue's ``nb_atomic_flows`` parameter is ignored by the DLB2 PMD, because 117the DLB2 does not limit the number of flows a queue can track. In the DLB2, all 118load-balanced queues can use the full 16-bit flow ID range. 119 120Load-balanced and Directed Ports 121~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 122 123DLB2 ports come in two flavors: load-balanced and directed. The eventdev API 124does not have the same concept, but it has a similar one: ports and queues that 125are singly-linked (i.e. linked to a single queue or port, respectively). 126 127The ``rte_event_dev_info_get()`` function reports the number of available 128event ports and queues (among other things). For the DLB2 PMD, max_event_ports 129and max_event_queues report the number of available load-balanced ports and 130queues, and max_single_link_event_port_queue_pairs reports the number of 131available directed ports and queues. 132 133When a scheduling domain is created in ``rte_event_dev_configure()``, the user 134specifies ``nb_event_ports`` and ``nb_single_link_event_port_queues``, which 135control the total number of ports (load-balanced and directed) and the number 136of directed ports. Hence, the number of requested load-balanced ports is 137``nb_event_ports - nb_single_link_event_ports``. The ``nb_event_queues`` field 138specifies the total number of queues (load-balanced and directed). The number 139of directed queues comes from ``nb_single_link_event_port_queues``, since 140directed ports and queues come in pairs. 141 142When a port is setup, the ``RTE_EVENT_PORT_CFG_SINGLE_LINK`` flag determines 143whether it should be configured as a directed (the flag is set) or a 144load-balanced (the flag is unset) port. Similarly, the 145``RTE_EVENT_QUEUE_CFG_SINGLE_LINK`` queue configuration flag controls 146whether it is a directed or load-balanced queue. 147 148Load-balanced ports can only be linked to load-balanced queues, and directed 149ports can only be linked to directed queues. Furthermore, directed ports can 150only be linked to a single directed queue (and vice versa), and that link 151cannot change after the eventdev is started. 152 153The eventdev API does not have a directed scheduling type. To support directed 154traffic, the dlb PMD detects when an event is being sent to a directed queue 155and overrides its scheduling type. Note that the originally selected scheduling 156type (atomic, ordered, or parallel) is not preserved, and an event's sched_type 157will be set to ``RTE_SCHED_TYPE_ATOMIC`` when it is dequeued from a directed 158port. 159 160Flow ID 161~~~~~~~ 162 163The flow ID field is preserved in the event when it is scheduled in the 164DLB2. 165 166Hardware Credits 167~~~~~~~~~~~~~~~~ 168 169DLB2 uses a hardware credit scheme to prevent software from overflowing hardware 170event storage, with each unit of storage represented by a credit. A port spends 171a credit to enqueue an event, and hardware refills the ports with credits as the 172events are scheduled to ports. Refills come from credit pools, and each port is 173a member of a load-balanced credit pool and a directed credit pool. The 174load-balanced credits are used to enqueue to load-balanced queues, and directed 175credits are used for directed queues. 176 177A DLB2 eventdev contains one load-balanced and one directed credit pool. These 178pools' sizes are controlled by the nb_events_limit field in struct 179rte_event_dev_config. The load-balanced pool is sized to contain 180nb_events_limit credits, and the directed pool is sized to contain 181nb_events_limit/4 credits. The directed pool size can be overridden with the 182num_dir_credits vdev argument, like so: 183 184 .. code-block:: console 185 186 --vdev=dlb1_event,num_dir_credits=<value> 187 188This can be used if the default allocation is too low or too high for the 189specific application needs. The PMD also supports a vdev arg that limits the 190max_num_events reported by rte_event_dev_info_get(): 191 192 .. code-block:: console 193 194 --vdev=dlb1_event,max_num_events=<value> 195 196By default, max_num_events is reported as the total available load-balanced 197credits. If multiple DLB2-based applications are being used, it may be desirable 198to control how many load-balanced credits each application uses, particularly 199when application(s) are written to configure nb_events_limit equal to the 200reported max_num_events. 201 202Each port is a member of both credit pools. A port's credit allocation is 203defined by its low watermark, high watermark, and refill quanta. These three 204parameters are calculated by the dlb PMD like so: 205 206- The load-balanced high watermark is set to the port's enqueue_depth. 207 The directed high watermark is set to the minimum of the enqueue_depth and 208 the directed pool size divided by the total number of ports. 209- The refill quanta is set to half the high watermark. 210- The low watermark is set to the minimum of 16 and the refill quanta. 211 212When the eventdev is started, each port is pre-allocated a high watermark's 213worth of credits. For example, if an eventdev contains four ports with enqueue 214depths of 32 and a load-balanced credit pool size of 4096, each port will start 215with 32 load-balanced credits, and there will be 3968 credits available to 216replenish the ports. Thus, a single port is not capable of enqueueing up to the 217nb_events_limit (without any events being dequeued), since the other ports are 218retaining their initial credit allocation; in short, all ports must enqueue in 219order to reach the limit. 220 221If a port attempts to enqueue and has no credits available, the enqueue 222operation will fail and the application must retry the enqueue. Credits are 223replenished asynchronously by the DLB2 hardware. 224 225Software Credits 226~~~~~~~~~~~~~~~~ 227 228The DLB2 is a "closed system" event dev, and the DLB2 PMD layers a software 229credit scheme on top of the hardware credit scheme in order to comply with 230the per-port backpressure described in the eventdev API. 231 232The DLB2's hardware scheme is local to a queue/pipeline stage: a port spends a 233credit when it enqueues to a queue, and credits are later replenished after the 234events are dequeued and released. 235 236In the software credit scheme, a credit is consumed when a new (.op = 237RTE_EVENT_OP_NEW) event is injected into the system, and the credit is 238replenished when the event is released from the system (either explicitly with 239RTE_EVENT_OP_RELEASE or implicitly in dequeue_burst()). 240 241In this model, an event is "in the system" from its first enqueue into eventdev 242until it is last dequeued. If the event goes through multiple event queues, it 243is still considered "in the system" while a worker thread is processing it. 244 245A port will fail to enqueue if the number of events in the system exceeds its 246``new_event_threshold`` (specified at port setup time). A port will also fail 247to enqueue if it lacks enough hardware credits to enqueue; load-balanced 248credits are used to enqueue to a load-balanced queue, and directed credits are 249used to enqueue to a directed queue. 250 251The out-of-credit situations are typically transient, and an eventdev 252application using the DLB2 ought to retry its enqueues if they fail. 253If enqueue fails, DLB2 PMD sets rte_errno as follows: 254 255- -ENOSPC: Credit exhaustion (either hardware or software) 256- -EINVAL: Invalid argument, such as port ID, queue ID, or sched_type. 257 258Depending on the pipeline the application has constructed, it's possible to 259enter a credit deadlock scenario wherein the worker thread lacks the credit 260to enqueue an event, and it must dequeue an event before it can recover the 261credit. If the worker thread retries its enqueue indefinitely, it will not 262make forward progress. Such deadlock is possible if the application has event 263"loops", in which an event in dequeued from queue A and later enqueued back to 264queue A. 265 266Due to this, workers should stop retrying after a time, release the events it 267is attempting to enqueue, and dequeue more events. It is important that the 268worker release the events and don't simply set them aside to retry the enqueue 269again later, because the port has limited history list size (by default, twice 270the port's dequeue_depth). 271 272Priority 273~~~~~~~~ 274 275The DLB2 supports event priority and per-port queue service priority, as 276described in the eventdev header file. The DLB2 does not support 'global' event 277queue priority established at queue creation time. 278 279DLB2 supports 8 event and queue service priority levels. For both priority 280types, the PMD uses the upper three bits of the priority field to determine the 281DLB2 priority, discarding the 5 least significant bits. The 5 least significant 282event priority bits are not preserved when an event is enqueued. 283 284Reconfiguration 285~~~~~~~~~~~~~~~ 286 287The Eventdev API allows one to reconfigure a device, its ports, and its queues 288by first stopping the device, calling the configuration function(s), then 289restarting the device. The DLB2 does not support configuring an individual queue 290or port without first reconfiguring the entire device, however, so there are 291certain reconfiguration sequences that are valid in the eventdev API but not 292supported by the PMD. 293 294Specifically, the PMD supports the following configuration sequence: 2951. Configure and start the device 2962. Stop the device 2973. (Optional) Reconfigure the device 2984. (Optional) If step 3 is run: 299 300 a. Setup queue(s). The reconfigured queue(s) lose their previous port links. 301 b. The reconfigured port(s) lose their previous queue links. 302 3035. (Optional, only if steps 4a and 4b are run) Link port(s) to queue(s) 3046. Restart the device. If the device is reconfigured in step 3 but one or more 305 of its ports or queues are not, the PMD will apply their previous 306 configuration (including port->queue links) at this time. 307 308The PMD does not support the following configuration sequences: 3091. Configure and start the device 3102. Stop the device 3113. Setup queue or setup port 3124. Start the device 313 314This sequence is not supported because the event device must be reconfigured 315before its ports or queues can be. 316 317Deferred Scheduling 318~~~~~~~~~~~~~~~~~~~ 319 320The DLB2 PMD's default behavior for managing a CQ is to "pop" the CQ once per 321dequeued event before returning from rte_event_dequeue_burst(). This frees the 322corresponding entries in the CQ, which enables the DLB2 to schedule more events 323to it. 324 325To support applications seeking finer-grained scheduling control -- for example 326deferring scheduling to get the best possible priority scheduling and 327load-balancing -- the PMD supports a deferred scheduling mode. In this mode, 328the CQ entry is not popped until the *subsequent* rte_event_dequeue_burst() 329call. This mode only applies to load-balanced event ports with dequeue depth of 3301. 331 332To enable deferred scheduling, use the defer_sched vdev argument like so: 333 334 .. code-block:: console 335 336 --vdev=dlb1_event,defer_sched=on 337 338Atomic Inflights Allocation 339~~~~~~~~~~~~~~~~~~~~~~~~~~~ 340 341In the last stage prior to scheduling an atomic event to a CQ, DLB2 holds the 342inflight event in a temporary buffer that is divided among load-balanced 343queues. If a queue's atomic buffer storage fills up, this can result in 344head-of-line-blocking. For example: 345 346- An LDB queue allocated N atomic buffer entries 347- All N entries are filled with events from flow X, which is pinned to CQ 0. 348 349Until CQ 0 releases 1+ events, no other atomic flows for that LDB queue can be 350scheduled. The likelihood of this case depends on the eventdev configuration, 351traffic behavior, event processing latency, potential for a worker to be 352interrupted or otherwise delayed, etc. 353 354By default, the PMD allocates 16 buffer entries for each load-balanced queue, 355which provides an even division across all 128 queues but potentially wastes 356buffer space (e.g. if not all queues are used, or aren't used for atomic 357scheduling). 358 359The PMD provides a dev arg to override the default per-queue allocation. To 360increase a vdev's per-queue atomic-inflight allocation to (for example) 64: 361 362 .. code-block:: console 363 364 --vdev=dlb1_event,atm_inflights=64 365 366QID Depth Threshold 367~~~~~~~~~~~~~~~~~~~ 368 369DLB2 supports setting and tracking queue depth thresholds. Hardware uses 370the thresholds to track how full a queue is compared to its threshold. 371Four buckets are used 372 373- Less than or equal to 50% of queue depth threshold 374- Greater than 50%, but less than or equal to 75% of depth threshold 375- Greater than 75%, but less than or equal to 100% of depth threshold 376- Greater than 100% of depth thresholds 377 378Per queue threshold metrics are tracked in the DLB2 xstats, and are also 379returned in the impl_opaque field of each received event. 380 381The per qid threshold can be specified as part of the device args, and 382can be applied to all queue, a range of queues, or a single queue, as 383shown below. 384 385 .. code-block:: console 386 387 --vdev=dlb2_event,qid_depth_thresh=all:<threshold_value> 388 --vdev=dlb2_event,qid_depth_thresh=qidA-qidB:<threshold_value> 389 --vdev=dlb2_event,qid_depth_thresh=qid:<threshold_value> 390 391Class of service 392~~~~~~~~~~~~~~~~ 393 394DLB2 supports provisioning the DLB2 bandwidth into 4 classes of service. 395 396- Class 4 corresponds to 40% of the DLB2 hardware bandwidth 397- Class 3 corresponds to 30% of the DLB2 hardware bandwidth 398- Class 2 corresponds to 20% of the DLB2 hardware bandwidth 399- Class 1 corresponds to 10% of the DLB2 hardware bandwidth 400- Class 0 corresponds to don't care 401 402The classes are applied globally to the set of ports contained in this 403scheduling domain, which is more appropriate for the bifurcated 404PMD than for the PF PMD, since the PF PMD supports just 1 scheduling 405domain. 406 407Class of service can be specified in the devargs, as follows 408 409 .. code-block:: console 410 411 --vdev=dlb2_event,cos=<0..4> 412