1.. SPDX-License-Identifier: BSD-3-Clause 2 Copyright(c) 2010-2014 Intel Corporation. 3 4Quality of Service (QoS) Framework 5================================== 6 7This chapter describes the DPDK Quality of Service (QoS) framework. 8 9Packet Pipeline with QoS Support 10-------------------------------- 11 12An example of a complex packet processing pipeline with QoS support is shown in the following figure. 13 14.. _figure_pkt_proc_pipeline_qos: 15 16.. figure:: ../img/pkt_proc_pipeline_qos.* 17 18 Complex Packet Processing Pipeline with QoS Support 19 20 21This pipeline can be built using reusable DPDK software libraries. 22The main blocks implementing QoS in this pipeline are: the policer, the dropper and the scheduler. 23A functional description of each block is provided in the following table. 24 25.. _table_qos_1: 26 27.. table:: Packet Processing Pipeline Implementing QoS 28 29 +---+------------------------+--------------------------------------------------------------------------------+ 30 | # | Block | Functional Description | 31 | | | | 32 +===+========================+================================================================================+ 33 | 1 | Packet I/O RX & TX | Packet reception/ transmission from/to multiple NIC ports. Poll mode drivers | 34 | | | (PMDs) for Intel 1 GbE/10 GbE NICs. | 35 | | | | 36 +---+------------------------+--------------------------------------------------------------------------------+ 37 | 2 | Packet parser | Identify the protocol stack of the input packet. Check the integrity of the | 38 | | | packet headers. | 39 | | | | 40 +---+------------------------+--------------------------------------------------------------------------------+ 41 | 3 | Flow classification | Map the input packet to one of the known traffic flows. Exact match table | 42 | | | lookup using configurable hash function (jhash, CRC and so on) and bucket | 43 | | | logic to handle collisions. | 44 | | | | 45 +---+------------------------+--------------------------------------------------------------------------------+ 46 | 4 | Policer | Packet metering using srTCM (RFC 2697) or trTCM (RFC2698) algorithms. | 47 | | | | 48 +---+------------------------+--------------------------------------------------------------------------------+ 49 | 5 | Load Balancer | Distribute the input packets to the application workers. Provide uniform load | 50 | | | to each worker. Preserve the affinity of traffic flows to workers and the | 51 | | | packet order within each flow. | 52 | | | | 53 +---+------------------------+--------------------------------------------------------------------------------+ 54 | 6 | Worker threads | Placeholders for the customer specific application workload (for example, IP | 55 | | | stack and so on). | 56 | | | | 57 +---+------------------------+--------------------------------------------------------------------------------+ 58 | 7 | Dropper | Congestion management using the Random Early Detection (RED) algorithm | 59 | | | (specified by the Sally Floyd - Van Jacobson paper) or Weighted RED (WRED) | 60 | | | or Proportional Integral Controller Enhanced (PIE). | 61 | | | Drop packets based on the current scheduler queue load level and packet | 62 | | | priority. When congestion is experienced, lower priority packets are dropped | 63 | | | first. | 64 | | | | 65 +---+------------------------+--------------------------------------------------------------------------------+ 66 | 8 | Hierarchical Scheduler | 5-level hierarchical scheduler (levels are: output port, subport, pipe, | 67 | | | traffic class and queue) with thousands (typically 64K) leaf nodes (queues). | 68 | | | Implements traffic shaping (for subport and pipe levels), strict priority | 69 | | | (for traffic class level) and Weighted Round Robin (WRR) (for queues within | 70 | | | each pipe traffic class). | 71 | | | | 72 +---+------------------------+--------------------------------------------------------------------------------+ 73 74The infrastructure blocks used throughout the packet processing pipeline are listed in the following table. 75 76.. _table_qos_2: 77 78.. table:: Infrastructure Blocks Used by the Packet Processing Pipeline 79 80 +---+-----------------------+-----------------------------------------------------------------------+ 81 | # | Block | Functional Description | 82 | | | | 83 +===+=======================+=======================================================================+ 84 | 1 | Buffer manager | Support for global buffer pools and private per-thread buffer caches. | 85 | | | | 86 +---+-----------------------+-----------------------------------------------------------------------+ 87 | 2 | Queue manager | Support for message passing between pipeline blocks. | 88 | | | | 89 +---+-----------------------+-----------------------------------------------------------------------+ 90 | 3 | Power saving | Support for power saving during low activity periods. | 91 | | | | 92 +---+-----------------------+-----------------------------------------------------------------------+ 93 94The mapping of pipeline blocks to CPU cores is configurable based on the performance level required by each specific application 95and the set of features enabled for each block. 96Some blocks might consume more than one CPU core (with each CPU core running a different instance of the same block on different input packets), 97while several other blocks could be mapped to the same CPU core. 98 99Hierarchical Scheduler 100---------------------- 101 102The hierarchical scheduler block, when present, usually sits on the TX side just before the transmission stage. 103Its purpose is to prioritize the transmission of packets from different users and different traffic classes 104according to the policy specified by the Service Level Agreements (SLAs) of each network node. 105 106Overview 107~~~~~~~~ 108 109The hierarchical scheduler block is similar to the traffic manager block used by network processors 110that typically implement per flow (or per group of flows) packet queuing and scheduling. 111It typically acts like a buffer that is able to temporarily store a large number of packets just before their transmission (enqueue operation); 112as the NIC TX is requesting more packets for transmission, 113these packets are later on removed and handed over to the NIC TX with the packet selection logic observing the predefined SLAs (dequeue operation). 114 115.. _figure_hier_sched_blk: 116 117.. figure:: ../img/hier_sched_blk.* 118 119 Hierarchical Scheduler Block Internal Diagram 120 121 122The hierarchical scheduler is optimized for a large number of packet queues. 123When only a small number of queues are needed, message passing queues should be used instead of this block. 124See `Worst Case Scenarios for Performance`_ for a more detailed discussion. 125 126Scheduling Hierarchy 127~~~~~~~~~~~~~~~~~~~~ 128 129The scheduling hierarchy is shown in :numref:`figure_sched_hier_per_port`. 130The first level of the hierarchy is the Ethernet TX port 1/10/40 GbE, 131with subsequent hierarchy levels defined as subport, pipe, traffic class and queue. 132 133Typically, each subport represents a predefined group of users, while each pipe represents an individual user/subscriber. 134Each traffic class is the representation of a different traffic type with specific loss rate, 135delay and jitter requirements, such as voice, video or data transfers. 136Each queue hosts packets from one or multiple connections of the same type belonging to the same user. 137 138.. _figure_sched_hier_per_port: 139 140.. figure:: ../img/sched_hier_per_port.* 141 142 Scheduling Hierarchy per Port 143 144 145The functionality of each hierarchical level is detailed in the following table. 146 147.. _table_qos_3: 148 149.. table:: Port Scheduling Hierarchy 150 151 +---+--------------------+----------------------------+---------------------------------------------------------------+ 152 | # | Level | Siblings per Parent | Functional Description | 153 | | | | | 154 +===+====================+============================+===============================================================+ 155 | 1 | Port | - | #. Output Ethernet port 1/10/40 GbE. | 156 | | | | | 157 | | | | #. Multiple ports are scheduled in round robin order with | 158 | | | | all ports having equal priority. | 159 | | | | | 160 +---+--------------------+----------------------------+---------------------------------------------------------------+ 161 | 2 | Subport | Configurable (default: 8) | #. Traffic shaping using token bucket algorithm (one token | 162 | | | | bucket per subport). | 163 | | | | | 164 | | | | #. Upper limit enforced per Traffic Class (TC) at the | 165 | | | | subport level. | 166 | | | | | 167 | | | | #. Lower priority TCs able to reuse subport bandwidth | 168 | | | | currently unused by higher priority TCs. | 169 | | | | | 170 +---+--------------------+----------------------------+---------------------------------------------------------------+ 171 | 3 | Pipe | Configurable (default: 4K) | #. Traffic shaping using the token bucket algorithm (one | 172 | | | | token bucket per pipe. | 173 | | | | | 174 +---+--------------------+----------------------------+---------------------------------------------------------------+ 175 | 4 | Traffic Class (TC) | 13 | #. TCs of the same pipe handled in strict priority order. | 176 | | | | | 177 | | | | #. Upper limit enforced per TC at the pipe level. | 178 | | | | | 179 | | | | #. Lower priority TCs able to reuse pipe bandwidth currently | 180 | | | | unused by higher priority TCs. | 181 | | | | | 182 | | | | #. When subport TC is oversubscribed (configuration time | 183 | | | | event), pipe TC upper limit is capped to a dynamically | 184 | | | | adjusted value that is shared by all the subport pipes. | 185 | | | | | 186 +---+--------------------+----------------------------+---------------------------------------------------------------+ 187 | 5 | Queue | High priority TCs: 1, | #. All the high priority TCs (TC0, TC1, ...,TC11) have | 188 | | | Lowest priority TC: 4 | exactly 1 queue, while the lowest priority TC (TC12), | 189 | | | | called Best Effort (BE), has 4 queues. | 190 | | | | | 191 | | | | #. Queues of the lowest priority TC (BE) are serviced using | 192 | | | | Weighted Round Robin (WRR) according to predefined weights| 193 | | | | weights. | 194 | | | | | 195 +---+--------------------+----------------------------+---------------------------------------------------------------+ 196 197Application Programming Interface (API) 198~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 199 200Port Scheduler Configuration API 201^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 202 203The rte_sched.h file contains configuration functions for port, subport and pipe. 204 205Port Scheduler Enqueue API 206^^^^^^^^^^^^^^^^^^^^^^^^^^ 207 208The port scheduler enqueue API is very similar to the API of the DPDK PMD TX function. 209 210.. code-block:: c 211 212 int rte_sched_port_enqueue(struct rte_sched_port *port, struct rte_mbuf **pkts, uint32_t n_pkts); 213 214Port Scheduler Dequeue API 215^^^^^^^^^^^^^^^^^^^^^^^^^^ 216 217The port scheduler dequeue API is very similar to the API of the DPDK PMD RX function. 218 219.. code-block:: c 220 221 int rte_sched_port_dequeue(struct rte_sched_port *port, struct rte_mbuf **pkts, uint32_t n_pkts); 222 223Usage Example 224^^^^^^^^^^^^^ 225 226.. code-block:: c 227 228 /* File "application.c" */ 229 230 #define N_PKTS_RX 64 231 #define N_PKTS_TX 48 232 #define NIC_RX_PORT 0 233 #define NIC_RX_QUEUE 0 234 #define NIC_TX_PORT 1 235 #define NIC_TX_QUEUE 0 236 237 struct rte_sched_port *port = NULL; 238 struct rte_mbuf *pkts_rx[N_PKTS_RX], *pkts_tx[N_PKTS_TX]; 239 uint32_t n_pkts_rx, n_pkts_tx; 240 241 /* Initialization */ 242 243 <initialization code> 244 245 /* Runtime */ 246 while (1) { 247 /* Read packets from NIC RX queue */ 248 249 n_pkts_rx = rte_eth_rx_burst(NIC_RX_PORT, NIC_RX_QUEUE, pkts_rx, N_PKTS_RX); 250 251 /* Hierarchical scheduler enqueue */ 252 253 rte_sched_port_enqueue(port, pkts_rx, n_pkts_rx); 254 255 /* Hierarchical scheduler dequeue */ 256 257 n_pkts_tx = rte_sched_port_dequeue(port, pkts_tx, N_PKTS_TX); 258 259 /* Write packets to NIC TX queue */ 260 261 rte_eth_tx_burst(NIC_TX_PORT, NIC_TX_QUEUE, pkts_tx, n_pkts_tx); 262 } 263 264Implementation 265~~~~~~~~~~~~~~ 266 267Internal Data Structures per Port 268^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 269 270A schematic of the internal data structures in shown in with details in. 271 272.. _figure_data_struct_per_port: 273 274.. figure:: ../img/data_struct_per_port.* 275 276 Internal Data Structures per Port 277 278 279.. _table_qos_4: 280 281.. table:: Scheduler Internal Data Structures per Port 282 283 +---+----------------------+-------------------------+---------------------+------------------------------+---------------------------------------------------+ 284 | # | Data structure | Size (bytes) | # per port | Access type | Description | 285 | | | | | | | 286 | | | | +-------------+----------------+---------------------------------------------------+ 287 | | | | | Enq | Deq | | 288 | | | | | | | | 289 +===+======================+=========================+=====================+=============+================+===================================================+ 290 | 1 | Subport table entry | 64 | # subports per port | - | Rd, Wr | Persistent subport data (credits, etc). | 291 | | | | | | | | 292 +---+----------------------+-------------------------+---------------------+-------------+----------------+---------------------------------------------------+ 293 | 2 | Pipe table entry | 64 | # pipes per port | - | Rd, Wr | Persistent data for pipe, its TCs and its queues | 294 | | | | | | | (credits, etc) that is updated during run-time. | 295 | | | | | | | | 296 | | | | | | | The pipe configuration parameters do not change | 297 | | | | | | | during run-time. The same pipe configuration | 298 | | | | | | | parameters are shared by multiple pipes, | 299 | | | | | | | therefore they are not part of pipe table entry. | 300 | | | | | | | | 301 +---+----------------------+-------------------------+---------------------+-------------+----------------+---------------------------------------------------+ 302 | 3 | Queue table entry | 4 | #queues per port | Rd, Wr | Rd, Wr | Persistent queue data (read and write pointers). | 303 | | | | | | | The queue size is the same per TC for all queues, | 304 | | | | | | | allowing the queue base address to be computed | 305 | | | | | | | using a fast formula, so these two parameters are | 306 | | | | | | | not part of queue table entry. | 307 | | | | | | | | 308 | | | | | | | The queue table entries for any given pipe are | 309 | | | | | | | stored in the same cache line. | 310 | | | | | | | | 311 +---+----------------------+-------------------------+---------------------+-------------+----------------+---------------------------------------------------+ 312 | 4 | Queue storage area | Config (default: 64 x8) | # queues per port | Wr | Rd | Array of elements per queue; each element is 8 | 313 | | | | | | | byte in size (mbuf pointer). | 314 | | | | | | | | 315 +---+----------------------+-------------------------+---------------------+-------------+----------------+---------------------------------------------------+ 316 | 5 | Active queues bitmap | 1 bit per queue | 1 | Wr (Set) | Rd, Wr (Clear) | The bitmap maintains one status bit per queue: | 317 | | | | | | | queue not active (queue is empty) or queue active | 318 | | | | | | | (queue is not empty). | 319 | | | | | | | | 320 | | | | | | | Queue bit is set by the scheduler enqueue and | 321 | | | | | | | cleared by the scheduler dequeue when queue | 322 | | | | | | | becomes empty. | 323 | | | | | | | | 324 | | | | | | | Bitmap scan operation returns the next non-empty | 325 | | | | | | | pipe and its status (16-bit mask of active queue | 326 | | | | | | | in the pipe). | 327 | | | | | | | | 328 +---+----------------------+-------------------------+---------------------+-------------+----------------+---------------------------------------------------+ 329 | 6 | Grinder | ~128 | Config (default: 8) | - | Rd, Wr | Short list of active pipes currently under | 330 | | | | | | | processing. The grinder contains temporary data | 331 | | | | | | | during pipe processing. | 332 | | | | | | | | 333 | | | | | | | Once the current pipe exhausts packets or | 334 | | | | | | | credits, it is replaced with another active pipe | 335 | | | | | | | from the bitmap. | 336 | | | | | | | | 337 +---+----------------------+-------------------------+---------------------+-------------+----------------+---------------------------------------------------+ 338 339Multicore Scaling Strategy 340^^^^^^^^^^^^^^^^^^^^^^^^^^ 341 342The multicore scaling strategy is: 343 344#. Running different physical ports on different threads. The enqueue and dequeue of the same port are run by the same thread. 345 346#. Splitting the same physical port to different threads by running different sets of subports of the same physical port (virtual ports) on different threads. 347 Similarly, a subport can be split into multiple subports that are each run by a different thread. 348 The enqueue and dequeue of the same port are run by the same thread. 349 This is only required if, for performance reasons, it is not possible to handle a full port with a single core. 350 351Enqueue and Dequeue for the Same Output Port 352"""""""""""""""""""""""""""""""""""""""""""" 353 354Running enqueue and dequeue operations for the same output port from different cores is likely to cause significant impact on scheduler's performance 355and it is therefore not recommended. 356 357The port enqueue and dequeue operations share access to the following data structures: 358 359#. Packet descriptors 360 361#. Queue table 362 363#. Queue storage area 364 365#. Bitmap of active queues 366 367The expected drop in performance is due to: 368 369#. Need to make the queue and bitmap operations thread safe, 370 which requires either using locking primitives for access serialization (for example, spinlocks/ semaphores) or 371 using atomic primitives for lockless access (for example, Test and Set, Compare And Swap, an so on). 372 The impact is much higher in the former case. 373 374#. Ping-pong of cache lines storing the shared data structures between the cache hierarchies of the two cores 375 (done transparently by the MESI protocol cache coherency CPU hardware). 376 377Therefore, the scheduler enqueue and dequeue operations have to be run from the same thread, 378which allows the queues and the bitmap operations to be non-thread safe and 379keeps the scheduler data structures internal to the same core. 380 381Performance Scaling 382""""""""""""""""""" 383 384Scaling up the number of NIC ports simply requires a proportional increase in the number of CPU cores to be used for traffic scheduling. 385 386Enqueue Pipeline 387^^^^^^^^^^^^^^^^ 388 389The sequence of steps per packet: 390 391#. *Access* the mbuf to read the data fields required to identify the destination queue for the packet. 392 These fields are: port, subport, traffic class and queue within traffic class, and are typically set by the classification stage. 393 394#. *Access* the queue structure to identify the write location in the queue array. 395 If the queue is full, then the packet is discarded. 396 397#. *Access* the queue array location to store the packet (i.e. write the mbuf pointer). 398 399It should be noted the strong data dependency between these steps, as steps 2 and 3 cannot start before the result from steps 1 and 2 becomes available, 400which prevents the processor out of order execution engine to provide any significant performance optimizations. 401 402Given the high rate of input packets and the large amount of queues, 403it is expected that the data structures accessed to enqueue the current packet are not present 404in the L1 or L2 data cache of the current core, thus the above 3 memory accesses would result (on average) in L1 and L2 data cache misses. 405A number of 3 L1/L2 cache misses per packet is not acceptable for performance reasons. 406 407The workaround is to prefetch the required data structures in advance. The prefetch operation has an execution latency during which 408the processor should not attempt to access the data structure currently under prefetch, so the processor should execute other work. 409The only other work available is to execute different stages of the enqueue sequence of operations on other input packets, 410thus resulting in a pipelined implementation for the enqueue operation. 411 412:numref:`figure_prefetch_pipeline` illustrates a pipelined implementation for the enqueue operation with 4 pipeline stages and each stage executing 2 different input packets. 413No input packet can be part of more than one pipeline stage at a given time. 414 415.. _figure_prefetch_pipeline: 416 417.. figure:: ../img/prefetch_pipeline.* 418 419 Prefetch Pipeline for the Hierarchical Scheduler Enqueue Operation 420 421 422The congestion management scheme implemented by the enqueue pipeline described above is very basic: 423packets are enqueued until a specific queue becomes full, 424then all the packets destined to the same queue are dropped until packets are consumed (by the dequeue operation). 425This can be improved by enabling RED/WRED or PIE as part of the enqueue pipeline which looks at the queue occupancy and 426packet priority in order to yield the enqueue/drop decision for a specific packet 427(as opposed to enqueuing all packets / dropping all packets indiscriminately). 428 429Dequeue State Machine 430^^^^^^^^^^^^^^^^^^^^^ 431 432The sequence of steps to schedule the next packet from the current pipe is: 433 434#. Identify the next active pipe using the bitmap scan operation, *prefetch* pipe. 435 436#. *Read* pipe data structure. Update the credits for the current pipe and its subport. 437 Identify the first active traffic class within the current pipe, select the next queue using WRR, 438 *prefetch* queue pointers for all the 16 queues of the current pipe. 439 440#. *Read* next element from the current WRR queue and *prefetch* its packet descriptor. 441 442#. *Read* the packet length from the packet descriptor (mbuf structure). 443 Based on the packet length and the available credits (of current pipe, pipe traffic class, subport and subport traffic class), 444 take the go/no go scheduling decision for the current packet. 445 446To avoid the cache misses, the above data structures (pipe, queue, queue array, mbufs) are prefetched in advance of being accessed. 447The strategy of hiding the latency of the prefetch operations is to switch from the current pipe (in grinder A) to another pipe 448(in grinder B) immediately after a prefetch is issued for the current pipe. 449This gives enough time to the prefetch operation to complete before the execution switches back to this pipe (in grinder A). 450 451The dequeue pipe state machine exploits the data presence into the processor cache, 452therefore it tries to send as many packets from the same pipe TC and pipe as possible (up to the available packets and credits) before 453moving to the next active TC from the same pipe (if any) or to another active pipe. 454 455.. _figure_pipe_prefetch_sm: 456 457.. figure:: ../img/pipe_prefetch_sm.* 458 459 Pipe Prefetch State Machine for the Hierarchical Scheduler Dequeue 460 Operation 461 462 463Timing and Synchronization 464^^^^^^^^^^^^^^^^^^^^^^^^^^ 465 466The output port is modeled as a conveyor belt of byte slots that need to be filled by the scheduler with data for transmission. 467For 10 GbE, there are 1.25 billion byte slots that need to be filled by the port scheduler every second. 468If the scheduler is not fast enough to fill the slots, provided that enough packets and credits exist, 469then some slots will be left unused and bandwidth will be wasted. 470 471In principle, the hierarchical scheduler dequeue operation should be triggered by NIC TX. 472Usually, once the occupancy of the NIC TX input queue drops below a predefined threshold, 473the port scheduler is woken up (interrupt based or polling based, 474by continuously monitoring the queue occupancy) to push more packets into the queue. 475 476Internal Time Reference 477""""""""""""""""""""""" 478 479The scheduler needs to keep track of time advancement for the credit logic, 480which requires credit updates based on time (for example, subport and pipe traffic shaping, traffic class upper limit enforcement, and so on). 481 482Every time the scheduler decides to send a packet out to the NIC TX for transmission, the scheduler will increment its internal time reference accordingly. 483Therefore, it is convenient to keep the internal time reference in units of bytes, 484where a byte signifies the time duration required by the physical interface to send out a byte on the transmission medium. 485This way, as a packet is scheduled for transmission, the time is incremented with (n + h), 486where n is the packet length in bytes and h is the number of framing overhead bytes per packet. 487 488Internal Time Reference Re-synchronization 489"""""""""""""""""""""""""""""""""""""""""" 490 491The scheduler needs to align its internal time reference to the pace of the port conveyor belt. 492The reason is to make sure that the scheduler does not feed the NIC TX with more bytes than the line rate of the physical medium in order to prevent packet drop 493(by the scheduler, due to the NIC TX input queue being full, or later on, internally by the NIC TX). 494 495The scheduler reads the current time on every dequeue invocation. 496The CPU time stamp can be obtained by reading either the Time Stamp Counter (TSC) register or the High Precision Event Timer (HPET) register. 497The current CPU time stamp is converted from number of CPU clocks to number of bytes: 498*time_bytes = time_cycles / cycles_per_byte, where cycles_per_byte* 499is the amount of CPU cycles that is equivalent to the transmission time for one byte on the wire 500(e.g. for a CPU frequency of 2 GHz and a 10GbE port,*cycles_per_byte = 1.6*). 501 502The scheduler maintains an internal time reference of the NIC time. 503Whenever a packet is scheduled, the NIC time is incremented with the packet length (including framing overhead). 504On every dequeue invocation, the scheduler checks its internal reference of the NIC time against the current time: 505 506#. If NIC time is in the future (NIC time >= current time), no adjustment of NIC time is needed. 507 This means that scheduler is able to schedule NIC packets before the NIC actually needs those packets, so the NIC TX is well supplied with packets; 508 509#. If NIC time is in the past (NIC time < current time), then NIC time should be adjusted by setting it to the current time. 510 This means that the scheduler is not able to keep up with the speed of the NIC byte conveyor belt, 511 so NIC bandwidth is wasted due to poor packet supply to the NIC TX. 512 513Scheduler Accuracy and Granularity 514"""""""""""""""""""""""""""""""""" 515 516The scheduler round trip delay (SRTD) is the time (number of CPU cycles) between two consecutive examinations of the same pipe by the scheduler. 517 518To keep up with the output port (that is, avoid bandwidth loss), 519the scheduler should be able to schedule n packets faster than the same n packets are transmitted by NIC TX. 520 521The scheduler needs to keep up with the rate of each individual pipe, 522as configured for the pipe token bucket, assuming that no port oversubscription is taking place. 523This means that the size of the pipe token bucket should be set high enough to prevent it from overflowing due to big SRTD, 524as this would result in credit loss (and therefore bandwidth loss) for the pipe. 525 526Credit Logic 527^^^^^^^^^^^^ 528 529Scheduling Decision 530""""""""""""""""""" 531 532The scheduling decision to send next packet from (subport S, pipe P, traffic class TC, queue Q) is favorable (packet is sent) 533when all the conditions below are met: 534 535* Pipe P of subport S is currently selected by one of the port grinders; 536 537* Traffic class TC is the highest priority active traffic class of pipe P; 538 539* Queue Q is the next queue selected by WRR within traffic class TC of pipe P; 540 541* Subport S has enough credits to send the packet; 542 543* Subport S has enough credits for traffic class TC to send the packet; 544 545* Pipe P has enough credits to send the packet; 546 547* Pipe P has enough credits for traffic class TC to send the packet. 548 549If all the above conditions are met, 550then the packet is selected for transmission and the necessary credits are subtracted from subport S, 551subport S traffic class TC, pipe P, pipe P traffic class TC. 552 553Framing Overhead 554"""""""""""""""" 555 556As the greatest common divisor for all packet lengths is one byte, the unit of credit is selected as one byte. 557The number of credits required for the transmission of a packet of n bytes is equal to (n+h), 558where h is equal to the number of framing overhead bytes per packet. 559 560.. _table_qos_5: 561 562.. table:: Ethernet Frame Overhead Fields 563 564 +---+--------------------------------+----------------+---------------------------------------------------------------------------+ 565 | # | Packet field | Length (bytes) | Comments | 566 | | | | | 567 +===+================================+================+===========================================================================+ 568 | 1 | Preamble | 7 | | 569 | | | | | 570 +---+--------------------------------+----------------+---------------------------------------------------------------------------+ 571 | 2 | Start of Frame Delimiter (SFD) | 1 | | 572 | | | | | 573 +---+--------------------------------+----------------+---------------------------------------------------------------------------+ 574 | 3 | Frame Check Sequence (FCS) | 4 | Considered overhead only if not included in the mbuf packet length field. | 575 | | | | | 576 +---+--------------------------------+----------------+---------------------------------------------------------------------------+ 577 | 4 | Inter Frame Gap (IFG) | 12 | | 578 | | | | | 579 +---+--------------------------------+----------------+---------------------------------------------------------------------------+ 580 | 5 | Total | 24 | | 581 | | | | | 582 +---+--------------------------------+----------------+---------------------------------------------------------------------------+ 583 584Traffic Shaping 585""""""""""""""" 586 587The traffic shaping for subport and pipe is implemented using a token bucket per subport/per pipe. 588Each token bucket is implemented using one saturated counter that keeps track of the number of available credits. 589 590The token bucket generic parameters and operations are presented in :numref:`table_qos_6` and :numref:`table_qos_7`. 591 592.. _table_qos_6: 593 594.. table:: Token Bucket Generic Parameters 595 596 +---+------------------------+--------------------+---------------------------------------------------------+ 597 | # | Token Bucket Parameter | Unit | Description | 598 | | | | | 599 +===+========================+====================+=========================================================+ 600 | 1 | bucket_rate | Credits per second | Rate of adding credits to the bucket. | 601 | | | | | 602 +---+------------------------+--------------------+---------------------------------------------------------+ 603 | 2 | bucket_size | Credits | Max number of credits that can be stored in the bucket. | 604 | | | | | 605 +---+------------------------+--------------------+---------------------------------------------------------+ 606 607.. _table_qos_7: 608 609.. table:: Token Bucket Generic Operations 610 611 +---+------------------------+------------------------------------------------------------------------------+ 612 | # | Token Bucket Operation | Description | 613 | | | | 614 +===+========================+==============================================================================+ 615 | 1 | Initialization | Bucket set to a predefined value, e.g. zero or half of the bucket size. | 616 | | | | 617 +---+------------------------+------------------------------------------------------------------------------+ 618 | 2 | Credit update | Credits are added to the bucket on top of existing ones, either periodically | 619 | | | or on demand, based on the bucket_rate. Credits cannot exceed the upper | 620 | | | limit defined by the bucket_size, so any credits to be added to the bucket | 621 | | | while the bucket is full are dropped. | 622 | | | | 623 +---+------------------------+------------------------------------------------------------------------------+ 624 | 3 | Credit consumption | As result of packet scheduling, the necessary number of credits is removed | 625 | | | from the bucket. The packet can only be sent if enough credits are in the | 626 | | | bucket to send the full packet (packet bytes and framing overhead for the | 627 | | | packet). | 628 | | | | 629 +---+------------------------+------------------------------------------------------------------------------+ 630 631To implement the token bucket generic operations described above, 632the current design uses the persistent data structure presented in :numref:`table_qos_8`, 633while the implementation of the token bucket operations is described in :numref:`table_qos_9`. 634 635.. _table_qos_8: 636 637.. table:: Token Bucket Persistent Data Structure 638 639 +---+------------------------+-------+----------------------------------------------------------------------+ 640 | # | Token bucket field | Unit | Description | 641 | | | | | 642 +===+========================+=======+======================================================================+ 643 | 1 | tb_time | Bytes | Time of the last credit update. Measured in bytes instead of seconds | 644 | | | | or CPU cycles for ease of credit consumption operation | 645 | | | | (as the current time is also maintained in bytes). | 646 | | | | | 647 | | | | See Section 26.2.4.5.1 "Internal Time Reference" for an | 648 | | | | explanation of why the time is maintained in byte units. | 649 | | | | | 650 +---+------------------------+-------+----------------------------------------------------------------------+ 651 | 2 | tb_period | Bytes | Time period that should elapse since the last credit update in order | 652 | | | | for the bucket to be awarded tb_credits_per_period worth or credits. | 653 | | | | | 654 +---+------------------------+-------+----------------------------------------------------------------------+ 655 | 3 | tb_credits_per_period | Bytes | Credit allowance per tb_period. | 656 | | | | | 657 +---+------------------------+-------+----------------------------------------------------------------------+ 658 | 4 | tb_size | Bytes | Bucket size, i.e. upper limit for the tb_credits. | 659 | | | | | 660 +---+------------------------+-------+----------------------------------------------------------------------+ 661 | 5 | tb_credits | Bytes | Number of credits currently in the bucket. | 662 | | | | | 663 +---+------------------------+-------+----------------------------------------------------------------------+ 664 665The bucket rate (in bytes per second) can be computed with the following formula: 666 667*bucket_rate = (tb_credits_per_period / tb_period) * r* 668 669where, r = port line rate (in bytes per second). 670 671.. _table_qos_9: 672 673.. table:: Token Bucket Operations 674 675 +---+-------------------------+-----------------------------------------------------------------------------+ 676 | # | Token bucket operation | Description | 677 | | | | 678 +===+=========================+=============================================================================+ 679 | 1 | Initialization | *tb_credits = 0; or tb_credits = tb_size / 2;* | 680 | | | | 681 +---+-------------------------+-----------------------------------------------------------------------------+ 682 | 2 | Credit update | Credit update options: | 683 | | | | 684 | | | * Every time a packet is sent for a port, update the credits of all the | 685 | | | the subports and pipes of that port. Not feasible. | 686 | | | | 687 | | | * Every time a packet is sent, update the credits for the pipe and | 688 | | | subport. Very accurate, but not needed (a lot of calculations). | 689 | | | | 690 | | | * Every time a pipe is selected (that is, picked by one | 691 | | | of the grinders), update the credits for the pipe and its subport. | 692 | | | | 693 | | | The current implementation is using option 3. According to Section | 694 | | | `Dequeue State Machine`_, the pipe and subport credits are | 695 | | | updated every time a pipe is selected by the dequeue process before the | 696 | | | pipe and subport credits are actually used. | 697 | | | | 698 | | | The implementation uses a tradeoff between accuracy and speed by updating | 699 | | | the bucket credits only when at least a full *tb_period* has elapsed since | 700 | | | the last update. | 701 | | | | 702 | | | * Full accuracy can be achieved by selecting the value for *tb_period* | 703 | | | for which *tb_credits_per_period = 1*. | 704 | | | | 705 | | | * When full accuracy is not required, better performance is achieved by | 706 | | | setting *tb_credits* to a larger value. | 707 | | | | 708 | | | Update operations: | 709 | | | | 710 | | | * n_periods = (time - tb_time) / tb_period; | 711 | | | | 712 | | | * tb_credits += n_periods * tb_credits_per_period; | 713 | | | | 714 | | | * tb_credits = min(tb_credits, tb_size); | 715 | | | | 716 | | | * tb_time += n_periods * tb_period; | 717 | | | | 718 +---+-------------------------+-----------------------------------------------------------------------------+ 719 | 3 | Credit consumption | As result of packet scheduling, the necessary number of credits is removed | 720 | | (on packet scheduling) | from the bucket. The packet can only be sent if enough credits are in the | 721 | | | bucket to send the full packet (packet bytes and framing overhead for the | 722 | | | packet). | 723 | | | | 724 | | | Scheduling operations: | 725 | | | | 726 | | | pkt_credits = pkt_len + frame_overhead; | 727 | | | if (tb_credits >= pkt_credits){tb_credits -= pkt_credits;} | 728 | | | | 729 +---+-------------------------+-----------------------------------------------------------------------------+ 730 731Traffic Classes 732""""""""""""""" 733 734Implementation of Strict Priority Scheduling 735'''''''''''''''''''''''''''''''''''''''''''' 736 737Strict priority scheduling of traffic classes within the same pipe is implemented by the pipe dequeue state machine, 738which selects the queues in ascending order. 739Therefore, queue 0 (associated with TC 0, highest priority TC) is handled before 740queue 1 (TC 1, lower priority than TC 0), 741which is handled before queue 2 (TC 2, lower priority than TC 1) and it continues until queues of all TCs except the 742lowest priority TC are handled. At last, queues 12..15 (best effort TC, lowest priority TC) are handled. 743 744Upper Limit Enforcement 745''''''''''''''''''''''' 746 747The traffic classes at the pipe and subport levels are not traffic shaped, 748so there is no token bucket maintained in this context. 749The upper limit for the traffic classes at the subport and 750pipe levels is enforced by periodically refilling the subport / pipe traffic class credit counter, 751out of which credits are consumed every time a packet is scheduled for that subport / pipe, 752as described in :numref:`table_qos_10` and :numref:`table_qos_11`. 753 754.. _table_qos_10: 755 756.. table:: Subport/Pipe Traffic Class Upper Limit Enforcement Persistent Data Structure 757 758 +---+-----------------------+-------+-----------------------------------------------------------------------+ 759 | # | Subport or pipe field | Unit | Description | 760 | | | | | 761 +===+=======================+=======+=======================================================================+ 762 | 1 | tc_time | Bytes | Time of the next update (upper limit refill) for the TCs of the | 763 | | | | current subport / pipe. | 764 | | | | | 765 | | | | See Section `Internal Time Reference`_ for the | 766 | | | | explanation of why the time is maintained in byte units. | 767 | | | | | 768 +---+-----------------------+-------+-----------------------------------------------------------------------+ 769 | 2 | tc_period | Bytes | Time between two consecutive updates for the all TCs of the current | 770 | | | | subport / pipe. This is expected to be many times bigger than the | 771 | | | | typical value of the token bucket tb_period. | 772 | | | | | 773 +---+-----------------------+-------+-----------------------------------------------------------------------+ 774 | 3 | tc_credits_per_period | Bytes | Upper limit for the number of credits allowed to be consumed by the | 775 | | | | current TC during each enforcement period tc_period. | 776 | | | | | 777 +---+-----------------------+-------+-----------------------------------------------------------------------+ 778 | 4 | tc_credits | Bytes | Current upper limit for the number of credits that can be consumed by | 779 | | | | the current traffic class for the remainder of the current | 780 | | | | enforcement period. | 781 | | | | | 782 +---+-----------------------+-------+-----------------------------------------------------------------------+ 783 784.. _table_qos_11: 785 786.. table:: Subport/Pipe Traffic Class Upper Limit Enforcement Operations 787 788 +---+--------------------------+----------------------------------------------------------------------------+ 789 | # | Traffic Class Operation | Description | 790 | | | | 791 +===+==========================+============================================================================+ 792 | 1 | Initialization | tc_credits = tc_credits_per_period; | 793 | | | | 794 | | | tc_time = tc_period; | 795 | | | | 796 +---+--------------------------+----------------------------------------------------------------------------+ 797 | 2 | Credit update | Update operations: | 798 | | | | 799 | | | if (time >= tc_time) { | 800 | | | | 801 | | | tc_credits = tc_credits_per_period; | 802 | | | | 803 | | | tc_time = time + tc_period; | 804 | | | | 805 | | | } | 806 | | | | 807 +---+--------------------------+----------------------------------------------------------------------------+ 808 | 3 | Credit consumption | As result of packet scheduling, the TC limit is decreased with the | 809 | | (on packet scheduling) | necessary number of credits. The packet can only be sent if enough credits | 810 | | | are currently available in the TC limit to send the full packet | 811 | | | (packet bytes and framing overhead for the packet). | 812 | | | | 813 | | | Scheduling operations: | 814 | | | | 815 | | | pkt_credits = pk_len + frame_overhead; | 816 | | | | 817 | | | if (tc_credits >= pkt_credits) {tc_credits -= pkt_credits;} | 818 | | | | 819 +---+--------------------------+----------------------------------------------------------------------------+ 820 821Weighted Round Robin (WRR) 822"""""""""""""""""""""""""" 823 824The evolution of the WRR design solution for the lowest priority traffic class (best effort TC) from simple to complex is shown in :numref:`table_qos_12`. 825 826.. _table_qos_12: 827 828.. table:: Weighted Round Robin (WRR) 829 830 +---+------------+-----------------+-------------+----------------------------------------------------------+ 831 | # | All Queues | Equal Weights | All Packets | Strategy | 832 | | Active? | for All Queues? | Equal? | | 833 +===+============+=================+=============+==========================================================+ 834 | 1 | Yes | Yes | Yes | **Byte level round robin** | 835 | | | | | | 836 | | | | | *Next queue* queue #i, i = *(i + 1) % n* | 837 | | | | | | 838 +---+------------+-----------------+-------------+----------------------------------------------------------+ 839 | 2 | Yes | Yes | No | **Packet level round robin** | 840 | | | | | | 841 | | | | | Consuming one byte from queue #i requires consuming | 842 | | | | | exactly one token for queue #i. | 843 | | | | | | 844 | | | | | T(i) = Accumulated number of tokens previously consumed | 845 | | | | | from queue #i. Every time a packet is consumed from | 846 | | | | | queue #i, T(i) is updated as: T(i) += *pkt_len*. | 847 | | | | | | 848 | | | | | *Next queue* : queue with the smallest T. | 849 | | | | | | 850 | | | | | | 851 +---+------------+-----------------+-------------+----------------------------------------------------------+ 852 | 3 | Yes | No | No | **Packet level weighted round robin** | 853 | | | | | | 854 | | | | | This case can be reduced to the previous case by | 855 | | | | | introducing a cost per byte that is different for each | 856 | | | | | queue. Queues with lower weights have a higher cost per | 857 | | | | | byte. This way, it is still meaningful to compare the | 858 | | | | | consumption amongst different queues in order to select | 859 | | | | | the next queue. | 860 | | | | | | 861 | | | | | w(i) = Weight of queue #i | 862 | | | | | | 863 | | | | | t(i) = Tokens per byte for queue #i, defined as the | 864 | | | | | inverse weight of queue #i. | 865 | | | | | For example, if w[0..3] = [1:2:4:8], | 866 | | | | | then t[0..3] = [8:4:2:1]; if w[0..3] = [1:4:15:20], | 867 | | | | | then t[0..3] = [60:15:4:3]. | 868 | | | | | Consuming one byte from queue #i requires consuming t(i) | 869 | | | | | tokens for queue #i. | 870 | | | | | | 871 | | | | | T(i) = Accumulated number of tokens previously consumed | 872 | | | | | from queue #i. Every time a packet is consumed from | 873 | | | | | queue #i, T(i) is updated as: *T(i) += pkt_len * t(i)*. | 874 | | | | | *Next queue* : queue with the smallest T. | 875 | | | | | | 876 +---+------------+-----------------+-------------+----------------------------------------------------------+ 877 | 4 | No | No | No | **Packet level weighted round robin with variable queue | 878 | | | | | status** | 879 | | | | | | 880 | | | | | Reduce this case to the previous case by setting the | 881 | | | | | consumption of inactive queues to a high number, so that | 882 | | | | | the inactive queues will never be selected by the | 883 | | | | | smallest T logic. | 884 | | | | | | 885 | | | | | To prevent T from overflowing as result of successive | 886 | | | | | accumulations, T(i) is truncated after each packet | 887 | | | | | consumption for all queues. | 888 | | | | | For example, T[0..3] = [1000, 1100, 1200, 1300] | 889 | | | | | is truncated to T[0..3] = [0, 100, 200, 300] | 890 | | | | | by subtracting the min T from T(i), i = 0..n. | 891 | | | | | | 892 | | | | | This requires having at least one active queue in the | 893 | | | | | set of input queues, which is guaranteed by the dequeue | 894 | | | | | state machine never selecting an inactive traffic class. | 895 | | | | | | 896 | | | | | *mask(i) = Saturation mask for queue #i, defined as:* | 897 | | | | | | 898 | | | | | mask(i) = (queue #i is active)? 0 : 0xFFFFFFFF; | 899 | | | | | | 900 | | | | | w(i) = Weight of queue #i | 901 | | | | | | 902 | | | | | t(i) = Tokens per byte for queue #i, defined as the | 903 | | | | | inverse weight of queue #i. | 904 | | | | | | 905 | | | | | T(i) = Accumulated numbers of tokens previously consumed | 906 | | | | | from queue #i. | 907 | | | | | | 908 | | | | | *Next queue* : queue with smallest T. | 909 | | | | | | 910 | | | | | Before packet consumption from queue #i: | 911 | | | | | | 912 | | | | | *T(i) |= mask(i)* | 913 | | | | | | 914 | | | | | After packet consumption from queue #i: | 915 | | | | | | 916 | | | | | T(j) -= T(i), j != i | 917 | | | | | | 918 | | | | | T(i) = pkt_len * t(i) | 919 | | | | | | 920 | | | | | Note: T(j) uses the T(i) value before T(i) is updated. | 921 | | | | | | 922 +---+------------+-----------------+-------------+----------------------------------------------------------+ 923 924Subport Traffic Class Oversubscription 925"""""""""""""""""""""""""""""""""""""" 926 927Problem Statement 928''''''''''''''''' 929 930Oversubscription for subport traffic class X is a configuration-time event that occurs when 931more bandwidth is allocated for traffic class X at the level of subport member pipes than 932allocated for the same traffic class at the parent subport level. 933 934The existence of the oversubscription for a specific subport and 935traffic class is solely the result of pipe and 936subport-level configuration as opposed to being created due 937to dynamic evolution of the traffic load at run-time (as congestion is). 938 939When the overall demand for traffic class X for the current subport is low, 940the existence of the oversubscription condition does not represent a problem, 941as demand for traffic class X is completely satisfied for all member pipes. 942However, this can no longer be achieved when the aggregated demand for traffic class X 943for all subport member pipes exceeds the limit configured at the subport level. 944 945Solution Space 946'''''''''''''' 947 948summarizes some of the possible approaches for handling this problem, 949with the third approach selected for implementation. 950 951.. _table_qos_13: 952 953.. table:: Subport Traffic Class Oversubscription 954 955 +-----+---------------------------+-------------------------------------------------------------------------+ 956 | No. | Approach | Description | 957 | | | | 958 +=====+===========================+=========================================================================+ 959 | 1 | Don't care | First come, first served. | 960 | | | | 961 | | | This approach is not fair amongst subport member pipes, as pipes that | 962 | | | are served first will use up as much bandwidth for TC X as they need, | 963 | | | while pipes that are served later will receive poor service due to | 964 | | | bandwidth for TC X at the subport level being scarce. | 965 | | | | 966 +-----+---------------------------+-------------------------------------------------------------------------+ 967 | 2 | Scale down all pipes | All pipes within the subport have their bandwidth limit for TC X scaled | 968 | | | down by the same factor. | 969 | | | | 970 | | | This approach is not fair among subport member pipes, as the low end | 971 | | | pipes (that is, pipes configured with low bandwidth) can potentially | 972 | | | experience severe service degradation that might render their service | 973 | | | unusable (if available bandwidth for these pipes drops below the | 974 | | | minimum requirements for a workable service), while the service | 975 | | | degradation for high end pipes might not be noticeable at all. | 976 | | | | 977 +-----+---------------------------+-------------------------------------------------------------------------+ 978 | 3 | Cap the high demand pipes | Each subport member pipe receives an equal share of the bandwidth | 979 | | | available at run-time for TC X at the subport level. Any bandwidth left | 980 | | | unused by the low-demand pipes is redistributed in equal portions to | 981 | | | the high-demand pipes. This way, the high-demand pipes are truncated | 982 | | | while the low-demand pipes are not impacted. | 983 | | | | 984 +-----+---------------------------+-------------------------------------------------------------------------+ 985 986Typically, the subport TC oversubscription feature is enabled only for the lowest priority traffic class, 987which is typically used for best effort traffic, 988with the management plane preventing this condition from occurring for the other (higher priority) traffic classes. 989 990To ease implementation, it is also assumed that the upper limit for subport best effort TC is set to 100% of the subport rate, 991and that the upper limit for pipe best effort TC is set to 100% of pipe rate for all subport member pipes. 992 993Implementation Overview 994''''''''''''''''''''''' 995 996The algorithm computes a watermark, which is periodically updated based on the current demand experienced by the subport member pipes, 997whose purpose is to limit the amount of traffic that each pipe is allowed to send for best effort TC. 998The watermark is computed at the subport level at the beginning of each traffic class upper limit enforcement period and 999the same value is used by all the subport member pipes throughout the current enforcement period. 1000illustrates how the watermark computed as subport level at the beginning of each period is propagated to all subport member pipes. 1001 1002At the beginning of the current enforcement period (which coincides with the end of the previous enforcement period), 1003the value of the watermark is adjusted based on the amount of bandwidth allocated to best effort TC at the beginning of the previous period that 1004was not left unused by the subport member pipes at the end of the previous period. 1005 1006If there was subport best effort TC bandwidth left unused, 1007the value of the watermark for the current period is increased to encourage the subport member pipes to consume more bandwidth. 1008Otherwise, the value of the watermark is decreased to enforce equality of bandwidth consumption among subport member pipes for best effort TC. 1009 1010The increase or decrease in the watermark value is done in small increments, 1011so several enforcement periods might be required to reach the equilibrium state. 1012This state can change at any moment due to variations in the demand experienced by the subport member pipes for best effort TC, for example, 1013as a result of demand increase (when the watermark needs to be lowered) or demand decrease (when the watermark needs to be increased). 1014 1015When demand is low, the watermark is set high to prevent it from impeding the subport member pipes from consuming more bandwidth. 1016The highest value for the watermark is picked as the highest rate configured for a subport member pipe. 1017:numref:`table_qos_14` and :numref:`table_qos_15` illustrates the watermark operation. 1018 1019.. _table_qos_14: 1020 1021.. table:: Watermark Propagation from Subport Level to Member Pipes at the Beginning of Each Traffic Class Upper Limit Enforcement Period 1022 1023 +-----+---------------------------------+----------------------------------------------------+ 1024 | No. | Subport Traffic Class Operation | Description | 1025 | | | | 1026 +=====+=================================+====================================================+ 1027 | 1 | Initialization | **Subport level**: subport_period_id= 0 | 1028 | | | | 1029 | | | **Pipe level**: pipe_period_id = 0 | 1030 | | | | 1031 +-----+---------------------------------+----------------------------------------------------+ 1032 | 2 | Credit update | **Subport Level**: | 1033 | | | | 1034 | | | if (time>=subport_tc_time) | 1035 | | | | 1036 | | | { | 1037 | | | subport_wm = water_mark_update(); | 1038 | | | | 1039 | | | subport_tc_time = time + subport_tc_period; | 1040 | | | | 1041 | | | subport_period_id++; | 1042 | | | | 1043 | | | } | 1044 | | | | 1045 | | | **Pipelevel:** | 1046 | | | | 1047 | | | if(pipe_period_id != subport_period_id) | 1048 | | | | 1049 | | | { | 1050 | | | | 1051 | | | pipe_ov_credits = subport_wm \* pipe_weight; | 1052 | | | | 1053 | | | pipe_period_id = subport_period_id; | 1054 | | | | 1055 | | | } | 1056 | | | | 1057 +-----+---------------------------------+----------------------------------------------------+ 1058 | 3 | Credit consumption | **Pipe level:** | 1059 | | (on packet scheduling) | | 1060 | | | pkt_credits = pk_len + frame_overhead; | 1061 | | | | 1062 | | | if(pipe_ov_credits >= pkt_credits{ | 1063 | | | | 1064 | | | pipe_ov_credits -= pkt_credits; | 1065 | | | | 1066 | | | } | 1067 | | | | 1068 +-----+---------------------------------+----------------------------------------------------+ 1069 1070.. _table_qos_15: 1071 1072.. table:: Watermark Calculation 1073 1074 +-----+------------------+----------------------------------------------------------------------------------+ 1075 | No. | Subport Traffic | Description | 1076 | | Class Operation | | 1077 +=====+==================+==================================================================================+ 1078 | 1 | Initialization | **Subport level:** | 1079 | | | | 1080 | | | wm = WM_MAX | 1081 | | | | 1082 +-----+------------------+----------------------------------------------------------------------------------+ 1083 | 2 | Credit update | **Subport level (water_mark_update):** | 1084 | | | | 1085 | | | tc0_cons = subport_tc0_credits_per_period - subport_tc0_credits; | 1086 | | | | 1087 | | | tc1_cons = subport_tc1_credits_per_period - subport_tc1_credits; | 1088 | | | | 1089 | | | tc2_cons = subport_tc2_credits_per_period - subport_tc2_credits; | 1090 | | | | 1091 | | | tc3_cons = subport_tc3_credits_per_period - subport_tc3_credits; | 1092 | | | | 1093 | | | tc4_cons = subport_tc4_credits_per_period - subport_tc4_credits; | 1094 | | | | 1095 | | | tc5_cons = subport_tc5_credits_per_period - subport_tc5_credits; | 1096 | | | | 1097 | | | tc6_cons = subport_tc6_credits_per_period - subport_tc6_credits; | 1098 | | | | 1099 | | | tc7_cons = subport_tc7_credits_per_period - subport_tc7_credits; | 1100 | | | | 1101 | | | tc8_cons = subport_tc8_credits_per_period - subport_tc8_credits; | 1102 | | | | 1103 | | | tc9_cons = subport_tc9_credits_per_period - subport_tc9_credits; | 1104 | | | | 1105 | | | tc10_cons = subport_tc10_credits_per_period - subport_tc10_credits; | 1106 | | | | 1107 | | | tc11_cons = subport_tc11_credits_per_period - subport_tc11_credits; | 1108 | | | | 1109 | | | tc_be_cons_max = subport_tc_be_credits_per_period - (tc0_cons + tc1_cons + | 1110 | | | tc2_cons + tc3_cons + tc4_cons + tc5_cons + tc6_cons + tc7_cons + tc8_cons + | 1111 | | | tc9_cons + tc10_cons + tc11_cons); | 1112 | | | | 1113 | | | if(tc_be_consumption > (tc_be_consumption_max - MTU)){ | 1114 | | | | 1115 | | | wm -= wm >> 7; | 1116 | | | | 1117 | | | if(wm < WM_MIN) wm = WM_MIN; | 1118 | | | | 1119 | | | } else { | 1120 | | | | 1121 | | | wm += (wm >> 7) + 1; | 1122 | | | | 1123 | | | if(wm > WM_MAX) wm = WM_MAX; | 1124 | | | | 1125 | | | } | 1126 | | | | 1127 +-----+------------------+----------------------------------------------------------------------------------+ 1128 1129Worst Case Scenarios for Performance 1130~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1131 1132Lots of Active Queues with Not Enough Credits 1133^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1134 1135The more queues the scheduler has to examine for packets and credits in order to select one packet, 1136the lower the performance of the scheduler is. 1137 1138The scheduler maintains the bitmap of active queues, which skips the non-active queues, 1139but in order to detect whether a specific pipe has enough credits, 1140the pipe has to be drilled down using the pipe dequeue state machine, 1141which consumes cycles regardless of the scheduling result 1142(no packets are produced or at least one packet is produced). 1143 1144This scenario stresses the importance of the policer for the scheduler performance: 1145if the pipe does not have enough credits, 1146its packets should be dropped as soon as possible (before they reach the hierarchical scheduler), 1147thus rendering the pipe queues as not active, 1148which allows the dequeue side to skip that pipe with no cycles being spent on investigating the pipe credits 1149that would result in a "not enough credits" status. 1150 1151Single Queue with 100% Line Rate 1152^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1153 1154The port scheduler performance is optimized for a large number of queues. 1155If the number of queues is small, 1156then the performance of the port scheduler for the same level of active traffic is expected to be worse than 1157the performance of a small set of message passing queues. 1158 1159.. _Droppers: 1160 1161Droppers 1162-------- 1163 1164The purpose of the DPDK dropper is to drop packets arriving at a packet scheduler to avoid congestion. 1165The dropper supports the Proportional Integral Controller Enhanced (PIE), Random Early Detection (RED), 1166Weighted Random Early Detection (WRED) and tail drop algorithms. 1167:numref:`figure_blk_diag_dropper` illustrates how the dropper integrates with the scheduler. 1168The DPDK currently does not support congestion management 1169so the dropper provides the only method for congestion avoidance. 1170 1171.. _figure_blk_diag_dropper: 1172 1173.. figure:: ../img/blk_diag_dropper.* 1174 1175 High-level Block Diagram of the DPDK Dropper 1176 1177 1178The dropper uses one of two congestion avoidance algorithms: 1179 - the Random Early Detection (RED) as documented in the reference publication. 1180 - the Proportional Integral Controller Enhanced (PIE) as documented in RFC8033 publication. 1181 1182The purpose of the RED/PIE algorithm is to monitor a packet queue, 1183determine the current congestion level in the queue and decide whether an arriving packet should be enqueued or dropped. 1184 1185The RED algorithm uses an Exponential Weighted Moving Average (EWMA) filter to compute average queue size which 1186gives an indication of the current congestion level in the queue. 1187 1188For each enqueue operation, the RED algorithm compares the average queue size to minimum and maximum thresholds. 1189Depending on whether the average queue size is below, above or in between these thresholds, 1190the RED algorithm calculates the probability that an arriving packet should be dropped and 1191makes a random decision based on this probability. 1192 1193The dropper also supports Weighted Random Early Detection (WRED) by allowing the scheduler to select 1194different RED configurations for the same packet queue at run-time. 1195In the case of severe congestion, the dropper resorts to tail drop. 1196This occurs when a packet queue has reached maximum capacity and cannot store any more packets. 1197In this situation, all arriving packets are dropped. 1198 1199The flow through the dropper is illustrated in :numref:`figure_flow_tru_dropper`. 1200The RED/WRED/PIE algorithm is exercised first and tail drop second. 1201 1202.. _figure_flow_tru_dropper: 1203 1204.. figure:: ../img/flow_tru_dropper.* 1205 1206 Flow Through the Dropper 1207 1208The PIE algorithm periodically updates the drop probability based on the latency samples. 1209The current latency sample but also analyze whether the latency is trending up or down. 1210This is the classical Proportional Integral (PI) controller method, which is known for 1211eliminating steady state errors. 1212 1213When a congestion period ends, we might be left with a high drop probability with light 1214packet arrivals. Hence, the PIE algorithm includes a mechanism by which the drop probability 1215decays exponentially (rather than linearly) when the system is not congested. 1216This would help the drop probability converge to 0 more quickly, while the PI controller ensures 1217that it would eventually reach zero. 1218 1219The use cases supported by the dropper are: 1220 1221* * Initialize configuration data 1222 1223* * Initialize run-time data 1224 1225* * Enqueue (make a decision to enqueue or drop an arriving packet) 1226 1227* * Mark empty (record the time at which a packet queue becomes empty) 1228 1229The configuration use case is explained in :ref:`Section2.23.3.1 <Configuration>`, 1230the enqueue operation is explained in :ref:`Section 2.23.3.2 <Enqueue_Operation>` 1231and the mark empty operation is explained in :ref:`Section 2.23.3.3 <Queue_Empty_Operation>`. 1232 1233.. _Configuration: 1234 1235Configuration 1236~~~~~~~~~~~~~ 1237 1238A RED configuration contains the parameters given in :numref:`table_qos_16`. 1239 1240.. _table_qos_16: 1241 1242.. table:: RED Configuration Parameters 1243 1244 +--------------------------+---------+---------+------------------+ 1245 | Parameter | Minimum | Maximum | Typical | 1246 | | | | | 1247 +==========================+=========+=========+==================+ 1248 | Minimum Threshold | 0 | 1022 | 1/4 x queue size | 1249 | | | | | 1250 +--------------------------+---------+---------+------------------+ 1251 | Maximum Threshold | 1 | 1023 | 1/2 x queue size | 1252 | | | | | 1253 +--------------------------+---------+---------+------------------+ 1254 | Inverse Mark Probability | 1 | 255 | 10 | 1255 | | | | | 1256 +--------------------------+---------+---------+------------------+ 1257 | EWMA Filter Weight | 1 | 12 | 9 | 1258 | | | | | 1259 +--------------------------+---------+---------+------------------+ 1260 1261The meaning of these parameters is explained in more detail in the following sections. 1262The format of these parameters as specified to the dropper module API 1263corresponds to the format used by Cisco* in their RED implementation. 1264The minimum and maximum threshold parameters are specified to the dropper module in terms of number of packets. 1265The mark probability parameter is specified as an inverse value, for example, 1266an inverse mark probability parameter value of 10 corresponds 1267to a mark probability of 1/10 (that is, 1 in 10 packets will be dropped). 1268The EWMA filter weight parameter is specified as an inverse log value, 1269for example, a filter weight parameter value of 9 corresponds to a filter weight of 1/29. 1270 1271A PIE configuration contains the parameters given in :numref:`table_qos_16a`. 1272 1273.. _table_qos_16a: 1274 1275.. table:: PIE Configuration Parameters 1276 1277 +--------------------------+---------+---------+------------------+ 1278 | Parameter | Minimum | Maximum | Default | 1279 | | | | | 1280 +==========================+=========+=========+==================+ 1281 | Queue delay reference | 1 | uint16 | 15 | 1282 | Latency Target Value | | | | 1283 | Unit: ms | | | | 1284 +--------------------------+---------+---------+------------------+ 1285 | Max Burst Allowance | 1 | uint16 | 150 | 1286 | Unit: ms | | | | 1287 +--------------------------+---------+---------+------------------+ 1288 | Tail Drop Threshold | 1 | uint16 | 64 | 1289 | Unit: bytes | | | | 1290 +--------------------------+---------+---------+------------------+ 1291 | Period to calculate | 1 | uint16 | 15 | 1292 | drop probability | | | | 1293 | Unit: ms | | | | 1294 +--------------------------+---------+---------+------------------+ 1295 1296The meaning of these parameters is explained in more detail in the next sections. 1297The format of these parameters as specified to the dropper module API. 1298They could made self calculated for fine tuning, within the apps. 1299 1300.. _Enqueue_Operation: 1301 1302Enqueue Operation 1303~~~~~~~~~~~~~~~~~ 1304 1305In the example shown in :numref:`figure_ex_data_flow_tru_dropper`, q (actual queue size) is the input value, 1306avg (average queue size) and count (number of packets since the last drop) are run-time values, 1307decision is the output value and the remaining values are configuration parameters. 1308 1309.. _figure_ex_data_flow_tru_dropper: 1310 1311.. figure:: ../img/ex_data_flow_tru_dropper.* 1312 1313 Example Data Flow Through Dropper 1314 1315 1316EWMA Filter Microblock 1317^^^^^^^^^^^^^^^^^^^^^^ 1318 1319The purpose of the EWMA Filter microblock is to filter queue size values to smooth out transient changes 1320that result from "bursty" traffic. 1321The output value is the average queue size which gives a more stable view of the current congestion level in the queue. 1322 1323The EWMA filter has one configuration parameter, filter weight, which determines how quickly 1324or slowly the average queue size output responds to changes in the actual queue size input. 1325Higher values of filter weight mean that the average queue size responds more quickly to changes in actual queue size. 1326 1327Average Queue Size Calculation when the Queue is not Empty 1328"""""""""""""""""""""""""""""""""""""""""""""""""""""""""" 1329 1330The definition of the EWMA filter is given in the following equation. 1331 1332.. image:: ../img/ewma_filter_eq_1.* 1333 1334Where: 1335 1336* *avg* = average queue size 1337 1338* *wq* = filter weight 1339 1340* *q* = actual queue size 1341 1342.. note:: 1343 1344 The filter weight, wq = 1/2^n, where n is the filter weight parameter value passed to the dropper module 1345 on configuration (see :ref:`Section2.23.3.1 <Configuration>` ). 1346 1347Average Queue Size Calculation when the Queue is Empty 1348^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1349 1350The EWMA filter does not read time stamps and instead assumes that enqueue operations will happen quite regularly. 1351Special handling is required when the queue becomes empty as the queue could be empty for a short time or a long time. 1352When the queue becomes empty, average queue size should decay gradually to zero instead of dropping suddenly to zero 1353or remaining stagnant at the last computed value. 1354When a packet is enqueued on an empty queue, the average queue size is computed using the following formula: 1355 1356.. image:: ../img/ewma_filter_eq_2.* 1357 1358Where: 1359 1360* *m* = the number of enqueue operations that could have occurred on this queue while the queue was empty 1361 1362In the dropper module, *m* is defined as: 1363 1364.. image:: ../img/m_definition.* 1365 1366Where: 1367 1368* *time* = current time 1369 1370* *qtime* = time the queue became empty 1371 1372* *s* = typical time between successive enqueue operations on this queue 1373 1374The time reference is in units of bytes, 1375where a byte signifies the time duration required by the physical interface to send out a byte on the transmission medium 1376(see Section `Internal Time Reference`_). 1377The parameter s is defined in the dropper module as a constant with the value: s=2^22. 1378This corresponds to the time required by every leaf node in a hierarchy with 64K leaf nodes 1379to transmit one 64-byte packet onto the wire and represents the worst case scenario. 1380For much smaller scheduler hierarchies, 1381it may be necessary to reduce the parameter s, which is defined in the red header source file (rte_red.h) as: 1382 1383.. code-block:: c 1384 1385 #define RTE_RED_S 1386 1387Since the time reference is in bytes, the port speed is implied in the expression: *time-qtime*. 1388The dropper does not have to be configured with the actual port speed. 1389It adjusts automatically to low speed and high speed links. 1390 1391Implementation 1392"""""""""""""" 1393 1394A numerical method is used to compute the factor (1-wq)^m that appears in Equation 2. 1395 1396This method is based on the following identity: 1397 1398.. image:: ../img/eq2_factor.* 1399 1400 1401This allows us to express the following: 1402 1403.. image:: ../img/eq2_expression.* 1404 1405 1406In the dropper module, a look-up table is used to compute log2(1-wq) for each value of wq supported by the dropper module. 1407The factor (1-wq)^m can then be obtained by multiplying the table value by *m* and applying shift operations. 1408To avoid overflow in the multiplication, the value, *m*, and the look-up table values are limited to 16 bits. 1409The total size of the look-up table is 56 bytes. 1410Once the factor (1-wq)^m is obtained using this method, the average queue size can be calculated from Equation 2. 1411 1412Alternative Approaches 1413"""""""""""""""""""""" 1414 1415Other methods for calculating the factor (1-wq)^m in the expression for computing average queue size 1416when the queue is empty (Equation 2) were considered. 1417These approaches include: 1418 1419* Floating-point evaluation 1420 1421* Fixed-point evaluation using a small look-up table (512B) and up to 16 multiplications 1422 (this is the approach used in the FreeBSD* ALTQ RED implementation) 1423 1424* Fixed-point evaluation using a small look-up table (512B) and 16 SSE multiplications 1425 (SSE optimized version of the approach used in the FreeBSD* ALTQ RED implementation) 1426 1427* Large look-up table (76 KB) 1428 1429The method that was finally selected (described above in Section 26.3.2.2.1) out performs all of these approaches 1430in terms of run-time performance and memory requirements and 1431also achieves accuracy comparable to floating-point evaluation. 1432:numref:`table_qos_17` lists the performance of each of these alternative approaches relative to the method that is used in the dropper. 1433As can be seen, the floating-point implementation achieved the worst performance. 1434 1435.. _table_qos_17: 1436 1437.. table:: Relative Performance of Alternative Approaches 1438 1439 +------------------------------------------------------------------------------------+----------------------+ 1440 | Method | Relative Performance | 1441 | | | 1442 +====================================================================================+======================+ 1443 | Current dropper method (see :ref:`Section 23.3.2.1.3 <Droppers>`) | 100% | 1444 | | | 1445 +------------------------------------------------------------------------------------+----------------------+ 1446 | Fixed-point method with small (512B) look-up table | 148% | 1447 | | | 1448 +------------------------------------------------------------------------------------+----------------------+ 1449 | SSE method with small (512B) look-up table | 114% | 1450 | | | 1451 +------------------------------------------------------------------------------------+----------------------+ 1452 | Large (76KB) look-up table | 118% | 1453 | | | 1454 +------------------------------------------------------------------------------------+----------------------+ 1455 | Floating-point | 595% | 1456 | | | 1457 +------------------------------------------------------------------------------------+----------------------+ 1458 | **Note**: In this case, since performance is expressed as time spent executing the operation in a | 1459 | specific condition, any relative performance value above 100% runs slower than the reference method. | 1460 | | 1461 +-----------------------------------------------------------------------------------------------------------+ 1462 1463Drop Decision Block 1464^^^^^^^^^^^^^^^^^^^ 1465 1466The Drop Decision block: 1467 1468* Compares the average queue size with the minimum and maximum thresholds 1469 1470* Calculates a packet drop probability 1471 1472* Makes a random decision to enqueue or drop an arriving packet 1473 1474The calculation of the drop probability occurs in two stages. 1475An initial drop probability is calculated based on the average queue size, 1476the minimum and maximum thresholds and the mark probability. 1477An actual drop probability is then computed from the initial drop probability. 1478The actual drop probability takes the count run-time value into consideration 1479so that the actual drop probability increases as more packets arrive to the packet queue 1480since the last packet was dropped. 1481 1482Initial Packet Drop Probability 1483""""""""""""""""""""""""""""""" 1484 1485The initial drop probability is calculated using the following equation. 1486 1487.. image:: ../img/drop_probability_eq3.* 1488 1489Where: 1490 1491* *maxp* = mark probability 1492 1493* *avg* = average queue size 1494 1495* *minth* = minimum threshold 1496 1497* *maxth* = maximum threshold 1498 1499The calculation of the packet drop probability using Equation 3 is illustrated in :numref:`figure_pkt_drop_probability`. 1500If the average queue size is below the minimum threshold, an arriving packet is enqueued. 1501If the average queue size is at or above the maximum threshold, an arriving packet is dropped. 1502If the average queue size is between the minimum and maximum thresholds, 1503a drop probability is calculated to determine if the packet should be enqueued or dropped. 1504 1505.. _figure_pkt_drop_probability: 1506 1507.. figure:: ../img/pkt_drop_probability.* 1508 1509 Packet Drop Probability for a Given RED Configuration 1510 1511 1512Actual Drop Probability 1513""""""""""""""""""""""" 1514 1515If the average queue size is between the minimum and maximum thresholds, 1516then the actual drop probability is calculated from the following equation. 1517 1518.. image:: ../img/drop_probability_eq4.* 1519 1520Where: 1521 1522* *Pb* = initial drop probability (from Equation 3) 1523 1524* *count* = number of packets that have arrived since the last drop 1525 1526The constant 2, in Equation 4 is the only deviation from the drop probability formulae 1527given in the reference document where a value of 1 is used instead. 1528It should be noted that the value pa computed from can be negative or greater than 1. 1529If this is the case, then a value of 1 should be used instead. 1530 1531The initial and actual drop probabilities are shown in :numref:`figure_drop_probability_graph`. 1532The actual drop probability is shown for the case where 1533the formula given in the reference document1 is used (blue curve) 1534and also for the case where the formula implemented in the dropper module, 1535is used (red curve). 1536The formula in the reference document results in a significantly higher drop rate 1537compared to the mark probability configuration parameter specified by the user. 1538The choice to deviate from the reference document is simply a design decision and 1539one that has been taken by other RED implementations, for example, FreeBSD* ALTQ RED. 1540 1541.. _figure_drop_probability_graph: 1542 1543.. figure:: ../img/drop_probability_graph.* 1544 1545 Initial Drop Probability (pb), Actual Drop probability (pa) Computed Using 1546 a Factor 1 (Blue Curve) and a Factor 2 (Red Curve) 1547 1548 1549.. _Queue_Empty_Operation: 1550 1551Queue Empty Operation 1552~~~~~~~~~~~~~~~~~~~~~ 1553 1554The time at which a packet queue becomes empty must be recorded and saved with the RED run-time data 1555so that the EWMA filter block can calculate the average queue size on the next enqueue operation. 1556It is the responsibility of the calling application to inform the dropper module 1557through the API that a queue has become empty. 1558 1559Source Files Location 1560~~~~~~~~~~~~~~~~~~~~~ 1561 1562The source files for the DPDK dropper are located at: 1563 1564* DPDK/lib/sched/rte_red.h 1565 1566* DPDK/lib/sched/rte_red.c 1567 1568Integration with the DPDK QoS Scheduler 1569~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1570 1571RED functionality in the DPDK QoS scheduler is disabled by default. 1572The parameter is found in the build configuration files in the DPDK/config directory. 1573RED configuration parameters are specified in the rte_red_params structure within the rte_sched_port_params structure 1574that is passed to the scheduler on initialization. 1575RED parameters are specified separately for four traffic classes and three packet colors (green, yellow and red) 1576allowing the scheduler to implement Weighted Random Early Detection (WRED). 1577 1578Integration with the DPDK QoS Scheduler Sample Application 1579~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1580 1581The DPDK QoS Scheduler Application reads a configuration file on start-up. 1582The configuration file includes a section containing RED parameters. 1583The format of these parameters is described in :ref:`Section2.23.3.1 <Configuration>`. 1584A sample RED configuration is shown below. In this example, the queue size is 64 packets. 1585 1586.. note:: 1587 1588 For correct operation, the same EWMA filter weight parameter (wred weight) should be used 1589 for each packet color (green, yellow, red) in the same traffic class (tc). 1590 1591:: 1592 1593 ; RED params per traffic class and color (Green / Yellow / Red) 1594 1595 [red] 1596 tc 0 wred min = 28 22 16 1597 tc 0 wred max = 32 32 32 1598 tc 0 wred inv prob = 10 10 10 1599 tc 0 wred weight = 9 9 9 1600 1601 tc 1 wred min = 28 22 16 1602 tc 1 wred max = 32 32 32 1603 tc 1 wred inv prob = 10 10 10 1604 tc 1 wred weight = 9 9 9 1605 1606 tc 2 wred min = 28 22 16 1607 tc 2 wred max = 32 32 32 1608 tc 2 wred inv prob = 10 10 10 1609 tc 2 wred weight = 9 9 9 1610 1611 tc 3 wred min = 28 22 16 1612 tc 3 wred max = 32 32 32 1613 tc 3 wred inv prob = 10 10 10 1614 tc 3 wred weight = 9 9 9 1615 1616 tc 4 wred min = 28 22 16 1617 tc 4 wred max = 32 32 32 1618 tc 4 wred inv prob = 10 10 10 1619 tc 4 wred weight = 9 9 9 1620 1621 tc 5 wred min = 28 22 16 1622 tc 5 wred max = 32 32 32 1623 tc 5 wred inv prob = 10 10 10 1624 tc 5 wred weight = 9 9 9 1625 1626 tc 6 wred min = 28 22 16 1627 tc 6 wred max = 32 32 32 1628 tc 6 wred inv prob = 10 10 10 1629 tc 6 wred weight = 9 9 9 1630 1631 tc 7 wred min = 28 22 16 1632 tc 7 wred max = 32 32 32 1633 tc 7 wred inv prob = 10 10 10 1634 tc 7 wred weight = 9 9 9 1635 1636 tc 8 wred min = 28 22 16 1637 tc 8 wred max = 32 32 32 1638 tc 8 wred inv prob = 10 10 10 1639 tc 8 wred weight = 9 9 9 1640 1641 tc 9 wred min = 28 22 16 1642 tc 9 wred max = 32 32 32 1643 tc 9 wred inv prob = 10 10 10 1644 tc 9 wred weight = 9 9 9 1645 1646 1647 tc 10 wred min = 28 22 16 1648 tc 10 wred max = 32 32 32 1649 tc 10 wred inv prob = 10 10 10 1650 tc 10 wred weight = 9 9 9 1651 1652 tc 11 wred min = 28 22 16 1653 tc 11 wred max = 32 32 32 1654 tc 11 wred inv prob = 10 10 10 1655 tc 11 wred weight = 9 9 9 1656 1657 tc 12 wred min = 28 22 16 1658 tc 12 wred max = 32 32 32 1659 tc 12 wred inv prob = 10 10 10 1660 tc 12 wred weight = 9 9 9 1661 1662With this configuration file, the RED configuration that applies to green, 1663yellow and red packets in traffic class 0 is shown in :numref:`table_qos_18`. 1664 1665.. _table_qos_18: 1666 1667.. table:: RED Configuration Corresponding to RED Configuration File 1668 1669 +--------------------+--------------------+-------+--------+-----+ 1670 | RED Parameter | Configuration Name | Green | Yellow | Red | 1671 | | | | | | 1672 +====================+====================+=======+========+=====+ 1673 | Minimum Threshold | tc 0 wred min | 28 | 22 | 16 | 1674 | | | | | | 1675 +--------------------+--------------------+-------+--------+-----+ 1676 | Maximum Threshold | tc 0 wred max | 32 | 32 | 32 | 1677 | | | | | | 1678 +--------------------+--------------------+-------+--------+-----+ 1679 | Mark Probability | tc 0 wred inv prob | 10 | 10 | 10 | 1680 | | | | | | 1681 +--------------------+--------------------+-------+--------+-----+ 1682 | EWMA Filter Weight | tc 0 wred weight | 9 | 9 | 9 | 1683 | | | | | | 1684 +--------------------+--------------------+-------+--------+-----+ 1685 1686Application Programming Interface (API) 1687~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1688 1689Enqueue API 1690^^^^^^^^^^^ 1691 1692The syntax of the enqueue API is as follows: 1693 1694.. code-block:: c 1695 1696 int rte_red_enqueue(const struct rte_red_config *red_cfg, struct rte_red *red, const unsigned q, const uint64_t time) 1697 1698 1699The arguments passed to the enqueue API are configuration data, run-time data, 1700the current size of the packet queue (in packets) and a value representing the current time. 1701The time reference is in units of bytes, 1702where a byte signifies the time duration required by the physical interface to send out a byte on the transmission medium 1703(see Section 26.2.4.5.1 "Internal Time Reference" ). 1704The dropper reuses the scheduler time stamps for performance reasons. 1705 1706Empty API 1707^^^^^^^^^ 1708 1709The syntax of the empty API is as follows: 1710 1711.. code-block:: c 1712 1713 void rte_red_mark_queue_empty(struct rte_red *red, const uint64_t time) 1714 1715The arguments passed to the empty API are run-time data and the current time in bytes. 1716 1717Traffic Metering 1718---------------- 1719 1720The traffic metering component implements the Single Rate Three Color Marker (srTCM) and 1721Two Rate Three Color Marker (trTCM) algorithms, as defined by IETF RFC 2697 and 2698 respectively. 1722These algorithms meter the stream of incoming packets based on the allowance defined in advance for each traffic flow. 1723As result, each incoming packet is tagged as green, 1724yellow or red based on the monitored consumption of the flow the packet belongs to. 1725 1726Functional Overview 1727~~~~~~~~~~~~~~~~~~~ 1728 1729The srTCM algorithm defines two token buckets for each traffic flow, 1730with the two buckets sharing the same token update rate: 1731 1732* Committed (C) bucket: fed with tokens at the rate defined by the Committed Information Rate (CIR) parameter 1733 (measured in IP packet bytes per second). 1734 The size of the C bucket is defined by the Committed Burst Size (CBS) parameter (measured in bytes); 1735 1736* Excess (E) bucket: fed with tokens at the same rate as the C bucket. 1737 The size of the E bucket is defined by the Excess Burst Size (EBS) parameter (measured in bytes). 1738 1739The trTCM algorithm defines two token buckets for each traffic flow, 1740with the two buckets being updated with tokens at independent rates: 1741 1742* Committed (C) bucket: fed with tokens at the rate defined by the Committed Information Rate (CIR) parameter 1743 (measured in bytes of IP packet per second). 1744 The size of the C bucket is defined by the Committed Burst Size (CBS) parameter (measured in bytes); 1745 1746* Peak (P) bucket: fed with tokens at the rate defined by the Peak Information Rate (PIR) parameter 1747 (measured in IP packet bytes per second). 1748 The size of the P bucket is defined by the Peak Burst Size (PBS) parameter (measured in bytes). 1749 1750Please refer to RFC 2697 (for srTCM) and RFC 2698 (for trTCM) for details on how tokens are consumed 1751from the buckets and how the packet color is determined. 1752 1753Color Blind and Color Aware Modes 1754^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1755 1756For both algorithms, the color blind mode is functionally equivalent to the color aware mode with input color set as green. 1757For color aware mode, a packet with red input color can only get the red output color, 1758while a packet with yellow input color can only get the yellow or red output colors. 1759 1760The reason why the color blind mode is still implemented distinctly than the color aware mode is 1761that color blind mode can be implemented with fewer operations than the color aware mode. 1762 1763Implementation Overview 1764~~~~~~~~~~~~~~~~~~~~~~~ 1765 1766For each input packet, the steps for the srTCM / trTCM algorithms are: 1767 1768* Update the C and E / P token buckets. This is done by reading the current time (from the CPU timestamp counter), 1769 identifying the amount of time since the last bucket update and computing the associated number of tokens 1770 (according to the pre-configured bucket rate). 1771 The number of tokens in the bucket is limited by the pre-configured bucket size; 1772 1773* Identify the output color for the current packet based on the size of the IP packet 1774 and the amount of tokens currently available in the C and E / P buckets; for color aware mode only, 1775 the input color of the packet is also considered. 1776 When the output color is not red, a number of tokens equal to the length of the IP packet are 1777 subtracted from the C or E /P or both buckets, depending on the algorithm and the output color of the packet. 1778