xref: /dpdk/doc/guides/prog_guide/ethdev/qos_framework.rst (revision 41dd9a6bc2d9c6e20e139ad713cc9d172572dd43)
1*41dd9a6bSDavid Young..  SPDX-License-Identifier: BSD-3-Clause
2*41dd9a6bSDavid Young    Copyright(c) 2010-2014 Intel Corporation.
3*41dd9a6bSDavid Young
4*41dd9a6bSDavid YoungQuality of Service (QoS) Framework
5*41dd9a6bSDavid Young==================================
6*41dd9a6bSDavid Young
7*41dd9a6bSDavid YoungThis chapter describes the DPDK Quality of Service (QoS) framework.
8*41dd9a6bSDavid Young
9*41dd9a6bSDavid YoungPacket Pipeline with QoS Support
10*41dd9a6bSDavid Young--------------------------------
11*41dd9a6bSDavid Young
12*41dd9a6bSDavid YoungAn example of a complex packet processing pipeline with QoS support is shown in the following figure.
13*41dd9a6bSDavid Young
14*41dd9a6bSDavid Young.. _figure_pkt_proc_pipeline_qos:
15*41dd9a6bSDavid Young
16*41dd9a6bSDavid Young.. figure:: ../img/pkt_proc_pipeline_qos.*
17*41dd9a6bSDavid Young
18*41dd9a6bSDavid Young   Complex Packet Processing Pipeline with QoS Support
19*41dd9a6bSDavid Young
20*41dd9a6bSDavid Young
21*41dd9a6bSDavid YoungThis pipeline can be built using reusable DPDK software libraries.
22*41dd9a6bSDavid YoungThe main blocks implementing QoS in this pipeline are: the policer, the dropper and the scheduler.
23*41dd9a6bSDavid YoungA functional description of each block is provided in the following table.
24*41dd9a6bSDavid Young
25*41dd9a6bSDavid Young.. _table_qos_1:
26*41dd9a6bSDavid Young
27*41dd9a6bSDavid Young.. table:: Packet Processing Pipeline Implementing QoS
28*41dd9a6bSDavid Young
29*41dd9a6bSDavid Young   +---+------------------------+--------------------------------------------------------------------------------+
30*41dd9a6bSDavid Young   | # | Block                  | Functional Description                                                         |
31*41dd9a6bSDavid Young   |   |                        |                                                                                |
32*41dd9a6bSDavid Young   +===+========================+================================================================================+
33*41dd9a6bSDavid Young   | 1 | Packet I/O RX & TX     | Packet reception/ transmission from/to multiple NIC ports. Poll mode drivers   |
34*41dd9a6bSDavid Young   |   |                        | (PMDs) for Intel 1 GbE/10 GbE NICs.                                            |
35*41dd9a6bSDavid Young   |   |                        |                                                                                |
36*41dd9a6bSDavid Young   +---+------------------------+--------------------------------------------------------------------------------+
37*41dd9a6bSDavid Young   | 2 | Packet parser          | Identify the protocol stack of the input packet. Check the integrity of the    |
38*41dd9a6bSDavid Young   |   |                        | packet headers.                                                                |
39*41dd9a6bSDavid Young   |   |                        |                                                                                |
40*41dd9a6bSDavid Young   +---+------------------------+--------------------------------------------------------------------------------+
41*41dd9a6bSDavid Young   | 3 | Flow classification    | Map the input packet to one of the known traffic flows. Exact match table      |
42*41dd9a6bSDavid Young   |   |                        | lookup using configurable hash function (jhash, CRC and so on) and bucket      |
43*41dd9a6bSDavid Young   |   |                        | logic to handle collisions.                                                    |
44*41dd9a6bSDavid Young   |   |                        |                                                                                |
45*41dd9a6bSDavid Young   +---+------------------------+--------------------------------------------------------------------------------+
46*41dd9a6bSDavid Young   | 4 | Policer                | Packet metering using srTCM (RFC 2697) or trTCM (RFC2698) algorithms.          |
47*41dd9a6bSDavid Young   |   |                        |                                                                                |
48*41dd9a6bSDavid Young   +---+------------------------+--------------------------------------------------------------------------------+
49*41dd9a6bSDavid Young   | 5 | Load Balancer          | Distribute the input packets to the application workers. Provide uniform load  |
50*41dd9a6bSDavid Young   |   |                        | to each worker. Preserve the affinity of traffic flows to workers and the      |
51*41dd9a6bSDavid Young   |   |                        | packet order within each flow.                                                 |
52*41dd9a6bSDavid Young   |   |                        |                                                                                |
53*41dd9a6bSDavid Young   +---+------------------------+--------------------------------------------------------------------------------+
54*41dd9a6bSDavid Young   | 6 | Worker threads         | Placeholders for the customer specific application workload (for example, IP   |
55*41dd9a6bSDavid Young   |   |                        | stack and so on).                                                              |
56*41dd9a6bSDavid Young   |   |                        |                                                                                |
57*41dd9a6bSDavid Young   +---+------------------------+--------------------------------------------------------------------------------+
58*41dd9a6bSDavid Young   | 7 | Dropper                | Congestion management using the Random Early Detection (RED) algorithm         |
59*41dd9a6bSDavid Young   |   |                        | (specified by the Sally Floyd - Van Jacobson paper) or Weighted RED (WRED)     |
60*41dd9a6bSDavid Young   |   |                        | or Proportional Integral Controller Enhanced (PIE).                            |
61*41dd9a6bSDavid Young   |   |                        | Drop packets based on the current scheduler queue load level and packet        |
62*41dd9a6bSDavid Young   |   |                        | priority. When congestion is experienced, lower priority packets are dropped   |
63*41dd9a6bSDavid Young   |   |                        | first.                                                                         |
64*41dd9a6bSDavid Young   |   |                        |                                                                                |
65*41dd9a6bSDavid Young   +---+------------------------+--------------------------------------------------------------------------------+
66*41dd9a6bSDavid Young   | 8 | Hierarchical Scheduler | 5-level hierarchical scheduler (levels are: output port, subport, pipe,        |
67*41dd9a6bSDavid Young   |   |                        | traffic class and queue) with thousands (typically 64K) leaf nodes (queues).   |
68*41dd9a6bSDavid Young   |   |                        | Implements traffic shaping (for subport and pipe levels), strict priority      |
69*41dd9a6bSDavid Young   |   |                        | (for traffic class level) and Weighted Round Robin (WRR) (for queues within    |
70*41dd9a6bSDavid Young   |   |                        | each pipe traffic class).                                                      |
71*41dd9a6bSDavid Young   |   |                        |                                                                                |
72*41dd9a6bSDavid Young   +---+------------------------+--------------------------------------------------------------------------------+
73*41dd9a6bSDavid Young
74*41dd9a6bSDavid YoungThe infrastructure blocks used throughout the packet processing pipeline are listed in the following table.
75*41dd9a6bSDavid Young
76*41dd9a6bSDavid Young.. _table_qos_2:
77*41dd9a6bSDavid Young
78*41dd9a6bSDavid Young.. table:: Infrastructure Blocks Used by the Packet Processing Pipeline
79*41dd9a6bSDavid Young
80*41dd9a6bSDavid Young   +---+-----------------------+-----------------------------------------------------------------------+
81*41dd9a6bSDavid Young   | # | Block                 | Functional Description                                                |
82*41dd9a6bSDavid Young   |   |                       |                                                                       |
83*41dd9a6bSDavid Young   +===+=======================+=======================================================================+
84*41dd9a6bSDavid Young   | 1 | Buffer manager        | Support for global buffer pools and private per-thread buffer caches. |
85*41dd9a6bSDavid Young   |   |                       |                                                                       |
86*41dd9a6bSDavid Young   +---+-----------------------+-----------------------------------------------------------------------+
87*41dd9a6bSDavid Young   | 2 | Queue manager         | Support for message passing between pipeline blocks.                  |
88*41dd9a6bSDavid Young   |   |                       |                                                                       |
89*41dd9a6bSDavid Young   +---+-----------------------+-----------------------------------------------------------------------+
90*41dd9a6bSDavid Young   | 3 | Power saving          | Support for power saving during low activity periods.                 |
91*41dd9a6bSDavid Young   |   |                       |                                                                       |
92*41dd9a6bSDavid Young   +---+-----------------------+-----------------------------------------------------------------------+
93*41dd9a6bSDavid Young
94*41dd9a6bSDavid YoungThe mapping of pipeline blocks to CPU cores is configurable based on the performance level required by each specific application
95*41dd9a6bSDavid Youngand the set of features enabled for each block.
96*41dd9a6bSDavid YoungSome blocks might consume more than one CPU core (with each CPU core running a different instance of the same block on different input packets),
97*41dd9a6bSDavid Youngwhile several other blocks could be mapped to the same CPU core.
98*41dd9a6bSDavid Young
99*41dd9a6bSDavid YoungHierarchical Scheduler
100*41dd9a6bSDavid Young----------------------
101*41dd9a6bSDavid Young
102*41dd9a6bSDavid YoungThe hierarchical scheduler block, when present, usually sits on the TX side just before the transmission stage.
103*41dd9a6bSDavid YoungIts purpose is to prioritize the transmission of packets from different users and different traffic classes
104*41dd9a6bSDavid Youngaccording to the policy specified by the Service Level Agreements (SLAs) of each network node.
105*41dd9a6bSDavid Young
106*41dd9a6bSDavid YoungOverview
107*41dd9a6bSDavid Young~~~~~~~~
108*41dd9a6bSDavid Young
109*41dd9a6bSDavid YoungThe hierarchical scheduler block is similar to the traffic manager block used by network processors
110*41dd9a6bSDavid Youngthat typically implement per flow (or per group of flows) packet queuing and scheduling.
111*41dd9a6bSDavid YoungIt typically acts like a buffer that is able to temporarily store a large number of packets just before their transmission (enqueue operation);
112*41dd9a6bSDavid Youngas the NIC TX is requesting more packets for transmission,
113*41dd9a6bSDavid Youngthese packets are later on removed and handed over to the NIC TX with the packet selection logic observing the predefined SLAs (dequeue operation).
114*41dd9a6bSDavid Young
115*41dd9a6bSDavid Young.. _figure_hier_sched_blk:
116*41dd9a6bSDavid Young
117*41dd9a6bSDavid Young.. figure:: ../img/hier_sched_blk.*
118*41dd9a6bSDavid Young
119*41dd9a6bSDavid Young   Hierarchical Scheduler Block Internal Diagram
120*41dd9a6bSDavid Young
121*41dd9a6bSDavid Young
122*41dd9a6bSDavid YoungThe hierarchical scheduler is optimized for a large number of packet queues.
123*41dd9a6bSDavid YoungWhen only a small number of queues are needed, message passing queues should be used instead of this block.
124*41dd9a6bSDavid YoungSee `Worst Case Scenarios for Performance`_ for a more detailed discussion.
125*41dd9a6bSDavid Young
126*41dd9a6bSDavid YoungScheduling Hierarchy
127*41dd9a6bSDavid Young~~~~~~~~~~~~~~~~~~~~
128*41dd9a6bSDavid Young
129*41dd9a6bSDavid YoungThe scheduling hierarchy is shown in :numref:`figure_sched_hier_per_port`.
130*41dd9a6bSDavid YoungThe first level of the hierarchy is the Ethernet TX port 1/10/40 GbE,
131*41dd9a6bSDavid Youngwith subsequent hierarchy levels defined as subport, pipe, traffic class and queue.
132*41dd9a6bSDavid Young
133*41dd9a6bSDavid YoungTypically, each subport represents a predefined group of users, while each pipe represents an individual user/subscriber.
134*41dd9a6bSDavid YoungEach traffic class is the representation of a different traffic type with specific loss rate,
135*41dd9a6bSDavid Youngdelay and jitter requirements, such as voice, video or data transfers.
136*41dd9a6bSDavid YoungEach queue hosts packets from one or multiple connections of the same type belonging to the same user.
137*41dd9a6bSDavid Young
138*41dd9a6bSDavid Young.. _figure_sched_hier_per_port:
139*41dd9a6bSDavid Young
140*41dd9a6bSDavid Young.. figure:: ../img/sched_hier_per_port.*
141*41dd9a6bSDavid Young
142*41dd9a6bSDavid Young   Scheduling Hierarchy per Port
143*41dd9a6bSDavid Young
144*41dd9a6bSDavid Young
145*41dd9a6bSDavid YoungThe functionality of each hierarchical level is detailed in the following table.
146*41dd9a6bSDavid Young
147*41dd9a6bSDavid Young.. _table_qos_3:
148*41dd9a6bSDavid Young
149*41dd9a6bSDavid Young.. table:: Port Scheduling Hierarchy
150*41dd9a6bSDavid Young
151*41dd9a6bSDavid Young   +---+--------------------+----------------------------+---------------------------------------------------------------+
152*41dd9a6bSDavid Young   | # | Level              | Siblings per Parent        | Functional Description                                        |
153*41dd9a6bSDavid Young   |   |                    |                            |                                                               |
154*41dd9a6bSDavid Young   +===+====================+============================+===============================================================+
155*41dd9a6bSDavid Young   | 1 | Port               | -                          | #.  Output Ethernet port 1/10/40 GbE.                         |
156*41dd9a6bSDavid Young   |   |                    |                            |                                                               |
157*41dd9a6bSDavid Young   |   |                    |                            | #.  Multiple ports are scheduled in round robin order with    |
158*41dd9a6bSDavid Young   |   |                    |                            |     all ports having equal priority.                          |
159*41dd9a6bSDavid Young   |   |                    |                            |                                                               |
160*41dd9a6bSDavid Young   +---+--------------------+----------------------------+---------------------------------------------------------------+
161*41dd9a6bSDavid Young   | 2 | Subport            | Configurable (default: 8)  | #.  Traffic shaping using token bucket algorithm (one token   |
162*41dd9a6bSDavid Young   |   |                    |                            |     bucket per subport).                                      |
163*41dd9a6bSDavid Young   |   |                    |                            |                                                               |
164*41dd9a6bSDavid Young   |   |                    |                            | #.  Upper limit enforced per Traffic Class (TC) at the        |
165*41dd9a6bSDavid Young   |   |                    |                            |     subport level.                                            |
166*41dd9a6bSDavid Young   |   |                    |                            |                                                               |
167*41dd9a6bSDavid Young   |   |                    |                            | #.  Lower priority TCs able to reuse subport bandwidth        |
168*41dd9a6bSDavid Young   |   |                    |                            |     currently unused by higher priority TCs.                  |
169*41dd9a6bSDavid Young   |   |                    |                            |                                                               |
170*41dd9a6bSDavid Young   +---+--------------------+----------------------------+---------------------------------------------------------------+
171*41dd9a6bSDavid Young   | 3 | Pipe               | Configurable (default: 4K) | #.  Traffic shaping using the token bucket algorithm (one     |
172*41dd9a6bSDavid Young   |   |                    |                            |     token bucket per pipe.                                    |
173*41dd9a6bSDavid Young   |   |                    |                            |                                                               |
174*41dd9a6bSDavid Young   +---+--------------------+----------------------------+---------------------------------------------------------------+
175*41dd9a6bSDavid Young   | 4 | Traffic Class (TC) | 13                         | #.  TCs of the same pipe handled in strict priority order.    |
176*41dd9a6bSDavid Young   |   |                    |                            |                                                               |
177*41dd9a6bSDavid Young   |   |                    |                            | #.  Upper limit enforced per TC at the pipe level.            |
178*41dd9a6bSDavid Young   |   |                    |                            |                                                               |
179*41dd9a6bSDavid Young   |   |                    |                            | #.  Lower priority TCs able to reuse pipe bandwidth currently |
180*41dd9a6bSDavid Young   |   |                    |                            |     unused by higher priority TCs.                            |
181*41dd9a6bSDavid Young   |   |                    |                            |                                                               |
182*41dd9a6bSDavid Young   |   |                    |                            | #.  When subport TC is oversubscribed (configuration time     |
183*41dd9a6bSDavid Young   |   |                    |                            |     event), pipe TC upper limit is capped to a dynamically    |
184*41dd9a6bSDavid Young   |   |                    |                            |     adjusted value that is shared by all the subport pipes.   |
185*41dd9a6bSDavid Young   |   |                    |                            |                                                               |
186*41dd9a6bSDavid Young   +---+--------------------+----------------------------+---------------------------------------------------------------+
187*41dd9a6bSDavid Young   | 5 | Queue              |  High priority TCs: 1,     | #.  All the high priority TCs (TC0, TC1,  ...,TC11) have      |
188*41dd9a6bSDavid Young   |   |                    |  Lowest priority TC: 4     |     exactly 1 queue, while the lowest priority TC (TC12),     |
189*41dd9a6bSDavid Young   |   |                    |                            |     called Best Effort (BE), has 4 queues.                    |
190*41dd9a6bSDavid Young   |   |                    |                            |                                                               |
191*41dd9a6bSDavid Young   |   |                    |                            | #.  Queues of the lowest priority TC (BE) are serviced using  |
192*41dd9a6bSDavid Young   |   |                    |                            |     Weighted Round Robin (WRR) according to predefined weights|
193*41dd9a6bSDavid Young   |   |                    |                            |     weights.                                                  |
194*41dd9a6bSDavid Young   |   |                    |                            |                                                               |
195*41dd9a6bSDavid Young   +---+--------------------+----------------------------+---------------------------------------------------------------+
196*41dd9a6bSDavid Young
197*41dd9a6bSDavid YoungApplication Programming Interface (API)
198*41dd9a6bSDavid Young~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
199*41dd9a6bSDavid Young
200*41dd9a6bSDavid YoungPort Scheduler Configuration API
201*41dd9a6bSDavid Young^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
202*41dd9a6bSDavid Young
203*41dd9a6bSDavid YoungThe rte_sched.h file contains configuration functions for port, subport and pipe.
204*41dd9a6bSDavid Young
205*41dd9a6bSDavid YoungPort Scheduler Enqueue API
206*41dd9a6bSDavid Young^^^^^^^^^^^^^^^^^^^^^^^^^^
207*41dd9a6bSDavid Young
208*41dd9a6bSDavid YoungThe port scheduler enqueue API is very similar to the API of the DPDK PMD TX function.
209*41dd9a6bSDavid Young
210*41dd9a6bSDavid Young.. code-block:: c
211*41dd9a6bSDavid Young
212*41dd9a6bSDavid Young    int rte_sched_port_enqueue(struct rte_sched_port *port, struct rte_mbuf **pkts, uint32_t n_pkts);
213*41dd9a6bSDavid Young
214*41dd9a6bSDavid YoungPort Scheduler Dequeue API
215*41dd9a6bSDavid Young^^^^^^^^^^^^^^^^^^^^^^^^^^
216*41dd9a6bSDavid Young
217*41dd9a6bSDavid YoungThe port scheduler dequeue API is very similar to the API of the DPDK PMD RX function.
218*41dd9a6bSDavid Young
219*41dd9a6bSDavid Young.. code-block:: c
220*41dd9a6bSDavid Young
221*41dd9a6bSDavid Young    int rte_sched_port_dequeue(struct rte_sched_port *port, struct rte_mbuf **pkts, uint32_t n_pkts);
222*41dd9a6bSDavid Young
223*41dd9a6bSDavid YoungUsage Example
224*41dd9a6bSDavid Young^^^^^^^^^^^^^
225*41dd9a6bSDavid Young
226*41dd9a6bSDavid Young.. code-block:: c
227*41dd9a6bSDavid Young
228*41dd9a6bSDavid Young    /* File "application.c" */
229*41dd9a6bSDavid Young
230*41dd9a6bSDavid Young    #define N_PKTS_RX   64
231*41dd9a6bSDavid Young    #define N_PKTS_TX   48
232*41dd9a6bSDavid Young    #define NIC_RX_PORT 0
233*41dd9a6bSDavid Young    #define NIC_RX_QUEUE 0
234*41dd9a6bSDavid Young    #define NIC_TX_PORT 1
235*41dd9a6bSDavid Young    #define NIC_TX_QUEUE 0
236*41dd9a6bSDavid Young
237*41dd9a6bSDavid Young    struct rte_sched_port *port = NULL;
238*41dd9a6bSDavid Young    struct rte_mbuf *pkts_rx[N_PKTS_RX], *pkts_tx[N_PKTS_TX];
239*41dd9a6bSDavid Young    uint32_t n_pkts_rx, n_pkts_tx;
240*41dd9a6bSDavid Young
241*41dd9a6bSDavid Young    /* Initialization */
242*41dd9a6bSDavid Young
243*41dd9a6bSDavid Young    <initialization code>
244*41dd9a6bSDavid Young
245*41dd9a6bSDavid Young    /* Runtime */
246*41dd9a6bSDavid Young    while (1) {
247*41dd9a6bSDavid Young        /* Read packets from NIC RX queue */
248*41dd9a6bSDavid Young
249*41dd9a6bSDavid Young        n_pkts_rx = rte_eth_rx_burst(NIC_RX_PORT, NIC_RX_QUEUE, pkts_rx, N_PKTS_RX);
250*41dd9a6bSDavid Young
251*41dd9a6bSDavid Young        /* Hierarchical scheduler enqueue */
252*41dd9a6bSDavid Young
253*41dd9a6bSDavid Young        rte_sched_port_enqueue(port, pkts_rx, n_pkts_rx);
254*41dd9a6bSDavid Young
255*41dd9a6bSDavid Young        /* Hierarchical scheduler dequeue */
256*41dd9a6bSDavid Young
257*41dd9a6bSDavid Young        n_pkts_tx = rte_sched_port_dequeue(port, pkts_tx, N_PKTS_TX);
258*41dd9a6bSDavid Young
259*41dd9a6bSDavid Young        /* Write packets to NIC TX queue */
260*41dd9a6bSDavid Young
261*41dd9a6bSDavid Young        rte_eth_tx_burst(NIC_TX_PORT, NIC_TX_QUEUE, pkts_tx, n_pkts_tx);
262*41dd9a6bSDavid Young    }
263*41dd9a6bSDavid Young
264*41dd9a6bSDavid YoungImplementation
265*41dd9a6bSDavid Young~~~~~~~~~~~~~~
266*41dd9a6bSDavid Young
267*41dd9a6bSDavid YoungInternal Data Structures per Port
268*41dd9a6bSDavid Young^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
269*41dd9a6bSDavid Young
270*41dd9a6bSDavid YoungA schematic of the internal data structures in shown in with details in.
271*41dd9a6bSDavid Young
272*41dd9a6bSDavid Young.. _figure_data_struct_per_port:
273*41dd9a6bSDavid Young
274*41dd9a6bSDavid Young.. figure:: ../img/data_struct_per_port.*
275*41dd9a6bSDavid Young
276*41dd9a6bSDavid Young    Internal Data Structures per Port
277*41dd9a6bSDavid Young
278*41dd9a6bSDavid Young
279*41dd9a6bSDavid Young.. _table_qos_4:
280*41dd9a6bSDavid Young
281*41dd9a6bSDavid Young.. table:: Scheduler Internal Data Structures per Port
282*41dd9a6bSDavid Young
283*41dd9a6bSDavid Young   +---+----------------------+-------------------------+---------------------+------------------------------+---------------------------------------------------+
284*41dd9a6bSDavid Young   | # | Data structure       | Size (bytes)            | # per port          | Access type                  | Description                                       |
285*41dd9a6bSDavid Young   |   |                      |                         |                     |                              |                                                   |
286*41dd9a6bSDavid Young   |   |                      |                         |                     +-------------+----------------+---------------------------------------------------+
287*41dd9a6bSDavid Young   |   |                      |                         |                     | Enq         | Deq            |                                                   |
288*41dd9a6bSDavid Young   |   |                      |                         |                     |             |                |                                                   |
289*41dd9a6bSDavid Young   +===+======================+=========================+=====================+=============+================+===================================================+
290*41dd9a6bSDavid Young   | 1 | Subport table entry  | 64                      | # subports per port | -           | Rd, Wr         | Persistent subport data (credits, etc).           |
291*41dd9a6bSDavid Young   |   |                      |                         |                     |             |                |                                                   |
292*41dd9a6bSDavid Young   +---+----------------------+-------------------------+---------------------+-------------+----------------+---------------------------------------------------+
293*41dd9a6bSDavid Young   | 2 | Pipe table entry     | 64                      | # pipes per port    | -           | Rd, Wr         | Persistent data for pipe, its TCs and its queues  |
294*41dd9a6bSDavid Young   |   |                      |                         |                     |             |                | (credits, etc) that is updated during run-time.   |
295*41dd9a6bSDavid Young   |   |                      |                         |                     |             |                |                                                   |
296*41dd9a6bSDavid Young   |   |                      |                         |                     |             |                | The pipe configuration parameters do not change   |
297*41dd9a6bSDavid Young   |   |                      |                         |                     |             |                | during run-time. The same pipe configuration      |
298*41dd9a6bSDavid Young   |   |                      |                         |                     |             |                | parameters are shared by multiple pipes,          |
299*41dd9a6bSDavid Young   |   |                      |                         |                     |             |                | therefore they are not part of pipe table entry.  |
300*41dd9a6bSDavid Young   |   |                      |                         |                     |             |                |                                                   |
301*41dd9a6bSDavid Young   +---+----------------------+-------------------------+---------------------+-------------+----------------+---------------------------------------------------+
302*41dd9a6bSDavid Young   | 3 | Queue table entry    | 4                       | #queues per port    | Rd, Wr      | Rd, Wr         | Persistent queue data (read and write pointers).  |
303*41dd9a6bSDavid Young   |   |                      |                         |                     |             |                | The queue size is the same per TC for all queues, |
304*41dd9a6bSDavid Young   |   |                      |                         |                     |             |                | allowing the queue base address to be computed    |
305*41dd9a6bSDavid Young   |   |                      |                         |                     |             |                | using a fast formula, so these two parameters are |
306*41dd9a6bSDavid Young   |   |                      |                         |                     |             |                | not part of queue table entry.                    |
307*41dd9a6bSDavid Young   |   |                      |                         |                     |             |                |                                                   |
308*41dd9a6bSDavid Young   |   |                      |                         |                     |             |                | The queue table entries for any given pipe are    |
309*41dd9a6bSDavid Young   |   |                      |                         |                     |             |                | stored in the same cache line.                    |
310*41dd9a6bSDavid Young   |   |                      |                         |                     |             |                |                                                   |
311*41dd9a6bSDavid Young   +---+----------------------+-------------------------+---------------------+-------------+----------------+---------------------------------------------------+
312*41dd9a6bSDavid Young   | 4 | Queue storage area   | Config (default: 64 x8) | # queues per port   | Wr          | Rd             | Array of elements per queue; each element is 8    |
313*41dd9a6bSDavid Young   |   |                      |                         |                     |             |                | byte in size (mbuf pointer).                      |
314*41dd9a6bSDavid Young   |   |                      |                         |                     |             |                |                                                   |
315*41dd9a6bSDavid Young   +---+----------------------+-------------------------+---------------------+-------------+----------------+---------------------------------------------------+
316*41dd9a6bSDavid Young   | 5 | Active queues bitmap | 1 bit per queue         | 1                   | Wr (Set)    | Rd, Wr (Clear) | The bitmap maintains one status bit per queue:    |
317*41dd9a6bSDavid Young   |   |                      |                         |                     |             |                | queue not active (queue is empty) or queue active |
318*41dd9a6bSDavid Young   |   |                      |                         |                     |             |                | (queue is not empty).                             |
319*41dd9a6bSDavid Young   |   |                      |                         |                     |             |                |                                                   |
320*41dd9a6bSDavid Young   |   |                      |                         |                     |             |                | Queue bit is set by the scheduler enqueue and     |
321*41dd9a6bSDavid Young   |   |                      |                         |                     |             |                | cleared by the scheduler dequeue when queue       |
322*41dd9a6bSDavid Young   |   |                      |                         |                     |             |                | becomes empty.                                    |
323*41dd9a6bSDavid Young   |   |                      |                         |                     |             |                |                                                   |
324*41dd9a6bSDavid Young   |   |                      |                         |                     |             |                | Bitmap scan operation returns the next non-empty  |
325*41dd9a6bSDavid Young   |   |                      |                         |                     |             |                | pipe and its status (16-bit mask of active queue  |
326*41dd9a6bSDavid Young   |   |                      |                         |                     |             |                | in the pipe).                                     |
327*41dd9a6bSDavid Young   |   |                      |                         |                     |             |                |                                                   |
328*41dd9a6bSDavid Young   +---+----------------------+-------------------------+---------------------+-------------+----------------+---------------------------------------------------+
329*41dd9a6bSDavid Young   | 6 | Grinder              | ~128                    | Config (default: 8) | -           | Rd, Wr         | Short list of active pipes currently under        |
330*41dd9a6bSDavid Young   |   |                      |                         |                     |             |                | processing. The grinder contains temporary data   |
331*41dd9a6bSDavid Young   |   |                      |                         |                     |             |                | during pipe processing.                           |
332*41dd9a6bSDavid Young   |   |                      |                         |                     |             |                |                                                   |
333*41dd9a6bSDavid Young   |   |                      |                         |                     |             |                | Once the current pipe exhausts packets or         |
334*41dd9a6bSDavid Young   |   |                      |                         |                     |             |                | credits, it is replaced with another active pipe  |
335*41dd9a6bSDavid Young   |   |                      |                         |                     |             |                | from the bitmap.                                  |
336*41dd9a6bSDavid Young   |   |                      |                         |                     |             |                |                                                   |
337*41dd9a6bSDavid Young   +---+----------------------+-------------------------+---------------------+-------------+----------------+---------------------------------------------------+
338*41dd9a6bSDavid Young
339*41dd9a6bSDavid YoungMulticore Scaling Strategy
340*41dd9a6bSDavid Young^^^^^^^^^^^^^^^^^^^^^^^^^^
341*41dd9a6bSDavid Young
342*41dd9a6bSDavid YoungThe multicore scaling strategy is:
343*41dd9a6bSDavid Young
344*41dd9a6bSDavid Young#.  Running different physical ports on different threads. The enqueue and dequeue of the same port are run by the same thread.
345*41dd9a6bSDavid Young
346*41dd9a6bSDavid Young#.  Splitting the same physical port to different threads by running different sets of subports of the same physical port (virtual ports) on different threads.
347*41dd9a6bSDavid Young    Similarly, a subport can be split into multiple subports that are each run by a different thread.
348*41dd9a6bSDavid Young    The enqueue and dequeue of the same port are run by the same thread.
349*41dd9a6bSDavid Young    This is only required if, for performance reasons, it is not possible to handle a full port with a single core.
350*41dd9a6bSDavid Young
351*41dd9a6bSDavid YoungEnqueue and Dequeue for the Same Output Port
352*41dd9a6bSDavid Young""""""""""""""""""""""""""""""""""""""""""""
353*41dd9a6bSDavid Young
354*41dd9a6bSDavid YoungRunning enqueue and dequeue operations for the same output port from different cores is likely to cause significant impact on scheduler's performance
355*41dd9a6bSDavid Youngand it is therefore not recommended.
356*41dd9a6bSDavid Young
357*41dd9a6bSDavid YoungThe port enqueue and dequeue operations share access to the following data structures:
358*41dd9a6bSDavid Young
359*41dd9a6bSDavid Young#.  Packet descriptors
360*41dd9a6bSDavid Young
361*41dd9a6bSDavid Young#.  Queue table
362*41dd9a6bSDavid Young
363*41dd9a6bSDavid Young#.  Queue storage area
364*41dd9a6bSDavid Young
365*41dd9a6bSDavid Young#.  Bitmap of active queues
366*41dd9a6bSDavid Young
367*41dd9a6bSDavid YoungThe expected drop in performance is due to:
368*41dd9a6bSDavid Young
369*41dd9a6bSDavid Young#.  Need to make the queue and bitmap operations thread safe,
370*41dd9a6bSDavid Young    which requires either using locking primitives for access serialization (for example, spinlocks/ semaphores) or
371*41dd9a6bSDavid Young    using atomic primitives for lockless access (for example, Test and Set, Compare And Swap, an so on).
372*41dd9a6bSDavid Young    The impact is much higher in the former case.
373*41dd9a6bSDavid Young
374*41dd9a6bSDavid Young#.  Ping-pong of cache lines storing the shared data structures between the cache hierarchies of the two cores
375*41dd9a6bSDavid Young    (done transparently by the MESI protocol cache coherency CPU hardware).
376*41dd9a6bSDavid Young
377*41dd9a6bSDavid YoungTherefore, the scheduler enqueue and dequeue operations have to be run from the same thread,
378*41dd9a6bSDavid Youngwhich allows the queues and the bitmap operations to be non-thread safe and
379*41dd9a6bSDavid Youngkeeps the scheduler data structures internal to the same core.
380*41dd9a6bSDavid Young
381*41dd9a6bSDavid YoungPerformance Scaling
382*41dd9a6bSDavid Young"""""""""""""""""""
383*41dd9a6bSDavid Young
384*41dd9a6bSDavid YoungScaling up the number of NIC ports simply requires a proportional increase in the number of CPU cores to be used for traffic scheduling.
385*41dd9a6bSDavid Young
386*41dd9a6bSDavid YoungEnqueue Pipeline
387*41dd9a6bSDavid Young^^^^^^^^^^^^^^^^
388*41dd9a6bSDavid Young
389*41dd9a6bSDavid YoungThe sequence of steps per packet:
390*41dd9a6bSDavid Young
391*41dd9a6bSDavid Young#.  *Access* the mbuf to read the data fields required to identify the destination queue for the packet.
392*41dd9a6bSDavid Young    These fields are: port, subport, traffic class and queue within traffic class, and are typically set by the classification stage.
393*41dd9a6bSDavid Young
394*41dd9a6bSDavid Young#.  *Access* the queue structure to identify the write location in the queue array.
395*41dd9a6bSDavid Young    If the queue is full, then the packet is discarded.
396*41dd9a6bSDavid Young
397*41dd9a6bSDavid Young#.  *Access* the queue array location to store the packet (i.e. write the mbuf pointer).
398*41dd9a6bSDavid Young
399*41dd9a6bSDavid YoungIt should be noted the strong data dependency between these steps, as steps 2 and 3 cannot start before the result from steps 1 and 2 becomes available,
400*41dd9a6bSDavid Youngwhich prevents the processor out of order execution engine to provide any significant performance optimizations.
401*41dd9a6bSDavid Young
402*41dd9a6bSDavid YoungGiven the high rate of input packets and the large amount of queues,
403*41dd9a6bSDavid Youngit is expected that the data structures accessed to enqueue the current packet are not present
404*41dd9a6bSDavid Youngin the L1 or L2 data cache of the current core, thus the above 3 memory accesses would result (on average) in L1 and L2 data cache misses.
405*41dd9a6bSDavid YoungA number of 3 L1/L2 cache misses per packet is not acceptable for performance reasons.
406*41dd9a6bSDavid Young
407*41dd9a6bSDavid YoungThe workaround is to prefetch the required data structures in advance. The prefetch operation has an execution latency during which
408*41dd9a6bSDavid Youngthe processor should not attempt to access the data structure currently under prefetch, so the processor should execute other work.
409*41dd9a6bSDavid YoungThe only other work available is to execute different stages of the enqueue sequence of operations on other input packets,
410*41dd9a6bSDavid Youngthus resulting in a pipelined implementation for the enqueue operation.
411*41dd9a6bSDavid Young
412*41dd9a6bSDavid Young:numref:`figure_prefetch_pipeline` illustrates a pipelined implementation for the enqueue operation with 4 pipeline stages and each stage executing 2 different input packets.
413*41dd9a6bSDavid YoungNo input packet can be part of more than one pipeline stage at a given time.
414*41dd9a6bSDavid Young
415*41dd9a6bSDavid Young.. _figure_prefetch_pipeline:
416*41dd9a6bSDavid Young
417*41dd9a6bSDavid Young.. figure:: ../img/prefetch_pipeline.*
418*41dd9a6bSDavid Young
419*41dd9a6bSDavid Young    Prefetch Pipeline for the Hierarchical Scheduler Enqueue Operation
420*41dd9a6bSDavid Young
421*41dd9a6bSDavid Young
422*41dd9a6bSDavid YoungThe congestion management scheme implemented by the enqueue pipeline described above is very basic:
423*41dd9a6bSDavid Youngpackets are enqueued until a specific queue becomes full,
424*41dd9a6bSDavid Youngthen all the packets destined to the same queue are dropped until packets are consumed (by the dequeue operation).
425*41dd9a6bSDavid YoungThis can be improved by enabling RED/WRED or PIE as part of the enqueue pipeline which looks at the queue occupancy and
426*41dd9a6bSDavid Youngpacket priority in order to yield the enqueue/drop decision for a specific packet
427*41dd9a6bSDavid Young(as opposed to enqueuing all packets / dropping all packets indiscriminately).
428*41dd9a6bSDavid Young
429*41dd9a6bSDavid YoungDequeue State Machine
430*41dd9a6bSDavid Young^^^^^^^^^^^^^^^^^^^^^
431*41dd9a6bSDavid Young
432*41dd9a6bSDavid YoungThe sequence of steps to schedule the next packet from the current pipe is:
433*41dd9a6bSDavid Young
434*41dd9a6bSDavid Young#.  Identify the next active pipe using the bitmap scan operation, *prefetch* pipe.
435*41dd9a6bSDavid Young
436*41dd9a6bSDavid Young#.  *Read* pipe data structure. Update the credits for the current pipe and its subport.
437*41dd9a6bSDavid Young    Identify the first active traffic class within the current pipe, select the next queue using WRR,
438*41dd9a6bSDavid Young    *prefetch* queue pointers for all the 16 queues of the current pipe.
439*41dd9a6bSDavid Young
440*41dd9a6bSDavid Young#.  *Read* next element from the current WRR queue and *prefetch* its packet descriptor.
441*41dd9a6bSDavid Young
442*41dd9a6bSDavid Young#.  *Read* the packet length from the packet descriptor (mbuf structure).
443*41dd9a6bSDavid Young    Based on the packet length and the available credits (of current pipe, pipe traffic class, subport and subport traffic class),
444*41dd9a6bSDavid Young    take the go/no go scheduling decision for the current packet.
445*41dd9a6bSDavid Young
446*41dd9a6bSDavid YoungTo avoid the cache misses, the above data structures (pipe, queue, queue array, mbufs) are prefetched in advance of being accessed.
447*41dd9a6bSDavid YoungThe strategy of hiding the latency of the prefetch operations is to switch from the current pipe (in grinder A) to another pipe
448*41dd9a6bSDavid Young(in grinder B) immediately after a prefetch is issued for the current pipe.
449*41dd9a6bSDavid YoungThis gives enough time to the prefetch operation to complete before the execution switches back to this pipe (in grinder A).
450*41dd9a6bSDavid Young
451*41dd9a6bSDavid YoungThe dequeue pipe state machine exploits the data presence into the processor cache,
452*41dd9a6bSDavid Youngtherefore it tries to send as many packets from the same pipe TC and pipe as possible (up to the available packets and credits) before
453*41dd9a6bSDavid Youngmoving to the next active TC from the same pipe (if any) or to another active pipe.
454*41dd9a6bSDavid Young
455*41dd9a6bSDavid Young.. _figure_pipe_prefetch_sm:
456*41dd9a6bSDavid Young
457*41dd9a6bSDavid Young.. figure:: ../img/pipe_prefetch_sm.*
458*41dd9a6bSDavid Young
459*41dd9a6bSDavid Young   Pipe Prefetch State Machine for the Hierarchical Scheduler Dequeue
460*41dd9a6bSDavid Young   Operation
461*41dd9a6bSDavid Young
462*41dd9a6bSDavid Young
463*41dd9a6bSDavid YoungTiming and Synchronization
464*41dd9a6bSDavid Young^^^^^^^^^^^^^^^^^^^^^^^^^^
465*41dd9a6bSDavid Young
466*41dd9a6bSDavid YoungThe output port is modeled as a conveyor belt of byte slots that need to be filled by the scheduler with data for transmission.
467*41dd9a6bSDavid YoungFor 10 GbE, there are 1.25 billion byte slots that need to be filled by the port scheduler every second.
468*41dd9a6bSDavid YoungIf the scheduler is not fast enough to fill the slots, provided that enough packets and credits exist,
469*41dd9a6bSDavid Youngthen some slots will be left unused and bandwidth will be wasted.
470*41dd9a6bSDavid Young
471*41dd9a6bSDavid YoungIn principle, the hierarchical scheduler dequeue operation should be triggered by NIC TX.
472*41dd9a6bSDavid YoungUsually, once the occupancy of the NIC TX input queue drops below a predefined threshold,
473*41dd9a6bSDavid Youngthe port scheduler is woken up (interrupt based or polling based,
474*41dd9a6bSDavid Youngby continuously monitoring the queue occupancy) to push more packets into the queue.
475*41dd9a6bSDavid Young
476*41dd9a6bSDavid YoungInternal Time Reference
477*41dd9a6bSDavid Young"""""""""""""""""""""""
478*41dd9a6bSDavid Young
479*41dd9a6bSDavid YoungThe scheduler needs to keep track of time advancement for the credit logic,
480*41dd9a6bSDavid Youngwhich requires credit updates based on time (for example, subport and pipe traffic shaping, traffic class upper limit enforcement, and so on).
481*41dd9a6bSDavid Young
482*41dd9a6bSDavid YoungEvery time the scheduler decides to send a packet out to the NIC TX for transmission, the scheduler will increment its internal time reference accordingly.
483*41dd9a6bSDavid YoungTherefore, it is convenient to keep the internal time reference in units of bytes,
484*41dd9a6bSDavid Youngwhere a byte signifies the time duration required by the physical interface to send out a byte on the transmission medium.
485*41dd9a6bSDavid YoungThis way, as a packet is scheduled for transmission, the time is incremented with (n + h),
486*41dd9a6bSDavid Youngwhere n is the packet length in bytes and h is the number of framing overhead bytes per packet.
487*41dd9a6bSDavid Young
488*41dd9a6bSDavid YoungInternal Time Reference Re-synchronization
489*41dd9a6bSDavid Young""""""""""""""""""""""""""""""""""""""""""
490*41dd9a6bSDavid Young
491*41dd9a6bSDavid YoungThe scheduler needs to align its internal time reference to the pace of the port conveyor belt.
492*41dd9a6bSDavid YoungThe reason is to make sure that the scheduler does not feed the NIC TX with more bytes than the line rate of the physical medium in order to prevent packet drop
493*41dd9a6bSDavid Young(by the scheduler, due to the NIC TX input queue being full, or later on, internally by the NIC TX).
494*41dd9a6bSDavid Young
495*41dd9a6bSDavid YoungThe scheduler reads the current time on every dequeue invocation.
496*41dd9a6bSDavid YoungThe CPU time stamp can be obtained by reading either the Time Stamp Counter (TSC) register or the High Precision Event Timer (HPET) register.
497*41dd9a6bSDavid YoungThe current CPU time stamp is converted from number of CPU clocks to number of bytes:
498*41dd9a6bSDavid Young*time_bytes = time_cycles / cycles_per_byte, where cycles_per_byte*
499*41dd9a6bSDavid Youngis the amount of CPU cycles that is equivalent to the transmission time for one byte on the wire
500*41dd9a6bSDavid Young(e.g. for a CPU frequency of 2 GHz and a 10GbE port,*cycles_per_byte = 1.6*).
501*41dd9a6bSDavid Young
502*41dd9a6bSDavid YoungThe scheduler maintains an internal time reference of the NIC time.
503*41dd9a6bSDavid YoungWhenever a packet is scheduled, the NIC time is incremented with the packet length (including framing overhead).
504*41dd9a6bSDavid YoungOn every dequeue invocation, the scheduler checks its internal reference of the NIC time against the current time:
505*41dd9a6bSDavid Young
506*41dd9a6bSDavid Young#. If NIC time is in the future (NIC time >= current time), no adjustment of NIC time is needed.
507*41dd9a6bSDavid Young   This means that scheduler is able to schedule NIC packets before the NIC actually needs those packets, so the NIC TX is well supplied with packets;
508*41dd9a6bSDavid Young
509*41dd9a6bSDavid Young#. If NIC time is in the past (NIC time < current time), then NIC time should be adjusted by setting it to the current time.
510*41dd9a6bSDavid Young   This means that the scheduler is not able to keep up with the speed of the NIC byte conveyor belt,
511*41dd9a6bSDavid Young   so NIC bandwidth is wasted due to poor packet supply to the NIC TX.
512*41dd9a6bSDavid Young
513*41dd9a6bSDavid YoungScheduler Accuracy and Granularity
514*41dd9a6bSDavid Young""""""""""""""""""""""""""""""""""
515*41dd9a6bSDavid Young
516*41dd9a6bSDavid YoungThe scheduler round trip delay (SRTD) is the time (number of CPU cycles) between two consecutive examinations of the same pipe by the scheduler.
517*41dd9a6bSDavid Young
518*41dd9a6bSDavid YoungTo keep up with the output port (that is, avoid bandwidth loss),
519*41dd9a6bSDavid Youngthe scheduler should be able to schedule n packets faster than the same n packets are transmitted by NIC TX.
520*41dd9a6bSDavid Young
521*41dd9a6bSDavid YoungThe scheduler needs to keep up with the rate of each individual pipe,
522*41dd9a6bSDavid Youngas configured for the pipe token bucket, assuming that no port oversubscription is taking place.
523*41dd9a6bSDavid YoungThis means that the size of the pipe token bucket should be set high enough to prevent it from overflowing due to big SRTD,
524*41dd9a6bSDavid Youngas this would result in credit loss (and therefore bandwidth loss) for the pipe.
525*41dd9a6bSDavid Young
526*41dd9a6bSDavid YoungCredit Logic
527*41dd9a6bSDavid Young^^^^^^^^^^^^
528*41dd9a6bSDavid Young
529*41dd9a6bSDavid YoungScheduling Decision
530*41dd9a6bSDavid Young"""""""""""""""""""
531*41dd9a6bSDavid Young
532*41dd9a6bSDavid YoungThe scheduling decision to send next packet from (subport S, pipe P, traffic class TC, queue Q) is favorable (packet is sent)
533*41dd9a6bSDavid Youngwhen all the conditions below are met:
534*41dd9a6bSDavid Young
535*41dd9a6bSDavid Young*   Pipe P of subport S is currently selected by one of the port grinders;
536*41dd9a6bSDavid Young
537*41dd9a6bSDavid Young*   Traffic class TC is the highest priority active traffic class of pipe P;
538*41dd9a6bSDavid Young
539*41dd9a6bSDavid Young*   Queue Q is the next queue selected by WRR within traffic class TC of pipe P;
540*41dd9a6bSDavid Young
541*41dd9a6bSDavid Young*   Subport S has enough credits to send the packet;
542*41dd9a6bSDavid Young
543*41dd9a6bSDavid Young*   Subport S has enough credits for traffic class TC to send the packet;
544*41dd9a6bSDavid Young
545*41dd9a6bSDavid Young*   Pipe P has enough credits to send the packet;
546*41dd9a6bSDavid Young
547*41dd9a6bSDavid Young*   Pipe P has enough credits for traffic class TC to send the packet.
548*41dd9a6bSDavid Young
549*41dd9a6bSDavid YoungIf all the above conditions are met,
550*41dd9a6bSDavid Youngthen the packet is selected for transmission and the necessary credits are subtracted from subport S,
551*41dd9a6bSDavid Youngsubport S traffic class TC, pipe P, pipe P traffic class TC.
552*41dd9a6bSDavid Young
553*41dd9a6bSDavid YoungFraming Overhead
554*41dd9a6bSDavid Young""""""""""""""""
555*41dd9a6bSDavid Young
556*41dd9a6bSDavid YoungAs the greatest common divisor for all packet lengths is one byte, the unit of credit is selected as one byte.
557*41dd9a6bSDavid YoungThe number of credits required for the transmission of a packet of n bytes is equal to (n+h),
558*41dd9a6bSDavid Youngwhere h is equal to the number of framing overhead bytes per packet.
559*41dd9a6bSDavid Young
560*41dd9a6bSDavid Young.. _table_qos_5:
561*41dd9a6bSDavid Young
562*41dd9a6bSDavid Young.. table:: Ethernet Frame Overhead Fields
563*41dd9a6bSDavid Young
564*41dd9a6bSDavid Young   +---+--------------------------------+----------------+---------------------------------------------------------------------------+
565*41dd9a6bSDavid Young   | # | Packet field                   | Length (bytes) | Comments                                                                  |
566*41dd9a6bSDavid Young   |   |                                |                |                                                                           |
567*41dd9a6bSDavid Young   +===+================================+================+===========================================================================+
568*41dd9a6bSDavid Young   | 1 | Preamble                       | 7              |                                                                           |
569*41dd9a6bSDavid Young   |   |                                |                |                                                                           |
570*41dd9a6bSDavid Young   +---+--------------------------------+----------------+---------------------------------------------------------------------------+
571*41dd9a6bSDavid Young   | 2 | Start of Frame Delimiter (SFD) | 1              |                                                                           |
572*41dd9a6bSDavid Young   |   |                                |                |                                                                           |
573*41dd9a6bSDavid Young   +---+--------------------------------+----------------+---------------------------------------------------------------------------+
574*41dd9a6bSDavid Young   | 3 | Frame Check Sequence (FCS)     | 4              | Considered overhead only if not included in the mbuf packet length field. |
575*41dd9a6bSDavid Young   |   |                                |                |                                                                           |
576*41dd9a6bSDavid Young   +---+--------------------------------+----------------+---------------------------------------------------------------------------+
577*41dd9a6bSDavid Young   | 4 | Inter Frame Gap (IFG)          | 12             |                                                                           |
578*41dd9a6bSDavid Young   |   |                                |                |                                                                           |
579*41dd9a6bSDavid Young   +---+--------------------------------+----------------+---------------------------------------------------------------------------+
580*41dd9a6bSDavid Young   | 5 | Total                          | 24             |                                                                           |
581*41dd9a6bSDavid Young   |   |                                |                |                                                                           |
582*41dd9a6bSDavid Young   +---+--------------------------------+----------------+---------------------------------------------------------------------------+
583*41dd9a6bSDavid Young
584*41dd9a6bSDavid YoungTraffic Shaping
585*41dd9a6bSDavid Young"""""""""""""""
586*41dd9a6bSDavid Young
587*41dd9a6bSDavid YoungThe traffic shaping for subport and pipe is implemented using a token bucket per subport/per pipe.
588*41dd9a6bSDavid YoungEach token bucket is implemented using one saturated counter that keeps track of the number of available credits.
589*41dd9a6bSDavid Young
590*41dd9a6bSDavid YoungThe token bucket generic parameters and operations are presented in :numref:`table_qos_6` and :numref:`table_qos_7`.
591*41dd9a6bSDavid Young
592*41dd9a6bSDavid Young.. _table_qos_6:
593*41dd9a6bSDavid Young
594*41dd9a6bSDavid Young.. table:: Token Bucket Generic Parameters
595*41dd9a6bSDavid Young
596*41dd9a6bSDavid Young   +---+------------------------+--------------------+---------------------------------------------------------+
597*41dd9a6bSDavid Young   | # | Token Bucket Parameter | Unit               | Description                                             |
598*41dd9a6bSDavid Young   |   |                        |                    |                                                         |
599*41dd9a6bSDavid Young   +===+========================+====================+=========================================================+
600*41dd9a6bSDavid Young   | 1 | bucket_rate            | Credits per second | Rate of adding credits to the bucket.                   |
601*41dd9a6bSDavid Young   |   |                        |                    |                                                         |
602*41dd9a6bSDavid Young   +---+------------------------+--------------------+---------------------------------------------------------+
603*41dd9a6bSDavid Young   | 2 | bucket_size            | Credits            | Max number of credits that can be stored in the bucket. |
604*41dd9a6bSDavid Young   |   |                        |                    |                                                         |
605*41dd9a6bSDavid Young   +---+------------------------+--------------------+---------------------------------------------------------+
606*41dd9a6bSDavid Young
607*41dd9a6bSDavid Young.. _table_qos_7:
608*41dd9a6bSDavid Young
609*41dd9a6bSDavid Young.. table:: Token Bucket Generic Operations
610*41dd9a6bSDavid Young
611*41dd9a6bSDavid Young   +---+------------------------+------------------------------------------------------------------------------+
612*41dd9a6bSDavid Young   | # | Token Bucket Operation | Description                                                                  |
613*41dd9a6bSDavid Young   |   |                        |                                                                              |
614*41dd9a6bSDavid Young   +===+========================+==============================================================================+
615*41dd9a6bSDavid Young   | 1 | Initialization         | Bucket set to a predefined value, e.g. zero or half of the bucket size.      |
616*41dd9a6bSDavid Young   |   |                        |                                                                              |
617*41dd9a6bSDavid Young   +---+------------------------+------------------------------------------------------------------------------+
618*41dd9a6bSDavid Young   | 2 | Credit update          | Credits are added to the bucket on top of existing ones, either periodically |
619*41dd9a6bSDavid Young   |   |                        | or on demand, based on the bucket_rate. Credits cannot exceed the upper      |
620*41dd9a6bSDavid Young   |   |                        | limit defined by the bucket_size, so any credits to be added to the bucket   |
621*41dd9a6bSDavid Young   |   |                        | while the bucket is full are dropped.                                        |
622*41dd9a6bSDavid Young   |   |                        |                                                                              |
623*41dd9a6bSDavid Young   +---+------------------------+------------------------------------------------------------------------------+
624*41dd9a6bSDavid Young   | 3 | Credit consumption     | As result of packet scheduling, the necessary number of credits is removed   |
625*41dd9a6bSDavid Young   |   |                        | from the bucket. The packet can only be sent if enough credits are in the    |
626*41dd9a6bSDavid Young   |   |                        | bucket to send the full packet (packet bytes and framing overhead for the    |
627*41dd9a6bSDavid Young   |   |                        | packet).                                                                     |
628*41dd9a6bSDavid Young   |   |                        |                                                                              |
629*41dd9a6bSDavid Young   +---+------------------------+------------------------------------------------------------------------------+
630*41dd9a6bSDavid Young
631*41dd9a6bSDavid YoungTo implement the token bucket generic operations described above,
632*41dd9a6bSDavid Youngthe current design uses the persistent data structure presented in :numref:`table_qos_8`,
633*41dd9a6bSDavid Youngwhile the implementation of the token bucket operations is described in :numref:`table_qos_9`.
634*41dd9a6bSDavid Young
635*41dd9a6bSDavid Young.. _table_qos_8:
636*41dd9a6bSDavid Young
637*41dd9a6bSDavid Young.. table:: Token Bucket Persistent Data Structure
638*41dd9a6bSDavid Young
639*41dd9a6bSDavid Young   +---+------------------------+-------+----------------------------------------------------------------------+
640*41dd9a6bSDavid Young   | # | Token bucket field     | Unit  | Description                                                          |
641*41dd9a6bSDavid Young   |   |                        |       |                                                                      |
642*41dd9a6bSDavid Young   +===+========================+=======+======================================================================+
643*41dd9a6bSDavid Young   | 1 | tb_time                | Bytes | Time of the last credit update. Measured in bytes instead of seconds |
644*41dd9a6bSDavid Young   |   |                        |       | or CPU cycles for ease of credit consumption operation               |
645*41dd9a6bSDavid Young   |   |                        |       | (as the current time is also maintained in bytes).                   |
646*41dd9a6bSDavid Young   |   |                        |       |                                                                      |
647*41dd9a6bSDavid Young   |   |                        |       | See  Section 26.2.4.5.1 "Internal Time Reference" for an             |
648*41dd9a6bSDavid Young   |   |                        |       | explanation of why the time is maintained in byte units.             |
649*41dd9a6bSDavid Young   |   |                        |       |                                                                      |
650*41dd9a6bSDavid Young   +---+------------------------+-------+----------------------------------------------------------------------+
651*41dd9a6bSDavid Young   | 2 | tb_period              | Bytes | Time period that should elapse since the last credit update in order |
652*41dd9a6bSDavid Young   |   |                        |       | for the bucket to be awarded tb_credits_per_period worth or credits. |
653*41dd9a6bSDavid Young   |   |                        |       |                                                                      |
654*41dd9a6bSDavid Young   +---+------------------------+-------+----------------------------------------------------------------------+
655*41dd9a6bSDavid Young   | 3 | tb_credits_per_period  | Bytes | Credit allowance per tb_period.                                      |
656*41dd9a6bSDavid Young   |   |                        |       |                                                                      |
657*41dd9a6bSDavid Young   +---+------------------------+-------+----------------------------------------------------------------------+
658*41dd9a6bSDavid Young   | 4 | tb_size                | Bytes | Bucket size, i.e. upper limit for the tb_credits.                    |
659*41dd9a6bSDavid Young   |   |                        |       |                                                                      |
660*41dd9a6bSDavid Young   +---+------------------------+-------+----------------------------------------------------------------------+
661*41dd9a6bSDavid Young   | 5 | tb_credits             | Bytes | Number of credits currently in the bucket.                           |
662*41dd9a6bSDavid Young   |   |                        |       |                                                                      |
663*41dd9a6bSDavid Young   +---+------------------------+-------+----------------------------------------------------------------------+
664*41dd9a6bSDavid Young
665*41dd9a6bSDavid YoungThe bucket rate (in bytes per second) can be computed with the following formula:
666*41dd9a6bSDavid Young
667*41dd9a6bSDavid Young*bucket_rate = (tb_credits_per_period / tb_period) * r*
668*41dd9a6bSDavid Young
669*41dd9a6bSDavid Youngwhere, r = port line rate (in bytes per second).
670*41dd9a6bSDavid Young
671*41dd9a6bSDavid Young.. _table_qos_9:
672*41dd9a6bSDavid Young
673*41dd9a6bSDavid Young.. table:: Token Bucket Operations
674*41dd9a6bSDavid Young
675*41dd9a6bSDavid Young   +---+-------------------------+-----------------------------------------------------------------------------+
676*41dd9a6bSDavid Young   | # | Token bucket operation  | Description                                                                 |
677*41dd9a6bSDavid Young   |   |                         |                                                                             |
678*41dd9a6bSDavid Young   +===+=========================+=============================================================================+
679*41dd9a6bSDavid Young   | 1 | Initialization          | *tb_credits = 0; or tb_credits = tb_size / 2;*                              |
680*41dd9a6bSDavid Young   |   |                         |                                                                             |
681*41dd9a6bSDavid Young   +---+-------------------------+-----------------------------------------------------------------------------+
682*41dd9a6bSDavid Young   | 2 | Credit update           | Credit update options:                                                      |
683*41dd9a6bSDavid Young   |   |                         |                                                                             |
684*41dd9a6bSDavid Young   |   |                         | *   Every time a packet is sent for a port, update the credits of all the   |
685*41dd9a6bSDavid Young   |   |                         |     the subports and pipes of that port. Not feasible.                      |
686*41dd9a6bSDavid Young   |   |                         |                                                                             |
687*41dd9a6bSDavid Young   |   |                         | *   Every time a packet is sent, update the credits for the pipe and        |
688*41dd9a6bSDavid Young   |   |                         |     subport. Very accurate, but not needed (a lot of calculations).         |
689*41dd9a6bSDavid Young   |   |                         |                                                                             |
690*41dd9a6bSDavid Young   |   |                         | *   Every time a pipe is selected (that is, picked by one                   |
691*41dd9a6bSDavid Young   |   |                         |     of the grinders), update the credits for the pipe and its subport.      |
692*41dd9a6bSDavid Young   |   |                         |                                                                             |
693*41dd9a6bSDavid Young   |   |                         | The current implementation is using option 3.  According to Section         |
694*41dd9a6bSDavid Young   |   |                         | `Dequeue State Machine`_, the pipe and subport credits are                  |
695*41dd9a6bSDavid Young   |   |                         | updated every time a pipe is selected by the dequeue process before the     |
696*41dd9a6bSDavid Young   |   |                         | pipe and subport credits are actually used.                                 |
697*41dd9a6bSDavid Young   |   |                         |                                                                             |
698*41dd9a6bSDavid Young   |   |                         | The implementation uses a tradeoff between accuracy and speed by updating   |
699*41dd9a6bSDavid Young   |   |                         | the bucket credits only when at least a full *tb_period*  has elapsed since |
700*41dd9a6bSDavid Young   |   |                         | the last update.                                                            |
701*41dd9a6bSDavid Young   |   |                         |                                                                             |
702*41dd9a6bSDavid Young   |   |                         | *   Full accuracy can be achieved by selecting the value for *tb_period*    |
703*41dd9a6bSDavid Young   |   |                         |     for which  *tb_credits_per_period = 1*.                                 |
704*41dd9a6bSDavid Young   |   |                         |                                                                             |
705*41dd9a6bSDavid Young   |   |                         | *   When full accuracy is not required, better performance is achieved by   |
706*41dd9a6bSDavid Young   |   |                         |     setting *tb_credits* to a larger value.                                 |
707*41dd9a6bSDavid Young   |   |                         |                                                                             |
708*41dd9a6bSDavid Young   |   |                         | Update operations:                                                          |
709*41dd9a6bSDavid Young   |   |                         |                                                                             |
710*41dd9a6bSDavid Young   |   |                         | *   n_periods = (time - tb_time) / tb_period;                               |
711*41dd9a6bSDavid Young   |   |                         |                                                                             |
712*41dd9a6bSDavid Young   |   |                         | *   tb_credits += n_periods * tb_credits_per_period;                        |
713*41dd9a6bSDavid Young   |   |                         |                                                                             |
714*41dd9a6bSDavid Young   |   |                         | *   tb_credits = min(tb_credits, tb_size);                                  |
715*41dd9a6bSDavid Young   |   |                         |                                                                             |
716*41dd9a6bSDavid Young   |   |                         | *   tb_time += n_periods * tb_period;                                       |
717*41dd9a6bSDavid Young   |   |                         |                                                                             |
718*41dd9a6bSDavid Young   +---+-------------------------+-----------------------------------------------------------------------------+
719*41dd9a6bSDavid Young   | 3 | Credit consumption      | As result of packet scheduling, the necessary number of credits is removed  |
720*41dd9a6bSDavid Young   |   |  (on packet scheduling) | from the bucket. The packet can only be sent if enough credits are in the   |
721*41dd9a6bSDavid Young   |   |                         | bucket to send the full packet (packet bytes and framing overhead for the   |
722*41dd9a6bSDavid Young   |   |                         | packet).                                                                    |
723*41dd9a6bSDavid Young   |   |                         |                                                                             |
724*41dd9a6bSDavid Young   |   |                         | Scheduling operations:                                                      |
725*41dd9a6bSDavid Young   |   |                         |                                                                             |
726*41dd9a6bSDavid Young   |   |                         | pkt_credits = pkt_len + frame_overhead;                                     |
727*41dd9a6bSDavid Young   |   |                         | if (tb_credits >= pkt_credits){tb_credits -= pkt_credits;}                  |
728*41dd9a6bSDavid Young   |   |                         |                                                                             |
729*41dd9a6bSDavid Young   +---+-------------------------+-----------------------------------------------------------------------------+
730*41dd9a6bSDavid Young
731*41dd9a6bSDavid YoungTraffic Classes
732*41dd9a6bSDavid Young"""""""""""""""
733*41dd9a6bSDavid Young
734*41dd9a6bSDavid YoungImplementation of Strict Priority Scheduling
735*41dd9a6bSDavid Young''''''''''''''''''''''''''''''''''''''''''''
736*41dd9a6bSDavid Young
737*41dd9a6bSDavid YoungStrict priority scheduling of traffic classes within the same pipe is implemented by the pipe dequeue state machine,
738*41dd9a6bSDavid Youngwhich selects the queues in ascending order.
739*41dd9a6bSDavid YoungTherefore, queue 0 (associated with TC 0, highest priority TC) is handled before
740*41dd9a6bSDavid Youngqueue 1 (TC 1, lower priority than TC 0),
741*41dd9a6bSDavid Youngwhich is handled before queue 2 (TC 2, lower priority than TC 1) and it continues until queues of all TCs except the
742*41dd9a6bSDavid Younglowest priority TC are handled. At last, queues 12..15 (best effort TC, lowest priority TC) are handled.
743*41dd9a6bSDavid Young
744*41dd9a6bSDavid YoungUpper Limit Enforcement
745*41dd9a6bSDavid Young'''''''''''''''''''''''
746*41dd9a6bSDavid Young
747*41dd9a6bSDavid YoungThe traffic classes at the pipe and subport levels are not traffic shaped,
748*41dd9a6bSDavid Youngso there is no token bucket maintained in this context.
749*41dd9a6bSDavid YoungThe upper limit for the traffic classes at the subport and
750*41dd9a6bSDavid Youngpipe levels is enforced by periodically refilling the subport / pipe traffic class credit counter,
751*41dd9a6bSDavid Youngout of which credits are consumed every time a packet is scheduled for that subport / pipe,
752*41dd9a6bSDavid Youngas described in :numref:`table_qos_10` and :numref:`table_qos_11`.
753*41dd9a6bSDavid Young
754*41dd9a6bSDavid Young.. _table_qos_10:
755*41dd9a6bSDavid Young
756*41dd9a6bSDavid Young.. table:: Subport/Pipe Traffic Class Upper Limit Enforcement Persistent Data Structure
757*41dd9a6bSDavid Young
758*41dd9a6bSDavid Young   +---+-----------------------+-------+-----------------------------------------------------------------------+
759*41dd9a6bSDavid Young   | # | Subport or pipe field | Unit  | Description                                                           |
760*41dd9a6bSDavid Young   |   |                       |       |                                                                       |
761*41dd9a6bSDavid Young   +===+=======================+=======+=======================================================================+
762*41dd9a6bSDavid Young   | 1 | tc_time               | Bytes | Time of the next update (upper limit refill) for the TCs of the       |
763*41dd9a6bSDavid Young   |   |                       |       | current subport / pipe.                                               |
764*41dd9a6bSDavid Young   |   |                       |       |                                                                       |
765*41dd9a6bSDavid Young   |   |                       |       | See  Section `Internal Time Reference`_ for the                       |
766*41dd9a6bSDavid Young   |   |                       |       | explanation of why the time is maintained in byte units.              |
767*41dd9a6bSDavid Young   |   |                       |       |                                                                       |
768*41dd9a6bSDavid Young   +---+-----------------------+-------+-----------------------------------------------------------------------+
769*41dd9a6bSDavid Young   | 2 | tc_period             | Bytes | Time between two consecutive updates for the all TCs of the current   |
770*41dd9a6bSDavid Young   |   |                       |       | subport / pipe. This is expected to be many times bigger than the     |
771*41dd9a6bSDavid Young   |   |                       |       | typical value of the token bucket tb_period.                          |
772*41dd9a6bSDavid Young   |   |                       |       |                                                                       |
773*41dd9a6bSDavid Young   +---+-----------------------+-------+-----------------------------------------------------------------------+
774*41dd9a6bSDavid Young   | 3 | tc_credits_per_period | Bytes | Upper limit for the number of credits allowed to be consumed by the   |
775*41dd9a6bSDavid Young   |   |                       |       | current TC during each enforcement period tc_period.                  |
776*41dd9a6bSDavid Young   |   |                       |       |                                                                       |
777*41dd9a6bSDavid Young   +---+-----------------------+-------+-----------------------------------------------------------------------+
778*41dd9a6bSDavid Young   | 4 | tc_credits            | Bytes | Current upper limit for the number of credits that can be consumed by |
779*41dd9a6bSDavid Young   |   |                       |       | the current traffic class for the remainder of the current            |
780*41dd9a6bSDavid Young   |   |                       |       | enforcement period.                                                   |
781*41dd9a6bSDavid Young   |   |                       |       |                                                                       |
782*41dd9a6bSDavid Young   +---+-----------------------+-------+-----------------------------------------------------------------------+
783*41dd9a6bSDavid Young
784*41dd9a6bSDavid Young.. _table_qos_11:
785*41dd9a6bSDavid Young
786*41dd9a6bSDavid Young.. table:: Subport/Pipe Traffic Class Upper Limit Enforcement Operations
787*41dd9a6bSDavid Young
788*41dd9a6bSDavid Young   +---+--------------------------+----------------------------------------------------------------------------+
789*41dd9a6bSDavid Young   | # | Traffic Class Operation  | Description                                                                |
790*41dd9a6bSDavid Young   |   |                          |                                                                            |
791*41dd9a6bSDavid Young   +===+==========================+============================================================================+
792*41dd9a6bSDavid Young   | 1 | Initialization           | tc_credits = tc_credits_per_period;                                        |
793*41dd9a6bSDavid Young   |   |                          |                                                                            |
794*41dd9a6bSDavid Young   |   |                          | tc_time = tc_period;                                                       |
795*41dd9a6bSDavid Young   |   |                          |                                                                            |
796*41dd9a6bSDavid Young   +---+--------------------------+----------------------------------------------------------------------------+
797*41dd9a6bSDavid Young   | 2 | Credit update            | Update operations:                                                         |
798*41dd9a6bSDavid Young   |   |                          |                                                                            |
799*41dd9a6bSDavid Young   |   |                          | if (time >= tc_time) {                                                     |
800*41dd9a6bSDavid Young   |   |                          |                                                                            |
801*41dd9a6bSDavid Young   |   |                          | tc_credits = tc_credits_per_period;                                        |
802*41dd9a6bSDavid Young   |   |                          |                                                                            |
803*41dd9a6bSDavid Young   |   |                          | tc_time = time + tc_period;                                                |
804*41dd9a6bSDavid Young   |   |                          |                                                                            |
805*41dd9a6bSDavid Young   |   |                          | }                                                                          |
806*41dd9a6bSDavid Young   |   |                          |                                                                            |
807*41dd9a6bSDavid Young   +---+--------------------------+----------------------------------------------------------------------------+
808*41dd9a6bSDavid Young   | 3 | Credit consumption       | As result of packet scheduling, the TC limit is decreased with the         |
809*41dd9a6bSDavid Young   |   | (on packet scheduling)   | necessary number of credits. The packet can only be sent if enough credits |
810*41dd9a6bSDavid Young   |   |                          | are currently available in the TC limit to send the full packet            |
811*41dd9a6bSDavid Young   |   |                          | (packet bytes and framing overhead for the packet).                        |
812*41dd9a6bSDavid Young   |   |                          |                                                                            |
813*41dd9a6bSDavid Young   |   |                          | Scheduling operations:                                                     |
814*41dd9a6bSDavid Young   |   |                          |                                                                            |
815*41dd9a6bSDavid Young   |   |                          | pkt_credits = pk_len + frame_overhead;                                     |
816*41dd9a6bSDavid Young   |   |                          |                                                                            |
817*41dd9a6bSDavid Young   |   |                          | if (tc_credits >= pkt_credits) {tc_credits -= pkt_credits;}                |
818*41dd9a6bSDavid Young   |   |                          |                                                                            |
819*41dd9a6bSDavid Young   +---+--------------------------+----------------------------------------------------------------------------+
820*41dd9a6bSDavid Young
821*41dd9a6bSDavid YoungWeighted Round Robin (WRR)
822*41dd9a6bSDavid Young""""""""""""""""""""""""""
823*41dd9a6bSDavid Young
824*41dd9a6bSDavid YoungThe evolution of the WRR design solution for the lowest priority traffic class (best effort TC) from simple to complex is shown in :numref:`table_qos_12`.
825*41dd9a6bSDavid Young
826*41dd9a6bSDavid Young.. _table_qos_12:
827*41dd9a6bSDavid Young
828*41dd9a6bSDavid Young.. table:: Weighted Round Robin (WRR)
829*41dd9a6bSDavid Young
830*41dd9a6bSDavid Young   +---+------------+-----------------+-------------+----------------------------------------------------------+
831*41dd9a6bSDavid Young   | # | All Queues | Equal Weights   | All Packets | Strategy                                                 |
832*41dd9a6bSDavid Young   |   | Active?    | for All Queues? | Equal?      |                                                          |
833*41dd9a6bSDavid Young   +===+============+=================+=============+==========================================================+
834*41dd9a6bSDavid Young   | 1 | Yes        | Yes             | Yes         | **Byte level round robin**                               |
835*41dd9a6bSDavid Young   |   |            |                 |             |                                                          |
836*41dd9a6bSDavid Young   |   |            |                 |             | *Next queue*  queue #i, i =  *(i + 1) % n*               |
837*41dd9a6bSDavid Young   |   |            |                 |             |                                                          |
838*41dd9a6bSDavid Young   +---+------------+-----------------+-------------+----------------------------------------------------------+
839*41dd9a6bSDavid Young   | 2 | Yes        | Yes             | No          | **Packet level round robin**                             |
840*41dd9a6bSDavid Young   |   |            |                 |             |                                                          |
841*41dd9a6bSDavid Young   |   |            |                 |             | Consuming one byte from queue #i requires consuming      |
842*41dd9a6bSDavid Young   |   |            |                 |             | exactly one token for queue #i.                          |
843*41dd9a6bSDavid Young   |   |            |                 |             |                                                          |
844*41dd9a6bSDavid Young   |   |            |                 |             | T(i) = Accumulated number of tokens previously consumed  |
845*41dd9a6bSDavid Young   |   |            |                 |             | from queue #i. Every time a packet is consumed from      |
846*41dd9a6bSDavid Young   |   |            |                 |             | queue #i, T(i) is updated as: T(i) += *pkt_len*.         |
847*41dd9a6bSDavid Young   |   |            |                 |             |                                                          |
848*41dd9a6bSDavid Young   |   |            |                 |             | *Next queue* : queue with the smallest T.                |
849*41dd9a6bSDavid Young   |   |            |                 |             |                                                          |
850*41dd9a6bSDavid Young   |   |            |                 |             |                                                          |
851*41dd9a6bSDavid Young   +---+------------+-----------------+-------------+----------------------------------------------------------+
852*41dd9a6bSDavid Young   | 3 | Yes        | No              | No          | **Packet level weighted round robin**                    |
853*41dd9a6bSDavid Young   |   |            |                 |             |                                                          |
854*41dd9a6bSDavid Young   |   |            |                 |             | This case can be reduced to the previous case by         |
855*41dd9a6bSDavid Young   |   |            |                 |             | introducing a cost per byte that is different for each   |
856*41dd9a6bSDavid Young   |   |            |                 |             | queue. Queues with lower weights have a higher cost per  |
857*41dd9a6bSDavid Young   |   |            |                 |             | byte. This way, it is still meaningful to compare the    |
858*41dd9a6bSDavid Young   |   |            |                 |             | consumption amongst different queues in order to select  |
859*41dd9a6bSDavid Young   |   |            |                 |             | the next queue.                                          |
860*41dd9a6bSDavid Young   |   |            |                 |             |                                                          |
861*41dd9a6bSDavid Young   |   |            |                 |             | w(i) = Weight of queue #i                                |
862*41dd9a6bSDavid Young   |   |            |                 |             |                                                          |
863*41dd9a6bSDavid Young   |   |            |                 |             | t(i) = Tokens per byte for queue #i, defined as the      |
864*41dd9a6bSDavid Young   |   |            |                 |             | inverse weight of queue #i.                              |
865*41dd9a6bSDavid Young   |   |            |                 |             | For example, if w[0..3] = [1:2:4:8],                     |
866*41dd9a6bSDavid Young   |   |            |                 |             | then t[0..3] = [8:4:2:1]; if w[0..3] = [1:4:15:20],      |
867*41dd9a6bSDavid Young   |   |            |                 |             | then t[0..3] = [60:15:4:3].                              |
868*41dd9a6bSDavid Young   |   |            |                 |             | Consuming one byte from queue #i requires consuming t(i) |
869*41dd9a6bSDavid Young   |   |            |                 |             | tokens for queue #i.                                     |
870*41dd9a6bSDavid Young   |   |            |                 |             |                                                          |
871*41dd9a6bSDavid Young   |   |            |                 |             | T(i) = Accumulated number of tokens previously consumed  |
872*41dd9a6bSDavid Young   |   |            |                 |             | from queue #i. Every time a packet is consumed from      |
873*41dd9a6bSDavid Young   |   |            |                 |             | queue #i, T(i) is updated as:  *T(i) += pkt_len * t(i)*. |
874*41dd9a6bSDavid Young   |   |            |                 |             | *Next queue* : queue with the smallest T.                |
875*41dd9a6bSDavid Young   |   |            |                 |             |                                                          |
876*41dd9a6bSDavid Young   +---+------------+-----------------+-------------+----------------------------------------------------------+
877*41dd9a6bSDavid Young   | 4 | No         | No              | No          | **Packet level weighted round robin with variable queue  |
878*41dd9a6bSDavid Young   |   |            |                 |             | status**                                                 |
879*41dd9a6bSDavid Young   |   |            |                 |             |                                                          |
880*41dd9a6bSDavid Young   |   |            |                 |             | Reduce this case to the previous case by setting the     |
881*41dd9a6bSDavid Young   |   |            |                 |             | consumption of inactive queues to a high number, so that |
882*41dd9a6bSDavid Young   |   |            |                 |             | the inactive queues will never be selected by the        |
883*41dd9a6bSDavid Young   |   |            |                 |             | smallest T logic.                                        |
884*41dd9a6bSDavid Young   |   |            |                 |             |                                                          |
885*41dd9a6bSDavid Young   |   |            |                 |             | To prevent T from overflowing as result of successive    |
886*41dd9a6bSDavid Young   |   |            |                 |             | accumulations, T(i) is truncated after each packet       |
887*41dd9a6bSDavid Young   |   |            |                 |             | consumption for all queues.                              |
888*41dd9a6bSDavid Young   |   |            |                 |             | For example, T[0..3] = [1000, 1100, 1200, 1300]          |
889*41dd9a6bSDavid Young   |   |            |                 |             | is truncated to T[0..3] = [0, 100, 200, 300]             |
890*41dd9a6bSDavid Young   |   |            |                 |             | by subtracting the min T from T(i), i = 0..n.            |
891*41dd9a6bSDavid Young   |   |            |                 |             |                                                          |
892*41dd9a6bSDavid Young   |   |            |                 |             | This requires having at least one active queue in the    |
893*41dd9a6bSDavid Young   |   |            |                 |             | set of input queues, which is guaranteed by the dequeue  |
894*41dd9a6bSDavid Young   |   |            |                 |             | state machine never selecting an inactive traffic class. |
895*41dd9a6bSDavid Young   |   |            |                 |             |                                                          |
896*41dd9a6bSDavid Young   |   |            |                 |             | *mask(i) = Saturation mask for queue #i, defined as:*    |
897*41dd9a6bSDavid Young   |   |            |                 |             |                                                          |
898*41dd9a6bSDavid Young   |   |            |                 |             | mask(i) = (queue #i is active)? 0 : 0xFFFFFFFF;          |
899*41dd9a6bSDavid Young   |   |            |                 |             |                                                          |
900*41dd9a6bSDavid Young   |   |            |                 |             | w(i) = Weight of queue #i                                |
901*41dd9a6bSDavid Young   |   |            |                 |             |                                                          |
902*41dd9a6bSDavid Young   |   |            |                 |             | t(i) = Tokens per byte for queue #i, defined as the      |
903*41dd9a6bSDavid Young   |   |            |                 |             | inverse weight of queue #i.                              |
904*41dd9a6bSDavid Young   |   |            |                 |             |                                                          |
905*41dd9a6bSDavid Young   |   |            |                 |             | T(i) = Accumulated numbers of tokens previously consumed |
906*41dd9a6bSDavid Young   |   |            |                 |             | from queue #i.                                           |
907*41dd9a6bSDavid Young   |   |            |                 |             |                                                          |
908*41dd9a6bSDavid Young   |   |            |                 |             | *Next queue*  : queue with smallest T.                   |
909*41dd9a6bSDavid Young   |   |            |                 |             |                                                          |
910*41dd9a6bSDavid Young   |   |            |                 |             | Before packet consumption from queue #i:                 |
911*41dd9a6bSDavid Young   |   |            |                 |             |                                                          |
912*41dd9a6bSDavid Young   |   |            |                 |             | *T(i) |= mask(i)*                                        |
913*41dd9a6bSDavid Young   |   |            |                 |             |                                                          |
914*41dd9a6bSDavid Young   |   |            |                 |             | After packet consumption from queue #i:                  |
915*41dd9a6bSDavid Young   |   |            |                 |             |                                                          |
916*41dd9a6bSDavid Young   |   |            |                 |             | T(j) -= T(i), j != i                                     |
917*41dd9a6bSDavid Young   |   |            |                 |             |                                                          |
918*41dd9a6bSDavid Young   |   |            |                 |             | T(i) = pkt_len * t(i)                                    |
919*41dd9a6bSDavid Young   |   |            |                 |             |                                                          |
920*41dd9a6bSDavid Young   |   |            |                 |             | Note: T(j) uses the T(i) value before T(i) is updated.   |
921*41dd9a6bSDavid Young   |   |            |                 |             |                                                          |
922*41dd9a6bSDavid Young   +---+------------+-----------------+-------------+----------------------------------------------------------+
923*41dd9a6bSDavid Young
924*41dd9a6bSDavid YoungSubport Traffic Class Oversubscription
925*41dd9a6bSDavid Young""""""""""""""""""""""""""""""""""""""
926*41dd9a6bSDavid Young
927*41dd9a6bSDavid YoungProblem Statement
928*41dd9a6bSDavid Young'''''''''''''''''
929*41dd9a6bSDavid Young
930*41dd9a6bSDavid YoungOversubscription for subport traffic class X is a configuration-time event that occurs when
931*41dd9a6bSDavid Youngmore bandwidth is allocated for traffic class X at the level of subport member pipes than
932*41dd9a6bSDavid Youngallocated for the same traffic class at the parent subport level.
933*41dd9a6bSDavid Young
934*41dd9a6bSDavid YoungThe existence of the oversubscription for a specific subport and
935*41dd9a6bSDavid Youngtraffic class is solely the result of pipe and
936*41dd9a6bSDavid Youngsubport-level configuration as opposed to being created due
937*41dd9a6bSDavid Youngto dynamic evolution of the traffic load at run-time (as congestion is).
938*41dd9a6bSDavid Young
939*41dd9a6bSDavid YoungWhen the overall demand for traffic class X for the current subport is low,
940*41dd9a6bSDavid Youngthe existence of the oversubscription condition does not represent a problem,
941*41dd9a6bSDavid Youngas demand for traffic class X is completely satisfied for all member pipes.
942*41dd9a6bSDavid YoungHowever, this can no longer be achieved when the aggregated demand for traffic class X
943*41dd9a6bSDavid Youngfor all subport member pipes exceeds the limit configured at the subport level.
944*41dd9a6bSDavid Young
945*41dd9a6bSDavid YoungSolution Space
946*41dd9a6bSDavid Young''''''''''''''
947*41dd9a6bSDavid Young
948*41dd9a6bSDavid Youngsummarizes some of the possible approaches for handling this problem,
949*41dd9a6bSDavid Youngwith the third approach selected for implementation.
950*41dd9a6bSDavid Young
951*41dd9a6bSDavid Young.. _table_qos_13:
952*41dd9a6bSDavid Young
953*41dd9a6bSDavid Young.. table:: Subport Traffic Class Oversubscription
954*41dd9a6bSDavid Young
955*41dd9a6bSDavid Young   +-----+---------------------------+-------------------------------------------------------------------------+
956*41dd9a6bSDavid Young   | No. | Approach                  | Description                                                             |
957*41dd9a6bSDavid Young   |     |                           |                                                                         |
958*41dd9a6bSDavid Young   +=====+===========================+=========================================================================+
959*41dd9a6bSDavid Young   | 1   | Don't care                | First come, first served.                                               |
960*41dd9a6bSDavid Young   |     |                           |                                                                         |
961*41dd9a6bSDavid Young   |     |                           | This approach is not fair amongst subport member pipes, as pipes that   |
962*41dd9a6bSDavid Young   |     |                           | are served first will use up as much bandwidth for TC X as they need,   |
963*41dd9a6bSDavid Young   |     |                           | while pipes that are served later will receive poor service due to      |
964*41dd9a6bSDavid Young   |     |                           | bandwidth for TC X at the subport level being scarce.                   |
965*41dd9a6bSDavid Young   |     |                           |                                                                         |
966*41dd9a6bSDavid Young   +-----+---------------------------+-------------------------------------------------------------------------+
967*41dd9a6bSDavid Young   | 2   | Scale down all pipes      | All pipes within the subport have their bandwidth limit for TC X scaled |
968*41dd9a6bSDavid Young   |     |                           | down by the same factor.                                                |
969*41dd9a6bSDavid Young   |     |                           |                                                                         |
970*41dd9a6bSDavid Young   |     |                           | This approach is not fair among subport member pipes, as the low end    |
971*41dd9a6bSDavid Young   |     |                           | pipes (that is, pipes configured with low bandwidth) can potentially    |
972*41dd9a6bSDavid Young   |     |                           | experience severe service degradation that might render their service   |
973*41dd9a6bSDavid Young   |     |                           | unusable (if available bandwidth for these pipes drops below the        |
974*41dd9a6bSDavid Young   |     |                           | minimum requirements for a workable service), while the service         |
975*41dd9a6bSDavid Young   |     |                           | degradation for high end pipes might not be noticeable at all.          |
976*41dd9a6bSDavid Young   |     |                           |                                                                         |
977*41dd9a6bSDavid Young   +-----+---------------------------+-------------------------------------------------------------------------+
978*41dd9a6bSDavid Young   | 3   | Cap the high demand pipes | Each subport member pipe receives an equal share of the bandwidth       |
979*41dd9a6bSDavid Young   |     |                           | available at run-time for TC X at the subport level. Any bandwidth left |
980*41dd9a6bSDavid Young   |     |                           | unused by the low-demand pipes is redistributed in equal portions to    |
981*41dd9a6bSDavid Young   |     |                           | the high-demand pipes. This way, the high-demand pipes are truncated    |
982*41dd9a6bSDavid Young   |     |                           | while the low-demand pipes are not impacted.                            |
983*41dd9a6bSDavid Young   |     |                           |                                                                         |
984*41dd9a6bSDavid Young   +-----+---------------------------+-------------------------------------------------------------------------+
985*41dd9a6bSDavid Young
986*41dd9a6bSDavid YoungTypically, the subport TC oversubscription feature is enabled only for the lowest priority traffic class,
987*41dd9a6bSDavid Youngwhich is typically used for best effort traffic,
988*41dd9a6bSDavid Youngwith the management plane preventing this condition from occurring for the other (higher priority) traffic classes.
989*41dd9a6bSDavid Young
990*41dd9a6bSDavid YoungTo ease implementation, it is also assumed that the upper limit for subport best effort TC is set to 100% of the subport rate,
991*41dd9a6bSDavid Youngand that the upper limit for pipe best effort TC is set to 100% of pipe rate for all subport member pipes.
992*41dd9a6bSDavid Young
993*41dd9a6bSDavid YoungImplementation Overview
994*41dd9a6bSDavid Young'''''''''''''''''''''''
995*41dd9a6bSDavid Young
996*41dd9a6bSDavid YoungThe algorithm computes a watermark, which is periodically updated based on the current demand experienced by the subport member pipes,
997*41dd9a6bSDavid Youngwhose purpose is to limit the amount of traffic that each pipe is allowed to send for best effort TC.
998*41dd9a6bSDavid YoungThe watermark is computed at the subport level at the beginning of each traffic class upper limit enforcement period and
999*41dd9a6bSDavid Youngthe same value is used by all the subport member pipes throughout the current enforcement period.
1000*41dd9a6bSDavid Youngillustrates how the watermark computed as subport level at the beginning of each period is propagated to all subport member pipes.
1001*41dd9a6bSDavid Young
1002*41dd9a6bSDavid YoungAt the beginning of the current enforcement period (which coincides with the end of the previous enforcement period),
1003*41dd9a6bSDavid Youngthe value of the watermark is adjusted based on the amount of bandwidth allocated to best effort TC at the beginning of the previous period that
1004*41dd9a6bSDavid Youngwas not left unused by the subport member pipes at the end of the previous period.
1005*41dd9a6bSDavid Young
1006*41dd9a6bSDavid YoungIf there was subport best effort TC bandwidth left unused,
1007*41dd9a6bSDavid Youngthe value of the watermark for the current period is increased to encourage the subport member pipes to consume more bandwidth.
1008*41dd9a6bSDavid YoungOtherwise, the value of the watermark is decreased to enforce equality of bandwidth consumption among subport member pipes for best effort TC.
1009*41dd9a6bSDavid Young
1010*41dd9a6bSDavid YoungThe increase or decrease in the watermark value is done in small increments,
1011*41dd9a6bSDavid Youngso several enforcement periods might be required to reach the equilibrium state.
1012*41dd9a6bSDavid YoungThis state can change at any moment due to variations in the demand experienced by the subport member pipes for best effort TC, for example,
1013*41dd9a6bSDavid Youngas a result of demand increase (when the watermark needs to be lowered) or demand decrease (when the watermark needs to be increased).
1014*41dd9a6bSDavid Young
1015*41dd9a6bSDavid YoungWhen demand is low, the watermark is set high to prevent it from impeding the subport member pipes from consuming more bandwidth.
1016*41dd9a6bSDavid YoungThe highest value for the watermark is picked as the highest rate configured for a subport member pipe.
1017*41dd9a6bSDavid Young:numref:`table_qos_14` and :numref:`table_qos_15` illustrates the watermark operation.
1018*41dd9a6bSDavid Young
1019*41dd9a6bSDavid Young.. _table_qos_14:
1020*41dd9a6bSDavid Young
1021*41dd9a6bSDavid Young.. table:: Watermark Propagation from Subport Level to Member Pipes at the Beginning of Each Traffic Class Upper Limit Enforcement Period
1022*41dd9a6bSDavid Young
1023*41dd9a6bSDavid Young   +-----+---------------------------------+----------------------------------------------------+
1024*41dd9a6bSDavid Young   | No. | Subport Traffic Class Operation | Description                                        |
1025*41dd9a6bSDavid Young   |     |                                 |                                                    |
1026*41dd9a6bSDavid Young   +=====+=================================+====================================================+
1027*41dd9a6bSDavid Young   | 1   | Initialization                  | **Subport level**: subport_period_id= 0            |
1028*41dd9a6bSDavid Young   |     |                                 |                                                    |
1029*41dd9a6bSDavid Young   |     |                                 | **Pipe level**: pipe_period_id = 0                 |
1030*41dd9a6bSDavid Young   |     |                                 |                                                    |
1031*41dd9a6bSDavid Young   +-----+---------------------------------+----------------------------------------------------+
1032*41dd9a6bSDavid Young   | 2   | Credit update                   | **Subport Level**:                                 |
1033*41dd9a6bSDavid Young   |     |                                 |                                                    |
1034*41dd9a6bSDavid Young   |     |                                 | if (time>=subport_tc_time)                         |
1035*41dd9a6bSDavid Young   |     |                                 |                                                    |
1036*41dd9a6bSDavid Young   |     |                                 | {                                                  |
1037*41dd9a6bSDavid Young   |     |                                 |     subport_wm = water_mark_update();              |
1038*41dd9a6bSDavid Young   |     |                                 |                                                    |
1039*41dd9a6bSDavid Young   |     |                                 |     subport_tc_time = time + subport_tc_period;    |
1040*41dd9a6bSDavid Young   |     |                                 |                                                    |
1041*41dd9a6bSDavid Young   |     |                                 |     subport_period_id++;                           |
1042*41dd9a6bSDavid Young   |     |                                 |                                                    |
1043*41dd9a6bSDavid Young   |     |                                 | }                                                  |
1044*41dd9a6bSDavid Young   |     |                                 |                                                    |
1045*41dd9a6bSDavid Young   |     |                                 | **Pipelevel:**                                     |
1046*41dd9a6bSDavid Young   |     |                                 |                                                    |
1047*41dd9a6bSDavid Young   |     |                                 | if(pipe_period_id != subport_period_id)            |
1048*41dd9a6bSDavid Young   |     |                                 |                                                    |
1049*41dd9a6bSDavid Young   |     |                                 | {                                                  |
1050*41dd9a6bSDavid Young   |     |                                 |                                                    |
1051*41dd9a6bSDavid Young   |     |                                 |     pipe_ov_credits = subport_wm \* pipe_weight;   |
1052*41dd9a6bSDavid Young   |     |                                 |                                                    |
1053*41dd9a6bSDavid Young   |     |                                 |     pipe_period_id = subport_period_id;            |
1054*41dd9a6bSDavid Young   |     |                                 |                                                    |
1055*41dd9a6bSDavid Young   |     |                                 | }                                                  |
1056*41dd9a6bSDavid Young   |     |                                 |                                                    |
1057*41dd9a6bSDavid Young   +-----+---------------------------------+----------------------------------------------------+
1058*41dd9a6bSDavid Young   | 3   | Credit consumption              | **Pipe level:**                                    |
1059*41dd9a6bSDavid Young   |     | (on packet scheduling)          |                                                    |
1060*41dd9a6bSDavid Young   |     |                                 | pkt_credits = pk_len + frame_overhead;             |
1061*41dd9a6bSDavid Young   |     |                                 |                                                    |
1062*41dd9a6bSDavid Young   |     |                                 | if(pipe_ov_credits >= pkt_credits{                 |
1063*41dd9a6bSDavid Young   |     |                                 |                                                    |
1064*41dd9a6bSDavid Young   |     |                                 |    pipe_ov_credits -= pkt_credits;                 |
1065*41dd9a6bSDavid Young   |     |                                 |                                                    |
1066*41dd9a6bSDavid Young   |     |                                 | }                                                  |
1067*41dd9a6bSDavid Young   |     |                                 |                                                    |
1068*41dd9a6bSDavid Young   +-----+---------------------------------+----------------------------------------------------+
1069*41dd9a6bSDavid Young
1070*41dd9a6bSDavid Young.. _table_qos_15:
1071*41dd9a6bSDavid Young
1072*41dd9a6bSDavid Young.. table:: Watermark Calculation
1073*41dd9a6bSDavid Young
1074*41dd9a6bSDavid Young   +-----+------------------+----------------------------------------------------------------------------------+
1075*41dd9a6bSDavid Young   | No. | Subport Traffic  | Description                                                                      |
1076*41dd9a6bSDavid Young   |     | Class Operation  |                                                                                  |
1077*41dd9a6bSDavid Young   +=====+==================+==================================================================================+
1078*41dd9a6bSDavid Young   | 1   | Initialization   | **Subport level:**                                                               |
1079*41dd9a6bSDavid Young   |     |                  |                                                                                  |
1080*41dd9a6bSDavid Young   |     |                  | wm = WM_MAX                                                                      |
1081*41dd9a6bSDavid Young   |     |                  |                                                                                  |
1082*41dd9a6bSDavid Young   +-----+------------------+----------------------------------------------------------------------------------+
1083*41dd9a6bSDavid Young   | 2   | Credit update    | **Subport level (water_mark_update):**                                           |
1084*41dd9a6bSDavid Young   |     |                  |                                                                                  |
1085*41dd9a6bSDavid Young   |     |                  | tc0_cons = subport_tc0_credits_per_period - subport_tc0_credits;                 |
1086*41dd9a6bSDavid Young   |     |                  |                                                                                  |
1087*41dd9a6bSDavid Young   |     |                  | tc1_cons = subport_tc1_credits_per_period - subport_tc1_credits;                 |
1088*41dd9a6bSDavid Young   |     |                  |                                                                                  |
1089*41dd9a6bSDavid Young   |     |                  | tc2_cons = subport_tc2_credits_per_period - subport_tc2_credits;                 |
1090*41dd9a6bSDavid Young   |     |                  |                                                                                  |
1091*41dd9a6bSDavid Young   |     |                  | tc3_cons = subport_tc3_credits_per_period - subport_tc3_credits;                 |
1092*41dd9a6bSDavid Young   |     |                  |                                                                                  |
1093*41dd9a6bSDavid Young   |     |                  | tc4_cons = subport_tc4_credits_per_period - subport_tc4_credits;                 |
1094*41dd9a6bSDavid Young   |     |                  |                                                                                  |
1095*41dd9a6bSDavid Young   |     |                  | tc5_cons = subport_tc5_credits_per_period - subport_tc5_credits;                 |
1096*41dd9a6bSDavid Young   |     |                  |                                                                                  |
1097*41dd9a6bSDavid Young   |     |                  | tc6_cons = subport_tc6_credits_per_period - subport_tc6_credits;                 |
1098*41dd9a6bSDavid Young   |     |                  |                                                                                  |
1099*41dd9a6bSDavid Young   |     |                  | tc7_cons = subport_tc7_credits_per_period - subport_tc7_credits;                 |
1100*41dd9a6bSDavid Young   |     |                  |                                                                                  |
1101*41dd9a6bSDavid Young   |     |                  | tc8_cons = subport_tc8_credits_per_period - subport_tc8_credits;                 |
1102*41dd9a6bSDavid Young   |     |                  |                                                                                  |
1103*41dd9a6bSDavid Young   |     |                  | tc9_cons = subport_tc9_credits_per_period - subport_tc9_credits;                 |
1104*41dd9a6bSDavid Young   |     |                  |                                                                                  |
1105*41dd9a6bSDavid Young   |     |                  | tc10_cons = subport_tc10_credits_per_period - subport_tc10_credits;              |
1106*41dd9a6bSDavid Young   |     |                  |                                                                                  |
1107*41dd9a6bSDavid Young   |     |                  | tc11_cons = subport_tc11_credits_per_period - subport_tc11_credits;              |
1108*41dd9a6bSDavid Young   |     |                  |                                                                                  |
1109*41dd9a6bSDavid Young   |     |                  | tc_be_cons_max = subport_tc_be_credits_per_period - (tc0_cons + tc1_cons +       |
1110*41dd9a6bSDavid Young   |     |                  | tc2_cons + tc3_cons + tc4_cons + tc5_cons + tc6_cons + tc7_cons + tc8_cons +     |
1111*41dd9a6bSDavid Young   |     |                  | tc9_cons + tc10_cons + tc11_cons);                                               |
1112*41dd9a6bSDavid Young   |     |                  |                                                                                  |
1113*41dd9a6bSDavid Young   |     |                  | if(tc_be_consumption > (tc_be_consumption_max - MTU)){                           |
1114*41dd9a6bSDavid Young   |     |                  |                                                                                  |
1115*41dd9a6bSDavid Young   |     |                  |     wm -= wm >> 7;                                                               |
1116*41dd9a6bSDavid Young   |     |                  |                                                                                  |
1117*41dd9a6bSDavid Young   |     |                  |     if(wm < WM_MIN) wm =  WM_MIN;                                                |
1118*41dd9a6bSDavid Young   |     |                  |                                                                                  |
1119*41dd9a6bSDavid Young   |     |                  | } else {                                                                         |
1120*41dd9a6bSDavid Young   |     |                  |                                                                                  |
1121*41dd9a6bSDavid Young   |     |                  |    wm += (wm >> 7) + 1;                                                          |
1122*41dd9a6bSDavid Young   |     |                  |                                                                                  |
1123*41dd9a6bSDavid Young   |     |                  |    if(wm > WM_MAX) wm = WM_MAX;                                                  |
1124*41dd9a6bSDavid Young   |     |                  |                                                                                  |
1125*41dd9a6bSDavid Young   |     |                  | }                                                                                |
1126*41dd9a6bSDavid Young   |     |                  |                                                                                  |
1127*41dd9a6bSDavid Young   +-----+------------------+----------------------------------------------------------------------------------+
1128*41dd9a6bSDavid Young
1129*41dd9a6bSDavid YoungWorst Case Scenarios for Performance
1130*41dd9a6bSDavid Young~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1131*41dd9a6bSDavid Young
1132*41dd9a6bSDavid YoungLots of Active Queues with Not Enough Credits
1133*41dd9a6bSDavid Young^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1134*41dd9a6bSDavid Young
1135*41dd9a6bSDavid YoungThe more queues the scheduler has to examine for packets and credits in order to select one packet,
1136*41dd9a6bSDavid Youngthe lower the performance of the scheduler is.
1137*41dd9a6bSDavid Young
1138*41dd9a6bSDavid YoungThe scheduler maintains the bitmap of active queues, which skips the non-active queues,
1139*41dd9a6bSDavid Youngbut in order to detect whether a specific pipe has enough credits,
1140*41dd9a6bSDavid Youngthe pipe has to be drilled down using the pipe dequeue state machine,
1141*41dd9a6bSDavid Youngwhich consumes cycles regardless of the scheduling result
1142*41dd9a6bSDavid Young(no packets are produced or at least one packet is produced).
1143*41dd9a6bSDavid Young
1144*41dd9a6bSDavid YoungThis scenario stresses the importance of the policer for the scheduler performance:
1145*41dd9a6bSDavid Youngif the pipe does not have enough credits,
1146*41dd9a6bSDavid Youngits packets should be dropped as soon as possible (before they reach the hierarchical scheduler),
1147*41dd9a6bSDavid Youngthus rendering the pipe queues as not active,
1148*41dd9a6bSDavid Youngwhich allows the dequeue side to skip that pipe with no cycles being spent on investigating the pipe credits
1149*41dd9a6bSDavid Youngthat would result in a "not enough credits" status.
1150*41dd9a6bSDavid Young
1151*41dd9a6bSDavid YoungSingle Queue with 100% Line Rate
1152*41dd9a6bSDavid Young^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1153*41dd9a6bSDavid Young
1154*41dd9a6bSDavid YoungThe port scheduler performance is optimized for a large number of queues.
1155*41dd9a6bSDavid YoungIf the number of queues is small,
1156*41dd9a6bSDavid Youngthen the performance of the port scheduler for the same level of active traffic is expected to be worse than
1157*41dd9a6bSDavid Youngthe performance of a small set of message passing queues.
1158*41dd9a6bSDavid Young
1159*41dd9a6bSDavid Young.. _Droppers:
1160*41dd9a6bSDavid Young
1161*41dd9a6bSDavid YoungDroppers
1162*41dd9a6bSDavid Young--------
1163*41dd9a6bSDavid Young
1164*41dd9a6bSDavid YoungThe purpose of the DPDK dropper is to drop packets arriving at a packet scheduler to avoid congestion.
1165*41dd9a6bSDavid YoungThe dropper supports the Proportional Integral Controller Enhanced (PIE), Random Early Detection (RED),
1166*41dd9a6bSDavid YoungWeighted Random Early Detection (WRED) and tail drop algorithms.
1167*41dd9a6bSDavid Young:numref:`figure_blk_diag_dropper` illustrates how the dropper integrates with the scheduler.
1168*41dd9a6bSDavid YoungThe DPDK currently does not support congestion management
1169*41dd9a6bSDavid Youngso the dropper provides the only method for congestion avoidance.
1170*41dd9a6bSDavid Young
1171*41dd9a6bSDavid Young.. _figure_blk_diag_dropper:
1172*41dd9a6bSDavid Young
1173*41dd9a6bSDavid Young.. figure:: ../img/blk_diag_dropper.*
1174*41dd9a6bSDavid Young
1175*41dd9a6bSDavid Young   High-level Block Diagram of the DPDK Dropper
1176*41dd9a6bSDavid Young
1177*41dd9a6bSDavid Young
1178*41dd9a6bSDavid YoungThe dropper uses one of two congestion avoidance algorithms:
1179*41dd9a6bSDavid Young   - the Random Early Detection (RED) as documented in the reference publication.
1180*41dd9a6bSDavid Young   - the Proportional Integral Controller Enhanced (PIE) as documented in RFC8033 publication.
1181*41dd9a6bSDavid Young
1182*41dd9a6bSDavid YoungThe purpose of the RED/PIE algorithm is to monitor a packet queue,
1183*41dd9a6bSDavid Youngdetermine the current congestion level in the queue and decide whether an arriving packet should be enqueued or dropped.
1184*41dd9a6bSDavid Young
1185*41dd9a6bSDavid YoungThe RED algorithm uses an Exponential Weighted Moving Average (EWMA) filter to compute average queue size which
1186*41dd9a6bSDavid Younggives an indication of the current congestion level in the queue.
1187*41dd9a6bSDavid Young
1188*41dd9a6bSDavid YoungFor each enqueue operation, the RED algorithm compares the average queue size to minimum and maximum thresholds.
1189*41dd9a6bSDavid YoungDepending on whether the average queue size is below, above or in between these thresholds,
1190*41dd9a6bSDavid Youngthe RED algorithm calculates the probability that an arriving packet should be dropped and
1191*41dd9a6bSDavid Youngmakes a random decision based on this probability.
1192*41dd9a6bSDavid Young
1193*41dd9a6bSDavid YoungThe dropper also supports Weighted Random Early Detection (WRED) by allowing the scheduler to select
1194*41dd9a6bSDavid Youngdifferent RED configurations for the same packet queue at run-time.
1195*41dd9a6bSDavid YoungIn the case of severe congestion, the dropper resorts to tail drop.
1196*41dd9a6bSDavid YoungThis occurs when a packet queue has reached maximum capacity and cannot store any more packets.
1197*41dd9a6bSDavid YoungIn this situation, all arriving packets are dropped.
1198*41dd9a6bSDavid Young
1199*41dd9a6bSDavid YoungThe flow through the dropper is illustrated in :numref:`figure_flow_tru_dropper`.
1200*41dd9a6bSDavid YoungThe RED/WRED/PIE algorithm is exercised first and tail drop second.
1201*41dd9a6bSDavid Young
1202*41dd9a6bSDavid Young.. _figure_flow_tru_dropper:
1203*41dd9a6bSDavid Young
1204*41dd9a6bSDavid Young.. figure:: ../img/flow_tru_dropper.*
1205*41dd9a6bSDavid Young
1206*41dd9a6bSDavid Young   Flow Through the Dropper
1207*41dd9a6bSDavid Young
1208*41dd9a6bSDavid YoungThe PIE algorithm periodically updates the drop probability based on the latency samples.
1209*41dd9a6bSDavid YoungThe current latency sample but also analyze whether the latency is trending up or down.
1210*41dd9a6bSDavid YoungThis is the classical Proportional Integral (PI) controller method, which is known for
1211*41dd9a6bSDavid Youngeliminating steady state errors.
1212*41dd9a6bSDavid Young
1213*41dd9a6bSDavid YoungWhen a congestion period ends, we might be left with a high drop probability with light
1214*41dd9a6bSDavid Youngpacket arrivals. Hence, the PIE algorithm includes a mechanism by which the drop probability
1215*41dd9a6bSDavid Youngdecays exponentially (rather than linearly) when the system is not congested.
1216*41dd9a6bSDavid YoungThis would help the drop probability converge to 0 more quickly, while the PI controller ensures
1217*41dd9a6bSDavid Youngthat it would eventually reach zero.
1218*41dd9a6bSDavid Young
1219*41dd9a6bSDavid YoungThe use cases supported by the dropper are:
1220*41dd9a6bSDavid Young
1221*41dd9a6bSDavid Young*   *    Initialize configuration data
1222*41dd9a6bSDavid Young
1223*41dd9a6bSDavid Young*   *    Initialize run-time data
1224*41dd9a6bSDavid Young
1225*41dd9a6bSDavid Young*   *    Enqueue (make a decision to enqueue or drop an arriving packet)
1226*41dd9a6bSDavid Young
1227*41dd9a6bSDavid Young*   *    Mark empty (record the time at which a packet queue becomes empty)
1228*41dd9a6bSDavid Young
1229*41dd9a6bSDavid YoungThe configuration use case is explained in :ref:`Section2.23.3.1 <Configuration>`,
1230*41dd9a6bSDavid Youngthe enqueue operation is explained in  :ref:`Section 2.23.3.2 <Enqueue_Operation>`
1231*41dd9a6bSDavid Youngand the mark empty operation is explained in :ref:`Section 2.23.3.3 <Queue_Empty_Operation>`.
1232*41dd9a6bSDavid Young
1233*41dd9a6bSDavid Young.. _Configuration:
1234*41dd9a6bSDavid Young
1235*41dd9a6bSDavid YoungConfiguration
1236*41dd9a6bSDavid Young~~~~~~~~~~~~~
1237*41dd9a6bSDavid Young
1238*41dd9a6bSDavid YoungA RED configuration contains the parameters given in :numref:`table_qos_16`.
1239*41dd9a6bSDavid Young
1240*41dd9a6bSDavid Young.. _table_qos_16:
1241*41dd9a6bSDavid Young
1242*41dd9a6bSDavid Young.. table:: RED Configuration Parameters
1243*41dd9a6bSDavid Young
1244*41dd9a6bSDavid Young   +--------------------------+---------+---------+------------------+
1245*41dd9a6bSDavid Young   | Parameter                | Minimum | Maximum | Typical          |
1246*41dd9a6bSDavid Young   |                          |         |         |                  |
1247*41dd9a6bSDavid Young   +==========================+=========+=========+==================+
1248*41dd9a6bSDavid Young   | Minimum Threshold        | 0       | 1022    | 1/4 x queue size |
1249*41dd9a6bSDavid Young   |                          |         |         |                  |
1250*41dd9a6bSDavid Young   +--------------------------+---------+---------+------------------+
1251*41dd9a6bSDavid Young   | Maximum Threshold        | 1       | 1023    | 1/2 x queue size |
1252*41dd9a6bSDavid Young   |                          |         |         |                  |
1253*41dd9a6bSDavid Young   +--------------------------+---------+---------+------------------+
1254*41dd9a6bSDavid Young   | Inverse Mark Probability | 1       | 255     | 10               |
1255*41dd9a6bSDavid Young   |                          |         |         |                  |
1256*41dd9a6bSDavid Young   +--------------------------+---------+---------+------------------+
1257*41dd9a6bSDavid Young   | EWMA Filter Weight       | 1       | 12      | 9                |
1258*41dd9a6bSDavid Young   |                          |         |         |                  |
1259*41dd9a6bSDavid Young   +--------------------------+---------+---------+------------------+
1260*41dd9a6bSDavid Young
1261*41dd9a6bSDavid YoungThe meaning of these parameters is explained in more detail in the following sections.
1262*41dd9a6bSDavid YoungThe format of these parameters as specified to the dropper module API
1263*41dd9a6bSDavid Youngcorresponds to the format used by Cisco* in their RED implementation.
1264*41dd9a6bSDavid YoungThe minimum and maximum threshold parameters are specified to the dropper module in terms of number of packets.
1265*41dd9a6bSDavid YoungThe mark probability parameter is specified as an inverse value, for example,
1266*41dd9a6bSDavid Youngan inverse mark probability parameter value of 10 corresponds
1267*41dd9a6bSDavid Youngto a mark probability of 1/10 (that is, 1 in 10 packets will be dropped).
1268*41dd9a6bSDavid YoungThe EWMA filter weight parameter is specified as an inverse log value,
1269*41dd9a6bSDavid Youngfor example, a filter weight parameter value of 9 corresponds to a filter weight of 1/29.
1270*41dd9a6bSDavid Young
1271*41dd9a6bSDavid YoungA PIE configuration contains the parameters given in :numref:`table_qos_16a`.
1272*41dd9a6bSDavid Young
1273*41dd9a6bSDavid Young.. _table_qos_16a:
1274*41dd9a6bSDavid Young
1275*41dd9a6bSDavid Young.. table:: PIE Configuration Parameters
1276*41dd9a6bSDavid Young
1277*41dd9a6bSDavid Young   +--------------------------+---------+---------+------------------+
1278*41dd9a6bSDavid Young   | Parameter                | Minimum | Maximum | Default          |
1279*41dd9a6bSDavid Young   |                          |         |         |                  |
1280*41dd9a6bSDavid Young   +==========================+=========+=========+==================+
1281*41dd9a6bSDavid Young   | Queue delay reference    | 1       | uint16  | 15               |
1282*41dd9a6bSDavid Young   | Latency Target Value     |         |         |                  |
1283*41dd9a6bSDavid Young   | Unit: ms                 |         |         |                  |
1284*41dd9a6bSDavid Young   +--------------------------+---------+---------+------------------+
1285*41dd9a6bSDavid Young   | Max Burst Allowance      | 1       | uint16  | 150              |
1286*41dd9a6bSDavid Young   | Unit: ms                 |         |         |                  |
1287*41dd9a6bSDavid Young   +--------------------------+---------+---------+------------------+
1288*41dd9a6bSDavid Young   | Tail Drop Threshold      | 1       | uint16  | 64               |
1289*41dd9a6bSDavid Young   | Unit: bytes              |         |         |                  |
1290*41dd9a6bSDavid Young   +--------------------------+---------+---------+------------------+
1291*41dd9a6bSDavid Young   | Period to calculate      | 1       | uint16  | 15               |
1292*41dd9a6bSDavid Young   | drop probability         |         |         |                  |
1293*41dd9a6bSDavid Young   | Unit: ms                 |         |         |                  |
1294*41dd9a6bSDavid Young   +--------------------------+---------+---------+------------------+
1295*41dd9a6bSDavid Young
1296*41dd9a6bSDavid YoungThe meaning of these parameters is explained in more detail in the next sections.
1297*41dd9a6bSDavid YoungThe format of these parameters as specified to the dropper module API.
1298*41dd9a6bSDavid YoungThey could made self calculated for fine tuning, within the apps.
1299*41dd9a6bSDavid Young
1300*41dd9a6bSDavid Young.. _Enqueue_Operation:
1301*41dd9a6bSDavid Young
1302*41dd9a6bSDavid YoungEnqueue Operation
1303*41dd9a6bSDavid Young~~~~~~~~~~~~~~~~~
1304*41dd9a6bSDavid Young
1305*41dd9a6bSDavid YoungIn the example shown in :numref:`figure_ex_data_flow_tru_dropper`, q (actual queue size) is the input value,
1306*41dd9a6bSDavid Youngavg (average queue size) and count (number of packets since the last drop) are run-time values,
1307*41dd9a6bSDavid Youngdecision is the output value and the remaining values are configuration parameters.
1308*41dd9a6bSDavid Young
1309*41dd9a6bSDavid Young.. _figure_ex_data_flow_tru_dropper:
1310*41dd9a6bSDavid Young
1311*41dd9a6bSDavid Young.. figure:: ../img/ex_data_flow_tru_dropper.*
1312*41dd9a6bSDavid Young
1313*41dd9a6bSDavid Young   Example Data Flow Through Dropper
1314*41dd9a6bSDavid Young
1315*41dd9a6bSDavid Young
1316*41dd9a6bSDavid YoungEWMA Filter Microblock
1317*41dd9a6bSDavid Young^^^^^^^^^^^^^^^^^^^^^^
1318*41dd9a6bSDavid Young
1319*41dd9a6bSDavid YoungThe purpose of the EWMA Filter microblock is to filter queue size values to smooth out transient changes
1320*41dd9a6bSDavid Youngthat result from "bursty" traffic.
1321*41dd9a6bSDavid YoungThe output value is the average queue size which gives a more stable view of the current congestion level in the queue.
1322*41dd9a6bSDavid Young
1323*41dd9a6bSDavid YoungThe EWMA filter has one configuration parameter, filter weight, which determines how quickly
1324*41dd9a6bSDavid Youngor slowly the average queue size output responds to changes in the actual queue size input.
1325*41dd9a6bSDavid YoungHigher values of filter weight mean that the average queue size responds more quickly to changes in actual queue size.
1326*41dd9a6bSDavid Young
1327*41dd9a6bSDavid YoungAverage Queue Size Calculation when the Queue is not Empty
1328*41dd9a6bSDavid Young""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
1329*41dd9a6bSDavid Young
1330*41dd9a6bSDavid YoungThe definition of the EWMA filter is given in the following equation.
1331*41dd9a6bSDavid Young
1332*41dd9a6bSDavid Young.. image:: ../img/ewma_filter_eq_1.*
1333*41dd9a6bSDavid Young
1334*41dd9a6bSDavid YoungWhere:
1335*41dd9a6bSDavid Young
1336*41dd9a6bSDavid Young*   *avg*  = average queue size
1337*41dd9a6bSDavid Young
1338*41dd9a6bSDavid Young*   *wq*   = filter weight
1339*41dd9a6bSDavid Young
1340*41dd9a6bSDavid Young*   *q*    = actual queue size
1341*41dd9a6bSDavid Young
1342*41dd9a6bSDavid Young.. note::
1343*41dd9a6bSDavid Young
1344*41dd9a6bSDavid Young    The filter weight, wq = 1/2^n, where n is the filter weight parameter value passed to the dropper module
1345*41dd9a6bSDavid Young	on configuration (see :ref:`Section2.23.3.1 <Configuration>` ).
1346*41dd9a6bSDavid Young
1347*41dd9a6bSDavid YoungAverage Queue Size Calculation when the Queue is Empty
1348*41dd9a6bSDavid Young^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1349*41dd9a6bSDavid Young
1350*41dd9a6bSDavid YoungThe EWMA filter does not read time stamps and instead assumes that enqueue operations will happen quite regularly.
1351*41dd9a6bSDavid YoungSpecial handling is required when the queue becomes empty as the queue could be empty for a short time or a long time.
1352*41dd9a6bSDavid YoungWhen the queue becomes empty, average queue size should decay gradually to zero instead of dropping suddenly to zero
1353*41dd9a6bSDavid Youngor remaining stagnant at the last computed value.
1354*41dd9a6bSDavid YoungWhen a packet is enqueued on an empty queue, the average queue size is computed using the following formula:
1355*41dd9a6bSDavid Young
1356*41dd9a6bSDavid Young.. image:: ../img/ewma_filter_eq_2.*
1357*41dd9a6bSDavid Young
1358*41dd9a6bSDavid YoungWhere:
1359*41dd9a6bSDavid Young
1360*41dd9a6bSDavid Young*   *m*   = the number of enqueue operations that could have occurred on this queue while the queue was empty
1361*41dd9a6bSDavid Young
1362*41dd9a6bSDavid YoungIn the dropper module, *m* is defined as:
1363*41dd9a6bSDavid Young
1364*41dd9a6bSDavid Young.. image:: ../img/m_definition.*
1365*41dd9a6bSDavid Young
1366*41dd9a6bSDavid YoungWhere:
1367*41dd9a6bSDavid Young
1368*41dd9a6bSDavid Young*   *time*  = current time
1369*41dd9a6bSDavid Young
1370*41dd9a6bSDavid Young*   *qtime* = time the queue became empty
1371*41dd9a6bSDavid Young
1372*41dd9a6bSDavid Young*   *s* = typical time between successive enqueue operations on this queue
1373*41dd9a6bSDavid Young
1374*41dd9a6bSDavid YoungThe time reference is in units of bytes,
1375*41dd9a6bSDavid Youngwhere a byte signifies the time duration required by the physical interface to send out a byte on the transmission medium
1376*41dd9a6bSDavid Young(see Section `Internal Time Reference`_).
1377*41dd9a6bSDavid YoungThe parameter s is defined in the dropper module as a constant with the value: s=2^22.
1378*41dd9a6bSDavid YoungThis corresponds to the time required by every leaf node in a hierarchy with 64K leaf nodes
1379*41dd9a6bSDavid Youngto transmit one 64-byte packet onto the wire and represents the worst case scenario.
1380*41dd9a6bSDavid YoungFor much smaller scheduler hierarchies,
1381*41dd9a6bSDavid Youngit may be necessary to reduce the parameter s, which is defined in the red header source file (rte_red.h) as:
1382*41dd9a6bSDavid Young
1383*41dd9a6bSDavid Young.. code-block:: c
1384*41dd9a6bSDavid Young
1385*41dd9a6bSDavid Young    #define RTE_RED_S
1386*41dd9a6bSDavid Young
1387*41dd9a6bSDavid YoungSince the time reference is in bytes, the port speed is implied in the expression: *time-qtime*.
1388*41dd9a6bSDavid YoungThe dropper does not have to be configured with the actual port speed.
1389*41dd9a6bSDavid YoungIt adjusts automatically to low speed and high speed links.
1390*41dd9a6bSDavid Young
1391*41dd9a6bSDavid YoungImplementation
1392*41dd9a6bSDavid Young""""""""""""""
1393*41dd9a6bSDavid Young
1394*41dd9a6bSDavid YoungA numerical method is used to compute the factor (1-wq)^m that appears in Equation 2.
1395*41dd9a6bSDavid Young
1396*41dd9a6bSDavid YoungThis method is based on the following identity:
1397*41dd9a6bSDavid Young
1398*41dd9a6bSDavid Young.. image:: ../img/eq2_factor.*
1399*41dd9a6bSDavid Young
1400*41dd9a6bSDavid Young
1401*41dd9a6bSDavid YoungThis allows us to express the following:
1402*41dd9a6bSDavid Young
1403*41dd9a6bSDavid Young.. image:: ../img/eq2_expression.*
1404*41dd9a6bSDavid Young
1405*41dd9a6bSDavid Young
1406*41dd9a6bSDavid YoungIn the dropper module, a look-up table is used to compute log2(1-wq) for each value of wq supported by the dropper module.
1407*41dd9a6bSDavid YoungThe factor (1-wq)^m can then be obtained by multiplying the table value by *m* and applying shift operations.
1408*41dd9a6bSDavid YoungTo avoid overflow in the multiplication, the value, *m*, and the look-up table values are limited to 16 bits.
1409*41dd9a6bSDavid YoungThe total size of the look-up table is 56 bytes.
1410*41dd9a6bSDavid YoungOnce the factor (1-wq)^m is obtained using this method, the average queue size can be calculated from Equation 2.
1411*41dd9a6bSDavid Young
1412*41dd9a6bSDavid YoungAlternative Approaches
1413*41dd9a6bSDavid Young""""""""""""""""""""""
1414*41dd9a6bSDavid Young
1415*41dd9a6bSDavid YoungOther methods for calculating the factor (1-wq)^m in the expression for computing average queue size
1416*41dd9a6bSDavid Youngwhen the queue is empty (Equation 2) were considered.
1417*41dd9a6bSDavid YoungThese approaches include:
1418*41dd9a6bSDavid Young
1419*41dd9a6bSDavid Young*   Floating-point evaluation
1420*41dd9a6bSDavid Young
1421*41dd9a6bSDavid Young*   Fixed-point evaluation using a small look-up table (512B) and up to 16 multiplications
1422*41dd9a6bSDavid Young    (this is the approach used in the FreeBSD* ALTQ RED implementation)
1423*41dd9a6bSDavid Young
1424*41dd9a6bSDavid Young*   Fixed-point evaluation using a small look-up table (512B) and 16 SSE multiplications
1425*41dd9a6bSDavid Young    (SSE optimized version of the approach used in the FreeBSD* ALTQ RED implementation)
1426*41dd9a6bSDavid Young
1427*41dd9a6bSDavid Young*   Large look-up table (76 KB)
1428*41dd9a6bSDavid Young
1429*41dd9a6bSDavid YoungThe method that was finally selected (described above in Section 26.3.2.2.1) out performs all of these approaches
1430*41dd9a6bSDavid Youngin terms of run-time performance and memory requirements and
1431*41dd9a6bSDavid Youngalso achieves accuracy comparable to floating-point evaluation.
1432*41dd9a6bSDavid Young:numref:`table_qos_17` lists the performance of each of these alternative approaches relative to the method that is used in the dropper.
1433*41dd9a6bSDavid YoungAs can be seen, the floating-point implementation achieved the worst performance.
1434*41dd9a6bSDavid Young
1435*41dd9a6bSDavid Young.. _table_qos_17:
1436*41dd9a6bSDavid Young
1437*41dd9a6bSDavid Young.. table:: Relative Performance of Alternative Approaches
1438*41dd9a6bSDavid Young
1439*41dd9a6bSDavid Young   +------------------------------------------------------------------------------------+----------------------+
1440*41dd9a6bSDavid Young   | Method                                                                             | Relative Performance |
1441*41dd9a6bSDavid Young   |                                                                                    |                      |
1442*41dd9a6bSDavid Young   +====================================================================================+======================+
1443*41dd9a6bSDavid Young   | Current dropper method (see :ref:`Section 23.3.2.1.3 <Droppers>`)                  | 100%                 |
1444*41dd9a6bSDavid Young   |                                                                                    |                      |
1445*41dd9a6bSDavid Young   +------------------------------------------------------------------------------------+----------------------+
1446*41dd9a6bSDavid Young   | Fixed-point method with small (512B) look-up table                                 | 148%                 |
1447*41dd9a6bSDavid Young   |                                                                                    |                      |
1448*41dd9a6bSDavid Young   +------------------------------------------------------------------------------------+----------------------+
1449*41dd9a6bSDavid Young   | SSE method with small (512B) look-up table                                         | 114%                 |
1450*41dd9a6bSDavid Young   |                                                                                    |                      |
1451*41dd9a6bSDavid Young   +------------------------------------------------------------------------------------+----------------------+
1452*41dd9a6bSDavid Young   | Large (76KB) look-up table                                                         | 118%                 |
1453*41dd9a6bSDavid Young   |                                                                                    |                      |
1454*41dd9a6bSDavid Young   +------------------------------------------------------------------------------------+----------------------+
1455*41dd9a6bSDavid Young   | Floating-point                                                                     | 595%                 |
1456*41dd9a6bSDavid Young   |                                                                                    |                      |
1457*41dd9a6bSDavid Young   +------------------------------------------------------------------------------------+----------------------+
1458*41dd9a6bSDavid Young   | **Note**: In this case, since performance is expressed as time spent executing the operation in a         |
1459*41dd9a6bSDavid Young   | specific condition, any relative performance value above 100% runs slower than the reference method.      |
1460*41dd9a6bSDavid Young   |                                                                                                           |
1461*41dd9a6bSDavid Young   +-----------------------------------------------------------------------------------------------------------+
1462*41dd9a6bSDavid Young
1463*41dd9a6bSDavid YoungDrop Decision Block
1464*41dd9a6bSDavid Young^^^^^^^^^^^^^^^^^^^
1465*41dd9a6bSDavid Young
1466*41dd9a6bSDavid YoungThe Drop Decision block:
1467*41dd9a6bSDavid Young
1468*41dd9a6bSDavid Young*   Compares the average queue size with the minimum and maximum thresholds
1469*41dd9a6bSDavid Young
1470*41dd9a6bSDavid Young*   Calculates a packet drop probability
1471*41dd9a6bSDavid Young
1472*41dd9a6bSDavid Young*   Makes a random decision to enqueue or drop an arriving packet
1473*41dd9a6bSDavid Young
1474*41dd9a6bSDavid YoungThe calculation of the drop probability occurs in two stages.
1475*41dd9a6bSDavid YoungAn initial drop probability is calculated based on the average queue size,
1476*41dd9a6bSDavid Youngthe minimum and maximum thresholds and the mark probability.
1477*41dd9a6bSDavid YoungAn actual drop probability is then computed from the initial drop probability.
1478*41dd9a6bSDavid YoungThe actual drop probability takes the count run-time value into consideration
1479*41dd9a6bSDavid Youngso that the actual drop probability increases as more packets arrive to the packet queue
1480*41dd9a6bSDavid Youngsince the last packet was dropped.
1481*41dd9a6bSDavid Young
1482*41dd9a6bSDavid YoungInitial Packet Drop Probability
1483*41dd9a6bSDavid Young"""""""""""""""""""""""""""""""
1484*41dd9a6bSDavid Young
1485*41dd9a6bSDavid YoungThe initial drop probability is calculated using the following equation.
1486*41dd9a6bSDavid Young
1487*41dd9a6bSDavid Young.. image:: ../img/drop_probability_eq3.*
1488*41dd9a6bSDavid Young
1489*41dd9a6bSDavid YoungWhere:
1490*41dd9a6bSDavid Young
1491*41dd9a6bSDavid Young*   *maxp*  = mark probability
1492*41dd9a6bSDavid Young
1493*41dd9a6bSDavid Young*   *avg*  = average queue size
1494*41dd9a6bSDavid Young
1495*41dd9a6bSDavid Young*   *minth*  = minimum threshold
1496*41dd9a6bSDavid Young
1497*41dd9a6bSDavid Young*   *maxth*  = maximum threshold
1498*41dd9a6bSDavid Young
1499*41dd9a6bSDavid YoungThe calculation of the packet drop probability using Equation 3 is illustrated in :numref:`figure_pkt_drop_probability`.
1500*41dd9a6bSDavid YoungIf the average queue size is below the minimum threshold, an arriving packet is enqueued.
1501*41dd9a6bSDavid YoungIf the average queue size is at or above the maximum threshold, an arriving packet is dropped.
1502*41dd9a6bSDavid YoungIf the average queue size is between the minimum and maximum thresholds,
1503*41dd9a6bSDavid Younga drop probability is calculated to determine if the packet should be enqueued or dropped.
1504*41dd9a6bSDavid Young
1505*41dd9a6bSDavid Young.. _figure_pkt_drop_probability:
1506*41dd9a6bSDavid Young
1507*41dd9a6bSDavid Young.. figure:: ../img/pkt_drop_probability.*
1508*41dd9a6bSDavid Young
1509*41dd9a6bSDavid Young   Packet Drop Probability for a Given RED Configuration
1510*41dd9a6bSDavid Young
1511*41dd9a6bSDavid Young
1512*41dd9a6bSDavid YoungActual Drop Probability
1513*41dd9a6bSDavid Young"""""""""""""""""""""""
1514*41dd9a6bSDavid Young
1515*41dd9a6bSDavid YoungIf the average queue size is between the minimum and maximum thresholds,
1516*41dd9a6bSDavid Youngthen the actual drop probability is calculated from the following equation.
1517*41dd9a6bSDavid Young
1518*41dd9a6bSDavid Young.. image:: ../img/drop_probability_eq4.*
1519*41dd9a6bSDavid Young
1520*41dd9a6bSDavid YoungWhere:
1521*41dd9a6bSDavid Young
1522*41dd9a6bSDavid Young*   *Pb*  = initial drop probability (from Equation 3)
1523*41dd9a6bSDavid Young
1524*41dd9a6bSDavid Young*   *count* = number of packets that have arrived since the last drop
1525*41dd9a6bSDavid Young
1526*41dd9a6bSDavid YoungThe constant 2, in Equation 4 is the only deviation from the drop probability formulae
1527*41dd9a6bSDavid Younggiven in the reference document where a value of 1 is used instead.
1528*41dd9a6bSDavid YoungIt should be noted that the value pa computed from can be negative or greater than 1.
1529*41dd9a6bSDavid YoungIf this is the case, then a value of 1 should be used instead.
1530*41dd9a6bSDavid Young
1531*41dd9a6bSDavid YoungThe initial and actual drop probabilities are shown in :numref:`figure_drop_probability_graph`.
1532*41dd9a6bSDavid YoungThe actual drop probability is shown for the case where
1533*41dd9a6bSDavid Youngthe formula given in the reference document1 is used (blue curve)
1534*41dd9a6bSDavid Youngand also for the case where the formula implemented in the dropper module,
1535*41dd9a6bSDavid Youngis used (red curve).
1536*41dd9a6bSDavid YoungThe formula in the reference document results in a significantly higher drop rate
1537*41dd9a6bSDavid Youngcompared to the mark probability configuration parameter specified by the user.
1538*41dd9a6bSDavid YoungThe choice to deviate from the reference document is simply a design decision and
1539*41dd9a6bSDavid Youngone that has been taken by other RED implementations, for example, FreeBSD* ALTQ RED.
1540*41dd9a6bSDavid Young
1541*41dd9a6bSDavid Young.. _figure_drop_probability_graph:
1542*41dd9a6bSDavid Young
1543*41dd9a6bSDavid Young.. figure:: ../img/drop_probability_graph.*
1544*41dd9a6bSDavid Young
1545*41dd9a6bSDavid Young   Initial Drop Probability (pb), Actual Drop probability (pa) Computed Using
1546*41dd9a6bSDavid Young   a Factor 1 (Blue Curve) and a Factor 2 (Red Curve)
1547*41dd9a6bSDavid Young
1548*41dd9a6bSDavid Young
1549*41dd9a6bSDavid Young.. _Queue_Empty_Operation:
1550*41dd9a6bSDavid Young
1551*41dd9a6bSDavid YoungQueue Empty Operation
1552*41dd9a6bSDavid Young~~~~~~~~~~~~~~~~~~~~~
1553*41dd9a6bSDavid Young
1554*41dd9a6bSDavid YoungThe time at which a packet queue becomes empty must be recorded and saved with the RED run-time data
1555*41dd9a6bSDavid Youngso that the EWMA filter block can calculate the average queue size on the next enqueue operation.
1556*41dd9a6bSDavid YoungIt is the responsibility of the calling application to inform the dropper module
1557*41dd9a6bSDavid Youngthrough the API that a queue has become empty.
1558*41dd9a6bSDavid Young
1559*41dd9a6bSDavid YoungSource Files Location
1560*41dd9a6bSDavid Young~~~~~~~~~~~~~~~~~~~~~
1561*41dd9a6bSDavid Young
1562*41dd9a6bSDavid YoungThe source files for the DPDK dropper are located at:
1563*41dd9a6bSDavid Young
1564*41dd9a6bSDavid Young*   DPDK/lib/sched/rte_red.h
1565*41dd9a6bSDavid Young
1566*41dd9a6bSDavid Young*   DPDK/lib/sched/rte_red.c
1567*41dd9a6bSDavid Young
1568*41dd9a6bSDavid YoungIntegration with the DPDK QoS Scheduler
1569*41dd9a6bSDavid Young~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1570*41dd9a6bSDavid Young
1571*41dd9a6bSDavid YoungRED functionality in the DPDK QoS scheduler is disabled by default.
1572*41dd9a6bSDavid YoungThe parameter is found in the build configuration files in the DPDK/config directory.
1573*41dd9a6bSDavid YoungRED configuration parameters are specified in the rte_red_params structure within the rte_sched_port_params structure
1574*41dd9a6bSDavid Youngthat is passed to the scheduler on initialization.
1575*41dd9a6bSDavid YoungRED parameters are specified separately for four traffic classes and three packet colors (green, yellow and red)
1576*41dd9a6bSDavid Youngallowing the scheduler to implement Weighted Random Early Detection (WRED).
1577*41dd9a6bSDavid Young
1578*41dd9a6bSDavid YoungIntegration with the DPDK QoS Scheduler Sample Application
1579*41dd9a6bSDavid Young~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1580*41dd9a6bSDavid Young
1581*41dd9a6bSDavid YoungThe DPDK QoS Scheduler Application reads a configuration file on start-up.
1582*41dd9a6bSDavid YoungThe configuration file includes a section containing RED parameters.
1583*41dd9a6bSDavid YoungThe format of these parameters is described in :ref:`Section2.23.3.1 <Configuration>`.
1584*41dd9a6bSDavid YoungA sample RED configuration is shown below. In this example, the queue size is 64 packets.
1585*41dd9a6bSDavid Young
1586*41dd9a6bSDavid Young.. note::
1587*41dd9a6bSDavid Young
1588*41dd9a6bSDavid Young    For correct operation, the same EWMA filter weight parameter (wred weight) should be used
1589*41dd9a6bSDavid Young    for each packet color (green, yellow, red) in the same traffic class (tc).
1590*41dd9a6bSDavid Young
1591*41dd9a6bSDavid Young::
1592*41dd9a6bSDavid Young
1593*41dd9a6bSDavid Young    ; RED params per traffic class and color (Green / Yellow / Red)
1594*41dd9a6bSDavid Young
1595*41dd9a6bSDavid Young   [red]
1596*41dd9a6bSDavid Young   tc 0 wred min = 28 22 16
1597*41dd9a6bSDavid Young   tc 0 wred max = 32 32 32
1598*41dd9a6bSDavid Young   tc 0 wred inv prob = 10 10 10
1599*41dd9a6bSDavid Young   tc 0 wred weight = 9 9 9
1600*41dd9a6bSDavid Young
1601*41dd9a6bSDavid Young   tc 1 wred min = 28 22 16
1602*41dd9a6bSDavid Young   tc 1 wred max = 32 32 32
1603*41dd9a6bSDavid Young   tc 1 wred inv prob = 10 10 10
1604*41dd9a6bSDavid Young   tc 1 wred weight = 9 9 9
1605*41dd9a6bSDavid Young
1606*41dd9a6bSDavid Young   tc 2 wred min = 28 22 16
1607*41dd9a6bSDavid Young   tc 2 wred max = 32 32 32
1608*41dd9a6bSDavid Young   tc 2 wred inv prob = 10 10 10
1609*41dd9a6bSDavid Young   tc 2 wred weight = 9 9 9
1610*41dd9a6bSDavid Young
1611*41dd9a6bSDavid Young   tc 3 wred min = 28 22 16
1612*41dd9a6bSDavid Young   tc 3 wred max = 32 32 32
1613*41dd9a6bSDavid Young   tc 3 wred inv prob = 10 10 10
1614*41dd9a6bSDavid Young   tc 3 wred weight = 9 9 9
1615*41dd9a6bSDavid Young
1616*41dd9a6bSDavid Young   tc 4 wred min = 28 22 16
1617*41dd9a6bSDavid Young   tc 4 wred max = 32 32 32
1618*41dd9a6bSDavid Young   tc 4 wred inv prob = 10 10 10
1619*41dd9a6bSDavid Young   tc 4 wred weight = 9 9 9
1620*41dd9a6bSDavid Young
1621*41dd9a6bSDavid Young   tc 5 wred min = 28 22 16
1622*41dd9a6bSDavid Young   tc 5 wred max = 32 32 32
1623*41dd9a6bSDavid Young   tc 5 wred inv prob = 10 10 10
1624*41dd9a6bSDavid Young   tc 5 wred weight = 9 9 9
1625*41dd9a6bSDavid Young
1626*41dd9a6bSDavid Young   tc 6 wred min = 28 22 16
1627*41dd9a6bSDavid Young   tc 6 wred max = 32 32 32
1628*41dd9a6bSDavid Young   tc 6 wred inv prob = 10 10 10
1629*41dd9a6bSDavid Young   tc 6 wred weight = 9 9 9
1630*41dd9a6bSDavid Young
1631*41dd9a6bSDavid Young   tc 7 wred min = 28 22 16
1632*41dd9a6bSDavid Young   tc 7 wred max = 32 32 32
1633*41dd9a6bSDavid Young   tc 7 wred inv prob = 10 10 10
1634*41dd9a6bSDavid Young   tc 7 wred weight = 9 9 9
1635*41dd9a6bSDavid Young
1636*41dd9a6bSDavid Young   tc 8 wred min = 28 22 16
1637*41dd9a6bSDavid Young   tc 8 wred max = 32 32 32
1638*41dd9a6bSDavid Young   tc 8 wred inv prob = 10 10 10
1639*41dd9a6bSDavid Young   tc 8 wred weight = 9 9 9
1640*41dd9a6bSDavid Young
1641*41dd9a6bSDavid Young   tc 9 wred min = 28 22 16
1642*41dd9a6bSDavid Young   tc 9 wred max = 32 32 32
1643*41dd9a6bSDavid Young   tc 9 wred inv prob = 10 10 10
1644*41dd9a6bSDavid Young   tc 9 wred weight = 9 9 9
1645*41dd9a6bSDavid Young
1646*41dd9a6bSDavid Young
1647*41dd9a6bSDavid Young   tc 10 wred min = 28 22 16
1648*41dd9a6bSDavid Young   tc 10 wred max = 32 32 32
1649*41dd9a6bSDavid Young   tc 10 wred inv prob = 10 10 10
1650*41dd9a6bSDavid Young   tc 10 wred weight = 9 9 9
1651*41dd9a6bSDavid Young
1652*41dd9a6bSDavid Young   tc 11 wred min = 28 22 16
1653*41dd9a6bSDavid Young   tc 11 wred max = 32 32 32
1654*41dd9a6bSDavid Young   tc 11 wred inv prob = 10 10 10
1655*41dd9a6bSDavid Young   tc 11 wred weight = 9 9 9
1656*41dd9a6bSDavid Young
1657*41dd9a6bSDavid Young   tc 12 wred min = 28 22 16
1658*41dd9a6bSDavid Young   tc 12 wred max = 32 32 32
1659*41dd9a6bSDavid Young   tc 12 wred inv prob = 10 10 10
1660*41dd9a6bSDavid Young   tc 12 wred weight = 9 9 9
1661*41dd9a6bSDavid Young
1662*41dd9a6bSDavid YoungWith this configuration file, the RED configuration that applies to green,
1663*41dd9a6bSDavid Youngyellow and red packets in traffic class 0 is shown in :numref:`table_qos_18`.
1664*41dd9a6bSDavid Young
1665*41dd9a6bSDavid Young.. _table_qos_18:
1666*41dd9a6bSDavid Young
1667*41dd9a6bSDavid Young.. table:: RED Configuration Corresponding to RED Configuration File
1668*41dd9a6bSDavid Young
1669*41dd9a6bSDavid Young   +--------------------+--------------------+-------+--------+-----+
1670*41dd9a6bSDavid Young   | RED Parameter      | Configuration Name | Green | Yellow | Red |
1671*41dd9a6bSDavid Young   |                    |                    |       |        |     |
1672*41dd9a6bSDavid Young   +====================+====================+=======+========+=====+
1673*41dd9a6bSDavid Young   | Minimum Threshold  | tc 0 wred min      | 28    | 22     | 16  |
1674*41dd9a6bSDavid Young   |                    |                    |       |        |     |
1675*41dd9a6bSDavid Young   +--------------------+--------------------+-------+--------+-----+
1676*41dd9a6bSDavid Young   | Maximum Threshold  | tc 0 wred max      | 32    | 32     | 32  |
1677*41dd9a6bSDavid Young   |                    |                    |       |        |     |
1678*41dd9a6bSDavid Young   +--------------------+--------------------+-------+--------+-----+
1679*41dd9a6bSDavid Young   | Mark Probability   | tc 0 wred inv prob | 10    | 10     | 10  |
1680*41dd9a6bSDavid Young   |                    |                    |       |        |     |
1681*41dd9a6bSDavid Young   +--------------------+--------------------+-------+--------+-----+
1682*41dd9a6bSDavid Young   | EWMA Filter Weight | tc 0 wred weight   | 9     | 9      | 9   |
1683*41dd9a6bSDavid Young   |                    |                    |       |        |     |
1684*41dd9a6bSDavid Young   +--------------------+--------------------+-------+--------+-----+
1685*41dd9a6bSDavid Young
1686*41dd9a6bSDavid YoungApplication Programming Interface (API)
1687*41dd9a6bSDavid Young~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1688*41dd9a6bSDavid Young
1689*41dd9a6bSDavid YoungEnqueue API
1690*41dd9a6bSDavid Young^^^^^^^^^^^
1691*41dd9a6bSDavid Young
1692*41dd9a6bSDavid YoungThe syntax of the enqueue API is as follows:
1693*41dd9a6bSDavid Young
1694*41dd9a6bSDavid Young.. code-block:: c
1695*41dd9a6bSDavid Young
1696*41dd9a6bSDavid Young   int rte_red_enqueue(const struct rte_red_config *red_cfg, struct rte_red *red, const unsigned q, const uint64_t time)
1697*41dd9a6bSDavid Young
1698*41dd9a6bSDavid Young
1699*41dd9a6bSDavid YoungThe arguments passed to the enqueue API are configuration data, run-time data,
1700*41dd9a6bSDavid Youngthe current size of the packet queue (in packets) and a value representing the current time.
1701*41dd9a6bSDavid YoungThe time reference is in units of bytes,
1702*41dd9a6bSDavid Youngwhere a byte signifies the time duration required by the physical interface to send out a byte on the transmission medium
1703*41dd9a6bSDavid Young(see Section 26.2.4.5.1 "Internal Time Reference" ).
1704*41dd9a6bSDavid YoungThe dropper reuses the scheduler time stamps for performance reasons.
1705*41dd9a6bSDavid Young
1706*41dd9a6bSDavid YoungEmpty API
1707*41dd9a6bSDavid Young^^^^^^^^^
1708*41dd9a6bSDavid Young
1709*41dd9a6bSDavid YoungThe syntax of the empty API is as follows:
1710*41dd9a6bSDavid Young
1711*41dd9a6bSDavid Young.. code-block:: c
1712*41dd9a6bSDavid Young
1713*41dd9a6bSDavid Young    void rte_red_mark_queue_empty(struct rte_red *red, const uint64_t time)
1714*41dd9a6bSDavid Young
1715*41dd9a6bSDavid YoungThe arguments passed to the empty API are run-time data and the current time in bytes.
1716*41dd9a6bSDavid Young
1717*41dd9a6bSDavid YoungTraffic Metering
1718*41dd9a6bSDavid Young----------------
1719*41dd9a6bSDavid Young
1720*41dd9a6bSDavid YoungThe traffic metering component implements the Single Rate Three Color Marker (srTCM) and
1721*41dd9a6bSDavid YoungTwo Rate Three Color Marker (trTCM) algorithms, as defined by IETF RFC 2697 and 2698 respectively.
1722*41dd9a6bSDavid YoungThese algorithms meter the stream of incoming packets based on the allowance defined in advance for each traffic flow.
1723*41dd9a6bSDavid YoungAs result, each incoming packet is tagged as green,
1724*41dd9a6bSDavid Youngyellow or red based on the monitored consumption of the flow the packet belongs to.
1725*41dd9a6bSDavid Young
1726*41dd9a6bSDavid YoungFunctional Overview
1727*41dd9a6bSDavid Young~~~~~~~~~~~~~~~~~~~
1728*41dd9a6bSDavid Young
1729*41dd9a6bSDavid YoungThe srTCM algorithm defines two token buckets for each traffic flow,
1730*41dd9a6bSDavid Youngwith the two buckets sharing the same token update rate:
1731*41dd9a6bSDavid Young
1732*41dd9a6bSDavid Young*   Committed (C) bucket: fed with tokens at the rate defined by the Committed Information Rate (CIR) parameter
1733*41dd9a6bSDavid Young    (measured in IP packet bytes per second).
1734*41dd9a6bSDavid Young    The size of the C bucket is defined by the Committed Burst Size (CBS) parameter (measured in bytes);
1735*41dd9a6bSDavid Young
1736*41dd9a6bSDavid Young*   Excess (E) bucket: fed with tokens at the same rate as the C bucket.
1737*41dd9a6bSDavid Young    The size of the E bucket is defined by the Excess Burst Size (EBS) parameter (measured in bytes).
1738*41dd9a6bSDavid Young
1739*41dd9a6bSDavid YoungThe trTCM algorithm defines two token buckets for each traffic flow,
1740*41dd9a6bSDavid Youngwith the two buckets being updated with tokens at independent rates:
1741*41dd9a6bSDavid Young
1742*41dd9a6bSDavid Young*   Committed (C) bucket: fed with tokens at the rate defined by the Committed Information Rate (CIR) parameter
1743*41dd9a6bSDavid Young    (measured in bytes of IP packet per second).
1744*41dd9a6bSDavid Young    The size of the C bucket is defined by the Committed Burst Size (CBS) parameter (measured in bytes);
1745*41dd9a6bSDavid Young
1746*41dd9a6bSDavid Young*   Peak (P) bucket: fed with tokens at the rate defined by the Peak Information Rate (PIR) parameter
1747*41dd9a6bSDavid Young    (measured in IP packet bytes per second).
1748*41dd9a6bSDavid Young    The size of the P bucket is defined by the Peak Burst Size (PBS) parameter (measured in bytes).
1749*41dd9a6bSDavid Young
1750*41dd9a6bSDavid YoungPlease refer to RFC 2697 (for srTCM) and RFC 2698 (for trTCM) for details on how tokens are consumed
1751*41dd9a6bSDavid Youngfrom the buckets and how the packet color is determined.
1752*41dd9a6bSDavid Young
1753*41dd9a6bSDavid YoungColor Blind and Color Aware Modes
1754*41dd9a6bSDavid Young^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1755*41dd9a6bSDavid Young
1756*41dd9a6bSDavid YoungFor both algorithms, the color blind mode is functionally equivalent to the color aware mode with input color set as green.
1757*41dd9a6bSDavid YoungFor color aware mode, a packet with red input color can only get the red output color,
1758*41dd9a6bSDavid Youngwhile a packet with yellow input color can only get the yellow or red output colors.
1759*41dd9a6bSDavid Young
1760*41dd9a6bSDavid YoungThe reason why the color blind mode is still implemented distinctly than the color aware mode is
1761*41dd9a6bSDavid Youngthat color blind mode can be implemented with fewer operations than the color aware mode.
1762*41dd9a6bSDavid Young
1763*41dd9a6bSDavid YoungImplementation Overview
1764*41dd9a6bSDavid Young~~~~~~~~~~~~~~~~~~~~~~~
1765*41dd9a6bSDavid Young
1766*41dd9a6bSDavid YoungFor each input packet, the steps for the srTCM / trTCM algorithms are:
1767*41dd9a6bSDavid Young
1768*41dd9a6bSDavid Young*   Update the C and E / P token buckets. This is done by reading the current time (from the CPU timestamp counter),
1769*41dd9a6bSDavid Young    identifying the amount of time since the last bucket update and computing the associated number of tokens
1770*41dd9a6bSDavid Young    (according to the pre-configured bucket rate).
1771*41dd9a6bSDavid Young    The number of tokens in the bucket is limited by the pre-configured bucket size;
1772*41dd9a6bSDavid Young
1773*41dd9a6bSDavid Young*   Identify the output color for the current packet based on the size of the IP packet
1774*41dd9a6bSDavid Young    and the amount of tokens currently available in the C and E / P buckets; for color aware mode only,
1775*41dd9a6bSDavid Young    the input color of the packet is also considered.
1776*41dd9a6bSDavid Young    When the output color is not red, a number of tokens equal to the length of the IP packet are
1777*41dd9a6bSDavid Young    subtracted from the C or E /P or both buckets, depending on the algorithm and the output color of the packet.
1778