xref: /dpdk/doc/guides/prog_guide/ethdev/traffic_management.rst (revision 41dd9a6bc2d9c6e20e139ad713cc9d172572dd43)
1..  SPDX-License-Identifier: BSD-3-Clause
2    Copyright(c) 2017 Intel Corporation.
3
4Traffic Management API
5======================
6
7
8Overview
9--------
10
11This is the generic API for the Quality of Service (QoS) Traffic Management of
12Ethernet devices, which includes the following main features: hierarchical
13scheduling, traffic shaping, congestion management, packet marking. This API
14is agnostic of the underlying HW, SW or mixed HW-SW implementation.
15
16Main features:
17
18* Part of DPDK rte_ethdev API
19* Capability query API per port, per hierarchy level and per hierarchy node
20* Scheduling algorithms: Strict Priority (SP), Weighed Fair Queuing (WFQ)
21* Traffic shaping: single/dual rate, private (per node) and
22  shared (by multiple nodes) shapers
23* Congestion management for hierarchy leaf nodes: algorithms of tail drop, head
24  drop, WRED, private (per node) and shared (by multiple nodes) WRED contexts
25  and PIE.
26* Packet marking: IEEE 802.1q (VLAN DEI), IETF RFC 3168 (IPv4/IPv6 ECN for TCP
27  and SCTP), IETF RFC 2597 (IPv4 / IPv6 DSCP)
28
29
30Capability API
31--------------
32
33The aim of these APIs is to advertise the capability information (i.e critical
34parameter values) that the TM implementation (HW/SW) is able to support for the
35application. The APIs supports the information disclosure at the TM level, at
36any hierarchical level of the TM and at any node level of the specific
37hierarchical level. Such information helps towards rapid understanding of
38whether a specific implementation does meet the needs to the user application.
39
40At the TM level, users can get high level idea with the help of various
41parameters such as maximum number of nodes, maximum number of hierarchical
42levels, maximum number of shapers, maximum number of private shapers, type of
43scheduling algorithm (Strict Priority, Weighted Fair Queuing , etc.), etc.,
44supported by the implementation.
45
46Likewise, users can query the capability of the TM at the hierarchical level to
47have more granular knowledge about the specific level. The various parameters
48such as maximum number of nodes at the level, maximum number of leaf/non-leaf
49nodes at the level, type of the shaper(dual rate, single rate) supported at
50the level if node is non-leaf type etc., are exposed as a result of
51hierarchical level capability query.
52
53Finally, the node level capability API offers knowledge about the capability
54supported by the node at any specific level. The information whether the
55support is available for private shaper, dual rate shaper, maximum and minimum
56shaper rate, etc. is exposed by node level capability API.
57
58
59Scheduling Algorithms
60---------------------
61
62The fundamental scheduling algorithms that are supported are Strict Priority
63(SP) and Weighted Fair Queuing (WFQ). The SP and WFQ algorithms are supported
64at the level of each node of the scheduling hierarchy, regardless of the node
65level/position in the tree. The SP algorithm is used to schedule between
66sibling nodes with different priority, while WFQ is used to schedule between
67groups of siblings that have the same priority.
68
69Algorithms such as Weighed Round Robin (WRR), byte-level WRR, Deficit WRR
70(DWRR), etc are considered approximations of the ideal WFQ and are therefore
71assimilated to WFQ, although an associated implementation-dependent accuracy,
72performance and resource usage trade-off might exist.
73
74
75Traffic Shaping
76---------------
77
78The TM API provides support for single rate and dual rate shapers (rate
79limiters) for the hierarchy nodes, subject to the specific implementation
80support being available.
81
82Each hierarchy node has zero or one private shaper (only one node using it)
83and/or zero, one or several shared shapers (multiple nodes use the same shaper
84instance). A private shaper is used to perform traffic shaping for a single
85node, while a shared shaper is used to perform traffic shaping for a group of
86nodes.
87
88The configuration of private and shared shapers is done through the definition
89of shaper profiles. Any shaper profile (single rate or dual rate shaper) can be
90used by one or several shaper instances (either private or shared).
91
92Single rate shapers use a single token bucket. Therefore, single rate shaper is
93configured by setting the rate of the committed bucket to zero, which
94effectively disables this bucket. The peak bucket is used to limit the rate
95and the burst size for the single rate shaper. Dual rate shapers use both the
96committed and the peak token buckets. The rate of the peak bucket has to be
97bigger than zero, as well as greater than or equal to the rate of the committed
98bucket.
99
100
101Congestion Management
102---------------------
103
104Congestion management is used to control the admission of packets into a packet
105queue or group of packet queues on congestion. The congestion management
106algorithms that are supported are: Tail Drop, Head Drop and Weighted Random
107Early Detection (WRED), Proportional Integral Controller Enhanced (PIE).
108They are made available for every leaf node in the hierarchy, subject to
109the specific implementation supporting them.
110On request of writing a new packet into the current queue while the queue is
111full, the Tail Drop algorithm drops the new packet while leaving the queue
112unmodified, as opposed to the Head Drop* algorithm, which drops the packet
113at the head of the queue (the oldest packet waiting in the queue) and admits
114the new packet at the tail of the queue.
115
116The Random Early Detection (RED) algorithm works by proactively dropping more
117and more input packets as the queue occupancy builds up. When the queue is full
118or almost full, RED effectively works as Tail Drop. The Weighted RED (WRED)
119algorithm uses a separate set of RED thresholds for each packet color and uses
120separate set of RED thresholds for each packet color.
121
122Each hierarchy leaf node with WRED enabled as its congestion management mode
123has zero or one private WRED context (only one leaf node using it) and/or zero,
124one or several shared WRED contexts (multiple leaf nodes use the same WRED
125context). A private WRED context is used to perform congestion management for
126a single leaf node, while a shared WRED context is used to perform congestion
127management for a group of leaf nodes.
128
129The configuration of WRED private and shared contexts is done through the
130definition of WRED profiles. Any WRED profile can be used by one or several
131WRED contexts (either private or shared).
132
133The Proportional Integral Controller Enhanced (PIE) algorithm works by proactively
134dropping packets randomly. Calculated drop probability is updated periodically,
135based on latency measured and desired and whether the queuing latency is currently
136trending up or down. Queuing latency can be obtained using direct measurement or
137on estimations calculated from the queue length and dequeue rate. The random drop
138is triggered by a packet's arrival before enqueuing into a queue.
139
140
141Packet Marking
142--------------
143The TM APIs have been provided to support various types of packet marking such
144as VLAN DEI packet marking (IEEE 802.1Q), IPv4/IPv6 ECN marking of TCP and SCTP
145packets (IETF RFC 3168) and IPv4/IPv6 DSCP packet marking (IETF RFC 2597).
146All VLAN frames of a given color get their DEI bit set if marking is enabled
147for this color. In case, when marking for a given color is not enabled, the
148DEI bit is left as is (either set or not).
149
150All IPv4/IPv6 packets of a given color with ECN set to 2’b01 or 2’b10 carrying
151TCP or SCTP have their ECN set to 2’b11 if the marking feature is enabled for
152the current color, otherwise the ECN field is left as is.
153
154All IPv4/IPv6 packets have their color marked into DSCP bits 3 and 4 as
155follows: green mapped to Low Drop Precedence (2’b01), yellow to Medium (2’b10)
156and red to High (2’b11). Marking needs to be explicitly enabled for each color;
157when not enabled for a given color, the DSCP field of all packets with that
158color is left as is.
159
160
161Steps to Setup the Hierarchy
162----------------------------
163
164The TM hierarchical tree consists of leaf nodes and non-leaf nodes. Each leaf
165node sits on top of a scheduling queue of the current Ethernet port. Therefore,
166the leaf nodes have predefined IDs in the range of 0... (N-1), where N is the
167number of scheduling queues of the current Ethernet port. The non-leaf nodes
168have their IDs generated by the application outside of the above range, which
169is reserved for leaf nodes.
170
171Each non-leaf node has multiple inputs (its children nodes) and single output
172(which is input to its parent node). It arbitrates its inputs using Strict
173Priority (SP) and Weighted Fair Queuing (WFQ) algorithms to schedule input
174packets to its output while observing its shaping (rate limiting) constraints.
175
176The children nodes with different priorities are scheduled using the SP
177algorithm based on their priority, with 0 as the highest priority. Children
178with the same priority are scheduled using the WFQ algorithm according to their
179weights. The WFQ weight of a given child node is relative to the sum of the
180weights of all its sibling nodes that have the same priority, with 1 as the
181lowest weight. For each SP priority, the WFQ weight mode can be set as either
182byte-based or packet-based.
183
184
185Initial Hierarchy Specification
186~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
187
188The hierarchy is specified by incrementally adding nodes to build up the
189scheduling tree. The first node that is added to the hierarchy becomes the root
190node and all the nodes that are subsequently added have to be added as
191descendants of the root node. The parent of the root node has to be specified
192as RTE_TM_NODE_ID_NULL and there can only be one node with this parent ID
193(i.e. the root node). The unique ID that is assigned to each node when the node
194is created is further used to update the node configuration or to connect
195children nodes to it.
196
197During this phase, some limited checks on the hierarchy specification can be
198conducted, usually limited in scope to the current node, its parent node and
199its sibling nodes. At this time, since the hierarchy is not fully defined,
200there is typically no real action performed by the underlying implementation.
201
202
203Hierarchy Commit
204~~~~~~~~~~~~~~~~
205
206The hierarchy commit API is called during the port initialization phase (before
207the Ethernet port is started) to freeze the start-up hierarchy.  This function
208typically performs the following steps:
209
210* It validates the start-up hierarchy that was previously defined for the
211  current port through successive node add API invocations.
212* Assuming successful validation, it performs all the necessary implementation
213  specific operations to install the specified hierarchy on the current port,
214  with immediate effect once the port is started.
215
216This function fails when the currently configured hierarchy is not supported by
217the Ethernet port, in which case the user can abort or try out another
218hierarchy configuration (e.g. a hierarchy with less leaf nodes), which can be
219built from scratch or by modifying the existing hierarchy configuration. Note
220that this function can still fail due to other causes (e.g. not enough memory
221available in the system, etc.), even though the specified hierarchy is
222supported in principle by the current port.
223
224
225Run-Time Hierarchy Updates
226~~~~~~~~~~~~~~~~~~~~~~~~~~
227
228The TM API provides support for on-the-fly changes to the scheduling hierarchy,
229thus operations such as node add/delete, node suspend/resume, parent node
230update, etc., can be invoked after the Ethernet port has been started, subject
231to the specific implementation supporting them. The set of dynamic updates
232supported by the implementation is advertised through the port capability set.
233