xref: /dpdk/doc/guides/prog_guide/graph_lib.rst (revision 070db97e017b7ed9a5320b2f624f05562a632bd3)
1..  SPDX-License-Identifier: BSD-3-Clause
2    Copyright(C) 2020 Marvell International Ltd.
3
4Graph Library and Inbuilt Nodes
5===============================
6
7Graph architecture abstracts the data processing functions as a ``node`` and
8``links`` them together to create a complex ``graph`` to enable reusable/modular
9data processing functions.
10
11The graph library provides API to enable graph framework operations such as
12create, lookup, dump and destroy on graph and node operations such as clone,
13edge update, and edge shrink, etc. The API also allows to create the stats
14cluster to monitor per graph and per node stats.
15
16Features
17--------
18
19Features of the Graph library are:
20
21- Nodes as plugins.
22- Support for out of tree nodes.
23- Inbuilt nodes for packet processing.
24- Node specific xstat counts.
25- Multi-process support.
26- Low overhead graph walk and node enqueue.
27- Low overhead statistics collection infrastructure.
28- Support to export the graph as a Graphviz dot file. See ``rte_graph_export()``.
29- Allow having another graph walk implementation in the future by segregating
30  the fast path(``rte_graph_worker.h``) and slow path code.
31
32Advantages of Graph architecture
33--------------------------------
34
35- Memory latency is the enemy for high-speed packet processing, moving the
36  similar packet processing code to a node will reduce the I cache and D
37  caches misses.
38- Exploits the probability that most packets will follow the same nodes in the
39  graph.
40- Allow SIMD instructions for packet processing of the node.-
41- The modular scheme allows having reusable nodes for the consumers.
42- The modular scheme allows us to abstract the vendor HW specific
43  optimizations as a node.
44
45Performance tuning parameters
46-----------------------------
47
48- Test with various burst size values (256, 128, 64, 32) using
49  RTE_GRAPH_BURST_SIZE config option.
50  The testing shows, on x86 and arm64 servers, The sweet spot is 256 burst
51  size. While on arm64 embedded SoCs, it is either 64 or 128.
52- Disable node statistics (using ``RTE_LIBRTE_GRAPH_STATS`` config option)
53  if not needed.
54
55Programming model
56-----------------
57
58Anatomy of Node:
59~~~~~~~~~~~~~~~~
60
61.. _figure_anatomy_of_a_node:
62
63.. figure:: img/anatomy_of_a_node.*
64
65   Anatomy of a node
66
67The node is the basic building block of the graph framework.
68
69A node consists of:
70
71process():
72^^^^^^^^^^
73
74The callback function will be invoked by worker thread using
75``rte_graph_walk()`` function when there is data to be processed by the node.
76A graph node process the function using ``process()`` and enqueue to next
77downstream node using ``rte_node_enqueue*()`` function.
78
79Context memory:
80^^^^^^^^^^^^^^^
81
82It is memory allocated by the library to store the node-specific context
83information. This memory will be used by process(), init(), fini() callbacks.
84
85init():
86^^^^^^^
87
88The callback function will be invoked by ``rte_graph_create()`` on when
89a node gets attached to a graph.
90
91fini():
92^^^^^^^
93
94The callback function will be invoked by ``rte_graph_destroy()`` on when a
95node gets detached to a graph.
96
97Node name:
98^^^^^^^^^^
99
100It is the name of the node. When a node registers to graph library, the library
101gives the ID as ``rte_node_t`` type. Both ID or Name shall be used lookup the
102node. ``rte_node_from_name()``, ``rte_node_id_to_name()`` are the node
103lookup functions.
104
105nb_edges:
106^^^^^^^^^
107
108The number of downstream nodes connected to this node. The ``next_nodes[]``
109stores the downstream nodes objects. ``rte_node_edge_update()`` and
110``rte_node_edge_shrink()`` functions shall be used to update the ``next_node[]``
111objects. Consumers of the node APIs are free to update the ``next_node[]``
112objects till ``rte_graph_create()`` invoked.
113
114next_node[]:
115^^^^^^^^^^^^
116
117The dynamic array to store the downstream nodes connected to this node. Downstream
118node should not be current node itself or a source node.
119
120Source node:
121^^^^^^^^^^^^
122
123Source nodes are static nodes created using ``RTE_NODE_REGISTER`` by passing
124``flags`` as ``RTE_NODE_SOURCE_F``.
125While performing the graph walk, the ``process()`` function of all the source
126nodes will be called first. So that these nodes can be used as input nodes for a graph.
127
128nb_xstats:
129^^^^^^^^^^
130
131The number of xstats that this node can report.
132The ``xstat_desc[]`` stores the xstat descriptions which will later be propagated to stats.
133
134xstat_desc[]:
135^^^^^^^^^^^^^
136
137The dynamic array to store the xstat descriptions that will be reported by this node.
138
139Node creation and registration
140~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
141* Node implementer creates the node by implementing ops and attributes of
142  ``struct rte_node_register``.
143
144* The library registers the node by invoking RTE_NODE_REGISTER on library load
145  using the constructor scheme. The constructor scheme used here to support multi-process.
146
147Link the Nodes to create the graph topology
148~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
149.. _figure_link_the_nodes:
150
151.. figure:: img/link_the_nodes.*
152
153   Topology after linking the nodes
154
155Once nodes are available to the program, Application or node public API
156functions can link them together to create a complex packet processing graph.
157
158There are multiple different types of strategies to link the nodes.
159
160Method (a):
161^^^^^^^^^^^
162Provide the ``next_nodes[]`` at the node registration time. See ``struct rte_node_register::nb_edges``.
163This is a use case to address the static node scheme where one knows upfront the
164``next_nodes[]`` of the node.
165
166Method (b):
167^^^^^^^^^^^
168Use ``rte_node_edge_get()``, ``rte_node_edge_update()``, ``rte_node_edge_shrink()``
169to update the ``next_nodes[]`` links for the node runtime but before graph create.
170
171Method (c):
172^^^^^^^^^^^
173Use ``rte_node_clone()`` to clone a already existing node, created using RTE_NODE_REGISTER.
174When ``rte_node_clone()`` invoked, The library, would clone all the attributes
175of the node and creates a new one. The name for cloned node shall be
176``"parent_node_name-user_provided_name"``.
177
178This method enables the use case of Rx and Tx nodes where multiple of those nodes
179need to be cloned based on the number of CPU available in the system.
180The cloned nodes will be identical, except the ``"context memory"``.
181Context memory will have information of port, queue pair in case of Rx and Tx
182ethdev nodes.
183
184Create the graph object
185~~~~~~~~~~~~~~~~~~~~~~~
186Now that the nodes are linked, Its time to create a graph by including
187the required nodes. The application can provide a set of node patterns to
188form a graph object. The ``fnmatch()`` API used underneath for the pattern
189matching to include the required nodes. After the graph create any changes to
190nodes or graph is not allowed.
191
192The ``rte_graph_create()`` API shall be used to create the graph.
193
194Example of a graph object creation:
195
196.. code-block:: console
197
198   {"ethdev_rx-0-0", ip4*, ethdev_tx-*"}
199
200In the above example, A graph object will be created with ethdev Rx
201node of port 0 and queue 0, all ipv4* nodes in the system,
202and ethdev tx node of all ports.
203
204Graph models
205~~~~~~~~~~~~
206There are two different kinds of graph walking models. User can select the model using
207``rte_graph_worker_model_set()`` API. If the application decides to use only one model,
208the fast path check can be avoided by defining the model with RTE_GRAPH_MODEL_SELECT.
209For example:
210
211.. code-block:: c
212
213  #define RTE_GRAPH_MODEL_SELECT RTE_GRAPH_MODEL_RTC
214  #include "rte_graph_worker.h"
215
216RTC (Run-To-Completion)
217^^^^^^^^^^^^^^^^^^^^^^^
218This is the default graph walking model. Specifically, ``rte_graph_walk_rtc()`` and
219``rte_node_enqueue*`` fast path API functions are designed to work on single-core to
220have better performance. The fast path API works on graph object, So the multi-core
221graph processing strategy would be to create graph object PER WORKER.
222
223Example:
224
225Graph: node-0 -> node-1 -> node-2 @Core0.
226
227.. code-block:: diff
228
229    + - - - - - - - - - - - - - - - - - - - - - +
230    '                  Core #0                  '
231    '                                           '
232    ' +--------+     +---------+     +--------+ '
233    ' | Node-0 | --> | Node-1  | --> | Node-2 | '
234    ' +--------+     +---------+     +--------+ '
235    '                                           '
236    + - - - - - - - - - - - - - - - - - - - - - +
237
238Dispatch model
239^^^^^^^^^^^^^^
240The dispatch model enables a cross-core dispatching mechanism which employs
241a scheduling work-queue to dispatch streams to other worker cores which
242being associated with the destination node.
243
244Use ``rte_graph_model_mcore_dispatch_lcore_affinity_set()`` to set lcore affinity
245with the node.
246Each worker core will have a graph repetition. Use ``rte_graph_clone()`` to clone
247graph for each worker and use``rte_graph_model_mcore_dispatch_core_bind()`` to
248bind graph with the worker core.
249
250Example:
251
252Graph topo: node-0 -> Core1; node-1 -> node-2; node-2 -> node-3.
253Config graph: node-0 @Core0; node-1/3 @Core1; node-2 @Core2.
254
255.. code-block:: diff
256
257    + - - - - - -+     +- - - - - - - - - - - - - +     + - - - - - -+
258    '  Core #0   '     '          Core #1         '     '  Core #2   '
259    '            '     '                          '     '            '
260    ' +--------+ '     ' +--------+    +--------+ '     ' +--------+ '
261    ' | Node-0 | - - - ->| Node-1 |    | Node-3 |<- - - - | Node-2 | '
262    ' +--------+ '     ' +--------+    +--------+ '     ' +--------+ '
263    '            '     '     |                    '     '      ^     '
264    + - - - - - -+     +- - -|- - - - - - - - - - +     + - - -|- - -+
265                             |                                 |
266                             + - - - - - - - - - - - - - - - - +
267
268
269In fast path
270~~~~~~~~~~~~
271Typical fast-path code looks like below, where the application
272gets the fast-path graph object using ``rte_graph_lookup()``
273on the worker thread and run the ``rte_graph_walk()`` in a tight loop.
274
275.. code-block:: c
276
277    struct rte_graph *graph = rte_graph_lookup("worker0");
278
279    while (!done) {
280        rte_graph_walk(graph);
281    }
282
283Context update when graph walk in action
284~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
285The fast-path object for the node is ``struct rte_node``.
286
287It may be possible that in slow-path or after the graph walk-in action,
288the user needs to update the context of the node hence access to
289``struct rte_node *`` memory.
290
291``rte_graph_foreach_node()``, ``rte_graph_node_get()``,
292``rte_graph_node_get_by_name()`` APIs can be used to get the
293``struct rte_node*``. ``rte_graph_foreach_node()`` iterator function works on
294``struct rte_graph *`` fast-path graph object while others works on graph ID or name.
295
296Get the node statistics using graph cluster
297~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
298The user may need to know the aggregate stats of the node across
299multiple graph objects. Especially the situation where each graph object bound
300to a worker thread.
301
302Introduced a graph cluster object for statistics.
303``rte_graph_cluster_stats_create()`` API shall be used for creating a
304graph cluster with multiple graph objects and ``rte_graph_cluster_stats_get()``
305to get the aggregate node statistics.
306
307An example statistics output from ``rte_graph_cluster_stats_get()``
308
309.. code-block:: diff
310
311    +---------+-----------+-------------+---------------+-----------+---------------+-----------+
312    |Node     |calls      |objs         |realloc_count  |objs/call  |objs/sec(10E6) |cycles/call|
313    +---------------------+-------------+---------------+-----------+---------------+-----------+
314    |node0    |12977424   |3322220544   |5              |256.000    |3047.151872    |20.0000    |
315    |node1    |12977653   |3322279168   |0              |256.000    |3047.210496    |17.0000    |
316    |node2    |12977696   |3322290176   |0              |256.000    |3047.221504    |17.0000    |
317    |node3    |12977734   |3322299904   |0              |256.000    |3047.231232    |17.0000    |
318    |node4    |12977784   |3322312704   |1              |256.000    |3047.243776    |17.0000    |
319    |node5    |12977825   |3322323200   |0              |256.000    |3047.254528    |17.0000    |
320    +---------+-----------+-------------+---------------+-----------+---------------+-----------+
321
322Node writing guidelines
323~~~~~~~~~~~~~~~~~~~~~~~
324
325The ``process()`` function of a node is the fast-path function and that needs
326to be written carefully to achieve max performance.
327
328Broadly speaking, there are two different types of nodes.
329
330Static nodes
331~~~~~~~~~~~~
332The first kind of nodes are those that have a fixed ``next_nodes[]`` for the
333complete burst (like ethdev_rx, ethdev_tx) and it is simple to write.
334``process()`` function can move the obj burst to the next node either using
335``rte_node_next_stream_move()`` or using ``rte_node_next_stream_get()`` and
336``rte_node_next_stream_put()``.
337
338Intermediate nodes
339~~~~~~~~~~~~~~~~~~
340The second kind of such node is ``intermediate nodes`` that decide what is the
341``next_node[]`` to send to on a per-packet basis. In these nodes,
342
343* Firstly, there has to be the best possible packet processing logic.
344
345* Secondly, each packet needs to be queued to its next node.
346
347This can be done using ``rte_node_enqueue_[x1|x2|x4]()`` APIs if
348they are to single next or ``rte_node_enqueue_next()`` that takes array of nexts.
349
350In scenario where multiple intermediate nodes are present but most of the time
351each node using the same next node for all its packets, the cost of moving every
352pointer from current node's stream to next node's stream could be avoided.
353This is called home run and ``rte_node_next_stream_move()`` could be used to
354just move stream from the current node to the next node with least number of cycles.
355Since this can be avoided only in the case where all the packets are destined
356to the same next node, node implementation should be also having worst-case
357handling where every packet could be going to different next node.
358
359Example of intermediate node implementation with home run:
360^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
361
362#. Start with speculation that next_node = node->ctx.
363   This could be the next_node application used in the previous function call of this node.
364
365#. Get the next_node stream array with required space using
366   ``rte_node_next_stream_get(next_node, space)``.
367
368#. while n_left_from > 0 (i.e packets left to be sent) prefetch next pkt_set
369   and process current pkt_set to find their next node
370
371#. if all the next nodes of the current pkt_set match speculated next node,
372   just count them as successfully speculated(``last_spec``) till now and
373   continue the loop without actually moving them to the next node. else if there is
374   a mismatch, copy all the pkt_set pointers that were ``last_spec`` and move the
375   current pkt_set to their respective next's nodes using ``rte_enqueue_next_x1()``.
376   Also, one of the next_node can be updated as speculated next_node if it is more
377   probable. Finally, reset ``last_spec`` to zero.
378
379#. if n_left_from != 0 then goto 3) to process remaining packets.
380
381#. if last_spec == nb_objs, All the objects passed were successfully speculated
382   to single next node. So, the current stream can be moved to next node using
383   ``rte_node_next_stream_move(node, next_node)``.
384   This is the ``home run`` where memcpy of buffer pointers to next node is avoided.
385
386#. Update the ``node->ctx`` with more probable next node.
387
388Graph object memory layout
389--------------------------
390.. _figure_graph_mem_layout:
391
392.. figure:: img/graph_mem_layout.*
393
394   Memory layout
395
396Understanding the memory layout helps to debug the graph library and
397improve the performance if needed.
398
399Graph object consists of a header, circular buffer to store the pending stream
400when walking over the graph, variable-length memory to store the ``rte_node`` objects,
401and variable-length memory to store the xstat reported by each ``rte_node``.
402
403The graph_nodes_mem_create() creates and populate this memory. The functions
404such as ``rte_graph_walk()`` and ``rte_node_enqueue_*`` use this memory
405to enable fastpath services.
406
407Inbuilt Nodes
408-------------
409
410DPDK provides a set of nodes for data processing.
411The following diagram depicts inbuilt nodes data flow.
412
413.. _figure_graph_inbuit_node_flow:
414
415.. figure:: img/graph_inbuilt_node_flow.*
416
417   Inbuilt nodes data flow
418
419Following section details the documentation for individual inbuilt node.
420
421ethdev_rx
422~~~~~~~~~
423This node does ``rte_eth_rx_burst()`` into stream buffer passed to it
424(src node stream) and does ``rte_node_next_stream_move()`` only when
425there are packets received. Each ``rte_node`` works only on one Rx port and
426queue that it gets from node->ctx. For each (port X, rx_queue Y),
427a rte_node is cloned from  ethdev_rx_base_node as ``ethdev_rx-X-Y`` in
428``rte_node_eth_config()`` along with updating ``node->ctx``.
429Each graph needs to be associated  with a unique rte_node for a (port, rx_queue).
430
431ethdev_tx
432~~~~~~~~~
433This node does ``rte_eth_tx_burst()`` for a burst of objs received by it.
434It sends the burst to a fixed Tx Port and Queue information from
435node->ctx. For each (port X), this ``rte_node`` is cloned from
436ethdev_tx_node_base as "ethdev_tx-X" in ``rte_node_eth_config()``
437along with updating node->context.
438
439Since each graph doesn't need more than one Txq, per port, a Txq is assigned
440based on graph id to each rte_node instance. Each graph needs to be associated
441with a rte_node for each (port).
442
443pkt_drop
444~~~~~~~~
445This node frees all the objects passed to it considering them as
446``rte_mbufs`` that need to be freed.
447
448ip4_lookup
449~~~~~~~~~~
450This node is an intermediate node that does LPM lookup for the received
451ipv4 packets and the result determines each packets next node.
452
453On successful LPM lookup, the result contains the ``next_node`` id and
454``next-hop`` id with which the packet needs to be further processed.
455
456On LPM lookup failure, objects are redirected to pkt_drop node.
457``rte_node_ip4_route_add()`` is control path API to add ipv4 routes.
458To achieve home run, node use ``rte_node_stream_move()`` as mentioned in above
459sections.
460
461ip4_rewrite
462~~~~~~~~~~~
463This node gets packets from ``ip4_lookup`` node with next-hop id for each
464packet is embedded in ``node_mbuf_priv1(mbuf)->nh``. This id is used
465to determine the L2 header to be written to the packet before sending
466the packet out to a particular ethdev_tx node.
467``rte_node_ip4_rewrite_add()`` is control path API to add next-hop info.
468
469ip4_reassembly
470~~~~~~~~~~~~~~
471This node is an intermediate node that reassembles ipv4 fragmented packets,
472non-fragmented packets pass through the node un-effected.
473The node rewrites its stream and moves it to the next node.
474The fragment table and death row table should be setup via the
475``rte_node_ip4_reassembly_configure`` API.
476
477ip6_lookup
478~~~~~~~~~~
479This node is an intermediate node that does LPM lookup for the received
480IPv6 packets and the result determines each packets next node.
481
482On successful LPM lookup, the result contains the ``next_node`` ID
483and `next-hop`` ID with which the packet needs to be further processed.
484
485On LPM lookup failure, objects are redirected to ``pkt_drop`` node.
486``rte_node_ip6_route_add()`` is control path API to add IPv6 routes.
487To achieve home run, node use ``rte_node_stream_move()``
488as mentioned in above sections.
489
490ip6_rewrite
491~~~~~~~~~~~
492This node gets packets from ``ip6_lookup`` node with next-hop ID
493for each packet is embedded in ``node_mbuf_priv1(mbuf)->nh``.
494This ID is used to determine the L2 header to be written to the packet
495before sending the packet out to a particular ``ethdev_tx`` node.
496``rte_node_ip6_rewrite_add()`` is control path API to add next-hop info.
497
498null
499~~~~
500This node ignores the set of objects passed to it and reports that all are
501processed.
502
503kernel_tx
504~~~~~~~~~
505This node is an exit node that forwards the packets to kernel.
506It will be used to forward any control plane traffic to kernel stack from DPDK.
507It uses a raw socket interface to transmit the packets,
508it uses the packet's destination IP address in sockaddr_in address structure
509and ``sendto`` function to send data on the raw socket.
510After sending the burst of packets to kernel,
511this node frees up the packet buffers.
512
513kernel_rx
514~~~~~~~~~
515This node is a source node which receives packets from kernel
516and forwards to any of the intermediate nodes.
517It uses the raw socket interface to receive packets from kernel.
518Uses ``poll`` function to poll on the socket fd
519for ``POLLIN`` events to read the packets from raw socket
520to stream buffer and does ``rte_node_next_stream_move()``
521when there are received packets.
522
523ip4_local
524~~~~~~~~~
525This node is an intermediate node that does ``packet_type`` lookup for
526the received ipv4 packets and the result determines each packets next node.
527
528On successful ``packet_type`` lookup, for any IPv4 protocol the result
529contains the ``next_node`` id and ``next-hop`` id with which the packet
530needs to be further processed.
531
532On packet_type lookup failure, objects are redirected to ``pkt_drop`` node.
533``rte_node_ip4_route_add()`` is control path API to add ipv4 address with 32 bit
534depth to receive to packets.
535To achieve home run, node use ``rte_node_stream_move()`` as mentioned in above
536sections.
537
538udp4_input
539~~~~~~~~~~
540This node is an intermediate node that does udp destination port lookup for
541the received ipv4 packets and the result determines each packets next node.
542
543User registers a new node ``udp4_input`` into graph library during initialization
544and attach user specified node as edege to this node using
545``rte_node_udp4_usr_node_add()``, and create empty hash table with destination
546port and node id as its feilds.
547
548After successful addition of user node as edege, edge id is returned to the user.
549
550User would register ``ip4_lookup`` table with specified ip address and 32 bit as mask
551for ip filtration using api ``rte_node_ip4_route_add()``.
552
553After graph is created user would update hash table with custom port with
554and previously obtained edge id using API ``rte_node_udp4_dst_port_add()``.
555
556When packet is received lpm look up is performed if ip is matched the packet
557is handed over to ip4_local node, then packet is verified for udp proto and
558on success packet is enqueued to ``udp4_input`` node.
559
560Hash lookup is performed in ``udp4_input`` node with registered destination port
561and destination port in UDP packet , on success packet is handed to ``udp_user_node``.
562