xref: /dpdk/doc/guides/prog_guide/multi_proc_support.rst (revision 00e57b0e550b7df2047e6d0bde8965c7ae17d203)
1..  SPDX-License-Identifier: BSD-3-Clause
2    Copyright(c) 2010-2014 Intel Corporation.
3
4.. _Multi-process_Support:
5
6Multi-process Support
7=====================
8
9In the DPDK, multi-process support is designed to allow a group of DPDK processes
10to work together in a simple transparent manner to perform packet processing,
11or other workloads.
12To support this functionality,
13a number of additions have been made to the core DPDK Environment Abstraction Layer (EAL).
14
15The EAL has been modified to allow different types of DPDK processes to be spawned,
16each with different permissions on the hugepage memory used by the applications.
17For now, there are two types of process specified:
18
19*   primary processes, which can initialize and which have full permissions on shared memory
20
21*   secondary processes, which cannot initialize shared memory,
22    but can attach to pre- initialized shared memory and create objects in it.
23
24Standalone DPDK processes are primary processes,
25while secondary processes can only run alongside a primary process or
26after a primary process has already configured the hugepage shared memory for them.
27
28.. note::
29
30    Secondary processes should run alongside primary process with same DPDK version.
31
32    Secondary processes which requires access to physical devices in Primary process, must
33    be passed with the same allow and block options.
34
35To support these two process types, and other multi-process setups described later,
36two additional command-line parameters are available to the EAL:
37
38*   ``--proc-type:`` for specifying a given process instance as the primary or secondary DPDK instance
39
40*   ``--file-prefix:`` to allow processes that do not want to co-operate to have different memory regions
41
42A number of example applications are provided that demonstrate how multiple DPDK processes can be used together.
43These are more fully documented in the "Multi- process Sample Application" chapter
44in the *DPDK Sample Application's User Guide*.
45
46Memory Sharing
47--------------
48
49The key element in getting a multi-process application working using the DPDK is to ensure that
50memory resources are properly shared among the processes making up the multi-process application.
51Once there are blocks of shared memory available that can be accessed by multiple processes,
52then issues such as inter-process communication (IPC) becomes much simpler.
53
54On application start-up in a primary or standalone process,
55the DPDK records to memory-mapped files the details of the memory configuration it is using - hugepages in use,
56the virtual addresses they are mapped at, the number of memory channels present, etc.
57When a secondary process is started, these files are read and the EAL recreates the same memory configuration
58in the secondary process so that all memory zones are shared between processes and all pointers to that memory are valid,
59and point to the same objects, in both processes.
60
61.. note::
62
63    Refer to `Multi-process Limitations`_ for details of
64    how Linux kernel Address-Space Layout Randomization (ASLR) can affect memory sharing.
65
66    If the primary process was run with ``--legacy-mem`` or
67    ``--single-file-segments`` switch, secondary processes must be run with the
68    same switch specified. Otherwise, memory corruption may occur.
69
70.. _figure_multi_process_memory:
71
72.. figure:: img/multi_process_memory.*
73
74   Memory Sharing in the DPDK Multi-process Sample Application
75
76
77The EAL also supports an auto-detection mode (set by EAL ``--proc-type=auto`` flag ),
78whereby a DPDK process is started as a secondary instance if a primary instance is already running.
79
80Deployment Models
81-----------------
82
83Symmetric/Peer Processes
84~~~~~~~~~~~~~~~~~~~~~~~~
85
86DPDK multi-process support can be used to create a set of peer processes where each process performs the same workload.
87This model is equivalent to having multiple threads each running the same main-loop function,
88as is done in most of the supplied DPDK sample applications.
89In this model, the first of the processes spawned should be spawned using the ``--proc-type=primary`` EAL flag,
90while all subsequent instances should be spawned using the ``--proc-type=secondary`` flag.
91
92The simple_mp and symmetric_mp sample applications demonstrate this usage model.
93They are described in the "Multi-process Sample Application" chapter in the *DPDK Sample Application's User Guide*.
94
95Asymmetric/Non-Peer Processes
96~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
97
98An alternative deployment model that can be used for multi-process applications
99is to have a single primary process instance that acts as a load-balancer or
100server distributing received packets among worker or client threads, which are run as secondary processes.
101In this case, extensive use of rte_ring objects is made, which are located in shared hugepage memory.
102
103The client_server_mp sample application shows this usage model.
104It is described in the "Multi-process Sample Application" chapter in the *DPDK Sample Application's User Guide*.
105
106Running Multiple Independent DPDK Applications
107~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
108
109In addition to the above scenarios involving multiple DPDK processes working together,
110it is possible to run multiple DPDK processes concurrently,
111where those processes are all working independently.
112Support for this usage scenario is provided using the ``--file-prefix`` parameter to the EAL.
113
114The EAL puts shared runtime files in a directory based on standard conventions.
115If ``$RUNTIME_DIRECTORY`` is defined in the environment,
116it is used (as ``$RUNTIME_DIRECTORY/dpdk``).
117Otherwise, if DPDK is run as root user, it uses ``/var/run/dpdk``
118or if run as non-root user then the ``/tmp/dpdk`` (or ``$XDG_RUNTIME_DIRECTORY/dpdk``) is used.
119Hugepage files on each hugetlbfs filesystem use the ``rtemap_X`` filename,
120where X is in the range 0 to the maximum number of hugepages -1.
121Similarly, it creates shared configuration files, memory mapped in each process,
122using the ``.rte_config`` filename.
123The rte part of the filenames of each of the above is configurable using the file-prefix parameter.
124
125In addition to specifying the file-prefix parameter,
126any DPDK applications that are to be run side-by-side must explicitly limit their memory use.
127This is less of a problem on Linux, as by default, applications will not
128allocate more memory than they need. However if ``--legacy-mem`` is used, DPDK
129will attempt to preallocate all memory it can get to, and memory use must be
130explicitly limited. This is done by passing the ``-m`` flag to each process to
131specify how much hugepage memory, in megabytes, each process can use (or passing
132``--socket-mem`` to specify how much hugepage memory on each socket each process
133can use).
134
135.. note::
136
137    Independent DPDK instances running side-by-side on a single machine cannot share any network ports.
138    Any network ports being used by one process should be blocked by every other process.
139
140Running Multiple Independent Groups of DPDK Applications
141~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
142
143In the same way that it is possible to run independent DPDK applications side- by-side on a single system,
144this can be trivially extended to multi-process groups of DPDK applications running side-by-side.
145In this case, the secondary processes must use the same ``--file-prefix`` parameter
146as the primary process whose shared memory they are connecting to.
147
148.. note::
149
150    All restrictions and issues with multiple independent DPDK processes running side-by-side
151    apply in this usage scenario also.
152
153Multi-process Limitations
154-------------------------
155
156There are a number of limitations to what can be done when running DPDK multi-process applications.
157Some of these are documented below:
158
159*   The multi-process feature requires that the exact same hugepage memory mappings be present in all applications.
160    This makes secondary process startup process generally unreliable. Disabling
161    Linux security feature - Address-Space Layout Randomization (ASLR) may
162    help getting more consistent mappings, but not necessarily more reliable -
163    if the mappings are wrong, they will be consistently wrong!
164
165.. warning::
166
167    Disabling Address-Space Layout Randomization (ASLR) may have security implications,
168    so it is recommended that it be disabled only when absolutely necessary,
169    and only when the implications of this change have been understood.
170
171*   All DPDK processes running as a single application and using shared memory must have distinct coremask/corelist arguments.
172    It is not possible to have a primary and secondary instance, or two secondary instances,
173    using any of the same logical cores.
174    Attempting to do so can cause corruption of memory pool caches, among other issues.
175
176*   The delivery of interrupts, such as Ethernet* device link status interrupts, do not work in secondary processes.
177    All interrupts are triggered inside the primary process only.
178    Any application needing interrupt notification in multiple processes should provide its own mechanism
179    to transfer the interrupt information from the primary process to any secondary process that needs the information.
180
181*   The use of function pointers between multiple processes running based of different compiled binaries is not supported,
182    since the location of a given function in one process may be different to its location in a second.
183    This prevents the librte_hash library from behaving properly as in a multi-process instance,
184    since it uses a pointer to the hash function internally.
185
186To work around this issue, it is recommended that multi-process applications perform the hash calculations by directly calling
187the hashing function from the code and then using the rte_hash_add_with_hash()/rte_hash_lookup_with_hash() functions
188instead of the functions which do the hashing internally, such as rte_hash_add()/rte_hash_lookup().
189
190*   Depending upon the hardware in use, and the number of DPDK processes used,
191    it may not be possible to have HPET timers available in each DPDK instance.
192    The minimum number of HPET comparators available to Linux* userspace can be just a single comparator,
193    which means that only the first, primary DPDK process instance can open and mmap  /dev/hpet.
194    If the number of required DPDK processes exceeds that of the number of available HPET comparators,
195    the TSC (which is the default timer in this release) must be used as a time source across all processes instead of the HPET.
196
197Communication between multiple processes
198----------------------------------------
199
200While there are multiple ways one can approach inter-process communication in
201DPDK, there is also a native DPDK IPC API available. It is not intended to be
202performance-critical, but rather is intended to be a convenient, general
203purpose API to exchange short messages between primary and secondary processes.
204
205DPDK IPC API supports the following communication modes:
206
207* Unicast message from secondary to primary
208* Broadcast message from primary to all secondaries
209
210In other words, any IPC message sent in a primary process will be delivered to
211all secondaries, while any IPC message sent in a secondary process will only be
212delivered to primary process. Unicast from primary to secondary or from
213secondary to secondary is not supported.
214
215There are three types of communications that are available within DPDK IPC API:
216
217* Message
218* Synchronous request
219* Asynchronous request
220
221A "message" type does not expect a response and is meant to be a best-effort
222notification mechanism, while the two types of "requests" are meant to be a two
223way communication mechanism, with the requester expecting a response from the
224other side.
225
226Both messages and requests will trigger a named callback on the receiver side.
227These callbacks will be called from within a dedicated IPC or interrupt thread
228that are not part of EAL lcore threads.
229
230Registering for incoming messages
231~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
232
233Before any messages can be received, a callback will need to be registered.
234This is accomplished by calling ``rte_mp_action_register()`` function. This
235function accepts a unique callback name, and a function pointer to a callback
236that will be called when a message or a request matching this callback name
237arrives.
238
239If the application is no longer willing to receive messages intended for a
240specific callback function, ``rte_mp_action_unregister()`` function can be
241called to ensure that callback will not be triggered again.
242
243Sending messages
244~~~~~~~~~~~~~~~~
245
246To send a message, a ``rte_mp_msg`` descriptor must be populated first. The list
247of fields to be populated are as follows:
248
249* ``name`` - message name. This name must match receivers' callback name.
250* ``param`` - message data (up to 256 bytes).
251* ``len_param`` - length of message data.
252* ``fds`` - file descriptors to pass long with the data (up to 8 fd's).
253* ``num_fds`` - number of file descriptors to send.
254
255Once the structure is populated, calling ``rte_mp_sendmsg()`` will send the
256descriptor either to all secondary processes (if sent from primary process), or
257to primary process (if sent from secondary process). The function will return
258a value indicating whether sending the message succeeded or not.
259
260Sending requests
261~~~~~~~~~~~~~~~~
262
263Sending requests involves waiting for the other side to reply, so they can block
264for a relatively long time.
265
266To send a request, a message descriptor ``rte_mp_msg`` must be populated.
267Additionally, a ``timespec`` value must be specified as a timeout, after which
268IPC will stop waiting and return.
269
270For synchronous requests, the ``rte_mp_reply`` descriptor must also be created.
271This is where the responses will be stored.
272The list of fields that will be populated by IPC are as follows:
273
274* ``nb_sent`` - number indicating how many requests were sent (i.e. how many
275  peer processes were active at the time of the request).
276* ``nb_received`` - number indicating how many responses were received (i.e. of
277  those peer processes that were active at the time of request, how many have
278  replied)
279* ``msgs`` - pointer to where all of the responses are stored. The order in
280  which responses appear is undefined. When doing synchronous requests, this
281  memory must be freed by the requestor after request completes!
282
283For asynchronous requests, a function pointer to the callback function must be
284provided instead. This callback will be called when the request either has timed
285out, or will have received a response to all the messages that were sent.
286
287.. warning::
288
289    When an asynchronous request times out, the callback will be called not by
290    a dedicated IPC thread, but rather from EAL interrupt thread. Because of
291    this, it may not be possible for DPDK to trigger another interrupt-based
292    event (such as an alarm) while handling asynchronous IPC callback.
293
294When the callback is called, the original request descriptor will be provided
295(so that it would be possible to determine for which sent message this is a
296callback to), along with a response descriptor like the one described above.
297When doing asynchronous requests, there is no need to free the resulting
298``rte_mp_reply`` descriptor.
299
300Receiving and responding to messages
301~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
302
303To receive a message, a name callback must be registered using the
304``rte_mp_action_register()`` function. The name of the callback must match the
305``name`` field in sender's ``rte_mp_msg`` message descriptor in order for this
306message to be delivered and for the callback to be trigger.
307
308The callback's definition is ``rte_mp_t``, and consists of the incoming message
309pointer ``msg``, and an opaque pointer ``peer``. Contents of ``msg`` will be
310identical to ones sent by the sender.
311
312If a response is required, a new ``rte_mp_msg`` message descriptor must be
313constructed and sent via ``rte_mp_reply()`` function, along with ``peer``
314pointer. The resulting response will then be delivered to the correct requestor.
315
316.. warning::
317    Simply returning a value when processing a request callback will not send a
318    response to the request - it must always be explicitly sent even in case
319    of errors. Implementation of error signalling rests with the application,
320    there is no built-in way to indicate success or error for a request. Failing
321    to do so will cause the requestor to time out while waiting on a response.
322
323Misc considerations
324~~~~~~~~~~~~~~~~~~~~~~~~
325
326Due to the underlying IPC implementation being single-threaded, recursive
327requests (i.e. sending a request while responding to another request) is not
328supported. However, since sending messages (not requests) does not involve an
329IPC thread, sending messages while processing another message or request is
330supported.
331
332Since the memory subsystem uses IPC internally, memory allocations and IPC must
333not be mixed: it is not safe to use IPC inside a memory-related callback, nor is
334it safe to allocate/free memory inside IPC callbacks. Attempting to do so may
335lead to a deadlock.
336
337Asynchronous request callbacks may be triggered either from IPC thread or from
338interrupt thread, depending on whether the request has timed out. It is
339therefore suggested to avoid waiting for interrupt-based events (such as alarms)
340inside asynchronous IPC request callbacks. This limitation does not apply to
341messages or synchronous requests.
342
343If callbacks spend a long time processing the incoming requests, the requestor
344might time out, so setting the right timeout value on the requestor side is
345imperative.
346
347If some of the messages timed out, ``nb_sent`` and ``nb_received`` fields in the
348``rte_mp_reply`` descriptor will not have matching values. This is not treated
349as error by the IPC API, and it is expected that the user will be responsible
350for deciding how to handle such cases.
351
352If a callback has been registered, IPC will assume that it is safe to call it.
353This is important when registering callbacks during DPDK initialization.
354During initialization, IPC will consider the receiving side as non-existing if
355the callback has not been registered yet. However, once the callback has been
356registered, it is expected that IPC should be safe to trigger it, even if the
357rest of the DPDK initialization hasn't finished yet.
358