1.. SPDX-License-Identifier: BSD-3-Clause 2 Copyright(c) 2010-2014 Intel Corporation. 3 4Multi-process Support 5===================== 6 7In the DPDK, multi-process support is designed to allow a group of DPDK processes 8to work together in a simple transparent manner to perform packet processing, 9or other workloads. 10To support this functionality, 11a number of additions have been made to the core DPDK Environment Abstraction Layer (EAL). 12 13The EAL has been modified to allow different types of DPDK processes to be spawned, 14each with different permissions on the hugepage memory used by the applications. 15For now, there are two types of process specified: 16 17* primary processes, which can initialize and which have full permissions on shared memory 18 19* secondary processes, which cannot initialize shared memory, 20 but can attach to pre- initialized shared memory and create objects in it. 21 22Standalone DPDK processes are primary processes, 23while secondary processes can only run alongside a primary process or 24after a primary process has already configured the hugepage shared memory for them. 25 26.. note:: 27 28 Secondary processes should run alongside primary process with same DPDK version. 29 30 Secondary processes which requires access to physical devices in Primary process, must 31 be passed with the same allow and block options. 32 33To support these two process types, and other multi-process setups described later, 34two additional command-line parameters are available to the EAL: 35 36* ``--proc-type:`` for specifying a given process instance as the primary or secondary DPDK instance 37 38* ``--file-prefix:`` to allow processes that do not want to co-operate to have different memory regions 39 40A number of example applications are provided that demonstrate how multiple DPDK processes can be used together. 41These are more fully documented in the "Multi- process Sample Application" chapter 42in the *DPDK Sample Application's User Guide*. 43 44Memory Sharing 45-------------- 46 47The key element in getting a multi-process application working using the DPDK is to ensure that 48memory resources are properly shared among the processes making up the multi-process application. 49Once there are blocks of shared memory available that can be accessed by multiple processes, 50then issues such as inter-process communication (IPC) becomes much simpler. 51 52On application start-up in a primary or standalone process, 53the DPDK records to memory-mapped files the details of the memory configuration it is using - hugepages in use, 54the virtual addresses they are mapped at, the number of memory channels present, etc. 55When a secondary process is started, these files are read and the EAL recreates the same memory configuration 56in the secondary process so that all memory zones are shared between processes and all pointers to that memory are valid, 57and point to the same objects, in both processes. 58 59.. note:: 60 61 Refer to `Multi-process Limitations`_ for details of 62 how Linux kernel Address-Space Layout Randomization (ASLR) can affect memory sharing. 63 64 If the primary process was run with ``--legacy-mem`` or 65 ``--single-file-segments`` switch, secondary processes must be run with the 66 same switch specified. Otherwise, memory corruption may occur. 67 68.. _figure_multi_process_memory: 69 70.. figure:: img/multi_process_memory.* 71 72 Memory Sharing in the DPDK Multi-process Sample Application 73 74 75The EAL also supports an auto-detection mode (set by EAL ``--proc-type=auto`` flag ), 76whereby a DPDK process is started as a secondary instance if a primary instance is already running. 77 78Deployment Models 79----------------- 80 81Symmetric/Peer Processes 82~~~~~~~~~~~~~~~~~~~~~~~~ 83 84DPDK multi-process support can be used to create a set of peer processes where each process performs the same workload. 85This model is equivalent to having multiple threads each running the same main-loop function, 86as is done in most of the supplied DPDK sample applications. 87In this model, the first of the processes spawned should be spawned using the ``--proc-type=primary`` EAL flag, 88while all subsequent instances should be spawned using the ``--proc-type=secondary`` flag. 89 90The simple_mp and symmetric_mp sample applications demonstrate this usage model. 91They are described in the "Multi-process Sample Application" chapter in the *DPDK Sample Application's User Guide*. 92 93Asymmetric/Non-Peer Processes 94~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 95 96An alternative deployment model that can be used for multi-process applications 97is to have a single primary process instance that acts as a load-balancer or 98server distributing received packets among worker or client threads, which are run as secondary processes. 99In this case, extensive use of rte_ring objects is made, which are located in shared hugepage memory. 100 101The client_server_mp sample application shows this usage model. 102It is described in the "Multi-process Sample Application" chapter in the *DPDK Sample Application's User Guide*. 103 104Running Multiple Independent DPDK Applications 105~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 106 107In addition to the above scenarios involving multiple DPDK processes working together, 108it is possible to run multiple DPDK processes concurrently, 109where those processes are all working independently. 110Support for this usage scenario is provided using the ``--file-prefix`` parameter to the EAL. 111 112The EAL puts shared runtime files in a directory based on standard conventions. 113If ``$RUNTIME_DIRECTORY`` is defined in the environment, 114it is used (as ``$RUNTIME_DIRECTORY/dpdk``). 115Otherwise, if DPDK is run as root user, it uses ``/var/run/dpdk`` 116or if run as non-root user then the ``/tmp/dpdk`` (or ``$XDG_RUNTIME_DIRECTORY/dpdk``) is used. 117Hugepage files on each hugetlbfs filesystem use the ``rtemap_X`` filename, 118where X is in the range 0 to the maximum number of hugepages -1. 119Similarly, it creates shared configuration files, memory mapped in each process, 120using the ``.rte_config`` filename. 121The rte part of the filenames of each of the above is configurable using the file-prefix parameter. 122 123In addition to specifying the file-prefix parameter, 124any DPDK applications that are to be run side-by-side must explicitly limit their memory use. 125This is less of a problem on Linux, as by default, applications will not 126allocate more memory than they need. However if ``--legacy-mem`` is used, DPDK 127will attempt to preallocate all memory it can get to, and memory use must be 128explicitly limited. This is done by passing the ``-m`` flag to each process to 129specify how much hugepage memory, in megabytes, each process can use (or passing 130``--socket-mem`` to specify how much hugepage memory on each socket each process 131can use). 132 133.. note:: 134 135 Independent DPDK instances running side-by-side on a single machine cannot share any network ports. 136 Any network ports being used by one process should be blocked by every other process. 137 138Running Multiple Independent Groups of DPDK Applications 139~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 140 141In the same way that it is possible to run independent DPDK applications side- by-side on a single system, 142this can be trivially extended to multi-process groups of DPDK applications running side-by-side. 143In this case, the secondary processes must use the same ``--file-prefix`` parameter 144as the primary process whose shared memory they are connecting to. 145 146.. note:: 147 148 All restrictions and issues with multiple independent DPDK processes running side-by-side 149 apply in this usage scenario also. 150 151Multi-process Limitations 152------------------------- 153 154There are a number of limitations to what can be done when running DPDK multi-process applications. 155Some of these are documented below: 156 157* The multi-process feature requires that the exact same hugepage memory mappings be present in all applications. 158 This makes secondary process startup process generally unreliable. Disabling 159 Linux security feature - Address-Space Layout Randomization (ASLR) may 160 help getting more consistent mappings, but not necessarily more reliable - 161 if the mappings are wrong, they will be consistently wrong! 162 163.. warning:: 164 165 Disabling Address-Space Layout Randomization (ASLR) may have security implications, 166 so it is recommended that it be disabled only when absolutely necessary, 167 and only when the implications of this change have been understood. 168 169* All DPDK processes running as a single application and using shared memory must have distinct coremask/corelist arguments. 170 It is not possible to have a primary and secondary instance, or two secondary instances, 171 using any of the same logical cores. 172 Attempting to do so can cause corruption of memory pool caches, among other issues. 173 174* The delivery of interrupts, such as Ethernet* device link status interrupts, do not work in secondary processes. 175 All interrupts are triggered inside the primary process only. 176 Any application needing interrupt notification in multiple processes should provide its own mechanism 177 to transfer the interrupt information from the primary process to any secondary process that needs the information. 178 179* The use of function pointers between multiple processes running based of different compiled binaries is not supported, 180 since the location of a given function in one process may be different to its location in a second. 181 This prevents the librte_hash library from behaving properly as in a multi-process instance, 182 since it uses a pointer to the hash function internally. 183 184To work around this issue, it is recommended that multi-process applications perform the hash calculations by directly calling 185the hashing function from the code and then using the rte_hash_add_with_hash()/rte_hash_lookup_with_hash() functions 186instead of the functions which do the hashing internally, such as rte_hash_add()/rte_hash_lookup(). 187 188* Depending upon the hardware in use, and the number of DPDK processes used, 189 it may not be possible to have HPET timers available in each DPDK instance. 190 The minimum number of HPET comparators available to Linux* userspace can be just a single comparator, 191 which means that only the first, primary DPDK process instance can open and mmap /dev/hpet. 192 If the number of required DPDK processes exceeds that of the number of available HPET comparators, 193 the TSC (which is the default timer in this release) must be used as a time source across all processes instead of the HPET. 194 195Communication between multiple processes 196---------------------------------------- 197 198While there are multiple ways one can approach inter-process communication in 199DPDK, there is also a native DPDK IPC API available. It is not intended to be 200performance-critical, but rather is intended to be a convenient, general 201purpose API to exchange short messages between primary and secondary processes. 202 203DPDK IPC API supports the following communication modes: 204 205* Unicast message from secondary to primary 206* Broadcast message from primary to all secondaries 207 208In other words, any IPC message sent in a primary process will be delivered to 209all secondaries, while any IPC message sent in a secondary process will only be 210delivered to primary process. Unicast from primary to secondary or from 211secondary to secondary is not supported. 212 213There are three types of communications that are available within DPDK IPC API: 214 215* Message 216* Synchronous request 217* Asynchronous request 218 219A "message" type does not expect a response and is meant to be a best-effort 220notification mechanism, while the two types of "requests" are meant to be a two 221way communication mechanism, with the requester expecting a response from the 222other side. 223 224Both messages and requests will trigger a named callback on the receiver side. 225These callbacks will be called from within a dedicated IPC or interrupt thread 226that are not part of EAL lcore threads. 227 228Registering for incoming messages 229~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 230 231Before any messages can be received, a callback will need to be registered. 232This is accomplished by calling ``rte_mp_action_register()`` function. This 233function accepts a unique callback name, and a function pointer to a callback 234that will be called when a message or a request matching this callback name 235arrives. 236 237If the application is no longer willing to receive messages intended for a 238specific callback function, ``rte_mp_action_unregister()`` function can be 239called to ensure that callback will not be triggered again. 240 241Sending messages 242~~~~~~~~~~~~~~~~ 243 244To send a message, a ``rte_mp_msg`` descriptor must be populated first. The list 245of fields to be populated are as follows: 246 247* ``name`` - message name. This name must match receivers' callback name. 248* ``param`` - message data (up to 256 bytes). 249* ``len_param`` - length of message data. 250* ``fds`` - file descriptors to pass long with the data (up to 8 fd's). 251* ``num_fds`` - number of file descriptors to send. 252 253Once the structure is populated, calling ``rte_mp_sendmsg()`` will send the 254descriptor either to all secondary processes (if sent from primary process), or 255to primary process (if sent from secondary process). The function will return 256a value indicating whether sending the message succeeded or not. 257 258Sending requests 259~~~~~~~~~~~~~~~~ 260 261Sending requests involves waiting for the other side to reply, so they can block 262for a relatively long time. 263 264To send a request, a message descriptor ``rte_mp_msg`` must be populated. 265Additionally, a ``timespec`` value must be specified as a timeout, after which 266IPC will stop waiting and return. 267 268For synchronous requests, the ``rte_mp_reply`` descriptor must also be created. 269This is where the responses will be stored. 270The list of fields that will be populated by IPC are as follows: 271 272* ``nb_sent`` - number indicating how many requests were sent (i.e. how many 273 peer processes were active at the time of the request). 274* ``nb_received`` - number indicating how many responses were received (i.e. of 275 those peer processes that were active at the time of request, how many have 276 replied) 277* ``msgs`` - pointer to where all of the responses are stored. The order in 278 which responses appear is undefined. When doing synchronous requests, this 279 memory must be freed by the requestor after request completes! 280 281For asynchronous requests, a function pointer to the callback function must be 282provided instead. This callback will be called when the request either has timed 283out, or will have received a response to all the messages that were sent. 284 285.. warning:: 286 287 When an asynchronous request times out, the callback will be called not by 288 a dedicated IPC thread, but rather from EAL interrupt thread. Because of 289 this, it may not be possible for DPDK to trigger another interrupt-based 290 event (such as an alarm) while handling asynchronous IPC callback. 291 292When the callback is called, the original request descriptor will be provided 293(so that it would be possible to determine for which sent message this is a 294callback to), along with a response descriptor like the one described above. 295When doing asynchronous requests, there is no need to free the resulting 296``rte_mp_reply`` descriptor. 297 298Receiving and responding to messages 299~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 300 301To receive a message, a name callback must be registered using the 302``rte_mp_action_register()`` function. The name of the callback must match the 303``name`` field in sender's ``rte_mp_msg`` message descriptor in order for this 304message to be delivered and for the callback to be trigger. 305 306The callback's definition is ``rte_mp_t``, and consists of the incoming message 307pointer ``msg``, and an opaque pointer ``peer``. Contents of ``msg`` will be 308identical to ones sent by the sender. 309 310If a response is required, a new ``rte_mp_msg`` message descriptor must be 311constructed and sent via ``rte_mp_reply()`` function, along with ``peer`` 312pointer. The resulting response will then be delivered to the correct requestor. 313 314.. warning:: 315 Simply returning a value when processing a request callback will not send a 316 response to the request - it must always be explicitly sent even in case 317 of errors. Implementation of error signalling rests with the application, 318 there is no built-in way to indicate success or error for a request. Failing 319 to do so will cause the requestor to time out while waiting on a response. 320 321Misc considerations 322~~~~~~~~~~~~~~~~~~~~~~~~ 323 324Due to the underlying IPC implementation being single-threaded, recursive 325requests (i.e. sending a request while responding to another request) is not 326supported. However, since sending messages (not requests) does not involve an 327IPC thread, sending messages while processing another message or request is 328supported. 329 330Since the memory subsystem uses IPC internally, memory allocations and IPC must 331not be mixed: it is not safe to use IPC inside a memory-related callback, nor is 332it safe to allocate/free memory inside IPC callbacks. Attempting to do so may 333lead to a deadlock. 334 335Asynchronous request callbacks may be triggered either from IPC thread or from 336interrupt thread, depending on whether the request has timed out. It is 337therefore suggested to avoid waiting for interrupt-based events (such as alarms) 338inside asynchronous IPC request callbacks. This limitation does not apply to 339messages or synchronous requests. 340 341If callbacks spend a long time processing the incoming requests, the requestor 342might time out, so setting the right timeout value on the requestor side is 343imperative. 344 345If some of the messages timed out, ``nb_sent`` and ``nb_received`` fields in the 346``rte_mp_reply`` descriptor will not have matching values. This is not treated 347as error by the IPC API, and it is expected that the user will be responsible 348for deciding how to handle such cases. 349 350If a callback has been registered, IPC will assume that it is safe to call it. 351This is important when registering callbacks during DPDK initialization. 352During initialization, IPC will consider the receiving side as non-existing if 353the callback has not been registered yet. However, once the callback has been 354registered, it is expected that IPC should be safe to trigger it, even if the 355rest of the DPDK initialization hasn't finished yet. 356