1.. SPDX-License-Identifier: BSD-3-Clause 2 Copyright(c) 2010-2014 Intel Corporation. 3 4.. _Multi-process_Support: 5 6Multi-process Support 7===================== 8 9In the DPDK, multi-process support is designed to allow a group of DPDK processes 10to work together in a simple transparent manner to perform packet processing, 11or other workloads. 12To support this functionality, 13a number of additions have been made to the core DPDK Environment Abstraction Layer (EAL). 14 15The EAL has been modified to allow different types of DPDK processes to be spawned, 16each with different permissions on the hugepage memory used by the applications. 17For now, there are two types of process specified: 18 19* primary processes, which can initialize and which have full permissions on shared memory 20 21* secondary processes, which cannot initialize shared memory, 22 but can attach to pre- initialized shared memory and create objects in it. 23 24Standalone DPDK processes are primary processes, 25while secondary processes can only run alongside a primary process or 26after a primary process has already configured the hugepage shared memory for them. 27 28.. note:: 29 30 Secondary processes should run alongside primary process with same DPDK version. 31 32 Secondary processes which requires access to physical devices in Primary process, must 33 be passed with the same allow and block options. 34 35To support these two process types, and other multi-process setups described later, 36two additional command-line parameters are available to the EAL: 37 38* ``--proc-type:`` for specifying a given process instance as the primary or secondary DPDK instance 39 40* ``--file-prefix:`` to allow processes that do not want to co-operate to have different memory regions 41 42A number of example applications are provided that demonstrate how multiple DPDK processes can be used together. 43These are more fully documented in the "Multi- process Sample Application" chapter 44in the *DPDK Sample Application's User Guide*. 45 46Memory Sharing 47-------------- 48 49The key element in getting a multi-process application working using the DPDK is to ensure that 50memory resources are properly shared among the processes making up the multi-process application. 51Once there are blocks of shared memory available that can be accessed by multiple processes, 52then issues such as inter-process communication (IPC) becomes much simpler. 53 54On application start-up in a primary or standalone process, 55the DPDK records to memory-mapped files the details of the memory configuration it is using - hugepages in use, 56the virtual addresses they are mapped at, the number of memory channels present, etc. 57When a secondary process is started, these files are read and the EAL recreates the same memory configuration 58in the secondary process so that all memory zones are shared between processes and all pointers to that memory are valid, 59and point to the same objects, in both processes. 60 61.. note:: 62 63 Refer to `Multi-process Limitations`_ for details of 64 how Linux kernel Address-Space Layout Randomization (ASLR) can affect memory sharing. 65 66 If the primary process was run with ``--legacy-mem`` or 67 ``--single-file-segments`` switch, secondary processes must be run with the 68 same switch specified. Otherwise, memory corruption may occur. 69 70.. _figure_multi_process_memory: 71 72.. figure:: img/multi_process_memory.* 73 74 Memory Sharing in the DPDK Multi-process Sample Application 75 76 77The EAL also supports an auto-detection mode (set by EAL ``--proc-type=auto`` flag ), 78whereby a DPDK process is started as a secondary instance if a primary instance is already running. 79 80Deployment Models 81----------------- 82 83Symmetric/Peer Processes 84~~~~~~~~~~~~~~~~~~~~~~~~ 85 86DPDK multi-process support can be used to create a set of peer processes where each process performs the same workload. 87This model is equivalent to having multiple threads each running the same main-loop function, 88as is done in most of the supplied DPDK sample applications. 89In this model, the first of the processes spawned should be spawned using the ``--proc-type=primary`` EAL flag, 90while all subsequent instances should be spawned using the ``--proc-type=secondary`` flag. 91 92The simple_mp and symmetric_mp sample applications demonstrate this usage model. 93They are described in the "Multi-process Sample Application" chapter in the *DPDK Sample Application's User Guide*. 94 95Asymmetric/Non-Peer Processes 96~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 97 98An alternative deployment model that can be used for multi-process applications 99is to have a single primary process instance that acts as a load-balancer or 100server distributing received packets among worker or client threads, which are run as secondary processes. 101In this case, extensive use of rte_ring objects is made, which are located in shared hugepage memory. 102 103The client_server_mp sample application shows this usage model. 104It is described in the "Multi-process Sample Application" chapter in the *DPDK Sample Application's User Guide*. 105 106Running Multiple Independent DPDK Applications 107~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 108 109In addition to the above scenarios involving multiple DPDK processes working together, 110it is possible to run multiple DPDK processes side-by-side, 111where those processes are all working independently. 112Support for this usage scenario is provided using the ``--file-prefix`` parameter to the EAL. 113 114By default, the EAL creates hugepage files on each hugetlbfs filesystem using the rtemap_X filename, 115where X is in the range 0 to the maximum number of hugepages -1. 116Similarly, it creates shared configuration files, memory mapped in each process, using the /var/run/.rte_config filename, 117when run as root (or $HOME/.rte_config when run as a non-root user; 118if filesystem and device permissions are set up to allow this). 119The rte part of the filenames of each of the above is configurable using the file-prefix parameter. 120 121In addition to specifying the file-prefix parameter, 122any DPDK applications that are to be run side-by-side must explicitly limit their memory use. 123This is less of a problem on Linux, as by default, applications will not 124allocate more memory than they need. However if ``--legacy-mem`` is used, DPDK 125will attempt to preallocate all memory it can get to, and memory use must be 126explicitly limited. This is done by passing the ``-m`` flag to each process to 127specify how much hugepage memory, in megabytes, each process can use (or passing 128``--socket-mem`` to specify how much hugepage memory on each socket each process 129can use). 130 131.. note:: 132 133 Independent DPDK instances running side-by-side on a single machine cannot share any network ports. 134 Any network ports being used by one process should be blocked by every other process. 135 136Running Multiple Independent Groups of DPDK Applications 137~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 138 139In the same way that it is possible to run independent DPDK applications side- by-side on a single system, 140this can be trivially extended to multi-process groups of DPDK applications running side-by-side. 141In this case, the secondary processes must use the same ``--file-prefix`` parameter 142as the primary process whose shared memory they are connecting to. 143 144.. note:: 145 146 All restrictions and issues with multiple independent DPDK processes running side-by-side 147 apply in this usage scenario also. 148 149Multi-process Limitations 150------------------------- 151 152There are a number of limitations to what can be done when running DPDK multi-process applications. 153Some of these are documented below: 154 155* The multi-process feature requires that the exact same hugepage memory mappings be present in all applications. 156 This makes secondary process startup process generally unreliable. Disabling 157 Linux security feature - Address-Space Layout Randomization (ASLR) may 158 help getting more consistent mappings, but not necessarily more reliable - 159 if the mappings are wrong, they will be consistently wrong! 160 161.. warning:: 162 163 Disabling Address-Space Layout Randomization (ASLR) may have security implications, 164 so it is recommended that it be disabled only when absolutely necessary, 165 and only when the implications of this change have been understood. 166 167* All DPDK processes running as a single application and using shared memory must have distinct coremask/corelist arguments. 168 It is not possible to have a primary and secondary instance, or two secondary instances, 169 using any of the same logical cores. 170 Attempting to do so can cause corruption of memory pool caches, among other issues. 171 172* The delivery of interrupts, such as Ethernet* device link status interrupts, do not work in secondary processes. 173 All interrupts are triggered inside the primary process only. 174 Any application needing interrupt notification in multiple processes should provide its own mechanism 175 to transfer the interrupt information from the primary process to any secondary process that needs the information. 176 177* The use of function pointers between multiple processes running based of different compiled binaries is not supported, 178 since the location of a given function in one process may be different to its location in a second. 179 This prevents the librte_hash library from behaving properly as in a multi-process instance, 180 since it uses a pointer to the hash function internally. 181 182To work around this issue, it is recommended that multi-process applications perform the hash calculations by directly calling 183the hashing function from the code and then using the rte_hash_add_with_hash()/rte_hash_lookup_with_hash() functions 184instead of the functions which do the hashing internally, such as rte_hash_add()/rte_hash_lookup(). 185 186* Depending upon the hardware in use, and the number of DPDK processes used, 187 it may not be possible to have HPET timers available in each DPDK instance. 188 The minimum number of HPET comparators available to Linux* userspace can be just a single comparator, 189 which means that only the first, primary DPDK process instance can open and mmap /dev/hpet. 190 If the number of required DPDK processes exceeds that of the number of available HPET comparators, 191 the TSC (which is the default timer in this release) must be used as a time source across all processes instead of the HPET. 192 193Communication between multiple processes 194---------------------------------------- 195 196While there are multiple ways one can approach inter-process communication in 197DPDK, there is also a native DPDK IPC API available. It is not intended to be 198performance-critical, but rather is intended to be a convenient, general 199purpose API to exchange short messages between primary and secondary processes. 200 201DPDK IPC API supports the following communication modes: 202 203* Unicast message from secondary to primary 204* Broadcast message from primary to all secondaries 205 206In other words, any IPC message sent in a primary process will be delivered to 207all secondaries, while any IPC message sent in a secondary process will only be 208delivered to primary process. Unicast from primary to secondary or from 209secondary to secondary is not supported. 210 211There are three types of communications that are available within DPDK IPC API: 212 213* Message 214* Synchronous request 215* Asynchronous request 216 217A "message" type does not expect a response and is meant to be a best-effort 218notification mechanism, while the two types of "requests" are meant to be a two 219way communication mechanism, with the requester expecting a response from the 220other side. 221 222Both messages and requests will trigger a named callback on the receiver side. 223These callbacks will be called from within a dedicated IPC or interrupt thread 224that are not part of EAL lcore threads. 225 226Registering for incoming messages 227~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 228 229Before any messages can be received, a callback will need to be registered. 230This is accomplished by calling ``rte_mp_action_register()`` function. This 231function accepts a unique callback name, and a function pointer to a callback 232that will be called when a message or a request matching this callback name 233arrives. 234 235If the application is no longer willing to receive messages intended for a 236specific callback function, ``rte_mp_action_unregister()`` function can be 237called to ensure that callback will not be triggered again. 238 239Sending messages 240~~~~~~~~~~~~~~~~ 241 242To send a message, a ``rte_mp_msg`` descriptor must be populated first. The list 243of fields to be populated are as follows: 244 245* ``name`` - message name. This name must match receivers' callback name. 246* ``param`` - message data (up to 256 bytes). 247* ``len_param`` - length of message data. 248* ``fds`` - file descriptors to pass long with the data (up to 8 fd's). 249* ``num_fds`` - number of file descriptors to send. 250 251Once the structure is populated, calling ``rte_mp_sendmsg()`` will send the 252descriptor either to all secondary processes (if sent from primary process), or 253to primary process (if sent from secondary process). The function will return 254a value indicating whether sending the message succeeded or not. 255 256Sending requests 257~~~~~~~~~~~~~~~~ 258 259Sending requests involves waiting for the other side to reply, so they can block 260for a relatively long time. 261 262To send a request, a message descriptor ``rte_mp_msg`` must be populated. 263Additionally, a ``timespec`` value must be specified as a timeout, after which 264IPC will stop waiting and return. 265 266For synchronous requests, the ``rte_mp_reply`` descriptor must also be created. 267This is where the responses will be stored. 268The list of fields that will be populated by IPC are as follows: 269 270* ``nb_sent`` - number indicating how many requests were sent (i.e. how many 271 peer processes were active at the time of the request). 272* ``nb_received`` - number indicating how many responses were received (i.e. of 273 those peer processes that were active at the time of request, how many have 274 replied) 275* ``msgs`` - pointer to where all of the responses are stored. The order in 276 which responses appear is undefined. When doing synchronous requests, this 277 memory must be freed by the requestor after request completes! 278 279For asynchronous requests, a function pointer to the callback function must be 280provided instead. This callback will be called when the request either has timed 281out, or will have received a response to all the messages that were sent. 282 283.. warning:: 284 285 When an asynchronous request times out, the callback will be called not by 286 a dedicated IPC thread, but rather from EAL interrupt thread. Because of 287 this, it may not be possible for DPDK to trigger another interrupt-based 288 event (such as an alarm) while handling asynchronous IPC callback. 289 290When the callback is called, the original request descriptor will be provided 291(so that it would be possible to determine for which sent message this is a 292callback to), along with a response descriptor like the one described above. 293When doing asynchronous requests, there is no need to free the resulting 294``rte_mp_reply`` descriptor. 295 296Receiving and responding to messages 297~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 298 299To receive a message, a name callback must be registered using the 300``rte_mp_action_register()`` function. The name of the callback must match the 301``name`` field in sender's ``rte_mp_msg`` message descriptor in order for this 302message to be delivered and for the callback to be trigger. 303 304The callback's definition is ``rte_mp_t``, and consists of the incoming message 305pointer ``msg``, and an opaque pointer ``peer``. Contents of ``msg`` will be 306identical to ones sent by the sender. 307 308If a response is required, a new ``rte_mp_msg`` message descriptor must be 309constructed and sent via ``rte_mp_reply()`` function, along with ``peer`` 310pointer. The resulting response will then be delivered to the correct requestor. 311 312.. warning:: 313 Simply returning a value when processing a request callback will not send a 314 response to the request - it must always be explicitly sent even in case 315 of errors. Implementation of error signalling rests with the application, 316 there is no built-in way to indicate success or error for a request. Failing 317 to do so will cause the requestor to time out while waiting on a response. 318 319Misc considerations 320~~~~~~~~~~~~~~~~~~~~~~~~ 321 322Due to the underlying IPC implementation being single-threaded, recursive 323requests (i.e. sending a request while responding to another request) is not 324supported. However, since sending messages (not requests) does not involve an 325IPC thread, sending messages while processing another message or request is 326supported. 327 328Since the memory subsystem uses IPC internally, memory allocations and IPC must 329not be mixed: it is not safe to use IPC inside a memory-related callback, nor is 330it safe to allocate/free memory inside IPC callbacks. Attempting to do so may 331lead to a deadlock. 332 333Asynchronous request callbacks may be triggered either from IPC thread or from 334interrupt thread, depending on whether the request has timed out. It is 335therefore suggested to avoid waiting for interrupt-based events (such as alarms) 336inside asynchronous IPC request callbacks. This limitation does not apply to 337messages or synchronous requests. 338 339If callbacks spend a long time processing the incoming requests, the requestor 340might time out, so setting the right timeout value on the requestor side is 341imperative. 342 343If some of the messages timed out, ``nb_sent`` and ``nb_received`` fields in the 344``rte_mp_reply`` descriptor will not have matching values. This is not treated 345as error by the IPC API, and it is expected that the user will be responsible 346for deciding how to handle such cases. 347 348If a callback has been registered, IPC will assume that it is safe to call it. 349This is important when registering callbacks during DPDK initialization. 350During initialization, IPC will consider the receiving side as non-existing if 351the callback has not been registered yet. However, once the callback has been 352registered, it is expected that IPC should be safe to trigger it, even if the 353rest of the DPDK initialization hasn't finished yet. 354