xref: /dpdk/doc/guides/prog_guide/lcore_var.rst (revision 37dda90ee15b7098bc48356868a87d34f727eecc)
1.. SPDX-License-Identifier: BSD-3-Clause
2   Copyright(c) 2024 Ericsson AB
3
4Lcore Variables
5===============
6
7The ``rte_lcore_var.h`` API provides a mechanism to allocate and
8access per-lcore id variables in a space- and cycle-efficient manner.
9
10
11Lcore Variables API
12-------------------
13
14A per-lcore id variable (or lcore variable for short)
15holds a unique value for each EAL thread and registered non-EAL thread.
16Thus, there is one distinct value for each past, current and future
17lcore id-equipped thread, with a total of ``RTE_MAX_LCORE`` instances.
18
19The value of the lcore variable for one lcore id is independent of the
20values associated with other lcore ids within the same variable.
21
22For detailed information on the lcore variables API,
23please refer to the ``rte_lcore_var.h`` API documentation.
24
25
26Lcore Variable Handle
27~~~~~~~~~~~~~~~~~~~~~
28
29To allocate and access an lcore variable's values, a *handle* is used.
30The handle is represented by an opaque pointer,
31only to be dereferenced using the appropriate ``<rte_lcore_var.h>`` macros.
32
33The handle is a pointer to the value's type
34(e.g., for an ``uint32_t`` lcore variable, the handle is a ``uint32_t *``).
35
36The reason the handle is typed (i.e., it's not a void pointer or an integer)
37is to enable type checking when accessing values of the lcore variable.
38
39A handle may be passed between modules and threads
40just like any other pointer.
41
42A valid (i.e., allocated) handle never has the value NULL.
43Thus, a handle set to NULL may be used
44to signify that allocation has not yet been done.
45
46
47Lcore Variable Allocation
48~~~~~~~~~~~~~~~~~~~~~~~~~
49
50An lcore variable is created in two steps:
51
521. Define an lcore variable handle by using ``RTE_LCORE_VAR_HANDLE``.
532. Allocate lcore variable storage and initialize the handle
54   by using ``RTE_LCORE_VAR_ALLOC`` or ``RTE_LCORE_VAR_INIT``.
55   Allocation generally occurs at the time of module initialization,
56   but may be done at any time.
57
58The lifetime of an lcore variable is not tied to the thread that created it.
59
60Each lcore variable has ``RTE_MAX_LCORE`` values,
61one for each possible lcore id.
62All of an lcore variable's values may be accessed
63from the moment the lcore variable is created,
64throughout the lifetime of the EAL (i.e., until ``rte_eal_cleanup()``).
65
66Lcore variables do not need to be freed and cannot be freed.
67
68
69Access
70~~~~~~
71
72The value of any lcore variable for any lcore id
73may be accessed from any thread (including unregistered threads),
74but it should only be *frequently* read from or written to by the *owner*.
75A thread is considered the owner of a particular lcore variable value instance
76if it has the lcore id associated with that instance.
77
78Non-owner accesses results in *false sharing*.
79As long as non-owner accesses are rare,
80they will have only a very slight effect on performance.
81This property of lcore variables memory organization is intentional.
82See the implementation section for more information.
83
84Values of the same lcore variable,
85associated with different lcore ids may be frequently read or written
86by their respective owners without risking false sharing.
87
88An appropriate synchronization mechanism,
89such as atomic load and stores,
90should be employed to prevent data races between the owning thread
91and any other thread accessing the same value instance.
92
93The value of the lcore variable for a particular lcore id
94is accessed via ``RTE_LCORE_VAR_LCORE``.
95
96A common pattern is for an EAL thread or a registered non-EAL thread
97to access its own lcore variable value.
98For this purpose, a shorthand exists as ``RTE_LCORE_VAR``.
99
100``RTE_LCORE_VAR_FOREACH`` may be used to iterate
101over all values of a particular lcore variable.
102
103The handle, defined by ``RTE_LCORE_VAR_HANDLE``,
104is a pointer of the same type as the value,
105but it must be treated as an opaque identifier
106and cannot be directly dereferenced.
107
108Lcore variable handles and value pointers may be freely passed
109between different threads.
110
111
112Storage
113~~~~~~~
114
115An lcore variable's values may be of a primitive type like ``int``,
116but is typically a ``struct``.
117
118The lcore variable handle introduces a per-variable
119(not per-value/per-lcore id) overhead of ``sizeof(void *)`` bytes,
120so there are some memory footprint gains to be made by organizing
121all per-lcore id data for a particular module as one lcore variable
122(e.g., as a struct).
123
124An application may define an lcore variable handle
125without ever allocating the lcore variable.
126
127The size of an lcore variable's value cannot exceed
128the DPDK build-time constant ``RTE_MAX_LCORE_VAR``.
129An lcore variable's size is the size of one of its value instance,
130not the aggregate of all its ``RTE_MAX_LCORE`` instances.
131
132Lcore variables should generally *not* be ``__rte_cache_aligned``
133and need *not* include a ``RTE_CACHE_GUARD`` field,
134since these constructs are designed to avoid false sharing.
135With lcore variables, false sharing is largely avoided by other means.
136In the case of an lcore variable instance,
137the thread most recently accessing nearby data structures
138should almost always be the lcore variable's owner.
139Adding padding (e.g., with ``RTE_CACHE_GUARD``)
140will increase the effective memory working set size,
141potentially reducing performance.
142
143Lcore variable values are initialized to zero by default.
144
145Lcore variables are not stored in huge page memory.
146
147
148Example
149~~~~~~~
150
151Below is an example of the use of an lcore variable:
152
153.. code-block:: c
154
155   struct foo_lcore_state {
156           int a;
157           long b;
158   };
159
160   static RTE_LCORE_VAR_HANDLE(struct foo_lcore_state, lcore_states);
161
162   long foo_get_a_plus_b(void)
163   {
164           const struct foo_lcore_state *state = RTE_LCORE_VAR(lcore_states);
165
166           return state->a + state->b;
167   }
168
169   RTE_INIT(rte_foo_init)
170   {
171           RTE_LCORE_VAR_ALLOC(lcore_states);
172
173           unsigned int lcore_id;
174           struct foo_lcore_state *state;
175           RTE_LCORE_VAR_FOREACH(lcore_id, state, lcore_states) {
176                   /* initialize state */
177           }
178
179           /* other initialization */
180   }
181
182
183Implementation
184--------------
185
186This section gives an overview of the implementation of lcore variables,
187and some background to its design.
188
189
190Lcore Variable Buffers
191~~~~~~~~~~~~~~~~~~~~~~
192
193Lcore variable values are kept in a set of ``lcore_var_buffer`` structs.
194
195.. literalinclude:: ../../../lib/eal/common/eal_common_lcore_var.c
196   :language: c
197   :start-after: base unit
198   :end-before: last allocated unit
199
200An lcore var buffer stores at a minimum one, but usually many, lcore variables.
201
202The value instances for all lcore ids are stored in the same buffer.
203However, each lcore id has its own slice of the ``data`` array.
204Such a slice is ``RTE_MAX_LCORE_VAR`` bytes in size.
205
206In this way, the values associated with a particular lcore id
207are grouped spatially close (in memory).
208No padding is required to prevent false sharing.
209
210.. literalinclude:: ../../../lib/eal/common/eal_common_lcore_var.c
211   :language: c
212   :start-after: last allocated unit
213   :end-before: >8 end of documented variables
214
215The implementation maintains a current ``lcore_var_buffer`` and an ``offset``,
216where the latter tracks how many bytes of this current buffer has been allocated.
217
218The ``offset`` is progressively incremented
219(by the size of the just-allocated lcore variable),
220as lcore variables are being allocated.
221
222If the allocation of a variable would result in an ``offset`` larger
223than ``RTE_MAX_LCORE_VAR`` (i.e., the slice size), the buffer is full.
224In that case, new buffer is allocated off the heap, and the ``offset`` is reset.
225
226The lcore var buffers are arranged in a link list,
227to allow freeing them at the point of ``rte_eal_cleanup()``.
228
229The lcore variable buffers are allocated off the regular C heap.
230There are a number of reasons for not using ``<rte_malloc.h>``
231and huge pages for lcore variables:
232
233- The libc heap is available at any time,
234  including early in the DPDK initialization.
235- The amount of data kept in lcore variables is projected to be small,
236  and thus is unlikely to induce translate lookaside buffer (TLB) misses.
237- The last (and potentially only) lcore buffer in the chain
238  will likely only partially be in use.
239  Huge pages of the sort used by DPDK are always resident in memory,
240  and their use would result in a significant amount of memory going to waste.
241  An example: ~256 kB worth of lcore variables are allocated
242  by DPDK libraries, PMDs and the application.
243  ``RTE_MAX_LCORE_VAR`` is set to 128 kB and ``RTE_MAX_LCORE`` to 128.
244  With 4 kB OS pages, only the first ~64 pages of each of the 128 per-lcore id slices
245  in the (only) ``lcore_var_buffer`` will actually be resident (paged in).
246  Here, demand paging saves ~98 MB of memory.
247
248.. note::
249
250   Not residing in huge pages, lcore variables cannot be accessed from secondary processes.
251
252Heap allocation failures are treated as fatal.
253The reason for this unorthodox design is that a majority of the allocations
254are deemed to happen at initialization.
255An early heap allocation failure for a fixed amount of data is a situation
256not unlike one where there is not enough memory available for static variables
257(i.e., the BSS or data sections).
258
259Provided these assumptions hold true, it's deemed acceptable
260to leave the application out of handling memory allocation failures.
261
262The upside of this approach is that no error handling code is required
263on the API user side.
264
265
266Lcore Variable Handles
267~~~~~~~~~~~~~~~~~~~~~~
268
269Upon lcore variable allocation, the lcore variables API returns
270an opaque *handle* in the form of a pointer.
271The value of the pointer is ``buffer->data + offset``.
272
273Translating a handle base pointer to a pointer to a value
274associated with a particular lcore id is straightforward:
275
276.. literalinclude:: ../../../lib/eal/include/rte_lcore_var.h
277   :language: c
278   :start-after: access function 8<
279   :end-before: >8 end of access function
280
281``RTE_MAX_LCORE_VAR`` is a public macro to allow the compiler
282to optimize the ``lcore_id * RTE_MAX_LCORE_VAR`` expression,
283and replace the multiplication with a less expensive arithmetic operation.
284
285To maintain type safety, the ``RTE_LCORE_VAR*()`` macros should be used,
286instead of directly invoking ``rte_lcore_var_lcore()``.
287The macros return a pointer of the same type as the handle
288(i.e., a pointer to the value's type).
289
290
291Memory Layout
292~~~~~~~~~~~~~
293
294This section describes how lcore variables are organized in memory.
295
296As an illustration, two example modules are used,
297``rte_x`` and ``rte_y``, both maintaining per-lcore id state
298as a part of their implementation.
299
300Two different methods will be used to maintain such state -
301lcore variables and, to serve as a reference, lcore id-indexed static arrays.
302
303Certain parameters are scaled down to make graphical depictions more practical.
304
305For the purpose of this exercise, a ``RTE_MAX_LCORE`` of 2 is assumed.
306In a real-world configuration, the maximum number of
307EAL threads and registered threads will be much greater (e.g., 128).
308
309The lcore variables example assumes a ``RTE_MAX_LCORE_VAR`` of 64.
310In a real-world configuration (as controlled by ``rte_config.h``),
311the value of this compile-time constant will be much greater (e.g., 1048576).
312
313The per-lcore id state is also smaller than what most real-world modules would have.
314
315Lcore Variables Example
316^^^^^^^^^^^^^^^^^^^^^^^
317
318When lcore variables are used, the parts of ``rte_x`` and ``rte_y``
319that deal with the declaration and allocation of per-lcore id data
320may look something like below.
321
322.. code-block:: c
323
324   /* -- Lcore variables -- */
325
326   /* rte_x.c */
327
328   struct x_lcore
329   {
330       int a;
331       char b;
332   };
333
334   static RTE_LCORE_VAR_HANDLE(struct x_lcore, x_lcores);
335   RTE_LCORE_VAR_INIT(x_lcores);
336
337   /../
338
339   /* rte_y.c */
340
341   struct y_lcore
342   {
343       long c;
344       long d;
345   };
346
347   static RTE_LCORE_VAR_HANDLE(struct y_lcore, y_lcores);
348   RTE_LCORE_VAR_INIT(y_lcores);
349
350   /../
351
352The resulting memory layout will look something like the following:
353
354.. figure:: img/lcore_var_mem_layout.*
355
356The above figure assumes that ``x_lcores`` is allocated prior to ``y_lcores``.
357``RTE_LCORE_VAR_INIT()`` relies constructors, run prior to ``main()`` in an undefined order.
358
359The use of lcore variables ensures that per-lcore id data is kept in close proximity,
360within a designated region of memory.
361This proximity enhances data locality and can improve performance.
362
363Lcore Id Index Static Array Example
364^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
365
366Below is an example of the struct declarations,
367declarations and the resulting organization in memory
368in case an lcore id indexed static array of cache-line aligned,
369RTE_CACHE_GUARDed structs are used to maintain per-lcore id state.
370
371This is a common pattern in DPDK, which lcore variables attempts to replace.
372
373.. code-block:: c
374
375   /* -- Cache-aligned static arrays -- */
376
377   /* rte_x.c */
378
379   struct __rte_cache_aligned x_lcore
380   {
381       int a;
382       char b;
383       RTE_CACHE_GUARD;
384   };
385
386   static struct x_lcore x_lcores[RTE_MAX_LCORE];
387
388   /../
389
390   /* rte_y.c */
391
392   struct __rte_cache_aligned y_lcore
393   {
394       long c;
395       long d;
396       RTE_CACHE_GUARD;
397   };
398
399   static struct y_lcore y_lcores[RTE_MAX_LCORE];
400
401   /../
402
403In this approach, accessing the state for a particular lcore id is merely
404a matter retrieving the lcore id and looking up the correct struct instance.
405
406.. code-block:: c
407
408   struct x_lcore *my_lcore_state = &x_lcores[rte_lcore_id()];
409
410The address "0" at the top of the left-most column in the figure
411represent the base address for the ``x_lcores`` array
412(in the BSS segment in memory).
413
414The figure only includes the memory layout for the ``rte_x`` example module.
415``rte_y`` would look very similar, with ``y_lcores`` being located
416at some other address in the BSS section.
417
418.. figure:: img/static_array_mem_layout.*
419
420The static array approach results in the per-lcore id
421being organized around modules, not lcore ids.
422To avoid false sharing, an extensive use of padding is employed,
423causing cache fragmentation.
424
425Because the padding is interspersed with the data,
426demand paging is unlikely to reduce the actual resident DRAM memory footprint.
427This is because the padding is smaller
428than a typical operating system memory page (usually 4 kB).
429
430
431Performance
432~~~~~~~~~~~
433
434One of the goals of lcore variables is to improve performance.
435This is achieved by packing often-used data in fewer cache lines,
436and thus reducing fragmentation in CPU caches
437and thus somewhat improving the effective cache size and cache hit rates.
438
439The application-level gains depends much on how much data is kept in lcore variables,
440and how often it is accessed,
441and how much pressure the application asserts on the CPU caches
442(i.e., how much other memory it accesses).
443
444The ``lcore_var_perf_autotest`` is an attempt at exploring
445the performance benefits (or drawbacks) of lcore variables
446compared to its alternatives.
447Being a micro benchmark, it needs to be taken with a grain of salt.
448
449Generally, one shouldn't expect more than some very modest gains in performance
450after a switch from lcore id indexed arrays to lcore variables.
451
452An additional benefit of the use of lcore variables is that it avoids
453certain tricky issues related to CPU core hardware prefetching
454(e.g., next-N-lines prefetching) that may cause false sharing
455even when data used by two cores do not reside on the same cache line.
456Hardware prefetch behavior is generally not publicly documented
457and varies across CPU vendors, CPU generations and BIOS (or similar) configurations.
458For applications aiming to be portable, this may cause issues.
459Often, CPU hardware prefetch-induced issues are non-existent,
460except some particular circumstances, where their adverse effects may be significant.
461
462
463Alternatives
464------------
465
466Lcore Id Indexed Static Arrays
467~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
468
469Lcore variables are designed to replace a pattern exemplified below:
470
471.. code-block:: c
472
473   struct __rte_cache_aligned foo_lcore_state {
474           int a;
475           long b;
476           RTE_CACHE_GUARD;
477   };
478
479   static struct foo_lcore_state lcore_states[RTE_MAX_LCORE];
480
481This scheme is simple and effective, but has one drawback:
482the data is organized so that objects related to all lcores for a particular module
483are kept close in memory.
484At a bare minimum, this requires sizing data structures
485(e.g., using ``__rte_cache_aligned``) to an even number of cache lines
486and ensuring that allocation of such objects
487are cache line aligned to avoid false sharing.
488With CPU hardware prefetching and memory loads resulting from speculative execution
489(functions which seemingly are getting more eager faster
490than they are getting more intelligent),
491one or more "guard" cache lines may be required
492to separate one lcore's data from another's and prevent false sharing.
493
494Lcore variables offer the advantage of working with,
495rather than against, the CPU's assumptions.
496A next-line hardware prefetcher, for example, may function as intended
497(i.e., to the benefit, not detriment, of system performance).
498
499
500Thread Local Storage
501~~~~~~~~~~~~~~~~~~~~
502
503An alternative to ``rte_lcore_var.h`` is the ``rte_per_lcore.h`` API,
504which makes use of thread-local storage
505(TLS, e.g., GCC ``__thread`` or C11 ``_Thread_local``).
506
507There are a number of differences between using TLS
508and the use of lcore variables.
509
510The lifecycle of a thread-local variable instance is tied to that of the thread.
511The data cannot be accessed before the thread has been created,
512nor after it has terminated.
513As a result, thread-local variables must be initialized in a "lazy" manner
514(e.g., at the point of thread creation).
515Lcore variables may be accessed immediately after having been allocated
516(which may occur before any thread beyond the main thread is running).
517
518A thread-local variable is duplicated across all threads in the process,
519including unregistered non-EAL threads (i.e., "regular" threads).
520For DPDK applications heavily relying on multi-threading
521(in conjunction to DPDK's "one thread per core" pattern),
522either by having many concurrent threads or creating/destroying threads at a high rate,
523an excessive use of thread-local variables may cause inefficiencies
524(e.g., increased thread creation overhead due to thread-local storage initialization
525or increased memory footprint).
526Lcore variables *only* exist for threads with an lcore id.
527
528Whether data in thread-local storage can be shared between threads
529(i.e., whether a pointer to a thread-local variable can be passed to
530and successfully dereferenced by a non-owning thread)
531depends on the specifics of the TLS implementation.
532With GCC ``__thread`` and GCC ``_Thread_local``,
533data sharing between threads is supported.
534In the C11 standard, accessing another thread's ``_Thread_local`` object
535is implementation-defined.
536Lcore variable instances may be accessed reliably by any thread.
537
538Lcore variables also relies on TLS to retrieve the thread's lcore id.
539However, the rest of the per-thread data is not kept in TLS.
540
541From a memory layout perspective, TLS is similar to lcore variables,
542and thus per-thread data structure need not be padded.
543
544In case the above-mentioned drawbacks of the use of TLS is of no significance
545to a particular application, TLS is a good alternative to lcore variables.
546