1.. SPDX-License-Identifier: BSD-3-Clause 2 Copyright(c) 2024 Ericsson AB 3 4Lcore Variables 5=============== 6 7The ``rte_lcore_var.h`` API provides a mechanism to allocate and 8access per-lcore id variables in a space- and cycle-efficient manner. 9 10 11Lcore Variables API 12------------------- 13 14A per-lcore id variable (or lcore variable for short) 15holds a unique value for each EAL thread and registered non-EAL thread. 16Thus, there is one distinct value for each past, current and future 17lcore id-equipped thread, with a total of ``RTE_MAX_LCORE`` instances. 18 19The value of the lcore variable for one lcore id is independent of the 20values associated with other lcore ids within the same variable. 21 22For detailed information on the lcore variables API, 23please refer to the ``rte_lcore_var.h`` API documentation. 24 25 26Lcore Variable Handle 27~~~~~~~~~~~~~~~~~~~~~ 28 29To allocate and access an lcore variable's values, a *handle* is used. 30The handle is represented by an opaque pointer, 31only to be dereferenced using the appropriate ``<rte_lcore_var.h>`` macros. 32 33The handle is a pointer to the value's type 34(e.g., for an ``uint32_t`` lcore variable, the handle is a ``uint32_t *``). 35 36The reason the handle is typed (i.e., it's not a void pointer or an integer) 37is to enable type checking when accessing values of the lcore variable. 38 39A handle may be passed between modules and threads 40just like any other pointer. 41 42A valid (i.e., allocated) handle never has the value NULL. 43Thus, a handle set to NULL may be used 44to signify that allocation has not yet been done. 45 46 47Lcore Variable Allocation 48~~~~~~~~~~~~~~~~~~~~~~~~~ 49 50An lcore variable is created in two steps: 51 521. Define an lcore variable handle by using ``RTE_LCORE_VAR_HANDLE``. 532. Allocate lcore variable storage and initialize the handle 54 by using ``RTE_LCORE_VAR_ALLOC`` or ``RTE_LCORE_VAR_INIT``. 55 Allocation generally occurs at the time of module initialization, 56 but may be done at any time. 57 58The lifetime of an lcore variable is not tied to the thread that created it. 59 60Each lcore variable has ``RTE_MAX_LCORE`` values, 61one for each possible lcore id. 62All of an lcore variable's values may be accessed 63from the moment the lcore variable is created, 64throughout the lifetime of the EAL (i.e., until ``rte_eal_cleanup()``). 65 66Lcore variables do not need to be freed and cannot be freed. 67 68 69Access 70~~~~~~ 71 72The value of any lcore variable for any lcore id 73may be accessed from any thread (including unregistered threads), 74but it should only be *frequently* read from or written to by the *owner*. 75A thread is considered the owner of a particular lcore variable value instance 76if it has the lcore id associated with that instance. 77 78Non-owner accesses results in *false sharing*. 79As long as non-owner accesses are rare, 80they will have only a very slight effect on performance. 81This property of lcore variables memory organization is intentional. 82See the implementation section for more information. 83 84Values of the same lcore variable, 85associated with different lcore ids may be frequently read or written 86by their respective owners without risking false sharing. 87 88An appropriate synchronization mechanism, 89such as atomic load and stores, 90should be employed to prevent data races between the owning thread 91and any other thread accessing the same value instance. 92 93The value of the lcore variable for a particular lcore id 94is accessed via ``RTE_LCORE_VAR_LCORE``. 95 96A common pattern is for an EAL thread or a registered non-EAL thread 97to access its own lcore variable value. 98For this purpose, a shorthand exists as ``RTE_LCORE_VAR``. 99 100``RTE_LCORE_VAR_FOREACH`` may be used to iterate 101over all values of a particular lcore variable. 102 103The handle, defined by ``RTE_LCORE_VAR_HANDLE``, 104is a pointer of the same type as the value, 105but it must be treated as an opaque identifier 106and cannot be directly dereferenced. 107 108Lcore variable handles and value pointers may be freely passed 109between different threads. 110 111 112Storage 113~~~~~~~ 114 115An lcore variable's values may be of a primitive type like ``int``, 116but is typically a ``struct``. 117 118The lcore variable handle introduces a per-variable 119(not per-value/per-lcore id) overhead of ``sizeof(void *)`` bytes, 120so there are some memory footprint gains to be made by organizing 121all per-lcore id data for a particular module as one lcore variable 122(e.g., as a struct). 123 124An application may define an lcore variable handle 125without ever allocating the lcore variable. 126 127The size of an lcore variable's value cannot exceed 128the DPDK build-time constant ``RTE_MAX_LCORE_VAR``. 129An lcore variable's size is the size of one of its value instance, 130not the aggregate of all its ``RTE_MAX_LCORE`` instances. 131 132Lcore variables should generally *not* be ``__rte_cache_aligned`` 133and need *not* include a ``RTE_CACHE_GUARD`` field, 134since these constructs are designed to avoid false sharing. 135With lcore variables, false sharing is largely avoided by other means. 136In the case of an lcore variable instance, 137the thread most recently accessing nearby data structures 138should almost always be the lcore variable's owner. 139Adding padding (e.g., with ``RTE_CACHE_GUARD``) 140will increase the effective memory working set size, 141potentially reducing performance. 142 143Lcore variable values are initialized to zero by default. 144 145Lcore variables are not stored in huge page memory. 146 147 148Example 149~~~~~~~ 150 151Below is an example of the use of an lcore variable: 152 153.. code-block:: c 154 155 struct foo_lcore_state { 156 int a; 157 long b; 158 }; 159 160 static RTE_LCORE_VAR_HANDLE(struct foo_lcore_state, lcore_states); 161 162 long foo_get_a_plus_b(void) 163 { 164 const struct foo_lcore_state *state = RTE_LCORE_VAR(lcore_states); 165 166 return state->a + state->b; 167 } 168 169 RTE_INIT(rte_foo_init) 170 { 171 RTE_LCORE_VAR_ALLOC(lcore_states); 172 173 unsigned int lcore_id; 174 struct foo_lcore_state *state; 175 RTE_LCORE_VAR_FOREACH(lcore_id, state, lcore_states) { 176 /* initialize state */ 177 } 178 179 /* other initialization */ 180 } 181 182 183Implementation 184-------------- 185 186This section gives an overview of the implementation of lcore variables, 187and some background to its design. 188 189 190Lcore Variable Buffers 191~~~~~~~~~~~~~~~~~~~~~~ 192 193Lcore variable values are kept in a set of ``lcore_var_buffer`` structs. 194 195.. literalinclude:: ../../../lib/eal/common/eal_common_lcore_var.c 196 :language: c 197 :start-after: base unit 198 :end-before: last allocated unit 199 200An lcore var buffer stores at a minimum one, but usually many, lcore variables. 201 202The value instances for all lcore ids are stored in the same buffer. 203However, each lcore id has its own slice of the ``data`` array. 204Such a slice is ``RTE_MAX_LCORE_VAR`` bytes in size. 205 206In this way, the values associated with a particular lcore id 207are grouped spatially close (in memory). 208No padding is required to prevent false sharing. 209 210.. literalinclude:: ../../../lib/eal/common/eal_common_lcore_var.c 211 :language: c 212 :start-after: last allocated unit 213 :end-before: >8 end of documented variables 214 215The implementation maintains a current ``lcore_var_buffer`` and an ``offset``, 216where the latter tracks how many bytes of this current buffer has been allocated. 217 218The ``offset`` is progressively incremented 219(by the size of the just-allocated lcore variable), 220as lcore variables are being allocated. 221 222If the allocation of a variable would result in an ``offset`` larger 223than ``RTE_MAX_LCORE_VAR`` (i.e., the slice size), the buffer is full. 224In that case, new buffer is allocated off the heap, and the ``offset`` is reset. 225 226The lcore var buffers are arranged in a link list, 227to allow freeing them at the point of ``rte_eal_cleanup()``. 228 229The lcore variable buffers are allocated off the regular C heap. 230There are a number of reasons for not using ``<rte_malloc.h>`` 231and huge pages for lcore variables: 232 233- The libc heap is available at any time, 234 including early in the DPDK initialization. 235- The amount of data kept in lcore variables is projected to be small, 236 and thus is unlikely to induce translate lookaside buffer (TLB) misses. 237- The last (and potentially only) lcore buffer in the chain 238 will likely only partially be in use. 239 Huge pages of the sort used by DPDK are always resident in memory, 240 and their use would result in a significant amount of memory going to waste. 241 An example: ~256 kB worth of lcore variables are allocated 242 by DPDK libraries, PMDs and the application. 243 ``RTE_MAX_LCORE_VAR`` is set to 128 kB and ``RTE_MAX_LCORE`` to 128. 244 With 4 kB OS pages, only the first ~64 pages of each of the 128 per-lcore id slices 245 in the (only) ``lcore_var_buffer`` will actually be resident (paged in). 246 Here, demand paging saves ~98 MB of memory. 247 248.. note:: 249 250 Not residing in huge pages, lcore variables cannot be accessed from secondary processes. 251 252Heap allocation failures are treated as fatal. 253The reason for this unorthodox design is that a majority of the allocations 254are deemed to happen at initialization. 255An early heap allocation failure for a fixed amount of data is a situation 256not unlike one where there is not enough memory available for static variables 257(i.e., the BSS or data sections). 258 259Provided these assumptions hold true, it's deemed acceptable 260to leave the application out of handling memory allocation failures. 261 262The upside of this approach is that no error handling code is required 263on the API user side. 264 265 266Lcore Variable Handles 267~~~~~~~~~~~~~~~~~~~~~~ 268 269Upon lcore variable allocation, the lcore variables API returns 270an opaque *handle* in the form of a pointer. 271The value of the pointer is ``buffer->data + offset``. 272 273Translating a handle base pointer to a pointer to a value 274associated with a particular lcore id is straightforward: 275 276.. literalinclude:: ../../../lib/eal/include/rte_lcore_var.h 277 :language: c 278 :start-after: access function 8< 279 :end-before: >8 end of access function 280 281``RTE_MAX_LCORE_VAR`` is a public macro to allow the compiler 282to optimize the ``lcore_id * RTE_MAX_LCORE_VAR`` expression, 283and replace the multiplication with a less expensive arithmetic operation. 284 285To maintain type safety, the ``RTE_LCORE_VAR*()`` macros should be used, 286instead of directly invoking ``rte_lcore_var_lcore()``. 287The macros return a pointer of the same type as the handle 288(i.e., a pointer to the value's type). 289 290 291Memory Layout 292~~~~~~~~~~~~~ 293 294This section describes how lcore variables are organized in memory. 295 296As an illustration, two example modules are used, 297``rte_x`` and ``rte_y``, both maintaining per-lcore id state 298as a part of their implementation. 299 300Two different methods will be used to maintain such state - 301lcore variables and, to serve as a reference, lcore id-indexed static arrays. 302 303Certain parameters are scaled down to make graphical depictions more practical. 304 305For the purpose of this exercise, a ``RTE_MAX_LCORE`` of 2 is assumed. 306In a real-world configuration, the maximum number of 307EAL threads and registered threads will be much greater (e.g., 128). 308 309The lcore variables example assumes a ``RTE_MAX_LCORE_VAR`` of 64. 310In a real-world configuration (as controlled by ``rte_config.h``), 311the value of this compile-time constant will be much greater (e.g., 1048576). 312 313The per-lcore id state is also smaller than what most real-world modules would have. 314 315Lcore Variables Example 316^^^^^^^^^^^^^^^^^^^^^^^ 317 318When lcore variables are used, the parts of ``rte_x`` and ``rte_y`` 319that deal with the declaration and allocation of per-lcore id data 320may look something like below. 321 322.. code-block:: c 323 324 /* -- Lcore variables -- */ 325 326 /* rte_x.c */ 327 328 struct x_lcore 329 { 330 int a; 331 char b; 332 }; 333 334 static RTE_LCORE_VAR_HANDLE(struct x_lcore, x_lcores); 335 RTE_LCORE_VAR_INIT(x_lcores); 336 337 /../ 338 339 /* rte_y.c */ 340 341 struct y_lcore 342 { 343 long c; 344 long d; 345 }; 346 347 static RTE_LCORE_VAR_HANDLE(struct y_lcore, y_lcores); 348 RTE_LCORE_VAR_INIT(y_lcores); 349 350 /../ 351 352The resulting memory layout will look something like the following: 353 354.. figure:: img/lcore_var_mem_layout.* 355 356The above figure assumes that ``x_lcores`` is allocated prior to ``y_lcores``. 357``RTE_LCORE_VAR_INIT()`` relies constructors, run prior to ``main()`` in an undefined order. 358 359The use of lcore variables ensures that per-lcore id data is kept in close proximity, 360within a designated region of memory. 361This proximity enhances data locality and can improve performance. 362 363Lcore Id Index Static Array Example 364^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 365 366Below is an example of the struct declarations, 367declarations and the resulting organization in memory 368in case an lcore id indexed static array of cache-line aligned, 369RTE_CACHE_GUARDed structs are used to maintain per-lcore id state. 370 371This is a common pattern in DPDK, which lcore variables attempts to replace. 372 373.. code-block:: c 374 375 /* -- Cache-aligned static arrays -- */ 376 377 /* rte_x.c */ 378 379 struct __rte_cache_aligned x_lcore 380 { 381 int a; 382 char b; 383 RTE_CACHE_GUARD; 384 }; 385 386 static struct x_lcore x_lcores[RTE_MAX_LCORE]; 387 388 /../ 389 390 /* rte_y.c */ 391 392 struct __rte_cache_aligned y_lcore 393 { 394 long c; 395 long d; 396 RTE_CACHE_GUARD; 397 }; 398 399 static struct y_lcore y_lcores[RTE_MAX_LCORE]; 400 401 /../ 402 403In this approach, accessing the state for a particular lcore id is merely 404a matter retrieving the lcore id and looking up the correct struct instance. 405 406.. code-block:: c 407 408 struct x_lcore *my_lcore_state = &x_lcores[rte_lcore_id()]; 409 410The address "0" at the top of the left-most column in the figure 411represent the base address for the ``x_lcores`` array 412(in the BSS segment in memory). 413 414The figure only includes the memory layout for the ``rte_x`` example module. 415``rte_y`` would look very similar, with ``y_lcores`` being located 416at some other address in the BSS section. 417 418.. figure:: img/static_array_mem_layout.* 419 420The static array approach results in the per-lcore id 421being organized around modules, not lcore ids. 422To avoid false sharing, an extensive use of padding is employed, 423causing cache fragmentation. 424 425Because the padding is interspersed with the data, 426demand paging is unlikely to reduce the actual resident DRAM memory footprint. 427This is because the padding is smaller 428than a typical operating system memory page (usually 4 kB). 429 430 431Performance 432~~~~~~~~~~~ 433 434One of the goals of lcore variables is to improve performance. 435This is achieved by packing often-used data in fewer cache lines, 436and thus reducing fragmentation in CPU caches 437and thus somewhat improving the effective cache size and cache hit rates. 438 439The application-level gains depends much on how much data is kept in lcore variables, 440and how often it is accessed, 441and how much pressure the application asserts on the CPU caches 442(i.e., how much other memory it accesses). 443 444The ``lcore_var_perf_autotest`` is an attempt at exploring 445the performance benefits (or drawbacks) of lcore variables 446compared to its alternatives. 447Being a micro benchmark, it needs to be taken with a grain of salt. 448 449Generally, one shouldn't expect more than some very modest gains in performance 450after a switch from lcore id indexed arrays to lcore variables. 451 452An additional benefit of the use of lcore variables is that it avoids 453certain tricky issues related to CPU core hardware prefetching 454(e.g., next-N-lines prefetching) that may cause false sharing 455even when data used by two cores do not reside on the same cache line. 456Hardware prefetch behavior is generally not publicly documented 457and varies across CPU vendors, CPU generations and BIOS (or similar) configurations. 458For applications aiming to be portable, this may cause issues. 459Often, CPU hardware prefetch-induced issues are non-existent, 460except some particular circumstances, where their adverse effects may be significant. 461 462 463Alternatives 464------------ 465 466Lcore Id Indexed Static Arrays 467~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 468 469Lcore variables are designed to replace a pattern exemplified below: 470 471.. code-block:: c 472 473 struct __rte_cache_aligned foo_lcore_state { 474 int a; 475 long b; 476 RTE_CACHE_GUARD; 477 }; 478 479 static struct foo_lcore_state lcore_states[RTE_MAX_LCORE]; 480 481This scheme is simple and effective, but has one drawback: 482the data is organized so that objects related to all lcores for a particular module 483are kept close in memory. 484At a bare minimum, this requires sizing data structures 485(e.g., using ``__rte_cache_aligned``) to an even number of cache lines 486and ensuring that allocation of such objects 487are cache line aligned to avoid false sharing. 488With CPU hardware prefetching and memory loads resulting from speculative execution 489(functions which seemingly are getting more eager faster 490than they are getting more intelligent), 491one or more "guard" cache lines may be required 492to separate one lcore's data from another's and prevent false sharing. 493 494Lcore variables offer the advantage of working with, 495rather than against, the CPU's assumptions. 496A next-line hardware prefetcher, for example, may function as intended 497(i.e., to the benefit, not detriment, of system performance). 498 499 500Thread Local Storage 501~~~~~~~~~~~~~~~~~~~~ 502 503An alternative to ``rte_lcore_var.h`` is the ``rte_per_lcore.h`` API, 504which makes use of thread-local storage 505(TLS, e.g., GCC ``__thread`` or C11 ``_Thread_local``). 506 507There are a number of differences between using TLS 508and the use of lcore variables. 509 510The lifecycle of a thread-local variable instance is tied to that of the thread. 511The data cannot be accessed before the thread has been created, 512nor after it has terminated. 513As a result, thread-local variables must be initialized in a "lazy" manner 514(e.g., at the point of thread creation). 515Lcore variables may be accessed immediately after having been allocated 516(which may occur before any thread beyond the main thread is running). 517 518A thread-local variable is duplicated across all threads in the process, 519including unregistered non-EAL threads (i.e., "regular" threads). 520For DPDK applications heavily relying on multi-threading 521(in conjunction to DPDK's "one thread per core" pattern), 522either by having many concurrent threads or creating/destroying threads at a high rate, 523an excessive use of thread-local variables may cause inefficiencies 524(e.g., increased thread creation overhead due to thread-local storage initialization 525or increased memory footprint). 526Lcore variables *only* exist for threads with an lcore id. 527 528Whether data in thread-local storage can be shared between threads 529(i.e., whether a pointer to a thread-local variable can be passed to 530and successfully dereferenced by a non-owning thread) 531depends on the specifics of the TLS implementation. 532With GCC ``__thread`` and GCC ``_Thread_local``, 533data sharing between threads is supported. 534In the C11 standard, accessing another thread's ``_Thread_local`` object 535is implementation-defined. 536Lcore variable instances may be accessed reliably by any thread. 537 538Lcore variables also relies on TLS to retrieve the thread's lcore id. 539However, the rest of the per-thread data is not kept in TLS. 540 541From a memory layout perspective, TLS is similar to lcore variables, 542and thus per-thread data structure need not be padded. 543 544In case the above-mentioned drawbacks of the use of TLS is of no significance 545to a particular application, TLS is a good alternative to lcore variables. 546