1ecc1da8aSBen Walker# Message Passing and Concurrency {#concurrency} 2ecc1da8aSBen Walker 31e1fd9acSwawryk## Theory 4ecc1da8aSBen Walker 5ecc1da8aSBen WalkerOne of the primary aims of SPDK is to scale linearly with the addition of 600d692d0SBen Walkerhardware. This can mean many things in practice. For instance, moving from one 700d692d0SBen WalkerSSD to two should double the number of I/O's per second. Or doubling the number 800d692d0SBen Walkerof CPU cores should double the amount of computation possible. Or even doubling 900d692d0SBen Walkerthe number of NICs should double the network throughput. To achieve this, the 1000d692d0SBen Walkersoftware's threads of execution must be independent from one another as much as 1100d692d0SBen Walkerpossible. In practice, that means avoiding software locks and even atomic 1200d692d0SBen Walkerinstructions. 13ecc1da8aSBen Walker 14ecc1da8aSBen WalkerTraditionally, software achieves concurrency by placing some shared data onto 15ecc1da8aSBen Walkerthe heap, protecting it with a lock, and then having all threads of execution 1600d692d0SBen Walkeracquire the lock only when accessing the data. This model has many great 1700d692d0SBen Walkerproperties: 18ecc1da8aSBen Walker 1900d692d0SBen Walker* It's easy to convert single-threaded programs to multi-threaded programs 2000d692d0SBen Walker because you don't have to change the data model from the single-threaded 2100d692d0SBen Walker version. You add a lock around the data. 22ecc1da8aSBen Walker* You can write your program as a synchronous, imperative list of statements 23ecc1da8aSBen Walker that you read from top to bottom. 2400d692d0SBen Walker* The scheduler can interrupt threads, allowing for efficient time-sharing 2500d692d0SBen Walker of CPU resources. 26ecc1da8aSBen Walker 2700d692d0SBen WalkerUnfortunately, as the number of threads scales up, contention on the lock around 2800d692d0SBen Walkerthe shared data does too. More granular locking helps, but then also increases 2900d692d0SBen Walkerthe complexity of the program. Even then, beyond a certain number of contended 3000d692d0SBen Walkerlocks, threads will spend most of their time attempting to acquire the locks and 3100d692d0SBen Walkerthe program will not benefit from more CPU cores. 32ecc1da8aSBen Walker 33ecc1da8aSBen WalkerSPDK takes a different approach altogether. Instead of placing shared data in a 34ecc1da8aSBen Walkerglobal location that all threads access after acquiring a lock, SPDK will often 3500d692d0SBen Walkerassign that data to a single thread. When other threads want to access the data, 3600d692d0SBen Walkerthey pass a message to the owning thread to perform the operation on their 3700d692d0SBen Walkerbehalf. This strategy, of course, is not at all new. For instance, it is one of 3800d692d0SBen Walkerthe core design principles of 39ecc1da8aSBen Walker[Erlang](http://erlang.org/download/armstrong_thesis_2003.pdf) and is the main 40ecc1da8aSBen Walkerconcurrency mechanism in [Go](https://tour.golang.org/concurrency/2). A message 4100d692d0SBen Walkerin SPDK consists of a function pointer and a pointer to some context. Messages 4200d692d0SBen Walkerare passed between threads using a 43ecc1da8aSBen Walker[lockless ring](http://dpdk.org/doc/guides/prog_guide/ring_lib.html). Message 4400d692d0SBen Walkerpassing is often much faster than most software developer's intuition leads them 4500d692d0SBen Walkerto believe due to caching effects. If a single core is accessing the same data 4600d692d0SBen Walker(on behalf of all of the other cores), then that data is far more likely to be 4700d692d0SBen Walkerin a cache closer to that core. It's often most efficient to have each core work 4800d692d0SBen Walkeron a small set of data sitting in its local cache and then hand off a small 4900d692d0SBen Walkermessage to the next core when done. 50ecc1da8aSBen Walker 5100d692d0SBen WalkerIn more extreme cases where even message passing may be too costly, each thread 5200d692d0SBen Walkermay make a local copy of the data. The thread will then only reference its local 5300d692d0SBen Walkercopy. To mutate the data, threads will send a message to each other thread 5400d692d0SBen Walkertelling them to perform the update on their local copy. This is great when the 5500d692d0SBen Walkerdata isn't mutated very often, but is read very frequently, and is often 5600d692d0SBen Walkeremployed in the I/O path. This of course trades memory size for computational 5700d692d0SBen Walkerefficiency, so it is used in only the most critical code paths. 58ecc1da8aSBen Walker 591e1fd9acSwawryk## Message Passing Infrastructure 60ecc1da8aSBen Walker 61ecc1da8aSBen WalkerSPDK provides several layers of message passing infrastructure. The most 62ecc1da8aSBen Walkerfundamental libraries in SPDK, for instance, don't do any message passing on 63ecc1da8aSBen Walkertheir own and instead enumerate rules about when functions may be called in 64ecc1da8aSBen Walkertheir documentation (e.g. @ref nvme). Most libraries, however, depend on SPDK's 6553cbe5d0SBen Walker[thread](http://www.spdk.io/doc/thread_8h.html) 6653cbe5d0SBen Walkerabstraction, located in `libspdk_thread.a`. The thread abstraction provides a 6753cbe5d0SBen Walkerbasic message passing framework and defines a few key primitives. 68ecc1da8aSBen Walker 6900d692d0SBen WalkerFirst, `spdk_thread` is an abstraction for a lightweight, stackless thread of 7000d692d0SBen Walkerexecution. A lower level framework can execute an `spdk_thread` for a single 7100d692d0SBen Walkertimeslice by calling `spdk_thread_poll()`. A lower level framework is allowed to 7200d692d0SBen Walkermove an `spdk_thread` between system threads at any time, as long as there is 7300d692d0SBen Walkeronly a single system thread executing `spdk_thread_poll()` on that 7400d692d0SBen Walker`spdk_thread` at any given time. New lightweight threads may be created at any 7500d692d0SBen Walkertime by calling `spdk_thread_create()` and destroyed by calling 7600d692d0SBen Walker`spdk_thread_destroy()`. The lightweight thread is the foundational abstraction for 7700d692d0SBen Walkerthreading in SPDK. 78ecc1da8aSBen Walker 7900d692d0SBen WalkerThere are then a few additional abstractions layered on top of the 8000d692d0SBen Walker`spdk_thread`. One is the `spdk_poller`, which is an abstraction for a 8100d692d0SBen Walkerfunction that should be repeatedly called on the given thread. Another is an 8200d692d0SBen Walker`spdk_msg_fn`, which is a function pointer and a context pointer, that can 8300d692d0SBen Walkerbe sent to a thread for execution via `spdk_thread_send_msg()`. 8400d692d0SBen Walker 8500d692d0SBen WalkerThe library also defines two additional abstractions: `spdk_io_device` and 8600d692d0SBen Walker`spdk_io_channel`. In the course of implementing SPDK we noticed the same 8700d692d0SBen Walkerpattern emerging in a number of different libraries. In order to implement a 8800d692d0SBen Walkermessage passing strategy, the code would describe some object with global state 8900d692d0SBen Walkerand also some per-thread context associated with that object that was accessed 9000d692d0SBen Walkerin the I/O path to avoid locking on the global state. The pattern was clearest 9100d692d0SBen Walkerin the lowest layers where I/O was being submitted to block devices. These 9200d692d0SBen Walkerdevices often expose multiple queues that can be assigned to threads and then 9300d692d0SBen Walkeraccessed without a lock to submit I/O. To abstract that, we generalized the 9400d692d0SBen Walkerdevice to `spdk_io_device` and the thread-specific queue to `spdk_io_channel`. 9500d692d0SBen WalkerOver time, however, the pattern has appeared in a huge number of places that 9600d692d0SBen Walkerdon't fit quite so nicely with the names we originally chose. In today's code 9700d692d0SBen Walker`spdk_io_device` is any pointer, whose uniqueness is predicated only on its 9800d692d0SBen Walkermemory address, and `spdk_io_channel` is the per-thread context associated with 9900d692d0SBen Walkera particular `spdk_io_device`. 100ecc1da8aSBen Walker 10153cbe5d0SBen WalkerThe threading abstraction provides functions to send a message to any other 102ecc1da8aSBen Walkerthread, to send a message to all threads one by one, and to send a message to 103ecc1da8aSBen Walkerall threads for which there is an io_channel for a given io_device. 104ecc1da8aSBen Walker 10500d692d0SBen WalkerMost critically, the thread abstraction does not actually spawn any system level 10600d692d0SBen Walkerthreads of its own. Instead, it relies on the existence of some lower level 10700d692d0SBen Walkerframework that spawns system threads and sets up event loops. Inside those event 10800d692d0SBen Walkerloops, the threading abstraction simply requires the lower level framework to 10900d692d0SBen Walkerrepeatedly call `spdk_thread_poll()` on each `spdk_thread()` that exists. This 11000d692d0SBen Walkermakes SPDK very portable to a wide variety of asynchronous, event-based 11100d692d0SBen Walkerframeworks such as [Seastar](https://www.seastar.io) or [libuv](https://libuv.org/). 11200d692d0SBen Walker 113*b09ae853SMike Gerdts## SPDK Spinlocks 114*b09ae853SMike Gerdts 115*b09ae853SMike GerdtsThere are some cases where locks are used. These should be limited in favor of 116*b09ae853SMike Gerdtsthe message passing interface described above. When locks are needed, 117*b09ae853SMike GerdtsSPDK spinlocks should be used instead of POSIX locks. 118*b09ae853SMike Gerdts 119*b09ae853SMike GerdtsPOSIX locks like `pthread_mutex_t` and `pthread_spinlock_t` do not properly 120*b09ae853SMike Gerdtshandle locking between SPDK's lightweight threads. SPDK's `spdk_spinlock` 121*b09ae853SMike Gerdtsis safe to use in SPDK libraries and applications. This safety comes from 122*b09ae853SMike Gerdtsimposing restrictions on when locks can be held. See 123*b09ae853SMike Gerdts[spdk_spinlock](structspdk__spinlock.html) for details. 124*b09ae853SMike Gerdts 1251e1fd9acSwawryk## The event Framework 126ecc1da8aSBen Walker 12700d692d0SBen WalkerThe SPDK project didn't want to officially pick an asynchronous, event-based 12800d692d0SBen Walkerframework for all of the example applications it shipped with, in the interest 12900d692d0SBen Walkerof supporting the widest variety of frameworks possible. But the applications do 13000d692d0SBen Walkerof course require something that implements an asynchronous event loop in order 13100d692d0SBen Walkerto run, so enter the `event` framework located in `lib/event`. This framework 13271ccea94SMaciej Szwedincludes things like polling and scheduling the lightweight threads, installing 13371ccea94SMaciej Szwedsignal handlers to cleanly shutdown, and basic command line option parsing. 13471ccea94SMaciej SzwedOnly established applications should consider directly integrating the lower 13571ccea94SMaciej Szwedlevel libraries. 136ecc1da8aSBen Walker 1371e1fd9acSwawryk## Limitations of the C Language 138ecc1da8aSBen Walker 139ecc1da8aSBen WalkerMessage passing is efficient, but it results in asynchronous code. 140ecc1da8aSBen WalkerUnfortunately, asynchronous code is a challenge in C. It's often implemented by 141ecc1da8aSBen Walkerpassing function pointers that are called when an operation completes. This 142ecc1da8aSBen Walkerchops up the code so that it isn't easy to follow, especially through logic 143ecc1da8aSBen Walkerbranches. The best solution is to use a language with support for 144ecc1da8aSBen Walker[futures and promises](https://en.wikipedia.org/wiki/Futures_and_promises), 145ecc1da8aSBen Walkersuch as C++, Rust, Go, or almost any other higher level language. However, SPDK is a low 146ecc1da8aSBen Walkerlevel library and requires very wide compatibility and portability, so we've 147ecc1da8aSBen Walkerelected to stay with plain old C. 148ecc1da8aSBen Walker 149ecc1da8aSBen WalkerWe do have a few recommendations to share, though. For _simple_ callback chains, 150ecc1da8aSBen Walkerit's easiest if you write the functions from bottom to top. By that we mean if 151ecc1da8aSBen Walkerfunction `foo` performs some asynchronous operation and when that completes 152ecc1da8aSBen Walkerfunction `bar` is called, then function `bar` performs some operation that 153ecc1da8aSBen Walkercalls function `baz` on completion, a good way to write it is as such: 154ecc1da8aSBen Walker 155111d4276SMaciej Wawryk```c 156ecc1da8aSBen Walker void baz(void *ctx) { 157ecc1da8aSBen Walker ... 158ecc1da8aSBen Walker } 159ecc1da8aSBen Walker 160ecc1da8aSBen Walker void bar(void *ctx) { 161ecc1da8aSBen Walker async_op(baz, ctx); 162ecc1da8aSBen Walker } 163ecc1da8aSBen Walker 164ecc1da8aSBen Walker void foo(void *ctx) { 165ecc1da8aSBen Walker async_op(bar, ctx); 166ecc1da8aSBen Walker } 167111d4276SMaciej Wawryk``` 168ecc1da8aSBen Walker 169ecc1da8aSBen WalkerDon't split these functions up - keep them as a nice unit that can be read from bottom to top. 170ecc1da8aSBen Walker 171ecc1da8aSBen WalkerFor more complex callback chains, especially ones that have logical branches 172ecc1da8aSBen Walkeror loops, it's best to write out a state machine. It turns out that higher 1731f813ec3SChen Wanglevel languages that support futures and promises are just generating state 174ecc1da8aSBen Walkermachines at compile time, so even though we don't have the ability to generate 175ecc1da8aSBen Walkerthem in C we can still write them out by hand. As an example, here's a 176ecc1da8aSBen Walkercallback chain that performs `foo` 5 times and then calls `bar` - effectively 177ecc1da8aSBen Walkeran asynchronous for loop. 178ecc1da8aSBen Walker 179111d4276SMaciej Wawryk```c 180ecc1da8aSBen Walker enum states { 181ecc1da8aSBen Walker FOO_START = 0, 182ecc1da8aSBen Walker FOO_END, 183ecc1da8aSBen Walker BAR_START, 184ecc1da8aSBen Walker BAR_END 185ecc1da8aSBen Walker }; 186ecc1da8aSBen Walker 187ecc1da8aSBen Walker struct state_machine { 188ecc1da8aSBen Walker enum states state; 189ecc1da8aSBen Walker 190ecc1da8aSBen Walker int count; 191ecc1da8aSBen Walker }; 192ecc1da8aSBen Walker 193ecc1da8aSBen Walker static void 194ecc1da8aSBen Walker foo_complete(void *ctx) 195ecc1da8aSBen Walker { 196ecc1da8aSBen Walker struct state_machine *sm = ctx; 197ecc1da8aSBen Walker 198ecc1da8aSBen Walker sm->state = FOO_END; 199ecc1da8aSBen Walker run_state_machine(sm); 200ecc1da8aSBen Walker } 201ecc1da8aSBen Walker 202ecc1da8aSBen Walker static void 203ecc1da8aSBen Walker foo(struct state_machine *sm) 204ecc1da8aSBen Walker { 205ecc1da8aSBen Walker do_async_op(foo_complete, sm); 206ecc1da8aSBen Walker } 207ecc1da8aSBen Walker 208ecc1da8aSBen Walker static void 209ecc1da8aSBen Walker bar_complete(void *ctx) 210ecc1da8aSBen Walker { 211ecc1da8aSBen Walker struct state_machine *sm = ctx; 212ecc1da8aSBen Walker 213ecc1da8aSBen Walker sm->state = BAR_END; 214ecc1da8aSBen Walker run_state_machine(sm); 215ecc1da8aSBen Walker } 216ecc1da8aSBen Walker 217ecc1da8aSBen Walker static void 218ecc1da8aSBen Walker bar(struct state_machine *sm) 219ecc1da8aSBen Walker { 220ecc1da8aSBen Walker do_async_op(bar_complete, sm); 221ecc1da8aSBen Walker } 222ecc1da8aSBen Walker 223ecc1da8aSBen Walker static void 224ecc1da8aSBen Walker run_state_machine(struct state_machine *sm) 225ecc1da8aSBen Walker { 226ecc1da8aSBen Walker enum states prev_state; 227ecc1da8aSBen Walker 228ecc1da8aSBen Walker do { 229ecc1da8aSBen Walker prev_state = sm->state; 230ecc1da8aSBen Walker 231ecc1da8aSBen Walker switch (sm->state) { 232ecc1da8aSBen Walker case FOO_START: 233ecc1da8aSBen Walker foo(sm); 234ecc1da8aSBen Walker break; 235ecc1da8aSBen Walker case FOO_END: 236ecc1da8aSBen Walker /* This is the loop condition */ 237ecc1da8aSBen Walker if (sm->count++ < 5) { 238ecc1da8aSBen Walker sm->state = FOO_START; 239ecc1da8aSBen Walker } else { 240ecc1da8aSBen Walker sm->state = BAR_START; 241ecc1da8aSBen Walker } 242ecc1da8aSBen Walker break; 243ecc1da8aSBen Walker case BAR_START: 244ecc1da8aSBen Walker bar(sm); 245ecc1da8aSBen Walker break; 246ecc1da8aSBen Walker case BAR_END: 247ecc1da8aSBen Walker break; 248ecc1da8aSBen Walker } 249ecc1da8aSBen Walker } while (prev_state != sm->state); 250ecc1da8aSBen Walker } 251ecc1da8aSBen Walker 252ecc1da8aSBen Walker void do_async_for(void) 253ecc1da8aSBen Walker { 254ecc1da8aSBen Walker struct state_machine *sm; 255ecc1da8aSBen Walker 256ecc1da8aSBen Walker sm = malloc(sizeof(*sm)); 257ecc1da8aSBen Walker sm->state = FOO_START; 258ecc1da8aSBen Walker sm->count = 0; 259ecc1da8aSBen Walker 260ecc1da8aSBen Walker run_state_machine(sm); 261ecc1da8aSBen Walker } 262111d4276SMaciej Wawryk``` 263ecc1da8aSBen Walker 264ecc1da8aSBen WalkerThis is complex, of course, but the `run_state_machine` function can be read 265ecc1da8aSBen Walkerfrom top to bottom to get a clear overview of what's happening in the code 266ecc1da8aSBen Walkerwithout having to chase through each of the callbacks. 267