1# Message Passing and Concurrency {#concurrency} 2 3## Theory 4 5One of the primary aims of SPDK is to scale linearly with the addition of 6hardware. This can mean many things in practice. For instance, moving from one 7SSD to two should double the number of I/O's per second. Or doubling the number 8of CPU cores should double the amount of computation possible. Or even doubling 9the number of NICs should double the network throughput. To achieve this, the 10software's threads of execution must be independent from one another as much as 11possible. In practice, that means avoiding software locks and even atomic 12instructions. 13 14Traditionally, software achieves concurrency by placing some shared data onto 15the heap, protecting it with a lock, and then having all threads of execution 16acquire the lock only when accessing the data. This model has many great 17properties: 18 19* It's easy to convert single-threaded programs to multi-threaded programs 20 because you don't have to change the data model from the single-threaded 21 version. You add a lock around the data. 22* You can write your program as a synchronous, imperative list of statements 23 that you read from top to bottom. 24* The scheduler can interrupt threads, allowing for efficient time-sharing 25 of CPU resources. 26 27Unfortunately, as the number of threads scales up, contention on the lock around 28the shared data does too. More granular locking helps, but then also increases 29the complexity of the program. Even then, beyond a certain number of contended 30locks, threads will spend most of their time attempting to acquire the locks and 31the program will not benefit from more CPU cores. 32 33SPDK takes a different approach altogether. Instead of placing shared data in a 34global location that all threads access after acquiring a lock, SPDK will often 35assign that data to a single thread. When other threads want to access the data, 36they pass a message to the owning thread to perform the operation on their 37behalf. This strategy, of course, is not at all new. For instance, it is one of 38the core design principles of 39[Erlang](http://erlang.org/download/armstrong_thesis_2003.pdf) and is the main 40concurrency mechanism in [Go](https://tour.golang.org/concurrency/2). A message 41in SPDK consists of a function pointer and a pointer to some context. Messages 42are passed between threads using a 43[lockless ring](http://dpdk.org/doc/guides/prog_guide/ring_lib.html). Message 44passing is often much faster than most software developer's intuition leads them 45to believe due to caching effects. If a single core is accessing the same data 46(on behalf of all of the other cores), then that data is far more likely to be 47in a cache closer to that core. It's often most efficient to have each core work 48on a small set of data sitting in its local cache and then hand off a small 49message to the next core when done. 50 51In more extreme cases where even message passing may be too costly, each thread 52may make a local copy of the data. The thread will then only reference its local 53copy. To mutate the data, threads will send a message to each other thread 54telling them to perform the update on their local copy. This is great when the 55data isn't mutated very often, but is read very frequently, and is often 56employed in the I/O path. This of course trades memory size for computational 57efficiency, so it is used in only the most critical code paths. 58 59## Message Passing Infrastructure 60 61SPDK provides several layers of message passing infrastructure. The most 62fundamental libraries in SPDK, for instance, don't do any message passing on 63their own and instead enumerate rules about when functions may be called in 64their documentation (e.g. @ref nvme). Most libraries, however, depend on SPDK's 65[thread](http://www.spdk.io/doc/thread_8h.html) 66abstraction, located in `libspdk_thread.a`. The thread abstraction provides a 67basic message passing framework and defines a few key primitives. 68 69First, `spdk_thread` is an abstraction for a lightweight, stackless thread of 70execution. A lower level framework can execute an `spdk_thread` for a single 71timeslice by calling `spdk_thread_poll()`. A lower level framework is allowed to 72move an `spdk_thread` between system threads at any time, as long as there is 73only a single system thread executing `spdk_thread_poll()` on that 74`spdk_thread` at any given time. New lightweight threads may be created at any 75time by calling `spdk_thread_create()` and destroyed by calling 76`spdk_thread_destroy()`. The lightweight thread is the foundational abstraction for 77threading in SPDK. 78 79There are then a few additional abstractions layered on top of the 80`spdk_thread`. One is the `spdk_poller`, which is an abstraction for a 81function that should be repeatedly called on the given thread. Another is an 82`spdk_msg_fn`, which is a function pointer and a context pointer, that can 83be sent to a thread for execution via `spdk_thread_send_msg()`. 84 85The library also defines two additional abstractions: `spdk_io_device` and 86`spdk_io_channel`. In the course of implementing SPDK we noticed the same 87pattern emerging in a number of different libraries. In order to implement a 88message passing strategy, the code would describe some object with global state 89and also some per-thread context associated with that object that was accessed 90in the I/O path to avoid locking on the global state. The pattern was clearest 91in the lowest layers where I/O was being submitted to block devices. These 92devices often expose multiple queues that can be assigned to threads and then 93accessed without a lock to submit I/O. To abstract that, we generalized the 94device to `spdk_io_device` and the thread-specific queue to `spdk_io_channel`. 95Over time, however, the pattern has appeared in a huge number of places that 96don't fit quite so nicely with the names we originally chose. In today's code 97`spdk_io_device` is any pointer, whose uniqueness is predicated only on its 98memory address, and `spdk_io_channel` is the per-thread context associated with 99a particular `spdk_io_device`. 100 101The threading abstraction provides functions to send a message to any other 102thread, to send a message to all threads one by one, and to send a message to 103all threads for which there is an io_channel for a given io_device. 104 105Most critically, the thread abstraction does not actually spawn any system level 106threads of its own. Instead, it relies on the existence of some lower level 107framework that spawns system threads and sets up event loops. Inside those event 108loops, the threading abstraction simply requires the lower level framework to 109repeatedly call `spdk_thread_poll()` on each `spdk_thread()` that exists. This 110makes SPDK very portable to a wide variety of asynchronous, event-based 111frameworks such as [Seastar](https://www.seastar.io) or [libuv](https://libuv.org/). 112 113## SPDK Spinlocks 114 115There are some cases where locks are used. These should be limited in favor of 116the message passing interface described above. When locks are needed, 117SPDK spinlocks should be used instead of POSIX locks. 118 119POSIX locks like `pthread_mutex_t` and `pthread_spinlock_t` do not properly 120handle locking between SPDK's lightweight threads. SPDK's `spdk_spinlock` 121is safe to use in SPDK libraries and applications. This safety comes from 122imposing restrictions on when locks can be held. See 123[spdk_spinlock](structspdk__spinlock.html) for details. 124 125## The event Framework 126 127The SPDK project didn't want to officially pick an asynchronous, event-based 128framework for all of the example applications it shipped with, in the interest 129of supporting the widest variety of frameworks possible. But the applications do 130of course require something that implements an asynchronous event loop in order 131to run, so enter the `event` framework located in `lib/event`. This framework 132includes things like polling and scheduling the lightweight threads, installing 133signal handlers to cleanly shutdown, and basic command line option parsing. 134Only established applications should consider directly integrating the lower 135level libraries. 136 137## Limitations of the C Language 138 139Message passing is efficient, but it results in asynchronous code. 140Unfortunately, asynchronous code is a challenge in C. It's often implemented by 141passing function pointers that are called when an operation completes. This 142chops up the code so that it isn't easy to follow, especially through logic 143branches. The best solution is to use a language with support for 144[futures and promises](https://en.wikipedia.org/wiki/Futures_and_promises), 145such as C++, Rust, Go, or almost any other higher level language. However, SPDK is a low 146level library and requires very wide compatibility and portability, so we've 147elected to stay with plain old C. 148 149We do have a few recommendations to share, though. For _simple_ callback chains, 150it's easiest if you write the functions from bottom to top. By that we mean if 151function `foo` performs some asynchronous operation and when that completes 152function `bar` is called, then function `bar` performs some operation that 153calls function `baz` on completion, a good way to write it is as such: 154 155```c 156 void baz(void *ctx) { 157 ... 158 } 159 160 void bar(void *ctx) { 161 async_op(baz, ctx); 162 } 163 164 void foo(void *ctx) { 165 async_op(bar, ctx); 166 } 167``` 168 169Don't split these functions up - keep them as a nice unit that can be read from bottom to top. 170 171For more complex callback chains, especially ones that have logical branches 172or loops, it's best to write out a state machine. It turns out that higher 173level languages that support futures and promises are just generating state 174machines at compile time, so even though we don't have the ability to generate 175them in C we can still write them out by hand. As an example, here's a 176callback chain that performs `foo` 5 times and then calls `bar` - effectively 177an asynchronous for loop. 178 179```c 180 enum states { 181 FOO_START = 0, 182 FOO_END, 183 BAR_START, 184 BAR_END 185 }; 186 187 struct state_machine { 188 enum states state; 189 190 int count; 191 }; 192 193 static void 194 foo_complete(void *ctx) 195 { 196 struct state_machine *sm = ctx; 197 198 sm->state = FOO_END; 199 run_state_machine(sm); 200 } 201 202 static void 203 foo(struct state_machine *sm) 204 { 205 do_async_op(foo_complete, sm); 206 } 207 208 static void 209 bar_complete(void *ctx) 210 { 211 struct state_machine *sm = ctx; 212 213 sm->state = BAR_END; 214 run_state_machine(sm); 215 } 216 217 static void 218 bar(struct state_machine *sm) 219 { 220 do_async_op(bar_complete, sm); 221 } 222 223 static void 224 run_state_machine(struct state_machine *sm) 225 { 226 enum states prev_state; 227 228 do { 229 prev_state = sm->state; 230 231 switch (sm->state) { 232 case FOO_START: 233 foo(sm); 234 break; 235 case FOO_END: 236 /* This is the loop condition */ 237 if (sm->count++ < 5) { 238 sm->state = FOO_START; 239 } else { 240 sm->state = BAR_START; 241 } 242 break; 243 case BAR_START: 244 bar(sm); 245 break; 246 case BAR_END: 247 break; 248 } 249 } while (prev_state != sm->state); 250 } 251 252 void do_async_for(void) 253 { 254 struct state_machine *sm; 255 256 sm = malloc(sizeof(*sm)); 257 sm->state = FOO_START; 258 sm->count = 0; 259 260 run_state_machine(sm); 261 } 262``` 263 264This is complex, of course, but the `run_state_machine` function can be read 265from top to bottom to get a clear overview of what's happening in the code 266without having to chase through each of the callbacks. 267