1# Message Passing and Concurrency {#concurrency} 2 3# Theory 4 5One of the primary aims of SPDK is to scale linearly with the addition of 6hardware. This can mean many things in practice. For instance, moving from one 7SSD to two should double the number of I/O's per second. Or doubling the number 8of CPU cores should double the amount of computation possible. Or even doubling 9the number of NICs should double the network throughput. To achieve this, the 10software's threads of execution must be independent from one another as much as 11possible. In practice, that means avoiding software locks and even atomic 12instructions. 13 14Traditionally, software achieves concurrency by placing some shared data onto 15the heap, protecting it with a lock, and then having all threads of execution 16acquire the lock only when accessing the data. This model has many great 17properties: 18 19* It's easy to convert single-threaded programs to multi-threaded programs 20 because you don't have to change the data model from the single-threaded 21 version. You add a lock around the data. 22* You can write your program as a synchronous, imperative list of statements 23that you read from top to bottom. 24* The scheduler can interrupt threads, allowing for efficient time-sharing 25 of CPU resources. 26 27Unfortunately, as the number of threads scales up, contention on the lock around 28the shared data does too. More granular locking helps, but then also increases 29the complexity of the program. Even then, beyond a certain number of contended 30locks, threads will spend most of their time attempting to acquire the locks and 31the program will not benefit from more CPU cores. 32 33SPDK takes a different approach altogether. Instead of placing shared data in a 34global location that all threads access after acquiring a lock, SPDK will often 35assign that data to a single thread. When other threads want to access the data, 36they pass a message to the owning thread to perform the operation on their 37behalf. This strategy, of course, is not at all new. For instance, it is one of 38the core design principles of 39[Erlang](http://erlang.org/download/armstrong_thesis_2003.pdf) and is the main 40concurrency mechanism in [Go](https://tour.golang.org/concurrency/2). A message 41in SPDK consists of a function pointer and a pointer to some context. Messages 42are passed between threads using a 43[lockless ring](http://dpdk.org/doc/guides/prog_guide/ring_lib.html). Message 44passing is often much faster than most software developer's intuition leads them 45to believe due to caching effects. If a single core is accessing the same data 46(on behalf of all of the other cores), then that data is far more likely to be 47in a cache closer to that core. It's often most efficient to have each core work 48on a small set of data sitting in its local cache and then hand off a small 49message to the next core when done. 50 51In more extreme cases where even message passing may be too costly, each thread 52may make a local copy of the data. The thread will then only reference its local 53copy. To mutate the data, threads will send a message to each other thread 54telling them to perform the update on their local copy. This is great when the 55data isn't mutated very often, but is read very frequently, and is often 56employed in the I/O path. This of course trades memory size for computational 57efficiency, so it is used in only the most critical code paths. 58 59# Message Passing Infrastructure 60 61SPDK provides several layers of message passing infrastructure. The most 62fundamental libraries in SPDK, for instance, don't do any message passing on 63their own and instead enumerate rules about when functions may be called in 64their documentation (e.g. @ref nvme). Most libraries, however, depend on SPDK's 65[thread](http://www.spdk.io/doc/thread_8h.html) 66abstraction, located in `libspdk_thread.a`. The thread abstraction provides a 67basic message passing framework and defines a few key primitives. 68 69First, `spdk_thread` is an abstraction for a lightweight, stackless thread of 70execution. A lower level framework can execute an `spdk_thread` for a single 71timeslice by calling `spdk_thread_poll()`. A lower level framework is allowed to 72move an `spdk_thread` between system threads at any time, as long as there is 73only a single system thread executing `spdk_thread_poll()` on that 74`spdk_thread` at any given time. New lightweight threads may be created at any 75time by calling `spdk_thread_create()` and destroyed by calling 76`spdk_thread_destroy()`. The lightweight thread is the foundational abstraction for 77threading in SPDK. 78 79There are then a few additional abstractions layered on top of the 80`spdk_thread`. One is the `spdk_poller`, which is an abstraction for a 81function that should be repeatedly called on the given thread. Another is an 82`spdk_msg_fn`, which is a function pointer and a context pointer, that can 83be sent to a thread for execution via `spdk_thread_send_msg()`. 84 85The library also defines two additional abstractions: `spdk_io_device` and 86`spdk_io_channel`. In the course of implementing SPDK we noticed the same 87pattern emerging in a number of different libraries. In order to implement a 88message passing strategy, the code would describe some object with global state 89and also some per-thread context associated with that object that was accessed 90in the I/O path to avoid locking on the global state. The pattern was clearest 91in the lowest layers where I/O was being submitted to block devices. These 92devices often expose multiple queues that can be assigned to threads and then 93accessed without a lock to submit I/O. To abstract that, we generalized the 94device to `spdk_io_device` and the thread-specific queue to `spdk_io_channel`. 95Over time, however, the pattern has appeared in a huge number of places that 96don't fit quite so nicely with the names we originally chose. In today's code 97`spdk_io_device` is any pointer, whose uniqueness is predicated only on its 98memory address, and `spdk_io_channel` is the per-thread context associated with 99a particular `spdk_io_device`. 100 101The threading abstraction provides functions to send a message to any other 102thread, to send a message to all threads one by one, and to send a message to 103all threads for which there is an io_channel for a given io_device. 104 105Most critically, the thread abstraction does not actually spawn any system level 106threads of its own. Instead, it relies on the existence of some lower level 107framework that spawns system threads and sets up event loops. Inside those event 108loops, the threading abstraction simply requires the lower level framework to 109repeatedly call `spdk_thread_poll()` on each `spdk_thread()` that exists. This 110makes SPDK very portable to a wide variety of asynchronous, event-based 111frameworks such as [Seastar](https://www.seastar.io) or [libuv](https://libuv.org/). 112 113# The event Framework 114 115The SPDK project didn't want to officially pick an asynchronous, event-based 116framework for all of the example applications it shipped with, in the interest 117of supporting the widest variety of frameworks possible. But the applications do 118of course require something that implements an asynchronous event loop in order 119to run, so enter the `event` framework located in `lib/event`. This framework 120includes things like spawning one thread per core, pinning each thread to a 121unique core, polling and scheduling the lightweight threads, installing signal 122handlers to cleanly shutdown, and basic command line option parsing. When 123started through spdk_app_start(), the library automatically spawns all of the 124threads requested, pins them, and is ready for lightweight threads to be 125created. This makes it much easier to implement a brand new SPDK application and 126is the recommended method for those starting out. Only established applications 127should consider directly integrating the lower level libraries. 128 129# Limitations of the C Language 130 131Message passing is efficient, but it results in asynchronous code. 132Unfortunately, asynchronous code is a challenge in C. It's often implemented by 133passing function pointers that are called when an operation completes. This 134chops up the code so that it isn't easy to follow, especially through logic 135branches. The best solution is to use a language with support for 136[futures and promises](https://en.wikipedia.org/wiki/Futures_and_promises), 137such as C++, Rust, Go, or almost any other higher level language. However, SPDK is a low 138level library and requires very wide compatibility and portability, so we've 139elected to stay with plain old C. 140 141We do have a few recommendations to share, though. For _simple_ callback chains, 142it's easiest if you write the functions from bottom to top. By that we mean if 143function `foo` performs some asynchronous operation and when that completes 144function `bar` is called, then function `bar` performs some operation that 145calls function `baz` on completion, a good way to write it is as such: 146 147 void baz(void *ctx) { 148 ... 149 } 150 151 void bar(void *ctx) { 152 async_op(baz, ctx); 153 } 154 155 void foo(void *ctx) { 156 async_op(bar, ctx); 157 } 158 159Don't split these functions up - keep them as a nice unit that can be read from bottom to top. 160 161For more complex callback chains, especially ones that have logical branches 162or loops, it's best to write out a state machine. It turns out that higher 163level languages that support futures and promises are just generating state 164machines at compile time, so even though we don't have the ability to generate 165them in C we can still write them out by hand. As an example, here's a 166callback chain that performs `foo` 5 times and then calls `bar` - effectively 167an asynchronous for loop. 168 169 enum states { 170 FOO_START = 0, 171 FOO_END, 172 BAR_START, 173 BAR_END 174 }; 175 176 struct state_machine { 177 enum states state; 178 179 int count; 180 }; 181 182 static void 183 foo_complete(void *ctx) 184 { 185 struct state_machine *sm = ctx; 186 187 sm->state = FOO_END; 188 run_state_machine(sm); 189 } 190 191 static void 192 foo(struct state_machine *sm) 193 { 194 do_async_op(foo_complete, sm); 195 } 196 197 static void 198 bar_complete(void *ctx) 199 { 200 struct state_machine *sm = ctx; 201 202 sm->state = BAR_END; 203 run_state_machine(sm); 204 } 205 206 static void 207 bar(struct state_machine *sm) 208 { 209 do_async_op(bar_complete, sm); 210 } 211 212 static void 213 run_state_machine(struct state_machine *sm) 214 { 215 enum states prev_state; 216 217 do { 218 prev_state = sm->state; 219 220 switch (sm->state) { 221 case FOO_START: 222 foo(sm); 223 break; 224 case FOO_END: 225 /* This is the loop condition */ 226 if (sm->count++ < 5) { 227 sm->state = FOO_START; 228 } else { 229 sm->state = BAR_START; 230 } 231 break; 232 case BAR_START: 233 bar(sm); 234 break; 235 case BAR_END: 236 break; 237 } 238 } while (prev_state != sm->state); 239 } 240 241 void do_async_for(void) 242 { 243 struct state_machine *sm; 244 245 sm = malloc(sizeof(*sm)); 246 sm->state = FOO_START; 247 sm->count = 0; 248 249 run_state_machine(sm); 250 } 251 252This is complex, of course, but the `run_state_machine` function can be read 253from top to bottom to get a clear overview of what's happening in the code 254without having to chase through each of the callbacks. 255