xref: /spdk/doc/concurrency.md (revision 838e61c3772fdefb17e1a0b8f9880e2bcb9c4c0d)
1# Message Passing and Concurrency {#concurrency}
2
3## Theory
4
5One of the primary aims of SPDK is to scale linearly with the addition of
6hardware. This can mean many things in practice. For instance, moving from one
7SSD to two should double the number of I/O's per second. Or doubling the number
8of CPU cores should double the amount of computation possible. Or even doubling
9the number of NICs should double the network throughput. To achieve this, the
10software's threads of execution must be independent from one another as much as
11possible. In practice, that means avoiding software locks and even atomic
12instructions.
13
14Traditionally, software achieves concurrency by placing some shared data onto
15the heap, protecting it with a lock, and then having all threads of execution
16acquire the lock only when accessing the data. This model has many great
17properties:
18
19* It's easy to convert single-threaded programs to multi-threaded programs
20  because you don't have to change the data model from the single-threaded
21  version. You add a lock around the data.
22* You can write your program as a synchronous, imperative list of statements
23  that you read from top to bottom.
24* The scheduler can interrupt threads, allowing for efficient time-sharing
25  of CPU resources.
26
27Unfortunately, as the number of threads scales up, contention on the lock around
28the shared data does too. More granular locking helps, but then also increases
29the complexity of the program. Even then, beyond a certain number of contended
30locks, threads will spend most of their time attempting to acquire the locks and
31the program will not benefit from more CPU cores.
32
33SPDK takes a different approach altogether. Instead of placing shared data in a
34global location that all threads access after acquiring a lock, SPDK will often
35assign that data to a single thread. When other threads want to access the data,
36they pass a message to the owning thread to perform the operation on their
37behalf. This strategy, of course, is not at all new. For instance, it is one of
38the core design principles of
39[Erlang](http://erlang.org/download/armstrong_thesis_2003.pdf) and is the main
40concurrency mechanism in [Go](https://tour.golang.org/concurrency/2). A message
41in SPDK consists of a function pointer and a pointer to some context. Messages
42are passed between threads using a
43[lockless ring](http://dpdk.org/doc/guides/prog_guide/ring_lib.html). Message
44passing is often much faster than most software developer's intuition leads them
45to believe due to caching effects. If a single core is accessing the same data
46(on behalf of all of the other cores), then that data is far more likely to be
47in a cache closer to that core. It's often most efficient to have each core work
48on a small set of data sitting in its local cache and then hand off a small
49message to the next core when done.
50
51In more extreme cases where even message passing may be too costly, each thread
52may make a local copy of the data. The thread will then only reference its local
53copy. To mutate the data, threads will send a message to each other thread
54telling them to perform the update on their local copy. This is great when the
55data isn't mutated very often, but is read very frequently, and is often
56employed in the I/O path. This of course trades memory size for computational
57efficiency, so it is used in only the most critical code paths.
58
59## Message Passing Infrastructure
60
61SPDK provides several layers of message passing infrastructure. The most
62fundamental libraries in SPDK, for instance, don't do any message passing on
63their own and instead enumerate rules about when functions may be called in
64their documentation (e.g. @ref nvme). Most libraries, however, depend on SPDK's
65[thread](http://www.spdk.io/doc/thread_8h.html)
66abstraction, located in `libspdk_thread.a`. The thread abstraction provides a
67basic message passing framework and defines a few key primitives.
68
69First, `spdk_thread` is an abstraction for a lightweight, stackless thread of
70execution. A lower level framework can execute an `spdk_thread` for a single
71timeslice by calling `spdk_thread_poll()`. A lower level framework is allowed to
72move an `spdk_thread` between system threads at any time, as long as there is
73only a single system thread executing `spdk_thread_poll()` on that
74`spdk_thread` at any given time. New lightweight threads may be created at any
75time by calling `spdk_thread_create()` and destroyed by calling
76`spdk_thread_destroy()`. The lightweight thread is the foundational abstraction for
77threading in SPDK.
78
79There are then a few additional abstractions layered on top of the
80`spdk_thread`. One is the `spdk_poller`, which is an abstraction for a
81function that should be repeatedly called on the given thread. Another is an
82`spdk_msg_fn`, which is a function pointer and a context pointer, that can
83be sent to a thread for execution via `spdk_thread_send_msg()`.
84
85The library also defines two additional abstractions: `spdk_io_device` and
86`spdk_io_channel`. In the course of implementing SPDK we noticed the same
87pattern emerging in a number of different libraries. In order to implement a
88message passing strategy, the code would describe some object with global state
89and also some per-thread context associated with that object that was accessed
90in the I/O path to avoid locking on the global state. The pattern was clearest
91in the lowest layers where I/O was being submitted to block devices. These
92devices often expose multiple queues that can be assigned to threads and then
93accessed without a lock to submit I/O. To abstract that, we generalized the
94device to `spdk_io_device` and the thread-specific queue to `spdk_io_channel`.
95Over time, however, the pattern has appeared in a huge number of places that
96don't fit quite so nicely with the names we originally chose. In today's code
97`spdk_io_device` is any pointer, whose uniqueness is predicated only on its
98memory address, and `spdk_io_channel` is the per-thread context associated with
99a particular `spdk_io_device`.
100
101The threading abstraction provides functions to send a message to any other
102thread, to send a message to all threads one by one, and to send a message to
103all threads for which there is an io_channel for a given io_device.
104
105Most critically, the thread abstraction does not actually spawn any system level
106threads of its own. Instead, it relies on the existence of some lower level
107framework that spawns system threads and sets up event loops. Inside those event
108loops, the threading abstraction simply requires the lower level framework to
109repeatedly call `spdk_thread_poll()` on each `spdk_thread()` that exists. This
110makes SPDK very portable to a wide variety of asynchronous, event-based
111frameworks such as [Seastar](https://www.seastar.io) or [libuv](https://libuv.org/).
112
113## SPDK Spinlocks
114
115There are some cases where locks are used. These should be limited in favor of
116the message passing interface described above. When locks are needed,
117SPDK spinlocks should be used instead of POSIX locks.
118
119POSIX locks like `pthread_mutex_t` and `pthread_spinlock_t` do not properly
120handle locking between SPDK's lightweight threads. SPDK's `spdk_spinlock`
121is safe to use in SPDK libraries and applications. This safety comes from
122imposing restrictions on when locks can be held. See
123[spdk_spinlock](structspdk__spinlock.html) for details.
124
125## The event Framework
126
127The SPDK project didn't want to officially pick an asynchronous, event-based
128framework for all of the example applications it shipped with, in the interest
129of supporting the widest variety of frameworks possible. But the applications do
130of course require something that implements an asynchronous event loop in order
131to run, so enter the `event` framework located in `lib/event`. This framework
132includes things like polling and scheduling the lightweight threads, installing
133signal handlers to cleanly shutdown, and basic command line option parsing.
134Only established applications should consider directly integrating the lower
135level libraries.
136
137## Limitations of the C Language
138
139Message passing is efficient, but it results in asynchronous code.
140Unfortunately, asynchronous code is a challenge in C. It's often implemented by
141passing function pointers that are called when an operation completes. This
142chops up the code so that it isn't easy to follow, especially through logic
143branches. The best solution is to use a language with support for
144[futures and promises](https://en.wikipedia.org/wiki/Futures_and_promises),
145such as C++, Rust, Go, or almost any other higher level language. However, SPDK is a low
146level library and requires very wide compatibility and portability, so we've
147elected to stay with plain old C.
148
149We do have a few recommendations to share, though. For _simple_ callback chains,
150it's easiest if you write the functions from bottom to top. By that we mean if
151function `foo` performs some asynchronous operation and when that completes
152function `bar` is called, then function `bar` performs some operation that
153calls function `baz` on completion, a good way to write it is as such:
154
155```c
156    void baz(void *ctx) {
157            ...
158    }
159
160    void bar(void *ctx) {
161            async_op(baz, ctx);
162    }
163
164    void foo(void *ctx) {
165            async_op(bar, ctx);
166    }
167```
168
169Don't split these functions up - keep them as a nice unit that can be read from bottom to top.
170
171For more complex callback chains, especially ones that have logical branches
172or loops, it's best to write out a state machine. It turns out that higher
173level languages that support futures and promises are just generating state
174machines at compile time, so even though we don't have the ability to generate
175them in C we can still write them out by hand. As an example, here's a
176callback chain that performs `foo` 5 times and then calls `bar` - effectively
177an asynchronous for loop.
178
179```c
180    enum states {
181            FOO_START = 0,
182            FOO_END,
183            BAR_START,
184            BAR_END
185    };
186
187    struct state_machine {
188            enum states state;
189
190            int count;
191    };
192
193    static void
194    foo_complete(void *ctx)
195    {
196        struct state_machine *sm = ctx;
197
198        sm->state = FOO_END;
199        run_state_machine(sm);
200    }
201
202    static void
203    foo(struct state_machine *sm)
204    {
205        do_async_op(foo_complete, sm);
206    }
207
208    static void
209    bar_complete(void *ctx)
210    {
211        struct state_machine *sm = ctx;
212
213        sm->state = BAR_END;
214        run_state_machine(sm);
215    }
216
217    static void
218    bar(struct state_machine *sm)
219    {
220        do_async_op(bar_complete, sm);
221    }
222
223    static void
224    run_state_machine(struct state_machine *sm)
225    {
226        enum states prev_state;
227
228        do {
229            prev_state = sm->state;
230
231            switch (sm->state) {
232                case FOO_START:
233                    foo(sm);
234                    break;
235                case FOO_END:
236                    /* This is the loop condition */
237                    if (sm->count++ < 5) {
238                        sm->state = FOO_START;
239                    } else {
240                        sm->state = BAR_START;
241                    }
242                    break;
243                case BAR_START:
244                    bar(sm);
245                    break;
246                case BAR_END:
247                    break;
248            }
249        } while (prev_state != sm->state);
250    }
251
252    void do_async_for(void)
253    {
254            struct state_machine *sm;
255
256            sm = malloc(sizeof(*sm));
257            sm->state = FOO_START;
258            sm->count = 0;
259
260            run_state_machine(sm);
261    }
262```
263
264This is complex, of course, but the `run_state_machine` function can be read
265from top to bottom to get a clear overview of what's happening in the code
266without having to chase through each of the callbacks.
267