xref: /spdk/doc/concurrency.md (revision b30d57cdad6d2bc75cc1e4e2ebbcebcb0d98dcfa)
1# Message Passing and Concurrency {#concurrency}
2
3# Theory
4
5One of the primary aims of SPDK is to scale linearly with the addition of
6hardware. This can mean many things in practice. For instance, moving from one
7SSD to two should double the number of I/O's per second. Or doubling the number
8of CPU cores should double the amount of computation possible. Or even doubling
9the number of NICs should double the network throughput. To achieve this, the
10software's threads of execution must be independent from one another as much as
11possible. In practice, that means avoiding software locks and even atomic
12instructions.
13
14Traditionally, software achieves concurrency by placing some shared data onto
15the heap, protecting it with a lock, and then having all threads of execution
16acquire the lock only when accessing the data. This model has many great
17properties:
18
19* It's easy to convert single-threaded programs to multi-threaded programs
20  because you don't have to change the data model from the single-threaded
21  version. You add a lock around the data.
22* You can write your program as a synchronous, imperative list of statements
23  that you read from top to bottom.
24* The scheduler can interrupt threads, allowing for efficient time-sharing
25  of CPU resources.
26
27Unfortunately, as the number of threads scales up, contention on the lock around
28the shared data does too. More granular locking helps, but then also increases
29the complexity of the program. Even then, beyond a certain number of contended
30locks, threads will spend most of their time attempting to acquire the locks and
31the program will not benefit from more CPU cores.
32
33SPDK takes a different approach altogether. Instead of placing shared data in a
34global location that all threads access after acquiring a lock, SPDK will often
35assign that data to a single thread. When other threads want to access the data,
36they pass a message to the owning thread to perform the operation on their
37behalf. This strategy, of course, is not at all new. For instance, it is one of
38the core design principles of
39[Erlang](http://erlang.org/download/armstrong_thesis_2003.pdf) and is the main
40concurrency mechanism in [Go](https://tour.golang.org/concurrency/2). A message
41in SPDK consists of a function pointer and a pointer to some context. Messages
42are passed between threads using a
43[lockless ring](http://dpdk.org/doc/guides/prog_guide/ring_lib.html). Message
44passing is often much faster than most software developer's intuition leads them
45to believe due to caching effects. If a single core is accessing the same data
46(on behalf of all of the other cores), then that data is far more likely to be
47in a cache closer to that core. It's often most efficient to have each core work
48on a small set of data sitting in its local cache and then hand off a small
49message to the next core when done.
50
51In more extreme cases where even message passing may be too costly, each thread
52may make a local copy of the data. The thread will then only reference its local
53copy. To mutate the data, threads will send a message to each other thread
54telling them to perform the update on their local copy. This is great when the
55data isn't mutated very often, but is read very frequently, and is often
56employed in the I/O path. This of course trades memory size for computational
57efficiency, so it is used in only the most critical code paths.
58
59# Message Passing Infrastructure
60
61SPDK provides several layers of message passing infrastructure. The most
62fundamental libraries in SPDK, for instance, don't do any message passing on
63their own and instead enumerate rules about when functions may be called in
64their documentation (e.g. @ref nvme). Most libraries, however, depend on SPDK's
65[thread](http://www.spdk.io/doc/thread_8h.html)
66abstraction, located in `libspdk_thread.a`. The thread abstraction provides a
67basic message passing framework and defines a few key primitives.
68
69First, `spdk_thread` is an abstraction for a lightweight, stackless thread of
70execution. A lower level framework can execute an `spdk_thread` for a single
71timeslice by calling `spdk_thread_poll()`. A lower level framework is allowed to
72move an `spdk_thread` between system threads at any time, as long as there is
73only a single system thread executing `spdk_thread_poll()` on that
74`spdk_thread` at any given time. New lightweight threads may be created at any
75time by calling `spdk_thread_create()` and destroyed by calling
76`spdk_thread_destroy()`. The lightweight thread is the foundational abstraction for
77threading in SPDK.
78
79There are then a few additional abstractions layered on top of the
80`spdk_thread`. One is the `spdk_poller`, which is an abstraction for a
81function that should be repeatedly called on the given thread. Another is an
82`spdk_msg_fn`, which is a function pointer and a context pointer, that can
83be sent to a thread for execution via `spdk_thread_send_msg()`.
84
85The library also defines two additional abstractions: `spdk_io_device` and
86`spdk_io_channel`. In the course of implementing SPDK we noticed the same
87pattern emerging in a number of different libraries. In order to implement a
88message passing strategy, the code would describe some object with global state
89and also some per-thread context associated with that object that was accessed
90in the I/O path to avoid locking on the global state. The pattern was clearest
91in the lowest layers where I/O was being submitted to block devices. These
92devices often expose multiple queues that can be assigned to threads and then
93accessed without a lock to submit I/O. To abstract that, we generalized the
94device to `spdk_io_device` and the thread-specific queue to `spdk_io_channel`.
95Over time, however, the pattern has appeared in a huge number of places that
96don't fit quite so nicely with the names we originally chose. In today's code
97`spdk_io_device` is any pointer, whose uniqueness is predicated only on its
98memory address, and `spdk_io_channel` is the per-thread context associated with
99a particular `spdk_io_device`.
100
101The threading abstraction provides functions to send a message to any other
102thread, to send a message to all threads one by one, and to send a message to
103all threads for which there is an io_channel for a given io_device.
104
105Most critically, the thread abstraction does not actually spawn any system level
106threads of its own. Instead, it relies on the existence of some lower level
107framework that spawns system threads and sets up event loops. Inside those event
108loops, the threading abstraction simply requires the lower level framework to
109repeatedly call `spdk_thread_poll()` on each `spdk_thread()` that exists. This
110makes SPDK very portable to a wide variety of asynchronous, event-based
111frameworks such as [Seastar](https://www.seastar.io) or [libuv](https://libuv.org/).
112
113# The event Framework
114
115The SPDK project didn't want to officially pick an asynchronous, event-based
116framework for all of the example applications it shipped with, in the interest
117of supporting the widest variety of frameworks possible. But the applications do
118of course require something that implements an asynchronous event loop in order
119to run, so enter the `event` framework located in `lib/event`. This framework
120includes things like polling and scheduling the lightweight threads, installing
121signal handlers to cleanly shutdown, and basic command line option parsing.
122Only established applications should consider directly integrating the lower
123level libraries.
124
125# Limitations of the C Language
126
127Message passing is efficient, but it results in asynchronous code.
128Unfortunately, asynchronous code is a challenge in C. It's often implemented by
129passing function pointers that are called when an operation completes. This
130chops up the code so that it isn't easy to follow, especially through logic
131branches. The best solution is to use a language with support for
132[futures and promises](https://en.wikipedia.org/wiki/Futures_and_promises),
133such as C++, Rust, Go, or almost any other higher level language. However, SPDK is a low
134level library and requires very wide compatibility and portability, so we've
135elected to stay with plain old C.
136
137We do have a few recommendations to share, though. For _simple_ callback chains,
138it's easiest if you write the functions from bottom to top. By that we mean if
139function `foo` performs some asynchronous operation and when that completes
140function `bar` is called, then function `bar` performs some operation that
141calls function `baz` on completion, a good way to write it is as such:
142
143    void baz(void *ctx) {
144            ...
145    }
146
147    void bar(void *ctx) {
148            async_op(baz, ctx);
149    }
150
151    void foo(void *ctx) {
152            async_op(bar, ctx);
153    }
154
155Don't split these functions up - keep them as a nice unit that can be read from bottom to top.
156
157For more complex callback chains, especially ones that have logical branches
158or loops, it's best to write out a state machine. It turns out that higher
159level languages that support futures and promises are just generating state
160machines at compile time, so even though we don't have the ability to generate
161them in C we can still write them out by hand. As an example, here's a
162callback chain that performs `foo` 5 times and then calls `bar` - effectively
163an asynchronous for loop.
164
165    enum states {
166            FOO_START = 0,
167            FOO_END,
168            BAR_START,
169            BAR_END
170    };
171
172    struct state_machine {
173            enum states state;
174
175            int count;
176    };
177
178    static void
179    foo_complete(void *ctx)
180    {
181        struct state_machine *sm = ctx;
182
183        sm->state = FOO_END;
184        run_state_machine(sm);
185    }
186
187    static void
188    foo(struct state_machine *sm)
189    {
190        do_async_op(foo_complete, sm);
191    }
192
193    static void
194    bar_complete(void *ctx)
195    {
196        struct state_machine *sm = ctx;
197
198        sm->state = BAR_END;
199        run_state_machine(sm);
200    }
201
202    static void
203    bar(struct state_machine *sm)
204    {
205        do_async_op(bar_complete, sm);
206    }
207
208    static void
209    run_state_machine(struct state_machine *sm)
210    {
211        enum states prev_state;
212
213        do {
214            prev_state = sm->state;
215
216            switch (sm->state) {
217                case FOO_START:
218                    foo(sm);
219                    break;
220                case FOO_END:
221                    /* This is the loop condition */
222                    if (sm->count++ < 5) {
223                        sm->state = FOO_START;
224                    } else {
225                        sm->state = BAR_START;
226                    }
227                    break;
228                case BAR_START:
229                    bar(sm);
230                    break;
231                case BAR_END:
232                    break;
233            }
234        } while (prev_state != sm->state);
235    }
236
237    void do_async_for(void)
238    {
239            struct state_machine *sm;
240
241            sm = malloc(sizeof(*sm));
242            sm->state = FOO_START;
243            sm->count = 0;
244
245            run_state_machine(sm);
246    }
247
248This is complex, of course, but the `run_state_machine` function can be read
249from top to bottom to get a clear overview of what's happening in the code
250without having to chase through each of the callbacks.
251