xref: /spdk/doc/concurrency.md (revision 9889ab2dc80e40dae92dcef361d53dcba722043d)
1# Message Passing and Concurrency {#concurrency}
2
3# Theory
4
5One of the primary aims of SPDK is to scale linearly with the addition of
6hardware. This can mean many things in practice. For instance, moving from one
7SSD to two should double the number of I/O's per second. Or doubling the number
8of CPU cores should double the amount of computation possible. Or even doubling
9the number of NICs should double the network throughput. To achieve this, the
10software's threads of execution must be independent from one another as much as
11possible. In practice, that means avoiding software locks and even atomic
12instructions.
13
14Traditionally, software achieves concurrency by placing some shared data onto
15the heap, protecting it with a lock, and then having all threads of execution
16acquire the lock only when accessing the data. This model has many great
17properties:
18
19* It's easy to convert single-threaded programs to multi-threaded programs
20  because you don't have to change the data model from the single-threaded
21  version. You add a lock around the data.
22* You can write your program as a synchronous, imperative list of statements
23that you read from top to bottom.
24* The scheduler can interrupt threads, allowing for efficient time-sharing
25  of CPU resources.
26
27Unfortunately, as the number of threads scales up, contention on the lock around
28the shared data does too. More granular locking helps, but then also increases
29the complexity of the program. Even then, beyond a certain number of contended
30locks, threads will spend most of their time attempting to acquire the locks and
31the program will not benefit from more CPU cores.
32
33SPDK takes a different approach altogether. Instead of placing shared data in a
34global location that all threads access after acquiring a lock, SPDK will often
35assign that data to a single thread. When other threads want to access the data,
36they pass a message to the owning thread to perform the operation on their
37behalf. This strategy, of course, is not at all new. For instance, it is one of
38the core design principles of
39[Erlang](http://erlang.org/download/armstrong_thesis_2003.pdf) and is the main
40concurrency mechanism in [Go](https://tour.golang.org/concurrency/2). A message
41in SPDK consists of a function pointer and a pointer to some context. Messages
42are passed between threads using a
43[lockless ring](http://dpdk.org/doc/guides/prog_guide/ring_lib.html). Message
44passing is often much faster than most software developer's intuition leads them
45to believe due to caching effects. If a single core is accessing the same data
46(on behalf of all of the other cores), then that data is far more likely to be
47in a cache closer to that core. It's often most efficient to have each core work
48on a small set of data sitting in its local cache and then hand off a small
49message to the next core when done.
50
51In more extreme cases where even message passing may be too costly, each thread
52may make a local copy of the data. The thread will then only reference its local
53copy. To mutate the data, threads will send a message to each other thread
54telling them to perform the update on their local copy. This is great when the
55data isn't mutated very often, but is read very frequently, and is often
56employed in the I/O path. This of course trades memory size for computational
57efficiency, so it is used in only the most critical code paths.
58
59# Message Passing Infrastructure
60
61SPDK provides several layers of message passing infrastructure. The most
62fundamental libraries in SPDK, for instance, don't do any message passing on
63their own and instead enumerate rules about when functions may be called in
64their documentation (e.g. @ref nvme). Most libraries, however, depend on SPDK's
65[thread](http://www.spdk.io/doc/thread_8h.html)
66abstraction, located in `libspdk_thread.a`. The thread abstraction provides a
67basic message passing framework and defines a few key primitives.
68
69First, `spdk_thread` is an abstraction for a lightweight, stackless thread of
70execution. A lower level framework can execute an `spdk_thread` for a single
71timeslice by calling `spdk_thread_poll()`. A lower level framework is allowed to
72move an `spdk_thread` between system threads at any time, as long as there is
73only a single system thread executing `spdk_thread_poll()` on that
74`spdk_thread` at any given time. New lightweight threads may be created at any
75time by calling `spdk_thread_create()` and destroyed by calling
76`spdk_thread_destroy()`. The lightweight thread is the foundational abstraction for
77threading in SPDK.
78
79There are then a few additional abstractions layered on top of the
80`spdk_thread`. One is the `spdk_poller`, which is an abstraction for a
81function that should be repeatedly called on the given thread. Another is an
82`spdk_msg_fn`, which is a function pointer and a context pointer, that can
83be sent to a thread for execution via `spdk_thread_send_msg()`.
84
85The library also defines two additional abstractions: `spdk_io_device` and
86`spdk_io_channel`. In the course of implementing SPDK we noticed the same
87pattern emerging in a number of different libraries. In order to implement a
88message passing strategy, the code would describe some object with global state
89and also some per-thread context associated with that object that was accessed
90in the I/O path to avoid locking on the global state. The pattern was clearest
91in the lowest layers where I/O was being submitted to block devices. These
92devices often expose multiple queues that can be assigned to threads and then
93accessed without a lock to submit I/O. To abstract that, we generalized the
94device to `spdk_io_device` and the thread-specific queue to `spdk_io_channel`.
95Over time, however, the pattern has appeared in a huge number of places that
96don't fit quite so nicely with the names we originally chose. In today's code
97`spdk_io_device` is any pointer, whose uniqueness is predicated only on its
98memory address, and `spdk_io_channel` is the per-thread context associated with
99a particular `spdk_io_device`.
100
101The threading abstraction provides functions to send a message to any other
102thread, to send a message to all threads one by one, and to send a message to
103all threads for which there is an io_channel for a given io_device.
104
105Most critically, the thread abstraction does not actually spawn any system level
106threads of its own. Instead, it relies on the existence of some lower level
107framework that spawns system threads and sets up event loops. Inside those event
108loops, the threading abstraction simply requires the lower level framework to
109repeatedly call `spdk_thread_poll()` on each `spdk_thread()` that exists. This
110makes SPDK very portable to a wide variety of asynchronous, event-based
111frameworks such as [Seastar](https://www.seastar.io) or [libuv](https://libuv.org/).
112
113# The event Framework
114
115The SPDK project didn't want to officially pick an asynchronous, event-based
116framework for all of the example applications it shipped with, in the interest
117of supporting the widest variety of frameworks possible. But the applications do
118of course require something that implements an asynchronous event loop in order
119to run, so enter the `event` framework located in `lib/event`. This framework
120includes things like spawning one thread per core, pinning each thread to a
121unique core, polling and scheduling the lightweight threads, installing signal
122handlers to cleanly shutdown, and basic command line option parsing. When
123started through spdk_app_start(), the library automatically spawns all of the
124threads requested, pins them, and is ready for lightweight threads to be
125created. This makes it much easier to implement a brand new SPDK application and
126is the recommended method for those starting out. Only established applications
127should consider directly integrating the lower level libraries.
128
129# Limitations of the C Language
130
131Message passing is efficient, but it results in asynchronous code.
132Unfortunately, asynchronous code is a challenge in C. It's often implemented by
133passing function pointers that are called when an operation completes. This
134chops up the code so that it isn't easy to follow, especially through logic
135branches. The best solution is to use a language with support for
136[futures and promises](https://en.wikipedia.org/wiki/Futures_and_promises),
137such as C++, Rust, Go, or almost any other higher level language. However, SPDK is a low
138level library and requires very wide compatibility and portability, so we've
139elected to stay with plain old C.
140
141We do have a few recommendations to share, though. For _simple_ callback chains,
142it's easiest if you write the functions from bottom to top. By that we mean if
143function `foo` performs some asynchronous operation and when that completes
144function `bar` is called, then function `bar` performs some operation that
145calls function `baz` on completion, a good way to write it is as such:
146
147    void baz(void *ctx) {
148            ...
149    }
150
151    void bar(void *ctx) {
152            async_op(baz, ctx);
153    }
154
155    void foo(void *ctx) {
156            async_op(bar, ctx);
157    }
158
159Don't split these functions up - keep them as a nice unit that can be read from bottom to top.
160
161For more complex callback chains, especially ones that have logical branches
162or loops, it's best to write out a state machine. It turns out that higher
163level languages that support futures and promises are just generating state
164machines at compile time, so even though we don't have the ability to generate
165them in C we can still write them out by hand. As an example, here's a
166callback chain that performs `foo` 5 times and then calls `bar` - effectively
167an asynchronous for loop.
168
169    enum states {
170            FOO_START = 0,
171            FOO_END,
172            BAR_START,
173            BAR_END
174    };
175
176    struct state_machine {
177            enum states state;
178
179            int count;
180    };
181
182    static void
183    foo_complete(void *ctx)
184    {
185        struct state_machine *sm = ctx;
186
187        sm->state = FOO_END;
188        run_state_machine(sm);
189    }
190
191    static void
192    foo(struct state_machine *sm)
193    {
194        do_async_op(foo_complete, sm);
195    }
196
197    static void
198    bar_complete(void *ctx)
199    {
200        struct state_machine *sm = ctx;
201
202        sm->state = BAR_END;
203        run_state_machine(sm);
204    }
205
206    static void
207    bar(struct state_machine *sm)
208    {
209        do_async_op(bar_complete, sm);
210    }
211
212    static void
213    run_state_machine(struct state_machine *sm)
214    {
215        enum states prev_state;
216
217        do {
218            prev_state = sm->state;
219
220            switch (sm->state) {
221                case FOO_START:
222                    foo(sm);
223                    break;
224                case FOO_END:
225                    /* This is the loop condition */
226                    if (sm->count++ < 5) {
227                        sm->state = FOO_START;
228                    } else {
229                        sm->state = BAR_START;
230                    }
231                    break;
232                case BAR_START:
233                    bar(sm);
234                    break;
235                case BAR_END:
236                    break;
237            }
238        } while (prev_state != sm->state);
239    }
240
241    void do_async_for(void)
242    {
243            struct state_machine *sm;
244
245            sm = malloc(sizeof(*sm));
246            sm->state = FOO_START;
247            sm->count = 0;
248
249            run_state_machine(sm);
250    }
251
252This is complex, of course, but the `run_state_machine` function can be read
253from top to bottom to get a clear overview of what's happening in the code
254without having to chase through each of the callbacks.
255