xref: /spdk/doc/concurrency.md (revision b09ae853f9a595a42f9e816a1f50d1f06b6ff27a)
1ecc1da8aSBen Walker# Message Passing and Concurrency {#concurrency}
2ecc1da8aSBen Walker
31e1fd9acSwawryk## Theory
4ecc1da8aSBen Walker
5ecc1da8aSBen WalkerOne of the primary aims of SPDK is to scale linearly with the addition of
600d692d0SBen Walkerhardware. This can mean many things in practice. For instance, moving from one
700d692d0SBen WalkerSSD to two should double the number of I/O's per second. Or doubling the number
800d692d0SBen Walkerof CPU cores should double the amount of computation possible. Or even doubling
900d692d0SBen Walkerthe number of NICs should double the network throughput. To achieve this, the
1000d692d0SBen Walkersoftware's threads of execution must be independent from one another as much as
1100d692d0SBen Walkerpossible. In practice, that means avoiding software locks and even atomic
1200d692d0SBen Walkerinstructions.
13ecc1da8aSBen Walker
14ecc1da8aSBen WalkerTraditionally, software achieves concurrency by placing some shared data onto
15ecc1da8aSBen Walkerthe heap, protecting it with a lock, and then having all threads of execution
1600d692d0SBen Walkeracquire the lock only when accessing the data. This model has many great
1700d692d0SBen Walkerproperties:
18ecc1da8aSBen Walker
1900d692d0SBen Walker* It's easy to convert single-threaded programs to multi-threaded programs
2000d692d0SBen Walker  because you don't have to change the data model from the single-threaded
2100d692d0SBen Walker  version. You add a lock around the data.
22ecc1da8aSBen Walker* You can write your program as a synchronous, imperative list of statements
23ecc1da8aSBen Walker  that you read from top to bottom.
2400d692d0SBen Walker* The scheduler can interrupt threads, allowing for efficient time-sharing
2500d692d0SBen Walker  of CPU resources.
26ecc1da8aSBen Walker
2700d692d0SBen WalkerUnfortunately, as the number of threads scales up, contention on the lock around
2800d692d0SBen Walkerthe shared data does too. More granular locking helps, but then also increases
2900d692d0SBen Walkerthe complexity of the program. Even then, beyond a certain number of contended
3000d692d0SBen Walkerlocks, threads will spend most of their time attempting to acquire the locks and
3100d692d0SBen Walkerthe program will not benefit from more CPU cores.
32ecc1da8aSBen Walker
33ecc1da8aSBen WalkerSPDK takes a different approach altogether. Instead of placing shared data in a
34ecc1da8aSBen Walkerglobal location that all threads access after acquiring a lock, SPDK will often
3500d692d0SBen Walkerassign that data to a single thread. When other threads want to access the data,
3600d692d0SBen Walkerthey pass a message to the owning thread to perform the operation on their
3700d692d0SBen Walkerbehalf. This strategy, of course, is not at all new. For instance, it is one of
3800d692d0SBen Walkerthe core design principles of
39ecc1da8aSBen Walker[Erlang](http://erlang.org/download/armstrong_thesis_2003.pdf) and is the main
40ecc1da8aSBen Walkerconcurrency mechanism in [Go](https://tour.golang.org/concurrency/2). A message
4100d692d0SBen Walkerin SPDK consists of a function pointer and a pointer to some context. Messages
4200d692d0SBen Walkerare passed between threads using a
43ecc1da8aSBen Walker[lockless ring](http://dpdk.org/doc/guides/prog_guide/ring_lib.html). Message
4400d692d0SBen Walkerpassing is often much faster than most software developer's intuition leads them
4500d692d0SBen Walkerto believe due to caching effects. If a single core is accessing the same data
4600d692d0SBen Walker(on behalf of all of the other cores), then that data is far more likely to be
4700d692d0SBen Walkerin a cache closer to that core. It's often most efficient to have each core work
4800d692d0SBen Walkeron a small set of data sitting in its local cache and then hand off a small
4900d692d0SBen Walkermessage to the next core when done.
50ecc1da8aSBen Walker
5100d692d0SBen WalkerIn more extreme cases where even message passing may be too costly, each thread
5200d692d0SBen Walkermay make a local copy of the data. The thread will then only reference its local
5300d692d0SBen Walkercopy. To mutate the data, threads will send a message to each other thread
5400d692d0SBen Walkertelling them to perform the update on their local copy. This is great when the
5500d692d0SBen Walkerdata isn't mutated very often, but is read very frequently, and is often
5600d692d0SBen Walkeremployed in the I/O path. This of course trades memory size for computational
5700d692d0SBen Walkerefficiency, so it is used in only the most critical code paths.
58ecc1da8aSBen Walker
591e1fd9acSwawryk## Message Passing Infrastructure
60ecc1da8aSBen Walker
61ecc1da8aSBen WalkerSPDK provides several layers of message passing infrastructure. The most
62ecc1da8aSBen Walkerfundamental libraries in SPDK, for instance, don't do any message passing on
63ecc1da8aSBen Walkertheir own and instead enumerate rules about when functions may be called in
64ecc1da8aSBen Walkertheir documentation (e.g. @ref nvme). Most libraries, however, depend on SPDK's
6553cbe5d0SBen Walker[thread](http://www.spdk.io/doc/thread_8h.html)
6653cbe5d0SBen Walkerabstraction, located in `libspdk_thread.a`. The thread abstraction provides a
6753cbe5d0SBen Walkerbasic message passing framework and defines a few key primitives.
68ecc1da8aSBen Walker
6900d692d0SBen WalkerFirst, `spdk_thread` is an abstraction for a lightweight, stackless thread of
7000d692d0SBen Walkerexecution. A lower level framework can execute an `spdk_thread` for a single
7100d692d0SBen Walkertimeslice by calling `spdk_thread_poll()`. A lower level framework is allowed to
7200d692d0SBen Walkermove an `spdk_thread` between system threads at any time, as long as there is
7300d692d0SBen Walkeronly a single system thread executing `spdk_thread_poll()` on that
7400d692d0SBen Walker`spdk_thread` at any given time. New lightweight threads may be created at any
7500d692d0SBen Walkertime by calling `spdk_thread_create()` and destroyed by calling
7600d692d0SBen Walker`spdk_thread_destroy()`. The lightweight thread is the foundational abstraction for
7700d692d0SBen Walkerthreading in SPDK.
78ecc1da8aSBen Walker
7900d692d0SBen WalkerThere are then a few additional abstractions layered on top of the
8000d692d0SBen Walker`spdk_thread`. One is the `spdk_poller`, which is an abstraction for a
8100d692d0SBen Walkerfunction that should be repeatedly called on the given thread. Another is an
8200d692d0SBen Walker`spdk_msg_fn`, which is a function pointer and a context pointer, that can
8300d692d0SBen Walkerbe sent to a thread for execution via `spdk_thread_send_msg()`.
8400d692d0SBen Walker
8500d692d0SBen WalkerThe library also defines two additional abstractions: `spdk_io_device` and
8600d692d0SBen Walker`spdk_io_channel`. In the course of implementing SPDK we noticed the same
8700d692d0SBen Walkerpattern emerging in a number of different libraries. In order to implement a
8800d692d0SBen Walkermessage passing strategy, the code would describe some object with global state
8900d692d0SBen Walkerand also some per-thread context associated with that object that was accessed
9000d692d0SBen Walkerin the I/O path to avoid locking on the global state. The pattern was clearest
9100d692d0SBen Walkerin the lowest layers where I/O was being submitted to block devices. These
9200d692d0SBen Walkerdevices often expose multiple queues that can be assigned to threads and then
9300d692d0SBen Walkeraccessed without a lock to submit I/O. To abstract that, we generalized the
9400d692d0SBen Walkerdevice to `spdk_io_device` and the thread-specific queue to `spdk_io_channel`.
9500d692d0SBen WalkerOver time, however, the pattern has appeared in a huge number of places that
9600d692d0SBen Walkerdon't fit quite so nicely with the names we originally chose. In today's code
9700d692d0SBen Walker`spdk_io_device` is any pointer, whose uniqueness is predicated only on its
9800d692d0SBen Walkermemory address, and `spdk_io_channel` is the per-thread context associated with
9900d692d0SBen Walkera particular `spdk_io_device`.
100ecc1da8aSBen Walker
10153cbe5d0SBen WalkerThe threading abstraction provides functions to send a message to any other
102ecc1da8aSBen Walkerthread, to send a message to all threads one by one, and to send a message to
103ecc1da8aSBen Walkerall threads for which there is an io_channel for a given io_device.
104ecc1da8aSBen Walker
10500d692d0SBen WalkerMost critically, the thread abstraction does not actually spawn any system level
10600d692d0SBen Walkerthreads of its own. Instead, it relies on the existence of some lower level
10700d692d0SBen Walkerframework that spawns system threads and sets up event loops. Inside those event
10800d692d0SBen Walkerloops, the threading abstraction simply requires the lower level framework to
10900d692d0SBen Walkerrepeatedly call `spdk_thread_poll()` on each `spdk_thread()` that exists. This
11000d692d0SBen Walkermakes SPDK very portable to a wide variety of asynchronous, event-based
11100d692d0SBen Walkerframeworks such as [Seastar](https://www.seastar.io) or [libuv](https://libuv.org/).
11200d692d0SBen Walker
113*b09ae853SMike Gerdts## SPDK Spinlocks
114*b09ae853SMike Gerdts
115*b09ae853SMike GerdtsThere are some cases where locks are used. These should be limited in favor of
116*b09ae853SMike Gerdtsthe message passing interface described above. When locks are needed,
117*b09ae853SMike GerdtsSPDK spinlocks should be used instead of POSIX locks.
118*b09ae853SMike Gerdts
119*b09ae853SMike GerdtsPOSIX locks like `pthread_mutex_t` and `pthread_spinlock_t` do not properly
120*b09ae853SMike Gerdtshandle locking between SPDK's lightweight threads. SPDK's `spdk_spinlock`
121*b09ae853SMike Gerdtsis safe to use in SPDK libraries and applications. This safety comes from
122*b09ae853SMike Gerdtsimposing restrictions on when locks can be held. See
123*b09ae853SMike Gerdts[spdk_spinlock](structspdk__spinlock.html) for details.
124*b09ae853SMike Gerdts
1251e1fd9acSwawryk## The event Framework
126ecc1da8aSBen Walker
12700d692d0SBen WalkerThe SPDK project didn't want to officially pick an asynchronous, event-based
12800d692d0SBen Walkerframework for all of the example applications it shipped with, in the interest
12900d692d0SBen Walkerof supporting the widest variety of frameworks possible. But the applications do
13000d692d0SBen Walkerof course require something that implements an asynchronous event loop in order
13100d692d0SBen Walkerto run, so enter the `event` framework located in `lib/event`. This framework
13271ccea94SMaciej Szwedincludes things like polling and scheduling the lightweight threads, installing
13371ccea94SMaciej Szwedsignal handlers to cleanly shutdown, and basic command line option parsing.
13471ccea94SMaciej SzwedOnly established applications should consider directly integrating the lower
13571ccea94SMaciej Szwedlevel libraries.
136ecc1da8aSBen Walker
1371e1fd9acSwawryk## Limitations of the C Language
138ecc1da8aSBen Walker
139ecc1da8aSBen WalkerMessage passing is efficient, but it results in asynchronous code.
140ecc1da8aSBen WalkerUnfortunately, asynchronous code is a challenge in C. It's often implemented by
141ecc1da8aSBen Walkerpassing function pointers that are called when an operation completes. This
142ecc1da8aSBen Walkerchops up the code so that it isn't easy to follow, especially through logic
143ecc1da8aSBen Walkerbranches. The best solution is to use a language with support for
144ecc1da8aSBen Walker[futures and promises](https://en.wikipedia.org/wiki/Futures_and_promises),
145ecc1da8aSBen Walkersuch as C++, Rust, Go, or almost any other higher level language. However, SPDK is a low
146ecc1da8aSBen Walkerlevel library and requires very wide compatibility and portability, so we've
147ecc1da8aSBen Walkerelected to stay with plain old C.
148ecc1da8aSBen Walker
149ecc1da8aSBen WalkerWe do have a few recommendations to share, though. For _simple_ callback chains,
150ecc1da8aSBen Walkerit's easiest if you write the functions from bottom to top. By that we mean if
151ecc1da8aSBen Walkerfunction `foo` performs some asynchronous operation and when that completes
152ecc1da8aSBen Walkerfunction `bar` is called, then function `bar` performs some operation that
153ecc1da8aSBen Walkercalls function `baz` on completion, a good way to write it is as such:
154ecc1da8aSBen Walker
155111d4276SMaciej Wawryk```c
156ecc1da8aSBen Walker    void baz(void *ctx) {
157ecc1da8aSBen Walker            ...
158ecc1da8aSBen Walker    }
159ecc1da8aSBen Walker
160ecc1da8aSBen Walker    void bar(void *ctx) {
161ecc1da8aSBen Walker            async_op(baz, ctx);
162ecc1da8aSBen Walker    }
163ecc1da8aSBen Walker
164ecc1da8aSBen Walker    void foo(void *ctx) {
165ecc1da8aSBen Walker            async_op(bar, ctx);
166ecc1da8aSBen Walker    }
167111d4276SMaciej Wawryk```
168ecc1da8aSBen Walker
169ecc1da8aSBen WalkerDon't split these functions up - keep them as a nice unit that can be read from bottom to top.
170ecc1da8aSBen Walker
171ecc1da8aSBen WalkerFor more complex callback chains, especially ones that have logical branches
172ecc1da8aSBen Walkeror loops, it's best to write out a state machine. It turns out that higher
1731f813ec3SChen Wanglevel languages that support futures and promises are just generating state
174ecc1da8aSBen Walkermachines at compile time, so even though we don't have the ability to generate
175ecc1da8aSBen Walkerthem in C we can still write them out by hand. As an example, here's a
176ecc1da8aSBen Walkercallback chain that performs `foo` 5 times and then calls `bar` - effectively
177ecc1da8aSBen Walkeran asynchronous for loop.
178ecc1da8aSBen Walker
179111d4276SMaciej Wawryk```c
180ecc1da8aSBen Walker    enum states {
181ecc1da8aSBen Walker            FOO_START = 0,
182ecc1da8aSBen Walker            FOO_END,
183ecc1da8aSBen Walker            BAR_START,
184ecc1da8aSBen Walker            BAR_END
185ecc1da8aSBen Walker    };
186ecc1da8aSBen Walker
187ecc1da8aSBen Walker    struct state_machine {
188ecc1da8aSBen Walker            enum states state;
189ecc1da8aSBen Walker
190ecc1da8aSBen Walker            int count;
191ecc1da8aSBen Walker    };
192ecc1da8aSBen Walker
193ecc1da8aSBen Walker    static void
194ecc1da8aSBen Walker    foo_complete(void *ctx)
195ecc1da8aSBen Walker    {
196ecc1da8aSBen Walker        struct state_machine *sm = ctx;
197ecc1da8aSBen Walker
198ecc1da8aSBen Walker        sm->state = FOO_END;
199ecc1da8aSBen Walker        run_state_machine(sm);
200ecc1da8aSBen Walker    }
201ecc1da8aSBen Walker
202ecc1da8aSBen Walker    static void
203ecc1da8aSBen Walker    foo(struct state_machine *sm)
204ecc1da8aSBen Walker    {
205ecc1da8aSBen Walker        do_async_op(foo_complete, sm);
206ecc1da8aSBen Walker    }
207ecc1da8aSBen Walker
208ecc1da8aSBen Walker    static void
209ecc1da8aSBen Walker    bar_complete(void *ctx)
210ecc1da8aSBen Walker    {
211ecc1da8aSBen Walker        struct state_machine *sm = ctx;
212ecc1da8aSBen Walker
213ecc1da8aSBen Walker        sm->state = BAR_END;
214ecc1da8aSBen Walker        run_state_machine(sm);
215ecc1da8aSBen Walker    }
216ecc1da8aSBen Walker
217ecc1da8aSBen Walker    static void
218ecc1da8aSBen Walker    bar(struct state_machine *sm)
219ecc1da8aSBen Walker    {
220ecc1da8aSBen Walker        do_async_op(bar_complete, sm);
221ecc1da8aSBen Walker    }
222ecc1da8aSBen Walker
223ecc1da8aSBen Walker    static void
224ecc1da8aSBen Walker    run_state_machine(struct state_machine *sm)
225ecc1da8aSBen Walker    {
226ecc1da8aSBen Walker        enum states prev_state;
227ecc1da8aSBen Walker
228ecc1da8aSBen Walker        do {
229ecc1da8aSBen Walker            prev_state = sm->state;
230ecc1da8aSBen Walker
231ecc1da8aSBen Walker            switch (sm->state) {
232ecc1da8aSBen Walker                case FOO_START:
233ecc1da8aSBen Walker                    foo(sm);
234ecc1da8aSBen Walker                    break;
235ecc1da8aSBen Walker                case FOO_END:
236ecc1da8aSBen Walker                    /* This is the loop condition */
237ecc1da8aSBen Walker                    if (sm->count++ < 5) {
238ecc1da8aSBen Walker                        sm->state = FOO_START;
239ecc1da8aSBen Walker                    } else {
240ecc1da8aSBen Walker                        sm->state = BAR_START;
241ecc1da8aSBen Walker                    }
242ecc1da8aSBen Walker                    break;
243ecc1da8aSBen Walker                case BAR_START:
244ecc1da8aSBen Walker                    bar(sm);
245ecc1da8aSBen Walker                    break;
246ecc1da8aSBen Walker                case BAR_END:
247ecc1da8aSBen Walker                    break;
248ecc1da8aSBen Walker            }
249ecc1da8aSBen Walker        } while (prev_state != sm->state);
250ecc1da8aSBen Walker    }
251ecc1da8aSBen Walker
252ecc1da8aSBen Walker    void do_async_for(void)
253ecc1da8aSBen Walker    {
254ecc1da8aSBen Walker            struct state_machine *sm;
255ecc1da8aSBen Walker
256ecc1da8aSBen Walker            sm = malloc(sizeof(*sm));
257ecc1da8aSBen Walker            sm->state = FOO_START;
258ecc1da8aSBen Walker            sm->count = 0;
259ecc1da8aSBen Walker
260ecc1da8aSBen Walker            run_state_machine(sm);
261ecc1da8aSBen Walker    }
262111d4276SMaciej Wawryk```
263ecc1da8aSBen Walker
264ecc1da8aSBen WalkerThis is complex, of course, but the `run_state_machine` function can be read
265ecc1da8aSBen Walkerfrom top to bottom to get a clear overview of what's happening in the code
266ecc1da8aSBen Walkerwithout having to chase through each of the callbacks.
267