spdk/doc/concurrency.md

ecc1da8aSBen Walker# Message Passing and Concurrency {#concurrency}
ecc1da8aSBen Walker
1e1fd9acSwawryk## Theory
ecc1da8aSBen Walker
ecc1da8aSBen WalkerOne of the primary aims of SPDK is to scale linearly with the addition of
00d692d0SBen Walkerhardware. This can mean many things in practice. For instance, moving from one
00d692d0SBen WalkerSSD to two should double the number of I/O's per second. Or doubling the number
00d692d0SBen Walkerof CPU cores should double the amount of computation possible. Or even doubling
00d692d0SBen Walkerthe number of NICs should double the network throughput. To achieve this, the
00d692d0SBen Walkersoftware's threads of execution must be independent from one another as much as
00d692d0SBen Walkerpossible. In practice, that means avoiding software locks and even atomic
00d692d0SBen Walkerinstructions.
ecc1da8aSBen Walker
ecc1da8aSBen WalkerTraditionally, software achieves concurrency by placing some shared data onto
ecc1da8aSBen Walkerthe heap, protecting it with a lock, and then having all threads of execution
00d692d0SBen Walkeracquire the lock only when accessing the data. This model has many great
00d692d0SBen Walkerproperties:
ecc1da8aSBen Walker
00d692d0SBen Walker* It's easy to convert single-threaded programs to multi-threaded programs
00d692d0SBen Walker  because you don't have to change the data model from the single-threaded
00d692d0SBen Walker  version. You add a lock around the data.
ecc1da8aSBen Walker* You can write your program as a synchronous, imperative list of statements
ecc1da8aSBen Walker  that you read from top to bottom.
00d692d0SBen Walker* The scheduler can interrupt threads, allowing for efficient time-sharing
00d692d0SBen Walker  of CPU resources.
ecc1da8aSBen Walker
00d692d0SBen WalkerUnfortunately, as the number of threads scales up, contention on the lock around
00d692d0SBen Walkerthe shared data does too. More granular locking helps, but then also increases
00d692d0SBen Walkerthe complexity of the program. Even then, beyond a certain number of contended
00d692d0SBen Walkerlocks, threads will spend most of their time attempting to acquire the locks and
00d692d0SBen Walkerthe program will not benefit from more CPU cores.
ecc1da8aSBen Walker
ecc1da8aSBen WalkerSPDK takes a different approach altogether. Instead of placing shared data in a
ecc1da8aSBen Walkerglobal location that all threads access after acquiring a lock, SPDK will often
00d692d0SBen Walkerassign that data to a single thread. When other threads want to access the data,
00d692d0SBen Walkerthey pass a message to the owning thread to perform the operation on their
00d692d0SBen Walkerbehalf. This strategy, of course, is not at all new. For instance, it is one of
00d692d0SBen Walkerthe core design principles of
ecc1da8aSBen Walker[Erlang](http://erlang.org/download/armstrong_thesis_2003.pdf) and is the main
ecc1da8aSBen Walkerconcurrency mechanism in [Go](https://tour.golang.org/concurrency/2). A message
00d692d0SBen Walkerin SPDK consists of a function pointer and a pointer to some context. Messages
00d692d0SBen Walkerare passed between threads using a
ecc1da8aSBen Walker[lockless ring](http://dpdk.org/doc/guides/prog_guide/ring_lib.html). Message
00d692d0SBen Walkerpassing is often much faster than most software developer's intuition leads them
00d692d0SBen Walkerto believe due to caching effects. If a single core is accessing the same data
00d692d0SBen Walker(on behalf of all of the other cores), then that data is far more likely to be
00d692d0SBen Walkerin a cache closer to that core. It's often most efficient to have each core work
00d692d0SBen Walkeron a small set of data sitting in its local cache and then hand off a small
00d692d0SBen Walkermessage to the next core when done.
ecc1da8aSBen Walker
00d692d0SBen WalkerIn more extreme cases where even message passing may be too costly, each thread
00d692d0SBen Walkermay make a local copy of the data. The thread will then only reference its local
00d692d0SBen Walkercopy. To mutate the data, threads will send a message to each other thread
00d692d0SBen Walkertelling them to perform the update on their local copy. This is great when the
00d692d0SBen Walkerdata isn't mutated very often, but is read very frequently, and is often
00d692d0SBen Walkeremployed in the I/O path. This of course trades memory size for computational
00d692d0SBen Walkerefficiency, so it is used in only the most critical code paths.
ecc1da8aSBen Walker
1e1fd9acSwawryk## Message Passing Infrastructure
ecc1da8aSBen Walker
ecc1da8aSBen WalkerSPDK provides several layers of message passing infrastructure. The most
ecc1da8aSBen Walkerfundamental libraries in SPDK, for instance, don't do any message passing on
ecc1da8aSBen Walkertheir own and instead enumerate rules about when functions may be called in
ecc1da8aSBen Walkertheir documentation (e.g. @ref nvme). Most libraries, however, depend on SPDK's
53cbe5d0SBen Walker[thread](http://www.spdk.io/doc/thread_8h.html)
53cbe5d0SBen Walkerabstraction, located in `libspdk_thread.a`. The thread abstraction provides a
53cbe5d0SBen Walkerbasic message passing framework and defines a few key primitives.
ecc1da8aSBen Walker
00d692d0SBen WalkerFirst, `spdk_thread` is an abstraction for a lightweight, stackless thread of
00d692d0SBen Walkerexecution. A lower level framework can execute an `spdk_thread` for a single
00d692d0SBen Walkertimeslice by calling `spdk_thread_poll()`. A lower level framework is allowed to
00d692d0SBen Walkermove an `spdk_thread` between system threads at any time, as long as there is
00d692d0SBen Walkeronly a single system thread executing `spdk_thread_poll()` on that
00d692d0SBen Walker`spdk_thread` at any given time. New lightweight threads may be created at any
00d692d0SBen Walkertime by calling `spdk_thread_create()` and destroyed by calling
00d692d0SBen Walker`spdk_thread_destroy()`. The lightweight thread is the foundational abstraction for
00d692d0SBen Walkerthreading in SPDK.
ecc1da8aSBen Walker
00d692d0SBen WalkerThere are then a few additional abstractions layered on top of the
00d692d0SBen Walker`spdk_thread`. One is the `spdk_poller`, which is an abstraction for a
00d692d0SBen Walkerfunction that should be repeatedly called on the given thread. Another is an
00d692d0SBen Walker`spdk_msg_fn`, which is a function pointer and a context pointer, that can
00d692d0SBen Walkerbe sent to a thread for execution via `spdk_thread_send_msg()`.
00d692d0SBen Walker
00d692d0SBen WalkerThe library also defines two additional abstractions: `spdk_io_device` and
00d692d0SBen Walker`spdk_io_channel`. In the course of implementing SPDK we noticed the same
00d692d0SBen Walkerpattern emerging in a number of different libraries. In order to implement a
00d692d0SBen Walkermessage passing strategy, the code would describe some object with global state
00d692d0SBen Walkerand also some per-thread context associated with that object that was accessed
00d692d0SBen Walkerin the I/O path to avoid locking on the global state. The pattern was clearest
00d692d0SBen Walkerin the lowest layers where I/O was being submitted to block devices. These
00d692d0SBen Walkerdevices often expose multiple queues that can be assigned to threads and then
00d692d0SBen Walkeraccessed without a lock to submit I/O. To abstract that, we generalized the
00d692d0SBen Walkerdevice to `spdk_io_device` and the thread-specific queue to `spdk_io_channel`.
00d692d0SBen WalkerOver time, however, the pattern has appeared in a huge number of places that
00d692d0SBen Walkerdon't fit quite so nicely with the names we originally chose. In today's code
00d692d0SBen Walker`spdk_io_device` is any pointer, whose uniqueness is predicated only on its
00d692d0SBen Walkermemory address, and `spdk_io_channel` is the per-thread context associated with
00d692d0SBen Walkera particular `spdk_io_device`.
ecc1da8aSBen Walker
53cbe5d0SBen WalkerThe threading abstraction provides functions to send a message to any other
ecc1da8aSBen Walkerthread, to send a message to all threads one by one, and to send a message to
ecc1da8aSBen Walkerall threads for which there is an io_channel for a given io_device.
ecc1da8aSBen Walker
00d692d0SBen WalkerMost critically, the thread abstraction does not actually spawn any system level
00d692d0SBen Walkerthreads of its own. Instead, it relies on the existence of some lower level
00d692d0SBen Walkerframework that spawns system threads and sets up event loops. Inside those event
00d692d0SBen Walkerloops, the threading abstraction simply requires the lower level framework to
00d692d0SBen Walkerrepeatedly call `spdk_thread_poll()` on each `spdk_thread()` that exists. This
00d692d0SBen Walkermakes SPDK very portable to a wide variety of asynchronous, event-based
00d692d0SBen Walkerframeworks such as [Seastar](https://www.seastar.io) or [libuv](https://libuv.org/).
00d692d0SBen Walker
*b09ae853SMike Gerdts## SPDK Spinlocks
*b09ae853SMike Gerdts
*b09ae853SMike GerdtsThere are some cases where locks are used. These should be limited in favor of
*b09ae853SMike Gerdtsthe message passing interface described above. When locks are needed,
*b09ae853SMike GerdtsSPDK spinlocks should be used instead of POSIX locks.
*b09ae853SMike Gerdts
*b09ae853SMike GerdtsPOSIX locks like `pthread_mutex_t` and `pthread_spinlock_t` do not properly
*b09ae853SMike Gerdtshandle locking between SPDK's lightweight threads. SPDK's `spdk_spinlock`
*b09ae853SMike Gerdtsis safe to use in SPDK libraries and applications. This safety comes from
*b09ae853SMike Gerdtsimposing restrictions on when locks can be held. See
*b09ae853SMike Gerdts[spdk_spinlock](structspdk__spinlock.html) for details.
*b09ae853SMike Gerdts
1e1fd9acSwawryk## The event Framework
ecc1da8aSBen Walker
00d692d0SBen WalkerThe SPDK project didn't want to officially pick an asynchronous, event-based
00d692d0SBen Walkerframework for all of the example applications it shipped with, in the interest
00d692d0SBen Walkerof supporting the widest variety of frameworks possible. But the applications do
00d692d0SBen Walkerof course require something that implements an asynchronous event loop in order
00d692d0SBen Walkerto run, so enter the `event` framework located in `lib/event`. This framework
71ccea94SMaciej Szwedincludes things like polling and scheduling the lightweight threads, installing
71ccea94SMaciej Szwedsignal handlers to cleanly shutdown, and basic command line option parsing.
71ccea94SMaciej SzwedOnly established applications should consider directly integrating the lower
71ccea94SMaciej Szwedlevel libraries.
ecc1da8aSBen Walker
1e1fd9acSwawryk## Limitations of the C Language
ecc1da8aSBen Walker
ecc1da8aSBen WalkerMessage passing is efficient, but it results in asynchronous code.
ecc1da8aSBen WalkerUnfortunately, asynchronous code is a challenge in C. It's often implemented by
ecc1da8aSBen Walkerpassing function pointers that are called when an operation completes. This
ecc1da8aSBen Walkerchops up the code so that it isn't easy to follow, especially through logic
ecc1da8aSBen Walkerbranches. The best solution is to use a language with support for
ecc1da8aSBen Walker[futures and promises](https://en.wikipedia.org/wiki/Futures_and_promises),
ecc1da8aSBen Walkersuch as C++, Rust, Go, or almost any other higher level language. However, SPDK is a low
ecc1da8aSBen Walkerlevel library and requires very wide compatibility and portability, so we've
ecc1da8aSBen Walkerelected to stay with plain old C.
ecc1da8aSBen Walker
ecc1da8aSBen WalkerWe do have a few recommendations to share, though. For _simple_ callback chains,
ecc1da8aSBen Walkerit's easiest if you write the functions from bottom to top. By that we mean if
ecc1da8aSBen Walkerfunction `foo` performs some asynchronous operation and when that completes
ecc1da8aSBen Walkerfunction `bar` is called, then function `bar` performs some operation that
ecc1da8aSBen Walkercalls function `baz` on completion, a good way to write it is as such:
ecc1da8aSBen Walker
111d4276SMaciej Wawryk```c
ecc1da8aSBen Walker    void baz(void *ctx) {
ecc1da8aSBen Walker            ...
ecc1da8aSBen Walker    }
ecc1da8aSBen Walker
ecc1da8aSBen Walker    void bar(void *ctx) {
ecc1da8aSBen Walker            async_op(baz, ctx);
ecc1da8aSBen Walker    }
ecc1da8aSBen Walker
ecc1da8aSBen Walker    void foo(void *ctx) {
ecc1da8aSBen Walker            async_op(bar, ctx);
ecc1da8aSBen Walker    }
111d4276SMaciej Wawryk```
ecc1da8aSBen Walker
ecc1da8aSBen WalkerDon't split these functions up - keep them as a nice unit that can be read from bottom to top.
ecc1da8aSBen Walker
ecc1da8aSBen WalkerFor more complex callback chains, especially ones that have logical branches
ecc1da8aSBen Walkeror loops, it's best to write out a state machine. It turns out that higher
1f813ec3SChen Wanglevel languages that support futures and promises are just generating state
ecc1da8aSBen Walkermachines at compile time, so even though we don't have the ability to generate
ecc1da8aSBen Walkerthem in C we can still write them out by hand. As an example, here's a
ecc1da8aSBen Walkercallback chain that performs `foo` 5 times and then calls `bar` - effectively
ecc1da8aSBen Walkeran asynchronous for loop.
ecc1da8aSBen Walker
111d4276SMaciej Wawryk```c
ecc1da8aSBen Walker    enum states {
ecc1da8aSBen Walker            FOO_START = 0,
ecc1da8aSBen Walker            FOO_END,
ecc1da8aSBen Walker            BAR_START,
ecc1da8aSBen Walker            BAR_END
ecc1da8aSBen Walker    };
ecc1da8aSBen Walker
ecc1da8aSBen Walker    struct state_machine {
ecc1da8aSBen Walker            enum states state;
ecc1da8aSBen Walker
ecc1da8aSBen Walker            int count;
ecc1da8aSBen Walker    };
ecc1da8aSBen Walker
ecc1da8aSBen Walker    static void
ecc1da8aSBen Walker    foo_complete(void *ctx)
ecc1da8aSBen Walker    {
ecc1da8aSBen Walker        struct state_machine *sm = ctx;
ecc1da8aSBen Walker
ecc1da8aSBen Walker        sm->state = FOO_END;
ecc1da8aSBen Walker        run_state_machine(sm);
ecc1da8aSBen Walker    }
ecc1da8aSBen Walker
ecc1da8aSBen Walker    static void
ecc1da8aSBen Walker    foo(struct state_machine *sm)
ecc1da8aSBen Walker    {
ecc1da8aSBen Walker        do_async_op(foo_complete, sm);
ecc1da8aSBen Walker    }
ecc1da8aSBen Walker
ecc1da8aSBen Walker    static void
ecc1da8aSBen Walker    bar_complete(void *ctx)
ecc1da8aSBen Walker    {
ecc1da8aSBen Walker        struct state_machine *sm = ctx;
ecc1da8aSBen Walker
ecc1da8aSBen Walker        sm->state = BAR_END;
ecc1da8aSBen Walker        run_state_machine(sm);
ecc1da8aSBen Walker    }
ecc1da8aSBen Walker
ecc1da8aSBen Walker    static void
ecc1da8aSBen Walker    bar(struct state_machine *sm)
ecc1da8aSBen Walker    {
ecc1da8aSBen Walker        do_async_op(bar_complete, sm);
ecc1da8aSBen Walker    }
ecc1da8aSBen Walker
ecc1da8aSBen Walker    static void
ecc1da8aSBen Walker    run_state_machine(struct state_machine *sm)
ecc1da8aSBen Walker    {
ecc1da8aSBen Walker        enum states prev_state;
ecc1da8aSBen Walker
ecc1da8aSBen Walker        do {
ecc1da8aSBen Walker            prev_state = sm->state;
ecc1da8aSBen Walker
ecc1da8aSBen Walker            switch (sm->state) {
ecc1da8aSBen Walker                case FOO_START:
ecc1da8aSBen Walker                    foo(sm);
ecc1da8aSBen Walker                    break;
ecc1da8aSBen Walker                case FOO_END:
ecc1da8aSBen Walker                    /* This is the loop condition */
ecc1da8aSBen Walker                    if (sm->count++ < 5) {
ecc1da8aSBen Walker                        sm->state = FOO_START;
ecc1da8aSBen Walker                    } else {
ecc1da8aSBen Walker                        sm->state = BAR_START;
ecc1da8aSBen Walker                    }
ecc1da8aSBen Walker                    break;
ecc1da8aSBen Walker                case BAR_START:
ecc1da8aSBen Walker                    bar(sm);
ecc1da8aSBen Walker                    break;
ecc1da8aSBen Walker                case BAR_END:
ecc1da8aSBen Walker                    break;
ecc1da8aSBen Walker            }
ecc1da8aSBen Walker        } while (prev_state != sm->state);
ecc1da8aSBen Walker    }
ecc1da8aSBen Walker
ecc1da8aSBen Walker    void do_async_for(void)
ecc1da8aSBen Walker    {
ecc1da8aSBen Walker            struct state_machine *sm;
ecc1da8aSBen Walker
ecc1da8aSBen Walker            sm = malloc(sizeof(*sm));
ecc1da8aSBen Walker            sm->state = FOO_START;
ecc1da8aSBen Walker            sm->count = 0;
ecc1da8aSBen Walker
ecc1da8aSBen Walker            run_state_machine(sm);
ecc1da8aSBen Walker    }
111d4276SMaciej Wawryk```
ecc1da8aSBen Walker
ecc1da8aSBen WalkerThis is complex, of course, but the `run_state_machine` function can be read
ecc1da8aSBen Walkerfrom top to bottom to get a clear overview of what's happening in the code
ecc1da8aSBen Walkerwithout having to chase through each of the callbacks.