xref: /spdk/doc/nvme_spec.md (revision 75440d0537b98721404d10cb56a76cf9f2d2d0d7)
1# Submitting I/O to an NVMe Device {#nvme_spec}
2
3## The NVMe Specification
4
5The NVMe specification describes a hardware interface for interacting with
6storage devices. The specification includes network transport definitions for
7remote storage as well as a hardware register layout for local PCIe devices.
8What follows here is an overview of how an I/O is submitted to a local PCIe
9device through SPDK.
10
11NVMe devices allow host software (in our case, the SPDK NVMe driver) to allocate
12queue pairs in host memory. The term "host" is used a lot, so to clarify that's
13the system that the NVMe SSD is plugged into. A queue pair consists of two
14queues - a submission queue and a completion queue. These queues are more
15accurately described as circular rings of fixed size entries. The submission
16queue is an array of 64 byte command structures, plus 2 integers (head and tail
17indices). The completion queue is similarly an array of 16 byte completion
18structures, plus 2 integers (head and tail indices). There are also two 32-bit
19registers involved that are called doorbells.
20
21An I/O is submitted to an NVMe device by constructing a 64 byte command, placing
22it into the submission queue at the current location of the submission queue
23tail index, and then writing the new index of the submission queue tail to the
24submission queue tail doorbell register. It's actually valid to copy a whole set
25of commands into open slots in the ring and then write the doorbell just one
26time to submit the whole batch.
27
28There is a very detailed description of the command submission and completion
29process in the NVMe specification, which is conveniently available from the main
30page over at [NVM Express](https://nvmexpress.org).
31
32Most importantly, the command itself describes the operation and also, if
33necessary, a location in host memory containing a descriptor for host memory
34associated with the command. This host memory is the data to be written on a
35write command, or the location to place the data on a read command. Data is
36transferred to or from this location using a DMA engine on the NVMe device.
37
38The completion queue works similarly, but the device is instead the one writing
39entries into the ring. Each entry contains a "phase" bit that toggles between 0
40and 1 on each loop through the entire ring. When a queue pair is set up to
41generate interrupts, the interrupt contains the index of the completion queue
42head. However, SPDK doesn't enable interrupts and instead polls on the phase
43bit to detect completions. Interrupts are very heavy operations, so polling this
44phase bit is often far more efficient.
45
46## The SPDK NVMe Driver I/O Path
47
48Now that we know how the ring structures work, let's cover how the SPDK NVMe
49driver uses them. The user is going to construct a queue pair at some early time
50in the life cycle of the program, so that's not part of the "hot" path. Then,
51they'll call functions like spdk_nvme_ns_cmd_read() to perform an I/O operation.
52The user supplies a data buffer, the target LBA, and the length, as well as
53other information like which NVMe namespace the command is targeted at and which
54NVMe queue pair to use. Finally, the user provides a callback function and
55context pointer that will be called when a completion for the resulting command
56is discovered during a later call to spdk_nvme_qpair_process_completions().
57
58The first stage in the driver is allocating a request object to track the operation. The
59operations are asynchronous, so it can't simply track the state of the request
60on the call stack. Allocating a new request object on the heap would be far too
61slow, so SPDK keeps a pre-allocated set of request objects inside of the NVMe
62queue pair object - `struct spdk_nvme_qpair`. The number of requests allocated to
63the queue pair is larger than the actual queue depth of the NVMe submission
64queue because SPDK supports a couple of key convenience features. The first is
65software queueing - SPDK will allow the user to submit more requests than the
66hardware queue can actually hold and SPDK will automatically queue in software.
67The second is splitting. SPDK will split a request for many reasons, some of
68which are outlined next. The number of request objects is configurable at queue
69pair creation time and if not specified, SPDK will pick a sensible number based
70on the hardware queue depth.
71
72The second stage is building the 64 byte NVMe command itself. The command is
73built into memory embedded into the request object - not directly into an NVMe
74submission queue slot. Once the command has been constructed, SPDK attempts to
75obtain an open slot in the NVMe submission queue. For each element in the
76submission queue an object called a tracker is allocated. The trackers are
77allocated in an array, so they can be quickly looked up by an index. The tracker
78itself contains a pointer to the request currently occupying that slot. When a
79particular tracker is obtained, the command's CID value is updated with the
80index of the tracker. The NVMe specification provides that CID value in the
81completion, so the request can be recovered by looking up the tracker via the
82CID value and then following the pointer.
83
84Once a tracker (slot) is obtained, the data buffer associated with it is
85processed to build a PRP list. That's essentially an NVMe scatter gather list,
86although it is a bit more restricted. The user provides SPDK with the virtual
87address of the buffer, so SPDK has to go do a page table look up to find the
88physical address (pa) or I/O virtual addresses (iova) backing that virtual
89memory. A virtually contiguous memory region may not be physically contiguous,
90so this may result in a PRP list with multiple elements. Sometimes this may
91result in a set of physical addresses that can't actually be expressed as a
92single PRP list, so SPDK will automatically split the user operation into two
93separate requests transparently. For more information on how memory is managed,
94see @ref memory.
95
96The reason the PRP list is not built until a tracker is obtained is because the
97PRP list description must be allocated in DMA-able memory and can be quite
98large. Since SPDK typically allocates a large number of requests, we didn't want
99to allocate enough space to pre-build the worst case scenario PRP list,
100especially given that the common case does not require a separate PRP list at
101all.
102
103Each NVMe command has two PRP list elements embedded into it, so a separate PRP
104list isn't required if the request is 4KiB (or if it is 8KiB and aligned
105perfectly). Profiling shows that this section of the code is not a major
106contributor to the overall CPU use.
107
108With a tracker filled out, SPDK copies the 64 byte command into the actual NVMe
109submission queue slot and then rings the submission queue tail doorbell to tell
110the device to go process it. SPDK then returns back to the user, without waiting
111for a completion.
112
113The user can periodically call `spdk_nvme_qpair_process_completions()` to tell
114SPDK to examine the completion queue. Specifically, it reads the phase bit of
115the next expected completion slot and when it flips, looks at the CID value to
116find the tracker, which points at the request object. The request object
117contains a function pointer that the user provided initially, which is then
118called to complete the command.
119
120The `spdk_nvme_qpair_process_completions()` function will keep advancing to the
121next completion slot until it runs out of completions, at which point it will
122write the completion queue head doorbell to let the device know that it can use
123the completion queue slots for new completions and return.
124