xref: /spdk/doc/nvme_spec.md (revision 75440d0537b98721404d10cb56a76cf9f2d2d0d7)
1c2c0ece7SBen Walker# Submitting I/O to an NVMe Device {#nvme_spec}
2c2c0ece7SBen Walker
3c2c0ece7SBen Walker## The NVMe Specification
4c2c0ece7SBen Walker
5c2c0ece7SBen WalkerThe NVMe specification describes a hardware interface for interacting with
6c2c0ece7SBen Walkerstorage devices. The specification includes network transport definitions for
7c2c0ece7SBen Walkerremote storage as well as a hardware register layout for local PCIe devices.
8c2c0ece7SBen WalkerWhat follows here is an overview of how an I/O is submitted to a local PCIe
9c2c0ece7SBen Walkerdevice through SPDK.
10c2c0ece7SBen Walker
11c2c0ece7SBen WalkerNVMe devices allow host software (in our case, the SPDK NVMe driver) to allocate
12c2c0ece7SBen Walkerqueue pairs in host memory. The term "host" is used a lot, so to clarify that's
13c2c0ece7SBen Walkerthe system that the NVMe SSD is plugged into. A queue pair consists of two
14c2c0ece7SBen Walkerqueues - a submission queue and a completion queue. These queues are more
15c2c0ece7SBen Walkeraccurately described as circular rings of fixed size entries. The submission
16c2c0ece7SBen Walkerqueue is an array of 64 byte command structures, plus 2 integers (head and tail
17c2c0ece7SBen Walkerindices). The completion queue is similarly an array of 16 byte completion
18c2c0ece7SBen Walkerstructures, plus 2 integers (head and tail indices). There are also two 32-bit
19c2c0ece7SBen Walkerregisters involved that are called doorbells.
20c2c0ece7SBen Walker
21c2c0ece7SBen WalkerAn I/O is submitted to an NVMe device by constructing a 64 byte command, placing
22c2c0ece7SBen Walkerit into the submission queue at the current location of the submission queue
23*75440d05SMichael Bangtail index, and then writing the new index of the submission queue tail to the
24*75440d05SMichael Bangsubmission queue tail doorbell register. It's actually valid to copy a whole set
25c2c0ece7SBen Walkerof commands into open slots in the ring and then write the doorbell just one
26c2c0ece7SBen Walkertime to submit the whole batch.
27c2c0ece7SBen Walker
28c2c0ece7SBen WalkerThere is a very detailed description of the command submission and completion
29c2c0ece7SBen Walkerprocess in the NVMe specification, which is conveniently available from the main
30c2c0ece7SBen Walkerpage over at [NVM Express](https://nvmexpress.org).
31c2c0ece7SBen Walker
32c2c0ece7SBen WalkerMost importantly, the command itself describes the operation and also, if
33c2c0ece7SBen Walkernecessary, a location in host memory containing a descriptor for host memory
34c2c0ece7SBen Walkerassociated with the command. This host memory is the data to be written on a
35c2c0ece7SBen Walkerwrite command, or the location to place the data on a read command. Data is
36c2c0ece7SBen Walkertransferred to or from this location using a DMA engine on the NVMe device.
37c2c0ece7SBen Walker
38c2c0ece7SBen WalkerThe completion queue works similarly, but the device is instead the one writing
39c2c0ece7SBen Walkerentries into the ring. Each entry contains a "phase" bit that toggles between 0
40c2c0ece7SBen Walkerand 1 on each loop through the entire ring. When a queue pair is set up to
41c2c0ece7SBen Walkergenerate interrupts, the interrupt contains the index of the completion queue
42c2c0ece7SBen Walkerhead. However, SPDK doesn't enable interrupts and instead polls on the phase
43c2c0ece7SBen Walkerbit to detect completions. Interrupts are very heavy operations, so polling this
44c2c0ece7SBen Walkerphase bit is often far more efficient.
45c2c0ece7SBen Walker
46c2c0ece7SBen Walker## The SPDK NVMe Driver I/O Path
47c2c0ece7SBen Walker
48c2c0ece7SBen WalkerNow that we know how the ring structures work, let's cover how the SPDK NVMe
49c2c0ece7SBen Walkerdriver uses them. The user is going to construct a queue pair at some early time
50c2c0ece7SBen Walkerin the life cycle of the program, so that's not part of the "hot" path. Then,
51c2c0ece7SBen Walkerthey'll call functions like spdk_nvme_ns_cmd_read() to perform an I/O operation.
52c2c0ece7SBen WalkerThe user supplies a data buffer, the target LBA, and the length, as well as
53c2c0ece7SBen Walkerother information like which NVMe namespace the command is targeted at and which
54c2c0ece7SBen WalkerNVMe queue pair to use. Finally, the user provides a callback function and
55c2c0ece7SBen Walkercontext pointer that will be called when a completion for the resulting command
56c2c0ece7SBen Walkeris discovered during a later call to spdk_nvme_qpair_process_completions().
57c2c0ece7SBen Walker
58c2c0ece7SBen WalkerThe first stage in the driver is allocating a request object to track the operation. The
59c2c0ece7SBen Walkeroperations are asynchronous, so it can't simply track the state of the request
60c2c0ece7SBen Walkeron the call stack. Allocating a new request object on the heap would be far too
61c2c0ece7SBen Walkerslow, so SPDK keeps a pre-allocated set of request objects inside of the NVMe
62c2c0ece7SBen Walkerqueue pair object - `struct spdk_nvme_qpair`. The number of requests allocated to
63c2c0ece7SBen Walkerthe queue pair is larger than the actual queue depth of the NVMe submission
64c2c0ece7SBen Walkerqueue because SPDK supports a couple of key convenience features. The first is
65c2c0ece7SBen Walkersoftware queueing - SPDK will allow the user to submit more requests than the
66c2c0ece7SBen Walkerhardware queue can actually hold and SPDK will automatically queue in software.
67c2c0ece7SBen WalkerThe second is splitting. SPDK will split a request for many reasons, some of
68c2c0ece7SBen Walkerwhich are outlined next. The number of request objects is configurable at queue
69c2c0ece7SBen Walkerpair creation time and if not specified, SPDK will pick a sensible number based
70c2c0ece7SBen Walkeron the hardware queue depth.
71c2c0ece7SBen Walker
72c2c0ece7SBen WalkerThe second stage is building the 64 byte NVMe command itself. The command is
73c2c0ece7SBen Walkerbuilt into memory embedded into the request object - not directly into an NVMe
74c2c0ece7SBen Walkersubmission queue slot. Once the command has been constructed, SPDK attempts to
75c2c0ece7SBen Walkerobtain an open slot in the NVMe submission queue. For each element in the
76c2c0ece7SBen Walkersubmission queue an object called a tracker is allocated. The trackers are
77c2c0ece7SBen Walkerallocated in an array, so they can be quickly looked up by an index. The tracker
78c2c0ece7SBen Walkeritself contains a pointer to the request currently occupying that slot. When a
79c2c0ece7SBen Walkerparticular tracker is obtained, the command's CID value is updated with the
80c2c0ece7SBen Walkerindex of the tracker. The NVMe specification provides that CID value in the
81c2c0ece7SBen Walkercompletion, so the request can be recovered by looking up the tracker via the
82c2c0ece7SBen WalkerCID value and then following the pointer.
83c2c0ece7SBen Walker
84c2c0ece7SBen WalkerOnce a tracker (slot) is obtained, the data buffer associated with it is
85c2c0ece7SBen Walkerprocessed to build a PRP list. That's essentially an NVMe scatter gather list,
86c2c0ece7SBen Walkeralthough it is a bit more restricted. The user provides SPDK with the virtual
87c2c0ece7SBen Walkeraddress of the buffer, so SPDK has to go do a page table look up to find the
88c2c0ece7SBen Walkerphysical address (pa) or I/O virtual addresses (iova) backing that virtual
89c2c0ece7SBen Walkermemory. A virtually contiguous memory region may not be physically contiguous,
90c2c0ece7SBen Walkerso this may result in a PRP list with multiple elements. Sometimes this may
91c2c0ece7SBen Walkerresult in a set of physical addresses that can't actually be expressed as a
92c2c0ece7SBen Walkersingle PRP list, so SPDK will automatically split the user operation into two
93c2c0ece7SBen Walkerseparate requests transparently. For more information on how memory is managed,
94c2c0ece7SBen Walkersee @ref memory.
95c2c0ece7SBen Walker
96c2c0ece7SBen WalkerThe reason the PRP list is not built until a tracker is obtained is because the
97c2c0ece7SBen WalkerPRP list description must be allocated in DMA-able memory and can be quite
98c2c0ece7SBen Walkerlarge. Since SPDK typically allocates a large number of requests, we didn't want
99c2c0ece7SBen Walkerto allocate enough space to pre-build the worst case scenario PRP list,
100c2c0ece7SBen Walkerespecially given that the common case does not require a separate PRP list at
101c2c0ece7SBen Walkerall.
102c2c0ece7SBen Walker
103c2c0ece7SBen WalkerEach NVMe command has two PRP list elements embedded into it, so a separate PRP
104c2c0ece7SBen Walkerlist isn't required if the request is 4KiB (or if it is 8KiB and aligned
105c2c0ece7SBen Walkerperfectly). Profiling shows that this section of the code is not a major
106c2c0ece7SBen Walkercontributor to the overall CPU use.
107c2c0ece7SBen Walker
108c2c0ece7SBen WalkerWith a tracker filled out, SPDK copies the 64 byte command into the actual NVMe
109c2c0ece7SBen Walkersubmission queue slot and then rings the submission queue tail doorbell to tell
110c2c0ece7SBen Walkerthe device to go process it. SPDK then returns back to the user, without waiting
111c2c0ece7SBen Walkerfor a completion.
112c2c0ece7SBen Walker
113c2c0ece7SBen WalkerThe user can periodically call `spdk_nvme_qpair_process_completions()` to tell
114c2c0ece7SBen WalkerSPDK to examine the completion queue. Specifically, it reads the phase bit of
115c2c0ece7SBen Walkerthe next expected completion slot and when it flips, looks at the CID value to
116c2c0ece7SBen Walkerfind the tracker, which points at the request object. The request object
117c2c0ece7SBen Walkercontains a function pointer that the user provided initially, which is then
118c2c0ece7SBen Walkercalled to complete the command.
119c2c0ece7SBen Walker
120c2c0ece7SBen WalkerThe `spdk_nvme_qpair_process_completions()` function will keep advancing to the
121c2c0ece7SBen Walkernext completion slot until it runs out of completions, at which point it will
122c2c0ece7SBen Walkerwrite the completion queue head doorbell to let the device know that it can use
123c2c0ece7SBen Walkerthe completion queue slots for new completions and return.
124