xref: /spdk/doc/nvme.md (revision ba23cec1820104cc710ad776f0127e1cf82033aa)
1# NVMe Driver {#nvme}
2
3# In this document {#nvme_toc}
4
5* @ref nvme_intro
6* @ref nvme_examples
7* @ref nvme_interface
8* @ref nvme_design
9* @ref nvme_fabrics_host
10* @ref nvme_multi_process
11* @ref nvme_hotplug
12* @ref nvme_cuse
13
14# Introduction {#nvme_intro}
15
16The NVMe driver is a C library that may be linked directly into an application
17that provides direct, zero-copy data transfer to and from
18[NVMe SSDs](http://nvmexpress.org/). It is entirely passive, meaning that it spawns
19no threads and only performs actions in response to function calls from the
20application itself. The library controls NVMe devices by directly mapping the
21[PCI BAR](https://en.wikipedia.org/wiki/PCI_configuration_space) into the local
22process and performing [MMIO](https://en.wikipedia.org/wiki/Memory-mapped_I/O).
23I/O is submitted asynchronously via queue pairs and the general flow isn't
24entirely dissimilar from Linux's
25[libaio](http://man7.org/linux/man-pages/man2/io_submit.2.html).
26
27More recently, the library has been improved to also connect to remote NVMe
28devices via NVMe over Fabrics. Users may now call spdk_nvme_probe() on both
29local PCI busses and on remote NVMe over Fabrics discovery services. The API is
30otherwise unchanged.
31
32# Examples {#nvme_examples}
33
34## Getting Start with Hello World {#nvme_helloworld}
35
36There are a number of examples provided that demonstrate how to use the NVMe
37library. They are all in the [examples/nvme](https://github.com/spdk/spdk/tree/master/examples/nvme)
38directory in the repository. The best place to start is
39[hello_world](https://github.com/spdk/spdk/blob/master/examples/nvme/hello_world/hello_world.c).
40
41## Running Benchmarks with Fio Plugin {#nvme_fioplugin}
42
43SPDK provides a plugin to the very popular [fio](https://github.com/axboe/fio)
44tool for running some basic benchmarks. See the fio start up
45[guide](https://github.com/spdk/spdk/blob/master/examples/nvme/fio_plugin/)
46for more details.
47
48## Running Benchmarks with Perf Tool {#nvme_perf}
49
50NVMe perf utility in the [examples/nvme/perf](https://github.com/spdk/spdk/tree/master/examples/nvme/perf)
51is one of the examples which also can be used for performance tests. The fio
52tool is widely used because it is very flexible. However, that flexibility adds
53overhead and reduces the efficiency of SPDK. Therefore, SPDK provides a perf
54benchmarking tool which has minimal overhead during benchmarking. We have
55measured up to 2.6 times more IOPS/core when using perf vs. fio with the
564K 100% Random Read workload. The perf benchmarking tool provides several
57run time options to support the most common workload. The following examples
58demonstrate how to use perf.
59
60Example: Using perf for 4K 100% Random Read workload to a local NVMe SSD for 300 seconds
61~~~{.sh}
62perf -q 128 -o 4096 -w randread -r 'trtype:PCIe traddr:0000:04:00.0' -t 300
63~~~
64
65Example: Using perf for 4K 100% Random Read workload to a remote NVMe SSD exported over the network via NVMe-oF
66~~~{.sh}
67perf -q 128 -o 4096 -w randread -r 'trtype:RDMA adrfam:IPv4 traddr:192.168.100.8 trsvcid:4420' -t 300
68~~~
69
70Example: Using perf for 4K 70/30 Random Read/Write mix workload to all local NVMe SSDs for 300 seconds
71~~~{.sh}
72perf -q 128 -o 4096 -w randrw -M 70 -t 300
73~~~
74
75Example: Using perf for extended LBA format CRC guard test to a local NVMe SSD,
76users must write to the SSD before reading the LBA from SSD
77~~~{.sh}
78perf -q 1 -o 4096 -w write -r 'trtype:PCIe traddr:0000:04:00.0' -t 300 -e 'PRACT=0,PRCKH=GUARD'
79perf -q 1 -o 4096 -w read -r 'trtype:PCIe traddr:0000:04:00.0' -t 200 -e 'PRACT=0,PRCKH=GUARD'
80~~~
81
82# Public Interface {#nvme_interface}
83
84- spdk/nvme.h
85
86Key Functions                               | Description
87------------------------------------------- | -----------
88spdk_nvme_probe()                           | @copybrief spdk_nvme_probe()
89spdk_nvme_ctrlr_alloc_io_qpair()            | @copybrief spdk_nvme_ctrlr_alloc_io_qpair()
90spdk_nvme_ctrlr_get_ns()                    | @copybrief spdk_nvme_ctrlr_get_ns()
91spdk_nvme_ns_cmd_read()                     | @copybrief spdk_nvme_ns_cmd_read()
92spdk_nvme_ns_cmd_readv()                    | @copybrief spdk_nvme_ns_cmd_readv()
93spdk_nvme_ns_cmd_read_with_md()             | @copybrief spdk_nvme_ns_cmd_read_with_md()
94spdk_nvme_ns_cmd_write()                    | @copybrief spdk_nvme_ns_cmd_write()
95spdk_nvme_ns_cmd_writev()                   | @copybrief spdk_nvme_ns_cmd_writev()
96spdk_nvme_ns_cmd_write_with_md()            | @copybrief spdk_nvme_ns_cmd_write_with_md()
97spdk_nvme_ns_cmd_write_zeroes()             | @copybrief spdk_nvme_ns_cmd_write_zeroes()
98spdk_nvme_ns_cmd_dataset_management()       | @copybrief spdk_nvme_ns_cmd_dataset_management()
99spdk_nvme_ns_cmd_flush()                    | @copybrief spdk_nvme_ns_cmd_flush()
100spdk_nvme_qpair_process_completions()       | @copybrief spdk_nvme_qpair_process_completions()
101spdk_nvme_ctrlr_cmd_admin_raw()             | @copybrief spdk_nvme_ctrlr_cmd_admin_raw()
102spdk_nvme_ctrlr_process_admin_completions() | @copybrief spdk_nvme_ctrlr_process_admin_completions()
103spdk_nvme_ctrlr_cmd_io_raw()                | @copybrief spdk_nvme_ctrlr_cmd_io_raw()
104spdk_nvme_ctrlr_cmd_io_raw_with_md()        | @copybrief spdk_nvme_ctrlr_cmd_io_raw_with_md()
105
106# NVMe Driver Design {#nvme_design}
107
108## NVMe I/O Submission {#nvme_io_submission}
109
110I/O is submitted to an NVMe namespace using nvme_ns_cmd_xxx functions. The NVMe
111driver submits the I/O request as an NVMe submission queue entry on the queue
112pair specified in the command. The function returns immediately, prior to the
113completion of the command. The application must poll for I/O completion on each
114queue pair with outstanding I/O to receive completion callbacks by calling
115spdk_nvme_qpair_process_completions().
116
117@sa spdk_nvme_ns_cmd_read, spdk_nvme_ns_cmd_write, spdk_nvme_ns_cmd_dataset_management,
118spdk_nvme_ns_cmd_flush, spdk_nvme_qpair_process_completions
119
120### Fused operations {#nvme_fuses}
121
122To "fuse" two commands, the first command should have the SPDK_NVME_IO_FLAGS_FUSE_FIRST
123io flag set, and the next one should have the SPDK_NVME_IO_FLAGS_FUSE_SECOND.
124
125In addition, the following rules must be met to execute two commands as an atomic unit:
126
127 - The commands shall be inserted next to each other in the same submission queue.
128 - The LBA range, should be the same for the two commands.
129
130E.g. To send fused compare and write operation user must call spdk_nvme_ns_cmd_compare
131followed with spdk_nvme_ns_cmd_write and make sure no other operations are submitted
132in between on the same queue, like in example below:
133
134~~~
135	rc = spdk_nvme_ns_cmd_compare(ns, qpair, cmp_buf, 0, 1, nvme_fused_first_cpl_cb,
136			NULL, SPDK_NVME_CMD_FUSE_FIRST);
137	if (rc != 0) {
138		...
139	}
140
141	rc = spdk_nvme_ns_cmd_write(ns, qpair, write_buf, 0, 1, nvme_fused_second_cpl_cb,
142			NULL, SPDK_NVME_CMD_FUSE_SECOND);
143	if (rc != 0) {
144		...
145	}
146~~~
147
148The NVMe specification currently defines compare-and-write as a fused operation.
149Support for compare-and-write is reported by the controller flag
150SPDK_NVME_CTRLR_COMPARE_AND_WRITE_SUPPORTED.
151
152### Scaling Performance {#nvme_scaling}
153
154NVMe queue pairs (struct spdk_nvme_qpair) provide parallel submission paths for
155I/O. I/O may be submitted on multiple queue pairs simultaneously from different
156threads. Queue pairs contain no locks or atomics, however, so a given queue
157pair may only be used by a single thread at a time. This requirement is not
158enforced by the NVMe driver (doing so would require a lock), and violating this
159requirement results in undefined behavior.
160
161The number of queue pairs allowed is dictated by the NVMe SSD itself. The
162specification allows for thousands, but most devices support between 32
163and 128. The specification makes no guarantees about the performance available from
164each queue pair, but in practice the full performance of a device is almost
165always achievable using just one queue pair. For example, if a device claims to
166be capable of 450,000 I/O per second at queue depth 128, in practice it does
167not matter if the driver is using 4 queue pairs each with queue depth 32, or a
168single queue pair with queue depth 128.
169
170Given the above, the easiest threading model for an application using SPDK is
171to spawn a fixed number of threads in a pool and dedicate a single NVMe queue
172pair to each thread. A further improvement would be to pin each thread to a
173separate CPU core, and often the SPDK documentation will use "CPU core" and
174"thread" interchangeably because we have this threading model in mind.
175
176The NVMe driver takes no locks in the I/O path, so it scales linearly in terms
177of performance per thread as long as a queue pair and a CPU core are dedicated
178to each new thread. In order to take full advantage of this scaling,
179applications should consider organizing their internal data structures such
180that data is assigned exclusively to a single thread. All operations that
181require that data should be done by sending a request to the owning thread.
182This results in a message passing architecture, as opposed to a locking
183architecture, and will result in superior scaling across CPU cores.
184
185## NVMe Driver Internal Memory Usage {#nvme_memory_usage}
186
187The SPDK NVMe driver provides a zero-copy data transfer path, which means that
188there are no data buffers for I/O commands. However, some Admin commands have
189data copies depending on the API used by the user.
190
191Each queue pair has a number of trackers used to track commands submitted by the
192caller. The number trackers for I/O queues depend on the users' input for queue
193size and the value read from controller capabilities register field Maximum Queue
194Entries Supported(MQES, 0 based value). Each tracker has a fixed size 4096 Bytes,
195so the maximum memory used for each I/O queue is: (MQES + 1) * 4 KiB.
196
197I/O queue pairs can be allocated in host memory, this is used for most NVMe controllers,
198some NVMe controllers which can support Controller Memory Buffer may put I/O queue
199pairs at controllers' PCI BAR space, SPDK NVMe driver can put I/O submission queue
200into controller memory buffer, it depends on users' input and controller capabilities.
201Each submission queue entry (SQE) and completion queue entry (CQE) consumes 64 bytes
202and 16 bytes respectively. Therefore, the maximum memory used for each I/O queue
203pair is (MQES + 1) * (64 + 16) Bytes.
204
205# NVMe over Fabrics Host Support {#nvme_fabrics_host}
206
207The NVMe driver supports connecting to remote NVMe-oF targets and
208interacting with them in the same manner as local NVMe SSDs.
209
210## Specifying Remote NVMe over Fabrics Targets {#nvme_fabrics_trid}
211
212The method for connecting to a remote NVMe-oF target is very similar
213to the normal enumeration process for local PCIe-attached NVMe devices.
214To connect to a remote NVMe over Fabrics subsystem, the user may call
215spdk_nvme_probe() with the `trid` parameter specifying the address of
216the NVMe-oF target.
217
218The caller may fill out the spdk_nvme_transport_id structure manually
219or use the spdk_nvme_transport_id_parse() function to convert a
220human-readable string representation into the required structure.
221
222The spdk_nvme_transport_id may contain the address of a discovery service
223or a single NVM subsystem.  If a discovery service address is specified,
224the NVMe library will call the spdk_nvme_probe() `probe_cb` for each
225discovered NVM subsystem, which allows the user to select the desired
226subsystems to be attached.  Alternatively, if the address specifies a
227single NVM subsystem directly, the NVMe library will call `probe_cb`
228for just that subsystem; this allows the user to skip the discovery step
229and connect directly to a subsystem with a known address.
230
231## RDMA Limitations
232
233Please refer to NVMe-oF target's @ref nvmf_rdma_limitations
234
235# NVMe Multi Process {#nvme_multi_process}
236
237This capability enables the SPDK NVMe driver to support multiple processes accessing the
238same NVMe device. The NVMe driver allocates critical structures from shared memory, so
239that each process can map that memory and create its own queue pairs or share the admin
240queue. There is a limited number of I/O queue pairs per NVMe controller.
241
242The primary motivation for this feature is to support management tools that can attach
243to long running applications, perform some maintenance work or gather information, and
244then detach.
245
246## Configuration {#nvme_multi_process_configuration}
247
248DPDK EAL allows different types of processes to be spawned, each with different permissions
249on the hugepage memory used by the applications.
250
251There are two types of processes:
252
2531. a primary process which initializes the shared memory and has full privileges and
2542. a secondary process which can attach to the primary process by mapping its shared memory
255   regions and perform NVMe operations including creating queue pairs.
256
257This feature is enabled by default and is controlled by selecting a value for the shared
258memory group ID. This ID is a positive integer and two applications with the same shared
259memory group ID will share memory. The first application with a given shared memory group
260ID will be considered the primary and all others secondary.
261
262Example: identical shm_id and non-overlapping core masks
263~~~{.sh}
264./perf options [AIO device(s)]...
265	[-c core mask for I/O submission/completion]
266	[-i shared memory group ID]
267
268./perf -q 1 -o 4096 -w randread -c 0x1 -t 60 -i 1
269./perf -q 8 -o 131072 -w write -c 0x10 -t 60 -i 1
270~~~
271
272## Limitations {#nvme_multi_process_limitations}
273
2741. Two processes sharing memory may not share any cores in their core mask.
2752. If a primary process exits while secondary processes are still running, those processes
276   will continue to run. However, a new primary process cannot be created.
2773. Applications are responsible for coordinating access to logical blocks.
2784. If a process exits unexpectedly, the allocated memory will be released when the last
279   process exits.
280
281@sa spdk_nvme_probe, spdk_nvme_ctrlr_process_admin_completions
282
283# NVMe Hotplug {#nvme_hotplug}
284
285At the NVMe driver level, we provide the following support for Hotplug:
286
2871. Hotplug events detection:
288   The user of the NVMe library can call spdk_nvme_probe() periodically to detect
289   hotplug events. The probe_cb, followed by the attach_cb, will be called for each
290   new device detected. The user may optionally also provide a remove_cb that will be
291   called if a previously attached NVMe device is no longer present on the system.
292   All subsequent I/O to the removed device will return an error.
293
2942. Hot remove NVMe with IO loads:
295   When a device is hot removed while I/O is occurring, all access to the PCI BAR will
296   result in a SIGBUS error. The NVMe driver automatically handles this case by installing
297   a SIGBUS handler and remapping the PCI BAR to a new, placeholder memory location.
298   This means I/O in flight during a hot remove will complete with an appropriate error
299   code and will not crash the application.
300
301@sa spdk_nvme_probe
302
303# NVMe Character Devices {#nvme_cuse}
304
305This feature is considered as experimental.
306
307![NVMe character devices processing diagram](nvme_cuse.svg)
308
309For each controller as well as namespace, character devices are created in the
310locations:
311~~~{.sh}
312    /dev/spdk/nvmeX
313    /dev/spdk/nvmeXnY
314    ...
315~~~
316Where X is unique SPDK NVMe controller index and Y is namespace id.
317
318Requests from CUSE are handled by pthreads when controller and namespaces are created.
319Those pass the I/O or admin commands via a ring to a thread that processes them using
320nvme_io_msg_process().
321
322Ioctls that request information attained when attaching NVMe controller receive an
323immediate response, without passing them through the ring.
324
325This interface reserves one qpair for sending down the I/O for each controller.
326
327## Enabling cuse support for NVMe
328
329Cuse support is disabled by default. To enable support for NVMe devices SPDK
330must be compiled with "./configure --with-nvme-cuse".
331
332## Limitations
333
334NVMe namespaces are created as character devices and their use may be limited for
335tools expecting block devices.
336
337Sysfs is not updated by SPDK.
338
339SPDK NVMe CUSE creates nodes in "/dev/spdk/" directory to explicitly differentiate
340from other devices. Tools that only search in the "/dev" directory might not work
341with SPDK NVMe CUSE.
342
343SCSI to NVMe Translation Layer is not implemented. Tools that are using this layer to
344identify, manage or operate device might not work properly or their use may be limited.
345
346### Examples of using smartctl
347
348smartctl tool recognizes device type based on the device path. If none of expected
349patterns match, SCSI translation layer is used to identify device.
350
351To use smartctl '-d nvme' parameter must be used in addition to full path to
352the NVMe device.
353
354~~~{.sh}
355    smartctl -d nvme -i /dev/spdk/nvme0
356    smartctl -d nvme -H /dev/spdk/nvme1
357    ...
358~~~
359