xref: /spdk/doc/nvme.md (revision 34edd9f1bf5fda4c987f4500ddc3c9f50be32e7d)
1# NVMe Driver {#nvme}
2
3## In this document {#nvme_toc}
4
5- @ref nvme_intro
6- @ref nvme_examples
7- @ref nvme_interface
8- @ref nvme_design
9- @ref nvme_fabrics_host
10- @ref nvme_multi_process
11- @ref nvme_hotplug
12- @ref nvme_cuse
13- @ref nvme_led
14
15## Introduction {#nvme_intro}
16
17The NVMe driver is a C library that may be linked directly into an application
18that provides direct, zero-copy data transfer to and from
19[NVMe SSDs](http://nvmexpress.org/). It is entirely passive, meaning that it spawns
20no threads and only performs actions in response to function calls from the
21application itself. The library controls NVMe devices by directly mapping the
22[PCI BAR](https://en.wikipedia.org/wiki/PCI_configuration_space) into the local
23process and performing [MMIO](https://en.wikipedia.org/wiki/Memory-mapped_I/O).
24I/O is submitted asynchronously via queue pairs and the general flow isn't
25entirely dissimilar from Linux's
26[libaio](http://man7.org/linux/man-pages/man2/io_submit.2.html).
27
28More recently, the library has been improved to also connect to remote NVMe
29devices via NVMe over Fabrics. Users may now call spdk_nvme_probe() on both
30local PCI buses and on remote NVMe over Fabrics discovery services. The API is
31otherwise unchanged.
32
33## Examples {#nvme_examples}
34
35### Getting Start with Hello World {#nvme_helloworld}
36
37There are a number of examples provided that demonstrate how to use the NVMe
38library. They are all in the [examples/nvme](https://github.com/spdk/spdk/tree/master/examples/nvme)
39directory in the repository. The best place to start is
40[hello_world](https://github.com/spdk/spdk/blob/master/examples/nvme/hello_world/hello_world.c).
41
42### Running Benchmarks with Fio Plugin {#nvme_fioplugin}
43
44SPDK provides a plugin to the very popular [fio](https://github.com/axboe/fio)
45tool for running some basic benchmarks. See the fio start up
46[guide](https://github.com/spdk/spdk/blob/master/app/fio/nvme/)
47for more details.
48
49### Running Benchmarks with Perf Tool {#nvme_perf}
50
51NVMe perf utility in the [app/spdk_nvme_perf](https://github.com/spdk/spdk/tree/master/app/spdk_nvme_perf)
52is one of the examples which also can be used for performance tests. The fio
53tool is widely used because it is very flexible. However, that flexibility adds
54overhead and reduces the efficiency of SPDK. Therefore, SPDK provides a perf
55benchmarking tool which has minimal overhead during benchmarking. We have
56measured up to 2.6 times more IOPS/core when using perf vs. fio with the
574K 100% Random Read workload. The perf benchmarking tool provides several
58run time options to support the most common workload. The following examples
59demonstrate how to use perf.
60
61Example: Using perf for 4K 100% Random Read workload to a local NVMe SSD for 300 seconds
62~~~{.sh}
63perf -q 128 -o 4096 -w randread -r 'trtype:PCIe traddr:0000:04:00.0' -t 300
64~~~
65
66Example: Using perf for 4K 100% Random Read workload to a remote NVMe SSD exported over the network via NVMe-oF
67~~~{.sh}
68perf -q 128 -o 4096 -w randread -r 'trtype:RDMA adrfam:IPv4 traddr:192.168.100.8 trsvcid:4420' -t 300
69~~~
70
71Example: Using perf for 4K 70/30 Random Read/Write mix workload to all local NVMe SSDs for 300 seconds
72~~~{.sh}
73perf -q 128 -o 4096 -w randrw -M 70 -t 300
74~~~
75
76Example: Using perf for extended LBA format CRC guard test to a local NVMe SSD,
77users must write to the SSD before reading the LBA from SSD
78~~~{.sh}
79perf -q 1 -o 4096 -w write -r 'trtype:PCIe traddr:0000:04:00.0' -t 300 -e 'PRACT=0,PRCKH=GUARD'
80perf -q 1 -o 4096 -w read -r 'trtype:PCIe traddr:0000:04:00.0' -t 200 -e 'PRACT=0,PRCKH=GUARD'
81~~~
82
83## Public Interface {#nvme_interface}
84
85- spdk/nvme.h
86
87Key Functions                               | Description
88------------------------------------------- | -----------
89spdk_nvme_probe()                           | @copybrief spdk_nvme_probe()
90spdk_nvme_ctrlr_alloc_io_qpair()            | @copybrief spdk_nvme_ctrlr_alloc_io_qpair()
91spdk_nvme_ctrlr_get_ns()                    | @copybrief spdk_nvme_ctrlr_get_ns()
92spdk_nvme_ns_cmd_read()                     | @copybrief spdk_nvme_ns_cmd_read()
93spdk_nvme_ns_cmd_readv()                    | @copybrief spdk_nvme_ns_cmd_readv()
94spdk_nvme_ns_cmd_read_with_md()             | @copybrief spdk_nvme_ns_cmd_read_with_md()
95spdk_nvme_ns_cmd_write()                    | @copybrief spdk_nvme_ns_cmd_write()
96spdk_nvme_ns_cmd_writev()                   | @copybrief spdk_nvme_ns_cmd_writev()
97spdk_nvme_ns_cmd_write_with_md()            | @copybrief spdk_nvme_ns_cmd_write_with_md()
98spdk_nvme_ns_cmd_write_zeroes()             | @copybrief spdk_nvme_ns_cmd_write_zeroes()
99spdk_nvme_ns_cmd_dataset_management()       | @copybrief spdk_nvme_ns_cmd_dataset_management()
100spdk_nvme_ns_cmd_flush()                    | @copybrief spdk_nvme_ns_cmd_flush()
101spdk_nvme_qpair_process_completions()       | @copybrief spdk_nvme_qpair_process_completions()
102spdk_nvme_ctrlr_cmd_admin_raw()             | @copybrief spdk_nvme_ctrlr_cmd_admin_raw()
103spdk_nvme_ctrlr_process_admin_completions() | @copybrief spdk_nvme_ctrlr_process_admin_completions()
104spdk_nvme_ctrlr_cmd_io_raw()                | @copybrief spdk_nvme_ctrlr_cmd_io_raw()
105spdk_nvme_ctrlr_cmd_io_raw_with_md()        | @copybrief spdk_nvme_ctrlr_cmd_io_raw_with_md()
106
107## NVMe Driver Design {#nvme_design}
108
109### NVMe I/O Submission {#nvme_io_submission}
110
111I/O is submitted to an NVMe namespace using nvme_ns_cmd_xxx functions. The NVMe
112driver submits the I/O request as an NVMe submission queue entry on the queue
113pair specified in the command. The function returns immediately, prior to the
114completion of the command. The application must poll for I/O completion on each
115queue pair with outstanding I/O to receive completion callbacks by calling
116spdk_nvme_qpair_process_completions().
117
118@sa spdk_nvme_ns_cmd_read, spdk_nvme_ns_cmd_write, spdk_nvme_ns_cmd_dataset_management,
119spdk_nvme_ns_cmd_flush, spdk_nvme_qpair_process_completions
120
121#### Fused operations {#nvme_fuses}
122
123To "fuse" two commands, the first command should have the SPDK_NVME_IO_FLAGS_FUSE_FIRST
124io flag set, and the next one should have the SPDK_NVME_IO_FLAGS_FUSE_SECOND.
125
126In addition, the following rules must be met to execute two commands as an atomic unit:
127
128- The commands shall be inserted next to each other in the same submission queue.
129- The LBA range, should be the same for the two commands.
130
131E.g. To send fused compare and write operation user must call spdk_nvme_ns_cmd_compare
132followed with spdk_nvme_ns_cmd_write and make sure no other operations are submitted
133in between on the same queue, like in example below:
134
135~~~c
136	rc = spdk_nvme_ns_cmd_compare(ns, qpair, cmp_buf, 0, 1, nvme_fused_first_cpl_cb,
137			NULL, SPDK_NVME_CMD_FUSE_FIRST);
138	if (rc != 0) {
139		...
140	}
141
142	rc = spdk_nvme_ns_cmd_write(ns, qpair, write_buf, 0, 1, nvme_fused_second_cpl_cb,
143			NULL, SPDK_NVME_CMD_FUSE_SECOND);
144	if (rc != 0) {
145		...
146	}
147~~~
148
149The NVMe specification currently defines compare-and-write as a fused operation.
150Support for compare-and-write is reported by the controller flag
151SPDK_NVME_CTRLR_COMPARE_AND_WRITE_SUPPORTED.
152
153#### Scaling Performance {#nvme_scaling}
154
155NVMe queue pairs (struct spdk_nvme_qpair) provide parallel submission paths for
156I/O. I/O may be submitted on multiple queue pairs simultaneously from different
157threads. Queue pairs contain no locks or atomics, however, so a given queue
158pair may only be used by a single thread at a time. This requirement is not
159enforced by the NVMe driver (doing so would require a lock), and violating this
160requirement results in undefined behavior.
161
162The number of queue pairs allowed is dictated by the NVMe SSD itself. The
163specification allows for thousands, but most devices support between 32
164and 128. The specification makes no guarantees about the performance available from
165each queue pair, but in practice the full performance of a device is almost
166always achievable using just one queue pair. For example, if a device claims to
167be capable of 450,000 I/O per second at queue depth 128, in practice it does
168not matter if the driver is using 4 queue pairs each with queue depth 32, or a
169single queue pair with queue depth 128.
170
171Given the above, the easiest threading model for an application using SPDK is
172to spawn a fixed number of threads in a pool and dedicate a single NVMe queue
173pair to each thread. A further improvement would be to pin each thread to a
174separate CPU core, and often the SPDK documentation will use "CPU core" and
175"thread" interchangeably because we have this threading model in mind.
176
177The NVMe driver takes no locks in the I/O path, so it scales linearly in terms
178of performance per thread as long as a queue pair and a CPU core are dedicated
179to each new thread. In order to take full advantage of this scaling,
180applications should consider organizing their internal data structures such
181that data is assigned exclusively to a single thread. All operations that
182require that data should be done by sending a request to the owning thread.
183This results in a message passing architecture, as opposed to a locking
184architecture, and will result in superior scaling across CPU cores.
185
186### NVMe Driver Internal Memory Usage {#nvme_memory_usage}
187
188The SPDK NVMe driver provides a zero-copy data transfer path, which means that
189there are no data buffers for I/O commands. However, some Admin commands have
190data copies depending on the API used by the user.
191
192Each queue pair has a number of trackers used to track commands submitted by the
193caller. The number trackers for I/O queues depend on the users' input for queue
194size and the value read from controller capabilities register field Maximum Queue
195Entries Supported(MQES, 0 based value). Each tracker has a fixed size 4096 Bytes,
196so the maximum memory used for each I/O queue is: (MQES + 1) * 4 KiB.
197
198I/O queue pairs can be allocated in host memory, this is used for most NVMe controllers,
199some NVMe controllers which can support Controller Memory Buffer may put I/O queue
200pairs at controllers' PCI BAR space, SPDK NVMe driver can put I/O submission queue
201into controller memory buffer, it depends on users' input and controller capabilities.
202Each submission queue entry (SQE) and completion queue entry (CQE) consumes 64 bytes
203and 16 bytes respectively. Therefore, the maximum memory used for each I/O queue
204pair is (MQES + 1) * (64 + 16) Bytes.
205
206## NVMe over Fabrics Host Support {#nvme_fabrics_host}
207
208The NVMe driver supports connecting to remote NVMe-oF targets and
209interacting with them in the same manner as local NVMe SSDs.
210
211### Specifying Remote NVMe over Fabrics Targets {#nvme_fabrics_trid}
212
213The method for connecting to a remote NVMe-oF target is very similar
214to the normal enumeration process for local PCIe-attached NVMe devices.
215To connect to a remote NVMe over Fabrics subsystem, the user may call
216spdk_nvme_probe() with the `trid` parameter specifying the address of
217the NVMe-oF target.
218
219The caller may fill out the spdk_nvme_transport_id structure manually
220or use the spdk_nvme_transport_id_parse() function to convert a
221human-readable string representation into the required structure.
222
223The spdk_nvme_transport_id may contain the address of a discovery service
224or a single NVM subsystem.  If a discovery service address is specified,
225the NVMe library will call the spdk_nvme_probe() `probe_cb` for each
226discovered NVM subsystem, which allows the user to select the desired
227subsystems to be attached.  Alternatively, if the address specifies a
228single NVM subsystem directly, the NVMe library will call `probe_cb`
229for just that subsystem; this allows the user to skip the discovery step
230and connect directly to a subsystem with a known address.
231
232### RDMA Limitations
233
234Please refer to NVMe-oF target's @ref nvmf_rdma_limitations
235
236## NVMe Multi Process {#nvme_multi_process}
237
238This capability enables the SPDK NVMe driver to support multiple processes accessing the
239same NVMe device. The NVMe driver allocates critical structures from shared memory, so
240that each process can map that memory and create its own queue pairs or share the admin
241queue. There is a limited number of I/O queue pairs per NVMe controller.
242
243The primary motivation for this feature is to support management tools that can attach
244to long running applications, perform some maintenance work or gather information, and
245then detach.
246
247### Configuration {#nvme_multi_process_configuration}
248
249DPDK EAL allows different types of processes to be spawned, each with different permissions
250on the hugepage memory used by the applications.
251
252There are two types of processes:
253
2541. a primary process which initializes the shared memory and has full privileges and
2552. a secondary process which can attach to the primary process by mapping its shared memory
256   regions and perform NVMe operations including creating queue pairs.
257
258This feature is enabled by default and is controlled by selecting a value for the shared
259memory group ID. This ID is a positive integer and two applications with the same shared
260memory group ID will share memory. The first application with a given shared memory group
261ID will be considered the primary and all others secondary.
262
263Example: identical shm_id and non-overlapping core masks
264~~~{.sh}
265spdk_nvme_perf options [AIO device(s)]...
266	[-c core mask for I/O submission/completion]
267	[-i shared memory group ID]
268
269spdk_nvme_perf -q 1 -o 4096 -w randread -c 0x1 -t 60 -i 1
270spdk_nvme_perf -q 8 -o 131072 -w write -c 0x10 -t 60 -i 1
271~~~
272
273### Limitations {#nvme_multi_process_limitations}
274
2751. Two processes sharing memory may not share any cores in their core mask.
2762. If a primary process exits while secondary processes are still running, those processes
277   will continue to run. However, a new primary process cannot be created.
2783. Applications are responsible for coordinating access to logical blocks.
2794. If a process exits unexpectedly, the allocated memory will be released when the last
280   process exits.
281
282@sa spdk_nvme_probe, spdk_nvme_ctrlr_process_admin_completions
283
284## NVMe Hotplug {#nvme_hotplug}
285
286At the NVMe driver level, we provide the following support for Hotplug:
287
2881. Hotplug events detection:
289   The user of the NVMe library can call spdk_nvme_probe() periodically to detect
290   hotplug events. The probe_cb, followed by the attach_cb, will be called for each
291   new device detected. The user may optionally also provide a remove_cb that will be
292   called if a previously attached NVMe device is no longer present on the system.
293   All subsequent I/O to the removed device will return an error.
294
2952. Hot remove NVMe with IO loads:
296   When a device is hot removed while I/O is occurring, all access to the PCI BAR will
297   result in a SIGBUS error. The NVMe driver automatically handles this case by installing
298   a SIGBUS handler and remapping the PCI BAR to a new, placeholder memory location.
299   This means I/O in flight during a hot remove will complete with an appropriate error
300   code and will not crash the application.
301
302@sa spdk_nvme_probe
303
304## NVMe Character Devices {#nvme_cuse}
305
306### Design
307
308![NVMe character devices processing diagram](nvme_cuse.svg)
309
310For each controller as well as namespace, character devices are created in the
311locations:
312~~~{.sh}
313    /dev/spdk/nvmeX
314    /dev/spdk/nvmeXnY
315    ...
316~~~
317Where X is unique SPDK NVMe controller index and Y is namespace id.
318
319Requests from CUSE are handled by pthreads when controller and namespaces are created.
320Those pass the I/O or admin commands via a ring to a thread that processes them using
321nvme_io_msg_process().
322
323Ioctls that request information attained when attaching NVMe controller receive an
324immediate response, without passing them through the ring.
325
326This interface reserves one additional qpair for sending down the I/O for each controller.
327
328### Usage
329
330#### Enabling cuse support for NVMe
331
332Cuse support is enabled by default on Linux. Make sure to install required dependencies:
333~~~{.sh}
334sudo scripts/pkgdep.sh
335~~~
336
337#### Creating NVMe-CUSE device
338
339First make sure to prepare the environment (see @ref getting_started).
340This includes loading CUSE kernel module.
341Any NVMe controller attached to a running SPDK application can be
342exposed via NVMe-CUSE interface. When closing SPDK application,
343the NVMe-CUSE devices are unregistered.
344
345~~~{.sh}
346$ sudo scripts/setup.sh
347$ sudo modprobe cuse
348$ sudo build/bin/spdk_tgt
349# Continue in another session
350$ sudo scripts/rpc.py bdev_nvme_attach_controller -b Nvme0 -t PCIe -a 0000:82:00.0
351Nvme0n1
352$ sudo scripts/rpc.py bdev_nvme_get_controllers
353[
354  {
355    "name": "Nvme0",
356    "trid": {
357      "trtype": "PCIe",
358      "traddr": "0000:82:00.0"
359    }
360  }
361]
362$ sudo scripts/rpc.py bdev_nvme_cuse_register -n Nvme0
363$ ls /dev/spdk/
364nvme0  nvme0n1
365~~~
366
367#### Example of using nvme-cli
368
369Most nvme-cli commands can point to specific controller or namespace by providing a path to it.
370This can be leveraged to issue commands to the SPDK NVMe-CUSE devices.
371
372~~~{.sh}
373sudo nvme id-ctrl /dev/spdk/nvme0
374sudo nvme smart-log /dev/spdk/nvme0
375sudo nvme id-ns /dev/spdk/nvme0n1
376~~~
377
378Note: `nvme list` command does not display SPDK NVMe-CUSE devices,
379see nvme-cli [PR #773](https://github.com/linux-nvme/nvme-cli/pull/773).
380
381#### Examples of using smartctl
382
383smartctl tool recognizes device type based on the device path. If none of expected
384patterns match, SCSI translation layer is used to identify device.
385
386To use smartctl '-d nvme' parameter must be used in addition to full path to
387the NVMe device.
388
389~~~{.sh}
390    smartctl -d nvme -i /dev/spdk/nvme0
391    smartctl -d nvme -H /dev/spdk/nvme1
392    ...
393~~~
394
395### Limitations
396
397NVMe namespaces are created as character devices and their use may be limited for
398tools expecting block devices.
399
400Sysfs is not updated by SPDK.
401
402SPDK NVMe CUSE creates nodes in "/dev/spdk/" directory to explicitly differentiate
403from other devices. Tools that only search in the "/dev" directory might not work
404with SPDK NVMe CUSE.
405
406SCSI to NVMe Translation Layer is not implemented. Tools that are using this layer to
407identify, manage or operate device might not work properly or their use may be limited.
408
409### SPDK_CUSE_GET_TRANSPORT ioctl command
410
411nvme-cli mostly uses IOCTLs to obtain information, but transport information is
412obtained through sysfs. Since SPDK does not populate sysfs, the SPDK plugin leverages
413an SPDK/CUSE specific ioctl to get the information.
414
415~~~{.c}
416#define SPDK_CUSE_GET_TRANSPORT _IOWR('n', 0x1, struct cuse_transport)
417~~~
418
419~~~{.c}
420struct cuse_transport {
421	char trstring[SPDK_NVMF_TRSTRING_MAX_LEN + 1];
422	char traddr[SPDK_NVMF_TRADDR_MAX_LEN + 1];
423} tr;
424~~~
425
426## NVMe LED management {#nvme_led}
427
428It is possible to use the ledctl(8) utility to control the state of LEDs in systems supporting
429NPEM (Native PCIe Enclosure Management), even when the NVMe devices are controlled by SPDK.
430However, in this case it is necessary to determine the slot device number because the block device
431is unavailable. The [ledctl.sh](https://github.com/spdk/spdk/tree/master/scripts/ledctl.sh) script
432can be used to help with this. It takes the name of the nvme bdev and invokes ledctl with
433appropriate options.
434