1# NVMe Driver {#nvme} 2 3## In this document {#nvme_toc} 4 5- @ref nvme_intro 6- @ref nvme_examples 7- @ref nvme_interface 8- @ref nvme_design 9- @ref nvme_fabrics_host 10- @ref nvme_multi_process 11- @ref nvme_hotplug 12- @ref nvme_cuse 13- @ref nvme_led 14 15## Introduction {#nvme_intro} 16 17The NVMe driver is a C library that may be linked directly into an application 18that provides direct, zero-copy data transfer to and from 19[NVMe SSDs](http://nvmexpress.org/). It is entirely passive, meaning that it spawns 20no threads and only performs actions in response to function calls from the 21application itself. The library controls NVMe devices by directly mapping the 22[PCI BAR](https://en.wikipedia.org/wiki/PCI_configuration_space) into the local 23process and performing [MMIO](https://en.wikipedia.org/wiki/Memory-mapped_I/O). 24I/O is submitted asynchronously via queue pairs and the general flow isn't 25entirely dissimilar from Linux's 26[libaio](http://man7.org/linux/man-pages/man2/io_submit.2.html). 27 28More recently, the library has been improved to also connect to remote NVMe 29devices via NVMe over Fabrics. Users may now call spdk_nvme_probe() on both 30local PCI buses and on remote NVMe over Fabrics discovery services. The API is 31otherwise unchanged. 32 33## Examples {#nvme_examples} 34 35### Getting Start with Hello World {#nvme_helloworld} 36 37There are a number of examples provided that demonstrate how to use the NVMe 38library. They are all in the [examples/nvme](https://github.com/spdk/spdk/tree/master/examples/nvme) 39directory in the repository. The best place to start is 40[hello_world](https://github.com/spdk/spdk/blob/master/examples/nvme/hello_world/hello_world.c). 41 42### Running Benchmarks with Fio Plugin {#nvme_fioplugin} 43 44SPDK provides a plugin to the very popular [fio](https://github.com/axboe/fio) 45tool for running some basic benchmarks. See the fio start up 46[guide](https://github.com/spdk/spdk/blob/master/app/fio/nvme/) 47for more details. 48 49### Running Benchmarks with Perf Tool {#nvme_perf} 50 51NVMe perf utility in the [app/spdk_nvme_perf](https://github.com/spdk/spdk/tree/master/app/spdk_nvme_perf) 52is one of the examples which also can be used for performance tests. The fio 53tool is widely used because it is very flexible. However, that flexibility adds 54overhead and reduces the efficiency of SPDK. Therefore, SPDK provides a perf 55benchmarking tool which has minimal overhead during benchmarking. We have 56measured up to 2.6 times more IOPS/core when using perf vs. fio with the 574K 100% Random Read workload. The perf benchmarking tool provides several 58run time options to support the most common workload. The following examples 59demonstrate how to use perf. 60 61Example: Using perf for 4K 100% Random Read workload to a local NVMe SSD for 300 seconds 62~~~{.sh} 63perf -q 128 -o 4096 -w randread -r 'trtype:PCIe traddr:0000:04:00.0' -t 300 64~~~ 65 66Example: Using perf for 4K 100% Random Read workload to a remote NVMe SSD exported over the network via NVMe-oF 67~~~{.sh} 68perf -q 128 -o 4096 -w randread -r 'trtype:RDMA adrfam:IPv4 traddr:192.168.100.8 trsvcid:4420' -t 300 69~~~ 70 71Example: Using perf for 4K 70/30 Random Read/Write mix workload to all local NVMe SSDs for 300 seconds 72~~~{.sh} 73perf -q 128 -o 4096 -w randrw -M 70 -t 300 74~~~ 75 76Example: Using perf for extended LBA format CRC guard test to a local NVMe SSD, 77users must write to the SSD before reading the LBA from SSD 78~~~{.sh} 79perf -q 1 -o 4096 -w write -r 'trtype:PCIe traddr:0000:04:00.0' -t 300 -e 'PRACT=0,PRCKH=GUARD' 80perf -q 1 -o 4096 -w read -r 'trtype:PCIe traddr:0000:04:00.0' -t 200 -e 'PRACT=0,PRCKH=GUARD' 81~~~ 82 83## Public Interface {#nvme_interface} 84 85- spdk/nvme.h 86 87Key Functions | Description 88------------------------------------------- | ----------- 89spdk_nvme_probe() | @copybrief spdk_nvme_probe() 90spdk_nvme_ctrlr_alloc_io_qpair() | @copybrief spdk_nvme_ctrlr_alloc_io_qpair() 91spdk_nvme_ctrlr_get_ns() | @copybrief spdk_nvme_ctrlr_get_ns() 92spdk_nvme_ns_cmd_read() | @copybrief spdk_nvme_ns_cmd_read() 93spdk_nvme_ns_cmd_readv() | @copybrief spdk_nvme_ns_cmd_readv() 94spdk_nvme_ns_cmd_read_with_md() | @copybrief spdk_nvme_ns_cmd_read_with_md() 95spdk_nvme_ns_cmd_write() | @copybrief spdk_nvme_ns_cmd_write() 96spdk_nvme_ns_cmd_writev() | @copybrief spdk_nvme_ns_cmd_writev() 97spdk_nvme_ns_cmd_write_with_md() | @copybrief spdk_nvme_ns_cmd_write_with_md() 98spdk_nvme_ns_cmd_write_zeroes() | @copybrief spdk_nvme_ns_cmd_write_zeroes() 99spdk_nvme_ns_cmd_dataset_management() | @copybrief spdk_nvme_ns_cmd_dataset_management() 100spdk_nvme_ns_cmd_flush() | @copybrief spdk_nvme_ns_cmd_flush() 101spdk_nvme_qpair_process_completions() | @copybrief spdk_nvme_qpair_process_completions() 102spdk_nvme_ctrlr_cmd_admin_raw() | @copybrief spdk_nvme_ctrlr_cmd_admin_raw() 103spdk_nvme_ctrlr_process_admin_completions() | @copybrief spdk_nvme_ctrlr_process_admin_completions() 104spdk_nvme_ctrlr_cmd_io_raw() | @copybrief spdk_nvme_ctrlr_cmd_io_raw() 105spdk_nvme_ctrlr_cmd_io_raw_with_md() | @copybrief spdk_nvme_ctrlr_cmd_io_raw_with_md() 106 107## NVMe Driver Design {#nvme_design} 108 109### NVMe I/O Submission {#nvme_io_submission} 110 111I/O is submitted to an NVMe namespace using nvme_ns_cmd_xxx functions. The NVMe 112driver submits the I/O request as an NVMe submission queue entry on the queue 113pair specified in the command. The function returns immediately, prior to the 114completion of the command. The application must poll for I/O completion on each 115queue pair with outstanding I/O to receive completion callbacks by calling 116spdk_nvme_qpair_process_completions(). 117 118@sa spdk_nvme_ns_cmd_read, spdk_nvme_ns_cmd_write, spdk_nvme_ns_cmd_dataset_management, 119spdk_nvme_ns_cmd_flush, spdk_nvme_qpair_process_completions 120 121#### Fused operations {#nvme_fuses} 122 123To "fuse" two commands, the first command should have the SPDK_NVME_IO_FLAGS_FUSE_FIRST 124io flag set, and the next one should have the SPDK_NVME_IO_FLAGS_FUSE_SECOND. 125 126In addition, the following rules must be met to execute two commands as an atomic unit: 127 128- The commands shall be inserted next to each other in the same submission queue. 129- The LBA range, should be the same for the two commands. 130 131E.g. To send fused compare and write operation user must call spdk_nvme_ns_cmd_compare 132followed with spdk_nvme_ns_cmd_write and make sure no other operations are submitted 133in between on the same queue, like in example below: 134 135~~~c 136 rc = spdk_nvme_ns_cmd_compare(ns, qpair, cmp_buf, 0, 1, nvme_fused_first_cpl_cb, 137 NULL, SPDK_NVME_CMD_FUSE_FIRST); 138 if (rc != 0) { 139 ... 140 } 141 142 rc = spdk_nvme_ns_cmd_write(ns, qpair, write_buf, 0, 1, nvme_fused_second_cpl_cb, 143 NULL, SPDK_NVME_CMD_FUSE_SECOND); 144 if (rc != 0) { 145 ... 146 } 147~~~ 148 149The NVMe specification currently defines compare-and-write as a fused operation. 150Support for compare-and-write is reported by the controller flag 151SPDK_NVME_CTRLR_COMPARE_AND_WRITE_SUPPORTED. 152 153#### Scaling Performance {#nvme_scaling} 154 155NVMe queue pairs (struct spdk_nvme_qpair) provide parallel submission paths for 156I/O. I/O may be submitted on multiple queue pairs simultaneously from different 157threads. Queue pairs contain no locks or atomics, however, so a given queue 158pair may only be used by a single thread at a time. This requirement is not 159enforced by the NVMe driver (doing so would require a lock), and violating this 160requirement results in undefined behavior. 161 162The number of queue pairs allowed is dictated by the NVMe SSD itself. The 163specification allows for thousands, but most devices support between 32 164and 128. The specification makes no guarantees about the performance available from 165each queue pair, but in practice the full performance of a device is almost 166always achievable using just one queue pair. For example, if a device claims to 167be capable of 450,000 I/O per second at queue depth 128, in practice it does 168not matter if the driver is using 4 queue pairs each with queue depth 32, or a 169single queue pair with queue depth 128. 170 171Given the above, the easiest threading model for an application using SPDK is 172to spawn a fixed number of threads in a pool and dedicate a single NVMe queue 173pair to each thread. A further improvement would be to pin each thread to a 174separate CPU core, and often the SPDK documentation will use "CPU core" and 175"thread" interchangeably because we have this threading model in mind. 176 177The NVMe driver takes no locks in the I/O path, so it scales linearly in terms 178of performance per thread as long as a queue pair and a CPU core are dedicated 179to each new thread. In order to take full advantage of this scaling, 180applications should consider organizing their internal data structures such 181that data is assigned exclusively to a single thread. All operations that 182require that data should be done by sending a request to the owning thread. 183This results in a message passing architecture, as opposed to a locking 184architecture, and will result in superior scaling across CPU cores. 185 186### NVMe Driver Internal Memory Usage {#nvme_memory_usage} 187 188The SPDK NVMe driver provides a zero-copy data transfer path, which means that 189there are no data buffers for I/O commands. However, some Admin commands have 190data copies depending on the API used by the user. 191 192Each queue pair has a number of trackers used to track commands submitted by the 193caller. The number trackers for I/O queues depend on the users' input for queue 194size and the value read from controller capabilities register field Maximum Queue 195Entries Supported(MQES, 0 based value). Each tracker has a fixed size 4096 Bytes, 196so the maximum memory used for each I/O queue is: (MQES + 1) * 4 KiB. 197 198I/O queue pairs can be allocated in host memory, this is used for most NVMe controllers, 199some NVMe controllers which can support Controller Memory Buffer may put I/O queue 200pairs at controllers' PCI BAR space, SPDK NVMe driver can put I/O submission queue 201into controller memory buffer, it depends on users' input and controller capabilities. 202Each submission queue entry (SQE) and completion queue entry (CQE) consumes 64 bytes 203and 16 bytes respectively. Therefore, the maximum memory used for each I/O queue 204pair is (MQES + 1) * (64 + 16) Bytes. 205 206## NVMe over Fabrics Host Support {#nvme_fabrics_host} 207 208The NVMe driver supports connecting to remote NVMe-oF targets and 209interacting with them in the same manner as local NVMe SSDs. 210 211### Specifying Remote NVMe over Fabrics Targets {#nvme_fabrics_trid} 212 213The method for connecting to a remote NVMe-oF target is very similar 214to the normal enumeration process for local PCIe-attached NVMe devices. 215To connect to a remote NVMe over Fabrics subsystem, the user may call 216spdk_nvme_probe() with the `trid` parameter specifying the address of 217the NVMe-oF target. 218 219The caller may fill out the spdk_nvme_transport_id structure manually 220or use the spdk_nvme_transport_id_parse() function to convert a 221human-readable string representation into the required structure. 222 223The spdk_nvme_transport_id may contain the address of a discovery service 224or a single NVM subsystem. If a discovery service address is specified, 225the NVMe library will call the spdk_nvme_probe() `probe_cb` for each 226discovered NVM subsystem, which allows the user to select the desired 227subsystems to be attached. Alternatively, if the address specifies a 228single NVM subsystem directly, the NVMe library will call `probe_cb` 229for just that subsystem; this allows the user to skip the discovery step 230and connect directly to a subsystem with a known address. 231 232### RDMA Limitations 233 234Please refer to NVMe-oF target's @ref nvmf_rdma_limitations 235 236## NVMe Multi Process {#nvme_multi_process} 237 238This capability enables the SPDK NVMe driver to support multiple processes accessing the 239same NVMe device. The NVMe driver allocates critical structures from shared memory, so 240that each process can map that memory and create its own queue pairs or share the admin 241queue. There is a limited number of I/O queue pairs per NVMe controller. 242 243The primary motivation for this feature is to support management tools that can attach 244to long running applications, perform some maintenance work or gather information, and 245then detach. 246 247### Configuration {#nvme_multi_process_configuration} 248 249DPDK EAL allows different types of processes to be spawned, each with different permissions 250on the hugepage memory used by the applications. 251 252There are two types of processes: 253 2541. a primary process which initializes the shared memory and has full privileges and 2552. a secondary process which can attach to the primary process by mapping its shared memory 256 regions and perform NVMe operations including creating queue pairs. 257 258This feature is enabled by default and is controlled by selecting a value for the shared 259memory group ID. This ID is a positive integer and two applications with the same shared 260memory group ID will share memory. The first application with a given shared memory group 261ID will be considered the primary and all others secondary. 262 263Example: identical shm_id and non-overlapping core masks 264~~~{.sh} 265spdk_nvme_perf options [AIO device(s)]... 266 [-c core mask for I/O submission/completion] 267 [-i shared memory group ID] 268 269spdk_nvme_perf -q 1 -o 4096 -w randread -c 0x1 -t 60 -i 1 270spdk_nvme_perf -q 8 -o 131072 -w write -c 0x10 -t 60 -i 1 271~~~ 272 273### Limitations {#nvme_multi_process_limitations} 274 2751. Two processes sharing memory may not share any cores in their core mask. 2762. If a primary process exits while secondary processes are still running, those processes 277 will continue to run. However, a new primary process cannot be created. 2783. Applications are responsible for coordinating access to logical blocks. 2794. If a process exits unexpectedly, the allocated memory will be released when the last 280 process exits. 281 282@sa spdk_nvme_probe, spdk_nvme_ctrlr_process_admin_completions 283 284## NVMe Hotplug {#nvme_hotplug} 285 286At the NVMe driver level, we provide the following support for Hotplug: 287 2881. Hotplug events detection: 289 The user of the NVMe library can call spdk_nvme_probe() periodically to detect 290 hotplug events. The probe_cb, followed by the attach_cb, will be called for each 291 new device detected. The user may optionally also provide a remove_cb that will be 292 called if a previously attached NVMe device is no longer present on the system. 293 All subsequent I/O to the removed device will return an error. 294 2952. Hot remove NVMe with IO loads: 296 When a device is hot removed while I/O is occurring, all access to the PCI BAR will 297 result in a SIGBUS error. The NVMe driver automatically handles this case by installing 298 a SIGBUS handler and remapping the PCI BAR to a new, placeholder memory location. 299 This means I/O in flight during a hot remove will complete with an appropriate error 300 code and will not crash the application. 301 302@sa spdk_nvme_probe 303 304## NVMe Character Devices {#nvme_cuse} 305 306### Design 307 308 309 310For each controller as well as namespace, character devices are created in the 311locations: 312~~~{.sh} 313 /dev/spdk/nvmeX 314 /dev/spdk/nvmeXnY 315 ... 316~~~ 317Where X is unique SPDK NVMe controller index and Y is namespace id. 318 319Requests from CUSE are handled by pthreads when controller and namespaces are created. 320Those pass the I/O or admin commands via a ring to a thread that processes them using 321nvme_io_msg_process(). 322 323Ioctls that request information attained when attaching NVMe controller receive an 324immediate response, without passing them through the ring. 325 326This interface reserves one additional qpair for sending down the I/O for each controller. 327 328### Usage 329 330#### Enabling cuse support for NVMe 331 332Cuse support is enabled by default on Linux. Make sure to install required dependencies: 333~~~{.sh} 334sudo scripts/pkgdep.sh 335~~~ 336 337#### Creating NVMe-CUSE device 338 339First make sure to prepare the environment (see @ref getting_started). 340This includes loading CUSE kernel module. 341Any NVMe controller attached to a running SPDK application can be 342exposed via NVMe-CUSE interface. When closing SPDK application, 343the NVMe-CUSE devices are unregistered. 344 345~~~{.sh} 346$ sudo scripts/setup.sh 347$ sudo modprobe cuse 348$ sudo build/bin/spdk_tgt 349# Continue in another session 350$ sudo scripts/rpc.py bdev_nvme_attach_controller -b Nvme0 -t PCIe -a 0000:82:00.0 351Nvme0n1 352$ sudo scripts/rpc.py bdev_nvme_get_controllers 353[ 354 { 355 "name": "Nvme0", 356 "trid": { 357 "trtype": "PCIe", 358 "traddr": "0000:82:00.0" 359 } 360 } 361] 362$ sudo scripts/rpc.py bdev_nvme_cuse_register -n Nvme0 363$ ls /dev/spdk/ 364nvme0 nvme0n1 365~~~ 366 367#### Example of using nvme-cli 368 369Most nvme-cli commands can point to specific controller or namespace by providing a path to it. 370This can be leveraged to issue commands to the SPDK NVMe-CUSE devices. 371 372~~~{.sh} 373sudo nvme id-ctrl /dev/spdk/nvme0 374sudo nvme smart-log /dev/spdk/nvme0 375sudo nvme id-ns /dev/spdk/nvme0n1 376~~~ 377 378Note: `nvme list` command does not display SPDK NVMe-CUSE devices, 379see nvme-cli [PR #773](https://github.com/linux-nvme/nvme-cli/pull/773). 380 381#### Examples of using smartctl 382 383smartctl tool recognizes device type based on the device path. If none of expected 384patterns match, SCSI translation layer is used to identify device. 385 386To use smartctl '-d nvme' parameter must be used in addition to full path to 387the NVMe device. 388 389~~~{.sh} 390 smartctl -d nvme -i /dev/spdk/nvme0 391 smartctl -d nvme -H /dev/spdk/nvme1 392 ... 393~~~ 394 395### Limitations 396 397NVMe namespaces are created as character devices and their use may be limited for 398tools expecting block devices. 399 400Sysfs is not updated by SPDK. 401 402SPDK NVMe CUSE creates nodes in "/dev/spdk/" directory to explicitly differentiate 403from other devices. Tools that only search in the "/dev" directory might not work 404with SPDK NVMe CUSE. 405 406SCSI to NVMe Translation Layer is not implemented. Tools that are using this layer to 407identify, manage or operate device might not work properly or their use may be limited. 408 409### SPDK_CUSE_GET_TRANSPORT ioctl command 410 411nvme-cli mostly uses IOCTLs to obtain information, but transport information is 412obtained through sysfs. Since SPDK does not populate sysfs, the SPDK plugin leverages 413an SPDK/CUSE specific ioctl to get the information. 414 415~~~{.c} 416#define SPDK_CUSE_GET_TRANSPORT _IOWR('n', 0x1, struct cuse_transport) 417~~~ 418 419~~~{.c} 420struct cuse_transport { 421 char trstring[SPDK_NVMF_TRSTRING_MAX_LEN + 1]; 422 char traddr[SPDK_NVMF_TRADDR_MAX_LEN + 1]; 423} tr; 424~~~ 425 426## NVMe LED management {#nvme_led} 427 428It is possible to use the ledctl(8) utility to control the state of LEDs in systems supporting 429NPEM (Native PCIe Enclosure Management), even when the NVMe devices are controlled by SPDK. 430However, in this case it is necessary to determine the slot device number because the block device 431is unavailable. The [ledctl.sh](https://github.com/spdk/spdk/tree/master/scripts/ledctl.sh) script 432can be used to help with this. It takes the name of the nvme bdev and invokes ledctl with 433appropriate options. 434