1# NVMe Driver {#nvme} 2 3# In this document {#nvme_toc} 4 5* @ref nvme_intro 6* @ref nvme_examples 7* @ref nvme_interface 8* @ref nvme_design 9* @ref nvme_fabrics_host 10* @ref nvme_multi_process 11* @ref nvme_hotplug 12 13# Introduction {#nvme_intro} 14 15The NVMe driver is a C library that may be linked directly into an application 16that provides direct, zero-copy data transfer to and from 17[NVMe SSDs](http://nvmexpress.org/). It is entirely passive, meaning that it spawns 18no threads and only performs actions in response to function calls from the 19application itself. The library controls NVMe devices by directly mapping the 20[PCI BAR](https://en.wikipedia.org/wiki/PCI_configuration_space) into the local 21process and performing [MMIO](https://en.wikipedia.org/wiki/Memory-mapped_I/O). 22I/O is submitted asynchronously via queue pairs and the general flow isn't 23entirely dissimilar from Linux's 24[libaio](http://man7.org/linux/man-pages/man2/io_submit.2.html). 25 26More recently, the library has been improved to also connect to remote NVMe 27devices via NVMe over Fabrics. Users may now call spdk_nvme_probe() on both 28local PCI busses and on remote NVMe over Fabrics discovery services. The API is 29otherwise unchanged. 30 31# Examples {#nvme_examples} 32 33## Getting Start with Hello World {#nvme_helloworld} 34 35There are a number of examples provided that demonstrate how to use the NVMe 36library. They are all in the [examples/nvme](https://github.com/spdk/spdk/tree/master/examples/nvme) 37directory in the repository. The best place to start is 38[hello_world](https://github.com/spdk/spdk/blob/master/examples/nvme/hello_world/hello_world.c). 39 40## Running Benchmarks with Fio Plugin {#nvme_fioplugin} 41 42SPDK provides a plugin to the very popular [fio](https://github.com/axboe/fio) 43tool for running some basic benchmarks. See the fio start up 44[guide](https://github.com/spdk/spdk/blob/master/examples/nvme/fio_plugin/) 45for more details. 46 47## Running Benchmarks with Perf Tool {#nvme_perf} 48 49NVMe perf utility in the [examples/nvme/perf](https://github.com/spdk/spdk/tree/master/examples/nvme/perf) 50is one of the examples which also can be used for performance tests. The fio 51tool is widely used because it is very flexible. However, that flexibility adds 52overhead and reduces the efficiency of SPDK. Therefore, SPDK provides a perf 53benchmarking tool which has minimal overhead during benchmarking. We have 54measured up to 2.6 times more IOPS/core when using perf vs. fio with the 554K 100% Random Read workload. The perf benchmarking tool provides several 56run time options to support the most common workload. The following examples 57demonstrate how to use perf. 58 59Example: Using perf for 4K 100% Random Read workload to a local NVMe SSD for 300 seconds 60~~~{.sh} 61perf -q 128 -o 4096 -w randread -r 'trtype:PCIe traddr:0000:04:00.0' -t 300 62~~~ 63 64Example: Using perf for 4K 100% Random Read workload to a remote NVMe SSD exported over the network via NVMe-oF 65~~~{.sh} 66perf -q 128 -o 4096 -w randread -r 'trtype:RDMA adrfam:IPv4 traddr:192.168.100.8 trsvcid:4420' -t 300 67~~~ 68 69Example: Using perf for 4K 70/30 Random Read/Write mix workload to all local NVMe SSDs for 300 seconds 70~~~{.sh} 71perf -q 128 -o 4096 -w randrw -M 70 -t 300 72~~~ 73 74Example: Using perf for extended LBA format CRC guard test to a local NVMe SSD, 75users must write to the SSD before reading the LBA from SSD 76~~~{.sh} 77perf -q 1 -o 4096 -w write -r 'trtype:PCIe traddr:0000:04:00.0' -t 300 -e 'PRACT=0,PRCKH=GUARD' 78perf -q 1 -o 4096 -w read -r 'trtype:PCIe traddr:0000:04:00.0' -t 200 -e 'PRACT=0,PRCKH=GUARD' 79~~~ 80 81# Public Interface {#nvme_interface} 82 83- spdk/nvme.h 84 85Key Functions | Description 86------------------------------------------- | ----------- 87spdk_nvme_probe() | @copybrief spdk_nvme_probe() 88spdk_nvme_ctrlr_alloc_io_qpair() | @copybrief spdk_nvme_ctrlr_alloc_io_qpair() 89spdk_nvme_ctrlr_get_ns() | @copybrief spdk_nvme_ctrlr_get_ns() 90spdk_nvme_ns_cmd_read() | @copybrief spdk_nvme_ns_cmd_read() 91spdk_nvme_ns_cmd_readv() | @copybrief spdk_nvme_ns_cmd_readv() 92spdk_nvme_ns_cmd_read_with_md() | @copybrief spdk_nvme_ns_cmd_read_with_md() 93spdk_nvme_ns_cmd_write() | @copybrief spdk_nvme_ns_cmd_write() 94spdk_nvme_ns_cmd_writev() | @copybrief spdk_nvme_ns_cmd_writev() 95spdk_nvme_ns_cmd_write_with_md() | @copybrief spdk_nvme_ns_cmd_write_with_md() 96spdk_nvme_ns_cmd_write_zeroes() | @copybrief spdk_nvme_ns_cmd_write_zeroes() 97spdk_nvme_ns_cmd_dataset_management() | @copybrief spdk_nvme_ns_cmd_dataset_management() 98spdk_nvme_ns_cmd_flush() | @copybrief spdk_nvme_ns_cmd_flush() 99spdk_nvme_qpair_process_completions() | @copybrief spdk_nvme_qpair_process_completions() 100spdk_nvme_ctrlr_cmd_admin_raw() | @copybrief spdk_nvme_ctrlr_cmd_admin_raw() 101spdk_nvme_ctrlr_process_admin_completions() | @copybrief spdk_nvme_ctrlr_process_admin_completions() 102spdk_nvme_ctrlr_cmd_io_raw() | @copybrief spdk_nvme_ctrlr_cmd_io_raw() 103spdk_nvme_ctrlr_cmd_io_raw_with_md() | @copybrief spdk_nvme_ctrlr_cmd_io_raw_with_md() 104 105# NVMe Driver Design {#nvme_design} 106 107## NVMe I/O Submission {#nvme_io_submission} 108 109I/O is submitted to an NVMe namespace using nvme_ns_cmd_xxx functions. The NVMe 110driver submits the I/O request as an NVMe submission queue entry on the queue 111pair specified in the command. The function returns immediately, prior to the 112completion of the command. The application must poll for I/O completion on each 113queue pair with outstanding I/O to receive completion callbacks by calling 114spdk_nvme_qpair_process_completions(). 115 116@sa spdk_nvme_ns_cmd_read, spdk_nvme_ns_cmd_write, spdk_nvme_ns_cmd_dataset_management, 117spdk_nvme_ns_cmd_flush, spdk_nvme_qpair_process_completions 118 119### Scaling Performance {#nvme_scaling} 120 121NVMe queue pairs (struct spdk_nvme_qpair) provide parallel submission paths for 122I/O. I/O may be submitted on multiple queue pairs simultaneously from different 123threads. Queue pairs contain no locks or atomics, however, so a given queue 124pair may only be used by a single thread at a time. This requirement is not 125enforced by the NVMe driver (doing so would require a lock), and violating this 126requirement results in undefined behavior. 127 128The number of queue pairs allowed is dictated by the NVMe SSD itself. The 129specification allows for thousands, but most devices support between 32 130and 128. The specification makes no guarantees about the performance available from 131each queue pair, but in practice the full performance of a device is almost 132always achievable using just one queue pair. For example, if a device claims to 133be capable of 450,000 I/O per second at queue depth 128, in practice it does 134not matter if the driver is using 4 queue pairs each with queue depth 32, or a 135single queue pair with queue depth 128. 136 137Given the above, the easiest threading model for an application using SPDK is 138to spawn a fixed number of threads in a pool and dedicate a single NVMe queue 139pair to each thread. A further improvement would be to pin each thread to a 140separate CPU core, and often the SPDK documentation will use "CPU core" and 141"thread" interchangeably because we have this threading model in mind. 142 143The NVMe driver takes no locks in the I/O path, so it scales linearly in terms 144of performance per thread as long as a queue pair and a CPU core are dedicated 145to each new thread. In order to take full advantage of this scaling, 146applications should consider organizing their internal data structures such 147that data is assigned exclusively to a single thread. All operations that 148require that data should be done by sending a request to the owning thread. 149This results in a message passing architecture, as opposed to a locking 150architecture, and will result in superior scaling across CPU cores. 151 152## NVMe Driver Internal Memory Usage {#nvme_memory_usage} 153 154The SPDK NVMe driver provides a zero-copy data transfer path, which means that 155there are no data buffers for I/O commands. However, some Admin commands have 156data copies depending on the API used by the user. 157 158Each queue pair has a number of trackers used to track commands submitted by the 159caller. The number trackers for I/O queues depend on the users' input for queue 160size and the value read from controller capabilities register field Maximum Queue 161Entries Supported(MQES, 0 based value). Each tracker has a fixed size 4096 Bytes, 162so the maximum memory used for each I/O queue is: (MQES + 1) * 4 KiB. 163 164I/O queue pairs can be allocated in host memory, this is used for most NVMe controllers, 165some NVMe controllers which can support Controller Memory Buffer may put I/O queue 166pairs at controllers' PCI BAR space, SPDK NVMe driver can put I/O submission queue 167into controller memory buffer, it depends on users' input and controller capabilities. 168Each submission queue entry (SQE) and completion queue entry (CQE) consumes 64 bytes 169and 16 bytes respectively. Therefore, the maximum memory used for each I/O queue 170pair is (MQES + 1) * (64 + 16) Bytes. 171 172# NVMe over Fabrics Host Support {#nvme_fabrics_host} 173 174The NVMe driver supports connecting to remote NVMe-oF targets and 175interacting with them in the same manner as local NVMe SSDs. 176 177## Specifying Remote NVMe over Fabrics Targets {#nvme_fabrics_trid} 178 179The method for connecting to a remote NVMe-oF target is very similar 180to the normal enumeration process for local PCIe-attached NVMe devices. 181To connect to a remote NVMe over Fabrics subsystem, the user may call 182spdk_nvme_probe() with the `trid` parameter specifying the address of 183the NVMe-oF target. 184 185The caller may fill out the spdk_nvme_transport_id structure manually 186or use the spdk_nvme_transport_id_parse() function to convert a 187human-readable string representation into the required structure. 188 189The spdk_nvme_transport_id may contain the address of a discovery service 190or a single NVM subsystem. If a discovery service address is specified, 191the NVMe library will call the spdk_nvme_probe() `probe_cb` for each 192discovered NVM subsystem, which allows the user to select the desired 193subsystems to be attached. Alternatively, if the address specifies a 194single NVM subsystem directly, the NVMe library will call `probe_cb` 195for just that subsystem; this allows the user to skip the discovery step 196and connect directly to a subsystem with a known address. 197 198## RDMA Limitations 199 200Please refer to NVMe-oF target's @ref nvmf_rdma_limitations 201 202# NVMe Multi Process {#nvme_multi_process} 203 204This capability enables the SPDK NVMe driver to support multiple processes accessing the 205same NVMe device. The NVMe driver allocates critical structures from shared memory, so 206that each process can map that memory and create its own queue pairs or share the admin 207queue. There is a limited number of I/O queue pairs per NVMe controller. 208 209The primary motivation for this feature is to support management tools that can attach 210to long running applications, perform some maintenance work or gather information, and 211then detach. 212 213## Configuration {#nvme_multi_process_configuration} 214 215DPDK EAL allows different types of processes to be spawned, each with different permissions 216on the hugepage memory used by the applications. 217 218There are two types of processes: 2191. a primary process which initializes the shared memory and has full privileges and 2202. a secondary process which can attach to the primary process by mapping its shared memory 221regions and perform NVMe operations including creating queue pairs. 222 223This feature is enabled by default and is controlled by selecting a value for the shared 224memory group ID. This ID is a positive integer and two applications with the same shared 225memory group ID will share memory. The first application with a given shared memory group 226ID will be considered the primary and all others secondary. 227 228Example: identical shm_id and non-overlapping core masks 229~~~{.sh} 230./perf options [AIO device(s)]... 231 [-c core mask for I/O submission/completion] 232 [-i shared memory group ID] 233 234./perf -q 1 -o 4096 -w randread -c 0x1 -t 60 -i 1 235./perf -q 8 -o 131072 -w write -c 0x10 -t 60 -i 1 236~~~ 237 238## Limitations {#nvme_multi_process_limitations} 239 2401. Two processes sharing memory may not share any cores in their core mask. 2412. If a primary process exits while secondary processes are still running, those processes 242will continue to run. However, a new primary process cannot be created. 2433. Applications are responsible for coordinating access to logical blocks. 2444. If a process exits unexpectedly, the allocated memory will be released when the last 245process exits. 246 247@sa spdk_nvme_probe, spdk_nvme_ctrlr_process_admin_completions 248 249 250# NVMe Hotplug {#nvme_hotplug} 251 252At the NVMe driver level, we provide the following support for Hotplug: 253 2541. Hotplug events detection: 255The user of the NVMe library can call spdk_nvme_probe() periodically to detect 256hotplug events. The probe_cb, followed by the attach_cb, will be called for each 257new device detected. The user may optionally also provide a remove_cb that will be 258called if a previously attached NVMe device is no longer present on the system. 259All subsequent I/O to the removed device will return an error. 260 2612. Hot remove NVMe with IO loads: 262When a device is hot removed while I/O is occurring, all access to the PCI BAR will 263result in a SIGBUS error. The NVMe driver automatically handles this case by installing 264a SIGBUS handler and remapping the PCI BAR to a new, placeholder memory location. 265This means I/O in flight during a hot remove will complete with an appropriate error 266code and will not crash the application. 267 268@sa spdk_nvme_probe 269