1# NVMe Driver {#nvme} 2 3# In this document {#nvme_toc} 4 5* @ref nvme_intro 6* @ref nvme_examples 7* @ref nvme_interface 8* @ref nvme_design 9* @ref nvme_fabrics_host 10* @ref nvme_multi_process 11* @ref nvme_hotplug 12* @ref nvme_cuse 13 14# Introduction {#nvme_intro} 15 16The NVMe driver is a C library that may be linked directly into an application 17that provides direct, zero-copy data transfer to and from 18[NVMe SSDs](http://nvmexpress.org/). It is entirely passive, meaning that it spawns 19no threads and only performs actions in response to function calls from the 20application itself. The library controls NVMe devices by directly mapping the 21[PCI BAR](https://en.wikipedia.org/wiki/PCI_configuration_space) into the local 22process and performing [MMIO](https://en.wikipedia.org/wiki/Memory-mapped_I/O). 23I/O is submitted asynchronously via queue pairs and the general flow isn't 24entirely dissimilar from Linux's 25[libaio](http://man7.org/linux/man-pages/man2/io_submit.2.html). 26 27More recently, the library has been improved to also connect to remote NVMe 28devices via NVMe over Fabrics. Users may now call spdk_nvme_probe() on both 29local PCI busses and on remote NVMe over Fabrics discovery services. The API is 30otherwise unchanged. 31 32# Examples {#nvme_examples} 33 34## Getting Start with Hello World {#nvme_helloworld} 35 36There are a number of examples provided that demonstrate how to use the NVMe 37library. They are all in the [examples/nvme](https://github.com/spdk/spdk/tree/master/examples/nvme) 38directory in the repository. The best place to start is 39[hello_world](https://github.com/spdk/spdk/blob/master/examples/nvme/hello_world/hello_world.c). 40 41## Running Benchmarks with Fio Plugin {#nvme_fioplugin} 42 43SPDK provides a plugin to the very popular [fio](https://github.com/axboe/fio) 44tool for running some basic benchmarks. See the fio start up 45[guide](https://github.com/spdk/spdk/blob/master/examples/nvme/fio_plugin/) 46for more details. 47 48## Running Benchmarks with Perf Tool {#nvme_perf} 49 50NVMe perf utility in the [examples/nvme/perf](https://github.com/spdk/spdk/tree/master/examples/nvme/perf) 51is one of the examples which also can be used for performance tests. The fio 52tool is widely used because it is very flexible. However, that flexibility adds 53overhead and reduces the efficiency of SPDK. Therefore, SPDK provides a perf 54benchmarking tool which has minimal overhead during benchmarking. We have 55measured up to 2.6 times more IOPS/core when using perf vs. fio with the 564K 100% Random Read workload. The perf benchmarking tool provides several 57run time options to support the most common workload. The following examples 58demonstrate how to use perf. 59 60Example: Using perf for 4K 100% Random Read workload to a local NVMe SSD for 300 seconds 61~~~{.sh} 62perf -q 128 -o 4096 -w randread -r 'trtype:PCIe traddr:0000:04:00.0' -t 300 63~~~ 64 65Example: Using perf for 4K 100% Random Read workload to a remote NVMe SSD exported over the network via NVMe-oF 66~~~{.sh} 67perf -q 128 -o 4096 -w randread -r 'trtype:RDMA adrfam:IPv4 traddr:192.168.100.8 trsvcid:4420' -t 300 68~~~ 69 70Example: Using perf for 4K 70/30 Random Read/Write mix workload to all local NVMe SSDs for 300 seconds 71~~~{.sh} 72perf -q 128 -o 4096 -w randrw -M 70 -t 300 73~~~ 74 75Example: Using perf for extended LBA format CRC guard test to a local NVMe SSD, 76users must write to the SSD before reading the LBA from SSD 77~~~{.sh} 78perf -q 1 -o 4096 -w write -r 'trtype:PCIe traddr:0000:04:00.0' -t 300 -e 'PRACT=0,PRCKH=GUARD' 79perf -q 1 -o 4096 -w read -r 'trtype:PCIe traddr:0000:04:00.0' -t 200 -e 'PRACT=0,PRCKH=GUARD' 80~~~ 81 82# Public Interface {#nvme_interface} 83 84- spdk/nvme.h 85 86Key Functions | Description 87------------------------------------------- | ----------- 88spdk_nvme_probe() | @copybrief spdk_nvme_probe() 89spdk_nvme_ctrlr_alloc_io_qpair() | @copybrief spdk_nvme_ctrlr_alloc_io_qpair() 90spdk_nvme_ctrlr_get_ns() | @copybrief spdk_nvme_ctrlr_get_ns() 91spdk_nvme_ns_cmd_read() | @copybrief spdk_nvme_ns_cmd_read() 92spdk_nvme_ns_cmd_readv() | @copybrief spdk_nvme_ns_cmd_readv() 93spdk_nvme_ns_cmd_read_with_md() | @copybrief spdk_nvme_ns_cmd_read_with_md() 94spdk_nvme_ns_cmd_write() | @copybrief spdk_nvme_ns_cmd_write() 95spdk_nvme_ns_cmd_writev() | @copybrief spdk_nvme_ns_cmd_writev() 96spdk_nvme_ns_cmd_write_with_md() | @copybrief spdk_nvme_ns_cmd_write_with_md() 97spdk_nvme_ns_cmd_write_zeroes() | @copybrief spdk_nvme_ns_cmd_write_zeroes() 98spdk_nvme_ns_cmd_dataset_management() | @copybrief spdk_nvme_ns_cmd_dataset_management() 99spdk_nvme_ns_cmd_flush() | @copybrief spdk_nvme_ns_cmd_flush() 100spdk_nvme_qpair_process_completions() | @copybrief spdk_nvme_qpair_process_completions() 101spdk_nvme_ctrlr_cmd_admin_raw() | @copybrief spdk_nvme_ctrlr_cmd_admin_raw() 102spdk_nvme_ctrlr_process_admin_completions() | @copybrief spdk_nvme_ctrlr_process_admin_completions() 103spdk_nvme_ctrlr_cmd_io_raw() | @copybrief spdk_nvme_ctrlr_cmd_io_raw() 104spdk_nvme_ctrlr_cmd_io_raw_with_md() | @copybrief spdk_nvme_ctrlr_cmd_io_raw_with_md() 105 106# NVMe Driver Design {#nvme_design} 107 108## NVMe I/O Submission {#nvme_io_submission} 109 110I/O is submitted to an NVMe namespace using nvme_ns_cmd_xxx functions. The NVMe 111driver submits the I/O request as an NVMe submission queue entry on the queue 112pair specified in the command. The function returns immediately, prior to the 113completion of the command. The application must poll for I/O completion on each 114queue pair with outstanding I/O to receive completion callbacks by calling 115spdk_nvme_qpair_process_completions(). 116 117@sa spdk_nvme_ns_cmd_read, spdk_nvme_ns_cmd_write, spdk_nvme_ns_cmd_dataset_management, 118spdk_nvme_ns_cmd_flush, spdk_nvme_qpair_process_completions 119 120### Fused operations {#nvme_fuses} 121 122To "fuse" two commands, the first command should have the SPDK_NVME_IO_FLAGS_FUSE_FIRST 123io flag set, and the next one should have the SPDK_NVME_IO_FLAGS_FUSE_SECOND. 124 125In addition, the following rules must be met to execute two commands as an atomic unit: 126 127 - The commands shall be inserted next to each other in the same submission queue. 128 - The LBA range, should be the same for the two commands. 129 130E.g. To send fused compare and write operation user must call spdk_nvme_ns_cmd_compare 131followed with spdk_nvme_ns_cmd_write and make sure no other operations are submitted 132in between on the same queue, like in example below: 133 134~~~ 135 rc = spdk_nvme_ns_cmd_compare(ns, qpair, cmp_buf, 0, 1, nvme_fused_first_cpl_cb, 136 NULL, SPDK_NVME_CMD_FUSE_FIRST); 137 if (rc != 0) { 138 ... 139 } 140 141 rc = spdk_nvme_ns_cmd_write(ns, qpair, write_buf, 0, 1, nvme_fused_second_cpl_cb, 142 NULL, SPDK_NVME_CMD_FUSE_SECOND); 143 if (rc != 0) { 144 ... 145 } 146~~~ 147 148The NVMe specification currently defines compare-and-write as a fused operation. 149Support for compare-and-write is reported by the controller flag 150SPDK_NVME_CTRLR_COMPARE_AND_WRITE_SUPPORTED. 151 152### Scaling Performance {#nvme_scaling} 153 154NVMe queue pairs (struct spdk_nvme_qpair) provide parallel submission paths for 155I/O. I/O may be submitted on multiple queue pairs simultaneously from different 156threads. Queue pairs contain no locks or atomics, however, so a given queue 157pair may only be used by a single thread at a time. This requirement is not 158enforced by the NVMe driver (doing so would require a lock), and violating this 159requirement results in undefined behavior. 160 161The number of queue pairs allowed is dictated by the NVMe SSD itself. The 162specification allows for thousands, but most devices support between 32 163and 128. The specification makes no guarantees about the performance available from 164each queue pair, but in practice the full performance of a device is almost 165always achievable using just one queue pair. For example, if a device claims to 166be capable of 450,000 I/O per second at queue depth 128, in practice it does 167not matter if the driver is using 4 queue pairs each with queue depth 32, or a 168single queue pair with queue depth 128. 169 170Given the above, the easiest threading model for an application using SPDK is 171to spawn a fixed number of threads in a pool and dedicate a single NVMe queue 172pair to each thread. A further improvement would be to pin each thread to a 173separate CPU core, and often the SPDK documentation will use "CPU core" and 174"thread" interchangeably because we have this threading model in mind. 175 176The NVMe driver takes no locks in the I/O path, so it scales linearly in terms 177of performance per thread as long as a queue pair and a CPU core are dedicated 178to each new thread. In order to take full advantage of this scaling, 179applications should consider organizing their internal data structures such 180that data is assigned exclusively to a single thread. All operations that 181require that data should be done by sending a request to the owning thread. 182This results in a message passing architecture, as opposed to a locking 183architecture, and will result in superior scaling across CPU cores. 184 185## NVMe Driver Internal Memory Usage {#nvme_memory_usage} 186 187The SPDK NVMe driver provides a zero-copy data transfer path, which means that 188there are no data buffers for I/O commands. However, some Admin commands have 189data copies depending on the API used by the user. 190 191Each queue pair has a number of trackers used to track commands submitted by the 192caller. The number trackers for I/O queues depend on the users' input for queue 193size and the value read from controller capabilities register field Maximum Queue 194Entries Supported(MQES, 0 based value). Each tracker has a fixed size 4096 Bytes, 195so the maximum memory used for each I/O queue is: (MQES + 1) * 4 KiB. 196 197I/O queue pairs can be allocated in host memory, this is used for most NVMe controllers, 198some NVMe controllers which can support Controller Memory Buffer may put I/O queue 199pairs at controllers' PCI BAR space, SPDK NVMe driver can put I/O submission queue 200into controller memory buffer, it depends on users' input and controller capabilities. 201Each submission queue entry (SQE) and completion queue entry (CQE) consumes 64 bytes 202and 16 bytes respectively. Therefore, the maximum memory used for each I/O queue 203pair is (MQES + 1) * (64 + 16) Bytes. 204 205# NVMe over Fabrics Host Support {#nvme_fabrics_host} 206 207The NVMe driver supports connecting to remote NVMe-oF targets and 208interacting with them in the same manner as local NVMe SSDs. 209 210## Specifying Remote NVMe over Fabrics Targets {#nvme_fabrics_trid} 211 212The method for connecting to a remote NVMe-oF target is very similar 213to the normal enumeration process for local PCIe-attached NVMe devices. 214To connect to a remote NVMe over Fabrics subsystem, the user may call 215spdk_nvme_probe() with the `trid` parameter specifying the address of 216the NVMe-oF target. 217 218The caller may fill out the spdk_nvme_transport_id structure manually 219or use the spdk_nvme_transport_id_parse() function to convert a 220human-readable string representation into the required structure. 221 222The spdk_nvme_transport_id may contain the address of a discovery service 223or a single NVM subsystem. If a discovery service address is specified, 224the NVMe library will call the spdk_nvme_probe() `probe_cb` for each 225discovered NVM subsystem, which allows the user to select the desired 226subsystems to be attached. Alternatively, if the address specifies a 227single NVM subsystem directly, the NVMe library will call `probe_cb` 228for just that subsystem; this allows the user to skip the discovery step 229and connect directly to a subsystem with a known address. 230 231## RDMA Limitations 232 233Please refer to NVMe-oF target's @ref nvmf_rdma_limitations 234 235# NVMe Multi Process {#nvme_multi_process} 236 237This capability enables the SPDK NVMe driver to support multiple processes accessing the 238same NVMe device. The NVMe driver allocates critical structures from shared memory, so 239that each process can map that memory and create its own queue pairs or share the admin 240queue. There is a limited number of I/O queue pairs per NVMe controller. 241 242The primary motivation for this feature is to support management tools that can attach 243to long running applications, perform some maintenance work or gather information, and 244then detach. 245 246## Configuration {#nvme_multi_process_configuration} 247 248DPDK EAL allows different types of processes to be spawned, each with different permissions 249on the hugepage memory used by the applications. 250 251There are two types of processes: 252 2531. a primary process which initializes the shared memory and has full privileges and 2542. a secondary process which can attach to the primary process by mapping its shared memory 255 regions and perform NVMe operations including creating queue pairs. 256 257This feature is enabled by default and is controlled by selecting a value for the shared 258memory group ID. This ID is a positive integer and two applications with the same shared 259memory group ID will share memory. The first application with a given shared memory group 260ID will be considered the primary and all others secondary. 261 262Example: identical shm_id and non-overlapping core masks 263~~~{.sh} 264./perf options [AIO device(s)]... 265 [-c core mask for I/O submission/completion] 266 [-i shared memory group ID] 267 268./perf -q 1 -o 4096 -w randread -c 0x1 -t 60 -i 1 269./perf -q 8 -o 131072 -w write -c 0x10 -t 60 -i 1 270~~~ 271 272## Limitations {#nvme_multi_process_limitations} 273 2741. Two processes sharing memory may not share any cores in their core mask. 2752. If a primary process exits while secondary processes are still running, those processes 276 will continue to run. However, a new primary process cannot be created. 2773. Applications are responsible for coordinating access to logical blocks. 2784. If a process exits unexpectedly, the allocated memory will be released when the last 279 process exits. 280 281@sa spdk_nvme_probe, spdk_nvme_ctrlr_process_admin_completions 282 283# NVMe Hotplug {#nvme_hotplug} 284 285At the NVMe driver level, we provide the following support for Hotplug: 286 2871. Hotplug events detection: 288 The user of the NVMe library can call spdk_nvme_probe() periodically to detect 289 hotplug events. The probe_cb, followed by the attach_cb, will be called for each 290 new device detected. The user may optionally also provide a remove_cb that will be 291 called if a previously attached NVMe device is no longer present on the system. 292 All subsequent I/O to the removed device will return an error. 293 2942. Hot remove NVMe with IO loads: 295 When a device is hot removed while I/O is occurring, all access to the PCI BAR will 296 result in a SIGBUS error. The NVMe driver automatically handles this case by installing 297 a SIGBUS handler and remapping the PCI BAR to a new, placeholder memory location. 298 This means I/O in flight during a hot remove will complete with an appropriate error 299 code and will not crash the application. 300 301@sa spdk_nvme_probe 302 303# NVMe Character Devices {#nvme_cuse} 304 305This feature is considered as experimental. 306 307 308 309For each controller as well as namespace, character devices are created in the 310locations: 311~~~{.sh} 312 /dev/spdk/nvmeX 313 /dev/spdk/nvmeXnY 314 ... 315~~~ 316Where X is unique SPDK NVMe controller index and Y is namespace id. 317 318Requests from CUSE are handled by pthreads when controller and namespaces are created. 319Those pass the I/O or admin commands via a ring to a thread that processes them using 320nvme_io_msg_process(). 321 322Ioctls that request information attained when attaching NVMe controller receive an 323immediate response, without passing them through the ring. 324 325This interface reserves one qpair for sending down the I/O for each controller. 326 327## Enabling cuse support for NVMe 328 329Cuse support is disabled by default. To enable support for NVMe devices SPDK 330must be compiled with "./configure --with-nvme-cuse". 331 332## Limitations 333 334NVMe namespaces are created as character devices and their use may be limited for 335tools expecting block devices. 336 337Sysfs is not updated by SPDK. 338 339SPDK NVMe CUSE creates nodes in "/dev/spdk/" directory to explicitly differentiate 340from other devices. Tools that only search in the "/dev" directory might not work 341with SPDK NVMe CUSE. 342 343SCSI to NVMe Translation Layer is not implemented. Tools that are using this layer to 344identify, manage or operate device might not work properly or their use may be limited. 345 346### Examples of using smartctl 347 348smartctl tool recognizes device type based on the device path. If none of expected 349patterns match, SCSI translation layer is used to identify device. 350 351To use smartctl '-d nvme' parameter must be used in addition to full path to 352the NVMe device. 353 354~~~{.sh} 355 smartctl -d nvme -i /dev/spdk/nvme0 356 smartctl -d nvme -H /dev/spdk/nvme1 357 ... 358~~~ 359