1# NVMe over Fabrics Target Programming Guide {#nvmf_tgt_pg} 2 3## Target Audience 4 5This programming guide is intended for developers authoring applications that 6use the SPDK NVMe-oF target library (`lib/nvmf`). It is intended to provide 7background context, architectural insight, and design recommendations. This 8guide will not cover how to use the SPDK NVMe-oF target application. For a 9guide on how to use the existing application as-is, see @ref nvmf. 10 11## Introduction 12 13The SPDK NVMe-oF target library is located in `lib/nvmf`. The library 14implements all logic required to create an NVMe-oF target application. It is 15used in the implementation of the example NVMe-oF target application in 16`app/nvmf_tgt`, but is intended to be consumed independently. 17 18This guide is written assuming that the reader is familiar with both NVMe and 19NVMe over Fabrics. The best way to become familiar with those is to read their 20[specifications](https://nvmexpress.org/specifications/). 21 22## Primitives 23 24The library exposes a number of primitives - basic objects that the user 25creates and interacts with. They are: 26 27`struct spdk_nvmf_tgt`: An NVMe-oF target. This concept, surprisingly, does 28not appear in the NVMe-oF specification. SPDK defines this to mean the 29collection of subsystems with the associated namespaces, plus the set of 30transports and their associated network connections. This will be referred to 31throughout this guide as a **target**. 32 33`struct spdk_nvmf_subsystem`: An NVMe-oF subsystem, as defined by the NVMe-oF 34specification. Subsystems contain namespaces and controllers and perform 35access control. This will be referred to throughout this guide as a 36**subsystem**. 37 38`struct spdk_nvmf_ns`: An NVMe-oF namespace, as defined by the NVMe-oF 39specification. Namespaces are **bdevs**. See @ref bdev for an explanation of 40the SPDK bdev layer. This will be referred to throughout this guide as a 41**namespace**. 42 43`struct spdk_nvmf_qpair`: An NVMe-oF queue pair, as defined by the NVMe-oF 44specification. These map 1:1 to network connections. This will be referred to 45throughout this guide as a **qpair**. 46 47`struct spdk_nvmf_transport`: An abstraction for a network fabric, as defined 48by the NVMe-oF specification. The specification is designed to allow for many 49different network fabrics, so the code mirrors that and implements a plugin 50system. Currently, only the RDMA transport is available. This will be referred 51to throughout this guide as a **transport**. 52 53`struct spdk_nvmf_poll_group`: An abstraction for a collection of network 54connections that can be polled as a unit. This is an SPDK-defined concept that 55does not appear in the NVMe-oF specification. Often, network transports have 56facilities to check for incoming data on groups of connections more 57efficiently than checking each one individually (e.g. epoll), so poll groups 58provide a generic abstraction for that. This will be referred to throughout 59this guide as a **poll group**. 60 61`struct spdk_nvmf_listener`: A network address at which the target will accept 62new connections. 63 64`struct spdk_nvmf_host`: An NVMe-oF NQN representing a host (initiator) 65system. This is used for access control. 66 67## The Basics 68 69A user of the NVMe-oF target library begins by creating a target using 70spdk_nvmf_tgt_create(), setting up a set of addresses on which to accept 71connections by calling spdk_nvmf_tgt_listen_ext(), then creating a subsystem 72using spdk_nvmf_subsystem_create(). 73 74Subsystems begin in an inactive state and must be activated by calling 75spdk_nvmf_subsystem_start(). Subsystems may be modified at run time, but only 76when in the paused or inactive state. A running subsystem may be paused by 77calling spdk_nvmf_subsystem_pause() and resumed by calling 78spdk_nvmf_subsystem_resume(). 79 80Namespaces may be added to the subsystem by calling 81spdk_nvmf_subsystem_add_ns_ext() when the subsystem is inactive or paused. 82Namespaces are bdevs. See @ref bdev for more information about the SPDK bdev 83layer. A bdev may be obtained by calling spdk_bdev_get_by_name(). 84 85Once a subsystem exists and the target is listening on an address, new 86connections will be automatically assigned to poll groups as they are 87detected. 88 89All I/O to a subsystem is driven by a poll group, which polls for incoming 90network I/O. Poll groups may be created by calling 91spdk_nvmf_poll_group_create(). They automatically request to begin polling 92upon creation on the thread from which they were created. Most importantly, *a 93poll group may only be accessed from the thread on which it was created.* 94 95## Access Control 96 97Access control is performed at the subsystem level by adding allowed listen 98addresses and hosts to a subsystem (see spdk_nvmf_subsystem_add_listener() and 99spdk_nvmf_subsystem_add_host()). By default, a subsystem will not accept 100connections from any host or over any established listen address. Listeners 101and hosts may only be added to inactive or paused subsystems. 102 103## Discovery Subsystems 104 105A discovery subsystem, as defined by the NVMe-oF specification, is 106automatically created for each NVMe-oF target constructed. Connections to the 107discovery subsystem are handled in the same way as any other subsystem. 108 109## Transports 110 111The NVMe-oF specification defines multiple network transports (the "Fabrics" 112in NVMe over Fabrics) and has an extensible system for adding new fabrics 113in the future. The SPDK NVMe-oF target library implements a plugin system for 114network transports to mirror the specification. The API a new transport must 115implement is located in lib/nvmf/transport.h. As of this writing, only an RDMA 116transport has been implemented. 117 118The SPDK NVMe-oF target is designed to be able to process I/O from multiple 119fabrics simultaneously. 120 121## Choosing a Threading Model 122 123The SPDK NVMe-oF target library does not strictly dictate threading model, but 124poll groups do all of their polling and I/O processing on the thread they are 125created on. Given that, it almost always makes sense to create one poll group 126per thread used in the application. 127 128## Scaling Across CPU Cores 129 130Incoming I/O requests are picked up by the poll group polling their assigned 131qpair. For regular NVMe commands such as READ and WRITE, the I/O request is 132processed on the initial thread from start to the point where it is submitted 133to the backing storage device, without interruption. Completions are 134discovered by polling the backing storage device and also processed to 135completion on the polling thread. **Regular NVMe commands (READ, WRITE, etc.) 136do not require any cross-thread coordination, and therefore take no locks.** 137 138NVMe ADMIN commands, which are used for managing the NVMe device itself, may 139modify global state in the subsystem. For instance, an NVMe ADMIN command may 140perform namespace management, such as shrinking a namespace. For these 141commands, the subsystem will temporarily enter a paused state by sending a 142message to each thread in the system. All new incoming I/O on any thread 143targeting the subsystem will be queued during this time. Once the subsystem is 144fully paused, the state change will occur, and messages will be sent to each 145thread to release queued I/O and resume. Management commands are rare, so this 146style of coordination is preferable to forcing all commands to take locks in 147the I/O path. 148 149## Zero Copy Support 150 151For the RDMA transport, data is transferred from the RDMA NIC to host memory 152and then host memory to the SSD (or vice versa), without any intermediate 153copies. Data is never moved from one location in host memory to another. Other 154transports in the future may require data copies. 155 156## RDMA 157 158The SPDK NVMe-oF RDMA transport is implemented on top of the libibverbs and 159rdmacm libraries, which are packaged and available on most Linux 160distributions. It does not use a user-space RDMA driver stack through DPDK. 161 162In order to scale to large numbers of connections, the SPDK NVMe-oF RDMA 163transport allocates a single RDMA completion queue per poll group. All new 164qpairs assigned to the poll group are given their own RDMA send and receive 165queues, but share this common completion queue. This allows the poll group to 166poll a single queue for incoming messages instead of iterating through each 167one. 168 169Each RDMA request is handled by a state machine that walks the request through 170a number of states. This keeps the code organized and makes all of the corner 171cases much more obvious. 172 173RDMA SEND, READ, and WRITE operations are ordered with respect to one another, 174but RDMA RECVs are not necessarily ordered with SEND acknowledgements. For 175instance, it is possible to detect an incoming RDMA RECV message containing a 176new NVMe-oF capsule prior to detecting the acknowledgement of a previous SEND 177containing an NVMe completion. This is problematic at full queue depth because 178there may not yet be a free request structure. To handle this, the RDMA 179request structure is broken into two parts - an rdma_recv and an rdma_request. 180New RDMA RECVs will always grab a free rdma_recv, but may need to wait in a 181queue for a SEND acknowledgement before they can acquire a full rdma_request 182object. 183 184Further, RDMA NICs expose different queue depths for READ/WRITE operations 185than they do for SEND/RECV operations. The RDMA transport reports available 186queue depth based on SEND/RECV operation limits and will queue in software as 187necessary to accommodate (usually lower) limits on READ/WRITE operations. 188