1cf9e0998SGangCao# NVMe over Fabrics Target Programming Guide {#nvmf_tgt_pg} 2cf9e0998SGangCao 3cf9e0998SGangCao## Target Audience 4cf9e0998SGangCao 5cf9e0998SGangCaoThis programming guide is intended for developers authoring applications that 6cf9e0998SGangCaouse the SPDK NVMe-oF target library (`lib/nvmf`). It is intended to provide 7cf9e0998SGangCaobackground context, architectural insight, and design recommendations. This 8cf9e0998SGangCaoguide will not cover how to use the SPDK NVMe-oF target application. For a 9cf9e0998SGangCaoguide on how to use the existing application as-is, see @ref nvmf. 10cf9e0998SGangCao 11cf9e0998SGangCao## Introduction 12cf9e0998SGangCao 13cf9e0998SGangCaoThe SPDK NVMe-oF target library is located in `lib/nvmf`. The library 14cf9e0998SGangCaoimplements all logic required to create an NVMe-oF target application. It is 15cf9e0998SGangCaoused in the implementation of the example NVMe-oF target application in 16cf9e0998SGangCao`app/nvmf_tgt`, but is intended to be consumed independently. 17cf9e0998SGangCao 18cf9e0998SGangCaoThis guide is written assuming that the reader is familiar with both NVMe and 19cf9e0998SGangCaoNVMe over Fabrics. The best way to become familiar with those is to read their 20*7d30d705Shollin[specifications](https://nvmexpress.org/specifications/). 21cf9e0998SGangCao 22cf9e0998SGangCao## Primitives 23cf9e0998SGangCao 24cf9e0998SGangCaoThe library exposes a number of primitives - basic objects that the user 25cf9e0998SGangCaocreates and interacts with. They are: 26cf9e0998SGangCao 27cf9e0998SGangCao`struct spdk_nvmf_tgt`: An NVMe-oF target. This concept, surprisingly, does 28cf9e0998SGangCaonot appear in the NVMe-oF specification. SPDK defines this to mean the 29cf9e0998SGangCaocollection of subsystems with the associated namespaces, plus the set of 30cf9e0998SGangCaotransports and their associated network connections. This will be referred to 31cf9e0998SGangCaothroughout this guide as a **target**. 32cf9e0998SGangCao 33cf9e0998SGangCao`struct spdk_nvmf_subsystem`: An NVMe-oF subsystem, as defined by the NVMe-oF 34cf9e0998SGangCaospecification. Subsystems contain namespaces and controllers and perform 35cf9e0998SGangCaoaccess control. This will be referred to throughout this guide as a 36cf9e0998SGangCao**subsystem**. 37cf9e0998SGangCao 38cf9e0998SGangCao`struct spdk_nvmf_ns`: An NVMe-oF namespace, as defined by the NVMe-oF 39cf9e0998SGangCaospecification. Namespaces are **bdevs**. See @ref bdev for an explanation of 40cf9e0998SGangCaothe SPDK bdev layer. This will be referred to throughout this guide as a 41cf9e0998SGangCao**namespace**. 42cf9e0998SGangCao 43cf9e0998SGangCao`struct spdk_nvmf_qpair`: An NVMe-oF queue pair, as defined by the NVMe-oF 44cf9e0998SGangCaospecification. These map 1:1 to network connections. This will be referred to 45cf9e0998SGangCaothroughout this guide as a **qpair**. 46cf9e0998SGangCao 47cf9e0998SGangCao`struct spdk_nvmf_transport`: An abstraction for a network fabric, as defined 48cf9e0998SGangCaoby the NVMe-oF specification. The specification is designed to allow for many 49cf9e0998SGangCaodifferent network fabrics, so the code mirrors that and implements a plugin 50cf9e0998SGangCaosystem. Currently, only the RDMA transport is available. This will be referred 51cf9e0998SGangCaoto throughout this guide as a **transport**. 52cf9e0998SGangCao 53cf9e0998SGangCao`struct spdk_nvmf_poll_group`: An abstraction for a collection of network 54cf9e0998SGangCaoconnections that can be polled as a unit. This is an SPDK-defined concept that 55cf9e0998SGangCaodoes not appear in the NVMe-oF specification. Often, network transports have 56cf9e0998SGangCaofacilities to check for incoming data on groups of connections more 57cf9e0998SGangCaoefficiently than checking each one individually (e.g. epoll), so poll groups 58cf9e0998SGangCaoprovide a generic abstraction for that. This will be referred to throughout 59cf9e0998SGangCaothis guide as a **poll group**. 60cf9e0998SGangCao 61cf9e0998SGangCao`struct spdk_nvmf_listener`: A network address at which the target will accept 62cf9e0998SGangCaonew connections. 63cf9e0998SGangCao 64cf9e0998SGangCao`struct spdk_nvmf_host`: An NVMe-oF NQN representing a host (initiator) 65cf9e0998SGangCaosystem. This is used for access control. 66cf9e0998SGangCao 67cf9e0998SGangCao## The Basics 68cf9e0998SGangCao 69cf9e0998SGangCaoA user of the NVMe-oF target library begins by creating a target using 7074d4e7e6SSeth Howellspdk_nvmf_tgt_create(), setting up a set of addresses on which to accept 71fe8af228STomasz Zawadzkiconnections by calling spdk_nvmf_tgt_listen_ext(), then creating a subsystem 7274d4e7e6SSeth Howellusing spdk_nvmf_subsystem_create(). 73cf9e0998SGangCao 74cf9e0998SGangCaoSubsystems begin in an inactive state and must be activated by calling 75cf9e0998SGangCaospdk_nvmf_subsystem_start(). Subsystems may be modified at run time, but only 76cf9e0998SGangCaowhen in the paused or inactive state. A running subsystem may be paused by 77cf9e0998SGangCaocalling spdk_nvmf_subsystem_pause() and resumed by calling 78cf9e0998SGangCaospdk_nvmf_subsystem_resume(). 79cf9e0998SGangCao 80cf9e0998SGangCaoNamespaces may be added to the subsystem by calling 81fe8af228STomasz Zawadzkispdk_nvmf_subsystem_add_ns_ext() when the subsystem is inactive or paused. 82cf9e0998SGangCaoNamespaces are bdevs. See @ref bdev for more information about the SPDK bdev 83cf9e0998SGangCaolayer. A bdev may be obtained by calling spdk_bdev_get_by_name(). 84cf9e0998SGangCao 85cf9e0998SGangCaoOnce a subsystem exists and the target is listening on an address, new 8677ab6f28SBen Walkerconnections will be automatically assigned to poll groups as they are 8777ab6f28SBen Walkerdetected. 88cf9e0998SGangCao 89cf9e0998SGangCaoAll I/O to a subsystem is driven by a poll group, which polls for incoming 90cf9e0998SGangCaonetwork I/O. Poll groups may be created by calling 91cf9e0998SGangCaospdk_nvmf_poll_group_create(). They automatically request to begin polling 92cf9e0998SGangCaoupon creation on the thread from which they were created. Most importantly, *a 9374d4e7e6SSeth Howellpoll group may only be accessed from the thread on which it was created.* 94cf9e0998SGangCao 95cf9e0998SGangCao## Access Control 96cf9e0998SGangCao 97cf9e0998SGangCaoAccess control is performed at the subsystem level by adding allowed listen 98cf9e0998SGangCaoaddresses and hosts to a subsystem (see spdk_nvmf_subsystem_add_listener() and 99cf9e0998SGangCaospdk_nvmf_subsystem_add_host()). By default, a subsystem will not accept 100cf9e0998SGangCaoconnections from any host or over any established listen address. Listeners 101cf9e0998SGangCaoand hosts may only be added to inactive or paused subsystems. 102cf9e0998SGangCao 103cf9e0998SGangCao## Discovery Subsystems 104cf9e0998SGangCao 105cf9e0998SGangCaoA discovery subsystem, as defined by the NVMe-oF specification, is 106cf9e0998SGangCaoautomatically created for each NVMe-oF target constructed. Connections to the 10777ab6f28SBen Walkerdiscovery subsystem are handled in the same way as any other subsystem. 108cf9e0998SGangCao 109cf9e0998SGangCao## Transports 110cf9e0998SGangCao 111cf9e0998SGangCaoThe NVMe-oF specification defines multiple network transports (the "Fabrics" 112cf9e0998SGangCaoin NVMe over Fabrics) and has an extensible system for adding new fabrics 113cf9e0998SGangCaoin the future. The SPDK NVMe-oF target library implements a plugin system for 114cf9e0998SGangCaonetwork transports to mirror the specification. The API a new transport must 115cf9e0998SGangCaoimplement is located in lib/nvmf/transport.h. As of this writing, only an RDMA 116cf9e0998SGangCaotransport has been implemented. 117cf9e0998SGangCao 118cf9e0998SGangCaoThe SPDK NVMe-oF target is designed to be able to process I/O from multiple 119cf9e0998SGangCaofabrics simultaneously. 120cf9e0998SGangCao 121cf9e0998SGangCao## Choosing a Threading Model 122cf9e0998SGangCao 123cf9e0998SGangCaoThe SPDK NVMe-oF target library does not strictly dictate threading model, but 124cf9e0998SGangCaopoll groups do all of their polling and I/O processing on the thread they are 125cf9e0998SGangCaocreated on. Given that, it almost always makes sense to create one poll group 12677ab6f28SBen Walkerper thread used in the application. 127cf9e0998SGangCao 128cf9e0998SGangCao## Scaling Across CPU Cores 129cf9e0998SGangCao 130cf9e0998SGangCaoIncoming I/O requests are picked up by the poll group polling their assigned 131cf9e0998SGangCaoqpair. For regular NVMe commands such as READ and WRITE, the I/O request is 132cf9e0998SGangCaoprocessed on the initial thread from start to the point where it is submitted 133cf9e0998SGangCaoto the backing storage device, without interruption. Completions are 134cf9e0998SGangCaodiscovered by polling the backing storage device and also processed to 135cf9e0998SGangCaocompletion on the polling thread. **Regular NVMe commands (READ, WRITE, etc.) 136cf9e0998SGangCaodo not require any cross-thread coordination, and therefore take no locks.** 137cf9e0998SGangCao 138cf9e0998SGangCaoNVMe ADMIN commands, which are used for managing the NVMe device itself, may 139cf9e0998SGangCaomodify global state in the subsystem. For instance, an NVMe ADMIN command may 140cf9e0998SGangCaoperform namespace management, such as shrinking a namespace. For these 141cf9e0998SGangCaocommands, the subsystem will temporarily enter a paused state by sending a 142cf9e0998SGangCaomessage to each thread in the system. All new incoming I/O on any thread 143cf9e0998SGangCaotargeting the subsystem will be queued during this time. Once the subsystem is 144cf9e0998SGangCaofully paused, the state change will occur, and messages will be sent to each 145cf9e0998SGangCaothread to release queued I/O and resume. Management commands are rare, so this 146cf9e0998SGangCaostyle of coordination is preferable to forcing all commands to take locks in 147cf9e0998SGangCaothe I/O path. 148cf9e0998SGangCao 149cf9e0998SGangCao## Zero Copy Support 150cf9e0998SGangCao 151cf9e0998SGangCaoFor the RDMA transport, data is transferred from the RDMA NIC to host memory 15274d4e7e6SSeth Howelland then host memory to the SSD (or vice versa), without any intermediate 153cf9e0998SGangCaocopies. Data is never moved from one location in host memory to another. Other 154cf9e0998SGangCaotransports in the future may require data copies. 155cf9e0998SGangCao 156cf9e0998SGangCao## RDMA 157cf9e0998SGangCao 158cf9e0998SGangCaoThe SPDK NVMe-oF RDMA transport is implemented on top of the libibverbs and 159cf9e0998SGangCaordmacm libraries, which are packaged and available on most Linux 160cf9e0998SGangCaodistributions. It does not use a user-space RDMA driver stack through DPDK. 161cf9e0998SGangCao 162cf9e0998SGangCaoIn order to scale to large numbers of connections, the SPDK NVMe-oF RDMA 163cf9e0998SGangCaotransport allocates a single RDMA completion queue per poll group. All new 164cf9e0998SGangCaoqpairs assigned to the poll group are given their own RDMA send and receive 165cf9e0998SGangCaoqueues, but share this common completion queue. This allows the poll group to 166cf9e0998SGangCaopoll a single queue for incoming messages instead of iterating through each 167cf9e0998SGangCaoone. 168cf9e0998SGangCao 169cf9e0998SGangCaoEach RDMA request is handled by a state machine that walks the request through 170cf9e0998SGangCaoa number of states. This keeps the code organized and makes all of the corner 171cf9e0998SGangCaocases much more obvious. 172cf9e0998SGangCao 173cf9e0998SGangCaoRDMA SEND, READ, and WRITE operations are ordered with respect to one another, 174cf9e0998SGangCaobut RDMA RECVs are not necessarily ordered with SEND acknowledgements. For 175cf9e0998SGangCaoinstance, it is possible to detect an incoming RDMA RECV message containing a 176cf9e0998SGangCaonew NVMe-oF capsule prior to detecting the acknowledgement of a previous SEND 177cf9e0998SGangCaocontaining an NVMe completion. This is problematic at full queue depth because 178cf9e0998SGangCaothere may not yet be a free request structure. To handle this, the RDMA 179cf9e0998SGangCaorequest structure is broken into two parts - an rdma_recv and an rdma_request. 180cf9e0998SGangCaoNew RDMA RECVs will always grab a free rdma_recv, but may need to wait in a 181cf9e0998SGangCaoqueue for a SEND acknowledgement before they can acquire a full rdma_request 182cf9e0998SGangCaoobject. 183cf9e0998SGangCao 184cf9e0998SGangCaoFurther, RDMA NICs expose different queue depths for READ/WRITE operations 185cf9e0998SGangCaothan they do for SEND/RECV operations. The RDMA transport reports available 186cf9e0998SGangCaoqueue depth based on SEND/RECV operation limits and will queue in software as 1871f813ec3SChen Wangnecessary to accommodate (usually lower) limits on READ/WRITE operations. 188