xref: /spdk/doc/nvmf_tgt_pg.md (revision 7d30d7059fc5063ac034ae761b93552f83a9ef24)
1# NVMe over Fabrics Target Programming Guide {#nvmf_tgt_pg}
2
3## Target Audience
4
5This programming guide is intended for developers authoring applications that
6use the SPDK NVMe-oF target library (`lib/nvmf`). It is intended to provide
7background context, architectural insight, and design recommendations. This
8guide will not cover how to use the SPDK NVMe-oF target application. For a
9guide on how to use the existing application as-is, see @ref nvmf.
10
11## Introduction
12
13The SPDK NVMe-oF target library is located in `lib/nvmf`. The library
14implements all logic required to create an NVMe-oF target application. It is
15used in the implementation of the example NVMe-oF target application in
16`app/nvmf_tgt`, but is intended to be consumed independently.
17
18This guide is written assuming that the reader is familiar with both NVMe and
19NVMe over Fabrics. The best way to become familiar with those is to read their
20[specifications](https://nvmexpress.org/specifications/).
21
22## Primitives
23
24The library exposes a number of primitives - basic objects that the user
25creates and interacts with. They are:
26
27`struct spdk_nvmf_tgt`: An NVMe-oF target. This concept, surprisingly, does
28not appear in the NVMe-oF specification. SPDK defines this to mean the
29collection of subsystems with the associated namespaces, plus the set of
30transports and their associated network connections. This will be referred to
31throughout this guide as a **target**.
32
33`struct spdk_nvmf_subsystem`: An NVMe-oF subsystem, as defined by the NVMe-oF
34specification. Subsystems contain namespaces and controllers and perform
35access control. This will be referred to throughout this guide as a
36**subsystem**.
37
38`struct spdk_nvmf_ns`: An NVMe-oF namespace, as defined by the NVMe-oF
39specification. Namespaces are **bdevs**. See @ref bdev for an explanation of
40the SPDK bdev layer. This will be referred to throughout this guide as a
41**namespace**.
42
43`struct spdk_nvmf_qpair`: An NVMe-oF queue pair, as defined by the NVMe-oF
44specification. These map 1:1 to network connections. This will be referred to
45throughout this guide as a **qpair**.
46
47`struct spdk_nvmf_transport`: An abstraction for a network fabric, as defined
48by the NVMe-oF specification. The specification is designed to allow for many
49different network fabrics, so the code mirrors that and implements a plugin
50system. Currently, only the RDMA transport is available. This will be referred
51to throughout this guide as a **transport**.
52
53`struct spdk_nvmf_poll_group`: An abstraction for a collection of network
54connections that can be polled as a unit. This is an SPDK-defined concept that
55does not appear in the NVMe-oF specification. Often, network transports have
56facilities to check for incoming data on groups of connections more
57efficiently than checking each one individually (e.g. epoll), so poll groups
58provide a generic abstraction for that. This will be referred to throughout
59this guide as a **poll group**.
60
61`struct spdk_nvmf_listener`: A network address at which the target will accept
62new connections.
63
64`struct spdk_nvmf_host`: An NVMe-oF NQN representing a host (initiator)
65system. This is used for access control.
66
67## The Basics
68
69A user of the NVMe-oF target library begins by creating a target using
70spdk_nvmf_tgt_create(), setting up a set of addresses on which to accept
71connections by calling spdk_nvmf_tgt_listen_ext(), then creating a subsystem
72using spdk_nvmf_subsystem_create().
73
74Subsystems begin in an inactive state and must be activated by calling
75spdk_nvmf_subsystem_start(). Subsystems may be modified at run time, but only
76when in the paused or inactive state. A running subsystem may be paused by
77calling spdk_nvmf_subsystem_pause() and resumed by calling
78spdk_nvmf_subsystem_resume().
79
80Namespaces may be added to the subsystem by calling
81spdk_nvmf_subsystem_add_ns_ext() when the subsystem is inactive or paused.
82Namespaces are bdevs. See @ref bdev for more information about the SPDK bdev
83layer. A bdev may be obtained by calling spdk_bdev_get_by_name().
84
85Once a subsystem exists and the target is listening on an address, new
86connections will be automatically assigned to poll groups as they are
87detected.
88
89All I/O to a subsystem is driven by a poll group, which polls for incoming
90network I/O. Poll groups may be created by calling
91spdk_nvmf_poll_group_create(). They automatically request to begin polling
92upon creation on the thread from which they were created. Most importantly, *a
93poll group may only be accessed from the thread on which it was created.*
94
95## Access Control
96
97Access control is performed at the subsystem level by adding allowed listen
98addresses and hosts to a subsystem (see spdk_nvmf_subsystem_add_listener() and
99spdk_nvmf_subsystem_add_host()). By default, a subsystem will not accept
100connections from any host or over any established listen address. Listeners
101and hosts may only be added to inactive or paused subsystems.
102
103## Discovery Subsystems
104
105A discovery subsystem, as defined by the NVMe-oF specification, is
106automatically created for each NVMe-oF target constructed. Connections to the
107discovery subsystem are handled in the same way as any other subsystem.
108
109## Transports
110
111The NVMe-oF specification defines multiple network transports (the "Fabrics"
112in NVMe over Fabrics) and has an extensible system for adding new fabrics
113in the future. The SPDK NVMe-oF target library implements a plugin system for
114network transports to mirror the specification. The API a new transport must
115implement is located in lib/nvmf/transport.h. As of this writing, only an RDMA
116transport has been implemented.
117
118The SPDK NVMe-oF target is designed to be able to process I/O from multiple
119fabrics simultaneously.
120
121## Choosing a Threading Model
122
123The SPDK NVMe-oF target library does not strictly dictate threading model, but
124poll groups do all of their polling and I/O processing on the thread they are
125created on. Given that, it almost always makes sense to create one poll group
126per thread used in the application.
127
128## Scaling Across CPU Cores
129
130Incoming I/O requests are picked up by the poll group polling their assigned
131qpair. For regular NVMe commands such as READ and WRITE, the I/O request is
132processed on the initial thread from start to the point where it is submitted
133to the backing storage device, without interruption. Completions are
134discovered by polling the backing storage device and also processed to
135completion on the polling thread. **Regular NVMe commands (READ, WRITE, etc.)
136do not require any cross-thread coordination, and therefore take no locks.**
137
138NVMe ADMIN commands, which are used for managing the NVMe device itself, may
139modify global state in the subsystem. For instance, an NVMe ADMIN command may
140perform namespace management, such as shrinking a namespace. For these
141commands, the subsystem will temporarily enter a paused state by sending a
142message to each thread in the system. All new incoming I/O on any thread
143targeting the subsystem will be queued during this time. Once the subsystem is
144fully paused, the state change will occur, and messages will be sent to each
145thread to release queued I/O and resume. Management commands are rare, so this
146style of coordination is preferable to forcing all commands to take locks in
147the I/O path.
148
149## Zero Copy Support
150
151For the RDMA transport, data is transferred from the RDMA NIC to host memory
152and then host memory to the SSD (or vice versa), without any intermediate
153copies. Data is never moved from one location in host memory to another. Other
154transports in the future may require data copies.
155
156## RDMA
157
158The SPDK NVMe-oF RDMA transport is implemented on top of the libibverbs and
159rdmacm libraries, which are packaged and available on most Linux
160distributions. It does not use a user-space RDMA driver stack through DPDK.
161
162In order to scale to large numbers of connections, the SPDK NVMe-oF RDMA
163transport allocates a single RDMA completion queue per poll group. All new
164qpairs assigned to the poll group are given their own RDMA send and receive
165queues, but share this common completion queue. This allows the poll group to
166poll a single queue for incoming messages instead of iterating through each
167one.
168
169Each RDMA request is handled by a state machine that walks the request through
170a number of states. This keeps the code organized and makes all of the corner
171cases much more obvious.
172
173RDMA SEND, READ, and WRITE operations are ordered with respect to one another,
174but RDMA RECVs are not necessarily ordered with SEND acknowledgements. For
175instance, it is possible to detect an incoming RDMA RECV message containing a
176new NVMe-oF capsule prior to detecting the acknowledgement of a previous SEND
177containing an NVMe completion. This is problematic at full queue depth because
178there may not yet be a free request structure. To handle this, the RDMA
179request structure is broken into two parts - an rdma_recv and an rdma_request.
180New RDMA RECVs will always grab a free rdma_recv, but may need to wait in a
181queue for a SEND acknowledgement before they can acquire a full rdma_request
182object.
183
184Further, RDMA NICs expose different queue depths for READ/WRITE operations
185than they do for SEND/RECV operations. The RDMA transport reports available
186queue depth based on SEND/RECV operation limits and will queue in software as
187necessary to accommodate (usually lower) limits on READ/WRITE operations.
188