xref: /spdk/doc/nvmf_tgt_pg.md (revision 7d30d7059fc5063ac034ae761b93552f83a9ef24)
1cf9e0998SGangCao# NVMe over Fabrics Target Programming Guide {#nvmf_tgt_pg}
2cf9e0998SGangCao
3cf9e0998SGangCao## Target Audience
4cf9e0998SGangCao
5cf9e0998SGangCaoThis programming guide is intended for developers authoring applications that
6cf9e0998SGangCaouse the SPDK NVMe-oF target library (`lib/nvmf`). It is intended to provide
7cf9e0998SGangCaobackground context, architectural insight, and design recommendations. This
8cf9e0998SGangCaoguide will not cover how to use the SPDK NVMe-oF target application. For a
9cf9e0998SGangCaoguide on how to use the existing application as-is, see @ref nvmf.
10cf9e0998SGangCao
11cf9e0998SGangCao## Introduction
12cf9e0998SGangCao
13cf9e0998SGangCaoThe SPDK NVMe-oF target library is located in `lib/nvmf`. The library
14cf9e0998SGangCaoimplements all logic required to create an NVMe-oF target application. It is
15cf9e0998SGangCaoused in the implementation of the example NVMe-oF target application in
16cf9e0998SGangCao`app/nvmf_tgt`, but is intended to be consumed independently.
17cf9e0998SGangCao
18cf9e0998SGangCaoThis guide is written assuming that the reader is familiar with both NVMe and
19cf9e0998SGangCaoNVMe over Fabrics. The best way to become familiar with those is to read their
20*7d30d705Shollin[specifications](https://nvmexpress.org/specifications/).
21cf9e0998SGangCao
22cf9e0998SGangCao## Primitives
23cf9e0998SGangCao
24cf9e0998SGangCaoThe library exposes a number of primitives - basic objects that the user
25cf9e0998SGangCaocreates and interacts with. They are:
26cf9e0998SGangCao
27cf9e0998SGangCao`struct spdk_nvmf_tgt`: An NVMe-oF target. This concept, surprisingly, does
28cf9e0998SGangCaonot appear in the NVMe-oF specification. SPDK defines this to mean the
29cf9e0998SGangCaocollection of subsystems with the associated namespaces, plus the set of
30cf9e0998SGangCaotransports and their associated network connections. This will be referred to
31cf9e0998SGangCaothroughout this guide as a **target**.
32cf9e0998SGangCao
33cf9e0998SGangCao`struct spdk_nvmf_subsystem`: An NVMe-oF subsystem, as defined by the NVMe-oF
34cf9e0998SGangCaospecification. Subsystems contain namespaces and controllers and perform
35cf9e0998SGangCaoaccess control. This will be referred to throughout this guide as a
36cf9e0998SGangCao**subsystem**.
37cf9e0998SGangCao
38cf9e0998SGangCao`struct spdk_nvmf_ns`: An NVMe-oF namespace, as defined by the NVMe-oF
39cf9e0998SGangCaospecification. Namespaces are **bdevs**. See @ref bdev for an explanation of
40cf9e0998SGangCaothe SPDK bdev layer. This will be referred to throughout this guide as a
41cf9e0998SGangCao**namespace**.
42cf9e0998SGangCao
43cf9e0998SGangCao`struct spdk_nvmf_qpair`: An NVMe-oF queue pair, as defined by the NVMe-oF
44cf9e0998SGangCaospecification. These map 1:1 to network connections. This will be referred to
45cf9e0998SGangCaothroughout this guide as a **qpair**.
46cf9e0998SGangCao
47cf9e0998SGangCao`struct spdk_nvmf_transport`: An abstraction for a network fabric, as defined
48cf9e0998SGangCaoby the NVMe-oF specification. The specification is designed to allow for many
49cf9e0998SGangCaodifferent network fabrics, so the code mirrors that and implements a plugin
50cf9e0998SGangCaosystem. Currently, only the RDMA transport is available. This will be referred
51cf9e0998SGangCaoto throughout this guide as a **transport**.
52cf9e0998SGangCao
53cf9e0998SGangCao`struct spdk_nvmf_poll_group`: An abstraction for a collection of network
54cf9e0998SGangCaoconnections that can be polled as a unit. This is an SPDK-defined concept that
55cf9e0998SGangCaodoes not appear in the NVMe-oF specification. Often, network transports have
56cf9e0998SGangCaofacilities to check for incoming data on groups of connections more
57cf9e0998SGangCaoefficiently than checking each one individually (e.g. epoll), so poll groups
58cf9e0998SGangCaoprovide a generic abstraction for that. This will be referred to throughout
59cf9e0998SGangCaothis guide as a **poll group**.
60cf9e0998SGangCao
61cf9e0998SGangCao`struct spdk_nvmf_listener`: A network address at which the target will accept
62cf9e0998SGangCaonew connections.
63cf9e0998SGangCao
64cf9e0998SGangCao`struct spdk_nvmf_host`: An NVMe-oF NQN representing a host (initiator)
65cf9e0998SGangCaosystem. This is used for access control.
66cf9e0998SGangCao
67cf9e0998SGangCao## The Basics
68cf9e0998SGangCao
69cf9e0998SGangCaoA user of the NVMe-oF target library begins by creating a target using
7074d4e7e6SSeth Howellspdk_nvmf_tgt_create(), setting up a set of addresses on which to accept
71fe8af228STomasz Zawadzkiconnections by calling spdk_nvmf_tgt_listen_ext(), then creating a subsystem
7274d4e7e6SSeth Howellusing spdk_nvmf_subsystem_create().
73cf9e0998SGangCao
74cf9e0998SGangCaoSubsystems begin in an inactive state and must be activated by calling
75cf9e0998SGangCaospdk_nvmf_subsystem_start(). Subsystems may be modified at run time, but only
76cf9e0998SGangCaowhen in the paused or inactive state. A running subsystem may be paused by
77cf9e0998SGangCaocalling spdk_nvmf_subsystem_pause() and resumed by calling
78cf9e0998SGangCaospdk_nvmf_subsystem_resume().
79cf9e0998SGangCao
80cf9e0998SGangCaoNamespaces may be added to the subsystem by calling
81fe8af228STomasz Zawadzkispdk_nvmf_subsystem_add_ns_ext() when the subsystem is inactive or paused.
82cf9e0998SGangCaoNamespaces are bdevs. See @ref bdev for more information about the SPDK bdev
83cf9e0998SGangCaolayer. A bdev may be obtained by calling spdk_bdev_get_by_name().
84cf9e0998SGangCao
85cf9e0998SGangCaoOnce a subsystem exists and the target is listening on an address, new
8677ab6f28SBen Walkerconnections will be automatically assigned to poll groups as they are
8777ab6f28SBen Walkerdetected.
88cf9e0998SGangCao
89cf9e0998SGangCaoAll I/O to a subsystem is driven by a poll group, which polls for incoming
90cf9e0998SGangCaonetwork I/O. Poll groups may be created by calling
91cf9e0998SGangCaospdk_nvmf_poll_group_create(). They automatically request to begin polling
92cf9e0998SGangCaoupon creation on the thread from which they were created. Most importantly, *a
9374d4e7e6SSeth Howellpoll group may only be accessed from the thread on which it was created.*
94cf9e0998SGangCao
95cf9e0998SGangCao## Access Control
96cf9e0998SGangCao
97cf9e0998SGangCaoAccess control is performed at the subsystem level by adding allowed listen
98cf9e0998SGangCaoaddresses and hosts to a subsystem (see spdk_nvmf_subsystem_add_listener() and
99cf9e0998SGangCaospdk_nvmf_subsystem_add_host()). By default, a subsystem will not accept
100cf9e0998SGangCaoconnections from any host or over any established listen address. Listeners
101cf9e0998SGangCaoand hosts may only be added to inactive or paused subsystems.
102cf9e0998SGangCao
103cf9e0998SGangCao## Discovery Subsystems
104cf9e0998SGangCao
105cf9e0998SGangCaoA discovery subsystem, as defined by the NVMe-oF specification, is
106cf9e0998SGangCaoautomatically created for each NVMe-oF target constructed. Connections to the
10777ab6f28SBen Walkerdiscovery subsystem are handled in the same way as any other subsystem.
108cf9e0998SGangCao
109cf9e0998SGangCao## Transports
110cf9e0998SGangCao
111cf9e0998SGangCaoThe NVMe-oF specification defines multiple network transports (the "Fabrics"
112cf9e0998SGangCaoin NVMe over Fabrics) and has an extensible system for adding new fabrics
113cf9e0998SGangCaoin the future. The SPDK NVMe-oF target library implements a plugin system for
114cf9e0998SGangCaonetwork transports to mirror the specification. The API a new transport must
115cf9e0998SGangCaoimplement is located in lib/nvmf/transport.h. As of this writing, only an RDMA
116cf9e0998SGangCaotransport has been implemented.
117cf9e0998SGangCao
118cf9e0998SGangCaoThe SPDK NVMe-oF target is designed to be able to process I/O from multiple
119cf9e0998SGangCaofabrics simultaneously.
120cf9e0998SGangCao
121cf9e0998SGangCao## Choosing a Threading Model
122cf9e0998SGangCao
123cf9e0998SGangCaoThe SPDK NVMe-oF target library does not strictly dictate threading model, but
124cf9e0998SGangCaopoll groups do all of their polling and I/O processing on the thread they are
125cf9e0998SGangCaocreated on. Given that, it almost always makes sense to create one poll group
12677ab6f28SBen Walkerper thread used in the application.
127cf9e0998SGangCao
128cf9e0998SGangCao## Scaling Across CPU Cores
129cf9e0998SGangCao
130cf9e0998SGangCaoIncoming I/O requests are picked up by the poll group polling their assigned
131cf9e0998SGangCaoqpair. For regular NVMe commands such as READ and WRITE, the I/O request is
132cf9e0998SGangCaoprocessed on the initial thread from start to the point where it is submitted
133cf9e0998SGangCaoto the backing storage device, without interruption. Completions are
134cf9e0998SGangCaodiscovered by polling the backing storage device and also processed to
135cf9e0998SGangCaocompletion on the polling thread. **Regular NVMe commands (READ, WRITE, etc.)
136cf9e0998SGangCaodo not require any cross-thread coordination, and therefore take no locks.**
137cf9e0998SGangCao
138cf9e0998SGangCaoNVMe ADMIN commands, which are used for managing the NVMe device itself, may
139cf9e0998SGangCaomodify global state in the subsystem. For instance, an NVMe ADMIN command may
140cf9e0998SGangCaoperform namespace management, such as shrinking a namespace. For these
141cf9e0998SGangCaocommands, the subsystem will temporarily enter a paused state by sending a
142cf9e0998SGangCaomessage to each thread in the system. All new incoming I/O on any thread
143cf9e0998SGangCaotargeting the subsystem will be queued during this time. Once the subsystem is
144cf9e0998SGangCaofully paused, the state change will occur, and messages will be sent to each
145cf9e0998SGangCaothread to release queued I/O and resume. Management commands are rare, so this
146cf9e0998SGangCaostyle of coordination is preferable to forcing all commands to take locks in
147cf9e0998SGangCaothe I/O path.
148cf9e0998SGangCao
149cf9e0998SGangCao## Zero Copy Support
150cf9e0998SGangCao
151cf9e0998SGangCaoFor the RDMA transport, data is transferred from the RDMA NIC to host memory
15274d4e7e6SSeth Howelland then host memory to the SSD (or vice versa), without any intermediate
153cf9e0998SGangCaocopies. Data is never moved from one location in host memory to another. Other
154cf9e0998SGangCaotransports in the future may require data copies.
155cf9e0998SGangCao
156cf9e0998SGangCao## RDMA
157cf9e0998SGangCao
158cf9e0998SGangCaoThe SPDK NVMe-oF RDMA transport is implemented on top of the libibverbs and
159cf9e0998SGangCaordmacm libraries, which are packaged and available on most Linux
160cf9e0998SGangCaodistributions. It does not use a user-space RDMA driver stack through DPDK.
161cf9e0998SGangCao
162cf9e0998SGangCaoIn order to scale to large numbers of connections, the SPDK NVMe-oF RDMA
163cf9e0998SGangCaotransport allocates a single RDMA completion queue per poll group. All new
164cf9e0998SGangCaoqpairs assigned to the poll group are given their own RDMA send and receive
165cf9e0998SGangCaoqueues, but share this common completion queue. This allows the poll group to
166cf9e0998SGangCaopoll a single queue for incoming messages instead of iterating through each
167cf9e0998SGangCaoone.
168cf9e0998SGangCao
169cf9e0998SGangCaoEach RDMA request is handled by a state machine that walks the request through
170cf9e0998SGangCaoa number of states. This keeps the code organized and makes all of the corner
171cf9e0998SGangCaocases much more obvious.
172cf9e0998SGangCao
173cf9e0998SGangCaoRDMA SEND, READ, and WRITE operations are ordered with respect to one another,
174cf9e0998SGangCaobut RDMA RECVs are not necessarily ordered with SEND acknowledgements. For
175cf9e0998SGangCaoinstance, it is possible to detect an incoming RDMA RECV message containing a
176cf9e0998SGangCaonew NVMe-oF capsule prior to detecting the acknowledgement of a previous SEND
177cf9e0998SGangCaocontaining an NVMe completion. This is problematic at full queue depth because
178cf9e0998SGangCaothere may not yet be a free request structure. To handle this, the RDMA
179cf9e0998SGangCaorequest structure is broken into two parts - an rdma_recv and an rdma_request.
180cf9e0998SGangCaoNew RDMA RECVs will always grab a free rdma_recv, but may need to wait in a
181cf9e0998SGangCaoqueue for a SEND acknowledgement before they can acquire a full rdma_request
182cf9e0998SGangCaoobject.
183cf9e0998SGangCao
184cf9e0998SGangCaoFurther, RDMA NICs expose different queue depths for READ/WRITE operations
185cf9e0998SGangCaothan they do for SEND/RECV operations. The RDMA transport reports available
186cf9e0998SGangCaoqueue depth based on SEND/RECV operation limits and will queue in software as
1871f813ec3SChen Wangnecessary to accommodate (usually lower) limits on READ/WRITE operations.
188