xref: /spdk/doc/nvmf.md (revision 6c84c86e48e7a10e5c43aba0c801783e1ec0969c)
1# NVMe over Fabrics Target {#nvmf}
2
3@sa @ref nvme_fabrics_host
4@sa @ref tracepoints
5
6## NVMe-oF Target Getting Started Guide {#nvmf_getting_started}
7
8The SPDK NVMe over Fabrics target is a user space application that presents block devices over a fabrics
9such as Ethernet, Infiniband or Fibre Channel. SPDK currently supports RDMA and TCP transports.
10
11The NVMe over Fabrics specification defines subsystems that can be exported over different transports.
12SPDK has chosen to call the software that exports these subsystems a "target", which is the term used
13for iSCSI. The specification refers to the "client" that connects to the target as a "host". Many
14people will also refer to the host as an "initiator", which is the equivalent thing in iSCSI
15parlance. SPDK will try to stick to the terms "target" and "host" to match the specification.
16
17The Linux kernel also implements an NVMe-oF target and host, and SPDK is tested for
18interoperability with the Linux kernel implementations.
19
20If you want to kill the application using signal, make sure use the SIGTERM, then the application
21will release all the share memory resource before exit, the SIGKILL will make the share memory
22resource have no chance to be released by application, you may need to release the resource manually.
23
24## RDMA transport support {#nvmf_rdma_transport}
25
26It requires an RDMA-capable NIC with its corresponding OFED (OpenFabrics Enterprise Distribution)
27software package installed to run. Maybe OS distributions provide packages, but OFED is also
28available [here](https://downloads.openfabrics.org/OFED/).
29
30### Prerequisites {#nvmf_prereqs}
31
32To build nvmf_tgt with the RDMA transport, there are some additional dependencies,
33which can be install using pkgdep.sh script.
34
35~~~{.sh}
36sudo scripts/pkgdep.sh --rdma
37~~~
38
39Then build SPDK with RDMA enabled:
40
41~~~{.sh}
42./configure --with-rdma <other config parameters>
43make
44~~~
45
46Once built, the binary will be in `build/bin`.
47
48### Prerequisites for InfiniBand/RDMA Verbs {#nvmf_prereqs_verbs}
49
50Before starting our NVMe-oF target with the RDMA transport we must load the InfiniBand and RDMA modules
51that allow userspace processes to use InfiniBand/RDMA verbs directly.
52
53~~~{.sh}
54modprobe ib_cm
55modprobe ib_core
56# Please note that ib_ucm does not exist in newer versions of the kernel and is not required.
57modprobe ib_ucm || true
58modprobe ib_umad
59modprobe ib_uverbs
60modprobe iw_cm
61modprobe rdma_cm
62modprobe rdma_ucm
63~~~
64
65### Prerequisites for RDMA NICs {#nvmf_prereqs_rdma_nics}
66
67Before starting our NVMe-oF target we must detect RDMA NICs and assign them IP addresses.
68
69### Finding RDMA NICs and associated network interfaces
70
71~~~{.sh}
72ls /sys/class/infiniband/*/device/net
73~~~
74
75#### Mellanox ConnectX-3 RDMA NICs
76
77~~~{.sh}
78modprobe mlx4_core
79modprobe mlx4_ib
80modprobe mlx4_en
81~~~
82
83#### Mellanox ConnectX-4 RDMA NICs
84
85~~~{.sh}
86modprobe mlx5_core
87modprobe mlx5_ib
88~~~
89
90#### Assigning IP addresses to RDMA NICs
91
92~~~{.sh}
93ifconfig eth1 192.168.100.8 netmask 255.255.255.0 up
94ifconfig eth2 192.168.100.9 netmask 255.255.255.0 up
95~~~
96
97### RDMA Limitations {#nvmf_rdma_limitations}
98
99As RDMA NICs put a limitation on the number of memory regions registered, the SPDK NVMe-oF
100target application may eventually start failing to allocate more DMA-able memory. This is
101an imperfection of the DPDK dynamic memory management and is most likely to occur with too
102many 2MB hugepages reserved at runtime. One type of memory bottleneck is the number of NIC memory
103regions, e.g., some NICs report as many as 2048 for the maximum number of memory regions. This
104gives us a 4GB memory limit with 2MB hugepages for the total memory regions. It can be overcome by
105using 1GB hugepages or by pre-reserving memory at application startup with `--mem-size` or `-s`
106option. All pre-reserved memory will be registered as a single region, but won't be returned to the
107system until the SPDK application is terminated.
108
109Another known issue occurs when using the E810 NICs in RoCE mode. Specifically, the NVMe-oF target
110sometimes cannot destroy a qpair, because its posted work requests don't get flushed.  It can cause
111the NVMe-oF target application unable to terminate cleanly.
112
113## TCP transport support {#nvmf_tcp_transport}
114
115The transport is built into the nvmf_tgt by default, and it does not need any special libraries.
116
117## FC transport support {#nvmf_fc_transport}
118
119To build nvmf_tgt with the FC transport, there is an additional FC LLD (Low Level Driver) code dependency.
120Please contact your FC vendor for instructions to obtain FC driver module.
121
122### Broadcom FC LLD code
123
124FC LLD driver for Broadcom FC NVMe capable adapters can be obtained from,
125https://github.com/ecdufcdrvr/bcmufctdrvr.
126
127### Fetch FC LLD module and then build SPDK with FC enabled
128
129After cloning SPDK repo and initialize submodules, FC LLD library is built which then can be linked with
130the fc transport.
131
132~~~{.sh}
133git clone https://github.com/spdk/spdk --recursive
134git clone https://github.com/ecdufcdrvr/bcmufctdrvr fc
135cd fc
136make DPDK_DIR=../spdk/dpdk/build SPDK_DIR=../spdk
137cd ../spdk
138./configure --with-fc=../fc/build
139make
140~~~
141
142## Configuring the SPDK NVMe over Fabrics Target {#nvmf_config}
143
144An NVMe over Fabrics target can be configured using JSON RPCs.
145The basic RPCs needed to configure the NVMe-oF subsystem are detailed below. More information about
146working with NVMe over Fabrics specific RPCs can be found on the @ref jsonrpc_components_nvmf_tgt RPC page.
147
148### Using RPCs {#nvmf_config_rpc}
149
150Start the nvmf_tgt application with elevated privileges. Once the target is started,
151the nvmf_create_transport rpc can be used to initialize a given transport. Below is an
152example where the target is started and configured with two different transports.
153The RDMA transport is configured with an I/O unit size of 8192 bytes, max I/O size 131072 and an
154in capsule data size of 8192 bytes. The TCP transport is configured with an I/O unit size of
15516384 bytes, 8 max qpairs per controller, and an in capsule data size of 8192 bytes.
156
157~~~{.sh}
158build/bin/nvmf_tgt
159scripts/rpc.py nvmf_create_transport -t RDMA -u 8192 -i 131072 -c 8192
160scripts/rpc.py nvmf_create_transport -t TCP -u 16384 -m 8 -c 8192
161~~~
162
163Below is an example of creating a malloc bdev and assigning it to a subsystem. Adjust the bdevs,
164NQN, serial number, and IP address with RDMA transport to your own circumstances. If you replace
165"rdma" with "TCP", then the subsystem will add a listener with TCP transport.
166
167~~~{.sh}
168scripts/rpc.py bdev_malloc_create -b Malloc0 512 512
169scripts/rpc.py nvmf_create_subsystem nqn.2016-06.io.spdk:cnode1 -a -s SPDK00000000000001 -d SPDK_Controller1
170scripts/rpc.py nvmf_subsystem_add_ns nqn.2016-06.io.spdk:cnode1 Malloc0
171scripts/rpc.py nvmf_subsystem_add_listener nqn.2016-06.io.spdk:cnode1 -t rdma -a 192.168.100.8 -s 4420
172~~~
173
174### NQN Formal Definition
175
176NVMe qualified names or NQNs are defined in section 7.9 of the
177[NVMe specification](http://nvmexpress.org/wp-content/uploads/NVM_Express_Revision_1.3.pdf). SPDK has attempted to
178formalize that definition using [Extended Backus-Naur form](https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form).
179SPDK modules use this formal definition (provided below) when validating NQNs.
180
181~~~{.sh}
182
183Basic Types
184year = 4 * digit ;
185month = '01' | '02' | '03' | '04' | '05' | '06' | '07' | '08' | '09' | '10' | '11' | '12' ;
186digit = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' ;
187hex digit = 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | '0' |
188'1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' ;
189
190NQN Definition
191NVMe Qualified Name = ( NVMe-oF Discovery NQN | NVMe UUID NQN | NVMe Domain NQN ), '\0' ;
192NVMe-oF Discovery NQN = "nqn.2014-08.org.nvmexpress.discovery" ;
193NVMe UUID NQN = "nqn.2014-08.org.nvmexpress:uuid:", string UUID ;
194string UUID = 8 * hex digit, '-', 3 * (4 * hex digit, '-'), 12 * hex digit ;
195NVMe Domain NQN = "nqn.", year, '-', month, '.', reverse domain, ':', utf-8 string ;
196
197~~~
198
199Please note that the following types from the definition above are defined elsewhere:
200
2011. utf-8 string: Defined in [rfc 3629](https://tools.ietf.org/html/rfc3629).
2022. reverse domain: Equivalent to domain name as defined in [rfc 1034](https://tools.ietf.org/html/rfc1034).
203
204While not stated in the formal definition, SPDK enforces the requirement from the spec that the
205"maximum name is 223 bytes in length". SPDK does not include the null terminating character when
206defining the length of an nqn, and will accept an nqn containing up to 223 valid bytes with an
207additional null terminator. To be precise, SPDK follows the same conventions as the c standard
208library function [strlen()](http://man7.org/linux/man-pages/man3/strlen.3.html).
209
210#### NQN Comparisons
211
212SPDK compares NQNs byte for byte without case matching or unicode normalization. This has specific implications for
213uuid based NQNs. The following pair of NQNs, for example, would not match when compared in the SPDK NVMe-oF Target:
214
215nqn.2014-08.org.nvmexpress:uuid:11111111-aaaa-bbdd-ffee-123456789abc
216nqn.2014-08.org.nvmexpress:uuid:11111111-AAAA-BBDD-FFEE-123456789ABC
217
218In order to ensure the consistency of uuid based NQNs while using SPDK, users should use lowercase when representing
219alphabetic hex digits in their NQNs.
220
221### Assigning CPU Cores to the NVMe over Fabrics Target {#nvmf_config_lcore}
222
223SPDK uses the [DPDK Environment Abstraction Layer](http://dpdk.org/doc/guides/prog_guide/env_abstraction_layer.html)
224to gain access to hardware resources such as huge memory pages and CPU core(s). DPDK EAL provides
225functions to assign threads to specific cores.
226To ensure the SPDK NVMe-oF target has the best performance, configure the NICs and NVMe devices to
227be located on the same NUMA node.
228
229The `-m` core mask option specifies a bit mask of the CPU cores that
230SPDK is allowed to execute work items on.
231For example, to allow SPDK to use cores 24, 25, 26 and 27:
232~~~{.sh}
233build/bin/nvmf_tgt -m 0xF000000
234~~~
235
236## Configuring the Linux NVMe over Fabrics Host {#nvmf_host}
237
238Both the Linux kernel and SPDK implement an NVMe over Fabrics host.
239The Linux kernel NVMe-oF RDMA host support is provided by the `nvme-rdma` driver
240(to support RDMA transport) and `nvme-tcp` (to support TCP transport). And the
241following shows two different commands for loading the driver.
242
243~~~{.sh}
244modprobe nvme-rdma
245modprobe nvme-tcp
246~~~
247
248The nvme-cli tool may be used to interface with the Linux kernel NVMe over Fabrics host.
249See below for examples of the discover, connect and disconnect commands. In all three instances, the
250transport can be changed to TCP by interchanging 'rdma' for 'tcp'.
251
252Discovery:
253~~~{.sh}
254nvme discover -t rdma -a 192.168.100.8 -s 4420
255~~~
256
257Connect:
258~~~{.sh}
259nvme connect -t rdma -n "nqn.2016-06.io.spdk:cnode1" -a 192.168.100.8 -s 4420
260~~~
261
262Disconnect:
263~~~{.sh}
264nvme disconnect -n "nqn.2016-06.io.spdk:cnode1"
265~~~
266
267## Enabling NVMe-oF target tracepoints for offline analysis and debug {#nvmf_trace}
268
269SPDK has a tracing framework for capturing low-level event information at runtime.
270@ref tracepoints enable analysis of both performance and application crashes.
271
272## Enabling NVMe-oF Multipath
273
274The SPDK NVMe-oF target and initiator support multiple independent paths to the same NVMe-oF subsystem.
275For step-by-step instructions for configuring and switching between paths, see @ref nvmf_multipath_howto .
276
277## Enabling NVMe-oF TLS
278
279The SPDK NVMe-oF target and initiator support establishing a secure TCP connection using Transport
280Layer Security (TLS) protocol in compliance with NVMe TCP transport specification. Only version 1.3
281of the TLS protocol is supported. This feature is considered experimental.
282
283Currently, it is only possible to establish a fabric secure channel using TLS. The channel is
284protected by a symmetric pre-shared key (PSK) using either `TLS_AES_256_GCM_SHA384` (recommended) or
285`TLS_AES_128_GCM_SHA256` cipher suite. The cipher suite is selected based on the hash function
286associated with a key. During configuration, the keys are expected to be in the PSK interchange
287format (see NVMe TCP transport specification 1.0c, section 3.6.1.5).
288
289The target supports assigning different keys for each host connecting to a given subsystem. It is
290also possible for a single host to use different keys for different subsystems. The keys are
291expected to be placed in separate files (with permissions configured only to allow read/write
292access to the owner) and can be configured using the `--psk` option in the `nvmf_subsystem_add_host`
293RPC. Additionally, to allow establishing TLS connections on a given listener, it must be created
294with `--secure-channel` option enabled. It's also worth noting that this option is mutually
295exclusive with `--allow-any-host` subsystem option and trying to add a listener to such a subsystem
296will result in an error.
297
298On the initiator side, the key can be specified using `--psk` option in the
299`bdev_nvme_attach_controller` RPC.
300
301Recommendations on the pre-shared keys:
302
303* It is strongly recommended to change the keys at least once a year.
304* Use a strong cryptographic random number generator that provides sufficient entropy
305  to generate the keys (e.g. HSM).
306* Use a single key to secure transmission between two systems only.
307* Delete files containing PSKs as soon as they are not needed.
308
309Additionally, it is recommended to follow:
310[RFC 9257 'Guidance for External Pre-Shared Key (PSK) Usage in TLS'](https://www.rfc-editor.org/rfc/rfc9257.html)
311
312### Target setup
313
314~~~{.sh}
315cat key.txt
316NVMeTLSkey-1:01:MDAxMTIyMzM0NDU1NjY3Nzg4OTlhYWJiY2NkZGVlZmZwJEiQ:
317
318build/bin/nvmf_tgt &
319scripts/rpc.py nvmf_create_transport -t TCP
320scripts/rpc.py nvmf_create_subsystem nqn.2016-06.io.spdk:cnode1 -s SPDK00000000000001 -m 10
321scripts/rpc.py nvmf_subsystem_add_listener nqn.2016-06.io.spdk:cnode1 -t tcp -a 127.0.0.1 -s 4420 \
322               --secure-channel
323scripts/rpc.py nvmf_subsystem_add_host nqn.2016-06.io.spdk:cnode1 nqn.2016-06.io.spdk:host1 \
324               --psk key.txt
325~~~
326
327### Initiator setup
328
329For SPDK initiator example, bdevperf application may be used, because it depends on SPDK's
330NVMe TCP driver.
331
332~~~{.sh}
333cat key.txt
334NVMeTLSkey-1:01:MDAxMTIyMzM0NDU1NjY3Nzg4OTlhYWJiY2NkZGVlZmZwJEiQ:
335
336build/examples/bdevperf -m 0x2 -z -r /var/tmp/bdevperf.sock -q 128 -o 4096 -w verify -t 10 &
337scripts/rpc.py -s /var/tmp/bdevperf.sock bdev_nvme_attach_controller -b TLSTEST -t tcp -a 127.0.0.1 \
338               -s 4420 -f ipv4 -n nqn.2016-06.io.spdk:cnode1 -q nqn.2016-06.io.spdk:host1 \
339               --psk key.txt
340~~~
341
342First of the two commands will launch bdevperf, the second one will attempt to construct NVMe bdev
343and establish TLS connection. Of course, the same PSK must be used on both the target and the
344initiator side.
345
346## NVMe-oF in-band authentication
347
348The NVMe-oF driver and NVMe-oF target both support in-band authentication using the DH-HMAC-CHAP
349protocol.  It allows the target to authenticate the host and the host to authenticate the target
350(the latter part is optional).
351
352The authentication will be performed if a subsystem is configured to allow a host with a set of
353DH-HMAC-CHAP keys.  Each host is allowed to use different keys to connect to different subsystems
354and each subsystem might use different keys for different hosts.  For instance, the following
355configures three hosts, two of which can request bidirectional authentication:
356
357```{.sh}
358$ scripts/rpc.py nvmf_subsystem_add_host nqn.2024-05.io.spdk:cnode0 nqn.2024-05.io.spdk:host0 \
359    --dhchap-key key0 --dhchap-ctrlr-key ctrlr-key0
360$ scripts/rpc.py nvmf_subsystem_add_host nqn.2024-05.io.spdk:cnode0 nqn.2024-05.io.spdk:host1 \
361    --dhchap-key key1 --dhchap-ctrlr-key ctrlr-key1
362$ scripts/rpc.py nvmf_subsystem_add_host nqn.2024-05.io.spdk:cnode0 nqn.2024-05.io.spdk:host2 \
363    --dhchap-key key2
364```
365
366Additionally, it's possible to change the keys while preserving existing connections to a subsystem
367via `nvmf_subsystem_set_keys`.  After that's done, new connections and reauthentication requests
368will be required to use the new keys.
369
370```{.sh}
371$ scripts/rpc.py nvmf_subsystem_add_host nqn.2024-05.io.spdk:cnode0 nqn.2024-05.io.spdk:host0 \
372    --dhchap-key key0 --dhchap-ctrlr-key ctrlr-key0
373# Host nqn.2024-05.io.spdk:host0 connects to subsystem nqn.2024-05.io.spdk:cnode0
374$ scripts/rpc.py nvmf_subsystem_set_keys nqn.2024-05.io.spdk:cnode0 nqn.2024-05.io.spdk:host0 \
375    --dhchap-key key1 --dhchap-ctrlr-key ctrlr-key1
376```
377
378On the host side, the keys are specified when attaching controllers, e.g.:
379
380```{.sh}
381$ scripts/rpc.py bdev_nvme_attach_controller -b nvme0 -t tcp -f ipv4 -a 127.0.0.1 -s 4420 \
382    -n nqn.2024-05.io.spdk:cnode0 -q nqn.2024-05.io.spdk:host0 --dhchap-key key0 \
383    --dhchap-ctrlr-key ctrlr-key0
384```
385
386All hash functions/Diffie-Hellman groups defined in the NVMe Base Specification 2.0d are supported
387and the algorithms used for a given DH-HMAC-CHAP transaction are negotiated at the beginning.  The
388SPDK NVMe-oF target selects the strongest available hash/group depending on its configuration and
389the capabilities of a peer.  Users can limit the allowed hash functions and/or Diffie-Hellman groups
390via RPCs.  For example, the following limits the target (`nvmf_set_config`) and the driver
391(`bdev_nvme_set_options`) to use sha384, sha512 and ffdhe6114, ffdhe8192:
392
393```{.sh}
394$ scripts/rpc.py nvmf_set_config --dhchap-digests sha384,sha512 \
395    --dhchap-dhgroups ffdhe6114,ffdhe8192
396$ scripts/rpc.py bdev_nvme_set_options --dhchap-digests sha384,sha512 \
397    --dhchap-dhgroups ffdhe6114,ffdhe8192
398```
399
400The NVMe specification describes the method for using in-band authentication in conjunction with
401establishing a secure channel (e.g. TLS).  However, that isn't supported currently, so in order to
402perform in-band authentication, hosts must connect over regular listeners (i.e. those that weren't
403created with the `--secure-channel` option).
404