xref: /spdk/doc/nvme_multipath.md (revision 508e4641669f99c33f1d9f0c23ceec7e7066a7f6)
1# NVMe Multipath {#nvme_multipath}
2
3## Introduction
4
5The NVMe bdev module supports two modes: failover and multipath. In failover mode, only one
6active connection is maintained and alternate paths are connected only during the switch-over.
7This can lead to delays and failed I/O reported to upper layers, but it does reduce the number
8of active connections at any given time. In multipath, active connections are maintained for
9every path and used based on a policy of either active-passive or active-active. The multipath
10mode also supports Asymmetric Namespace Access (ANA) and uses that to make policy decisions.
11
12## Design
13
14### Multipath Mode
15
16A user may establish connections on multiple independent paths to the same NVMe-oF subsystem
17for NVMe bdevs by calling the `bdev_nvme_attach_controller` RPC multiple times with the same NVMe
18bdev controller name. Additionally, the `multipath` parameter for this RPC must be set to
19"multipath" when connecting the second or later paths.
20
21For each path created by the `bdev_nvme_attach_controller` RPC, an NVMe-oF controller is created.
22Then the set of namespaces presented by that controller are discovered. For each namespace found,
23the NVMe bdev module attempts to match it with an existing NVMe bdev. If it finds a match, it adds
24the given namespace as an alternate path. If it does not find a match, it creates a new NVMe bdev.
25
26I/O and admin qpairs are necessary to access an NVMe-oF controller. A single admin qpair is created
27and is shared by all SPDK threads. To submit I/O without taking locks, for each SPDK thread, an I/O
28qpair is created as a dynamic context of an I/O channel for an NVMe-oF controller.
29
30For each SPDK thread, the NVMe bdev module creates an I/O channel for an NVMe bdev and provides it to
31the upper layer. The I/O channel for the NVMe bdev has an I/O path for each namespace. I/O path is
32an additional abstraction to submit I/O to a namespace, and consists of an I/O qpair context and a
33namespace. If an NVMe bdev has multiple namespaces, an I/O channel for the NVMe bdev has a list of
34multiple I/O paths. The I/O channel for the NVMe bdev has a retry I/O list and has a path selection
35policy.
36
37### Path Error Recovery
38
39If the NVMe driver detects an error on a qpair, it disconnects the qpair and notifies the error to
40the NVMe bdev module. Then the NVMe bdev module starts resetting the corresponding NVMe-oF controller.
41The NVMe-oF controller reset consists of the following steps: 1) disconnect and delete all I/O qpairs,
422) disconnect admin qpair, 3) connect admin qpair, 4) configure the NVMe-oF controller, and
435) create and connect all I/O qpairs.
44
45If the step 3, 4, or 5 fails, the reset reverts to the step 3 and then it is retried after
46`reconnect_delay_sec` seconds. Then the NVMe-oF controller is deleted automatically if it is not
47recovered within `ctrlr_loss_timeout_sec` seconds. If `ctrlr_loss_timeout_sec` is -1, it retries
48indefinitely.
49
50By default, error detection on a qpair is very slow for TCP and RDMA transports. For fast error
51detection, a global option, `transport_ack_timeout`, is useful.
52
53### Path Selection
54
55Multipath mode supports two path selection policies, active-passive or active-active.
56
57For both path selection policies, only ANA optimal I/O paths are used unless there are no ANA
58optimal I/O paths available.
59
60For active-passive policy, each I/O channel for an NVMe bdev has a cache to store the first found
61I/O path which is connected and optimal from ANA and use it for I/O submission. Some users may want
62to specify the preferred I/O path manually. They can dynamically set the preferred I/O path using
63the `bdev_nvme_set_preferred_path` RPC. Such assignment is realized naturally by moving the
64I/O path to the head of the I/O path list. By default, if the preferred I/O path is restored,
65failback to it is done automatically. The automatic failback can be disabled by a global option
66`disable_auto_failback`. In this case, the `bdev_nvme_set_preferred_path` RPC can be used
67to do manual failback.
68
69The active-active policy uses the round-robin algorithm or the minimum queue depth algorithm.
70The round-robin algorithm submits an I/O to each I/O path in circular order. The minimum queue depth
71algorithm selects an I/O path and submits an I/Os to it according to the number of outstanding I/Os
72of each I/O qpair. For these path selection algorithms, the number of I/Os routed to the current I/O
73path before switching to another I/O path is configurable.
74
75### I/O Retry
76
77The NVMe bdev module has a global option, `bdev_retry_count`, to control the number of retries when
78an I/O is returned with error. Each I/O has a retry count. If the retry count of an I/O is less than
79the `bdev_retry_count`, the I/O is allowed to retry and the retry count is incremented.
80
81NOTE: The `bdev_retry_count` is not directly used but is required to be non-zero for the process
82of multipath mode failing over to a different path because the retry count is checked first always
83when an I/O is returned with error.
84
85Each I/O has a timer to schedule an I/O retry at a particular time in the future. Each I/O channel
86for an NVMe bdev has a sorted I/O retry list. Retried I/Os are inserted into the I/O retry list.
87
88If an I/O is returned with error, the I/O completion handler in the NVMe bdev module executes the
89following steps:
90
911. If the DNR (Do Not Retry) bit is set or the retry count exceeds the limit, then complete the
92   I/O with the returned error.
932. If the error is a path error, insert the I/O to the I/O retry list with no delay.
943. Otherwise, insert the I/O to the I/O retry list with the delay reported by the CRD (Command
95   Retry Delay).
96
97Then the I/O retry poller is scheduled to the closest expiration. If there is no retried I/O,
98the I/O retry poller is stopped.
99
100When submitting an I/O, there may be no available I/O path. If there is any I/O path which is
101recovering, the I/O is inserted to the I/O retry list with one second delay. This may result in
102queueing many I/Os indefinitely. To avoid such indefinite queueing, per NVMe-oF controller option,
103`fast_io_fail_timeout_sec`, is added. If the corresponding NVMe-oF controller is not recovered
104within `fast_io_fail_timeout_sec` seconds, the I/O is not queued to wait the recovery but returned
105with an I/O error to the upper layer.
106
107### Asymmetric Namespace Accesses (ANA) Handling
108
109If an I/O is returned with an ANA error or an ANA change notice event is received, the ANA log page
110may be changed. In this case, the NVMe bdev module reads the ANA log page to check the ANA state
111changes.
112
113As described before, only ANA optimal I/O paths will be used unless there are no ANA optimal paths
114available.
115
116If an I/O path is in ANA transition, i.e., its namespace reports the ANA inaccessible state or the ANA
117change state, the NVMe bdev module queues I/Os to wait until the namespace becomes accessible again.
118The ANA transition should end within the ANATT (ANA Transition Time) seconds. If the namespace does
119not report the ANA optimized state or the ANA accessible state within the ANATT seconds, I/Os are
120returned with an I/O error to the upper layer.
121
122### I/O Timeout
123
124The NVMe driver supports I/O timeout for submitted I/Os. The NVMe bdev module provides three
125actions when an I/O timeout is notified from the NVMe driver, ABORT, RESET, or NONE. Users can
126choose one of the actions as a global option, `action_on_timeout`. Users can set different timeout
127values for I/O commands and admin commands by global options, `timeout_us` and `timeout_admin_us`.
128
129For ABORT, the NVMe bdev module tries aborting the timed out I/O, and if failed, it starts the
130NVMe-oF controller reset. For RESET, the NVMe bdev module starts the NVMe-oF controller reset.
131
132## Usage
133
134The following is an example to attach two NVMe-oF controllers and aggregate these into a single
135NVMe bdev controller `Nvme0`.
136
137```bash
138./scripts/rpc.py bdev_nvme_attach_controller -b Nvme0 -t rdma -a 192.168.100.8 -s 4420 -f ipv4 -n nqn.2016-06.io.spdk:cnode1 -l -1 -o 20
139./scripts/rpc.py bdev_nvme_attach_controller -b Nvme0 -t rdma -a 192.168.100.9 -s 4420 -f ipv4 -n nqn.2016-06.io.spdk:cnode1 -l -1 -o 20 -x multipath
140```
141
142In this example, if these two NVMe-oF controllers have a shared namespace whose namespace ID is 1,
143a single NVMe bdev `Nvme0n1` is created. For the NVMe bdev module, the default value of
144`bdev_retry_count` is 3 and I/O retry is enabled by default. `ctrlr_loss_timeout_sec` is set to -1
145and `reconnect_delay_sec` is set to 20. Hence, NVMe-oF controller reconnect will be retried once
146per 20 seconds until it succeeds.
147
148To confirm if multipath is configured correctly, two RPCs, `bdev_get_bdevs` and
149`bdev_nvme_get_controllers` are available.
150
151```bash
152./scripts/rpc.py bdev_get_bdevs -b Nvme0n1
153./scripts/rpc.py bdev_nvme_get_controllers -n Nvme0
154```
155
156To monitor the current multipath state, a RPC `bdev_nvme_get_io_paths` are available.
157
158```bash
159./scripts/rpc.py bdev_nvme_get_io_paths -n Nvme0n1
160```
161
162To configure the path selection policy, a RPC `bdev_nvme_set_multipath_policy` is available.
163The following is an example for a single NVMe bdev `Nvme0n1` to set the path selection policy to
164active-active, set the path selector to round-robin, and set the number of I/Os routed to the
165current I/O path before switching to another I/O path to 10.
166
167```bash
168./scripts/rpc.py bdev_nvme_set_multipath_policy -b Nvme0n1 -p active_active -s round_robin -r 10
169```
170
171## Limitations
172
173SPDK NVMe multipath is transport protocol independent. Heterogeneous multipath configuration (e.g.,
174TCP and RDMA) is supported. However, in this type of configuration, memory domain is not available
175yet because memory domain is supported only by the RDMA transport now.
176
177The RPCs, `bdev_get_iostat` and  `bdev_nvme_get_transport_statistics` display I/O statistics but
178both are not aware of multipath.
179