xref: /spdk/doc/ftl.md (revision 6b6dfea6c704a049e553024aa7e44ae916948e20)
1# Flash Translation Layer {#ftl}
2
3The Flash Translation Layer library provides block device access on top of non-block SSDs
4implementing Open Channel interface. It handles the logical to physical address mapping, responds to
5the asynchronous media management events, and manages the defragmentation process.
6
7# Terminology {#ftl_terminology}
8
9## Logical to physical address map
10
11 * Shorthand: L2P
12
13Contains the mapping of the logical addresses (LBA) to their on-disk physical location (PPA). The
14LBAs are contiguous and in range from 0 to the number of surfaced blocks (the number of spare blocks
15are calculated during device formation and are subtracted from the available address space). The
16spare blocks account for chunks going offline throughout the lifespan of the device as well as
17provide necessary buffer for data [defragmentation](#ftl_reloc).
18
19## Band {#ftl_band}
20
21Band describes a collection of chunks, each belonging to a different parallel unit. All writes to
22the band follow the same pattern - a batch of logical blocks is written to one chunk, another batch
23to the next one and so on. This ensures the parallelism of the write operations, as they can be
24executed independently on a different chunks. Each band keeps track of the LBAs it consists of, as
25well as their validity, as some of the data will be invalidated by subsequent writes to the same
26logical address. The L2P mapping can be restored from the SSD by reading this information in order
27from the oldest band to the youngest.
28
29             +--------------+        +--------------+                        +--------------+
30    band 1   |   chunk 1    +--------+     chk 1    +---- --- --- --- --- ---+     chk 1    |
31             +--------------+        +--------------+                        +--------------+
32    band 2   |   chunk 2    +--------+     chk 2    +---- --- --- --- --- ---+     chk 2    |
33             +--------------+        +--------------+                        +--------------+
34    band 3   |   chunk 3    +--------+     chk 3    +---- --- --- --- --- ---+     chk 3    |
35             +--------------+        +--------------+                        +--------------+
36             |     ...      |        |     ...      |                        |     ...      |
37             +--------------+        +--------------+                        +--------------+
38    band m   |   chunk m    +--------+     chk m    +---- --- --- --- --- ---+     chk m    |
39             +--------------+        +--------------+                        +--------------+
40             |     ...      |        |     ...      |                        |     ...      |
41             +--------------+        +--------------+                        +--------------+
42
43              parallel unit 1              pu 2                                    pu n
44
45The address map and valid map are, along with a several other things (e.g. UUID of the device it's
46part of, number of surfaced LBAs, band's sequence number, etc.), parts of the band's metadata. The
47metadata is split in two parts:
48 * the head part, containing information already known when opening the band (device's UUID, band's
49   sequence number, etc.), located at the beginning blocks of the band,
50 * the tail part, containing the address map and the valid map, located at the end of the band.
51
52
53       head metadata               band's data               tail metadata
54    +-------------------+-------------------------------+----------------------+
55    |chk 1|...|chk n|...|...|chk 1|...|                 | ... |chk  m-1 |chk  m|
56    |lbk 1|   |lbk 1|   |   |lbk x|   |                 |     |lblk y   |lblk y|
57    +-------------------+-------------+-----------------+----------------------+
58
59
60Bands are being written sequentially (in a way that was described earlier). Before a band can be
61written to, all of its chunks need to be erased. During that time, the band is considered to be in a
62`PREP` state. After that is done, the band transitions to the `OPENING` state, in which head metadata
63is being written. Then the band moves to the `OPEN` state and actual user data can be written to the
64band. Once the whole available space is filled, tail metadata is written and the band transitions to
65`CLOSING` state. When that finishes the band becomes `CLOSED`.
66
67## Ring write buffer {#ftl_rwb}
68
69 * Shorthand: RWB
70
71Because the smallest write size the SSD may support can be a multiple of block size, in order to
72support writes to a single block, the data needs to be buffered. The write buffer is the solution to
73this problem. It consists of a number of pre-allocated buffers called batches, each of size allowing
74for a single transfer to the SSD. A single batch is divided into block-sized buffer entries.
75
76                 write buffer
77    +-----------------------------------+
78    |batch 1                            |
79    |   +-----------------------------+ |
80    |   |rwb    |rwb    | ... |rwb    | |
81    |   |entry 1|entry 2|     |entry n| |
82    |   +-----------------------------+ |
83    +-----------------------------------+
84    | ...                               |
85    +-----------------------------------+
86    |batch m                            |
87    |   +-----------------------------+ |
88    |   |rwb    |rwb    | ... |rwb    | |
89    |   |entry 1|entry 2|     |entry n| |
90    |   +-----------------------------+ |
91    +-----------------------------------+
92
93When a write is scheduled, it needs to acquire an entry for each of its blocks and copy the data
94onto this buffer. Once all blocks are copied, the write can be signalled as completed to the user.
95In the meantime, the `rwb` is polled for filled batches and, if one is found, it's sent to the SSD.
96After that operation is completed the whole batch can be freed. For the whole time the data is in
97the `rwb`, the L2P points at the buffer entry instead of a location on the SSD. This allows for
98servicing read requests from the buffer.
99
100## Defragmentation and relocation {#ftl_reloc}
101
102 * Shorthand: defrag, reloc
103
104Since a write to the same LBA invalidates its previous physical location, some of the blocks on a
105band might contain old data that basically wastes space. As there is no way to overwrite an already
106written block, this data will stay there until the whole chunk is reset. This might create a
107situation in which all of the bands contain some valid data and no band can be erased, so no writes
108can be executed anymore. Therefore a mechanism is needed to move valid data and invalidate whole
109bands, so that they can be reused.
110
111                    band                                             band
112    +-----------------------------------+            +-----------------------------------+
113    | ** *    * ***      *    *** * *   |            |                                   |
114    |**  *       *    *    * *     *   *|   +---->   |                                   |
115    |*     ***  *      *            *   |            |                                   |
116    +-----------------------------------+            +-----------------------------------+
117
118Valid blocks are marked with an asterisk '\*'.
119
120Another reason for data relocation might be an event from the SSD telling us that the data might
121become corrupt if it's not relocated. This might happen due to its old age (if it was written a
122long time ago) or due to read disturb (media characteristic, that causes corruption of neighbouring
123blocks during a read operation).
124
125Module responsible for data relocation is called `reloc`. When a band is chosen for defragmentation
126or an ANM (asynchronous NAND management) event is received, the appropriate blocks are marked as
127required to be moved. The `reloc` module takes a band that has some of such blocks marked, checks
128their validity and, if they're still valid, copies them.
129
130Choosing a band for defragmentation depends on several factors: its valid ratio (1) (proportion of
131valid blocks to all user blocks), its age (2) (when was it written) and its write count / wear level
132index of its chunks (3) (how many times the band was written to). The lower the ratio (1), the
133higher its age (2) and the lower its write count (3), the higher the chance the band will be chosen
134for defrag.
135
136# Usage {#ftl_usage}
137
138## Prerequisites {#ftl_prereq}
139
140In order to use the FTL module, an Open Channel SSD is required. The easiest way to obtain one is to
141emulate it using QEMU. The QEMU with the patches providing Open Channel support can be found on the
142SPDK's QEMU fork on [spdk-3.0.0](https://github.com/spdk/qemu/tree/spdk-3.0.0) branch.
143
144## Configuring QEMU {#ftl_qemu_config}
145
146To emulate an Open Channel device, QEMU expects parameters describing the characteristics and
147geometry of the SSD:
148 - `serial` - serial number,
149 - `lver` - version of the OCSSD standard (0 - disabled, 1 - "1.2", 2 - "2.0"), libftl only supports
150   2.0,
151 - `lba_index` - default LBA format. Possible values (libftl only supports lba_index >= 3):
152        |lba_index| data| metadata|
153        |---------|-----|---------|
154        |    0    | 512B|    0B   |
155        |    1    | 512B|    8B   |
156        |    2    | 512B|   16B   |
157        |    3    |4096B|    0B   |
158        |    4    |4096B|   64B   |
159        |    5    |4096B|  128B   |
160        |    6    |4096B|   16B   |
161 - `lnum_ch` - number of groups,
162 - `lnum_lun` - number of parallel units
163 - `lnum_pln` - number of planes (logical blocks from all planes constitute a chunk)
164 - `lpgs_per_blk` - number of pages (smallest programmable unit) per chunk
165 - `lsecs_per_pg` - number of sectors in a page
166 - `lblks_per_pln` - number of chunks in a parallel unit
167 - `laer_thread_sleep` - timeout in ms between asynchronous events requesting the host to relocate
168   the data based on media feedback
169 - `lmetadata` - metadata file
170
171For more detailed description of the available options, consult the `hw/block/nvme.c` file in
172the QEMU repository.
173
174Example:
175
176```
177$ /path/to/qemu [OTHER PARAMETERS] -drive format=raw,file=/path/to/data/file,if=none,id=myocssd0
178        -device nvme,drive=myocssd0,serial=deadbeef,lver=2,lba_index=3,lnum_ch=1,lnum_lun=8,lnum_pln=4,
179        lpgs_per_blk=1536,lsecs_per_pg=4,lblks_per_pln=512,lmetadata=/path/to/md/file
180```
181
182In the above example, a device is created with 1 channel, 8 parallel units, 512 chunks per parallel
183unit, 24576 (`lnum_pln` * `lpgs_per_blk` * `lsecs_per_pg`) logical blocks in each chunk with logical
184block being 4096B. Therefore the data file needs to be at least 384G (8 * 512 * 24576 * 4096B) of
185size and can be created with the following command:
186
187```
188$ fallocate -l 384G /path/to/data/file
189```
190
191## Configuring SPDK {#ftl_spdk_config}
192
193To verify that the drive is emulated correctly, one can check the output of the NVMe identify app
194(assuming that `scripts/setup.sh` was called before and the driver has been changed for that
195device):
196
197```
198$ examples/nvme/identify/identify
199=====================================================
200NVMe Controller at 0000:00:0a.0 [1d1d:1f1f]
201=====================================================
202Controller Capabilities/Features
203================================
204Vendor ID:                             1d1d
205Subsystem Vendor ID:                   1af4
206Serial Number:                         deadbeef
207Model Number:                          QEMU NVMe Ctrl
208
209... other info ...
210
211Namespace OCSSD Geometry
212=======================
213OC version: maj:2 min:0
214
215... other info ...
216
217Groups (channels): 1
218PUs (LUNs) per group: 8
219Chunks per LUN: 512
220Logical blks per chunk: 24576
221
222... other info ...
223
224```
225
226Similarly to other bdevs, the FTL bdevs can be created either based on config files or via RPC. Both
227interfaces require the same arguments which are described by the `--help` option of the
228`bdev_ftl_create` RPC call, which are:
229 - bdev's name
230 - transport type of the device (e.g. PCIe)
231 - transport address of the device (e.g. `00:0a.0`)
232 - parallel unit range
233 - UUID of the FTL device (if the FTL is to be restored from the SSD)
234
235Example config:
236
237```
238[Ftl]
239 TransportID "trtype:PCIe traddr:00:0a.0" nvme0 "0-3" 00000000-0000-0000-0000-000000000000
240 TransportID "trtype:PCIe traddr:00:0a.0" nvme1 "4-5" e9825835-b03c-49d7-bc3e-5827cbde8a88
241```
242
243The above will result in creation of two devices:
244 - `nvme0` on `00:0a.0` using parallel units 0-3, created from scratch
245 - `nvme1` on the same device using parallel units 4-5, restored from the SSD using the UUID
246   provided
247
248The same can be achieved with the following two RPC calls:
249
250```
251$ scripts/rpc.py bdev_ftl_create -b nvme0 -l 0-3 -a 00:0a.0
252{
253        "name": "nvme0",
254        "uuid": "b4624a89-3174-476a-b9e5-5fd27d73e870"
255}
256$ scripts/rpc.py bdev_ftl_create -b nvme1 -l 0-3 -a 00:0a.0 -u e9825835-b03c-49d7-bc3e-5827cbde8a88
257{
258        "name": "nvme1",
259        "uuid": "e9825835-b03c-49d7-bc3e-5827cbde8a88"
260}
261```
262