xref: /spdk/doc/ftl.md (revision 17cf101b6e0f1a9117d0ac29c545f8a294bedc1a)
1e9a236d2SWojciech Malikowski# Flash Translation Layer {#ftl}
2e9a236d2SWojciech Malikowski
37f5a982fSKozlowski MateuszThe Flash Translation Layer library provides efficient 4K block device access on top of devices
47f5a982fSKozlowski Mateuszwith >4K write unit size (eg. raid5f bdev) or devices with large indirection units (some
57f5a982fSKozlowski Mateuszcapacity-focused NAND drives), which don't handle 4K writes well. It handles the logical to
67f5a982fSKozlowski Mateuszphysical address mapping and manages the garbage collection process.
7e9a236d2SWojciech Malikowski
81e1fd9acSwawryk## Terminology {#ftl_terminology}
9e9a236d2SWojciech Malikowski
10aa11a97dSKamil Godzwon### Logical to physical address map {#ftl_l2p}
11e9a236d2SWojciech Malikowski
127f5a982fSKozlowski Mateusz- Shorthand: `L2P`
13e9a236d2SWojciech Malikowski
14aa44b69aSWojciech MalikowskiContains the mapping of the logical addresses (LBA) to their on-disk physical location. The LBAs
15aa44b69aSWojciech Malikowskiare contiguous and in range from 0 to the number of surfaced blocks (the number of spare blocks
16e9a236d2SWojciech Malikowskiare calculated during device formation and are subtracted from the available address space). The
17aa44b69aSWojciech Malikowskispare blocks account for zones going offline throughout the lifespan of the device as well as
187f5a982fSKozlowski Mateuszprovide necessary buffer for data [garbage collection](#ftl_reloc).
197f5a982fSKozlowski Mateusz
207f5a982fSKozlowski MateuszSince the L2P would occupy a significant amount of DRAM (4B/LBA for drives smaller than 16TiB,
217f5a982fSKozlowski Mateusz8B/LBA for bigger drives), FTL will, by default, store only the 2GiB of most recently used L2P
227f5a982fSKozlowski Mateuszaddresses in memory (the amount is configurable), and page them in and out of the cache device
237f5a982fSKozlowski Mateuszas necessary.
24e9a236d2SWojciech Malikowski
251e1fd9acSwawryk### Band {#ftl_band}
26e9a236d2SWojciech Malikowski
27aa44b69aSWojciech MalikowskiA band describes a collection of zones, each belonging to a different parallel unit. All writes to
28aa44b69aSWojciech Malikowskia band follow the same pattern - a batch of logical blocks is written to one zone, another batch
29e9a236d2SWojciech Malikowskito the next one and so on. This ensures the parallelism of the write operations, as they can be
30aa44b69aSWojciech Malikowskiexecuted independently on different zones. Each band keeps track of the LBAs it consists of, as
31e9a236d2SWojciech Malikowskiwell as their validity, as some of the data will be invalidated by subsequent writes to the same
32e9a236d2SWojciech Malikowskilogical address. The L2P mapping can be restored from the SSD by reading this information in order
33e9a236d2SWojciech Malikowskifrom the oldest band to the youngest.
34e9a236d2SWojciech Malikowski
35111d4276SMaciej Wawryk```text
36e9a236d2SWojciech Malikowski             +--------------+        +--------------+                        +--------------+
37aa44b69aSWojciech Malikowski    band 1   |   zone 1     +--------+    zone 1    +---- --- --- --- --- ---+     zone 1   |
38e9a236d2SWojciech Malikowski             +--------------+        +--------------+                        +--------------+
39aa44b69aSWojciech Malikowski    band 2   |   zone 2     +--------+     zone 2   +---- --- --- --- --- ---+     zone 2   |
40e9a236d2SWojciech Malikowski             +--------------+        +--------------+                        +--------------+
41aa44b69aSWojciech Malikowski    band 3   |   zone 3     +--------+     zone 3   +---- --- --- --- --- ---+     zone 3   |
42e9a236d2SWojciech Malikowski             +--------------+        +--------------+                        +--------------+
43e9a236d2SWojciech Malikowski             |     ...      |        |     ...      |                        |     ...      |
44e9a236d2SWojciech Malikowski             +--------------+        +--------------+                        +--------------+
45aa44b69aSWojciech Malikowski    band m   |   zone m     +--------+     zone m   +---- --- --- --- --- ---+     zone m   |
46e9a236d2SWojciech Malikowski             +--------------+        +--------------+                        +--------------+
47e9a236d2SWojciech Malikowski             |     ...      |        |     ...      |                        |     ...      |
48e9a236d2SWojciech Malikowski             +--------------+        +--------------+                        +--------------+
49e9a236d2SWojciech Malikowski
50e9a236d2SWojciech Malikowski              parallel unit 1              pu 2                                    pu n
51111d4276SMaciej Wawryk```
52e9a236d2SWojciech Malikowski
537f5a982fSKozlowski MateuszThe address map (`P2L`) is saved as a part of the band's metadata, at the end of each band:
543d8a0b19SKarol Latecki
55111d4276SMaciej Wawryk```text
567f5a982fSKozlowski Mateusz                        band's data                        tail metadata
57aa44b69aSWojciech Malikowski    +-------------------+-------------------------------+------------------------+
58aa44b69aSWojciech Malikowski    |zone 1 |...|zone n |...|...|zone 1 |...|           | ... |zone  m-1 |zone  m|
59aa44b69aSWojciech Malikowski    |block 1|   |block 1|   |   |block x|   |           |     |block y   |block y|
60aa44b69aSWojciech Malikowski    +-------------------+-------------+-----------------+------------------------+
61111d4276SMaciej Wawryk```
62e9a236d2SWojciech Malikowski
63aa44b69aSWojciech MalikowskiBands are written sequentially (in a way that was described earlier). Before a band can be written
64aa44b69aSWojciech Malikowskito, all of its zones need to be erased. During that time, the band is considered to be in a `PREP`
657f5a982fSKozlowski Mateuszstate. Then the band moves to the `OPEN` state and actual user data can be written to the
66e9a236d2SWojciech Malikowskiband. Once the whole available space is filled, tail metadata is written and the band transitions to
67e9a236d2SWojciech Malikowski`CLOSING` state. When that finishes the band becomes `CLOSED`.
68e9a236d2SWojciech Malikowski
697f5a982fSKozlowski Mateusz### Non volatile cache {#ftl_nvcache}
70e9a236d2SWojciech Malikowski
717f5a982fSKozlowski Mateusz- Shorthand: `nvcache`
72e9a236d2SWojciech Malikowski
737f5a982fSKozlowski MateuszNvcache is a bdev that is used for buffering user writes and storing various metadata.
747f5a982fSKozlowski MateuszNvcache data space is divided into chunks. Chunks are written in sequential manner.
757f5a982fSKozlowski MateuszWhen number of free chunks is below assigned threshold data from fully written chunks
767f5a982fSKozlowski Mateuszis moved to base_bdev. This process is called chunk compaction.
77111d4276SMaciej Wawryk```text
787f5a982fSKozlowski Mateusz                      nvcache
797f5a982fSKozlowski Mateusz    +-----------------------------------------+
807f5a982fSKozlowski Mateusz    |chunk 1                                  |
817f5a982fSKozlowski Mateusz    |   +--------------------------------- +  |
827f5a982fSKozlowski Mateusz    |   |blk 1 + md| blk 2 + md| blk n + md|  |
837f5a982fSKozlowski Mateusz    |   +----------------------------------|  |
847f5a982fSKozlowski Mateusz    +-----------------------------------------+
85e9a236d2SWojciech Malikowski    | ...                                     |
867f5a982fSKozlowski Mateusz    +-----------------------------------------+
877f5a982fSKozlowski Mateusz    +-----------------------------------------+
887f5a982fSKozlowski Mateusz    |chunk N                                  |
897f5a982fSKozlowski Mateusz    |   +--------------------------------- +  |
907f5a982fSKozlowski Mateusz    |   |blk 1 + md| blk 2 + md| blk n + md|  |
917f5a982fSKozlowski Mateusz    |   +----------------------------------|  |
927f5a982fSKozlowski Mateusz    +-----------------------------------------+
93111d4276SMaciej Wawryk```
94e9a236d2SWojciech Malikowski
957f5a982fSKozlowski Mateusz### Garbage collection and relocation {#ftl_reloc}
96e9a236d2SWojciech Malikowski
977f5a982fSKozlowski Mateusz- Shorthand: gc, reloc
98e9a236d2SWojciech Malikowski
99e9a236d2SWojciech MalikowskiSince a write to the same LBA invalidates its previous physical location, some of the blocks on a
100e9a236d2SWojciech Malikowskiband might contain old data that basically wastes space. As there is no way to overwrite an already
1017f5a982fSKozlowski Mateuszwritten block for a ZNS drive, this data will stay there until the whole zone is reset. This might create a
102e9a236d2SWojciech Malikowskisituation in which all of the bands contain some valid data and no band can be erased, so no writes
103e9a236d2SWojciech Malikowskican be executed anymore. Therefore a mechanism is needed to move valid data and invalidate whole
104e9a236d2SWojciech Malikowskibands, so that they can be reused.
105e9a236d2SWojciech Malikowski
106111d4276SMaciej Wawryk```text
107e9a236d2SWojciech Malikowski                    band                                             band
108e9a236d2SWojciech Malikowski    +-----------------------------------+            +-----------------------------------+
109e9a236d2SWojciech Malikowski    | ** *    * ***      *    *** * *   |            |                                   |
110e9a236d2SWojciech Malikowski    |**  *       *    *    * *     *   *|   +---->   |                                   |
111e9a236d2SWojciech Malikowski    |*     ***  *      *            *   |            |                                   |
112e9a236d2SWojciech Malikowski    +-----------------------------------+            +-----------------------------------+
113111d4276SMaciej Wawryk```
114e9a236d2SWojciech Malikowski
115e9a236d2SWojciech MalikowskiValid blocks are marked with an asterisk '\*'.
116e9a236d2SWojciech Malikowski
1177f5a982fSKozlowski MateuszModule responsible for data relocation is called `reloc`. When a band is chosen for garbage collection,
1187f5a982fSKozlowski Mateuszthe appropriate blocks are marked as required to be moved. The `reloc` module takes a band that has
1197f5a982fSKozlowski Mateuszsome of such blocks marked, checks their validity and, if they're still valid, copies them.
120e9a236d2SWojciech Malikowski
1217f5a982fSKozlowski MateuszChoosing a band for garbage collection depends its validity ratio (proportion of valid blocks to all
1227f5a982fSKozlowski Mateuszuser blocks). The lower the ratio, the higher the chance the band will be chosen for gc.
123e9a236d2SWojciech Malikowski
1247f5a982fSKozlowski Mateusz## Metadata {#ftl_metadata}
1257f5a982fSKozlowski Mateusz
1267f5a982fSKozlowski MateuszIn addition to the [L2P](#ftl_l2p), FTL will store additional metadata both on the cache, as
1277f5a982fSKozlowski Mateuszwell as on the base devices. The following types of metadata are persisted:
1287f5a982fSKozlowski Mateusz
1297f5a982fSKozlowski Mateusz- Superblock - stores the global state of FTL; stored on cache, mirrored to the base device
1307f5a982fSKozlowski Mateusz
1317f5a982fSKozlowski Mateusz- L2P - see the [L2P](#ftl_l2p) section for details
1327f5a982fSKozlowski Mateusz
1337f5a982fSKozlowski Mateusz- Band - stores the state of bands - write pointers, their OPEN/FREE/CLOSE state; stored on cache, mirrored to a different section of the cache device
1347f5a982fSKozlowski Mateusz
1357f5a982fSKozlowski Mateusz- Valid map - bitmask of all the valid physical addresses, used for improving [relocation](#ftl_reloc)
1367f5a982fSKozlowski Mateusz
1377f5a982fSKozlowski Mateusz- Chunk - stores the state of chunks - write pointers, their OPEN/FREE/CLOSE state; stored on cache, mirrored to a different section of the cache device
1387f5a982fSKozlowski Mateusz
1397f5a982fSKozlowski Mateusz- P2L - stores the address mapping (P2L, see [band](#ftl_band)) of currently open bands. This allows for the recovery of open
1407f5a982fSKozlowski Mateusz bands after dirty shutdown without needing VSS DIX metadata on the base device; stored on the cache device
1417f5a982fSKozlowski Mateusz
1427f5a982fSKozlowski Mateusz- Trim - stores information about unmapped (trimmed) LBAs; stored on cache, mirrored to a different section of the cache device
1437f5a982fSKozlowski Mateusz
1447f5a982fSKozlowski Mateusz## Dirty shutdown recovery {#ftl_dirty_shutdown}
1457f5a982fSKozlowski Mateusz
1467f5a982fSKozlowski MateuszAfter power failure, FTL needs to rebuild the whole L2P using the address maps (`P2L`) stored within each band/chunk.
1477f5a982fSKozlowski MateuszThis needs to done, because while individual L2P pages may have been paged out and persisted to the cache device,
1483f912cf0SMichal Bergerthere's no way to tell which, if any, pages were dirty before the power failure occurred. The P2L consists of not only
1497f5a982fSKozlowski Mateuszthe mapping itself, but also a sequence id (`seq_id`), which describes the relative age of a given logical block
1507f5a982fSKozlowski Mateusz(multiple writes to the same logical block would produce the same amount of P2L entries, only the last one having the current data).
1517f5a982fSKozlowski Mateusz
1527f5a982fSKozlowski MateuszFTL will therefore rebuild the whole L2P by reading the P2L of all closed bands and chunks. For open bands, the P2L is stored on
1537f5a982fSKozlowski Mateuszthe cache device, in a separate metadata region (see [the P2L section](#ftl_metadata)). Open chunks can be restored thanks to storing
1547f5a982fSKozlowski Mateuszthe mapping in the VSS DIX metadata, which the cache device must be formatted with.
1557f5a982fSKozlowski Mateusz
1567f5a982fSKozlowski Mateusz### Shared memory recovery {#ftl_shm_recovery}
1577f5a982fSKozlowski Mateusz
1587f5a982fSKozlowski MateuszIn order to shorten the recovery after crash of the target application, FTL also stores its metadata in shared memory (`shm`) - this
1597f5a982fSKozlowski Mateuszallows it to keep track of the dirty-ness state of individual pages and shortens the recovery time dramatically, as FTL will only
1607f5a982fSKozlowski Mateuszneed to mark any potential L2P pages which were paging out at the time of the crash as dirty and reissue the writes. There's no need
1617f5a982fSKozlowski Mateuszto read the whole P2L in this case.
1627f5a982fSKozlowski Mateusz
1637f5a982fSKozlowski Mateusz### Trim {#ftl_trim}
1647f5a982fSKozlowski Mateusz
1657f5a982fSKozlowski MateuszDue to metadata size constraints and the difficulty of maintaining consistent data returned before and after dirty shutdown, FTL
1667f5a982fSKozlowski Mateuszcurrently only allows for trims (unmaps) aligned to 4MiB (alignment concerns both the offset and length of the trim command).
1676f62f0a1SKonrad Sztyber
1681e1fd9acSwawryk## Usage {#ftl_usage}
1696f62f0a1SKonrad Sztyber
1701e1fd9acSwawryk### Prerequisites {#ftl_prereq}
1716f62f0a1SKonrad Sztyber
1727f5a982fSKozlowski MateuszIn order to use the FTL module, a cache device formatted with VSS DIX metadata is required.
173aa44b69aSWojciech Malikowski
1741e1fd9acSwawryk### FTL bdev creation {#ftl_create}
175aa44b69aSWojciech Malikowski
176aa44b69aSWojciech MalikowskiSimilar to other bdevs, the FTL bdevs can be created either based on JSON config files or via RPC.
177aa44b69aSWojciech MalikowskiBoth interfaces require the same arguments which are described by the `--help` option of the
178aa44b69aSWojciech Malikowski`bdev_ftl_create` RPC call, which are:
1793d8a0b19SKarol Latecki
180aa44b69aSWojciech Malikowski- bdev's name
1817f5a982fSKozlowski Mateusz- base bdev's name
182*17cf101bSMateusz Kozlowski- cache bdev's name (cache bdev must support VSS DIX mode)
183aa44b69aSWojciech Malikowski- UUID of the FTL device (if the FTL is to be restored from the SSD)
184aa44b69aSWojciech Malikowski
1857f5a982fSKozlowski Mateusz## FTL bdev stack {#ftl_bdev_stack}
186aa44b69aSWojciech Malikowski
187b04b812eSWojciech MalikowskiIn order to create FTL on top of a regular bdev:
188aa44b69aSWojciech Malikowski1) Create regular bdev e.g. `bdev_nvme`, `bdev_null`, `bdev_malloc`
1897f5a982fSKozlowski Mateusz2) Create second regular bdev for nvcache
1907f5a982fSKozlowski Mateusz3) Create FTL bdev on top of bdev created in step 1 and step 2
191aa44b69aSWojciech Malikowski
192aa44b69aSWojciech MalikowskiExample:
193aa44b69aSWojciech Malikowski```
194aa44b69aSWojciech Malikowski$ scripts/rpc.py bdev_nvme_attach_controller -b nvme0 -a 00:05.0 -t pcie
195aa44b69aSWojciech Malikowski	nvme0n1
196aa44b69aSWojciech Malikowski
1977f5a982fSKozlowski Mateusz$ scripts/rpc.py bdev_nvme_attach_controller -b nvme1 -a 00:06.0 -t pcie
1987f5a982fSKozlowski Mateusz	nvme1n1
199aa44b69aSWojciech Malikowski
2007f5a982fSKozlowski Mateusz$ scripts/rpc.py bdev_ftl_create -b ftl0 -d nvme0n1 -c nvme1n1
2016f62f0a1SKonrad Sztyber{
202aa44b69aSWojciech Malikowski	"name": "ftl0",
203aa44b69aSWojciech Malikowski	"uuid": "3b469565-1fa5-4bfb-8341-747ec9f3a9b9"
2046f62f0a1SKonrad Sztyber}
2056f62f0a1SKonrad Sztyber```
206