1# Flash Translation Layer {#ftl} 2 3The Flash Translation Layer library provides efficient 4K block device access on top of devices 4with >4K write unit size (eg. raid5f bdev) or devices with large indirection units (some 5capacity-focused NAND drives), which don't handle 4K writes well. It handles the logical to 6physical address mapping and manages the garbage collection process. 7 8## Terminology {#ftl_terminology} 9 10### Logical to physical address map 11 12- Shorthand: `L2P` 13 14Contains the mapping of the logical addresses (LBA) to their on-disk physical location. The LBAs 15are contiguous and in range from 0 to the number of surfaced blocks (the number of spare blocks 16are calculated during device formation and are subtracted from the available address space). The 17spare blocks account for zones going offline throughout the lifespan of the device as well as 18provide necessary buffer for data [garbage collection](#ftl_reloc). 19 20Since the L2P would occupy a significant amount of DRAM (4B/LBA for drives smaller than 16TiB, 218B/LBA for bigger drives), FTL will, by default, store only the 2GiB of most recently used L2P 22addresses in memory (the amount is configurable), and page them in and out of the cache device 23as necessary. 24 25### Band {#ftl_band} 26 27A band describes a collection of zones, each belonging to a different parallel unit. All writes to 28a band follow the same pattern - a batch of logical blocks is written to one zone, another batch 29to the next one and so on. This ensures the parallelism of the write operations, as they can be 30executed independently on different zones. Each band keeps track of the LBAs it consists of, as 31well as their validity, as some of the data will be invalidated by subsequent writes to the same 32logical address. The L2P mapping can be restored from the SSD by reading this information in order 33from the oldest band to the youngest. 34 35```text 36 +--------------+ +--------------+ +--------------+ 37 band 1 | zone 1 +--------+ zone 1 +---- --- --- --- --- ---+ zone 1 | 38 +--------------+ +--------------+ +--------------+ 39 band 2 | zone 2 +--------+ zone 2 +---- --- --- --- --- ---+ zone 2 | 40 +--------------+ +--------------+ +--------------+ 41 band 3 | zone 3 +--------+ zone 3 +---- --- --- --- --- ---+ zone 3 | 42 +--------------+ +--------------+ +--------------+ 43 | ... | | ... | | ... | 44 +--------------+ +--------------+ +--------------+ 45 band m | zone m +--------+ zone m +---- --- --- --- --- ---+ zone m | 46 +--------------+ +--------------+ +--------------+ 47 | ... | | ... | | ... | 48 +--------------+ +--------------+ +--------------+ 49 50 parallel unit 1 pu 2 pu n 51``` 52 53The address map (`P2L`) is saved as a part of the band's metadata, at the end of each band: 54 55```text 56 band's data tail metadata 57 +-------------------+-------------------------------+------------------------+ 58 |zone 1 |...|zone n |...|...|zone 1 |...| | ... |zone m-1 |zone m| 59 |block 1| |block 1| | |block x| | | |block y |block y| 60 +-------------------+-------------+-----------------+------------------------+ 61``` 62 63Bands are written sequentially (in a way that was described earlier). Before a band can be written 64to, all of its zones need to be erased. During that time, the band is considered to be in a `PREP` 65state. Then the band moves to the `OPEN` state and actual user data can be written to the 66band. Once the whole available space is filled, tail metadata is written and the band transitions to 67`CLOSING` state. When that finishes the band becomes `CLOSED`. 68 69### Non volatile cache {#ftl_nvcache} 70 71- Shorthand: `nvcache` 72 73Nvcache is a bdev that is used for buffering user writes and storing various metadata. 74Nvcache data space is divided into chunks. Chunks are written in sequential manner. 75When number of free chunks is below assigned threshold data from fully written chunks 76is moved to base_bdev. This process is called chunk compaction. 77```text 78 nvcache 79 +-----------------------------------------+ 80 |chunk 1 | 81 | +--------------------------------- + | 82 | |blk 1 + md| blk 2 + md| blk n + md| | 83 | +----------------------------------| | 84 +-----------------------------------------+ 85 | ... | 86 +-----------------------------------------+ 87 +-----------------------------------------+ 88 |chunk N | 89 | +--------------------------------- + | 90 | |blk 1 + md| blk 2 + md| blk n + md| | 91 | +----------------------------------| | 92 +-----------------------------------------+ 93``` 94 95### Garbage collection and relocation {#ftl_reloc} 96 97- Shorthand: gc, reloc 98 99Since a write to the same LBA invalidates its previous physical location, some of the blocks on a 100band might contain old data that basically wastes space. As there is no way to overwrite an already 101written block for a ZNS drive, this data will stay there until the whole zone is reset. This might create a 102situation in which all of the bands contain some valid data and no band can be erased, so no writes 103can be executed anymore. Therefore a mechanism is needed to move valid data and invalidate whole 104bands, so that they can be reused. 105 106```text 107 band band 108 +-----------------------------------+ +-----------------------------------+ 109 | ** * * *** * *** * * | | | 110 |** * * * * * * *| +----> | | 111 |* *** * * * | | | 112 +-----------------------------------+ +-----------------------------------+ 113``` 114 115Valid blocks are marked with an asterisk '\*'. 116 117Module responsible for data relocation is called `reloc`. When a band is chosen for garbage collection, 118the appropriate blocks are marked as required to be moved. The `reloc` module takes a band that has 119some of such blocks marked, checks their validity and, if they're still valid, copies them. 120 121Choosing a band for garbage collection depends its validity ratio (proportion of valid blocks to all 122user blocks). The lower the ratio, the higher the chance the band will be chosen for gc. 123 124## Metadata {#ftl_metadata} 125 126In addition to the [L2P](#ftl_l2p), FTL will store additional metadata both on the cache, as 127well as on the base devices. The following types of metadata are persisted: 128 129- Superblock - stores the global state of FTL; stored on cache, mirrored to the base device 130 131- L2P - see the [L2P](#ftl_l2p) section for details 132 133- Band - stores the state of bands - write pointers, their OPEN/FREE/CLOSE state; stored on cache, mirrored to a different section of the cache device 134 135- Valid map - bitmask of all the valid physical addresses, used for improving [relocation](#ftl_reloc) 136 137- Chunk - stores the state of chunks - write pointers, their OPEN/FREE/CLOSE state; stored on cache, mirrored to a different section of the cache device 138 139- P2L - stores the address mapping (P2L, see [band](#ftl_band)) of currently open bands. This allows for the recovery of open 140 bands after dirty shutdown without needing VSS DIX metadata on the base device; stored on the cache device 141 142- Trim - stores information about unmapped (trimmed) LBAs; stored on cache, mirrored to a different section of the cache device 143 144## Dirty shutdown recovery {#ftl_dirty_shutdown} 145 146After power failure, FTL needs to rebuild the whole L2P using the address maps (`P2L`) stored within each band/chunk. 147This needs to done, because while individual L2P pages may have been paged out and persisted to the cache device, 148there's no way to tell which, if any, pages were dirty before the power failure occured. The P2L consists of not only 149the mapping itself, but also a sequence id (`seq_id`), which describes the relative age of a given logical block 150(multiple writes to the same logical block would produce the same amount of P2L entries, only the last one having the current data). 151 152FTL will therefore rebuild the whole L2P by reading the P2L of all closed bands and chunks. For open bands, the P2L is stored on 153the cache device, in a separate metadata region (see [the P2L section](#ftl_metadata)). Open chunks can be restored thanks to storing 154the mapping in the VSS DIX metadata, which the cache device must be formatted with. 155 156### Shared memory recovery {#ftl_shm_recovery} 157 158In order to shorten the recovery after crash of the target application, FTL also stores its metadata in shared memory (`shm`) - this 159allows it to keep track of the dirty-ness state of individual pages and shortens the recovery time dramatically, as FTL will only 160need to mark any potential L2P pages which were paging out at the time of the crash as dirty and reissue the writes. There's no need 161to read the whole P2L in this case. 162 163### Trim {#ftl_trim} 164 165Due to metadata size constraints and the difficulty of maintaining consistent data returned before and after dirty shutdown, FTL 166currently only allows for trims (unmaps) aligned to 4MiB (alignment concerns both the offset and length of the trim command). 167 168## Usage {#ftl_usage} 169 170### Prerequisites {#ftl_prereq} 171 172In order to use the FTL module, a cache device formatted with VSS DIX metadata is required. 173 174### FTL bdev creation {#ftl_create} 175 176Similar to other bdevs, the FTL bdevs can be created either based on JSON config files or via RPC. 177Both interfaces require the same arguments which are described by the `--help` option of the 178`bdev_ftl_create` RPC call, which are: 179 180- bdev's name 181- base bdev's name 182- cache bdev's name (cache bdev must support VSS DIX mode - could be emulated by providing SPDK_FTL_VSS_EMU=1 flag to make; 183 emulating VSS should be done for testing purposes only, it is not power-fail safe) 184- UUID of the FTL device (if the FTL is to be restored from the SSD) 185 186## FTL bdev stack {#ftl_bdev_stack} 187 188In order to create FTL on top of a regular bdev: 1891) Create regular bdev e.g. `bdev_nvme`, `bdev_null`, `bdev_malloc` 1902) Create second regular bdev for nvcache 1913) Create FTL bdev on top of bdev created in step 1 and step 2 192 193Example: 194``` 195$ scripts/rpc.py bdev_nvme_attach_controller -b nvme0 -a 00:05.0 -t pcie 196 nvme0n1 197 198$ scripts/rpc.py bdev_nvme_attach_controller -b nvme1 -a 00:06.0 -t pcie 199 nvme1n1 200 201$ scripts/rpc.py bdev_ftl_create -b ftl0 -d nvme0n1 -c nvme1n1 202{ 203 "name": "ftl0", 204 "uuid": "3b469565-1fa5-4bfb-8341-747ec9f3a9b9" 205} 206``` 207