xref: /spdk/doc/ssd_internals.md (revision 68f9bbc71904b4fadf2e1fed73c55548a94f3309)
1*68f9bbc7SBen Walker# NAND Flash SSD Internals {#ssd_internals}
2937f1c2bSBen Walker
3937f1c2bSBen WalkerSolid State Devices (SSD) are complex devices and their performance depends on
4937f1c2bSBen Walkerhow they're used. The following description is intended to help software
5937f1c2bSBen Walkerdevelopers understand what is occurring inside the SSD, so that they can come
6937f1c2bSBen Walkerup with better software designs. It should not be thought of as a strictly
7937f1c2bSBen Walkeraccurate guide to how SSD hardware really works.
8937f1c2bSBen Walker
9937f1c2bSBen Walker As of this writing, SSDs are generally implemented on top of
10937f1c2bSBen Walker [NAND Flash](https://en.wikipedia.org/wiki/Flash_memory) memory. At a
11937f1c2bSBen Walker very high level, this media has a few important properties:
12937f1c2bSBen Walker
13937f1c2bSBen Walker* The media is grouped onto chips called NAND dies and each die can
14937f1c2bSBen Walker  operate in parallel.
15937f1c2bSBen Walker* Flipping a bit is a highly asymmetric process. Flipping it one way is
16937f1c2bSBen Walker  easy, but flipping it back is quite hard.
17937f1c2bSBen Walker
18937f1c2bSBen WalkerNAND Flash media is grouped into large units often referred to as **erase
19937f1c2bSBen Walkerblocks**. The size of an erase block is highly implementation specific, but
20937f1c2bSBen Walkercan be thought of as somewhere between 1MiB and 8MiB. For each erase block,
21937f1c2bSBen Walkereach bit may be written to (i.e. have its bit flipped from 0 to 1) with
22937f1c2bSBen Walkerbit-granularity once. In order to write to the erase block a second time, the
23937f1c2bSBen Walkerentire block must be erased (i.e. all bits in the block are flipped back to
24937f1c2bSBen Walker0). This is the asymmetry part from above. Erasing a block causes a measurable
25937f1c2bSBen Walkeramount of wear and each block may only be erased a limited number of times.
26937f1c2bSBen Walker
27937f1c2bSBen WalkerSSDs expose an interface to the host system that makes it appear as if the
28937f1c2bSBen Walkerdrive is composed of a set of fixed size **logical blocks** which are usually
29937f1c2bSBen Walker512B or 4KiB in size. These blocks are entirely logical constructs of the
30937f1c2bSBen Walkerdevice firmware and they do not statically map to a location on the backing
31937f1c2bSBen Walkermedia. Instead, upon each write to a logical block, a new location on the NAND
32937f1c2bSBen WalkerFlash is selected and written and the mapping of the logical block to its
33937f1c2bSBen Walkerphysical location is updated. The algorithm for choosing this location is a
34937f1c2bSBen Walkerkey part of overall SSD performance and is often called the **flash
35937f1c2bSBen Walkertranslation layer** or FTL. This algorithm must correctly distribute the
36937f1c2bSBen Walkerblocks to account for wear (called **wear-leveling**) and spread them across
37937f1c2bSBen WalkerNAND dies to improve total available performance. The simplest model is to
38937f1c2bSBen Walkergroup all of the physical media on each die together using an algorithm
39937f1c2bSBen Walkersimilar to RAID and then write to that set sequentially. Real SSDs are far
40937f1c2bSBen Walkermore complicated, but this is an excellent simple model for software
41937f1c2bSBen Walkerdevelopers - imagine they are simply logging to a RAID volume and updating an
42937f1c2bSBen Walkerin-memory hash-table.
43937f1c2bSBen Walker
44937f1c2bSBen WalkerOne consequence of the flash translation layer is that logical blocks do not
45937f1c2bSBen Walkernecessarily correspond to physical locations on the NAND at all times. In
46937f1c2bSBen Walkerfact, there is a command that clears the translation for a block. In NVMe,
47937f1c2bSBen Walkerthis command is called deallocate, in SCSI it is called unmap, and in SATA it
48937f1c2bSBen Walkeris called trim. When a user attempts to read a block that doesn't have a
49937f1c2bSBen Walkermapping to a physical location, drives will do one of two things:
50937f1c2bSBen Walker
51937f1c2bSBen Walker1. Immediately complete the read request successfully, without performing any
52937f1c2bSBen Walker   data transfer. This is acceptable because the data the drive would return
53937f1c2bSBen Walker   is no more valid than the data already in the user's data buffer.
54937f1c2bSBen Walker2. Return all 0's as the data.
55937f1c2bSBen Walker
56937f1c2bSBen WalkerChoice #1 is much more common and performing reads to a fully deallocated
57937f1c2bSBen Walkerdevice will often show performance far beyond what the drive claims to be
58937f1c2bSBen Walkercapable of precisely because it is not actually transferring any data. Write
59937f1c2bSBen Walkerto all blocks prior to reading them when benchmarking!
60937f1c2bSBen Walker
61937f1c2bSBen WalkerAs SSDs are written to, the internal log will eventually consume all of the
62937f1c2bSBen Walkeravailable erase blocks. In order to continue writing, the SSD must free some
63937f1c2bSBen Walkerof them. This process is often called **garbage collection**. All SSDs reserve
64937f1c2bSBen Walkersome number of erase blocks so that they can guarantee there are free erase
65937f1c2bSBen Walkerblocks available for garbage collection. Garbage collection generally proceeds
66937f1c2bSBen Walkerby:
67937f1c2bSBen Walker
68937f1c2bSBen Walker1. Selecting a target erase block (a good mental model is that it picks the least recently used erase block)
69937f1c2bSBen Walker2. Walking through each entry in the erase block and determining if it is still a valid logical block.
70937f1c2bSBen Walker3. Moving valid logical blocks by reading them and writing them to a different erase block (i.e. the current head of the log)
71937f1c2bSBen Walker4. Erasing the entire erase block and marking it available for use.
72937f1c2bSBen Walker
73937f1c2bSBen WalkerGarbage collection is clearly far more efficient when step #3 can be skipped
74937f1c2bSBen Walkerbecause the erase block is already empty. There are two ways to make it much
75937f1c2bSBen Walkermore likely that step #3 can be skipped. The first is that SSDs reserve
76937f1c2bSBen Walkeradditional erase blocks beyond their reported capacity (called
77937f1c2bSBen Walker**over-provisioning**), so that statistically its much more likely that an
78937f1c2bSBen Walkererase block will not contain valid data. The second is software can write to
79937f1c2bSBen Walkerthe blocks on the device in sequential order in a circular pattern, throwing
80937f1c2bSBen Walkeraway old data when it is no longer needed. In this case, the software
81937f1c2bSBen Walkerguarantees that the least recently used erase blocks will not contain any
82937f1c2bSBen Walkervalid data that must be moved.
83937f1c2bSBen Walker
84937f1c2bSBen WalkerThe amount of over-provisioning a device has can dramatically impact the
85937f1c2bSBen Walkerperformance on random read and write workloads, if the workload is filling up
86937f1c2bSBen Walkerthe entire device. However, the same effect can typically be obtained by
87937f1c2bSBen Walkersimply reserving a given amount of space on the device in software. This
88937f1c2bSBen Walkerunderstanding is critical to producing consistent benchmarks. In particular,
89937f1c2bSBen Walkerif background garbage collection cannot keep up and the drive must switch to
90937f1c2bSBen Walkeron-demand garbage collection, the latency of writes will increase
91937f1c2bSBen Walkerdramatically. Therefore the internal state of the device must be forced into
92937f1c2bSBen Walkersome known state prior to running benchmarks for consistency. This is usually
93937f1c2bSBen Walkeraccomplished by writing to the device sequentially two times, from start to
94937f1c2bSBen Walkerfinish. For a highly detailed description of exactly how to force an SSD into
95937f1c2bSBen Walkera known state for benchmarking see this
96937f1c2bSBen Walker[SNIA Article](http://www.snia.org/sites/default/files/SSS_PTS_Enterprise_v1.1.pdf).
97