1# NAND Flash SSD Internals {#ssd_internals} 2 3Solid State Devices (SSD) are complex devices and their performance depends on 4how they're used. The following description is intended to help software 5developers understand what is occurring inside the SSD, so that they can come 6up with better software designs. It should not be thought of as a strictly 7accurate guide to how SSD hardware really works. 8 9 As of this writing, SSDs are generally implemented on top of 10 [NAND Flash](https://en.wikipedia.org/wiki/Flash_memory) memory. At a 11 very high level, this media has a few important properties: 12 13* The media is grouped onto chips called NAND dies and each die can 14 operate in parallel. 15* Flipping a bit is a highly asymmetric process. Flipping it one way is 16 easy, but flipping it back is quite hard. 17 18NAND Flash media is grouped into large units often referred to as **erase 19blocks**. The size of an erase block is highly implementation specific, but 20can be thought of as somewhere between 1MiB and 8MiB. For each erase block, 21each bit may be written to (i.e. have its bit flipped from 0 to 1) with 22bit-granularity once. In order to write to the erase block a second time, the 23entire block must be erased (i.e. all bits in the block are flipped back to 240). This is the asymmetry part from above. Erasing a block causes a measurable 25amount of wear and each block may only be erased a limited number of times. 26 27SSDs expose an interface to the host system that makes it appear as if the 28drive is composed of a set of fixed size **logical blocks** which are usually 29512B or 4KiB in size. These blocks are entirely logical constructs of the 30device firmware and they do not statically map to a location on the backing 31media. Instead, upon each write to a logical block, a new location on the NAND 32Flash is selected and written and the mapping of the logical block to its 33physical location is updated. The algorithm for choosing this location is a 34key part of overall SSD performance and is often called the **flash 35translation layer** or FTL. This algorithm must correctly distribute the 36blocks to account for wear (called **wear-leveling**) and spread them across 37NAND dies to improve total available performance. The simplest model is to 38group all of the physical media on each die together using an algorithm 39similar to RAID and then write to that set sequentially. Real SSDs are far 40more complicated, but this is an excellent simple model for software 41developers - imagine they are simply logging to a RAID volume and updating an 42in-memory hash-table. 43 44One consequence of the flash translation layer is that logical blocks do not 45necessarily correspond to physical locations on the NAND at all times. In 46fact, there is a command that clears the translation for a block. In NVMe, 47this command is called deallocate, in SCSI it is called unmap, and in SATA it 48is called trim. When a user attempts to read a block that doesn't have a 49mapping to a physical location, drives will do one of two things: 50 511. Immediately complete the read request successfully, without performing any 52 data transfer. This is acceptable because the data the drive would return 53 is no more valid than the data already in the user's data buffer. 542. Return all 0's as the data. 55 56Choice #1 is much more common and performing reads to a fully deallocated 57device will often show performance far beyond what the drive claims to be 58capable of precisely because it is not actually transferring any data. Write 59to all blocks prior to reading them when benchmarking! 60 61As SSDs are written to, the internal log will eventually consume all of the 62available erase blocks. In order to continue writing, the SSD must free some 63of them. This process is often called **garbage collection**. All SSDs reserve 64some number of erase blocks so that they can guarantee there are free erase 65blocks available for garbage collection. Garbage collection generally proceeds 66by: 67 681. Selecting a target erase block (a good mental model is that it picks the least recently used erase block) 692. Walking through each entry in the erase block and determining if it is still a valid logical block. 703. Moving valid logical blocks by reading them and writing them to a different erase block (i.e. the current head of the log) 714. Erasing the entire erase block and marking it available for use. 72 73Garbage collection is clearly far more efficient when step #3 can be skipped 74because the erase block is already empty. There are two ways to make it much 75more likely that step #3 can be skipped. The first is that SSDs reserve 76additional erase blocks beyond their reported capacity (called 77**over-provisioning**), so that statistically its much more likely that an 78erase block will not contain valid data. The second is software can write to 79the blocks on the device in sequential order in a circular pattern, throwing 80away old data when it is no longer needed. In this case, the software 81guarantees that the least recently used erase blocks will not contain any 82valid data that must be moved. 83 84The amount of over-provisioning a device has can dramatically impact the 85performance on random read and write workloads, if the workload is filling up 86the entire device. However, the same effect can typically be obtained by 87simply reserving a given amount of space on the device in software. This 88understanding is critical to producing consistent benchmarks. In particular, 89if background garbage collection cannot keep up and the drive must switch to 90on-demand garbage collection, the latency of writes will increase 91dramatically. Therefore the internal state of the device must be forced into 92some known state prior to running benchmarks for consistency. This is usually 93accomplished by writing to the device sequentially two times, from start to 94finish. For a highly detailed description of exactly how to force an SSD into 95a known state for benchmarking see this 96[SNIA Article](http://www.snia.org/sites/default/files/SSS_PTS_Enterprise_v1.1.pdf). 97