1cf0e7348SJim Harris# SPDK "Reduce" Block Compression Algorithm {#reduce} 2cf0e7348SJim Harris 3cf0e7348SJim Harris## Overview 4cf0e7348SJim Harris 5cf0e7348SJim HarrisThe SPDK "reduce" block compression scheme is based on using SSDs for storing compressed blocks of 6cf0e7348SJim Harrisstorage and persistent memory for metadata. This metadata includes mappings of logical blocks 7cf0e7348SJim Harrisrequested by a user to the compressed blocks on SSD. The scheme described in this document 8cf0e7348SJim Harrisis generic and not tied to any specific block device framework such as the SPDK block device (bdev) 9cf0e7348SJim Harrisframework. This algorithm will be implemented in a library called "libreduce". Higher-level 10cf0e7348SJim Harrissoftware modules can built on top of this library to create and present block devices in a 11cf0e7348SJim Harrisspecific block device framework. For SPDK, a bdev_reduce module will serve as a wrapper around 12cf0e7348SJim Harristhe libreduce library, to present the compressed block devices as an SPDK bdev. 13cf0e7348SJim Harris 14cf0e7348SJim HarrisThis scheme only describes how compressed blocks are stored on an SSD and the metadata for tracking 15cf0e7348SJim Harristhose compressed blocks. It relies on the higher-software module to perform the compression 16cf0e7348SJim Harrisalgorithm itself. For SPDK, the bdev_reduce module will utilize the DPDK compressdev framework 17cf0e7348SJim Harristo perform compression and decompression on behalf of the libreduce library. 18cf0e7348SJim Harris 19cf0e7348SJim Harris(Note that in some cases, blocks of storage may not be compressible, or cannot be compressed enough 20cf0e7348SJim Harristo realize savings from the compression. In these cases, the data may be stored uncompressed on 21cf0e7348SJim Harrisdisk. The phrase "compressed blocks of storage" includes these uncompressed blocks.) 22cf0e7348SJim Harris 23cf0e7348SJim HarrisA compressed block device is a logical entity built on top of a similarly-sized backing storage 24cf0e7348SJim Harrisdevice. The backing storage device must be thin-provisioned to realize any savings from 25cf0e7348SJim Harriscompression for reasons described later in this document. This algorithm has no direct knowledge 26cf0e7348SJim Harrisof the implementation of the backing storage device, except that it will always use the 27cf0e7348SJim Harrislowest-numbered blocks available on the backing storage device. This will ensure that when this 28cf0e7348SJim Harrisalgorithm is used on a thin-provisioned backing storage device, blocks will not be allocated until 29cf0e7348SJim Harristhey are actually needed. 30cf0e7348SJim Harris 31cf0e7348SJim HarrisThe backing storage device must be sized for the worst case scenario, where no data can be 32cf0e7348SJim Harriscompressed. In this case, the size of the backing storage device would be the same as the 33cf0e7348SJim Harriscompressed block device. Since this algorithm ensures atomicity by never overwriting data 34cf0e7348SJim Harrisin place, some additional backing storage is required to temporarily store data for writes in 35cf0e7348SJim Harrisprogress before the associated metadata is updated. 36cf0e7348SJim Harris 37cf0e7348SJim HarrisStorage from the backing storage device will be allocated, read, and written to in 4KB units for 38cf0e7348SJim Harrisbest NVMe performance. These 4KB units are called "backing IO units". They are indexed from 0 to N-1 39cf0e7348SJim Harriswith the indices called "backing IO unit indices". At start, the full set of indices represent the 40cf0e7348SJim Harris"free backing IO unit list". 41cf0e7348SJim Harris 42cf0e7348SJim HarrisA compressed block device compresses and decompresses data in units of chunks, where a chunk is a 43cf0e7348SJim Harrismultiple of at least two 4KB backing IO units. The number of backing IO units per chunk determines 44cf0e7348SJim Harristhe chunk size and is specified when the compressed block device is created. A chunk 45cf0e7348SJim Harrisconsumes a number of 4KB backing IO units between 1 and the number of 4KB units in the chunk. For 46cf0e7348SJim Harrisexample, a 16KB chunk consumes 1, 2, 3 or 4 backing IO units. The number of backing IO units depends on how 47cf0e7348SJim Harrismuch the chunk was able to be compressed. The blocks on disk associated with a chunk are stored in a 48cf0e7348SJim Harris"chunk map" in persistent memory. Each chunk map consists of N 64-bit values, where N is the maximum 49cf0e7348SJim Harrisnumber of backing IO units in the chunk. Each 64-bit value corresponds to a backing IO unit index. A 50cf0e7348SJim Harrisspecial value (for example, 2^64-1) is used for backing IO units not needed due to compression. The 51cf0e7348SJim Harrisnumber of chunk maps allocated is equal to the size of the compressed block device divided by its chunk 52cf0e7348SJim Harrissize, plus some number of extra chunk maps. These extra chunk maps are used to ensure atomicity on 53cf0e7348SJim Harriswrites and will be explained later in this document. At start, all of the chunk maps represent the 54cf0e7348SJim Harris"free chunk map list". 55cf0e7348SJim Harris 56cf0e7348SJim HarrisFinally, the logical view of the compressed block device is represented by the "logical map". The 57cf0e7348SJim Harrislogical map is a mapping of chunk offsets into the compressed block device to the corresponding 58cf0e7348SJim Harrischunk map. Each entry in the logical map is a 64-bit value, denoting the associated chunk map. 59cf0e7348SJim HarrisA special value (UINT64_MAX) is used if there is no associated chunk map. The mapping is 60cf0e7348SJim Harrisdetermined by dividing the byte offset by the chunk size to get an index, which is used as an 61cf0e7348SJim Harrisarray index into the array of chunk map entries. At start, all entries in the logical map have no 62cf0e7348SJim Harrisassociated chunk map. Note that while access to the backing storage device is in 4KB units, the 63cf0e7348SJim Harrislogical view may allow 4KB or 512B unit access and should perform similarly. 64cf0e7348SJim Harris 65cf0e7348SJim Harris## Example 66cf0e7348SJim Harris 67cf0e7348SJim HarrisTo illustrate this algorithm, we will use a real example at a very small scale. 68cf0e7348SJim Harris 69cf0e7348SJim HarrisThe size of the compressed block device is 64KB, with a chunk size of 16KB. This will 70cf0e7348SJim Harrisrealize the following: 71cf0e7348SJim Harris 72cf0e7348SJim Harris* "Backing storage" will consist of an 80KB thin-provisioned logical volume. This 73cf0e7348SJim Harris corresponds to the 64KB size of the compressed block device, plus an extra 16KB to handle 74cf0e7348SJim Harris additional write operations under a worst-case compression scenario. 75cf0e7348SJim Harris* "Free backing IO unit list" will consist of indices 0 through 19 (inclusive). These represent 76cf0e7348SJim Harris the 20 4KB IO units in the backing storage. 77cf0e7348SJim Harris* A "chunk map" will be 32 bytes in size. This corresponds to 4 backing IO units per chunk 78cf0e7348SJim Harris (16KB / 4KB), and 8B (64b) per backing IO unit index. 79dc07387aSRami Rosen* 5 chunk maps will be allocated in 160B of persistent memory. This corresponds to 4 chunk maps 80cf0e7348SJim Harris for the 4 chunks in the compressed block device (64KB / 16KB), plus an extra chunk map for use 81cf0e7348SJim Harris when overwriting an existing chunk. 82cf0e7348SJim Harris* "Free chunk map list" will consist of indices 0 through 4 (inclusive). These represent the 83cf0e7348SJim Harris 5 allocated chunk maps. 84cf0e7348SJim Harris* The "logical map" will be allocated in 32B of persistent memory. This corresponds to 85cf0e7348SJim Harris 4 entries for the 4 chunks in the compressed block device and 8B (64b) per entry. 86cf0e7348SJim Harris 87cf0e7348SJim HarrisIn these examples, the value "X" will represent the special value (2^64-1) described above. 88cf0e7348SJim Harris 89cf0e7348SJim Harris### Initial Creation 90cf0e7348SJim Harris 91*63ee471bSMaciej Wawryk```text 92cf0e7348SJim Harris +--------------------+ 93cf0e7348SJim Harris Backing Device | | 94cf0e7348SJim Harris +--------------------+ 95cf0e7348SJim Harris 96cf0e7348SJim Harris Free Backing IO Unit List 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19 97cf0e7348SJim Harris 98cf0e7348SJim Harris +------------+------------+------------+------------+------------+ 99cf0e7348SJim Harris Chunk Maps | | | | | | 100cf0e7348SJim Harris +------------+------------+------------+------------+------------+ 101cf0e7348SJim Harris 102cf0e7348SJim Harris Free Chunk Map List 0, 1, 2, 3, 4 103cf0e7348SJim Harris 104cf0e7348SJim Harris +---+---+---+---+ 105cf0e7348SJim Harris Logical Map | X | X | X | X | 106cf0e7348SJim Harris +---+---+---+---+ 1070dd5f32eSDarek Stojaczyk``` 108cf0e7348SJim Harris 109cf0e7348SJim Harris### Write 16KB at Offset 32KB 110cf0e7348SJim Harris 111cf0e7348SJim Harris* Find the corresponding index into the logical map. Offset 32KB divided by the chunk size 112cf0e7348SJim Harris (16KB) is 2. 113cf0e7348SJim Harris* Entry 2 in the logical map is "X". This means no part of this 16KB has been written to yet. 114cf0e7348SJim Harris* Allocate a 16KB buffer in memory 115cf0e7348SJim Harris* Compress the incoming 16KB of data into this allocated buffer 116cf0e7348SJim Harris* Assume this data compresses to 6KB. This requires 2 4KB backing IO units. 117cf0e7348SJim Harris* Allocate 2 blocks (0 and 1) from the free backing IO unit list. Always use the lowest numbered 118cf0e7348SJim Harris entries in the free backing IO unit list - this ensures that unnecessary backing storage 119cf0e7348SJim Harris is not allocated in the thin-provisioned logical volume holding the backing storage. 120cf0e7348SJim Harris* Write the 6KB of data to backing IO units 0 and 1. 121cf0e7348SJim Harris* Allocate a chunk map (0) from the free chunk map list. 122cf0e7348SJim Harris* Write (0, 1, X, X) to the chunk map. This represents that only 2 backing IO units were used to 123cf0e7348SJim Harris store the 16KB of data. 124cf0e7348SJim Harris* Write the chunk map index to entry 2 in the logical map. 125cf0e7348SJim Harris 126*63ee471bSMaciej Wawryk```text 127cf0e7348SJim Harris +--------------------+ 128cf0e7348SJim Harris Backing Device |01 | 129cf0e7348SJim Harris +--------------------+ 130cf0e7348SJim Harris 131cf0e7348SJim Harris Free Backing IO Unit List 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19 132cf0e7348SJim Harris 133cf0e7348SJim Harris +------------+------------+------------+------------+------------+ 134cf0e7348SJim Harris Chunk Maps | 0 1 X X | | | | | 135cf0e7348SJim Harris +------------+------------+------------+------------+------------+ 136cf0e7348SJim Harris 137cf0e7348SJim Harris Free Chunk Map List 1, 2, 3, 4 138cf0e7348SJim Harris 139cf0e7348SJim Harris +---+---+---+---+ 140cf0e7348SJim Harris Logical Map | X | X | 0 | X | 141cf0e7348SJim Harris +---+---+---+---+ 1420dd5f32eSDarek Stojaczyk``` 143cf0e7348SJim Harris 144cf0e7348SJim Harris### Write 4KB at Offset 8KB 145cf0e7348SJim Harris 146cf0e7348SJim Harris* Find the corresponding index into the logical map. Offset 8KB divided by the chunk size is 0. 147cf0e7348SJim Harris* Entry 0 in the logical map is "X". This means no part of this 16KB has been written to yet. 148cf0e7348SJim Harris* The write is not for the entire 16KB chunk, so we must allocate a 16KB chunk-sized buffer for 149cf0e7348SJim Harris source data. 150cf0e7348SJim Harris* Copy the incoming 4KB data to offset 8KB of this 16KB buffer. Zero the rest of the 16KB buffer. 151cf0e7348SJim Harris* Allocate a 16KB destination buffer. 152cf0e7348SJim Harris* Compress the 16KB source data buffer into the 16KB destination buffer 153cf0e7348SJim Harris* Assume this data compresses to 3KB. This requires 1 4KB backing IO unit. 154cf0e7348SJim Harris* Allocate 1 block (2) from the free backing IO unit list. 155cf0e7348SJim Harris* Write the 3KB of data to block 2. 156cf0e7348SJim Harris* Allocate a chunk map (1) from the free chunk map list. 157cf0e7348SJim Harris* Write (2, X, X, X) to the chunk map. 158cf0e7348SJim Harris* Write the chunk map index to entry 0 in the logical map. 159cf0e7348SJim Harris 160*63ee471bSMaciej Wawryk```text 161cf0e7348SJim Harris +--------------------+ 162cf0e7348SJim Harris Backing Device |012 | 163cf0e7348SJim Harris +--------------------+ 164cf0e7348SJim Harris 165cf0e7348SJim Harris Free Backing IO Unit List 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19 166cf0e7348SJim Harris 167cf0e7348SJim Harris +------------+------------+------------+------------+------------+ 168cf0e7348SJim Harris Chunk Maps | 0 1 X X | 2 X X X | | | | 169cf0e7348SJim Harris +------------+------------+------------+------------+------------+ 170cf0e7348SJim Harris 171cf0e7348SJim Harris Free Chunk Map List 2, 3, 4 172cf0e7348SJim Harris 173cf0e7348SJim Harris +---+---+---+---+ 174cf0e7348SJim Harris Logical Map | 1 | X | 0 | X | 175cf0e7348SJim Harris +---+---+---+---+ 1760dd5f32eSDarek Stojaczyk``` 177cf0e7348SJim Harris 178cf0e7348SJim Harris### Read 16KB at Offset 16KB 179cf0e7348SJim Harris 180cf0e7348SJim Harris* Offset 16KB maps to index 1 in the logical map. 181cf0e7348SJim Harris* Entry 1 in the logical map is "X". This means no part of this 16KB has been written to yet. 182cf0e7348SJim Harris* Since no data has been written to this chunk, return all 0's to satisfy the read I/O. 183cf0e7348SJim Harris 184cf0e7348SJim Harris### Write 4KB at Offset 4KB 185cf0e7348SJim Harris 186cf0e7348SJim Harris* Offset 4KB maps to index 0 in the logical map. 187cf0e7348SJim Harris* Entry 0 in the logical map is "1". Since we are not overwriting the entire chunk, we must 188cf0e7348SJim Harris do a read-modify-write. 189cf0e7348SJim Harris* Chunk map 1 only specifies one backing IO unit (2). Allocate a 16KB buffer and read block 190cf0e7348SJim Harris 2 into it. This will be called the compressed data buffer. Note that 16KB is allocated 191cf0e7348SJim Harris instead of 4KB so that we can reuse this buffer to hold the compressed data that will 192cf0e7348SJim Harris be written later back to disk. 193cf0e7348SJim Harris* Allocate a 16KB buffer for the uncompressed data for this chunk. Decompress the data from 194cf0e7348SJim Harris the compressed data buffer into this buffer. 195cf0e7348SJim Harris* Copy the incoming 4KB of data to offset 4KB of the uncompressed data buffer. 196cf0e7348SJim Harris* Compress the 16KB uncompressed data buffer into the compressed data buffer. 197cf0e7348SJim Harris* Assume this data compresses to 5KB. This requires 2 4KB backing IO units. 198cf0e7348SJim Harris* Allocate blocks 3 and 4 from the free backing IO unit list. 199cf0e7348SJim Harris* Write the 5KB of data to blocks 3 and 4. 200cf0e7348SJim Harris* Allocate chunk map 2 from the free chunk map list. 201cf0e7348SJim Harris* Write (3, 4, X, X) to chunk map 2. Note that at this point, the chunk map is not referenced 202cf0e7348SJim Harris by the logical map. If there was a power fail at this point, the previous data for this chunk 203cf0e7348SJim Harris would still be fully valid. 204cf0e7348SJim Harris* Write chunk map 2 to entry 0 in the logical map. 205cf0e7348SJim Harris* Free chunk map 1 back to the free chunk map list. 206cf0e7348SJim Harris* Free backing IO unit 2 back to the free backing IO unit list. 207cf0e7348SJim Harris 208*63ee471bSMaciej Wawryk```text 209cf0e7348SJim Harris +--------------------+ 210cf0e7348SJim Harris Backing Device |01 34 | 211cf0e7348SJim Harris +--------------------+ 212cf0e7348SJim Harris 213cf0e7348SJim Harris Free Backing IO Unit List 2, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19 214cf0e7348SJim Harris 215cf0e7348SJim Harris +------------+------------+------------+------------+------------+ 216cf0e7348SJim Harris Chunk Maps | 0 1 X X | | 3 4 X X | | | 217cf0e7348SJim Harris +------------+------------+------------+------------+------------+ 218cf0e7348SJim Harris 219cf0e7348SJim Harris Free Chunk Map List 1, 3, 4 220cf0e7348SJim Harris 221cf0e7348SJim Harris +---+---+---+---+ 222cf0e7348SJim Harris Logical Map | 2 | X | 0 | X | 223cf0e7348SJim Harris +---+---+---+---+ 2240dd5f32eSDarek Stojaczyk``` 225cf0e7348SJim Harris 226cf0e7348SJim Harris### Operations that span across multiple chunks 227cf0e7348SJim Harris 228cf0e7348SJim HarrisOperations that span a chunk boundary are logically split into multiple operations, each of 229cf0e7348SJim Harriswhich is associated with a single chunk. 230cf0e7348SJim Harris 231cf0e7348SJim HarrisExample: 20KB write at offset 4KB 232cf0e7348SJim Harris 233cf0e7348SJim HarrisIn this case, the write operation is split into a 12KB write at offset 4KB (affecting only 234cf0e7348SJim Harrischunk 0 in the logical map) and a 8KB write at offset 16KB (affecting only chunk 1 in the 235cf0e7348SJim Harrislogical map). Each write is processed independently using the algorithm described above. 236cf0e7348SJim HarrisCompletion of the 20KB write does not occur until both operations have completed. 237cf0e7348SJim Harris 238cf0e7348SJim Harris### Unmap Operations 239cf0e7348SJim Harris 240cf0e7348SJim HarrisUnmap operations on an entire chunk are achieved by removing the chunk map entry (if any) from 241cf0e7348SJim Harristhe logical map. The chunk map is returned to the free chunk map list, and any backing IO units 242cf0e7348SJim Harrisassociated with the chunk map are returned to the free backing IO unit list. 243cf0e7348SJim Harris 244cf0e7348SJim HarrisUnmap operations that affect only part of a chunk can be treated as writing zeroes to that 245cf0e7348SJim Harrisregion of the chunk. If the entire chunk is unmapped via several operations, it can be 246cf0e7348SJim Harrisdetected via the uncompressed data equaling all zeroes. When this occurs, the chunk map entry 247cf0e7348SJim Harrismay be removed from the logical map. 248cf0e7348SJim Harris 249cf0e7348SJim HarrisAfter an entire chunk has been unmapped, subsequent reads to the chunk will return all zeroes. 250cf0e7348SJim HarrisThis is similar to the "Read 16KB at offset 16KB" example above. 251cf0e7348SJim Harris 252cf0e7348SJim Harris### Write Zeroes Operations 253cf0e7348SJim Harris 254cf0e7348SJim HarrisWrite zeroes operations are handled similarly to unmap operations. If a write zeroes 255cf0e7348SJim Harrisoperation covers an entire chunk, we can remove the chunk's entry in the logical map 256cf0e7348SJim Harriscompletely. Then subsequent reads to that chunk will return all zeroes. 257cf0e7348SJim Harris 258cf0e7348SJim Harris### Restart 259cf0e7348SJim Harris 260cf0e7348SJim HarrisAn application using libreduce will periodically exit and need to be restarted. When the 261cf0e7348SJim Harrisapplication restarts, it will reload compressed volumes so they can be used again from the 262cf0e7348SJim Harrissame state as when the application exited. 263cf0e7348SJim Harris 264cf0e7348SJim HarrisWhen the compressed volume is reloaded, the free chunk map list and free backing IO unit list 265cf0e7348SJim Harrisare reconstructed by walking the logical map. The logical map will only point to valid 266cf0e7348SJim Harrischunk maps, and the valid chunk maps will only point to valid backing IO units. Any chunk maps 267cf0e7348SJim Harrisand backing IO units not referenced go into their respective free lists. 268cf0e7348SJim Harris 269cf0e7348SJim HarrisThis ensures that if a system crashes in the middle of a write operation - i.e. during or 270cf0e7348SJim Harrisafter a chunk map is updated, but before it is written to the logical map - that everything 271cf0e7348SJim Harrisrelated to that in-progress write will be ignored after the compressed volume is restarted. 272cf0e7348SJim Harris 273cf0e7348SJim Harris### Overlapping operations on same chunk 274cf0e7348SJim Harris 275cf0e7348SJim HarrisImplementations must take care to handle overlapping operations on the same chunk. For example, 276cf0e7348SJim Harrisoperation 1 writes some data to chunk A, and while this is in progress, operation 2 also writes 277cf0e7348SJim Harrissome data to chunk A. In this case, operation 2 should not start until operation 1 has 278cf0e7348SJim Harriscompleted. Further optimizations are outside the scope of this document. 279cf0e7348SJim Harris 280cf0e7348SJim Harris### Thin provisioned backing storage 281cf0e7348SJim Harris 282cf0e7348SJim HarrisBacking storage must be thin provisioned to realize any savings from compression. This algorithm 283cf0e7348SJim Harriswill always use (and reuse) backing IO units available closest to offset 0 on the backing device. 284cf0e7348SJim HarrisThis ensures that even though backing storage device may have been sized similarly to the size of 285cf0e7348SJim Harristhe compressed volume, storage for the backing storage device will not actually be allocated 286cf0e7348SJim Harrisuntil the backing IO units are actually needed. 287