xref: /spdk/doc/compression.md (revision 63ee471b6411a7b35c8e2495a0a89d61d4d3d10f)
1# SPDK "Reduce" Block Compression Algorithm {#reduce}
2
3## Overview
4
5The SPDK "reduce" block compression scheme is based on using SSDs for storing compressed blocks of
6storage and persistent memory for metadata.  This metadata includes mappings of logical blocks
7requested by a user to the compressed blocks on SSD.  The scheme described in this document
8is generic and not tied to any specific block device framework such as the SPDK block device (bdev)
9framework.  This algorithm will be implemented in a library called "libreduce".  Higher-level
10software modules can built on top of this library to create and present block devices in a
11specific block device framework.  For SPDK, a bdev_reduce module will serve as a wrapper around
12the libreduce library, to present the compressed block devices as an SPDK bdev.
13
14This scheme only describes how compressed blocks are stored on an SSD and the metadata for tracking
15those compressed blocks.  It relies on the higher-software module to perform the compression
16algorithm itself.  For SPDK, the bdev_reduce module will utilize the DPDK compressdev framework
17to perform compression and decompression on behalf of the libreduce library.
18
19(Note that in some cases, blocks of storage may not be compressible, or cannot be compressed enough
20to realize savings from the compression.  In these cases, the data may be stored uncompressed on
21disk.  The phrase "compressed blocks of storage" includes these uncompressed blocks.)
22
23A compressed block device is a logical entity built on top of a similarly-sized backing storage
24device.  The backing storage device must be thin-provisioned to realize any savings from
25compression for reasons described later in this document.  This algorithm has no direct knowledge
26of the implementation of the backing storage device, except that it will always use the
27lowest-numbered blocks available on the backing storage device.  This will ensure that when this
28algorithm is used on a thin-provisioned backing storage device, blocks will not be allocated until
29they are actually needed.
30
31The backing storage device must be sized for the worst case scenario, where no data can be
32compressed.  In this case, the size of the backing storage device would be the same as the
33compressed block device.  Since this algorithm ensures atomicity by never overwriting data
34in place, some additional backing storage is required to temporarily store data for writes in
35progress before the associated metadata is updated.
36
37Storage from the backing storage device will be allocated, read, and written to in 4KB units for
38best NVMe performance.  These 4KB units are called "backing IO units".  They are indexed from 0 to N-1
39with the indices called "backing IO unit indices".  At start, the full set of indices represent the
40"free backing IO unit list".
41
42A compressed block device compresses and decompresses data in units of chunks, where a chunk is a
43multiple of at least two 4KB backing IO units.  The number of backing IO units per chunk determines
44the chunk size and is specified when the compressed block device is created.  A chunk
45consumes a number of 4KB backing IO units between 1 and the number of 4KB units in the chunk.  For
46example, a 16KB chunk consumes 1, 2, 3 or 4 backing IO units.  The number of backing IO units depends on how
47much the chunk was able to be compressed.  The blocks on disk associated with a chunk are stored in a
48"chunk map" in persistent memory.  Each chunk map consists of N 64-bit values, where N is the maximum
49number of backing IO units in the chunk.  Each 64-bit value corresponds to a backing IO unit index.  A
50special value (for example, 2^64-1) is used for backing IO units not needed due to compression.  The
51number of chunk maps allocated is equal to the size of the compressed block device divided by its chunk
52size, plus some number of extra chunk maps.  These extra chunk maps are used to ensure atomicity on
53writes and will be explained later in this document.  At start, all of the chunk maps represent the
54"free chunk map list".
55
56Finally, the logical view of the compressed block device is represented by the "logical map".  The
57logical map is a mapping of chunk offsets into the compressed block device to the corresponding
58chunk map.  Each entry in the logical map is a 64-bit value, denoting the associated chunk map.
59A special value (UINT64_MAX) is used if there is no associated chunk map.  The mapping is
60determined by dividing the byte offset by the chunk size to get an index, which is used as an
61array index into the array of chunk map entries.  At start, all entries in the logical map have no
62associated chunk map.  Note that while access to the backing storage device is in 4KB units, the
63logical view may allow 4KB or 512B unit access and should perform similarly.
64
65## Example
66
67To illustrate this algorithm, we will use a real example at a very small scale.
68
69The size of the compressed block device is 64KB, with a chunk size of 16KB.  This will
70realize the following:
71
72* "Backing storage" will consist of an 80KB thin-provisioned logical volume.  This
73  corresponds to the 64KB size of the compressed block device, plus an extra 16KB to handle
74  additional write operations under a worst-case compression scenario.
75* "Free backing IO unit list" will consist of indices 0 through 19 (inclusive).  These represent
76  the 20 4KB IO units in the backing storage.
77* A "chunk map" will be 32 bytes in size.  This corresponds to 4 backing IO units per chunk
78  (16KB / 4KB), and 8B (64b) per backing IO unit index.
79* 5 chunk maps will be allocated in 160B of persistent memory.  This corresponds to 4 chunk maps
80  for the 4 chunks in the compressed block device (64KB / 16KB), plus an extra chunk map for use
81  when overwriting an existing chunk.
82* "Free chunk map list" will consist of indices 0 through 4 (inclusive).  These represent the
83  5 allocated chunk maps.
84* The "logical map" will be allocated in 32B of persistent memory.  This corresponds to
85  4 entries for the 4 chunks in the compressed block device and 8B (64b) per entry.
86
87In these examples, the value "X" will represent the special value (2^64-1) described above.
88
89### Initial Creation
90
91```text
92                  +--------------------+
93  Backing Device  |                    |
94                  +--------------------+
95
96  Free Backing IO Unit List  0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19
97
98             +------------+------------+------------+------------+------------+
99  Chunk Maps |            |            |            |            |            |
100             +------------+------------+------------+------------+------------+
101
102  Free Chunk Map List  0, 1, 2, 3, 4
103
104              +---+---+---+---+
105  Logical Map | X | X | X | X |
106              +---+---+---+---+
107```
108
109### Write 16KB at Offset 32KB
110
111* Find the corresponding index into the logical map.  Offset 32KB divided by the chunk size
112  (16KB) is 2.
113* Entry 2 in the logical map is "X".  This means no part of this 16KB has been written to yet.
114* Allocate a 16KB buffer in memory
115* Compress the incoming 16KB of data into this allocated buffer
116* Assume this data compresses to 6KB.  This requires 2 4KB backing IO units.
117* Allocate 2 blocks (0 and 1) from the free backing IO unit list.  Always use the lowest numbered
118  entries in the free backing IO unit list - this ensures that unnecessary backing storage
119  is not allocated in the thin-provisioned logical volume holding the backing storage.
120* Write the 6KB of data to backing IO units 0 and 1.
121* Allocate a chunk map (0) from the free chunk map list.
122* Write (0, 1, X, X) to the chunk map.  This represents that only 2 backing IO units were used to
123  store the 16KB of data.
124* Write the chunk map index to entry 2 in the logical map.
125
126```text
127                  +--------------------+
128  Backing Device  |01                  |
129                  +--------------------+
130
131  Free Backing IO Unit List  2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19
132
133             +------------+------------+------------+------------+------------+
134  Chunk Maps | 0 1 X X    |            |            |            |            |
135             +------------+------------+------------+------------+------------+
136
137  Free Chunk Map List  1, 2, 3, 4
138
139              +---+---+---+---+
140  Logical Map | X | X | 0 | X |
141              +---+---+---+---+
142```
143
144### Write 4KB at Offset 8KB
145
146* Find the corresponding index into the logical map.  Offset 8KB divided by the chunk size is 0.
147* Entry 0 in the logical map is "X".  This means no part of this 16KB has been written to yet.
148* The write is not for the entire 16KB chunk, so we must allocate a 16KB chunk-sized buffer for
149  source data.
150* Copy the incoming 4KB data to offset 8KB of this 16KB buffer.  Zero the rest of the 16KB buffer.
151* Allocate a 16KB destination buffer.
152* Compress the 16KB source data buffer into the 16KB destination buffer
153* Assume this data compresses to 3KB.  This requires 1 4KB backing IO unit.
154* Allocate 1 block (2) from the free backing IO unit list.
155* Write the 3KB of data to block 2.
156* Allocate a chunk map (1) from the free chunk map list.
157* Write (2, X, X, X) to the chunk map.
158* Write the chunk map index to entry 0 in the logical map.
159
160```text
161                  +--------------------+
162  Backing Device  |012                 |
163                  +--------------------+
164
165  Free Backing IO Unit List  3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19
166
167             +------------+------------+------------+------------+------------+
168  Chunk Maps | 0 1 X X    | 2 X X X    |            |            |            |
169             +------------+------------+------------+------------+------------+
170
171  Free Chunk Map List  2, 3, 4
172
173              +---+---+---+---+
174  Logical Map | 1 | X | 0 | X |
175              +---+---+---+---+
176```
177
178### Read 16KB at Offset 16KB
179
180* Offset 16KB maps to index 1 in the logical map.
181* Entry 1 in the logical map is "X".  This means no part of this 16KB has been written to yet.
182* Since no data has been written to this chunk, return all 0's to satisfy the read I/O.
183
184### Write 4KB at Offset 4KB
185
186* Offset 4KB maps to index 0 in the logical map.
187* Entry 0 in the logical map is "1".  Since we are not overwriting the entire chunk, we must
188  do a read-modify-write.
189* Chunk map 1 only specifies one backing IO unit (2).  Allocate a 16KB buffer and read block
190  2 into it.  This will be called the compressed data buffer.  Note that 16KB is allocated
191  instead of 4KB so that we can reuse this buffer to hold the compressed data that will
192  be written later back to disk.
193* Allocate a 16KB buffer for the uncompressed data for this chunk.  Decompress the data from
194  the compressed data buffer into this buffer.
195* Copy the incoming 4KB of data to offset 4KB of the uncompressed data buffer.
196* Compress the 16KB uncompressed data buffer into the compressed data buffer.
197* Assume this data compresses to 5KB.  This requires 2 4KB backing IO units.
198* Allocate blocks 3 and 4 from the free backing IO unit list.
199* Write the 5KB of data to blocks 3 and 4.
200* Allocate chunk map 2 from the free chunk map list.
201* Write (3, 4, X, X) to chunk map 2.  Note that at this point, the chunk map is not referenced
202  by the logical map.  If there was a power fail at this point, the previous data for this chunk
203  would still be fully valid.
204* Write chunk map 2 to entry 0 in the logical map.
205* Free chunk map 1 back to the free chunk map list.
206* Free backing IO unit 2 back to the free backing IO unit list.
207
208```text
209                  +--------------------+
210  Backing Device  |01 34               |
211                  +--------------------+
212
213  Free Backing IO Unit List  2, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19
214
215             +------------+------------+------------+------------+------------+
216  Chunk Maps | 0 1 X X    |            | 3 4 X X    |            |            |
217             +------------+------------+------------+------------+------------+
218
219  Free Chunk Map List  1, 3, 4
220
221              +---+---+---+---+
222  Logical Map | 2 | X | 0 | X |
223              +---+---+---+---+
224```
225
226### Operations that span across multiple chunks
227
228Operations that span a chunk boundary are logically split into multiple operations, each of
229which is associated with a single chunk.
230
231Example: 20KB write at offset 4KB
232
233In this case, the write operation is split into a 12KB write at offset 4KB (affecting only
234chunk 0 in the logical map) and a 8KB write at offset 16KB (affecting only chunk 1 in the
235logical map).  Each write is processed independently using the algorithm described above.
236Completion of the 20KB write does not occur until both operations have completed.
237
238### Unmap Operations
239
240Unmap operations on an entire chunk are achieved by removing the chunk map entry (if any) from
241the logical map.  The chunk map is returned to the free chunk map list, and any backing IO units
242associated with the chunk map are returned to the free backing IO unit list.
243
244Unmap operations that affect only part of a chunk can be treated as writing zeroes to that
245region of the chunk.  If the entire chunk is unmapped via several operations, it can be
246detected via the uncompressed data equaling all zeroes.  When this occurs, the chunk map entry
247may be removed from the logical map.
248
249After an entire chunk has been unmapped, subsequent reads to the chunk will return all zeroes.
250This is similar to the "Read 16KB at offset 16KB" example above.
251
252### Write Zeroes Operations
253
254Write zeroes operations are handled similarly to unmap operations.  If a write zeroes
255operation covers an entire chunk, we can remove the chunk's entry in the logical map
256completely.  Then subsequent reads to that chunk will return all zeroes.
257
258### Restart
259
260An application using libreduce will periodically exit and need to be restarted.  When the
261application restarts, it will reload compressed volumes so they can be used again from the
262same state as when the application exited.
263
264When the compressed volume is reloaded, the free chunk map list and free backing IO unit list
265are reconstructed by walking the logical map.  The logical map will only point to valid
266chunk maps, and the valid chunk maps will only point to valid backing IO units.  Any chunk maps
267and backing IO units not referenced go into their respective free lists.
268
269This ensures that if a system crashes in the middle of a write operation - i.e. during or
270after a chunk map is updated, but before it is written to the logical map - that everything
271related to that in-progress write will be ignored after the compressed volume is restarted.
272
273### Overlapping operations on same chunk
274
275Implementations must take care to handle overlapping operations on the same chunk.  For example,
276operation 1 writes some data to chunk A, and while this is in progress, operation 2 also writes
277some data to chunk A.  In this case, operation 2 should not start until operation 1 has
278completed.  Further optimizations are outside the scope of this document.
279
280### Thin provisioned backing storage
281
282Backing storage must be thin provisioned to realize any savings from compression.  This algorithm
283will always use (and reuse) backing IO units available closest to offset 0 on the backing device.
284This ensures that even though backing storage device may have been sized similarly to the size of
285the compressed volume, storage for the backing storage device will not actually be allocated
286until the backing IO units are actually needed.
287