xref: /spdk/doc/compression.md (revision 63ee471b6411a7b35c8e2495a0a89d61d4d3d10f)
1cf0e7348SJim Harris# SPDK "Reduce" Block Compression Algorithm {#reduce}
2cf0e7348SJim Harris
3cf0e7348SJim Harris## Overview
4cf0e7348SJim Harris
5cf0e7348SJim HarrisThe SPDK "reduce" block compression scheme is based on using SSDs for storing compressed blocks of
6cf0e7348SJim Harrisstorage and persistent memory for metadata.  This metadata includes mappings of logical blocks
7cf0e7348SJim Harrisrequested by a user to the compressed blocks on SSD.  The scheme described in this document
8cf0e7348SJim Harrisis generic and not tied to any specific block device framework such as the SPDK block device (bdev)
9cf0e7348SJim Harrisframework.  This algorithm will be implemented in a library called "libreduce".  Higher-level
10cf0e7348SJim Harrissoftware modules can built on top of this library to create and present block devices in a
11cf0e7348SJim Harrisspecific block device framework.  For SPDK, a bdev_reduce module will serve as a wrapper around
12cf0e7348SJim Harristhe libreduce library, to present the compressed block devices as an SPDK bdev.
13cf0e7348SJim Harris
14cf0e7348SJim HarrisThis scheme only describes how compressed blocks are stored on an SSD and the metadata for tracking
15cf0e7348SJim Harristhose compressed blocks.  It relies on the higher-software module to perform the compression
16cf0e7348SJim Harrisalgorithm itself.  For SPDK, the bdev_reduce module will utilize the DPDK compressdev framework
17cf0e7348SJim Harristo perform compression and decompression on behalf of the libreduce library.
18cf0e7348SJim Harris
19cf0e7348SJim Harris(Note that in some cases, blocks of storage may not be compressible, or cannot be compressed enough
20cf0e7348SJim Harristo realize savings from the compression.  In these cases, the data may be stored uncompressed on
21cf0e7348SJim Harrisdisk.  The phrase "compressed blocks of storage" includes these uncompressed blocks.)
22cf0e7348SJim Harris
23cf0e7348SJim HarrisA compressed block device is a logical entity built on top of a similarly-sized backing storage
24cf0e7348SJim Harrisdevice.  The backing storage device must be thin-provisioned to realize any savings from
25cf0e7348SJim Harriscompression for reasons described later in this document.  This algorithm has no direct knowledge
26cf0e7348SJim Harrisof the implementation of the backing storage device, except that it will always use the
27cf0e7348SJim Harrislowest-numbered blocks available on the backing storage device.  This will ensure that when this
28cf0e7348SJim Harrisalgorithm is used on a thin-provisioned backing storage device, blocks will not be allocated until
29cf0e7348SJim Harristhey are actually needed.
30cf0e7348SJim Harris
31cf0e7348SJim HarrisThe backing storage device must be sized for the worst case scenario, where no data can be
32cf0e7348SJim Harriscompressed.  In this case, the size of the backing storage device would be the same as the
33cf0e7348SJim Harriscompressed block device.  Since this algorithm ensures atomicity by never overwriting data
34cf0e7348SJim Harrisin place, some additional backing storage is required to temporarily store data for writes in
35cf0e7348SJim Harrisprogress before the associated metadata is updated.
36cf0e7348SJim Harris
37cf0e7348SJim HarrisStorage from the backing storage device will be allocated, read, and written to in 4KB units for
38cf0e7348SJim Harrisbest NVMe performance.  These 4KB units are called "backing IO units".  They are indexed from 0 to N-1
39cf0e7348SJim Harriswith the indices called "backing IO unit indices".  At start, the full set of indices represent the
40cf0e7348SJim Harris"free backing IO unit list".
41cf0e7348SJim Harris
42cf0e7348SJim HarrisA compressed block device compresses and decompresses data in units of chunks, where a chunk is a
43cf0e7348SJim Harrismultiple of at least two 4KB backing IO units.  The number of backing IO units per chunk determines
44cf0e7348SJim Harristhe chunk size and is specified when the compressed block device is created.  A chunk
45cf0e7348SJim Harrisconsumes a number of 4KB backing IO units between 1 and the number of 4KB units in the chunk.  For
46cf0e7348SJim Harrisexample, a 16KB chunk consumes 1, 2, 3 or 4 backing IO units.  The number of backing IO units depends on how
47cf0e7348SJim Harrismuch the chunk was able to be compressed.  The blocks on disk associated with a chunk are stored in a
48cf0e7348SJim Harris"chunk map" in persistent memory.  Each chunk map consists of N 64-bit values, where N is the maximum
49cf0e7348SJim Harrisnumber of backing IO units in the chunk.  Each 64-bit value corresponds to a backing IO unit index.  A
50cf0e7348SJim Harrisspecial value (for example, 2^64-1) is used for backing IO units not needed due to compression.  The
51cf0e7348SJim Harrisnumber of chunk maps allocated is equal to the size of the compressed block device divided by its chunk
52cf0e7348SJim Harrissize, plus some number of extra chunk maps.  These extra chunk maps are used to ensure atomicity on
53cf0e7348SJim Harriswrites and will be explained later in this document.  At start, all of the chunk maps represent the
54cf0e7348SJim Harris"free chunk map list".
55cf0e7348SJim Harris
56cf0e7348SJim HarrisFinally, the logical view of the compressed block device is represented by the "logical map".  The
57cf0e7348SJim Harrislogical map is a mapping of chunk offsets into the compressed block device to the corresponding
58cf0e7348SJim Harrischunk map.  Each entry in the logical map is a 64-bit value, denoting the associated chunk map.
59cf0e7348SJim HarrisA special value (UINT64_MAX) is used if there is no associated chunk map.  The mapping is
60cf0e7348SJim Harrisdetermined by dividing the byte offset by the chunk size to get an index, which is used as an
61cf0e7348SJim Harrisarray index into the array of chunk map entries.  At start, all entries in the logical map have no
62cf0e7348SJim Harrisassociated chunk map.  Note that while access to the backing storage device is in 4KB units, the
63cf0e7348SJim Harrislogical view may allow 4KB or 512B unit access and should perform similarly.
64cf0e7348SJim Harris
65cf0e7348SJim Harris## Example
66cf0e7348SJim Harris
67cf0e7348SJim HarrisTo illustrate this algorithm, we will use a real example at a very small scale.
68cf0e7348SJim Harris
69cf0e7348SJim HarrisThe size of the compressed block device is 64KB, with a chunk size of 16KB.  This will
70cf0e7348SJim Harrisrealize the following:
71cf0e7348SJim Harris
72cf0e7348SJim Harris* "Backing storage" will consist of an 80KB thin-provisioned logical volume.  This
73cf0e7348SJim Harris  corresponds to the 64KB size of the compressed block device, plus an extra 16KB to handle
74cf0e7348SJim Harris  additional write operations under a worst-case compression scenario.
75cf0e7348SJim Harris* "Free backing IO unit list" will consist of indices 0 through 19 (inclusive).  These represent
76cf0e7348SJim Harris  the 20 4KB IO units in the backing storage.
77cf0e7348SJim Harris* A "chunk map" will be 32 bytes in size.  This corresponds to 4 backing IO units per chunk
78cf0e7348SJim Harris  (16KB / 4KB), and 8B (64b) per backing IO unit index.
79dc07387aSRami Rosen* 5 chunk maps will be allocated in 160B of persistent memory.  This corresponds to 4 chunk maps
80cf0e7348SJim Harris  for the 4 chunks in the compressed block device (64KB / 16KB), plus an extra chunk map for use
81cf0e7348SJim Harris  when overwriting an existing chunk.
82cf0e7348SJim Harris* "Free chunk map list" will consist of indices 0 through 4 (inclusive).  These represent the
83cf0e7348SJim Harris  5 allocated chunk maps.
84cf0e7348SJim Harris* The "logical map" will be allocated in 32B of persistent memory.  This corresponds to
85cf0e7348SJim Harris  4 entries for the 4 chunks in the compressed block device and 8B (64b) per entry.
86cf0e7348SJim Harris
87cf0e7348SJim HarrisIn these examples, the value "X" will represent the special value (2^64-1) described above.
88cf0e7348SJim Harris
89cf0e7348SJim Harris### Initial Creation
90cf0e7348SJim Harris
91*63ee471bSMaciej Wawryk```text
92cf0e7348SJim Harris                  +--------------------+
93cf0e7348SJim Harris  Backing Device  |                    |
94cf0e7348SJim Harris                  +--------------------+
95cf0e7348SJim Harris
96cf0e7348SJim Harris  Free Backing IO Unit List  0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19
97cf0e7348SJim Harris
98cf0e7348SJim Harris             +------------+------------+------------+------------+------------+
99cf0e7348SJim Harris  Chunk Maps |            |            |            |            |            |
100cf0e7348SJim Harris             +------------+------------+------------+------------+------------+
101cf0e7348SJim Harris
102cf0e7348SJim Harris  Free Chunk Map List  0, 1, 2, 3, 4
103cf0e7348SJim Harris
104cf0e7348SJim Harris              +---+---+---+---+
105cf0e7348SJim Harris  Logical Map | X | X | X | X |
106cf0e7348SJim Harris              +---+---+---+---+
1070dd5f32eSDarek Stojaczyk```
108cf0e7348SJim Harris
109cf0e7348SJim Harris### Write 16KB at Offset 32KB
110cf0e7348SJim Harris
111cf0e7348SJim Harris* Find the corresponding index into the logical map.  Offset 32KB divided by the chunk size
112cf0e7348SJim Harris  (16KB) is 2.
113cf0e7348SJim Harris* Entry 2 in the logical map is "X".  This means no part of this 16KB has been written to yet.
114cf0e7348SJim Harris* Allocate a 16KB buffer in memory
115cf0e7348SJim Harris* Compress the incoming 16KB of data into this allocated buffer
116cf0e7348SJim Harris* Assume this data compresses to 6KB.  This requires 2 4KB backing IO units.
117cf0e7348SJim Harris* Allocate 2 blocks (0 and 1) from the free backing IO unit list.  Always use the lowest numbered
118cf0e7348SJim Harris  entries in the free backing IO unit list - this ensures that unnecessary backing storage
119cf0e7348SJim Harris  is not allocated in the thin-provisioned logical volume holding the backing storage.
120cf0e7348SJim Harris* Write the 6KB of data to backing IO units 0 and 1.
121cf0e7348SJim Harris* Allocate a chunk map (0) from the free chunk map list.
122cf0e7348SJim Harris* Write (0, 1, X, X) to the chunk map.  This represents that only 2 backing IO units were used to
123cf0e7348SJim Harris  store the 16KB of data.
124cf0e7348SJim Harris* Write the chunk map index to entry 2 in the logical map.
125cf0e7348SJim Harris
126*63ee471bSMaciej Wawryk```text
127cf0e7348SJim Harris                  +--------------------+
128cf0e7348SJim Harris  Backing Device  |01                  |
129cf0e7348SJim Harris                  +--------------------+
130cf0e7348SJim Harris
131cf0e7348SJim Harris  Free Backing IO Unit List  2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19
132cf0e7348SJim Harris
133cf0e7348SJim Harris             +------------+------------+------------+------------+------------+
134cf0e7348SJim Harris  Chunk Maps | 0 1 X X    |            |            |            |            |
135cf0e7348SJim Harris             +------------+------------+------------+------------+------------+
136cf0e7348SJim Harris
137cf0e7348SJim Harris  Free Chunk Map List  1, 2, 3, 4
138cf0e7348SJim Harris
139cf0e7348SJim Harris              +---+---+---+---+
140cf0e7348SJim Harris  Logical Map | X | X | 0 | X |
141cf0e7348SJim Harris              +---+---+---+---+
1420dd5f32eSDarek Stojaczyk```
143cf0e7348SJim Harris
144cf0e7348SJim Harris### Write 4KB at Offset 8KB
145cf0e7348SJim Harris
146cf0e7348SJim Harris* Find the corresponding index into the logical map.  Offset 8KB divided by the chunk size is 0.
147cf0e7348SJim Harris* Entry 0 in the logical map is "X".  This means no part of this 16KB has been written to yet.
148cf0e7348SJim Harris* The write is not for the entire 16KB chunk, so we must allocate a 16KB chunk-sized buffer for
149cf0e7348SJim Harris  source data.
150cf0e7348SJim Harris* Copy the incoming 4KB data to offset 8KB of this 16KB buffer.  Zero the rest of the 16KB buffer.
151cf0e7348SJim Harris* Allocate a 16KB destination buffer.
152cf0e7348SJim Harris* Compress the 16KB source data buffer into the 16KB destination buffer
153cf0e7348SJim Harris* Assume this data compresses to 3KB.  This requires 1 4KB backing IO unit.
154cf0e7348SJim Harris* Allocate 1 block (2) from the free backing IO unit list.
155cf0e7348SJim Harris* Write the 3KB of data to block 2.
156cf0e7348SJim Harris* Allocate a chunk map (1) from the free chunk map list.
157cf0e7348SJim Harris* Write (2, X, X, X) to the chunk map.
158cf0e7348SJim Harris* Write the chunk map index to entry 0 in the logical map.
159cf0e7348SJim Harris
160*63ee471bSMaciej Wawryk```text
161cf0e7348SJim Harris                  +--------------------+
162cf0e7348SJim Harris  Backing Device  |012                 |
163cf0e7348SJim Harris                  +--------------------+
164cf0e7348SJim Harris
165cf0e7348SJim Harris  Free Backing IO Unit List  3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19
166cf0e7348SJim Harris
167cf0e7348SJim Harris             +------------+------------+------------+------------+------------+
168cf0e7348SJim Harris  Chunk Maps | 0 1 X X    | 2 X X X    |            |            |            |
169cf0e7348SJim Harris             +------------+------------+------------+------------+------------+
170cf0e7348SJim Harris
171cf0e7348SJim Harris  Free Chunk Map List  2, 3, 4
172cf0e7348SJim Harris
173cf0e7348SJim Harris              +---+---+---+---+
174cf0e7348SJim Harris  Logical Map | 1 | X | 0 | X |
175cf0e7348SJim Harris              +---+---+---+---+
1760dd5f32eSDarek Stojaczyk```
177cf0e7348SJim Harris
178cf0e7348SJim Harris### Read 16KB at Offset 16KB
179cf0e7348SJim Harris
180cf0e7348SJim Harris* Offset 16KB maps to index 1 in the logical map.
181cf0e7348SJim Harris* Entry 1 in the logical map is "X".  This means no part of this 16KB has been written to yet.
182cf0e7348SJim Harris* Since no data has been written to this chunk, return all 0's to satisfy the read I/O.
183cf0e7348SJim Harris
184cf0e7348SJim Harris### Write 4KB at Offset 4KB
185cf0e7348SJim Harris
186cf0e7348SJim Harris* Offset 4KB maps to index 0 in the logical map.
187cf0e7348SJim Harris* Entry 0 in the logical map is "1".  Since we are not overwriting the entire chunk, we must
188cf0e7348SJim Harris  do a read-modify-write.
189cf0e7348SJim Harris* Chunk map 1 only specifies one backing IO unit (2).  Allocate a 16KB buffer and read block
190cf0e7348SJim Harris  2 into it.  This will be called the compressed data buffer.  Note that 16KB is allocated
191cf0e7348SJim Harris  instead of 4KB so that we can reuse this buffer to hold the compressed data that will
192cf0e7348SJim Harris  be written later back to disk.
193cf0e7348SJim Harris* Allocate a 16KB buffer for the uncompressed data for this chunk.  Decompress the data from
194cf0e7348SJim Harris  the compressed data buffer into this buffer.
195cf0e7348SJim Harris* Copy the incoming 4KB of data to offset 4KB of the uncompressed data buffer.
196cf0e7348SJim Harris* Compress the 16KB uncompressed data buffer into the compressed data buffer.
197cf0e7348SJim Harris* Assume this data compresses to 5KB.  This requires 2 4KB backing IO units.
198cf0e7348SJim Harris* Allocate blocks 3 and 4 from the free backing IO unit list.
199cf0e7348SJim Harris* Write the 5KB of data to blocks 3 and 4.
200cf0e7348SJim Harris* Allocate chunk map 2 from the free chunk map list.
201cf0e7348SJim Harris* Write (3, 4, X, X) to chunk map 2.  Note that at this point, the chunk map is not referenced
202cf0e7348SJim Harris  by the logical map.  If there was a power fail at this point, the previous data for this chunk
203cf0e7348SJim Harris  would still be fully valid.
204cf0e7348SJim Harris* Write chunk map 2 to entry 0 in the logical map.
205cf0e7348SJim Harris* Free chunk map 1 back to the free chunk map list.
206cf0e7348SJim Harris* Free backing IO unit 2 back to the free backing IO unit list.
207cf0e7348SJim Harris
208*63ee471bSMaciej Wawryk```text
209cf0e7348SJim Harris                  +--------------------+
210cf0e7348SJim Harris  Backing Device  |01 34               |
211cf0e7348SJim Harris                  +--------------------+
212cf0e7348SJim Harris
213cf0e7348SJim Harris  Free Backing IO Unit List  2, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19
214cf0e7348SJim Harris
215cf0e7348SJim Harris             +------------+------------+------------+------------+------------+
216cf0e7348SJim Harris  Chunk Maps | 0 1 X X    |            | 3 4 X X    |            |            |
217cf0e7348SJim Harris             +------------+------------+------------+------------+------------+
218cf0e7348SJim Harris
219cf0e7348SJim Harris  Free Chunk Map List  1, 3, 4
220cf0e7348SJim Harris
221cf0e7348SJim Harris              +---+---+---+---+
222cf0e7348SJim Harris  Logical Map | 2 | X | 0 | X |
223cf0e7348SJim Harris              +---+---+---+---+
2240dd5f32eSDarek Stojaczyk```
225cf0e7348SJim Harris
226cf0e7348SJim Harris### Operations that span across multiple chunks
227cf0e7348SJim Harris
228cf0e7348SJim HarrisOperations that span a chunk boundary are logically split into multiple operations, each of
229cf0e7348SJim Harriswhich is associated with a single chunk.
230cf0e7348SJim Harris
231cf0e7348SJim HarrisExample: 20KB write at offset 4KB
232cf0e7348SJim Harris
233cf0e7348SJim HarrisIn this case, the write operation is split into a 12KB write at offset 4KB (affecting only
234cf0e7348SJim Harrischunk 0 in the logical map) and a 8KB write at offset 16KB (affecting only chunk 1 in the
235cf0e7348SJim Harrislogical map).  Each write is processed independently using the algorithm described above.
236cf0e7348SJim HarrisCompletion of the 20KB write does not occur until both operations have completed.
237cf0e7348SJim Harris
238cf0e7348SJim Harris### Unmap Operations
239cf0e7348SJim Harris
240cf0e7348SJim HarrisUnmap operations on an entire chunk are achieved by removing the chunk map entry (if any) from
241cf0e7348SJim Harristhe logical map.  The chunk map is returned to the free chunk map list, and any backing IO units
242cf0e7348SJim Harrisassociated with the chunk map are returned to the free backing IO unit list.
243cf0e7348SJim Harris
244cf0e7348SJim HarrisUnmap operations that affect only part of a chunk can be treated as writing zeroes to that
245cf0e7348SJim Harrisregion of the chunk.  If the entire chunk is unmapped via several operations, it can be
246cf0e7348SJim Harrisdetected via the uncompressed data equaling all zeroes.  When this occurs, the chunk map entry
247cf0e7348SJim Harrismay be removed from the logical map.
248cf0e7348SJim Harris
249cf0e7348SJim HarrisAfter an entire chunk has been unmapped, subsequent reads to the chunk will return all zeroes.
250cf0e7348SJim HarrisThis is similar to the "Read 16KB at offset 16KB" example above.
251cf0e7348SJim Harris
252cf0e7348SJim Harris### Write Zeroes Operations
253cf0e7348SJim Harris
254cf0e7348SJim HarrisWrite zeroes operations are handled similarly to unmap operations.  If a write zeroes
255cf0e7348SJim Harrisoperation covers an entire chunk, we can remove the chunk's entry in the logical map
256cf0e7348SJim Harriscompletely.  Then subsequent reads to that chunk will return all zeroes.
257cf0e7348SJim Harris
258cf0e7348SJim Harris### Restart
259cf0e7348SJim Harris
260cf0e7348SJim HarrisAn application using libreduce will periodically exit and need to be restarted.  When the
261cf0e7348SJim Harrisapplication restarts, it will reload compressed volumes so they can be used again from the
262cf0e7348SJim Harrissame state as when the application exited.
263cf0e7348SJim Harris
264cf0e7348SJim HarrisWhen the compressed volume is reloaded, the free chunk map list and free backing IO unit list
265cf0e7348SJim Harrisare reconstructed by walking the logical map.  The logical map will only point to valid
266cf0e7348SJim Harrischunk maps, and the valid chunk maps will only point to valid backing IO units.  Any chunk maps
267cf0e7348SJim Harrisand backing IO units not referenced go into their respective free lists.
268cf0e7348SJim Harris
269cf0e7348SJim HarrisThis ensures that if a system crashes in the middle of a write operation - i.e. during or
270cf0e7348SJim Harrisafter a chunk map is updated, but before it is written to the logical map - that everything
271cf0e7348SJim Harrisrelated to that in-progress write will be ignored after the compressed volume is restarted.
272cf0e7348SJim Harris
273cf0e7348SJim Harris### Overlapping operations on same chunk
274cf0e7348SJim Harris
275cf0e7348SJim HarrisImplementations must take care to handle overlapping operations on the same chunk.  For example,
276cf0e7348SJim Harrisoperation 1 writes some data to chunk A, and while this is in progress, operation 2 also writes
277cf0e7348SJim Harrissome data to chunk A.  In this case, operation 2 should not start until operation 1 has
278cf0e7348SJim Harriscompleted.  Further optimizations are outside the scope of this document.
279cf0e7348SJim Harris
280cf0e7348SJim Harris### Thin provisioned backing storage
281cf0e7348SJim Harris
282cf0e7348SJim HarrisBacking storage must be thin provisioned to realize any savings from compression.  This algorithm
283cf0e7348SJim Harriswill always use (and reuse) backing IO units available closest to offset 0 on the backing device.
284cf0e7348SJim HarrisThis ensures that even though backing storage device may have been sized similarly to the size of
285cf0e7348SJim Harristhe compressed volume, storage for the backing storage device will not actually be allocated
286cf0e7348SJim Harrisuntil the backing IO units are actually needed.
287