xref: /netbsd-src/external/bsd/zstd/dist/doc/zstd_compression_format.md (revision 3117ece4fc4a4ca4489ba793710b60b0d26bab6c)
1*3117ece4SchristosZstandard Compression Format
2*3117ece4Schristos============================
3*3117ece4Schristos
4*3117ece4Schristos### Notices
5*3117ece4Schristos
6*3117ece4SchristosCopyright (c) Meta Platforms, Inc. and affiliates.
7*3117ece4Schristos
8*3117ece4SchristosPermission is granted to copy and distribute this document
9*3117ece4Schristosfor any purpose and without charge,
10*3117ece4Schristosincluding translations into other languages
11*3117ece4Schristosand incorporation into compilations,
12*3117ece4Schristosprovided that the copyright notice and this notice are preserved,
13*3117ece4Schristosand that any substantive changes or deletions from the original
14*3117ece4Schristosare clearly marked.
15*3117ece4SchristosDistribution of this document is unlimited.
16*3117ece4Schristos
17*3117ece4Schristos### Version
18*3117ece4Schristos
19*3117ece4Schristos0.4.0 (2023-06-05)
20*3117ece4Schristos
21*3117ece4Schristos
22*3117ece4SchristosIntroduction
23*3117ece4Schristos------------
24*3117ece4Schristos
25*3117ece4SchristosThe purpose of this document is to define a lossless compressed data format,
26*3117ece4Schristosthat is independent of CPU type, operating system,
27*3117ece4Schristosfile system and character set, suitable for
28*3117ece4Schristosfile compression, pipe and streaming compression,
29*3117ece4Schristosusing the [Zstandard algorithm](https://facebook.github.io/zstd/).
30*3117ece4SchristosThe text of the specification assumes a basic background in programming
31*3117ece4Schristosat the level of bits and other primitive data representations.
32*3117ece4Schristos
33*3117ece4SchristosThe data can be produced or consumed,
34*3117ece4Schristoseven for an arbitrarily long sequentially presented input data stream,
35*3117ece4Schristosusing only an a priori bounded amount of intermediate storage,
36*3117ece4Schristosand hence can be used in data communications.
37*3117ece4SchristosThe format uses the Zstandard compression method,
38*3117ece4Schristosand optional [xxHash-64 checksum method](https://cyan4973.github.io/xxHash/),
39*3117ece4Schristosfor detection of data corruption.
40*3117ece4Schristos
41*3117ece4SchristosThe data format defined by this specification
42*3117ece4Schristosdoes not attempt to allow random access to compressed data.
43*3117ece4Schristos
44*3117ece4SchristosUnless otherwise indicated below,
45*3117ece4Schristosa compliant compressor must produce data sets
46*3117ece4Schristosthat conform to the specifications presented here.
47*3117ece4SchristosIt doesn’t need to support all options though.
48*3117ece4Schristos
49*3117ece4SchristosA compliant decompressor must be able to decompress
50*3117ece4Schristosat least one working set of parameters
51*3117ece4Schristosthat conforms to the specifications presented here.
52*3117ece4SchristosIt may also ignore informative fields, such as checksum.
53*3117ece4SchristosWhenever it does not support a parameter defined in the compressed stream,
54*3117ece4Schristosit must produce a non-ambiguous error code and associated error message
55*3117ece4Schristosexplaining which parameter is unsupported.
56*3117ece4Schristos
57*3117ece4SchristosThis specification is intended for use by implementers of software
58*3117ece4Schristosto compress data into Zstandard format and/or decompress data from Zstandard format.
59*3117ece4SchristosThe Zstandard format is supported by an open source reference implementation,
60*3117ece4Schristoswritten in portable C, and available at : https://github.com/facebook/zstd .
61*3117ece4Schristos
62*3117ece4Schristos
63*3117ece4Schristos### Overall conventions
64*3117ece4SchristosIn this document:
65*3117ece4Schristos- square brackets i.e. `[` and `]` are used to indicate optional fields or parameters.
66*3117ece4Schristos- the naming convention for identifiers is `Mixed_Case_With_Underscores`
67*3117ece4Schristos
68*3117ece4Schristos### Definitions
69*3117ece4SchristosContent compressed by Zstandard is transformed into a Zstandard __frame__.
70*3117ece4SchristosMultiple frames can be appended into a single file or stream.
71*3117ece4SchristosA frame is completely independent, has a defined beginning and end,
72*3117ece4Schristosand a set of parameters which tells the decoder how to decompress it.
73*3117ece4Schristos
74*3117ece4SchristosA frame encapsulates one or multiple __blocks__.
75*3117ece4SchristosEach block contains arbitrary content, which is described by its header,
76*3117ece4Schristosand has a guaranteed maximum content size, which depends on frame parameters.
77*3117ece4SchristosUnlike frames, each block depends on previous blocks for proper decoding.
78*3117ece4SchristosHowever, each block can be decompressed without waiting for its successor,
79*3117ece4Schristosallowing streaming operations.
80*3117ece4Schristos
81*3117ece4SchristosOverview
82*3117ece4Schristos---------
83*3117ece4Schristos- [Frames](#frames)
84*3117ece4Schristos  - [Zstandard frames](#zstandard-frames)
85*3117ece4Schristos    - [Blocks](#blocks)
86*3117ece4Schristos      - [Literals Section](#literals-section)
87*3117ece4Schristos      - [Sequences Section](#sequences-section)
88*3117ece4Schristos      - [Sequence Execution](#sequence-execution)
89*3117ece4Schristos  - [Skippable frames](#skippable-frames)
90*3117ece4Schristos- [Entropy Encoding](#entropy-encoding)
91*3117ece4Schristos  - [FSE](#fse)
92*3117ece4Schristos  - [Huffman Coding](#huffman-coding)
93*3117ece4Schristos- [Dictionary Format](#dictionary-format)
94*3117ece4Schristos
95*3117ece4SchristosFrames
96*3117ece4Schristos------
97*3117ece4SchristosZstandard compressed data is made of one or more __frames__.
98*3117ece4SchristosEach frame is independent and can be decompressed independently of other frames.
99*3117ece4SchristosThe decompressed content of multiple concatenated frames is the concatenation of
100*3117ece4Schristoseach frame decompressed content.
101*3117ece4Schristos
102*3117ece4SchristosThere are two frame formats defined by Zstandard:
103*3117ece4Schristos  Zstandard frames and Skippable frames.
104*3117ece4SchristosZstandard frames contain compressed data, while
105*3117ece4Schristosskippable frames contain custom user metadata.
106*3117ece4Schristos
107*3117ece4Schristos## Zstandard frames
108*3117ece4SchristosThe structure of a single Zstandard frame is following:
109*3117ece4Schristos
110*3117ece4Schristos| `Magic_Number` | `Frame_Header` |`Data_Block`| [More data blocks] | [`Content_Checksum`] |
111*3117ece4Schristos|:--------------:|:--------------:|:----------:| ------------------ |:--------------------:|
112*3117ece4Schristos|  4 bytes       |  2-14 bytes    |  n bytes   |                    |     0-4 bytes        |
113*3117ece4Schristos
114*3117ece4Schristos__`Magic_Number`__
115*3117ece4Schristos
116*3117ece4Schristos4 Bytes, __little-endian__ format.
117*3117ece4SchristosValue : 0xFD2FB528
118*3117ece4SchristosNote: This value was selected to be less probable to find at the beginning of some random file.
119*3117ece4SchristosIt avoids trivial patterns (0x00, 0xFF, repeated bytes, increasing bytes, etc.),
120*3117ece4Schristoscontains byte values outside of ASCII range,
121*3117ece4Schristosand doesn't map into UTF8 space.
122*3117ece4SchristosIt reduces the chances that a text file represent this value by accident.
123*3117ece4Schristos
124*3117ece4Schristos__`Frame_Header`__
125*3117ece4Schristos
126*3117ece4Schristos2 to 14 Bytes, detailed in [`Frame_Header`](#frame_header).
127*3117ece4Schristos
128*3117ece4Schristos__`Data_Block`__
129*3117ece4Schristos
130*3117ece4SchristosDetailed in [`Blocks`](#blocks).
131*3117ece4SchristosThat’s where compressed data is stored.
132*3117ece4Schristos
133*3117ece4Schristos__`Content_Checksum`__
134*3117ece4Schristos
135*3117ece4SchristosAn optional 32-bit checksum, only present if `Content_Checksum_flag` is set.
136*3117ece4SchristosThe content checksum is the result
137*3117ece4Schristosof [xxh64() hash function](https://cyan4973.github.io/xxHash/)
138*3117ece4Schristosdigesting the original (decoded) data as input, and a seed of zero.
139*3117ece4SchristosThe low 4 bytes of the checksum are stored in __little-endian__ format.
140*3117ece4Schristos
141*3117ece4Schristos### `Frame_Header`
142*3117ece4Schristos
143*3117ece4SchristosThe `Frame_Header` has a variable size, with a minimum of 2 bytes,
144*3117ece4Schristosand up to 14 bytes depending on optional parameters.
145*3117ece4SchristosThe structure of `Frame_Header` is following:
146*3117ece4Schristos
147*3117ece4Schristos| `Frame_Header_Descriptor` | [`Window_Descriptor`] | [`Dictionary_ID`] | [`Frame_Content_Size`] |
148*3117ece4Schristos| ------------------------- | --------------------- | ----------------- | ---------------------- |
149*3117ece4Schristos| 1 byte                    | 0-1 byte              | 0-4 bytes         | 0-8 bytes              |
150*3117ece4Schristos
151*3117ece4Schristos#### `Frame_Header_Descriptor`
152*3117ece4Schristos
153*3117ece4SchristosThe first header's byte is called the `Frame_Header_Descriptor`.
154*3117ece4SchristosIt describes which other fields are present.
155*3117ece4SchristosDecoding this byte is enough to tell the size of `Frame_Header`.
156*3117ece4Schristos
157*3117ece4Schristos| Bit number | Field name                |
158*3117ece4Schristos| ---------- | ----------                |
159*3117ece4Schristos| 7-6        | `Frame_Content_Size_flag` |
160*3117ece4Schristos| 5          | `Single_Segment_flag`     |
161*3117ece4Schristos| 4          | `Unused_bit`              |
162*3117ece4Schristos| 3          | `Reserved_bit`            |
163*3117ece4Schristos| 2          | `Content_Checksum_flag`   |
164*3117ece4Schristos| 1-0        | `Dictionary_ID_flag`      |
165*3117ece4Schristos
166*3117ece4SchristosIn this table, bit 7 is the highest bit, while bit 0 is the lowest one.
167*3117ece4Schristos
168*3117ece4Schristos__`Frame_Content_Size_flag`__
169*3117ece4Schristos
170*3117ece4SchristosThis is a 2-bits flag (`= Frame_Header_Descriptor >> 6`),
171*3117ece4Schristosspecifying if `Frame_Content_Size` (the decompressed data size)
172*3117ece4Schristosis provided within the header.
173*3117ece4Schristos`Flag_Value` provides `FCS_Field_Size`,
174*3117ece4Schristoswhich is the number of bytes used by `Frame_Content_Size`
175*3117ece4Schristosaccording to the following table:
176*3117ece4Schristos
177*3117ece4Schristos|  `Flag_Value`  |    0   |  1  |  2  |  3  |
178*3117ece4Schristos| -------------- | ------ | --- | --- | --- |
179*3117ece4Schristos|`FCS_Field_Size`| 0 or 1 |  2  |  4  |  8  |
180*3117ece4Schristos
181*3117ece4SchristosWhen `Flag_Value` is `0`, `FCS_Field_Size` depends on `Single_Segment_flag` :
182*3117ece4Schristosif `Single_Segment_flag` is set, `FCS_Field_Size` is 1.
183*3117ece4SchristosOtherwise, `FCS_Field_Size` is 0 : `Frame_Content_Size` is not provided.
184*3117ece4Schristos
185*3117ece4Schristos__`Single_Segment_flag`__
186*3117ece4Schristos
187*3117ece4SchristosIf this flag is set,
188*3117ece4Schristosdata must be regenerated within a single continuous memory segment.
189*3117ece4Schristos
190*3117ece4SchristosIn this case, `Window_Descriptor` byte is skipped,
191*3117ece4Schristosbut `Frame_Content_Size` is necessarily present.
192*3117ece4SchristosAs a consequence, the decoder must allocate a memory segment
193*3117ece4Schristosof size equal or larger than `Frame_Content_Size`.
194*3117ece4Schristos
195*3117ece4SchristosIn order to preserve the decoder from unreasonable memory requirements,
196*3117ece4Schristosa decoder is allowed to reject a compressed frame
197*3117ece4Schristoswhich requests a memory size beyond decoder's authorized range.
198*3117ece4Schristos
199*3117ece4SchristosFor broader compatibility, decoders are recommended to support
200*3117ece4Schristosmemory sizes of at least 8 MB.
201*3117ece4SchristosThis is only a recommendation,
202*3117ece4Schristoseach decoder is free to support higher or lower limits,
203*3117ece4Schristosdepending on local limitations.
204*3117ece4Schristos
205*3117ece4Schristos__`Unused_bit`__
206*3117ece4Schristos
207*3117ece4SchristosA decoder compliant with this specification version shall not interpret this bit.
208*3117ece4SchristosIt might be used in any future version,
209*3117ece4Schristosto signal a property which is transparent to properly decode the frame.
210*3117ece4SchristosAn encoder compliant with this specification version must set this bit to zero.
211*3117ece4Schristos
212*3117ece4Schristos__`Reserved_bit`__
213*3117ece4Schristos
214*3117ece4SchristosThis bit is reserved for some future feature.
215*3117ece4SchristosIts value _must be zero_.
216*3117ece4SchristosA decoder compliant with this specification version must ensure it is not set.
217*3117ece4SchristosThis bit may be used in a future revision,
218*3117ece4Schristosto signal a feature that must be interpreted to decode the frame correctly.
219*3117ece4Schristos
220*3117ece4Schristos__`Content_Checksum_flag`__
221*3117ece4Schristos
222*3117ece4SchristosIf this flag is set, a 32-bits `Content_Checksum` will be present at frame's end.
223*3117ece4SchristosSee `Content_Checksum` paragraph.
224*3117ece4Schristos
225*3117ece4Schristos__`Dictionary_ID_flag`__
226*3117ece4Schristos
227*3117ece4SchristosThis is a 2-bits flag (`= FHD & 3`),
228*3117ece4Schristostelling if a dictionary ID is provided within the header.
229*3117ece4SchristosIt also specifies the size of this field as `DID_Field_Size`.
230*3117ece4Schristos
231*3117ece4Schristos|`Flag_Value`    |  0  |  1  |  2  |  3  |
232*3117ece4Schristos| -------------- | --- | --- | --- | --- |
233*3117ece4Schristos|`DID_Field_Size`|  0  |  1  |  2  |  4  |
234*3117ece4Schristos
235*3117ece4Schristos#### `Window_Descriptor`
236*3117ece4Schristos
237*3117ece4SchristosProvides guarantees on minimum memory buffer required to decompress a frame.
238*3117ece4SchristosThis information is important for decoders to allocate enough memory.
239*3117ece4Schristos
240*3117ece4SchristosThe `Window_Descriptor` byte is optional.
241*3117ece4SchristosWhen `Single_Segment_flag` is set, `Window_Descriptor` is not present.
242*3117ece4SchristosIn this case, `Window_Size` is `Frame_Content_Size`,
243*3117ece4Schristoswhich can be any value from 0 to 2^64-1 bytes (16 ExaBytes).
244*3117ece4Schristos
245*3117ece4Schristos| Bit numbers |     7-3    |     2-0    |
246*3117ece4Schristos| ----------- | ---------- | ---------- |
247*3117ece4Schristos| Field name  | `Exponent` | `Mantissa` |
248*3117ece4Schristos
249*3117ece4SchristosThe minimum memory buffer size is called `Window_Size`.
250*3117ece4SchristosIt is described by the following formulas :
251*3117ece4Schristos```
252*3117ece4SchristoswindowLog = 10 + Exponent;
253*3117ece4SchristoswindowBase = 1 << windowLog;
254*3117ece4SchristoswindowAdd = (windowBase / 8) * Mantissa;
255*3117ece4SchristosWindow_Size = windowBase + windowAdd;
256*3117ece4Schristos```
257*3117ece4SchristosThe minimum `Window_Size` is 1 KB.
258*3117ece4SchristosThe maximum `Window_Size` is `(1<<41) + 7*(1<<38)` bytes, which is 3.75 TB.
259*3117ece4Schristos
260*3117ece4SchristosIn general, larger `Window_Size` tend to improve compression ratio,
261*3117ece4Schristosbut at the cost of memory usage.
262*3117ece4Schristos
263*3117ece4SchristosTo properly decode compressed data,
264*3117ece4Schristosa decoder will need to allocate a buffer of at least `Window_Size` bytes.
265*3117ece4Schristos
266*3117ece4SchristosIn order to preserve decoder from unreasonable memory requirements,
267*3117ece4Schristosa decoder is allowed to reject a compressed frame
268*3117ece4Schristoswhich requests a memory size beyond decoder's authorized range.
269*3117ece4Schristos
270*3117ece4SchristosFor improved interoperability,
271*3117ece4Schristosit's recommended for decoders to support `Window_Size` of up to 8 MB,
272*3117ece4Schristosand it's recommended for encoders to not generate frame requiring `Window_Size` larger than 8 MB.
273*3117ece4SchristosIt's merely a recommendation though,
274*3117ece4Schristosdecoders are free to support larger or lower limits,
275*3117ece4Schristosdepending on local limitations.
276*3117ece4Schristos
277*3117ece4Schristos#### `Dictionary_ID`
278*3117ece4Schristos
279*3117ece4SchristosThis is a variable size field, which contains
280*3117ece4Schristosthe ID of the dictionary required to properly decode the frame.
281*3117ece4Schristos`Dictionary_ID` field is optional. When it's not present,
282*3117ece4Schristosit's up to the decoder to know which dictionary to use.
283*3117ece4Schristos
284*3117ece4Schristos`Dictionary_ID` field size is provided by `DID_Field_Size`.
285*3117ece4Schristos`DID_Field_Size` is directly derived from value of `Dictionary_ID_flag`.
286*3117ece4Schristos1 byte can represent an ID 0-255.
287*3117ece4Schristos2 bytes can represent an ID 0-65535.
288*3117ece4Schristos4 bytes can represent an ID 0-4294967295.
289*3117ece4SchristosFormat is __little-endian__.
290*3117ece4Schristos
291*3117ece4SchristosIt's allowed to represent a small ID (for example `13`)
292*3117ece4Schristoswith a large 4-bytes dictionary ID, even if it is less efficient.
293*3117ece4Schristos
294*3117ece4SchristosA value of `0` has same meaning as no `Dictionary_ID`,
295*3117ece4Schristosin which case the frame may or may not need a dictionary to be decoded,
296*3117ece4Schristosand the ID of such a dictionary is not specified.
297*3117ece4SchristosThe decoder must know this information by other means.
298*3117ece4Schristos
299*3117ece4Schristos#### `Frame_Content_Size`
300*3117ece4Schristos
301*3117ece4SchristosThis is the original (uncompressed) size. This information is optional.
302*3117ece4Schristos`Frame_Content_Size` uses a variable number of bytes, provided by `FCS_Field_Size`.
303*3117ece4Schristos`FCS_Field_Size` is provided by the value of `Frame_Content_Size_flag`.
304*3117ece4Schristos`FCS_Field_Size` can be equal to 0 (not present), 1, 2, 4 or 8 bytes.
305*3117ece4Schristos
306*3117ece4Schristos| `FCS_Field_Size` |    Range   |
307*3117ece4Schristos| ---------------- | ---------- |
308*3117ece4Schristos|        0         |   unknown  |
309*3117ece4Schristos|        1         |   0 - 255  |
310*3117ece4Schristos|        2         | 256 - 65791|
311*3117ece4Schristos|        4         | 0 - 2^32-1 |
312*3117ece4Schristos|        8         | 0 - 2^64-1 |
313*3117ece4Schristos
314*3117ece4Schristos`Frame_Content_Size` format is __little-endian__.
315*3117ece4SchristosWhen `FCS_Field_Size` is 1, 4 or 8 bytes, the value is read directly.
316*3117ece4SchristosWhen `FCS_Field_Size` is 2, _the offset of 256 is added_.
317*3117ece4SchristosIt's allowed to represent a small size (for example `18`) using any compatible variant.
318*3117ece4Schristos
319*3117ece4Schristos
320*3117ece4SchristosBlocks
321*3117ece4Schristos-------
322*3117ece4Schristos
323*3117ece4SchristosAfter `Magic_Number` and `Frame_Header`, there are some number of blocks.
324*3117ece4SchristosEach frame must have at least one block,
325*3117ece4Schristosbut there is no upper limit on the number of blocks per frame.
326*3117ece4Schristos
327*3117ece4SchristosThe structure of a block is as follows:
328*3117ece4Schristos
329*3117ece4Schristos| `Block_Header` | `Block_Content` |
330*3117ece4Schristos|:--------------:|:---------------:|
331*3117ece4Schristos|    3 bytes     |     n bytes     |
332*3117ece4Schristos
333*3117ece4Schristos__`Block_Header`__
334*3117ece4Schristos
335*3117ece4Schristos`Block_Header` uses 3 bytes, written using __little-endian__ convention.
336*3117ece4SchristosIt contains 3 fields :
337*3117ece4Schristos
338*3117ece4Schristos| `Last_Block` | `Block_Type` | `Block_Size` |
339*3117ece4Schristos|:------------:|:------------:|:------------:|
340*3117ece4Schristos|    bit 0     |  bits 1-2    |  bits 3-23   |
341*3117ece4Schristos
342*3117ece4Schristos__`Last_Block`__
343*3117ece4Schristos
344*3117ece4SchristosThe lowest bit signals if this block is the last one.
345*3117ece4SchristosThe frame will end after this last block.
346*3117ece4SchristosIt may be followed by an optional `Content_Checksum`
347*3117ece4Schristos(see [Zstandard Frames](#zstandard-frames)).
348*3117ece4Schristos
349*3117ece4Schristos__`Block_Type`__
350*3117ece4Schristos
351*3117ece4SchristosThe next 2 bits represent the `Block_Type`.
352*3117ece4Schristos`Block_Type` influences the meaning of `Block_Size`.
353*3117ece4SchristosThere are 4 block types :
354*3117ece4Schristos
355*3117ece4Schristos|    Value     |      0      |      1      |         2          |     3     |
356*3117ece4Schristos| ------------ | ----------- | ----------- | ------------------ | --------- |
357*3117ece4Schristos| `Block_Type` | `Raw_Block` | `RLE_Block` | `Compressed_Block` | `Reserved`|
358*3117ece4Schristos
359*3117ece4Schristos- `Raw_Block` - this is an uncompressed block.
360*3117ece4Schristos  `Block_Content` contains `Block_Size` bytes.
361*3117ece4Schristos
362*3117ece4Schristos- `RLE_Block` - this is a single byte, repeated `Block_Size` times.
363*3117ece4Schristos  `Block_Content` consists of a single byte.
364*3117ece4Schristos  On the decompression side, this byte must be repeated `Block_Size` times.
365*3117ece4Schristos
366*3117ece4Schristos- `Compressed_Block` - this is a [Zstandard compressed block](#compressed-blocks),
367*3117ece4Schristos  explained later on.
368*3117ece4Schristos  `Block_Size` is the length of `Block_Content`, the compressed data.
369*3117ece4Schristos  The decompressed size is not known,
370*3117ece4Schristos  but its maximum possible value is guaranteed (see below)
371*3117ece4Schristos
372*3117ece4Schristos- `Reserved` - this is not a block.
373*3117ece4Schristos  This value cannot be used with current version of this specification.
374*3117ece4Schristos  If such a value is present, it is considered corrupted data.
375*3117ece4Schristos
376*3117ece4Schristos__`Block_Size`__
377*3117ece4Schristos
378*3117ece4SchristosThe upper 21 bits of `Block_Header` represent the `Block_Size`.
379*3117ece4Schristos
380*3117ece4SchristosWhen `Block_Type` is `Compressed_Block` or `Raw_Block`,
381*3117ece4Schristos`Block_Size` is the size of `Block_Content` (hence excluding `Block_Header`).
382*3117ece4Schristos
383*3117ece4SchristosWhen `Block_Type` is `RLE_Block`, since `Block_Content`’s size is always 1,
384*3117ece4Schristos`Block_Size` represents the number of times this byte must be repeated.
385*3117ece4Schristos
386*3117ece4Schristos`Block_Size` is limited by `Block_Maximum_Size` (see below).
387*3117ece4Schristos
388*3117ece4Schristos__`Block_Content`__ and __`Block_Maximum_Size`__
389*3117ece4Schristos
390*3117ece4SchristosThe size of `Block_Content` is limited by `Block_Maximum_Size`,
391*3117ece4Schristoswhich is the smallest of:
392*3117ece4Schristos-  `Window_Size`
393*3117ece4Schristos-  128 KB
394*3117ece4Schristos
395*3117ece4Schristos`Block_Maximum_Size` is constant for a given frame.
396*3117ece4SchristosThis maximum is applicable to both the decompressed size
397*3117ece4Schristosand the compressed size of any block in the frame.
398*3117ece4Schristos
399*3117ece4SchristosThe reasoning for this limit is that a decoder can read this information
400*3117ece4Schristosat the beginning of a frame and use it to allocate buffers.
401*3117ece4SchristosThe guarantees on the size of blocks ensure that
402*3117ece4Schristosthe buffers will be large enough for any following block of the valid frame.
403*3117ece4Schristos
404*3117ece4Schristos
405*3117ece4SchristosCompressed Blocks
406*3117ece4Schristos-----------------
407*3117ece4SchristosTo decompress a compressed block, the compressed size must be provided
408*3117ece4Schristosfrom `Block_Size` field within `Block_Header`.
409*3117ece4Schristos
410*3117ece4SchristosA compressed block consists of 2 sections :
411*3117ece4Schristos- [Literals Section](#literals-section)
412*3117ece4Schristos- [Sequences Section](#sequences-section)
413*3117ece4Schristos
414*3117ece4SchristosThe results of the two sections are then combined to produce the decompressed
415*3117ece4Schristosdata in [Sequence Execution](#sequence-execution)
416*3117ece4Schristos
417*3117ece4Schristos#### Prerequisites
418*3117ece4SchristosTo decode a compressed block, the following elements are necessary :
419*3117ece4Schristos- Previous decoded data, up to a distance of `Window_Size`,
420*3117ece4Schristos  or beginning of the Frame, whichever is smaller.
421*3117ece4Schristos- List of "recent offsets" from previous `Compressed_Block`.
422*3117ece4Schristos- The previous Huffman tree, required by `Treeless_Literals_Block` type
423*3117ece4Schristos- Previous FSE decoding tables, required by `Repeat_Mode`
424*3117ece4Schristos  for each symbol type (literals lengths, match lengths, offsets)
425*3117ece4Schristos
426*3117ece4SchristosNote that decoding tables aren't always from the previous `Compressed_Block`.
427*3117ece4Schristos
428*3117ece4Schristos- Every decoding table can come from a dictionary.
429*3117ece4Schristos- The Huffman tree comes from the previous `Compressed_Literals_Block`.
430*3117ece4Schristos
431*3117ece4SchristosLiterals Section
432*3117ece4Schristos----------------
433*3117ece4SchristosAll literals are regrouped in the first part of the block.
434*3117ece4SchristosThey can be decoded first, and then copied during [Sequence Execution],
435*3117ece4Schristosor they can be decoded on the flow during [Sequence Execution].
436*3117ece4Schristos
437*3117ece4SchristosLiterals can be stored uncompressed or compressed using Huffman prefix codes.
438*3117ece4SchristosWhen compressed, a tree description may optionally be present,
439*3117ece4Schristosfollowed by 1 or 4 streams.
440*3117ece4Schristos
441*3117ece4Schristos| `Literals_Section_Header` | [`Huffman_Tree_Description`] | [jumpTable] | Stream1 | [Stream2] | [Stream3] | [Stream4] |
442*3117ece4Schristos| ------------------------- | ---------------------------- | ----------- | ------- | --------- | --------- | --------- |
443*3117ece4Schristos
444*3117ece4Schristos
445*3117ece4Schristos### `Literals_Section_Header`
446*3117ece4Schristos
447*3117ece4SchristosHeader is in charge of describing how literals are packed.
448*3117ece4SchristosIt's a byte-aligned variable-size bitfield, ranging from 1 to 5 bytes,
449*3117ece4Schristosusing __little-endian__ convention.
450*3117ece4Schristos
451*3117ece4Schristos| `Literals_Block_Type` | `Size_Format` | `Regenerated_Size` | [`Compressed_Size`] |
452*3117ece4Schristos| --------------------- | ------------- | ------------------ | ------------------- |
453*3117ece4Schristos|       2 bits          |  1 - 2 bits   |    5 - 20 bits     |     0 - 18 bits     |
454*3117ece4Schristos
455*3117ece4SchristosIn this representation, bits on the left are the lowest bits.
456*3117ece4Schristos
457*3117ece4Schristos__`Literals_Block_Type`__
458*3117ece4Schristos
459*3117ece4SchristosThis field uses 2 lowest bits of first byte, describing 4 different block types :
460*3117ece4Schristos
461*3117ece4Schristos| `Literals_Block_Type`       | Value |
462*3117ece4Schristos| --------------------------- | ----- |
463*3117ece4Schristos| `Raw_Literals_Block`        |   0   |
464*3117ece4Schristos| `RLE_Literals_Block`        |   1   |
465*3117ece4Schristos| `Compressed_Literals_Block` |   2   |
466*3117ece4Schristos| `Treeless_Literals_Block`   |   3   |
467*3117ece4Schristos
468*3117ece4Schristos- `Raw_Literals_Block` - Literals are stored uncompressed.
469*3117ece4Schristos- `RLE_Literals_Block` - Literals consist of a single byte value
470*3117ece4Schristos        repeated `Regenerated_Size` times.
471*3117ece4Schristos- `Compressed_Literals_Block` - This is a standard Huffman-compressed block,
472*3117ece4Schristos        starting with a Huffman tree description.
473*3117ece4Schristos        In this mode, there are at least 2 different literals represented in the Huffman tree description.
474*3117ece4Schristos        See details below.
475*3117ece4Schristos- `Treeless_Literals_Block` - This is a Huffman-compressed block,
476*3117ece4Schristos        using Huffman tree _from previous Huffman-compressed literals block_.
477*3117ece4Schristos        `Huffman_Tree_Description` will be skipped.
478*3117ece4Schristos        Note: If this mode is triggered without any previous Huffman-table in the frame
479*3117ece4Schristos        (or [dictionary](#dictionary-format)), this should be treated as data corruption.
480*3117ece4Schristos
481*3117ece4Schristos__`Size_Format`__
482*3117ece4Schristos
483*3117ece4Schristos`Size_Format` is divided into 2 families :
484*3117ece4Schristos
485*3117ece4Schristos- For `Raw_Literals_Block` and `RLE_Literals_Block`,
486*3117ece4Schristos  it's only necessary to decode `Regenerated_Size`.
487*3117ece4Schristos  There is no `Compressed_Size` field.
488*3117ece4Schristos- For `Compressed_Block` and `Treeless_Literals_Block`,
489*3117ece4Schristos  it's required to decode both `Compressed_Size`
490*3117ece4Schristos  and `Regenerated_Size` (the decompressed size).
491*3117ece4Schristos  It's also necessary to decode the number of streams (1 or 4).
492*3117ece4Schristos
493*3117ece4SchristosFor values spanning several bytes, convention is __little-endian__.
494*3117ece4Schristos
495*3117ece4Schristos__`Size_Format` for `Raw_Literals_Block` and `RLE_Literals_Block`__ :
496*3117ece4Schristos
497*3117ece4Schristos`Size_Format` uses 1 _or_ 2 bits.
498*3117ece4SchristosIts value is : `Size_Format = (Literals_Section_Header[0]>>2) & 3`
499*3117ece4Schristos
500*3117ece4Schristos- `Size_Format` == 00 or 10 : `Size_Format` uses 1 bit.
501*3117ece4Schristos               `Regenerated_Size` uses 5 bits (0-31).
502*3117ece4Schristos               `Literals_Section_Header` uses 1 byte.
503*3117ece4Schristos               `Regenerated_Size = Literals_Section_Header[0]>>3`
504*3117ece4Schristos- `Size_Format` == 01 : `Size_Format` uses 2 bits.
505*3117ece4Schristos               `Regenerated_Size` uses 12 bits (0-4095).
506*3117ece4Schristos               `Literals_Section_Header` uses 2 bytes.
507*3117ece4Schristos               `Regenerated_Size = (Literals_Section_Header[0]>>4) + (Literals_Section_Header[1]<<4)`
508*3117ece4Schristos- `Size_Format` == 11 : `Size_Format` uses 2 bits.
509*3117ece4Schristos               `Regenerated_Size` uses 20 bits (0-1048575).
510*3117ece4Schristos               `Literals_Section_Header` uses 3 bytes.
511*3117ece4Schristos               `Regenerated_Size = (Literals_Section_Header[0]>>4) + (Literals_Section_Header[1]<<4) + (Literals_Section_Header[2]<<12)`
512*3117ece4Schristos
513*3117ece4SchristosOnly Stream1 is present for these cases.
514*3117ece4SchristosNote : it's allowed to represent a short value (for example `27`)
515*3117ece4Schristosusing a long format, even if it's less efficient.
516*3117ece4Schristos
517*3117ece4Schristos__`Size_Format` for `Compressed_Literals_Block` and `Treeless_Literals_Block`__ :
518*3117ece4Schristos
519*3117ece4Schristos`Size_Format` always uses 2 bits.
520*3117ece4Schristos
521*3117ece4Schristos- `Size_Format` == 00 : _A single stream_.
522*3117ece4Schristos               Both `Regenerated_Size` and `Compressed_Size` use 10 bits (0-1023).
523*3117ece4Schristos               `Literals_Section_Header` uses 3 bytes.
524*3117ece4Schristos- `Size_Format` == 01 : 4 streams.
525*3117ece4Schristos               Both `Regenerated_Size` and `Compressed_Size` use 10 bits (6-1023).
526*3117ece4Schristos               `Literals_Section_Header` uses 3 bytes.
527*3117ece4Schristos- `Size_Format` == 10 : 4 streams.
528*3117ece4Schristos               Both `Regenerated_Size` and `Compressed_Size` use 14 bits (6-16383).
529*3117ece4Schristos               `Literals_Section_Header` uses 4 bytes.
530*3117ece4Schristos- `Size_Format` == 11 : 4 streams.
531*3117ece4Schristos               Both `Regenerated_Size` and `Compressed_Size` use 18 bits (6-262143).
532*3117ece4Schristos               `Literals_Section_Header` uses 5 bytes.
533*3117ece4Schristos
534*3117ece4SchristosBoth `Compressed_Size` and `Regenerated_Size` fields follow __little-endian__ convention.
535*3117ece4SchristosNote: `Compressed_Size` __includes__ the size of the Huffman Tree description
536*3117ece4Schristos_when_ it is present.
537*3117ece4SchristosNote 2: `Compressed_Size` can never be `==0`.
538*3117ece4SchristosEven in single-stream scenario, assuming an empty content, it must be `>=1`,
539*3117ece4Schristossince it contains at least the final end bit flag.
540*3117ece4SchristosIn 4-streams scenario, a valid `Compressed_Size` is necessarily `>= 10`
541*3117ece4Schristos(6 bytes for the jump table, + 4x1 bytes for the 4 streams).
542*3117ece4Schristos
543*3117ece4Schristos4 streams is faster than 1 stream in decompression speed,
544*3117ece4Schristosby exploiting instruction level parallelism.
545*3117ece4SchristosBut it's also more expensive,
546*3117ece4Schristoscosting on average ~7.3 bytes more than the 1 stream mode, mostly from the jump table.
547*3117ece4Schristos
548*3117ece4SchristosIn general, use the 4 streams mode when there are more literals to decode,
549*3117ece4Schristosto favor higher decompression speeds.
550*3117ece4SchristosNote that beyond >1KB of literals, the 4 streams mode is compulsory.
551*3117ece4Schristos
552*3117ece4SchristosNote that a minimum of 6 bytes is required for the 4 streams mode.
553*3117ece4SchristosThat's a technical minimum, but it's not recommended to employ the 4 streams mode
554*3117ece4Schristosfor such a small quantity, that would be wasteful.
555*3117ece4SchristosA more practical lower bound would be around ~256 bytes.
556*3117ece4Schristos
557*3117ece4Schristos#### Raw Literals Block
558*3117ece4SchristosThe data in Stream1 is `Regenerated_Size` bytes long,
559*3117ece4Schristosit contains the raw literals data to be used during [Sequence Execution].
560*3117ece4Schristos
561*3117ece4Schristos#### RLE Literals Block
562*3117ece4SchristosStream1 consists of a single byte which should be repeated `Regenerated_Size` times
563*3117ece4Schristosto generate the decoded literals.
564*3117ece4Schristos
565*3117ece4Schristos#### Compressed Literals Block and Treeless Literals Block
566*3117ece4SchristosBoth of these modes contain Huffman encoded data.
567*3117ece4Schristos
568*3117ece4SchristosFor `Treeless_Literals_Block`,
569*3117ece4Schristosthe Huffman table comes from previously compressed literals block,
570*3117ece4Schristosor from a dictionary.
571*3117ece4Schristos
572*3117ece4Schristos
573*3117ece4Schristos### `Huffman_Tree_Description`
574*3117ece4SchristosThis section is only present when `Literals_Block_Type` type is `Compressed_Literals_Block` (`2`).
575*3117ece4SchristosThe tree describes the weights of all literals symbols that can be present in the literals block, at least 2 and up to 256.
576*3117ece4SchristosThe format of the Huffman tree description can be found at [Huffman Tree description](#huffman-tree-description).
577*3117ece4SchristosThe size of `Huffman_Tree_Description` is determined during decoding process,
578*3117ece4Schristosit must be used to determine where streams begin.
579*3117ece4Schristos`Total_Streams_Size = Compressed_Size - Huffman_Tree_Description_Size`.
580*3117ece4Schristos
581*3117ece4Schristos
582*3117ece4Schristos### Jump Table
583*3117ece4SchristosThe Jump Table is only present when there are 4 Huffman-coded streams.
584*3117ece4Schristos
585*3117ece4SchristosReminder : Huffman compressed data consists of either 1 or 4 streams.
586*3117ece4Schristos
587*3117ece4SchristosIf only one stream is present, it is a single bitstream occupying the entire
588*3117ece4Schristosremaining portion of the literals block, encoded as described in
589*3117ece4Schristos[Huffman-Coded Streams](#huffman-coded-streams).
590*3117ece4Schristos
591*3117ece4SchristosIf there are four streams, `Literals_Section_Header` only provided
592*3117ece4Schristosenough information to know the decompressed and compressed sizes
593*3117ece4Schristosof all four streams _combined_.
594*3117ece4SchristosThe decompressed size of _each_ stream is equal to `(Regenerated_Size+3)/4`,
595*3117ece4Schristosexcept for the last stream which may be up to 3 bytes smaller,
596*3117ece4Schristosto reach a total decompressed size as specified in `Regenerated_Size`.
597*3117ece4Schristos
598*3117ece4SchristosThe compressed size of each stream is provided explicitly in the Jump Table.
599*3117ece4SchristosJump Table is 6 bytes long, and consists of three 2-byte __little-endian__ fields,
600*3117ece4Schristosdescribing the compressed sizes of the first three streams.
601*3117ece4Schristos`Stream4_Size` is computed from `Total_Streams_Size` minus sizes of other streams:
602*3117ece4Schristos
603*3117ece4Schristos`Stream4_Size = Total_Streams_Size - 6 - Stream1_Size - Stream2_Size - Stream3_Size`.
604*3117ece4Schristos
605*3117ece4Schristos`Stream4_Size` is necessarily `>= 1`. Therefore,
606*3117ece4Schristosif `Total_Streams_Size < Stream1_Size + Stream2_Size + Stream3_Size + 6 + 1`,
607*3117ece4Schristosdata is considered corrupted.
608*3117ece4Schristos
609*3117ece4SchristosEach of these 4 bitstreams is then decoded independently as a Huffman-Coded stream,
610*3117ece4Schristosas described in [Huffman-Coded Streams](#huffman-coded-streams)
611*3117ece4Schristos
612*3117ece4Schristos
613*3117ece4SchristosSequences Section
614*3117ece4Schristos-----------------
615*3117ece4SchristosA compressed block is a succession of _sequences_ .
616*3117ece4SchristosA sequence is a literal copy command, followed by a match copy command.
617*3117ece4SchristosA literal copy command specifies a length.
618*3117ece4SchristosIt is the number of bytes to be copied (or extracted) from the Literals Section.
619*3117ece4SchristosA match copy command specifies an offset and a length.
620*3117ece4Schristos
621*3117ece4SchristosWhen all _sequences_ are decoded,
622*3117ece4Schristosif there are literals left in the _literals section_,
623*3117ece4Schristosthese bytes are added at the end of the block.
624*3117ece4Schristos
625*3117ece4SchristosThis is described in more detail in [Sequence Execution](#sequence-execution).
626*3117ece4Schristos
627*3117ece4SchristosThe `Sequences_Section` regroup all symbols required to decode commands.
628*3117ece4SchristosThere are 3 symbol types : literals lengths, offsets and match lengths.
629*3117ece4SchristosThey are encoded together, interleaved, in a single _bitstream_.
630*3117ece4Schristos
631*3117ece4SchristosThe `Sequences_Section` starts by a header,
632*3117ece4Schristosfollowed by optional probability tables for each symbol type,
633*3117ece4Schristosfollowed by the bitstream.
634*3117ece4Schristos
635*3117ece4Schristos| `Sequences_Section_Header` | [`Literals_Length_Table`] | [`Offset_Table`] | [`Match_Length_Table`] | bitStream |
636*3117ece4Schristos| -------------------------- | ------------------------- | ---------------- | ---------------------- | --------- |
637*3117ece4Schristos
638*3117ece4SchristosTo decode the `Sequences_Section`, it's required to know its size.
639*3117ece4SchristosIts size is deduced from the size of `Literals_Section`:
640*3117ece4Schristos`Sequences_Section_Size = Block_Size - Literals_Section_Size`.
641*3117ece4Schristos
642*3117ece4Schristos
643*3117ece4Schristos#### `Sequences_Section_Header`
644*3117ece4Schristos
645*3117ece4SchristosConsists of 2 items:
646*3117ece4Schristos- `Number_of_Sequences`
647*3117ece4Schristos- Symbol compression modes
648*3117ece4Schristos
649*3117ece4Schristos__`Number_of_Sequences`__
650*3117ece4Schristos
651*3117ece4SchristosThis is a variable size field using between 1 and 3 bytes.
652*3117ece4SchristosLet's call its first byte `byte0`.
653*3117ece4Schristos- `if (byte0 < 128)` : `Number_of_Sequences = byte0` . Uses 1 byte.
654*3117ece4Schristos- `if (byte0 < 255)` : `Number_of_Sequences = ((byte0 - 0x80) << 8) + byte1`. Uses 2 bytes.
655*3117ece4Schristos            Note that the 2 bytes format fully overlaps the 1 byte format.
656*3117ece4Schristos- `if (byte0 == 255)`: `Number_of_Sequences = byte1 + (byte2<<8) + 0x7F00`. Uses 3 bytes.
657*3117ece4Schristos
658*3117ece4Schristos`if (Number_of_Sequences == 0)` : there are no sequences.
659*3117ece4Schristos            The sequence section stops immediately,
660*3117ece4Schristos            FSE tables used in `Repeat_Mode` aren't updated.
661*3117ece4Schristos            Block's decompressed content is defined solely by the Literals Section content.
662*3117ece4Schristos
663*3117ece4Schristos__Symbol compression modes__
664*3117ece4Schristos
665*3117ece4SchristosThis is a single byte, defining the compression mode of each symbol type.
666*3117ece4Schristos
667*3117ece4Schristos|Bit number|          7-6            |      5-4       |        3-2           |     1-0    |
668*3117ece4Schristos| -------- | ----------------------- | -------------- | -------------------- | ---------- |
669*3117ece4Schristos|Field name| `Literals_Lengths_Mode` | `Offsets_Mode` | `Match_Lengths_Mode` | `Reserved` |
670*3117ece4Schristos
671*3117ece4SchristosThe last field, `Reserved`, must be all-zeroes.
672*3117ece4Schristos
673*3117ece4Schristos`Literals_Lengths_Mode`, `Offsets_Mode` and `Match_Lengths_Mode` define the `Compression_Mode` of
674*3117ece4Schristosliterals lengths, offsets, and match lengths symbols respectively.
675*3117ece4Schristos
676*3117ece4SchristosThey follow the same enumeration :
677*3117ece4Schristos
678*3117ece4Schristos|        Value       |         0         |      1     |           2           |       3       |
679*3117ece4Schristos| ------------------ | ----------------- | ---------- | --------------------- | ------------- |
680*3117ece4Schristos| `Compression_Mode` | `Predefined_Mode` | `RLE_Mode` | `FSE_Compressed_Mode` | `Repeat_Mode` |
681*3117ece4Schristos
682*3117ece4Schristos- `Predefined_Mode` : A predefined FSE distribution table is used, defined in
683*3117ece4Schristos          [default distributions](#default-distributions).
684*3117ece4Schristos          No distribution table will be present.
685*3117ece4Schristos- `RLE_Mode` : The table description consists of a single byte, which contains the symbol's value.
686*3117ece4Schristos          This symbol will be used for all sequences.
687*3117ece4Schristos- `FSE_Compressed_Mode` : standard FSE compression.
688*3117ece4Schristos          A distribution table will be present.
689*3117ece4Schristos          The format of this distribution table is described in [FSE Table Description](#fse-table-description).
690*3117ece4Schristos          Note that the maximum allowed accuracy log for literals length and match length tables is 9,
691*3117ece4Schristos          and the maximum accuracy log for the offsets table is 8.
692*3117ece4Schristos          `FSE_Compressed_Mode` must not be used when only one symbol is present,
693*3117ece4Schristos          `RLE_Mode` should be used instead (although any other mode will work).
694*3117ece4Schristos- `Repeat_Mode` : The table used in the previous `Compressed_Block` with `Number_of_Sequences > 0` will be used again,
695*3117ece4Schristos          or if this is the first block, table in the dictionary will be used.
696*3117ece4Schristos          Note that this includes `RLE_mode`, so if `Repeat_Mode` follows `RLE_Mode`, the same symbol will be repeated.
697*3117ece4Schristos          It also includes `Predefined_Mode`, in which case `Repeat_Mode` will have same outcome as `Predefined_Mode`.
698*3117ece4Schristos          No distribution table will be present.
699*3117ece4Schristos          If this mode is used without any previous sequence table in the frame
700*3117ece4Schristos          (nor [dictionary](#dictionary-format)) to repeat, this should be treated as corruption.
701*3117ece4Schristos
702*3117ece4Schristos#### The codes for literals lengths, match lengths, and offsets.
703*3117ece4Schristos
704*3117ece4SchristosEach symbol is a _code_ in its own context,
705*3117ece4Schristoswhich specifies `Baseline` and `Number_of_Bits` to add.
706*3117ece4Schristos_Codes_ are FSE compressed,
707*3117ece4Schristosand interleaved with raw additional bits in the same bitstream.
708*3117ece4Schristos
709*3117ece4Schristos##### Literals length codes
710*3117ece4Schristos
711*3117ece4SchristosLiterals length codes are values ranging from `0` to `35` included.
712*3117ece4SchristosThey define lengths from 0 to 131071 bytes.
713*3117ece4SchristosThe literals length is equal to the decoded `Baseline` plus
714*3117ece4Schristosthe result of reading `Number_of_Bits` bits from the bitstream,
715*3117ece4Schristosas a __little-endian__ value.
716*3117ece4Schristos
717*3117ece4Schristos| `Literals_Length_Code` |         0-15           |
718*3117ece4Schristos| ---------------------- | ---------------------- |
719*3117ece4Schristos| length                 | `Literals_Length_Code` |
720*3117ece4Schristos| `Number_of_Bits`       |          0             |
721*3117ece4Schristos
722*3117ece4Schristos| `Literals_Length_Code` |  16  |  17  |  18  |  19  |  20  |  21  |  22  |  23  |
723*3117ece4Schristos| ---------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
724*3117ece4Schristos| `Baseline`             |  16  |  18  |  20  |  22  |  24  |  28  |  32  |  40  |
725*3117ece4Schristos| `Number_of_Bits`       |   1  |   1  |   1  |   1  |   2  |   2  |   3  |   3  |
726*3117ece4Schristos
727*3117ece4Schristos| `Literals_Length_Code` |  24  |  25  |  26  |  27  |  28  |  29  |  30  |  31  |
728*3117ece4Schristos| ---------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
729*3117ece4Schristos| `Baseline`             |  48  |  64  |  128 |  256 |  512 | 1024 | 2048 | 4096 |
730*3117ece4Schristos| `Number_of_Bits`       |   4  |   6  |   7  |   8  |   9  |  10  |  11  |  12  |
731*3117ece4Schristos
732*3117ece4Schristos| `Literals_Length_Code` |  32  |  33  |  34  |  35  |
733*3117ece4Schristos| ---------------------- | ---- | ---- | ---- | ---- |
734*3117ece4Schristos| `Baseline`             | 8192 |16384 |32768 |65536 |
735*3117ece4Schristos| `Number_of_Bits`       |  13  |  14  |  15  |  16  |
736*3117ece4Schristos
737*3117ece4Schristos
738*3117ece4Schristos##### Match length codes
739*3117ece4Schristos
740*3117ece4SchristosMatch length codes are values ranging from `0` to `52` included.
741*3117ece4SchristosThey define lengths from 3 to 131074 bytes.
742*3117ece4SchristosThe match length is equal to the decoded `Baseline` plus
743*3117ece4Schristosthe result of reading `Number_of_Bits` bits from the bitstream,
744*3117ece4Schristosas a __little-endian__ value.
745*3117ece4Schristos
746*3117ece4Schristos| `Match_Length_Code` |         0-31            |
747*3117ece4Schristos| ------------------- | ----------------------- |
748*3117ece4Schristos| value               | `Match_Length_Code` + 3 |
749*3117ece4Schristos| `Number_of_Bits`    |          0              |
750*3117ece4Schristos
751*3117ece4Schristos| `Match_Length_Code` |  32  |  33  |  34  |  35  |  36  |  37  |  38  |  39  |
752*3117ece4Schristos| ------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
753*3117ece4Schristos| `Baseline`          |  35  |  37  |  39  |  41  |  43  |  47  |  51  |  59  |
754*3117ece4Schristos| `Number_of_Bits`    |   1  |   1  |   1  |   1  |   2  |   2  |   3  |   3  |
755*3117ece4Schristos
756*3117ece4Schristos| `Match_Length_Code` |  40  |  41  |  42  |  43  |  44  |  45  |  46  |  47  |
757*3117ece4Schristos| ------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
758*3117ece4Schristos| `Baseline`          |  67  |  83  |  99  |  131 |  259 |  515 | 1027 | 2051 |
759*3117ece4Schristos| `Number_of_Bits`    |   4  |   4  |   5  |   7  |   8  |   9  |  10  |  11  |
760*3117ece4Schristos
761*3117ece4Schristos| `Match_Length_Code` |  48  |  49  |  50  |  51  |  52  |
762*3117ece4Schristos| ------------------- | ---- | ---- | ---- | ---- | ---- |
763*3117ece4Schristos| `Baseline`          | 4099 | 8195 |16387 |32771 |65539 |
764*3117ece4Schristos| `Number_of_Bits`    |  12  |  13  |  14  |  15  |  16  |
765*3117ece4Schristos
766*3117ece4Schristos##### Offset codes
767*3117ece4Schristos
768*3117ece4SchristosOffset codes are values ranging from `0` to `N`.
769*3117ece4Schristos
770*3117ece4SchristosA decoder is free to limit its maximum `N` supported.
771*3117ece4SchristosRecommendation is to support at least up to `22`.
772*3117ece4SchristosFor information, at the time of this writing.
773*3117ece4Schristosthe reference decoder supports a maximum `N` value of `31`.
774*3117ece4Schristos
775*3117ece4SchristosAn offset code is also the number of additional bits to read in __little-endian__ fashion,
776*3117ece4Schristosand can be translated into an `Offset_Value` using the following formulas :
777*3117ece4Schristos
778*3117ece4Schristos```
779*3117ece4SchristosOffset_Value = (1 << offsetCode) + readNBits(offsetCode);
780*3117ece4Schristosif (Offset_Value > 3) offset = Offset_Value - 3;
781*3117ece4Schristos```
782*3117ece4SchristosIt means that maximum `Offset_Value` is `(2^(N+1))-1`
783*3117ece4Schristossupporting back-reference distances up to `(2^(N+1))-4`,
784*3117ece4Schristosbut is limited by [maximum back-reference distance](#window_descriptor).
785*3117ece4Schristos
786*3117ece4Schristos`Offset_Value` from 1 to 3 are special : they define "repeat codes".
787*3117ece4SchristosThis is described in more detail in [Repeat Offsets](#repeat-offsets).
788*3117ece4Schristos
789*3117ece4Schristos#### Decoding Sequences
790*3117ece4SchristosFSE bitstreams are read in reverse direction than written. In zstd,
791*3117ece4Schristosthe compressor writes bits forward into a block and the decompressor
792*3117ece4Schristosmust read the bitstream _backwards_.
793*3117ece4Schristos
794*3117ece4SchristosTo find the start of the bitstream it is therefore necessary to
795*3117ece4Schristosknow the offset of the last byte of the block which can be found
796*3117ece4Schristosby counting `Block_Size` bytes after the block header.
797*3117ece4Schristos
798*3117ece4SchristosAfter writing the last bit containing information, the compressor
799*3117ece4Schristoswrites a single `1`-bit and then fills the byte with 0-7 `0` bits of
800*3117ece4Schristospadding. The last byte of the compressed bitstream cannot be `0` for
801*3117ece4Schristosthat reason.
802*3117ece4Schristos
803*3117ece4SchristosWhen decompressing, the last byte containing the padding is the first
804*3117ece4Schristosbyte to read. The decompressor needs to skip 0-7 initial `0`-bits and
805*3117ece4Schristosthe first `1`-bit it occurs. Afterwards, the useful part of the bitstream
806*3117ece4Schristosbegins.
807*3117ece4Schristos
808*3117ece4SchristosFSE decoding requires a 'state' to be carried from symbol to symbol.
809*3117ece4SchristosFor more explanation on FSE decoding, see the [FSE section](#fse).
810*3117ece4Schristos
811*3117ece4SchristosFor sequence decoding, a separate state keeps track of each
812*3117ece4Schristosliteral lengths, offsets, and match lengths symbols.
813*3117ece4SchristosSome FSE primitives are also used.
814*3117ece4SchristosFor more details on the operation of these primitives, see the [FSE section](#fse).
815*3117ece4Schristos
816*3117ece4Schristos##### Starting states
817*3117ece4SchristosThe bitstream starts with initial FSE state values,
818*3117ece4Schristoseach using the required number of bits in their respective _accuracy_,
819*3117ece4Schristosdecoded previously from their normalized distribution.
820*3117ece4Schristos
821*3117ece4SchristosIt starts by `Literals_Length_State`,
822*3117ece4Schristosfollowed by `Offset_State`,
823*3117ece4Schristosand finally `Match_Length_State`.
824*3117ece4Schristos
825*3117ece4SchristosReminder : always keep in mind that all values are read _backward_,
826*3117ece4Schristosso the 'start' of the bitstream is at the highest position in memory,
827*3117ece4Schristosimmediately before the last `1`-bit for padding.
828*3117ece4Schristos
829*3117ece4SchristosAfter decoding the starting states, a single sequence is decoded
830*3117ece4Schristos`Number_Of_Sequences` times.
831*3117ece4SchristosThese sequences are decoded in order from first to last.
832*3117ece4SchristosSince the compressor writes the bitstream in the forward direction,
833*3117ece4Schristosthis means the compressor must encode the sequences starting with the last
834*3117ece4Schristosone and ending with the first.
835*3117ece4Schristos
836*3117ece4Schristos##### Decoding a sequence
837*3117ece4SchristosFor each of the symbol types, the FSE state can be used to determine the appropriate code.
838*3117ece4SchristosThe code then defines the `Baseline` and `Number_of_Bits` to read for each type.
839*3117ece4SchristosSee the [description of the codes] for how to determine these values.
840*3117ece4Schristos
841*3117ece4Schristos[description of the codes]: #the-codes-for-literals-lengths-match-lengths-and-offsets
842*3117ece4Schristos
843*3117ece4SchristosDecoding starts by reading the `Number_of_Bits` required to decode `Offset`.
844*3117ece4SchristosIt then does the same for `Match_Length`, and then for `Literals_Length`.
845*3117ece4SchristosThis sequence is then used for [sequence execution](#sequence-execution).
846*3117ece4Schristos
847*3117ece4SchristosIf it is not the last sequence in the block,
848*3117ece4Schristosthe next operation is to update states.
849*3117ece4SchristosUsing the rules pre-calculated in the decoding tables,
850*3117ece4Schristos`Literals_Length_State` is updated,
851*3117ece4Schristosfollowed by `Match_Length_State`,
852*3117ece4Schristosand then `Offset_State`.
853*3117ece4SchristosSee the [FSE section](#fse) for details on how to update states from the bitstream.
854*3117ece4Schristos
855*3117ece4SchristosThis operation will be repeated `Number_of_Sequences` times.
856*3117ece4SchristosAt the end, the bitstream shall be entirely consumed,
857*3117ece4Schristosotherwise the bitstream is considered corrupted.
858*3117ece4Schristos
859*3117ece4Schristos#### Default Distributions
860*3117ece4SchristosIf `Predefined_Mode` is selected for a symbol type,
861*3117ece4Schristosits FSE decoding table is generated from a predefined distribution table defined here.
862*3117ece4SchristosFor details on how to convert this distribution into a decoding table, see the [FSE section].
863*3117ece4Schristos
864*3117ece4Schristos[FSE section]: #from-normalized-distribution-to-decoding-tables
865*3117ece4Schristos
866*3117ece4Schristos##### Literals Length
867*3117ece4SchristosThe decoding table uses an accuracy log of 6 bits (64 states).
868*3117ece4Schristos```
869*3117ece4Schristosshort literalsLength_defaultDistribution[36] =
870*3117ece4Schristos        { 4, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1,
871*3117ece4Schristos          2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 1, 1, 1, 1, 1,
872*3117ece4Schristos         -1,-1,-1,-1 };
873*3117ece4Schristos```
874*3117ece4Schristos
875*3117ece4Schristos##### Match Length
876*3117ece4SchristosThe decoding table uses an accuracy log of 6 bits (64 states).
877*3117ece4Schristos```
878*3117ece4Schristosshort matchLengths_defaultDistribution[53] =
879*3117ece4Schristos        { 1, 4, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
880*3117ece4Schristos          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
881*3117ece4Schristos          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,-1,-1,
882*3117ece4Schristos         -1,-1,-1,-1,-1 };
883*3117ece4Schristos```
884*3117ece4Schristos
885*3117ece4Schristos##### Offset Codes
886*3117ece4SchristosThe decoding table uses an accuracy log of 5 bits (32 states),
887*3117ece4Schristosand supports a maximum `N` value of 28, allowing offset values up to 536,870,908 .
888*3117ece4Schristos
889*3117ece4SchristosIf any sequence in the compressed block requires a larger offset than this,
890*3117ece4Schristosit's not possible to use the default distribution to represent it.
891*3117ece4Schristos```
892*3117ece4Schristosshort offsetCodes_defaultDistribution[29] =
893*3117ece4Schristos        { 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
894*3117ece4Schristos          1, 1, 1, 1, 1, 1, 1, 1,-1,-1,-1,-1,-1 };
895*3117ece4Schristos```
896*3117ece4Schristos
897*3117ece4Schristos
898*3117ece4SchristosSequence Execution
899*3117ece4Schristos------------------
900*3117ece4SchristosOnce literals and sequences have been decoded,
901*3117ece4Schristosthey are combined to produce the decoded content of a block.
902*3117ece4Schristos
903*3117ece4SchristosEach sequence consists of a tuple of (`literals_length`, `offset_value`, `match_length`),
904*3117ece4Schristosdecoded as described in the [Sequences Section](#sequences-section).
905*3117ece4SchristosTo execute a sequence, first copy `literals_length` bytes
906*3117ece4Schristosfrom the decoded literals to the output.
907*3117ece4Schristos
908*3117ece4SchristosThen `match_length` bytes are copied from previous decoded data.
909*3117ece4SchristosThe offset to copy from is determined by `offset_value`:
910*3117ece4Schristosif `offset_value > 3`, then the offset is `offset_value - 3`.
911*3117ece4SchristosIf `offset_value` is from 1-3, the offset is a special repeat offset value.
912*3117ece4SchristosSee the [repeat offset](#repeat-offsets) section for how the offset is determined
913*3117ece4Schristosin this case.
914*3117ece4Schristos
915*3117ece4SchristosThe offset is defined as from the current position, so an offset of 6
916*3117ece4Schristosand a match length of 3 means that 3 bytes should be copied from 6 bytes back.
917*3117ece4SchristosNote that all offsets leading to previously decoded data
918*3117ece4Schristosmust be smaller than `Window_Size` defined in `Frame_Header_Descriptor`.
919*3117ece4Schristos
920*3117ece4Schristos#### Repeat offsets
921*3117ece4SchristosAs seen in [Sequence Execution](#sequence-execution),
922*3117ece4Schristosthe first 3 values define a repeated offset and we will call them
923*3117ece4Schristos`Repeated_Offset1`, `Repeated_Offset2`, and `Repeated_Offset3`.
924*3117ece4SchristosThey are sorted in recency order, with `Repeated_Offset1` meaning "most recent one".
925*3117ece4Schristos
926*3117ece4SchristosIf `offset_value == 1`, then the offset used is `Repeated_Offset1`, etc.
927*3117ece4Schristos
928*3117ece4SchristosThere is an exception though, when current sequence's `literals_length = 0`.
929*3117ece4SchristosIn this case, repeated offsets are shifted by one,
930*3117ece4Schristosso an `offset_value` of 1 means `Repeated_Offset2`,
931*3117ece4Schristosan `offset_value` of 2 means `Repeated_Offset3`,
932*3117ece4Schristosand an `offset_value` of 3 means `Repeated_Offset1 - 1`.
933*3117ece4Schristos
934*3117ece4SchristosIn the final case, if `Repeated_Offset1 - 1` evaluates to 0, then the
935*3117ece4Schristosdata is considered corrupted.
936*3117ece4Schristos
937*3117ece4SchristosFor the first block, the starting offset history is populated with following values :
938*3117ece4Schristos`Repeated_Offset1`=1, `Repeated_Offset2`=4, `Repeated_Offset3`=8,
939*3117ece4Schristosunless a dictionary is used, in which case they come from the dictionary.
940*3117ece4Schristos
941*3117ece4SchristosThen each block gets its starting offset history from the ending values of the most recent `Compressed_Block`.
942*3117ece4SchristosNote that blocks which are not `Compressed_Block` are skipped, they do not contribute to offset history.
943*3117ece4Schristos
944*3117ece4Schristos[Offset Codes]: #offset-codes
945*3117ece4Schristos
946*3117ece4Schristos###### Offset updates rules
947*3117ece4Schristos
948*3117ece4SchristosDuring the execution of the sequences of a `Compressed_Block`, the
949*3117ece4Schristos`Repeated_Offsets`' values are kept up to date, so that they always represent
950*3117ece4Schristosthe three most-recently used offsets. In order to achieve that, they are
951*3117ece4Schristosupdated after executing each sequence in the following way:
952*3117ece4Schristos
953*3117ece4SchristosWhen the sequence's `offset_value` does not refer to one of the
954*3117ece4Schristos`Repeated_Offsets`--when it has value greater than 3, or when it has value 3
955*3117ece4Schristosand the sequence's `literals_length` is zero--the `Repeated_Offsets`' values
956*3117ece4Schristosare shifted back one, and `Repeated_Offset1` takes on the value of the
957*3117ece4Schristosjust-used offset.
958*3117ece4Schristos
959*3117ece4SchristosOtherwise, when the sequence's `offset_value` refers to one of the
960*3117ece4Schristos`Repeated_Offsets`--when it has value 1 or 2, or when it has value 3 and the
961*3117ece4Schristossequence's `literals_length` is non-zero--the `Repeated_Offsets` are re-ordered
962*3117ece4Schristosso that `Repeated_Offset1` takes on the value of the used Repeated_Offset, and
963*3117ece4Schristosthe existing values are pushed back from the first `Repeated_Offset` through to
964*3117ece4Schristosthe `Repeated_Offset` selected by the `offset_value`. This effectively performs
965*3117ece4Schristosa single-stepped wrapping rotation of the values of these offsets, so that
966*3117ece4Schristostheir order again reflects the recency of their use.
967*3117ece4Schristos
968*3117ece4SchristosThe following table shows the values of the `Repeated_Offsets` as a series of
969*3117ece4Schristossequences are applied to them:
970*3117ece4Schristos
971*3117ece4Schristos| `offset_value` | `literals_length` | `Repeated_Offset1` | `Repeated_Offset2` | `Repeated_Offset3` | Comment                 |
972*3117ece4Schristos|:--------------:|:-----------------:|:------------------:|:------------------:|:------------------:|:-----------------------:|
973*3117ece4Schristos|                |                   |                  1 |                  4 |                  8 | starting values         |
974*3117ece4Schristos|           1114 |                11 |               1111 |                  1 |                  4 | non-repeat              |
975*3117ece4Schristos|              1 |                22 |               1111 |                  1 |                  4 | repeat 1: no change     |
976*3117ece4Schristos|           2225 |                22 |               2222 |               1111 |                  1 | non-repeat              |
977*3117ece4Schristos|           1114 |               111 |               1111 |               2222 |               1111 | non-repeat              |
978*3117ece4Schristos|           3336 |                33 |               3333 |               1111 |               2222 | non-repeat              |
979*3117ece4Schristos|              2 |                22 |               1111 |               3333 |               2222 | repeat 2: swap 1 & 2    |
980*3117ece4Schristos|              3 |                33 |               2222 |               1111 |               3333 | repeat 3: rotate 3 to 1 |
981*3117ece4Schristos|              3 |                 0 |               2221 |               2222 |               1111 | special case : insert `repeat1 - 1` |
982*3117ece4Schristos|              1 |                 0 |               2222 |               2221 |               1111 | == repeat 2             |
983*3117ece4Schristos
984*3117ece4Schristos
985*3117ece4SchristosSkippable Frames
986*3117ece4Schristos----------------
987*3117ece4Schristos
988*3117ece4Schristos| `Magic_Number` | `Frame_Size` | `User_Data` |
989*3117ece4Schristos|:--------------:|:------------:|:-----------:|
990*3117ece4Schristos|   4 bytes      |  4 bytes     |   n bytes   |
991*3117ece4Schristos
992*3117ece4SchristosSkippable frames allow the insertion of user-defined metadata
993*3117ece4Schristosinto a flow of concatenated frames.
994*3117ece4Schristos
995*3117ece4SchristosSkippable frames defined in this specification are compatible with [LZ4] ones.
996*3117ece4Schristos
997*3117ece4Schristos[LZ4]:https://lz4.github.io/lz4/
998*3117ece4Schristos
999*3117ece4SchristosFrom a compliant decoder perspective, skippable frames need just be skipped,
1000*3117ece4Schristosand their content ignored, resuming decoding after the skippable frame.
1001*3117ece4Schristos
1002*3117ece4SchristosIt can be noted that a skippable frame
1003*3117ece4Schristoscan be used to watermark a stream of concatenated frames
1004*3117ece4Schristosembedding any kind of tracking information (even just a UUID).
1005*3117ece4SchristosUsers wary of such possibility should scan the stream of concatenated frames
1006*3117ece4Schristosin an attempt to detect such frame for analysis or removal.
1007*3117ece4Schristos
1008*3117ece4Schristos__`Magic_Number`__
1009*3117ece4Schristos
1010*3117ece4Schristos4 Bytes, __little-endian__ format.
1011*3117ece4SchristosValue : 0x184D2A5?, which means any value from 0x184D2A50 to 0x184D2A5F.
1012*3117ece4SchristosAll 16 values are valid to identify a skippable frame.
1013*3117ece4SchristosThis specification doesn't detail any specific tagging for skippable frames.
1014*3117ece4Schristos
1015*3117ece4Schristos__`Frame_Size`__
1016*3117ece4Schristos
1017*3117ece4SchristosThis is the size, in bytes, of the following `User_Data`
1018*3117ece4Schristos(without including the magic number nor the size field itself).
1019*3117ece4SchristosThis field is represented using 4 Bytes, __little-endian__ format, unsigned 32-bits.
1020*3117ece4SchristosThis means `User_Data` can’t be bigger than (2^32-1) bytes.
1021*3117ece4Schristos
1022*3117ece4Schristos__`User_Data`__
1023*3117ece4Schristos
1024*3117ece4SchristosThe `User_Data` can be anything. Data will just be skipped by the decoder.
1025*3117ece4Schristos
1026*3117ece4Schristos
1027*3117ece4Schristos
1028*3117ece4SchristosEntropy Encoding
1029*3117ece4Schristos----------------
1030*3117ece4SchristosTwo types of entropy encoding are used by the Zstandard format:
1031*3117ece4SchristosFSE, and Huffman coding.
1032*3117ece4SchristosHuffman is used to compress literals,
1033*3117ece4Schristoswhile FSE is used for all other symbols
1034*3117ece4Schristos(`Literals_Length_Code`, `Match_Length_Code`, offset codes)
1035*3117ece4Schristosand to compress Huffman headers.
1036*3117ece4Schristos
1037*3117ece4Schristos
1038*3117ece4SchristosFSE
1039*3117ece4Schristos---
1040*3117ece4SchristosFSE, short for Finite State Entropy, is an entropy codec based on [ANS].
1041*3117ece4SchristosFSE encoding/decoding involves a state that is carried over between symbols,
1042*3117ece4Schristosso decoding must be done in the opposite direction as encoding.
1043*3117ece4SchristosTherefore, all FSE bitstreams are read from end to beginning.
1044*3117ece4SchristosNote that the order of the bits in the stream is not reversed,
1045*3117ece4Schristoswe just read the elements in the reverse order they are written.
1046*3117ece4Schristos
1047*3117ece4SchristosFor additional details on FSE, see [Finite State Entropy].
1048*3117ece4Schristos
1049*3117ece4Schristos[Finite State Entropy]:https://github.com/Cyan4973/FiniteStateEntropy/
1050*3117ece4Schristos
1051*3117ece4SchristosFSE decoding involves a decoding table which has a power of 2 size, and contain three elements:
1052*3117ece4Schristos`Symbol`, `Num_Bits`, and `Baseline`.
1053*3117ece4SchristosThe `log2` of the table size is its `Accuracy_Log`.
1054*3117ece4SchristosAn FSE state value represents an index in this table.
1055*3117ece4Schristos
1056*3117ece4SchristosTo obtain the initial state value, consume `Accuracy_Log` bits from the stream as a __little-endian__ value.
1057*3117ece4SchristosThe next symbol in the stream is the `Symbol` indicated in the table for that state.
1058*3117ece4SchristosTo obtain the next state value,
1059*3117ece4Schristosthe decoder should consume `Num_Bits` bits from the stream as a __little-endian__ value and add it to `Baseline`.
1060*3117ece4Schristos
1061*3117ece4Schristos[ANS]: https://en.wikipedia.org/wiki/Asymmetric_Numeral_Systems
1062*3117ece4Schristos
1063*3117ece4Schristos### FSE Table Description
1064*3117ece4SchristosTo decode FSE streams, it is necessary to construct the decoding table.
1065*3117ece4SchristosThe Zstandard format encodes FSE table descriptions as follows:
1066*3117ece4Schristos
1067*3117ece4SchristosAn FSE distribution table describes the probabilities of all symbols
1068*3117ece4Schristosfrom `0` to the last present one (included)
1069*3117ece4Schristoson a normalized scale of `1 << Accuracy_Log` .
1070*3117ece4SchristosNote that there must be two or more symbols with nonzero probability.
1071*3117ece4Schristos
1072*3117ece4SchristosIt's a bitstream which is read forward, in __little-endian__ fashion.
1073*3117ece4SchristosIt's not necessary to know bitstream exact size,
1074*3117ece4Schristosit will be discovered and reported by the decoding process.
1075*3117ece4Schristos
1076*3117ece4SchristosThe bitstream starts by reporting on which scale it operates.
1077*3117ece4SchristosLet's `low4Bits` designate the lowest 4 bits of the first byte :
1078*3117ece4Schristos`Accuracy_Log = low4bits + 5`.
1079*3117ece4Schristos
1080*3117ece4SchristosThen follows each symbol value, from `0` to last present one.
1081*3117ece4SchristosThe number of bits used by each field is variable.
1082*3117ece4SchristosIt depends on :
1083*3117ece4Schristos
1084*3117ece4Schristos- Remaining probabilities + 1 :
1085*3117ece4Schristos  __example__ :
1086*3117ece4Schristos  Presuming an `Accuracy_Log` of 8,
1087*3117ece4Schristos  and presuming 100 probabilities points have already been distributed,
1088*3117ece4Schristos  the decoder may read any value from `0` to `256 - 100 + 1 == 157` (inclusive).
1089*3117ece4Schristos  Therefore, it may read up to `log2sup(157) == 8` bits, where `log2sup(N)`
1090*3117ece4Schristos  is the smallest integer `T` that satisfies `(1 << T) > N`.
1091*3117ece4Schristos
1092*3117ece4Schristos- Value decoded : small values use 1 less bit :
1093*3117ece4Schristos  __example__ :
1094*3117ece4Schristos  Presuming values from 0 to 157 (inclusive) are possible,
1095*3117ece4Schristos  255-157 = 98 values are remaining in an 8-bits field.
1096*3117ece4Schristos  They are used this way :
1097*3117ece4Schristos  first 98 values (hence from 0 to 97) use only 7 bits,
1098*3117ece4Schristos  values from 98 to 157 use 8 bits.
1099*3117ece4Schristos  This is achieved through this scheme :
1100*3117ece4Schristos
1101*3117ece4Schristos  | Value read | Value decoded | Number of bits used |
1102*3117ece4Schristos  | ---------- | ------------- | ------------------- |
1103*3117ece4Schristos  |   0 -  97  |   0 -  97     |  7                  |
1104*3117ece4Schristos  |  98 - 127  |  98 - 127     |  8                  |
1105*3117ece4Schristos  | 128 - 225  |   0 -  97     |  7                  |
1106*3117ece4Schristos  | 226 - 255  | 128 - 157     |  8                  |
1107*3117ece4Schristos
1108*3117ece4SchristosSymbols probabilities are read one by one, in order.
1109*3117ece4Schristos
1110*3117ece4SchristosProbability is obtained from Value decoded by following formula :
1111*3117ece4Schristos`Proba = value - 1`
1112*3117ece4Schristos
1113*3117ece4SchristosIt means value `0` becomes negative probability `-1`.
1114*3117ece4Schristos`-1` is a special probability, which means "less than 1".
1115*3117ece4SchristosIts effect on distribution table is described in the [next section].
1116*3117ece4SchristosFor the purpose of calculating total allocated probability points, it counts as one.
1117*3117ece4Schristos
1118*3117ece4Schristos[next section]:#from-normalized-distribution-to-decoding-tables
1119*3117ece4Schristos
1120*3117ece4SchristosWhen a symbol has a __probability__ of `zero`,
1121*3117ece4Schristosit is followed by a 2-bits repeat flag.
1122*3117ece4SchristosThis repeat flag tells how many probabilities of zeroes follow the current one.
1123*3117ece4SchristosIt provides a number ranging from 0 to 3.
1124*3117ece4SchristosIf it is a 3, another 2-bits repeat flag follows, and so on.
1125*3117ece4Schristos
1126*3117ece4SchristosWhen last symbol reaches cumulated total of `1 << Accuracy_Log`,
1127*3117ece4Schristosdecoding is complete.
1128*3117ece4SchristosIf the last symbol makes cumulated total go above `1 << Accuracy_Log`,
1129*3117ece4Schristosdistribution is considered corrupted.
1130*3117ece4SchristosIf this process results in a non-zero probability for a value outside of the
1131*3117ece4Schristosvalid range of values that the FSE table is defined for, even if that value is
1132*3117ece4Schristosnot used, then the data is considered corrupted.
1133*3117ece4Schristos
1134*3117ece4SchristosThen the decoder can tell how many bytes were used in this process,
1135*3117ece4Schristosand how many symbols are present.
1136*3117ece4SchristosThe bitstream consumes a round number of bytes.
1137*3117ece4SchristosAny remaining bit within the last byte is just unused.
1138*3117ece4Schristos
1139*3117ece4Schristos#### From normalized distribution to decoding tables
1140*3117ece4Schristos
1141*3117ece4SchristosThe distribution of normalized probabilities is enough
1142*3117ece4Schristosto create a unique decoding table.
1143*3117ece4Schristos
1144*3117ece4SchristosIt follows the following build rule :
1145*3117ece4Schristos
1146*3117ece4SchristosThe table has a size of `Table_Size = 1 << Accuracy_Log`.
1147*3117ece4SchristosEach cell describes the symbol decoded,
1148*3117ece4Schristosand instructions to get the next state (`Number_of_Bits` and `Baseline`).
1149*3117ece4Schristos
1150*3117ece4SchristosSymbols are scanned in their natural order for "less than 1" probabilities.
1151*3117ece4SchristosSymbols with this probability are being attributed a single cell,
1152*3117ece4Schristosstarting from the end of the table and retreating.
1153*3117ece4SchristosThese symbols define a full state reset, reading `Accuracy_Log` bits.
1154*3117ece4Schristos
1155*3117ece4SchristosThen, all remaining symbols, sorted in natural order, are allocated cells.
1156*3117ece4SchristosStarting from symbol `0` (if it exists), and table position `0`,
1157*3117ece4Schristoseach symbol gets allocated as many cells as its probability.
1158*3117ece4SchristosCell allocation is spread, not linear :
1159*3117ece4Schristoseach successor position follows this rule :
1160*3117ece4Schristos
1161*3117ece4Schristos```
1162*3117ece4Schristosposition += (tableSize>>1) + (tableSize>>3) + 3;
1163*3117ece4Schristosposition &= tableSize-1;
1164*3117ece4Schristos```
1165*3117ece4Schristos
1166*3117ece4SchristosA position is skipped if already occupied by a "less than 1" probability symbol.
1167*3117ece4Schristos`position` does not reset between symbols, it simply iterates through
1168*3117ece4Schristoseach position in the table, switching to the next symbol when enough
1169*3117ece4Schristosstates have been allocated to the current one.
1170*3117ece4Schristos
1171*3117ece4SchristosThe process guarantees that the table is entirely filled.
1172*3117ece4SchristosEach cell corresponds to a state value, which contains the symbol being decoded.
1173*3117ece4Schristos
1174*3117ece4SchristosTo add the `Number_of_Bits` and `Baseline` required to retrieve next state,
1175*3117ece4Schristosit's first necessary to sort all occurrences of each symbol in state order.
1176*3117ece4SchristosLower states will need 1 more bit than higher ones.
1177*3117ece4SchristosThe process is repeated for each symbol.
1178*3117ece4Schristos
1179*3117ece4Schristos__Example__ :
1180*3117ece4SchristosPresuming a symbol has a probability of 5,
1181*3117ece4Schristosit receives 5 cells, corresponding to 5 state values.
1182*3117ece4SchristosThese state values are then sorted in natural order.
1183*3117ece4Schristos
1184*3117ece4SchristosNext power of 2 after 5 is 8.
1185*3117ece4SchristosSpace of probabilities must be divided into 8 equal parts.
1186*3117ece4SchristosPresuming the `Accuracy_Log` is 7, it defines a space of 128 states.
1187*3117ece4SchristosDivided by 8, each share is 16 large.
1188*3117ece4Schristos
1189*3117ece4SchristosIn order to reach 8 shares, 8-5=3 lowest states will count "double",
1190*3117ece4Schristosdoubling their shares (32 in width), hence requiring one more bit.
1191*3117ece4Schristos
1192*3117ece4SchristosBaseline is assigned starting from the higher states using fewer bits,
1193*3117ece4Schristosincreasing at each state, then resuming at the first state,
1194*3117ece4Schristoseach state takes its allocated width from Baseline.
1195*3117ece4Schristos
1196*3117ece4Schristos| state order      |   0   |   1   |    2   |   3  |    4   |
1197*3117ece4Schristos| ---------------- | ----- | ----- | ------ | ---- | ------ |
1198*3117ece4Schristos| state value      |   1   |  39   |   77   |  84  |  122   |
1199*3117ece4Schristos| width            |  32   |  32   |   32   |  16  |   16   |
1200*3117ece4Schristos| `Number_of_Bits` |   5   |   5   |    5   |   4  |    4   |
1201*3117ece4Schristos| range number     |   2   |   4   |    6   |   0  |    1   |
1202*3117ece4Schristos| `Baseline`       |  32   |  64   |   96   |   0  |   16   |
1203*3117ece4Schristos| range            | 32-63 | 64-95 | 96-127 | 0-15 | 16-31  |
1204*3117ece4Schristos
1205*3117ece4SchristosDuring decoding, the next state value is determined from current state value,
1206*3117ece4Schristosby reading the required `Number_of_Bits`, and adding the specified `Baseline`.
1207*3117ece4Schristos
1208*3117ece4SchristosSee [Appendix A] for the results of this process applied to the default distributions.
1209*3117ece4Schristos
1210*3117ece4Schristos[Appendix A]: #appendix-a---decoding-tables-for-predefined-codes
1211*3117ece4Schristos
1212*3117ece4Schristos
1213*3117ece4SchristosHuffman Coding
1214*3117ece4Schristos--------------
1215*3117ece4SchristosZstandard Huffman-coded streams are read backwards,
1216*3117ece4Schristossimilar to the FSE bitstreams.
1217*3117ece4SchristosTherefore, to find the start of the bitstream, it is required to
1218*3117ece4Schristosknow the offset of the last byte of the Huffman-coded stream.
1219*3117ece4Schristos
1220*3117ece4SchristosAfter writing the last bit containing information, the compressor
1221*3117ece4Schristoswrites a single `1`-bit and then fills the byte with 0-7 `0` bits of
1222*3117ece4Schristospadding. The last byte of the compressed bitstream cannot be `0` for
1223*3117ece4Schristosthat reason.
1224*3117ece4Schristos
1225*3117ece4SchristosWhen decompressing, the last byte containing the padding is the first
1226*3117ece4Schristosbyte to read. The decompressor needs to skip 0-7 initial `0`-bits and
1227*3117ece4Schristosthe first `1`-bit it occurs. Afterwards, the useful part of the bitstream
1228*3117ece4Schristosbegins.
1229*3117ece4Schristos
1230*3117ece4SchristosThe bitstream contains Huffman-coded symbols in __little-endian__ order,
1231*3117ece4Schristoswith the codes defined by the method below.
1232*3117ece4Schristos
1233*3117ece4Schristos### Huffman Tree Description
1234*3117ece4Schristos
1235*3117ece4SchristosPrefix coding represents symbols from an a priori known alphabet
1236*3117ece4Schristosby bit sequences (codewords), one codeword for each symbol,
1237*3117ece4Schristosin a manner such that different symbols may be represented
1238*3117ece4Schristosby bit sequences of different lengths,
1239*3117ece4Schristosbut a parser can always parse an encoded string
1240*3117ece4Schristosunambiguously symbol-by-symbol.
1241*3117ece4Schristos
1242*3117ece4SchristosGiven an alphabet with known symbol frequencies,
1243*3117ece4Schristosthe Huffman algorithm allows the construction of an optimal prefix code
1244*3117ece4Schristosusing the fewest bits of any possible prefix codes for that alphabet.
1245*3117ece4Schristos
1246*3117ece4SchristosPrefix code must not exceed a maximum code length.
1247*3117ece4SchristosMore bits improve accuracy but cost more header size,
1248*3117ece4Schristosand require more memory or more complex decoding operations.
1249*3117ece4SchristosThis specification limits maximum code length to 11 bits.
1250*3117ece4Schristos
1251*3117ece4Schristos#### Representation
1252*3117ece4Schristos
1253*3117ece4SchristosAll literal values from zero (included) to last present one (excluded)
1254*3117ece4Schristosare represented by `Weight` with values from `0` to `Max_Number_of_Bits`.
1255*3117ece4SchristosTransformation from `Weight` to `Number_of_Bits` follows this formula :
1256*3117ece4Schristos```
1257*3117ece4SchristosNumber_of_Bits = Weight ? (Max_Number_of_Bits + 1 - Weight) : 0
1258*3117ece4Schristos```
1259*3117ece4SchristosWhen a literal value is not present, it receives a `Weight` of 0.
1260*3117ece4SchristosThe least frequent symbol receives a `Weight` of 1.
1261*3117ece4SchristosIf no literal has a `Weight` of 1, then the data is considered corrupted.
1262*3117ece4SchristosIf there are not at least two literals with non-zero `Weight`, then the data
1263*3117ece4Schristosis considered corrupted.
1264*3117ece4SchristosThe most frequent symbol receives a `Weight` anywhere between 1 and 11 (max).
1265*3117ece4SchristosThe last symbol's `Weight` is deduced from previously retrieved Weights,
1266*3117ece4Schristosby completing to the nearest power of 2. It's necessarily non 0.
1267*3117ece4SchristosIf it's not possible to reach a clean power of 2 with a single `Weight` value,
1268*3117ece4Schristosthe Huffman Tree Description is considered invalid.
1269*3117ece4SchristosThis final power of 2 gives `Max_Number_of_Bits`, the depth of the current tree.
1270*3117ece4Schristos`Max_Number_of_Bits` must be <= 11,
1271*3117ece4Schristosotherwise the representation is considered corrupted.
1272*3117ece4Schristos
1273*3117ece4Schristos__Example__ :
1274*3117ece4SchristosLet's presume the following Huffman tree must be described :
1275*3117ece4Schristos
1276*3117ece4Schristos|  literal value   |  0  |  1  |  2  |  3  |  4  |  5  |
1277*3117ece4Schristos| ---------------- | --- | --- | --- | --- | --- | --- |
1278*3117ece4Schristos| `Number_of_Bits` |  1  |  2  |  3  |  0  |  4  |  4  |
1279*3117ece4Schristos
1280*3117ece4SchristosThe tree depth is 4, since its longest elements uses 4 bits
1281*3117ece4Schristos(longest elements are the one with smallest frequency).
1282*3117ece4SchristosLiteral value `5` will not be listed, as it can be determined from previous values 0-4,
1283*3117ece4Schristosnor will values above `5` as they are all 0.
1284*3117ece4SchristosValues from `0` to `4` will be listed using `Weight` instead of `Number_of_Bits`.
1285*3117ece4SchristosWeight formula is :
1286*3117ece4Schristos```
1287*3117ece4SchristosWeight = Number_of_Bits ? (Max_Number_of_Bits + 1 - Number_of_Bits) : 0
1288*3117ece4Schristos```
1289*3117ece4SchristosIt gives the following series of weights :
1290*3117ece4Schristos
1291*3117ece4Schristos| literal value |  0  |  1  |  2  |  3  |  4  |
1292*3117ece4Schristos| ------------- | --- | --- | --- | --- | --- |
1293*3117ece4Schristos|   `Weight`    |  4  |  3  |  2  |  0  |  1  |
1294*3117ece4Schristos
1295*3117ece4SchristosThe decoder will do the inverse operation :
1296*3117ece4Schristoshaving collected weights of literal symbols from `0` to `4`,
1297*3117ece4Schristosit knows the last literal, `5`, is present with a non-zero `Weight`.
1298*3117ece4SchristosThe `Weight` of `5` can be determined by advancing to the next power of 2.
1299*3117ece4SchristosThe sum of `2^(Weight-1)` (excluding 0's) is :
1300*3117ece4Schristos`8 + 4 + 2 + 0 + 1 = 15`.
1301*3117ece4SchristosNearest larger power of 2 value is 16.
1302*3117ece4SchristosTherefore, `Max_Number_of_Bits = 4` and `Weight[5] = log_2(16 - 15) + 1 = 1`.
1303*3117ece4Schristos
1304*3117ece4Schristos#### Huffman Tree header
1305*3117ece4Schristos
1306*3117ece4SchristosThis is a single byte value (0-255),
1307*3117ece4Schristoswhich describes how the series of weights is encoded.
1308*3117ece4Schristos
1309*3117ece4Schristos- if `headerByte` < 128 :
1310*3117ece4Schristos  the series of weights is compressed using FSE (see below).
1311*3117ece4Schristos  The length of the FSE-compressed series is equal to `headerByte` (0-127).
1312*3117ece4Schristos
1313*3117ece4Schristos- if `headerByte` >= 128 :
1314*3117ece4Schristos  + the series of weights uses a direct representation,
1315*3117ece4Schristos    where each `Weight` is encoded directly as a 4 bits field (0-15).
1316*3117ece4Schristos  + They are encoded forward, 2 weights to a byte,
1317*3117ece4Schristos    first weight taking the top four bits and second one taking the bottom four.
1318*3117ece4Schristos    * e.g. the following operations could be used to read the weights:
1319*3117ece4Schristos      `Weight[0] = (Byte[0] >> 4), Weight[1] = (Byte[0] & 0xf)`, etc.
1320*3117ece4Schristos  + The full representation occupies `Ceiling(Number_of_Weights/2)` bytes,
1321*3117ece4Schristos    meaning it uses only full bytes even if `Number_of_Weights` is odd.
1322*3117ece4Schristos  + `Number_of_Weights = headerByte - 127`.
1323*3117ece4Schristos    * Note that maximum `Number_of_Weights` is 255-127 = 128,
1324*3117ece4Schristos      therefore, only up to 128 `Weight` can be encoded using direct representation.
1325*3117ece4Schristos    * Since the last non-zero `Weight` is _not_ encoded,
1326*3117ece4Schristos      this scheme is compatible with alphabet sizes of up to 129 symbols,
1327*3117ece4Schristos      hence including literal symbol 128.
1328*3117ece4Schristos    * If any literal symbol > 128 has a non-zero `Weight`,
1329*3117ece4Schristos      direct representation is not possible.
1330*3117ece4Schristos      In such case, it's necessary to use FSE compression.
1331*3117ece4Schristos
1332*3117ece4Schristos
1333*3117ece4Schristos#### Finite State Entropy (FSE) compression of Huffman weights
1334*3117ece4Schristos
1335*3117ece4SchristosIn this case, the series of Huffman weights is compressed using FSE compression.
1336*3117ece4SchristosIt's a single bitstream with 2 interleaved states,
1337*3117ece4Schristossharing a single distribution table.
1338*3117ece4Schristos
1339*3117ece4SchristosTo decode an FSE bitstream, it is necessary to know its compressed size.
1340*3117ece4SchristosCompressed size is provided by `headerByte`.
1341*3117ece4SchristosIt's also necessary to know its _maximum possible_ decompressed size,
1342*3117ece4Schristoswhich is `255`, since literal values span from `0` to `255`,
1343*3117ece4Schristosand last symbol's `Weight` is not represented.
1344*3117ece4Schristos
1345*3117ece4SchristosAn FSE bitstream starts by a header, describing probabilities distribution.
1346*3117ece4SchristosIt will create a Decoding Table.
1347*3117ece4SchristosFor a list of Huffman weights, the maximum accuracy log is 6 bits.
1348*3117ece4SchristosFor more description see the [FSE header description](#fse-table-description)
1349*3117ece4Schristos
1350*3117ece4SchristosThe Huffman header compression uses 2 states,
1351*3117ece4Schristoswhich share the same FSE distribution table.
1352*3117ece4SchristosThe first state (`State1`) encodes the even indexed symbols,
1353*3117ece4Schristosand the second (`State2`) encodes the odd indexed symbols.
1354*3117ece4Schristos`State1` is initialized first, and then `State2`, and they take turns
1355*3117ece4Schristosdecoding a single symbol and updating their state.
1356*3117ece4SchristosFor more details on these FSE operations, see the [FSE section](#fse).
1357*3117ece4Schristos
1358*3117ece4SchristosThe number of symbols to decode is determined
1359*3117ece4Schristosby tracking bitStream overflow condition:
1360*3117ece4SchristosIf updating state after decoding a symbol would require more bits than
1361*3117ece4Schristosremain in the stream, it is assumed that extra bits are 0.  Then,
1362*3117ece4Schristossymbols for each of the final states are decoded and the process is complete.
1363*3117ece4Schristos
1364*3117ece4SchristosIf this process would produce more weights than the maximum number of decoded
1365*3117ece4Schristosweights (255), then the data is considered corrupted.
1366*3117ece4Schristos
1367*3117ece4Schristos#### Conversion from weights to Huffman prefix codes
1368*3117ece4Schristos
1369*3117ece4SchristosAll present symbols shall now have a `Weight` value.
1370*3117ece4SchristosIt is possible to transform weights into `Number_of_Bits`, using this formula:
1371*3117ece4Schristos```
1372*3117ece4SchristosNumber_of_Bits = (Weight>0) ? Max_Number_of_Bits + 1 - Weight : 0
1373*3117ece4Schristos```
1374*3117ece4SchristosSymbols are sorted by `Weight`.
1375*3117ece4SchristosWithin same `Weight`, symbols keep natural sequential order.
1376*3117ece4SchristosSymbols with a `Weight` of zero are removed.
1377*3117ece4SchristosThen, starting from lowest `Weight`, prefix codes are distributed in sequential order.
1378*3117ece4Schristos
1379*3117ece4Schristos__Example__ :
1380*3117ece4SchristosLet's presume the following list of weights has been decoded :
1381*3117ece4Schristos
1382*3117ece4Schristos| Literal  |  0  |  1  |  2  |  3  |  4  |  5  |
1383*3117ece4Schristos| -------- | --- | --- | --- | --- | --- | --- |
1384*3117ece4Schristos| `Weight` |  4  |  3  |  2  |  0  |  1  |  1  |
1385*3117ece4Schristos
1386*3117ece4SchristosSorted by weight and then natural sequential order,
1387*3117ece4Schristosit gives the following distribution :
1388*3117ece4Schristos
1389*3117ece4Schristos| Literal          |  3  |  4  |  5  |  2  |  1  |   0  |
1390*3117ece4Schristos| ---------------- | --- | --- | --- | --- | --- | ---- |
1391*3117ece4Schristos| `Weight`         |  0  |  1  |  1  |  2  |  3  |   4  |
1392*3117ece4Schristos| `Number_of_Bits` |  0  |  4  |  4  |  3  |  2  |   1  |
1393*3117ece4Schristos| prefix codes     | N/A | 0000| 0001| 001 | 01  |   1  |
1394*3117ece4Schristos
1395*3117ece4Schristos### Huffman-coded Streams
1396*3117ece4Schristos
1397*3117ece4SchristosGiven a Huffman decoding table,
1398*3117ece4Schristosit's possible to decode a Huffman-coded stream.
1399*3117ece4Schristos
1400*3117ece4SchristosEach bitstream must be read _backward_,
1401*3117ece4Schristosthat is starting from the end down to the beginning.
1402*3117ece4SchristosTherefore it's necessary to know the size of each bitstream.
1403*3117ece4Schristos
1404*3117ece4SchristosIt's also necessary to know exactly which _bit_ is the last one.
1405*3117ece4SchristosThis is detected by a final bit flag :
1406*3117ece4Schristosthe highest bit of latest byte is a final-bit-flag.
1407*3117ece4SchristosConsequently, a last byte of `0` is not possible.
1408*3117ece4SchristosAnd the final-bit-flag itself is not part of the useful bitstream.
1409*3117ece4SchristosHence, the last byte contains between 0 and 7 useful bits.
1410*3117ece4Schristos
1411*3117ece4SchristosStarting from the end,
1412*3117ece4Schristosit's possible to read the bitstream in a __little-endian__ fashion,
1413*3117ece4Schristoskeeping track of already used bits. Since the bitstream is encoded in reverse
1414*3117ece4Schristosorder, starting from the end read symbols in forward order.
1415*3117ece4Schristos
1416*3117ece4SchristosFor example, if the literal sequence "0145" was encoded using above prefix code,
1417*3117ece4Schristosit would be encoded (in reverse order) as:
1418*3117ece4Schristos
1419*3117ece4Schristos|Symbol  |   5  |   4  |  1 | 0 | Padding |
1420*3117ece4Schristos|--------|------|------|----|---|---------|
1421*3117ece4Schristos|Encoding|`0000`|`0001`|`01`|`1`| `00001` |
1422*3117ece4Schristos
1423*3117ece4SchristosResulting in following 2-bytes bitstream :
1424*3117ece4Schristos```
1425*3117ece4Schristos00010000 00001101
1426*3117ece4Schristos```
1427*3117ece4Schristos
1428*3117ece4SchristosHere is an alternative representation with the symbol codes separated by underscore:
1429*3117ece4Schristos```
1430*3117ece4Schristos0001_0000 00001_1_01
1431*3117ece4Schristos```
1432*3117ece4Schristos
1433*3117ece4SchristosReading highest `Max_Number_of_Bits` bits,
1434*3117ece4Schristosit's possible to compare extracted value to decoding table,
1435*3117ece4Schristosdetermining the symbol to decode and number of bits to discard.
1436*3117ece4Schristos
1437*3117ece4SchristosThe process continues up to reading the required number of symbols per stream.
1438*3117ece4SchristosIf a bitstream is not entirely and exactly consumed,
1439*3117ece4Schristoshence reaching exactly its beginning position with _all_ bits consumed,
1440*3117ece4Schristosthe decoding process is considered faulty.
1441*3117ece4Schristos
1442*3117ece4Schristos
1443*3117ece4SchristosDictionary Format
1444*3117ece4Schristos-----------------
1445*3117ece4Schristos
1446*3117ece4SchristosZstandard is compatible with "raw content" dictionaries,
1447*3117ece4Schristosfree of any format restriction, except that they must be at least 8 bytes.
1448*3117ece4SchristosThese dictionaries function as if they were just the `Content` part
1449*3117ece4Schristosof a formatted dictionary.
1450*3117ece4Schristos
1451*3117ece4SchristosBut dictionaries created by `zstd --train` follow a format, described here.
1452*3117ece4Schristos
1453*3117ece4Schristos__Pre-requisites__ : a dictionary has a size,
1454*3117ece4Schristos                     defined either by a buffer limit, or a file size.
1455*3117ece4Schristos
1456*3117ece4Schristos| `Magic_Number` | `Dictionary_ID` | `Entropy_Tables` | `Content` |
1457*3117ece4Schristos| -------------- | --------------- | ---------------- | --------- |
1458*3117ece4Schristos
1459*3117ece4Schristos__`Magic_Number`__ : 4 bytes ID, value 0xEC30A437, __little-endian__ format
1460*3117ece4Schristos
1461*3117ece4Schristos__`Dictionary_ID`__ : 4 bytes, stored in __little-endian__ format.
1462*3117ece4Schristos              `Dictionary_ID` can be any value, except 0 (which means no `Dictionary_ID`).
1463*3117ece4Schristos              It's used by decoders to check if they use the correct dictionary.
1464*3117ece4Schristos
1465*3117ece4Schristos_Reserved ranges :_
1466*3117ece4SchristosIf the dictionary is going to be distributed in a public environment,
1467*3117ece4Schristosthe following ranges of `Dictionary_ID` are reserved for some future registrar
1468*3117ece4Schristosand shall not be used :
1469*3117ece4Schristos
1470*3117ece4Schristos    - low range  : <= 32767
1471*3117ece4Schristos    - high range : >= (2^31)
1472*3117ece4Schristos
1473*3117ece4SchristosOutside of these ranges, any value of `Dictionary_ID`
1474*3117ece4Schristoswhich is both `>= 32768` and `< (1<<31)` can be used freely,
1475*3117ece4Schristoseven in public environment.
1476*3117ece4Schristos
1477*3117ece4Schristos
1478*3117ece4Schristos__`Entropy_Tables`__ : follow the same format as tables in [compressed blocks].
1479*3117ece4Schristos              See the relevant [FSE](#fse-table-description)
1480*3117ece4Schristos              and [Huffman](#huffman-tree-description) sections for how to decode these tables.
1481*3117ece4Schristos              They are stored in following order :
1482*3117ece4Schristos              Huffman tables for literals, FSE table for offsets,
1483*3117ece4Schristos              FSE table for match lengths, and FSE table for literals lengths.
1484*3117ece4Schristos              These tables populate the Repeat Stats literals mode and
1485*3117ece4Schristos              Repeat distribution mode for sequence decoding.
1486*3117ece4Schristos              It's finally followed by 3 offset values, populating recent offsets (instead of using `{1,4,8}`),
1487*3117ece4Schristos              stored in order, 4-bytes __little-endian__ each, for a total of 12 bytes.
1488*3117ece4Schristos              Each recent offset must have a value <= dictionary content size, and cannot equal 0.
1489*3117ece4Schristos
1490*3117ece4Schristos__`Content`__ : The rest of the dictionary is its content.
1491*3117ece4Schristos              The content act as a "past" in front of data to compress or decompress,
1492*3117ece4Schristos              so it can be referenced in sequence commands.
1493*3117ece4Schristos              As long as the amount of data decoded from this frame is less than or
1494*3117ece4Schristos              equal to `Window_Size`, sequence commands may specify offsets longer
1495*3117ece4Schristos              than the total length of decoded output so far to reference back to the
1496*3117ece4Schristos              dictionary, even parts of the dictionary with offsets larger than `Window_Size`.
1497*3117ece4Schristos              After the total output has surpassed `Window_Size` however,
1498*3117ece4Schristos              this is no longer allowed and the dictionary is no longer accessible.
1499*3117ece4Schristos
1500*3117ece4Schristos[compressed blocks]: #the-format-of-compressed_block
1501*3117ece4Schristos
1502*3117ece4SchristosIf a dictionary is provided by an external source,
1503*3117ece4Schristosit should be loaded with great care, its content considered untrusted.
1504*3117ece4Schristos
1505*3117ece4Schristos
1506*3117ece4Schristos
1507*3117ece4SchristosAppendix A - Decoding tables for predefined codes
1508*3117ece4Schristos-------------------------------------------------
1509*3117ece4Schristos
1510*3117ece4SchristosThis appendix contains FSE decoding tables
1511*3117ece4Schristosfor the predefined literal length, match length, and offset codes.
1512*3117ece4SchristosThe tables have been constructed using the algorithm as given above in chapter
1513*3117ece4Schristos"from normalized distribution to decoding tables".
1514*3117ece4SchristosThe tables here can be used as examples
1515*3117ece4Schristosto crosscheck that an implementation build its decoding tables correctly.
1516*3117ece4Schristos
1517*3117ece4Schristos#### Literal Length Code:
1518*3117ece4Schristos
1519*3117ece4Schristos| State | Symbol | Number_Of_Bits | Base |
1520*3117ece4Schristos| ----- | ------ | -------------- | ---- |
1521*3117ece4Schristos|     0 |      0 |              4 |    0 |
1522*3117ece4Schristos|     1 |      0 |              4 |   16 |
1523*3117ece4Schristos|     2 |      1 |              5 |   32 |
1524*3117ece4Schristos|     3 |      3 |              5 |    0 |
1525*3117ece4Schristos|     4 |      4 |              5 |    0 |
1526*3117ece4Schristos|     5 |      6 |              5 |    0 |
1527*3117ece4Schristos|     6 |      7 |              5 |    0 |
1528*3117ece4Schristos|     7 |      9 |              5 |    0 |
1529*3117ece4Schristos|     8 |     10 |              5 |    0 |
1530*3117ece4Schristos|     9 |     12 |              5 |    0 |
1531*3117ece4Schristos|    10 |     14 |              6 |    0 |
1532*3117ece4Schristos|    11 |     16 |              5 |    0 |
1533*3117ece4Schristos|    12 |     18 |              5 |    0 |
1534*3117ece4Schristos|    13 |     19 |              5 |    0 |
1535*3117ece4Schristos|    14 |     21 |              5 |    0 |
1536*3117ece4Schristos|    15 |     22 |              5 |    0 |
1537*3117ece4Schristos|    16 |     24 |              5 |    0 |
1538*3117ece4Schristos|    17 |     25 |              5 |   32 |
1539*3117ece4Schristos|    18 |     26 |              5 |    0 |
1540*3117ece4Schristos|    19 |     27 |              6 |    0 |
1541*3117ece4Schristos|    20 |     29 |              6 |    0 |
1542*3117ece4Schristos|    21 |     31 |              6 |    0 |
1543*3117ece4Schristos|    22 |      0 |              4 |   32 |
1544*3117ece4Schristos|    23 |      1 |              4 |    0 |
1545*3117ece4Schristos|    24 |      2 |              5 |    0 |
1546*3117ece4Schristos|    25 |      4 |              5 |   32 |
1547*3117ece4Schristos|    26 |      5 |              5 |    0 |
1548*3117ece4Schristos|    27 |      7 |              5 |   32 |
1549*3117ece4Schristos|    28 |      8 |              5 |    0 |
1550*3117ece4Schristos|    29 |     10 |              5 |   32 |
1551*3117ece4Schristos|    30 |     11 |              5 |    0 |
1552*3117ece4Schristos|    31 |     13 |              6 |    0 |
1553*3117ece4Schristos|    32 |     16 |              5 |   32 |
1554*3117ece4Schristos|    33 |     17 |              5 |    0 |
1555*3117ece4Schristos|    34 |     19 |              5 |   32 |
1556*3117ece4Schristos|    35 |     20 |              5 |    0 |
1557*3117ece4Schristos|    36 |     22 |              5 |   32 |
1558*3117ece4Schristos|    37 |     23 |              5 |    0 |
1559*3117ece4Schristos|    38 |     25 |              4 |    0 |
1560*3117ece4Schristos|    39 |     25 |              4 |   16 |
1561*3117ece4Schristos|    40 |     26 |              5 |   32 |
1562*3117ece4Schristos|    41 |     28 |              6 |    0 |
1563*3117ece4Schristos|    42 |     30 |              6 |    0 |
1564*3117ece4Schristos|    43 |      0 |              4 |   48 |
1565*3117ece4Schristos|    44 |      1 |              4 |   16 |
1566*3117ece4Schristos|    45 |      2 |              5 |   32 |
1567*3117ece4Schristos|    46 |      3 |              5 |   32 |
1568*3117ece4Schristos|    47 |      5 |              5 |   32 |
1569*3117ece4Schristos|    48 |      6 |              5 |   32 |
1570*3117ece4Schristos|    49 |      8 |              5 |   32 |
1571*3117ece4Schristos|    50 |      9 |              5 |   32 |
1572*3117ece4Schristos|    51 |     11 |              5 |   32 |
1573*3117ece4Schristos|    52 |     12 |              5 |   32 |
1574*3117ece4Schristos|    53 |     15 |              6 |    0 |
1575*3117ece4Schristos|    54 |     17 |              5 |   32 |
1576*3117ece4Schristos|    55 |     18 |              5 |   32 |
1577*3117ece4Schristos|    56 |     20 |              5 |   32 |
1578*3117ece4Schristos|    57 |     21 |              5 |   32 |
1579*3117ece4Schristos|    58 |     23 |              5 |   32 |
1580*3117ece4Schristos|    59 |     24 |              5 |   32 |
1581*3117ece4Schristos|    60 |     35 |              6 |    0 |
1582*3117ece4Schristos|    61 |     34 |              6 |    0 |
1583*3117ece4Schristos|    62 |     33 |              6 |    0 |
1584*3117ece4Schristos|    63 |     32 |              6 |    0 |
1585*3117ece4Schristos
1586*3117ece4Schristos#### Match Length Code:
1587*3117ece4Schristos
1588*3117ece4Schristos| State | Symbol | Number_Of_Bits | Base |
1589*3117ece4Schristos| ----- | ------ | -------------- | ---- |
1590*3117ece4Schristos|     0 |      0 |              6 |    0 |
1591*3117ece4Schristos|     1 |      1 |              4 |    0 |
1592*3117ece4Schristos|     2 |      2 |              5 |   32 |
1593*3117ece4Schristos|     3 |      3 |              5 |    0 |
1594*3117ece4Schristos|     4 |      5 |              5 |    0 |
1595*3117ece4Schristos|     5 |      6 |              5 |    0 |
1596*3117ece4Schristos|     6 |      8 |              5 |    0 |
1597*3117ece4Schristos|     7 |     10 |              6 |    0 |
1598*3117ece4Schristos|     8 |     13 |              6 |    0 |
1599*3117ece4Schristos|     9 |     16 |              6 |    0 |
1600*3117ece4Schristos|    10 |     19 |              6 |    0 |
1601*3117ece4Schristos|    11 |     22 |              6 |    0 |
1602*3117ece4Schristos|    12 |     25 |              6 |    0 |
1603*3117ece4Schristos|    13 |     28 |              6 |    0 |
1604*3117ece4Schristos|    14 |     31 |              6 |    0 |
1605*3117ece4Schristos|    15 |     33 |              6 |    0 |
1606*3117ece4Schristos|    16 |     35 |              6 |    0 |
1607*3117ece4Schristos|    17 |     37 |              6 |    0 |
1608*3117ece4Schristos|    18 |     39 |              6 |    0 |
1609*3117ece4Schristos|    19 |     41 |              6 |    0 |
1610*3117ece4Schristos|    20 |     43 |              6 |    0 |
1611*3117ece4Schristos|    21 |     45 |              6 |    0 |
1612*3117ece4Schristos|    22 |      1 |              4 |   16 |
1613*3117ece4Schristos|    23 |      2 |              4 |    0 |
1614*3117ece4Schristos|    24 |      3 |              5 |   32 |
1615*3117ece4Schristos|    25 |      4 |              5 |    0 |
1616*3117ece4Schristos|    26 |      6 |              5 |   32 |
1617*3117ece4Schristos|    27 |      7 |              5 |    0 |
1618*3117ece4Schristos|    28 |      9 |              6 |    0 |
1619*3117ece4Schristos|    29 |     12 |              6 |    0 |
1620*3117ece4Schristos|    30 |     15 |              6 |    0 |
1621*3117ece4Schristos|    31 |     18 |              6 |    0 |
1622*3117ece4Schristos|    32 |     21 |              6 |    0 |
1623*3117ece4Schristos|    33 |     24 |              6 |    0 |
1624*3117ece4Schristos|    34 |     27 |              6 |    0 |
1625*3117ece4Schristos|    35 |     30 |              6 |    0 |
1626*3117ece4Schristos|    36 |     32 |              6 |    0 |
1627*3117ece4Schristos|    37 |     34 |              6 |    0 |
1628*3117ece4Schristos|    38 |     36 |              6 |    0 |
1629*3117ece4Schristos|    39 |     38 |              6 |    0 |
1630*3117ece4Schristos|    40 |     40 |              6 |    0 |
1631*3117ece4Schristos|    41 |     42 |              6 |    0 |
1632*3117ece4Schristos|    42 |     44 |              6 |    0 |
1633*3117ece4Schristos|    43 |      1 |              4 |   32 |
1634*3117ece4Schristos|    44 |      1 |              4 |   48 |
1635*3117ece4Schristos|    45 |      2 |              4 |   16 |
1636*3117ece4Schristos|    46 |      4 |              5 |   32 |
1637*3117ece4Schristos|    47 |      5 |              5 |   32 |
1638*3117ece4Schristos|    48 |      7 |              5 |   32 |
1639*3117ece4Schristos|    49 |      8 |              5 |   32 |
1640*3117ece4Schristos|    50 |     11 |              6 |    0 |
1641*3117ece4Schristos|    51 |     14 |              6 |    0 |
1642*3117ece4Schristos|    52 |     17 |              6 |    0 |
1643*3117ece4Schristos|    53 |     20 |              6 |    0 |
1644*3117ece4Schristos|    54 |     23 |              6 |    0 |
1645*3117ece4Schristos|    55 |     26 |              6 |    0 |
1646*3117ece4Schristos|    56 |     29 |              6 |    0 |
1647*3117ece4Schristos|    57 |     52 |              6 |    0 |
1648*3117ece4Schristos|    58 |     51 |              6 |    0 |
1649*3117ece4Schristos|    59 |     50 |              6 |    0 |
1650*3117ece4Schristos|    60 |     49 |              6 |    0 |
1651*3117ece4Schristos|    61 |     48 |              6 |    0 |
1652*3117ece4Schristos|    62 |     47 |              6 |    0 |
1653*3117ece4Schristos|    63 |     46 |              6 |    0 |
1654*3117ece4Schristos
1655*3117ece4Schristos#### Offset Code:
1656*3117ece4Schristos
1657*3117ece4Schristos| State | Symbol | Number_Of_Bits | Base |
1658*3117ece4Schristos| ----- | ------ | -------------- | ---- |
1659*3117ece4Schristos|     0 |      0 |              5 |    0 |
1660*3117ece4Schristos|     1 |      6 |              4 |    0 |
1661*3117ece4Schristos|     2 |      9 |              5 |    0 |
1662*3117ece4Schristos|     3 |     15 |              5 |    0 |
1663*3117ece4Schristos|     4 |     21 |              5 |    0 |
1664*3117ece4Schristos|     5 |      3 |              5 |    0 |
1665*3117ece4Schristos|     6 |      7 |              4 |    0 |
1666*3117ece4Schristos|     7 |     12 |              5 |    0 |
1667*3117ece4Schristos|     8 |     18 |              5 |    0 |
1668*3117ece4Schristos|     9 |     23 |              5 |    0 |
1669*3117ece4Schristos|    10 |      5 |              5 |    0 |
1670*3117ece4Schristos|    11 |      8 |              4 |    0 |
1671*3117ece4Schristos|    12 |     14 |              5 |    0 |
1672*3117ece4Schristos|    13 |     20 |              5 |    0 |
1673*3117ece4Schristos|    14 |      2 |              5 |    0 |
1674*3117ece4Schristos|    15 |      7 |              4 |   16 |
1675*3117ece4Schristos|    16 |     11 |              5 |    0 |
1676*3117ece4Schristos|    17 |     17 |              5 |    0 |
1677*3117ece4Schristos|    18 |     22 |              5 |    0 |
1678*3117ece4Schristos|    19 |      4 |              5 |    0 |
1679*3117ece4Schristos|    20 |      8 |              4 |   16 |
1680*3117ece4Schristos|    21 |     13 |              5 |    0 |
1681*3117ece4Schristos|    22 |     19 |              5 |    0 |
1682*3117ece4Schristos|    23 |      1 |              5 |    0 |
1683*3117ece4Schristos|    24 |      6 |              4 |   16 |
1684*3117ece4Schristos|    25 |     10 |              5 |    0 |
1685*3117ece4Schristos|    26 |     16 |              5 |    0 |
1686*3117ece4Schristos|    27 |     28 |              5 |    0 |
1687*3117ece4Schristos|    28 |     27 |              5 |    0 |
1688*3117ece4Schristos|    29 |     26 |              5 |    0 |
1689*3117ece4Schristos|    30 |     25 |              5 |    0 |
1690*3117ece4Schristos|    31 |     24 |              5 |    0 |
1691*3117ece4Schristos
1692*3117ece4Schristos
1693*3117ece4Schristos
1694*3117ece4SchristosAppendix B - Resources for implementers
1695*3117ece4Schristos-------------------------------------------------
1696*3117ece4Schristos
1697*3117ece4SchristosAn open source reference implementation is available on :
1698*3117ece4Schristoshttps://github.com/facebook/zstd
1699*3117ece4Schristos
1700*3117ece4SchristosThe project contains a frame generator, called [decodeCorpus],
1701*3117ece4Schristoswhich can be used by any 3rd-party implementation
1702*3117ece4Schristosto verify that a tested decoder is compliant with the specification.
1703*3117ece4Schristos
1704*3117ece4Schristos[decodeCorpus]: https://github.com/facebook/zstd/tree/v1.3.4/tests#decodecorpus---tool-to-generate-zstandard-frames-for-decoder-testing
1705*3117ece4Schristos
1706*3117ece4Schristos`decodeCorpus` generates random valid frames.
1707*3117ece4SchristosA compliant decoder should be able to decode them all,
1708*3117ece4Schristosor at least provide a meaningful error code explaining for which reason it cannot
1709*3117ece4Schristos(memory limit restrictions for example).
1710*3117ece4Schristos
1711*3117ece4Schristos
1712*3117ece4SchristosVersion changes
1713*3117ece4Schristos---------------
1714*3117ece4Schristos- 0.4.0 : fixed imprecise behavior for nbSeq==0, detected by Igor Pavlov
1715*3117ece4Schristos- 0.3.9 : clarifications for Huffman-compressed literal sizes.
1716*3117ece4Schristos- 0.3.8 : clarifications for Huffman Blocks and Huffman Tree descriptions.
1717*3117ece4Schristos- 0.3.7 : clarifications for Repeat_Offsets, matching RFC8878
1718*3117ece4Schristos- 0.3.6 : clarifications for Dictionary_ID
1719*3117ece4Schristos- 0.3.5 : clarifications for Block_Maximum_Size
1720*3117ece4Schristos- 0.3.4 : clarifications for FSE decoding table
1721*3117ece4Schristos- 0.3.3 : clarifications for field Block_Size
1722*3117ece4Schristos- 0.3.2 : remove additional block size restriction on compressed blocks
1723*3117ece4Schristos- 0.3.1 : minor clarification regarding offset history update rules
1724*3117ece4Schristos- 0.3.0 : minor edits to match RFC8478
1725*3117ece4Schristos- 0.2.9 : clarifications for huffman weights direct representation, by Ulrich Kunitz
1726*3117ece4Schristos- 0.2.8 : clarifications for IETF RFC discuss
1727*3117ece4Schristos- 0.2.7 : clarifications from IETF RFC review, by Vijay Gurbani and Nick Terrell
1728*3117ece4Schristos- 0.2.6 : fixed an error in huffman example, by Ulrich Kunitz
1729*3117ece4Schristos- 0.2.5 : minor typos and clarifications
1730*3117ece4Schristos- 0.2.4 : section restructuring, by Sean Purcell
1731*3117ece4Schristos- 0.2.3 : clarified several details, by Sean Purcell
1732*3117ece4Schristos- 0.2.2 : added predefined codes, by Johannes Rudolph
1733*3117ece4Schristos- 0.2.1 : clarify field names, by Przemyslaw Skibinski
1734*3117ece4Schristos- 0.2.0 : numerous format adjustments for zstd v0.8+
1735*3117ece4Schristos- 0.1.2 : limit Huffman tree depth to 11 bits
1736*3117ece4Schristos- 0.1.1 : reserved dictID ranges
1737*3117ece4Schristos- 0.1.0 : initial release
1738