xref: /openbsd-src/lib/libz/algorithm.doc (revision 36f395ce93ef6e98c75f0b713640c56c765ef52e)
19baa294fSmillert1. Compression algorithm (deflate)
29baa294fSmillert
315ce0796SmillertThe deflation algorithm used by gzip (also zip and zlib) is a variation of
49baa294fSmillertLZ77 (Lempel-Ziv 1977, see reference below). It finds duplicated strings in
59baa294fSmillertthe input data.  The second occurrence of a string is replaced by a
69baa294fSmillertpointer to the previous string, in the form of a pair (distance,
79baa294fSmillertlength).  Distances are limited to 32K bytes, and lengths are limited
89baa294fSmillertto 258 bytes. When a string does not occur anywhere in the previous
99baa294fSmillert32K bytes, it is emitted as a sequence of literal bytes.  (In this
109baa294fSmillertdescription, `string' must be taken as an arbitrary sequence of bytes,
119baa294fSmillertand is not restricted to printable characters.)
129baa294fSmillert
139baa294fSmillertLiterals or match lengths are compressed with one Huffman tree, and
149baa294fSmillertmatch distances are compressed with another tree. The trees are stored
159baa294fSmillertin a compact form at the start of each block. The blocks can have any
169baa294fSmillertsize (except that the compressed data for one block must fit in
179baa294fSmillertavailable memory). A block is terminated when deflate() determines that
189baa294fSmillertit would be useful to start another block with fresh trees. (This is
199baa294fSmillertsomewhat similar to the behavior of LZW-based _compress_.)
209baa294fSmillert
219baa294fSmillertDuplicated strings are found using a hash table. All input strings of
229baa294fSmillertlength 3 are inserted in the hash table. A hash index is computed for
239baa294fSmillertthe next 3 bytes. If the hash chain for this index is not empty, all
249baa294fSmillertstrings in the chain are compared with the current input string, and
259baa294fSmillertthe longest match is selected.
269baa294fSmillert
279baa294fSmillertThe hash chains are searched starting with the most recent strings, to
289baa294fSmillertfavor small distances and thus take advantage of the Huffman encoding.
299baa294fSmillertThe hash chains are singly linked. There are no deletions from the
309baa294fSmillerthash chains, the algorithm simply discards matches that are too old.
319baa294fSmillert
329baa294fSmillertTo avoid a worst-case situation, very long hash chains are arbitrarily
339baa294fSmillerttruncated at a certain length, determined by a runtime option (level
349baa294fSmillertparameter of deflateInit). So deflate() does not always find the longest
359baa294fSmillertpossible match but generally finds a match which is long enough.
369baa294fSmillert
379baa294fSmillertdeflate() also defers the selection of matches with a lazy evaluation
3815ce0796Smillertmechanism. After a match of length N has been found, deflate() searches for
3915ce0796Smillerta longer match at the next input byte. If a longer match is found, the
409baa294fSmillertprevious match is truncated to a length of one (thus producing a single
4115ce0796Smillertliteral byte) and the process of lazy evaluation begins again. Otherwise,
4215ce0796Smillertthe original match is kept, and the next match search is attempted only N
4315ce0796Smillertsteps later.
449baa294fSmillert
459baa294fSmillertThe lazy match evaluation is also subject to a runtime parameter. If
469baa294fSmillertthe current match is long enough, deflate() reduces the search for a longer
479baa294fSmillertmatch, thus speeding up the whole process. If compression ratio is more
489baa294fSmillertimportant than speed, deflate() attempts a complete second search even if
499baa294fSmillertthe first match is already long enough.
509baa294fSmillert
519baa294fSmillertThe lazy match evaluation is not performed for the fastest compression
529baa294fSmillertmodes (level parameter 1 to 3). For these fast modes, new strings
539baa294fSmillertare inserted in the hash table only when no match was found, or
549baa294fSmillertwhen the match is not too long. This degrades the compression ratio
559baa294fSmillertbut saves time since there are both fewer insertions and fewer searches.
569baa294fSmillert
579baa294fSmillert
589baa294fSmillert2. Decompression algorithm (inflate)
599baa294fSmillert
6015ce0796Smillert2.1 Introduction
6115ce0796Smillert
6285c48e79ShenningThe key question is how to represent a Huffman code (or any prefix code) so
6385c48e79Shenningthat you can decode fast.  The most important characteristic is that shorter
6485c48e79Shenningcodes are much more common than longer codes, so pay attention to decoding the
6585c48e79Shenningshort codes fast, and let the long codes take longer to decode.
669baa294fSmillert
679baa294fSmillertinflate() sets up a first level table that covers some number of bits of
689baa294fSmillertinput less than the length of longest code.  It gets that many bits from the
699baa294fSmillertstream, and looks it up in the table.  The table will tell if the next
709baa294fSmillertcode is that many bits or less and how many, and if it is, it will tell
719baa294fSmillertthe value, else it will point to the next level table for which inflate()
729baa294fSmillertgrabs more bits and tries to decode a longer code.
739baa294fSmillert
749baa294fSmillertHow many bits to make the first lookup is a tradeoff between the time it
759baa294fSmillerttakes to decode and the time it takes to build the table.  If building the
769baa294fSmillerttable took no time (and if you had infinite memory), then there would only
779baa294fSmillertbe a first level table to cover all the way to the longest code.  However,
789baa294fSmillertbuilding the table ends up taking a lot longer for more bits since short
799baa294fSmillertcodes are replicated many times in such a table.  What inflate() does is
8085c48e79Shenningsimply to make the number of bits in the first table a variable, and  then
8185c48e79Shenningto set that variable for the maximum speed.
829baa294fSmillert
8385c48e79ShenningFor inflate, which has 286 possible codes for the literal/length tree, the size
8485c48e79Shenningof the first table is nine bits.  Also the distance trees have 30 possible
8585c48e79Shenningvalues, and the size of the first table is six bits.  Note that for each of
8685c48e79Shenningthose cases, the table ended up one bit longer than the ``average'' code
8785c48e79Shenninglength, i.e. the code length of an approximately flat code which would be a
8885c48e79Shenninglittle more than eight bits for 286 symbols and a little less than five bits
8985c48e79Shenningfor 30 symbols.
909baa294fSmillert
919baa294fSmillert
9215ce0796Smillert2.2 More details on the inflate table lookup
9315ce0796Smillert
9415ce0796SmillertOk, you want to know what this cleverly obfuscated inflate tree actually
9515ce0796Smillertlooks like.  You are correct that it's not a Huffman tree.  It is simply a
9615ce0796Smillertlookup table for the first, let's say, nine bits of a Huffman symbol.  The
9715ce0796Smillertsymbol could be as short as one bit or as long as 15 bits.  If a particular
9815ce0796Smillertsymbol is shorter than nine bits, then that symbol's translation is duplicated
9915ce0796Smillertin all those entries that start with that symbol's bits.  For example, if the
10015ce0796Smillertsymbol is four bits, then it's duplicated 32 times in a nine-bit table.  If a
10115ce0796Smillertsymbol is nine bits long, it appears in the table once.
10215ce0796Smillert
10315ce0796SmillertIf the symbol is longer than nine bits, then that entry in the table points
10415ce0796Smillertto another similar table for the remaining bits.  Again, there are duplicated
10515ce0796Smillertentries as needed.  The idea is that most of the time the symbol will be short
10615ce0796Smillertand there will only be one table look up.  (That's whole idea behind data
10715ce0796Smillertcompression in the first place.)  For the less frequent long symbols, there
10815ce0796Smillertwill be two lookups.  If you had a compression method with really long
10915ce0796Smillertsymbols, you could have as many levels of lookups as is efficient.  For
11015ce0796Smillertinflate, two is enough.
11115ce0796Smillert
11215ce0796SmillertSo a table entry either points to another table (in which case nine bits in
11315ce0796Smillertthe above example are gobbled), or it contains the translation for the symbol
11415ce0796Smillertand the number of bits to gobble.  Then you start again with the next
11515ce0796Smillertungobbled bit.
11615ce0796Smillert
11715ce0796SmillertYou may wonder: why not just have one lookup table for how ever many bits the
11815ce0796Smillertlongest symbol is?  The reason is that if you do that, you end up spending
11915ce0796Smillertmore time filling in duplicate symbol entries than you do actually decoding.
12015ce0796SmillertAt least for deflate's output that generates new trees every several 10's of
12115ce0796Smillertkbytes.  You can imagine that filling in a 2^15 entry table for a 15-bit code
12215ce0796Smillertwould take too long if you're only decoding several thousand symbols.  At the
12315ce0796Smillertother extreme, you could make a new table for every bit in the code.  In fact,
12415ce0796Smillertthat's essentially a Huffman tree.  But then you spend two much time
12515ce0796Smillerttraversing the tree while decoding, even for short symbols.
12615ce0796Smillert
12715ce0796SmillertSo the number of bits for the first lookup table is a trade of the time to
12815ce0796Smillertfill out the table vs. the time spent looking at the second level and above of
12915ce0796Smillertthe table.
13015ce0796Smillert
13115ce0796SmillertHere is an example, scaled down:
13215ce0796Smillert
13315ce0796SmillertThe code being decoded, with 10 symbols, from 1 to 6 bits long:
13415ce0796Smillert
13515ce0796SmillertA: 0
13615ce0796SmillertB: 10
13715ce0796SmillertC: 1100
13815ce0796SmillertD: 11010
13915ce0796SmillertE: 11011
14015ce0796SmillertF: 11100
14115ce0796SmillertG: 11101
14215ce0796SmillertH: 11110
14315ce0796SmillertI: 111110
14415ce0796SmillertJ: 111111
14515ce0796Smillert
14615ce0796SmillertLet's make the first table three bits long (eight entries):
14715ce0796Smillert
14815ce0796Smillert000: A,1
14915ce0796Smillert001: A,1
15015ce0796Smillert010: A,1
15115ce0796Smillert011: A,1
15215ce0796Smillert100: B,2
15315ce0796Smillert101: B,2
15415ce0796Smillert110: -> table X (gobble 3 bits)
15515ce0796Smillert111: -> table Y (gobble 3 bits)
15615ce0796Smillert
15785c48e79ShenningEach entry is what the bits decode as and how many bits that is, i.e. how
15815ce0796Smillertmany bits to gobble.  Or the entry points to another table, with the number of
15915ce0796Smillertbits to gobble implicit in the size of the table.
16015ce0796Smillert
16115ce0796SmillertTable X is two bits long since the longest code starting with 110 is five bits
16215ce0796Smillertlong:
16315ce0796Smillert
16415ce0796Smillert00: C,1
16515ce0796Smillert01: C,1
16615ce0796Smillert10: D,2
16715ce0796Smillert11: E,2
16815ce0796Smillert
16915ce0796SmillertTable Y is three bits long since the longest code starting with 111 is six
17015ce0796Smillertbits long:
17115ce0796Smillert
17215ce0796Smillert000: F,2
17315ce0796Smillert001: F,2
17415ce0796Smillert010: G,2
17515ce0796Smillert011: G,2
17615ce0796Smillert100: H,2
17715ce0796Smillert101: H,2
17815ce0796Smillert110: I,3
17915ce0796Smillert111: J,3
18015ce0796Smillert
18115ce0796SmillertSo what we have here are three tables with a total of 20 entries that had to
18215ce0796Smillertbe constructed.  That's compared to 64 entries for a single table.  Or
18315ce0796Smillertcompared to 16 entries for a Huffman tree (six two entry tables and one four
18415ce0796Smillertentry table).  Assuming that the code ideally represents the probability of
18515ce0796Smillertthe symbols, it takes on the average 1.25 lookups per symbol.  That's compared
18615ce0796Smillertto one lookup for the single table, or 1.66 lookups per symbol for the
18715ce0796SmillertHuffman tree.
18815ce0796Smillert
18915ce0796SmillertThere, I think that gives you a picture of what's going on.  For inflate, the
19015ce0796Smillertmeaning of a particular symbol is often more than just a letter.  It can be a
19115ce0796Smillertbyte (a "literal"), or it can be either a length or a distance which
19215ce0796Smillertindicates a base value and a number of bits to fetch after the code that is
19315ce0796Smillertadded to the base value.  Or it might be the special end-of-block code.  The
19415ce0796Smillertdata structures created in inftrees.c try to encode all that information
19515ce0796Smillertcompactly in the tables.
19615ce0796Smillert
19715ce0796Smillert
1989baa294fSmillertJean-loup Gailly        Mark Adler
19915ce0796Smillertjloup@gzip.org          madler@alumni.caltech.edu
2009baa294fSmillert
2019baa294fSmillert
2029baa294fSmillertReferences:
2039baa294fSmillert
2049baa294fSmillert[LZ77] Ziv J., Lempel A., ``A Universal Algorithm for Sequential Data
2059baa294fSmillertCompression,'' IEEE Transactions on Information Theory, Vol. 23, No. 3,
2069baa294fSmillertpp. 337-343.
2079baa294fSmillert
2089baa294fSmillert``DEFLATE Compressed Data Format Specification'' available in
20985c48e79Shenninghttp://www.ietf.org/rfc/rfc1951.txt
210*36f395ceStb1. Compression algorithm (deflate)
211*36f395ceStb
212*36f395ceStbThe deflation algorithm used by gzip (also zip and zlib) is a variation of
213*36f395ceStbLZ77 (Lempel-Ziv 1977, see reference below). It finds duplicated strings in
214*36f395ceStbthe input data.  The second occurrence of a string is replaced by a
215*36f395ceStbpointer to the previous string, in the form of a pair (distance,
216*36f395ceStblength).  Distances are limited to 32K bytes, and lengths are limited
217*36f395ceStbto 258 bytes. When a string does not occur anywhere in the previous
218*36f395ceStb32K bytes, it is emitted as a sequence of literal bytes.  (In this
219*36f395ceStbdescription, `string' must be taken as an arbitrary sequence of bytes,
220*36f395ceStband is not restricted to printable characters.)
221*36f395ceStb
222*36f395ceStbLiterals or match lengths are compressed with one Huffman tree, and
223*36f395ceStbmatch distances are compressed with another tree. The trees are stored
224*36f395ceStbin a compact form at the start of each block. The blocks can have any
225*36f395ceStbsize (except that the compressed data for one block must fit in
226*36f395ceStbavailable memory). A block is terminated when deflate() determines that
227*36f395ceStbit would be useful to start another block with fresh trees. (This is
228*36f395ceStbsomewhat similar to the behavior of LZW-based _compress_.)
229*36f395ceStb
230*36f395ceStbDuplicated strings are found using a hash table. All input strings of
231*36f395ceStblength 3 are inserted in the hash table. A hash index is computed for
232*36f395ceStbthe next 3 bytes. If the hash chain for this index is not empty, all
233*36f395ceStbstrings in the chain are compared with the current input string, and
234*36f395ceStbthe longest match is selected.
235*36f395ceStb
236*36f395ceStbThe hash chains are searched starting with the most recent strings, to
237*36f395ceStbfavor small distances and thus take advantage of the Huffman encoding.
238*36f395ceStbThe hash chains are singly linked. There are no deletions from the
239*36f395ceStbhash chains, the algorithm simply discards matches that are too old.
240*36f395ceStb
241*36f395ceStbTo avoid a worst-case situation, very long hash chains are arbitrarily
242*36f395ceStbtruncated at a certain length, determined by a runtime option (level
243*36f395ceStbparameter of deflateInit). So deflate() does not always find the longest
244*36f395ceStbpossible match but generally finds a match which is long enough.
245*36f395ceStb
246*36f395ceStbdeflate() also defers the selection of matches with a lazy evaluation
247*36f395ceStbmechanism. After a match of length N has been found, deflate() searches for
248*36f395ceStba longer match at the next input byte. If a longer match is found, the
249*36f395ceStbprevious match is truncated to a length of one (thus producing a single
250*36f395ceStbliteral byte) and the process of lazy evaluation begins again. Otherwise,
251*36f395ceStbthe original match is kept, and the next match search is attempted only N
252*36f395ceStbsteps later.
253*36f395ceStb
254*36f395ceStbThe lazy match evaluation is also subject to a runtime parameter. If
255*36f395ceStbthe current match is long enough, deflate() reduces the search for a longer
256*36f395ceStbmatch, thus speeding up the whole process. If compression ratio is more
257*36f395ceStbimportant than speed, deflate() attempts a complete second search even if
258*36f395ceStbthe first match is already long enough.
259*36f395ceStb
260*36f395ceStbThe lazy match evaluation is not performed for the fastest compression
261*36f395ceStbmodes (level parameter 1 to 3). For these fast modes, new strings
262*36f395ceStbare inserted in the hash table only when no match was found, or
263*36f395ceStbwhen the match is not too long. This degrades the compression ratio
264*36f395ceStbbut saves time since there are both fewer insertions and fewer searches.
265*36f395ceStb
266*36f395ceStb
267*36f395ceStb2. Decompression algorithm (inflate)
268*36f395ceStb
269*36f395ceStb2.1 Introduction
270*36f395ceStb
271*36f395ceStbThe key question is how to represent a Huffman code (or any prefix code) so
272*36f395ceStbthat you can decode fast.  The most important characteristic is that shorter
273*36f395ceStbcodes are much more common than longer codes, so pay attention to decoding the
274*36f395ceStbshort codes fast, and let the long codes take longer to decode.
275*36f395ceStb
276*36f395ceStbinflate() sets up a first level table that covers some number of bits of
277*36f395ceStbinput less than the length of longest code.  It gets that many bits from the
278*36f395ceStbstream, and looks it up in the table.  The table will tell if the next
279*36f395ceStbcode is that many bits or less and how many, and if it is, it will tell
280*36f395ceStbthe value, else it will point to the next level table for which inflate()
281*36f395ceStbgrabs more bits and tries to decode a longer code.
282*36f395ceStb
283*36f395ceStbHow many bits to make the first lookup is a tradeoff between the time it
284*36f395ceStbtakes to decode and the time it takes to build the table.  If building the
285*36f395ceStbtable took no time (and if you had infinite memory), then there would only
286*36f395ceStbbe a first level table to cover all the way to the longest code.  However,
287*36f395ceStbbuilding the table ends up taking a lot longer for more bits since short
288*36f395ceStbcodes are replicated many times in such a table.  What inflate() does is
289*36f395ceStbsimply to make the number of bits in the first table a variable, and  then
290*36f395ceStbto set that variable for the maximum speed.
291*36f395ceStb
292*36f395ceStbFor inflate, which has 286 possible codes for the literal/length tree, the size
293*36f395ceStbof the first table is nine bits.  Also the distance trees have 30 possible
294*36f395ceStbvalues, and the size of the first table is six bits.  Note that for each of
295*36f395ceStbthose cases, the table ended up one bit longer than the ``average'' code
296*36f395ceStblength, i.e. the code length of an approximately flat code which would be a
297*36f395ceStblittle more than eight bits for 286 symbols and a little less than five bits
298*36f395ceStbfor 30 symbols.
299*36f395ceStb
300*36f395ceStb
301*36f395ceStb2.2 More details on the inflate table lookup
302*36f395ceStb
303*36f395ceStbOk, you want to know what this cleverly obfuscated inflate tree actually
304*36f395ceStblooks like.  You are correct that it's not a Huffman tree.  It is simply a
305*36f395ceStblookup table for the first, let's say, nine bits of a Huffman symbol.  The
306*36f395ceStbsymbol could be as short as one bit or as long as 15 bits.  If a particular
307*36f395ceStbsymbol is shorter than nine bits, then that symbol's translation is duplicated
308*36f395ceStbin all those entries that start with that symbol's bits.  For example, if the
309*36f395ceStbsymbol is four bits, then it's duplicated 32 times in a nine-bit table.  If a
310*36f395ceStbsymbol is nine bits long, it appears in the table once.
311*36f395ceStb
312*36f395ceStbIf the symbol is longer than nine bits, then that entry in the table points
313*36f395ceStbto another similar table for the remaining bits.  Again, there are duplicated
314*36f395ceStbentries as needed.  The idea is that most of the time the symbol will be short
315*36f395ceStband there will only be one table look up.  (That's whole idea behind data
316*36f395ceStbcompression in the first place.)  For the less frequent long symbols, there
317*36f395ceStbwill be two lookups.  If you had a compression method with really long
318*36f395ceStbsymbols, you could have as many levels of lookups as is efficient.  For
319*36f395ceStbinflate, two is enough.
320*36f395ceStb
321*36f395ceStbSo a table entry either points to another table (in which case nine bits in
322*36f395ceStbthe above example are gobbled), or it contains the translation for the symbol
323*36f395ceStband the number of bits to gobble.  Then you start again with the next
324*36f395ceStbungobbled bit.
325*36f395ceStb
326*36f395ceStbYou may wonder: why not just have one lookup table for how ever many bits the
327*36f395ceStblongest symbol is?  The reason is that if you do that, you end up spending
328*36f395ceStbmore time filling in duplicate symbol entries than you do actually decoding.
329*36f395ceStbAt least for deflate's output that generates new trees every several 10's of
330*36f395ceStbkbytes.  You can imagine that filling in a 2^15 entry table for a 15-bit code
331*36f395ceStbwould take too long if you're only decoding several thousand symbols.  At the
332*36f395ceStbother extreme, you could make a new table for every bit in the code.  In fact,
333*36f395ceStbthat's essentially a Huffman tree.  But then you spend two much time
334*36f395ceStbtraversing the tree while decoding, even for short symbols.
335*36f395ceStb
336*36f395ceStbSo the number of bits for the first lookup table is a trade of the time to
337*36f395ceStbfill out the table vs. the time spent looking at the second level and above of
338*36f395ceStbthe table.
339*36f395ceStb
340*36f395ceStbHere is an example, scaled down:
341*36f395ceStb
342*36f395ceStbThe code being decoded, with 10 symbols, from 1 to 6 bits long:
343*36f395ceStb
344*36f395ceStbA: 0
345*36f395ceStbB: 10
346*36f395ceStbC: 1100
347*36f395ceStbD: 11010
348*36f395ceStbE: 11011
349*36f395ceStbF: 11100
350*36f395ceStbG: 11101
351*36f395ceStbH: 11110
352*36f395ceStbI: 111110
353*36f395ceStbJ: 111111
354*36f395ceStb
355*36f395ceStbLet's make the first table three bits long (eight entries):
356*36f395ceStb
357*36f395ceStb000: A,1
358*36f395ceStb001: A,1
359*36f395ceStb010: A,1
360*36f395ceStb011: A,1
361*36f395ceStb100: B,2
362*36f395ceStb101: B,2
363*36f395ceStb110: -> table X (gobble 3 bits)
364*36f395ceStb111: -> table Y (gobble 3 bits)
365*36f395ceStb
366*36f395ceStbEach entry is what the bits decode as and how many bits that is, i.e. how
367*36f395ceStbmany bits to gobble.  Or the entry points to another table, with the number of
368*36f395ceStbbits to gobble implicit in the size of the table.
369*36f395ceStb
370*36f395ceStbTable X is two bits long since the longest code starting with 110 is five bits
371*36f395ceStblong:
372*36f395ceStb
373*36f395ceStb00: C,1
374*36f395ceStb01: C,1
375*36f395ceStb10: D,2
376*36f395ceStb11: E,2
377*36f395ceStb
378*36f395ceStbTable Y is three bits long since the longest code starting with 111 is six
379*36f395ceStbbits long:
380*36f395ceStb
381*36f395ceStb000: F,2
382*36f395ceStb001: F,2
383*36f395ceStb010: G,2
384*36f395ceStb011: G,2
385*36f395ceStb100: H,2
386*36f395ceStb101: H,2
387*36f395ceStb110: I,3
388*36f395ceStb111: J,3
389*36f395ceStb
390*36f395ceStbSo what we have here are three tables with a total of 20 entries that had to
391*36f395ceStbbe constructed.  That's compared to 64 entries for a single table.  Or
392*36f395ceStbcompared to 16 entries for a Huffman tree (six two entry tables and one four
393*36f395ceStbentry table).  Assuming that the code ideally represents the probability of
394*36f395ceStbthe symbols, it takes on the average 1.25 lookups per symbol.  That's compared
395*36f395ceStbto one lookup for the single table, or 1.66 lookups per symbol for the
396*36f395ceStbHuffman tree.
397*36f395ceStb
398*36f395ceStbThere, I think that gives you a picture of what's going on.  For inflate, the
399*36f395ceStbmeaning of a particular symbol is often more than just a letter.  It can be a
400*36f395ceStbbyte (a "literal"), or it can be either a length or a distance which
401*36f395ceStbindicates a base value and a number of bits to fetch after the code that is
402*36f395ceStbadded to the base value.  Or it might be the special end-of-block code.  The
403*36f395ceStbdata structures created in inftrees.c try to encode all that information
404*36f395ceStbcompactly in the tables.
405*36f395ceStb
406*36f395ceStb
407*36f395ceStbJean-loup Gailly        Mark Adler
408*36f395ceStbjloup@gzip.org          madler@alumni.caltech.edu
409*36f395ceStb
410*36f395ceStb
411*36f395ceStbReferences:
412*36f395ceStb
413*36f395ceStb[LZ77] Ziv J., Lempel A., ``A Universal Algorithm for Sequential Data
414*36f395ceStbCompression,'' IEEE Transactions on Information Theory, Vol. 23, No. 3,
415*36f395ceStbpp. 337-343.
416*36f395ceStb
417*36f395ceStb``DEFLATE Compressed Data Format Specification'' available in
418*36f395ceStbhttp://www.ietf.org/rfc/rfc1951.txt
419*36f395ceStb1. Compression algorithm (deflate)
420*36f395ceStb
421*36f395ceStbThe deflation algorithm used by gzip (also zip and zlib) is a variation of
422*36f395ceStbLZ77 (Lempel-Ziv 1977, see reference below). It finds duplicated strings in
423*36f395ceStbthe input data.  The second occurrence of a string is replaced by a
424*36f395ceStbpointer to the previous string, in the form of a pair (distance,
425*36f395ceStblength).  Distances are limited to 32K bytes, and lengths are limited
426*36f395ceStbto 258 bytes. When a string does not occur anywhere in the previous
427*36f395ceStb32K bytes, it is emitted as a sequence of literal bytes.  (In this
428*36f395ceStbdescription, `string' must be taken as an arbitrary sequence of bytes,
429*36f395ceStband is not restricted to printable characters.)
430*36f395ceStb
431*36f395ceStbLiterals or match lengths are compressed with one Huffman tree, and
432*36f395ceStbmatch distances are compressed with another tree. The trees are stored
433*36f395ceStbin a compact form at the start of each block. The blocks can have any
434*36f395ceStbsize (except that the compressed data for one block must fit in
435*36f395ceStbavailable memory). A block is terminated when deflate() determines that
436*36f395ceStbit would be useful to start another block with fresh trees. (This is
437*36f395ceStbsomewhat similar to the behavior of LZW-based _compress_.)
438*36f395ceStb
439*36f395ceStbDuplicated strings are found using a hash table. All input strings of
440*36f395ceStblength 3 are inserted in the hash table. A hash index is computed for
441*36f395ceStbthe next 3 bytes. If the hash chain for this index is not empty, all
442*36f395ceStbstrings in the chain are compared with the current input string, and
443*36f395ceStbthe longest match is selected.
444*36f395ceStb
445*36f395ceStbThe hash chains are searched starting with the most recent strings, to
446*36f395ceStbfavor small distances and thus take advantage of the Huffman encoding.
447*36f395ceStbThe hash chains are singly linked. There are no deletions from the
448*36f395ceStbhash chains, the algorithm simply discards matches that are too old.
449*36f395ceStb
450*36f395ceStbTo avoid a worst-case situation, very long hash chains are arbitrarily
451*36f395ceStbtruncated at a certain length, determined by a runtime option (level
452*36f395ceStbparameter of deflateInit). So deflate() does not always find the longest
453*36f395ceStbpossible match but generally finds a match which is long enough.
454*36f395ceStb
455*36f395ceStbdeflate() also defers the selection of matches with a lazy evaluation
456*36f395ceStbmechanism. After a match of length N has been found, deflate() searches for
457*36f395ceStba longer match at the next input byte. If a longer match is found, the
458*36f395ceStbprevious match is truncated to a length of one (thus producing a single
459*36f395ceStbliteral byte) and the process of lazy evaluation begins again. Otherwise,
460*36f395ceStbthe original match is kept, and the next match search is attempted only N
461*36f395ceStbsteps later.
462*36f395ceStb
463*36f395ceStbThe lazy match evaluation is also subject to a runtime parameter. If
464*36f395ceStbthe current match is long enough, deflate() reduces the search for a longer
465*36f395ceStbmatch, thus speeding up the whole process. If compression ratio is more
466*36f395ceStbimportant than speed, deflate() attempts a complete second search even if
467*36f395ceStbthe first match is already long enough.
468*36f395ceStb
469*36f395ceStbThe lazy match evaluation is not performed for the fastest compression
470*36f395ceStbmodes (level parameter 1 to 3). For these fast modes, new strings
471*36f395ceStbare inserted in the hash table only when no match was found, or
472*36f395ceStbwhen the match is not too long. This degrades the compression ratio
473*36f395ceStbbut saves time since there are both fewer insertions and fewer searches.
474*36f395ceStb
475*36f395ceStb
476*36f395ceStb2. Decompression algorithm (inflate)
477*36f395ceStb
478*36f395ceStb2.1 Introduction
479*36f395ceStb
480*36f395ceStbThe key question is how to represent a Huffman code (or any prefix code) so
481*36f395ceStbthat you can decode fast.  The most important characteristic is that shorter
482*36f395ceStbcodes are much more common than longer codes, so pay attention to decoding the
483*36f395ceStbshort codes fast, and let the long codes take longer to decode.
484*36f395ceStb
485*36f395ceStbinflate() sets up a first level table that covers some number of bits of
486*36f395ceStbinput less than the length of longest code.  It gets that many bits from the
487*36f395ceStbstream, and looks it up in the table.  The table will tell if the next
488*36f395ceStbcode is that many bits or less and how many, and if it is, it will tell
489*36f395ceStbthe value, else it will point to the next level table for which inflate()
490*36f395ceStbgrabs more bits and tries to decode a longer code.
491*36f395ceStb
492*36f395ceStbHow many bits to make the first lookup is a tradeoff between the time it
493*36f395ceStbtakes to decode and the time it takes to build the table.  If building the
494*36f395ceStbtable took no time (and if you had infinite memory), then there would only
495*36f395ceStbbe a first level table to cover all the way to the longest code.  However,
496*36f395ceStbbuilding the table ends up taking a lot longer for more bits since short
497*36f395ceStbcodes are replicated many times in such a table.  What inflate() does is
498*36f395ceStbsimply to make the number of bits in the first table a variable, and  then
499*36f395ceStbto set that variable for the maximum speed.
500*36f395ceStb
501*36f395ceStbFor inflate, which has 286 possible codes for the literal/length tree, the size
502*36f395ceStbof the first table is nine bits.  Also the distance trees have 30 possible
503*36f395ceStbvalues, and the size of the first table is six bits.  Note that for each of
504*36f395ceStbthose cases, the table ended up one bit longer than the ``average'' code
505*36f395ceStblength, i.e. the code length of an approximately flat code which would be a
506*36f395ceStblittle more than eight bits for 286 symbols and a little less than five bits
507*36f395ceStbfor 30 symbols.
508*36f395ceStb
509*36f395ceStb
510*36f395ceStb2.2 More details on the inflate table lookup
511*36f395ceStb
512*36f395ceStbOk, you want to know what this cleverly obfuscated inflate tree actually
513*36f395ceStblooks like.  You are correct that it's not a Huffman tree.  It is simply a
514*36f395ceStblookup table for the first, let's say, nine bits of a Huffman symbol.  The
515*36f395ceStbsymbol could be as short as one bit or as long as 15 bits.  If a particular
516*36f395ceStbsymbol is shorter than nine bits, then that symbol's translation is duplicated
517*36f395ceStbin all those entries that start with that symbol's bits.  For example, if the
518*36f395ceStbsymbol is four bits, then it's duplicated 32 times in a nine-bit table.  If a
519*36f395ceStbsymbol is nine bits long, it appears in the table once.
520*36f395ceStb
521*36f395ceStbIf the symbol is longer than nine bits, then that entry in the table points
522*36f395ceStbto another similar table for the remaining bits.  Again, there are duplicated
523*36f395ceStbentries as needed.  The idea is that most of the time the symbol will be short
524*36f395ceStband there will only be one table look up.  (That's whole idea behind data
525*36f395ceStbcompression in the first place.)  For the less frequent long symbols, there
526*36f395ceStbwill be two lookups.  If you had a compression method with really long
527*36f395ceStbsymbols, you could have as many levels of lookups as is efficient.  For
528*36f395ceStbinflate, two is enough.
529*36f395ceStb
530*36f395ceStbSo a table entry either points to another table (in which case nine bits in
531*36f395ceStbthe above example are gobbled), or it contains the translation for the symbol
532*36f395ceStband the number of bits to gobble.  Then you start again with the next
533*36f395ceStbungobbled bit.
534*36f395ceStb
535*36f395ceStbYou may wonder: why not just have one lookup table for how ever many bits the
536*36f395ceStblongest symbol is?  The reason is that if you do that, you end up spending
537*36f395ceStbmore time filling in duplicate symbol entries than you do actually decoding.
538*36f395ceStbAt least for deflate's output that generates new trees every several 10's of
539*36f395ceStbkbytes.  You can imagine that filling in a 2^15 entry table for a 15-bit code
540*36f395ceStbwould take too long if you're only decoding several thousand symbols.  At the
541*36f395ceStbother extreme, you could make a new table for every bit in the code.  In fact,
542*36f395ceStbthat's essentially a Huffman tree.  But then you spend two much time
543*36f395ceStbtraversing the tree while decoding, even for short symbols.
544*36f395ceStb
545*36f395ceStbSo the number of bits for the first lookup table is a trade of the time to
546*36f395ceStbfill out the table vs. the time spent looking at the second level and above of
547*36f395ceStbthe table.
548*36f395ceStb
549*36f395ceStbHere is an example, scaled down:
550*36f395ceStb
551*36f395ceStbThe code being decoded, with 10 symbols, from 1 to 6 bits long:
552*36f395ceStb
553*36f395ceStbA: 0
554*36f395ceStbB: 10
555*36f395ceStbC: 1100
556*36f395ceStbD: 11010
557*36f395ceStbE: 11011
558*36f395ceStbF: 11100
559*36f395ceStbG: 11101
560*36f395ceStbH: 11110
561*36f395ceStbI: 111110
562*36f395ceStbJ: 111111
563*36f395ceStb
564*36f395ceStbLet's make the first table three bits long (eight entries):
565*36f395ceStb
566*36f395ceStb000: A,1
567*36f395ceStb001: A,1
568*36f395ceStb010: A,1
569*36f395ceStb011: A,1
570*36f395ceStb100: B,2
571*36f395ceStb101: B,2
572*36f395ceStb110: -> table X (gobble 3 bits)
573*36f395ceStb111: -> table Y (gobble 3 bits)
574*36f395ceStb
575*36f395ceStbEach entry is what the bits decode as and how many bits that is, i.e. how
576*36f395ceStbmany bits to gobble.  Or the entry points to another table, with the number of
577*36f395ceStbbits to gobble implicit in the size of the table.
578*36f395ceStb
579*36f395ceStbTable X is two bits long since the longest code starting with 110 is five bits
580*36f395ceStblong:
581*36f395ceStb
582*36f395ceStb00: C,1
583*36f395ceStb01: C,1
584*36f395ceStb10: D,2
585*36f395ceStb11: E,2
586*36f395ceStb
587*36f395ceStbTable Y is three bits long since the longest code starting with 111 is six
588*36f395ceStbbits long:
589*36f395ceStb
590*36f395ceStb000: F,2
591*36f395ceStb001: F,2
592*36f395ceStb010: G,2
593*36f395ceStb011: G,2
594*36f395ceStb100: H,2
595*36f395ceStb101: H,2
596*36f395ceStb110: I,3
597*36f395ceStb111: J,3
598*36f395ceStb
599*36f395ceStbSo what we have here are three tables with a total of 20 entries that had to
600*36f395ceStbbe constructed.  That's compared to 64 entries for a single table.  Or
601*36f395ceStbcompared to 16 entries for a Huffman tree (six two entry tables and one four
602*36f395ceStbentry table).  Assuming that the code ideally represents the probability of
603*36f395ceStbthe symbols, it takes on the average 1.25 lookups per symbol.  That's compared
604*36f395ceStbto one lookup for the single table, or 1.66 lookups per symbol for the
605*36f395ceStbHuffman tree.
606*36f395ceStb
607*36f395ceStbThere, I think that gives you a picture of what's going on.  For inflate, the
608*36f395ceStbmeaning of a particular symbol is often more than just a letter.  It can be a
609*36f395ceStbbyte (a "literal"), or it can be either a length or a distance which
610*36f395ceStbindicates a base value and a number of bits to fetch after the code that is
611*36f395ceStbadded to the base value.  Or it might be the special end-of-block code.  The
612*36f395ceStbdata structures created in inftrees.c try to encode all that information
613*36f395ceStbcompactly in the tables.
614*36f395ceStb
615*36f395ceStb
616*36f395ceStbJean-loup Gailly        Mark Adler
617*36f395ceStbjloup@gzip.org          madler@alumni.caltech.edu
618*36f395ceStb
619*36f395ceStb
620*36f395ceStbReferences:
621*36f395ceStb
622*36f395ceStb[LZ77] Ziv J., Lempel A., ``A Universal Algorithm for Sequential Data
623*36f395ceStbCompression,'' IEEE Transactions on Information Theory, Vol. 23, No. 3,
624*36f395ceStbpp. 337-343.
625*36f395ceStb
626*36f395ceStb``DEFLATE Compressed Data Format Specification'' available in
627*36f395ceStbhttp://www.ietf.org/rfc/rfc1951.txt
628