lib/libz/algorithm.doc

9baa294fSmillert1. Compression algorithm (deflate)
9baa294fSmillert
15ce0796SmillertThe deflation algorithm used by gzip (also zip and zlib) is a variation of
9baa294fSmillertLZ77 (Lempel-Ziv 1977, see reference below). It finds duplicated strings in
9baa294fSmillertthe input data.  The second occurrence of a string is replaced by a
9baa294fSmillertpointer to the previous string, in the form of a pair (distance,
9baa294fSmillertlength).  Distances are limited to 32K bytes, and lengths are limited
9baa294fSmillertto 258 bytes. When a string does not occur anywhere in the previous
9baa294fSmillert32K bytes, it is emitted as a sequence of literal bytes.  (In this
9baa294fSmillertdescription, `string' must be taken as an arbitrary sequence of bytes,
9baa294fSmillertand is not restricted to printable characters.)
9baa294fSmillert
9baa294fSmillertLiterals or match lengths are compressed with one Huffman tree, and
9baa294fSmillertmatch distances are compressed with another tree. The trees are stored
9baa294fSmillertin a compact form at the start of each block. The blocks can have any
9baa294fSmillertsize (except that the compressed data for one block must fit in
9baa294fSmillertavailable memory). A block is terminated when deflate() determines that
9baa294fSmillertit would be useful to start another block with fresh trees. (This is
9baa294fSmillertsomewhat similar to the behavior of LZW-based _compress_.)
9baa294fSmillert
9baa294fSmillertDuplicated strings are found using a hash table. All input strings of
9baa294fSmillertlength 3 are inserted in the hash table. A hash index is computed for
9baa294fSmillertthe next 3 bytes. If the hash chain for this index is not empty, all
9baa294fSmillertstrings in the chain are compared with the current input string, and
9baa294fSmillertthe longest match is selected.
9baa294fSmillert
9baa294fSmillertThe hash chains are searched starting with the most recent strings, to
9baa294fSmillertfavor small distances and thus take advantage of the Huffman encoding.
9baa294fSmillertThe hash chains are singly linked. There are no deletions from the
9baa294fSmillerthash chains, the algorithm simply discards matches that are too old.
9baa294fSmillert
9baa294fSmillertTo avoid a worst-case situation, very long hash chains are arbitrarily
9baa294fSmillerttruncated at a certain length, determined by a runtime option (level
9baa294fSmillertparameter of deflateInit). So deflate() does not always find the longest
9baa294fSmillertpossible match but generally finds a match which is long enough.
9baa294fSmillert
9baa294fSmillertdeflate() also defers the selection of matches with a lazy evaluation
15ce0796Smillertmechanism. After a match of length N has been found, deflate() searches for
15ce0796Smillerta longer match at the next input byte. If a longer match is found, the
9baa294fSmillertprevious match is truncated to a length of one (thus producing a single
15ce0796Smillertliteral byte) and the process of lazy evaluation begins again. Otherwise,
15ce0796Smillertthe original match is kept, and the next match search is attempted only N
15ce0796Smillertsteps later.
9baa294fSmillert
9baa294fSmillertThe lazy match evaluation is also subject to a runtime parameter. If
9baa294fSmillertthe current match is long enough, deflate() reduces the search for a longer
9baa294fSmillertmatch, thus speeding up the whole process. If compression ratio is more
9baa294fSmillertimportant than speed, deflate() attempts a complete second search even if
9baa294fSmillertthe first match is already long enough.
9baa294fSmillert
9baa294fSmillertThe lazy match evaluation is not performed for the fastest compression
9baa294fSmillertmodes (level parameter 1 to 3). For these fast modes, new strings
9baa294fSmillertare inserted in the hash table only when no match was found, or
9baa294fSmillertwhen the match is not too long. This degrades the compression ratio
9baa294fSmillertbut saves time since there are both fewer insertions and fewer searches.
9baa294fSmillert
9baa294fSmillert
9baa294fSmillert2. Decompression algorithm (inflate)
9baa294fSmillert
15ce0796Smillert2.1 Introduction
15ce0796Smillert
85c48e79ShenningThe key question is how to represent a Huffman code (or any prefix code) so
85c48e79Shenningthat you can decode fast.  The most important characteristic is that shorter
85c48e79Shenningcodes are much more common than longer codes, so pay attention to decoding the
85c48e79Shenningshort codes fast, and let the long codes take longer to decode.
9baa294fSmillert
9baa294fSmillertinflate() sets up a first level table that covers some number of bits of
9baa294fSmillertinput less than the length of longest code.  It gets that many bits from the
9baa294fSmillertstream, and looks it up in the table.  The table will tell if the next
9baa294fSmillertcode is that many bits or less and how many, and if it is, it will tell
9baa294fSmillertthe value, else it will point to the next level table for which inflate()
9baa294fSmillertgrabs more bits and tries to decode a longer code.
9baa294fSmillert
9baa294fSmillertHow many bits to make the first lookup is a tradeoff between the time it
9baa294fSmillerttakes to decode and the time it takes to build the table.  If building the
9baa294fSmillerttable took no time (and if you had infinite memory), then there would only
9baa294fSmillertbe a first level table to cover all the way to the longest code.  However,
9baa294fSmillertbuilding the table ends up taking a lot longer for more bits since short
9baa294fSmillertcodes are replicated many times in such a table.  What inflate() does is
85c48e79Shenningsimply to make the number of bits in the first table a variable, and  then
85c48e79Shenningto set that variable for the maximum speed.
9baa294fSmillert
85c48e79ShenningFor inflate, which has 286 possible codes for the literal/length tree, the size
85c48e79Shenningof the first table is nine bits.  Also the distance trees have 30 possible
85c48e79Shenningvalues, and the size of the first table is six bits.  Note that for each of
85c48e79Shenningthose cases, the table ended up one bit longer than the ``average'' code
85c48e79Shenninglength, i.e. the code length of an approximately flat code which would be a
85c48e79Shenninglittle more than eight bits for 286 symbols and a little less than five bits
85c48e79Shenningfor 30 symbols.
9baa294fSmillert
9baa294fSmillert
15ce0796Smillert2.2 More details on the inflate table lookup
15ce0796Smillert
15ce0796SmillertOk, you want to know what this cleverly obfuscated inflate tree actually
15ce0796Smillertlooks like.  You are correct that it's not a Huffman tree.  It is simply a
15ce0796Smillertlookup table for the first, let's say, nine bits of a Huffman symbol.  The
15ce0796Smillertsymbol could be as short as one bit or as long as 15 bits.  If a particular
15ce0796Smillertsymbol is shorter than nine bits, then that symbol's translation is duplicated
15ce0796Smillertin all those entries that start with that symbol's bits.  For example, if the
15ce0796Smillertsymbol is four bits, then it's duplicated 32 times in a nine-bit table.  If a
15ce0796Smillertsymbol is nine bits long, it appears in the table once.
15ce0796Smillert
15ce0796SmillertIf the symbol is longer than nine bits, then that entry in the table points
15ce0796Smillertto another similar table for the remaining bits.  Again, there are duplicated
15ce0796Smillertentries as needed.  The idea is that most of the time the symbol will be short
15ce0796Smillertand there will only be one table look up.  (That's whole idea behind data
15ce0796Smillertcompression in the first place.)  For the less frequent long symbols, there
15ce0796Smillertwill be two lookups.  If you had a compression method with really long
15ce0796Smillertsymbols, you could have as many levels of lookups as is efficient.  For
15ce0796Smillertinflate, two is enough.
15ce0796Smillert
15ce0796SmillertSo a table entry either points to another table (in which case nine bits in
15ce0796Smillertthe above example are gobbled), or it contains the translation for the symbol
15ce0796Smillertand the number of bits to gobble.  Then you start again with the next
15ce0796Smillertungobbled bit.
15ce0796Smillert
15ce0796SmillertYou may wonder: why not just have one lookup table for how ever many bits the
15ce0796Smillertlongest symbol is?  The reason is that if you do that, you end up spending
15ce0796Smillertmore time filling in duplicate symbol entries than you do actually decoding.
15ce0796SmillertAt least for deflate's output that generates new trees every several 10's of
15ce0796Smillertkbytes.  You can imagine that filling in a 2^15 entry table for a 15-bit code
15ce0796Smillertwould take too long if you're only decoding several thousand symbols.  At the
15ce0796Smillertother extreme, you could make a new table for every bit in the code.  In fact,
15ce0796Smillertthat's essentially a Huffman tree.  But then you spend two much time
15ce0796Smillerttraversing the tree while decoding, even for short symbols.
15ce0796Smillert
15ce0796SmillertSo the number of bits for the first lookup table is a trade of the time to
15ce0796Smillertfill out the table vs. the time spent looking at the second level and above of
15ce0796Smillertthe table.
15ce0796Smillert
15ce0796SmillertHere is an example, scaled down:
15ce0796Smillert
15ce0796SmillertThe code being decoded, with 10 symbols, from 1 to 6 bits long:
15ce0796Smillert
15ce0796SmillertA: 0
15ce0796SmillertB: 10
15ce0796SmillertC: 1100
15ce0796SmillertD: 11010
15ce0796SmillertE: 11011
15ce0796SmillertF: 11100
15ce0796SmillertG: 11101
15ce0796SmillertH: 11110
15ce0796SmillertI: 111110
15ce0796SmillertJ: 111111
15ce0796Smillert
15ce0796SmillertLet's make the first table three bits long (eight entries):
15ce0796Smillert
15ce0796Smillert000: A,1
15ce0796Smillert001: A,1
15ce0796Smillert010: A,1
15ce0796Smillert011: A,1
15ce0796Smillert100: B,2
15ce0796Smillert101: B,2
15ce0796Smillert110: -> table X (gobble 3 bits)
15ce0796Smillert111: -> table Y (gobble 3 bits)
15ce0796Smillert
85c48e79ShenningEach entry is what the bits decode as and how many bits that is, i.e. how
15ce0796Smillertmany bits to gobble.  Or the entry points to another table, with the number of
15ce0796Smillertbits to gobble implicit in the size of the table.
15ce0796Smillert
15ce0796SmillertTable X is two bits long since the longest code starting with 110 is five bits
15ce0796Smillertlong:
15ce0796Smillert
15ce0796Smillert00: C,1
15ce0796Smillert01: C,1
15ce0796Smillert10: D,2
15ce0796Smillert11: E,2
15ce0796Smillert
15ce0796SmillertTable Y is three bits long since the longest code starting with 111 is six
15ce0796Smillertbits long:
15ce0796Smillert
15ce0796Smillert000: F,2
15ce0796Smillert001: F,2
15ce0796Smillert010: G,2
15ce0796Smillert011: G,2
15ce0796Smillert100: H,2
15ce0796Smillert101: H,2
15ce0796Smillert110: I,3
15ce0796Smillert111: J,3
15ce0796Smillert
15ce0796SmillertSo what we have here are three tables with a total of 20 entries that had to
15ce0796Smillertbe constructed.  That's compared to 64 entries for a single table.  Or
15ce0796Smillertcompared to 16 entries for a Huffman tree (six two entry tables and one four
15ce0796Smillertentry table).  Assuming that the code ideally represents the probability of
15ce0796Smillertthe symbols, it takes on the average 1.25 lookups per symbol.  That's compared
15ce0796Smillertto one lookup for the single table, or 1.66 lookups per symbol for the
15ce0796SmillertHuffman tree.
15ce0796Smillert
15ce0796SmillertThere, I think that gives you a picture of what's going on.  For inflate, the
15ce0796Smillertmeaning of a particular symbol is often more than just a letter.  It can be a
15ce0796Smillertbyte (a "literal"), or it can be either a length or a distance which
15ce0796Smillertindicates a base value and a number of bits to fetch after the code that is
15ce0796Smillertadded to the base value.  Or it might be the special end-of-block code.  The
15ce0796Smillertdata structures created in inftrees.c try to encode all that information
15ce0796Smillertcompactly in the tables.
15ce0796Smillert
15ce0796Smillert
9baa294fSmillertJean-loup Gailly        Mark Adler
15ce0796Smillertjloup@gzip.org          madler@alumni.caltech.edu
9baa294fSmillert
9baa294fSmillert
9baa294fSmillertReferences:
9baa294fSmillert
9baa294fSmillert[LZ77] Ziv J., Lempel A., ``A Universal Algorithm for Sequential Data
9baa294fSmillertCompression,'' IEEE Transactions on Information Theory, Vol. 23, No. 3,
9baa294fSmillertpp. 337-343.
9baa294fSmillert
9baa294fSmillert``DEFLATE Compressed Data Format Specification'' available in
85c48e79Shenninghttp://www.ietf.org/rfc/rfc1951.txt
*36f395ceStb1. Compression algorithm (deflate)
*36f395ceStb
*36f395ceStbThe deflation algorithm used by gzip (also zip and zlib) is a variation of
*36f395ceStbLZ77 (Lempel-Ziv 1977, see reference below). It finds duplicated strings in
*36f395ceStbthe input data.  The second occurrence of a string is replaced by a
*36f395ceStbpointer to the previous string, in the form of a pair (distance,
*36f395ceStblength).  Distances are limited to 32K bytes, and lengths are limited
*36f395ceStbto 258 bytes. When a string does not occur anywhere in the previous
*36f395ceStb32K bytes, it is emitted as a sequence of literal bytes.  (In this
*36f395ceStbdescription, `string' must be taken as an arbitrary sequence of bytes,
*36f395ceStband is not restricted to printable characters.)
*36f395ceStb
*36f395ceStbLiterals or match lengths are compressed with one Huffman tree, and
*36f395ceStbmatch distances are compressed with another tree. The trees are stored
*36f395ceStbin a compact form at the start of each block. The blocks can have any
*36f395ceStbsize (except that the compressed data for one block must fit in
*36f395ceStbavailable memory). A block is terminated when deflate() determines that
*36f395ceStbit would be useful to start another block with fresh trees. (This is
*36f395ceStbsomewhat similar to the behavior of LZW-based _compress_.)
*36f395ceStb
*36f395ceStbDuplicated strings are found using a hash table. All input strings of
*36f395ceStblength 3 are inserted in the hash table. A hash index is computed for
*36f395ceStbthe next 3 bytes. If the hash chain for this index is not empty, all
*36f395ceStbstrings in the chain are compared with the current input string, and
*36f395ceStbthe longest match is selected.
*36f395ceStb
*36f395ceStbThe hash chains are searched starting with the most recent strings, to
*36f395ceStbfavor small distances and thus take advantage of the Huffman encoding.
*36f395ceStbThe hash chains are singly linked. There are no deletions from the
*36f395ceStbhash chains, the algorithm simply discards matches that are too old.
*36f395ceStb
*36f395ceStbTo avoid a worst-case situation, very long hash chains are arbitrarily
*36f395ceStbtruncated at a certain length, determined by a runtime option (level
*36f395ceStbparameter of deflateInit). So deflate() does not always find the longest
*36f395ceStbpossible match but generally finds a match which is long enough.
*36f395ceStb
*36f395ceStbdeflate() also defers the selection of matches with a lazy evaluation
*36f395ceStbmechanism. After a match of length N has been found, deflate() searches for
*36f395ceStba longer match at the next input byte. If a longer match is found, the
*36f395ceStbprevious match is truncated to a length of one (thus producing a single
*36f395ceStbliteral byte) and the process of lazy evaluation begins again. Otherwise,
*36f395ceStbthe original match is kept, and the next match search is attempted only N
*36f395ceStbsteps later.
*36f395ceStb
*36f395ceStbThe lazy match evaluation is also subject to a runtime parameter. If
*36f395ceStbthe current match is long enough, deflate() reduces the search for a longer
*36f395ceStbmatch, thus speeding up the whole process. If compression ratio is more
*36f395ceStbimportant than speed, deflate() attempts a complete second search even if
*36f395ceStbthe first match is already long enough.
*36f395ceStb
*36f395ceStbThe lazy match evaluation is not performed for the fastest compression
*36f395ceStbmodes (level parameter 1 to 3). For these fast modes, new strings
*36f395ceStbare inserted in the hash table only when no match was found, or
*36f395ceStbwhen the match is not too long. This degrades the compression ratio
*36f395ceStbbut saves time since there are both fewer insertions and fewer searches.
*36f395ceStb
*36f395ceStb
*36f395ceStb2. Decompression algorithm (inflate)
*36f395ceStb
*36f395ceStb2.1 Introduction
*36f395ceStb
*36f395ceStbThe key question is how to represent a Huffman code (or any prefix code) so
*36f395ceStbthat you can decode fast.  The most important characteristic is that shorter
*36f395ceStbcodes are much more common than longer codes, so pay attention to decoding the
*36f395ceStbshort codes fast, and let the long codes take longer to decode.
*36f395ceStb
*36f395ceStbinflate() sets up a first level table that covers some number of bits of
*36f395ceStbinput less than the length of longest code.  It gets that many bits from the
*36f395ceStbstream, and looks it up in the table.  The table will tell if the next
*36f395ceStbcode is that many bits or less and how many, and if it is, it will tell
*36f395ceStbthe value, else it will point to the next level table for which inflate()
*36f395ceStbgrabs more bits and tries to decode a longer code.
*36f395ceStb
*36f395ceStbHow many bits to make the first lookup is a tradeoff between the time it
*36f395ceStbtakes to decode and the time it takes to build the table.  If building the
*36f395ceStbtable took no time (and if you had infinite memory), then there would only
*36f395ceStbbe a first level table to cover all the way to the longest code.  However,
*36f395ceStbbuilding the table ends up taking a lot longer for more bits since short
*36f395ceStbcodes are replicated many times in such a table.  What inflate() does is
*36f395ceStbsimply to make the number of bits in the first table a variable, and  then
*36f395ceStbto set that variable for the maximum speed.
*36f395ceStb
*36f395ceStbFor inflate, which has 286 possible codes for the literal/length tree, the size
*36f395ceStbof the first table is nine bits.  Also the distance trees have 30 possible
*36f395ceStbvalues, and the size of the first table is six bits.  Note that for each of
*36f395ceStbthose cases, the table ended up one bit longer than the ``average'' code
*36f395ceStblength, i.e. the code length of an approximately flat code which would be a
*36f395ceStblittle more than eight bits for 286 symbols and a little less than five bits
*36f395ceStbfor 30 symbols.
*36f395ceStb
*36f395ceStb
*36f395ceStb2.2 More details on the inflate table lookup
*36f395ceStb
*36f395ceStbOk, you want to know what this cleverly obfuscated inflate tree actually
*36f395ceStblooks like.  You are correct that it's not a Huffman tree.  It is simply a
*36f395ceStblookup table for the first, let's say, nine bits of a Huffman symbol.  The
*36f395ceStbsymbol could be as short as one bit or as long as 15 bits.  If a particular
*36f395ceStbsymbol is shorter than nine bits, then that symbol's translation is duplicated
*36f395ceStbin all those entries that start with that symbol's bits.  For example, if the
*36f395ceStbsymbol is four bits, then it's duplicated 32 times in a nine-bit table.  If a
*36f395ceStbsymbol is nine bits long, it appears in the table once.
*36f395ceStb
*36f395ceStbIf the symbol is longer than nine bits, then that entry in the table points
*36f395ceStbto another similar table for the remaining bits.  Again, there are duplicated
*36f395ceStbentries as needed.  The idea is that most of the time the symbol will be short
*36f395ceStband there will only be one table look up.  (That's whole idea behind data
*36f395ceStbcompression in the first place.)  For the less frequent long symbols, there
*36f395ceStbwill be two lookups.  If you had a compression method with really long
*36f395ceStbsymbols, you could have as many levels of lookups as is efficient.  For
*36f395ceStbinflate, two is enough.
*36f395ceStb
*36f395ceStbSo a table entry either points to another table (in which case nine bits in
*36f395ceStbthe above example are gobbled), or it contains the translation for the symbol
*36f395ceStband the number of bits to gobble.  Then you start again with the next
*36f395ceStbungobbled bit.
*36f395ceStb
*36f395ceStbYou may wonder: why not just have one lookup table for how ever many bits the
*36f395ceStblongest symbol is?  The reason is that if you do that, you end up spending
*36f395ceStbmore time filling in duplicate symbol entries than you do actually decoding.
*36f395ceStbAt least for deflate's output that generates new trees every several 10's of
*36f395ceStbkbytes.  You can imagine that filling in a 2^15 entry table for a 15-bit code
*36f395ceStbwould take too long if you're only decoding several thousand symbols.  At the
*36f395ceStbother extreme, you could make a new table for every bit in the code.  In fact,
*36f395ceStbthat's essentially a Huffman tree.  But then you spend two much time
*36f395ceStbtraversing the tree while decoding, even for short symbols.
*36f395ceStb
*36f395ceStbSo the number of bits for the first lookup table is a trade of the time to
*36f395ceStbfill out the table vs. the time spent looking at the second level and above of
*36f395ceStbthe table.
*36f395ceStb
*36f395ceStbHere is an example, scaled down:
*36f395ceStb
*36f395ceStbThe code being decoded, with 10 symbols, from 1 to 6 bits long:
*36f395ceStb
*36f395ceStbA: 0
*36f395ceStbB: 10
*36f395ceStbC: 1100
*36f395ceStbD: 11010
*36f395ceStbE: 11011
*36f395ceStbF: 11100
*36f395ceStbG: 11101
*36f395ceStbH: 11110
*36f395ceStbI: 111110
*36f395ceStbJ: 111111
*36f395ceStb
*36f395ceStbLet's make the first table three bits long (eight entries):
*36f395ceStb
*36f395ceStb000: A,1
*36f395ceStb001: A,1
*36f395ceStb010: A,1
*36f395ceStb011: A,1
*36f395ceStb100: B,2
*36f395ceStb101: B,2
*36f395ceStb110: -> table X (gobble 3 bits)
*36f395ceStb111: -> table Y (gobble 3 bits)
*36f395ceStb
*36f395ceStbEach entry is what the bits decode as and how many bits that is, i.e. how
*36f395ceStbmany bits to gobble.  Or the entry points to another table, with the number of
*36f395ceStbbits to gobble implicit in the size of the table.
*36f395ceStb
*36f395ceStbTable X is two bits long since the longest code starting with 110 is five bits
*36f395ceStblong:
*36f395ceStb
*36f395ceStb00: C,1
*36f395ceStb01: C,1
*36f395ceStb10: D,2
*36f395ceStb11: E,2
*36f395ceStb
*36f395ceStbTable Y is three bits long since the longest code starting with 111 is six
*36f395ceStbbits long:
*36f395ceStb
*36f395ceStb000: F,2
*36f395ceStb001: F,2
*36f395ceStb010: G,2
*36f395ceStb011: G,2
*36f395ceStb100: H,2
*36f395ceStb101: H,2
*36f395ceStb110: I,3
*36f395ceStb111: J,3
*36f395ceStb
*36f395ceStbSo what we have here are three tables with a total of 20 entries that had to
*36f395ceStbbe constructed.  That's compared to 64 entries for a single table.  Or
*36f395ceStbcompared to 16 entries for a Huffman tree (six two entry tables and one four
*36f395ceStbentry table).  Assuming that the code ideally represents the probability of
*36f395ceStbthe symbols, it takes on the average 1.25 lookups per symbol.  That's compared
*36f395ceStbto one lookup for the single table, or 1.66 lookups per symbol for the
*36f395ceStbHuffman tree.
*36f395ceStb
*36f395ceStbThere, I think that gives you a picture of what's going on.  For inflate, the
*36f395ceStbmeaning of a particular symbol is often more than just a letter.  It can be a
*36f395ceStbbyte (a "literal"), or it can be either a length or a distance which
*36f395ceStbindicates a base value and a number of bits to fetch after the code that is
*36f395ceStbadded to the base value.  Or it might be the special end-of-block code.  The
*36f395ceStbdata structures created in inftrees.c try to encode all that information
*36f395ceStbcompactly in the tables.
*36f395ceStb
*36f395ceStb
*36f395ceStbJean-loup Gailly        Mark Adler
*36f395ceStbjloup@gzip.org          madler@alumni.caltech.edu
*36f395ceStb
*36f395ceStb
*36f395ceStbReferences:
*36f395ceStb
*36f395ceStb[LZ77] Ziv J., Lempel A., ``A Universal Algorithm for Sequential Data
*36f395ceStbCompression,'' IEEE Transactions on Information Theory, Vol. 23, No. 3,
*36f395ceStbpp. 337-343.
*36f395ceStb
*36f395ceStb``DEFLATE Compressed Data Format Specification'' available in
*36f395ceStbhttp://www.ietf.org/rfc/rfc1951.txt
*36f395ceStb1. Compression algorithm (deflate)
*36f395ceStb
*36f395ceStbThe deflation algorithm used by gzip (also zip and zlib) is a variation of
*36f395ceStbLZ77 (Lempel-Ziv 1977, see reference below). It finds duplicated strings in
*36f395ceStbthe input data.  The second occurrence of a string is replaced by a
*36f395ceStbpointer to the previous string, in the form of a pair (distance,
*36f395ceStblength).  Distances are limited to 32K bytes, and lengths are limited
*36f395ceStbto 258 bytes. When a string does not occur anywhere in the previous
*36f395ceStb32K bytes, it is emitted as a sequence of literal bytes.  (In this
*36f395ceStbdescription, `string' must be taken as an arbitrary sequence of bytes,
*36f395ceStband is not restricted to printable characters.)
*36f395ceStb
*36f395ceStbLiterals or match lengths are compressed with one Huffman tree, and
*36f395ceStbmatch distances are compressed with another tree. The trees are stored
*36f395ceStbin a compact form at the start of each block. The blocks can have any
*36f395ceStbsize (except that the compressed data for one block must fit in
*36f395ceStbavailable memory). A block is terminated when deflate() determines that
*36f395ceStbit would be useful to start another block with fresh trees. (This is
*36f395ceStbsomewhat similar to the behavior of LZW-based _compress_.)
*36f395ceStb
*36f395ceStbDuplicated strings are found using a hash table. All input strings of
*36f395ceStblength 3 are inserted in the hash table. A hash index is computed for
*36f395ceStbthe next 3 bytes. If the hash chain for this index is not empty, all
*36f395ceStbstrings in the chain are compared with the current input string, and
*36f395ceStbthe longest match is selected.
*36f395ceStb
*36f395ceStbThe hash chains are searched starting with the most recent strings, to
*36f395ceStbfavor small distances and thus take advantage of the Huffman encoding.
*36f395ceStbThe hash chains are singly linked. There are no deletions from the
*36f395ceStbhash chains, the algorithm simply discards matches that are too old.
*36f395ceStb
*36f395ceStbTo avoid a worst-case situation, very long hash chains are arbitrarily
*36f395ceStbtruncated at a certain length, determined by a runtime option (level
*36f395ceStbparameter of deflateInit). So deflate() does not always find the longest
*36f395ceStbpossible match but generally finds a match which is long enough.
*36f395ceStb
*36f395ceStbdeflate() also defers the selection of matches with a lazy evaluation
*36f395ceStbmechanism. After a match of length N has been found, deflate() searches for
*36f395ceStba longer match at the next input byte. If a longer match is found, the
*36f395ceStbprevious match is truncated to a length of one (thus producing a single
*36f395ceStbliteral byte) and the process of lazy evaluation begins again. Otherwise,
*36f395ceStbthe original match is kept, and the next match search is attempted only N
*36f395ceStbsteps later.
*36f395ceStb
*36f395ceStbThe lazy match evaluation is also subject to a runtime parameter. If
*36f395ceStbthe current match is long enough, deflate() reduces the search for a longer
*36f395ceStbmatch, thus speeding up the whole process. If compression ratio is more
*36f395ceStbimportant than speed, deflate() attempts a complete second search even if
*36f395ceStbthe first match is already long enough.
*36f395ceStb
*36f395ceStbThe lazy match evaluation is not performed for the fastest compression
*36f395ceStbmodes (level parameter 1 to 3). For these fast modes, new strings
*36f395ceStbare inserted in the hash table only when no match was found, or
*36f395ceStbwhen the match is not too long. This degrades the compression ratio
*36f395ceStbbut saves time since there are both fewer insertions and fewer searches.
*36f395ceStb
*36f395ceStb
*36f395ceStb2. Decompression algorithm (inflate)
*36f395ceStb
*36f395ceStb2.1 Introduction
*36f395ceStb
*36f395ceStbThe key question is how to represent a Huffman code (or any prefix code) so
*36f395ceStbthat you can decode fast.  The most important characteristic is that shorter
*36f395ceStbcodes are much more common than longer codes, so pay attention to decoding the
*36f395ceStbshort codes fast, and let the long codes take longer to decode.
*36f395ceStb
*36f395ceStbinflate() sets up a first level table that covers some number of bits of
*36f395ceStbinput less than the length of longest code.  It gets that many bits from the
*36f395ceStbstream, and looks it up in the table.  The table will tell if the next
*36f395ceStbcode is that many bits or less and how many, and if it is, it will tell
*36f395ceStbthe value, else it will point to the next level table for which inflate()
*36f395ceStbgrabs more bits and tries to decode a longer code.
*36f395ceStb
*36f395ceStbHow many bits to make the first lookup is a tradeoff between the time it
*36f395ceStbtakes to decode and the time it takes to build the table.  If building the
*36f395ceStbtable took no time (and if you had infinite memory), then there would only
*36f395ceStbbe a first level table to cover all the way to the longest code.  However,
*36f395ceStbbuilding the table ends up taking a lot longer for more bits since short
*36f395ceStbcodes are replicated many times in such a table.  What inflate() does is
*36f395ceStbsimply to make the number of bits in the first table a variable, and  then
*36f395ceStbto set that variable for the maximum speed.
*36f395ceStb
*36f395ceStbFor inflate, which has 286 possible codes for the literal/length tree, the size
*36f395ceStbof the first table is nine bits.  Also the distance trees have 30 possible
*36f395ceStbvalues, and the size of the first table is six bits.  Note that for each of
*36f395ceStbthose cases, the table ended up one bit longer than the ``average'' code
*36f395ceStblength, i.e. the code length of an approximately flat code which would be a
*36f395ceStblittle more than eight bits for 286 symbols and a little less than five bits
*36f395ceStbfor 30 symbols.
*36f395ceStb
*36f395ceStb
*36f395ceStb2.2 More details on the inflate table lookup
*36f395ceStb
*36f395ceStbOk, you want to know what this cleverly obfuscated inflate tree actually
*36f395ceStblooks like.  You are correct that it's not a Huffman tree.  It is simply a
*36f395ceStblookup table for the first, let's say, nine bits of a Huffman symbol.  The
*36f395ceStbsymbol could be as short as one bit or as long as 15 bits.  If a particular
*36f395ceStbsymbol is shorter than nine bits, then that symbol's translation is duplicated
*36f395ceStbin all those entries that start with that symbol's bits.  For example, if the
*36f395ceStbsymbol is four bits, then it's duplicated 32 times in a nine-bit table.  If a
*36f395ceStbsymbol is nine bits long, it appears in the table once.
*36f395ceStb
*36f395ceStbIf the symbol is longer than nine bits, then that entry in the table points
*36f395ceStbto another similar table for the remaining bits.  Again, there are duplicated
*36f395ceStbentries as needed.  The idea is that most of the time the symbol will be short
*36f395ceStband there will only be one table look up.  (That's whole idea behind data
*36f395ceStbcompression in the first place.)  For the less frequent long symbols, there
*36f395ceStbwill be two lookups.  If you had a compression method with really long
*36f395ceStbsymbols, you could have as many levels of lookups as is efficient.  For
*36f395ceStbinflate, two is enough.
*36f395ceStb
*36f395ceStbSo a table entry either points to another table (in which case nine bits in
*36f395ceStbthe above example are gobbled), or it contains the translation for the symbol
*36f395ceStband the number of bits to gobble.  Then you start again with the next
*36f395ceStbungobbled bit.
*36f395ceStb
*36f395ceStbYou may wonder: why not just have one lookup table for how ever many bits the
*36f395ceStblongest symbol is?  The reason is that if you do that, you end up spending
*36f395ceStbmore time filling in duplicate symbol entries than you do actually decoding.
*36f395ceStbAt least for deflate's output that generates new trees every several 10's of
*36f395ceStbkbytes.  You can imagine that filling in a 2^15 entry table for a 15-bit code
*36f395ceStbwould take too long if you're only decoding several thousand symbols.  At the
*36f395ceStbother extreme, you could make a new table for every bit in the code.  In fact,
*36f395ceStbthat's essentially a Huffman tree.  But then you spend two much time
*36f395ceStbtraversing the tree while decoding, even for short symbols.
*36f395ceStb
*36f395ceStbSo the number of bits for the first lookup table is a trade of the time to
*36f395ceStbfill out the table vs. the time spent looking at the second level and above of
*36f395ceStbthe table.
*36f395ceStb
*36f395ceStbHere is an example, scaled down:
*36f395ceStb
*36f395ceStbThe code being decoded, with 10 symbols, from 1 to 6 bits long:
*36f395ceStb
*36f395ceStbA: 0
*36f395ceStbB: 10
*36f395ceStbC: 1100
*36f395ceStbD: 11010
*36f395ceStbE: 11011
*36f395ceStbF: 11100
*36f395ceStbG: 11101
*36f395ceStbH: 11110
*36f395ceStbI: 111110
*36f395ceStbJ: 111111
*36f395ceStb
*36f395ceStbLet's make the first table three bits long (eight entries):
*36f395ceStb
*36f395ceStb000: A,1
*36f395ceStb001: A,1
*36f395ceStb010: A,1
*36f395ceStb011: A,1
*36f395ceStb100: B,2
*36f395ceStb101: B,2
*36f395ceStb110: -> table X (gobble 3 bits)
*36f395ceStb111: -> table Y (gobble 3 bits)
*36f395ceStb
*36f395ceStbEach entry is what the bits decode as and how many bits that is, i.e. how
*36f395ceStbmany bits to gobble.  Or the entry points to another table, with the number of
*36f395ceStbbits to gobble implicit in the size of the table.
*36f395ceStb
*36f395ceStbTable X is two bits long since the longest code starting with 110 is five bits
*36f395ceStblong:
*36f395ceStb
*36f395ceStb00: C,1
*36f395ceStb01: C,1
*36f395ceStb10: D,2
*36f395ceStb11: E,2
*36f395ceStb
*36f395ceStbTable Y is three bits long since the longest code starting with 111 is six
*36f395ceStbbits long:
*36f395ceStb
*36f395ceStb000: F,2
*36f395ceStb001: F,2
*36f395ceStb010: G,2
*36f395ceStb011: G,2
*36f395ceStb100: H,2
*36f395ceStb101: H,2
*36f395ceStb110: I,3
*36f395ceStb111: J,3
*36f395ceStb
*36f395ceStbSo what we have here are three tables with a total of 20 entries that had to
*36f395ceStbbe constructed.  That's compared to 64 entries for a single table.  Or
*36f395ceStbcompared to 16 entries for a Huffman tree (six two entry tables and one four
*36f395ceStbentry table).  Assuming that the code ideally represents the probability of
*36f395ceStbthe symbols, it takes on the average 1.25 lookups per symbol.  That's compared
*36f395ceStbto one lookup for the single table, or 1.66 lookups per symbol for the
*36f395ceStbHuffman tree.
*36f395ceStb
*36f395ceStbThere, I think that gives you a picture of what's going on.  For inflate, the
*36f395ceStbmeaning of a particular symbol is often more than just a letter.  It can be a
*36f395ceStbbyte (a "literal"), or it can be either a length or a distance which
*36f395ceStbindicates a base value and a number of bits to fetch after the code that is
*36f395ceStbadded to the base value.  Or it might be the special end-of-block code.  The
*36f395ceStbdata structures created in inftrees.c try to encode all that information
*36f395ceStbcompactly in the tables.
*36f395ceStb
*36f395ceStb
*36f395ceStbJean-loup Gailly        Mark Adler
*36f395ceStbjloup@gzip.org          madler@alumni.caltech.edu
*36f395ceStb
*36f395ceStb
*36f395ceStbReferences:
*36f395ceStb
*36f395ceStb[LZ77] Ziv J., Lempel A., ``A Universal Algorithm for Sequential Data
*36f395ceStbCompression,'' IEEE Transactions on Information Theory, Vol. 23, No. 3,
*36f395ceStbpp. 337-343.
*36f395ceStb
*36f395ceStb``DEFLATE Compressed Data Format Specification'' available in
*36f395ceStbhttp://www.ietf.org/rfc/rfc1951.txt