19baa294fSmillert1. Compression algorithm (deflate) 29baa294fSmillert 315ce0796SmillertThe deflation algorithm used by gzip (also zip and zlib) is a variation of 49baa294fSmillertLZ77 (Lempel-Ziv 1977, see reference below). It finds duplicated strings in 59baa294fSmillertthe input data. The second occurrence of a string is replaced by a 69baa294fSmillertpointer to the previous string, in the form of a pair (distance, 79baa294fSmillertlength). Distances are limited to 32K bytes, and lengths are limited 89baa294fSmillertto 258 bytes. When a string does not occur anywhere in the previous 99baa294fSmillert32K bytes, it is emitted as a sequence of literal bytes. (In this 109baa294fSmillertdescription, `string' must be taken as an arbitrary sequence of bytes, 119baa294fSmillertand is not restricted to printable characters.) 129baa294fSmillert 139baa294fSmillertLiterals or match lengths are compressed with one Huffman tree, and 149baa294fSmillertmatch distances are compressed with another tree. The trees are stored 159baa294fSmillertin a compact form at the start of each block. The blocks can have any 169baa294fSmillertsize (except that the compressed data for one block must fit in 179baa294fSmillertavailable memory). A block is terminated when deflate() determines that 189baa294fSmillertit would be useful to start another block with fresh trees. (This is 199baa294fSmillertsomewhat similar to the behavior of LZW-based _compress_.) 209baa294fSmillert 219baa294fSmillertDuplicated strings are found using a hash table. All input strings of 229baa294fSmillertlength 3 are inserted in the hash table. A hash index is computed for 239baa294fSmillertthe next 3 bytes. If the hash chain for this index is not empty, all 249baa294fSmillertstrings in the chain are compared with the current input string, and 259baa294fSmillertthe longest match is selected. 269baa294fSmillert 279baa294fSmillertThe hash chains are searched starting with the most recent strings, to 289baa294fSmillertfavor small distances and thus take advantage of the Huffman encoding. 299baa294fSmillertThe hash chains are singly linked. There are no deletions from the 309baa294fSmillerthash chains, the algorithm simply discards matches that are too old. 319baa294fSmillert 329baa294fSmillertTo avoid a worst-case situation, very long hash chains are arbitrarily 339baa294fSmillerttruncated at a certain length, determined by a runtime option (level 349baa294fSmillertparameter of deflateInit). So deflate() does not always find the longest 359baa294fSmillertpossible match but generally finds a match which is long enough. 369baa294fSmillert 379baa294fSmillertdeflate() also defers the selection of matches with a lazy evaluation 3815ce0796Smillertmechanism. After a match of length N has been found, deflate() searches for 3915ce0796Smillerta longer match at the next input byte. If a longer match is found, the 409baa294fSmillertprevious match is truncated to a length of one (thus producing a single 4115ce0796Smillertliteral byte) and the process of lazy evaluation begins again. Otherwise, 4215ce0796Smillertthe original match is kept, and the next match search is attempted only N 4315ce0796Smillertsteps later. 449baa294fSmillert 459baa294fSmillertThe lazy match evaluation is also subject to a runtime parameter. If 469baa294fSmillertthe current match is long enough, deflate() reduces the search for a longer 479baa294fSmillertmatch, thus speeding up the whole process. If compression ratio is more 489baa294fSmillertimportant than speed, deflate() attempts a complete second search even if 499baa294fSmillertthe first match is already long enough. 509baa294fSmillert 519baa294fSmillertThe lazy match evaluation is not performed for the fastest compression 529baa294fSmillertmodes (level parameter 1 to 3). For these fast modes, new strings 539baa294fSmillertare inserted in the hash table only when no match was found, or 549baa294fSmillertwhen the match is not too long. This degrades the compression ratio 559baa294fSmillertbut saves time since there are both fewer insertions and fewer searches. 569baa294fSmillert 579baa294fSmillert 589baa294fSmillert2. Decompression algorithm (inflate) 599baa294fSmillert 6015ce0796Smillert2.1 Introduction 6115ce0796Smillert 6285c48e79ShenningThe key question is how to represent a Huffman code (or any prefix code) so 6385c48e79Shenningthat you can decode fast. The most important characteristic is that shorter 6485c48e79Shenningcodes are much more common than longer codes, so pay attention to decoding the 6585c48e79Shenningshort codes fast, and let the long codes take longer to decode. 669baa294fSmillert 679baa294fSmillertinflate() sets up a first level table that covers some number of bits of 689baa294fSmillertinput less than the length of longest code. It gets that many bits from the 699baa294fSmillertstream, and looks it up in the table. The table will tell if the next 709baa294fSmillertcode is that many bits or less and how many, and if it is, it will tell 719baa294fSmillertthe value, else it will point to the next level table for which inflate() 729baa294fSmillertgrabs more bits and tries to decode a longer code. 739baa294fSmillert 749baa294fSmillertHow many bits to make the first lookup is a tradeoff between the time it 759baa294fSmillerttakes to decode and the time it takes to build the table. If building the 769baa294fSmillerttable took no time (and if you had infinite memory), then there would only 779baa294fSmillertbe a first level table to cover all the way to the longest code. However, 789baa294fSmillertbuilding the table ends up taking a lot longer for more bits since short 799baa294fSmillertcodes are replicated many times in such a table. What inflate() does is 8085c48e79Shenningsimply to make the number of bits in the first table a variable, and then 8185c48e79Shenningto set that variable for the maximum speed. 829baa294fSmillert 8385c48e79ShenningFor inflate, which has 286 possible codes for the literal/length tree, the size 8485c48e79Shenningof the first table is nine bits. Also the distance trees have 30 possible 8585c48e79Shenningvalues, and the size of the first table is six bits. Note that for each of 8685c48e79Shenningthose cases, the table ended up one bit longer than the ``average'' code 8785c48e79Shenninglength, i.e. the code length of an approximately flat code which would be a 8885c48e79Shenninglittle more than eight bits for 286 symbols and a little less than five bits 8985c48e79Shenningfor 30 symbols. 909baa294fSmillert 919baa294fSmillert 9215ce0796Smillert2.2 More details on the inflate table lookup 9315ce0796Smillert 9415ce0796SmillertOk, you want to know what this cleverly obfuscated inflate tree actually 9515ce0796Smillertlooks like. You are correct that it's not a Huffman tree. It is simply a 9615ce0796Smillertlookup table for the first, let's say, nine bits of a Huffman symbol. The 9715ce0796Smillertsymbol could be as short as one bit or as long as 15 bits. If a particular 9815ce0796Smillertsymbol is shorter than nine bits, then that symbol's translation is duplicated 9915ce0796Smillertin all those entries that start with that symbol's bits. For example, if the 10015ce0796Smillertsymbol is four bits, then it's duplicated 32 times in a nine-bit table. If a 10115ce0796Smillertsymbol is nine bits long, it appears in the table once. 10215ce0796Smillert 10315ce0796SmillertIf the symbol is longer than nine bits, then that entry in the table points 10415ce0796Smillertto another similar table for the remaining bits. Again, there are duplicated 10515ce0796Smillertentries as needed. The idea is that most of the time the symbol will be short 10615ce0796Smillertand there will only be one table look up. (That's whole idea behind data 10715ce0796Smillertcompression in the first place.) For the less frequent long symbols, there 10815ce0796Smillertwill be two lookups. If you had a compression method with really long 10915ce0796Smillertsymbols, you could have as many levels of lookups as is efficient. For 11015ce0796Smillertinflate, two is enough. 11115ce0796Smillert 11215ce0796SmillertSo a table entry either points to another table (in which case nine bits in 11315ce0796Smillertthe above example are gobbled), or it contains the translation for the symbol 11415ce0796Smillertand the number of bits to gobble. Then you start again with the next 11515ce0796Smillertungobbled bit. 11615ce0796Smillert 11715ce0796SmillertYou may wonder: why not just have one lookup table for how ever many bits the 11815ce0796Smillertlongest symbol is? The reason is that if you do that, you end up spending 11915ce0796Smillertmore time filling in duplicate symbol entries than you do actually decoding. 12015ce0796SmillertAt least for deflate's output that generates new trees every several 10's of 12115ce0796Smillertkbytes. You can imagine that filling in a 2^15 entry table for a 15-bit code 12215ce0796Smillertwould take too long if you're only decoding several thousand symbols. At the 12315ce0796Smillertother extreme, you could make a new table for every bit in the code. In fact, 12415ce0796Smillertthat's essentially a Huffman tree. But then you spend two much time 12515ce0796Smillerttraversing the tree while decoding, even for short symbols. 12615ce0796Smillert 12715ce0796SmillertSo the number of bits for the first lookup table is a trade of the time to 12815ce0796Smillertfill out the table vs. the time spent looking at the second level and above of 12915ce0796Smillertthe table. 13015ce0796Smillert 13115ce0796SmillertHere is an example, scaled down: 13215ce0796Smillert 13315ce0796SmillertThe code being decoded, with 10 symbols, from 1 to 6 bits long: 13415ce0796Smillert 13515ce0796SmillertA: 0 13615ce0796SmillertB: 10 13715ce0796SmillertC: 1100 13815ce0796SmillertD: 11010 13915ce0796SmillertE: 11011 14015ce0796SmillertF: 11100 14115ce0796SmillertG: 11101 14215ce0796SmillertH: 11110 14315ce0796SmillertI: 111110 14415ce0796SmillertJ: 111111 14515ce0796Smillert 14615ce0796SmillertLet's make the first table three bits long (eight entries): 14715ce0796Smillert 14815ce0796Smillert000: A,1 14915ce0796Smillert001: A,1 15015ce0796Smillert010: A,1 15115ce0796Smillert011: A,1 15215ce0796Smillert100: B,2 15315ce0796Smillert101: B,2 15415ce0796Smillert110: -> table X (gobble 3 bits) 15515ce0796Smillert111: -> table Y (gobble 3 bits) 15615ce0796Smillert 15785c48e79ShenningEach entry is what the bits decode as and how many bits that is, i.e. how 15815ce0796Smillertmany bits to gobble. Or the entry points to another table, with the number of 15915ce0796Smillertbits to gobble implicit in the size of the table. 16015ce0796Smillert 16115ce0796SmillertTable X is two bits long since the longest code starting with 110 is five bits 16215ce0796Smillertlong: 16315ce0796Smillert 16415ce0796Smillert00: C,1 16515ce0796Smillert01: C,1 16615ce0796Smillert10: D,2 16715ce0796Smillert11: E,2 16815ce0796Smillert 16915ce0796SmillertTable Y is three bits long since the longest code starting with 111 is six 17015ce0796Smillertbits long: 17115ce0796Smillert 17215ce0796Smillert000: F,2 17315ce0796Smillert001: F,2 17415ce0796Smillert010: G,2 17515ce0796Smillert011: G,2 17615ce0796Smillert100: H,2 17715ce0796Smillert101: H,2 17815ce0796Smillert110: I,3 17915ce0796Smillert111: J,3 18015ce0796Smillert 18115ce0796SmillertSo what we have here are three tables with a total of 20 entries that had to 18215ce0796Smillertbe constructed. That's compared to 64 entries for a single table. Or 18315ce0796Smillertcompared to 16 entries for a Huffman tree (six two entry tables and one four 18415ce0796Smillertentry table). Assuming that the code ideally represents the probability of 18515ce0796Smillertthe symbols, it takes on the average 1.25 lookups per symbol. That's compared 18615ce0796Smillertto one lookup for the single table, or 1.66 lookups per symbol for the 18715ce0796SmillertHuffman tree. 18815ce0796Smillert 18915ce0796SmillertThere, I think that gives you a picture of what's going on. For inflate, the 19015ce0796Smillertmeaning of a particular symbol is often more than just a letter. It can be a 19115ce0796Smillertbyte (a "literal"), or it can be either a length or a distance which 19215ce0796Smillertindicates a base value and a number of bits to fetch after the code that is 19315ce0796Smillertadded to the base value. Or it might be the special end-of-block code. The 19415ce0796Smillertdata structures created in inftrees.c try to encode all that information 19515ce0796Smillertcompactly in the tables. 19615ce0796Smillert 19715ce0796Smillert 1989baa294fSmillertJean-loup Gailly Mark Adler 19915ce0796Smillertjloup@gzip.org madler@alumni.caltech.edu 2009baa294fSmillert 2019baa294fSmillert 2029baa294fSmillertReferences: 2039baa294fSmillert 2049baa294fSmillert[LZ77] Ziv J., Lempel A., ``A Universal Algorithm for Sequential Data 2059baa294fSmillertCompression,'' IEEE Transactions on Information Theory, Vol. 23, No. 3, 2069baa294fSmillertpp. 337-343. 2079baa294fSmillert 2089baa294fSmillert``DEFLATE Compressed Data Format Specification'' available in 20985c48e79Shenninghttp://www.ietf.org/rfc/rfc1951.txt 210*36f395ceStb1. Compression algorithm (deflate) 211*36f395ceStb 212*36f395ceStbThe deflation algorithm used by gzip (also zip and zlib) is a variation of 213*36f395ceStbLZ77 (Lempel-Ziv 1977, see reference below). It finds duplicated strings in 214*36f395ceStbthe input data. The second occurrence of a string is replaced by a 215*36f395ceStbpointer to the previous string, in the form of a pair (distance, 216*36f395ceStblength). Distances are limited to 32K bytes, and lengths are limited 217*36f395ceStbto 258 bytes. When a string does not occur anywhere in the previous 218*36f395ceStb32K bytes, it is emitted as a sequence of literal bytes. (In this 219*36f395ceStbdescription, `string' must be taken as an arbitrary sequence of bytes, 220*36f395ceStband is not restricted to printable characters.) 221*36f395ceStb 222*36f395ceStbLiterals or match lengths are compressed with one Huffman tree, and 223*36f395ceStbmatch distances are compressed with another tree. The trees are stored 224*36f395ceStbin a compact form at the start of each block. The blocks can have any 225*36f395ceStbsize (except that the compressed data for one block must fit in 226*36f395ceStbavailable memory). A block is terminated when deflate() determines that 227*36f395ceStbit would be useful to start another block with fresh trees. (This is 228*36f395ceStbsomewhat similar to the behavior of LZW-based _compress_.) 229*36f395ceStb 230*36f395ceStbDuplicated strings are found using a hash table. All input strings of 231*36f395ceStblength 3 are inserted in the hash table. A hash index is computed for 232*36f395ceStbthe next 3 bytes. If the hash chain for this index is not empty, all 233*36f395ceStbstrings in the chain are compared with the current input string, and 234*36f395ceStbthe longest match is selected. 235*36f395ceStb 236*36f395ceStbThe hash chains are searched starting with the most recent strings, to 237*36f395ceStbfavor small distances and thus take advantage of the Huffman encoding. 238*36f395ceStbThe hash chains are singly linked. There are no deletions from the 239*36f395ceStbhash chains, the algorithm simply discards matches that are too old. 240*36f395ceStb 241*36f395ceStbTo avoid a worst-case situation, very long hash chains are arbitrarily 242*36f395ceStbtruncated at a certain length, determined by a runtime option (level 243*36f395ceStbparameter of deflateInit). So deflate() does not always find the longest 244*36f395ceStbpossible match but generally finds a match which is long enough. 245*36f395ceStb 246*36f395ceStbdeflate() also defers the selection of matches with a lazy evaluation 247*36f395ceStbmechanism. After a match of length N has been found, deflate() searches for 248*36f395ceStba longer match at the next input byte. If a longer match is found, the 249*36f395ceStbprevious match is truncated to a length of one (thus producing a single 250*36f395ceStbliteral byte) and the process of lazy evaluation begins again. Otherwise, 251*36f395ceStbthe original match is kept, and the next match search is attempted only N 252*36f395ceStbsteps later. 253*36f395ceStb 254*36f395ceStbThe lazy match evaluation is also subject to a runtime parameter. If 255*36f395ceStbthe current match is long enough, deflate() reduces the search for a longer 256*36f395ceStbmatch, thus speeding up the whole process. If compression ratio is more 257*36f395ceStbimportant than speed, deflate() attempts a complete second search even if 258*36f395ceStbthe first match is already long enough. 259*36f395ceStb 260*36f395ceStbThe lazy match evaluation is not performed for the fastest compression 261*36f395ceStbmodes (level parameter 1 to 3). For these fast modes, new strings 262*36f395ceStbare inserted in the hash table only when no match was found, or 263*36f395ceStbwhen the match is not too long. This degrades the compression ratio 264*36f395ceStbbut saves time since there are both fewer insertions and fewer searches. 265*36f395ceStb 266*36f395ceStb 267*36f395ceStb2. Decompression algorithm (inflate) 268*36f395ceStb 269*36f395ceStb2.1 Introduction 270*36f395ceStb 271*36f395ceStbThe key question is how to represent a Huffman code (or any prefix code) so 272*36f395ceStbthat you can decode fast. The most important characteristic is that shorter 273*36f395ceStbcodes are much more common than longer codes, so pay attention to decoding the 274*36f395ceStbshort codes fast, and let the long codes take longer to decode. 275*36f395ceStb 276*36f395ceStbinflate() sets up a first level table that covers some number of bits of 277*36f395ceStbinput less than the length of longest code. It gets that many bits from the 278*36f395ceStbstream, and looks it up in the table. The table will tell if the next 279*36f395ceStbcode is that many bits or less and how many, and if it is, it will tell 280*36f395ceStbthe value, else it will point to the next level table for which inflate() 281*36f395ceStbgrabs more bits and tries to decode a longer code. 282*36f395ceStb 283*36f395ceStbHow many bits to make the first lookup is a tradeoff between the time it 284*36f395ceStbtakes to decode and the time it takes to build the table. If building the 285*36f395ceStbtable took no time (and if you had infinite memory), then there would only 286*36f395ceStbbe a first level table to cover all the way to the longest code. However, 287*36f395ceStbbuilding the table ends up taking a lot longer for more bits since short 288*36f395ceStbcodes are replicated many times in such a table. What inflate() does is 289*36f395ceStbsimply to make the number of bits in the first table a variable, and then 290*36f395ceStbto set that variable for the maximum speed. 291*36f395ceStb 292*36f395ceStbFor inflate, which has 286 possible codes for the literal/length tree, the size 293*36f395ceStbof the first table is nine bits. Also the distance trees have 30 possible 294*36f395ceStbvalues, and the size of the first table is six bits. Note that for each of 295*36f395ceStbthose cases, the table ended up one bit longer than the ``average'' code 296*36f395ceStblength, i.e. the code length of an approximately flat code which would be a 297*36f395ceStblittle more than eight bits for 286 symbols and a little less than five bits 298*36f395ceStbfor 30 symbols. 299*36f395ceStb 300*36f395ceStb 301*36f395ceStb2.2 More details on the inflate table lookup 302*36f395ceStb 303*36f395ceStbOk, you want to know what this cleverly obfuscated inflate tree actually 304*36f395ceStblooks like. You are correct that it's not a Huffman tree. It is simply a 305*36f395ceStblookup table for the first, let's say, nine bits of a Huffman symbol. The 306*36f395ceStbsymbol could be as short as one bit or as long as 15 bits. If a particular 307*36f395ceStbsymbol is shorter than nine bits, then that symbol's translation is duplicated 308*36f395ceStbin all those entries that start with that symbol's bits. For example, if the 309*36f395ceStbsymbol is four bits, then it's duplicated 32 times in a nine-bit table. If a 310*36f395ceStbsymbol is nine bits long, it appears in the table once. 311*36f395ceStb 312*36f395ceStbIf the symbol is longer than nine bits, then that entry in the table points 313*36f395ceStbto another similar table for the remaining bits. Again, there are duplicated 314*36f395ceStbentries as needed. The idea is that most of the time the symbol will be short 315*36f395ceStband there will only be one table look up. (That's whole idea behind data 316*36f395ceStbcompression in the first place.) For the less frequent long symbols, there 317*36f395ceStbwill be two lookups. If you had a compression method with really long 318*36f395ceStbsymbols, you could have as many levels of lookups as is efficient. For 319*36f395ceStbinflate, two is enough. 320*36f395ceStb 321*36f395ceStbSo a table entry either points to another table (in which case nine bits in 322*36f395ceStbthe above example are gobbled), or it contains the translation for the symbol 323*36f395ceStband the number of bits to gobble. Then you start again with the next 324*36f395ceStbungobbled bit. 325*36f395ceStb 326*36f395ceStbYou may wonder: why not just have one lookup table for how ever many bits the 327*36f395ceStblongest symbol is? The reason is that if you do that, you end up spending 328*36f395ceStbmore time filling in duplicate symbol entries than you do actually decoding. 329*36f395ceStbAt least for deflate's output that generates new trees every several 10's of 330*36f395ceStbkbytes. You can imagine that filling in a 2^15 entry table for a 15-bit code 331*36f395ceStbwould take too long if you're only decoding several thousand symbols. At the 332*36f395ceStbother extreme, you could make a new table for every bit in the code. In fact, 333*36f395ceStbthat's essentially a Huffman tree. But then you spend two much time 334*36f395ceStbtraversing the tree while decoding, even for short symbols. 335*36f395ceStb 336*36f395ceStbSo the number of bits for the first lookup table is a trade of the time to 337*36f395ceStbfill out the table vs. the time spent looking at the second level and above of 338*36f395ceStbthe table. 339*36f395ceStb 340*36f395ceStbHere is an example, scaled down: 341*36f395ceStb 342*36f395ceStbThe code being decoded, with 10 symbols, from 1 to 6 bits long: 343*36f395ceStb 344*36f395ceStbA: 0 345*36f395ceStbB: 10 346*36f395ceStbC: 1100 347*36f395ceStbD: 11010 348*36f395ceStbE: 11011 349*36f395ceStbF: 11100 350*36f395ceStbG: 11101 351*36f395ceStbH: 11110 352*36f395ceStbI: 111110 353*36f395ceStbJ: 111111 354*36f395ceStb 355*36f395ceStbLet's make the first table three bits long (eight entries): 356*36f395ceStb 357*36f395ceStb000: A,1 358*36f395ceStb001: A,1 359*36f395ceStb010: A,1 360*36f395ceStb011: A,1 361*36f395ceStb100: B,2 362*36f395ceStb101: B,2 363*36f395ceStb110: -> table X (gobble 3 bits) 364*36f395ceStb111: -> table Y (gobble 3 bits) 365*36f395ceStb 366*36f395ceStbEach entry is what the bits decode as and how many bits that is, i.e. how 367*36f395ceStbmany bits to gobble. Or the entry points to another table, with the number of 368*36f395ceStbbits to gobble implicit in the size of the table. 369*36f395ceStb 370*36f395ceStbTable X is two bits long since the longest code starting with 110 is five bits 371*36f395ceStblong: 372*36f395ceStb 373*36f395ceStb00: C,1 374*36f395ceStb01: C,1 375*36f395ceStb10: D,2 376*36f395ceStb11: E,2 377*36f395ceStb 378*36f395ceStbTable Y is three bits long since the longest code starting with 111 is six 379*36f395ceStbbits long: 380*36f395ceStb 381*36f395ceStb000: F,2 382*36f395ceStb001: F,2 383*36f395ceStb010: G,2 384*36f395ceStb011: G,2 385*36f395ceStb100: H,2 386*36f395ceStb101: H,2 387*36f395ceStb110: I,3 388*36f395ceStb111: J,3 389*36f395ceStb 390*36f395ceStbSo what we have here are three tables with a total of 20 entries that had to 391*36f395ceStbbe constructed. That's compared to 64 entries for a single table. Or 392*36f395ceStbcompared to 16 entries for a Huffman tree (six two entry tables and one four 393*36f395ceStbentry table). Assuming that the code ideally represents the probability of 394*36f395ceStbthe symbols, it takes on the average 1.25 lookups per symbol. That's compared 395*36f395ceStbto one lookup for the single table, or 1.66 lookups per symbol for the 396*36f395ceStbHuffman tree. 397*36f395ceStb 398*36f395ceStbThere, I think that gives you a picture of what's going on. For inflate, the 399*36f395ceStbmeaning of a particular symbol is often more than just a letter. It can be a 400*36f395ceStbbyte (a "literal"), or it can be either a length or a distance which 401*36f395ceStbindicates a base value and a number of bits to fetch after the code that is 402*36f395ceStbadded to the base value. Or it might be the special end-of-block code. The 403*36f395ceStbdata structures created in inftrees.c try to encode all that information 404*36f395ceStbcompactly in the tables. 405*36f395ceStb 406*36f395ceStb 407*36f395ceStbJean-loup Gailly Mark Adler 408*36f395ceStbjloup@gzip.org madler@alumni.caltech.edu 409*36f395ceStb 410*36f395ceStb 411*36f395ceStbReferences: 412*36f395ceStb 413*36f395ceStb[LZ77] Ziv J., Lempel A., ``A Universal Algorithm for Sequential Data 414*36f395ceStbCompression,'' IEEE Transactions on Information Theory, Vol. 23, No. 3, 415*36f395ceStbpp. 337-343. 416*36f395ceStb 417*36f395ceStb``DEFLATE Compressed Data Format Specification'' available in 418*36f395ceStbhttp://www.ietf.org/rfc/rfc1951.txt 419*36f395ceStb1. Compression algorithm (deflate) 420*36f395ceStb 421*36f395ceStbThe deflation algorithm used by gzip (also zip and zlib) is a variation of 422*36f395ceStbLZ77 (Lempel-Ziv 1977, see reference below). It finds duplicated strings in 423*36f395ceStbthe input data. The second occurrence of a string is replaced by a 424*36f395ceStbpointer to the previous string, in the form of a pair (distance, 425*36f395ceStblength). Distances are limited to 32K bytes, and lengths are limited 426*36f395ceStbto 258 bytes. When a string does not occur anywhere in the previous 427*36f395ceStb32K bytes, it is emitted as a sequence of literal bytes. (In this 428*36f395ceStbdescription, `string' must be taken as an arbitrary sequence of bytes, 429*36f395ceStband is not restricted to printable characters.) 430*36f395ceStb 431*36f395ceStbLiterals or match lengths are compressed with one Huffman tree, and 432*36f395ceStbmatch distances are compressed with another tree. The trees are stored 433*36f395ceStbin a compact form at the start of each block. The blocks can have any 434*36f395ceStbsize (except that the compressed data for one block must fit in 435*36f395ceStbavailable memory). A block is terminated when deflate() determines that 436*36f395ceStbit would be useful to start another block with fresh trees. (This is 437*36f395ceStbsomewhat similar to the behavior of LZW-based _compress_.) 438*36f395ceStb 439*36f395ceStbDuplicated strings are found using a hash table. All input strings of 440*36f395ceStblength 3 are inserted in the hash table. A hash index is computed for 441*36f395ceStbthe next 3 bytes. If the hash chain for this index is not empty, all 442*36f395ceStbstrings in the chain are compared with the current input string, and 443*36f395ceStbthe longest match is selected. 444*36f395ceStb 445*36f395ceStbThe hash chains are searched starting with the most recent strings, to 446*36f395ceStbfavor small distances and thus take advantage of the Huffman encoding. 447*36f395ceStbThe hash chains are singly linked. There are no deletions from the 448*36f395ceStbhash chains, the algorithm simply discards matches that are too old. 449*36f395ceStb 450*36f395ceStbTo avoid a worst-case situation, very long hash chains are arbitrarily 451*36f395ceStbtruncated at a certain length, determined by a runtime option (level 452*36f395ceStbparameter of deflateInit). So deflate() does not always find the longest 453*36f395ceStbpossible match but generally finds a match which is long enough. 454*36f395ceStb 455*36f395ceStbdeflate() also defers the selection of matches with a lazy evaluation 456*36f395ceStbmechanism. After a match of length N has been found, deflate() searches for 457*36f395ceStba longer match at the next input byte. If a longer match is found, the 458*36f395ceStbprevious match is truncated to a length of one (thus producing a single 459*36f395ceStbliteral byte) and the process of lazy evaluation begins again. Otherwise, 460*36f395ceStbthe original match is kept, and the next match search is attempted only N 461*36f395ceStbsteps later. 462*36f395ceStb 463*36f395ceStbThe lazy match evaluation is also subject to a runtime parameter. If 464*36f395ceStbthe current match is long enough, deflate() reduces the search for a longer 465*36f395ceStbmatch, thus speeding up the whole process. If compression ratio is more 466*36f395ceStbimportant than speed, deflate() attempts a complete second search even if 467*36f395ceStbthe first match is already long enough. 468*36f395ceStb 469*36f395ceStbThe lazy match evaluation is not performed for the fastest compression 470*36f395ceStbmodes (level parameter 1 to 3). For these fast modes, new strings 471*36f395ceStbare inserted in the hash table only when no match was found, or 472*36f395ceStbwhen the match is not too long. This degrades the compression ratio 473*36f395ceStbbut saves time since there are both fewer insertions and fewer searches. 474*36f395ceStb 475*36f395ceStb 476*36f395ceStb2. Decompression algorithm (inflate) 477*36f395ceStb 478*36f395ceStb2.1 Introduction 479*36f395ceStb 480*36f395ceStbThe key question is how to represent a Huffman code (or any prefix code) so 481*36f395ceStbthat you can decode fast. The most important characteristic is that shorter 482*36f395ceStbcodes are much more common than longer codes, so pay attention to decoding the 483*36f395ceStbshort codes fast, and let the long codes take longer to decode. 484*36f395ceStb 485*36f395ceStbinflate() sets up a first level table that covers some number of bits of 486*36f395ceStbinput less than the length of longest code. It gets that many bits from the 487*36f395ceStbstream, and looks it up in the table. The table will tell if the next 488*36f395ceStbcode is that many bits or less and how many, and if it is, it will tell 489*36f395ceStbthe value, else it will point to the next level table for which inflate() 490*36f395ceStbgrabs more bits and tries to decode a longer code. 491*36f395ceStb 492*36f395ceStbHow many bits to make the first lookup is a tradeoff between the time it 493*36f395ceStbtakes to decode and the time it takes to build the table. If building the 494*36f395ceStbtable took no time (and if you had infinite memory), then there would only 495*36f395ceStbbe a first level table to cover all the way to the longest code. However, 496*36f395ceStbbuilding the table ends up taking a lot longer for more bits since short 497*36f395ceStbcodes are replicated many times in such a table. What inflate() does is 498*36f395ceStbsimply to make the number of bits in the first table a variable, and then 499*36f395ceStbto set that variable for the maximum speed. 500*36f395ceStb 501*36f395ceStbFor inflate, which has 286 possible codes for the literal/length tree, the size 502*36f395ceStbof the first table is nine bits. Also the distance trees have 30 possible 503*36f395ceStbvalues, and the size of the first table is six bits. Note that for each of 504*36f395ceStbthose cases, the table ended up one bit longer than the ``average'' code 505*36f395ceStblength, i.e. the code length of an approximately flat code which would be a 506*36f395ceStblittle more than eight bits for 286 symbols and a little less than five bits 507*36f395ceStbfor 30 symbols. 508*36f395ceStb 509*36f395ceStb 510*36f395ceStb2.2 More details on the inflate table lookup 511*36f395ceStb 512*36f395ceStbOk, you want to know what this cleverly obfuscated inflate tree actually 513*36f395ceStblooks like. You are correct that it's not a Huffman tree. It is simply a 514*36f395ceStblookup table for the first, let's say, nine bits of a Huffman symbol. The 515*36f395ceStbsymbol could be as short as one bit or as long as 15 bits. If a particular 516*36f395ceStbsymbol is shorter than nine bits, then that symbol's translation is duplicated 517*36f395ceStbin all those entries that start with that symbol's bits. For example, if the 518*36f395ceStbsymbol is four bits, then it's duplicated 32 times in a nine-bit table. If a 519*36f395ceStbsymbol is nine bits long, it appears in the table once. 520*36f395ceStb 521*36f395ceStbIf the symbol is longer than nine bits, then that entry in the table points 522*36f395ceStbto another similar table for the remaining bits. Again, there are duplicated 523*36f395ceStbentries as needed. The idea is that most of the time the symbol will be short 524*36f395ceStband there will only be one table look up. (That's whole idea behind data 525*36f395ceStbcompression in the first place.) For the less frequent long symbols, there 526*36f395ceStbwill be two lookups. If you had a compression method with really long 527*36f395ceStbsymbols, you could have as many levels of lookups as is efficient. For 528*36f395ceStbinflate, two is enough. 529*36f395ceStb 530*36f395ceStbSo a table entry either points to another table (in which case nine bits in 531*36f395ceStbthe above example are gobbled), or it contains the translation for the symbol 532*36f395ceStband the number of bits to gobble. Then you start again with the next 533*36f395ceStbungobbled bit. 534*36f395ceStb 535*36f395ceStbYou may wonder: why not just have one lookup table for how ever many bits the 536*36f395ceStblongest symbol is? The reason is that if you do that, you end up spending 537*36f395ceStbmore time filling in duplicate symbol entries than you do actually decoding. 538*36f395ceStbAt least for deflate's output that generates new trees every several 10's of 539*36f395ceStbkbytes. You can imagine that filling in a 2^15 entry table for a 15-bit code 540*36f395ceStbwould take too long if you're only decoding several thousand symbols. At the 541*36f395ceStbother extreme, you could make a new table for every bit in the code. In fact, 542*36f395ceStbthat's essentially a Huffman tree. But then you spend two much time 543*36f395ceStbtraversing the tree while decoding, even for short symbols. 544*36f395ceStb 545*36f395ceStbSo the number of bits for the first lookup table is a trade of the time to 546*36f395ceStbfill out the table vs. the time spent looking at the second level and above of 547*36f395ceStbthe table. 548*36f395ceStb 549*36f395ceStbHere is an example, scaled down: 550*36f395ceStb 551*36f395ceStbThe code being decoded, with 10 symbols, from 1 to 6 bits long: 552*36f395ceStb 553*36f395ceStbA: 0 554*36f395ceStbB: 10 555*36f395ceStbC: 1100 556*36f395ceStbD: 11010 557*36f395ceStbE: 11011 558*36f395ceStbF: 11100 559*36f395ceStbG: 11101 560*36f395ceStbH: 11110 561*36f395ceStbI: 111110 562*36f395ceStbJ: 111111 563*36f395ceStb 564*36f395ceStbLet's make the first table three bits long (eight entries): 565*36f395ceStb 566*36f395ceStb000: A,1 567*36f395ceStb001: A,1 568*36f395ceStb010: A,1 569*36f395ceStb011: A,1 570*36f395ceStb100: B,2 571*36f395ceStb101: B,2 572*36f395ceStb110: -> table X (gobble 3 bits) 573*36f395ceStb111: -> table Y (gobble 3 bits) 574*36f395ceStb 575*36f395ceStbEach entry is what the bits decode as and how many bits that is, i.e. how 576*36f395ceStbmany bits to gobble. Or the entry points to another table, with the number of 577*36f395ceStbbits to gobble implicit in the size of the table. 578*36f395ceStb 579*36f395ceStbTable X is two bits long since the longest code starting with 110 is five bits 580*36f395ceStblong: 581*36f395ceStb 582*36f395ceStb00: C,1 583*36f395ceStb01: C,1 584*36f395ceStb10: D,2 585*36f395ceStb11: E,2 586*36f395ceStb 587*36f395ceStbTable Y is three bits long since the longest code starting with 111 is six 588*36f395ceStbbits long: 589*36f395ceStb 590*36f395ceStb000: F,2 591*36f395ceStb001: F,2 592*36f395ceStb010: G,2 593*36f395ceStb011: G,2 594*36f395ceStb100: H,2 595*36f395ceStb101: H,2 596*36f395ceStb110: I,3 597*36f395ceStb111: J,3 598*36f395ceStb 599*36f395ceStbSo what we have here are three tables with a total of 20 entries that had to 600*36f395ceStbbe constructed. That's compared to 64 entries for a single table. Or 601*36f395ceStbcompared to 16 entries for a Huffman tree (six two entry tables and one four 602*36f395ceStbentry table). Assuming that the code ideally represents the probability of 603*36f395ceStbthe symbols, it takes on the average 1.25 lookups per symbol. That's compared 604*36f395ceStbto one lookup for the single table, or 1.66 lookups per symbol for the 605*36f395ceStbHuffman tree. 606*36f395ceStb 607*36f395ceStbThere, I think that gives you a picture of what's going on. For inflate, the 608*36f395ceStbmeaning of a particular symbol is often more than just a letter. It can be a 609*36f395ceStbbyte (a "literal"), or it can be either a length or a distance which 610*36f395ceStbindicates a base value and a number of bits to fetch after the code that is 611*36f395ceStbadded to the base value. Or it might be the special end-of-block code. The 612*36f395ceStbdata structures created in inftrees.c try to encode all that information 613*36f395ceStbcompactly in the tables. 614*36f395ceStb 615*36f395ceStb 616*36f395ceStbJean-loup Gailly Mark Adler 617*36f395ceStbjloup@gzip.org madler@alumni.caltech.edu 618*36f395ceStb 619*36f395ceStb 620*36f395ceStbReferences: 621*36f395ceStb 622*36f395ceStb[LZ77] Ziv J., Lempel A., ``A Universal Algorithm for Sequential Data 623*36f395ceStbCompression,'' IEEE Transactions on Information Theory, Vol. 23, No. 3, 624*36f395ceStbpp. 337-343. 625*36f395ceStb 626*36f395ceStb``DEFLATE Compressed Data Format Specification'' available in 627*36f395ceStbhttp://www.ietf.org/rfc/rfc1951.txt 628