1.\" $NetBSD: bzip2.1,v 1.5 2019/07/21 21:07:12 wiz Exp $ 2.\" 3.Dd July 13, 2019 4.Dt BZIP2 1 5.Os 6.Sh NAME 7.Nm bzip2 , 8.Nm bunzip2 , 9.Nm bzcat , 10.Nm bzip2recover 11.Nd block-sorting file compressor 12.Sh SYNOPSIS 13.Nm bzip2 14.Op Fl 123456789cdfkLqstVvz 15.Op Ar filename Ar 16.Pp 17.Nm bunzip2 18.Op Fl fkLVvs 19.Op Ar filename Ar 20.Pp 21.Nm bzcat 22.Op Fl s 23.Op Ar filename Ar 24.Pp 25.Nm bzip2recover 26.Ar filename 27.Sh DESCRIPTION 28.Nm bzip2 29compresses files using the Burrows-Wheeler block sorting 30text compression algorithm, and Huffman coding. 31Compression is generally considerably better than that achieved by 32more conventional LZ77/LZ78-based compressors, and approaches the 33performance of the PPM family of statistical compressors. 34.Pp 35.Nm bzcat 36decompresses files to stdout, and 37.Nm bzip2recover 38recovers data from damaged bzip2 files. 39.Pp 40The command-line options are deliberately very similar to 41those of 42.Xr gzip 1 , 43but they are not identical. 44.Pp 45.Nm bzip2 46expects a list of file names to accompany the command-line flags. 47Each file is replaced by a compressed version of 48itself, with the name 49.Dq Pa original_name.bz2 . 50Each compressed file has the same modification date, permissions, and, 51when possible, ownership as the corresponding original, so that these 52properties can be correctly restored at decompression time. 53File name handling is naive in the sense that there is no mechanism 54for preserving original file names, permissions, ownerships or dates 55in filesystems which lack these concepts, or have serious file name 56length restrictions, such as 57.Tn MS-DOS . 58.Nm bzip2 59and 60.Nm bunzip2 61will by default not overwrite existing files. 62If you want this to happen, specify the 63.Fl f 64flag. 65.Pp 66If no file names are specified, 67.Nm bzip2 68compresses from standard input to standard output. 69In this case, 70.Nm bzip2 71will decline to write compressed output to a terminal, as this would 72be entirely incomprehensible and therefore pointless. 73.Pp 74.Nm bunzip2 75(or 76.Nm bzip2 Fl d ) 77decompresses all specified files. 78Files which were not created by 79.Nm bzip2 80will be detected and ignored, and a warning issued. 81.Nm bzip2 82attempts to guess the filename for the decompressed file 83from that of the compressed file as follows: 84.Bl -column "filename.tbz2" "becomes" -offset indent 85.It Pa filename.bz2 Ta becomes Ta Pa filename 86.It Pa filename.bz Ta becomes Ta Pa filename 87.It Pa filename.tbz2 Ta becomes Ta Pa filename.tar 88.It Pa filename.tbz Ta becomes Ta Pa filename.tar 89.It Pa anyothername Ta becomes Ta Pa anyothername.out 90.El 91.Pp 92If the file does not end in one of the recognised endings, 93.Pa .bz2 , 94.Pa .bz , 95.Pa .tbz2 , 96or 97.Pa .tbz , 98.Nm bzip2 99complains that it cannot guess the name of the original file, and uses 100the original name with 101.Pa .out 102appended. 103.Pp 104As with compression, supplying no filenames causes decompression from 105standard input to standard output. 106.Pp 107.Nm bunzip2 108will correctly decompress a file which is the concatenation of two or 109more compressed files. 110The result is the concatenation of the corresponding uncompressed 111files. 112Integrity testing 113.Pq Fl t 114of concatenated compressed files is also supported. 115.Pp 116You can also compress or decompress files to the standard output by 117giving the 118.Fl c 119flag. 120Multiple files may be compressed and decompressed like this. 121The resulting outputs are fed sequentially to stdout. 122Compression of multiple files in this manner generates a stream 123containing multiple compressed file representations. 124Such a stream can be decompressed correctly only by 125.Nm bzip2 126version 0.9.0 or later. 127Earlier versions of 128.Nm bzip2 129will stop after decompressing 130the first file in the stream. 131.Pp 132.Nm bzcat 133(or 134.Nm bzip2 Fl dc ) 135decompresses all specified files to the standard output. 136.Pp 137Compression is always performed, even if the compressed file is 138slightly larger than the original. 139Files of less than about one hundred bytes tend to get larger, since 140the compression mechanism has a constant overhead in the region of 50 141bytes. 142Random data (including the output of most file compressors) is coded 143at about 8.05 bits per byte, giving an expansion of around 0.5%. 144.Pp 145As a self-check for your protection, 146.Nm bzip2 147uses 32-bit CRCs to make sure that the decompressed version of a file 148is identical to the original. 149This guards against corruption of the compressed data, and against 150undetected bugs in 151.Nm bzip2 152(hopefully very unlikely). 153The chances of data corruption going undetected is microscopic, about 154one chance in four billion for each file processed. 155Be aware, though, that the check occurs upon decompression, so it can 156only tell you that something is wrong. 157It can't help you recover the original uncompressed data. 158You can use 159.Nm bzip2recover 160to try to recover data from 161damaged files. 162.Sh OPTIONS 163.Bl -tag -width "XXrepetitiveXfastXX" 164.It Fl Fl 165Treats all subsequent arguments as file names, even if they start with 166a dash. 167This is so you can handle files with names beginning with a dash, for 168example: 169.Dl bzip2 -- -myfilename . 170.It Fl 1 , Fl Fl fast 171to 172.It Fl 9 , Fl Fl best 173Set the block size to 100 k, 200 k ... 900 k when compressing. 174Has no effect when decompressing. 175See 176.Sx MEMORY MANAGEMENT 177below. 178The 179.Fl Fl fast 180and 181.Fl Fl best 182aliases are primarily for GNU 183.Xr gzip 1 184compatibility. 185In particular, 186.Fl Fl fast 187doesn't make things significantly faster, and 188.Fl Fl best 189merely selects the default behaviour. 190.It Fl c , Fl Fl stdout 191Compress or decompress to standard output. 192.It Fl d , Fl Fl decompress 193Force decompression. 194.Nm bzip2 , 195.Nm bunzip2 , 196and 197.Nm bzcat 198are really the same program, and the decision about what actions to 199take is done on the basis of which name is used. 200This flag overrides that mechanism, and forces 201.Nm bzip2 202to decompress. 203.It Fl f , Fl Fl force 204Force overwrite of output files. 205Normally, 206.Nm bzip2 207will not overwrite existing output files. 208Also forces 209.Nm bzip2 210to break hard links 211to files, which it otherwise wouldn't do. 212.Pp 213.Nm bzip2 214normally declines to decompress files which don't have the correct 215magic header bytes. 216If forced 217.Pq Fl f , 218however, it will pass such files through unmodified. 219This is how GNU 220.Xr gzip 1 221behaves. 222.It Fl k , Fl Fl keep 223Keep (don't delete) input files during compression 224or decompression. 225.It Fl L , Fl Fl license 226Display the license terms and conditions. 227.It Fl q , Fl Fl quiet 228Suppress non-essential warning messages. 229Messages pertaining to I/O errors and other critical events will not 230be suppressed. 231.It Fl Fl repetitive-fast 232.It Fl Fl repetitive-best 233These flags are redundant in versions 0.9.5 and above. 234They provided some coarse control over the behaviour of the sorting 235algorithm in earlier versions, which was sometimes useful. 2360.9.5 and above have an improved algorithm which renders these flags 237irrelevant. 238.It Fl s , Fl Fl small 239Reduce memory usage, for compression, decompression and testing. 240Files are decompressed and tested using a modified algorithm which 241only requires 2.5 bytes per block byte. 242This means any file can be decompressed in 2300k of memory, albeit at 243about half the normal speed. 244During compression, 245.Fl s 246selects a block size of 200k, which limits memory use to around the 247same figure, at the expense of your compression ratio. 248In short, if your machine is low on memory (8 megabytes or less), use 249.Fl s 250for everything. 251See 252.Sx MEMORY MANAGEMENT 253below. 254.It Fl t , Fl Fl test 255Check integrity of the specified file(s), but don't decompress them. 256This really performs a trial decompression and throws away the result. 257.It Fl V , Fl Fl version 258Display the software version. 259.It Fl v , Fl Fl verbose 260Verbose mode: show the compression ratio for each file processed. 261Further 262.Fl v Ap s 263increase the verbosity level, spewing out lots of information which is 264primarily of interest for diagnostic purposes. 265.It Fl z , Fl Fl compress 266The complement to 267Fl d : 268forces compression, regardless of the invocation name. 269.El 270.Ss MEMORY MANAGEMENT 271.Nm bzip2 272compresses large files in blocks. 273The block size affects both the compression ratio achieved, and the 274amount of memory needed for compression and decompression. 275The flags 276.Fl 1 277through 278.Fl 9 279specify the block size to be 100,000 bytes through 900,000 bytes (the 280default) respectively. 281At decompression time, the block size used for compression is read 282from the header of the compressed file, and 283.Nm bunzip2 284then allocates itself just enough memory to decompress the file. 285Since block sizes are stored in compressed files, it follows that the 286flags 287.Fl 1 288to 289.Fl 9 290are irrelevant to and so ignored during decompression. 291.Pp 292Compression and decompression requirements, in bytes, can be estimated 293as: 294.Bl -tag -width "Decompression:" -offset indent 295.It Compression : 296400k + ( 8 x block size ) 297.It Decompression : 298100k + ( 4 x block size ), or 100k + ( 2.5 x block size ) 299.El 300Larger block sizes give rapidly diminishing marginal returns. 301Most of the compression comes from the first two or three hundred k of 302block size, a fact worth bearing in mind when using 303.Nm bzip2 304on small machines. 305It is also important to appreciate that the decompression memory 306requirement is set at compression time by the choice of block size. 307.Pp 308For files compressed with the default 900k block size, 309.Nm bunzip2 310will require about 3700 kbytes to decompress. 311To support decompression of any file on a 4 megabyte machine, 312.Nm bunzip2 313has an option to decompress using approximately half this amount of 314memory, about 2300 kbytes. 315Decompression speed is also halved, so you should use this option only 316where necessary. 317The relevant flag is 318.Fl s . 319.Pp 320In general, try and use the largest block size memory constraints 321allow, since that maximises the compression achieved. 322Compression and decompression speed are virtually unaffected by block 323size. 324.Pp 325Another significant point applies to files which fit in a single block 326-- that means most files you'd encounter using a large block size. 327The amount of real memory touched is proportional to the size of the 328file, since the file is smaller than a block. 329For example, compressing a file 20,000 bytes long with the flag 330.Fl 9 331will cause the compressor to allocate around 7600k of memory, but only 332touch 400k + 20000 * 8 = 560 kbytes of it. 333Similarly, the decompressor will allocate 3700k but only touch 100k + 33420000 * 4 = 180 kbytes. 335.Pp 336Here is a table which summarises the maximum memory usage for different 337block sizes. 338Also recorded is the total compressed size for 14 files of the Calgary 339Text Compression Corpus totalling 3,141,622 bytes. 340This column gives some feel for how compression varies with block size. 341These figures tend to understate the advantage of larger block sizes 342for larger files, since the Corpus is dominated by smaller files. 343.Bl -column "Flag" "Compression" "Decompression" "DecompressionXXs" "Corpus size" 344.It Sy Flag Ta Sy Compression Ta Sy Decompression Ta Sy Decompression Fl s Ta Sy Corpus size 345.It -1 Ta 1200k Ta 500k Ta 350k Ta 914704 346.It -2 Ta 2000k Ta 900k Ta 600k Ta 877703 347.It -3 Ta 2800k Ta 1300k Ta 850k Ta 860338 348.It -4 Ta 3600k Ta 1700k Ta 1100k Ta 846899 349.It -5 Ta 4400k Ta 2100k Ta 1350k Ta 845160 350.It -6 Ta 5200k Ta 2500k Ta 1600k Ta 838626 351.It -7 Ta 6100k Ta 2900k Ta 1850k Ta 834096 352.It -8 Ta 6800k Ta 3300k Ta 2100k Ta 828642 353.It -9 Ta 7600k Ta 3700k Ta 2350k Ta 828642 354.El 355.Ss RECOVERING DATA FROM DAMAGED FILES 356.Nm bzip2 357compresses files in blocks, usually 900kbytes long. 358Each block is handled independently. 359If a media or transmission error causes a multi-block 360.Pa .bz2 361file to become damaged, it may be possible to recover data from the 362undamaged blocks in the file. 363.Pp 364The compressed representation of each block is delimited by a 48-bit 365pattern, which makes it possible to find the block boundaries with 366reasonable certainty. 367Each block also carries its own 32-bit CRC, so damaged blocks can be 368distinguished from undamaged ones. 369.Pp 370.Nm bzip2recover 371is a simple program whose purpose is to search for blocks in 372.Pa .bz2 373files, and write each block out into its own 374.Pa .bz2 375file. 376You can then use 377.Nm bzip2 378.Fl t 379to test the integrity of the resulting files, and decompress those 380which are undamaged. 381.Pp 382.Nm bzip2recover 383takes a single argument, the name of the damaged file, and writes a 384number of files 385.Dq Pa rec00001file.bz2 , 386.Dq Pa rec00002file.bz2 , 387etc., containing the extracted blocks. 388The output filenames are designed so that the use of wildcards in 389subsequent processing -- for example, 390.Dl bzip2 -dc rec*file.bz2 \*[Gt] recovered_data 391-- processes the files in the correct order. 392.Pp 393.Nm bzip2recover 394should be of most use dealing with large 395.Pa .bz2 396files, as these will contain many blocks. 397It is clearly futile to use it on damaged single-block files, since a 398damaged block cannot be recovered. 399If you wish to minimise any potential data loss through media or 400transmission errors, you might consider compressing with a smaller 401block size. 402.Ss PERFORMANCE NOTES 403The sorting phase of compression gathers together similar strings in 404the file. 405Because of this, files containing very long runs of repeated 406symbols, like 407.Dq aabaabaabaab... 408(repeated several hundred times) may compress more slowly than normal. 409Versions 0.9.5 and above fare much better than previous versions in 410this respect. 411The ratio between worst-case and average-case compression time is in 412the region of 10:1. 413For previous versions, this figure was more like 100:1. 414You can use the 415.Fl vvvv 416option to monitor progress in great detail, if you want. 417.Pp 418Decompression speed is unaffected by these phenomena. 419.Pp 420.Nm bzip2 421usually allocates several megabytes of memory to operate in, and then 422charges all over it in a fairly random fashion. 423This means that performance, both for compressing and decompressing, 424is largely determined by the speed at which your machine can service 425cache misses. 426Because of this, small changes to the code to reduce the miss rate 427have been observed to give disproportionately large performance 428improvements. 429I imagine 430.Nm bzip2 431will perform best on machines with very large caches. 432.Sh ENVIRONMENT 433.Nm bzip2 434will read arguments from the environment variables 435.Ev BZIP2 436and 437.Ev BZIP , 438in that order, and will process them before any arguments read from 439the command line. 440This gives a convenient way to supply default arguments. 441.Sh EXIT STATUS 4420 for a normal exit, 1 for environmental problems (file not found, 443invalid flags, I/O errors, etc.), 2 to indicate a corrupt compressed 444file, 3 for an internal consistency error (e.g., bug) which caused 445.Nm bzip2 446to panic. 447.Sh AUTHORS 448.An -nosplit 449.An Julian Seward 450.Aq jseward@bzip.org 451.Pp 452.Pa http://www.bzip.org 453.Pp 454The ideas embodied in 455.Nm bzip2 456are due to (at least) the following people: 457.An Michael Burrows 458and 459.An David Wheeler 460(for the block sorting transformation), 461.An David Wheeler 462(again, for the Huffman coder), 463.An Peter Fenwick 464(for the structured coding model in the original 465.Nm bzip , 466and many refinements), and 467.An Alistair Moffat , 468.An Radford Neal , 469and 470.An Ian Witten 471(for the arithmetic coder in the original 472.Nm bzip ) . 473I am much indebted for their help, support and advice. 474See the manual in the source distribution for pointers to sources of 475documentation. 476Christian von Roques encouraged me to look for faster sorting 477algorithms, so as to speed up compression. 478Bela Lubkin encouraged me to improve the worst-case compression 479performance. 480Donna Robinson XMLised the documentation. 481The bz* scripts are derived from those of GNU gzip. 482Many people sent patches, helped with portability problems, lent 483machines, gave advice and were generally helpful. 484.Sh CAVEATS 485I/O error messages are not as helpful as they could be. 486.Nm bzip2 487tries hard to detect I/O errors and exit cleanly, but the details of 488what the problem is sometimes seem rather misleading. 489.Pp 490This manual page pertains to version 1.0.8 of 491.Nm bzip2 . 492Compressed data created by this version is entirely forwards and 493backwards compatible with the previous public releases, versions 4940.1pl2, 0.9.0, 0.9.5, 1.0.0, 1.0.1, 1.0.2 and above, but with the 495following exception: 0.9.0 and above can correctly decompress multiple 496concatenated compressed files. 4970.1pl2 cannot do this; it will stop after decompressing just the first 498file in the stream. 499.Pp 500.Nm bzip2recover 501versions prior to 1.0.2 used 32-bit integers to represent bit 502positions in compressed files, so they could not handle compressed 503files more than 512 megabytes long. 504Versions 1.0.2 and above use 64-bit ints on some platforms which 505support them (GNU supported targets, and Windows). 506To establish whether or not 507.Nm bzip2recover 508was built with such a limitation, run it without arguments. 509In any event you can build yourself an unlimited version if you can 510recompile it with MaybeUInt64 set to be an unsigned 64-bit integer. 511