xref: /netbsd-src/external/bsd/bzip2/dist/bzip2.1 (revision 3569e60225ace945667781347835aed75d7bea05)
1.\"	$NetBSD: bzip2.1,v 1.5 2019/07/21 21:07:12 wiz Exp $
2.\"
3.Dd July 13, 2019
4.Dt BZIP2 1
5.Os
6.Sh NAME
7.Nm bzip2 ,
8.Nm bunzip2 ,
9.Nm bzcat ,
10.Nm bzip2recover
11.Nd block-sorting file compressor
12.Sh SYNOPSIS
13.Nm bzip2
14.Op Fl 123456789cdfkLqstVvz
15.Op Ar filename Ar
16.Pp
17.Nm bunzip2
18.Op Fl fkLVvs
19.Op Ar filename Ar
20.Pp
21.Nm bzcat
22.Op Fl s
23.Op Ar filename Ar
24.Pp
25.Nm bzip2recover
26.Ar filename
27.Sh DESCRIPTION
28.Nm bzip2
29compresses files using the Burrows-Wheeler block sorting
30text compression algorithm, and Huffman coding.
31Compression is generally considerably better than that achieved by
32more conventional LZ77/LZ78-based compressors, and approaches the
33performance of the PPM family of statistical compressors.
34.Pp
35.Nm bzcat
36decompresses files to stdout, and
37.Nm bzip2recover
38recovers data from damaged bzip2 files.
39.Pp
40The command-line options are deliberately very similar to
41those of
42.Xr gzip 1 ,
43but they are not identical.
44.Pp
45.Nm bzip2
46expects a list of file names to accompany the command-line flags.
47Each file is replaced by a compressed version of
48itself, with the name
49.Dq Pa original_name.bz2 .
50Each compressed file has the same modification date, permissions, and,
51when possible, ownership as the corresponding original, so that these
52properties can be correctly restored at decompression time.
53File name handling is naive in the sense that there is no mechanism
54for preserving original file names, permissions, ownerships or dates
55in filesystems which lack these concepts, or have serious file name
56length restrictions, such as
57.Tn MS-DOS .
58.Nm bzip2
59and
60.Nm bunzip2
61will by default not overwrite existing files.
62If you want this to happen, specify the
63.Fl f
64flag.
65.Pp
66If no file names are specified,
67.Nm bzip2
68compresses from standard input to standard output.
69In this case,
70.Nm bzip2
71will decline to write compressed output to a terminal, as this would
72be entirely incomprehensible and therefore pointless.
73.Pp
74.Nm bunzip2
75(or
76.Nm bzip2 Fl d )
77decompresses all specified files.
78Files which were not created by
79.Nm bzip2
80will be detected and ignored, and a warning issued.
81.Nm bzip2
82attempts to guess the filename for the decompressed file
83from that of the compressed file as follows:
84.Bl -column "filename.tbz2" "becomes" -offset indent
85.It Pa filename.bz2  Ta becomes Ta Pa filename
86.It Pa filename.bz   Ta becomes Ta Pa filename
87.It Pa filename.tbz2 Ta becomes Ta Pa filename.tar
88.It Pa filename.tbz  Ta becomes Ta Pa filename.tar
89.It Pa anyothername  Ta becomes Ta Pa anyothername.out
90.El
91.Pp
92If the file does not end in one of the recognised endings,
93.Pa .bz2 ,
94.Pa .bz ,
95.Pa .tbz2 ,
96or
97.Pa .tbz ,
98.Nm bzip2
99complains that it cannot guess the name of the original file, and uses
100the original name with
101.Pa .out
102appended.
103.Pp
104As with compression, supplying no filenames causes decompression from
105standard input to standard output.
106.Pp
107.Nm bunzip2
108will correctly decompress a file which is the concatenation of two or
109more compressed files.
110The result is the concatenation of the corresponding uncompressed
111files.
112Integrity testing
113.Pq Fl t
114of concatenated compressed files is also supported.
115.Pp
116You can also compress or decompress files to the standard output by
117giving the
118.Fl c
119flag.
120Multiple files may be compressed and decompressed like this.
121The resulting outputs are fed sequentially to stdout.
122Compression of multiple files in this manner generates a stream
123containing multiple compressed file representations.
124Such a stream can be decompressed correctly only by
125.Nm bzip2
126version 0.9.0 or later.
127Earlier versions of
128.Nm bzip2
129will stop after decompressing
130the first file in the stream.
131.Pp
132.Nm bzcat
133(or
134.Nm bzip2 Fl dc )
135decompresses all specified files to the standard output.
136.Pp
137Compression is always performed, even if the compressed file is
138slightly larger than the original.
139Files of less than about one hundred bytes tend to get larger, since
140the compression mechanism has a constant overhead in the region of 50
141bytes.
142Random data (including the output of most file compressors) is coded
143at about 8.05 bits per byte, giving an expansion of around 0.5%.
144.Pp
145As a self-check for your protection,
146.Nm bzip2
147uses 32-bit CRCs to make sure that the decompressed version of a file
148is identical to the original.
149This guards against corruption of the compressed data, and against
150undetected bugs in
151.Nm bzip2
152(hopefully very unlikely).
153The chances of data corruption going undetected is microscopic, about
154one chance in four billion for each file processed.
155Be aware, though, that the check occurs upon decompression, so it can
156only tell you that something is wrong.
157It can't help you recover the original uncompressed data.
158You can use
159.Nm bzip2recover
160to try to recover data from
161damaged files.
162.Sh OPTIONS
163.Bl -tag -width "XXrepetitiveXfastXX"
164.It Fl Fl
165Treats all subsequent arguments as file names, even if they start with
166a dash.
167This is so you can handle files with names beginning with a dash, for
168example:
169.Dl bzip2 -- -myfilename .
170.It Fl 1 , Fl Fl fast
171to
172.It Fl 9 , Fl Fl best
173Set the block size to 100 k, 200 k ... 900 k when compressing.
174Has no effect when decompressing.
175See
176.Sx MEMORY MANAGEMENT
177below.
178The
179.Fl Fl fast
180and
181.Fl Fl best
182aliases are primarily for GNU
183.Xr gzip 1
184compatibility.
185In particular,
186.Fl Fl fast
187doesn't make things significantly faster, and
188.Fl Fl best
189merely selects the default behaviour.
190.It Fl c , Fl Fl stdout
191Compress or decompress to standard output.
192.It Fl d , Fl Fl decompress
193Force decompression.
194.Nm bzip2 ,
195.Nm bunzip2 ,
196and
197.Nm bzcat
198are really the same program, and the decision about what actions to
199take is done on the basis of which name is used.
200This flag overrides that mechanism, and forces
201.Nm bzip2
202to decompress.
203.It Fl f , Fl Fl force
204Force overwrite of output files.
205Normally,
206.Nm bzip2
207will not overwrite existing output files.
208Also forces
209.Nm bzip2
210to break hard links
211to files, which it otherwise wouldn't do.
212.Pp
213.Nm bzip2
214normally declines to decompress files which don't have the correct
215magic header bytes.
216If forced
217.Pq Fl f ,
218however, it will pass such files through unmodified.
219This is how GNU
220.Xr gzip 1
221behaves.
222.It Fl k , Fl Fl keep
223Keep (don't delete) input files during compression
224or decompression.
225.It Fl L , Fl Fl license
226Display the license terms and conditions.
227.It Fl q , Fl Fl quiet
228Suppress non-essential warning messages.
229Messages pertaining to I/O errors and other critical events will not
230be suppressed.
231.It Fl Fl repetitive-fast
232.It Fl Fl repetitive-best
233These flags are redundant in versions 0.9.5 and above.
234They provided some coarse control over the behaviour of the sorting
235algorithm in earlier versions, which was sometimes useful.
2360.9.5 and above have an improved algorithm which renders these flags
237irrelevant.
238.It Fl s , Fl Fl small
239Reduce memory usage, for compression, decompression and testing.
240Files are decompressed and tested using a modified algorithm which
241only requires 2.5 bytes per block byte.
242This means any file can be decompressed in 2300k of memory, albeit at
243about half the normal speed.
244During compression,
245.Fl s
246selects a block size of 200k, which limits memory use to around the
247same figure, at the expense of your compression ratio.
248In short, if your machine is low on memory (8 megabytes or less), use
249.Fl s
250for everything.
251See
252.Sx MEMORY MANAGEMENT
253below.
254.It Fl t , Fl Fl test
255Check integrity of the specified file(s), but don't decompress them.
256This really performs a trial decompression and throws away the result.
257.It Fl V , Fl Fl version
258Display the software version.
259.It Fl v , Fl Fl verbose
260Verbose mode: show the compression ratio for each file processed.
261Further
262.Fl v Ap s
263increase the verbosity level, spewing out lots of information which is
264primarily of interest for diagnostic purposes.
265.It Fl z , Fl Fl compress
266The complement to
267Fl d :
268forces compression, regardless of the invocation name.
269.El
270.Ss MEMORY MANAGEMENT
271.Nm bzip2
272compresses large files in blocks.
273The block size affects both the compression ratio achieved, and the
274amount of memory needed for compression and decompression.
275The flags
276.Fl 1
277through
278.Fl 9
279specify the block size to be 100,000 bytes through 900,000 bytes (the
280default) respectively.
281At decompression time, the block size used for compression is read
282from the header of the compressed file, and
283.Nm bunzip2
284then allocates itself just enough memory to decompress the file.
285Since block sizes are stored in compressed files, it follows that the
286flags
287.Fl 1
288to
289.Fl 9
290are irrelevant to and so ignored during decompression.
291.Pp
292Compression and decompression requirements, in bytes, can be estimated
293as:
294.Bl -tag -width "Decompression:" -offset indent
295.It Compression :
296400k + ( 8 x block size )
297.It Decompression :
298100k + ( 4 x block size ), or 100k + ( 2.5 x block size )
299.El
300Larger block sizes give rapidly diminishing marginal returns.
301Most of the compression comes from the first two or three hundred k of
302block size, a fact worth bearing in mind when using
303.Nm bzip2
304on small machines.
305It is also important to appreciate that the decompression memory
306requirement is set at compression time by the choice of block size.
307.Pp
308For files compressed with the default 900k block size,
309.Nm bunzip2
310will require about 3700 kbytes to decompress.
311To support decompression of any file on a 4 megabyte machine,
312.Nm bunzip2
313has an option to decompress using approximately half this amount of
314memory, about 2300 kbytes.
315Decompression speed is also halved, so you should use this option only
316where necessary.
317The relevant flag is
318.Fl s .
319.Pp
320In general, try and use the largest block size memory constraints
321allow, since that maximises the compression achieved.
322Compression and decompression speed are virtually unaffected by block
323size.
324.Pp
325Another significant point applies to files which fit in a single block
326-- that means most files you'd encounter using a large block size.
327The amount of real memory touched is proportional to the size of the
328file, since the file is smaller than a block.
329For example, compressing a file 20,000 bytes long with the flag
330.Fl 9
331will cause the compressor to allocate around 7600k of memory, but only
332touch 400k + 20000 * 8 = 560 kbytes of it.
333Similarly, the decompressor will allocate 3700k but only touch 100k +
33420000 * 4 = 180 kbytes.
335.Pp
336Here is a table which summarises the maximum memory usage for different
337block sizes.
338Also recorded is the total compressed size for 14 files of the Calgary
339Text Compression Corpus totalling 3,141,622 bytes.
340This column gives some feel for how compression varies with block size.
341These figures tend to understate the advantage of larger block sizes
342for larger files, since the Corpus is dominated by smaller files.
343.Bl -column "Flag" "Compression" "Decompression" "DecompressionXXs" "Corpus size"
344.It Sy Flag Ta Sy Compression Ta Sy Decompression Ta Sy Decompression Fl s Ta Sy Corpus size
345.It -1 Ta 1200k Ta  500k Ta  350k Ta 914704
346.It -2 Ta 2000k Ta  900k Ta  600k Ta 877703
347.It -3 Ta 2800k Ta 1300k Ta  850k Ta 860338
348.It -4 Ta 3600k Ta 1700k Ta 1100k Ta 846899
349.It -5 Ta 4400k Ta 2100k Ta 1350k Ta 845160
350.It -6 Ta 5200k Ta 2500k Ta 1600k Ta 838626
351.It -7 Ta 6100k Ta 2900k Ta 1850k Ta 834096
352.It -8 Ta 6800k Ta 3300k Ta 2100k Ta 828642
353.It -9 Ta 7600k Ta 3700k Ta 2350k Ta 828642
354.El
355.Ss RECOVERING DATA FROM DAMAGED FILES
356.Nm bzip2
357compresses files in blocks, usually 900kbytes long.
358Each block is handled independently.
359If a media or transmission error causes a multi-block
360.Pa .bz2
361file to become damaged, it may be possible to recover data from the
362undamaged blocks in the file.
363.Pp
364The compressed representation of each block is delimited by a 48-bit
365pattern, which makes it possible to find the block boundaries with
366reasonable certainty.
367Each block also carries its own 32-bit CRC, so damaged blocks can be
368distinguished from undamaged ones.
369.Pp
370.Nm bzip2recover
371is a simple program whose purpose is to search for blocks in
372.Pa .bz2
373files, and write each block out into its own
374.Pa .bz2
375file.
376You can then use
377.Nm bzip2
378.Fl t
379to test the integrity of the resulting files, and decompress those
380which are undamaged.
381.Pp
382.Nm bzip2recover
383takes a single argument, the name of the damaged file, and writes a
384number of files
385.Dq Pa rec00001file.bz2 ,
386.Dq Pa rec00002file.bz2 ,
387etc., containing the extracted blocks.
388The output filenames are designed so that the use of wildcards in
389subsequent processing -- for example,
390.Dl bzip2 -dc rec*file.bz2 \*[Gt] recovered_data
391-- processes the files in the correct order.
392.Pp
393.Nm bzip2recover
394should be of most use dealing with large
395.Pa .bz2
396files, as these will contain many blocks.
397It is clearly futile to use it on damaged single-block files, since a
398damaged block cannot be recovered.
399If you wish to minimise any potential data loss through media or
400transmission errors, you might consider compressing with a smaller
401block size.
402.Ss PERFORMANCE NOTES
403The sorting phase of compression gathers together similar strings in
404the file.
405Because of this, files containing very long runs of repeated
406symbols, like
407.Dq aabaabaabaab...
408(repeated several hundred times) may compress more slowly than normal.
409Versions 0.9.5 and above fare much better than previous versions in
410this respect.
411The ratio between worst-case and average-case compression time is in
412the region of 10:1.
413For previous versions, this figure was more like 100:1.
414You can use the
415.Fl vvvv
416option to monitor progress in great detail, if you want.
417.Pp
418Decompression speed is unaffected by these phenomena.
419.Pp
420.Nm bzip2
421usually allocates several megabytes of memory to operate in, and then
422charges all over it in a fairly random fashion.
423This means that performance, both for compressing and decompressing,
424is largely determined by the speed at which your machine can service
425cache misses.
426Because of this, small changes to the code to reduce the miss rate
427have been observed to give disproportionately large performance
428improvements.
429I imagine
430.Nm bzip2
431will perform best on machines with very large caches.
432.Sh ENVIRONMENT
433.Nm bzip2
434will read arguments from the environment variables
435.Ev BZIP2
436and
437.Ev BZIP ,
438in that order, and will process them before any arguments read from
439the command line.
440This gives a convenient way to supply default arguments.
441.Sh EXIT STATUS
4420 for a normal exit, 1 for environmental problems (file not found,
443invalid flags, I/O errors, etc.), 2 to indicate a corrupt compressed
444file, 3 for an internal consistency error (e.g., bug) which caused
445.Nm bzip2
446to panic.
447.Sh AUTHORS
448.An -nosplit
449.An Julian Seward
450.Aq jseward@bzip.org
451.Pp
452.Pa http://www.bzip.org
453.Pp
454The ideas embodied in
455.Nm bzip2
456are due to (at least) the following people:
457.An Michael Burrows
458and
459.An David Wheeler
460(for the block sorting transformation),
461.An David Wheeler
462(again, for the Huffman coder),
463.An Peter Fenwick
464(for the structured coding model in the original
465.Nm bzip ,
466and many refinements), and
467.An Alistair Moffat ,
468.An Radford Neal ,
469and
470.An Ian Witten
471(for the arithmetic coder in the original
472.Nm bzip ) .
473I am much indebted for their help, support and advice.
474See the manual in the source distribution for pointers to sources of
475documentation.
476Christian von Roques encouraged me to look for faster sorting
477algorithms, so as to speed up compression.
478Bela Lubkin encouraged me to improve the worst-case compression
479performance.
480Donna Robinson XMLised the documentation.
481The bz* scripts are derived from those of GNU gzip.
482Many people sent patches, helped with portability problems, lent
483machines, gave advice and were generally helpful.
484.Sh CAVEATS
485I/O error messages are not as helpful as they could be.
486.Nm bzip2
487tries hard to detect I/O errors and exit cleanly, but the details of
488what the problem is sometimes seem rather misleading.
489.Pp
490This manual page pertains to version 1.0.8 of
491.Nm bzip2 .
492Compressed data created by this version is entirely forwards and
493backwards compatible with the previous public releases, versions
4940.1pl2, 0.9.0, 0.9.5, 1.0.0, 1.0.1, 1.0.2 and above, but with the
495following exception: 0.9.0 and above can correctly decompress multiple
496concatenated compressed files.
4970.1pl2 cannot do this; it will stop after decompressing just the first
498file in the stream.
499.Pp
500.Nm bzip2recover
501versions prior to 1.0.2 used 32-bit integers to represent bit
502positions in compressed files, so they could not handle compressed
503files more than 512 megabytes long.
504Versions 1.0.2 and above use 64-bit ints on some platforms which
505support them (GNU supported targets, and Windows).
506To establish whether or not
507.Nm bzip2recover
508was built with such a limitation, run it without arguments.
509In any event you can build yourself an unlimited version if you can
510recompile it with MaybeUInt64 set to be an unsigned 64-bit integer.
511