xref: /openbsd-src/gnu/usr.bin/perl/cpan/IO-Compress/lib/IO/Compress/FAQ.pod (revision f2da64fbbbf1b03f09f390ab01267c93dfd77c4c)
1
2=head1 NAME
3
4IO::Compress::FAQ -- Frequently Asked Questions about IO::Compress
5
6=head1 DESCRIPTION
7
8Common questions answered.
9
10=head1 GENERAL
11
12=head2 Compatibility with Unix compress/uncompress.
13
14Although C<Compress::Zlib> has a pair of functions called C<compress> and
15C<uncompress>, they are I<not> related to the Unix programs of the same
16name. The C<Compress::Zlib> module is not compatible with Unix
17C<compress>.
18
19If you have the C<uncompress> program available, you can use this to read
20compressed files
21
22    open F, "uncompress -c $filename |";
23    while (<F>)
24    {
25        ...
26
27Alternatively, if you have the C<gunzip> program available, you can use
28this to read compressed files
29
30    open F, "gunzip -c $filename |";
31    while (<F>)
32    {
33        ...
34
35and this to write compress files, if you have the C<compress> program
36available
37
38    open F, "| compress -c $filename ";
39    print F "data";
40    ...
41    close F ;
42
43=head2 Accessing .tar.Z files
44
45The C<Archive::Tar> module can optionally use C<Compress::Zlib> (via the
46C<IO::Zlib> module) to access tar files that have been compressed with
47C<gzip>. Unfortunately tar files compressed with the Unix C<compress>
48utility cannot be read by C<Compress::Zlib> and so cannot be directly
49accessed by C<Archive::Tar>.
50
51If the C<uncompress> or C<gunzip> programs are available, you can use one
52of these workarounds to read C<.tar.Z> files from C<Archive::Tar>
53
54Firstly with C<uncompress>
55
56    use strict;
57    use warnings;
58    use Archive::Tar;
59
60    open F, "uncompress -c $filename |";
61    my $tar = Archive::Tar->new(*F);
62    ...
63
64and this with C<gunzip>
65
66    use strict;
67    use warnings;
68    use Archive::Tar;
69
70    open F, "gunzip -c $filename |";
71    my $tar = Archive::Tar->new(*F);
72    ...
73
74Similarly, if the C<compress> program is available, you can use this to
75write a C<.tar.Z> file
76
77    use strict;
78    use warnings;
79    use Archive::Tar;
80    use IO::File;
81
82    my $fh = new IO::File "| compress -c >$filename";
83    my $tar = Archive::Tar->new();
84    ...
85    $tar->write($fh);
86    $fh->close ;
87
88=head2 How do I recompress using a different compression?
89
90This is easier that you might expect if you realise that all the
91C<IO::Compress::*> objects are derived from C<IO::File> and that all the
92C<IO::Uncompress::*> modules can read from an C<IO::File> filehandle.
93
94So, for example, say you have a file compressed with gzip that you want to
95recompress with bzip2. Here is all that is needed to carry out the
96recompression.
97
98    use IO::Uncompress::Gunzip ':all';
99    use IO::Compress::Bzip2 ':all';
100
101    my $gzipFile = "somefile.gz";
102    my $bzipFile = "somefile.bz2";
103
104    my $gunzip = new IO::Uncompress::Gunzip $gzipFile
105        or die "Cannot gunzip $gzipFile: $GunzipError\n" ;
106
107    bzip2 $gunzip => $bzipFile
108        or die "Cannot bzip2 to $bzipFile: $Bzip2Error\n" ;
109
110Note, there is a limitation of this technique. Some compression file
111formats store extra information along with the compressed data payload. For
112example, gzip can optionally store the original filename and Zip stores a
113lot of information about the original file. If the original compressed file
114contains any of this extra information, it will not be transferred to the
115new compressed file using the technique above.
116
117=head1 ZIP
118
119=head2 What Compression Types do IO::Compress::Zip & IO::Uncompress::Unzip support?
120
121The following compression formats are supported by C<IO::Compress::Zip> and
122C<IO::Uncompress::Unzip>
123
124=over 5
125
126=item * Store (method 0)
127
128No compression at all.
129
130=item * Deflate (method 8)
131
132This is the default compression used when creating a zip file with
133C<IO::Compress::Zip>.
134
135=item * Bzip2 (method 12)
136
137Only supported if the C<IO-Compress-Bzip2> module is installed.
138
139=item * Lzma (method 14)
140
141Only supported if the C<IO-Compress-Lzma> module is installed.
142
143=back
144
145=head2 Can I Read/Write Zip files larger the 4 Gig?
146
147Yes, both the C<IO-Compress-Zip> and C<IO-Uncompress-Unzip>  modules
148support the zip feature called I<Zip64>. That allows them to read/write
149files/buffers larger than 4Gig.
150
151If you are creating a Zip file using the one-shot interface, and any of the
152input files is greater than 4Gig, a zip64 complaint zip file will be
153created.
154
155    zip "really-large-file" => "my.zip";
156
157Similarly with the one-shot interface, if the input is a buffer larger than
1584 Gig, a zip64 complaint zip file will be created.
159
160    zip \$really_large_buffer => "my.zip";
161
162The one-shot interface allows you to force the creation of a zip64 zip file
163by including the C<Zip64> option.
164
165    zip $filehandle => "my.zip", Zip64 => 1;
166
167If you want to create a zip64 zip file with the OO interface you must
168specify the C<Zip64> option.
169
170    my $zip = new IO::Compress::Zip "whatever", Zip64 => 1;
171
172When uncompressing with C<IO-Uncompress-Unzip>, it will automatically
173detect if the zip file is zip64.
174
175If you intend to manipulate the Zip64 zip files created with
176C<IO-Compress-Zip> using an external zip/unzip, make sure that it supports
177Zip64.
178
179In particular, if you are using Info-Zip you need to have zip version 3.x
180or better to update a Zip64 archive and unzip version 6.x to read a zip64
181archive.
182
183=head2 Can I write more that 64K entries is a Zip files?
184
185Yes. Zip64 allows this. See previous question.
186
187=head2 Zip Resources
188
189The primary reference for zip files is the "appnote" document available at
190L<http://www.pkware.com/documents/casestudies/APPNOTE.TXT>
191
192An alternatively is the Info-Zip appnote. This is available from
193L<ftp://ftp.info-zip.org/pub/infozip/doc/>
194
195=head1 GZIP
196
197=head2 Gzip Resources
198
199The primary reference for gzip files is RFC 1952
200L<http://www.faqs.org/rfcs/rfc1952.html>
201
202The primary site for gzip is F<http://www.gzip.org>.
203
204=head2 Dealing with Concatenated gzip files
205
206If the gunzip program encounters a file containing multiple gzip files
207concatenated together it will automatically uncompress them all.
208The example below illustrates this behaviour
209
210    $ echo abc | gzip -c >x.gz
211    $ echo def | gzip -c >>x.gz
212    $ gunzip -c x.gz
213    abc
214    def
215
216By default C<IO::Uncompress::Gunzip> will I<not> behave like the gunzip
217program. It will only uncompress the first gzip data stream in the file, as
218shown below
219
220    $ perl -MIO::Uncompress::Gunzip=:all -e 'gunzip "x.gz" => \*STDOUT'
221    abc
222
223To force C<IO::Uncompress::Gunzip> to uncompress all the gzip data streams,
224include the C<MultiStream> option, as shown below
225
226    $ perl -MIO::Uncompress::Gunzip=:all -e 'gunzip "x.gz" => \*STDOUT, MultiStream => 1'
227    abc
228    def
229
230=head1 ZLIB
231
232=head2 Zlib Resources
233
234The primary site for the I<zlib> compression library is
235F<http://www.zlib.org>.
236
237=head1 Bzip2
238
239=head2 Bzip2 Resources
240
241The primary site for bzip2 is F<http://www.bzip.org>.
242
243=head2 Dealing with Concatenated bzip2 files
244
245If the bunzip2 program encounters a file containing multiple bzip2 files
246concatenated together it will automatically uncompress them all.
247The example below illustrates this behaviour
248
249    $ echo abc | bzip2 -c >x.bz2
250    $ echo def | bzip2 -c >>x.bz2
251    $ bunzip2 -c x.bz2
252    abc
253    def
254
255By default C<IO::Uncompress::Bunzip2> will I<not> behave like the bunzip2
256program. It will only uncompress the first bunzip2 data stream in the file, as
257shown below
258
259    $ perl -MIO::Uncompress::Bunzip2=:all -e 'bunzip2 "x.bz2" => \*STDOUT'
260    abc
261
262To force C<IO::Uncompress::Bunzip2> to uncompress all the bzip2 data streams,
263include the C<MultiStream> option, as shown below
264
265    $ perl -MIO::Uncompress::Bunzip2=:all -e 'bunzip2 "x.bz2" => \*STDOUT, MultiStream => 1'
266    abc
267    def
268
269=head2 Interoperating with Pbzip2
270
271Pbzip2 (L<http://compression.ca/pbzip2/>) is a parallel implementation of
272bzip2. The output from pbzip2 consists of a series of concatenated bzip2
273data streams.
274
275By default C<IO::Uncompress::Bzip2> will only uncompress the first bzip2
276data stream in a pbzip2 file. To uncompress the complete pbzip2 file you
277must include the C<MultiStream> option, like this.
278
279    bunzip2 $input => \$output, MultiStream => 1
280        or die "bunzip2 failed: $Bunzip2Error\n";
281
282=head1 HTTP & NETWORK
283
284=head2 Apache::GZip Revisited
285
286Below is a mod_perl Apache compression module, called C<Apache::GZip>,
287taken from
288F<http://perl.apache.org/docs/tutorials/tips/mod_perl_tricks/mod_perl_tricks.html#On_the_Fly_Compression>
289
290  package Apache::GZip;
291  #File: Apache::GZip.pm
292
293  use strict vars;
294  use Apache::Constants ':common';
295  use Compress::Zlib;
296  use IO::File;
297  use constant GZIP_MAGIC => 0x1f8b;
298  use constant OS_MAGIC => 0x03;
299
300  sub handler {
301      my $r = shift;
302      my ($fh,$gz);
303      my $file = $r->filename;
304      return DECLINED unless $fh=IO::File->new($file);
305      $r->header_out('Content-Encoding'=>'gzip');
306      $r->send_http_header;
307      return OK if $r->header_only;
308
309      tie *STDOUT,'Apache::GZip',$r;
310      print($_) while <$fh>;
311      untie *STDOUT;
312      return OK;
313  }
314
315  sub TIEHANDLE {
316      my($class,$r) = @_;
317      # initialize a deflation stream
318      my $d = deflateInit(-WindowBits=>-MAX_WBITS()) || return undef;
319
320      # gzip header -- don't ask how I found out
321      $r->print(pack("nccVcc",GZIP_MAGIC,Z_DEFLATED,0,time(),0,OS_MAGIC));
322
323      return bless { r   => $r,
324                     crc =>  crc32(undef),
325                     d   => $d,
326                     l   =>  0
327                   },$class;
328  }
329
330  sub PRINT {
331      my $self = shift;
332      foreach (@_) {
333        # deflate the data
334        my $data = $self->{d}->deflate($_);
335        $self->{r}->print($data);
336        # keep track of its length and crc
337        $self->{l} += length($_);
338        $self->{crc} = crc32($_,$self->{crc});
339      }
340  }
341
342  sub DESTROY {
343     my $self = shift;
344
345     # flush the output buffers
346     my $data = $self->{d}->flush;
347     $self->{r}->print($data);
348
349     # print the CRC and the total length (uncompressed)
350     $self->{r}->print(pack("LL",@{$self}{qw/crc l/}));
351  }
352
353  1;
354
355Here's the Apache configuration entry you'll need to make use of it.  Once
356set it will result in everything in the /compressed directory will be
357compressed automagically.
358
359  <Location /compressed>
360     SetHandler  perl-script
361     PerlHandler Apache::GZip
362  </Location>
363
364Although at first sight there seems to be quite a lot going on in
365C<Apache::GZip>, you could sum up what the code was doing as follows --
366read the contents of the file in C<< $r->filename >>, compress it and write
367the compressed data to standard output. That's all.
368
369This code has to jump through a few hoops to achieve this because
370
371=over
372
373=item 1.
374
375The gzip support in C<Compress::Zlib> version 1.x can only work with a real
376filesystem filehandle. The filehandles used by Apache modules are not
377associated with the filesystem.
378
379=item 2.
380
381That means all the gzip support has to be done by hand - in this case by
382creating a tied filehandle to deal with creating the gzip header and
383trailer.
384
385=back
386
387C<IO::Compress::Gzip> doesn't have that filehandle limitation (this was one
388of the reasons for writing it in the first place). So if
389C<IO::Compress::Gzip> is used instead of C<Compress::Zlib> the whole tied
390filehandle code can be removed. Here is the rewritten code.
391
392  package Apache::GZip;
393
394  use strict vars;
395  use Apache::Constants ':common';
396  use IO::Compress::Gzip;
397  use IO::File;
398
399  sub handler {
400      my $r = shift;
401      my ($fh,$gz);
402      my $file = $r->filename;
403      return DECLINED unless $fh=IO::File->new($file);
404      $r->header_out('Content-Encoding'=>'gzip');
405      $r->send_http_header;
406      return OK if $r->header_only;
407
408      my $gz = new IO::Compress::Gzip '-', Minimal => 1
409          or return DECLINED ;
410
411      print $gz $_ while <$fh>;
412
413      return OK;
414  }
415
416or even more succinctly, like this, using a one-shot gzip
417
418  package Apache::GZip;
419
420  use strict vars;
421  use Apache::Constants ':common';
422  use IO::Compress::Gzip qw(gzip);
423
424  sub handler {
425      my $r = shift;
426      $r->header_out('Content-Encoding'=>'gzip');
427      $r->send_http_header;
428      return OK if $r->header_only;
429
430      gzip $r->filename => '-', Minimal => 1
431        or return DECLINED ;
432
433      return OK;
434  }
435
436  1;
437
438The use of one-shot C<gzip> above just reads from C<< $r->filename >> and
439writes the compressed data to standard output.
440
441Note the use of the C<Minimal> option in the code above. When using gzip
442for Content-Encoding you should I<always> use this option. In the example
443above it will prevent the filename being included in the gzip header and
444make the size of the gzip data stream a slight bit smaller.
445
446=head2 Compressed files and Net::FTP
447
448The C<Net::FTP> module provides two low-level methods called C<stor> and
449C<retr> that both return filehandles. These filehandles can used with the
450C<IO::Compress/Uncompress> modules to compress or uncompress files read
451from or written to an FTP Server on the fly, without having to create a
452temporary file.
453
454Firstly, here is code that uses C<retr> to uncompressed a file as it is
455read from the FTP Server.
456
457    use Net::FTP;
458    use IO::Uncompress::Gunzip qw(:all);
459
460    my $ftp = new Net::FTP ...
461
462    my $retr_fh = $ftp->retr($compressed_filename);
463    gunzip $retr_fh => $outFilename, AutoClose => 1
464        or die "Cannot uncompress '$compressed_file': $GunzipError\n";
465
466and this to compress a file as it is written to the FTP Server
467
468    use Net::FTP;
469    use IO::Compress::Gzip qw(:all);
470
471    my $stor_fh = $ftp->stor($filename);
472    gzip "filename" => $stor_fh, AutoClose => 1
473        or die "Cannot compress '$filename': $GzipError\n";
474
475=head1 MISC
476
477=head2 Using C<InputLength> to uncompress data embedded in a larger file/buffer.
478
479A fairly common use-case is where compressed data is embedded in a larger
480file/buffer and you want to read both.
481
482As an example consider the structure of a zip file. This is a well-defined
483file format that mixes both compressed and uncompressed sections of data in
484a single file.
485
486For the purposes of this discussion you can think of a zip file as sequence
487of compressed data streams, each of which is prefixed by an uncompressed
488local header. The local header contains information about the compressed
489data stream, including the name of the compressed file and, in particular,
490the length of the compressed data stream.
491
492To illustrate how to use C<InputLength> here is a script that walks a zip
493file and prints out how many lines are in each compressed file (if you
494intend write code to walking through a zip file for real see
495L<IO::Uncompress::Unzip/"Walking through a zip file"> ). Also, although
496this example uses the zlib-based compression, the technique can be used by
497the other C<IO::Uncompress::*> modules.
498
499    use strict;
500    use warnings;
501
502    use IO::File;
503    use IO::Uncompress::RawInflate qw(:all);
504
505    use constant ZIP_LOCAL_HDR_SIG  => 0x04034b50;
506    use constant ZIP_LOCAL_HDR_LENGTH => 30;
507
508    my $file = $ARGV[0] ;
509
510    my $fh = new IO::File "<$file"
511                or die "Cannot open '$file': $!\n";
512
513    while (1)
514    {
515        my $sig;
516        my $buffer;
517
518        my $x ;
519        ($x = $fh->read($buffer, ZIP_LOCAL_HDR_LENGTH)) == ZIP_LOCAL_HDR_LENGTH
520            or die "Truncated file: $!\n";
521
522        my $signature = unpack ("V", substr($buffer, 0, 4));
523
524        last unless $signature == ZIP_LOCAL_HDR_SIG;
525
526        # Read Local Header
527        my $gpFlag             = unpack ("v", substr($buffer, 6, 2));
528        my $compressedMethod   = unpack ("v", substr($buffer, 8, 2));
529        my $compressedLength   = unpack ("V", substr($buffer, 18, 4));
530        my $uncompressedLength = unpack ("V", substr($buffer, 22, 4));
531        my $filename_length    = unpack ("v", substr($buffer, 26, 2));
532        my $extra_length       = unpack ("v", substr($buffer, 28, 2));
533
534        my $filename ;
535        $fh->read($filename, $filename_length) == $filename_length
536            or die "Truncated file\n";
537
538        $fh->read($buffer, $extra_length) == $extra_length
539            or die "Truncated file\n";
540
541        if ($compressedMethod != 8 && $compressedMethod != 0)
542        {
543            warn "Skipping file '$filename' - not deflated $compressedMethod\n";
544            $fh->read($buffer, $compressedLength) == $compressedLength
545                or die "Truncated file\n";
546            next;
547        }
548
549        if ($compressedMethod == 0 && $gpFlag & 8 == 8)
550        {
551            die "Streamed Stored not supported for '$filename'\n";
552        }
553
554        next if $compressedLength == 0;
555
556        # Done reading the Local Header
557
558        my $inf = new IO::Uncompress::RawInflate $fh,
559                            Transparent => 1,
560                            InputLength => $compressedLength
561          or die "Cannot uncompress $file [$filename]: $RawInflateError\n"  ;
562
563        my $line_count = 0;
564
565        while (<$inf>)
566        {
567            ++ $line_count;
568        }
569
570        print "$filename: $line_count\n";
571    }
572
573The majority of the code above is concerned with reading the zip local
574header data. The code that I want to focus on is at the bottom.
575
576    while (1) {
577
578        # read local zip header data
579        # get $filename
580        # get $compressedLength
581
582        my $inf = new IO::Uncompress::RawInflate $fh,
583                            Transparent => 1,
584                            InputLength => $compressedLength
585          or die "Cannot uncompress $file [$filename]: $RawInflateError\n"  ;
586
587        my $line_count = 0;
588
589        while (<$inf>)
590        {
591            ++ $line_count;
592        }
593
594        print "$filename: $line_count\n";
595    }
596
597The call to C<IO::Uncompress::RawInflate> creates a new filehandle C<$inf>
598that can be used to read from the parent filehandle C<$fh>, uncompressing
599it as it goes. The use of the C<InputLength> option will guarantee that
600I<at most> C<$compressedLength> bytes of compressed data will be read from
601the C<$fh> filehandle (The only exception is for an error case like a
602truncated file or a corrupt data stream).
603
604This means that once RawInflate is finished C<$fh> will be left at the
605byte directly after the compressed data stream.
606
607Now consider what the code looks like without C<InputLength>
608
609    while (1) {
610
611        # read local zip header data
612        # get $filename
613        # get $compressedLength
614
615        # read all the compressed data into $data
616        read($fh, $data, $compressedLength);
617
618        my $inf = new IO::Uncompress::RawInflate \$data,
619                            Transparent => 1,
620          or die "Cannot uncompress $file [$filename]: $RawInflateError\n"  ;
621
622        my $line_count = 0;
623
624        while (<$inf>)
625        {
626            ++ $line_count;
627        }
628
629        print "$filename: $line_count\n";
630    }
631
632The difference here is the addition of the temporary variable C<$data>.
633This is used to store a copy of the compressed data while it is being
634uncompressed.
635
636If you know that C<$compressedLength> isn't that big then using temporary
637storage won't be a problem. But if C<$compressedLength> is very large or
638you are writing an application that other people will use, and so have no
639idea how big C<$compressedLength> will be, it could be an issue.
640
641Using C<InputLength> avoids the use of temporary storage and means the
642application can cope with large compressed data streams.
643
644One final point -- obviously C<InputLength> can only be used whenever you
645know the length of the compressed data beforehand, like here with a zip
646file.
647
648=head1 SEE ALSO
649
650L<Compress::Zlib>, L<IO::Compress::Gzip>, L<IO::Uncompress::Gunzip>, L<IO::Compress::Deflate>, L<IO::Uncompress::Inflate>, L<IO::Compress::RawDeflate>, L<IO::Uncompress::RawInflate>, L<IO::Compress::Bzip2>, L<IO::Uncompress::Bunzip2>, L<IO::Compress::Lzma>, L<IO::Uncompress::UnLzma>, L<IO::Compress::Xz>, L<IO::Uncompress::UnXz>, L<IO::Compress::Lzop>, L<IO::Uncompress::UnLzop>, L<IO::Compress::Lzf>, L<IO::Uncompress::UnLzf>, L<IO::Uncompress::AnyInflate>, L<IO::Uncompress::AnyUncompress>
651
652L<IO::Compress::FAQ|IO::Compress::FAQ>
653
654L<File::GlobMapper|File::GlobMapper>, L<Archive::Zip|Archive::Zip>,
655L<Archive::Tar|Archive::Tar>,
656L<IO::Zlib|IO::Zlib>
657
658=head1 AUTHOR
659
660This module was written by Paul Marquess, F<pmqs@cpan.org>.
661
662=head1 MODIFICATION HISTORY
663
664See the Changes file.
665
666=head1 COPYRIGHT AND LICENSE
667
668Copyright (c) 2005-2014 Paul Marquess. All rights reserved.
669
670This program is free software; you can redistribute it and/or
671modify it under the same terms as Perl itself.
672
673