xref: /onnv-gate/usr/src/cmd/perl/5.8.4/distrib/pod/perlpacktut.pod (revision 0:68f95e015346)
1*0Sstevel@tonic-gate=head1 NAME
2*0Sstevel@tonic-gate
3*0Sstevel@tonic-gateperlpacktut - tutorial on C<pack> and C<unpack>
4*0Sstevel@tonic-gate
5*0Sstevel@tonic-gate=head1 DESCRIPTION
6*0Sstevel@tonic-gate
7*0Sstevel@tonic-gateC<pack> and C<unpack> are two functions for transforming data according
8*0Sstevel@tonic-gateto a user-defined template, between the guarded way Perl stores values
9*0Sstevel@tonic-gateand some well-defined representation as might be required in the
10*0Sstevel@tonic-gateenvironment of a Perl program. Unfortunately, they're also two of
11*0Sstevel@tonic-gatethe most misunderstood and most often overlooked functions that Perl
12*0Sstevel@tonic-gateprovides. This tutorial will demystify them for you.
13*0Sstevel@tonic-gate
14*0Sstevel@tonic-gate
15*0Sstevel@tonic-gate=head1 The Basic Principle
16*0Sstevel@tonic-gate
17*0Sstevel@tonic-gateMost programming languages don't shelter the memory where variables are
18*0Sstevel@tonic-gatestored. In C, for instance, you can take the address of some variable,
19*0Sstevel@tonic-gateand the C<sizeof> operator tells you how many bytes are allocated to
20*0Sstevel@tonic-gatethe variable. Using the address and the size, you may access the storage
21*0Sstevel@tonic-gateto your heart's content.
22*0Sstevel@tonic-gate
23*0Sstevel@tonic-gateIn Perl, you just can't access memory at random, but the structural and
24*0Sstevel@tonic-gaterepresentational conversion provided by C<pack> and C<unpack> is an
25*0Sstevel@tonic-gateexcellent alternative. The C<pack> function converts values to a byte
26*0Sstevel@tonic-gatesequence containing representations according to a given specification,
27*0Sstevel@tonic-gatethe so-called "template" argument. C<unpack> is the reverse process,
28*0Sstevel@tonic-gatederiving some values from the contents of a string of bytes. (Be cautioned,
29*0Sstevel@tonic-gatehowever, that not all that has been packed together can be neatly unpacked -
30*0Sstevel@tonic-gatea very common experience as seasoned travellers are likely to confirm.)
31*0Sstevel@tonic-gate
32*0Sstevel@tonic-gateWhy, you may ask, would you need a chunk of memory containing some values
33*0Sstevel@tonic-gatein binary representation? One good reason is input and output accessing
34*0Sstevel@tonic-gatesome file, a device, or a network connection, whereby this binary
35*0Sstevel@tonic-gaterepresentation is either forced on you or will give you some benefit
36*0Sstevel@tonic-gatein processing. Another cause is passing data to some system call that
37*0Sstevel@tonic-gateis not available as a Perl function: C<syscall> requires you to provide
38*0Sstevel@tonic-gateparameters stored in the way it happens in a C program. Even text processing
39*0Sstevel@tonic-gate(as shown in the next section) may be simplified with judicious usage
40*0Sstevel@tonic-gateof these two functions.
41*0Sstevel@tonic-gate
42*0Sstevel@tonic-gateTo see how (un)packing works, we'll start with a simple template
43*0Sstevel@tonic-gatecode where the conversion is in low gear: between the contents of a byte
44*0Sstevel@tonic-gatesequence and a string of hexadecimal digits. Let's use C<unpack>, since
45*0Sstevel@tonic-gatethis is likely to remind you of a dump program, or some desperate last
46*0Sstevel@tonic-gatemessage unfortunate programs are wont to throw at you before they expire
47*0Sstevel@tonic-gateinto the wild blue yonder. Assuming that the variable C<$mem> holds a
48*0Sstevel@tonic-gatesequence of bytes that we'd like to inspect without assuming anything
49*0Sstevel@tonic-gateabout its meaning, we can write
50*0Sstevel@tonic-gate
51*0Sstevel@tonic-gate   my( $hex ) = unpack( 'H*', $mem );
52*0Sstevel@tonic-gate   print "$hex\n";
53*0Sstevel@tonic-gate
54*0Sstevel@tonic-gatewhereupon we might see something like this, with each pair of hex digits
55*0Sstevel@tonic-gatecorresponding to a byte:
56*0Sstevel@tonic-gate
57*0Sstevel@tonic-gate   41204d414e204120504c414e20412043414e414c2050414e414d41
58*0Sstevel@tonic-gate
59*0Sstevel@tonic-gateWhat was in this chunk of memory? Numbers, characters, or a mixture of
60*0Sstevel@tonic-gateboth? Assuming that we're on a computer where ASCII (or some similar)
61*0Sstevel@tonic-gateencoding is used: hexadecimal values in the range C<0x40> - C<0x5A>
62*0Sstevel@tonic-gateindicate an uppercase letter, and C<0x20> encodes a space. So we might
63*0Sstevel@tonic-gateassume it is a piece of text, which some are able to read like a tabloid;
64*0Sstevel@tonic-gatebut others will have to get hold of an ASCII table and relive that
65*0Sstevel@tonic-gatefirstgrader feeling. Not caring too much about which way to read this,
66*0Sstevel@tonic-gatewe note that C<unpack> with the template code C<H> converts the contents
67*0Sstevel@tonic-gateof a sequence of bytes into the customary hexadecimal notation. Since
68*0Sstevel@tonic-gate"a sequence of" is a pretty vague indication of quantity, C<H> has been
69*0Sstevel@tonic-gatedefined to convert just a single hexadecimal digit unless it is followed
70*0Sstevel@tonic-gateby a repeat count. An asterisk for the repeat count means to use whatever
71*0Sstevel@tonic-gateremains.
72*0Sstevel@tonic-gate
73*0Sstevel@tonic-gateThe inverse operation - packing byte contents from a string of hexadecimal
74*0Sstevel@tonic-gatedigits - is just as easily written. For instance:
75*0Sstevel@tonic-gate
76*0Sstevel@tonic-gate   my $s = pack( 'H2' x 10, map { "3$_" } ( 0..9 ) );
77*0Sstevel@tonic-gate   print "$s\n";
78*0Sstevel@tonic-gate
79*0Sstevel@tonic-gateSince we feed a list of ten 2-digit hexadecimal strings to C<pack>, the
80*0Sstevel@tonic-gatepack template should contain ten pack codes. If this is run on a computer
81*0Sstevel@tonic-gatewith ASCII character coding, it will print C<0123456789>.
82*0Sstevel@tonic-gate
83*0Sstevel@tonic-gate
84*0Sstevel@tonic-gate=head1 Packing Text
85*0Sstevel@tonic-gate
86*0Sstevel@tonic-gateLet's suppose you've got to read in a data file like this:
87*0Sstevel@tonic-gate
88*0Sstevel@tonic-gate    Date      |Description                | Income|Expenditure
89*0Sstevel@tonic-gate    01/24/2001 Ahmed's Camel Emporium                  1147.99
90*0Sstevel@tonic-gate    01/28/2001 Flea spray                                24.99
91*0Sstevel@tonic-gate    01/29/2001 Camel rides to tourists      235.00
92*0Sstevel@tonic-gate
93*0Sstevel@tonic-gateHow do we do it? You might think first to use C<split>; however, since
94*0Sstevel@tonic-gateC<split> collapses blank fields, you'll never know whether a record was
95*0Sstevel@tonic-gateincome or expenditure. Oops. Well, you could always use C<substr>:
96*0Sstevel@tonic-gate
97*0Sstevel@tonic-gate    while (<>) {
98*0Sstevel@tonic-gate        my $date   = substr($_,  0, 11);
99*0Sstevel@tonic-gate        my $desc   = substr($_, 12, 27);
100*0Sstevel@tonic-gate        my $income = substr($_, 40,  7);
101*0Sstevel@tonic-gate        my $expend = substr($_, 52,  7);
102*0Sstevel@tonic-gate        ...
103*0Sstevel@tonic-gate    }
104*0Sstevel@tonic-gate
105*0Sstevel@tonic-gateIt's not really a barrel of laughs, is it? In fact, it's worse than it
106*0Sstevel@tonic-gatemay seem; the eagle-eyed may notice that the first field should only be
107*0Sstevel@tonic-gate10 characters wide, and the error has propagated right through the other
108*0Sstevel@tonic-gatenumbers - which we've had to count by hand. So it's error-prone as well
109*0Sstevel@tonic-gateas horribly unfriendly.
110*0Sstevel@tonic-gate
111*0Sstevel@tonic-gateOr maybe we could use regular expressions:
112*0Sstevel@tonic-gate
113*0Sstevel@tonic-gate    while (<>) {
114*0Sstevel@tonic-gate        my($date, $desc, $income, $expend) =
115*0Sstevel@tonic-gate            m|(\d\d/\d\d/\d{4}) (.{27}) (.{7})(.*)|;
116*0Sstevel@tonic-gate        ...
117*0Sstevel@tonic-gate    }
118*0Sstevel@tonic-gate
119*0Sstevel@tonic-gateUrgh. Well, it's a bit better, but - well, would you want to maintain
120*0Sstevel@tonic-gatethat?
121*0Sstevel@tonic-gate
122*0Sstevel@tonic-gateHey, isn't Perl supposed to make this sort of thing easy? Well, it does,
123*0Sstevel@tonic-gateif you use the right tools. C<pack> and C<unpack> are designed to help
124*0Sstevel@tonic-gateyou out when dealing with fixed-width data like the above. Let's have a
125*0Sstevel@tonic-gatelook at a solution with C<unpack>:
126*0Sstevel@tonic-gate
127*0Sstevel@tonic-gate    while (<>) {
128*0Sstevel@tonic-gate        my($date, $desc, $income, $expend) = unpack("A10xA27xA7A*", $_);
129*0Sstevel@tonic-gate        ...
130*0Sstevel@tonic-gate    }
131*0Sstevel@tonic-gate
132*0Sstevel@tonic-gateThat looks a bit nicer; but we've got to take apart that weird template.
133*0Sstevel@tonic-gateWhere did I pull that out of?
134*0Sstevel@tonic-gate
135*0Sstevel@tonic-gateOK, let's have a look at some of our data again; in fact, we'll include
136*0Sstevel@tonic-gatethe headers, and a handy ruler so we can keep track of where we are.
137*0Sstevel@tonic-gate
138*0Sstevel@tonic-gate             1         2         3         4         5
139*0Sstevel@tonic-gate    1234567890123456789012345678901234567890123456789012345678
140*0Sstevel@tonic-gate    Date      |Description                | Income|Expenditure
141*0Sstevel@tonic-gate    01/28/2001 Flea spray                                24.99
142*0Sstevel@tonic-gate    01/29/2001 Camel rides to tourists      235.00
143*0Sstevel@tonic-gate
144*0Sstevel@tonic-gateFrom this, we can see that the date column stretches from column 1 to
145*0Sstevel@tonic-gatecolumn 10 - ten characters wide. The C<pack>-ese for "character" is
146*0Sstevel@tonic-gateC<A>, and ten of them are C<A10>. So if we just wanted to extract the
147*0Sstevel@tonic-gatedates, we could say this:
148*0Sstevel@tonic-gate
149*0Sstevel@tonic-gate    my($date) = unpack("A10", $_);
150*0Sstevel@tonic-gate
151*0Sstevel@tonic-gateOK, what's next? Between the date and the description is a blank column;
152*0Sstevel@tonic-gatewe want to skip over that. The C<x> template means "skip forward", so we
153*0Sstevel@tonic-gatewant one of those. Next, we have another batch of characters, from 12 to
154*0Sstevel@tonic-gate38. That's 27 more characters, hence C<A27>. (Don't make the fencepost
155*0Sstevel@tonic-gateerror - there are 27 characters between 12 and 38, not 26. Count 'em!)
156*0Sstevel@tonic-gate
157*0Sstevel@tonic-gateNow we skip another character and pick up the next 7 characters:
158*0Sstevel@tonic-gate
159*0Sstevel@tonic-gate    my($date,$description,$income) = unpack("A10xA27xA7", $_);
160*0Sstevel@tonic-gate
161*0Sstevel@tonic-gateNow comes the clever bit. Lines in our ledger which are just income and
162*0Sstevel@tonic-gatenot expenditure might end at column 46. Hence, we don't want to tell our
163*0Sstevel@tonic-gateC<unpack> pattern that we B<need> to find another 12 characters; we'll
164*0Sstevel@tonic-gatejust say "if there's anything left, take it". As you might guess from
165*0Sstevel@tonic-gateregular expressions, that's what the C<*> means: "use everything
166*0Sstevel@tonic-gateremaining".
167*0Sstevel@tonic-gate
168*0Sstevel@tonic-gate=over 3
169*0Sstevel@tonic-gate
170*0Sstevel@tonic-gate=item *
171*0Sstevel@tonic-gate
172*0Sstevel@tonic-gateBe warned, though, that unlike regular expressions, if the C<unpack>
173*0Sstevel@tonic-gatetemplate doesn't match the incoming data, Perl will scream and die.
174*0Sstevel@tonic-gate
175*0Sstevel@tonic-gate=back
176*0Sstevel@tonic-gate
177*0Sstevel@tonic-gate
178*0Sstevel@tonic-gateHence, putting it all together:
179*0Sstevel@tonic-gate
180*0Sstevel@tonic-gate    my($date,$description,$income,$expend) = unpack("A10xA27xA7xA*", $_);
181*0Sstevel@tonic-gate
182*0Sstevel@tonic-gateNow, that's our data parsed. I suppose what we might want to do now is
183*0Sstevel@tonic-gatetotal up our income and expenditure, and add another line to the end of
184*0Sstevel@tonic-gateour ledger - in the same format - saying how much we've brought in and
185*0Sstevel@tonic-gatehow much we've spent:
186*0Sstevel@tonic-gate
187*0Sstevel@tonic-gate    while (<>) {
188*0Sstevel@tonic-gate        my($date, $desc, $income, $expend) = unpack("A10xA27xA7xA*", $_);
189*0Sstevel@tonic-gate        $tot_income += $income;
190*0Sstevel@tonic-gate        $tot_expend += $expend;
191*0Sstevel@tonic-gate    }
192*0Sstevel@tonic-gate
193*0Sstevel@tonic-gate    $tot_income = sprintf("%.2f", $tot_income); # Get them into
194*0Sstevel@tonic-gate    $tot_expend = sprintf("%.2f", $tot_expend); # "financial" format
195*0Sstevel@tonic-gate
196*0Sstevel@tonic-gate    $date = POSIX::strftime("%m/%d/%Y", localtime);
197*0Sstevel@tonic-gate
198*0Sstevel@tonic-gate    # OK, let's go:
199*0Sstevel@tonic-gate
200*0Sstevel@tonic-gate    print pack("A10xA27xA7xA*", $date, "Totals", $tot_income, $tot_expend);
201*0Sstevel@tonic-gate
202*0Sstevel@tonic-gateOh, hmm. That didn't quite work. Let's see what happened:
203*0Sstevel@tonic-gate
204*0Sstevel@tonic-gate    01/24/2001 Ahmed's Camel Emporium                   1147.99
205*0Sstevel@tonic-gate    01/28/2001 Flea spray                                 24.99
206*0Sstevel@tonic-gate    01/29/2001 Camel rides to tourists     1235.00
207*0Sstevel@tonic-gate    03/23/2001Totals                     1235.001172.98
208*0Sstevel@tonic-gate
209*0Sstevel@tonic-gateOK, it's a start, but what happened to the spaces? We put C<x>, didn't
210*0Sstevel@tonic-gatewe? Shouldn't it skip forward? Let's look at what L<perlfunc/pack> says:
211*0Sstevel@tonic-gate
212*0Sstevel@tonic-gate    x   A null byte.
213*0Sstevel@tonic-gate
214*0Sstevel@tonic-gateUrgh. No wonder. There's a big difference between "a null byte",
215*0Sstevel@tonic-gatecharacter zero, and "a space", character 32. Perl's put something
216*0Sstevel@tonic-gatebetween the date and the description - but unfortunately, we can't see
217*0Sstevel@tonic-gateit!
218*0Sstevel@tonic-gate
219*0Sstevel@tonic-gateWhat we actually need to do is expand the width of the fields. The C<A>
220*0Sstevel@tonic-gateformat pads any non-existent characters with spaces, so we can use the
221*0Sstevel@tonic-gateadditional spaces to line up our fields, like this:
222*0Sstevel@tonic-gate
223*0Sstevel@tonic-gate    print pack("A11 A28 A8 A*", $date, "Totals", $tot_income, $tot_expend);
224*0Sstevel@tonic-gate
225*0Sstevel@tonic-gate(Note that you can put spaces in the template to make it more readable,
226*0Sstevel@tonic-gatebut they don't translate to spaces in the output.) Here's what we got
227*0Sstevel@tonic-gatethis time:
228*0Sstevel@tonic-gate
229*0Sstevel@tonic-gate    01/24/2001 Ahmed's Camel Emporium                   1147.99
230*0Sstevel@tonic-gate    01/28/2001 Flea spray                                 24.99
231*0Sstevel@tonic-gate    01/29/2001 Camel rides to tourists     1235.00
232*0Sstevel@tonic-gate    03/23/2001 Totals                      1235.00 1172.98
233*0Sstevel@tonic-gate
234*0Sstevel@tonic-gateThat's a bit better, but we still have that last column which needs to
235*0Sstevel@tonic-gatebe moved further over. There's an easy way to fix this up:
236*0Sstevel@tonic-gateunfortunately, we can't get C<pack> to right-justify our fields, but we
237*0Sstevel@tonic-gatecan get C<sprintf> to do it:
238*0Sstevel@tonic-gate
239*0Sstevel@tonic-gate    $tot_income = sprintf("%.2f", $tot_income);
240*0Sstevel@tonic-gate    $tot_expend = sprintf("%12.2f", $tot_expend);
241*0Sstevel@tonic-gate    $date = POSIX::strftime("%m/%d/%Y", localtime);
242*0Sstevel@tonic-gate    print pack("A11 A28 A8 A*", $date, "Totals", $tot_income, $tot_expend);
243*0Sstevel@tonic-gate
244*0Sstevel@tonic-gateThis time we get the right answer:
245*0Sstevel@tonic-gate
246*0Sstevel@tonic-gate    01/28/2001 Flea spray                                 24.99
247*0Sstevel@tonic-gate    01/29/2001 Camel rides to tourists     1235.00
248*0Sstevel@tonic-gate    03/23/2001 Totals                      1235.00      1172.98
249*0Sstevel@tonic-gate
250*0Sstevel@tonic-gateSo that's how we consume and produce fixed-width data. Let's recap what
251*0Sstevel@tonic-gatewe've seen of C<pack> and C<unpack> so far:
252*0Sstevel@tonic-gate
253*0Sstevel@tonic-gate=over 3
254*0Sstevel@tonic-gate
255*0Sstevel@tonic-gate=item *
256*0Sstevel@tonic-gate
257*0Sstevel@tonic-gateUse C<pack> to go from several pieces of data to one fixed-width
258*0Sstevel@tonic-gateversion; use C<unpack> to turn a fixed-width-format string into several
259*0Sstevel@tonic-gatepieces of data.
260*0Sstevel@tonic-gate
261*0Sstevel@tonic-gate=item *
262*0Sstevel@tonic-gate
263*0Sstevel@tonic-gateThe pack format C<A> means "any character"; if you're C<pack>ing and
264*0Sstevel@tonic-gateyou've run out of things to pack, C<pack> will fill the rest up with
265*0Sstevel@tonic-gatespaces.
266*0Sstevel@tonic-gate
267*0Sstevel@tonic-gate=item *
268*0Sstevel@tonic-gate
269*0Sstevel@tonic-gateC<x> means "skip a byte" when C<unpack>ing; when C<pack>ing, it means
270*0Sstevel@tonic-gate"introduce a null byte" - that's probably not what you mean if you're
271*0Sstevel@tonic-gatedealing with plain text.
272*0Sstevel@tonic-gate
273*0Sstevel@tonic-gate=item *
274*0Sstevel@tonic-gate
275*0Sstevel@tonic-gateYou can follow the formats with numbers to say how many characters
276*0Sstevel@tonic-gateshould be affected by that format: C<A12> means "take 12 characters";
277*0Sstevel@tonic-gateC<x6> means "skip 6 bytes" or "character 0, 6 times".
278*0Sstevel@tonic-gate
279*0Sstevel@tonic-gate=item *
280*0Sstevel@tonic-gate
281*0Sstevel@tonic-gateInstead of a number, you can use C<*> to mean "consume everything else
282*0Sstevel@tonic-gateleft".
283*0Sstevel@tonic-gate
284*0Sstevel@tonic-gateB<Warning>: when packing multiple pieces of data, C<*> only means
285*0Sstevel@tonic-gate"consume all of the current piece of data". That's to say
286*0Sstevel@tonic-gate
287*0Sstevel@tonic-gate    pack("A*A*", $one, $two)
288*0Sstevel@tonic-gate
289*0Sstevel@tonic-gatepacks all of C<$one> into the first C<A*> and then all of C<$two> into
290*0Sstevel@tonic-gatethe second. This is a general principle: each format character
291*0Sstevel@tonic-gatecorresponds to one piece of data to be C<pack>ed.
292*0Sstevel@tonic-gate
293*0Sstevel@tonic-gate=back
294*0Sstevel@tonic-gate
295*0Sstevel@tonic-gate
296*0Sstevel@tonic-gate
297*0Sstevel@tonic-gate=head1 Packing Numbers
298*0Sstevel@tonic-gate
299*0Sstevel@tonic-gateSo much for textual data. Let's get onto the meaty stuff that C<pack>
300*0Sstevel@tonic-gateand C<unpack> are best at: handling binary formats for numbers. There is,
301*0Sstevel@tonic-gateof course, not just one binary format  - life would be too simple - but
302*0Sstevel@tonic-gatePerl will do all the finicky labor for you.
303*0Sstevel@tonic-gate
304*0Sstevel@tonic-gate
305*0Sstevel@tonic-gate=head2 Integers
306*0Sstevel@tonic-gate
307*0Sstevel@tonic-gatePacking and unpacking numbers implies conversion to and from some
308*0Sstevel@tonic-gateI<specific> binary representation. Leaving floating point numbers
309*0Sstevel@tonic-gateaside for the moment, the salient properties of any such representation
310*0Sstevel@tonic-gateare:
311*0Sstevel@tonic-gate
312*0Sstevel@tonic-gate=over 4
313*0Sstevel@tonic-gate
314*0Sstevel@tonic-gate=item *
315*0Sstevel@tonic-gate
316*0Sstevel@tonic-gatethe number of bytes used for storing the integer,
317*0Sstevel@tonic-gate
318*0Sstevel@tonic-gate=item *
319*0Sstevel@tonic-gate
320*0Sstevel@tonic-gatewhether the contents are interpreted as a signed or unsigned number,
321*0Sstevel@tonic-gate
322*0Sstevel@tonic-gate=item *
323*0Sstevel@tonic-gate
324*0Sstevel@tonic-gatethe byte ordering: whether the first byte is the least or most
325*0Sstevel@tonic-gatesignificant byte (or: little-endian or big-endian, respectively).
326*0Sstevel@tonic-gate
327*0Sstevel@tonic-gate=back
328*0Sstevel@tonic-gate
329*0Sstevel@tonic-gateSo, for instance, to pack 20302 to a signed 16 bit integer in your
330*0Sstevel@tonic-gatecomputer's representation you write
331*0Sstevel@tonic-gate
332*0Sstevel@tonic-gate   my $ps = pack( 's', 20302 );
333*0Sstevel@tonic-gate
334*0Sstevel@tonic-gateAgain, the result is a string, now containing 2 bytes. If you print
335*0Sstevel@tonic-gatethis string (which is, generally, not recommended) you might see
336*0Sstevel@tonic-gateC<ON> or C<NO> (depending on your system's byte ordering) - or something
337*0Sstevel@tonic-gateentirely different if your computer doesn't use ASCII character encoding.
338*0Sstevel@tonic-gateUnpacking C<$ps> with the same template returns the original integer value:
339*0Sstevel@tonic-gate
340*0Sstevel@tonic-gate   my( $s ) = unpack( 's', $ps );
341*0Sstevel@tonic-gate
342*0Sstevel@tonic-gateThis is true for all numeric template codes. But don't expect miracles:
343*0Sstevel@tonic-gateif the packed value exceeds the allotted byte capacity, high order bits
344*0Sstevel@tonic-gateare silently discarded, and unpack certainly won't be able to pull them
345*0Sstevel@tonic-gateback out of some magic hat. And, when you pack using a signed template
346*0Sstevel@tonic-gatecode such as C<s>, an excess value may result in the sign bit
347*0Sstevel@tonic-gategetting set, and unpacking this will smartly return a negative value.
348*0Sstevel@tonic-gate
349*0Sstevel@tonic-gate16 bits won't get you too far with integers, but there is C<l> and C<L>
350*0Sstevel@tonic-gatefor signed and unsigned 32-bit integers. And if this is not enough and
351*0Sstevel@tonic-gateyour system supports 64 bit integers you can push the limits much closer
352*0Sstevel@tonic-gateto infinity with pack codes C<q> and C<Q>. A notable exception is provided
353*0Sstevel@tonic-gateby pack codes C<i> and C<I> for signed and unsigned integers of the
354*0Sstevel@tonic-gate"local custom" variety: Such an integer will take up as many bytes as
355*0Sstevel@tonic-gatea local C compiler returns for C<sizeof(int)>, but it'll use I<at least>
356*0Sstevel@tonic-gate32 bits.
357*0Sstevel@tonic-gate
358*0Sstevel@tonic-gateEach of the integer pack codes C<sSlLqQ> results in a fixed number of bytes,
359*0Sstevel@tonic-gateno matter where you execute your program. This may be useful for some
360*0Sstevel@tonic-gateapplications, but it does not provide for a portable way to pass data
361*0Sstevel@tonic-gatestructures between Perl and C programs (bound to happen when you call
362*0Sstevel@tonic-gateXS extensions or the Perl function C<syscall>), or when you read or
363*0Sstevel@tonic-gatewrite binary files. What you'll need in this case are template codes that
364*0Sstevel@tonic-gatedepend on what your local C compiler compiles when you code C<short> or
365*0Sstevel@tonic-gateC<unsigned long>, for instance. These codes and their corresponding
366*0Sstevel@tonic-gatebyte lengths are shown in the table below.  Since the C standard leaves
367*0Sstevel@tonic-gatemuch leeway with respect to the relative sizes of these data types, actual
368*0Sstevel@tonic-gatevalues may vary, and that's why the values are given as expressions in
369*0Sstevel@tonic-gateC and Perl. (If you'd like to use values from C<%Config> in your program
370*0Sstevel@tonic-gateyou have to import it with C<use Config>.)
371*0Sstevel@tonic-gate
372*0Sstevel@tonic-gate   signed unsigned  byte length in C   byte length in Perl
373*0Sstevel@tonic-gate     s!     S!      sizeof(short)      $Config{shortsize}
374*0Sstevel@tonic-gate     i!     I!      sizeof(int)        $Config{intsize}
375*0Sstevel@tonic-gate     l!     L!      sizeof(long)       $Config{longsize}
376*0Sstevel@tonic-gate     q!     Q!      sizeof(long long)  $Config{longlongsize}
377*0Sstevel@tonic-gate
378*0Sstevel@tonic-gateThe C<i!> and C<I!> codes aren't different from C<i> and C<I>; they are
379*0Sstevel@tonic-gatetolerated for completeness' sake.
380*0Sstevel@tonic-gate
381*0Sstevel@tonic-gate
382*0Sstevel@tonic-gate=head2 Unpacking a Stack Frame
383*0Sstevel@tonic-gate
384*0Sstevel@tonic-gateRequesting a particular byte ordering may be necessary when you work with
385*0Sstevel@tonic-gatebinary data coming from some specific architecture whereas your program could
386*0Sstevel@tonic-gaterun on a totally different system. As an example, assume you have 24 bytes
387*0Sstevel@tonic-gatecontaining a stack frame as it happens on an Intel 8086:
388*0Sstevel@tonic-gate
389*0Sstevel@tonic-gate      +---------+        +----+----+               +---------+
390*0Sstevel@tonic-gate TOS: |   IP    |  TOS+4:| FL | FH | FLAGS  TOS+14:|   SI    |
391*0Sstevel@tonic-gate      +---------+        +----+----+               +---------+
392*0Sstevel@tonic-gate      |   CS    |        | AL | AH | AX            |   DI    |
393*0Sstevel@tonic-gate      +---------+        +----+----+               +---------+
394*0Sstevel@tonic-gate                         | BL | BH | BX            |   BP    |
395*0Sstevel@tonic-gate                         +----+----+               +---------+
396*0Sstevel@tonic-gate                         | CL | CH | CX            |   DS    |
397*0Sstevel@tonic-gate                         +----+----+               +---------+
398*0Sstevel@tonic-gate                         | DL | DH | DX            |   ES    |
399*0Sstevel@tonic-gate                         +----+----+               +---------+
400*0Sstevel@tonic-gate
401*0Sstevel@tonic-gateFirst, we note that this time-honored 16-bit CPU uses little-endian order,
402*0Sstevel@tonic-gateand that's why the low order byte is stored at the lower address. To
403*0Sstevel@tonic-gateunpack such a (signed) short we'll have to use code C<v>. A repeat
404*0Sstevel@tonic-gatecount unpacks all 12 shorts:
405*0Sstevel@tonic-gate
406*0Sstevel@tonic-gate   my( $ip, $cs, $flags, $ax, $bx, $cd, $dx, $si, $di, $bp, $ds, $es ) =
407*0Sstevel@tonic-gate     unpack( 'v12', $frame );
408*0Sstevel@tonic-gate
409*0Sstevel@tonic-gateAlternatively, we could have used C<C> to unpack the individually
410*0Sstevel@tonic-gateaccessible byte registers FL, FH, AL, AH, etc.:
411*0Sstevel@tonic-gate
412*0Sstevel@tonic-gate   my( $fl, $fh, $al, $ah, $bl, $bh, $cl, $ch, $dl, $dh ) =
413*0Sstevel@tonic-gate     unpack( 'C10', substr( $frame, 4, 10 ) );
414*0Sstevel@tonic-gate
415*0Sstevel@tonic-gateIt would be nice if we could do this in one fell swoop: unpack a short,
416*0Sstevel@tonic-gateback up a little, and then unpack 2 bytes. Since Perl I<is> nice, it
417*0Sstevel@tonic-gateproffers the template code C<X> to back up one byte. Putting this all
418*0Sstevel@tonic-gatetogether, we may now write:
419*0Sstevel@tonic-gate
420*0Sstevel@tonic-gate   my( $ip, $cs,
421*0Sstevel@tonic-gate       $flags,$fl,$fh,
422*0Sstevel@tonic-gate       $ax,$al,$ah, $bx,$bl,$bh, $cx,$cl,$ch, $dx,$dl,$dh,
423*0Sstevel@tonic-gate       $si, $di, $bp, $ds, $es ) =
424*0Sstevel@tonic-gate   unpack( 'v2' . ('vXXCC' x 5) . 'v5', $frame );
425*0Sstevel@tonic-gate
426*0Sstevel@tonic-gate(The clumsy construction of the template can be avoided - just read on!)
427*0Sstevel@tonic-gate
428*0Sstevel@tonic-gateWe've taken some pains to construct the template so that it matches
429*0Sstevel@tonic-gatethe contents of our frame buffer. Otherwise we'd either get undefined values,
430*0Sstevel@tonic-gateor C<unpack> could not unpack all. If C<pack> runs out of items, it will
431*0Sstevel@tonic-gatesupply null strings (which are coerced into zeroes whenever the pack code
432*0Sstevel@tonic-gatesays so).
433*0Sstevel@tonic-gate
434*0Sstevel@tonic-gate
435*0Sstevel@tonic-gate=head2 How to Eat an Egg on a Net
436*0Sstevel@tonic-gate
437*0Sstevel@tonic-gateThe pack code for big-endian (high order byte at the lowest address) is
438*0Sstevel@tonic-gateC<n> for 16 bit and C<N> for 32 bit integers. You use these codes
439*0Sstevel@tonic-gateif you know that your data comes from a compliant architecture, but,
440*0Sstevel@tonic-gatesurprisingly enough, you should also use these pack codes if you
441*0Sstevel@tonic-gateexchange binary data, across the network, with some system that you
442*0Sstevel@tonic-gateknow next to nothing about. The simple reason is that this
443*0Sstevel@tonic-gateorder has been chosen as the I<network order>, and all standard-fearing
444*0Sstevel@tonic-gateprograms ought to follow this convention. (This is, of course, a stern
445*0Sstevel@tonic-gatebacking for one of the Lilliputian parties and may well influence the
446*0Sstevel@tonic-gatepolitical development there.) So, if the protocol expects you to send
447*0Sstevel@tonic-gatea message by sending the length first, followed by just so many bytes,
448*0Sstevel@tonic-gateyou could write:
449*0Sstevel@tonic-gate
450*0Sstevel@tonic-gate   my $buf = pack( 'N', length( $msg ) ) . $msg;
451*0Sstevel@tonic-gate
452*0Sstevel@tonic-gateor even:
453*0Sstevel@tonic-gate
454*0Sstevel@tonic-gate   my $buf = pack( 'NA*', length( $msg ), $msg );
455*0Sstevel@tonic-gate
456*0Sstevel@tonic-gateand pass C<$buf> to your send routine. Some protocols demand that the
457*0Sstevel@tonic-gatecount should include the length of the count itself: then just add 4
458*0Sstevel@tonic-gateto the data length. (But make sure to read L<"Lengths and Widths"> before
459*0Sstevel@tonic-gateyou really code this!)
460*0Sstevel@tonic-gate
461*0Sstevel@tonic-gate
462*0Sstevel@tonic-gate
463*0Sstevel@tonic-gate=head2 Floating point Numbers
464*0Sstevel@tonic-gate
465*0Sstevel@tonic-gateFor packing floating point numbers you have the choice between the
466*0Sstevel@tonic-gatepack codes C<f> and C<d> which pack into (or unpack from) single-precision or
467*0Sstevel@tonic-gatedouble-precision representation as it is provided by your system. (There
468*0Sstevel@tonic-gateis no such thing as a network representation for reals, so if you want
469*0Sstevel@tonic-gateto send your real numbers across computer boundaries, you'd better stick
470*0Sstevel@tonic-gateto ASCII representation, unless you're absolutely sure what's on the other
471*0Sstevel@tonic-gateend of the line.)
472*0Sstevel@tonic-gate
473*0Sstevel@tonic-gate
474*0Sstevel@tonic-gate
475*0Sstevel@tonic-gate=head1 Exotic Templates
476*0Sstevel@tonic-gate
477*0Sstevel@tonic-gate
478*0Sstevel@tonic-gate=head2 Bit Strings
479*0Sstevel@tonic-gate
480*0Sstevel@tonic-gateBits are the atoms in the memory world. Access to individual bits may
481*0Sstevel@tonic-gatehave to be used either as a last resort or because it is the most
482*0Sstevel@tonic-gateconvenient way to handle your data. Bit string (un)packing converts
483*0Sstevel@tonic-gatebetween strings containing a series of C<0> and C<1> characters and
484*0Sstevel@tonic-gatea sequence of bytes each containing a group of 8 bits. This is almost
485*0Sstevel@tonic-gateas simple as it sounds, except that there are two ways the contents of
486*0Sstevel@tonic-gatea byte may be written as a bit string. Let's have a look at an annotated
487*0Sstevel@tonic-gatebyte:
488*0Sstevel@tonic-gate
489*0Sstevel@tonic-gate     7 6 5 4 3 2 1 0
490*0Sstevel@tonic-gate   +-----------------+
491*0Sstevel@tonic-gate   | 1 0 0 0 1 1 0 0 |
492*0Sstevel@tonic-gate   +-----------------+
493*0Sstevel@tonic-gate    MSB           LSB
494*0Sstevel@tonic-gate
495*0Sstevel@tonic-gateIt's egg-eating all over again: Some think that as a bit string this should
496*0Sstevel@tonic-gatebe written "10001100" i.e. beginning with the most significant bit, others
497*0Sstevel@tonic-gateinsist on "00110001". Well, Perl isn't biased, so that's why we have two bit
498*0Sstevel@tonic-gatestring codes:
499*0Sstevel@tonic-gate
500*0Sstevel@tonic-gate   $byte = pack( 'B8', '10001100' ); # start with MSB
501*0Sstevel@tonic-gate   $byte = pack( 'b8', '00110001' ); # start with LSB
502*0Sstevel@tonic-gate
503*0Sstevel@tonic-gateIt is not possible to pack or unpack bit fields - just integral bytes.
504*0Sstevel@tonic-gateC<pack> always starts at the next byte boundary and "rounds up" to the
505*0Sstevel@tonic-gatenext multiple of 8 by adding zero bits as required. (If you do want bit
506*0Sstevel@tonic-gatefields, there is L<perlfunc/vec>. Or you could implement bit field
507*0Sstevel@tonic-gatehandling at the character string level, using split, substr, and
508*0Sstevel@tonic-gateconcatenation on unpacked bit strings.)
509*0Sstevel@tonic-gate
510*0Sstevel@tonic-gateTo illustrate unpacking for bit strings, we'll decompose a simple
511*0Sstevel@tonic-gatestatus register (a "-" stands for a "reserved" bit):
512*0Sstevel@tonic-gate
513*0Sstevel@tonic-gate   +-----------------+-----------------+
514*0Sstevel@tonic-gate   | S Z - A - P - C | - - - - O D I T |
515*0Sstevel@tonic-gate   +-----------------+-----------------+
516*0Sstevel@tonic-gate    MSB           LSB MSB           LSB
517*0Sstevel@tonic-gate
518*0Sstevel@tonic-gateConverting these two bytes to a string can be done with the unpack
519*0Sstevel@tonic-gatetemplate C<'b16'>. To obtain the individual bit values from the bit
520*0Sstevel@tonic-gatestring we use C<split> with the "empty" separator pattern which dissects
521*0Sstevel@tonic-gateinto individual characters. Bit values from the "reserved" positions are
522*0Sstevel@tonic-gatesimply assigned to C<undef>, a convenient notation for "I don't care where
523*0Sstevel@tonic-gatethis goes".
524*0Sstevel@tonic-gate
525*0Sstevel@tonic-gate   ($carry, undef, $parity, undef, $auxcarry, undef, $zero, $sign,
526*0Sstevel@tonic-gate    $trace, $interrupt, $direction, $overflow) =
527*0Sstevel@tonic-gate      split( //, unpack( 'b16', $status ) );
528*0Sstevel@tonic-gate
529*0Sstevel@tonic-gateWe could have used an unpack template C<'b12'> just as well, since the
530*0Sstevel@tonic-gatelast 4 bits can be ignored anyway.
531*0Sstevel@tonic-gate
532*0Sstevel@tonic-gate
533*0Sstevel@tonic-gate=head2 Uuencoding
534*0Sstevel@tonic-gate
535*0Sstevel@tonic-gateAnother odd-man-out in the template alphabet is C<u>, which packs an
536*0Sstevel@tonic-gate"uuencoded string". ("uu" is short for Unix-to-Unix.) Chances are that
537*0Sstevel@tonic-gateyou won't ever need this encoding technique which was invented to overcome
538*0Sstevel@tonic-gatethe shortcomings of old-fashioned transmission mediums that do not support
539*0Sstevel@tonic-gateother than simple ASCII data. The essential recipe is simple: Take three
540*0Sstevel@tonic-gatebytes, or 24 bits. Split them into 4 six-packs, adding a space (0x20) to
541*0Sstevel@tonic-gateeach. Repeat until all of the data is blended. Fold groups of 4 bytes into
542*0Sstevel@tonic-gatelines no longer than 60 and garnish them in front with the original byte count
543*0Sstevel@tonic-gate(incremented by 0x20) and a C<"\n"> at the end. - The C<pack> chef will
544*0Sstevel@tonic-gateprepare this for you, a la minute, when you select pack code C<u> on the menu:
545*0Sstevel@tonic-gate
546*0Sstevel@tonic-gate   my $uubuf = pack( 'u', $bindat );
547*0Sstevel@tonic-gate
548*0Sstevel@tonic-gateA repeat count after C<u> sets the number of bytes to put into an
549*0Sstevel@tonic-gateuuencoded line, which is the maximum of 45 by default, but could be
550*0Sstevel@tonic-gateset to some (smaller) integer multiple of three. C<unpack> simply ignores
551*0Sstevel@tonic-gatethe repeat count.
552*0Sstevel@tonic-gate
553*0Sstevel@tonic-gate
554*0Sstevel@tonic-gate=head2 Doing Sums
555*0Sstevel@tonic-gate
556*0Sstevel@tonic-gateAn even stranger template code is C<%>E<lt>I<number>E<gt>. First, because
557*0Sstevel@tonic-gateit's used as a prefix to some other template code. Second, because it
558*0Sstevel@tonic-gatecannot be used in C<pack> at all, and third, in C<unpack>, doesn't return the
559*0Sstevel@tonic-gatedata as defined by the template code it precedes. Instead it'll give you an
560*0Sstevel@tonic-gateinteger of I<number> bits that is computed from the data value by
561*0Sstevel@tonic-gatedoing sums. For numeric unpack codes, no big feat is achieved:
562*0Sstevel@tonic-gate
563*0Sstevel@tonic-gate    my $buf = pack( 'iii', 100, 20, 3 );
564*0Sstevel@tonic-gate    print unpack( '%32i3', $buf ), "\n";  # prints 123
565*0Sstevel@tonic-gate
566*0Sstevel@tonic-gateFor string values, C<%> returns the sum of the byte values saving
567*0Sstevel@tonic-gateyou the trouble of a sum loop with C<substr> and C<ord>:
568*0Sstevel@tonic-gate
569*0Sstevel@tonic-gate    print unpack( '%32A*', "\x01\x10" ), "\n";  # prints 17
570*0Sstevel@tonic-gate
571*0Sstevel@tonic-gateAlthough the C<%> code is documented as returning a "checksum":
572*0Sstevel@tonic-gatedon't put your trust in such values! Even when applied to a small number
573*0Sstevel@tonic-gateof bytes, they won't guarantee a noticeable Hamming distance.
574*0Sstevel@tonic-gate
575*0Sstevel@tonic-gateIn connection with C<b> or C<B>, C<%> simply adds bits, and this can be put
576*0Sstevel@tonic-gateto good use to count set bits efficiently:
577*0Sstevel@tonic-gate
578*0Sstevel@tonic-gate    my $bitcount = unpack( '%32b*', $mask );
579*0Sstevel@tonic-gate
580*0Sstevel@tonic-gateAnd an even parity bit can be determined like this:
581*0Sstevel@tonic-gate
582*0Sstevel@tonic-gate    my $evenparity = unpack( '%1b*', $mask );
583*0Sstevel@tonic-gate
584*0Sstevel@tonic-gate
585*0Sstevel@tonic-gate=head2  Unicode
586*0Sstevel@tonic-gate
587*0Sstevel@tonic-gateUnicode is a character set that can represent most characters in most of
588*0Sstevel@tonic-gatethe world's languages, providing room for over one million different
589*0Sstevel@tonic-gatecharacters. Unicode 3.1 specifies 94,140 characters: The Basic Latin
590*0Sstevel@tonic-gatecharacters are assigned to the numbers 0 - 127. The Latin-1 Supplement with
591*0Sstevel@tonic-gatecharacters that are used in several European languages is in the next
592*0Sstevel@tonic-gaterange, up to 255. After some more Latin extensions we find the character
593*0Sstevel@tonic-gatesets from languages using non-Roman alphabets, interspersed with a
594*0Sstevel@tonic-gatevariety of symbol sets such as currency symbols, Zapf Dingbats or Braille.
595*0Sstevel@tonic-gate(You might want to visit L<www.unicode.org> for a look at some of
596*0Sstevel@tonic-gatethem - my personal favourites are Telugu and Kannada.)
597*0Sstevel@tonic-gate
598*0Sstevel@tonic-gateThe Unicode character sets associates characters with integers. Encoding
599*0Sstevel@tonic-gatethese numbers in an equal number of bytes would more than double the
600*0Sstevel@tonic-gaterequirements for storing texts written in Latin alphabets.
601*0Sstevel@tonic-gateThe UTF-8 encoding avoids this by storing the most common (from a western
602*0Sstevel@tonic-gatepoint of view) characters in a single byte while encoding the rarer
603*0Sstevel@tonic-gateones in three or more bytes.
604*0Sstevel@tonic-gate
605*0Sstevel@tonic-gateSo what has this got to do with C<pack>? Well, if you want to convert
606*0Sstevel@tonic-gatebetween a Unicode number and its UTF-8 representation you can do so by
607*0Sstevel@tonic-gateusing template code C<U>. As an example, let's produce the UTF-8
608*0Sstevel@tonic-gaterepresentation of the Euro currency symbol (code number 0x20AC):
609*0Sstevel@tonic-gate
610*0Sstevel@tonic-gate   $UTF8{Euro} = pack( 'U', 0x20AC );
611*0Sstevel@tonic-gate
612*0Sstevel@tonic-gateInspecting C<$UTF8{Euro}> shows that it contains 3 bytes: "\xe2\x82\xac". The
613*0Sstevel@tonic-gateround trip can be completed with C<unpack>:
614*0Sstevel@tonic-gate
615*0Sstevel@tonic-gate   $Unicode{Euro} = unpack( 'U', $UTF8{Euro} );
616*0Sstevel@tonic-gate
617*0Sstevel@tonic-gateUsually you'll want to pack or unpack UTF-8 strings:
618*0Sstevel@tonic-gate
619*0Sstevel@tonic-gate   # pack and unpack the Hebrew alphabet
620*0Sstevel@tonic-gate   my $alefbet = pack( 'U*', 0x05d0..0x05ea );
621*0Sstevel@tonic-gate   my @hebrew = unpack( 'U*', $utf );
622*0Sstevel@tonic-gate
623*0Sstevel@tonic-gate
624*0Sstevel@tonic-gate=head2 Another Portable Binary Encoding
625*0Sstevel@tonic-gate
626*0Sstevel@tonic-gateThe pack code C<w> has been added to support a portable binary data
627*0Sstevel@tonic-gateencoding scheme that goes way beyond simple integers. (Details can
628*0Sstevel@tonic-gatebe found at L<Casbah.org>, the Scarab project.)  A BER (Binary Encoded
629*0Sstevel@tonic-gateRepresentation) compressed unsigned integer stores base 128
630*0Sstevel@tonic-gatedigits, most significant digit first, with as few digits as possible.
631*0Sstevel@tonic-gateBit eight (the high bit) is set on each byte except the last. There
632*0Sstevel@tonic-gateis no size limit to BER encoding, but Perl won't go to extremes.
633*0Sstevel@tonic-gate
634*0Sstevel@tonic-gate   my $berbuf = pack( 'w*', 1, 128, 128+1, 128*128+127 );
635*0Sstevel@tonic-gate
636*0Sstevel@tonic-gateA hex dump of C<$berbuf>, with spaces inserted at the right places,
637*0Sstevel@tonic-gateshows 01 8100 8101 81807F. Since the last byte is always less than
638*0Sstevel@tonic-gate128, C<unpack> knows where to stop.
639*0Sstevel@tonic-gate
640*0Sstevel@tonic-gate
641*0Sstevel@tonic-gate=head1 Template Grouping
642*0Sstevel@tonic-gate
643*0Sstevel@tonic-gatePrior to Perl 5.8, repetitions of templates had to be made by
644*0Sstevel@tonic-gateC<x>-multiplication of template strings. Now there is a better way as
645*0Sstevel@tonic-gatewe may use the pack codes C<(> and C<)> combined with a repeat count.
646*0Sstevel@tonic-gateThe C<unpack> template from the Stack Frame example can simply
647*0Sstevel@tonic-gatebe written like this:
648*0Sstevel@tonic-gate
649*0Sstevel@tonic-gate   unpack( 'v2 (vXXCC)5 v5', $frame )
650*0Sstevel@tonic-gate
651*0Sstevel@tonic-gateLet's explore this feature a little more. We'll begin with the equivalent of
652*0Sstevel@tonic-gate
653*0Sstevel@tonic-gate   join( '', map( substr( $_, 0, 1 ), @str ) )
654*0Sstevel@tonic-gate
655*0Sstevel@tonic-gatewhich returns a string consisting of the first character from each string.
656*0Sstevel@tonic-gateUsing pack, we can write
657*0Sstevel@tonic-gate
658*0Sstevel@tonic-gate   pack( '(A)'.@str, @str )
659*0Sstevel@tonic-gate
660*0Sstevel@tonic-gateor, because a repeat count C<*> means "repeat as often as required",
661*0Sstevel@tonic-gatesimply
662*0Sstevel@tonic-gate
663*0Sstevel@tonic-gate   pack( '(A)*', @str )
664*0Sstevel@tonic-gate
665*0Sstevel@tonic-gate(Note that the template C<A*> would only have packed C<$str[0]> in full
666*0Sstevel@tonic-gatelength.)
667*0Sstevel@tonic-gate
668*0Sstevel@tonic-gateTo pack dates stored as triplets ( day, month, year ) in an array C<@dates>
669*0Sstevel@tonic-gateinto a sequence of byte, byte, short integer we can write
670*0Sstevel@tonic-gate
671*0Sstevel@tonic-gate   $pd = pack( '(CCS)*', map( @$_, @dates ) );
672*0Sstevel@tonic-gate
673*0Sstevel@tonic-gateTo swap pairs of characters in a string (with even length) one could use
674*0Sstevel@tonic-gateseveral techniques. First, let's use C<x> and C<X> to skip forward and back:
675*0Sstevel@tonic-gate
676*0Sstevel@tonic-gate   $s = pack( '(A)*', unpack( '(xAXXAx)*', $s ) );
677*0Sstevel@tonic-gate
678*0Sstevel@tonic-gateWe can also use C<@> to jump to an offset, with 0 being the position where
679*0Sstevel@tonic-gatewe were when the last C<(> was encountered:
680*0Sstevel@tonic-gate
681*0Sstevel@tonic-gate   $s = pack( '(A)*', unpack( '(@1A @0A @2)*', $s ) );
682*0Sstevel@tonic-gate
683*0Sstevel@tonic-gateFinally, there is also an entirely different approach by unpacking big
684*0Sstevel@tonic-gateendian shorts and packing them in the reverse byte order:
685*0Sstevel@tonic-gate
686*0Sstevel@tonic-gate   $s = pack( '(v)*', unpack( '(n)*', $s );
687*0Sstevel@tonic-gate
688*0Sstevel@tonic-gate
689*0Sstevel@tonic-gate=head1 Lengths and Widths
690*0Sstevel@tonic-gate
691*0Sstevel@tonic-gate=head2 String Lengths
692*0Sstevel@tonic-gate
693*0Sstevel@tonic-gateIn the previous section we've seen a network message that was constructed
694*0Sstevel@tonic-gateby prefixing the binary message length to the actual message. You'll find
695*0Sstevel@tonic-gatethat packing a length followed by so many bytes of data is a
696*0Sstevel@tonic-gatefrequently used recipe since appending a null byte won't work
697*0Sstevel@tonic-gateif a null byte may be part of the data. Here is an example where both
698*0Sstevel@tonic-gatetechniques are used: after two null terminated strings with source and
699*0Sstevel@tonic-gatedestination address, a Short Message (to a mobile phone) is sent after
700*0Sstevel@tonic-gatea length byte:
701*0Sstevel@tonic-gate
702*0Sstevel@tonic-gate   my $msg = pack( 'Z*Z*CA*', $src, $dst, length( $sm ), $sm );
703*0Sstevel@tonic-gate
704*0Sstevel@tonic-gateUnpacking this message can be done with the same template:
705*0Sstevel@tonic-gate
706*0Sstevel@tonic-gate   ( $src, $dst, $len, $sm ) = unpack( 'Z*Z*CA*', $msg );
707*0Sstevel@tonic-gate
708*0Sstevel@tonic-gateThere's a subtle trap lurking in the offing: Adding another field after
709*0Sstevel@tonic-gatethe Short Message (in variable C<$sm>) is all right when packing, but this
710*0Sstevel@tonic-gatecannot be unpacked naively:
711*0Sstevel@tonic-gate
712*0Sstevel@tonic-gate   # pack a message
713*0Sstevel@tonic-gate   my $msg = pack( 'Z*Z*CA*C', $src, $dst, length( $sm ), $sm, $prio );
714*0Sstevel@tonic-gate
715*0Sstevel@tonic-gate   # unpack fails - $prio remains undefined!
716*0Sstevel@tonic-gate   ( $src, $dst, $len, $sm, $prio ) = unpack( 'Z*Z*CA*C', $msg );
717*0Sstevel@tonic-gate
718*0Sstevel@tonic-gateThe pack code C<A*> gobbles up all remaining bytes, and C<$prio> remains
719*0Sstevel@tonic-gateundefined! Before we let disappointment dampen the morale: Perl's got
720*0Sstevel@tonic-gatethe trump card to make this trick too, just a little further up the sleeve.
721*0Sstevel@tonic-gateWatch this:
722*0Sstevel@tonic-gate
723*0Sstevel@tonic-gate   # pack a message: ASCIIZ, ASCIIZ, length/string, byte
724*0Sstevel@tonic-gate   my $msg = pack( 'Z* Z* C/A* C', $src, $dst, $sm, $prio );
725*0Sstevel@tonic-gate
726*0Sstevel@tonic-gate   # unpack
727*0Sstevel@tonic-gate   ( $src, $dst, $sm, $prio ) = unpack( 'Z* Z* C/A* C', $msg );
728*0Sstevel@tonic-gate
729*0Sstevel@tonic-gateCombining two pack codes with a slash (C</>) associates them with a single
730*0Sstevel@tonic-gatevalue from the argument list. In C<pack>, the length of the argument is
731*0Sstevel@tonic-gatetaken and packed according to the first code while the argument itself
732*0Sstevel@tonic-gateis added after being converted with the template code after the slash.
733*0Sstevel@tonic-gateThis saves us the trouble of inserting the C<length> call, but it is
734*0Sstevel@tonic-gatein C<unpack> where we really score: The value of the length byte marks the
735*0Sstevel@tonic-gateend of the string to be taken from the buffer. Since this combination
736*0Sstevel@tonic-gatedoesn't make sense except when the second pack code isn't C<a*>, C<A*>
737*0Sstevel@tonic-gateor C<Z*>, Perl won't let you.
738*0Sstevel@tonic-gate
739*0Sstevel@tonic-gateThe pack code preceding C</> may be anything that's fit to represent a
740*0Sstevel@tonic-gatenumber: All the numeric binary pack codes, and even text codes such as
741*0Sstevel@tonic-gateC<A4> or C<Z*>:
742*0Sstevel@tonic-gate
743*0Sstevel@tonic-gate   # pack/unpack a string preceded by its length in ASCII
744*0Sstevel@tonic-gate   my $buf = pack( 'A4/A*', "Humpty-Dumpty" );
745*0Sstevel@tonic-gate   # unpack $buf: '13  Humpty-Dumpty'
746*0Sstevel@tonic-gate   my $txt = unpack( 'A4/A*', $buf );
747*0Sstevel@tonic-gate
748*0Sstevel@tonic-gateC</> is not implemented in Perls before 5.6, so if your code is required to
749*0Sstevel@tonic-gatework on older Perls you'll need to C<unpack( 'Z* Z* C')> to get the length,
750*0Sstevel@tonic-gatethen use it to make a new unpack string. For example
751*0Sstevel@tonic-gate
752*0Sstevel@tonic-gate   # pack a message: ASCIIZ, ASCIIZ, length, string, byte (5.005 compatible)
753*0Sstevel@tonic-gate   my $msg = pack( 'Z* Z* C A* C', $src, $dst, length $sm, $sm, $prio );
754*0Sstevel@tonic-gate
755*0Sstevel@tonic-gate   # unpack
756*0Sstevel@tonic-gate   ( undef, undef, $len) = unpack( 'Z* Z* C', $msg );
757*0Sstevel@tonic-gate   ($src, $dst, $sm, $prio) = unpack ( "Z* Z* x A$len C", $msg );
758*0Sstevel@tonic-gate
759*0Sstevel@tonic-gateBut that second C<unpack> is rushing ahead. It isn't using a simple literal
760*0Sstevel@tonic-gatestring for the template. So maybe we should introduce...
761*0Sstevel@tonic-gate
762*0Sstevel@tonic-gate=head2 Dynamic Templates
763*0Sstevel@tonic-gate
764*0Sstevel@tonic-gateSo far, we've seen literals used as templates. If the list of pack
765*0Sstevel@tonic-gateitems doesn't have fixed length, an expression constructing the
766*0Sstevel@tonic-gatetemplate is required (whenever, for some reason, C<()*> cannot be used).
767*0Sstevel@tonic-gateHere's an example: To store named string values in a way that can be
768*0Sstevel@tonic-gateconveniently parsed by a C program, we create a sequence of names and
769*0Sstevel@tonic-gatenull terminated ASCII strings, with C<=> between the name and the value,
770*0Sstevel@tonic-gatefollowed by an additional delimiting null byte. Here's how:
771*0Sstevel@tonic-gate
772*0Sstevel@tonic-gate   my $env = pack( '(A*A*Z*)' . keys( %Env ) . 'C',
773*0Sstevel@tonic-gate                   map( { ( $_, '=', $Env{$_} ) } keys( %Env ) ), 0 );
774*0Sstevel@tonic-gate
775*0Sstevel@tonic-gateLet's examine the cogs of this byte mill, one by one. There's the C<map>
776*0Sstevel@tonic-gatecall, creating the items we intend to stuff into the C<$env> buffer:
777*0Sstevel@tonic-gateto each key (in C<$_>) it adds the C<=> separator and the hash entry value.
778*0Sstevel@tonic-gateEach triplet is packed with the template code sequence C<A*A*Z*> that
779*0Sstevel@tonic-gateis repeated according to the number of keys. (Yes, that's what the C<keys>
780*0Sstevel@tonic-gatefunction returns in scalar context.) To get the very last null byte,
781*0Sstevel@tonic-gatewe add a C<0> at the end of the C<pack> list, to be packed with C<C>.
782*0Sstevel@tonic-gate(Attentive readers may have noticed that we could have omitted the 0.)
783*0Sstevel@tonic-gate
784*0Sstevel@tonic-gateFor the reverse operation, we'll have to determine the number of items
785*0Sstevel@tonic-gatein the buffer before we can let C<unpack> rip it apart:
786*0Sstevel@tonic-gate
787*0Sstevel@tonic-gate   my $n = $env =~ tr/\0// - 1;
788*0Sstevel@tonic-gate   my %env = map( split( /=/, $_ ), unpack( "(Z*)$n", $env ) );
789*0Sstevel@tonic-gate
790*0Sstevel@tonic-gateThe C<tr> counts the null bytes. The C<unpack> call returns a list of
791*0Sstevel@tonic-gatename-value pairs each of which is taken apart in the C<map> block.
792*0Sstevel@tonic-gate
793*0Sstevel@tonic-gate
794*0Sstevel@tonic-gate=head2 Counting Repetitions
795*0Sstevel@tonic-gate
796*0Sstevel@tonic-gateRather than storing a sentinel at the end of a data item (or a list of items),
797*0Sstevel@tonic-gatewe could precede the data with a count. Again, we pack keys and values of
798*0Sstevel@tonic-gatea hash, preceding each with an unsigned short length count, and up front
799*0Sstevel@tonic-gatewe store the number of pairs:
800*0Sstevel@tonic-gate
801*0Sstevel@tonic-gate   my $env = pack( 'S(S/A* S/A*)*', scalar keys( %Env ), %Env );
802*0Sstevel@tonic-gate
803*0Sstevel@tonic-gateThis simplifies the reverse operation as the number of repetitions can be
804*0Sstevel@tonic-gateunpacked with the C</> code:
805*0Sstevel@tonic-gate
806*0Sstevel@tonic-gate   my %env = unpack( 'S/(S/A* S/A*)', $env );
807*0Sstevel@tonic-gate
808*0Sstevel@tonic-gateNote that this is one of the rare cases where you cannot use the same
809*0Sstevel@tonic-gatetemplate for C<pack> and C<unpack> because C<pack> can't determine
810*0Sstevel@tonic-gatea repeat count for a C<()>-group.
811*0Sstevel@tonic-gate
812*0Sstevel@tonic-gate
813*0Sstevel@tonic-gate=head1 Packing and Unpacking C Structures
814*0Sstevel@tonic-gate
815*0Sstevel@tonic-gateIn previous sections we have seen how to pack numbers and character
816*0Sstevel@tonic-gatestrings. If it were not for a couple of snags we could conclude this
817*0Sstevel@tonic-gatesection right away with the terse remark that C structures don't
818*0Sstevel@tonic-gatecontain anything else, and therefore you already know all there is to it.
819*0Sstevel@tonic-gateSorry, no: read on, please.
820*0Sstevel@tonic-gate
821*0Sstevel@tonic-gate=head2 The Alignment Pit
822*0Sstevel@tonic-gate
823*0Sstevel@tonic-gateIn the consideration of speed against memory requirements the balance
824*0Sstevel@tonic-gatehas been tilted in favor of faster execution. This has influenced the
825*0Sstevel@tonic-gateway C compilers allocate memory for structures: On architectures
826*0Sstevel@tonic-gatewhere a 16-bit or 32-bit operand can be moved faster between places in
827*0Sstevel@tonic-gatememory, or to or from a CPU register, if it is aligned at an even or
828*0Sstevel@tonic-gatemultiple-of-four or even at a multiple-of eight address, a C compiler
829*0Sstevel@tonic-gatewill give you this speed benefit by stuffing extra bytes into structures.
830*0Sstevel@tonic-gateIf you don't cross the C shoreline this is not likely to cause you any
831*0Sstevel@tonic-gategrief (although you should care when you design large data structures,
832*0Sstevel@tonic-gateor you want your code to be portable between architectures (you do want
833*0Sstevel@tonic-gatethat, don't you?)).
834*0Sstevel@tonic-gate
835*0Sstevel@tonic-gateTo see how this affects C<pack> and C<unpack>, we'll compare these two
836*0Sstevel@tonic-gateC structures:
837*0Sstevel@tonic-gate
838*0Sstevel@tonic-gate   typedef struct {
839*0Sstevel@tonic-gate     char     c1;
840*0Sstevel@tonic-gate     short    s;
841*0Sstevel@tonic-gate     char     c2;
842*0Sstevel@tonic-gate     long     l;
843*0Sstevel@tonic-gate   } gappy_t;
844*0Sstevel@tonic-gate
845*0Sstevel@tonic-gate   typedef struct {
846*0Sstevel@tonic-gate     long     l;
847*0Sstevel@tonic-gate     short    s;
848*0Sstevel@tonic-gate     char     c1;
849*0Sstevel@tonic-gate     char     c2;
850*0Sstevel@tonic-gate   } dense_t;
851*0Sstevel@tonic-gate
852*0Sstevel@tonic-gateTypically, a C compiler allocates 12 bytes to a C<gappy_t> variable, but
853*0Sstevel@tonic-gaterequires only 8 bytes for a C<dense_t>. After investigating this further,
854*0Sstevel@tonic-gatewe can draw memory maps, showing where the extra 4 bytes are hidden:
855*0Sstevel@tonic-gate
856*0Sstevel@tonic-gate   0           +4          +8          +12
857*0Sstevel@tonic-gate   +--+--+--+--+--+--+--+--+--+--+--+--+
858*0Sstevel@tonic-gate   |c1|xx|  s  |c2|xx|xx|xx|     l     |    xx = fill byte
859*0Sstevel@tonic-gate   +--+--+--+--+--+--+--+--+--+--+--+--+
860*0Sstevel@tonic-gate   gappy_t
861*0Sstevel@tonic-gate
862*0Sstevel@tonic-gate   0           +4          +8
863*0Sstevel@tonic-gate   +--+--+--+--+--+--+--+--+
864*0Sstevel@tonic-gate   |     l     |  h  |c1|c2|
865*0Sstevel@tonic-gate   +--+--+--+--+--+--+--+--+
866*0Sstevel@tonic-gate   dense_t
867*0Sstevel@tonic-gate
868*0Sstevel@tonic-gateAnd that's where the first quirk strikes: C<pack> and C<unpack>
869*0Sstevel@tonic-gatetemplates have to be stuffed with C<x> codes to get those extra fill bytes.
870*0Sstevel@tonic-gate
871*0Sstevel@tonic-gateThe natural question: "Why can't Perl compensate for the gaps?" warrants
872*0Sstevel@tonic-gatean answer. One good reason is that C compilers might provide (non-ANSI)
873*0Sstevel@tonic-gateextensions permitting all sorts of fancy control over the way structures
874*0Sstevel@tonic-gateare aligned, even at the level of an individual structure field. And, if
875*0Sstevel@tonic-gatethis were not enough, there is an insidious thing called C<union> where
876*0Sstevel@tonic-gatethe amount of fill bytes cannot be derived from the alignment of the next
877*0Sstevel@tonic-gateitem alone.
878*0Sstevel@tonic-gate
879*0Sstevel@tonic-gateOK, so let's bite the bullet. Here's one way to get the alignment right
880*0Sstevel@tonic-gateby inserting template codes C<x>, which don't take a corresponding item
881*0Sstevel@tonic-gatefrom the list:
882*0Sstevel@tonic-gate
883*0Sstevel@tonic-gate  my $gappy = pack( 'cxs cxxx l!', $c1, $s, $c2, $l );
884*0Sstevel@tonic-gate
885*0Sstevel@tonic-gateNote the C<!> after C<l>: We want to make sure that we pack a long
886*0Sstevel@tonic-gateinteger as it is compiled by our C compiler. And even now, it will only
887*0Sstevel@tonic-gatework for the platforms where the compiler aligns things as above.
888*0Sstevel@tonic-gateAnd somebody somewhere has a platform where it doesn't.
889*0Sstevel@tonic-gate[Probably a Cray, where C<short>s, C<int>s and C<long>s are all 8 bytes. :-)]
890*0Sstevel@tonic-gate
891*0Sstevel@tonic-gateCounting bytes and watching alignments in lengthy structures is bound to
892*0Sstevel@tonic-gatebe a drag. Isn't there a way we can create the template with a simple
893*0Sstevel@tonic-gateprogram? Here's a C program that does the trick:
894*0Sstevel@tonic-gate
895*0Sstevel@tonic-gate   #include <stdio.h>
896*0Sstevel@tonic-gate   #include <stddef.h>
897*0Sstevel@tonic-gate
898*0Sstevel@tonic-gate   typedef struct {
899*0Sstevel@tonic-gate     char     fc1;
900*0Sstevel@tonic-gate     short    fs;
901*0Sstevel@tonic-gate     char     fc2;
902*0Sstevel@tonic-gate     long     fl;
903*0Sstevel@tonic-gate   } gappy_t;
904*0Sstevel@tonic-gate
905*0Sstevel@tonic-gate   #define Pt(struct,field,tchar) \
906*0Sstevel@tonic-gate     printf( "@%d%s ", offsetof(struct,field), # tchar );
907*0Sstevel@tonic-gate
908*0Sstevel@tonic-gate   int main() {
909*0Sstevel@tonic-gate     Pt( gappy_t, fc1, c  );
910*0Sstevel@tonic-gate     Pt( gappy_t, fs,  s! );
911*0Sstevel@tonic-gate     Pt( gappy_t, fc2, c  );
912*0Sstevel@tonic-gate     Pt( gappy_t, fl,  l! );
913*0Sstevel@tonic-gate     printf( "\n" );
914*0Sstevel@tonic-gate   }
915*0Sstevel@tonic-gate
916*0Sstevel@tonic-gateThe output line can be used as a template in a C<pack> or C<unpack> call:
917*0Sstevel@tonic-gate
918*0Sstevel@tonic-gate  my $gappy = pack( '@0c @2s! @4c @8l!', $c1, $s, $c2, $l );
919*0Sstevel@tonic-gate
920*0Sstevel@tonic-gateGee, yet another template code - as if we hadn't plenty. But
921*0Sstevel@tonic-gateC<@> saves our day by enabling us to specify the offset from the beginning
922*0Sstevel@tonic-gateof the pack buffer to the next item: This is just the value
923*0Sstevel@tonic-gatethe C<offsetof> macro (defined in C<E<lt>stddef.hE<gt>>) returns when
924*0Sstevel@tonic-gategiven a C<struct> type and one of its field names ("member-designator" in
925*0Sstevel@tonic-gateC standardese).
926*0Sstevel@tonic-gate
927*0Sstevel@tonic-gateNeither using offsets nor adding C<x>'s to bridge the gaps is satisfactory.
928*0Sstevel@tonic-gate(Just imagine what happens if the structure changes.) What we really need
929*0Sstevel@tonic-gateis a way of saying "skip as many bytes as required to the next multiple of N".
930*0Sstevel@tonic-gateIn fluent Templatese, you say this with C<x!N> where N is replaced by the
931*0Sstevel@tonic-gateappropriate value. Here's the next version of our struct packaging:
932*0Sstevel@tonic-gate
933*0Sstevel@tonic-gate  my $gappy = pack( 'c x!2 s c x!4 l!', $c1, $s, $c2, $l );
934*0Sstevel@tonic-gate
935*0Sstevel@tonic-gateThat's certainly better, but we still have to know how long all the
936*0Sstevel@tonic-gateintegers are, and portability is far away. Rather than C<2>,
937*0Sstevel@tonic-gatefor instance, we want to say "however long a short is". But this can be
938*0Sstevel@tonic-gatedone by enclosing the appropriate pack code in brackets: C<[s]>. So, here's
939*0Sstevel@tonic-gatethe very best we can do:
940*0Sstevel@tonic-gate
941*0Sstevel@tonic-gate  my $gappy = pack( 'c x![s] s c x![l!] l!', $c1, $s, $c2, $l );
942*0Sstevel@tonic-gate
943*0Sstevel@tonic-gate
944*0Sstevel@tonic-gate=head2 Alignment, Take 2
945*0Sstevel@tonic-gate
946*0Sstevel@tonic-gateI'm afraid that we're not quite through with the alignment catch yet. The
947*0Sstevel@tonic-gatehydra raises another ugly head when you pack arrays of structures:
948*0Sstevel@tonic-gate
949*0Sstevel@tonic-gate   typedef struct {
950*0Sstevel@tonic-gate     short    count;
951*0Sstevel@tonic-gate     char     glyph;
952*0Sstevel@tonic-gate   } cell_t;
953*0Sstevel@tonic-gate
954*0Sstevel@tonic-gate   typedef cell_t buffer_t[BUFLEN];
955*0Sstevel@tonic-gate
956*0Sstevel@tonic-gateWhere's the catch? Padding is neither required before the first field C<count>,
957*0Sstevel@tonic-gatenor between this and the next field C<glyph>, so why can't we simply pack
958*0Sstevel@tonic-gatelike this:
959*0Sstevel@tonic-gate
960*0Sstevel@tonic-gate   # something goes wrong here:
961*0Sstevel@tonic-gate   pack( 's!a' x @buffer,
962*0Sstevel@tonic-gate         map{ ( $_->{count}, $_->{glyph} ) } @buffer );
963*0Sstevel@tonic-gate
964*0Sstevel@tonic-gateThis packs C<3*@buffer> bytes, but it turns out that the size of
965*0Sstevel@tonic-gateC<buffer_t> is four times C<BUFLEN>! The moral of the story is that
966*0Sstevel@tonic-gatethe required alignment of a structure or array is propagated to the
967*0Sstevel@tonic-gatenext higher level where we have to consider padding I<at the end>
968*0Sstevel@tonic-gateof each component as well. Thus the correct template is:
969*0Sstevel@tonic-gate
970*0Sstevel@tonic-gate   pack( 's!ax' x @buffer,
971*0Sstevel@tonic-gate         map{ ( $_->{count}, $_->{glyph} ) } @buffer );
972*0Sstevel@tonic-gate
973*0Sstevel@tonic-gate=head2 Alignment, Take 3
974*0Sstevel@tonic-gate
975*0Sstevel@tonic-gateAnd even if you take all the above into account, ANSI still lets this:
976*0Sstevel@tonic-gate
977*0Sstevel@tonic-gate   typedef struct {
978*0Sstevel@tonic-gate     char     foo[2];
979*0Sstevel@tonic-gate   } foo_t;
980*0Sstevel@tonic-gate
981*0Sstevel@tonic-gatevary in size. The alignment constraint of the structure can be greater than
982*0Sstevel@tonic-gateany of its elements. [And if you think that this doesn't affect anything
983*0Sstevel@tonic-gatecommon, dismember the next cellphone that you see. Many have ARM cores, and
984*0Sstevel@tonic-gatethe ARM structure rules make C<sizeof (foo_t)> == 4]
985*0Sstevel@tonic-gate
986*0Sstevel@tonic-gate=head2 Pointers for How to Use Them
987*0Sstevel@tonic-gate
988*0Sstevel@tonic-gateThe title of this section indicates the second problem you may run into
989*0Sstevel@tonic-gatesooner or later when you pack C structures. If the function you intend
990*0Sstevel@tonic-gateto call expects a, say, C<void *> value, you I<cannot> simply take
991*0Sstevel@tonic-gatea reference to a Perl variable. (Although that value certainly is a
992*0Sstevel@tonic-gatememory address, it's not the address where the variable's contents are
993*0Sstevel@tonic-gatestored.)
994*0Sstevel@tonic-gate
995*0Sstevel@tonic-gateTemplate code C<P> promises to pack a "pointer to a fixed length string".
996*0Sstevel@tonic-gateIsn't this what we want? Let's try:
997*0Sstevel@tonic-gate
998*0Sstevel@tonic-gate    # allocate some storage and pack a pointer to it
999*0Sstevel@tonic-gate    my $memory = "\x00" x $size;
1000*0Sstevel@tonic-gate    my $memptr = pack( 'P', $memory );
1001*0Sstevel@tonic-gate
1002*0Sstevel@tonic-gateBut wait: doesn't C<pack> just return a sequence of bytes? How can we pass this
1003*0Sstevel@tonic-gatestring of bytes to some C code expecting a pointer which is, after all,
1004*0Sstevel@tonic-gatenothing but a number? The answer is simple: We have to obtain the numeric
1005*0Sstevel@tonic-gateaddress from the bytes returned by C<pack>.
1006*0Sstevel@tonic-gate
1007*0Sstevel@tonic-gate    my $ptr = unpack( 'L!', $memptr );
1008*0Sstevel@tonic-gate
1009*0Sstevel@tonic-gateObviously this assumes that it is possible to typecast a pointer
1010*0Sstevel@tonic-gateto an unsigned long and vice versa, which frequently works but should not
1011*0Sstevel@tonic-gatebe taken as a universal law. - Now that we have this pointer the next question
1012*0Sstevel@tonic-gateis: How can we put it to good use? We need a call to some C function
1013*0Sstevel@tonic-gatewhere a pointer is expected. The read(2) system call comes to mind:
1014*0Sstevel@tonic-gate
1015*0Sstevel@tonic-gate    ssize_t read(int fd, void *buf, size_t count);
1016*0Sstevel@tonic-gate
1017*0Sstevel@tonic-gateAfter reading L<perlfunc> explaining how to use C<syscall> we can write
1018*0Sstevel@tonic-gatethis Perl function copying a file to standard output:
1019*0Sstevel@tonic-gate
1020*0Sstevel@tonic-gate    require 'syscall.ph';
1021*0Sstevel@tonic-gate    sub cat($){
1022*0Sstevel@tonic-gate        my $path = shift();
1023*0Sstevel@tonic-gate        my $size = -s $path;
1024*0Sstevel@tonic-gate        my $memory = "\x00" x $size;  # allocate some memory
1025*0Sstevel@tonic-gate        my $ptr = unpack( 'L', pack( 'P', $memory ) );
1026*0Sstevel@tonic-gate        open( F, $path ) || die( "$path: cannot open ($!)\n" );
1027*0Sstevel@tonic-gate        my $fd = fileno(F);
1028*0Sstevel@tonic-gate        my $res = syscall( &SYS_read, fileno(F), $ptr, $size );
1029*0Sstevel@tonic-gate        print $memory;
1030*0Sstevel@tonic-gate        close( F );
1031*0Sstevel@tonic-gate    }
1032*0Sstevel@tonic-gate
1033*0Sstevel@tonic-gateThis is neither a specimen of simplicity nor a paragon of portability but
1034*0Sstevel@tonic-gateit illustrates the point: We are able to sneak behind the scenes and
1035*0Sstevel@tonic-gateaccess Perl's otherwise well-guarded memory! (Important note: Perl's
1036*0Sstevel@tonic-gateC<syscall> does I<not> require you to construct pointers in this roundabout
1037*0Sstevel@tonic-gateway. You simply pass a string variable, and Perl forwards the address.)
1038*0Sstevel@tonic-gate
1039*0Sstevel@tonic-gateHow does C<unpack> with C<P> work? Imagine some pointer in the buffer
1040*0Sstevel@tonic-gateabout to be unpacked: If it isn't the null pointer (which will smartly
1041*0Sstevel@tonic-gateproduce the C<undef> value) we have a start address - but then what?
1042*0Sstevel@tonic-gatePerl has no way of knowing how long this "fixed length string" is, so
1043*0Sstevel@tonic-gateit's up to you to specify the actual size as an explicit length after C<P>.
1044*0Sstevel@tonic-gate
1045*0Sstevel@tonic-gate   my $mem = "abcdefghijklmn";
1046*0Sstevel@tonic-gate   print unpack( 'P5', pack( 'P', $mem ) ); # prints "abcde"
1047*0Sstevel@tonic-gate
1048*0Sstevel@tonic-gateAs a consequence, C<pack> ignores any number or C<*> after C<P>.
1049*0Sstevel@tonic-gate
1050*0Sstevel@tonic-gate
1051*0Sstevel@tonic-gateNow that we have seen C<P> at work, we might as well give C<p> a whirl.
1052*0Sstevel@tonic-gateWhy do we need a second template code for packing pointers at all? The
1053*0Sstevel@tonic-gateanswer lies behind the simple fact that an C<unpack> with C<p> promises
1054*0Sstevel@tonic-gatea null-terminated string starting at the address taken from the buffer,
1055*0Sstevel@tonic-gateand that implies a length for the data item to be returned:
1056*0Sstevel@tonic-gate
1057*0Sstevel@tonic-gate   my $buf = pack( 'p', "abc\x00efhijklmn" );
1058*0Sstevel@tonic-gate   print unpack( 'p', $buf );    # prints "abc"
1059*0Sstevel@tonic-gate
1060*0Sstevel@tonic-gate
1061*0Sstevel@tonic-gate
1062*0Sstevel@tonic-gateAlbeit this is apt to be confusing: As a consequence of the length being
1063*0Sstevel@tonic-gateimplied by the string's length, a number after pack code C<p> is a repeat
1064*0Sstevel@tonic-gatecount, not a length as after C<P>.
1065*0Sstevel@tonic-gate
1066*0Sstevel@tonic-gate
1067*0Sstevel@tonic-gateUsing C<pack(..., $x)> with C<P> or C<p> to get the address where C<$x> is
1068*0Sstevel@tonic-gateactually stored must be used with circumspection. Perl's internal machinery
1069*0Sstevel@tonic-gateconsiders the relation between a variable and that address as its very own
1070*0Sstevel@tonic-gateprivate matter and doesn't really care that we have obtained a copy. Therefore:
1071*0Sstevel@tonic-gate
1072*0Sstevel@tonic-gate=over 4
1073*0Sstevel@tonic-gate
1074*0Sstevel@tonic-gate=item *
1075*0Sstevel@tonic-gate
1076*0Sstevel@tonic-gateDo not use C<pack> with C<p> or C<P> to obtain the address of variable
1077*0Sstevel@tonic-gatethat's bound to go out of scope (and thereby freeing its memory) before you
1078*0Sstevel@tonic-gateare done with using the memory at that address.
1079*0Sstevel@tonic-gate
1080*0Sstevel@tonic-gate=item *
1081*0Sstevel@tonic-gate
1082*0Sstevel@tonic-gateBe very careful with Perl operations that change the value of the
1083*0Sstevel@tonic-gatevariable. Appending something to the variable, for instance, might require
1084*0Sstevel@tonic-gatereallocation of its storage, leaving you with a pointer into no-man's land.
1085*0Sstevel@tonic-gate
1086*0Sstevel@tonic-gate=item *
1087*0Sstevel@tonic-gate
1088*0Sstevel@tonic-gateDon't think that you can get the address of a Perl variable
1089*0Sstevel@tonic-gatewhen it is stored as an integer or double number! C<pack('P', $x)> will
1090*0Sstevel@tonic-gateforce the variable's internal representation to string, just as if you
1091*0Sstevel@tonic-gatehad written something like C<$x .= ''>.
1092*0Sstevel@tonic-gate
1093*0Sstevel@tonic-gate=back
1094*0Sstevel@tonic-gate
1095*0Sstevel@tonic-gateIt's safe, however, to P- or p-pack a string literal, because Perl simply
1096*0Sstevel@tonic-gateallocates an anonymous variable.
1097*0Sstevel@tonic-gate
1098*0Sstevel@tonic-gate
1099*0Sstevel@tonic-gate
1100*0Sstevel@tonic-gate=head1 Pack Recipes
1101*0Sstevel@tonic-gate
1102*0Sstevel@tonic-gateHere are a collection of (possibly) useful canned recipes for C<pack>
1103*0Sstevel@tonic-gateand C<unpack>:
1104*0Sstevel@tonic-gate
1105*0Sstevel@tonic-gate    # Convert IP address for socket functions
1106*0Sstevel@tonic-gate    pack( "C4", split /\./, "123.4.5.6" );
1107*0Sstevel@tonic-gate
1108*0Sstevel@tonic-gate    # Count the bits in a chunk of memory (e.g. a select vector)
1109*0Sstevel@tonic-gate    unpack( '%32b*', $mask );
1110*0Sstevel@tonic-gate
1111*0Sstevel@tonic-gate    # Determine the endianness of your system
1112*0Sstevel@tonic-gate    $is_little_endian = unpack( 'c', pack( 's', 1 ) );
1113*0Sstevel@tonic-gate    $is_big_endian = unpack( 'xc', pack( 's', 1 ) );
1114*0Sstevel@tonic-gate
1115*0Sstevel@tonic-gate    # Determine the number of bits in a native integer
1116*0Sstevel@tonic-gate    $bits = unpack( '%32I!', ~0 );
1117*0Sstevel@tonic-gate
1118*0Sstevel@tonic-gate    # Prepare argument for the nanosleep system call
1119*0Sstevel@tonic-gate    my $timespec = pack( 'L!L!', $secs, $nanosecs );
1120*0Sstevel@tonic-gate
1121*0Sstevel@tonic-gateFor a simple memory dump we unpack some bytes into just as
1122*0Sstevel@tonic-gatemany pairs of hex digits, and use C<map> to handle the traditional
1123*0Sstevel@tonic-gatespacing - 16 bytes to a line:
1124*0Sstevel@tonic-gate
1125*0Sstevel@tonic-gate    my $i;
1126*0Sstevel@tonic-gate    print map( ++$i % 16 ? "$_ " : "$_\n",
1127*0Sstevel@tonic-gate               unpack( 'H2' x length( $mem ), $mem ) ),
1128*0Sstevel@tonic-gate          length( $mem ) % 16 ? "\n" : '';
1129*0Sstevel@tonic-gate
1130*0Sstevel@tonic-gate
1131*0Sstevel@tonic-gate=head1 Funnies Section
1132*0Sstevel@tonic-gate
1133*0Sstevel@tonic-gate    # Pulling digits out of nowhere...
1134*0Sstevel@tonic-gate    print unpack( 'C', pack( 'x' ) ),
1135*0Sstevel@tonic-gate          unpack( '%B*', pack( 'A' ) ),
1136*0Sstevel@tonic-gate          unpack( 'H', pack( 'A' ) ),
1137*0Sstevel@tonic-gate          unpack( 'A', unpack( 'C', pack( 'A' ) ) ), "\n";
1138*0Sstevel@tonic-gate
1139*0Sstevel@tonic-gate    # One for the road ;-)
1140*0Sstevel@tonic-gate    my $advice = pack( 'all u can in a van' );
1141*0Sstevel@tonic-gate
1142*0Sstevel@tonic-gate
1143*0Sstevel@tonic-gate=head1 Authors
1144*0Sstevel@tonic-gate
1145*0Sstevel@tonic-gateSimon Cozens and Wolfgang Laun.
1146*0Sstevel@tonic-gate
1147