1package bytes; 2 3our $VERSION = '1.05'; 4 5$bytes::hint_bits = 0x00000008; 6 7sub import { 8 $^H |= $bytes::hint_bits; 9} 10 11sub unimport { 12 $^H &= ~$bytes::hint_bits; 13} 14 15sub AUTOLOAD { 16 require "bytes_heavy.pl"; 17 goto &$AUTOLOAD if defined &$AUTOLOAD; 18 require Carp; 19 Carp::croak("Undefined subroutine $AUTOLOAD called"); 20} 21 22sub length (_); 23sub chr (_); 24sub ord (_); 25sub substr ($$;$$); 26sub index ($$;$); 27sub rindex ($$;$); 28 291; 30__END__ 31 32=head1 NAME 33 34bytes - Perl pragma to expose the individual bytes of characters 35 36=head1 NOTICE 37 38Because the bytes pragma breaks encapsulation (i.e. it exposes the innards of 39how the perl executable currently happens to store a string), the byte values 40that result are in an unspecified encoding. 41 42B<Use of this module for anything other than debugging purposes is 43strongly discouraged.> If you feel that the functions here within 44might be useful for your application, this possibly indicates a 45mismatch between your mental model of Perl Unicode and the current 46reality. In that case, you may wish to read some of the perl Unicode 47documentation: L<perluniintro>, L<perlunitut>, L<perlunifaq> and 48L<perlunicode>. 49 50=head1 SYNOPSIS 51 52 use bytes; 53 ... chr(...); # or bytes::chr 54 ... index(...); # or bytes::index 55 ... length(...); # or bytes::length 56 ... ord(...); # or bytes::ord 57 ... rindex(...); # or bytes::rindex 58 ... substr(...); # or bytes::substr 59 no bytes; 60 61 62=head1 DESCRIPTION 63 64Perl's characters are stored internally as sequences of one or more bytes. 65This pragma allows for the examination of the individual bytes that together 66comprise a character. 67 68Originally the pragma was designed for the loftier goal of helping incorporate 69Unicode into Perl, but the approach that used it was found to be defective, 70and the one remaining legitimate use is for debugging when you need to 71non-destructively examine characters' individual bytes. Just insert this 72pragma temporarily, and remove it after the debugging is finished. 73 74The original usage can be accomplished by explicit (rather than this pragma's 75implict) encoding using the L<Encode> module: 76 77 use Encode qw/encode/; 78 79 my $utf8_byte_string = encode "UTF8", $string; 80 my $latin1_byte_string = encode "Latin1", $string; 81 82Or, if performance is needed and you are only interested in the UTF-8 83representation: 84 85 use utf8; 86 87 utf8::encode(my $utf8_byte_string = $string); 88 89C<no bytes> can be used to reverse the effect of C<use bytes> within the 90current lexical scope. 91 92As an example, when Perl sees C<$x = chr(400)>, it encodes the character 93in UTF-8 and stores it in C<$x>. Then it is marked as character data, so, 94for instance, C<length $x> returns C<1>. However, in the scope of the 95C<bytes> pragma, C<$x> is treated as a series of bytes - the bytes that make 96up the UTF8 encoding - and C<length $x> returns C<2>: 97 98 $x = chr(400); 99 print "Length is ", length $x, "\n"; # "Length is 1" 100 printf "Contents are %vd\n", $x; # "Contents are 400" 101 { 102 use bytes; # or "require bytes; bytes::length()" 103 print "Length is ", length $x, "\n"; # "Length is 2" 104 printf "Contents are %vd\n", $x; # "Contents are 198.144 (on 105 # ASCII platforms)" 106 } 107 108C<chr()>, C<ord()>, C<substr()>, C<index()> and C<rindex()> behave similarly. 109 110For more on the implications, see L<perluniintro> and L<perlunicode>. 111 112C<bytes::length()> is admittedly handy if you need to know the 113B<byte length> of a Perl scalar. But a more modern way is: 114 115 use Encode 'encode'; 116 length(encode('UTF-8', $scalar)) 117 118=head1 LIMITATIONS 119 120C<bytes::substr()> does not work as an I<lvalue()>. 121 122=head1 SEE ALSO 123 124L<perluniintro>, L<perlunicode>, L<utf8>, L<Encode> 125 126=cut 127