1*0Sstevel@tonic-gatepackage Encode::Unicode; 2*0Sstevel@tonic-gate 3*0Sstevel@tonic-gateuse strict; 4*0Sstevel@tonic-gateuse warnings; 5*0Sstevel@tonic-gateno warnings 'redefine'; 6*0Sstevel@tonic-gate 7*0Sstevel@tonic-gateour $VERSION = do { my @r = (q$Revision: 1.40 $ =~ /\d+/g); sprintf "%d."."%02d" x $#r, @r }; 8*0Sstevel@tonic-gate 9*0Sstevel@tonic-gateuse XSLoader; 10*0Sstevel@tonic-gateXSLoader::load(__PACKAGE__,$VERSION); 11*0Sstevel@tonic-gate 12*0Sstevel@tonic-gate# 13*0Sstevel@tonic-gate# Object Generator 8 transcoders all at once! 14*0Sstevel@tonic-gate# 15*0Sstevel@tonic-gate 16*0Sstevel@tonic-gaterequire Encode; 17*0Sstevel@tonic-gate 18*0Sstevel@tonic-gateour %BOM_Unknown = map {$_ => 1} qw(UTF-16 UTF-32); 19*0Sstevel@tonic-gate 20*0Sstevel@tonic-gatefor my $name (qw(UTF-16 UTF-16BE UTF-16LE 21*0Sstevel@tonic-gate UTF-32 UTF-32BE UTF-32LE 22*0Sstevel@tonic-gate UCS-2BE UCS-2LE)) 23*0Sstevel@tonic-gate{ 24*0Sstevel@tonic-gate my ($size, $endian, $ucs2, $mask); 25*0Sstevel@tonic-gate $name =~ /^(\w+)-(\d+)(\w*)$/o; 26*0Sstevel@tonic-gate if ($ucs2 = ($1 eq 'UCS')){ 27*0Sstevel@tonic-gate $size = 2; 28*0Sstevel@tonic-gate }else{ 29*0Sstevel@tonic-gate $size = $2/8; 30*0Sstevel@tonic-gate } 31*0Sstevel@tonic-gate $endian = ($3 eq 'BE') ? 'n' : ($3 eq 'LE') ? 'v' : '' ; 32*0Sstevel@tonic-gate $size == 4 and $endian = uc($endian); 33*0Sstevel@tonic-gate 34*0Sstevel@tonic-gate $Encode::Encoding{$name} = 35*0Sstevel@tonic-gate bless { 36*0Sstevel@tonic-gate Name => $name, 37*0Sstevel@tonic-gate size => $size, 38*0Sstevel@tonic-gate endian => $endian, 39*0Sstevel@tonic-gate ucs2 => $ucs2, 40*0Sstevel@tonic-gate } => __PACKAGE__; 41*0Sstevel@tonic-gate} 42*0Sstevel@tonic-gate 43*0Sstevel@tonic-gateuse base qw(Encode::Encoding); 44*0Sstevel@tonic-gate 45*0Sstevel@tonic-gatesub renew { 46*0Sstevel@tonic-gate my $self = shift; 47*0Sstevel@tonic-gate $BOM_Unknown{$self->name} or return $self; 48*0Sstevel@tonic-gate my $clone = bless { %$self } => ref($self); 49*0Sstevel@tonic-gate $clone->{clone} = 1; # so the caller knows it is renewed. 50*0Sstevel@tonic-gate return $clone; 51*0Sstevel@tonic-gate} 52*0Sstevel@tonic-gate 53*0Sstevel@tonic-gate# There used to be a perl implemntation of (en|de)code but with 54*0Sstevel@tonic-gate# XS version is ripe, perl version is zapped for optimal speed 55*0Sstevel@tonic-gate 56*0Sstevel@tonic-gate*decode = \&decode_xs; 57*0Sstevel@tonic-gate*encode = \&encode_xs; 58*0Sstevel@tonic-gate 59*0Sstevel@tonic-gate1; 60*0Sstevel@tonic-gate__END__ 61*0Sstevel@tonic-gate 62*0Sstevel@tonic-gate=head1 NAME 63*0Sstevel@tonic-gate 64*0Sstevel@tonic-gateEncode::Unicode -- Various Unicode Transformation Formats 65*0Sstevel@tonic-gate 66*0Sstevel@tonic-gate=cut 67*0Sstevel@tonic-gate 68*0Sstevel@tonic-gate=head1 SYNOPSIS 69*0Sstevel@tonic-gate 70*0Sstevel@tonic-gate use Encode qw/encode decode/; 71*0Sstevel@tonic-gate $ucs2 = encode("UCS-2BE", $utf8); 72*0Sstevel@tonic-gate $utf8 = decode("UCS-2BE", $ucs2); 73*0Sstevel@tonic-gate 74*0Sstevel@tonic-gate=head1 ABSTRACT 75*0Sstevel@tonic-gate 76*0Sstevel@tonic-gateThis module implements all Character Encoding Schemes of Unicode that 77*0Sstevel@tonic-gateare officially documented by Unicode Consortium (except, of course, 78*0Sstevel@tonic-gatefor UTF-8, which is a native format in perl). 79*0Sstevel@tonic-gate 80*0Sstevel@tonic-gate=over 4 81*0Sstevel@tonic-gate 82*0Sstevel@tonic-gate=item L<http://www.unicode.org/glossary/> says: 83*0Sstevel@tonic-gate 84*0Sstevel@tonic-gateI<Character Encoding Scheme> A character encoding form plus byte 85*0Sstevel@tonic-gateserialization. There are Seven character encoding schemes in Unicode: 86*0Sstevel@tonic-gateUTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32 (UCS-4), UTF-32BE (UCS-4BE) and 87*0Sstevel@tonic-gateUTF-32LE (UCS-4LE), and UTF-7. 88*0Sstevel@tonic-gate 89*0Sstevel@tonic-gateSince UTF-7 is a 7-bit (re)encoded version of UTF-16BE, It is not part of 90*0Sstevel@tonic-gateUnicode's Character Encoding Scheme. It is separately implemented in 91*0Sstevel@tonic-gateEncode::Unicode::UTF7. For details see L<Encode::Unicode::UTF7>. 92*0Sstevel@tonic-gate 93*0Sstevel@tonic-gate=item Quick Reference 94*0Sstevel@tonic-gate 95*0Sstevel@tonic-gate Decodes from ord(N) Encodes chr(N) to... 96*0Sstevel@tonic-gate octet/char BOM S.P d800-dfff ord > 0xffff \x{1abcd} == 97*0Sstevel@tonic-gate ---------------+-----------------+------------------------------ 98*0Sstevel@tonic-gate UCS-2BE 2 N N is bogus Not Available 99*0Sstevel@tonic-gate UCS-2LE 2 N N bogus Not Available 100*0Sstevel@tonic-gate UTF-16 2/4 Y Y is S.P S.P BE/LE 101*0Sstevel@tonic-gate UTF-16BE 2/4 N Y S.P S.P 0xd82a,0xdfcd 102*0Sstevel@tonic-gate UTF-16LE 2 N Y S.P S.P 0x2ad8,0xcddf 103*0Sstevel@tonic-gate UTF-32 4 Y - is bogus As is BE/LE 104*0Sstevel@tonic-gate UTF-32BE 4 N - bogus As is 0x0001abcd 105*0Sstevel@tonic-gate UTF-32LE 4 N - bogus As is 0xcdab0100 106*0Sstevel@tonic-gate UTF-8 1-4 - - bogus >= 4 octets \xf0\x9a\af\8d 107*0Sstevel@tonic-gate ---------------+-----------------+------------------------------ 108*0Sstevel@tonic-gate 109*0Sstevel@tonic-gate=back 110*0Sstevel@tonic-gate 111*0Sstevel@tonic-gate=head1 Size, Endianness, and BOM 112*0Sstevel@tonic-gate 113*0Sstevel@tonic-gateYou can categorize these CES by 3 criteria: size of each character, 114*0Sstevel@tonic-gateendianness, and Byte Order Mark. 115*0Sstevel@tonic-gate 116*0Sstevel@tonic-gate=head2 by size 117*0Sstevel@tonic-gate 118*0Sstevel@tonic-gateUCS-2 is a fixed-length encoding with each character taking 16 bits. 119*0Sstevel@tonic-gateIt B<does not> support I<surrogate pairs>. When a surrogate pair 120*0Sstevel@tonic-gateis encountered during decode(), its place is filled with \x{FFFD} 121*0Sstevel@tonic-gateif I<CHECK> is 0, or the routine croaks if I<CHECK> is 1. When a 122*0Sstevel@tonic-gatecharacter whose ord value is larger than 0xFFFF is encountered, 123*0Sstevel@tonic-gateits place is filled with \x{FFFD} if I<CHECK> is 0, or the routine 124*0Sstevel@tonic-gatecroaks if I<CHECK> is 1. 125*0Sstevel@tonic-gate 126*0Sstevel@tonic-gateUTF-16 is almost the same as UCS-2 but it supports I<surrogate pairs>. 127*0Sstevel@tonic-gateWhen it encounters a high surrogate (0xD800-0xDBFF), it fetches the 128*0Sstevel@tonic-gatefollowing low surrogate (0xDC00-0xDFFF) and C<desurrogate>s them to 129*0Sstevel@tonic-gateform a character. Bogus surrogates result in death. When \x{10000} 130*0Sstevel@tonic-gateor above is encountered during encode(), it C<ensurrogate>s them and 131*0Sstevel@tonic-gatepushes the surrogate pair to the output stream. 132*0Sstevel@tonic-gate 133*0Sstevel@tonic-gateUTF-32 (UCS-4) is a fixed-length encoding with each character taking 32 bits. 134*0Sstevel@tonic-gateSince it is 32-bit, there is no need for I<surrogate pairs>. 135*0Sstevel@tonic-gate 136*0Sstevel@tonic-gate=head2 by endianness 137*0Sstevel@tonic-gate 138*0Sstevel@tonic-gateThe first (and now failed) goal of Unicode was to map all character 139*0Sstevel@tonic-gaterepertoires into a fixed-length integer so that programmers are happy. 140*0Sstevel@tonic-gateSince each character is either a I<short> or I<long> in C, you have to 141*0Sstevel@tonic-gatepay attention to the endianness of each platform when you pass data 142*0Sstevel@tonic-gateto one another. 143*0Sstevel@tonic-gate 144*0Sstevel@tonic-gateAnything marked as BE is Big Endian (or network byte order) and LE is 145*0Sstevel@tonic-gateLittle Endian (aka VAX byte order). For anything not marked either 146*0Sstevel@tonic-gateBE or LE, a character called Byte Order Mark (BOM) indicating the 147*0Sstevel@tonic-gateendianness is prepended to the string. 148*0Sstevel@tonic-gate 149*0Sstevel@tonic-gate=over 4 150*0Sstevel@tonic-gate 151*0Sstevel@tonic-gate=item BOM as integer when fetched in network byte order 152*0Sstevel@tonic-gate 153*0Sstevel@tonic-gate 16 32 bits/char 154*0Sstevel@tonic-gate ------------------------- 155*0Sstevel@tonic-gate BE 0xFeFF 0x0000FeFF 156*0Sstevel@tonic-gate LE 0xFFeF 0xFFFe0000 157*0Sstevel@tonic-gate ------------------------- 158*0Sstevel@tonic-gate 159*0Sstevel@tonic-gate=back 160*0Sstevel@tonic-gate 161*0Sstevel@tonic-gateThis modules handles the BOM as follows. 162*0Sstevel@tonic-gate 163*0Sstevel@tonic-gate=over 4 164*0Sstevel@tonic-gate 165*0Sstevel@tonic-gate=item * 166*0Sstevel@tonic-gate 167*0Sstevel@tonic-gateWhen BE or LE is explicitly stated as the name of encoding, BOM is 168*0Sstevel@tonic-gatesimply treated as a normal character (ZERO WIDTH NO-BREAK SPACE). 169*0Sstevel@tonic-gate 170*0Sstevel@tonic-gate=item * 171*0Sstevel@tonic-gate 172*0Sstevel@tonic-gateWhen BE or LE is omitted during decode(), it checks if BOM is at the 173*0Sstevel@tonic-gatebeginning of the string; if one is found, the endianness is set to 174*0Sstevel@tonic-gatewhat the BOM says. If no BOM is found, the routine dies. 175*0Sstevel@tonic-gate 176*0Sstevel@tonic-gate=item * 177*0Sstevel@tonic-gate 178*0Sstevel@tonic-gateWhen BE or LE is omitted during encode(), it returns a BE-encoded 179*0Sstevel@tonic-gatestring with BOM prepended. So when you want to encode a whole text 180*0Sstevel@tonic-gatefile, make sure you encode() the whole text at once, not line by line 181*0Sstevel@tonic-gateor each line, not file, will have a BOM prepended. 182*0Sstevel@tonic-gate 183*0Sstevel@tonic-gate=item * 184*0Sstevel@tonic-gate 185*0Sstevel@tonic-gateC<UCS-2> is an exception. Unlike others, this is an alias of UCS-2BE. 186*0Sstevel@tonic-gateUCS-2 is already registered by IANA and others that way. 187*0Sstevel@tonic-gate 188*0Sstevel@tonic-gate=back 189*0Sstevel@tonic-gate 190*0Sstevel@tonic-gate=head1 Surrogate Pairs 191*0Sstevel@tonic-gate 192*0Sstevel@tonic-gateTo say the least, surrogate pairs were the biggest mistake of the 193*0Sstevel@tonic-gateUnicode Consortium. But according to the late Douglas Adams in I<The 194*0Sstevel@tonic-gateHitchhiker's Guide to the Galaxy> Trilogy, C<In the beginning the 195*0Sstevel@tonic-gateUniverse was created. This has made a lot of people very angry and 196*0Sstevel@tonic-gatebeen widely regarded as a bad move>. Their mistake was not of this 197*0Sstevel@tonic-gatemagnitude so let's forgive them. 198*0Sstevel@tonic-gate 199*0Sstevel@tonic-gate(I don't dare make any comparison with Unicode Consortium and the 200*0Sstevel@tonic-gateVogons here ;) Or, comparing Encode to Babel Fish is completely 201*0Sstevel@tonic-gateappropriate -- if you can only stick this into your ear :) 202*0Sstevel@tonic-gate 203*0Sstevel@tonic-gateSurrogate pairs were born when the Unicode Consortium finally 204*0Sstevel@tonic-gateadmitted that 16 bits were not big enough to hold all the world's 205*0Sstevel@tonic-gatecharacter repertoires. But they already made UCS-2 16-bit. What 206*0Sstevel@tonic-gatedo we do? 207*0Sstevel@tonic-gate 208*0Sstevel@tonic-gateBack then, the range 0xD800-0xDFFF was not allocated. Let's split 209*0Sstevel@tonic-gatethat range in half and use the first half to represent the C<upper 210*0Sstevel@tonic-gatehalf of a character> and the second half to represent the C<lower 211*0Sstevel@tonic-gatehalf of a character>. That way, you can represent 1024 * 1024 = 212*0Sstevel@tonic-gate1048576 more characters. Now we can store character ranges up to 213*0Sstevel@tonic-gate\x{10ffff} even with 16-bit encodings. This pair of half-character is 214*0Sstevel@tonic-gatenow called a I<surrogate pair> and UTF-16 is the name of the encoding 215*0Sstevel@tonic-gatethat embraces them. 216*0Sstevel@tonic-gate 217*0Sstevel@tonic-gateHere is a formula to ensurrogate a Unicode character \x{10000} and 218*0Sstevel@tonic-gateabove; 219*0Sstevel@tonic-gate 220*0Sstevel@tonic-gate $hi = ($uni - 0x10000) / 0x400 + 0xD800; 221*0Sstevel@tonic-gate $lo = ($uni - 0x10000) % 0x400 + 0xDC00; 222*0Sstevel@tonic-gate 223*0Sstevel@tonic-gateAnd to desurrogate; 224*0Sstevel@tonic-gate 225*0Sstevel@tonic-gate $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00); 226*0Sstevel@tonic-gate 227*0Sstevel@tonic-gateNote this move has made \x{D800}-\x{DFFF} into a forbidden zone but 228*0Sstevel@tonic-gateperl does not prohibit the use of characters within this range. To perl, 229*0Sstevel@tonic-gateevery one of \x{0000_0000} up to \x{ffff_ffff} (*) is I<a character>. 230*0Sstevel@tonic-gate 231*0Sstevel@tonic-gate (*) or \x{ffff_ffff_ffff_ffff} if your perl is compiled with 64-bit 232*0Sstevel@tonic-gate integer support! 233*0Sstevel@tonic-gate 234*0Sstevel@tonic-gate=head1 SEE ALSO 235*0Sstevel@tonic-gate 236*0Sstevel@tonic-gateL<Encode>, L<Encode::Unicode::UTF7>, L<http://www.unicode.org/glossary/>, 237*0Sstevel@tonic-gateL<http://www.unicode.org/unicode/faq/utf_bom.html>, 238*0Sstevel@tonic-gate 239*0Sstevel@tonic-gateRFC 2781 L<http://rfc.net/rfc2781.html>, 240*0Sstevel@tonic-gate 241*0Sstevel@tonic-gateThe whole Unicode standard L<http://www.unicode.org/unicode/uni2book/u2.html> 242*0Sstevel@tonic-gate 243*0Sstevel@tonic-gateCh. 15, pp. 403 of C<Programming Perl (3rd Edition)> 244*0Sstevel@tonic-gateby Larry Wall, Tom Christiansen, Jon Orwant; 245*0Sstevel@tonic-gateO'Reilly & Associates; ISBN 0-596-00027-8 246*0Sstevel@tonic-gate 247*0Sstevel@tonic-gate=cut 248