1*0Sstevel@tonic-gatepackage Encode::Unicode;
2*0Sstevel@tonic-gate
3*0Sstevel@tonic-gateuse strict;
4*0Sstevel@tonic-gateuse warnings;
5*0Sstevel@tonic-gateno warnings 'redefine';
6*0Sstevel@tonic-gate
7*0Sstevel@tonic-gateour $VERSION = do { my @r = (q$Revision: 1.40 $ =~ /\d+/g); sprintf "%d."."%02d" x $#r, @r };
8*0Sstevel@tonic-gate
9*0Sstevel@tonic-gateuse XSLoader;
10*0Sstevel@tonic-gateXSLoader::load(__PACKAGE__,$VERSION);
11*0Sstevel@tonic-gate
12*0Sstevel@tonic-gate#
13*0Sstevel@tonic-gate# Object Generator 8 transcoders all at once!
14*0Sstevel@tonic-gate#
15*0Sstevel@tonic-gate
16*0Sstevel@tonic-gaterequire Encode;
17*0Sstevel@tonic-gate
18*0Sstevel@tonic-gateour %BOM_Unknown = map {$_ => 1} qw(UTF-16 UTF-32);
19*0Sstevel@tonic-gate
20*0Sstevel@tonic-gatefor my $name (qw(UTF-16 UTF-16BE UTF-16LE
21*0Sstevel@tonic-gate                 UTF-32 UTF-32BE UTF-32LE
22*0Sstevel@tonic-gate                        UCS-2BE  UCS-2LE))
23*0Sstevel@tonic-gate{
24*0Sstevel@tonic-gate    my ($size, $endian, $ucs2, $mask);
25*0Sstevel@tonic-gate    $name =~ /^(\w+)-(\d+)(\w*)$/o;
26*0Sstevel@tonic-gate    if ($ucs2 = ($1 eq 'UCS')){
27*0Sstevel@tonic-gate	$size = 2;
28*0Sstevel@tonic-gate    }else{
29*0Sstevel@tonic-gate	$size = $2/8;
30*0Sstevel@tonic-gate    }
31*0Sstevel@tonic-gate    $endian = ($3 eq 'BE') ? 'n' : ($3 eq 'LE') ? 'v' : '' ;
32*0Sstevel@tonic-gate    $size == 4 and $endian = uc($endian);
33*0Sstevel@tonic-gate
34*0Sstevel@tonic-gate    $Encode::Encoding{$name} =
35*0Sstevel@tonic-gate	bless {
36*0Sstevel@tonic-gate	       Name   =>   $name,
37*0Sstevel@tonic-gate	       size   =>   $size,
38*0Sstevel@tonic-gate	       endian => $endian,
39*0Sstevel@tonic-gate	       ucs2   =>   $ucs2,
40*0Sstevel@tonic-gate	      } => __PACKAGE__;
41*0Sstevel@tonic-gate}
42*0Sstevel@tonic-gate
43*0Sstevel@tonic-gateuse base qw(Encode::Encoding);
44*0Sstevel@tonic-gate
45*0Sstevel@tonic-gatesub renew {
46*0Sstevel@tonic-gate    my $self = shift;
47*0Sstevel@tonic-gate    $BOM_Unknown{$self->name} or return $self;
48*0Sstevel@tonic-gate    my $clone = bless { %$self } => ref($self);
49*0Sstevel@tonic-gate    $clone->{clone} = 1; # so the caller knows it is renewed.
50*0Sstevel@tonic-gate    return $clone;
51*0Sstevel@tonic-gate}
52*0Sstevel@tonic-gate
53*0Sstevel@tonic-gate# There used to be a perl implemntation of (en|de)code but with
54*0Sstevel@tonic-gate# XS version is ripe, perl version is zapped for optimal speed
55*0Sstevel@tonic-gate
56*0Sstevel@tonic-gate*decode = \&decode_xs;
57*0Sstevel@tonic-gate*encode = \&encode_xs;
58*0Sstevel@tonic-gate
59*0Sstevel@tonic-gate1;
60*0Sstevel@tonic-gate__END__
61*0Sstevel@tonic-gate
62*0Sstevel@tonic-gate=head1 NAME
63*0Sstevel@tonic-gate
64*0Sstevel@tonic-gateEncode::Unicode -- Various Unicode Transformation Formats
65*0Sstevel@tonic-gate
66*0Sstevel@tonic-gate=cut
67*0Sstevel@tonic-gate
68*0Sstevel@tonic-gate=head1 SYNOPSIS
69*0Sstevel@tonic-gate
70*0Sstevel@tonic-gate    use Encode qw/encode decode/;
71*0Sstevel@tonic-gate    $ucs2 = encode("UCS-2BE", $utf8);
72*0Sstevel@tonic-gate    $utf8 = decode("UCS-2BE", $ucs2);
73*0Sstevel@tonic-gate
74*0Sstevel@tonic-gate=head1 ABSTRACT
75*0Sstevel@tonic-gate
76*0Sstevel@tonic-gateThis module implements all Character Encoding Schemes of Unicode that
77*0Sstevel@tonic-gateare officially documented by Unicode Consortium (except, of course,
78*0Sstevel@tonic-gatefor UTF-8, which is a native format in perl).
79*0Sstevel@tonic-gate
80*0Sstevel@tonic-gate=over 4
81*0Sstevel@tonic-gate
82*0Sstevel@tonic-gate=item L<http://www.unicode.org/glossary/> says:
83*0Sstevel@tonic-gate
84*0Sstevel@tonic-gateI<Character Encoding Scheme> A character encoding form plus byte
85*0Sstevel@tonic-gateserialization. There are Seven character encoding schemes in Unicode:
86*0Sstevel@tonic-gateUTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32 (UCS-4), UTF-32BE (UCS-4BE) and
87*0Sstevel@tonic-gateUTF-32LE (UCS-4LE), and UTF-7.
88*0Sstevel@tonic-gate
89*0Sstevel@tonic-gateSince UTF-7 is a 7-bit (re)encoded version of UTF-16BE, It is not part of
90*0Sstevel@tonic-gateUnicode's Character Encoding Scheme.  It is separately implemented in
91*0Sstevel@tonic-gateEncode::Unicode::UTF7.  For details see L<Encode::Unicode::UTF7>.
92*0Sstevel@tonic-gate
93*0Sstevel@tonic-gate=item Quick Reference
94*0Sstevel@tonic-gate
95*0Sstevel@tonic-gate                Decodes from ord(N)           Encodes chr(N) to...
96*0Sstevel@tonic-gate       octet/char BOM S.P d800-dfff  ord > 0xffff     \x{1abcd} ==
97*0Sstevel@tonic-gate  ---------------+-----------------+------------------------------
98*0Sstevel@tonic-gate  UCS-2BE	2   N   N  is bogus                  Not Available
99*0Sstevel@tonic-gate  UCS-2LE       2   N   N     bogus                  Not Available
100*0Sstevel@tonic-gate  UTF-16      2/4   Y   Y  is   S.P           S.P            BE/LE
101*0Sstevel@tonic-gate  UTF-16BE    2/4   N   Y       S.P           S.P    0xd82a,0xdfcd
102*0Sstevel@tonic-gate  UTF-16LE	2   N   Y       S.P           S.P    0x2ad8,0xcddf
103*0Sstevel@tonic-gate  UTF-32	4   Y   -  is bogus         As is            BE/LE
104*0Sstevel@tonic-gate  UTF-32BE	4   N   -     bogus         As is       0x0001abcd
105*0Sstevel@tonic-gate  UTF-32LE	4   N   -     bogus         As is       0xcdab0100
106*0Sstevel@tonic-gate  UTF-8       1-4   -   -     bogus   >= 4 octets   \xf0\x9a\af\8d
107*0Sstevel@tonic-gate  ---------------+-----------------+------------------------------
108*0Sstevel@tonic-gate
109*0Sstevel@tonic-gate=back
110*0Sstevel@tonic-gate
111*0Sstevel@tonic-gate=head1 Size, Endianness, and BOM
112*0Sstevel@tonic-gate
113*0Sstevel@tonic-gateYou can categorize these CES by 3 criteria:  size of each character,
114*0Sstevel@tonic-gateendianness, and Byte Order Mark.
115*0Sstevel@tonic-gate
116*0Sstevel@tonic-gate=head2 by size
117*0Sstevel@tonic-gate
118*0Sstevel@tonic-gateUCS-2 is a fixed-length encoding with each character taking 16 bits.
119*0Sstevel@tonic-gateIt B<does not> support I<surrogate pairs>.  When a surrogate pair
120*0Sstevel@tonic-gateis encountered during decode(), its place is filled with \x{FFFD}
121*0Sstevel@tonic-gateif I<CHECK> is 0, or the routine croaks if I<CHECK> is 1.  When a
122*0Sstevel@tonic-gatecharacter whose ord value is larger than 0xFFFF is encountered,
123*0Sstevel@tonic-gateits place is filled with \x{FFFD} if I<CHECK> is 0, or the routine
124*0Sstevel@tonic-gatecroaks if I<CHECK> is 1.
125*0Sstevel@tonic-gate
126*0Sstevel@tonic-gateUTF-16 is almost the same as UCS-2 but it supports I<surrogate pairs>.
127*0Sstevel@tonic-gateWhen it encounters a high surrogate (0xD800-0xDBFF), it fetches the
128*0Sstevel@tonic-gatefollowing low surrogate (0xDC00-0xDFFF) and C<desurrogate>s them to
129*0Sstevel@tonic-gateform a character.  Bogus surrogates result in death.  When \x{10000}
130*0Sstevel@tonic-gateor above is encountered during encode(), it C<ensurrogate>s them and
131*0Sstevel@tonic-gatepushes the surrogate pair to the output stream.
132*0Sstevel@tonic-gate
133*0Sstevel@tonic-gateUTF-32 (UCS-4) is a fixed-length encoding with each character taking 32 bits.
134*0Sstevel@tonic-gateSince it is 32-bit, there is no need for I<surrogate pairs>.
135*0Sstevel@tonic-gate
136*0Sstevel@tonic-gate=head2 by endianness
137*0Sstevel@tonic-gate
138*0Sstevel@tonic-gateThe first (and now failed) goal of Unicode was to map all character
139*0Sstevel@tonic-gaterepertoires into a fixed-length integer so that programmers are happy.
140*0Sstevel@tonic-gateSince each character is either a I<short> or I<long> in C, you have to
141*0Sstevel@tonic-gatepay attention to the endianness of each platform when you pass data
142*0Sstevel@tonic-gateto one another.
143*0Sstevel@tonic-gate
144*0Sstevel@tonic-gateAnything marked as BE is Big Endian (or network byte order) and LE is
145*0Sstevel@tonic-gateLittle Endian (aka VAX byte order).  For anything not marked either
146*0Sstevel@tonic-gateBE or LE, a character called Byte Order Mark (BOM) indicating the
147*0Sstevel@tonic-gateendianness is prepended to the string.
148*0Sstevel@tonic-gate
149*0Sstevel@tonic-gate=over 4
150*0Sstevel@tonic-gate
151*0Sstevel@tonic-gate=item BOM as integer when fetched in network byte order
152*0Sstevel@tonic-gate
153*0Sstevel@tonic-gate              16         32 bits/char
154*0Sstevel@tonic-gate  -------------------------
155*0Sstevel@tonic-gate  BE      0xFeFF 0x0000FeFF
156*0Sstevel@tonic-gate  LE      0xFFeF 0xFFFe0000
157*0Sstevel@tonic-gate  -------------------------
158*0Sstevel@tonic-gate
159*0Sstevel@tonic-gate=back
160*0Sstevel@tonic-gate
161*0Sstevel@tonic-gateThis modules handles the BOM as follows.
162*0Sstevel@tonic-gate
163*0Sstevel@tonic-gate=over 4
164*0Sstevel@tonic-gate
165*0Sstevel@tonic-gate=item *
166*0Sstevel@tonic-gate
167*0Sstevel@tonic-gateWhen BE or LE is explicitly stated as the name of encoding, BOM is
168*0Sstevel@tonic-gatesimply treated as a normal character (ZERO WIDTH NO-BREAK SPACE).
169*0Sstevel@tonic-gate
170*0Sstevel@tonic-gate=item *
171*0Sstevel@tonic-gate
172*0Sstevel@tonic-gateWhen BE or LE is omitted during decode(), it checks if BOM is at the
173*0Sstevel@tonic-gatebeginning of the string; if one is found, the endianness is set to
174*0Sstevel@tonic-gatewhat the BOM says.  If no BOM is found, the routine dies.
175*0Sstevel@tonic-gate
176*0Sstevel@tonic-gate=item *
177*0Sstevel@tonic-gate
178*0Sstevel@tonic-gateWhen BE or LE is omitted during encode(), it returns a BE-encoded
179*0Sstevel@tonic-gatestring with BOM prepended.  So when you want to encode a whole text
180*0Sstevel@tonic-gatefile, make sure you encode() the whole text at once, not line by line
181*0Sstevel@tonic-gateor each line, not file, will have a BOM prepended.
182*0Sstevel@tonic-gate
183*0Sstevel@tonic-gate=item *
184*0Sstevel@tonic-gate
185*0Sstevel@tonic-gateC<UCS-2> is an exception.  Unlike others, this is an alias of UCS-2BE.
186*0Sstevel@tonic-gateUCS-2 is already registered by IANA and others that way.
187*0Sstevel@tonic-gate
188*0Sstevel@tonic-gate=back
189*0Sstevel@tonic-gate
190*0Sstevel@tonic-gate=head1 Surrogate Pairs
191*0Sstevel@tonic-gate
192*0Sstevel@tonic-gateTo say the least, surrogate pairs were the biggest mistake of the
193*0Sstevel@tonic-gateUnicode Consortium.  But according to the late Douglas Adams in I<The
194*0Sstevel@tonic-gateHitchhiker's Guide to the Galaxy> Trilogy, C<In the beginning the
195*0Sstevel@tonic-gateUniverse was created. This has made a lot of people very angry and
196*0Sstevel@tonic-gatebeen widely regarded as a bad move>.  Their mistake was not of this
197*0Sstevel@tonic-gatemagnitude so let's forgive them.
198*0Sstevel@tonic-gate
199*0Sstevel@tonic-gate(I don't dare make any comparison with Unicode Consortium and the
200*0Sstevel@tonic-gateVogons here ;)  Or, comparing Encode to Babel Fish is completely
201*0Sstevel@tonic-gateappropriate -- if you can only stick this into your ear :)
202*0Sstevel@tonic-gate
203*0Sstevel@tonic-gateSurrogate pairs were born when the Unicode Consortium finally
204*0Sstevel@tonic-gateadmitted that 16 bits were not big enough to hold all the world's
205*0Sstevel@tonic-gatecharacter repertoires.  But they already made UCS-2 16-bit.  What
206*0Sstevel@tonic-gatedo we do?
207*0Sstevel@tonic-gate
208*0Sstevel@tonic-gateBack then, the range 0xD800-0xDFFF was not allocated.  Let's split
209*0Sstevel@tonic-gatethat range in half and use the first half to represent the C<upper
210*0Sstevel@tonic-gatehalf of a character> and the second half to represent the C<lower
211*0Sstevel@tonic-gatehalf of a character>.  That way, you can represent 1024 * 1024 =
212*0Sstevel@tonic-gate1048576 more characters.  Now we can store character ranges up to
213*0Sstevel@tonic-gate\x{10ffff} even with 16-bit encodings.  This pair of half-character is
214*0Sstevel@tonic-gatenow called a I<surrogate pair> and UTF-16 is the name of the encoding
215*0Sstevel@tonic-gatethat embraces them.
216*0Sstevel@tonic-gate
217*0Sstevel@tonic-gateHere is a formula to ensurrogate a Unicode character \x{10000} and
218*0Sstevel@tonic-gateabove;
219*0Sstevel@tonic-gate
220*0Sstevel@tonic-gate  $hi = ($uni - 0x10000) / 0x400 + 0xD800;
221*0Sstevel@tonic-gate  $lo = ($uni - 0x10000) % 0x400 + 0xDC00;
222*0Sstevel@tonic-gate
223*0Sstevel@tonic-gateAnd to desurrogate;
224*0Sstevel@tonic-gate
225*0Sstevel@tonic-gate $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
226*0Sstevel@tonic-gate
227*0Sstevel@tonic-gateNote this move has made \x{D800}-\x{DFFF} into a forbidden zone but
228*0Sstevel@tonic-gateperl does not prohibit the use of characters within this range.  To perl,
229*0Sstevel@tonic-gateevery one of \x{0000_0000} up to \x{ffff_ffff} (*) is I<a character>.
230*0Sstevel@tonic-gate
231*0Sstevel@tonic-gate  (*) or \x{ffff_ffff_ffff_ffff} if your perl is compiled with 64-bit
232*0Sstevel@tonic-gate  integer support!
233*0Sstevel@tonic-gate
234*0Sstevel@tonic-gate=head1 SEE ALSO
235*0Sstevel@tonic-gate
236*0Sstevel@tonic-gateL<Encode>, L<Encode::Unicode::UTF7>, L<http://www.unicode.org/glossary/>,
237*0Sstevel@tonic-gateL<http://www.unicode.org/unicode/faq/utf_bom.html>,
238*0Sstevel@tonic-gate
239*0Sstevel@tonic-gateRFC 2781 L<http://rfc.net/rfc2781.html>,
240*0Sstevel@tonic-gate
241*0Sstevel@tonic-gateThe whole Unicode standard L<http://www.unicode.org/unicode/uni2book/u2.html>
242*0Sstevel@tonic-gate
243*0Sstevel@tonic-gateCh. 15, pp. 403 of C<Programming Perl (3rd Edition)>
244*0Sstevel@tonic-gateby Larry Wall, Tom Christiansen, Jon Orwant;
245*0Sstevel@tonic-gateO'Reilly & Associates; ISBN 0-596-00027-8
246*0Sstevel@tonic-gate
247*0Sstevel@tonic-gate=cut
248