xref: /onnv-gate/usr/src/cmd/perl/5.8.4/distrib/pod/perluniintro.pod (revision 0:68f95e015346)
1*0Sstevel@tonic-gate=head1 NAME
2*0Sstevel@tonic-gate
3*0Sstevel@tonic-gateperluniintro - Perl Unicode introduction
4*0Sstevel@tonic-gate
5*0Sstevel@tonic-gate=head1 DESCRIPTION
6*0Sstevel@tonic-gate
7*0Sstevel@tonic-gateThis document gives a general idea of Unicode and how to use Unicode
8*0Sstevel@tonic-gatein Perl.
9*0Sstevel@tonic-gate
10*0Sstevel@tonic-gate=head2 Unicode
11*0Sstevel@tonic-gate
12*0Sstevel@tonic-gateUnicode is a character set standard which plans to codify all of the
13*0Sstevel@tonic-gatewriting systems of the world, plus many other symbols.
14*0Sstevel@tonic-gate
15*0Sstevel@tonic-gateUnicode and ISO/IEC 10646 are coordinated standards that provide code
16*0Sstevel@tonic-gatepoints for characters in almost all modern character set standards,
17*0Sstevel@tonic-gatecovering more than 30 writing systems and hundreds of languages,
18*0Sstevel@tonic-gateincluding all commercially-important modern languages.  All characters
19*0Sstevel@tonic-gatein the largest Chinese, Japanese, and Korean dictionaries are also
20*0Sstevel@tonic-gateencoded. The standards will eventually cover almost all characters in
21*0Sstevel@tonic-gatemore than 250 writing systems and thousands of languages.
22*0Sstevel@tonic-gateUnicode 1.0 was released in October 1991, and 4.0 in April 2003.
23*0Sstevel@tonic-gate
24*0Sstevel@tonic-gateA Unicode I<character> is an abstract entity.  It is not bound to any
25*0Sstevel@tonic-gateparticular integer width, especially not to the C language C<char>.
26*0Sstevel@tonic-gateUnicode is language-neutral and display-neutral: it does not encode the
27*0Sstevel@tonic-gatelanguage of the text and it does not define fonts or other graphical
28*0Sstevel@tonic-gatelayout details.  Unicode operates on characters and on text built from
29*0Sstevel@tonic-gatethose characters.
30*0Sstevel@tonic-gate
31*0Sstevel@tonic-gateUnicode defines characters like C<LATIN CAPITAL LETTER A> or C<GREEK
32*0Sstevel@tonic-gateSMALL LETTER ALPHA> and unique numbers for the characters, in this
33*0Sstevel@tonic-gatecase 0x0041 and 0x03B1, respectively.  These unique numbers are called
34*0Sstevel@tonic-gateI<code points>.
35*0Sstevel@tonic-gate
36*0Sstevel@tonic-gateThe Unicode standard prefers using hexadecimal notation for the code
37*0Sstevel@tonic-gatepoints.  If numbers like C<0x0041> are unfamiliar to you, take a peek
38*0Sstevel@tonic-gateat a later section, L</"Hexadecimal Notation">.  The Unicode standard
39*0Sstevel@tonic-gateuses the notation C<U+0041 LATIN CAPITAL LETTER A>, to give the
40*0Sstevel@tonic-gatehexadecimal code point and the normative name of the character.
41*0Sstevel@tonic-gate
42*0Sstevel@tonic-gateUnicode also defines various I<properties> for the characters, like
43*0Sstevel@tonic-gate"uppercase" or "lowercase", "decimal digit", or "punctuation";
44*0Sstevel@tonic-gatethese properties are independent of the names of the characters.
45*0Sstevel@tonic-gateFurthermore, various operations on the characters like uppercasing,
46*0Sstevel@tonic-gatelowercasing, and collating (sorting) are defined.
47*0Sstevel@tonic-gate
48*0Sstevel@tonic-gateA Unicode character consists either of a single code point, or a
49*0Sstevel@tonic-gateI<base character> (like C<LATIN CAPITAL LETTER A>), followed by one or
50*0Sstevel@tonic-gatemore I<modifiers> (like C<COMBINING ACUTE ACCENT>).  This sequence of
51*0Sstevel@tonic-gatebase character and modifiers is called a I<combining character
52*0Sstevel@tonic-gatesequence>.
53*0Sstevel@tonic-gate
54*0Sstevel@tonic-gateWhether to call these combining character sequences "characters"
55*0Sstevel@tonic-gatedepends on your point of view. If you are a programmer, you probably
56*0Sstevel@tonic-gatewould tend towards seeing each element in the sequences as one unit,
57*0Sstevel@tonic-gateor "character".  The whole sequence could be seen as one "character",
58*0Sstevel@tonic-gatehowever, from the user's point of view, since that's probably what it
59*0Sstevel@tonic-gatelooks like in the context of the user's language.
60*0Sstevel@tonic-gate
61*0Sstevel@tonic-gateWith this "whole sequence" view of characters, the total number of
62*0Sstevel@tonic-gatecharacters is open-ended. But in the programmer's "one unit is one
63*0Sstevel@tonic-gatecharacter" point of view, the concept of "characters" is more
64*0Sstevel@tonic-gatedeterministic.  In this document, we take that second  point of view:
65*0Sstevel@tonic-gateone "character" is one Unicode code point, be it a base character or
66*0Sstevel@tonic-gatea combining character.
67*0Sstevel@tonic-gate
68*0Sstevel@tonic-gateFor some combinations, there are I<precomposed> characters.
69*0Sstevel@tonic-gateC<LATIN CAPITAL LETTER A WITH ACUTE>, for example, is defined as
70*0Sstevel@tonic-gatea single code point.  These precomposed characters are, however,
71*0Sstevel@tonic-gateonly available for some combinations, and are mainly
72*0Sstevel@tonic-gatemeant to support round-trip conversions between Unicode and legacy
73*0Sstevel@tonic-gatestandards (like the ISO 8859).  In the general case, the composing
74*0Sstevel@tonic-gatemethod is more extensible.  To support conversion between
75*0Sstevel@tonic-gatedifferent compositions of the characters, various I<normalization
76*0Sstevel@tonic-gateforms> to standardize representations are also defined.
77*0Sstevel@tonic-gate
78*0Sstevel@tonic-gateBecause of backward compatibility with legacy encodings, the "a unique
79*0Sstevel@tonic-gatenumber for every character" idea breaks down a bit: instead, there is
80*0Sstevel@tonic-gate"at least one number for every character".  The same character could
81*0Sstevel@tonic-gatebe represented differently in several legacy encodings.  The
82*0Sstevel@tonic-gateconverse is also not true: some code points do not have an assigned
83*0Sstevel@tonic-gatecharacter.  Firstly, there are unallocated code points within
84*0Sstevel@tonic-gateotherwise used blocks.  Secondly, there are special Unicode control
85*0Sstevel@tonic-gatecharacters that do not represent true characters.
86*0Sstevel@tonic-gate
87*0Sstevel@tonic-gateA common myth about Unicode is that it would be "16-bit", that is,
88*0Sstevel@tonic-gateUnicode is only represented as C<0x10000> (or 65536) characters from
89*0Sstevel@tonic-gateC<0x0000> to C<0xFFFF>.  B<This is untrue.>  Since Unicode 2.0 (July
90*0Sstevel@tonic-gate1996), Unicode has been defined all the way up to 21 bits (C<0x10FFFF>),
91*0Sstevel@tonic-gateand since Unicode 3.1 (March 2001), characters have been defined
92*0Sstevel@tonic-gatebeyond C<0xFFFF>.  The first C<0x10000> characters are called the
93*0Sstevel@tonic-gateI<Plane 0>, or the I<Basic Multilingual Plane> (BMP).  With Unicode
94*0Sstevel@tonic-gate3.1, 17 (yes, seventeen) planes in all were defined--but they are
95*0Sstevel@tonic-gatenowhere near full of defined characters, yet.
96*0Sstevel@tonic-gate
97*0Sstevel@tonic-gateAnother myth is that the 256-character blocks have something to
98*0Sstevel@tonic-gatedo with languages--that each block would define the characters used
99*0Sstevel@tonic-gateby a language or a set of languages.  B<This is also untrue.>
100*0Sstevel@tonic-gateThe division into blocks exists, but it is almost completely
101*0Sstevel@tonic-gateaccidental--an artifact of how the characters have been and
102*0Sstevel@tonic-gatestill are allocated.  Instead, there is a concept called I<scripts>,
103*0Sstevel@tonic-gatewhich is more useful: there is C<Latin> script, C<Greek> script, and
104*0Sstevel@tonic-gateso on.  Scripts usually span varied parts of several blocks.
105*0Sstevel@tonic-gateFor further information see L<Unicode::UCD>.
106*0Sstevel@tonic-gate
107*0Sstevel@tonic-gateThe Unicode code points are just abstract numbers.  To input and
108*0Sstevel@tonic-gateoutput these abstract numbers, the numbers must be I<encoded> or
109*0Sstevel@tonic-gateI<serialised> somehow.  Unicode defines several I<character encoding
110*0Sstevel@tonic-gateforms>, of which I<UTF-8> is perhaps the most popular.  UTF-8 is a
111*0Sstevel@tonic-gatevariable length encoding that encodes Unicode characters as 1 to 6
112*0Sstevel@tonic-gatebytes (only 4 with the currently defined characters).  Other encodings
113*0Sstevel@tonic-gateinclude UTF-16 and UTF-32 and their big- and little-endian variants
114*0Sstevel@tonic-gate(UTF-8 is byte-order independent) The ISO/IEC 10646 defines the UCS-2
115*0Sstevel@tonic-gateand UCS-4 encoding forms.
116*0Sstevel@tonic-gate
117*0Sstevel@tonic-gateFor more information about encodings--for instance, to learn what
118*0Sstevel@tonic-gateI<surrogates> and I<byte order marks> (BOMs) are--see L<perlunicode>.
119*0Sstevel@tonic-gate
120*0Sstevel@tonic-gate=head2 Perl's Unicode Support
121*0Sstevel@tonic-gate
122*0Sstevel@tonic-gateStarting from Perl 5.6.0, Perl has had the capacity to handle Unicode
123*0Sstevel@tonic-gatenatively.  Perl 5.8.0, however, is the first recommended release for
124*0Sstevel@tonic-gateserious Unicode work.  The maintenance release 5.6.1 fixed many of the
125*0Sstevel@tonic-gateproblems of the initial Unicode implementation, but for example
126*0Sstevel@tonic-gateregular expressions still do not work with Unicode in 5.6.1.
127*0Sstevel@tonic-gate
128*0Sstevel@tonic-gateB<Starting from Perl 5.8.0, the use of C<use utf8> is no longer
129*0Sstevel@tonic-gatenecessary.> In earlier releases the C<utf8> pragma was used to declare
130*0Sstevel@tonic-gatethat operations in the current block or file would be Unicode-aware.
131*0Sstevel@tonic-gateThis model was found to be wrong, or at least clumsy: the "Unicodeness"
132*0Sstevel@tonic-gateis now carried with the data, instead of being attached to the
133*0Sstevel@tonic-gateoperations.  Only one case remains where an explicit C<use utf8> is
134*0Sstevel@tonic-gateneeded: if your Perl script itself is encoded in UTF-8, you can use
135*0Sstevel@tonic-gateUTF-8 in your identifier names, and in string and regular expression
136*0Sstevel@tonic-gateliterals, by saying C<use utf8>.  This is not the default because
137*0Sstevel@tonic-gatescripts with legacy 8-bit data in them would break.  See L<utf8>.
138*0Sstevel@tonic-gate
139*0Sstevel@tonic-gate=head2 Perl's Unicode Model
140*0Sstevel@tonic-gate
141*0Sstevel@tonic-gatePerl supports both pre-5.6 strings of eight-bit native bytes, and
142*0Sstevel@tonic-gatestrings of Unicode characters.  The principle is that Perl tries to
143*0Sstevel@tonic-gatekeep its data as eight-bit bytes for as long as possible, but as soon
144*0Sstevel@tonic-gateas Unicodeness cannot be avoided, the data is transparently upgraded
145*0Sstevel@tonic-gateto Unicode.
146*0Sstevel@tonic-gate
147*0Sstevel@tonic-gateInternally, Perl currently uses either whatever the native eight-bit
148*0Sstevel@tonic-gatecharacter set of the platform (for example Latin-1) is, defaulting to
149*0Sstevel@tonic-gateUTF-8, to encode Unicode strings. Specifically, if all code points in
150*0Sstevel@tonic-gatethe string are C<0xFF> or less, Perl uses the native eight-bit
151*0Sstevel@tonic-gatecharacter set.  Otherwise, it uses UTF-8.
152*0Sstevel@tonic-gate
153*0Sstevel@tonic-gateA user of Perl does not normally need to know nor care how Perl
154*0Sstevel@tonic-gatehappens to encode its internal strings, but it becomes relevant when
155*0Sstevel@tonic-gateoutputting Unicode strings to a stream without a PerlIO layer -- one with
156*0Sstevel@tonic-gatethe "default" encoding.  In such a case, the raw bytes used internally
157*0Sstevel@tonic-gate(the native character set or UTF-8, as appropriate for each string)
158*0Sstevel@tonic-gatewill be used, and a "Wide character" warning will be issued if those
159*0Sstevel@tonic-gatestrings contain a character beyond 0x00FF.
160*0Sstevel@tonic-gate
161*0Sstevel@tonic-gateFor example,
162*0Sstevel@tonic-gate
163*0Sstevel@tonic-gate      perl -e 'print "\x{DF}\n", "\x{0100}\x{DF}\n"'
164*0Sstevel@tonic-gate
165*0Sstevel@tonic-gateproduces a fairly useless mixture of native bytes and UTF-8, as well
166*0Sstevel@tonic-gateas a warning:
167*0Sstevel@tonic-gate
168*0Sstevel@tonic-gate     Wide character in print at ...
169*0Sstevel@tonic-gate
170*0Sstevel@tonic-gateTo output UTF-8, use the C<:utf8> output layer.  Prepending
171*0Sstevel@tonic-gate
172*0Sstevel@tonic-gate      binmode(STDOUT, ":utf8");
173*0Sstevel@tonic-gate
174*0Sstevel@tonic-gateto this sample program ensures that the output is completely UTF-8,
175*0Sstevel@tonic-gateand removes the program's warning.
176*0Sstevel@tonic-gate
177*0Sstevel@tonic-gateYou can enable automatic UTF-8-ification of your standard file
178*0Sstevel@tonic-gatehandles, default C<open()> layer, and C<@ARGV> by using either
179*0Sstevel@tonic-gatethe C<-C> command line switch or the C<PERL_UNICODE> environment
180*0Sstevel@tonic-gatevariable, see L<perlrun> for the documentation of the C<-C> switch.
181*0Sstevel@tonic-gate
182*0Sstevel@tonic-gateNote that this means that Perl expects other software to work, too:
183*0Sstevel@tonic-gateif Perl has been led to believe that STDIN should be UTF-8, but then
184*0Sstevel@tonic-gateSTDIN coming in from another command is not UTF-8, Perl will complain
185*0Sstevel@tonic-gateabout the malformed UTF-8.
186*0Sstevel@tonic-gate
187*0Sstevel@tonic-gateAll features that combine Unicode and I/O also require using the new
188*0Sstevel@tonic-gatePerlIO feature.  Almost all Perl 5.8 platforms do use PerlIO, though:
189*0Sstevel@tonic-gateyou can see whether yours is by running "perl -V" and looking for
190*0Sstevel@tonic-gateC<useperlio=define>.
191*0Sstevel@tonic-gate
192*0Sstevel@tonic-gate=head2 Unicode and EBCDIC
193*0Sstevel@tonic-gate
194*0Sstevel@tonic-gatePerl 5.8.0 also supports Unicode on EBCDIC platforms.  There,
195*0Sstevel@tonic-gateUnicode support is somewhat more complex to implement since
196*0Sstevel@tonic-gateadditional conversions are needed at every step.  Some problems
197*0Sstevel@tonic-gateremain, see L<perlebcdic> for details.
198*0Sstevel@tonic-gate
199*0Sstevel@tonic-gateIn any case, the Unicode support on EBCDIC platforms is better than
200*0Sstevel@tonic-gatein the 5.6 series, which didn't work much at all for EBCDIC platform.
201*0Sstevel@tonic-gateOn EBCDIC platforms, the internal Unicode encoding form is UTF-EBCDIC
202*0Sstevel@tonic-gateinstead of UTF-8.  The difference is that as UTF-8 is "ASCII-safe" in
203*0Sstevel@tonic-gatethat ASCII characters encode to UTF-8 as-is, while UTF-EBCDIC is
204*0Sstevel@tonic-gate"EBCDIC-safe".
205*0Sstevel@tonic-gate
206*0Sstevel@tonic-gate=head2 Creating Unicode
207*0Sstevel@tonic-gate
208*0Sstevel@tonic-gateTo create Unicode characters in literals for code points above C<0xFF>,
209*0Sstevel@tonic-gateuse the C<\x{...}> notation in double-quoted strings:
210*0Sstevel@tonic-gate
211*0Sstevel@tonic-gate    my $smiley = "\x{263a}";
212*0Sstevel@tonic-gate
213*0Sstevel@tonic-gateSimilarly, it can be used in regular expression literals
214*0Sstevel@tonic-gate
215*0Sstevel@tonic-gate    $smiley =~ /\x{263a}/;
216*0Sstevel@tonic-gate
217*0Sstevel@tonic-gateAt run-time you can use C<chr()>:
218*0Sstevel@tonic-gate
219*0Sstevel@tonic-gate    my $hebrew_alef = chr(0x05d0);
220*0Sstevel@tonic-gate
221*0Sstevel@tonic-gateSee L</"Further Resources"> for how to find all these numeric codes.
222*0Sstevel@tonic-gate
223*0Sstevel@tonic-gateNaturally, C<ord()> will do the reverse: it turns a character into
224*0Sstevel@tonic-gatea code point.
225*0Sstevel@tonic-gate
226*0Sstevel@tonic-gateNote that C<\x..> (no C<{}> and only two hexadecimal digits), C<\x{...}>,
227*0Sstevel@tonic-gateand C<chr(...)> for arguments less than C<0x100> (decimal 256)
228*0Sstevel@tonic-gategenerate an eight-bit character for backward compatibility with older
229*0Sstevel@tonic-gatePerls.  For arguments of C<0x100> or more, Unicode characters are
230*0Sstevel@tonic-gatealways produced. If you want to force the production of Unicode
231*0Sstevel@tonic-gatecharacters regardless of the numeric value, use C<pack("U", ...)>
232*0Sstevel@tonic-gateinstead of C<\x..>, C<\x{...}>, or C<chr()>.
233*0Sstevel@tonic-gate
234*0Sstevel@tonic-gateYou can also use the C<charnames> pragma to invoke characters
235*0Sstevel@tonic-gateby name in double-quoted strings:
236*0Sstevel@tonic-gate
237*0Sstevel@tonic-gate    use charnames ':full';
238*0Sstevel@tonic-gate    my $arabic_alef = "\N{ARABIC LETTER ALEF}";
239*0Sstevel@tonic-gate
240*0Sstevel@tonic-gateAnd, as mentioned above, you can also C<pack()> numbers into Unicode
241*0Sstevel@tonic-gatecharacters:
242*0Sstevel@tonic-gate
243*0Sstevel@tonic-gate   my $georgian_an  = pack("U", 0x10a0);
244*0Sstevel@tonic-gate
245*0Sstevel@tonic-gateNote that both C<\x{...}> and C<\N{...}> are compile-time string
246*0Sstevel@tonic-gateconstants: you cannot use variables in them.  if you want similar
247*0Sstevel@tonic-gaterun-time functionality, use C<chr()> and C<charnames::vianame()>.
248*0Sstevel@tonic-gate
249*0Sstevel@tonic-gateIf you want to force the result to Unicode characters, use the special
250*0Sstevel@tonic-gateC<"U0"> prefix.  It consumes no arguments but forces the result to be
251*0Sstevel@tonic-gatein Unicode characters, instead of bytes.
252*0Sstevel@tonic-gate
253*0Sstevel@tonic-gate   my $chars = pack("U0C*", 0x80, 0x42);
254*0Sstevel@tonic-gate
255*0Sstevel@tonic-gateLikewise, you can force the result to be bytes by using the special
256*0Sstevel@tonic-gateC<"C0"> prefix.
257*0Sstevel@tonic-gate
258*0Sstevel@tonic-gate=head2 Handling Unicode
259*0Sstevel@tonic-gate
260*0Sstevel@tonic-gateHandling Unicode is for the most part transparent: just use the
261*0Sstevel@tonic-gatestrings as usual.  Functions like C<index()>, C<length()>, and
262*0Sstevel@tonic-gateC<substr()> will work on the Unicode characters; regular expressions
263*0Sstevel@tonic-gatewill work on the Unicode characters (see L<perlunicode> and L<perlretut>).
264*0Sstevel@tonic-gate
265*0Sstevel@tonic-gateNote that Perl considers combining character sequences to be
266*0Sstevel@tonic-gateseparate characters, so for example
267*0Sstevel@tonic-gate
268*0Sstevel@tonic-gate    use charnames ':full';
269*0Sstevel@tonic-gate    print length("\N{LATIN CAPITAL LETTER A}\N{COMBINING ACUTE ACCENT}"), "\n";
270*0Sstevel@tonic-gate
271*0Sstevel@tonic-gatewill print 2, not 1.  The only exception is that regular expressions
272*0Sstevel@tonic-gatehave C<\X> for matching a combining character sequence.
273*0Sstevel@tonic-gate
274*0Sstevel@tonic-gateLife is not quite so transparent, however, when working with legacy
275*0Sstevel@tonic-gateencodings, I/O, and certain special cases:
276*0Sstevel@tonic-gate
277*0Sstevel@tonic-gate=head2 Legacy Encodings
278*0Sstevel@tonic-gate
279*0Sstevel@tonic-gateWhen you combine legacy data and Unicode the legacy data needs
280*0Sstevel@tonic-gateto be upgraded to Unicode.  Normally ISO 8859-1 (or EBCDIC, if
281*0Sstevel@tonic-gateapplicable) is assumed.  You can override this assumption by
282*0Sstevel@tonic-gateusing the C<encoding> pragma, for example
283*0Sstevel@tonic-gate
284*0Sstevel@tonic-gate    use encoding 'latin2'; # ISO 8859-2
285*0Sstevel@tonic-gate
286*0Sstevel@tonic-gatein which case literals (string or regular expressions), C<chr()>,
287*0Sstevel@tonic-gateand C<ord()> in your whole script are assumed to produce Unicode
288*0Sstevel@tonic-gatecharacters from ISO 8859-2 code points.  Note that the matching for
289*0Sstevel@tonic-gateencoding names is forgiving: instead of C<latin2> you could have
290*0Sstevel@tonic-gatesaid C<Latin 2>, or C<iso8859-2>, or other variations.  With just
291*0Sstevel@tonic-gate
292*0Sstevel@tonic-gate    use encoding;
293*0Sstevel@tonic-gate
294*0Sstevel@tonic-gatethe environment variable C<PERL_ENCODING> will be consulted.
295*0Sstevel@tonic-gateIf that variable isn't set, the encoding pragma will fail.
296*0Sstevel@tonic-gate
297*0Sstevel@tonic-gateThe C<Encode> module knows about many encodings and has interfaces
298*0Sstevel@tonic-gatefor doing conversions between those encodings:
299*0Sstevel@tonic-gate
300*0Sstevel@tonic-gate    use Encode 'decode';
301*0Sstevel@tonic-gate    $data = decode("iso-8859-3", $data); # convert from legacy to utf-8
302*0Sstevel@tonic-gate
303*0Sstevel@tonic-gate=head2 Unicode I/O
304*0Sstevel@tonic-gate
305*0Sstevel@tonic-gateNormally, writing out Unicode data
306*0Sstevel@tonic-gate
307*0Sstevel@tonic-gate    print FH $some_string_with_unicode, "\n";
308*0Sstevel@tonic-gate
309*0Sstevel@tonic-gateproduces raw bytes that Perl happens to use to internally encode the
310*0Sstevel@tonic-gateUnicode string.  Perl's internal encoding depends on the system as
311*0Sstevel@tonic-gatewell as what characters happen to be in the string at the time. If
312*0Sstevel@tonic-gateany of the characters are at code points C<0x100> or above, you will get
313*0Sstevel@tonic-gatea warning.  To ensure that the output is explicitly rendered in the
314*0Sstevel@tonic-gateencoding you desire--and to avoid the warning--open the stream with
315*0Sstevel@tonic-gatethe desired encoding. Some examples:
316*0Sstevel@tonic-gate
317*0Sstevel@tonic-gate    open FH, ">:utf8", "file";
318*0Sstevel@tonic-gate
319*0Sstevel@tonic-gate    open FH, ">:encoding(ucs2)",      "file";
320*0Sstevel@tonic-gate    open FH, ">:encoding(UTF-8)",     "file";
321*0Sstevel@tonic-gate    open FH, ">:encoding(shift_jis)", "file";
322*0Sstevel@tonic-gate
323*0Sstevel@tonic-gateand on already open streams, use C<binmode()>:
324*0Sstevel@tonic-gate
325*0Sstevel@tonic-gate    binmode(STDOUT, ":utf8");
326*0Sstevel@tonic-gate
327*0Sstevel@tonic-gate    binmode(STDOUT, ":encoding(ucs2)");
328*0Sstevel@tonic-gate    binmode(STDOUT, ":encoding(UTF-8)");
329*0Sstevel@tonic-gate    binmode(STDOUT, ":encoding(shift_jis)");
330*0Sstevel@tonic-gate
331*0Sstevel@tonic-gateThe matching of encoding names is loose: case does not matter, and
332*0Sstevel@tonic-gatemany encodings have several aliases.  Note that the C<:utf8> layer
333*0Sstevel@tonic-gatemust always be specified exactly like that; it is I<not> subject to
334*0Sstevel@tonic-gatethe loose matching of encoding names.
335*0Sstevel@tonic-gate
336*0Sstevel@tonic-gateSee L<PerlIO> for the C<:utf8> layer, L<PerlIO::encoding> and
337*0Sstevel@tonic-gateL<Encode::PerlIO> for the C<:encoding()> layer, and
338*0Sstevel@tonic-gateL<Encode::Supported> for many encodings supported by the C<Encode>
339*0Sstevel@tonic-gatemodule.
340*0Sstevel@tonic-gate
341*0Sstevel@tonic-gateReading in a file that you know happens to be encoded in one of the
342*0Sstevel@tonic-gateUnicode or legacy encodings does not magically turn the data into
343*0Sstevel@tonic-gateUnicode in Perl's eyes.  To do that, specify the appropriate
344*0Sstevel@tonic-gatelayer when opening files
345*0Sstevel@tonic-gate
346*0Sstevel@tonic-gate    open(my $fh,'<:utf8', 'anything');
347*0Sstevel@tonic-gate    my $line_of_unicode = <$fh>;
348*0Sstevel@tonic-gate
349*0Sstevel@tonic-gate    open(my $fh,'<:encoding(Big5)', 'anything');
350*0Sstevel@tonic-gate    my $line_of_unicode = <$fh>;
351*0Sstevel@tonic-gate
352*0Sstevel@tonic-gateThe I/O layers can also be specified more flexibly with
353*0Sstevel@tonic-gatethe C<open> pragma.  See L<open>, or look at the following example.
354*0Sstevel@tonic-gate
355*0Sstevel@tonic-gate    use open ':utf8'; # input and output default layer will be UTF-8
356*0Sstevel@tonic-gate    open X, ">file";
357*0Sstevel@tonic-gate    print X chr(0x100), "\n";
358*0Sstevel@tonic-gate    close X;
359*0Sstevel@tonic-gate    open Y, "<file";
360*0Sstevel@tonic-gate    printf "%#x\n", ord(<Y>); # this should print 0x100
361*0Sstevel@tonic-gate    close Y;
362*0Sstevel@tonic-gate
363*0Sstevel@tonic-gateWith the C<open> pragma you can use the C<:locale> layer
364*0Sstevel@tonic-gate
365*0Sstevel@tonic-gate    BEGIN { $ENV{LC_ALL} = $ENV{LANG} = 'ru_RU.KOI8-R' }
366*0Sstevel@tonic-gate    # the :locale will probe the locale environment variables like LC_ALL
367*0Sstevel@tonic-gate    use open OUT => ':locale'; # russki parusski
368*0Sstevel@tonic-gate    open(O, ">koi8");
369*0Sstevel@tonic-gate    print O chr(0x430); # Unicode CYRILLIC SMALL LETTER A = KOI8-R 0xc1
370*0Sstevel@tonic-gate    close O;
371*0Sstevel@tonic-gate    open(I, "<koi8");
372*0Sstevel@tonic-gate    printf "%#x\n", ord(<I>), "\n"; # this should print 0xc1
373*0Sstevel@tonic-gate    close I;
374*0Sstevel@tonic-gate
375*0Sstevel@tonic-gateor you can also use the C<':encoding(...)'> layer
376*0Sstevel@tonic-gate
377*0Sstevel@tonic-gate    open(my $epic,'<:encoding(iso-8859-7)','iliad.greek');
378*0Sstevel@tonic-gate    my $line_of_unicode = <$epic>;
379*0Sstevel@tonic-gate
380*0Sstevel@tonic-gateThese methods install a transparent filter on the I/O stream that
381*0Sstevel@tonic-gateconverts data from the specified encoding when it is read in from the
382*0Sstevel@tonic-gatestream.  The result is always Unicode.
383*0Sstevel@tonic-gate
384*0Sstevel@tonic-gateThe L<open> pragma affects all the C<open()> calls after the pragma by
385*0Sstevel@tonic-gatesetting default layers.  If you want to affect only certain
386*0Sstevel@tonic-gatestreams, use explicit layers directly in the C<open()> call.
387*0Sstevel@tonic-gate
388*0Sstevel@tonic-gateYou can switch encodings on an already opened stream by using
389*0Sstevel@tonic-gateC<binmode()>; see L<perlfunc/binmode>.
390*0Sstevel@tonic-gate
391*0Sstevel@tonic-gateThe C<:locale> does not currently (as of Perl 5.8.0) work with
392*0Sstevel@tonic-gateC<open()> and C<binmode()>, only with the C<open> pragma.  The
393*0Sstevel@tonic-gateC<:utf8> and C<:encoding(...)> methods do work with all of C<open()>,
394*0Sstevel@tonic-gateC<binmode()>, and the C<open> pragma.
395*0Sstevel@tonic-gate
396*0Sstevel@tonic-gateSimilarly, you may use these I/O layers on output streams to
397*0Sstevel@tonic-gateautomatically convert Unicode to the specified encoding when it is
398*0Sstevel@tonic-gatewritten to the stream. For example, the following snippet copies the
399*0Sstevel@tonic-gatecontents of the file "text.jis" (encoded as ISO-2022-JP, aka JIS) to
400*0Sstevel@tonic-gatethe file "text.utf8", encoded as UTF-8:
401*0Sstevel@tonic-gate
402*0Sstevel@tonic-gate    open(my $nihongo, '<:encoding(iso-2022-jp)', 'text.jis');
403*0Sstevel@tonic-gate    open(my $unicode, '>:utf8',                  'text.utf8');
404*0Sstevel@tonic-gate    while (<$nihongo>) { print $unicode $_ }
405*0Sstevel@tonic-gate
406*0Sstevel@tonic-gateThe naming of encodings, both by the C<open()> and by the C<open>
407*0Sstevel@tonic-gatepragma, is similar to the C<encoding> pragma in that it allows for
408*0Sstevel@tonic-gateflexible names: C<koi8-r> and C<KOI8R> will both be understood.
409*0Sstevel@tonic-gate
410*0Sstevel@tonic-gateCommon encodings recognized by ISO, MIME, IANA, and various other
411*0Sstevel@tonic-gatestandardisation organisations are recognised; for a more detailed
412*0Sstevel@tonic-gatelist see L<Encode::Supported>.
413*0Sstevel@tonic-gate
414*0Sstevel@tonic-gateC<read()> reads characters and returns the number of characters.
415*0Sstevel@tonic-gateC<seek()> and C<tell()> operate on byte counts, as do C<sysread()>
416*0Sstevel@tonic-gateand C<sysseek()>.
417*0Sstevel@tonic-gate
418*0Sstevel@tonic-gateNotice that because of the default behaviour of not doing any
419*0Sstevel@tonic-gateconversion upon input if there is no default layer,
420*0Sstevel@tonic-gateit is easy to mistakenly write code that keeps on expanding a file
421*0Sstevel@tonic-gateby repeatedly encoding the data:
422*0Sstevel@tonic-gate
423*0Sstevel@tonic-gate    # BAD CODE WARNING
424*0Sstevel@tonic-gate    open F, "file";
425*0Sstevel@tonic-gate    local $/; ## read in the whole file of 8-bit characters
426*0Sstevel@tonic-gate    $t = <F>;
427*0Sstevel@tonic-gate    close F;
428*0Sstevel@tonic-gate    open F, ">:utf8", "file";
429*0Sstevel@tonic-gate    print F $t; ## convert to UTF-8 on output
430*0Sstevel@tonic-gate    close F;
431*0Sstevel@tonic-gate
432*0Sstevel@tonic-gateIf you run this code twice, the contents of the F<file> will be twice
433*0Sstevel@tonic-gateUTF-8 encoded.  A C<use open ':utf8'> would have avoided the bug, or
434*0Sstevel@tonic-gateexplicitly opening also the F<file> for input as UTF-8.
435*0Sstevel@tonic-gate
436*0Sstevel@tonic-gateB<NOTE>: the C<:utf8> and C<:encoding> features work only if your
437*0Sstevel@tonic-gatePerl has been built with the new PerlIO feature (which is the default
438*0Sstevel@tonic-gateon most systems).
439*0Sstevel@tonic-gate
440*0Sstevel@tonic-gate=head2 Displaying Unicode As Text
441*0Sstevel@tonic-gate
442*0Sstevel@tonic-gateSometimes you might want to display Perl scalars containing Unicode as
443*0Sstevel@tonic-gatesimple ASCII (or EBCDIC) text.  The following subroutine converts
444*0Sstevel@tonic-gateits argument so that Unicode characters with code points greater than
445*0Sstevel@tonic-gate255 are displayed as C<\x{...}>, control characters (like C<\n>) are
446*0Sstevel@tonic-gatedisplayed as C<\x..>, and the rest of the characters as themselves:
447*0Sstevel@tonic-gate
448*0Sstevel@tonic-gate   sub nice_string {
449*0Sstevel@tonic-gate       join("",
450*0Sstevel@tonic-gate         map { $_ > 255 ?                  # if wide character...
451*0Sstevel@tonic-gate               sprintf("\\x{%04X}", $_) :  # \x{...}
452*0Sstevel@tonic-gate               chr($_) =~ /[[:cntrl:]]/ ?  # else if control character ...
453*0Sstevel@tonic-gate               sprintf("\\x%02X", $_) :    # \x..
454*0Sstevel@tonic-gate               quotemeta(chr($_))          # else quoted or as themselves
455*0Sstevel@tonic-gate         } unpack("U*", $_[0]));           # unpack Unicode characters
456*0Sstevel@tonic-gate   }
457*0Sstevel@tonic-gate
458*0Sstevel@tonic-gateFor example,
459*0Sstevel@tonic-gate
460*0Sstevel@tonic-gate   nice_string("foo\x{100}bar\n")
461*0Sstevel@tonic-gate
462*0Sstevel@tonic-gatereturns the string
463*0Sstevel@tonic-gate
464*0Sstevel@tonic-gate   'foo\x{0100}bar\x0A'
465*0Sstevel@tonic-gate
466*0Sstevel@tonic-gatewhich is ready to be printed.
467*0Sstevel@tonic-gate
468*0Sstevel@tonic-gate=head2 Special Cases
469*0Sstevel@tonic-gate
470*0Sstevel@tonic-gate=over 4
471*0Sstevel@tonic-gate
472*0Sstevel@tonic-gate=item *
473*0Sstevel@tonic-gate
474*0Sstevel@tonic-gateBit Complement Operator ~ And vec()
475*0Sstevel@tonic-gate
476*0Sstevel@tonic-gateThe bit complement operator C<~> may produce surprising results if
477*0Sstevel@tonic-gateused on strings containing characters with ordinal values above
478*0Sstevel@tonic-gate255. In such a case, the results are consistent with the internal
479*0Sstevel@tonic-gateencoding of the characters, but not with much else. So don't do
480*0Sstevel@tonic-gatethat. Similarly for C<vec()>: you will be operating on the
481*0Sstevel@tonic-gateinternally-encoded bit patterns of the Unicode characters, not on
482*0Sstevel@tonic-gatethe code point values, which is very probably not what you want.
483*0Sstevel@tonic-gate
484*0Sstevel@tonic-gate=item *
485*0Sstevel@tonic-gate
486*0Sstevel@tonic-gatePeeking At Perl's Internal Encoding
487*0Sstevel@tonic-gate
488*0Sstevel@tonic-gateNormal users of Perl should never care how Perl encodes any particular
489*0Sstevel@tonic-gateUnicode string (because the normal ways to get at the contents of a
490*0Sstevel@tonic-gatestring with Unicode--via input and output--should always be via
491*0Sstevel@tonic-gateexplicitly-defined I/O layers). But if you must, there are two
492*0Sstevel@tonic-gateways of looking behind the scenes.
493*0Sstevel@tonic-gate
494*0Sstevel@tonic-gateOne way of peeking inside the internal encoding of Unicode characters
495*0Sstevel@tonic-gateis to use C<unpack("C*", ...> to get the bytes or C<unpack("H*", ...)>
496*0Sstevel@tonic-gateto display the bytes:
497*0Sstevel@tonic-gate
498*0Sstevel@tonic-gate    # this prints  c4 80  for the UTF-8 bytes 0xc4 0x80
499*0Sstevel@tonic-gate    print join(" ", unpack("H*", pack("U", 0x100))), "\n";
500*0Sstevel@tonic-gate
501*0Sstevel@tonic-gateYet another way would be to use the Devel::Peek module:
502*0Sstevel@tonic-gate
503*0Sstevel@tonic-gate    perl -MDevel::Peek -e 'Dump(chr(0x100))'
504*0Sstevel@tonic-gate
505*0Sstevel@tonic-gateThat shows the C<UTF8> flag in FLAGS and both the UTF-8 bytes
506*0Sstevel@tonic-gateand Unicode characters in C<PV>.  See also later in this document
507*0Sstevel@tonic-gatethe discussion about the C<utf8::is_utf8()> function.
508*0Sstevel@tonic-gate
509*0Sstevel@tonic-gate=back
510*0Sstevel@tonic-gate
511*0Sstevel@tonic-gate=head2 Advanced Topics
512*0Sstevel@tonic-gate
513*0Sstevel@tonic-gate=over 4
514*0Sstevel@tonic-gate
515*0Sstevel@tonic-gate=item *
516*0Sstevel@tonic-gate
517*0Sstevel@tonic-gateString Equivalence
518*0Sstevel@tonic-gate
519*0Sstevel@tonic-gateThe question of string equivalence turns somewhat complicated
520*0Sstevel@tonic-gatein Unicode: what do you mean by "equal"?
521*0Sstevel@tonic-gate
522*0Sstevel@tonic-gate(Is C<LATIN CAPITAL LETTER A WITH ACUTE> equal to
523*0Sstevel@tonic-gateC<LATIN CAPITAL LETTER A>?)
524*0Sstevel@tonic-gate
525*0Sstevel@tonic-gateThe short answer is that by default Perl compares equivalence (C<eq>,
526*0Sstevel@tonic-gateC<ne>) based only on code points of the characters.  In the above
527*0Sstevel@tonic-gatecase, the answer is no (because 0x00C1 != 0x0041).  But sometimes, any
528*0Sstevel@tonic-gateCAPITAL LETTER As should be considered equal, or even As of any case.
529*0Sstevel@tonic-gate
530*0Sstevel@tonic-gateThe long answer is that you need to consider character normalization
531*0Sstevel@tonic-gateand casing issues: see L<Unicode::Normalize>, Unicode Technical
532*0Sstevel@tonic-gateReports #15 and #21, I<Unicode Normalization Forms> and I<Case
533*0Sstevel@tonic-gateMappings>, http://www.unicode.org/unicode/reports/tr15/ and
534*0Sstevel@tonic-gatehttp://www.unicode.org/unicode/reports/tr21/
535*0Sstevel@tonic-gate
536*0Sstevel@tonic-gateAs of Perl 5.8.0, the "Full" case-folding of I<Case
537*0Sstevel@tonic-gateMappings/SpecialCasing> is implemented.
538*0Sstevel@tonic-gate
539*0Sstevel@tonic-gate=item *
540*0Sstevel@tonic-gate
541*0Sstevel@tonic-gateString Collation
542*0Sstevel@tonic-gate
543*0Sstevel@tonic-gatePeople like to see their strings nicely sorted--or as Unicode
544*0Sstevel@tonic-gateparlance goes, collated.  But again, what do you mean by collate?
545*0Sstevel@tonic-gate
546*0Sstevel@tonic-gate(Does C<LATIN CAPITAL LETTER A WITH ACUTE> come before or after
547*0Sstevel@tonic-gateC<LATIN CAPITAL LETTER A WITH GRAVE>?)
548*0Sstevel@tonic-gate
549*0Sstevel@tonic-gateThe short answer is that by default, Perl compares strings (C<lt>,
550*0Sstevel@tonic-gateC<le>, C<cmp>, C<ge>, C<gt>) based only on the code points of the
551*0Sstevel@tonic-gatecharacters.  In the above case, the answer is "after", since
552*0Sstevel@tonic-gateC<0x00C1> > C<0x00C0>.
553*0Sstevel@tonic-gate
554*0Sstevel@tonic-gateThe long answer is that "it depends", and a good answer cannot be
555*0Sstevel@tonic-gategiven without knowing (at the very least) the language context.
556*0Sstevel@tonic-gateSee L<Unicode::Collate>, and I<Unicode Collation Algorithm>
557*0Sstevel@tonic-gatehttp://www.unicode.org/unicode/reports/tr10/
558*0Sstevel@tonic-gate
559*0Sstevel@tonic-gate=back
560*0Sstevel@tonic-gate
561*0Sstevel@tonic-gate=head2 Miscellaneous
562*0Sstevel@tonic-gate
563*0Sstevel@tonic-gate=over 4
564*0Sstevel@tonic-gate
565*0Sstevel@tonic-gate=item *
566*0Sstevel@tonic-gate
567*0Sstevel@tonic-gateCharacter Ranges and Classes
568*0Sstevel@tonic-gate
569*0Sstevel@tonic-gateCharacter ranges in regular expression character classes (C</[a-z]/>)
570*0Sstevel@tonic-gateand in the C<tr///> (also known as C<y///>) operator are not magically
571*0Sstevel@tonic-gateUnicode-aware.  What this means that C<[A-Za-z]> will not magically start
572*0Sstevel@tonic-gateto mean "all alphabetic letters"; not that it does mean that even for
573*0Sstevel@tonic-gate8-bit characters, you should be using C</[[:alpha:]]/> in that case.
574*0Sstevel@tonic-gate
575*0Sstevel@tonic-gateFor specifying character classes like that in regular expressions,
576*0Sstevel@tonic-gateyou can use the various Unicode properties--C<\pL>, or perhaps
577*0Sstevel@tonic-gateC<\p{Alphabetic}>, in this particular case.  You can use Unicode
578*0Sstevel@tonic-gatecode points as the end points of character ranges, but there is no
579*0Sstevel@tonic-gatemagic associated with specifying a certain range.  For further
580*0Sstevel@tonic-gateinformation--there are dozens of Unicode character classes--see
581*0Sstevel@tonic-gateL<perlunicode>.
582*0Sstevel@tonic-gate
583*0Sstevel@tonic-gate=item *
584*0Sstevel@tonic-gate
585*0Sstevel@tonic-gateString-To-Number Conversions
586*0Sstevel@tonic-gate
587*0Sstevel@tonic-gateUnicode does define several other decimal--and numeric--characters
588*0Sstevel@tonic-gatebesides the familiar 0 to 9, such as the Arabic and Indic digits.
589*0Sstevel@tonic-gatePerl does not support string-to-number conversion for digits other
590*0Sstevel@tonic-gatethan ASCII 0 to 9 (and ASCII a to f for hexadecimal).
591*0Sstevel@tonic-gate
592*0Sstevel@tonic-gate=back
593*0Sstevel@tonic-gate
594*0Sstevel@tonic-gate=head2 Questions With Answers
595*0Sstevel@tonic-gate
596*0Sstevel@tonic-gate=over 4
597*0Sstevel@tonic-gate
598*0Sstevel@tonic-gate=item *
599*0Sstevel@tonic-gate
600*0Sstevel@tonic-gateWill My Old Scripts Break?
601*0Sstevel@tonic-gate
602*0Sstevel@tonic-gateVery probably not.  Unless you are generating Unicode characters
603*0Sstevel@tonic-gatesomehow, old behaviour should be preserved.  About the only behaviour
604*0Sstevel@tonic-gatethat has changed and which could start generating Unicode is the old
605*0Sstevel@tonic-gatebehaviour of C<chr()> where supplying an argument more than 255
606*0Sstevel@tonic-gateproduced a character modulo 255.  C<chr(300)>, for example, was equal
607*0Sstevel@tonic-gateto C<chr(45)> or "-" (in ASCII), now it is LATIN CAPITAL LETTER I WITH
608*0Sstevel@tonic-gateBREVE.
609*0Sstevel@tonic-gate
610*0Sstevel@tonic-gate=item *
611*0Sstevel@tonic-gate
612*0Sstevel@tonic-gateHow Do I Make My Scripts Work With Unicode?
613*0Sstevel@tonic-gate
614*0Sstevel@tonic-gateVery little work should be needed since nothing changes until you
615*0Sstevel@tonic-gategenerate Unicode data.  The most important thing is getting input as
616*0Sstevel@tonic-gateUnicode; for that, see the earlier I/O discussion.
617*0Sstevel@tonic-gate
618*0Sstevel@tonic-gate=item *
619*0Sstevel@tonic-gate
620*0Sstevel@tonic-gateHow Do I Know Whether My String Is In Unicode?
621*0Sstevel@tonic-gate
622*0Sstevel@tonic-gateYou shouldn't care.  No, you really shouldn't.  No, really.  If you
623*0Sstevel@tonic-gatehave to care--beyond the cases described above--it means that we
624*0Sstevel@tonic-gatedidn't get the transparency of Unicode quite right.
625*0Sstevel@tonic-gate
626*0Sstevel@tonic-gateOkay, if you insist:
627*0Sstevel@tonic-gate
628*0Sstevel@tonic-gate    print utf8::is_utf8($string) ? 1 : 0, "\n";
629*0Sstevel@tonic-gate
630*0Sstevel@tonic-gateBut note that this doesn't mean that any of the characters in the
631*0Sstevel@tonic-gatestring are necessary UTF-8 encoded, or that any of the characters have
632*0Sstevel@tonic-gatecode points greater than 0xFF (255) or even 0x80 (128), or that the
633*0Sstevel@tonic-gatestring has any characters at all.  All the C<is_utf8()> does is to
634*0Sstevel@tonic-gatereturn the value of the internal "utf8ness" flag attached to the
635*0Sstevel@tonic-gateC<$string>.  If the flag is off, the bytes in the scalar are interpreted
636*0Sstevel@tonic-gateas a single byte encoding.  If the flag is on, the bytes in the scalar
637*0Sstevel@tonic-gateare interpreted as the (multi-byte, variable-length) UTF-8 encoded code
638*0Sstevel@tonic-gatepoints of the characters.  Bytes added to an UTF-8 encoded string are
639*0Sstevel@tonic-gateautomatically upgraded to UTF-8.  If mixed non-UTF-8 and UTF-8 scalars
640*0Sstevel@tonic-gateare merged (double-quoted interpolation, explicit concatenation, and
641*0Sstevel@tonic-gateprintf/sprintf parameter substitution), the result will be UTF-8 encoded
642*0Sstevel@tonic-gateas if copies of the byte strings were upgraded to UTF-8: for example,
643*0Sstevel@tonic-gate
644*0Sstevel@tonic-gate    $a = "ab\x80c";
645*0Sstevel@tonic-gate    $b = "\x{100}";
646*0Sstevel@tonic-gate    print "$a = $b\n";
647*0Sstevel@tonic-gate
648*0Sstevel@tonic-gatethe output string will be UTF-8-encoded C<ab\x80c = \x{100}\n>, but
649*0Sstevel@tonic-gateC<$a> will stay byte-encoded.
650*0Sstevel@tonic-gate
651*0Sstevel@tonic-gateSometimes you might really need to know the byte length of a string
652*0Sstevel@tonic-gateinstead of the character length. For that use either the
653*0Sstevel@tonic-gateC<Encode::encode_utf8()> function or the C<bytes> pragma and its only
654*0Sstevel@tonic-gatedefined function C<length()>:
655*0Sstevel@tonic-gate
656*0Sstevel@tonic-gate    my $unicode = chr(0x100);
657*0Sstevel@tonic-gate    print length($unicode), "\n"; # will print 1
658*0Sstevel@tonic-gate    require Encode;
659*0Sstevel@tonic-gate    print length(Encode::encode_utf8($unicode)), "\n"; # will print 2
660*0Sstevel@tonic-gate    use bytes;
661*0Sstevel@tonic-gate    print length($unicode), "\n"; # will also print 2
662*0Sstevel@tonic-gate                                  # (the 0xC4 0x80 of the UTF-8)
663*0Sstevel@tonic-gate
664*0Sstevel@tonic-gate=item *
665*0Sstevel@tonic-gate
666*0Sstevel@tonic-gateHow Do I Detect Data That's Not Valid In a Particular Encoding?
667*0Sstevel@tonic-gate
668*0Sstevel@tonic-gateUse the C<Encode> package to try converting it.
669*0Sstevel@tonic-gateFor example,
670*0Sstevel@tonic-gate
671*0Sstevel@tonic-gate    use Encode 'encode_utf8';
672*0Sstevel@tonic-gate    if (encode_utf8($string_of_bytes_that_I_think_is_utf8)) {
673*0Sstevel@tonic-gate        # valid
674*0Sstevel@tonic-gate    } else {
675*0Sstevel@tonic-gate        # invalid
676*0Sstevel@tonic-gate    }
677*0Sstevel@tonic-gate
678*0Sstevel@tonic-gateFor UTF-8 only, you can use:
679*0Sstevel@tonic-gate
680*0Sstevel@tonic-gate    use warnings;
681*0Sstevel@tonic-gate    @chars = unpack("U0U*", $string_of_bytes_that_I_think_is_utf8);
682*0Sstevel@tonic-gate
683*0Sstevel@tonic-gateIf invalid, a C<Malformed UTF-8 character (byte 0x##) in unpack>
684*0Sstevel@tonic-gatewarning is produced. The "U0" means "expect strictly UTF-8 encoded
685*0Sstevel@tonic-gateUnicode".  Without that the C<unpack("U*", ...)> would accept also
686*0Sstevel@tonic-gatedata like C<chr(0xFF>), similarly to the C<pack> as we saw earlier.
687*0Sstevel@tonic-gate
688*0Sstevel@tonic-gate=item *
689*0Sstevel@tonic-gate
690*0Sstevel@tonic-gateHow Do I Convert Binary Data Into a Particular Encoding, Or Vice Versa?
691*0Sstevel@tonic-gate
692*0Sstevel@tonic-gateThis probably isn't as useful as you might think.
693*0Sstevel@tonic-gateNormally, you shouldn't need to.
694*0Sstevel@tonic-gate
695*0Sstevel@tonic-gateIn one sense, what you are asking doesn't make much sense: encodings
696*0Sstevel@tonic-gateare for characters, and binary data are not "characters", so converting
697*0Sstevel@tonic-gate"data" into some encoding isn't meaningful unless you know in what
698*0Sstevel@tonic-gatecharacter set and encoding the binary data is in, in which case it's
699*0Sstevel@tonic-gatenot just binary data, now is it?
700*0Sstevel@tonic-gate
701*0Sstevel@tonic-gateIf you have a raw sequence of bytes that you know should be
702*0Sstevel@tonic-gateinterpreted via a particular encoding, you can use C<Encode>:
703*0Sstevel@tonic-gate
704*0Sstevel@tonic-gate    use Encode 'from_to';
705*0Sstevel@tonic-gate    from_to($data, "iso-8859-1", "utf-8"); # from latin-1 to utf-8
706*0Sstevel@tonic-gate
707*0Sstevel@tonic-gateThe call to C<from_to()> changes the bytes in C<$data>, but nothing
708*0Sstevel@tonic-gatematerial about the nature of the string has changed as far as Perl is
709*0Sstevel@tonic-gateconcerned.  Both before and after the call, the string C<$data>
710*0Sstevel@tonic-gatecontains just a bunch of 8-bit bytes. As far as Perl is concerned,
711*0Sstevel@tonic-gatethe encoding of the string remains as "system-native 8-bit bytes".
712*0Sstevel@tonic-gate
713*0Sstevel@tonic-gateYou might relate this to a fictional 'Translate' module:
714*0Sstevel@tonic-gate
715*0Sstevel@tonic-gate   use Translate;
716*0Sstevel@tonic-gate   my $phrase = "Yes";
717*0Sstevel@tonic-gate   Translate::from_to($phrase, 'english', 'deutsch');
718*0Sstevel@tonic-gate   ## phrase now contains "Ja"
719*0Sstevel@tonic-gate
720*0Sstevel@tonic-gateThe contents of the string changes, but not the nature of the string.
721*0Sstevel@tonic-gatePerl doesn't know any more after the call than before that the
722*0Sstevel@tonic-gatecontents of the string indicates the affirmative.
723*0Sstevel@tonic-gate
724*0Sstevel@tonic-gateBack to converting data.  If you have (or want) data in your system's
725*0Sstevel@tonic-gatenative 8-bit encoding (e.g. Latin-1, EBCDIC, etc.), you can use
726*0Sstevel@tonic-gatepack/unpack to convert to/from Unicode.
727*0Sstevel@tonic-gate
728*0Sstevel@tonic-gate    $native_string  = pack("C*", unpack("U*", $Unicode_string));
729*0Sstevel@tonic-gate    $Unicode_string = pack("U*", unpack("C*", $native_string));
730*0Sstevel@tonic-gate
731*0Sstevel@tonic-gateIf you have a sequence of bytes you B<know> is valid UTF-8,
732*0Sstevel@tonic-gatebut Perl doesn't know it yet, you can make Perl a believer, too:
733*0Sstevel@tonic-gate
734*0Sstevel@tonic-gate    use Encode 'decode_utf8';
735*0Sstevel@tonic-gate    $Unicode = decode_utf8($bytes);
736*0Sstevel@tonic-gate
737*0Sstevel@tonic-gateYou can convert well-formed UTF-8 to a sequence of bytes, but if
738*0Sstevel@tonic-gateyou just want to convert random binary data into UTF-8, you can't.
739*0Sstevel@tonic-gateB<Any random collection of bytes isn't well-formed UTF-8>.  You can
740*0Sstevel@tonic-gateuse C<unpack("C*", $string)> for the former, and you can create
741*0Sstevel@tonic-gatewell-formed Unicode data by C<pack("U*", 0xff, ...)>.
742*0Sstevel@tonic-gate
743*0Sstevel@tonic-gate=item *
744*0Sstevel@tonic-gate
745*0Sstevel@tonic-gateHow Do I Display Unicode?  How Do I Input Unicode?
746*0Sstevel@tonic-gate
747*0Sstevel@tonic-gateSee http://www.alanwood.net/unicode/ and
748*0Sstevel@tonic-gatehttp://www.cl.cam.ac.uk/~mgk25/unicode.html
749*0Sstevel@tonic-gate
750*0Sstevel@tonic-gate=item *
751*0Sstevel@tonic-gate
752*0Sstevel@tonic-gateHow Does Unicode Work With Traditional Locales?
753*0Sstevel@tonic-gate
754*0Sstevel@tonic-gateIn Perl, not very well.  Avoid using locales through the C<locale>
755*0Sstevel@tonic-gatepragma.  Use only one or the other.  But see L<perlrun> for the
756*0Sstevel@tonic-gatedescription of the C<-C> switch and its environment counterpart,
757*0Sstevel@tonic-gateC<$ENV{PERL_UNICODE}> to see how to enable various Unicode features,
758*0Sstevel@tonic-gatefor example by using locale settings.
759*0Sstevel@tonic-gate
760*0Sstevel@tonic-gate=back
761*0Sstevel@tonic-gate
762*0Sstevel@tonic-gate=head2 Hexadecimal Notation
763*0Sstevel@tonic-gate
764*0Sstevel@tonic-gateThe Unicode standard prefers using hexadecimal notation because
765*0Sstevel@tonic-gatethat more clearly shows the division of Unicode into blocks of 256 characters.
766*0Sstevel@tonic-gateHexadecimal is also simply shorter than decimal.  You can use decimal
767*0Sstevel@tonic-gatenotation, too, but learning to use hexadecimal just makes life easier
768*0Sstevel@tonic-gatewith the Unicode standard.  The C<U+HHHH> notation uses hexadecimal,
769*0Sstevel@tonic-gatefor example.
770*0Sstevel@tonic-gate
771*0Sstevel@tonic-gateThe C<0x> prefix means a hexadecimal number, the digits are 0-9 I<and>
772*0Sstevel@tonic-gatea-f (or A-F, case doesn't matter).  Each hexadecimal digit represents
773*0Sstevel@tonic-gatefour bits, or half a byte.  C<print 0x..., "\n"> will show a
774*0Sstevel@tonic-gatehexadecimal number in decimal, and C<printf "%x\n", $decimal> will
775*0Sstevel@tonic-gateshow a decimal number in hexadecimal.  If you have just the
776*0Sstevel@tonic-gate"hex digits" of a hexadecimal number, you can use the C<hex()> function.
777*0Sstevel@tonic-gate
778*0Sstevel@tonic-gate    print 0x0009, "\n";    # 9
779*0Sstevel@tonic-gate    print 0x000a, "\n";    # 10
780*0Sstevel@tonic-gate    print 0x000f, "\n";    # 15
781*0Sstevel@tonic-gate    print 0x0010, "\n";    # 16
782*0Sstevel@tonic-gate    print 0x0011, "\n";    # 17
783*0Sstevel@tonic-gate    print 0x0100, "\n";    # 256
784*0Sstevel@tonic-gate
785*0Sstevel@tonic-gate    print 0x0041, "\n";    # 65
786*0Sstevel@tonic-gate
787*0Sstevel@tonic-gate    printf "%x\n",  65;    # 41
788*0Sstevel@tonic-gate    printf "%#x\n", 65;    # 0x41
789*0Sstevel@tonic-gate
790*0Sstevel@tonic-gate    print hex("41"), "\n"; # 65
791*0Sstevel@tonic-gate
792*0Sstevel@tonic-gate=head2 Further Resources
793*0Sstevel@tonic-gate
794*0Sstevel@tonic-gate=over 4
795*0Sstevel@tonic-gate
796*0Sstevel@tonic-gate=item *
797*0Sstevel@tonic-gate
798*0Sstevel@tonic-gateUnicode Consortium
799*0Sstevel@tonic-gate
800*0Sstevel@tonic-gate    http://www.unicode.org/
801*0Sstevel@tonic-gate
802*0Sstevel@tonic-gate=item *
803*0Sstevel@tonic-gate
804*0Sstevel@tonic-gateUnicode FAQ
805*0Sstevel@tonic-gate
806*0Sstevel@tonic-gate    http://www.unicode.org/unicode/faq/
807*0Sstevel@tonic-gate
808*0Sstevel@tonic-gate=item *
809*0Sstevel@tonic-gate
810*0Sstevel@tonic-gateUnicode Glossary
811*0Sstevel@tonic-gate
812*0Sstevel@tonic-gate    http://www.unicode.org/glossary/
813*0Sstevel@tonic-gate
814*0Sstevel@tonic-gate=item *
815*0Sstevel@tonic-gate
816*0Sstevel@tonic-gateUnicode Useful Resources
817*0Sstevel@tonic-gate
818*0Sstevel@tonic-gate    http://www.unicode.org/unicode/onlinedat/resources.html
819*0Sstevel@tonic-gate
820*0Sstevel@tonic-gate=item *
821*0Sstevel@tonic-gate
822*0Sstevel@tonic-gateUnicode and Multilingual Support in HTML, Fonts, Web Browsers and Other Applications
823*0Sstevel@tonic-gate
824*0Sstevel@tonic-gate    http://www.alanwood.net/unicode/
825*0Sstevel@tonic-gate
826*0Sstevel@tonic-gate=item *
827*0Sstevel@tonic-gate
828*0Sstevel@tonic-gateUTF-8 and Unicode FAQ for Unix/Linux
829*0Sstevel@tonic-gate
830*0Sstevel@tonic-gate    http://www.cl.cam.ac.uk/~mgk25/unicode.html
831*0Sstevel@tonic-gate
832*0Sstevel@tonic-gate=item *
833*0Sstevel@tonic-gate
834*0Sstevel@tonic-gateLegacy Character Sets
835*0Sstevel@tonic-gate
836*0Sstevel@tonic-gate    http://www.czyborra.com/
837*0Sstevel@tonic-gate    http://www.eki.ee/letter/
838*0Sstevel@tonic-gate
839*0Sstevel@tonic-gate=item *
840*0Sstevel@tonic-gate
841*0Sstevel@tonic-gateThe Unicode support files live within the Perl installation in the
842*0Sstevel@tonic-gatedirectory
843*0Sstevel@tonic-gate
844*0Sstevel@tonic-gate    $Config{installprivlib}/unicore
845*0Sstevel@tonic-gate
846*0Sstevel@tonic-gatein Perl 5.8.0 or newer, and
847*0Sstevel@tonic-gate
848*0Sstevel@tonic-gate    $Config{installprivlib}/unicode
849*0Sstevel@tonic-gate
850*0Sstevel@tonic-gatein the Perl 5.6 series.  (The renaming to F<lib/unicore> was done to
851*0Sstevel@tonic-gateavoid naming conflicts with lib/Unicode in case-insensitive filesystems.)
852*0Sstevel@tonic-gateThe main Unicode data file is F<UnicodeData.txt> (or F<Unicode.301> in
853*0Sstevel@tonic-gatePerl 5.6.1.)  You can find the C<$Config{installprivlib}> by
854*0Sstevel@tonic-gate
855*0Sstevel@tonic-gate    perl "-V:installprivlib"
856*0Sstevel@tonic-gate
857*0Sstevel@tonic-gateYou can explore various information from the Unicode data files using
858*0Sstevel@tonic-gatethe C<Unicode::UCD> module.
859*0Sstevel@tonic-gate
860*0Sstevel@tonic-gate=back
861*0Sstevel@tonic-gate
862*0Sstevel@tonic-gate=head1 UNICODE IN OLDER PERLS
863*0Sstevel@tonic-gate
864*0Sstevel@tonic-gateIf you cannot upgrade your Perl to 5.8.0 or later, you can still
865*0Sstevel@tonic-gatedo some Unicode processing by using the modules C<Unicode::String>,
866*0Sstevel@tonic-gateC<Unicode::Map8>, and C<Unicode::Map>, available from CPAN.
867*0Sstevel@tonic-gateIf you have the GNU recode installed, you can also use the
868*0Sstevel@tonic-gatePerl front-end C<Convert::Recode> for character conversions.
869*0Sstevel@tonic-gate
870*0Sstevel@tonic-gateThe following are fast conversions from ISO 8859-1 (Latin-1) bytes
871*0Sstevel@tonic-gateto UTF-8 bytes and back, the code works even with older Perl 5 versions.
872*0Sstevel@tonic-gate
873*0Sstevel@tonic-gate    # ISO 8859-1 to UTF-8
874*0Sstevel@tonic-gate    s/([\x80-\xFF])/chr(0xC0|ord($1)>>6).chr(0x80|ord($1)&0x3F)/eg;
875*0Sstevel@tonic-gate
876*0Sstevel@tonic-gate    # UTF-8 to ISO 8859-1
877*0Sstevel@tonic-gate    s/([\xC2\xC3])([\x80-\xBF])/chr(ord($1)<<6&0xC0|ord($2)&0x3F)/eg;
878*0Sstevel@tonic-gate
879*0Sstevel@tonic-gate=head1 SEE ALSO
880*0Sstevel@tonic-gate
881*0Sstevel@tonic-gateL<perlunicode>, L<Encode>, L<encoding>, L<open>, L<utf8>, L<bytes>,
882*0Sstevel@tonic-gateL<perlretut>, L<perlrun>, L<Unicode::Collate>, L<Unicode::Normalize>,
883*0Sstevel@tonic-gateL<Unicode::UCD>
884*0Sstevel@tonic-gate
885*0Sstevel@tonic-gate=head1 ACKNOWLEDGMENTS
886*0Sstevel@tonic-gate
887*0Sstevel@tonic-gateThanks to the kind readers of the perl5-porters@perl.org,
888*0Sstevel@tonic-gateperl-unicode@perl.org, linux-utf8@nl.linux.org, and unicore@unicode.org
889*0Sstevel@tonic-gatemailing lists for their valuable feedback.
890*0Sstevel@tonic-gate
891*0Sstevel@tonic-gate=head1 AUTHOR, COPYRIGHT, AND LICENSE
892*0Sstevel@tonic-gate
893*0Sstevel@tonic-gateCopyright 2001-2002 Jarkko Hietaniemi E<lt>jhi@iki.fiE<gt>
894*0Sstevel@tonic-gate
895*0Sstevel@tonic-gateThis document may be distributed under the same terms as Perl itself.
896