1*0Sstevel@tonic-gate=head1 NAME 2*0Sstevel@tonic-gate 3*0Sstevel@tonic-gateperluniintro - Perl Unicode introduction 4*0Sstevel@tonic-gate 5*0Sstevel@tonic-gate=head1 DESCRIPTION 6*0Sstevel@tonic-gate 7*0Sstevel@tonic-gateThis document gives a general idea of Unicode and how to use Unicode 8*0Sstevel@tonic-gatein Perl. 9*0Sstevel@tonic-gate 10*0Sstevel@tonic-gate=head2 Unicode 11*0Sstevel@tonic-gate 12*0Sstevel@tonic-gateUnicode is a character set standard which plans to codify all of the 13*0Sstevel@tonic-gatewriting systems of the world, plus many other symbols. 14*0Sstevel@tonic-gate 15*0Sstevel@tonic-gateUnicode and ISO/IEC 10646 are coordinated standards that provide code 16*0Sstevel@tonic-gatepoints for characters in almost all modern character set standards, 17*0Sstevel@tonic-gatecovering more than 30 writing systems and hundreds of languages, 18*0Sstevel@tonic-gateincluding all commercially-important modern languages. All characters 19*0Sstevel@tonic-gatein the largest Chinese, Japanese, and Korean dictionaries are also 20*0Sstevel@tonic-gateencoded. The standards will eventually cover almost all characters in 21*0Sstevel@tonic-gatemore than 250 writing systems and thousands of languages. 22*0Sstevel@tonic-gateUnicode 1.0 was released in October 1991, and 4.0 in April 2003. 23*0Sstevel@tonic-gate 24*0Sstevel@tonic-gateA Unicode I<character> is an abstract entity. It is not bound to any 25*0Sstevel@tonic-gateparticular integer width, especially not to the C language C<char>. 26*0Sstevel@tonic-gateUnicode is language-neutral and display-neutral: it does not encode the 27*0Sstevel@tonic-gatelanguage of the text and it does not define fonts or other graphical 28*0Sstevel@tonic-gatelayout details. Unicode operates on characters and on text built from 29*0Sstevel@tonic-gatethose characters. 30*0Sstevel@tonic-gate 31*0Sstevel@tonic-gateUnicode defines characters like C<LATIN CAPITAL LETTER A> or C<GREEK 32*0Sstevel@tonic-gateSMALL LETTER ALPHA> and unique numbers for the characters, in this 33*0Sstevel@tonic-gatecase 0x0041 and 0x03B1, respectively. These unique numbers are called 34*0Sstevel@tonic-gateI<code points>. 35*0Sstevel@tonic-gate 36*0Sstevel@tonic-gateThe Unicode standard prefers using hexadecimal notation for the code 37*0Sstevel@tonic-gatepoints. If numbers like C<0x0041> are unfamiliar to you, take a peek 38*0Sstevel@tonic-gateat a later section, L</"Hexadecimal Notation">. The Unicode standard 39*0Sstevel@tonic-gateuses the notation C<U+0041 LATIN CAPITAL LETTER A>, to give the 40*0Sstevel@tonic-gatehexadecimal code point and the normative name of the character. 41*0Sstevel@tonic-gate 42*0Sstevel@tonic-gateUnicode also defines various I<properties> for the characters, like 43*0Sstevel@tonic-gate"uppercase" or "lowercase", "decimal digit", or "punctuation"; 44*0Sstevel@tonic-gatethese properties are independent of the names of the characters. 45*0Sstevel@tonic-gateFurthermore, various operations on the characters like uppercasing, 46*0Sstevel@tonic-gatelowercasing, and collating (sorting) are defined. 47*0Sstevel@tonic-gate 48*0Sstevel@tonic-gateA Unicode character consists either of a single code point, or a 49*0Sstevel@tonic-gateI<base character> (like C<LATIN CAPITAL LETTER A>), followed by one or 50*0Sstevel@tonic-gatemore I<modifiers> (like C<COMBINING ACUTE ACCENT>). This sequence of 51*0Sstevel@tonic-gatebase character and modifiers is called a I<combining character 52*0Sstevel@tonic-gatesequence>. 53*0Sstevel@tonic-gate 54*0Sstevel@tonic-gateWhether to call these combining character sequences "characters" 55*0Sstevel@tonic-gatedepends on your point of view. If you are a programmer, you probably 56*0Sstevel@tonic-gatewould tend towards seeing each element in the sequences as one unit, 57*0Sstevel@tonic-gateor "character". The whole sequence could be seen as one "character", 58*0Sstevel@tonic-gatehowever, from the user's point of view, since that's probably what it 59*0Sstevel@tonic-gatelooks like in the context of the user's language. 60*0Sstevel@tonic-gate 61*0Sstevel@tonic-gateWith this "whole sequence" view of characters, the total number of 62*0Sstevel@tonic-gatecharacters is open-ended. But in the programmer's "one unit is one 63*0Sstevel@tonic-gatecharacter" point of view, the concept of "characters" is more 64*0Sstevel@tonic-gatedeterministic. In this document, we take that second point of view: 65*0Sstevel@tonic-gateone "character" is one Unicode code point, be it a base character or 66*0Sstevel@tonic-gatea combining character. 67*0Sstevel@tonic-gate 68*0Sstevel@tonic-gateFor some combinations, there are I<precomposed> characters. 69*0Sstevel@tonic-gateC<LATIN CAPITAL LETTER A WITH ACUTE>, for example, is defined as 70*0Sstevel@tonic-gatea single code point. These precomposed characters are, however, 71*0Sstevel@tonic-gateonly available for some combinations, and are mainly 72*0Sstevel@tonic-gatemeant to support round-trip conversions between Unicode and legacy 73*0Sstevel@tonic-gatestandards (like the ISO 8859). In the general case, the composing 74*0Sstevel@tonic-gatemethod is more extensible. To support conversion between 75*0Sstevel@tonic-gatedifferent compositions of the characters, various I<normalization 76*0Sstevel@tonic-gateforms> to standardize representations are also defined. 77*0Sstevel@tonic-gate 78*0Sstevel@tonic-gateBecause of backward compatibility with legacy encodings, the "a unique 79*0Sstevel@tonic-gatenumber for every character" idea breaks down a bit: instead, there is 80*0Sstevel@tonic-gate"at least one number for every character". The same character could 81*0Sstevel@tonic-gatebe represented differently in several legacy encodings. The 82*0Sstevel@tonic-gateconverse is also not true: some code points do not have an assigned 83*0Sstevel@tonic-gatecharacter. Firstly, there are unallocated code points within 84*0Sstevel@tonic-gateotherwise used blocks. Secondly, there are special Unicode control 85*0Sstevel@tonic-gatecharacters that do not represent true characters. 86*0Sstevel@tonic-gate 87*0Sstevel@tonic-gateA common myth about Unicode is that it would be "16-bit", that is, 88*0Sstevel@tonic-gateUnicode is only represented as C<0x10000> (or 65536) characters from 89*0Sstevel@tonic-gateC<0x0000> to C<0xFFFF>. B<This is untrue.> Since Unicode 2.0 (July 90*0Sstevel@tonic-gate1996), Unicode has been defined all the way up to 21 bits (C<0x10FFFF>), 91*0Sstevel@tonic-gateand since Unicode 3.1 (March 2001), characters have been defined 92*0Sstevel@tonic-gatebeyond C<0xFFFF>. The first C<0x10000> characters are called the 93*0Sstevel@tonic-gateI<Plane 0>, or the I<Basic Multilingual Plane> (BMP). With Unicode 94*0Sstevel@tonic-gate3.1, 17 (yes, seventeen) planes in all were defined--but they are 95*0Sstevel@tonic-gatenowhere near full of defined characters, yet. 96*0Sstevel@tonic-gate 97*0Sstevel@tonic-gateAnother myth is that the 256-character blocks have something to 98*0Sstevel@tonic-gatedo with languages--that each block would define the characters used 99*0Sstevel@tonic-gateby a language or a set of languages. B<This is also untrue.> 100*0Sstevel@tonic-gateThe division into blocks exists, but it is almost completely 101*0Sstevel@tonic-gateaccidental--an artifact of how the characters have been and 102*0Sstevel@tonic-gatestill are allocated. Instead, there is a concept called I<scripts>, 103*0Sstevel@tonic-gatewhich is more useful: there is C<Latin> script, C<Greek> script, and 104*0Sstevel@tonic-gateso on. Scripts usually span varied parts of several blocks. 105*0Sstevel@tonic-gateFor further information see L<Unicode::UCD>. 106*0Sstevel@tonic-gate 107*0Sstevel@tonic-gateThe Unicode code points are just abstract numbers. To input and 108*0Sstevel@tonic-gateoutput these abstract numbers, the numbers must be I<encoded> or 109*0Sstevel@tonic-gateI<serialised> somehow. Unicode defines several I<character encoding 110*0Sstevel@tonic-gateforms>, of which I<UTF-8> is perhaps the most popular. UTF-8 is a 111*0Sstevel@tonic-gatevariable length encoding that encodes Unicode characters as 1 to 6 112*0Sstevel@tonic-gatebytes (only 4 with the currently defined characters). Other encodings 113*0Sstevel@tonic-gateinclude UTF-16 and UTF-32 and their big- and little-endian variants 114*0Sstevel@tonic-gate(UTF-8 is byte-order independent) The ISO/IEC 10646 defines the UCS-2 115*0Sstevel@tonic-gateand UCS-4 encoding forms. 116*0Sstevel@tonic-gate 117*0Sstevel@tonic-gateFor more information about encodings--for instance, to learn what 118*0Sstevel@tonic-gateI<surrogates> and I<byte order marks> (BOMs) are--see L<perlunicode>. 119*0Sstevel@tonic-gate 120*0Sstevel@tonic-gate=head2 Perl's Unicode Support 121*0Sstevel@tonic-gate 122*0Sstevel@tonic-gateStarting from Perl 5.6.0, Perl has had the capacity to handle Unicode 123*0Sstevel@tonic-gatenatively. Perl 5.8.0, however, is the first recommended release for 124*0Sstevel@tonic-gateserious Unicode work. The maintenance release 5.6.1 fixed many of the 125*0Sstevel@tonic-gateproblems of the initial Unicode implementation, but for example 126*0Sstevel@tonic-gateregular expressions still do not work with Unicode in 5.6.1. 127*0Sstevel@tonic-gate 128*0Sstevel@tonic-gateB<Starting from Perl 5.8.0, the use of C<use utf8> is no longer 129*0Sstevel@tonic-gatenecessary.> In earlier releases the C<utf8> pragma was used to declare 130*0Sstevel@tonic-gatethat operations in the current block or file would be Unicode-aware. 131*0Sstevel@tonic-gateThis model was found to be wrong, or at least clumsy: the "Unicodeness" 132*0Sstevel@tonic-gateis now carried with the data, instead of being attached to the 133*0Sstevel@tonic-gateoperations. Only one case remains where an explicit C<use utf8> is 134*0Sstevel@tonic-gateneeded: if your Perl script itself is encoded in UTF-8, you can use 135*0Sstevel@tonic-gateUTF-8 in your identifier names, and in string and regular expression 136*0Sstevel@tonic-gateliterals, by saying C<use utf8>. This is not the default because 137*0Sstevel@tonic-gatescripts with legacy 8-bit data in them would break. See L<utf8>. 138*0Sstevel@tonic-gate 139*0Sstevel@tonic-gate=head2 Perl's Unicode Model 140*0Sstevel@tonic-gate 141*0Sstevel@tonic-gatePerl supports both pre-5.6 strings of eight-bit native bytes, and 142*0Sstevel@tonic-gatestrings of Unicode characters. The principle is that Perl tries to 143*0Sstevel@tonic-gatekeep its data as eight-bit bytes for as long as possible, but as soon 144*0Sstevel@tonic-gateas Unicodeness cannot be avoided, the data is transparently upgraded 145*0Sstevel@tonic-gateto Unicode. 146*0Sstevel@tonic-gate 147*0Sstevel@tonic-gateInternally, Perl currently uses either whatever the native eight-bit 148*0Sstevel@tonic-gatecharacter set of the platform (for example Latin-1) is, defaulting to 149*0Sstevel@tonic-gateUTF-8, to encode Unicode strings. Specifically, if all code points in 150*0Sstevel@tonic-gatethe string are C<0xFF> or less, Perl uses the native eight-bit 151*0Sstevel@tonic-gatecharacter set. Otherwise, it uses UTF-8. 152*0Sstevel@tonic-gate 153*0Sstevel@tonic-gateA user of Perl does not normally need to know nor care how Perl 154*0Sstevel@tonic-gatehappens to encode its internal strings, but it becomes relevant when 155*0Sstevel@tonic-gateoutputting Unicode strings to a stream without a PerlIO layer -- one with 156*0Sstevel@tonic-gatethe "default" encoding. In such a case, the raw bytes used internally 157*0Sstevel@tonic-gate(the native character set or UTF-8, as appropriate for each string) 158*0Sstevel@tonic-gatewill be used, and a "Wide character" warning will be issued if those 159*0Sstevel@tonic-gatestrings contain a character beyond 0x00FF. 160*0Sstevel@tonic-gate 161*0Sstevel@tonic-gateFor example, 162*0Sstevel@tonic-gate 163*0Sstevel@tonic-gate perl -e 'print "\x{DF}\n", "\x{0100}\x{DF}\n"' 164*0Sstevel@tonic-gate 165*0Sstevel@tonic-gateproduces a fairly useless mixture of native bytes and UTF-8, as well 166*0Sstevel@tonic-gateas a warning: 167*0Sstevel@tonic-gate 168*0Sstevel@tonic-gate Wide character in print at ... 169*0Sstevel@tonic-gate 170*0Sstevel@tonic-gateTo output UTF-8, use the C<:utf8> output layer. Prepending 171*0Sstevel@tonic-gate 172*0Sstevel@tonic-gate binmode(STDOUT, ":utf8"); 173*0Sstevel@tonic-gate 174*0Sstevel@tonic-gateto this sample program ensures that the output is completely UTF-8, 175*0Sstevel@tonic-gateand removes the program's warning. 176*0Sstevel@tonic-gate 177*0Sstevel@tonic-gateYou can enable automatic UTF-8-ification of your standard file 178*0Sstevel@tonic-gatehandles, default C<open()> layer, and C<@ARGV> by using either 179*0Sstevel@tonic-gatethe C<-C> command line switch or the C<PERL_UNICODE> environment 180*0Sstevel@tonic-gatevariable, see L<perlrun> for the documentation of the C<-C> switch. 181*0Sstevel@tonic-gate 182*0Sstevel@tonic-gateNote that this means that Perl expects other software to work, too: 183*0Sstevel@tonic-gateif Perl has been led to believe that STDIN should be UTF-8, but then 184*0Sstevel@tonic-gateSTDIN coming in from another command is not UTF-8, Perl will complain 185*0Sstevel@tonic-gateabout the malformed UTF-8. 186*0Sstevel@tonic-gate 187*0Sstevel@tonic-gateAll features that combine Unicode and I/O also require using the new 188*0Sstevel@tonic-gatePerlIO feature. Almost all Perl 5.8 platforms do use PerlIO, though: 189*0Sstevel@tonic-gateyou can see whether yours is by running "perl -V" and looking for 190*0Sstevel@tonic-gateC<useperlio=define>. 191*0Sstevel@tonic-gate 192*0Sstevel@tonic-gate=head2 Unicode and EBCDIC 193*0Sstevel@tonic-gate 194*0Sstevel@tonic-gatePerl 5.8.0 also supports Unicode on EBCDIC platforms. There, 195*0Sstevel@tonic-gateUnicode support is somewhat more complex to implement since 196*0Sstevel@tonic-gateadditional conversions are needed at every step. Some problems 197*0Sstevel@tonic-gateremain, see L<perlebcdic> for details. 198*0Sstevel@tonic-gate 199*0Sstevel@tonic-gateIn any case, the Unicode support on EBCDIC platforms is better than 200*0Sstevel@tonic-gatein the 5.6 series, which didn't work much at all for EBCDIC platform. 201*0Sstevel@tonic-gateOn EBCDIC platforms, the internal Unicode encoding form is UTF-EBCDIC 202*0Sstevel@tonic-gateinstead of UTF-8. The difference is that as UTF-8 is "ASCII-safe" in 203*0Sstevel@tonic-gatethat ASCII characters encode to UTF-8 as-is, while UTF-EBCDIC is 204*0Sstevel@tonic-gate"EBCDIC-safe". 205*0Sstevel@tonic-gate 206*0Sstevel@tonic-gate=head2 Creating Unicode 207*0Sstevel@tonic-gate 208*0Sstevel@tonic-gateTo create Unicode characters in literals for code points above C<0xFF>, 209*0Sstevel@tonic-gateuse the C<\x{...}> notation in double-quoted strings: 210*0Sstevel@tonic-gate 211*0Sstevel@tonic-gate my $smiley = "\x{263a}"; 212*0Sstevel@tonic-gate 213*0Sstevel@tonic-gateSimilarly, it can be used in regular expression literals 214*0Sstevel@tonic-gate 215*0Sstevel@tonic-gate $smiley =~ /\x{263a}/; 216*0Sstevel@tonic-gate 217*0Sstevel@tonic-gateAt run-time you can use C<chr()>: 218*0Sstevel@tonic-gate 219*0Sstevel@tonic-gate my $hebrew_alef = chr(0x05d0); 220*0Sstevel@tonic-gate 221*0Sstevel@tonic-gateSee L</"Further Resources"> for how to find all these numeric codes. 222*0Sstevel@tonic-gate 223*0Sstevel@tonic-gateNaturally, C<ord()> will do the reverse: it turns a character into 224*0Sstevel@tonic-gatea code point. 225*0Sstevel@tonic-gate 226*0Sstevel@tonic-gateNote that C<\x..> (no C<{}> and only two hexadecimal digits), C<\x{...}>, 227*0Sstevel@tonic-gateand C<chr(...)> for arguments less than C<0x100> (decimal 256) 228*0Sstevel@tonic-gategenerate an eight-bit character for backward compatibility with older 229*0Sstevel@tonic-gatePerls. For arguments of C<0x100> or more, Unicode characters are 230*0Sstevel@tonic-gatealways produced. If you want to force the production of Unicode 231*0Sstevel@tonic-gatecharacters regardless of the numeric value, use C<pack("U", ...)> 232*0Sstevel@tonic-gateinstead of C<\x..>, C<\x{...}>, or C<chr()>. 233*0Sstevel@tonic-gate 234*0Sstevel@tonic-gateYou can also use the C<charnames> pragma to invoke characters 235*0Sstevel@tonic-gateby name in double-quoted strings: 236*0Sstevel@tonic-gate 237*0Sstevel@tonic-gate use charnames ':full'; 238*0Sstevel@tonic-gate my $arabic_alef = "\N{ARABIC LETTER ALEF}"; 239*0Sstevel@tonic-gate 240*0Sstevel@tonic-gateAnd, as mentioned above, you can also C<pack()> numbers into Unicode 241*0Sstevel@tonic-gatecharacters: 242*0Sstevel@tonic-gate 243*0Sstevel@tonic-gate my $georgian_an = pack("U", 0x10a0); 244*0Sstevel@tonic-gate 245*0Sstevel@tonic-gateNote that both C<\x{...}> and C<\N{...}> are compile-time string 246*0Sstevel@tonic-gateconstants: you cannot use variables in them. if you want similar 247*0Sstevel@tonic-gaterun-time functionality, use C<chr()> and C<charnames::vianame()>. 248*0Sstevel@tonic-gate 249*0Sstevel@tonic-gateIf you want to force the result to Unicode characters, use the special 250*0Sstevel@tonic-gateC<"U0"> prefix. It consumes no arguments but forces the result to be 251*0Sstevel@tonic-gatein Unicode characters, instead of bytes. 252*0Sstevel@tonic-gate 253*0Sstevel@tonic-gate my $chars = pack("U0C*", 0x80, 0x42); 254*0Sstevel@tonic-gate 255*0Sstevel@tonic-gateLikewise, you can force the result to be bytes by using the special 256*0Sstevel@tonic-gateC<"C0"> prefix. 257*0Sstevel@tonic-gate 258*0Sstevel@tonic-gate=head2 Handling Unicode 259*0Sstevel@tonic-gate 260*0Sstevel@tonic-gateHandling Unicode is for the most part transparent: just use the 261*0Sstevel@tonic-gatestrings as usual. Functions like C<index()>, C<length()>, and 262*0Sstevel@tonic-gateC<substr()> will work on the Unicode characters; regular expressions 263*0Sstevel@tonic-gatewill work on the Unicode characters (see L<perlunicode> and L<perlretut>). 264*0Sstevel@tonic-gate 265*0Sstevel@tonic-gateNote that Perl considers combining character sequences to be 266*0Sstevel@tonic-gateseparate characters, so for example 267*0Sstevel@tonic-gate 268*0Sstevel@tonic-gate use charnames ':full'; 269*0Sstevel@tonic-gate print length("\N{LATIN CAPITAL LETTER A}\N{COMBINING ACUTE ACCENT}"), "\n"; 270*0Sstevel@tonic-gate 271*0Sstevel@tonic-gatewill print 2, not 1. The only exception is that regular expressions 272*0Sstevel@tonic-gatehave C<\X> for matching a combining character sequence. 273*0Sstevel@tonic-gate 274*0Sstevel@tonic-gateLife is not quite so transparent, however, when working with legacy 275*0Sstevel@tonic-gateencodings, I/O, and certain special cases: 276*0Sstevel@tonic-gate 277*0Sstevel@tonic-gate=head2 Legacy Encodings 278*0Sstevel@tonic-gate 279*0Sstevel@tonic-gateWhen you combine legacy data and Unicode the legacy data needs 280*0Sstevel@tonic-gateto be upgraded to Unicode. Normally ISO 8859-1 (or EBCDIC, if 281*0Sstevel@tonic-gateapplicable) is assumed. You can override this assumption by 282*0Sstevel@tonic-gateusing the C<encoding> pragma, for example 283*0Sstevel@tonic-gate 284*0Sstevel@tonic-gate use encoding 'latin2'; # ISO 8859-2 285*0Sstevel@tonic-gate 286*0Sstevel@tonic-gatein which case literals (string or regular expressions), C<chr()>, 287*0Sstevel@tonic-gateand C<ord()> in your whole script are assumed to produce Unicode 288*0Sstevel@tonic-gatecharacters from ISO 8859-2 code points. Note that the matching for 289*0Sstevel@tonic-gateencoding names is forgiving: instead of C<latin2> you could have 290*0Sstevel@tonic-gatesaid C<Latin 2>, or C<iso8859-2>, or other variations. With just 291*0Sstevel@tonic-gate 292*0Sstevel@tonic-gate use encoding; 293*0Sstevel@tonic-gate 294*0Sstevel@tonic-gatethe environment variable C<PERL_ENCODING> will be consulted. 295*0Sstevel@tonic-gateIf that variable isn't set, the encoding pragma will fail. 296*0Sstevel@tonic-gate 297*0Sstevel@tonic-gateThe C<Encode> module knows about many encodings and has interfaces 298*0Sstevel@tonic-gatefor doing conversions between those encodings: 299*0Sstevel@tonic-gate 300*0Sstevel@tonic-gate use Encode 'decode'; 301*0Sstevel@tonic-gate $data = decode("iso-8859-3", $data); # convert from legacy to utf-8 302*0Sstevel@tonic-gate 303*0Sstevel@tonic-gate=head2 Unicode I/O 304*0Sstevel@tonic-gate 305*0Sstevel@tonic-gateNormally, writing out Unicode data 306*0Sstevel@tonic-gate 307*0Sstevel@tonic-gate print FH $some_string_with_unicode, "\n"; 308*0Sstevel@tonic-gate 309*0Sstevel@tonic-gateproduces raw bytes that Perl happens to use to internally encode the 310*0Sstevel@tonic-gateUnicode string. Perl's internal encoding depends on the system as 311*0Sstevel@tonic-gatewell as what characters happen to be in the string at the time. If 312*0Sstevel@tonic-gateany of the characters are at code points C<0x100> or above, you will get 313*0Sstevel@tonic-gatea warning. To ensure that the output is explicitly rendered in the 314*0Sstevel@tonic-gateencoding you desire--and to avoid the warning--open the stream with 315*0Sstevel@tonic-gatethe desired encoding. Some examples: 316*0Sstevel@tonic-gate 317*0Sstevel@tonic-gate open FH, ">:utf8", "file"; 318*0Sstevel@tonic-gate 319*0Sstevel@tonic-gate open FH, ">:encoding(ucs2)", "file"; 320*0Sstevel@tonic-gate open FH, ">:encoding(UTF-8)", "file"; 321*0Sstevel@tonic-gate open FH, ">:encoding(shift_jis)", "file"; 322*0Sstevel@tonic-gate 323*0Sstevel@tonic-gateand on already open streams, use C<binmode()>: 324*0Sstevel@tonic-gate 325*0Sstevel@tonic-gate binmode(STDOUT, ":utf8"); 326*0Sstevel@tonic-gate 327*0Sstevel@tonic-gate binmode(STDOUT, ":encoding(ucs2)"); 328*0Sstevel@tonic-gate binmode(STDOUT, ":encoding(UTF-8)"); 329*0Sstevel@tonic-gate binmode(STDOUT, ":encoding(shift_jis)"); 330*0Sstevel@tonic-gate 331*0Sstevel@tonic-gateThe matching of encoding names is loose: case does not matter, and 332*0Sstevel@tonic-gatemany encodings have several aliases. Note that the C<:utf8> layer 333*0Sstevel@tonic-gatemust always be specified exactly like that; it is I<not> subject to 334*0Sstevel@tonic-gatethe loose matching of encoding names. 335*0Sstevel@tonic-gate 336*0Sstevel@tonic-gateSee L<PerlIO> for the C<:utf8> layer, L<PerlIO::encoding> and 337*0Sstevel@tonic-gateL<Encode::PerlIO> for the C<:encoding()> layer, and 338*0Sstevel@tonic-gateL<Encode::Supported> for many encodings supported by the C<Encode> 339*0Sstevel@tonic-gatemodule. 340*0Sstevel@tonic-gate 341*0Sstevel@tonic-gateReading in a file that you know happens to be encoded in one of the 342*0Sstevel@tonic-gateUnicode or legacy encodings does not magically turn the data into 343*0Sstevel@tonic-gateUnicode in Perl's eyes. To do that, specify the appropriate 344*0Sstevel@tonic-gatelayer when opening files 345*0Sstevel@tonic-gate 346*0Sstevel@tonic-gate open(my $fh,'<:utf8', 'anything'); 347*0Sstevel@tonic-gate my $line_of_unicode = <$fh>; 348*0Sstevel@tonic-gate 349*0Sstevel@tonic-gate open(my $fh,'<:encoding(Big5)', 'anything'); 350*0Sstevel@tonic-gate my $line_of_unicode = <$fh>; 351*0Sstevel@tonic-gate 352*0Sstevel@tonic-gateThe I/O layers can also be specified more flexibly with 353*0Sstevel@tonic-gatethe C<open> pragma. See L<open>, or look at the following example. 354*0Sstevel@tonic-gate 355*0Sstevel@tonic-gate use open ':utf8'; # input and output default layer will be UTF-8 356*0Sstevel@tonic-gate open X, ">file"; 357*0Sstevel@tonic-gate print X chr(0x100), "\n"; 358*0Sstevel@tonic-gate close X; 359*0Sstevel@tonic-gate open Y, "<file"; 360*0Sstevel@tonic-gate printf "%#x\n", ord(<Y>); # this should print 0x100 361*0Sstevel@tonic-gate close Y; 362*0Sstevel@tonic-gate 363*0Sstevel@tonic-gateWith the C<open> pragma you can use the C<:locale> layer 364*0Sstevel@tonic-gate 365*0Sstevel@tonic-gate BEGIN { $ENV{LC_ALL} = $ENV{LANG} = 'ru_RU.KOI8-R' } 366*0Sstevel@tonic-gate # the :locale will probe the locale environment variables like LC_ALL 367*0Sstevel@tonic-gate use open OUT => ':locale'; # russki parusski 368*0Sstevel@tonic-gate open(O, ">koi8"); 369*0Sstevel@tonic-gate print O chr(0x430); # Unicode CYRILLIC SMALL LETTER A = KOI8-R 0xc1 370*0Sstevel@tonic-gate close O; 371*0Sstevel@tonic-gate open(I, "<koi8"); 372*0Sstevel@tonic-gate printf "%#x\n", ord(<I>), "\n"; # this should print 0xc1 373*0Sstevel@tonic-gate close I; 374*0Sstevel@tonic-gate 375*0Sstevel@tonic-gateor you can also use the C<':encoding(...)'> layer 376*0Sstevel@tonic-gate 377*0Sstevel@tonic-gate open(my $epic,'<:encoding(iso-8859-7)','iliad.greek'); 378*0Sstevel@tonic-gate my $line_of_unicode = <$epic>; 379*0Sstevel@tonic-gate 380*0Sstevel@tonic-gateThese methods install a transparent filter on the I/O stream that 381*0Sstevel@tonic-gateconverts data from the specified encoding when it is read in from the 382*0Sstevel@tonic-gatestream. The result is always Unicode. 383*0Sstevel@tonic-gate 384*0Sstevel@tonic-gateThe L<open> pragma affects all the C<open()> calls after the pragma by 385*0Sstevel@tonic-gatesetting default layers. If you want to affect only certain 386*0Sstevel@tonic-gatestreams, use explicit layers directly in the C<open()> call. 387*0Sstevel@tonic-gate 388*0Sstevel@tonic-gateYou can switch encodings on an already opened stream by using 389*0Sstevel@tonic-gateC<binmode()>; see L<perlfunc/binmode>. 390*0Sstevel@tonic-gate 391*0Sstevel@tonic-gateThe C<:locale> does not currently (as of Perl 5.8.0) work with 392*0Sstevel@tonic-gateC<open()> and C<binmode()>, only with the C<open> pragma. The 393*0Sstevel@tonic-gateC<:utf8> and C<:encoding(...)> methods do work with all of C<open()>, 394*0Sstevel@tonic-gateC<binmode()>, and the C<open> pragma. 395*0Sstevel@tonic-gate 396*0Sstevel@tonic-gateSimilarly, you may use these I/O layers on output streams to 397*0Sstevel@tonic-gateautomatically convert Unicode to the specified encoding when it is 398*0Sstevel@tonic-gatewritten to the stream. For example, the following snippet copies the 399*0Sstevel@tonic-gatecontents of the file "text.jis" (encoded as ISO-2022-JP, aka JIS) to 400*0Sstevel@tonic-gatethe file "text.utf8", encoded as UTF-8: 401*0Sstevel@tonic-gate 402*0Sstevel@tonic-gate open(my $nihongo, '<:encoding(iso-2022-jp)', 'text.jis'); 403*0Sstevel@tonic-gate open(my $unicode, '>:utf8', 'text.utf8'); 404*0Sstevel@tonic-gate while (<$nihongo>) { print $unicode $_ } 405*0Sstevel@tonic-gate 406*0Sstevel@tonic-gateThe naming of encodings, both by the C<open()> and by the C<open> 407*0Sstevel@tonic-gatepragma, is similar to the C<encoding> pragma in that it allows for 408*0Sstevel@tonic-gateflexible names: C<koi8-r> and C<KOI8R> will both be understood. 409*0Sstevel@tonic-gate 410*0Sstevel@tonic-gateCommon encodings recognized by ISO, MIME, IANA, and various other 411*0Sstevel@tonic-gatestandardisation organisations are recognised; for a more detailed 412*0Sstevel@tonic-gatelist see L<Encode::Supported>. 413*0Sstevel@tonic-gate 414*0Sstevel@tonic-gateC<read()> reads characters and returns the number of characters. 415*0Sstevel@tonic-gateC<seek()> and C<tell()> operate on byte counts, as do C<sysread()> 416*0Sstevel@tonic-gateand C<sysseek()>. 417*0Sstevel@tonic-gate 418*0Sstevel@tonic-gateNotice that because of the default behaviour of not doing any 419*0Sstevel@tonic-gateconversion upon input if there is no default layer, 420*0Sstevel@tonic-gateit is easy to mistakenly write code that keeps on expanding a file 421*0Sstevel@tonic-gateby repeatedly encoding the data: 422*0Sstevel@tonic-gate 423*0Sstevel@tonic-gate # BAD CODE WARNING 424*0Sstevel@tonic-gate open F, "file"; 425*0Sstevel@tonic-gate local $/; ## read in the whole file of 8-bit characters 426*0Sstevel@tonic-gate $t = <F>; 427*0Sstevel@tonic-gate close F; 428*0Sstevel@tonic-gate open F, ">:utf8", "file"; 429*0Sstevel@tonic-gate print F $t; ## convert to UTF-8 on output 430*0Sstevel@tonic-gate close F; 431*0Sstevel@tonic-gate 432*0Sstevel@tonic-gateIf you run this code twice, the contents of the F<file> will be twice 433*0Sstevel@tonic-gateUTF-8 encoded. A C<use open ':utf8'> would have avoided the bug, or 434*0Sstevel@tonic-gateexplicitly opening also the F<file> for input as UTF-8. 435*0Sstevel@tonic-gate 436*0Sstevel@tonic-gateB<NOTE>: the C<:utf8> and C<:encoding> features work only if your 437*0Sstevel@tonic-gatePerl has been built with the new PerlIO feature (which is the default 438*0Sstevel@tonic-gateon most systems). 439*0Sstevel@tonic-gate 440*0Sstevel@tonic-gate=head2 Displaying Unicode As Text 441*0Sstevel@tonic-gate 442*0Sstevel@tonic-gateSometimes you might want to display Perl scalars containing Unicode as 443*0Sstevel@tonic-gatesimple ASCII (or EBCDIC) text. The following subroutine converts 444*0Sstevel@tonic-gateits argument so that Unicode characters with code points greater than 445*0Sstevel@tonic-gate255 are displayed as C<\x{...}>, control characters (like C<\n>) are 446*0Sstevel@tonic-gatedisplayed as C<\x..>, and the rest of the characters as themselves: 447*0Sstevel@tonic-gate 448*0Sstevel@tonic-gate sub nice_string { 449*0Sstevel@tonic-gate join("", 450*0Sstevel@tonic-gate map { $_ > 255 ? # if wide character... 451*0Sstevel@tonic-gate sprintf("\\x{%04X}", $_) : # \x{...} 452*0Sstevel@tonic-gate chr($_) =~ /[[:cntrl:]]/ ? # else if control character ... 453*0Sstevel@tonic-gate sprintf("\\x%02X", $_) : # \x.. 454*0Sstevel@tonic-gate quotemeta(chr($_)) # else quoted or as themselves 455*0Sstevel@tonic-gate } unpack("U*", $_[0])); # unpack Unicode characters 456*0Sstevel@tonic-gate } 457*0Sstevel@tonic-gate 458*0Sstevel@tonic-gateFor example, 459*0Sstevel@tonic-gate 460*0Sstevel@tonic-gate nice_string("foo\x{100}bar\n") 461*0Sstevel@tonic-gate 462*0Sstevel@tonic-gatereturns the string 463*0Sstevel@tonic-gate 464*0Sstevel@tonic-gate 'foo\x{0100}bar\x0A' 465*0Sstevel@tonic-gate 466*0Sstevel@tonic-gatewhich is ready to be printed. 467*0Sstevel@tonic-gate 468*0Sstevel@tonic-gate=head2 Special Cases 469*0Sstevel@tonic-gate 470*0Sstevel@tonic-gate=over 4 471*0Sstevel@tonic-gate 472*0Sstevel@tonic-gate=item * 473*0Sstevel@tonic-gate 474*0Sstevel@tonic-gateBit Complement Operator ~ And vec() 475*0Sstevel@tonic-gate 476*0Sstevel@tonic-gateThe bit complement operator C<~> may produce surprising results if 477*0Sstevel@tonic-gateused on strings containing characters with ordinal values above 478*0Sstevel@tonic-gate255. In such a case, the results are consistent with the internal 479*0Sstevel@tonic-gateencoding of the characters, but not with much else. So don't do 480*0Sstevel@tonic-gatethat. Similarly for C<vec()>: you will be operating on the 481*0Sstevel@tonic-gateinternally-encoded bit patterns of the Unicode characters, not on 482*0Sstevel@tonic-gatethe code point values, which is very probably not what you want. 483*0Sstevel@tonic-gate 484*0Sstevel@tonic-gate=item * 485*0Sstevel@tonic-gate 486*0Sstevel@tonic-gatePeeking At Perl's Internal Encoding 487*0Sstevel@tonic-gate 488*0Sstevel@tonic-gateNormal users of Perl should never care how Perl encodes any particular 489*0Sstevel@tonic-gateUnicode string (because the normal ways to get at the contents of a 490*0Sstevel@tonic-gatestring with Unicode--via input and output--should always be via 491*0Sstevel@tonic-gateexplicitly-defined I/O layers). But if you must, there are two 492*0Sstevel@tonic-gateways of looking behind the scenes. 493*0Sstevel@tonic-gate 494*0Sstevel@tonic-gateOne way of peeking inside the internal encoding of Unicode characters 495*0Sstevel@tonic-gateis to use C<unpack("C*", ...> to get the bytes or C<unpack("H*", ...)> 496*0Sstevel@tonic-gateto display the bytes: 497*0Sstevel@tonic-gate 498*0Sstevel@tonic-gate # this prints c4 80 for the UTF-8 bytes 0xc4 0x80 499*0Sstevel@tonic-gate print join(" ", unpack("H*", pack("U", 0x100))), "\n"; 500*0Sstevel@tonic-gate 501*0Sstevel@tonic-gateYet another way would be to use the Devel::Peek module: 502*0Sstevel@tonic-gate 503*0Sstevel@tonic-gate perl -MDevel::Peek -e 'Dump(chr(0x100))' 504*0Sstevel@tonic-gate 505*0Sstevel@tonic-gateThat shows the C<UTF8> flag in FLAGS and both the UTF-8 bytes 506*0Sstevel@tonic-gateand Unicode characters in C<PV>. See also later in this document 507*0Sstevel@tonic-gatethe discussion about the C<utf8::is_utf8()> function. 508*0Sstevel@tonic-gate 509*0Sstevel@tonic-gate=back 510*0Sstevel@tonic-gate 511*0Sstevel@tonic-gate=head2 Advanced Topics 512*0Sstevel@tonic-gate 513*0Sstevel@tonic-gate=over 4 514*0Sstevel@tonic-gate 515*0Sstevel@tonic-gate=item * 516*0Sstevel@tonic-gate 517*0Sstevel@tonic-gateString Equivalence 518*0Sstevel@tonic-gate 519*0Sstevel@tonic-gateThe question of string equivalence turns somewhat complicated 520*0Sstevel@tonic-gatein Unicode: what do you mean by "equal"? 521*0Sstevel@tonic-gate 522*0Sstevel@tonic-gate(Is C<LATIN CAPITAL LETTER A WITH ACUTE> equal to 523*0Sstevel@tonic-gateC<LATIN CAPITAL LETTER A>?) 524*0Sstevel@tonic-gate 525*0Sstevel@tonic-gateThe short answer is that by default Perl compares equivalence (C<eq>, 526*0Sstevel@tonic-gateC<ne>) based only on code points of the characters. In the above 527*0Sstevel@tonic-gatecase, the answer is no (because 0x00C1 != 0x0041). But sometimes, any 528*0Sstevel@tonic-gateCAPITAL LETTER As should be considered equal, or even As of any case. 529*0Sstevel@tonic-gate 530*0Sstevel@tonic-gateThe long answer is that you need to consider character normalization 531*0Sstevel@tonic-gateand casing issues: see L<Unicode::Normalize>, Unicode Technical 532*0Sstevel@tonic-gateReports #15 and #21, I<Unicode Normalization Forms> and I<Case 533*0Sstevel@tonic-gateMappings>, http://www.unicode.org/unicode/reports/tr15/ and 534*0Sstevel@tonic-gatehttp://www.unicode.org/unicode/reports/tr21/ 535*0Sstevel@tonic-gate 536*0Sstevel@tonic-gateAs of Perl 5.8.0, the "Full" case-folding of I<Case 537*0Sstevel@tonic-gateMappings/SpecialCasing> is implemented. 538*0Sstevel@tonic-gate 539*0Sstevel@tonic-gate=item * 540*0Sstevel@tonic-gate 541*0Sstevel@tonic-gateString Collation 542*0Sstevel@tonic-gate 543*0Sstevel@tonic-gatePeople like to see their strings nicely sorted--or as Unicode 544*0Sstevel@tonic-gateparlance goes, collated. But again, what do you mean by collate? 545*0Sstevel@tonic-gate 546*0Sstevel@tonic-gate(Does C<LATIN CAPITAL LETTER A WITH ACUTE> come before or after 547*0Sstevel@tonic-gateC<LATIN CAPITAL LETTER A WITH GRAVE>?) 548*0Sstevel@tonic-gate 549*0Sstevel@tonic-gateThe short answer is that by default, Perl compares strings (C<lt>, 550*0Sstevel@tonic-gateC<le>, C<cmp>, C<ge>, C<gt>) based only on the code points of the 551*0Sstevel@tonic-gatecharacters. In the above case, the answer is "after", since 552*0Sstevel@tonic-gateC<0x00C1> > C<0x00C0>. 553*0Sstevel@tonic-gate 554*0Sstevel@tonic-gateThe long answer is that "it depends", and a good answer cannot be 555*0Sstevel@tonic-gategiven without knowing (at the very least) the language context. 556*0Sstevel@tonic-gateSee L<Unicode::Collate>, and I<Unicode Collation Algorithm> 557*0Sstevel@tonic-gatehttp://www.unicode.org/unicode/reports/tr10/ 558*0Sstevel@tonic-gate 559*0Sstevel@tonic-gate=back 560*0Sstevel@tonic-gate 561*0Sstevel@tonic-gate=head2 Miscellaneous 562*0Sstevel@tonic-gate 563*0Sstevel@tonic-gate=over 4 564*0Sstevel@tonic-gate 565*0Sstevel@tonic-gate=item * 566*0Sstevel@tonic-gate 567*0Sstevel@tonic-gateCharacter Ranges and Classes 568*0Sstevel@tonic-gate 569*0Sstevel@tonic-gateCharacter ranges in regular expression character classes (C</[a-z]/>) 570*0Sstevel@tonic-gateand in the C<tr///> (also known as C<y///>) operator are not magically 571*0Sstevel@tonic-gateUnicode-aware. What this means that C<[A-Za-z]> will not magically start 572*0Sstevel@tonic-gateto mean "all alphabetic letters"; not that it does mean that even for 573*0Sstevel@tonic-gate8-bit characters, you should be using C</[[:alpha:]]/> in that case. 574*0Sstevel@tonic-gate 575*0Sstevel@tonic-gateFor specifying character classes like that in regular expressions, 576*0Sstevel@tonic-gateyou can use the various Unicode properties--C<\pL>, or perhaps 577*0Sstevel@tonic-gateC<\p{Alphabetic}>, in this particular case. You can use Unicode 578*0Sstevel@tonic-gatecode points as the end points of character ranges, but there is no 579*0Sstevel@tonic-gatemagic associated with specifying a certain range. For further 580*0Sstevel@tonic-gateinformation--there are dozens of Unicode character classes--see 581*0Sstevel@tonic-gateL<perlunicode>. 582*0Sstevel@tonic-gate 583*0Sstevel@tonic-gate=item * 584*0Sstevel@tonic-gate 585*0Sstevel@tonic-gateString-To-Number Conversions 586*0Sstevel@tonic-gate 587*0Sstevel@tonic-gateUnicode does define several other decimal--and numeric--characters 588*0Sstevel@tonic-gatebesides the familiar 0 to 9, such as the Arabic and Indic digits. 589*0Sstevel@tonic-gatePerl does not support string-to-number conversion for digits other 590*0Sstevel@tonic-gatethan ASCII 0 to 9 (and ASCII a to f for hexadecimal). 591*0Sstevel@tonic-gate 592*0Sstevel@tonic-gate=back 593*0Sstevel@tonic-gate 594*0Sstevel@tonic-gate=head2 Questions With Answers 595*0Sstevel@tonic-gate 596*0Sstevel@tonic-gate=over 4 597*0Sstevel@tonic-gate 598*0Sstevel@tonic-gate=item * 599*0Sstevel@tonic-gate 600*0Sstevel@tonic-gateWill My Old Scripts Break? 601*0Sstevel@tonic-gate 602*0Sstevel@tonic-gateVery probably not. Unless you are generating Unicode characters 603*0Sstevel@tonic-gatesomehow, old behaviour should be preserved. About the only behaviour 604*0Sstevel@tonic-gatethat has changed and which could start generating Unicode is the old 605*0Sstevel@tonic-gatebehaviour of C<chr()> where supplying an argument more than 255 606*0Sstevel@tonic-gateproduced a character modulo 255. C<chr(300)>, for example, was equal 607*0Sstevel@tonic-gateto C<chr(45)> or "-" (in ASCII), now it is LATIN CAPITAL LETTER I WITH 608*0Sstevel@tonic-gateBREVE. 609*0Sstevel@tonic-gate 610*0Sstevel@tonic-gate=item * 611*0Sstevel@tonic-gate 612*0Sstevel@tonic-gateHow Do I Make My Scripts Work With Unicode? 613*0Sstevel@tonic-gate 614*0Sstevel@tonic-gateVery little work should be needed since nothing changes until you 615*0Sstevel@tonic-gategenerate Unicode data. The most important thing is getting input as 616*0Sstevel@tonic-gateUnicode; for that, see the earlier I/O discussion. 617*0Sstevel@tonic-gate 618*0Sstevel@tonic-gate=item * 619*0Sstevel@tonic-gate 620*0Sstevel@tonic-gateHow Do I Know Whether My String Is In Unicode? 621*0Sstevel@tonic-gate 622*0Sstevel@tonic-gateYou shouldn't care. No, you really shouldn't. No, really. If you 623*0Sstevel@tonic-gatehave to care--beyond the cases described above--it means that we 624*0Sstevel@tonic-gatedidn't get the transparency of Unicode quite right. 625*0Sstevel@tonic-gate 626*0Sstevel@tonic-gateOkay, if you insist: 627*0Sstevel@tonic-gate 628*0Sstevel@tonic-gate print utf8::is_utf8($string) ? 1 : 0, "\n"; 629*0Sstevel@tonic-gate 630*0Sstevel@tonic-gateBut note that this doesn't mean that any of the characters in the 631*0Sstevel@tonic-gatestring are necessary UTF-8 encoded, or that any of the characters have 632*0Sstevel@tonic-gatecode points greater than 0xFF (255) or even 0x80 (128), or that the 633*0Sstevel@tonic-gatestring has any characters at all. All the C<is_utf8()> does is to 634*0Sstevel@tonic-gatereturn the value of the internal "utf8ness" flag attached to the 635*0Sstevel@tonic-gateC<$string>. If the flag is off, the bytes in the scalar are interpreted 636*0Sstevel@tonic-gateas a single byte encoding. If the flag is on, the bytes in the scalar 637*0Sstevel@tonic-gateare interpreted as the (multi-byte, variable-length) UTF-8 encoded code 638*0Sstevel@tonic-gatepoints of the characters. Bytes added to an UTF-8 encoded string are 639*0Sstevel@tonic-gateautomatically upgraded to UTF-8. If mixed non-UTF-8 and UTF-8 scalars 640*0Sstevel@tonic-gateare merged (double-quoted interpolation, explicit concatenation, and 641*0Sstevel@tonic-gateprintf/sprintf parameter substitution), the result will be UTF-8 encoded 642*0Sstevel@tonic-gateas if copies of the byte strings were upgraded to UTF-8: for example, 643*0Sstevel@tonic-gate 644*0Sstevel@tonic-gate $a = "ab\x80c"; 645*0Sstevel@tonic-gate $b = "\x{100}"; 646*0Sstevel@tonic-gate print "$a = $b\n"; 647*0Sstevel@tonic-gate 648*0Sstevel@tonic-gatethe output string will be UTF-8-encoded C<ab\x80c = \x{100}\n>, but 649*0Sstevel@tonic-gateC<$a> will stay byte-encoded. 650*0Sstevel@tonic-gate 651*0Sstevel@tonic-gateSometimes you might really need to know the byte length of a string 652*0Sstevel@tonic-gateinstead of the character length. For that use either the 653*0Sstevel@tonic-gateC<Encode::encode_utf8()> function or the C<bytes> pragma and its only 654*0Sstevel@tonic-gatedefined function C<length()>: 655*0Sstevel@tonic-gate 656*0Sstevel@tonic-gate my $unicode = chr(0x100); 657*0Sstevel@tonic-gate print length($unicode), "\n"; # will print 1 658*0Sstevel@tonic-gate require Encode; 659*0Sstevel@tonic-gate print length(Encode::encode_utf8($unicode)), "\n"; # will print 2 660*0Sstevel@tonic-gate use bytes; 661*0Sstevel@tonic-gate print length($unicode), "\n"; # will also print 2 662*0Sstevel@tonic-gate # (the 0xC4 0x80 of the UTF-8) 663*0Sstevel@tonic-gate 664*0Sstevel@tonic-gate=item * 665*0Sstevel@tonic-gate 666*0Sstevel@tonic-gateHow Do I Detect Data That's Not Valid In a Particular Encoding? 667*0Sstevel@tonic-gate 668*0Sstevel@tonic-gateUse the C<Encode> package to try converting it. 669*0Sstevel@tonic-gateFor example, 670*0Sstevel@tonic-gate 671*0Sstevel@tonic-gate use Encode 'encode_utf8'; 672*0Sstevel@tonic-gate if (encode_utf8($string_of_bytes_that_I_think_is_utf8)) { 673*0Sstevel@tonic-gate # valid 674*0Sstevel@tonic-gate } else { 675*0Sstevel@tonic-gate # invalid 676*0Sstevel@tonic-gate } 677*0Sstevel@tonic-gate 678*0Sstevel@tonic-gateFor UTF-8 only, you can use: 679*0Sstevel@tonic-gate 680*0Sstevel@tonic-gate use warnings; 681*0Sstevel@tonic-gate @chars = unpack("U0U*", $string_of_bytes_that_I_think_is_utf8); 682*0Sstevel@tonic-gate 683*0Sstevel@tonic-gateIf invalid, a C<Malformed UTF-8 character (byte 0x##) in unpack> 684*0Sstevel@tonic-gatewarning is produced. The "U0" means "expect strictly UTF-8 encoded 685*0Sstevel@tonic-gateUnicode". Without that the C<unpack("U*", ...)> would accept also 686*0Sstevel@tonic-gatedata like C<chr(0xFF>), similarly to the C<pack> as we saw earlier. 687*0Sstevel@tonic-gate 688*0Sstevel@tonic-gate=item * 689*0Sstevel@tonic-gate 690*0Sstevel@tonic-gateHow Do I Convert Binary Data Into a Particular Encoding, Or Vice Versa? 691*0Sstevel@tonic-gate 692*0Sstevel@tonic-gateThis probably isn't as useful as you might think. 693*0Sstevel@tonic-gateNormally, you shouldn't need to. 694*0Sstevel@tonic-gate 695*0Sstevel@tonic-gateIn one sense, what you are asking doesn't make much sense: encodings 696*0Sstevel@tonic-gateare for characters, and binary data are not "characters", so converting 697*0Sstevel@tonic-gate"data" into some encoding isn't meaningful unless you know in what 698*0Sstevel@tonic-gatecharacter set and encoding the binary data is in, in which case it's 699*0Sstevel@tonic-gatenot just binary data, now is it? 700*0Sstevel@tonic-gate 701*0Sstevel@tonic-gateIf you have a raw sequence of bytes that you know should be 702*0Sstevel@tonic-gateinterpreted via a particular encoding, you can use C<Encode>: 703*0Sstevel@tonic-gate 704*0Sstevel@tonic-gate use Encode 'from_to'; 705*0Sstevel@tonic-gate from_to($data, "iso-8859-1", "utf-8"); # from latin-1 to utf-8 706*0Sstevel@tonic-gate 707*0Sstevel@tonic-gateThe call to C<from_to()> changes the bytes in C<$data>, but nothing 708*0Sstevel@tonic-gatematerial about the nature of the string has changed as far as Perl is 709*0Sstevel@tonic-gateconcerned. Both before and after the call, the string C<$data> 710*0Sstevel@tonic-gatecontains just a bunch of 8-bit bytes. As far as Perl is concerned, 711*0Sstevel@tonic-gatethe encoding of the string remains as "system-native 8-bit bytes". 712*0Sstevel@tonic-gate 713*0Sstevel@tonic-gateYou might relate this to a fictional 'Translate' module: 714*0Sstevel@tonic-gate 715*0Sstevel@tonic-gate use Translate; 716*0Sstevel@tonic-gate my $phrase = "Yes"; 717*0Sstevel@tonic-gate Translate::from_to($phrase, 'english', 'deutsch'); 718*0Sstevel@tonic-gate ## phrase now contains "Ja" 719*0Sstevel@tonic-gate 720*0Sstevel@tonic-gateThe contents of the string changes, but not the nature of the string. 721*0Sstevel@tonic-gatePerl doesn't know any more after the call than before that the 722*0Sstevel@tonic-gatecontents of the string indicates the affirmative. 723*0Sstevel@tonic-gate 724*0Sstevel@tonic-gateBack to converting data. If you have (or want) data in your system's 725*0Sstevel@tonic-gatenative 8-bit encoding (e.g. Latin-1, EBCDIC, etc.), you can use 726*0Sstevel@tonic-gatepack/unpack to convert to/from Unicode. 727*0Sstevel@tonic-gate 728*0Sstevel@tonic-gate $native_string = pack("C*", unpack("U*", $Unicode_string)); 729*0Sstevel@tonic-gate $Unicode_string = pack("U*", unpack("C*", $native_string)); 730*0Sstevel@tonic-gate 731*0Sstevel@tonic-gateIf you have a sequence of bytes you B<know> is valid UTF-8, 732*0Sstevel@tonic-gatebut Perl doesn't know it yet, you can make Perl a believer, too: 733*0Sstevel@tonic-gate 734*0Sstevel@tonic-gate use Encode 'decode_utf8'; 735*0Sstevel@tonic-gate $Unicode = decode_utf8($bytes); 736*0Sstevel@tonic-gate 737*0Sstevel@tonic-gateYou can convert well-formed UTF-8 to a sequence of bytes, but if 738*0Sstevel@tonic-gateyou just want to convert random binary data into UTF-8, you can't. 739*0Sstevel@tonic-gateB<Any random collection of bytes isn't well-formed UTF-8>. You can 740*0Sstevel@tonic-gateuse C<unpack("C*", $string)> for the former, and you can create 741*0Sstevel@tonic-gatewell-formed Unicode data by C<pack("U*", 0xff, ...)>. 742*0Sstevel@tonic-gate 743*0Sstevel@tonic-gate=item * 744*0Sstevel@tonic-gate 745*0Sstevel@tonic-gateHow Do I Display Unicode? How Do I Input Unicode? 746*0Sstevel@tonic-gate 747*0Sstevel@tonic-gateSee http://www.alanwood.net/unicode/ and 748*0Sstevel@tonic-gatehttp://www.cl.cam.ac.uk/~mgk25/unicode.html 749*0Sstevel@tonic-gate 750*0Sstevel@tonic-gate=item * 751*0Sstevel@tonic-gate 752*0Sstevel@tonic-gateHow Does Unicode Work With Traditional Locales? 753*0Sstevel@tonic-gate 754*0Sstevel@tonic-gateIn Perl, not very well. Avoid using locales through the C<locale> 755*0Sstevel@tonic-gatepragma. Use only one or the other. But see L<perlrun> for the 756*0Sstevel@tonic-gatedescription of the C<-C> switch and its environment counterpart, 757*0Sstevel@tonic-gateC<$ENV{PERL_UNICODE}> to see how to enable various Unicode features, 758*0Sstevel@tonic-gatefor example by using locale settings. 759*0Sstevel@tonic-gate 760*0Sstevel@tonic-gate=back 761*0Sstevel@tonic-gate 762*0Sstevel@tonic-gate=head2 Hexadecimal Notation 763*0Sstevel@tonic-gate 764*0Sstevel@tonic-gateThe Unicode standard prefers using hexadecimal notation because 765*0Sstevel@tonic-gatethat more clearly shows the division of Unicode into blocks of 256 characters. 766*0Sstevel@tonic-gateHexadecimal is also simply shorter than decimal. You can use decimal 767*0Sstevel@tonic-gatenotation, too, but learning to use hexadecimal just makes life easier 768*0Sstevel@tonic-gatewith the Unicode standard. The C<U+HHHH> notation uses hexadecimal, 769*0Sstevel@tonic-gatefor example. 770*0Sstevel@tonic-gate 771*0Sstevel@tonic-gateThe C<0x> prefix means a hexadecimal number, the digits are 0-9 I<and> 772*0Sstevel@tonic-gatea-f (or A-F, case doesn't matter). Each hexadecimal digit represents 773*0Sstevel@tonic-gatefour bits, or half a byte. C<print 0x..., "\n"> will show a 774*0Sstevel@tonic-gatehexadecimal number in decimal, and C<printf "%x\n", $decimal> will 775*0Sstevel@tonic-gateshow a decimal number in hexadecimal. If you have just the 776*0Sstevel@tonic-gate"hex digits" of a hexadecimal number, you can use the C<hex()> function. 777*0Sstevel@tonic-gate 778*0Sstevel@tonic-gate print 0x0009, "\n"; # 9 779*0Sstevel@tonic-gate print 0x000a, "\n"; # 10 780*0Sstevel@tonic-gate print 0x000f, "\n"; # 15 781*0Sstevel@tonic-gate print 0x0010, "\n"; # 16 782*0Sstevel@tonic-gate print 0x0011, "\n"; # 17 783*0Sstevel@tonic-gate print 0x0100, "\n"; # 256 784*0Sstevel@tonic-gate 785*0Sstevel@tonic-gate print 0x0041, "\n"; # 65 786*0Sstevel@tonic-gate 787*0Sstevel@tonic-gate printf "%x\n", 65; # 41 788*0Sstevel@tonic-gate printf "%#x\n", 65; # 0x41 789*0Sstevel@tonic-gate 790*0Sstevel@tonic-gate print hex("41"), "\n"; # 65 791*0Sstevel@tonic-gate 792*0Sstevel@tonic-gate=head2 Further Resources 793*0Sstevel@tonic-gate 794*0Sstevel@tonic-gate=over 4 795*0Sstevel@tonic-gate 796*0Sstevel@tonic-gate=item * 797*0Sstevel@tonic-gate 798*0Sstevel@tonic-gateUnicode Consortium 799*0Sstevel@tonic-gate 800*0Sstevel@tonic-gate http://www.unicode.org/ 801*0Sstevel@tonic-gate 802*0Sstevel@tonic-gate=item * 803*0Sstevel@tonic-gate 804*0Sstevel@tonic-gateUnicode FAQ 805*0Sstevel@tonic-gate 806*0Sstevel@tonic-gate http://www.unicode.org/unicode/faq/ 807*0Sstevel@tonic-gate 808*0Sstevel@tonic-gate=item * 809*0Sstevel@tonic-gate 810*0Sstevel@tonic-gateUnicode Glossary 811*0Sstevel@tonic-gate 812*0Sstevel@tonic-gate http://www.unicode.org/glossary/ 813*0Sstevel@tonic-gate 814*0Sstevel@tonic-gate=item * 815*0Sstevel@tonic-gate 816*0Sstevel@tonic-gateUnicode Useful Resources 817*0Sstevel@tonic-gate 818*0Sstevel@tonic-gate http://www.unicode.org/unicode/onlinedat/resources.html 819*0Sstevel@tonic-gate 820*0Sstevel@tonic-gate=item * 821*0Sstevel@tonic-gate 822*0Sstevel@tonic-gateUnicode and Multilingual Support in HTML, Fonts, Web Browsers and Other Applications 823*0Sstevel@tonic-gate 824*0Sstevel@tonic-gate http://www.alanwood.net/unicode/ 825*0Sstevel@tonic-gate 826*0Sstevel@tonic-gate=item * 827*0Sstevel@tonic-gate 828*0Sstevel@tonic-gateUTF-8 and Unicode FAQ for Unix/Linux 829*0Sstevel@tonic-gate 830*0Sstevel@tonic-gate http://www.cl.cam.ac.uk/~mgk25/unicode.html 831*0Sstevel@tonic-gate 832*0Sstevel@tonic-gate=item * 833*0Sstevel@tonic-gate 834*0Sstevel@tonic-gateLegacy Character Sets 835*0Sstevel@tonic-gate 836*0Sstevel@tonic-gate http://www.czyborra.com/ 837*0Sstevel@tonic-gate http://www.eki.ee/letter/ 838*0Sstevel@tonic-gate 839*0Sstevel@tonic-gate=item * 840*0Sstevel@tonic-gate 841*0Sstevel@tonic-gateThe Unicode support files live within the Perl installation in the 842*0Sstevel@tonic-gatedirectory 843*0Sstevel@tonic-gate 844*0Sstevel@tonic-gate $Config{installprivlib}/unicore 845*0Sstevel@tonic-gate 846*0Sstevel@tonic-gatein Perl 5.8.0 or newer, and 847*0Sstevel@tonic-gate 848*0Sstevel@tonic-gate $Config{installprivlib}/unicode 849*0Sstevel@tonic-gate 850*0Sstevel@tonic-gatein the Perl 5.6 series. (The renaming to F<lib/unicore> was done to 851*0Sstevel@tonic-gateavoid naming conflicts with lib/Unicode in case-insensitive filesystems.) 852*0Sstevel@tonic-gateThe main Unicode data file is F<UnicodeData.txt> (or F<Unicode.301> in 853*0Sstevel@tonic-gatePerl 5.6.1.) You can find the C<$Config{installprivlib}> by 854*0Sstevel@tonic-gate 855*0Sstevel@tonic-gate perl "-V:installprivlib" 856*0Sstevel@tonic-gate 857*0Sstevel@tonic-gateYou can explore various information from the Unicode data files using 858*0Sstevel@tonic-gatethe C<Unicode::UCD> module. 859*0Sstevel@tonic-gate 860*0Sstevel@tonic-gate=back 861*0Sstevel@tonic-gate 862*0Sstevel@tonic-gate=head1 UNICODE IN OLDER PERLS 863*0Sstevel@tonic-gate 864*0Sstevel@tonic-gateIf you cannot upgrade your Perl to 5.8.0 or later, you can still 865*0Sstevel@tonic-gatedo some Unicode processing by using the modules C<Unicode::String>, 866*0Sstevel@tonic-gateC<Unicode::Map8>, and C<Unicode::Map>, available from CPAN. 867*0Sstevel@tonic-gateIf you have the GNU recode installed, you can also use the 868*0Sstevel@tonic-gatePerl front-end C<Convert::Recode> for character conversions. 869*0Sstevel@tonic-gate 870*0Sstevel@tonic-gateThe following are fast conversions from ISO 8859-1 (Latin-1) bytes 871*0Sstevel@tonic-gateto UTF-8 bytes and back, the code works even with older Perl 5 versions. 872*0Sstevel@tonic-gate 873*0Sstevel@tonic-gate # ISO 8859-1 to UTF-8 874*0Sstevel@tonic-gate s/([\x80-\xFF])/chr(0xC0|ord($1)>>6).chr(0x80|ord($1)&0x3F)/eg; 875*0Sstevel@tonic-gate 876*0Sstevel@tonic-gate # UTF-8 to ISO 8859-1 877*0Sstevel@tonic-gate s/([\xC2\xC3])([\x80-\xBF])/chr(ord($1)<<6&0xC0|ord($2)&0x3F)/eg; 878*0Sstevel@tonic-gate 879*0Sstevel@tonic-gate=head1 SEE ALSO 880*0Sstevel@tonic-gate 881*0Sstevel@tonic-gateL<perlunicode>, L<Encode>, L<encoding>, L<open>, L<utf8>, L<bytes>, 882*0Sstevel@tonic-gateL<perlretut>, L<perlrun>, L<Unicode::Collate>, L<Unicode::Normalize>, 883*0Sstevel@tonic-gateL<Unicode::UCD> 884*0Sstevel@tonic-gate 885*0Sstevel@tonic-gate=head1 ACKNOWLEDGMENTS 886*0Sstevel@tonic-gate 887*0Sstevel@tonic-gateThanks to the kind readers of the perl5-porters@perl.org, 888*0Sstevel@tonic-gateperl-unicode@perl.org, linux-utf8@nl.linux.org, and unicore@unicode.org 889*0Sstevel@tonic-gatemailing lists for their valuable feedback. 890*0Sstevel@tonic-gate 891*0Sstevel@tonic-gate=head1 AUTHOR, COPYRIGHT, AND LICENSE 892*0Sstevel@tonic-gate 893*0Sstevel@tonic-gateCopyright 2001-2002 Jarkko Hietaniemi E<lt>jhi@iki.fiE<gt> 894*0Sstevel@tonic-gate 895*0Sstevel@tonic-gateThis document may be distributed under the same terms as Perl itself. 896