1=head1 NAME 2 3perluniintro - Perl Unicode introduction 4 5=head1 DESCRIPTION 6 7This document gives a general idea of Unicode and how to use Unicode 8in Perl. See L</Further Resources> for references to more in-depth 9treatments of Unicode. 10 11=head2 Unicode 12 13Unicode is a character set standard which plans to codify all of the 14writing systems of the world, plus many other symbols. 15 16Unicode and ISO/IEC 10646 are coordinated standards that unify 17almost all other modern character set standards, 18covering more than 80 writing systems and hundreds of languages, 19including all commercially-important modern languages. All characters 20in the largest Chinese, Japanese, and Korean dictionaries are also 21encoded. The standards will eventually cover almost all characters in 22more than 250 writing systems and thousands of languages. 23Unicode 1.0 was released in October 1991, and 6.0 in October 2010. 24 25A Unicode I<character> is an abstract entity. It is not bound to any 26particular integer width, especially not to the C language C<char>. 27Unicode is language-neutral and display-neutral: it does not encode the 28language of the text, and it does not generally define fonts or other graphical 29layout details. Unicode operates on characters and on text built from 30those characters. 31 32Unicode defines characters like C<LATIN CAPITAL LETTER A> or C<GREEK 33SMALL LETTER ALPHA> and unique numbers for the characters, in this 34case 0x0041 and 0x03B1, respectively. These unique numbers are called 35I<code points>. A code point is essentially the position of the 36character within the set of all possible Unicode characters, and thus in 37Perl, the term I<ordinal> is often used interchangeably with it. 38 39The Unicode standard prefers using hexadecimal notation for the code 40points. If numbers like C<0x0041> are unfamiliar to you, take a peek 41at a later section, L</"Hexadecimal Notation">. The Unicode standard 42uses the notation C<U+0041 LATIN CAPITAL LETTER A>, to give the 43hexadecimal code point and the normative name of the character. 44 45Unicode also defines various I<properties> for the characters, like 46"uppercase" or "lowercase", "decimal digit", or "punctuation"; 47these properties are independent of the names of the characters. 48Furthermore, various operations on the characters like uppercasing, 49lowercasing, and collating (sorting) are defined. 50 51A Unicode I<logical> "character" can actually consist of more than one internal 52I<actual> "character" or code point. For Western languages, this is adequately 53modelled by a I<base character> (like C<LATIN CAPITAL LETTER A>) followed 54by one or more I<modifiers> (like C<COMBINING ACUTE ACCENT>). This sequence of 55base character and modifiers is called a I<combining character 56sequence>. Some non-western languages require more complicated 57models, so Unicode created the I<grapheme cluster> concept, which was 58later further refined into the I<extended grapheme cluster>. For 59example, a Korean Hangul syllable is considered a single logical 60character, but most often consists of three actual 61Unicode characters: a leading consonant followed by an interior vowel followed 62by a trailing consonant. 63 64Whether to call these extended grapheme clusters "characters" depends on your 65point of view. If you are a programmer, you probably would tend towards seeing 66each element in the sequences as one unit, or "character". However from 67the user's point of view, the whole sequence could be seen as one 68"character" since that's probably what it looks like in the context of the 69user's language. In this document, we take the programmer's point of 70view: one "character" is one Unicode code point. 71 72For some combinations of base character and modifiers, there are 73I<precomposed> characters. There is a single character equivalent, for 74example, for the sequence C<LATIN CAPITAL LETTER A> followed by 75C<COMBINING ACUTE ACCENT>. It is called C<LATIN CAPITAL LETTER A WITH 76ACUTE>. These precomposed characters are, however, only available for 77some combinations, and are mainly meant to support round-trip 78conversions between Unicode and legacy standards (like ISO 8859). Using 79sequences, as Unicode does, allows for needing fewer basic building blocks 80(code points) to express many more potential grapheme clusters. To 81support conversion between equivalent forms, various I<normalization 82forms> are also defined. Thus, C<LATIN CAPITAL LETTER A WITH ACUTE> is 83in I<Normalization Form Composed>, (abbreviated NFC), and the sequence 84C<LATIN CAPITAL LETTER A> followed by C<COMBINING ACUTE ACCENT> 85represents the same character in I<Normalization Form Decomposed> (NFD). 86 87Because of backward compatibility with legacy encodings, the "a unique 88number for every character" idea breaks down a bit: instead, there is 89"at least one number for every character". The same character could 90be represented differently in several legacy encodings. The 91converse is not also true: some code points do not have an assigned 92character. Firstly, there are unallocated code points within 93otherwise used blocks. Secondly, there are special Unicode control 94characters that do not represent true characters. 95 96When Unicode was first conceived, it was thought that all the world's 97characters could be represented using a 16-bit word; that is a maximum of 98C<0x10000> (or 65536) characters from C<0x0000> to C<0xFFFF> would be 99needed. This soon proved to be false, and since Unicode 2.0 (July 1001996), Unicode has been defined all the way up to 21 bits (C<0x10FFFF>), 101and Unicode 3.1 (March 2001) defined the first characters above C<0xFFFF>. 102The first C<0x10000> characters are called the I<Plane 0>, or the 103I<Basic Multilingual Plane> (BMP). With Unicode 3.1, 17 (yes, 104seventeen) planes in all were defined--but they are nowhere near full of 105defined characters, yet. 106 107When a new language is being encoded, Unicode generally will choose a 108C<block> of consecutive unallocated code points for its characters. So 109far, the number of code points in these blocks has always been evenly 110divisible by 16. Extras in a block, not currently needed, are left 111unallocated, for future growth. But there have been occasions when 112a later release needed more code points than the available extras, and a 113new block had to allocated somewhere else, not contiguous to the initial 114one, to handle the overflow. Thus, it became apparent early on that 115"block" wasn't an adequate organizing principal, and so the C<Script> 116property was created. (Later an improved script property was added as 117well, the C<Script_Extensions> property.) Those code points that are in 118overflow blocks can still 119have the same script as the original ones. The script concept fits more 120closely with natural language: there is C<Latin> script, C<Greek> 121script, and so on; and there are several artificial scripts, like 122C<Common> for characters that are used in multiple scripts, such as 123mathematical symbols. Scripts usually span varied parts of several 124blocks. For more information about scripts, see L<perlunicode/Scripts>. 125The division into blocks exists, but it is almost completely 126accidental--an artifact of how the characters have been and still are 127allocated. (Note that this paragraph has oversimplified things for the 128sake of this being an introduction. Unicode doesn't really encode 129languages, but the writing systems for them--their scripts; and one 130script can be used by many languages. Unicode also encodes things that 131aren't really about languages, such as symbols like C<BAGGAGE CLAIM>.) 132 133The Unicode code points are just abstract numbers. To input and 134output these abstract numbers, the numbers must be I<encoded> or 135I<serialised> somehow. Unicode defines several I<character encoding 136forms>, of which I<UTF-8> is perhaps the most popular. UTF-8 is a 137variable length encoding that encodes Unicode characters as 1 to 6 138bytes. Other encodings 139include UTF-16 and UTF-32 and their big- and little-endian variants 140(UTF-8 is byte-order independent). The ISO/IEC 10646 defines the UCS-2 141and UCS-4 encoding forms. 142 143For more information about encodings--for instance, to learn what 144I<surrogates> and I<byte order marks> (BOMs) are--see L<perlunicode>. 145 146=head2 Perl's Unicode Support 147 148Starting from Perl v5.6.0, Perl has had the capacity to handle Unicode 149natively. Perl v5.8.0, however, is the first recommended release for 150serious Unicode work. The maintenance release 5.6.1 fixed many of the 151problems of the initial Unicode implementation, but for example 152regular expressions still do not work with Unicode in 5.6.1. 153Perl v5.14.0 is the first release where Unicode support is 154(almost) seamlessly integrable without some gotchas (the exception being 155some differences in L<quotemeta|perlfunc/quotemeta>, which is fixed 156starting in Perl 5.16.0). To enable this 157seamless support, you should C<use feature 'unicode_strings'> (which is 158automatically selected if you C<use 5.012> or higher). See L<feature>. 159(5.14 also fixes a number of bugs and departures from the Unicode 160standard.) 161 162Before Perl v5.8.0, the use of C<use utf8> was used to declare 163that operations in the current block or file would be Unicode-aware. 164This model was found to be wrong, or at least clumsy: the "Unicodeness" 165is now carried with the data, instead of being attached to the 166operations. 167Starting with Perl v5.8.0, only one case remains where an explicit C<use 168utf8> is needed: if your Perl script itself is encoded in UTF-8, you can 169use UTF-8 in your identifier names, and in string and regular expression 170literals, by saying C<use utf8>. This is not the default because 171scripts with legacy 8-bit data in them would break. See L<utf8>. 172 173=head2 Perl's Unicode Model 174 175Perl supports both pre-5.6 strings of eight-bit native bytes, and 176strings of Unicode characters. The general principle is that Perl tries 177to keep its data as eight-bit bytes for as long as possible, but as soon 178as Unicodeness cannot be avoided, the data is transparently upgraded 179to Unicode. Prior to Perl v5.14.0, the upgrade was not completely 180transparent (see L<perlunicode/The "Unicode Bug">), and for backwards 181compatibility, full transparency is not gained unless C<use feature 182'unicode_strings'> (see L<feature>) or C<use 5.012> (or higher) is 183selected. 184 185Internally, Perl currently uses either whatever the native eight-bit 186character set of the platform (for example Latin-1) is, defaulting to 187UTF-8, to encode Unicode strings. Specifically, if all code points in 188the string are C<0xFF> or less, Perl uses the native eight-bit 189character set. Otherwise, it uses UTF-8. 190 191A user of Perl does not normally need to know nor care how Perl 192happens to encode its internal strings, but it becomes relevant when 193outputting Unicode strings to a stream without a PerlIO layer (one with 194the "default" encoding). In such a case, the raw bytes used internally 195(the native character set or UTF-8, as appropriate for each string) 196will be used, and a "Wide character" warning will be issued if those 197strings contain a character beyond 0x00FF. 198 199For example, 200 201 perl -e 'print "\x{DF}\n", "\x{0100}\x{DF}\n"' 202 203produces a fairly useless mixture of native bytes and UTF-8, as well 204as a warning: 205 206 Wide character in print at ... 207 208To output UTF-8, use the C<:encoding> or C<:utf8> output layer. Prepending 209 210 binmode(STDOUT, ":utf8"); 211 212to this sample program ensures that the output is completely UTF-8, 213and removes the program's warning. 214 215You can enable automatic UTF-8-ification of your standard file 216handles, default C<open()> layer, and C<@ARGV> by using either 217the C<-C> command line switch or the C<PERL_UNICODE> environment 218variable, see L<perlrun> for the documentation of the C<-C> switch. 219 220Note that this means that Perl expects other software to work the same 221way: 222if Perl has been led to believe that STDIN should be UTF-8, but then 223STDIN coming in from another command is not UTF-8, Perl will likely 224complain about the malformed UTF-8. 225 226All features that combine Unicode and I/O also require using the new 227PerlIO feature. Almost all Perl 5.8 platforms do use PerlIO, though: 228you can see whether yours is by running "perl -V" and looking for 229C<useperlio=define>. 230 231=head2 Unicode and EBCDIC 232 233Perl 5.8.0 also supports Unicode on EBCDIC platforms. There, 234Unicode support is somewhat more complex to implement since 235additional conversions are needed at every step. 236 237Later Perl releases have added code that will not work on EBCDIC platforms, and 238no one has complained, so the divergence has continued. If you want to run 239Perl on an EBCDIC platform, send email to perlbug@perl.org 240 241On EBCDIC platforms, the internal Unicode encoding form is UTF-EBCDIC 242instead of UTF-8. The difference is that as UTF-8 is "ASCII-safe" in 243that ASCII characters encode to UTF-8 as-is, while UTF-EBCDIC is 244"EBCDIC-safe". 245 246=head2 Creating Unicode 247 248To create Unicode characters in literals for code points above C<0xFF>, 249use the C<\x{...}> notation in double-quoted strings: 250 251 my $smiley = "\x{263a}"; 252 253Similarly, it can be used in regular expression literals 254 255 $smiley =~ /\x{263a}/; 256 257At run-time you can use C<chr()>: 258 259 my $hebrew_alef = chr(0x05d0); 260 261See L</"Further Resources"> for how to find all these numeric codes. 262 263Naturally, C<ord()> will do the reverse: it turns a character into 264a code point. 265 266Note that C<\x..> (no C<{}> and only two hexadecimal digits), C<\x{...}>, 267and C<chr(...)> for arguments less than C<0x100> (decimal 256) 268generate an eight-bit character for backward compatibility with older 269Perls. For arguments of C<0x100> or more, Unicode characters are 270always produced. If you want to force the production of Unicode 271characters regardless of the numeric value, use C<pack("U", ...)> 272instead of C<\x..>, C<\x{...}>, or C<chr()>. 273 274You can invoke characters 275by name in double-quoted strings: 276 277 my $arabic_alef = "\N{ARABIC LETTER ALEF}"; 278 279And, as mentioned above, you can also C<pack()> numbers into Unicode 280characters: 281 282 my $georgian_an = pack("U", 0x10a0); 283 284Note that both C<\x{...}> and C<\N{...}> are compile-time string 285constants: you cannot use variables in them. if you want similar 286run-time functionality, use C<chr()> and C<charnames::string_vianame()>. 287 288If you want to force the result to Unicode characters, use the special 289C<"U0"> prefix. It consumes no arguments but causes the following bytes 290to be interpreted as the UTF-8 encoding of Unicode characters: 291 292 my $chars = pack("U0W*", 0x80, 0x42); 293 294Likewise, you can stop such UTF-8 interpretation by using the special 295C<"C0"> prefix. 296 297=head2 Handling Unicode 298 299Handling Unicode is for the most part transparent: just use the 300strings as usual. Functions like C<index()>, C<length()>, and 301C<substr()> will work on the Unicode characters; regular expressions 302will work on the Unicode characters (see L<perlunicode> and L<perlretut>). 303 304Note that Perl considers grapheme clusters to be separate characters, so for 305example 306 307 print length("\N{LATIN CAPITAL LETTER A}\N{COMBINING ACUTE ACCENT}"), 308 "\n"; 309 310will print 2, not 1. The only exception is that regular expressions 311have C<\X> for matching an extended grapheme cluster. (Thus C<\X> in a 312regular expression would match the entire sequence of both the example 313characters.) 314 315Life is not quite so transparent, however, when working with legacy 316encodings, I/O, and certain special cases: 317 318=head2 Legacy Encodings 319 320When you combine legacy data and Unicode, the legacy data needs 321to be upgraded to Unicode. Normally the legacy data is assumed to be 322ISO 8859-1 (or EBCDIC, if applicable). 323 324The C<Encode> module knows about many encodings and has interfaces 325for doing conversions between those encodings: 326 327 use Encode 'decode'; 328 $data = decode("iso-8859-3", $data); # convert from legacy to utf-8 329 330=head2 Unicode I/O 331 332Normally, writing out Unicode data 333 334 print FH $some_string_with_unicode, "\n"; 335 336produces raw bytes that Perl happens to use to internally encode the 337Unicode string. Perl's internal encoding depends on the system as 338well as what characters happen to be in the string at the time. If 339any of the characters are at code points C<0x100> or above, you will get 340a warning. To ensure that the output is explicitly rendered in the 341encoding you desire--and to avoid the warning--open the stream with 342the desired encoding. Some examples: 343 344 open FH, ">:utf8", "file"; 345 346 open FH, ">:encoding(ucs2)", "file"; 347 open FH, ">:encoding(UTF-8)", "file"; 348 open FH, ">:encoding(shift_jis)", "file"; 349 350and on already open streams, use C<binmode()>: 351 352 binmode(STDOUT, ":utf8"); 353 354 binmode(STDOUT, ":encoding(ucs2)"); 355 binmode(STDOUT, ":encoding(UTF-8)"); 356 binmode(STDOUT, ":encoding(shift_jis)"); 357 358The matching of encoding names is loose: case does not matter, and 359many encodings have several aliases. Note that the C<:utf8> layer 360must always be specified exactly like that; it is I<not> subject to 361the loose matching of encoding names. Also note that currently C<:utf8> is unsafe for 362input, because it accepts the data without validating that it is indeed valid 363UTF-8; you should instead use C<:encoding(utf-8)> (with or without a 364hyphen). 365 366See L<PerlIO> for the C<:utf8> layer, L<PerlIO::encoding> and 367L<Encode::PerlIO> for the C<:encoding()> layer, and 368L<Encode::Supported> for many encodings supported by the C<Encode> 369module. 370 371Reading in a file that you know happens to be encoded in one of the 372Unicode or legacy encodings does not magically turn the data into 373Unicode in Perl's eyes. To do that, specify the appropriate 374layer when opening files 375 376 open(my $fh,'<:encoding(utf8)', 'anything'); 377 my $line_of_unicode = <$fh>; 378 379 open(my $fh,'<:encoding(Big5)', 'anything'); 380 my $line_of_unicode = <$fh>; 381 382The I/O layers can also be specified more flexibly with 383the C<open> pragma. See L<open>, or look at the following example. 384 385 use open ':encoding(utf8)'; # input/output default encoding will be 386 # UTF-8 387 open X, ">file"; 388 print X chr(0x100), "\n"; 389 close X; 390 open Y, "<file"; 391 printf "%#x\n", ord(<Y>); # this should print 0x100 392 close Y; 393 394With the C<open> pragma you can use the C<:locale> layer 395 396 BEGIN { $ENV{LC_ALL} = $ENV{LANG} = 'ru_RU.KOI8-R' } 397 # the :locale will probe the locale environment variables like 398 # LC_ALL 399 use open OUT => ':locale'; # russki parusski 400 open(O, ">koi8"); 401 print O chr(0x430); # Unicode CYRILLIC SMALL LETTER A = KOI8-R 0xc1 402 close O; 403 open(I, "<koi8"); 404 printf "%#x\n", ord(<I>), "\n"; # this should print 0xc1 405 close I; 406 407These methods install a transparent filter on the I/O stream that 408converts data from the specified encoding when it is read in from the 409stream. The result is always Unicode. 410 411The L<open> pragma affects all the C<open()> calls after the pragma by 412setting default layers. If you want to affect only certain 413streams, use explicit layers directly in the C<open()> call. 414 415You can switch encodings on an already opened stream by using 416C<binmode()>; see L<perlfunc/binmode>. 417 418The C<:locale> does not currently work with 419C<open()> and C<binmode()>, only with the C<open> pragma. The 420C<:utf8> and C<:encoding(...)> methods do work with all of C<open()>, 421C<binmode()>, and the C<open> pragma. 422 423Similarly, you may use these I/O layers on output streams to 424automatically convert Unicode to the specified encoding when it is 425written to the stream. For example, the following snippet copies the 426contents of the file "text.jis" (encoded as ISO-2022-JP, aka JIS) to 427the file "text.utf8", encoded as UTF-8: 428 429 open(my $nihongo, '<:encoding(iso-2022-jp)', 'text.jis'); 430 open(my $unicode, '>:utf8', 'text.utf8'); 431 while (<$nihongo>) { print $unicode $_ } 432 433The naming of encodings, both by the C<open()> and by the C<open> 434pragma allows for flexible names: C<koi8-r> and C<KOI8R> will both be 435understood. 436 437Common encodings recognized by ISO, MIME, IANA, and various other 438standardisation organisations are recognised; for a more detailed 439list see L<Encode::Supported>. 440 441C<read()> reads characters and returns the number of characters. 442C<seek()> and C<tell()> operate on byte counts, as do C<sysread()> 443and C<sysseek()>. 444 445Notice that because of the default behaviour of not doing any 446conversion upon input if there is no default layer, 447it is easy to mistakenly write code that keeps on expanding a file 448by repeatedly encoding the data: 449 450 # BAD CODE WARNING 451 open F, "file"; 452 local $/; ## read in the whole file of 8-bit characters 453 $t = <F>; 454 close F; 455 open F, ">:encoding(utf8)", "file"; 456 print F $t; ## convert to UTF-8 on output 457 close F; 458 459If you run this code twice, the contents of the F<file> will be twice 460UTF-8 encoded. A C<use open ':encoding(utf8)'> would have avoided the 461bug, or explicitly opening also the F<file> for input as UTF-8. 462 463B<NOTE>: the C<:utf8> and C<:encoding> features work only if your 464Perl has been built with L<PerlIO>, which is the default 465on most systems. 466 467=head2 Displaying Unicode As Text 468 469Sometimes you might want to display Perl scalars containing Unicode as 470simple ASCII (or EBCDIC) text. The following subroutine converts 471its argument so that Unicode characters with code points greater than 472255 are displayed as C<\x{...}>, control characters (like C<\n>) are 473displayed as C<\x..>, and the rest of the characters as themselves: 474 475 sub nice_string { 476 join("", 477 map { $_ > 255 # if wide character... 478 ? sprintf("\\x{%04X}", $_) # \x{...} 479 : chr($_) =~ /[[:cntrl:]]/ # else if control character... 480 ? sprintf("\\x%02X", $_) # \x.. 481 : quotemeta(chr($_)) # else quoted or as themselves 482 } unpack("W*", $_[0])); # unpack Unicode characters 483 } 484 485For example, 486 487 nice_string("foo\x{100}bar\n") 488 489returns the string 490 491 'foo\x{0100}bar\x0A' 492 493which is ready to be printed. 494 495=head2 Special Cases 496 497=over 4 498 499=item * 500 501Bit Complement Operator ~ And vec() 502 503The bit complement operator C<~> may produce surprising results if 504used on strings containing characters with ordinal values above 505255. In such a case, the results are consistent with the internal 506encoding of the characters, but not with much else. So don't do 507that. Similarly for C<vec()>: you will be operating on the 508internally-encoded bit patterns of the Unicode characters, not on 509the code point values, which is very probably not what you want. 510 511=item * 512 513Peeking At Perl's Internal Encoding 514 515Normal users of Perl should never care how Perl encodes any particular 516Unicode string (because the normal ways to get at the contents of a 517string with Unicode--via input and output--should always be via 518explicitly-defined I/O layers). But if you must, there are two 519ways of looking behind the scenes. 520 521One way of peeking inside the internal encoding of Unicode characters 522is to use C<unpack("C*", ...> to get the bytes of whatever the string 523encoding happens to be, or C<unpack("U0..", ...)> to get the bytes of the 524UTF-8 encoding: 525 526 # this prints c4 80 for the UTF-8 bytes 0xc4 0x80 527 print join(" ", unpack("U0(H2)*", pack("U", 0x100))), "\n"; 528 529Yet another way would be to use the Devel::Peek module: 530 531 perl -MDevel::Peek -e 'Dump(chr(0x100))' 532 533That shows the C<UTF8> flag in FLAGS and both the UTF-8 bytes 534and Unicode characters in C<PV>. See also later in this document 535the discussion about the C<utf8::is_utf8()> function. 536 537=back 538 539=head2 Advanced Topics 540 541=over 4 542 543=item * 544 545String Equivalence 546 547The question of string equivalence turns somewhat complicated 548in Unicode: what do you mean by "equal"? 549 550(Is C<LATIN CAPITAL LETTER A WITH ACUTE> equal to 551C<LATIN CAPITAL LETTER A>?) 552 553The short answer is that by default Perl compares equivalence (C<eq>, 554C<ne>) based only on code points of the characters. In the above 555case, the answer is no (because 0x00C1 != 0x0041). But sometimes, any 556CAPITAL LETTER A's should be considered equal, or even A's of any case. 557 558The long answer is that you need to consider character normalization 559and casing issues: see L<Unicode::Normalize>, Unicode Technical Report #15, 560L<Unicode Normalization Forms|http://www.unicode.org/unicode/reports/tr15> and 561sections on case mapping in the L<Unicode Standard|http://www.unicode.org>. 562 563As of Perl 5.8.0, the "Full" case-folding of I<Case 564Mappings/SpecialCasing> is implemented, but bugs remain in C<qr//i> with them, 565mostly fixed by 5.14, and essentially entirely by 5.18. 566 567=item * 568 569String Collation 570 571People like to see their strings nicely sorted--or as Unicode 572parlance goes, collated. But again, what do you mean by collate? 573 574(Does C<LATIN CAPITAL LETTER A WITH ACUTE> come before or after 575C<LATIN CAPITAL LETTER A WITH GRAVE>?) 576 577The short answer is that by default, Perl compares strings (C<lt>, 578C<le>, C<cmp>, C<ge>, C<gt>) based only on the code points of the 579characters. In the above case, the answer is "after", since 580C<0x00C1> > C<0x00C0>. 581 582The long answer is that "it depends", and a good answer cannot be 583given without knowing (at the very least) the language context. 584See L<Unicode::Collate>, and I<Unicode Collation Algorithm> 585L<http://www.unicode.org/unicode/reports/tr10/> 586 587=back 588 589=head2 Miscellaneous 590 591=over 4 592 593=item * 594 595Character Ranges and Classes 596 597Character ranges in regular expression bracketed character classes ( e.g., 598C</[a-z]/>) and in the C<tr///> (also known as C<y///>) operator are not 599magically Unicode-aware. What this means is that C<[A-Za-z]> will not 600magically start to mean "all alphabetic letters" (not that it does mean that 601even for 8-bit characters; for those, if you are using locales (L<perllocale>), 602use C</[[:alpha:]]/>; and if not, use the 8-bit-aware property C<\p{alpha}>). 603 604All the properties that begin with C<\p> (and its inverse C<\P>) are actually 605character classes that are Unicode-aware. There are dozens of them, see 606L<perluniprops>. 607 608You can use Unicode code points as the end points of character ranges, and the 609range will include all Unicode code points that lie between those end points. 610 611=item * 612 613String-To-Number Conversions 614 615Unicode does define several other decimal--and numeric--characters 616besides the familiar 0 to 9, such as the Arabic and Indic digits. 617Perl does not support string-to-number conversion for digits other 618than ASCII C<0> to C<9> (and ASCII C<a> to C<f> for hexadecimal). 619To get safe conversions from any Unicode string, use 620L<Unicode::UCD/num()>. 621 622=back 623 624=head2 Questions With Answers 625 626=over 4 627 628=item * 629 630Will My Old Scripts Break? 631 632Very probably not. Unless you are generating Unicode characters 633somehow, old behaviour should be preserved. About the only behaviour 634that has changed and which could start generating Unicode is the old 635behaviour of C<chr()> where supplying an argument more than 255 636produced a character modulo 255. C<chr(300)>, for example, was equal 637to C<chr(45)> or "-" (in ASCII), now it is LATIN CAPITAL LETTER I WITH 638BREVE. 639 640=item * 641 642How Do I Make My Scripts Work With Unicode? 643 644Very little work should be needed since nothing changes until you 645generate Unicode data. The most important thing is getting input as 646Unicode; for that, see the earlier I/O discussion. 647To get full seamless Unicode support, add 648C<use feature 'unicode_strings'> (or C<use 5.012> or higher) to your 649script. 650 651=item * 652 653How Do I Know Whether My String Is In Unicode? 654 655You shouldn't have to care. But you may if your Perl is before 5.14.0 656or you haven't specified C<use feature 'unicode_strings'> or C<use 6575.012> (or higher) because otherwise the semantics of the code points 658in the range 128 to 255 are different depending on 659whether the string they are contained within is in Unicode or not. 660(See L<perlunicode/When Unicode Does Not Happen>.) 661 662To determine if a string is in Unicode, use: 663 664 print utf8::is_utf8($string) ? 1 : 0, "\n"; 665 666But note that this doesn't mean that any of the characters in the 667string are necessary UTF-8 encoded, or that any of the characters have 668code points greater than 0xFF (255) or even 0x80 (128), or that the 669string has any characters at all. All the C<is_utf8()> does is to 670return the value of the internal "utf8ness" flag attached to the 671C<$string>. If the flag is off, the bytes in the scalar are interpreted 672as a single byte encoding. If the flag is on, the bytes in the scalar 673are interpreted as the (variable-length, potentially multi-byte) UTF-8 encoded 674code points of the characters. Bytes added to a UTF-8 encoded string are 675automatically upgraded to UTF-8. If mixed non-UTF-8 and UTF-8 scalars 676are merged (double-quoted interpolation, explicit concatenation, or 677printf/sprintf parameter substitution), the result will be UTF-8 encoded 678as if copies of the byte strings were upgraded to UTF-8: for example, 679 680 $a = "ab\x80c"; 681 $b = "\x{100}"; 682 print "$a = $b\n"; 683 684the output string will be UTF-8-encoded C<ab\x80c = \x{100}\n>, but 685C<$a> will stay byte-encoded. 686 687Sometimes you might really need to know the byte length of a string 688instead of the character length. For that use either the 689C<Encode::encode_utf8()> function or the C<bytes> pragma 690and the C<length()> function: 691 692 my $unicode = chr(0x100); 693 print length($unicode), "\n"; # will print 1 694 require Encode; 695 print length(Encode::encode_utf8($unicode)),"\n"; # will print 2 696 use bytes; 697 print length($unicode), "\n"; # will also print 2 698 # (the 0xC4 0x80 of the UTF-8) 699 no bytes; 700 701=item * 702 703How Do I Find Out What Encoding a File Has? 704 705You might try L<Encode::Guess>, but it has a number of limitations. 706 707=item * 708 709How Do I Detect Data That's Not Valid In a Particular Encoding? 710 711Use the C<Encode> package to try converting it. 712For example, 713 714 use Encode 'decode_utf8'; 715 716 if (eval { decode_utf8($string, Encode::FB_CROAK); 1 }) { 717 # $string is valid utf8 718 } else { 719 # $string is not valid utf8 720 } 721 722Or use C<unpack> to try decoding it: 723 724 use warnings; 725 @chars = unpack("C0U*", $string_of_bytes_that_I_think_is_utf8); 726 727If invalid, a C<Malformed UTF-8 character> warning is produced. The "C0" means 728"process the string character per character". Without that, the 729C<unpack("U*", ...)> would work in C<U0> mode (the default if the format 730string starts with C<U>) and it would return the bytes making up the UTF-8 731encoding of the target string, something that will always work. 732 733=item * 734 735How Do I Convert Binary Data Into a Particular Encoding, Or Vice Versa? 736 737This probably isn't as useful as you might think. 738Normally, you shouldn't need to. 739 740In one sense, what you are asking doesn't make much sense: encodings 741are for characters, and binary data are not "characters", so converting 742"data" into some encoding isn't meaningful unless you know in what 743character set and encoding the binary data is in, in which case it's 744not just binary data, now is it? 745 746If you have a raw sequence of bytes that you know should be 747interpreted via a particular encoding, you can use C<Encode>: 748 749 use Encode 'from_to'; 750 from_to($data, "iso-8859-1", "utf-8"); # from latin-1 to utf-8 751 752The call to C<from_to()> changes the bytes in C<$data>, but nothing 753material about the nature of the string has changed as far as Perl is 754concerned. Both before and after the call, the string C<$data> 755contains just a bunch of 8-bit bytes. As far as Perl is concerned, 756the encoding of the string remains as "system-native 8-bit bytes". 757 758You might relate this to a fictional 'Translate' module: 759 760 use Translate; 761 my $phrase = "Yes"; 762 Translate::from_to($phrase, 'english', 'deutsch'); 763 ## phrase now contains "Ja" 764 765The contents of the string changes, but not the nature of the string. 766Perl doesn't know any more after the call than before that the 767contents of the string indicates the affirmative. 768 769Back to converting data. If you have (or want) data in your system's 770native 8-bit encoding (e.g. Latin-1, EBCDIC, etc.), you can use 771pack/unpack to convert to/from Unicode. 772 773 $native_string = pack("W*", unpack("U*", $Unicode_string)); 774 $Unicode_string = pack("U*", unpack("W*", $native_string)); 775 776If you have a sequence of bytes you B<know> is valid UTF-8, 777but Perl doesn't know it yet, you can make Perl a believer, too: 778 779 use Encode 'decode_utf8'; 780 $Unicode = decode_utf8($bytes); 781 782or: 783 784 $Unicode = pack("U0a*", $bytes); 785 786You can find the bytes that make up a UTF-8 sequence with 787 788 @bytes = unpack("C*", $Unicode_string) 789 790and you can create well-formed Unicode with 791 792 $Unicode_string = pack("U*", 0xff, ...) 793 794=item * 795 796How Do I Display Unicode? How Do I Input Unicode? 797 798See L<http://www.alanwood.net/unicode/> and 799L<http://www.cl.cam.ac.uk/~mgk25/unicode.html> 800 801=item * 802 803How Does Unicode Work With Traditional Locales? 804 805If your locale is a UTF-8 locale, starting in Perl v5.20, Perl works 806well for all categories except C<LC_COLLATE> dealing with sorting and 807the C<cmp> operator. 808 809For other locales, starting in Perl 5.16, you can specify 810 811 use locale ':not_characters'; 812 813to get Perl to work well with them. The catch is that you 814have to translate from the locale character set to/from Unicode 815yourself. See L</Unicode IE<sol>O> above for how to 816 817 use open ':locale'; 818 819to accomplish this, but full details are in L<perllocale/Unicode and 820UTF-8>, including gotchas that happen if you don't specify 821C<:not_characters>. 822 823=back 824 825=head2 Hexadecimal Notation 826 827The Unicode standard prefers using hexadecimal notation because 828that more clearly shows the division of Unicode into blocks of 256 characters. 829Hexadecimal is also simply shorter than decimal. You can use decimal 830notation, too, but learning to use hexadecimal just makes life easier 831with the Unicode standard. The C<U+HHHH> notation uses hexadecimal, 832for example. 833 834The C<0x> prefix means a hexadecimal number, the digits are 0-9 I<and> 835a-f (or A-F, case doesn't matter). Each hexadecimal digit represents 836four bits, or half a byte. C<print 0x..., "\n"> will show a 837hexadecimal number in decimal, and C<printf "%x\n", $decimal> will 838show a decimal number in hexadecimal. If you have just the 839"hex digits" of a hexadecimal number, you can use the C<hex()> function. 840 841 print 0x0009, "\n"; # 9 842 print 0x000a, "\n"; # 10 843 print 0x000f, "\n"; # 15 844 print 0x0010, "\n"; # 16 845 print 0x0011, "\n"; # 17 846 print 0x0100, "\n"; # 256 847 848 print 0x0041, "\n"; # 65 849 850 printf "%x\n", 65; # 41 851 printf "%#x\n", 65; # 0x41 852 853 print hex("41"), "\n"; # 65 854 855=head2 Further Resources 856 857=over 4 858 859=item * 860 861Unicode Consortium 862 863L<http://www.unicode.org/> 864 865=item * 866 867Unicode FAQ 868 869L<http://www.unicode.org/unicode/faq/> 870 871=item * 872 873Unicode Glossary 874 875L<http://www.unicode.org/glossary/> 876 877=item * 878 879Unicode Recommended Reading List 880 881The Unicode Consortium has a list of articles and books, some of which 882give a much more in depth treatment of Unicode: 883L<http://unicode.org/resources/readinglist.html> 884 885=item * 886 887Unicode Useful Resources 888 889L<http://www.unicode.org/unicode/onlinedat/resources.html> 890 891=item * 892 893Unicode and Multilingual Support in HTML, Fonts, Web Browsers and Other Applications 894 895L<http://www.alanwood.net/unicode/> 896 897=item * 898 899UTF-8 and Unicode FAQ for Unix/Linux 900 901L<http://www.cl.cam.ac.uk/~mgk25/unicode.html> 902 903=item * 904 905Legacy Character Sets 906 907L<http://www.czyborra.com/> 908L<http://www.eki.ee/letter/> 909 910=item * 911 912You can explore various information from the Unicode data files using 913the C<Unicode::UCD> module. 914 915=back 916 917=head1 UNICODE IN OLDER PERLS 918 919If you cannot upgrade your Perl to 5.8.0 or later, you can still 920do some Unicode processing by using the modules C<Unicode::String>, 921C<Unicode::Map8>, and C<Unicode::Map>, available from CPAN. 922If you have the GNU recode installed, you can also use the 923Perl front-end C<Convert::Recode> for character conversions. 924 925The following are fast conversions from ISO 8859-1 (Latin-1) bytes 926to UTF-8 bytes and back, the code works even with older Perl 5 versions. 927 928 # ISO 8859-1 to UTF-8 929 s/([\x80-\xFF])/chr(0xC0|ord($1)>>6).chr(0x80|ord($1)&0x3F)/eg; 930 931 # UTF-8 to ISO 8859-1 932 s/([\xC2\xC3])([\x80-\xBF])/chr(ord($1)<<6&0xC0|ord($2)&0x3F)/eg; 933 934=head1 SEE ALSO 935 936L<perlunitut>, L<perlunicode>, L<Encode>, L<open>, L<utf8>, L<bytes>, 937L<perlretut>, L<perlrun>, L<Unicode::Collate>, L<Unicode::Normalize>, 938L<Unicode::UCD> 939 940=head1 ACKNOWLEDGMENTS 941 942Thanks to the kind readers of the perl5-porters@perl.org, 943perl-unicode@perl.org, linux-utf8@nl.linux.org, and unicore@unicode.org 944mailing lists for their valuable feedback. 945 946=head1 AUTHOR, COPYRIGHT, AND LICENSE 947 948Copyright 2001-2011 Jarkko Hietaniemi E<lt>jhi@iki.fiE<gt> 949 950This document may be distributed under the same terms as Perl itself. 951