1=head1 NAME 2 3perluniintro - Perl Unicode introduction 4 5=head1 DESCRIPTION 6 7This document gives a general idea of Unicode and how to use Unicode 8in Perl. 9 10=head2 Unicode 11 12Unicode is a character set standard which plans to codify all of the 13writing systems of the world, plus many other symbols. 14 15Unicode and ISO/IEC 10646 are coordinated standards that provide code 16points for characters in almost all modern character set standards, 17covering more than 30 writing systems and hundreds of languages, 18including all commercially-important modern languages. All characters 19in the largest Chinese, Japanese, and Korean dictionaries are also 20encoded. The standards will eventually cover almost all characters in 21more than 250 writing systems and thousands of languages. 22Unicode 1.0 was released in October 1991, and 4.0 in April 2003. 23 24A Unicode I<character> is an abstract entity. It is not bound to any 25particular integer width, especially not to the C language C<char>. 26Unicode is language-neutral and display-neutral: it does not encode the 27language of the text, and it does not generally define fonts or other graphical 28layout details. Unicode operates on characters and on text built from 29those characters. 30 31Unicode defines characters like C<LATIN CAPITAL LETTER A> or C<GREEK 32SMALL LETTER ALPHA> and unique numbers for the characters, in this 33case 0x0041 and 0x03B1, respectively. These unique numbers are called 34I<code points>. 35 36The Unicode standard prefers using hexadecimal notation for the code 37points. If numbers like C<0x0041> are unfamiliar to you, take a peek 38at a later section, L</"Hexadecimal Notation">. The Unicode standard 39uses the notation C<U+0041 LATIN CAPITAL LETTER A>, to give the 40hexadecimal code point and the normative name of the character. 41 42Unicode also defines various I<properties> for the characters, like 43"uppercase" or "lowercase", "decimal digit", or "punctuation"; 44these properties are independent of the names of the characters. 45Furthermore, various operations on the characters like uppercasing, 46lowercasing, and collating (sorting) are defined. 47 48A Unicode I<logical> "character" can actually consist of more than one internal 49I<actual> "character" or code point. For Western languages, this is adequately 50modelled by a I<base character> (like C<LATIN CAPITAL LETTER A>) followed 51by one or more I<modifiers> (like C<COMBINING ACUTE ACCENT>). This sequence of 52base character and modifiers is called a I<combining character 53sequence>. Some non-western languages require more complicated 54models, so Unicode created the I<grapheme cluster> concept, and then the 55I<extended grapheme cluster>. For example, a Korean Hangul syllable is 56considered a single logical character, but most often consists of three actual 57Unicode characters: a leading consonant followed by an interior vowel followed 58by a trailing consonant. 59 60Whether to call these extended grapheme clusters "characters" depends on your 61point of view. If you are a programmer, you probably would tend towards seeing 62each element in the sequences as one unit, or "character". The whole sequence 63could be seen as one "character", however, from the user's point of view, since 64that's probably what it looks like in the context of the user's language. 65 66With this "whole sequence" view of characters, the total number of 67characters is open-ended. But in the programmer's "one unit is one 68character" point of view, the concept of "characters" is more 69deterministic. In this document, we take that second point of view: 70one "character" is one Unicode code point. 71 72For some combinations, there are I<precomposed> characters. 73C<LATIN CAPITAL LETTER A WITH ACUTE>, for example, is defined as 74a single code point. These precomposed characters are, however, 75only available for some combinations, and are mainly 76meant to support round-trip conversions between Unicode and legacy 77standards (like the ISO 8859). In the general case, the composing 78method is more extensible. To support conversion between 79different compositions of the characters, various I<normalization 80forms> to standardize representations are also defined. 81 82Because of backward compatibility with legacy encodings, the "a unique 83number for every character" idea breaks down a bit: instead, there is 84"at least one number for every character". The same character could 85be represented differently in several legacy encodings. The 86converse is also not true: some code points do not have an assigned 87character. Firstly, there are unallocated code points within 88otherwise used blocks. Secondly, there are special Unicode control 89characters that do not represent true characters. 90 91A common myth about Unicode is that it is "16-bit", that is, 92Unicode is only represented as C<0x10000> (or 65536) characters from 93C<0x0000> to C<0xFFFF>. B<This is untrue.> Since Unicode 2.0 (July 941996), Unicode has been defined all the way up to 21 bits (C<0x10FFFF>), 95and since Unicode 3.1 (March 2001), characters have been defined 96beyond C<0xFFFF>. The first C<0x10000> characters are called the 97I<Plane 0>, or the I<Basic Multilingual Plane> (BMP). With Unicode 983.1, 17 (yes, seventeen) planes in all were defined--but they are 99nowhere near full of defined characters, yet. 100 101Another myth is about Unicode blocks--that they have something to 102do with languages--that each block would define the characters used 103by a language or a set of languages. B<This is also untrue.> 104The division into blocks exists, but it is almost completely 105accidental--an artifact of how the characters have been and 106still are allocated. Instead, there is a concept called I<scripts>, which is 107more useful: there is C<Latin> script, C<Greek> script, and so on. Scripts 108usually span varied parts of several blocks. For more information about 109scripts, see L<perlunicode/Scripts>. 110 111The Unicode code points are just abstract numbers. To input and 112output these abstract numbers, the numbers must be I<encoded> or 113I<serialised> somehow. Unicode defines several I<character encoding 114forms>, of which I<UTF-8> is perhaps the most popular. UTF-8 is a 115variable length encoding that encodes Unicode characters as 1 to 6 116bytes. Other encodings 117include UTF-16 and UTF-32 and their big- and little-endian variants 118(UTF-8 is byte-order independent) The ISO/IEC 10646 defines the UCS-2 119and UCS-4 encoding forms. 120 121For more information about encodings--for instance, to learn what 122I<surrogates> and I<byte order marks> (BOMs) are--see L<perlunicode>. 123 124=head2 Perl's Unicode Support 125 126Starting from Perl 5.6.0, Perl has had the capacity to handle Unicode 127natively. Perl 5.8.0, however, is the first recommended release for 128serious Unicode work. The maintenance release 5.6.1 fixed many of the 129problems of the initial Unicode implementation, but for example 130regular expressions still do not work with Unicode in 5.6.1. 131 132B<Starting from Perl 5.8.0, the use of C<use utf8> is needed only in much more restricted circumstances.> In earlier releases the C<utf8> pragma was used to declare 133that operations in the current block or file would be Unicode-aware. 134This model was found to be wrong, or at least clumsy: the "Unicodeness" 135is now carried with the data, instead of being attached to the 136operations. Only one case remains where an explicit C<use utf8> is 137needed: if your Perl script itself is encoded in UTF-8, you can use 138UTF-8 in your identifier names, and in string and regular expression 139literals, by saying C<use utf8>. This is not the default because 140scripts with legacy 8-bit data in them would break. See L<utf8>. 141 142=head2 Perl's Unicode Model 143 144Perl supports both pre-5.6 strings of eight-bit native bytes, and 145strings of Unicode characters. The principle is that Perl tries to 146keep its data as eight-bit bytes for as long as possible, but as soon 147as Unicodeness cannot be avoided, the data is (mostly) transparently upgraded 148to Unicode. There are some problems--see L<perlunicode/The "Unicode Bug">. 149 150Internally, Perl currently uses either whatever the native eight-bit 151character set of the platform (for example Latin-1) is, defaulting to 152UTF-8, to encode Unicode strings. Specifically, if all code points in 153the string are C<0xFF> or less, Perl uses the native eight-bit 154character set. Otherwise, it uses UTF-8. 155 156A user of Perl does not normally need to know nor care how Perl 157happens to encode its internal strings, but it becomes relevant when 158outputting Unicode strings to a stream without a PerlIO layer (one with 159the "default" encoding). In such a case, the raw bytes used internally 160(the native character set or UTF-8, as appropriate for each string) 161will be used, and a "Wide character" warning will be issued if those 162strings contain a character beyond 0x00FF. 163 164For example, 165 166 perl -e 'print "\x{DF}\n", "\x{0100}\x{DF}\n"' 167 168produces a fairly useless mixture of native bytes and UTF-8, as well 169as a warning: 170 171 Wide character in print at ... 172 173To output UTF-8, use the C<:encoding> or C<:utf8> output layer. Prepending 174 175 binmode(STDOUT, ":utf8"); 176 177to this sample program ensures that the output is completely UTF-8, 178and removes the program's warning. 179 180You can enable automatic UTF-8-ification of your standard file 181handles, default C<open()> layer, and C<@ARGV> by using either 182the C<-C> command line switch or the C<PERL_UNICODE> environment 183variable, see L<perlrun> for the documentation of the C<-C> switch. 184 185Note that this means that Perl expects other software to work, too: 186if Perl has been led to believe that STDIN should be UTF-8, but then 187STDIN coming in from another command is not UTF-8, Perl will complain 188about the malformed UTF-8. 189 190All features that combine Unicode and I/O also require using the new 191PerlIO feature. Almost all Perl 5.8 platforms do use PerlIO, though: 192you can see whether yours is by running "perl -V" and looking for 193C<useperlio=define>. 194 195=head2 Unicode and EBCDIC 196 197Perl 5.8.0 also supports Unicode on EBCDIC platforms. There, 198Unicode support is somewhat more complex to implement since 199additional conversions are needed at every step. 200 201Later Perl releases have added code that will not work on EBCDIC platforms, and 202no one has complained, so the divergence has continued. If you want to run 203Perl on an EBCDIC platform, send email to perlbug@perl.org 204 205On EBCDIC platforms, the internal Unicode encoding form is UTF-EBCDIC 206instead of UTF-8. The difference is that as UTF-8 is "ASCII-safe" in 207that ASCII characters encode to UTF-8 as-is, while UTF-EBCDIC is 208"EBCDIC-safe". 209 210=head2 Creating Unicode 211 212To create Unicode characters in literals for code points above C<0xFF>, 213use the C<\x{...}> notation in double-quoted strings: 214 215 my $smiley = "\x{263a}"; 216 217Similarly, it can be used in regular expression literals 218 219 $smiley =~ /\x{263a}/; 220 221At run-time you can use C<chr()>: 222 223 my $hebrew_alef = chr(0x05d0); 224 225See L</"Further Resources"> for how to find all these numeric codes. 226 227Naturally, C<ord()> will do the reverse: it turns a character into 228a code point. 229 230Note that C<\x..> (no C<{}> and only two hexadecimal digits), C<\x{...}>, 231and C<chr(...)> for arguments less than C<0x100> (decimal 256) 232generate an eight-bit character for backward compatibility with older 233Perls. For arguments of C<0x100> or more, Unicode characters are 234always produced. If you want to force the production of Unicode 235characters regardless of the numeric value, use C<pack("U", ...)> 236instead of C<\x..>, C<\x{...}>, or C<chr()>. 237 238You can also use the C<charnames> pragma to invoke characters 239by name in double-quoted strings: 240 241 use charnames ':full'; 242 my $arabic_alef = "\N{ARABIC LETTER ALEF}"; 243 244And, as mentioned above, you can also C<pack()> numbers into Unicode 245characters: 246 247 my $georgian_an = pack("U", 0x10a0); 248 249Note that both C<\x{...}> and C<\N{...}> are compile-time string 250constants: you cannot use variables in them. if you want similar 251run-time functionality, use C<chr()> and C<charnames::vianame()>. 252 253If you want to force the result to Unicode characters, use the special 254C<"U0"> prefix. It consumes no arguments but causes the following bytes 255to be interpreted as the UTF-8 encoding of Unicode characters: 256 257 my $chars = pack("U0W*", 0x80, 0x42); 258 259Likewise, you can stop such UTF-8 interpretation by using the special 260C<"C0"> prefix. 261 262=head2 Handling Unicode 263 264Handling Unicode is for the most part transparent: just use the 265strings as usual. Functions like C<index()>, C<length()>, and 266C<substr()> will work on the Unicode characters; regular expressions 267will work on the Unicode characters (see L<perlunicode> and L<perlretut>). 268 269Note that Perl considers grapheme clusters to be separate characters, so for 270example 271 272 use charnames ':full'; 273 print length("\N{LATIN CAPITAL LETTER A}\N{COMBINING ACUTE ACCENT}"), "\n"; 274 275will print 2, not 1. The only exception is that regular expressions 276have C<\X> for matching an extended grapheme cluster. 277 278Life is not quite so transparent, however, when working with legacy 279encodings, I/O, and certain special cases: 280 281=head2 Legacy Encodings 282 283When you combine legacy data and Unicode the legacy data needs 284to be upgraded to Unicode. Normally ISO 8859-1 (or EBCDIC, if 285applicable) is assumed. 286 287The C<Encode> module knows about many encodings and has interfaces 288for doing conversions between those encodings: 289 290 use Encode 'decode'; 291 $data = decode("iso-8859-3", $data); # convert from legacy to utf-8 292 293=head2 Unicode I/O 294 295Normally, writing out Unicode data 296 297 print FH $some_string_with_unicode, "\n"; 298 299produces raw bytes that Perl happens to use to internally encode the 300Unicode string. Perl's internal encoding depends on the system as 301well as what characters happen to be in the string at the time. If 302any of the characters are at code points C<0x100> or above, you will get 303a warning. To ensure that the output is explicitly rendered in the 304encoding you desire--and to avoid the warning--open the stream with 305the desired encoding. Some examples: 306 307 open FH, ">:utf8", "file"; 308 309 open FH, ">:encoding(ucs2)", "file"; 310 open FH, ">:encoding(UTF-8)", "file"; 311 open FH, ">:encoding(shift_jis)", "file"; 312 313and on already open streams, use C<binmode()>: 314 315 binmode(STDOUT, ":utf8"); 316 317 binmode(STDOUT, ":encoding(ucs2)"); 318 binmode(STDOUT, ":encoding(UTF-8)"); 319 binmode(STDOUT, ":encoding(shift_jis)"); 320 321The matching of encoding names is loose: case does not matter, and 322many encodings have several aliases. Note that the C<:utf8> layer 323must always be specified exactly like that; it is I<not> subject to 324the loose matching of encoding names. Also note that C<:utf8> is unsafe for 325input, because it accepts the data without validating that it is indeed valid 326UTF8. 327 328See L<PerlIO> for the C<:utf8> layer, L<PerlIO::encoding> and 329L<Encode::PerlIO> for the C<:encoding()> layer, and 330L<Encode::Supported> for many encodings supported by the C<Encode> 331module. 332 333Reading in a file that you know happens to be encoded in one of the 334Unicode or legacy encodings does not magically turn the data into 335Unicode in Perl's eyes. To do that, specify the appropriate 336layer when opening files 337 338 open(my $fh,'<:encoding(utf8)', 'anything'); 339 my $line_of_unicode = <$fh>; 340 341 open(my $fh,'<:encoding(Big5)', 'anything'); 342 my $line_of_unicode = <$fh>; 343 344The I/O layers can also be specified more flexibly with 345the C<open> pragma. See L<open>, or look at the following example. 346 347 use open ':encoding(utf8)'; # input/output default encoding will be UTF-8 348 open X, ">file"; 349 print X chr(0x100), "\n"; 350 close X; 351 open Y, "<file"; 352 printf "%#x\n", ord(<Y>); # this should print 0x100 353 close Y; 354 355With the C<open> pragma you can use the C<:locale> layer 356 357 BEGIN { $ENV{LC_ALL} = $ENV{LANG} = 'ru_RU.KOI8-R' } 358 # the :locale will probe the locale environment variables like LC_ALL 359 use open OUT => ':locale'; # russki parusski 360 open(O, ">koi8"); 361 print O chr(0x430); # Unicode CYRILLIC SMALL LETTER A = KOI8-R 0xc1 362 close O; 363 open(I, "<koi8"); 364 printf "%#x\n", ord(<I>), "\n"; # this should print 0xc1 365 close I; 366 367These methods install a transparent filter on the I/O stream that 368converts data from the specified encoding when it is read in from the 369stream. The result is always Unicode. 370 371The L<open> pragma affects all the C<open()> calls after the pragma by 372setting default layers. If you want to affect only certain 373streams, use explicit layers directly in the C<open()> call. 374 375You can switch encodings on an already opened stream by using 376C<binmode()>; see L<perlfunc/binmode>. 377 378The C<:locale> does not currently (as of Perl 5.8.0) work with 379C<open()> and C<binmode()>, only with the C<open> pragma. The 380C<:utf8> and C<:encoding(...)> methods do work with all of C<open()>, 381C<binmode()>, and the C<open> pragma. 382 383Similarly, you may use these I/O layers on output streams to 384automatically convert Unicode to the specified encoding when it is 385written to the stream. For example, the following snippet copies the 386contents of the file "text.jis" (encoded as ISO-2022-JP, aka JIS) to 387the file "text.utf8", encoded as UTF-8: 388 389 open(my $nihongo, '<:encoding(iso-2022-jp)', 'text.jis'); 390 open(my $unicode, '>:utf8', 'text.utf8'); 391 while (<$nihongo>) { print $unicode $_ } 392 393The naming of encodings, both by the C<open()> and by the C<open> 394pragma allows for flexible names: C<koi8-r> and C<KOI8R> will both be 395understood. 396 397Common encodings recognized by ISO, MIME, IANA, and various other 398standardisation organisations are recognised; for a more detailed 399list see L<Encode::Supported>. 400 401C<read()> reads characters and returns the number of characters. 402C<seek()> and C<tell()> operate on byte counts, as do C<sysread()> 403and C<sysseek()>. 404 405Notice that because of the default behaviour of not doing any 406conversion upon input if there is no default layer, 407it is easy to mistakenly write code that keeps on expanding a file 408by repeatedly encoding the data: 409 410 # BAD CODE WARNING 411 open F, "file"; 412 local $/; ## read in the whole file of 8-bit characters 413 $t = <F>; 414 close F; 415 open F, ">:encoding(utf8)", "file"; 416 print F $t; ## convert to UTF-8 on output 417 close F; 418 419If you run this code twice, the contents of the F<file> will be twice 420UTF-8 encoded. A C<use open ':encoding(utf8)'> would have avoided the 421bug, or explicitly opening also the F<file> for input as UTF-8. 422 423B<NOTE>: the C<:utf8> and C<:encoding> features work only if your 424Perl has been built with the new PerlIO feature (which is the default 425on most systems). 426 427=head2 Displaying Unicode As Text 428 429Sometimes you might want to display Perl scalars containing Unicode as 430simple ASCII (or EBCDIC) text. The following subroutine converts 431its argument so that Unicode characters with code points greater than 432255 are displayed as C<\x{...}>, control characters (like C<\n>) are 433displayed as C<\x..>, and the rest of the characters as themselves: 434 435 sub nice_string { 436 join("", 437 map { $_ > 255 ? # if wide character... 438 sprintf("\\x{%04X}", $_) : # \x{...} 439 chr($_) =~ /[[:cntrl:]]/ ? # else if control character ... 440 sprintf("\\x%02X", $_) : # \x.. 441 quotemeta(chr($_)) # else quoted or as themselves 442 } unpack("W*", $_[0])); # unpack Unicode characters 443 } 444 445For example, 446 447 nice_string("foo\x{100}bar\n") 448 449returns the string 450 451 'foo\x{0100}bar\x0A' 452 453which is ready to be printed. 454 455=head2 Special Cases 456 457=over 4 458 459=item * 460 461Bit Complement Operator ~ And vec() 462 463The bit complement operator C<~> may produce surprising results if 464used on strings containing characters with ordinal values above 465255. In such a case, the results are consistent with the internal 466encoding of the characters, but not with much else. So don't do 467that. Similarly for C<vec()>: you will be operating on the 468internally-encoded bit patterns of the Unicode characters, not on 469the code point values, which is very probably not what you want. 470 471=item * 472 473Peeking At Perl's Internal Encoding 474 475Normal users of Perl should never care how Perl encodes any particular 476Unicode string (because the normal ways to get at the contents of a 477string with Unicode--via input and output--should always be via 478explicitly-defined I/O layers). But if you must, there are two 479ways of looking behind the scenes. 480 481One way of peeking inside the internal encoding of Unicode characters 482is to use C<unpack("C*", ...> to get the bytes of whatever the string 483encoding happens to be, or C<unpack("U0..", ...)> to get the bytes of the 484UTF-8 encoding: 485 486 # this prints c4 80 for the UTF-8 bytes 0xc4 0x80 487 print join(" ", unpack("U0(H2)*", pack("U", 0x100))), "\n"; 488 489Yet another way would be to use the Devel::Peek module: 490 491 perl -MDevel::Peek -e 'Dump(chr(0x100))' 492 493That shows the C<UTF8> flag in FLAGS and both the UTF-8 bytes 494and Unicode characters in C<PV>. See also later in this document 495the discussion about the C<utf8::is_utf8()> function. 496 497=back 498 499=head2 Advanced Topics 500 501=over 4 502 503=item * 504 505String Equivalence 506 507The question of string equivalence turns somewhat complicated 508in Unicode: what do you mean by "equal"? 509 510(Is C<LATIN CAPITAL LETTER A WITH ACUTE> equal to 511C<LATIN CAPITAL LETTER A>?) 512 513The short answer is that by default Perl compares equivalence (C<eq>, 514C<ne>) based only on code points of the characters. In the above 515case, the answer is no (because 0x00C1 != 0x0041). But sometimes, any 516CAPITAL LETTER As should be considered equal, or even As of any case. 517 518The long answer is that you need to consider character normalization 519and casing issues: see L<Unicode::Normalize>, Unicode Technical Report #15, 520L<Unicode Normalization Forms|http://www.unicode.org/unicode/reports/tr15> and 521sections on case mapping in the L<Unicode Standard|http://www.unicode.org>. 522 523As of Perl 5.8.0, the "Full" case-folding of I<Case 524Mappings/SpecialCasing> is implemented, but bugs remain in C<qr//i> with them. 525 526=item * 527 528String Collation 529 530People like to see their strings nicely sorted--or as Unicode 531parlance goes, collated. But again, what do you mean by collate? 532 533(Does C<LATIN CAPITAL LETTER A WITH ACUTE> come before or after 534C<LATIN CAPITAL LETTER A WITH GRAVE>?) 535 536The short answer is that by default, Perl compares strings (C<lt>, 537C<le>, C<cmp>, C<ge>, C<gt>) based only on the code points of the 538characters. In the above case, the answer is "after", since 539C<0x00C1> > C<0x00C0>. 540 541The long answer is that "it depends", and a good answer cannot be 542given without knowing (at the very least) the language context. 543See L<Unicode::Collate>, and I<Unicode Collation Algorithm> 544L<http://www.unicode.org/unicode/reports/tr10/> 545 546=back 547 548=head2 Miscellaneous 549 550=over 4 551 552=item * 553 554Character Ranges and Classes 555 556Character ranges in regular expression character classes (C</[a-z]/>) 557and in the C<tr///> (also known as C<y///>) operator are not magically 558Unicode-aware. What this means is that C<[A-Za-z]> will not magically start 559to mean "all alphabetic letters"; not that it does mean that even for 5608-bit characters, you should be using C</[[:alpha:]]/> in that case. 561 562For specifying character classes like that in regular expressions, 563you can use the various Unicode properties--C<\pL>, or perhaps 564C<\p{Alphabetic}>, in this particular case. You can use Unicode 565code points as the end points of character ranges, but there is no 566magic associated with specifying a certain range. For further 567information--there are dozens of Unicode character classes--see 568L<perlunicode>. 569 570=item * 571 572String-To-Number Conversions 573 574Unicode does define several other decimal--and numeric--characters 575besides the familiar 0 to 9, such as the Arabic and Indic digits. 576Perl does not support string-to-number conversion for digits other 577than ASCII 0 to 9 (and ASCII a to f for hexadecimal). 578 579=back 580 581=head2 Questions With Answers 582 583=over 4 584 585=item * 586 587Will My Old Scripts Break? 588 589Very probably not. Unless you are generating Unicode characters 590somehow, old behaviour should be preserved. About the only behaviour 591that has changed and which could start generating Unicode is the old 592behaviour of C<chr()> where supplying an argument more than 255 593produced a character modulo 255. C<chr(300)>, for example, was equal 594to C<chr(45)> or "-" (in ASCII), now it is LATIN CAPITAL LETTER I WITH 595BREVE. 596 597=item * 598 599How Do I Make My Scripts Work With Unicode? 600 601Very little work should be needed since nothing changes until you 602generate Unicode data. The most important thing is getting input as 603Unicode; for that, see the earlier I/O discussion. 604 605=item * 606 607How Do I Know Whether My String Is In Unicode? 608 609You shouldn't have to care. But you may, because currently the semantics of the 610characters whose ordinals are in the range 128 to 255 is different depending on 611whether the string they are contained within is in Unicode or not. 612(See L<perlunicode/When Unicode Does Not Happen>.) 613 614To determine if a string is in Unicode, use: 615 616 print utf8::is_utf8($string) ? 1 : 0, "\n"; 617 618But note that this doesn't mean that any of the characters in the 619string are necessary UTF-8 encoded, or that any of the characters have 620code points greater than 0xFF (255) or even 0x80 (128), or that the 621string has any characters at all. All the C<is_utf8()> does is to 622return the value of the internal "utf8ness" flag attached to the 623C<$string>. If the flag is off, the bytes in the scalar are interpreted 624as a single byte encoding. If the flag is on, the bytes in the scalar 625are interpreted as the (multi-byte, variable-length) UTF-8 encoded code 626points of the characters. Bytes added to a UTF-8 encoded string are 627automatically upgraded to UTF-8. If mixed non-UTF-8 and UTF-8 scalars 628are merged (double-quoted interpolation, explicit concatenation, and 629printf/sprintf parameter substitution), the result will be UTF-8 encoded 630as if copies of the byte strings were upgraded to UTF-8: for example, 631 632 $a = "ab\x80c"; 633 $b = "\x{100}"; 634 print "$a = $b\n"; 635 636the output string will be UTF-8-encoded C<ab\x80c = \x{100}\n>, but 637C<$a> will stay byte-encoded. 638 639Sometimes you might really need to know the byte length of a string 640instead of the character length. For that use either the 641C<Encode::encode_utf8()> function or the C<bytes> pragma and 642the C<length()> function: 643 644 my $unicode = chr(0x100); 645 print length($unicode), "\n"; # will print 1 646 require Encode; 647 print length(Encode::encode_utf8($unicode)), "\n"; # will print 2 648 use bytes; 649 print length($unicode), "\n"; # will also print 2 650 # (the 0xC4 0x80 of the UTF-8) 651 652=item * 653 654How Do I Detect Data That's Not Valid In a Particular Encoding? 655 656Use the C<Encode> package to try converting it. 657For example, 658 659 use Encode 'decode_utf8'; 660 661 if (eval { decode_utf8($string, Encode::FB_CROAK); 1 }) { 662 # $string is valid utf8 663 } else { 664 # $string is not valid utf8 665 } 666 667Or use C<unpack> to try decoding it: 668 669 use warnings; 670 @chars = unpack("C0U*", $string_of_bytes_that_I_think_is_utf8); 671 672If invalid, a C<Malformed UTF-8 character> warning is produced. The "C0" means 673"process the string character per character". Without that, the 674C<unpack("U*", ...)> would work in C<U0> mode (the default if the format 675string starts with C<U>) and it would return the bytes making up the UTF-8 676encoding of the target string, something that will always work. 677 678=item * 679 680How Do I Convert Binary Data Into a Particular Encoding, Or Vice Versa? 681 682This probably isn't as useful as you might think. 683Normally, you shouldn't need to. 684 685In one sense, what you are asking doesn't make much sense: encodings 686are for characters, and binary data are not "characters", so converting 687"data" into some encoding isn't meaningful unless you know in what 688character set and encoding the binary data is in, in which case it's 689not just binary data, now is it? 690 691If you have a raw sequence of bytes that you know should be 692interpreted via a particular encoding, you can use C<Encode>: 693 694 use Encode 'from_to'; 695 from_to($data, "iso-8859-1", "utf-8"); # from latin-1 to utf-8 696 697The call to C<from_to()> changes the bytes in C<$data>, but nothing 698material about the nature of the string has changed as far as Perl is 699concerned. Both before and after the call, the string C<$data> 700contains just a bunch of 8-bit bytes. As far as Perl is concerned, 701the encoding of the string remains as "system-native 8-bit bytes". 702 703You might relate this to a fictional 'Translate' module: 704 705 use Translate; 706 my $phrase = "Yes"; 707 Translate::from_to($phrase, 'english', 'deutsch'); 708 ## phrase now contains "Ja" 709 710The contents of the string changes, but not the nature of the string. 711Perl doesn't know any more after the call than before that the 712contents of the string indicates the affirmative. 713 714Back to converting data. If you have (or want) data in your system's 715native 8-bit encoding (e.g. Latin-1, EBCDIC, etc.), you can use 716pack/unpack to convert to/from Unicode. 717 718 $native_string = pack("W*", unpack("U*", $Unicode_string)); 719 $Unicode_string = pack("U*", unpack("W*", $native_string)); 720 721If you have a sequence of bytes you B<know> is valid UTF-8, 722but Perl doesn't know it yet, you can make Perl a believer, too: 723 724 use Encode 'decode_utf8'; 725 $Unicode = decode_utf8($bytes); 726 727or: 728 729 $Unicode = pack("U0a*", $bytes); 730 731You can find the bytes that make up a UTF-8 sequence with 732 733 @bytes = unpack("C*", $Unicode_string) 734 735and you can create well-formed Unicode with 736 737 $Unicode_string = pack("U*", 0xff, ...) 738 739=item * 740 741How Do I Display Unicode? How Do I Input Unicode? 742 743See L<http://www.alanwood.net/unicode/> and 744L<http://www.cl.cam.ac.uk/~mgk25/unicode.html> 745 746=item * 747 748How Does Unicode Work With Traditional Locales? 749 750In Perl, not very well. Avoid using locales through the C<locale> 751pragma. Use only one or the other. But see L<perlrun> for the 752description of the C<-C> switch and its environment counterpart, 753C<$ENV{PERL_UNICODE}> to see how to enable various Unicode features, 754for example by using locale settings. 755 756=back 757 758=head2 Hexadecimal Notation 759 760The Unicode standard prefers using hexadecimal notation because 761that more clearly shows the division of Unicode into blocks of 256 characters. 762Hexadecimal is also simply shorter than decimal. You can use decimal 763notation, too, but learning to use hexadecimal just makes life easier 764with the Unicode standard. The C<U+HHHH> notation uses hexadecimal, 765for example. 766 767The C<0x> prefix means a hexadecimal number, the digits are 0-9 I<and> 768a-f (or A-F, case doesn't matter). Each hexadecimal digit represents 769four bits, or half a byte. C<print 0x..., "\n"> will show a 770hexadecimal number in decimal, and C<printf "%x\n", $decimal> will 771show a decimal number in hexadecimal. If you have just the 772"hex digits" of a hexadecimal number, you can use the C<hex()> function. 773 774 print 0x0009, "\n"; # 9 775 print 0x000a, "\n"; # 10 776 print 0x000f, "\n"; # 15 777 print 0x0010, "\n"; # 16 778 print 0x0011, "\n"; # 17 779 print 0x0100, "\n"; # 256 780 781 print 0x0041, "\n"; # 65 782 783 printf "%x\n", 65; # 41 784 printf "%#x\n", 65; # 0x41 785 786 print hex("41"), "\n"; # 65 787 788=head2 Further Resources 789 790=over 4 791 792=item * 793 794Unicode Consortium 795 796L<http://www.unicode.org/> 797 798=item * 799 800Unicode FAQ 801 802L<http://www.unicode.org/unicode/faq/> 803 804=item * 805 806Unicode Glossary 807 808L<http://www.unicode.org/glossary/> 809 810=item * 811 812Unicode Useful Resources 813 814L<http://www.unicode.org/unicode/onlinedat/resources.html> 815 816=item * 817 818Unicode and Multilingual Support in HTML, Fonts, Web Browsers and Other Applications 819 820L<http://www.alanwood.net/unicode/> 821 822=item * 823 824UTF-8 and Unicode FAQ for Unix/Linux 825 826L<http://www.cl.cam.ac.uk/~mgk25/unicode.html> 827 828=item * 829 830Legacy Character Sets 831 832L<http://www.czyborra.com/> 833L<http://www.eki.ee/letter/> 834 835=item * 836 837The Unicode support files live within the Perl installation in the 838directory 839 840 $Config{installprivlib}/unicore 841 842in Perl 5.8.0 or newer, and 843 844 $Config{installprivlib}/unicode 845 846in the Perl 5.6 series. (The renaming to F<lib/unicore> was done to 847avoid naming conflicts with lib/Unicode in case-insensitive filesystems.) 848The main Unicode data file is F<UnicodeData.txt> (or F<Unicode.301> in 849Perl 5.6.1.) You can find the C<$Config{installprivlib}> by 850 851 perl "-V:installprivlib" 852 853You can explore various information from the Unicode data files using 854the C<Unicode::UCD> module. 855 856=back 857 858=head1 UNICODE IN OLDER PERLS 859 860If you cannot upgrade your Perl to 5.8.0 or later, you can still 861do some Unicode processing by using the modules C<Unicode::String>, 862C<Unicode::Map8>, and C<Unicode::Map>, available from CPAN. 863If you have the GNU recode installed, you can also use the 864Perl front-end C<Convert::Recode> for character conversions. 865 866The following are fast conversions from ISO 8859-1 (Latin-1) bytes 867to UTF-8 bytes and back, the code works even with older Perl 5 versions. 868 869 # ISO 8859-1 to UTF-8 870 s/([\x80-\xFF])/chr(0xC0|ord($1)>>6).chr(0x80|ord($1)&0x3F)/eg; 871 872 # UTF-8 to ISO 8859-1 873 s/([\xC2\xC3])([\x80-\xBF])/chr(ord($1)<<6&0xC0|ord($2)&0x3F)/eg; 874 875=head1 SEE ALSO 876 877L<perlunitut>, L<perlunicode>, L<Encode>, L<open>, L<utf8>, L<bytes>, 878L<perlretut>, L<perlrun>, L<Unicode::Collate>, L<Unicode::Normalize>, 879L<Unicode::UCD> 880 881=head1 ACKNOWLEDGMENTS 882 883Thanks to the kind readers of the perl5-porters@perl.org, 884perl-unicode@perl.org, linux-utf8@nl.linux.org, and unicore@unicode.org 885mailing lists for their valuable feedback. 886 887=head1 AUTHOR, COPYRIGHT, AND LICENSE 888 889Copyright 2001-2002 Jarkko Hietaniemi E<lt>jhi@iki.fiE<gt> 890 891This document may be distributed under the same terms as Perl itself. 892