1=head1 NAME 2 3perlunicode - Unicode support in Perl 4 5=head1 DESCRIPTION 6 7=head2 Important Caveats 8 9Unicode support is an extensive requirement. While Perl does not 10implement the Unicode standard or the accompanying technical reports 11from cover to cover, Perl does support many Unicode features. 12 13People who want to learn to use Unicode in Perl, should probably read 14L<the Perl Unicode tutorial|perlunitut> before reading this reference 15document. 16 17=over 4 18 19=item Input and Output Layers 20 21Perl knows when a filehandle uses Perl's internal Unicode encodings 22(UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with 23the ":utf8" layer. Other encodings can be converted to Perl's 24encoding on input or from Perl's encoding on output by use of the 25":encoding(...)" layer. See L<open>. 26 27To indicate that Perl source itself is in UTF-8, use C<use utf8;>. 28 29=item Regular Expressions 30 31The regular expression compiler produces polymorphic opcodes. That is, 32the pattern adapts to the data and automatically switches to the Unicode 33character scheme when presented with data that is internally encoded in 34UTF-8 -- or instead uses a traditional byte scheme when presented with 35byte data. 36 37=item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts 38 39As a compatibility measure, the C<use utf8> pragma must be explicitly 40included to enable recognition of UTF-8 in the Perl scripts themselves 41(in string or regular expression literals, or in identifier names) on 42ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based 43machines. B<These are the only times when an explicit C<use utf8> 44is needed.> See L<utf8>. 45 46=item BOM-marked scripts and UTF-16 scripts autodetected 47 48If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE, 49or UTF-8), or if the script looks like non-BOM-marked UTF-16 of either 50endianness, Perl will correctly read in the script as Unicode. 51(BOMless UTF-8 cannot be effectively recognized or differentiated from 52ISO 8859-1 or other eight-bit encodings.) 53 54=item C<use encoding> needed to upgrade non-Latin-1 byte strings 55 56By default, there is a fundamental asymmetry in Perl's Unicode model: 57implicit upgrading from byte strings to Unicode strings assumes that 58they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are 59downgraded with UTF-8 encoding. This happens because the first 256 60codepoints in Unicode happens to agree with Latin-1. 61 62See L</"Byte and Character Semantics"> for more details. 63 64=back 65 66=head2 Byte and Character Semantics 67 68Beginning with version 5.6, Perl uses logically-wide characters to 69represent strings internally. 70 71In future, Perl-level operations will be expected to work with 72characters rather than bytes. 73 74However, as an interim compatibility measure, Perl aims to 75provide a safe migration path from byte semantics to character 76semantics for programs. For operations where Perl can unambiguously 77decide that the input data are characters, Perl switches to 78character semantics. For operations where this determination cannot 79be made without additional information from the user, Perl decides in 80favor of compatibility and chooses to use byte semantics. 81 82This behavior preserves compatibility with earlier versions of Perl, 83which allowed byte semantics in Perl operations only if 84none of the program's inputs were marked as being as source of Unicode 85character data. Such data may come from filehandles, from calls to 86external programs, from information provided by the system (such as %ENV), 87or from literals and constants in the source text. 88 89The C<bytes> pragma will always, regardless of platform, force byte 90semantics in a particular lexical scope. See L<bytes>. 91 92The C<utf8> pragma is primarily a compatibility device that enables 93recognition of UTF-(8|EBCDIC) in literals encountered by the parser. 94Note that this pragma is only required while Perl defaults to byte 95semantics; when character semantics become the default, this pragma 96may become a no-op. See L<utf8>. 97 98Unless explicitly stated, Perl operators use character semantics 99for Unicode data and byte semantics for non-Unicode data. 100The decision to use character semantics is made transparently. If 101input data comes from a Unicode source--for example, if a character 102encoding layer is added to a filehandle or a literal Unicode 103string constant appears in a program--character semantics apply. 104Otherwise, byte semantics are in effect. The C<bytes> pragma should 105be used to force byte semantics on Unicode data. 106 107If strings operating under byte semantics and strings with Unicode 108character data are concatenated, the new string will be created by 109decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the 110old Unicode string used EBCDIC. This translation is done without 111regard to the system's native 8-bit encoding. 112 113Under character semantics, many operations that formerly operated on 114bytes now operate on characters. A character in Perl is 115logically just a number ranging from 0 to 2**31 or so. Larger 116characters may encode into longer sequences of bytes internally, but 117this internal detail is mostly hidden for Perl code. 118See L<perluniintro> for more. 119 120=head2 Effects of Character Semantics 121 122Character semantics have the following effects: 123 124=over 4 125 126=item * 127 128Strings--including hash keys--and regular expression patterns may 129contain characters that have an ordinal value larger than 255. 130 131If you use a Unicode editor to edit your program, Unicode characters may 132occur directly within the literal strings in UTF-8 encoding, or UTF-16. 133(The former requires a BOM or C<use utf8>, the latter requires a BOM.) 134 135Unicode characters can also be added to a string by using the C<\x{...}> 136notation. The Unicode code for the desired character, in hexadecimal, 137should be placed in the braces. For instance, a smiley face is 138C<\x{263A}>. This encoding scheme only works for all characters, but 139for characters under 0x100, note that Perl may use an 8 bit encoding 140internally, for optimization and/or backward compatibility. 141 142Additionally, if you 143 144 use charnames ':full'; 145 146you can use the C<\N{...}> notation and put the official Unicode 147character name within the braces, such as C<\N{WHITE SMILING FACE}>. 148 149=item * 150 151If an appropriate L<encoding> is specified, identifiers within the 152Perl script may contain Unicode alphanumeric characters, including 153ideographs. Perl does not currently attempt to canonicalize variable 154names. 155 156=item * 157 158Regular expressions match characters instead of bytes. "." matches 159a character instead of a byte. 160 161=item * 162 163Character classes in regular expressions match characters instead of 164bytes and match against the character properties specified in the 165Unicode properties database. C<\w> can be used to match a Japanese 166ideograph, for instance. 167 168=item * 169 170Named Unicode properties, scripts, and block ranges may be used like 171character classes via the C<\p{}> "matches property" construct and 172the C<\P{}> negation, "doesn't match property". 173 174See L</"Unicode Character Properties"> for more details. 175 176You can define your own character properties and use them 177in the regular expression with the C<\p{}> or C<\P{}> construct. 178 179See L</"User-Defined Character Properties"> for more details. 180 181=item * 182 183The special pattern C<\X> matches any extended Unicode 184sequence--"a combining character sequence" in Standardese--where the 185first character is a base character and subsequent characters are mark 186characters that apply to the base character. C<\X> is equivalent to 187C<(?:\PM\pM*)>. 188 189=item * 190 191The C<tr///> operator translates characters instead of bytes. Note 192that the C<tr///CU> functionality has been removed. For similar 193functionality see pack('U0', ...) and pack('C0', ...). 194 195=item * 196 197Case translation operators use the Unicode case translation tables 198when character input is provided. Note that C<uc()>, or C<\U> in 199interpolated strings, translates to uppercase, while C<ucfirst>, 200or C<\u> in interpolated strings, translates to titlecase in languages 201that make the distinction. 202 203=item * 204 205Most operators that deal with positions or lengths in a string will 206automatically switch to using character positions, including 207C<chop()>, C<chomp()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>, 208C<sprintf()>, C<write()>, and C<length()>. An operator that 209specifically does not switch is C<vec()>. Operators that really don't 210care include operators that treat strings as a bucket of bits such as 211C<sort()>, and operators dealing with filenames. 212 213=item * 214 215The C<pack()>/C<unpack()> letter C<C> does I<not> change, since it is often 216used for byte-oriented formats. Again, think C<char> in the C language. 217 218There is a new C<U> specifier that converts between Unicode characters 219and code points. There is also a C<W> specifier that is the equivalent of 220C<chr>/C<ord> and properly handles character values even if they are above 255. 221 222=item * 223 224The C<chr()> and C<ord()> functions work on characters, similar to 225C<pack("W")> and C<unpack("W")>, I<not> C<pack("C")> and 226C<unpack("C")>. C<pack("C")> and C<unpack("C")> are methods for 227emulating byte-oriented C<chr()> and C<ord()> on Unicode strings. 228While these methods reveal the internal encoding of Unicode strings, 229that is not something one normally needs to care about at all. 230 231=item * 232 233The bit string operators, C<& | ^ ~>, can operate on character data. 234However, for backward compatibility, such as when using bit string 235operations when characters are all less than 256 in ordinal value, one 236should not use C<~> (the bit complement) with characters of both 237values less than 256 and values greater than 256. Most importantly, 238DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>) 239will not hold. The reason for this mathematical I<faux pas> is that 240the complement cannot return B<both> the 8-bit (byte-wide) bit 241complement B<and> the full character-wide bit complement. 242 243=item * 244 245lc(), uc(), lcfirst(), and ucfirst() work for the following cases: 246 247=over 8 248 249=item * 250 251the case mapping is from a single Unicode character to another 252single Unicode character, or 253 254=item * 255 256the case mapping is from a single Unicode character to more 257than one Unicode character. 258 259=back 260 261Things to do with locales (Lithuanian, Turkish, Azeri) do B<not> work 262since Perl does not understand the concept of Unicode locales. 263 264See the Unicode Technical Report #21, Case Mappings, for more details. 265 266But you can also define your own mappings to be used in the lc(), 267lcfirst(), uc(), and ucfirst() (or their string-inlined versions). 268 269See L</"User-Defined Case Mappings"> for more details. 270 271=back 272 273=over 4 274 275=item * 276 277And finally, C<scalar reverse()> reverses by character rather than by byte. 278 279=back 280 281=head2 Unicode Character Properties 282 283Named Unicode properties, scripts, and block ranges may be used like 284character classes via the C<\p{}> "matches property" construct and 285the C<\P{}> negation, "doesn't match property". 286 287For instance, C<\p{Lu}> matches any character with the Unicode "Lu" 288(Letter, uppercase) property, while C<\p{M}> matches any character 289with an "M" (mark--accents and such) property. Brackets are not 290required for single letter properties, so C<\p{M}> is equivalent to 291C<\pM>. Many predefined properties are available, such as 292C<\p{Mirrored}> and C<\p{Tibetan}>. 293 294The official Unicode script and block names have spaces and dashes as 295separators, but for convenience you can use dashes, spaces, or 296underbars, and case is unimportant. It is recommended, however, that 297for consistency you use the following naming: the official Unicode 298script, property, or block name (see below for the additional rules 299that apply to block names) with whitespace and dashes removed, and the 300words "uppercase-first-lowercase-rest". C<Latin-1 Supplement> thus 301becomes C<Latin1Supplement>. 302 303You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret 304(^) between the first brace and the property name: C<\p{^Tamil}> is 305equal to C<\P{Tamil}>. 306 307B<NOTE: the properties, scripts, and blocks listed here are as of 308Unicode 5.0.0 in July 2006.> 309 310=over 4 311 312=item General Category 313 314Here are the basic Unicode General Category properties, followed by their 315long form. You can use either; C<\p{Lu}> and C<\p{UppercaseLetter}>, 316for instance, are identical. 317 318 Short Long 319 320 L Letter 321 LC CasedLetter 322 Lu UppercaseLetter 323 Ll LowercaseLetter 324 Lt TitlecaseLetter 325 Lm ModifierLetter 326 Lo OtherLetter 327 328 M Mark 329 Mn NonspacingMark 330 Mc SpacingMark 331 Me EnclosingMark 332 333 N Number 334 Nd DecimalNumber 335 Nl LetterNumber 336 No OtherNumber 337 338 P Punctuation 339 Pc ConnectorPunctuation 340 Pd DashPunctuation 341 Ps OpenPunctuation 342 Pe ClosePunctuation 343 Pi InitialPunctuation 344 (may behave like Ps or Pe depending on usage) 345 Pf FinalPunctuation 346 (may behave like Ps or Pe depending on usage) 347 Po OtherPunctuation 348 349 S Symbol 350 Sm MathSymbol 351 Sc CurrencySymbol 352 Sk ModifierSymbol 353 So OtherSymbol 354 355 Z Separator 356 Zs SpaceSeparator 357 Zl LineSeparator 358 Zp ParagraphSeparator 359 360 C Other 361 Cc Control 362 Cf Format 363 Cs Surrogate (not usable) 364 Co PrivateUse 365 Cn Unassigned 366 367Single-letter properties match all characters in any of the 368two-letter sub-properties starting with the same letter. 369C<LC> and C<L&> are special cases, which are aliases for the set of 370C<Ll>, C<Lu>, and C<Lt>. 371 372Because Perl hides the need for the user to understand the internal 373representation of Unicode characters, there is no need to implement 374the somewhat messy concept of surrogates. C<Cs> is therefore not 375supported. 376 377=item Bidirectional Character Types 378 379Because scripts differ in their directionality--Hebrew is 380written right to left, for example--Unicode supplies these properties in 381the BidiClass class: 382 383 Property Meaning 384 385 L Left-to-Right 386 LRE Left-to-Right Embedding 387 LRO Left-to-Right Override 388 R Right-to-Left 389 AL Right-to-Left Arabic 390 RLE Right-to-Left Embedding 391 RLO Right-to-Left Override 392 PDF Pop Directional Format 393 EN European Number 394 ES European Number Separator 395 ET European Number Terminator 396 AN Arabic Number 397 CS Common Number Separator 398 NSM Non-Spacing Mark 399 BN Boundary Neutral 400 B Paragraph Separator 401 S Segment Separator 402 WS Whitespace 403 ON Other Neutrals 404 405For example, C<\p{BidiClass:R}> matches characters that are normally 406written right to left. 407 408=item Scripts 409 410The script names which can be used by C<\p{...}> and C<\P{...}>, 411such as in C<\p{Latin}> or C<\p{Cyrillic}>, are as follows: 412 413 Arabic 414 Armenian 415 Balinese 416 Bengali 417 Bopomofo 418 Braille 419 Buginese 420 Buhid 421 CanadianAboriginal 422 Cherokee 423 Coptic 424 Cuneiform 425 Cypriot 426 Cyrillic 427 Deseret 428 Devanagari 429 Ethiopic 430 Georgian 431 Glagolitic 432 Gothic 433 Greek 434 Gujarati 435 Gurmukhi 436 Han 437 Hangul 438 Hanunoo 439 Hebrew 440 Hiragana 441 Inherited 442 Kannada 443 Katakana 444 Kharoshthi 445 Khmer 446 Lao 447 Latin 448 Limbu 449 LinearB 450 Malayalam 451 Mongolian 452 Myanmar 453 NewTaiLue 454 Nko 455 Ogham 456 OldItalic 457 OldPersian 458 Oriya 459 Osmanya 460 PhagsPa 461 Phoenician 462 Runic 463 Shavian 464 Sinhala 465 SylotiNagri 466 Syriac 467 Tagalog 468 Tagbanwa 469 TaiLe 470 Tamil 471 Telugu 472 Thaana 473 Thai 474 Tibetan 475 Tifinagh 476 Ugaritic 477 Yi 478 479=item Extended property classes 480 481Extended property classes can supplement the basic 482properties, defined by the F<PropList> Unicode database: 483 484 ASCIIHexDigit 485 BidiControl 486 Dash 487 Deprecated 488 Diacritic 489 Extender 490 HexDigit 491 Hyphen 492 Ideographic 493 IDSBinaryOperator 494 IDSTrinaryOperator 495 JoinControl 496 LogicalOrderException 497 NoncharacterCodePoint 498 OtherAlphabetic 499 OtherDefaultIgnorableCodePoint 500 OtherGraphemeExtend 501 OtherIDStart 502 OtherIDContinue 503 OtherLowercase 504 OtherMath 505 OtherUppercase 506 PatternSyntax 507 PatternWhiteSpace 508 QuotationMark 509 Radical 510 SoftDotted 511 STerm 512 TerminalPunctuation 513 UnifiedIdeograph 514 VariationSelector 515 WhiteSpace 516 517and there are further derived properties: 518 519 Alphabetic = Lu + Ll + Lt + Lm + Lo + Nl + OtherAlphabetic 520 Lowercase = Ll + OtherLowercase 521 Uppercase = Lu + OtherUppercase 522 Math = Sm + OtherMath 523 524 IDStart = Lu + Ll + Lt + Lm + Lo + Nl + OtherIDStart 525 IDContinue = IDStart + Mn + Mc + Nd + Pc + OtherIDContinue 526 527 DefaultIgnorableCodePoint 528 = OtherDefaultIgnorableCodePoint 529 + Cf + Cc + Cs + Noncharacters + VariationSelector 530 - WhiteSpace - FFF9..FFFB (Annotation Characters) 531 532 Any = Any code points (i.e. U+0000 to U+10FFFF) 533 Assigned = Any non-Cn code points (i.e. synonym for \P{Cn}) 534 Unassigned = Synonym for \p{Cn} 535 ASCII = ASCII (i.e. U+0000 to U+007F) 536 537 Common = Any character (or unassigned code point) 538 not explicitly assigned to a script 539 540=item Use of "Is" Prefix 541 542For backward compatibility (with Perl 5.6), all properties mentioned 543so far may have C<Is> prepended to their name, so C<\P{IsLu}>, for 544example, is equal to C<\P{Lu}>. 545 546=item Blocks 547 548In addition to B<scripts>, Unicode also defines B<blocks> of 549characters. The difference between scripts and blocks is that the 550concept of scripts is closer to natural languages, while the concept 551of blocks is more of an artificial grouping based on groups of 256 552Unicode characters. For example, the C<Latin> script contains letters 553from many blocks but does not contain all the characters from those 554blocks. It does not, for example, contain digits, because digits are 555shared across many scripts. Digits and similar groups, like 556punctuation, are in a category called C<Common>. 557 558For more about scripts, see the UAX#24 "Script Names": 559 560 http://www.unicode.org/reports/tr24/ 561 562For more about blocks, see: 563 564 http://www.unicode.org/Public/UNIDATA/Blocks.txt 565 566Block names are given with the C<In> prefix. For example, the 567Katakana block is referenced via C<\p{InKatakana}>. The C<In> 568prefix may be omitted if there is no naming conflict with a script 569or any other property, but it is recommended that C<In> always be used 570for block tests to avoid confusion. 571 572These block names are supported: 573 574 InAegeanNumbers 575 InAlphabeticPresentationForms 576 InAncientGreekMusicalNotation 577 InAncientGreekNumbers 578 InArabic 579 InArabicPresentationFormsA 580 InArabicPresentationFormsB 581 InArabicSupplement 582 InArmenian 583 InArrows 584 InBalinese 585 InBasicLatin 586 InBengali 587 InBlockElements 588 InBopomofo 589 InBopomofoExtended 590 InBoxDrawing 591 InBraillePatterns 592 InBuginese 593 InBuhid 594 InByzantineMusicalSymbols 595 InCJKCompatibility 596 InCJKCompatibilityForms 597 InCJKCompatibilityIdeographs 598 InCJKCompatibilityIdeographsSupplement 599 InCJKRadicalsSupplement 600 InCJKStrokes 601 InCJKSymbolsAndPunctuation 602 InCJKUnifiedIdeographs 603 InCJKUnifiedIdeographsExtensionA 604 InCJKUnifiedIdeographsExtensionB 605 InCherokee 606 InCombiningDiacriticalMarks 607 InCombiningDiacriticalMarksSupplement 608 InCombiningDiacriticalMarksforSymbols 609 InCombiningHalfMarks 610 InControlPictures 611 InCoptic 612 InCountingRodNumerals 613 InCuneiform 614 InCuneiformNumbersAndPunctuation 615 InCurrencySymbols 616 InCypriotSyllabary 617 InCyrillic 618 InCyrillicSupplement 619 InDeseret 620 InDevanagari 621 InDingbats 622 InEnclosedAlphanumerics 623 InEnclosedCJKLettersAndMonths 624 InEthiopic 625 InEthiopicExtended 626 InEthiopicSupplement 627 InGeneralPunctuation 628 InGeometricShapes 629 InGeorgian 630 InGeorgianSupplement 631 InGlagolitic 632 InGothic 633 InGreekExtended 634 InGreekAndCoptic 635 InGujarati 636 InGurmukhi 637 InHalfwidthAndFullwidthForms 638 InHangulCompatibilityJamo 639 InHangulJamo 640 InHangulSyllables 641 InHanunoo 642 InHebrew 643 InHighPrivateUseSurrogates 644 InHighSurrogates 645 InHiragana 646 InIPAExtensions 647 InIdeographicDescriptionCharacters 648 InKanbun 649 InKangxiRadicals 650 InKannada 651 InKatakana 652 InKatakanaPhoneticExtensions 653 InKharoshthi 654 InKhmer 655 InKhmerSymbols 656 InLao 657 InLatin1Supplement 658 InLatinExtendedA 659 InLatinExtendedAdditional 660 InLatinExtendedB 661 InLatinExtendedC 662 InLatinExtendedD 663 InLetterlikeSymbols 664 InLimbu 665 InLinearBIdeograms 666 InLinearBSyllabary 667 InLowSurrogates 668 InMalayalam 669 InMathematicalAlphanumericSymbols 670 InMathematicalOperators 671 InMiscellaneousMathematicalSymbolsA 672 InMiscellaneousMathematicalSymbolsB 673 InMiscellaneousSymbols 674 InMiscellaneousSymbolsAndArrows 675 InMiscellaneousTechnical 676 InModifierToneLetters 677 InMongolian 678 InMusicalSymbols 679 InMyanmar 680 InNKo 681 InNewTaiLue 682 InNumberForms 683 InOgham 684 InOldItalic 685 InOldPersian 686 InOpticalCharacterRecognition 687 InOriya 688 InOsmanya 689 InPhagspa 690 InPhoenician 691 InPhoneticExtensions 692 InPhoneticExtensionsSupplement 693 InPrivateUseArea 694 InRunic 695 InShavian 696 InSinhala 697 InSmallFormVariants 698 InSpacingModifierLetters 699 InSpecials 700 InSuperscriptsAndSubscripts 701 InSupplementalArrowsA 702 InSupplementalArrowsB 703 InSupplementalMathematicalOperators 704 InSupplementalPunctuation 705 InSupplementaryPrivateUseAreaA 706 InSupplementaryPrivateUseAreaB 707 InSylotiNagri 708 InSyriac 709 InTagalog 710 InTagbanwa 711 InTags 712 InTaiLe 713 InTaiXuanJingSymbols 714 InTamil 715 InTelugu 716 InThaana 717 InThai 718 InTibetan 719 InTifinagh 720 InUgaritic 721 InUnifiedCanadianAboriginalSyllabics 722 InVariationSelectors 723 InVariationSelectorsSupplement 724 InVerticalForms 725 InYiRadicals 726 InYiSyllables 727 InYijingHexagramSymbols 728 729=back 730 731=head2 User-Defined Character Properties 732 733You can define your own character properties by defining subroutines 734whose names begin with "In" or "Is". The subroutines can be defined in 735any package. The user-defined properties can be used in the regular 736expression C<\p> and C<\P> constructs; if you are using a user-defined 737property from a package other than the one you are in, you must specify 738its package in the C<\p> or C<\P> construct. 739 740 # assuming property IsForeign defined in Lang:: 741 package main; # property package name required 742 if ($txt =~ /\p{Lang::IsForeign}+/) { ... } 743 744 package Lang; # property package name not required 745 if ($txt =~ /\p{IsForeign}+/) { ... } 746 747 748Note that the effect is compile-time and immutable once defined. 749 750The subroutines must return a specially-formatted string, with one 751or more newline-separated lines. Each line must be one of the following: 752 753=over 4 754 755=item * 756 757A single hexadecimal number denoting a Unicode code point to include. 758 759=item * 760 761Two hexadecimal numbers separated by horizontal whitespace (space or 762tabular characters) denoting a range of Unicode code points to include. 763 764=item * 765 766Something to include, prefixed by "+": a built-in character 767property (prefixed by "utf8::") or a user-defined character property, 768to represent all the characters in that property; two hexadecimal code 769points for a range; or a single hexadecimal code point. 770 771=item * 772 773Something to exclude, prefixed by "-": an existing character 774property (prefixed by "utf8::") or a user-defined character property, 775to represent all the characters in that property; two hexadecimal code 776points for a range; or a single hexadecimal code point. 777 778=item * 779 780Something to negate, prefixed "!": an existing character 781property (prefixed by "utf8::") or a user-defined character property, 782to represent all the characters in that property; two hexadecimal code 783points for a range; or a single hexadecimal code point. 784 785=item * 786 787Something to intersect with, prefixed by "&": an existing character 788property (prefixed by "utf8::") or a user-defined character property, 789for all the characters except the characters in the property; two 790hexadecimal code points for a range; or a single hexadecimal code point. 791 792=back 793 794For example, to define a property that covers both the Japanese 795syllabaries (hiragana and katakana), you can define 796 797 sub InKana { 798 return <<END; 799 3040\t309F 800 30A0\t30FF 801 END 802 } 803 804Imagine that the here-doc end marker is at the beginning of the line. 805Now you can use C<\p{InKana}> and C<\P{InKana}>. 806 807You could also have used the existing block property names: 808 809 sub InKana { 810 return <<'END'; 811 +utf8::InHiragana 812 +utf8::InKatakana 813 END 814 } 815 816Suppose you wanted to match only the allocated characters, 817not the raw block ranges: in other words, you want to remove 818the non-characters: 819 820 sub InKana { 821 return <<'END'; 822 +utf8::InHiragana 823 +utf8::InKatakana 824 -utf8::IsCn 825 END 826 } 827 828The negation is useful for defining (surprise!) negated classes. 829 830 sub InNotKana { 831 return <<'END'; 832 !utf8::InHiragana 833 -utf8::InKatakana 834 +utf8::IsCn 835 END 836 } 837 838Intersection is useful for getting the common characters matched by 839two (or more) classes. 840 841 sub InFooAndBar { 842 return <<'END'; 843 +main::Foo 844 &main::Bar 845 END 846 } 847 848It's important to remember not to use "&" for the first set -- that 849would be intersecting with nothing (resulting in an empty set). 850 851=head2 User-Defined Case Mappings 852 853You can also define your own mappings to be used in the lc(), 854lcfirst(), uc(), and ucfirst() (or their string-inlined versions). 855The principle is similar to that of user-defined character 856properties: to define subroutines in the C<main> package 857with names like C<ToLower> (for lc() and lcfirst()), C<ToTitle> (for 858the first character in ucfirst()), and C<ToUpper> (for uc(), and the 859rest of the characters in ucfirst()). 860 861The string returned by the subroutines needs now to be three 862hexadecimal numbers separated by tabulators: start of the source 863range, end of the source range, and start of the destination range. 864For example: 865 866 sub ToUpper { 867 return <<END; 868 0061\t0063\t0041 869 END 870 } 871 872defines an uc() mapping that causes only the characters "a", "b", and 873"c" to be mapped to "A", "B", "C", all other characters will remain 874unchanged. 875 876If there is no source range to speak of, that is, the mapping is from 877a single character to another single character, leave the end of the 878source range empty, but the two tabulator characters are still needed. 879For example: 880 881 sub ToLower { 882 return <<END; 883 0041\t\t0061 884 END 885 } 886 887defines a lc() mapping that causes only "A" to be mapped to "a", all 888other characters will remain unchanged. 889 890(For serious hackers only) If you want to introspect the default 891mappings, you can find the data in the directory 892C<$Config{privlib}>/F<unicore/To/>. The mapping data is returned as 893the here-document, and the C<utf8::ToSpecFoo> are special exception 894mappings derived from <$Config{privlib}>/F<unicore/SpecialCasing.txt>. 895The C<Digit> and C<Fold> mappings that one can see in the directory 896are not directly user-accessible, one can use either the 897C<Unicode::UCD> module, or just match case-insensitively (that's when 898the C<Fold> mapping is used). 899 900A final note on the user-defined case mappings: they will be used 901only if the scalar has been marked as having Unicode characters. 902Old byte-style strings will not be affected. 903 904=head2 Character Encodings for Input and Output 905 906See L<Encode>. 907 908=head2 Unicode Regular Expression Support Level 909 910The following list of Unicode support for regular expressions describes 911all the features currently supported. The references to "Level N" 912and the section numbers refer to the Unicode Technical Standard #18, 913"Unicode Regular Expressions", version 11, in May 2005. 914 915=over 4 916 917=item * 918 919Level 1 - Basic Unicode Support 920 921 RL1.1 Hex Notation - done [1] 922 RL1.2 Properties - done [2][3] 923 RL1.2a Compatibility Properties - done [4] 924 RL1.3 Subtraction and Intersection - MISSING [5] 925 RL1.4 Simple Word Boundaries - done [6] 926 RL1.5 Simple Loose Matches - done [7] 927 RL1.6 Line Boundaries - MISSING [8] 928 RL1.7 Supplementary Code Points - done [9] 929 930 [1] \x{...} 931 [2] \p{...} \P{...} 932 [3] supports not only minimal list (general category, scripts, 933 Alphabetic, Lowercase, Uppercase, WhiteSpace, 934 NoncharacterCodePoint, DefaultIgnorableCodePoint, Any, 935 ASCII, Assigned), but also bidirectional types, blocks, etc. 936 (see L</"Unicode Character Properties">) 937 [4] \d \D \s \S \w \W \X [:prop:] [:^prop:] 938 [5] can use regular expression look-ahead [a] or 939 user-defined character properties [b] to emulate set operations 940 [6] \b \B 941 [7] note that Perl does Full case-folding in matching, not Simple: 942 for example U+1F88 is equivalent with U+1F00 U+03B9, 943 not with 1F80. This difference matters for certain Greek 944 capital letters with certain modifiers: the Full case-folding 945 decomposes the letter, while the Simple case-folding would map 946 it to a single character. 947 [8] should do ^ and $ also on U+000B (\v in C), FF (\f), CR (\r), 948 CRLF (\r\n), NEL (U+0085), LS (U+2028), and PS (U+2029); 949 should also affect <>, $., and script line numbers; 950 should not split lines within CRLF [c] (i.e. there is no empty 951 line between \r and \n) 952 [9] UTF-8/UTF-EBDDIC used in perl allows not only U+10000 to U+10FFFF 953 but also beyond U+10FFFF [d] 954 955[a] You can mimic class subtraction using lookahead. 956For example, what UTS#18 might write as 957 958 [{Greek}-[{UNASSIGNED}]] 959 960in Perl can be written as: 961 962 (?!\p{Unassigned})\p{InGreekAndCoptic} 963 (?=\p{Assigned})\p{InGreekAndCoptic} 964 965But in this particular example, you probably really want 966 967 \p{GreekAndCoptic} 968 969which will match assigned characters known to be part of the Greek script. 970 971Also see the Unicode::Regex::Set module, it does implement the full 972UTS#18 grouping, intersection, union, and removal (subtraction) syntax. 973 974[b] '+' for union, '-' for removal (set-difference), '&' for intersection 975(see L</"User-Defined Character Properties">) 976 977[c] Try the C<:crlf> layer (see L<PerlIO>). 978 979[d] Avoid C<use warning 'utf8';> (or say C<no warning 'utf8';>) to allow 980U+FFFF (C<\x{FFFF}>). 981 982=item * 983 984Level 2 - Extended Unicode Support 985 986 RL2.1 Canonical Equivalents - MISSING [10][11] 987 RL2.2 Default Grapheme Clusters - MISSING [12][13] 988 RL2.3 Default Word Boundaries - MISSING [14] 989 RL2.4 Default Loose Matches - MISSING [15] 990 RL2.5 Name Properties - MISSING [16] 991 RL2.6 Wildcard Properties - MISSING 992 993 [10] see UAX#15 "Unicode Normalization Forms" 994 [11] have Unicode::Normalize but not integrated to regexes 995 [12] have \X but at this level . should equal that 996 [13] UAX#29 "Text Boundaries" considers CRLF and Hangul syllable 997 clusters as a single grapheme cluster. 998 [14] see UAX#29, Word Boundaries 999 [15] see UAX#21 "Case Mappings" 1000 [16] have \N{...} but neither compute names of CJK Ideographs 1001 and Hangul Syllables nor use a loose match [e] 1002 1003[e] C<\N{...}> allows namespaces (see L<charnames>). 1004 1005=item * 1006 1007Level 3 - Tailored Support 1008 1009 RL3.1 Tailored Punctuation - MISSING 1010 RL3.2 Tailored Grapheme Clusters - MISSING [17][18] 1011 RL3.3 Tailored Word Boundaries - MISSING 1012 RL3.4 Tailored Loose Matches - MISSING 1013 RL3.5 Tailored Ranges - MISSING 1014 RL3.6 Context Matching - MISSING [19] 1015 RL3.7 Incremental Matches - MISSING 1016 ( RL3.8 Unicode Set Sharing ) 1017 RL3.9 Possible Match Sets - MISSING 1018 RL3.10 Folded Matching - MISSING [20] 1019 RL3.11 Submatchers - MISSING 1020 1021 [17] see UAX#10 "Unicode Collation Algorithms" 1022 [18] have Unicode::Collate but not integrated to regexes 1023 [19] have (?<=x) and (?=x), but look-aheads or look-behinds should see 1024 outside of the target substring 1025 [20] need insensitive matching for linguistic features other than case; 1026 for example, hiragana to katakana, wide and narrow, simplified Han 1027 to traditional Han (see UTR#30 "Character Foldings") 1028 1029=back 1030 1031=head2 Unicode Encodings 1032 1033Unicode characters are assigned to I<code points>, which are abstract 1034numbers. To use these numbers, various encodings are needed. 1035 1036=over 4 1037 1038=item * 1039 1040UTF-8 1041 1042UTF-8 is a variable-length (1 to 6 bytes, current character allocations 1043require 4 bytes), byte-order independent encoding. For ASCII (and we 1044really do mean 7-bit ASCII, not another 8-bit encoding), UTF-8 is 1045transparent. 1046 1047The following table is from Unicode 3.2. 1048 1049 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte 1050 1051 U+0000..U+007F 00..7F 1052 U+0080..U+07FF C2..DF 80..BF 1053 U+0800..U+0FFF E0 A0..BF 80..BF 1054 U+1000..U+CFFF E1..EC 80..BF 80..BF 1055 U+D000..U+D7FF ED 80..9F 80..BF 1056 U+D800..U+DFFF ******* ill-formed ******* 1057 U+E000..U+FFFF EE..EF 80..BF 80..BF 1058 U+10000..U+3FFFF F0 90..BF 80..BF 80..BF 1059 U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF 1060 U+100000..U+10FFFF F4 80..8F 80..BF 80..BF 1061 1062Note the C<A0..BF> in C<U+0800..U+0FFF>, the C<80..9F> in 1063C<U+D000...U+D7FF>, the C<90..B>F in C<U+10000..U+3FFFF>, and the 1064C<80...8F> in C<U+100000..U+10FFFF>. The "gaps" are caused by legal 1065UTF-8 avoiding non-shortest encodings: it is technically possible to 1066UTF-8-encode a single code point in different ways, but that is 1067explicitly forbidden, and the shortest possible encoding should always 1068be used. So that's what Perl does. 1069 1070Another way to look at it is via bits: 1071 1072 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte 1073 1074 0aaaaaaa 0aaaaaaa 1075 00000bbbbbaaaaaa 110bbbbb 10aaaaaa 1076 ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa 1077 00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa 1078 1079As you can see, the continuation bytes all begin with C<10>, and the 1080leading bits of the start byte tell how many bytes the are in the 1081encoded character. 1082 1083=item * 1084 1085UTF-EBCDIC 1086 1087Like UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe. 1088 1089=item * 1090 1091UTF-16, UTF-16BE, UTF-16LE, Surrogates, and BOMs (Byte Order Marks) 1092 1093The followings items are mostly for reference and general Unicode 1094knowledge, Perl doesn't use these constructs internally. 1095 1096UTF-16 is a 2 or 4 byte encoding. The Unicode code points 1097C<U+0000..U+FFFF> are stored in a single 16-bit unit, and the code 1098points C<U+10000..U+10FFFF> in two 16-bit units. The latter case is 1099using I<surrogates>, the first 16-bit unit being the I<high 1100surrogate>, and the second being the I<low surrogate>. 1101 1102Surrogates are code points set aside to encode the C<U+10000..U+10FFFF> 1103range of Unicode code points in pairs of 16-bit units. The I<high 1104surrogates> are the range C<U+D800..U+DBFF>, and the I<low surrogates> 1105are the range C<U+DC00..U+DFFF>. The surrogate encoding is 1106 1107 $hi = ($uni - 0x10000) / 0x400 + 0xD800; 1108 $lo = ($uni - 0x10000) % 0x400 + 0xDC00; 1109 1110and the decoding is 1111 1112 $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00); 1113 1114If you try to generate surrogates (for example by using chr()), you 1115will get a warning if warnings are turned on, because those code 1116points are not valid for a Unicode character. 1117 1118Because of the 16-bitness, UTF-16 is byte-order dependent. UTF-16 1119itself can be used for in-memory computations, but if storage or 1120transfer is required either UTF-16BE (big-endian) or UTF-16LE 1121(little-endian) encodings must be chosen. 1122 1123This introduces another problem: what if you just know that your data 1124is UTF-16, but you don't know which endianness? Byte Order Marks, or 1125BOMs, are a solution to this. A special character has been reserved 1126in Unicode to function as a byte order marker: the character with the 1127code point C<U+FEFF> is the BOM. 1128 1129The trick is that if you read a BOM, you will know the byte order, 1130since if it was written on a big-endian platform, you will read the 1131bytes C<0xFE 0xFF>, but if it was written on a little-endian platform, 1132you will read the bytes C<0xFF 0xFE>. (And if the originating platform 1133was writing in UTF-8, you will read the bytes C<0xEF 0xBB 0xBF>.) 1134 1135The way this trick works is that the character with the code point 1136C<U+FFFE> is guaranteed not to be a valid Unicode character, so the 1137sequence of bytes C<0xFF 0xFE> is unambiguously "BOM, represented in 1138little-endian format" and cannot be C<U+FFFE>, represented in big-endian 1139format". 1140 1141=item * 1142 1143UTF-32, UTF-32BE, UTF-32LE 1144 1145The UTF-32 family is pretty much like the UTF-16 family, expect that 1146the units are 32-bit, and therefore the surrogate scheme is not 1147needed. The BOM signatures will be C<0x00 0x00 0xFE 0xFF> for BE and 1148C<0xFF 0xFE 0x00 0x00> for LE. 1149 1150=item * 1151 1152UCS-2, UCS-4 1153 1154Encodings defined by the ISO 10646 standard. UCS-2 is a 16-bit 1155encoding. Unlike UTF-16, UCS-2 is not extensible beyond C<U+FFFF>, 1156because it does not use surrogates. UCS-4 is a 32-bit encoding, 1157functionally identical to UTF-32. 1158 1159=item * 1160 1161UTF-7 1162 1163A seven-bit safe (non-eight-bit) encoding, which is useful if the 1164transport or storage is not eight-bit safe. Defined by RFC 2152. 1165 1166=back 1167 1168=head2 Security Implications of Unicode 1169 1170=over 4 1171 1172=item * 1173 1174Malformed UTF-8 1175 1176Unfortunately, the specification of UTF-8 leaves some room for 1177interpretation of how many bytes of encoded output one should generate 1178from one input Unicode character. Strictly speaking, the shortest 1179possible sequence of UTF-8 bytes should be generated, 1180because otherwise there is potential for an input buffer overflow at 1181the receiving end of a UTF-8 connection. Perl always generates the 1182shortest length UTF-8, and with warnings on Perl will warn about 1183non-shortest length UTF-8 along with other malformations, such as the 1184surrogates, which are not real Unicode code points. 1185 1186=item * 1187 1188Regular expressions behave slightly differently between byte data and 1189character (Unicode) data. For example, the "word character" character 1190class C<\w> will work differently depending on if data is eight-bit bytes 1191or Unicode. 1192 1193In the first case, the set of C<\w> characters is either small--the 1194default set of alphabetic characters, digits, and the "_"--or, if you 1195are using a locale (see L<perllocale>), the C<\w> might contain a few 1196more letters according to your language and country. 1197 1198In the second case, the C<\w> set of characters is much, much larger. 1199Most importantly, even in the set of the first 256 characters, it will 1200probably match different characters: unlike most locales, which are 1201specific to a language and country pair, Unicode classifies all the 1202characters that are letters I<somewhere> as C<\w>. For example, your 1203locale might not think that LATIN SMALL LETTER ETH is a letter (unless 1204you happen to speak Icelandic), but Unicode does. 1205 1206As discussed elsewhere, Perl has one foot (two hooves?) planted in 1207each of two worlds: the old world of bytes and the new world of 1208characters, upgrading from bytes to characters when necessary. 1209If your legacy code does not explicitly use Unicode, no automatic 1210switch-over to characters should happen. Characters shouldn't get 1211downgraded to bytes, either. It is possible to accidentally mix bytes 1212and characters, however (see L<perluniintro>), in which case C<\w> in 1213regular expressions might start behaving differently. Review your 1214code. Use warnings and the C<strict> pragma. 1215 1216=back 1217 1218=head2 Unicode in Perl on EBCDIC 1219 1220The way Unicode is handled on EBCDIC platforms is still 1221experimental. On such platforms, references to UTF-8 encoding in this 1222document and elsewhere should be read as meaning the UTF-EBCDIC 1223specified in Unicode Technical Report 16, unless ASCII vs. EBCDIC issues 1224are specifically discussed. There is no C<utfebcdic> pragma or 1225":utfebcdic" layer; rather, "utf8" and ":utf8" are reused to mean 1226the platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic> 1227for more discussion of the issues. 1228 1229=head2 Locales 1230 1231Usually locale settings and Unicode do not affect each other, but 1232there are a couple of exceptions: 1233 1234=over 4 1235 1236=item * 1237 1238You can enable automatic UTF-8-ification of your standard file 1239handles, default C<open()> layer, and C<@ARGV> by using either 1240the C<-C> command line switch or the C<PERL_UNICODE> environment 1241variable, see L<perlrun> for the documentation of the C<-C> switch. 1242 1243=item * 1244 1245Perl tries really hard to work both with Unicode and the old 1246byte-oriented world. Most often this is nice, but sometimes Perl's 1247straddling of the proverbial fence causes problems. 1248 1249=back 1250 1251=head2 When Unicode Does Not Happen 1252 1253While Perl does have extensive ways to input and output in Unicode, 1254and few other 'entry points' like the @ARGV which can be interpreted 1255as Unicode (UTF-8), there still are many places where Unicode (in some 1256encoding or another) could be given as arguments or received as 1257results, or both, but it is not. 1258 1259The following are such interfaces. For all of these interfaces Perl 1260currently (as of 5.8.3) simply assumes byte strings both as arguments 1261and results, or UTF-8 strings if the C<encoding> pragma has been used. 1262 1263One reason why Perl does not attempt to resolve the role of Unicode in 1264this cases is that the answers are highly dependent on the operating 1265system and the file system(s). For example, whether filenames can be 1266in Unicode, and in exactly what kind of encoding, is not exactly a 1267portable concept. Similarly for the qx and system: how well will the 1268'command line interface' (and which of them?) handle Unicode? 1269 1270=over 4 1271 1272=item * 1273 1274chdir, chmod, chown, chroot, exec, link, lstat, mkdir, 1275rename, rmdir, stat, symlink, truncate, unlink, utime, -X 1276 1277=item * 1278 1279%ENV 1280 1281=item * 1282 1283glob (aka the <*>) 1284 1285=item * 1286 1287open, opendir, sysopen 1288 1289=item * 1290 1291qx (aka the backtick operator), system 1292 1293=item * 1294 1295readdir, readlink 1296 1297=back 1298 1299=head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl) 1300 1301Sometimes (see L</"When Unicode Does Not Happen">) there are 1302situations where you simply need to force Perl to believe that a byte 1303string is UTF-8, or vice versa. The low-level calls 1304utf8::upgrade($bytestring) and utf8::downgrade($utf8string) are 1305the answers. 1306 1307Do not use them without careful thought, though: Perl may easily get 1308very confused, angry, or even crash, if you suddenly change the 'nature' 1309of scalar like that. Especially careful you have to be if you use the 1310utf8::upgrade(): any random byte string is not valid UTF-8. 1311 1312=head2 Using Unicode in XS 1313 1314If you want to handle Perl Unicode in XS extensions, you may find the 1315following C APIs useful. See also L<perlguts/"Unicode Support"> for an 1316explanation about Unicode at the XS level, and L<perlapi> for the API 1317details. 1318 1319=over 4 1320 1321=item * 1322 1323C<DO_UTF8(sv)> returns true if the C<UTF8> flag is on and the bytes 1324pragma is not in effect. C<SvUTF8(sv)> returns true is the C<UTF8> 1325flag is on; the bytes pragma is ignored. The C<UTF8> flag being on 1326does B<not> mean that there are any characters of code points greater 1327than 255 (or 127) in the scalar or that there are even any characters 1328in the scalar. What the C<UTF8> flag means is that the sequence of 1329octets in the representation of the scalar is the sequence of UTF-8 1330encoded code points of the characters of a string. The C<UTF8> flag 1331being off means that each octet in this representation encodes a 1332single character with code point 0..255 within the string. Perl's 1333Unicode model is not to use UTF-8 until it is absolutely necessary. 1334 1335=item * 1336 1337C<uvuni_to_utf8(buf, chr)> writes a Unicode character code point into 1338a buffer encoding the code point as UTF-8, and returns a pointer 1339pointing after the UTF-8 bytes. 1340 1341=item * 1342 1343C<utf8_to_uvuni(buf, lenp)> reads UTF-8 encoded bytes from a buffer and 1344returns the Unicode character code point and, optionally, the length of 1345the UTF-8 byte sequence. 1346 1347=item * 1348 1349C<utf8_length(start, end)> returns the length of the UTF-8 encoded buffer 1350in characters. C<sv_len_utf8(sv)> returns the length of the UTF-8 encoded 1351scalar. 1352 1353=item * 1354 1355C<sv_utf8_upgrade(sv)> converts the string of the scalar to its UTF-8 1356encoded form. C<sv_utf8_downgrade(sv)> does the opposite, if 1357possible. C<sv_utf8_encode(sv)> is like sv_utf8_upgrade except that 1358it does not set the C<UTF8> flag. C<sv_utf8_decode()> does the 1359opposite of C<sv_utf8_encode()>. Note that none of these are to be 1360used as general-purpose encoding or decoding interfaces: C<use Encode> 1361for that. C<sv_utf8_upgrade()> is affected by the encoding pragma 1362but C<sv_utf8_downgrade()> is not (since the encoding pragma is 1363designed to be a one-way street). 1364 1365=item * 1366 1367C<is_utf8_char(s)> returns true if the pointer points to a valid UTF-8 1368character. 1369 1370=item * 1371 1372C<is_utf8_string(buf, len)> returns true if C<len> bytes of the buffer 1373are valid UTF-8. 1374 1375=item * 1376 1377C<UTF8SKIP(buf)> will return the number of bytes in the UTF-8 encoded 1378character in the buffer. C<UNISKIP(chr)> will return the number of bytes 1379required to UTF-8-encode the Unicode character code point. C<UTF8SKIP()> 1380is useful for example for iterating over the characters of a UTF-8 1381encoded buffer; C<UNISKIP()> is useful, for example, in computing 1382the size required for a UTF-8 encoded buffer. 1383 1384=item * 1385 1386C<utf8_distance(a, b)> will tell the distance in characters between the 1387two pointers pointing to the same UTF-8 encoded buffer. 1388 1389=item * 1390 1391C<utf8_hop(s, off)> will return a pointer to an UTF-8 encoded buffer 1392that is C<off> (positive or negative) Unicode characters displaced 1393from the UTF-8 buffer C<s>. Be careful not to overstep the buffer: 1394C<utf8_hop()> will merrily run off the end or the beginning of the 1395buffer if told to do so. 1396 1397=item * 1398 1399C<pv_uni_display(dsv, spv, len, pvlim, flags)> and 1400C<sv_uni_display(dsv, ssv, pvlim, flags)> are useful for debugging the 1401output of Unicode strings and scalars. By default they are useful 1402only for debugging--they display B<all> characters as hexadecimal code 1403points--but with the flags C<UNI_DISPLAY_ISPRINT>, 1404C<UNI_DISPLAY_BACKSLASH>, and C<UNI_DISPLAY_QQ> you can make the 1405output more readable. 1406 1407=item * 1408 1409C<ibcmp_utf8(s1, pe1, u1, l1, u1, s2, pe2, l2, u2)> can be used to 1410compare two strings case-insensitively in Unicode. For case-sensitive 1411comparisons you can just use C<memEQ()> and C<memNE()> as usual. 1412 1413=back 1414 1415For more information, see L<perlapi>, and F<utf8.c> and F<utf8.h> 1416in the Perl source code distribution. 1417 1418=head1 BUGS 1419 1420=head2 Interaction with Locales 1421 1422Use of locales with Unicode data may lead to odd results. Currently, 1423Perl attempts to attach 8-bit locale info to characters in the range 14240..255, but this technique is demonstrably incorrect for locales that 1425use characters above that range when mapped into Unicode. Perl's 1426Unicode support will also tend to run slower. Use of locales with 1427Unicode is discouraged. 1428 1429=head2 Interaction with Extensions 1430 1431When Perl exchanges data with an extension, the extension should be 1432able to understand the UTF8 flag and act accordingly. If the 1433extension doesn't know about the flag, it's likely that the extension 1434will return incorrectly-flagged data. 1435 1436So if you're working with Unicode data, consult the documentation of 1437every module you're using if there are any issues with Unicode data 1438exchange. If the documentation does not talk about Unicode at all, 1439suspect the worst and probably look at the source to learn how the 1440module is implemented. Modules written completely in Perl shouldn't 1441cause problems. Modules that directly or indirectly access code written 1442in other programming languages are at risk. 1443 1444For affected functions, the simple strategy to avoid data corruption is 1445to always make the encoding of the exchanged data explicit. Choose an 1446encoding that you know the extension can handle. Convert arguments passed 1447to the extensions to that encoding and convert results back from that 1448encoding. Write wrapper functions that do the conversions for you, so 1449you can later change the functions when the extension catches up. 1450 1451To provide an example, let's say the popular Foo::Bar::escape_html 1452function doesn't deal with Unicode data yet. The wrapper function 1453would convert the argument to raw UTF-8 and convert the result back to 1454Perl's internal representation like so: 1455 1456 sub my_escape_html ($) { 1457 my($what) = shift; 1458 return unless defined $what; 1459 Encode::decode_utf8(Foo::Bar::escape_html(Encode::encode_utf8($what))); 1460 } 1461 1462Sometimes, when the extension does not convert data but just stores 1463and retrieves them, you will be in a position to use the otherwise 1464dangerous Encode::_utf8_on() function. Let's say the popular 1465C<Foo::Bar> extension, written in C, provides a C<param> method that 1466lets you store and retrieve data according to these prototypes: 1467 1468 $self->param($name, $value); # set a scalar 1469 $value = $self->param($name); # retrieve a scalar 1470 1471If it does not yet provide support for any encoding, one could write a 1472derived class with such a C<param> method: 1473 1474 sub param { 1475 my($self,$name,$value) = @_; 1476 utf8::upgrade($name); # make sure it is UTF-8 encoded 1477 if (defined $value) { 1478 utf8::upgrade($value); # make sure it is UTF-8 encoded 1479 return $self->SUPER::param($name,$value); 1480 } else { 1481 my $ret = $self->SUPER::param($name); 1482 Encode::_utf8_on($ret); # we know, it is UTF-8 encoded 1483 return $ret; 1484 } 1485 } 1486 1487Some extensions provide filters on data entry/exit points, such as 1488DB_File::filter_store_key and family. Look out for such filters in 1489the documentation of your extensions, they can make the transition to 1490Unicode data much easier. 1491 1492=head2 Speed 1493 1494Some functions are slower when working on UTF-8 encoded strings than 1495on byte encoded strings. All functions that need to hop over 1496characters such as length(), substr() or index(), or matching regular 1497expressions can work B<much> faster when the underlying data are 1498byte-encoded. 1499 1500In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1 1501a caching scheme was introduced which will hopefully make the slowness 1502somewhat less spectacular, at least for some operations. In general, 1503operations with UTF-8 encoded strings are still slower. As an example, 1504the Unicode properties (character classes) like C<\p{Nd}> are known to 1505be quite a bit slower (5-20 times) than their simpler counterparts 1506like C<\d> (then again, there 268 Unicode characters matching C<Nd> 1507compared with the 10 ASCII characters matching C<d>). 1508 1509=head2 Porting code from perl-5.6.X 1510 1511Perl 5.8 has a different Unicode model from 5.6. In 5.6 the programmer 1512was required to use the C<utf8> pragma to declare that a given scope 1513expected to deal with Unicode data and had to make sure that only 1514Unicode data were reaching that scope. If you have code that is 1515working with 5.6, you will need some of the following adjustments to 1516your code. The examples are written such that the code will continue 1517to work under 5.6, so you should be safe to try them out. 1518 1519=over 4 1520 1521=item * 1522 1523A filehandle that should read or write UTF-8 1524 1525 if ($] > 5.007) { 1526 binmode $fh, ":encoding(utf8)"; 1527 } 1528 1529=item * 1530 1531A scalar that is going to be passed to some extension 1532 1533Be it Compress::Zlib, Apache::Request or any extension that has no 1534mention of Unicode in the manpage, you need to make sure that the 1535UTF8 flag is stripped off. Note that at the time of this writing 1536(October 2002) the mentioned modules are not UTF-8-aware. Please 1537check the documentation to verify if this is still true. 1538 1539 if ($] > 5.007) { 1540 require Encode; 1541 $val = Encode::encode_utf8($val); # make octets 1542 } 1543 1544=item * 1545 1546A scalar we got back from an extension 1547 1548If you believe the scalar comes back as UTF-8, you will most likely 1549want the UTF8 flag restored: 1550 1551 if ($] > 5.007) { 1552 require Encode; 1553 $val = Encode::decode_utf8($val); 1554 } 1555 1556=item * 1557 1558Same thing, if you are really sure it is UTF-8 1559 1560 if ($] > 5.007) { 1561 require Encode; 1562 Encode::_utf8_on($val); 1563 } 1564 1565=item * 1566 1567A wrapper for fetchrow_array and fetchrow_hashref 1568 1569When the database contains only UTF-8, a wrapper function or method is 1570a convenient way to replace all your fetchrow_array and 1571fetchrow_hashref calls. A wrapper function will also make it easier to 1572adapt to future enhancements in your database driver. Note that at the 1573time of this writing (October 2002), the DBI has no standardized way 1574to deal with UTF-8 data. Please check the documentation to verify if 1575that is still true. 1576 1577 sub fetchrow { 1578 my($self, $sth, $what) = @_; # $what is one of fetchrow_{array,hashref} 1579 if ($] < 5.007) { 1580 return $sth->$what; 1581 } else { 1582 require Encode; 1583 if (wantarray) { 1584 my @arr = $sth->$what; 1585 for (@arr) { 1586 defined && /[^\000-\177]/ && Encode::_utf8_on($_); 1587 } 1588 return @arr; 1589 } else { 1590 my $ret = $sth->$what; 1591 if (ref $ret) { 1592 for my $k (keys %$ret) { 1593 defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret->{$k}; 1594 } 1595 return $ret; 1596 } else { 1597 defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret; 1598 return $ret; 1599 } 1600 } 1601 } 1602 } 1603 1604 1605=item * 1606 1607A large scalar that you know can only contain ASCII 1608 1609Scalars that contain only ASCII and are marked as UTF-8 are sometimes 1610a drag to your program. If you recognize such a situation, just remove 1611the UTF8 flag: 1612 1613 utf8::downgrade($val) if $] > 5.007; 1614 1615=back 1616 1617=head1 SEE ALSO 1618 1619L<perlunitut>, L<perluniintro>, L<Encode>, L<open>, L<utf8>, L<bytes>, 1620L<perlretut>, L<perlvar/"${^UNICODE}"> 1621 1622=cut 1623