1*0Sstevel@tonic-gate=head1 NAME 2*0Sstevel@tonic-gate 3*0Sstevel@tonic-gateperlunicode - Unicode support in Perl 4*0Sstevel@tonic-gate 5*0Sstevel@tonic-gate=head1 DESCRIPTION 6*0Sstevel@tonic-gate 7*0Sstevel@tonic-gate=head2 Important Caveats 8*0Sstevel@tonic-gate 9*0Sstevel@tonic-gateUnicode support is an extensive requirement. While Perl does not 10*0Sstevel@tonic-gateimplement the Unicode standard or the accompanying technical reports 11*0Sstevel@tonic-gatefrom cover to cover, Perl does support many Unicode features. 12*0Sstevel@tonic-gate 13*0Sstevel@tonic-gate=over 4 14*0Sstevel@tonic-gate 15*0Sstevel@tonic-gate=item Input and Output Layers 16*0Sstevel@tonic-gate 17*0Sstevel@tonic-gatePerl knows when a filehandle uses Perl's internal Unicode encodings 18*0Sstevel@tonic-gate(UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with 19*0Sstevel@tonic-gatethe ":utf8" layer. Other encodings can be converted to Perl's 20*0Sstevel@tonic-gateencoding on input or from Perl's encoding on output by use of the 21*0Sstevel@tonic-gate":encoding(...)" layer. See L<open>. 22*0Sstevel@tonic-gate 23*0Sstevel@tonic-gateTo indicate that Perl source itself is using a particular encoding, 24*0Sstevel@tonic-gatesee L<encoding>. 25*0Sstevel@tonic-gate 26*0Sstevel@tonic-gate=item Regular Expressions 27*0Sstevel@tonic-gate 28*0Sstevel@tonic-gateThe regular expression compiler produces polymorphic opcodes. That is, 29*0Sstevel@tonic-gatethe pattern adapts to the data and automatically switches to the Unicode 30*0Sstevel@tonic-gatecharacter scheme when presented with Unicode data--or instead uses 31*0Sstevel@tonic-gatea traditional byte scheme when presented with byte data. 32*0Sstevel@tonic-gate 33*0Sstevel@tonic-gate=item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts 34*0Sstevel@tonic-gate 35*0Sstevel@tonic-gateAs a compatibility measure, the C<use utf8> pragma must be explicitly 36*0Sstevel@tonic-gateincluded to enable recognition of UTF-8 in the Perl scripts themselves 37*0Sstevel@tonic-gate(in string or regular expression literals, or in identifier names) on 38*0Sstevel@tonic-gateASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based 39*0Sstevel@tonic-gatemachines. B<These are the only times when an explicit C<use utf8> 40*0Sstevel@tonic-gateis needed.> See L<utf8>. 41*0Sstevel@tonic-gate 42*0Sstevel@tonic-gateYou can also use the C<encoding> pragma to change the default encoding 43*0Sstevel@tonic-gateof the data in your script; see L<encoding>. 44*0Sstevel@tonic-gate 45*0Sstevel@tonic-gate=item BOM-marked scripts and UTF-16 scripts autodetected 46*0Sstevel@tonic-gate 47*0Sstevel@tonic-gateIf a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE, 48*0Sstevel@tonic-gateor UTF-8), or if the script looks like non-BOM-marked UTF-16 of either 49*0Sstevel@tonic-gateendianness, Perl will correctly read in the script as Unicode. 50*0Sstevel@tonic-gate(BOMless UTF-8 cannot be effectively recognized or differentiated from 51*0Sstevel@tonic-gateISO 8859-1 or other eight-bit encodings.) 52*0Sstevel@tonic-gate 53*0Sstevel@tonic-gate=item C<use encoding> needed to upgrade non-Latin-1 byte strings 54*0Sstevel@tonic-gate 55*0Sstevel@tonic-gateBy default, there is a fundamental asymmetry in Perl's unicode model: 56*0Sstevel@tonic-gateimplicit upgrading from byte strings to Unicode strings assumes that 57*0Sstevel@tonic-gatethey were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are 58*0Sstevel@tonic-gatedowngraded with UTF-8 encoding. This happens because the first 256 59*0Sstevel@tonic-gatecodepoints in Unicode happens to agree with Latin-1. 60*0Sstevel@tonic-gate 61*0Sstevel@tonic-gateIf you wish to interpret byte strings as UTF-8 instead, use the 62*0Sstevel@tonic-gateC<encoding> pragma: 63*0Sstevel@tonic-gate 64*0Sstevel@tonic-gate use encoding 'utf8'; 65*0Sstevel@tonic-gate 66*0Sstevel@tonic-gateSee L</"Byte and Character Semantics"> for more details. 67*0Sstevel@tonic-gate 68*0Sstevel@tonic-gate=back 69*0Sstevel@tonic-gate 70*0Sstevel@tonic-gate=head2 Byte and Character Semantics 71*0Sstevel@tonic-gate 72*0Sstevel@tonic-gateBeginning with version 5.6, Perl uses logically-wide characters to 73*0Sstevel@tonic-gaterepresent strings internally. 74*0Sstevel@tonic-gate 75*0Sstevel@tonic-gateIn future, Perl-level operations will be expected to work with 76*0Sstevel@tonic-gatecharacters rather than bytes. 77*0Sstevel@tonic-gate 78*0Sstevel@tonic-gateHowever, as an interim compatibility measure, Perl aims to 79*0Sstevel@tonic-gateprovide a safe migration path from byte semantics to character 80*0Sstevel@tonic-gatesemantics for programs. For operations where Perl can unambiguously 81*0Sstevel@tonic-gatedecide that the input data are characters, Perl switches to 82*0Sstevel@tonic-gatecharacter semantics. For operations where this determination cannot 83*0Sstevel@tonic-gatebe made without additional information from the user, Perl decides in 84*0Sstevel@tonic-gatefavor of compatibility and chooses to use byte semantics. 85*0Sstevel@tonic-gate 86*0Sstevel@tonic-gateThis behavior preserves compatibility with earlier versions of Perl, 87*0Sstevel@tonic-gatewhich allowed byte semantics in Perl operations only if 88*0Sstevel@tonic-gatenone of the program's inputs were marked as being as source of Unicode 89*0Sstevel@tonic-gatecharacter data. Such data may come from filehandles, from calls to 90*0Sstevel@tonic-gateexternal programs, from information provided by the system (such as %ENV), 91*0Sstevel@tonic-gateor from literals and constants in the source text. 92*0Sstevel@tonic-gate 93*0Sstevel@tonic-gateThe C<bytes> pragma will always, regardless of platform, force byte 94*0Sstevel@tonic-gatesemantics in a particular lexical scope. See L<bytes>. 95*0Sstevel@tonic-gate 96*0Sstevel@tonic-gateThe C<utf8> pragma is primarily a compatibility device that enables 97*0Sstevel@tonic-gaterecognition of UTF-(8|EBCDIC) in literals encountered by the parser. 98*0Sstevel@tonic-gateNote that this pragma is only required while Perl defaults to byte 99*0Sstevel@tonic-gatesemantics; when character semantics become the default, this pragma 100*0Sstevel@tonic-gatemay become a no-op. See L<utf8>. 101*0Sstevel@tonic-gate 102*0Sstevel@tonic-gateUnless explicitly stated, Perl operators use character semantics 103*0Sstevel@tonic-gatefor Unicode data and byte semantics for non-Unicode data. 104*0Sstevel@tonic-gateThe decision to use character semantics is made transparently. If 105*0Sstevel@tonic-gateinput data comes from a Unicode source--for example, if a character 106*0Sstevel@tonic-gateencoding layer is added to a filehandle or a literal Unicode 107*0Sstevel@tonic-gatestring constant appears in a program--character semantics apply. 108*0Sstevel@tonic-gateOtherwise, byte semantics are in effect. The C<bytes> pragma should 109*0Sstevel@tonic-gatebe used to force byte semantics on Unicode data. 110*0Sstevel@tonic-gate 111*0Sstevel@tonic-gateIf strings operating under byte semantics and strings with Unicode 112*0Sstevel@tonic-gatecharacter data are concatenated, the new string will be created by 113*0Sstevel@tonic-gatedecoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the 114*0Sstevel@tonic-gateold Unicode string used EBCDIC. This translation is done without 115*0Sstevel@tonic-gateregard to the system's native 8-bit encoding. To change this for 116*0Sstevel@tonic-gatesystems with non-Latin-1 and non-EBCDIC native encodings, use the 117*0Sstevel@tonic-gateC<encoding> pragma. See L<encoding>. 118*0Sstevel@tonic-gate 119*0Sstevel@tonic-gateUnder character semantics, many operations that formerly operated on 120*0Sstevel@tonic-gatebytes now operate on characters. A character in Perl is 121*0Sstevel@tonic-gatelogically just a number ranging from 0 to 2**31 or so. Larger 122*0Sstevel@tonic-gatecharacters may encode into longer sequences of bytes internally, but 123*0Sstevel@tonic-gatethis internal detail is mostly hidden for Perl code. 124*0Sstevel@tonic-gateSee L<perluniintro> for more. 125*0Sstevel@tonic-gate 126*0Sstevel@tonic-gate=head2 Effects of Character Semantics 127*0Sstevel@tonic-gate 128*0Sstevel@tonic-gateCharacter semantics have the following effects: 129*0Sstevel@tonic-gate 130*0Sstevel@tonic-gate=over 4 131*0Sstevel@tonic-gate 132*0Sstevel@tonic-gate=item * 133*0Sstevel@tonic-gate 134*0Sstevel@tonic-gateStrings--including hash keys--and regular expression patterns may 135*0Sstevel@tonic-gatecontain characters that have an ordinal value larger than 255. 136*0Sstevel@tonic-gate 137*0Sstevel@tonic-gateIf you use a Unicode editor to edit your program, Unicode characters 138*0Sstevel@tonic-gatemay occur directly within the literal strings in one of the various 139*0Sstevel@tonic-gateUnicode encodings (UTF-8, UTF-EBCDIC, UCS-2, etc.), but will be recognized 140*0Sstevel@tonic-gateas such and converted to Perl's internal representation only if the 141*0Sstevel@tonic-gateappropriate L<encoding> is specified. 142*0Sstevel@tonic-gate 143*0Sstevel@tonic-gateUnicode characters can also be added to a string by using the 144*0Sstevel@tonic-gateC<\x{...}> notation. The Unicode code for the desired character, in 145*0Sstevel@tonic-gatehexadecimal, should be placed in the braces. For instance, a smiley 146*0Sstevel@tonic-gateface is C<\x{263A}>. This encoding scheme only works for characters 147*0Sstevel@tonic-gatewith a code of 0x100 or above. 148*0Sstevel@tonic-gate 149*0Sstevel@tonic-gateAdditionally, if you 150*0Sstevel@tonic-gate 151*0Sstevel@tonic-gate use charnames ':full'; 152*0Sstevel@tonic-gate 153*0Sstevel@tonic-gateyou can use the C<\N{...}> notation and put the official Unicode 154*0Sstevel@tonic-gatecharacter name within the braces, such as C<\N{WHITE SMILING FACE}>. 155*0Sstevel@tonic-gate 156*0Sstevel@tonic-gate 157*0Sstevel@tonic-gate=item * 158*0Sstevel@tonic-gate 159*0Sstevel@tonic-gateIf an appropriate L<encoding> is specified, identifiers within the 160*0Sstevel@tonic-gatePerl script may contain Unicode alphanumeric characters, including 161*0Sstevel@tonic-gateideographs. Perl does not currently attempt to canonicalize variable 162*0Sstevel@tonic-gatenames. 163*0Sstevel@tonic-gate 164*0Sstevel@tonic-gate=item * 165*0Sstevel@tonic-gate 166*0Sstevel@tonic-gateRegular expressions match characters instead of bytes. "." matches 167*0Sstevel@tonic-gatea character instead of a byte. The C<\C> pattern is provided to force 168*0Sstevel@tonic-gatea match a single byte--a C<char> in C, hence C<\C>. 169*0Sstevel@tonic-gate 170*0Sstevel@tonic-gate=item * 171*0Sstevel@tonic-gate 172*0Sstevel@tonic-gateCharacter classes in regular expressions match characters instead of 173*0Sstevel@tonic-gatebytes and match against the character properties specified in the 174*0Sstevel@tonic-gateUnicode properties database. C<\w> can be used to match a Japanese 175*0Sstevel@tonic-gateideograph, for instance. 176*0Sstevel@tonic-gate 177*0Sstevel@tonic-gate(However, and as a limitation of the current implementation, using 178*0Sstevel@tonic-gateC<\w> or C<\W> I<inside> a C<[...]> character class will still match 179*0Sstevel@tonic-gatewith byte semantics.) 180*0Sstevel@tonic-gate 181*0Sstevel@tonic-gate=item * 182*0Sstevel@tonic-gate 183*0Sstevel@tonic-gateNamed Unicode properties, scripts, and block ranges may be used like 184*0Sstevel@tonic-gatecharacter classes via the C<\p{}> "matches property" construct and 185*0Sstevel@tonic-gatethe C<\P{}> negation, "doesn't match property". 186*0Sstevel@tonic-gate 187*0Sstevel@tonic-gateFor instance, C<\p{Lu}> matches any character with the Unicode "Lu" 188*0Sstevel@tonic-gate(Letter, uppercase) property, while C<\p{M}> matches any character 189*0Sstevel@tonic-gatewith an "M" (mark--accents and such) property. Brackets are not 190*0Sstevel@tonic-gaterequired for single letter properties, so C<\p{M}> is equivalent to 191*0Sstevel@tonic-gateC<\pM>. Many predefined properties are available, such as 192*0Sstevel@tonic-gateC<\p{Mirrored}> and C<\p{Tibetan}>. 193*0Sstevel@tonic-gate 194*0Sstevel@tonic-gateThe official Unicode script and block names have spaces and dashes as 195*0Sstevel@tonic-gateseparators, but for convenience you can use dashes, spaces, or 196*0Sstevel@tonic-gateunderbars, and case is unimportant. It is recommended, however, that 197*0Sstevel@tonic-gatefor consistency you use the following naming: the official Unicode 198*0Sstevel@tonic-gatescript, property, or block name (see below for the additional rules 199*0Sstevel@tonic-gatethat apply to block names) with whitespace and dashes removed, and the 200*0Sstevel@tonic-gatewords "uppercase-first-lowercase-rest". C<Latin-1 Supplement> thus 201*0Sstevel@tonic-gatebecomes C<Latin1Supplement>. 202*0Sstevel@tonic-gate 203*0Sstevel@tonic-gateYou can also use negation in both C<\p{}> and C<\P{}> by introducing a caret 204*0Sstevel@tonic-gate(^) between the first brace and the property name: C<\p{^Tamil}> is 205*0Sstevel@tonic-gateequal to C<\P{Tamil}>. 206*0Sstevel@tonic-gate 207*0Sstevel@tonic-gateB<NOTE: the properties, scripts, and blocks listed here are as of 208*0Sstevel@tonic-gateUnicode 3.2.0, March 2002, or Perl 5.8.0, July 2002. Unicode 4.0.0 209*0Sstevel@tonic-gatecame out in April 2003, and Perl 5.8.1 in September 2003.> 210*0Sstevel@tonic-gate 211*0Sstevel@tonic-gateHere are the basic Unicode General Category properties, followed by their 212*0Sstevel@tonic-gatelong form. You can use either; C<\p{Lu}> and C<\p{UppercaseLetter}>, 213*0Sstevel@tonic-gatefor instance, are identical. 214*0Sstevel@tonic-gate 215*0Sstevel@tonic-gate Short Long 216*0Sstevel@tonic-gate 217*0Sstevel@tonic-gate L Letter 218*0Sstevel@tonic-gate Lu UppercaseLetter 219*0Sstevel@tonic-gate Ll LowercaseLetter 220*0Sstevel@tonic-gate Lt TitlecaseLetter 221*0Sstevel@tonic-gate Lm ModifierLetter 222*0Sstevel@tonic-gate Lo OtherLetter 223*0Sstevel@tonic-gate 224*0Sstevel@tonic-gate M Mark 225*0Sstevel@tonic-gate Mn NonspacingMark 226*0Sstevel@tonic-gate Mc SpacingMark 227*0Sstevel@tonic-gate Me EnclosingMark 228*0Sstevel@tonic-gate 229*0Sstevel@tonic-gate N Number 230*0Sstevel@tonic-gate Nd DecimalNumber 231*0Sstevel@tonic-gate Nl LetterNumber 232*0Sstevel@tonic-gate No OtherNumber 233*0Sstevel@tonic-gate 234*0Sstevel@tonic-gate P Punctuation 235*0Sstevel@tonic-gate Pc ConnectorPunctuation 236*0Sstevel@tonic-gate Pd DashPunctuation 237*0Sstevel@tonic-gate Ps OpenPunctuation 238*0Sstevel@tonic-gate Pe ClosePunctuation 239*0Sstevel@tonic-gate Pi InitialPunctuation 240*0Sstevel@tonic-gate (may behave like Ps or Pe depending on usage) 241*0Sstevel@tonic-gate Pf FinalPunctuation 242*0Sstevel@tonic-gate (may behave like Ps or Pe depending on usage) 243*0Sstevel@tonic-gate Po OtherPunctuation 244*0Sstevel@tonic-gate 245*0Sstevel@tonic-gate S Symbol 246*0Sstevel@tonic-gate Sm MathSymbol 247*0Sstevel@tonic-gate Sc CurrencySymbol 248*0Sstevel@tonic-gate Sk ModifierSymbol 249*0Sstevel@tonic-gate So OtherSymbol 250*0Sstevel@tonic-gate 251*0Sstevel@tonic-gate Z Separator 252*0Sstevel@tonic-gate Zs SpaceSeparator 253*0Sstevel@tonic-gate Zl LineSeparator 254*0Sstevel@tonic-gate Zp ParagraphSeparator 255*0Sstevel@tonic-gate 256*0Sstevel@tonic-gate C Other 257*0Sstevel@tonic-gate Cc Control 258*0Sstevel@tonic-gate Cf Format 259*0Sstevel@tonic-gate Cs Surrogate (not usable) 260*0Sstevel@tonic-gate Co PrivateUse 261*0Sstevel@tonic-gate Cn Unassigned 262*0Sstevel@tonic-gate 263*0Sstevel@tonic-gateSingle-letter properties match all characters in any of the 264*0Sstevel@tonic-gatetwo-letter sub-properties starting with the same letter. 265*0Sstevel@tonic-gateC<L&> is a special case, which is an alias for C<Ll>, C<Lu>, and C<Lt>. 266*0Sstevel@tonic-gate 267*0Sstevel@tonic-gateBecause Perl hides the need for the user to understand the internal 268*0Sstevel@tonic-gaterepresentation of Unicode characters, there is no need to implement 269*0Sstevel@tonic-gatethe somewhat messy concept of surrogates. C<Cs> is therefore not 270*0Sstevel@tonic-gatesupported. 271*0Sstevel@tonic-gate 272*0Sstevel@tonic-gateBecause scripts differ in their directionality--Hebrew is 273*0Sstevel@tonic-gatewritten right to left, for example--Unicode supplies these properties: 274*0Sstevel@tonic-gate 275*0Sstevel@tonic-gate Property Meaning 276*0Sstevel@tonic-gate 277*0Sstevel@tonic-gate BidiL Left-to-Right 278*0Sstevel@tonic-gate BidiLRE Left-to-Right Embedding 279*0Sstevel@tonic-gate BidiLRO Left-to-Right Override 280*0Sstevel@tonic-gate BidiR Right-to-Left 281*0Sstevel@tonic-gate BidiAL Right-to-Left Arabic 282*0Sstevel@tonic-gate BidiRLE Right-to-Left Embedding 283*0Sstevel@tonic-gate BidiRLO Right-to-Left Override 284*0Sstevel@tonic-gate BidiPDF Pop Directional Format 285*0Sstevel@tonic-gate BidiEN European Number 286*0Sstevel@tonic-gate BidiES European Number Separator 287*0Sstevel@tonic-gate BidiET European Number Terminator 288*0Sstevel@tonic-gate BidiAN Arabic Number 289*0Sstevel@tonic-gate BidiCS Common Number Separator 290*0Sstevel@tonic-gate BidiNSM Non-Spacing Mark 291*0Sstevel@tonic-gate BidiBN Boundary Neutral 292*0Sstevel@tonic-gate BidiB Paragraph Separator 293*0Sstevel@tonic-gate BidiS Segment Separator 294*0Sstevel@tonic-gate BidiWS Whitespace 295*0Sstevel@tonic-gate BidiON Other Neutrals 296*0Sstevel@tonic-gate 297*0Sstevel@tonic-gateFor example, C<\p{BidiR}> matches characters that are normally 298*0Sstevel@tonic-gatewritten right to left. 299*0Sstevel@tonic-gate 300*0Sstevel@tonic-gate=back 301*0Sstevel@tonic-gate 302*0Sstevel@tonic-gate=head2 Scripts 303*0Sstevel@tonic-gate 304*0Sstevel@tonic-gateThe script names which can be used by C<\p{...}> and C<\P{...}>, 305*0Sstevel@tonic-gatesuch as in C<\p{Latin}> or C<\p{Cyrillic}>, are as follows: 306*0Sstevel@tonic-gate 307*0Sstevel@tonic-gate Arabic 308*0Sstevel@tonic-gate Armenian 309*0Sstevel@tonic-gate Bengali 310*0Sstevel@tonic-gate Bopomofo 311*0Sstevel@tonic-gate Buhid 312*0Sstevel@tonic-gate CanadianAboriginal 313*0Sstevel@tonic-gate Cherokee 314*0Sstevel@tonic-gate Cyrillic 315*0Sstevel@tonic-gate Deseret 316*0Sstevel@tonic-gate Devanagari 317*0Sstevel@tonic-gate Ethiopic 318*0Sstevel@tonic-gate Georgian 319*0Sstevel@tonic-gate Gothic 320*0Sstevel@tonic-gate Greek 321*0Sstevel@tonic-gate Gujarati 322*0Sstevel@tonic-gate Gurmukhi 323*0Sstevel@tonic-gate Han 324*0Sstevel@tonic-gate Hangul 325*0Sstevel@tonic-gate Hanunoo 326*0Sstevel@tonic-gate Hebrew 327*0Sstevel@tonic-gate Hiragana 328*0Sstevel@tonic-gate Inherited 329*0Sstevel@tonic-gate Kannada 330*0Sstevel@tonic-gate Katakana 331*0Sstevel@tonic-gate Khmer 332*0Sstevel@tonic-gate Lao 333*0Sstevel@tonic-gate Latin 334*0Sstevel@tonic-gate Malayalam 335*0Sstevel@tonic-gate Mongolian 336*0Sstevel@tonic-gate Myanmar 337*0Sstevel@tonic-gate Ogham 338*0Sstevel@tonic-gate OldItalic 339*0Sstevel@tonic-gate Oriya 340*0Sstevel@tonic-gate Runic 341*0Sstevel@tonic-gate Sinhala 342*0Sstevel@tonic-gate Syriac 343*0Sstevel@tonic-gate Tagalog 344*0Sstevel@tonic-gate Tagbanwa 345*0Sstevel@tonic-gate Tamil 346*0Sstevel@tonic-gate Telugu 347*0Sstevel@tonic-gate Thaana 348*0Sstevel@tonic-gate Thai 349*0Sstevel@tonic-gate Tibetan 350*0Sstevel@tonic-gate Yi 351*0Sstevel@tonic-gate 352*0Sstevel@tonic-gateExtended property classes can supplement the basic 353*0Sstevel@tonic-gateproperties, defined by the F<PropList> Unicode database: 354*0Sstevel@tonic-gate 355*0Sstevel@tonic-gate ASCIIHexDigit 356*0Sstevel@tonic-gate BidiControl 357*0Sstevel@tonic-gate Dash 358*0Sstevel@tonic-gate Deprecated 359*0Sstevel@tonic-gate Diacritic 360*0Sstevel@tonic-gate Extender 361*0Sstevel@tonic-gate GraphemeLink 362*0Sstevel@tonic-gate HexDigit 363*0Sstevel@tonic-gate Hyphen 364*0Sstevel@tonic-gate Ideographic 365*0Sstevel@tonic-gate IDSBinaryOperator 366*0Sstevel@tonic-gate IDSTrinaryOperator 367*0Sstevel@tonic-gate JoinControl 368*0Sstevel@tonic-gate LogicalOrderException 369*0Sstevel@tonic-gate NoncharacterCodePoint 370*0Sstevel@tonic-gate OtherAlphabetic 371*0Sstevel@tonic-gate OtherDefaultIgnorableCodePoint 372*0Sstevel@tonic-gate OtherGraphemeExtend 373*0Sstevel@tonic-gate OtherLowercase 374*0Sstevel@tonic-gate OtherMath 375*0Sstevel@tonic-gate OtherUppercase 376*0Sstevel@tonic-gate QuotationMark 377*0Sstevel@tonic-gate Radical 378*0Sstevel@tonic-gate SoftDotted 379*0Sstevel@tonic-gate TerminalPunctuation 380*0Sstevel@tonic-gate UnifiedIdeograph 381*0Sstevel@tonic-gate WhiteSpace 382*0Sstevel@tonic-gate 383*0Sstevel@tonic-gateand there are further derived properties: 384*0Sstevel@tonic-gate 385*0Sstevel@tonic-gate Alphabetic Lu + Ll + Lt + Lm + Lo + OtherAlphabetic 386*0Sstevel@tonic-gate Lowercase Ll + OtherLowercase 387*0Sstevel@tonic-gate Uppercase Lu + OtherUppercase 388*0Sstevel@tonic-gate Math Sm + OtherMath 389*0Sstevel@tonic-gate 390*0Sstevel@tonic-gate ID_Start Lu + Ll + Lt + Lm + Lo + Nl 391*0Sstevel@tonic-gate ID_Continue ID_Start + Mn + Mc + Nd + Pc 392*0Sstevel@tonic-gate 393*0Sstevel@tonic-gate Any Any character 394*0Sstevel@tonic-gate Assigned Any non-Cn character (i.e. synonym for \P{Cn}) 395*0Sstevel@tonic-gate Unassigned Synonym for \p{Cn} 396*0Sstevel@tonic-gate Common Any character (or unassigned code point) 397*0Sstevel@tonic-gate not explicitly assigned to a script 398*0Sstevel@tonic-gate 399*0Sstevel@tonic-gateFor backward compatibility (with Perl 5.6), all properties mentioned 400*0Sstevel@tonic-gateso far may have C<Is> prepended to their name, so C<\P{IsLu}>, for 401*0Sstevel@tonic-gateexample, is equal to C<\P{Lu}>. 402*0Sstevel@tonic-gate 403*0Sstevel@tonic-gate=head2 Blocks 404*0Sstevel@tonic-gate 405*0Sstevel@tonic-gateIn addition to B<scripts>, Unicode also defines B<blocks> of 406*0Sstevel@tonic-gatecharacters. The difference between scripts and blocks is that the 407*0Sstevel@tonic-gateconcept of scripts is closer to natural languages, while the concept 408*0Sstevel@tonic-gateof blocks is more of an artificial grouping based on groups of 256 409*0Sstevel@tonic-gateUnicode characters. For example, the C<Latin> script contains letters 410*0Sstevel@tonic-gatefrom many blocks but does not contain all the characters from those 411*0Sstevel@tonic-gateblocks. It does not, for example, contain digits, because digits are 412*0Sstevel@tonic-gateshared across many scripts. Digits and similar groups, like 413*0Sstevel@tonic-gatepunctuation, are in a category called C<Common>. 414*0Sstevel@tonic-gate 415*0Sstevel@tonic-gateFor more about scripts, see the UTR #24: 416*0Sstevel@tonic-gate 417*0Sstevel@tonic-gate http://www.unicode.org/unicode/reports/tr24/ 418*0Sstevel@tonic-gate 419*0Sstevel@tonic-gateFor more about blocks, see: 420*0Sstevel@tonic-gate 421*0Sstevel@tonic-gate http://www.unicode.org/Public/UNIDATA/Blocks.txt 422*0Sstevel@tonic-gate 423*0Sstevel@tonic-gateBlock names are given with the C<In> prefix. For example, the 424*0Sstevel@tonic-gateKatakana block is referenced via C<\p{InKatakana}>. The C<In> 425*0Sstevel@tonic-gateprefix may be omitted if there is no naming conflict with a script 426*0Sstevel@tonic-gateor any other property, but it is recommended that C<In> always be used 427*0Sstevel@tonic-gatefor block tests to avoid confusion. 428*0Sstevel@tonic-gate 429*0Sstevel@tonic-gateThese block names are supported: 430*0Sstevel@tonic-gate 431*0Sstevel@tonic-gate InAlphabeticPresentationForms 432*0Sstevel@tonic-gate InArabic 433*0Sstevel@tonic-gate InArabicPresentationFormsA 434*0Sstevel@tonic-gate InArabicPresentationFormsB 435*0Sstevel@tonic-gate InArmenian 436*0Sstevel@tonic-gate InArrows 437*0Sstevel@tonic-gate InBasicLatin 438*0Sstevel@tonic-gate InBengali 439*0Sstevel@tonic-gate InBlockElements 440*0Sstevel@tonic-gate InBopomofo 441*0Sstevel@tonic-gate InBopomofoExtended 442*0Sstevel@tonic-gate InBoxDrawing 443*0Sstevel@tonic-gate InBraillePatterns 444*0Sstevel@tonic-gate InBuhid 445*0Sstevel@tonic-gate InByzantineMusicalSymbols 446*0Sstevel@tonic-gate InCJKCompatibility 447*0Sstevel@tonic-gate InCJKCompatibilityForms 448*0Sstevel@tonic-gate InCJKCompatibilityIdeographs 449*0Sstevel@tonic-gate InCJKCompatibilityIdeographsSupplement 450*0Sstevel@tonic-gate InCJKRadicalsSupplement 451*0Sstevel@tonic-gate InCJKSymbolsAndPunctuation 452*0Sstevel@tonic-gate InCJKUnifiedIdeographs 453*0Sstevel@tonic-gate InCJKUnifiedIdeographsExtensionA 454*0Sstevel@tonic-gate InCJKUnifiedIdeographsExtensionB 455*0Sstevel@tonic-gate InCherokee 456*0Sstevel@tonic-gate InCombiningDiacriticalMarks 457*0Sstevel@tonic-gate InCombiningDiacriticalMarksforSymbols 458*0Sstevel@tonic-gate InCombiningHalfMarks 459*0Sstevel@tonic-gate InControlPictures 460*0Sstevel@tonic-gate InCurrencySymbols 461*0Sstevel@tonic-gate InCyrillic 462*0Sstevel@tonic-gate InCyrillicSupplementary 463*0Sstevel@tonic-gate InDeseret 464*0Sstevel@tonic-gate InDevanagari 465*0Sstevel@tonic-gate InDingbats 466*0Sstevel@tonic-gate InEnclosedAlphanumerics 467*0Sstevel@tonic-gate InEnclosedCJKLettersAndMonths 468*0Sstevel@tonic-gate InEthiopic 469*0Sstevel@tonic-gate InGeneralPunctuation 470*0Sstevel@tonic-gate InGeometricShapes 471*0Sstevel@tonic-gate InGeorgian 472*0Sstevel@tonic-gate InGothic 473*0Sstevel@tonic-gate InGreekExtended 474*0Sstevel@tonic-gate InGreekAndCoptic 475*0Sstevel@tonic-gate InGujarati 476*0Sstevel@tonic-gate InGurmukhi 477*0Sstevel@tonic-gate InHalfwidthAndFullwidthForms 478*0Sstevel@tonic-gate InHangulCompatibilityJamo 479*0Sstevel@tonic-gate InHangulJamo 480*0Sstevel@tonic-gate InHangulSyllables 481*0Sstevel@tonic-gate InHanunoo 482*0Sstevel@tonic-gate InHebrew 483*0Sstevel@tonic-gate InHighPrivateUseSurrogates 484*0Sstevel@tonic-gate InHighSurrogates 485*0Sstevel@tonic-gate InHiragana 486*0Sstevel@tonic-gate InIPAExtensions 487*0Sstevel@tonic-gate InIdeographicDescriptionCharacters 488*0Sstevel@tonic-gate InKanbun 489*0Sstevel@tonic-gate InKangxiRadicals 490*0Sstevel@tonic-gate InKannada 491*0Sstevel@tonic-gate InKatakana 492*0Sstevel@tonic-gate InKatakanaPhoneticExtensions 493*0Sstevel@tonic-gate InKhmer 494*0Sstevel@tonic-gate InLao 495*0Sstevel@tonic-gate InLatin1Supplement 496*0Sstevel@tonic-gate InLatinExtendedA 497*0Sstevel@tonic-gate InLatinExtendedAdditional 498*0Sstevel@tonic-gate InLatinExtendedB 499*0Sstevel@tonic-gate InLetterlikeSymbols 500*0Sstevel@tonic-gate InLowSurrogates 501*0Sstevel@tonic-gate InMalayalam 502*0Sstevel@tonic-gate InMathematicalAlphanumericSymbols 503*0Sstevel@tonic-gate InMathematicalOperators 504*0Sstevel@tonic-gate InMiscellaneousMathematicalSymbolsA 505*0Sstevel@tonic-gate InMiscellaneousMathematicalSymbolsB 506*0Sstevel@tonic-gate InMiscellaneousSymbols 507*0Sstevel@tonic-gate InMiscellaneousTechnical 508*0Sstevel@tonic-gate InMongolian 509*0Sstevel@tonic-gate InMusicalSymbols 510*0Sstevel@tonic-gate InMyanmar 511*0Sstevel@tonic-gate InNumberForms 512*0Sstevel@tonic-gate InOgham 513*0Sstevel@tonic-gate InOldItalic 514*0Sstevel@tonic-gate InOpticalCharacterRecognition 515*0Sstevel@tonic-gate InOriya 516*0Sstevel@tonic-gate InPrivateUseArea 517*0Sstevel@tonic-gate InRunic 518*0Sstevel@tonic-gate InSinhala 519*0Sstevel@tonic-gate InSmallFormVariants 520*0Sstevel@tonic-gate InSpacingModifierLetters 521*0Sstevel@tonic-gate InSpecials 522*0Sstevel@tonic-gate InSuperscriptsAndSubscripts 523*0Sstevel@tonic-gate InSupplementalArrowsA 524*0Sstevel@tonic-gate InSupplementalArrowsB 525*0Sstevel@tonic-gate InSupplementalMathematicalOperators 526*0Sstevel@tonic-gate InSupplementaryPrivateUseAreaA 527*0Sstevel@tonic-gate InSupplementaryPrivateUseAreaB 528*0Sstevel@tonic-gate InSyriac 529*0Sstevel@tonic-gate InTagalog 530*0Sstevel@tonic-gate InTagbanwa 531*0Sstevel@tonic-gate InTags 532*0Sstevel@tonic-gate InTamil 533*0Sstevel@tonic-gate InTelugu 534*0Sstevel@tonic-gate InThaana 535*0Sstevel@tonic-gate InThai 536*0Sstevel@tonic-gate InTibetan 537*0Sstevel@tonic-gate InUnifiedCanadianAboriginalSyllabics 538*0Sstevel@tonic-gate InVariationSelectors 539*0Sstevel@tonic-gate InYiRadicals 540*0Sstevel@tonic-gate InYiSyllables 541*0Sstevel@tonic-gate 542*0Sstevel@tonic-gate=over 4 543*0Sstevel@tonic-gate 544*0Sstevel@tonic-gate=item * 545*0Sstevel@tonic-gate 546*0Sstevel@tonic-gateThe special pattern C<\X> matches any extended Unicode 547*0Sstevel@tonic-gatesequence--"a combining character sequence" in Standardese--where the 548*0Sstevel@tonic-gatefirst character is a base character and subsequent characters are mark 549*0Sstevel@tonic-gatecharacters that apply to the base character. C<\X> is equivalent to 550*0Sstevel@tonic-gateC<(?:\PM\pM*)>. 551*0Sstevel@tonic-gate 552*0Sstevel@tonic-gate=item * 553*0Sstevel@tonic-gate 554*0Sstevel@tonic-gateThe C<tr///> operator translates characters instead of bytes. Note 555*0Sstevel@tonic-gatethat the C<tr///CU> functionality has been removed. For similar 556*0Sstevel@tonic-gatefunctionality see pack('U0', ...) and pack('C0', ...). 557*0Sstevel@tonic-gate 558*0Sstevel@tonic-gate=item * 559*0Sstevel@tonic-gate 560*0Sstevel@tonic-gateCase translation operators use the Unicode case translation tables 561*0Sstevel@tonic-gatewhen character input is provided. Note that C<uc()>, or C<\U> in 562*0Sstevel@tonic-gateinterpolated strings, translates to uppercase, while C<ucfirst>, 563*0Sstevel@tonic-gateor C<\u> in interpolated strings, translates to titlecase in languages 564*0Sstevel@tonic-gatethat make the distinction. 565*0Sstevel@tonic-gate 566*0Sstevel@tonic-gate=item * 567*0Sstevel@tonic-gate 568*0Sstevel@tonic-gateMost operators that deal with positions or lengths in a string will 569*0Sstevel@tonic-gateautomatically switch to using character positions, including 570*0Sstevel@tonic-gateC<chop()>, C<chomp()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>, 571*0Sstevel@tonic-gateC<sprintf()>, C<write()>, and C<length()>. Operators that 572*0Sstevel@tonic-gatespecifically do not switch include C<vec()>, C<pack()>, and 573*0Sstevel@tonic-gateC<unpack()>. Operators that really don't care include 574*0Sstevel@tonic-gateoperators that treats strings as a bucket of bits such as C<sort()>, 575*0Sstevel@tonic-gateand operators dealing with filenames. 576*0Sstevel@tonic-gate 577*0Sstevel@tonic-gate=item * 578*0Sstevel@tonic-gate 579*0Sstevel@tonic-gateThe C<pack()>/C<unpack()> letters C<c> and C<C> do I<not> change, 580*0Sstevel@tonic-gatesince they are often used for byte-oriented formats. Again, think 581*0Sstevel@tonic-gateC<char> in the C language. 582*0Sstevel@tonic-gate 583*0Sstevel@tonic-gateThere is a new C<U> specifier that converts between Unicode characters 584*0Sstevel@tonic-gateand code points. 585*0Sstevel@tonic-gate 586*0Sstevel@tonic-gate=item * 587*0Sstevel@tonic-gate 588*0Sstevel@tonic-gateThe C<chr()> and C<ord()> functions work on characters, similar to 589*0Sstevel@tonic-gateC<pack("U")> and C<unpack("U")>, I<not> C<pack("C")> and 590*0Sstevel@tonic-gateC<unpack("C")>. C<pack("C")> and C<unpack("C")> are methods for 591*0Sstevel@tonic-gateemulating byte-oriented C<chr()> and C<ord()> on Unicode strings. 592*0Sstevel@tonic-gateWhile these methods reveal the internal encoding of Unicode strings, 593*0Sstevel@tonic-gatethat is not something one normally needs to care about at all. 594*0Sstevel@tonic-gate 595*0Sstevel@tonic-gate=item * 596*0Sstevel@tonic-gate 597*0Sstevel@tonic-gateThe bit string operators, C<& | ^ ~>, can operate on character data. 598*0Sstevel@tonic-gateHowever, for backward compatibility, such as when using bit string 599*0Sstevel@tonic-gateoperations when characters are all less than 256 in ordinal value, one 600*0Sstevel@tonic-gateshould not use C<~> (the bit complement) with characters of both 601*0Sstevel@tonic-gatevalues less than 256 and values greater than 256. Most importantly, 602*0Sstevel@tonic-gateDeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>) 603*0Sstevel@tonic-gatewill not hold. The reason for this mathematical I<faux pas> is that 604*0Sstevel@tonic-gatethe complement cannot return B<both> the 8-bit (byte-wide) bit 605*0Sstevel@tonic-gatecomplement B<and> the full character-wide bit complement. 606*0Sstevel@tonic-gate 607*0Sstevel@tonic-gate=item * 608*0Sstevel@tonic-gate 609*0Sstevel@tonic-gatelc(), uc(), lcfirst(), and ucfirst() work for the following cases: 610*0Sstevel@tonic-gate 611*0Sstevel@tonic-gate=over 8 612*0Sstevel@tonic-gate 613*0Sstevel@tonic-gate=item * 614*0Sstevel@tonic-gate 615*0Sstevel@tonic-gatethe case mapping is from a single Unicode character to another 616*0Sstevel@tonic-gatesingle Unicode character, or 617*0Sstevel@tonic-gate 618*0Sstevel@tonic-gate=item * 619*0Sstevel@tonic-gate 620*0Sstevel@tonic-gatethe case mapping is from a single Unicode character to more 621*0Sstevel@tonic-gatethan one Unicode character. 622*0Sstevel@tonic-gate 623*0Sstevel@tonic-gate=back 624*0Sstevel@tonic-gate 625*0Sstevel@tonic-gateThings to do with locales (Lithuanian, Turkish, Azeri) do B<not> work 626*0Sstevel@tonic-gatesince Perl does not understand the concept of Unicode locales. 627*0Sstevel@tonic-gate 628*0Sstevel@tonic-gateSee the Unicode Technical Report #21, Case Mappings, for more details. 629*0Sstevel@tonic-gate 630*0Sstevel@tonic-gate=back 631*0Sstevel@tonic-gate 632*0Sstevel@tonic-gate=over 4 633*0Sstevel@tonic-gate 634*0Sstevel@tonic-gate=item * 635*0Sstevel@tonic-gate 636*0Sstevel@tonic-gateAnd finally, C<scalar reverse()> reverses by character rather than by byte. 637*0Sstevel@tonic-gate 638*0Sstevel@tonic-gate=back 639*0Sstevel@tonic-gate 640*0Sstevel@tonic-gate=head2 User-Defined Character Properties 641*0Sstevel@tonic-gate 642*0Sstevel@tonic-gateYou can define your own character properties by defining subroutines 643*0Sstevel@tonic-gatewhose names begin with "In" or "Is". The subroutines must be defined 644*0Sstevel@tonic-gatein the C<main> package. The user-defined properties can be used in the 645*0Sstevel@tonic-gateregular expression C<\p> and C<\P> constructs. Note that the effect 646*0Sstevel@tonic-gateis compile-time and immutable once defined. 647*0Sstevel@tonic-gate 648*0Sstevel@tonic-gateThe subroutines must return a specially-formatted string, with one 649*0Sstevel@tonic-gateor more newline-separated lines. Each line must be one of the following: 650*0Sstevel@tonic-gate 651*0Sstevel@tonic-gate=over 4 652*0Sstevel@tonic-gate 653*0Sstevel@tonic-gate=item * 654*0Sstevel@tonic-gate 655*0Sstevel@tonic-gateTwo hexadecimal numbers separated by horizontal whitespace (space or 656*0Sstevel@tonic-gatetabular characters) denoting a range of Unicode code points to include. 657*0Sstevel@tonic-gate 658*0Sstevel@tonic-gate=item * 659*0Sstevel@tonic-gate 660*0Sstevel@tonic-gateSomething to include, prefixed by "+": a built-in character 661*0Sstevel@tonic-gateproperty (prefixed by "utf8::"), to represent all the characters in that 662*0Sstevel@tonic-gateproperty; two hexadecimal code points for a range; or a single 663*0Sstevel@tonic-gatehexadecimal code point. 664*0Sstevel@tonic-gate 665*0Sstevel@tonic-gate=item * 666*0Sstevel@tonic-gate 667*0Sstevel@tonic-gateSomething to exclude, prefixed by "-": an existing character 668*0Sstevel@tonic-gateproperty (prefixed by "utf8::"), for all the characters in that 669*0Sstevel@tonic-gateproperty; two hexadecimal code points for a range; or a single 670*0Sstevel@tonic-gatehexadecimal code point. 671*0Sstevel@tonic-gate 672*0Sstevel@tonic-gate=item * 673*0Sstevel@tonic-gate 674*0Sstevel@tonic-gateSomething to negate, prefixed "!": an existing character 675*0Sstevel@tonic-gateproperty (prefixed by "utf8::") for all the characters except the 676*0Sstevel@tonic-gatecharacters in the property; two hexadecimal code points for a range; 677*0Sstevel@tonic-gateor a single hexadecimal code point. 678*0Sstevel@tonic-gate 679*0Sstevel@tonic-gate=back 680*0Sstevel@tonic-gate 681*0Sstevel@tonic-gateFor example, to define a property that covers both the Japanese 682*0Sstevel@tonic-gatesyllabaries (hiragana and katakana), you can define 683*0Sstevel@tonic-gate 684*0Sstevel@tonic-gate sub InKana { 685*0Sstevel@tonic-gate return <<END; 686*0Sstevel@tonic-gate 3040\t309F 687*0Sstevel@tonic-gate 30A0\t30FF 688*0Sstevel@tonic-gate END 689*0Sstevel@tonic-gate } 690*0Sstevel@tonic-gate 691*0Sstevel@tonic-gateImagine that the here-doc end marker is at the beginning of the line. 692*0Sstevel@tonic-gateNow you can use C<\p{InKana}> and C<\P{InKana}>. 693*0Sstevel@tonic-gate 694*0Sstevel@tonic-gateYou could also have used the existing block property names: 695*0Sstevel@tonic-gate 696*0Sstevel@tonic-gate sub InKana { 697*0Sstevel@tonic-gate return <<'END'; 698*0Sstevel@tonic-gate +utf8::InHiragana 699*0Sstevel@tonic-gate +utf8::InKatakana 700*0Sstevel@tonic-gate END 701*0Sstevel@tonic-gate } 702*0Sstevel@tonic-gate 703*0Sstevel@tonic-gateSuppose you wanted to match only the allocated characters, 704*0Sstevel@tonic-gatenot the raw block ranges: in other words, you want to remove 705*0Sstevel@tonic-gatethe non-characters: 706*0Sstevel@tonic-gate 707*0Sstevel@tonic-gate sub InKana { 708*0Sstevel@tonic-gate return <<'END'; 709*0Sstevel@tonic-gate +utf8::InHiragana 710*0Sstevel@tonic-gate +utf8::InKatakana 711*0Sstevel@tonic-gate -utf8::IsCn 712*0Sstevel@tonic-gate END 713*0Sstevel@tonic-gate } 714*0Sstevel@tonic-gate 715*0Sstevel@tonic-gateThe negation is useful for defining (surprise!) negated classes. 716*0Sstevel@tonic-gate 717*0Sstevel@tonic-gate sub InNotKana { 718*0Sstevel@tonic-gate return <<'END'; 719*0Sstevel@tonic-gate !utf8::InHiragana 720*0Sstevel@tonic-gate -utf8::InKatakana 721*0Sstevel@tonic-gate +utf8::IsCn 722*0Sstevel@tonic-gate END 723*0Sstevel@tonic-gate } 724*0Sstevel@tonic-gate 725*0Sstevel@tonic-gateYou can also define your own mappings to be used in the lc(), 726*0Sstevel@tonic-gatelcfirst(), uc(), and ucfirst() (or their string-inlined versions). 727*0Sstevel@tonic-gateThe principle is the same: define subroutines in the C<main> package 728*0Sstevel@tonic-gatewith names like C<ToLower> (for lc() and lcfirst()), C<ToTitle> (for 729*0Sstevel@tonic-gatethe first character in ucfirst()), and C<ToUpper> (for uc(), and the 730*0Sstevel@tonic-gaterest of the characters in ucfirst()). 731*0Sstevel@tonic-gate 732*0Sstevel@tonic-gateThe string returned by the subroutines needs now to be three 733*0Sstevel@tonic-gatehexadecimal numbers separated by tabulators: start of the source 734*0Sstevel@tonic-gaterange, end of the source range, and start of the destination range. 735*0Sstevel@tonic-gateFor example: 736*0Sstevel@tonic-gate 737*0Sstevel@tonic-gate sub ToUpper { 738*0Sstevel@tonic-gate return <<END; 739*0Sstevel@tonic-gate 0061\t0063\t0041 740*0Sstevel@tonic-gate END 741*0Sstevel@tonic-gate } 742*0Sstevel@tonic-gate 743*0Sstevel@tonic-gatedefines an uc() mapping that causes only the characters "a", "b", and 744*0Sstevel@tonic-gate"c" to be mapped to "A", "B", "C", all other characters will remain 745*0Sstevel@tonic-gateunchanged. 746*0Sstevel@tonic-gate 747*0Sstevel@tonic-gateIf there is no source range to speak of, that is, the mapping is from 748*0Sstevel@tonic-gatea single character to another single character, leave the end of the 749*0Sstevel@tonic-gatesource range empty, but the two tabulator characters are still needed. 750*0Sstevel@tonic-gateFor example: 751*0Sstevel@tonic-gate 752*0Sstevel@tonic-gate sub ToLower { 753*0Sstevel@tonic-gate return <<END; 754*0Sstevel@tonic-gate 0041\t\t0061 755*0Sstevel@tonic-gate END 756*0Sstevel@tonic-gate } 757*0Sstevel@tonic-gate 758*0Sstevel@tonic-gatedefines a lc() mapping that causes only "A" to be mapped to "a", all 759*0Sstevel@tonic-gateother characters will remain unchanged. 760*0Sstevel@tonic-gate 761*0Sstevel@tonic-gate(For serious hackers only) If you want to introspect the default 762*0Sstevel@tonic-gatemappings, you can find the data in the directory 763*0Sstevel@tonic-gateC<$Config{privlib}>/F<unicore/To/>. The mapping data is returned as 764*0Sstevel@tonic-gatethe here-document, and the C<utf8::ToSpecFoo> are special exception 765*0Sstevel@tonic-gatemappings derived from <$Config{privlib}>/F<unicore/SpecialCasing.txt>. 766*0Sstevel@tonic-gateThe C<Digit> and C<Fold> mappings that one can see in the directory 767*0Sstevel@tonic-gateare not directly user-accessible, one can use either the 768*0Sstevel@tonic-gateC<Unicode::UCD> module, or just match case-insensitively (that's when 769*0Sstevel@tonic-gatethe C<Fold> mapping is used). 770*0Sstevel@tonic-gate 771*0Sstevel@tonic-gateA final note on the user-defined property tests and mappings: they 772*0Sstevel@tonic-gatewill be used only if the scalar has been marked as having Unicode 773*0Sstevel@tonic-gatecharacters. Old byte-style strings will not be affected. 774*0Sstevel@tonic-gate 775*0Sstevel@tonic-gate=head2 Character Encodings for Input and Output 776*0Sstevel@tonic-gate 777*0Sstevel@tonic-gateSee L<Encode>. 778*0Sstevel@tonic-gate 779*0Sstevel@tonic-gate=head2 Unicode Regular Expression Support Level 780*0Sstevel@tonic-gate 781*0Sstevel@tonic-gateThe following list of Unicode support for regular expressions describes 782*0Sstevel@tonic-gateall the features currently supported. The references to "Level N" 783*0Sstevel@tonic-gateand the section numbers refer to the Unicode Technical Report 18, 784*0Sstevel@tonic-gate"Unicode Regular Expression Guidelines", version 6 (Unicode 3.2.0, 785*0Sstevel@tonic-gatePerl 5.8.0). 786*0Sstevel@tonic-gate 787*0Sstevel@tonic-gate=over 4 788*0Sstevel@tonic-gate 789*0Sstevel@tonic-gate=item * 790*0Sstevel@tonic-gate 791*0Sstevel@tonic-gateLevel 1 - Basic Unicode Support 792*0Sstevel@tonic-gate 793*0Sstevel@tonic-gate 2.1 Hex Notation - done [1] 794*0Sstevel@tonic-gate Named Notation - done [2] 795*0Sstevel@tonic-gate 2.2 Categories - done [3][4] 796*0Sstevel@tonic-gate 2.3 Subtraction - MISSING [5][6] 797*0Sstevel@tonic-gate 2.4 Simple Word Boundaries - done [7] 798*0Sstevel@tonic-gate 2.5 Simple Loose Matches - done [8] 799*0Sstevel@tonic-gate 2.6 End of Line - MISSING [9][10] 800*0Sstevel@tonic-gate 801*0Sstevel@tonic-gate [ 1] \x{...} 802*0Sstevel@tonic-gate [ 2] \N{...} 803*0Sstevel@tonic-gate [ 3] . \p{...} \P{...} 804*0Sstevel@tonic-gate [ 4] now scripts (see UTR#24 Script Names) in addition to blocks 805*0Sstevel@tonic-gate [ 5] have negation 806*0Sstevel@tonic-gate [ 6] can use regular expression look-ahead [a] 807*0Sstevel@tonic-gate or user-defined character properties [b] to emulate subtraction 808*0Sstevel@tonic-gate [ 7] include Letters in word characters 809*0Sstevel@tonic-gate [ 8] note that Perl does Full case-folding in matching, not Simple: 810*0Sstevel@tonic-gate for example U+1F88 is equivalent with U+1F00 U+03B9, 811*0Sstevel@tonic-gate not with 1F80. This difference matters for certain Greek 812*0Sstevel@tonic-gate capital letters with certain modifiers: the Full case-folding 813*0Sstevel@tonic-gate decomposes the letter, while the Simple case-folding would map 814*0Sstevel@tonic-gate it to a single character. 815*0Sstevel@tonic-gate [ 9] see UTR #13 Unicode Newline Guidelines 816*0Sstevel@tonic-gate [10] should do ^ and $ also on \x{85}, \x{2028} and \x{2029} 817*0Sstevel@tonic-gate (should also affect <>, $., and script line numbers) 818*0Sstevel@tonic-gate (the \x{85}, \x{2028} and \x{2029} do match \s) 819*0Sstevel@tonic-gate 820*0Sstevel@tonic-gate[a] You can mimic class subtraction using lookahead. 821*0Sstevel@tonic-gateFor example, what UTR #18 might write as 822*0Sstevel@tonic-gate 823*0Sstevel@tonic-gate [{Greek}-[{UNASSIGNED}]] 824*0Sstevel@tonic-gate 825*0Sstevel@tonic-gatein Perl can be written as: 826*0Sstevel@tonic-gate 827*0Sstevel@tonic-gate (?!\p{Unassigned})\p{InGreekAndCoptic} 828*0Sstevel@tonic-gate (?=\p{Assigned})\p{InGreekAndCoptic} 829*0Sstevel@tonic-gate 830*0Sstevel@tonic-gateBut in this particular example, you probably really want 831*0Sstevel@tonic-gate 832*0Sstevel@tonic-gate \p{GreekAndCoptic} 833*0Sstevel@tonic-gate 834*0Sstevel@tonic-gatewhich will match assigned characters known to be part of the Greek script. 835*0Sstevel@tonic-gate 836*0Sstevel@tonic-gateAlso see the Unicode::Regex::Set module, it does implement the full 837*0Sstevel@tonic-gateUTR #18 grouping, intersection, union, and removal (subtraction) syntax. 838*0Sstevel@tonic-gate 839*0Sstevel@tonic-gate[b] See L</"User-Defined Character Properties">. 840*0Sstevel@tonic-gate 841*0Sstevel@tonic-gate=item * 842*0Sstevel@tonic-gate 843*0Sstevel@tonic-gateLevel 2 - Extended Unicode Support 844*0Sstevel@tonic-gate 845*0Sstevel@tonic-gate 3.1 Surrogates - MISSING [11] 846*0Sstevel@tonic-gate 3.2 Canonical Equivalents - MISSING [12][13] 847*0Sstevel@tonic-gate 3.3 Locale-Independent Graphemes - MISSING [14] 848*0Sstevel@tonic-gate 3.4 Locale-Independent Words - MISSING [15] 849*0Sstevel@tonic-gate 3.5 Locale-Independent Loose Matches - MISSING [16] 850*0Sstevel@tonic-gate 851*0Sstevel@tonic-gate [11] Surrogates are solely a UTF-16 concept and Perl's internal 852*0Sstevel@tonic-gate representation is UTF-8. The Encode module does UTF-16, though. 853*0Sstevel@tonic-gate [12] see UTR#15 Unicode Normalization 854*0Sstevel@tonic-gate [13] have Unicode::Normalize but not integrated to regexes 855*0Sstevel@tonic-gate [14] have \X but at this level . should equal that 856*0Sstevel@tonic-gate [15] need three classes, not just \w and \W 857*0Sstevel@tonic-gate [16] see UTR#21 Case Mappings 858*0Sstevel@tonic-gate 859*0Sstevel@tonic-gate=item * 860*0Sstevel@tonic-gate 861*0Sstevel@tonic-gateLevel 3 - Locale-Sensitive Support 862*0Sstevel@tonic-gate 863*0Sstevel@tonic-gate 4.1 Locale-Dependent Categories - MISSING 864*0Sstevel@tonic-gate 4.2 Locale-Dependent Graphemes - MISSING [16][17] 865*0Sstevel@tonic-gate 4.3 Locale-Dependent Words - MISSING 866*0Sstevel@tonic-gate 4.4 Locale-Dependent Loose Matches - MISSING 867*0Sstevel@tonic-gate 4.5 Locale-Dependent Ranges - MISSING 868*0Sstevel@tonic-gate 869*0Sstevel@tonic-gate [16] see UTR#10 Unicode Collation Algorithms 870*0Sstevel@tonic-gate [17] have Unicode::Collate but not integrated to regexes 871*0Sstevel@tonic-gate 872*0Sstevel@tonic-gate=back 873*0Sstevel@tonic-gate 874*0Sstevel@tonic-gate=head2 Unicode Encodings 875*0Sstevel@tonic-gate 876*0Sstevel@tonic-gateUnicode characters are assigned to I<code points>, which are abstract 877*0Sstevel@tonic-gatenumbers. To use these numbers, various encodings are needed. 878*0Sstevel@tonic-gate 879*0Sstevel@tonic-gate=over 4 880*0Sstevel@tonic-gate 881*0Sstevel@tonic-gate=item * 882*0Sstevel@tonic-gate 883*0Sstevel@tonic-gateUTF-8 884*0Sstevel@tonic-gate 885*0Sstevel@tonic-gateUTF-8 is a variable-length (1 to 6 bytes, current character allocations 886*0Sstevel@tonic-gaterequire 4 bytes), byte-order independent encoding. For ASCII (and we 887*0Sstevel@tonic-gatereally do mean 7-bit ASCII, not another 8-bit encoding), UTF-8 is 888*0Sstevel@tonic-gatetransparent. 889*0Sstevel@tonic-gate 890*0Sstevel@tonic-gateThe following table is from Unicode 3.2. 891*0Sstevel@tonic-gate 892*0Sstevel@tonic-gate Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte 893*0Sstevel@tonic-gate 894*0Sstevel@tonic-gate U+0000..U+007F 00..7F 895*0Sstevel@tonic-gate U+0080..U+07FF C2..DF 80..BF 896*0Sstevel@tonic-gate U+0800..U+0FFF E0 A0..BF 80..BF 897*0Sstevel@tonic-gate U+1000..U+CFFF E1..EC 80..BF 80..BF 898*0Sstevel@tonic-gate U+D000..U+D7FF ED 80..9F 80..BF 899*0Sstevel@tonic-gate U+D800..U+DFFF ******* ill-formed ******* 900*0Sstevel@tonic-gate U+E000..U+FFFF EE..EF 80..BF 80..BF 901*0Sstevel@tonic-gate U+10000..U+3FFFF F0 90..BF 80..BF 80..BF 902*0Sstevel@tonic-gate U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF 903*0Sstevel@tonic-gate U+100000..U+10FFFF F4 80..8F 80..BF 80..BF 904*0Sstevel@tonic-gate 905*0Sstevel@tonic-gateNote the C<A0..BF> in C<U+0800..U+0FFF>, the C<80..9F> in 906*0Sstevel@tonic-gateC<U+D000...U+D7FF>, the C<90..B>F in C<U+10000..U+3FFFF>, and the 907*0Sstevel@tonic-gateC<80...8F> in C<U+100000..U+10FFFF>. The "gaps" are caused by legal 908*0Sstevel@tonic-gateUTF-8 avoiding non-shortest encodings: it is technically possible to 909*0Sstevel@tonic-gateUTF-8-encode a single code point in different ways, but that is 910*0Sstevel@tonic-gateexplicitly forbidden, and the shortest possible encoding should always 911*0Sstevel@tonic-gatebe used. So that's what Perl does. 912*0Sstevel@tonic-gate 913*0Sstevel@tonic-gateAnother way to look at it is via bits: 914*0Sstevel@tonic-gate 915*0Sstevel@tonic-gate Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte 916*0Sstevel@tonic-gate 917*0Sstevel@tonic-gate 0aaaaaaa 0aaaaaaa 918*0Sstevel@tonic-gate 00000bbbbbaaaaaa 110bbbbb 10aaaaaa 919*0Sstevel@tonic-gate ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa 920*0Sstevel@tonic-gate 00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa 921*0Sstevel@tonic-gate 922*0Sstevel@tonic-gateAs you can see, the continuation bytes all begin with C<10>, and the 923*0Sstevel@tonic-gateleading bits of the start byte tell how many bytes the are in the 924*0Sstevel@tonic-gateencoded character. 925*0Sstevel@tonic-gate 926*0Sstevel@tonic-gate=item * 927*0Sstevel@tonic-gate 928*0Sstevel@tonic-gateUTF-EBCDIC 929*0Sstevel@tonic-gate 930*0Sstevel@tonic-gateLike UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe. 931*0Sstevel@tonic-gate 932*0Sstevel@tonic-gate=item * 933*0Sstevel@tonic-gate 934*0Sstevel@tonic-gateUTF-16, UTF-16BE, UTF-16LE, Surrogates, and BOMs (Byte Order Marks) 935*0Sstevel@tonic-gate 936*0Sstevel@tonic-gateThe followings items are mostly for reference and general Unicode 937*0Sstevel@tonic-gateknowledge, Perl doesn't use these constructs internally. 938*0Sstevel@tonic-gate 939*0Sstevel@tonic-gateUTF-16 is a 2 or 4 byte encoding. The Unicode code points 940*0Sstevel@tonic-gateC<U+0000..U+FFFF> are stored in a single 16-bit unit, and the code 941*0Sstevel@tonic-gatepoints C<U+10000..U+10FFFF> in two 16-bit units. The latter case is 942*0Sstevel@tonic-gateusing I<surrogates>, the first 16-bit unit being the I<high 943*0Sstevel@tonic-gatesurrogate>, and the second being the I<low surrogate>. 944*0Sstevel@tonic-gate 945*0Sstevel@tonic-gateSurrogates are code points set aside to encode the C<U+10000..U+10FFFF> 946*0Sstevel@tonic-gaterange of Unicode code points in pairs of 16-bit units. The I<high 947*0Sstevel@tonic-gatesurrogates> are the range C<U+D800..U+DBFF>, and the I<low surrogates> 948*0Sstevel@tonic-gateare the range C<U+DC00..U+DFFF>. The surrogate encoding is 949*0Sstevel@tonic-gate 950*0Sstevel@tonic-gate $hi = ($uni - 0x10000) / 0x400 + 0xD800; 951*0Sstevel@tonic-gate $lo = ($uni - 0x10000) % 0x400 + 0xDC00; 952*0Sstevel@tonic-gate 953*0Sstevel@tonic-gateand the decoding is 954*0Sstevel@tonic-gate 955*0Sstevel@tonic-gate $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00); 956*0Sstevel@tonic-gate 957*0Sstevel@tonic-gateIf you try to generate surrogates (for example by using chr()), you 958*0Sstevel@tonic-gatewill get a warning if warnings are turned on, because those code 959*0Sstevel@tonic-gatepoints are not valid for a Unicode character. 960*0Sstevel@tonic-gate 961*0Sstevel@tonic-gateBecause of the 16-bitness, UTF-16 is byte-order dependent. UTF-16 962*0Sstevel@tonic-gateitself can be used for in-memory computations, but if storage or 963*0Sstevel@tonic-gatetransfer is required either UTF-16BE (big-endian) or UTF-16LE 964*0Sstevel@tonic-gate(little-endian) encodings must be chosen. 965*0Sstevel@tonic-gate 966*0Sstevel@tonic-gateThis introduces another problem: what if you just know that your data 967*0Sstevel@tonic-gateis UTF-16, but you don't know which endianness? Byte Order Marks, or 968*0Sstevel@tonic-gateBOMs, are a solution to this. A special character has been reserved 969*0Sstevel@tonic-gatein Unicode to function as a byte order marker: the character with the 970*0Sstevel@tonic-gatecode point C<U+FEFF> is the BOM. 971*0Sstevel@tonic-gate 972*0Sstevel@tonic-gateThe trick is that if you read a BOM, you will know the byte order, 973*0Sstevel@tonic-gatesince if it was written on a big-endian platform, you will read the 974*0Sstevel@tonic-gatebytes C<0xFE 0xFF>, but if it was written on a little-endian platform, 975*0Sstevel@tonic-gateyou will read the bytes C<0xFF 0xFE>. (And if the originating platform 976*0Sstevel@tonic-gatewas writing in UTF-8, you will read the bytes C<0xEF 0xBB 0xBF>.) 977*0Sstevel@tonic-gate 978*0Sstevel@tonic-gateThe way this trick works is that the character with the code point 979*0Sstevel@tonic-gateC<U+FFFE> is guaranteed not to be a valid Unicode character, so the 980*0Sstevel@tonic-gatesequence of bytes C<0xFF 0xFE> is unambiguously "BOM, represented in 981*0Sstevel@tonic-gatelittle-endian format" and cannot be C<U+FFFE>, represented in big-endian 982*0Sstevel@tonic-gateformat". 983*0Sstevel@tonic-gate 984*0Sstevel@tonic-gate=item * 985*0Sstevel@tonic-gate 986*0Sstevel@tonic-gateUTF-32, UTF-32BE, UTF-32LE 987*0Sstevel@tonic-gate 988*0Sstevel@tonic-gateThe UTF-32 family is pretty much like the UTF-16 family, expect that 989*0Sstevel@tonic-gatethe units are 32-bit, and therefore the surrogate scheme is not 990*0Sstevel@tonic-gateneeded. The BOM signatures will be C<0x00 0x00 0xFE 0xFF> for BE and 991*0Sstevel@tonic-gateC<0xFF 0xFE 0x00 0x00> for LE. 992*0Sstevel@tonic-gate 993*0Sstevel@tonic-gate=item * 994*0Sstevel@tonic-gate 995*0Sstevel@tonic-gateUCS-2, UCS-4 996*0Sstevel@tonic-gate 997*0Sstevel@tonic-gateEncodings defined by the ISO 10646 standard. UCS-2 is a 16-bit 998*0Sstevel@tonic-gateencoding. Unlike UTF-16, UCS-2 is not extensible beyond C<U+FFFF>, 999*0Sstevel@tonic-gatebecause it does not use surrogates. UCS-4 is a 32-bit encoding, 1000*0Sstevel@tonic-gatefunctionally identical to UTF-32. 1001*0Sstevel@tonic-gate 1002*0Sstevel@tonic-gate=item * 1003*0Sstevel@tonic-gate 1004*0Sstevel@tonic-gateUTF-7 1005*0Sstevel@tonic-gate 1006*0Sstevel@tonic-gateA seven-bit safe (non-eight-bit) encoding, which is useful if the 1007*0Sstevel@tonic-gatetransport or storage is not eight-bit safe. Defined by RFC 2152. 1008*0Sstevel@tonic-gate 1009*0Sstevel@tonic-gate=back 1010*0Sstevel@tonic-gate 1011*0Sstevel@tonic-gate=head2 Security Implications of Unicode 1012*0Sstevel@tonic-gate 1013*0Sstevel@tonic-gate=over 4 1014*0Sstevel@tonic-gate 1015*0Sstevel@tonic-gate=item * 1016*0Sstevel@tonic-gate 1017*0Sstevel@tonic-gateMalformed UTF-8 1018*0Sstevel@tonic-gate 1019*0Sstevel@tonic-gateUnfortunately, the specification of UTF-8 leaves some room for 1020*0Sstevel@tonic-gateinterpretation of how many bytes of encoded output one should generate 1021*0Sstevel@tonic-gatefrom one input Unicode character. Strictly speaking, the shortest 1022*0Sstevel@tonic-gatepossible sequence of UTF-8 bytes should be generated, 1023*0Sstevel@tonic-gatebecause otherwise there is potential for an input buffer overflow at 1024*0Sstevel@tonic-gatethe receiving end of a UTF-8 connection. Perl always generates the 1025*0Sstevel@tonic-gateshortest length UTF-8, and with warnings on Perl will warn about 1026*0Sstevel@tonic-gatenon-shortest length UTF-8 along with other malformations, such as the 1027*0Sstevel@tonic-gatesurrogates, which are not real Unicode code points. 1028*0Sstevel@tonic-gate 1029*0Sstevel@tonic-gate=item * 1030*0Sstevel@tonic-gate 1031*0Sstevel@tonic-gateRegular expressions behave slightly differently between byte data and 1032*0Sstevel@tonic-gatecharacter (Unicode) data. For example, the "word character" character 1033*0Sstevel@tonic-gateclass C<\w> will work differently depending on if data is eight-bit bytes 1034*0Sstevel@tonic-gateor Unicode. 1035*0Sstevel@tonic-gate 1036*0Sstevel@tonic-gateIn the first case, the set of C<\w> characters is either small--the 1037*0Sstevel@tonic-gatedefault set of alphabetic characters, digits, and the "_"--or, if you 1038*0Sstevel@tonic-gateare using a locale (see L<perllocale>), the C<\w> might contain a few 1039*0Sstevel@tonic-gatemore letters according to your language and country. 1040*0Sstevel@tonic-gate 1041*0Sstevel@tonic-gateIn the second case, the C<\w> set of characters is much, much larger. 1042*0Sstevel@tonic-gateMost importantly, even in the set of the first 256 characters, it will 1043*0Sstevel@tonic-gateprobably match different characters: unlike most locales, which are 1044*0Sstevel@tonic-gatespecific to a language and country pair, Unicode classifies all the 1045*0Sstevel@tonic-gatecharacters that are letters I<somewhere> as C<\w>. For example, your 1046*0Sstevel@tonic-gatelocale might not think that LATIN SMALL LETTER ETH is a letter (unless 1047*0Sstevel@tonic-gateyou happen to speak Icelandic), but Unicode does. 1048*0Sstevel@tonic-gate 1049*0Sstevel@tonic-gateAs discussed elsewhere, Perl has one foot (two hooves?) planted in 1050*0Sstevel@tonic-gateeach of two worlds: the old world of bytes and the new world of 1051*0Sstevel@tonic-gatecharacters, upgrading from bytes to characters when necessary. 1052*0Sstevel@tonic-gateIf your legacy code does not explicitly use Unicode, no automatic 1053*0Sstevel@tonic-gateswitch-over to characters should happen. Characters shouldn't get 1054*0Sstevel@tonic-gatedowngraded to bytes, either. It is possible to accidentally mix bytes 1055*0Sstevel@tonic-gateand characters, however (see L<perluniintro>), in which case C<\w> in 1056*0Sstevel@tonic-gateregular expressions might start behaving differently. Review your 1057*0Sstevel@tonic-gatecode. Use warnings and the C<strict> pragma. 1058*0Sstevel@tonic-gate 1059*0Sstevel@tonic-gate=back 1060*0Sstevel@tonic-gate 1061*0Sstevel@tonic-gate=head2 Unicode in Perl on EBCDIC 1062*0Sstevel@tonic-gate 1063*0Sstevel@tonic-gateThe way Unicode is handled on EBCDIC platforms is still 1064*0Sstevel@tonic-gateexperimental. On such platforms, references to UTF-8 encoding in this 1065*0Sstevel@tonic-gatedocument and elsewhere should be read as meaning the UTF-EBCDIC 1066*0Sstevel@tonic-gatespecified in Unicode Technical Report 16, unless ASCII vs. EBCDIC issues 1067*0Sstevel@tonic-gateare specifically discussed. There is no C<utfebcdic> pragma or 1068*0Sstevel@tonic-gate":utfebcdic" layer; rather, "utf8" and ":utf8" are reused to mean 1069*0Sstevel@tonic-gatethe platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic> 1070*0Sstevel@tonic-gatefor more discussion of the issues. 1071*0Sstevel@tonic-gate 1072*0Sstevel@tonic-gate=head2 Locales 1073*0Sstevel@tonic-gate 1074*0Sstevel@tonic-gateUsually locale settings and Unicode do not affect each other, but 1075*0Sstevel@tonic-gatethere are a couple of exceptions: 1076*0Sstevel@tonic-gate 1077*0Sstevel@tonic-gate=over 4 1078*0Sstevel@tonic-gate 1079*0Sstevel@tonic-gate=item * 1080*0Sstevel@tonic-gate 1081*0Sstevel@tonic-gateYou can enable automatic UTF-8-ification of your standard file 1082*0Sstevel@tonic-gatehandles, default C<open()> layer, and C<@ARGV> by using either 1083*0Sstevel@tonic-gatethe C<-C> command line switch or the C<PERL_UNICODE> environment 1084*0Sstevel@tonic-gatevariable, see L<perlrun> for the documentation of the C<-C> switch. 1085*0Sstevel@tonic-gate 1086*0Sstevel@tonic-gate=item * 1087*0Sstevel@tonic-gate 1088*0Sstevel@tonic-gatePerl tries really hard to work both with Unicode and the old 1089*0Sstevel@tonic-gatebyte-oriented world. Most often this is nice, but sometimes Perl's 1090*0Sstevel@tonic-gatestraddling of the proverbial fence causes problems. 1091*0Sstevel@tonic-gate 1092*0Sstevel@tonic-gate=back 1093*0Sstevel@tonic-gate 1094*0Sstevel@tonic-gate=head2 When Unicode Does Not Happen 1095*0Sstevel@tonic-gate 1096*0Sstevel@tonic-gateWhile Perl does have extensive ways to input and output in Unicode, 1097*0Sstevel@tonic-gateand few other 'entry points' like the @ARGV which can be interpreted 1098*0Sstevel@tonic-gateas Unicode (UTF-8), there still are many places where Unicode (in some 1099*0Sstevel@tonic-gateencoding or another) could be given as arguments or received as 1100*0Sstevel@tonic-gateresults, or both, but it is not. 1101*0Sstevel@tonic-gate 1102*0Sstevel@tonic-gateThe following are such interfaces. For all of these interfaces Perl 1103*0Sstevel@tonic-gatecurrently (as of 5.8.3) simply assumes byte strings both as arguments 1104*0Sstevel@tonic-gateand results, or UTF-8 strings if the C<encoding> pragma has been used. 1105*0Sstevel@tonic-gate 1106*0Sstevel@tonic-gateOne reason why Perl does not attempt to resolve the role of Unicode in 1107*0Sstevel@tonic-gatethis cases is that the answers are highly dependent on the operating 1108*0Sstevel@tonic-gatesystem and the file system(s). For example, whether filenames can be 1109*0Sstevel@tonic-gatein Unicode, and in exactly what kind of encoding, is not exactly a 1110*0Sstevel@tonic-gateportable concept. Similarly for the qx and system: how well will the 1111*0Sstevel@tonic-gate'command line interface' (and which of them?) handle Unicode? 1112*0Sstevel@tonic-gate 1113*0Sstevel@tonic-gate=over 4 1114*0Sstevel@tonic-gate 1115*0Sstevel@tonic-gate=item * 1116*0Sstevel@tonic-gate 1117*0Sstevel@tonic-gatechmod, chmod, chown, chroot, exec, link, lstat, mkdir, 1118*0Sstevel@tonic-gaterename, rmdir, stat, symlink, truncate, unlink, utime, -X 1119*0Sstevel@tonic-gate 1120*0Sstevel@tonic-gate=item * 1121*0Sstevel@tonic-gate 1122*0Sstevel@tonic-gate%ENV 1123*0Sstevel@tonic-gate 1124*0Sstevel@tonic-gate=item * 1125*0Sstevel@tonic-gate 1126*0Sstevel@tonic-gateglob (aka the <*>) 1127*0Sstevel@tonic-gate 1128*0Sstevel@tonic-gate=item * 1129*0Sstevel@tonic-gate 1130*0Sstevel@tonic-gateopen, opendir, sysopen 1131*0Sstevel@tonic-gate 1132*0Sstevel@tonic-gate=item * 1133*0Sstevel@tonic-gate 1134*0Sstevel@tonic-gateqx (aka the backtick operator), system 1135*0Sstevel@tonic-gate 1136*0Sstevel@tonic-gate=item * 1137*0Sstevel@tonic-gate 1138*0Sstevel@tonic-gatereaddir, readlink 1139*0Sstevel@tonic-gate 1140*0Sstevel@tonic-gate=back 1141*0Sstevel@tonic-gate 1142*0Sstevel@tonic-gate=head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl) 1143*0Sstevel@tonic-gate 1144*0Sstevel@tonic-gateSometimes (see L</"When Unicode Does Not Happen">) there are 1145*0Sstevel@tonic-gatesituations where you simply need to force Perl to believe that a byte 1146*0Sstevel@tonic-gatestring is UTF-8, or vice versa. The low-level calls 1147*0Sstevel@tonic-gateutf8::upgrade($bytestring) and utf8::downgrade($utf8string) are 1148*0Sstevel@tonic-gatethe answers. 1149*0Sstevel@tonic-gate 1150*0Sstevel@tonic-gateDo not use them without careful thought, though: Perl may easily get 1151*0Sstevel@tonic-gatevery confused, angry, or even crash, if you suddenly change the 'nature' 1152*0Sstevel@tonic-gateof scalar like that. Especially careful you have to be if you use the 1153*0Sstevel@tonic-gateutf8::upgrade(): any random byte string is not valid UTF-8. 1154*0Sstevel@tonic-gate 1155*0Sstevel@tonic-gate=head2 Using Unicode in XS 1156*0Sstevel@tonic-gate 1157*0Sstevel@tonic-gateIf you want to handle Perl Unicode in XS extensions, you may find the 1158*0Sstevel@tonic-gatefollowing C APIs useful. See also L<perlguts/"Unicode Support"> for an 1159*0Sstevel@tonic-gateexplanation about Unicode at the XS level, and L<perlapi> for the API 1160*0Sstevel@tonic-gatedetails. 1161*0Sstevel@tonic-gate 1162*0Sstevel@tonic-gate=over 4 1163*0Sstevel@tonic-gate 1164*0Sstevel@tonic-gate=item * 1165*0Sstevel@tonic-gate 1166*0Sstevel@tonic-gateC<DO_UTF8(sv)> returns true if the C<UTF8> flag is on and the bytes 1167*0Sstevel@tonic-gatepragma is not in effect. C<SvUTF8(sv)> returns true is the C<UTF8> 1168*0Sstevel@tonic-gateflag is on; the bytes pragma is ignored. The C<UTF8> flag being on 1169*0Sstevel@tonic-gatedoes B<not> mean that there are any characters of code points greater 1170*0Sstevel@tonic-gatethan 255 (or 127) in the scalar or that there are even any characters 1171*0Sstevel@tonic-gatein the scalar. What the C<UTF8> flag means is that the sequence of 1172*0Sstevel@tonic-gateoctets in the representation of the scalar is the sequence of UTF-8 1173*0Sstevel@tonic-gateencoded code points of the characters of a string. The C<UTF8> flag 1174*0Sstevel@tonic-gatebeing off means that each octet in this representation encodes a 1175*0Sstevel@tonic-gatesingle character with code point 0..255 within the string. Perl's 1176*0Sstevel@tonic-gateUnicode model is not to use UTF-8 until it is absolutely necessary. 1177*0Sstevel@tonic-gate 1178*0Sstevel@tonic-gate=item * 1179*0Sstevel@tonic-gate 1180*0Sstevel@tonic-gateC<uvuni_to_utf8(buf, chr)> writes a Unicode character code point into 1181*0Sstevel@tonic-gatea buffer encoding the code point as UTF-8, and returns a pointer 1182*0Sstevel@tonic-gatepointing after the UTF-8 bytes. 1183*0Sstevel@tonic-gate 1184*0Sstevel@tonic-gate=item * 1185*0Sstevel@tonic-gate 1186*0Sstevel@tonic-gateC<utf8_to_uvuni(buf, lenp)> reads UTF-8 encoded bytes from a buffer and 1187*0Sstevel@tonic-gatereturns the Unicode character code point and, optionally, the length of 1188*0Sstevel@tonic-gatethe UTF-8 byte sequence. 1189*0Sstevel@tonic-gate 1190*0Sstevel@tonic-gate=item * 1191*0Sstevel@tonic-gate 1192*0Sstevel@tonic-gateC<utf8_length(start, end)> returns the length of the UTF-8 encoded buffer 1193*0Sstevel@tonic-gatein characters. C<sv_len_utf8(sv)> returns the length of the UTF-8 encoded 1194*0Sstevel@tonic-gatescalar. 1195*0Sstevel@tonic-gate 1196*0Sstevel@tonic-gate=item * 1197*0Sstevel@tonic-gate 1198*0Sstevel@tonic-gateC<sv_utf8_upgrade(sv)> converts the string of the scalar to its UTF-8 1199*0Sstevel@tonic-gateencoded form. C<sv_utf8_downgrade(sv)> does the opposite, if 1200*0Sstevel@tonic-gatepossible. C<sv_utf8_encode(sv)> is like sv_utf8_upgrade except that 1201*0Sstevel@tonic-gateit does not set the C<UTF8> flag. C<sv_utf8_decode()> does the 1202*0Sstevel@tonic-gateopposite of C<sv_utf8_encode()>. Note that none of these are to be 1203*0Sstevel@tonic-gateused as general-purpose encoding or decoding interfaces: C<use Encode> 1204*0Sstevel@tonic-gatefor that. C<sv_utf8_upgrade()> is affected by the encoding pragma 1205*0Sstevel@tonic-gatebut C<sv_utf8_downgrade()> is not (since the encoding pragma is 1206*0Sstevel@tonic-gatedesigned to be a one-way street). 1207*0Sstevel@tonic-gate 1208*0Sstevel@tonic-gate=item * 1209*0Sstevel@tonic-gate 1210*0Sstevel@tonic-gateC<is_utf8_char(s)> returns true if the pointer points to a valid UTF-8 1211*0Sstevel@tonic-gatecharacter. 1212*0Sstevel@tonic-gate 1213*0Sstevel@tonic-gate=item * 1214*0Sstevel@tonic-gate 1215*0Sstevel@tonic-gateC<is_utf8_string(buf, len)> returns true if C<len> bytes of the buffer 1216*0Sstevel@tonic-gateare valid UTF-8. 1217*0Sstevel@tonic-gate 1218*0Sstevel@tonic-gate=item * 1219*0Sstevel@tonic-gate 1220*0Sstevel@tonic-gateC<UTF8SKIP(buf)> will return the number of bytes in the UTF-8 encoded 1221*0Sstevel@tonic-gatecharacter in the buffer. C<UNISKIP(chr)> will return the number of bytes 1222*0Sstevel@tonic-gaterequired to UTF-8-encode the Unicode character code point. C<UTF8SKIP()> 1223*0Sstevel@tonic-gateis useful for example for iterating over the characters of a UTF-8 1224*0Sstevel@tonic-gateencoded buffer; C<UNISKIP()> is useful, for example, in computing 1225*0Sstevel@tonic-gatethe size required for a UTF-8 encoded buffer. 1226*0Sstevel@tonic-gate 1227*0Sstevel@tonic-gate=item * 1228*0Sstevel@tonic-gate 1229*0Sstevel@tonic-gateC<utf8_distance(a, b)> will tell the distance in characters between the 1230*0Sstevel@tonic-gatetwo pointers pointing to the same UTF-8 encoded buffer. 1231*0Sstevel@tonic-gate 1232*0Sstevel@tonic-gate=item * 1233*0Sstevel@tonic-gate 1234*0Sstevel@tonic-gateC<utf8_hop(s, off)> will return a pointer to an UTF-8 encoded buffer 1235*0Sstevel@tonic-gatethat is C<off> (positive or negative) Unicode characters displaced 1236*0Sstevel@tonic-gatefrom the UTF-8 buffer C<s>. Be careful not to overstep the buffer: 1237*0Sstevel@tonic-gateC<utf8_hop()> will merrily run off the end or the beginning of the 1238*0Sstevel@tonic-gatebuffer if told to do so. 1239*0Sstevel@tonic-gate 1240*0Sstevel@tonic-gate=item * 1241*0Sstevel@tonic-gate 1242*0Sstevel@tonic-gateC<pv_uni_display(dsv, spv, len, pvlim, flags)> and 1243*0Sstevel@tonic-gateC<sv_uni_display(dsv, ssv, pvlim, flags)> are useful for debugging the 1244*0Sstevel@tonic-gateoutput of Unicode strings and scalars. By default they are useful 1245*0Sstevel@tonic-gateonly for debugging--they display B<all> characters as hexadecimal code 1246*0Sstevel@tonic-gatepoints--but with the flags C<UNI_DISPLAY_ISPRINT>, 1247*0Sstevel@tonic-gateC<UNI_DISPLAY_BACKSLASH>, and C<UNI_DISPLAY_QQ> you can make the 1248*0Sstevel@tonic-gateoutput more readable. 1249*0Sstevel@tonic-gate 1250*0Sstevel@tonic-gate=item * 1251*0Sstevel@tonic-gate 1252*0Sstevel@tonic-gateC<ibcmp_utf8(s1, pe1, u1, l1, u1, s2, pe2, l2, u2)> can be used to 1253*0Sstevel@tonic-gatecompare two strings case-insensitively in Unicode. For case-sensitive 1254*0Sstevel@tonic-gatecomparisons you can just use C<memEQ()> and C<memNE()> as usual. 1255*0Sstevel@tonic-gate 1256*0Sstevel@tonic-gate=back 1257*0Sstevel@tonic-gate 1258*0Sstevel@tonic-gateFor more information, see L<perlapi>, and F<utf8.c> and F<utf8.h> 1259*0Sstevel@tonic-gatein the Perl source code distribution. 1260*0Sstevel@tonic-gate 1261*0Sstevel@tonic-gate=head1 BUGS 1262*0Sstevel@tonic-gate 1263*0Sstevel@tonic-gate=head2 Interaction with Locales 1264*0Sstevel@tonic-gate 1265*0Sstevel@tonic-gateUse of locales with Unicode data may lead to odd results. Currently, 1266*0Sstevel@tonic-gatePerl attempts to attach 8-bit locale info to characters in the range 1267*0Sstevel@tonic-gate0..255, but this technique is demonstrably incorrect for locales that 1268*0Sstevel@tonic-gateuse characters above that range when mapped into Unicode. Perl's 1269*0Sstevel@tonic-gateUnicode support will also tend to run slower. Use of locales with 1270*0Sstevel@tonic-gateUnicode is discouraged. 1271*0Sstevel@tonic-gate 1272*0Sstevel@tonic-gate=head2 Interaction with Extensions 1273*0Sstevel@tonic-gate 1274*0Sstevel@tonic-gateWhen Perl exchanges data with an extension, the extension should be 1275*0Sstevel@tonic-gateable to understand the UTF-8 flag and act accordingly. If the 1276*0Sstevel@tonic-gateextension doesn't know about the flag, it's likely that the extension 1277*0Sstevel@tonic-gatewill return incorrectly-flagged data. 1278*0Sstevel@tonic-gate 1279*0Sstevel@tonic-gateSo if you're working with Unicode data, consult the documentation of 1280*0Sstevel@tonic-gateevery module you're using if there are any issues with Unicode data 1281*0Sstevel@tonic-gateexchange. If the documentation does not talk about Unicode at all, 1282*0Sstevel@tonic-gatesuspect the worst and probably look at the source to learn how the 1283*0Sstevel@tonic-gatemodule is implemented. Modules written completely in Perl shouldn't 1284*0Sstevel@tonic-gatecause problems. Modules that directly or indirectly access code written 1285*0Sstevel@tonic-gatein other programming languages are at risk. 1286*0Sstevel@tonic-gate 1287*0Sstevel@tonic-gateFor affected functions, the simple strategy to avoid data corruption is 1288*0Sstevel@tonic-gateto always make the encoding of the exchanged data explicit. Choose an 1289*0Sstevel@tonic-gateencoding that you know the extension can handle. Convert arguments passed 1290*0Sstevel@tonic-gateto the extensions to that encoding and convert results back from that 1291*0Sstevel@tonic-gateencoding. Write wrapper functions that do the conversions for you, so 1292*0Sstevel@tonic-gateyou can later change the functions when the extension catches up. 1293*0Sstevel@tonic-gate 1294*0Sstevel@tonic-gateTo provide an example, let's say the popular Foo::Bar::escape_html 1295*0Sstevel@tonic-gatefunction doesn't deal with Unicode data yet. The wrapper function 1296*0Sstevel@tonic-gatewould convert the argument to raw UTF-8 and convert the result back to 1297*0Sstevel@tonic-gatePerl's internal representation like so: 1298*0Sstevel@tonic-gate 1299*0Sstevel@tonic-gate sub my_escape_html ($) { 1300*0Sstevel@tonic-gate my($what) = shift; 1301*0Sstevel@tonic-gate return unless defined $what; 1302*0Sstevel@tonic-gate Encode::decode_utf8(Foo::Bar::escape_html(Encode::encode_utf8($what))); 1303*0Sstevel@tonic-gate } 1304*0Sstevel@tonic-gate 1305*0Sstevel@tonic-gateSometimes, when the extension does not convert data but just stores 1306*0Sstevel@tonic-gateand retrieves them, you will be in a position to use the otherwise 1307*0Sstevel@tonic-gatedangerous Encode::_utf8_on() function. Let's say the popular 1308*0Sstevel@tonic-gateC<Foo::Bar> extension, written in C, provides a C<param> method that 1309*0Sstevel@tonic-gatelets you store and retrieve data according to these prototypes: 1310*0Sstevel@tonic-gate 1311*0Sstevel@tonic-gate $self->param($name, $value); # set a scalar 1312*0Sstevel@tonic-gate $value = $self->param($name); # retrieve a scalar 1313*0Sstevel@tonic-gate 1314*0Sstevel@tonic-gateIf it does not yet provide support for any encoding, one could write a 1315*0Sstevel@tonic-gatederived class with such a C<param> method: 1316*0Sstevel@tonic-gate 1317*0Sstevel@tonic-gate sub param { 1318*0Sstevel@tonic-gate my($self,$name,$value) = @_; 1319*0Sstevel@tonic-gate utf8::upgrade($name); # make sure it is UTF-8 encoded 1320*0Sstevel@tonic-gate if (defined $value) 1321*0Sstevel@tonic-gate utf8::upgrade($value); # make sure it is UTF-8 encoded 1322*0Sstevel@tonic-gate return $self->SUPER::param($name,$value); 1323*0Sstevel@tonic-gate } else { 1324*0Sstevel@tonic-gate my $ret = $self->SUPER::param($name); 1325*0Sstevel@tonic-gate Encode::_utf8_on($ret); # we know, it is UTF-8 encoded 1326*0Sstevel@tonic-gate return $ret; 1327*0Sstevel@tonic-gate } 1328*0Sstevel@tonic-gate } 1329*0Sstevel@tonic-gate 1330*0Sstevel@tonic-gateSome extensions provide filters on data entry/exit points, such as 1331*0Sstevel@tonic-gateDB_File::filter_store_key and family. Look out for such filters in 1332*0Sstevel@tonic-gatethe documentation of your extensions, they can make the transition to 1333*0Sstevel@tonic-gateUnicode data much easier. 1334*0Sstevel@tonic-gate 1335*0Sstevel@tonic-gate=head2 Speed 1336*0Sstevel@tonic-gate 1337*0Sstevel@tonic-gateSome functions are slower when working on UTF-8 encoded strings than 1338*0Sstevel@tonic-gateon byte encoded strings. All functions that need to hop over 1339*0Sstevel@tonic-gatecharacters such as length(), substr() or index(), or matching regular 1340*0Sstevel@tonic-gateexpressions can work B<much> faster when the underlying data are 1341*0Sstevel@tonic-gatebyte-encoded. 1342*0Sstevel@tonic-gate 1343*0Sstevel@tonic-gateIn Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1 1344*0Sstevel@tonic-gatea caching scheme was introduced which will hopefully make the slowness 1345*0Sstevel@tonic-gatesomewhat less spectacular, at least for some operations. In general, 1346*0Sstevel@tonic-gateoperations with UTF-8 encoded strings are still slower. As an example, 1347*0Sstevel@tonic-gatethe Unicode properties (character classes) like C<\p{Nd}> are known to 1348*0Sstevel@tonic-gatebe quite a bit slower (5-20 times) than their simpler counterparts 1349*0Sstevel@tonic-gatelike C<\d> (then again, there 268 Unicode characters matching C<Nd> 1350*0Sstevel@tonic-gatecompared with the 10 ASCII characters matching C<d>). 1351*0Sstevel@tonic-gate 1352*0Sstevel@tonic-gate=head2 Porting code from perl-5.6.X 1353*0Sstevel@tonic-gate 1354*0Sstevel@tonic-gatePerl 5.8 has a different Unicode model from 5.6. In 5.6 the programmer 1355*0Sstevel@tonic-gatewas required to use the C<utf8> pragma to declare that a given scope 1356*0Sstevel@tonic-gateexpected to deal with Unicode data and had to make sure that only 1357*0Sstevel@tonic-gateUnicode data were reaching that scope. If you have code that is 1358*0Sstevel@tonic-gateworking with 5.6, you will need some of the following adjustments to 1359*0Sstevel@tonic-gateyour code. The examples are written such that the code will continue 1360*0Sstevel@tonic-gateto work under 5.6, so you should be safe to try them out. 1361*0Sstevel@tonic-gate 1362*0Sstevel@tonic-gate=over 4 1363*0Sstevel@tonic-gate 1364*0Sstevel@tonic-gate=item * 1365*0Sstevel@tonic-gate 1366*0Sstevel@tonic-gateA filehandle that should read or write UTF-8 1367*0Sstevel@tonic-gate 1368*0Sstevel@tonic-gate if ($] > 5.007) { 1369*0Sstevel@tonic-gate binmode $fh, ":utf8"; 1370*0Sstevel@tonic-gate } 1371*0Sstevel@tonic-gate 1372*0Sstevel@tonic-gate=item * 1373*0Sstevel@tonic-gate 1374*0Sstevel@tonic-gateA scalar that is going to be passed to some extension 1375*0Sstevel@tonic-gate 1376*0Sstevel@tonic-gateBe it Compress::Zlib, Apache::Request or any extension that has no 1377*0Sstevel@tonic-gatemention of Unicode in the manpage, you need to make sure that the 1378*0Sstevel@tonic-gateUTF-8 flag is stripped off. Note that at the time of this writing 1379*0Sstevel@tonic-gate(October 2002) the mentioned modules are not UTF-8-aware. Please 1380*0Sstevel@tonic-gatecheck the documentation to verify if this is still true. 1381*0Sstevel@tonic-gate 1382*0Sstevel@tonic-gate if ($] > 5.007) { 1383*0Sstevel@tonic-gate require Encode; 1384*0Sstevel@tonic-gate $val = Encode::encode_utf8($val); # make octets 1385*0Sstevel@tonic-gate } 1386*0Sstevel@tonic-gate 1387*0Sstevel@tonic-gate=item * 1388*0Sstevel@tonic-gate 1389*0Sstevel@tonic-gateA scalar we got back from an extension 1390*0Sstevel@tonic-gate 1391*0Sstevel@tonic-gateIf you believe the scalar comes back as UTF-8, you will most likely 1392*0Sstevel@tonic-gatewant the UTF-8 flag restored: 1393*0Sstevel@tonic-gate 1394*0Sstevel@tonic-gate if ($] > 5.007) { 1395*0Sstevel@tonic-gate require Encode; 1396*0Sstevel@tonic-gate $val = Encode::decode_utf8($val); 1397*0Sstevel@tonic-gate } 1398*0Sstevel@tonic-gate 1399*0Sstevel@tonic-gate=item * 1400*0Sstevel@tonic-gate 1401*0Sstevel@tonic-gateSame thing, if you are really sure it is UTF-8 1402*0Sstevel@tonic-gate 1403*0Sstevel@tonic-gate if ($] > 5.007) { 1404*0Sstevel@tonic-gate require Encode; 1405*0Sstevel@tonic-gate Encode::_utf8_on($val); 1406*0Sstevel@tonic-gate } 1407*0Sstevel@tonic-gate 1408*0Sstevel@tonic-gate=item * 1409*0Sstevel@tonic-gate 1410*0Sstevel@tonic-gateA wrapper for fetchrow_array and fetchrow_hashref 1411*0Sstevel@tonic-gate 1412*0Sstevel@tonic-gateWhen the database contains only UTF-8, a wrapper function or method is 1413*0Sstevel@tonic-gatea convenient way to replace all your fetchrow_array and 1414*0Sstevel@tonic-gatefetchrow_hashref calls. A wrapper function will also make it easier to 1415*0Sstevel@tonic-gateadapt to future enhancements in your database driver. Note that at the 1416*0Sstevel@tonic-gatetime of this writing (October 2002), the DBI has no standardized way 1417*0Sstevel@tonic-gateto deal with UTF-8 data. Please check the documentation to verify if 1418*0Sstevel@tonic-gatethat is still true. 1419*0Sstevel@tonic-gate 1420*0Sstevel@tonic-gate sub fetchrow { 1421*0Sstevel@tonic-gate my($self, $sth, $what) = @_; # $what is one of fetchrow_{array,hashref} 1422*0Sstevel@tonic-gate if ($] < 5.007) { 1423*0Sstevel@tonic-gate return $sth->$what; 1424*0Sstevel@tonic-gate } else { 1425*0Sstevel@tonic-gate require Encode; 1426*0Sstevel@tonic-gate if (wantarray) { 1427*0Sstevel@tonic-gate my @arr = $sth->$what; 1428*0Sstevel@tonic-gate for (@arr) { 1429*0Sstevel@tonic-gate defined && /[^\000-\177]/ && Encode::_utf8_on($_); 1430*0Sstevel@tonic-gate } 1431*0Sstevel@tonic-gate return @arr; 1432*0Sstevel@tonic-gate } else { 1433*0Sstevel@tonic-gate my $ret = $sth->$what; 1434*0Sstevel@tonic-gate if (ref $ret) { 1435*0Sstevel@tonic-gate for my $k (keys %$ret) { 1436*0Sstevel@tonic-gate defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret->{$k}; 1437*0Sstevel@tonic-gate } 1438*0Sstevel@tonic-gate return $ret; 1439*0Sstevel@tonic-gate } else { 1440*0Sstevel@tonic-gate defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret; 1441*0Sstevel@tonic-gate return $ret; 1442*0Sstevel@tonic-gate } 1443*0Sstevel@tonic-gate } 1444*0Sstevel@tonic-gate } 1445*0Sstevel@tonic-gate } 1446*0Sstevel@tonic-gate 1447*0Sstevel@tonic-gate 1448*0Sstevel@tonic-gate=item * 1449*0Sstevel@tonic-gate 1450*0Sstevel@tonic-gateA large scalar that you know can only contain ASCII 1451*0Sstevel@tonic-gate 1452*0Sstevel@tonic-gateScalars that contain only ASCII and are marked as UTF-8 are sometimes 1453*0Sstevel@tonic-gatea drag to your program. If you recognize such a situation, just remove 1454*0Sstevel@tonic-gatethe UTF-8 flag: 1455*0Sstevel@tonic-gate 1456*0Sstevel@tonic-gate utf8::downgrade($val) if $] > 5.007; 1457*0Sstevel@tonic-gate 1458*0Sstevel@tonic-gate=back 1459*0Sstevel@tonic-gate 1460*0Sstevel@tonic-gate=head1 SEE ALSO 1461*0Sstevel@tonic-gate 1462*0Sstevel@tonic-gateL<perluniintro>, L<encoding>, L<Encode>, L<open>, L<utf8>, L<bytes>, 1463*0Sstevel@tonic-gateL<perlretut>, L<perlvar/"${^UNICODE}"> 1464*0Sstevel@tonic-gate 1465*0Sstevel@tonic-gate=cut 1466