1=head1 NAME 2 3perlunicode - Unicode support in Perl (EXPERIMENTAL, subject to change) 4 5=head1 DESCRIPTION 6 7=head2 Important Caveat 8 9 WARNING: As of the 5.6.1 release, the implementation of Unicode 10 support in Perl is incomplete, and continues to be highly experimental. 11 12The following areas need further work. They are being rapidly addressed 13in the 5.7.x development branch. 14 15=over 4 16 17=item Input and Output Disciplines 18 19There is currently no easy way to mark data read from a file or other 20external source as being utf8. This will be one of the major areas of 21focus in the near future. 22 23=item Regular Expressions 24 25The existing regular expression compiler does not produce polymorphic 26opcodes. This means that the determination on whether to match Unicode 27characters is made when the pattern is compiled, based on whether the 28pattern contains Unicode characters, and not when the matching happens 29at run time. This needs to be changed to adaptively match Unicode if 30the string to be matched is Unicode. 31 32=item C<use utf8> still needed to enable a few features 33 34The C<utf8> pragma implements the tables used for Unicode support. These 35tables are automatically loaded on demand, so the C<utf8> pragma need not 36normally be used. 37 38However, as a compatibility measure, this pragma must be explicitly used 39to enable recognition of UTF-8 encoded literals and identifiers in the 40source text. 41 42=back 43 44=head2 Byte and Character semantics 45 46Beginning with version 5.6, Perl uses logically wide characters to 47represent strings internally. This internal representation of strings 48uses the UTF-8 encoding. 49 50In future, Perl-level operations can be expected to work with characters 51rather than bytes, in general. 52 53However, as strictly an interim compatibility measure, Perl v5.6 aims to 54provide a safe migration path from byte semantics to character semantics 55for programs. For operations where Perl can unambiguously decide that the 56input data is characters, Perl now switches to character semantics. 57For operations where this determination cannot be made without additional 58information from the user, Perl decides in favor of compatibility, and 59chooses to use byte semantics. 60 61This behavior preserves compatibility with earlier versions of Perl, 62which allowed byte semantics in Perl operations, but only as long as 63none of the program's inputs are marked as being as source of Unicode 64character data. Such data may come from filehandles, from calls to 65external programs, from information provided by the system (such as %ENV), 66or from literals and constants in the source text. 67 68If the C<-C> command line switch is used, (or the ${^WIDE_SYSTEM_CALLS} 69global flag is set to C<1>), all system calls will use the 70corresponding wide character APIs. This is currently only implemented 71on Windows. 72 73Regardless of the above, the C<bytes> pragma can always be used to force 74byte semantics in a particular lexical scope. See L<bytes>. 75 76The C<utf8> pragma is primarily a compatibility device that enables 77recognition of UTF-8 in literals encountered by the parser. It may also 78be used for enabling some of the more experimental Unicode support features. 79Note that this pragma is only required until a future version of Perl 80in which character semantics will become the default. This pragma may 81then become a no-op. See L<utf8>. 82 83Unless mentioned otherwise, Perl operators will use character semantics 84when they are dealing with Unicode data, and byte semantics otherwise. 85Thus, character semantics for these operations apply transparently; if 86the input data came from a Unicode source (for example, by adding a 87character encoding discipline to the filehandle whence it came, or a 88literal UTF-8 string constant in the program), character semantics 89apply; otherwise, byte semantics are in effect. To force byte semantics 90on Unicode data, the C<bytes> pragma should be used. 91 92Under character semantics, many operations that formerly operated on 93bytes change to operating on characters. For ASCII data this makes 94no difference, because UTF-8 stores ASCII in single bytes, but for 95any character greater than C<chr(127)>, the character may be stored in 96a sequence of two or more bytes, all of which have the high bit set. 97But by and large, the user need not worry about this, because Perl 98hides it from the user. A character in Perl is logically just a number 99ranging from 0 to 2**32 or so. Larger characters encode to longer 100sequences of bytes internally, but again, this is just an internal 101detail which is hidden at the Perl level. 102 103=head2 Effects of character semantics 104 105Character semantics have the following effects: 106 107=over 4 108 109=item * 110 111Strings and patterns may contain characters that have an ordinal value 112larger than 255. 113 114Presuming you use a Unicode editor to edit your program, such characters 115will typically occur directly within the literal strings as UTF-8 116characters, but you can also specify a particular character with an 117extension of the C<\x> notation. UTF-8 characters are specified by 118putting the hexadecimal code within curlies after the C<\x>. For instance, 119a Unicode smiley face is C<\x{263A}>. 120 121=item * 122 123Identifiers within the Perl script may contain Unicode alphanumeric 124characters, including ideographs. (You are currently on your own when 125it comes to using the canonical forms of characters--Perl doesn't (yet) 126attempt to canonicalize variable names for you.) 127 128=item * 129 130Regular expressions match characters instead of bytes. For instance, 131"." matches a character instead of a byte. (However, the C<\C> pattern 132is provided to force a match a single byte ("C<char>" in C, hence 133C<\C>).) 134 135=item * 136 137Character classes in regular expressions match characters instead of 138bytes, and match against the character properties specified in the 139Unicode properties database. So C<\w> can be used to match an ideograph, 140for instance. 141 142=item * 143 144Named Unicode properties and block ranges make be used as character 145classes via the new C<\p{}> (matches property) and C<\P{}> (doesn't 146match property) constructs. For instance, C<\p{Lu}> matches any 147character with the Unicode uppercase property, while C<\p{M}> matches 148any mark character. Single letter properties may omit the brackets, so 149that can be written C<\pM> also. Many predefined character classes are 150available, such as C<\p{IsMirrored}> and C<\p{InTibetan}>. 151 152=item * 153 154The special pattern C<\X> match matches any extended Unicode sequence 155(a "combining character sequence" in Standardese), where the first 156character is a base character and subsequent characters are mark 157characters that apply to the base character. It is equivalent to 158C<(?:\PM\pM*)>. 159 160=item * 161 162The C<tr///> operator translates characters instead of bytes. Note 163that the C<tr///CU> functionality has been removed, as the interface 164was a mistake. For similar functionality see pack('U0', ...) and 165pack('C0', ...). 166 167=item * 168 169Case translation operators use the Unicode case translation tables 170when provided character input. Note that C<uc()> translates to 171uppercase, while C<ucfirst> translates to titlecase (for languages 172that make the distinction). Naturally the corresponding backslash 173sequences have the same semantics. 174 175=item * 176 177Most operators that deal with positions or lengths in the string will 178automatically switch to using character positions, including C<chop()>, 179C<substr()>, C<pos()>, C<index()>, C<rindex()>, C<sprintf()>, 180C<write()>, and C<length()>. Operators that specifically don't switch 181include C<vec()>, C<pack()>, and C<unpack()>. Operators that really 182don't care include C<chomp()>, as well as any other operator that 183treats a string as a bucket of bits, such as C<sort()>, and the 184operators dealing with filenames. 185 186=item * 187 188The C<pack()>/C<unpack()> letters "C<c>" and "C<C>" do I<not> change, 189since they're often used for byte-oriented formats. (Again, think 190"C<char>" in the C language.) However, there is a new "C<U>" specifier 191that will convert between UTF-8 characters and integers. (It works 192outside of the utf8 pragma too.) 193 194=item * 195 196The C<chr()> and C<ord()> functions work on characters. This is like 197C<pack("U")> and C<unpack("U")>, not like C<pack("C")> and 198C<unpack("C")>. In fact, the latter are how you now emulate 199byte-oriented C<chr()> and C<ord()> under utf8. 200 201=item * 202 203The bit string operators C<& | ^ ~> can operate on character data. 204However, for backward compatibility reasons (bit string operations 205when the characters all are less than 256 in ordinal value) one cannot 206mix C<~> (the bit complement) and characters both less than 256 and 207equal or greater than 256. Most importantly, the DeMorgan's laws 208(C<~($x|$y) eq ~$x&~$y>, C<~($x&$y) eq ~$x|~$y>) won't hold. 209Another way to look at this is that the complement cannot return 210B<both> the 8-bit (byte) wide bit complement, and the full character 211wide bit complement. 212 213=item * 214 215And finally, C<scalar reverse()> reverses by character rather than by byte. 216 217=back 218 219=head2 Character encodings for input and output 220 221[XXX: This feature is not yet implemented.] 222 223=head1 CAVEATS 224 225As of yet, there is no method for automatically coercing input and 226output to some encoding other than UTF-8. This is planned in the near 227future, however. 228 229Whether an arbitrary piece of data will be treated as "characters" or 230"bytes" by internal operations cannot be divined at the current time. 231 232Use of locales with utf8 may lead to odd results. Currently there is 233some attempt to apply 8-bit locale info to characters in the range 2340..255, but this is demonstrably incorrect for locales that use 235characters above that range (when mapped into Unicode). It will also 236tend to run slower. Avoidance of locales is strongly encouraged. 237 238=head1 SEE ALSO 239 240L<bytes>, L<utf8>, L<perlvar/"${^WIDE_SYSTEM_CALLS}"> 241 242=cut 243