xref: /openbsd-src/gnu/usr.bin/perl/pod/perlunicode.pod (revision c48bdce47de487644c5bf49fc71f7db60e4f07d6)
1=head1 NAME
2
3perlunicode - Unicode support in Perl (EXPERIMENTAL, subject to change)
4
5=head1 DESCRIPTION
6
7=head2 Important Caveat
8
9    WARNING:  As of the 5.6.1 release, the implementation of Unicode
10    support in Perl is incomplete, and continues to be highly experimental.
11
12The following areas need further work.  They are being rapidly addressed
13in the 5.7.x development branch.
14
15=over 4
16
17=item Input and Output Disciplines
18
19There is currently no easy way to mark data read from a file or other
20external source as being utf8.  This will be one of the major areas of
21focus in the near future.
22
23=item Regular Expressions
24
25The existing regular expression compiler does not produce polymorphic
26opcodes.  This means that the determination on whether to match Unicode
27characters is made when the pattern is compiled, based on whether the
28pattern contains Unicode characters, and not when the matching happens
29at run time.  This needs to be changed to adaptively match Unicode if
30the string to be matched is Unicode.
31
32=item C<use utf8> still needed to enable a few features
33
34The C<utf8> pragma implements the tables used for Unicode support.  These
35tables are automatically loaded on demand, so the C<utf8> pragma need not
36normally be used.
37
38However, as a compatibility measure, this pragma must be explicitly used
39to enable recognition of UTF-8 encoded literals and identifiers in the
40source text.
41
42=back
43
44=head2 Byte and Character semantics
45
46Beginning with version 5.6, Perl uses logically wide characters to
47represent strings internally.  This internal representation of strings
48uses the UTF-8 encoding.
49
50In future, Perl-level operations can be expected to work with characters
51rather than bytes, in general.
52
53However, as strictly an interim compatibility measure, Perl v5.6 aims to
54provide a safe migration path from byte semantics to character semantics
55for programs.  For operations where Perl can unambiguously decide that the
56input data is characters, Perl now switches to character semantics.
57For operations where this determination cannot be made without additional
58information from the user, Perl decides in favor of compatibility, and
59chooses to use byte semantics.
60
61This behavior preserves compatibility with earlier versions of Perl,
62which allowed byte semantics in Perl operations, but only as long as
63none of the program's inputs are marked as being as source of Unicode
64character data.  Such data may come from filehandles, from calls to
65external programs, from information provided by the system (such as %ENV),
66or from literals and constants in the source text.
67
68If the C<-C> command line switch is used, (or the ${^WIDE_SYSTEM_CALLS}
69global flag is set to C<1>), all system calls will use the
70corresponding wide character APIs.  This is currently only implemented
71on Windows.
72
73Regardless of the above, the C<bytes> pragma can always be used to force
74byte semantics in a particular lexical scope.  See L<bytes>.
75
76The C<utf8> pragma is primarily a compatibility device that enables
77recognition of UTF-8 in literals encountered by the parser.  It may also
78be used for enabling some of the more experimental Unicode support features.
79Note that this pragma is only required until a future version of Perl
80in which character semantics will become the default.  This pragma may
81then become a no-op.  See L<utf8>.
82
83Unless mentioned otherwise, Perl operators will use character semantics
84when they are dealing with Unicode data, and byte semantics otherwise.
85Thus, character semantics for these operations apply transparently; if
86the input data came from a Unicode source (for example, by adding a
87character encoding discipline to the filehandle whence it came, or a
88literal UTF-8 string constant in the program), character semantics
89apply; otherwise, byte semantics are in effect.  To force byte semantics
90on Unicode data, the C<bytes> pragma should be used.
91
92Under character semantics, many operations that formerly operated on
93bytes change to operating on characters.  For ASCII data this makes
94no difference, because UTF-8 stores ASCII in single bytes, but for
95any character greater than C<chr(127)>, the character may be stored in
96a sequence of two or more bytes, all of which have the high bit set.
97But by and large, the user need not worry about this, because Perl
98hides it from the user.  A character in Perl is logically just a number
99ranging from 0 to 2**32 or so.  Larger characters encode to longer
100sequences of bytes internally, but again, this is just an internal
101detail which is hidden at the Perl level.
102
103=head2 Effects of character semantics
104
105Character semantics have the following effects:
106
107=over 4
108
109=item *
110
111Strings and patterns may contain characters that have an ordinal value
112larger than 255.
113
114Presuming you use a Unicode editor to edit your program, such characters
115will typically occur directly within the literal strings as UTF-8
116characters, but you can also specify a particular character with an
117extension of the C<\x> notation.  UTF-8 characters are specified by
118putting the hexadecimal code within curlies after the C<\x>.  For instance,
119a Unicode smiley face is C<\x{263A}>.
120
121=item *
122
123Identifiers within the Perl script may contain Unicode alphanumeric
124characters, including ideographs.  (You are currently on your own when
125it comes to using the canonical forms of characters--Perl doesn't (yet)
126attempt to canonicalize variable names for you.)
127
128=item *
129
130Regular expressions match characters instead of bytes.  For instance,
131"." matches a character instead of a byte.  (However, the C<\C> pattern
132is provided to force a match a single byte ("C<char>" in C, hence
133C<\C>).)
134
135=item *
136
137Character classes in regular expressions match characters instead of
138bytes, and match against the character properties specified in the
139Unicode properties database.  So C<\w> can be used to match an ideograph,
140for instance.
141
142=item *
143
144Named Unicode properties and block ranges make be used as character
145classes via the new C<\p{}> (matches property) and C<\P{}> (doesn't
146match property) constructs.  For instance, C<\p{Lu}> matches any
147character with the Unicode uppercase property, while C<\p{M}> matches
148any mark character.  Single letter properties may omit the brackets, so
149that can be written C<\pM> also.  Many predefined character classes are
150available, such as C<\p{IsMirrored}> and  C<\p{InTibetan}>.
151
152=item *
153
154The special pattern C<\X> match matches any extended Unicode sequence
155(a "combining character sequence" in Standardese), where the first
156character is a base character and subsequent characters are mark
157characters that apply to the base character.  It is equivalent to
158C<(?:\PM\pM*)>.
159
160=item *
161
162The C<tr///> operator translates characters instead of bytes.  Note
163that the C<tr///CU> functionality has been removed, as the interface
164was a mistake.  For similar functionality see pack('U0', ...) and
165pack('C0', ...).
166
167=item *
168
169Case translation operators use the Unicode case translation tables
170when provided character input.  Note that C<uc()> translates to
171uppercase, while C<ucfirst> translates to titlecase (for languages
172that make the distinction).  Naturally the corresponding backslash
173sequences have the same semantics.
174
175=item *
176
177Most operators that deal with positions or lengths in the string will
178automatically switch to using character positions, including C<chop()>,
179C<substr()>, C<pos()>, C<index()>, C<rindex()>, C<sprintf()>,
180C<write()>, and C<length()>.  Operators that specifically don't switch
181include C<vec()>, C<pack()>, and C<unpack()>.  Operators that really
182don't care include C<chomp()>, as well as any other operator that
183treats a string as a bucket of bits, such as C<sort()>, and the
184operators dealing with filenames.
185
186=item *
187
188The C<pack()>/C<unpack()> letters "C<c>" and "C<C>" do I<not> change,
189since they're often used for byte-oriented formats.  (Again, think
190"C<char>" in the C language.)  However, there is a new "C<U>" specifier
191that will convert between UTF-8 characters and integers.  (It works
192outside of the utf8 pragma too.)
193
194=item *
195
196The C<chr()> and C<ord()> functions work on characters.  This is like
197C<pack("U")> and C<unpack("U")>, not like C<pack("C")> and
198C<unpack("C")>.  In fact, the latter are how you now emulate
199byte-oriented C<chr()> and C<ord()> under utf8.
200
201=item *
202
203The bit string operators C<& | ^ ~> can operate on character data.
204However, for backward compatibility reasons (bit string operations
205when the characters all are less than 256 in ordinal value) one cannot
206mix C<~> (the bit complement) and characters both less than 256 and
207equal or greater than 256.  Most importantly, the DeMorgan's laws
208(C<~($x|$y) eq ~$x&~$y>, C<~($x&$y) eq ~$x|~$y>) won't hold.
209Another way to look at this is that the complement cannot return
210B<both> the 8-bit (byte) wide bit complement, and the full character
211wide bit complement.
212
213=item *
214
215And finally, C<scalar reverse()> reverses by character rather than by byte.
216
217=back
218
219=head2 Character encodings for input and output
220
221[XXX: This feature is not yet implemented.]
222
223=head1 CAVEATS
224
225As of yet, there is no method for automatically coercing input and
226output to some encoding other than UTF-8.  This is planned in the near
227future, however.
228
229Whether an arbitrary piece of data will be treated as "characters" or
230"bytes" by internal operations cannot be divined at the current time.
231
232Use of locales with utf8 may lead to odd results.  Currently there is
233some attempt to apply 8-bit locale info to characters in the range
2340..255, but this is demonstrably incorrect for locales that use
235characters above that range (when mapped into Unicode).  It will also
236tend to run slower.  Avoidance of locales is strongly encouraged.
237
238=head1 SEE ALSO
239
240L<bytes>, L<utf8>, L<perlvar/"${^WIDE_SYSTEM_CALLS}">
241
242=cut
243