xref: /onnv-gate/usr/src/cmd/perl/5.8.4/distrib/pod/perlunicode.pod (revision 0:68f95e015346)
1*0Sstevel@tonic-gate=head1 NAME
2*0Sstevel@tonic-gate
3*0Sstevel@tonic-gateperlunicode - Unicode support in Perl
4*0Sstevel@tonic-gate
5*0Sstevel@tonic-gate=head1 DESCRIPTION
6*0Sstevel@tonic-gate
7*0Sstevel@tonic-gate=head2 Important Caveats
8*0Sstevel@tonic-gate
9*0Sstevel@tonic-gateUnicode support is an extensive requirement. While Perl does not
10*0Sstevel@tonic-gateimplement the Unicode standard or the accompanying technical reports
11*0Sstevel@tonic-gatefrom cover to cover, Perl does support many Unicode features.
12*0Sstevel@tonic-gate
13*0Sstevel@tonic-gate=over 4
14*0Sstevel@tonic-gate
15*0Sstevel@tonic-gate=item Input and Output Layers
16*0Sstevel@tonic-gate
17*0Sstevel@tonic-gatePerl knows when a filehandle uses Perl's internal Unicode encodings
18*0Sstevel@tonic-gate(UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with
19*0Sstevel@tonic-gatethe ":utf8" layer.  Other encodings can be converted to Perl's
20*0Sstevel@tonic-gateencoding on input or from Perl's encoding on output by use of the
21*0Sstevel@tonic-gate":encoding(...)"  layer.  See L<open>.
22*0Sstevel@tonic-gate
23*0Sstevel@tonic-gateTo indicate that Perl source itself is using a particular encoding,
24*0Sstevel@tonic-gatesee L<encoding>.
25*0Sstevel@tonic-gate
26*0Sstevel@tonic-gate=item Regular Expressions
27*0Sstevel@tonic-gate
28*0Sstevel@tonic-gateThe regular expression compiler produces polymorphic opcodes.  That is,
29*0Sstevel@tonic-gatethe pattern adapts to the data and automatically switches to the Unicode
30*0Sstevel@tonic-gatecharacter scheme when presented with Unicode data--or instead uses
31*0Sstevel@tonic-gatea traditional byte scheme when presented with byte data.
32*0Sstevel@tonic-gate
33*0Sstevel@tonic-gate=item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts
34*0Sstevel@tonic-gate
35*0Sstevel@tonic-gateAs a compatibility measure, the C<use utf8> pragma must be explicitly
36*0Sstevel@tonic-gateincluded to enable recognition of UTF-8 in the Perl scripts themselves
37*0Sstevel@tonic-gate(in string or regular expression literals, or in identifier names) on
38*0Sstevel@tonic-gateASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based
39*0Sstevel@tonic-gatemachines.  B<These are the only times when an explicit C<use utf8>
40*0Sstevel@tonic-gateis needed.>  See L<utf8>.
41*0Sstevel@tonic-gate
42*0Sstevel@tonic-gateYou can also use the C<encoding> pragma to change the default encoding
43*0Sstevel@tonic-gateof the data in your script; see L<encoding>.
44*0Sstevel@tonic-gate
45*0Sstevel@tonic-gate=item BOM-marked scripts and UTF-16 scripts autodetected
46*0Sstevel@tonic-gate
47*0Sstevel@tonic-gateIf a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE,
48*0Sstevel@tonic-gateor UTF-8), or if the script looks like non-BOM-marked UTF-16 of either
49*0Sstevel@tonic-gateendianness, Perl will correctly read in the script as Unicode.
50*0Sstevel@tonic-gate(BOMless UTF-8 cannot be effectively recognized or differentiated from
51*0Sstevel@tonic-gateISO 8859-1 or other eight-bit encodings.)
52*0Sstevel@tonic-gate
53*0Sstevel@tonic-gate=item C<use encoding> needed to upgrade non-Latin-1 byte strings
54*0Sstevel@tonic-gate
55*0Sstevel@tonic-gateBy default, there is a fundamental asymmetry in Perl's unicode model:
56*0Sstevel@tonic-gateimplicit upgrading from byte strings to Unicode strings assumes that
57*0Sstevel@tonic-gatethey were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are
58*0Sstevel@tonic-gatedowngraded with UTF-8 encoding.  This happens because the first 256
59*0Sstevel@tonic-gatecodepoints in Unicode happens to agree with Latin-1.
60*0Sstevel@tonic-gate
61*0Sstevel@tonic-gateIf you wish to interpret byte strings as UTF-8 instead, use the
62*0Sstevel@tonic-gateC<encoding> pragma:
63*0Sstevel@tonic-gate
64*0Sstevel@tonic-gate    use encoding 'utf8';
65*0Sstevel@tonic-gate
66*0Sstevel@tonic-gateSee L</"Byte and Character Semantics"> for more details.
67*0Sstevel@tonic-gate
68*0Sstevel@tonic-gate=back
69*0Sstevel@tonic-gate
70*0Sstevel@tonic-gate=head2 Byte and Character Semantics
71*0Sstevel@tonic-gate
72*0Sstevel@tonic-gateBeginning with version 5.6, Perl uses logically-wide characters to
73*0Sstevel@tonic-gaterepresent strings internally.
74*0Sstevel@tonic-gate
75*0Sstevel@tonic-gateIn future, Perl-level operations will be expected to work with
76*0Sstevel@tonic-gatecharacters rather than bytes.
77*0Sstevel@tonic-gate
78*0Sstevel@tonic-gateHowever, as an interim compatibility measure, Perl aims to
79*0Sstevel@tonic-gateprovide a safe migration path from byte semantics to character
80*0Sstevel@tonic-gatesemantics for programs.  For operations where Perl can unambiguously
81*0Sstevel@tonic-gatedecide that the input data are characters, Perl switches to
82*0Sstevel@tonic-gatecharacter semantics.  For operations where this determination cannot
83*0Sstevel@tonic-gatebe made without additional information from the user, Perl decides in
84*0Sstevel@tonic-gatefavor of compatibility and chooses to use byte semantics.
85*0Sstevel@tonic-gate
86*0Sstevel@tonic-gateThis behavior preserves compatibility with earlier versions of Perl,
87*0Sstevel@tonic-gatewhich allowed byte semantics in Perl operations only if
88*0Sstevel@tonic-gatenone of the program's inputs were marked as being as source of Unicode
89*0Sstevel@tonic-gatecharacter data.  Such data may come from filehandles, from calls to
90*0Sstevel@tonic-gateexternal programs, from information provided by the system (such as %ENV),
91*0Sstevel@tonic-gateor from literals and constants in the source text.
92*0Sstevel@tonic-gate
93*0Sstevel@tonic-gateThe C<bytes> pragma will always, regardless of platform, force byte
94*0Sstevel@tonic-gatesemantics in a particular lexical scope.  See L<bytes>.
95*0Sstevel@tonic-gate
96*0Sstevel@tonic-gateThe C<utf8> pragma is primarily a compatibility device that enables
97*0Sstevel@tonic-gaterecognition of UTF-(8|EBCDIC) in literals encountered by the parser.
98*0Sstevel@tonic-gateNote that this pragma is only required while Perl defaults to byte
99*0Sstevel@tonic-gatesemantics; when character semantics become the default, this pragma
100*0Sstevel@tonic-gatemay become a no-op.  See L<utf8>.
101*0Sstevel@tonic-gate
102*0Sstevel@tonic-gateUnless explicitly stated, Perl operators use character semantics
103*0Sstevel@tonic-gatefor Unicode data and byte semantics for non-Unicode data.
104*0Sstevel@tonic-gateThe decision to use character semantics is made transparently.  If
105*0Sstevel@tonic-gateinput data comes from a Unicode source--for example, if a character
106*0Sstevel@tonic-gateencoding layer is added to a filehandle or a literal Unicode
107*0Sstevel@tonic-gatestring constant appears in a program--character semantics apply.
108*0Sstevel@tonic-gateOtherwise, byte semantics are in effect.  The C<bytes> pragma should
109*0Sstevel@tonic-gatebe used to force byte semantics on Unicode data.
110*0Sstevel@tonic-gate
111*0Sstevel@tonic-gateIf strings operating under byte semantics and strings with Unicode
112*0Sstevel@tonic-gatecharacter data are concatenated, the new string will be created by
113*0Sstevel@tonic-gatedecoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the
114*0Sstevel@tonic-gateold Unicode string used EBCDIC.  This translation is done without
115*0Sstevel@tonic-gateregard to the system's native 8-bit encoding.  To change this for
116*0Sstevel@tonic-gatesystems with non-Latin-1 and non-EBCDIC native encodings, use the
117*0Sstevel@tonic-gateC<encoding> pragma.  See L<encoding>.
118*0Sstevel@tonic-gate
119*0Sstevel@tonic-gateUnder character semantics, many operations that formerly operated on
120*0Sstevel@tonic-gatebytes now operate on characters. A character in Perl is
121*0Sstevel@tonic-gatelogically just a number ranging from 0 to 2**31 or so. Larger
122*0Sstevel@tonic-gatecharacters may encode into longer sequences of bytes internally, but
123*0Sstevel@tonic-gatethis internal detail is mostly hidden for Perl code.
124*0Sstevel@tonic-gateSee L<perluniintro> for more.
125*0Sstevel@tonic-gate
126*0Sstevel@tonic-gate=head2 Effects of Character Semantics
127*0Sstevel@tonic-gate
128*0Sstevel@tonic-gateCharacter semantics have the following effects:
129*0Sstevel@tonic-gate
130*0Sstevel@tonic-gate=over 4
131*0Sstevel@tonic-gate
132*0Sstevel@tonic-gate=item *
133*0Sstevel@tonic-gate
134*0Sstevel@tonic-gateStrings--including hash keys--and regular expression patterns may
135*0Sstevel@tonic-gatecontain characters that have an ordinal value larger than 255.
136*0Sstevel@tonic-gate
137*0Sstevel@tonic-gateIf you use a Unicode editor to edit your program, Unicode characters
138*0Sstevel@tonic-gatemay occur directly within the literal strings in one of the various
139*0Sstevel@tonic-gateUnicode encodings (UTF-8, UTF-EBCDIC, UCS-2, etc.), but will be recognized
140*0Sstevel@tonic-gateas such and converted to Perl's internal representation only if the
141*0Sstevel@tonic-gateappropriate L<encoding> is specified.
142*0Sstevel@tonic-gate
143*0Sstevel@tonic-gateUnicode characters can also be added to a string by using the
144*0Sstevel@tonic-gateC<\x{...}> notation.  The Unicode code for the desired character, in
145*0Sstevel@tonic-gatehexadecimal, should be placed in the braces. For instance, a smiley
146*0Sstevel@tonic-gateface is C<\x{263A}>.  This encoding scheme only works for characters
147*0Sstevel@tonic-gatewith a code of 0x100 or above.
148*0Sstevel@tonic-gate
149*0Sstevel@tonic-gateAdditionally, if you
150*0Sstevel@tonic-gate
151*0Sstevel@tonic-gate   use charnames ':full';
152*0Sstevel@tonic-gate
153*0Sstevel@tonic-gateyou can use the C<\N{...}> notation and put the official Unicode
154*0Sstevel@tonic-gatecharacter name within the braces, such as C<\N{WHITE SMILING FACE}>.
155*0Sstevel@tonic-gate
156*0Sstevel@tonic-gate
157*0Sstevel@tonic-gate=item *
158*0Sstevel@tonic-gate
159*0Sstevel@tonic-gateIf an appropriate L<encoding> is specified, identifiers within the
160*0Sstevel@tonic-gatePerl script may contain Unicode alphanumeric characters, including
161*0Sstevel@tonic-gateideographs.  Perl does not currently attempt to canonicalize variable
162*0Sstevel@tonic-gatenames.
163*0Sstevel@tonic-gate
164*0Sstevel@tonic-gate=item *
165*0Sstevel@tonic-gate
166*0Sstevel@tonic-gateRegular expressions match characters instead of bytes.  "." matches
167*0Sstevel@tonic-gatea character instead of a byte.  The C<\C> pattern is provided to force
168*0Sstevel@tonic-gatea match a single byte--a C<char> in C, hence C<\C>.
169*0Sstevel@tonic-gate
170*0Sstevel@tonic-gate=item *
171*0Sstevel@tonic-gate
172*0Sstevel@tonic-gateCharacter classes in regular expressions match characters instead of
173*0Sstevel@tonic-gatebytes and match against the character properties specified in the
174*0Sstevel@tonic-gateUnicode properties database.  C<\w> can be used to match a Japanese
175*0Sstevel@tonic-gateideograph, for instance.
176*0Sstevel@tonic-gate
177*0Sstevel@tonic-gate(However, and as a limitation of the current implementation, using
178*0Sstevel@tonic-gateC<\w> or C<\W> I<inside> a C<[...]> character class will still match
179*0Sstevel@tonic-gatewith byte semantics.)
180*0Sstevel@tonic-gate
181*0Sstevel@tonic-gate=item *
182*0Sstevel@tonic-gate
183*0Sstevel@tonic-gateNamed Unicode properties, scripts, and block ranges may be used like
184*0Sstevel@tonic-gatecharacter classes via the C<\p{}> "matches property" construct and
185*0Sstevel@tonic-gatethe  C<\P{}> negation, "doesn't match property".
186*0Sstevel@tonic-gate
187*0Sstevel@tonic-gateFor instance, C<\p{Lu}> matches any character with the Unicode "Lu"
188*0Sstevel@tonic-gate(Letter, uppercase) property, while C<\p{M}> matches any character
189*0Sstevel@tonic-gatewith an "M" (mark--accents and such) property.  Brackets are not
190*0Sstevel@tonic-gaterequired for single letter properties, so C<\p{M}> is equivalent to
191*0Sstevel@tonic-gateC<\pM>. Many predefined properties are available, such as
192*0Sstevel@tonic-gateC<\p{Mirrored}> and C<\p{Tibetan}>.
193*0Sstevel@tonic-gate
194*0Sstevel@tonic-gateThe official Unicode script and block names have spaces and dashes as
195*0Sstevel@tonic-gateseparators, but for convenience you can use dashes, spaces, or
196*0Sstevel@tonic-gateunderbars, and case is unimportant. It is recommended, however, that
197*0Sstevel@tonic-gatefor consistency you use the following naming: the official Unicode
198*0Sstevel@tonic-gatescript, property, or block name (see below for the additional rules
199*0Sstevel@tonic-gatethat apply to block names) with whitespace and dashes removed, and the
200*0Sstevel@tonic-gatewords "uppercase-first-lowercase-rest". C<Latin-1 Supplement> thus
201*0Sstevel@tonic-gatebecomes C<Latin1Supplement>.
202*0Sstevel@tonic-gate
203*0Sstevel@tonic-gateYou can also use negation in both C<\p{}> and C<\P{}> by introducing a caret
204*0Sstevel@tonic-gate(^) between the first brace and the property name: C<\p{^Tamil}> is
205*0Sstevel@tonic-gateequal to C<\P{Tamil}>.
206*0Sstevel@tonic-gate
207*0Sstevel@tonic-gateB<NOTE: the properties, scripts, and blocks listed here are as of
208*0Sstevel@tonic-gateUnicode 3.2.0, March 2002, or Perl 5.8.0, July 2002.  Unicode 4.0.0
209*0Sstevel@tonic-gatecame out in April 2003, and Perl 5.8.1 in September 2003.>
210*0Sstevel@tonic-gate
211*0Sstevel@tonic-gateHere are the basic Unicode General Category properties, followed by their
212*0Sstevel@tonic-gatelong form.  You can use either; C<\p{Lu}> and C<\p{UppercaseLetter}>,
213*0Sstevel@tonic-gatefor instance, are identical.
214*0Sstevel@tonic-gate
215*0Sstevel@tonic-gate    Short       Long
216*0Sstevel@tonic-gate
217*0Sstevel@tonic-gate    L           Letter
218*0Sstevel@tonic-gate    Lu          UppercaseLetter
219*0Sstevel@tonic-gate    Ll          LowercaseLetter
220*0Sstevel@tonic-gate    Lt          TitlecaseLetter
221*0Sstevel@tonic-gate    Lm          ModifierLetter
222*0Sstevel@tonic-gate    Lo          OtherLetter
223*0Sstevel@tonic-gate
224*0Sstevel@tonic-gate    M           Mark
225*0Sstevel@tonic-gate    Mn          NonspacingMark
226*0Sstevel@tonic-gate    Mc          SpacingMark
227*0Sstevel@tonic-gate    Me          EnclosingMark
228*0Sstevel@tonic-gate
229*0Sstevel@tonic-gate    N           Number
230*0Sstevel@tonic-gate    Nd          DecimalNumber
231*0Sstevel@tonic-gate    Nl          LetterNumber
232*0Sstevel@tonic-gate    No          OtherNumber
233*0Sstevel@tonic-gate
234*0Sstevel@tonic-gate    P           Punctuation
235*0Sstevel@tonic-gate    Pc          ConnectorPunctuation
236*0Sstevel@tonic-gate    Pd          DashPunctuation
237*0Sstevel@tonic-gate    Ps          OpenPunctuation
238*0Sstevel@tonic-gate    Pe          ClosePunctuation
239*0Sstevel@tonic-gate    Pi          InitialPunctuation
240*0Sstevel@tonic-gate                (may behave like Ps or Pe depending on usage)
241*0Sstevel@tonic-gate    Pf          FinalPunctuation
242*0Sstevel@tonic-gate                (may behave like Ps or Pe depending on usage)
243*0Sstevel@tonic-gate    Po          OtherPunctuation
244*0Sstevel@tonic-gate
245*0Sstevel@tonic-gate    S           Symbol
246*0Sstevel@tonic-gate    Sm          MathSymbol
247*0Sstevel@tonic-gate    Sc          CurrencySymbol
248*0Sstevel@tonic-gate    Sk          ModifierSymbol
249*0Sstevel@tonic-gate    So          OtherSymbol
250*0Sstevel@tonic-gate
251*0Sstevel@tonic-gate    Z           Separator
252*0Sstevel@tonic-gate    Zs          SpaceSeparator
253*0Sstevel@tonic-gate    Zl          LineSeparator
254*0Sstevel@tonic-gate    Zp          ParagraphSeparator
255*0Sstevel@tonic-gate
256*0Sstevel@tonic-gate    C           Other
257*0Sstevel@tonic-gate    Cc          Control
258*0Sstevel@tonic-gate    Cf          Format
259*0Sstevel@tonic-gate    Cs          Surrogate   (not usable)
260*0Sstevel@tonic-gate    Co          PrivateUse
261*0Sstevel@tonic-gate    Cn          Unassigned
262*0Sstevel@tonic-gate
263*0Sstevel@tonic-gateSingle-letter properties match all characters in any of the
264*0Sstevel@tonic-gatetwo-letter sub-properties starting with the same letter.
265*0Sstevel@tonic-gateC<L&> is a special case, which is an alias for C<Ll>, C<Lu>, and C<Lt>.
266*0Sstevel@tonic-gate
267*0Sstevel@tonic-gateBecause Perl hides the need for the user to understand the internal
268*0Sstevel@tonic-gaterepresentation of Unicode characters, there is no need to implement
269*0Sstevel@tonic-gatethe somewhat messy concept of surrogates. C<Cs> is therefore not
270*0Sstevel@tonic-gatesupported.
271*0Sstevel@tonic-gate
272*0Sstevel@tonic-gateBecause scripts differ in their directionality--Hebrew is
273*0Sstevel@tonic-gatewritten right to left, for example--Unicode supplies these properties:
274*0Sstevel@tonic-gate
275*0Sstevel@tonic-gate    Property    Meaning
276*0Sstevel@tonic-gate
277*0Sstevel@tonic-gate    BidiL       Left-to-Right
278*0Sstevel@tonic-gate    BidiLRE     Left-to-Right Embedding
279*0Sstevel@tonic-gate    BidiLRO     Left-to-Right Override
280*0Sstevel@tonic-gate    BidiR       Right-to-Left
281*0Sstevel@tonic-gate    BidiAL      Right-to-Left Arabic
282*0Sstevel@tonic-gate    BidiRLE     Right-to-Left Embedding
283*0Sstevel@tonic-gate    BidiRLO     Right-to-Left Override
284*0Sstevel@tonic-gate    BidiPDF     Pop Directional Format
285*0Sstevel@tonic-gate    BidiEN      European Number
286*0Sstevel@tonic-gate    BidiES      European Number Separator
287*0Sstevel@tonic-gate    BidiET      European Number Terminator
288*0Sstevel@tonic-gate    BidiAN      Arabic Number
289*0Sstevel@tonic-gate    BidiCS      Common Number Separator
290*0Sstevel@tonic-gate    BidiNSM     Non-Spacing Mark
291*0Sstevel@tonic-gate    BidiBN      Boundary Neutral
292*0Sstevel@tonic-gate    BidiB       Paragraph Separator
293*0Sstevel@tonic-gate    BidiS       Segment Separator
294*0Sstevel@tonic-gate    BidiWS      Whitespace
295*0Sstevel@tonic-gate    BidiON      Other Neutrals
296*0Sstevel@tonic-gate
297*0Sstevel@tonic-gateFor example, C<\p{BidiR}> matches characters that are normally
298*0Sstevel@tonic-gatewritten right to left.
299*0Sstevel@tonic-gate
300*0Sstevel@tonic-gate=back
301*0Sstevel@tonic-gate
302*0Sstevel@tonic-gate=head2 Scripts
303*0Sstevel@tonic-gate
304*0Sstevel@tonic-gateThe script names which can be used by C<\p{...}> and C<\P{...}>,
305*0Sstevel@tonic-gatesuch as in C<\p{Latin}> or C<\p{Cyrillic}>, are as follows:
306*0Sstevel@tonic-gate
307*0Sstevel@tonic-gate    Arabic
308*0Sstevel@tonic-gate    Armenian
309*0Sstevel@tonic-gate    Bengali
310*0Sstevel@tonic-gate    Bopomofo
311*0Sstevel@tonic-gate    Buhid
312*0Sstevel@tonic-gate    CanadianAboriginal
313*0Sstevel@tonic-gate    Cherokee
314*0Sstevel@tonic-gate    Cyrillic
315*0Sstevel@tonic-gate    Deseret
316*0Sstevel@tonic-gate    Devanagari
317*0Sstevel@tonic-gate    Ethiopic
318*0Sstevel@tonic-gate    Georgian
319*0Sstevel@tonic-gate    Gothic
320*0Sstevel@tonic-gate    Greek
321*0Sstevel@tonic-gate    Gujarati
322*0Sstevel@tonic-gate    Gurmukhi
323*0Sstevel@tonic-gate    Han
324*0Sstevel@tonic-gate    Hangul
325*0Sstevel@tonic-gate    Hanunoo
326*0Sstevel@tonic-gate    Hebrew
327*0Sstevel@tonic-gate    Hiragana
328*0Sstevel@tonic-gate    Inherited
329*0Sstevel@tonic-gate    Kannada
330*0Sstevel@tonic-gate    Katakana
331*0Sstevel@tonic-gate    Khmer
332*0Sstevel@tonic-gate    Lao
333*0Sstevel@tonic-gate    Latin
334*0Sstevel@tonic-gate    Malayalam
335*0Sstevel@tonic-gate    Mongolian
336*0Sstevel@tonic-gate    Myanmar
337*0Sstevel@tonic-gate    Ogham
338*0Sstevel@tonic-gate    OldItalic
339*0Sstevel@tonic-gate    Oriya
340*0Sstevel@tonic-gate    Runic
341*0Sstevel@tonic-gate    Sinhala
342*0Sstevel@tonic-gate    Syriac
343*0Sstevel@tonic-gate    Tagalog
344*0Sstevel@tonic-gate    Tagbanwa
345*0Sstevel@tonic-gate    Tamil
346*0Sstevel@tonic-gate    Telugu
347*0Sstevel@tonic-gate    Thaana
348*0Sstevel@tonic-gate    Thai
349*0Sstevel@tonic-gate    Tibetan
350*0Sstevel@tonic-gate    Yi
351*0Sstevel@tonic-gate
352*0Sstevel@tonic-gateExtended property classes can supplement the basic
353*0Sstevel@tonic-gateproperties, defined by the F<PropList> Unicode database:
354*0Sstevel@tonic-gate
355*0Sstevel@tonic-gate    ASCIIHexDigit
356*0Sstevel@tonic-gate    BidiControl
357*0Sstevel@tonic-gate    Dash
358*0Sstevel@tonic-gate    Deprecated
359*0Sstevel@tonic-gate    Diacritic
360*0Sstevel@tonic-gate    Extender
361*0Sstevel@tonic-gate    GraphemeLink
362*0Sstevel@tonic-gate    HexDigit
363*0Sstevel@tonic-gate    Hyphen
364*0Sstevel@tonic-gate    Ideographic
365*0Sstevel@tonic-gate    IDSBinaryOperator
366*0Sstevel@tonic-gate    IDSTrinaryOperator
367*0Sstevel@tonic-gate    JoinControl
368*0Sstevel@tonic-gate    LogicalOrderException
369*0Sstevel@tonic-gate    NoncharacterCodePoint
370*0Sstevel@tonic-gate    OtherAlphabetic
371*0Sstevel@tonic-gate    OtherDefaultIgnorableCodePoint
372*0Sstevel@tonic-gate    OtherGraphemeExtend
373*0Sstevel@tonic-gate    OtherLowercase
374*0Sstevel@tonic-gate    OtherMath
375*0Sstevel@tonic-gate    OtherUppercase
376*0Sstevel@tonic-gate    QuotationMark
377*0Sstevel@tonic-gate    Radical
378*0Sstevel@tonic-gate    SoftDotted
379*0Sstevel@tonic-gate    TerminalPunctuation
380*0Sstevel@tonic-gate    UnifiedIdeograph
381*0Sstevel@tonic-gate    WhiteSpace
382*0Sstevel@tonic-gate
383*0Sstevel@tonic-gateand there are further derived properties:
384*0Sstevel@tonic-gate
385*0Sstevel@tonic-gate    Alphabetic      Lu + Ll + Lt + Lm + Lo + OtherAlphabetic
386*0Sstevel@tonic-gate    Lowercase       Ll + OtherLowercase
387*0Sstevel@tonic-gate    Uppercase       Lu + OtherUppercase
388*0Sstevel@tonic-gate    Math            Sm + OtherMath
389*0Sstevel@tonic-gate
390*0Sstevel@tonic-gate    ID_Start        Lu + Ll + Lt + Lm + Lo + Nl
391*0Sstevel@tonic-gate    ID_Continue     ID_Start + Mn + Mc + Nd + Pc
392*0Sstevel@tonic-gate
393*0Sstevel@tonic-gate    Any             Any character
394*0Sstevel@tonic-gate    Assigned        Any non-Cn character (i.e. synonym for \P{Cn})
395*0Sstevel@tonic-gate    Unassigned      Synonym for \p{Cn}
396*0Sstevel@tonic-gate    Common          Any character (or unassigned code point)
397*0Sstevel@tonic-gate                    not explicitly assigned to a script
398*0Sstevel@tonic-gate
399*0Sstevel@tonic-gateFor backward compatibility (with Perl 5.6), all properties mentioned
400*0Sstevel@tonic-gateso far may have C<Is> prepended to their name, so C<\P{IsLu}>, for
401*0Sstevel@tonic-gateexample, is equal to C<\P{Lu}>.
402*0Sstevel@tonic-gate
403*0Sstevel@tonic-gate=head2 Blocks
404*0Sstevel@tonic-gate
405*0Sstevel@tonic-gateIn addition to B<scripts>, Unicode also defines B<blocks> of
406*0Sstevel@tonic-gatecharacters.  The difference between scripts and blocks is that the
407*0Sstevel@tonic-gateconcept of scripts is closer to natural languages, while the concept
408*0Sstevel@tonic-gateof blocks is more of an artificial grouping based on groups of 256
409*0Sstevel@tonic-gateUnicode characters. For example, the C<Latin> script contains letters
410*0Sstevel@tonic-gatefrom many blocks but does not contain all the characters from those
411*0Sstevel@tonic-gateblocks. It does not, for example, contain digits, because digits are
412*0Sstevel@tonic-gateshared across many scripts. Digits and similar groups, like
413*0Sstevel@tonic-gatepunctuation, are in a category called C<Common>.
414*0Sstevel@tonic-gate
415*0Sstevel@tonic-gateFor more about scripts, see the UTR #24:
416*0Sstevel@tonic-gate
417*0Sstevel@tonic-gate   http://www.unicode.org/unicode/reports/tr24/
418*0Sstevel@tonic-gate
419*0Sstevel@tonic-gateFor more about blocks, see:
420*0Sstevel@tonic-gate
421*0Sstevel@tonic-gate   http://www.unicode.org/Public/UNIDATA/Blocks.txt
422*0Sstevel@tonic-gate
423*0Sstevel@tonic-gateBlock names are given with the C<In> prefix. For example, the
424*0Sstevel@tonic-gateKatakana block is referenced via C<\p{InKatakana}>.  The C<In>
425*0Sstevel@tonic-gateprefix may be omitted if there is no naming conflict with a script
426*0Sstevel@tonic-gateor any other property, but it is recommended that C<In> always be used
427*0Sstevel@tonic-gatefor block tests to avoid confusion.
428*0Sstevel@tonic-gate
429*0Sstevel@tonic-gateThese block names are supported:
430*0Sstevel@tonic-gate
431*0Sstevel@tonic-gate    InAlphabeticPresentationForms
432*0Sstevel@tonic-gate    InArabic
433*0Sstevel@tonic-gate    InArabicPresentationFormsA
434*0Sstevel@tonic-gate    InArabicPresentationFormsB
435*0Sstevel@tonic-gate    InArmenian
436*0Sstevel@tonic-gate    InArrows
437*0Sstevel@tonic-gate    InBasicLatin
438*0Sstevel@tonic-gate    InBengali
439*0Sstevel@tonic-gate    InBlockElements
440*0Sstevel@tonic-gate    InBopomofo
441*0Sstevel@tonic-gate    InBopomofoExtended
442*0Sstevel@tonic-gate    InBoxDrawing
443*0Sstevel@tonic-gate    InBraillePatterns
444*0Sstevel@tonic-gate    InBuhid
445*0Sstevel@tonic-gate    InByzantineMusicalSymbols
446*0Sstevel@tonic-gate    InCJKCompatibility
447*0Sstevel@tonic-gate    InCJKCompatibilityForms
448*0Sstevel@tonic-gate    InCJKCompatibilityIdeographs
449*0Sstevel@tonic-gate    InCJKCompatibilityIdeographsSupplement
450*0Sstevel@tonic-gate    InCJKRadicalsSupplement
451*0Sstevel@tonic-gate    InCJKSymbolsAndPunctuation
452*0Sstevel@tonic-gate    InCJKUnifiedIdeographs
453*0Sstevel@tonic-gate    InCJKUnifiedIdeographsExtensionA
454*0Sstevel@tonic-gate    InCJKUnifiedIdeographsExtensionB
455*0Sstevel@tonic-gate    InCherokee
456*0Sstevel@tonic-gate    InCombiningDiacriticalMarks
457*0Sstevel@tonic-gate    InCombiningDiacriticalMarksforSymbols
458*0Sstevel@tonic-gate    InCombiningHalfMarks
459*0Sstevel@tonic-gate    InControlPictures
460*0Sstevel@tonic-gate    InCurrencySymbols
461*0Sstevel@tonic-gate    InCyrillic
462*0Sstevel@tonic-gate    InCyrillicSupplementary
463*0Sstevel@tonic-gate    InDeseret
464*0Sstevel@tonic-gate    InDevanagari
465*0Sstevel@tonic-gate    InDingbats
466*0Sstevel@tonic-gate    InEnclosedAlphanumerics
467*0Sstevel@tonic-gate    InEnclosedCJKLettersAndMonths
468*0Sstevel@tonic-gate    InEthiopic
469*0Sstevel@tonic-gate    InGeneralPunctuation
470*0Sstevel@tonic-gate    InGeometricShapes
471*0Sstevel@tonic-gate    InGeorgian
472*0Sstevel@tonic-gate    InGothic
473*0Sstevel@tonic-gate    InGreekExtended
474*0Sstevel@tonic-gate    InGreekAndCoptic
475*0Sstevel@tonic-gate    InGujarati
476*0Sstevel@tonic-gate    InGurmukhi
477*0Sstevel@tonic-gate    InHalfwidthAndFullwidthForms
478*0Sstevel@tonic-gate    InHangulCompatibilityJamo
479*0Sstevel@tonic-gate    InHangulJamo
480*0Sstevel@tonic-gate    InHangulSyllables
481*0Sstevel@tonic-gate    InHanunoo
482*0Sstevel@tonic-gate    InHebrew
483*0Sstevel@tonic-gate    InHighPrivateUseSurrogates
484*0Sstevel@tonic-gate    InHighSurrogates
485*0Sstevel@tonic-gate    InHiragana
486*0Sstevel@tonic-gate    InIPAExtensions
487*0Sstevel@tonic-gate    InIdeographicDescriptionCharacters
488*0Sstevel@tonic-gate    InKanbun
489*0Sstevel@tonic-gate    InKangxiRadicals
490*0Sstevel@tonic-gate    InKannada
491*0Sstevel@tonic-gate    InKatakana
492*0Sstevel@tonic-gate    InKatakanaPhoneticExtensions
493*0Sstevel@tonic-gate    InKhmer
494*0Sstevel@tonic-gate    InLao
495*0Sstevel@tonic-gate    InLatin1Supplement
496*0Sstevel@tonic-gate    InLatinExtendedA
497*0Sstevel@tonic-gate    InLatinExtendedAdditional
498*0Sstevel@tonic-gate    InLatinExtendedB
499*0Sstevel@tonic-gate    InLetterlikeSymbols
500*0Sstevel@tonic-gate    InLowSurrogates
501*0Sstevel@tonic-gate    InMalayalam
502*0Sstevel@tonic-gate    InMathematicalAlphanumericSymbols
503*0Sstevel@tonic-gate    InMathematicalOperators
504*0Sstevel@tonic-gate    InMiscellaneousMathematicalSymbolsA
505*0Sstevel@tonic-gate    InMiscellaneousMathematicalSymbolsB
506*0Sstevel@tonic-gate    InMiscellaneousSymbols
507*0Sstevel@tonic-gate    InMiscellaneousTechnical
508*0Sstevel@tonic-gate    InMongolian
509*0Sstevel@tonic-gate    InMusicalSymbols
510*0Sstevel@tonic-gate    InMyanmar
511*0Sstevel@tonic-gate    InNumberForms
512*0Sstevel@tonic-gate    InOgham
513*0Sstevel@tonic-gate    InOldItalic
514*0Sstevel@tonic-gate    InOpticalCharacterRecognition
515*0Sstevel@tonic-gate    InOriya
516*0Sstevel@tonic-gate    InPrivateUseArea
517*0Sstevel@tonic-gate    InRunic
518*0Sstevel@tonic-gate    InSinhala
519*0Sstevel@tonic-gate    InSmallFormVariants
520*0Sstevel@tonic-gate    InSpacingModifierLetters
521*0Sstevel@tonic-gate    InSpecials
522*0Sstevel@tonic-gate    InSuperscriptsAndSubscripts
523*0Sstevel@tonic-gate    InSupplementalArrowsA
524*0Sstevel@tonic-gate    InSupplementalArrowsB
525*0Sstevel@tonic-gate    InSupplementalMathematicalOperators
526*0Sstevel@tonic-gate    InSupplementaryPrivateUseAreaA
527*0Sstevel@tonic-gate    InSupplementaryPrivateUseAreaB
528*0Sstevel@tonic-gate    InSyriac
529*0Sstevel@tonic-gate    InTagalog
530*0Sstevel@tonic-gate    InTagbanwa
531*0Sstevel@tonic-gate    InTags
532*0Sstevel@tonic-gate    InTamil
533*0Sstevel@tonic-gate    InTelugu
534*0Sstevel@tonic-gate    InThaana
535*0Sstevel@tonic-gate    InThai
536*0Sstevel@tonic-gate    InTibetan
537*0Sstevel@tonic-gate    InUnifiedCanadianAboriginalSyllabics
538*0Sstevel@tonic-gate    InVariationSelectors
539*0Sstevel@tonic-gate    InYiRadicals
540*0Sstevel@tonic-gate    InYiSyllables
541*0Sstevel@tonic-gate
542*0Sstevel@tonic-gate=over 4
543*0Sstevel@tonic-gate
544*0Sstevel@tonic-gate=item *
545*0Sstevel@tonic-gate
546*0Sstevel@tonic-gateThe special pattern C<\X> matches any extended Unicode
547*0Sstevel@tonic-gatesequence--"a combining character sequence" in Standardese--where the
548*0Sstevel@tonic-gatefirst character is a base character and subsequent characters are mark
549*0Sstevel@tonic-gatecharacters that apply to the base character.  C<\X> is equivalent to
550*0Sstevel@tonic-gateC<(?:\PM\pM*)>.
551*0Sstevel@tonic-gate
552*0Sstevel@tonic-gate=item *
553*0Sstevel@tonic-gate
554*0Sstevel@tonic-gateThe C<tr///> operator translates characters instead of bytes.  Note
555*0Sstevel@tonic-gatethat the C<tr///CU> functionality has been removed.  For similar
556*0Sstevel@tonic-gatefunctionality see pack('U0', ...) and pack('C0', ...).
557*0Sstevel@tonic-gate
558*0Sstevel@tonic-gate=item *
559*0Sstevel@tonic-gate
560*0Sstevel@tonic-gateCase translation operators use the Unicode case translation tables
561*0Sstevel@tonic-gatewhen character input is provided.  Note that C<uc()>, or C<\U> in
562*0Sstevel@tonic-gateinterpolated strings, translates to uppercase, while C<ucfirst>,
563*0Sstevel@tonic-gateor C<\u> in interpolated strings, translates to titlecase in languages
564*0Sstevel@tonic-gatethat make the distinction.
565*0Sstevel@tonic-gate
566*0Sstevel@tonic-gate=item *
567*0Sstevel@tonic-gate
568*0Sstevel@tonic-gateMost operators that deal with positions or lengths in a string will
569*0Sstevel@tonic-gateautomatically switch to using character positions, including
570*0Sstevel@tonic-gateC<chop()>, C<chomp()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>,
571*0Sstevel@tonic-gateC<sprintf()>, C<write()>, and C<length()>.  Operators that
572*0Sstevel@tonic-gatespecifically do not switch include C<vec()>, C<pack()>, and
573*0Sstevel@tonic-gateC<unpack()>.  Operators that really don't care include
574*0Sstevel@tonic-gateoperators that treats strings as a bucket of bits such as C<sort()>,
575*0Sstevel@tonic-gateand operators dealing with filenames.
576*0Sstevel@tonic-gate
577*0Sstevel@tonic-gate=item *
578*0Sstevel@tonic-gate
579*0Sstevel@tonic-gateThe C<pack()>/C<unpack()> letters C<c> and C<C> do I<not> change,
580*0Sstevel@tonic-gatesince they are often used for byte-oriented formats.  Again, think
581*0Sstevel@tonic-gateC<char> in the C language.
582*0Sstevel@tonic-gate
583*0Sstevel@tonic-gateThere is a new C<U> specifier that converts between Unicode characters
584*0Sstevel@tonic-gateand code points.
585*0Sstevel@tonic-gate
586*0Sstevel@tonic-gate=item *
587*0Sstevel@tonic-gate
588*0Sstevel@tonic-gateThe C<chr()> and C<ord()> functions work on characters, similar to
589*0Sstevel@tonic-gateC<pack("U")> and C<unpack("U")>, I<not> C<pack("C")> and
590*0Sstevel@tonic-gateC<unpack("C")>.  C<pack("C")> and C<unpack("C")> are methods for
591*0Sstevel@tonic-gateemulating byte-oriented C<chr()> and C<ord()> on Unicode strings.
592*0Sstevel@tonic-gateWhile these methods reveal the internal encoding of Unicode strings,
593*0Sstevel@tonic-gatethat is not something one normally needs to care about at all.
594*0Sstevel@tonic-gate
595*0Sstevel@tonic-gate=item *
596*0Sstevel@tonic-gate
597*0Sstevel@tonic-gateThe bit string operators, C<& | ^ ~>, can operate on character data.
598*0Sstevel@tonic-gateHowever, for backward compatibility, such as when using bit string
599*0Sstevel@tonic-gateoperations when characters are all less than 256 in ordinal value, one
600*0Sstevel@tonic-gateshould not use C<~> (the bit complement) with characters of both
601*0Sstevel@tonic-gatevalues less than 256 and values greater than 256.  Most importantly,
602*0Sstevel@tonic-gateDeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>)
603*0Sstevel@tonic-gatewill not hold.  The reason for this mathematical I<faux pas> is that
604*0Sstevel@tonic-gatethe complement cannot return B<both> the 8-bit (byte-wide) bit
605*0Sstevel@tonic-gatecomplement B<and> the full character-wide bit complement.
606*0Sstevel@tonic-gate
607*0Sstevel@tonic-gate=item *
608*0Sstevel@tonic-gate
609*0Sstevel@tonic-gatelc(), uc(), lcfirst(), and ucfirst() work for the following cases:
610*0Sstevel@tonic-gate
611*0Sstevel@tonic-gate=over 8
612*0Sstevel@tonic-gate
613*0Sstevel@tonic-gate=item *
614*0Sstevel@tonic-gate
615*0Sstevel@tonic-gatethe case mapping is from a single Unicode character to another
616*0Sstevel@tonic-gatesingle Unicode character, or
617*0Sstevel@tonic-gate
618*0Sstevel@tonic-gate=item *
619*0Sstevel@tonic-gate
620*0Sstevel@tonic-gatethe case mapping is from a single Unicode character to more
621*0Sstevel@tonic-gatethan one Unicode character.
622*0Sstevel@tonic-gate
623*0Sstevel@tonic-gate=back
624*0Sstevel@tonic-gate
625*0Sstevel@tonic-gateThings to do with locales (Lithuanian, Turkish, Azeri) do B<not> work
626*0Sstevel@tonic-gatesince Perl does not understand the concept of Unicode locales.
627*0Sstevel@tonic-gate
628*0Sstevel@tonic-gateSee the Unicode Technical Report #21, Case Mappings, for more details.
629*0Sstevel@tonic-gate
630*0Sstevel@tonic-gate=back
631*0Sstevel@tonic-gate
632*0Sstevel@tonic-gate=over 4
633*0Sstevel@tonic-gate
634*0Sstevel@tonic-gate=item *
635*0Sstevel@tonic-gate
636*0Sstevel@tonic-gateAnd finally, C<scalar reverse()> reverses by character rather than by byte.
637*0Sstevel@tonic-gate
638*0Sstevel@tonic-gate=back
639*0Sstevel@tonic-gate
640*0Sstevel@tonic-gate=head2 User-Defined Character Properties
641*0Sstevel@tonic-gate
642*0Sstevel@tonic-gateYou can define your own character properties by defining subroutines
643*0Sstevel@tonic-gatewhose names begin with "In" or "Is".  The subroutines must be defined
644*0Sstevel@tonic-gatein the C<main> package.  The user-defined properties can be used in the
645*0Sstevel@tonic-gateregular expression C<\p> and C<\P> constructs.  Note that the effect
646*0Sstevel@tonic-gateis compile-time and immutable once defined.
647*0Sstevel@tonic-gate
648*0Sstevel@tonic-gateThe subroutines must return a specially-formatted string, with one
649*0Sstevel@tonic-gateor more newline-separated lines.  Each line must be one of the following:
650*0Sstevel@tonic-gate
651*0Sstevel@tonic-gate=over 4
652*0Sstevel@tonic-gate
653*0Sstevel@tonic-gate=item *
654*0Sstevel@tonic-gate
655*0Sstevel@tonic-gateTwo hexadecimal numbers separated by horizontal whitespace (space or
656*0Sstevel@tonic-gatetabular characters) denoting a range of Unicode code points to include.
657*0Sstevel@tonic-gate
658*0Sstevel@tonic-gate=item *
659*0Sstevel@tonic-gate
660*0Sstevel@tonic-gateSomething to include, prefixed by "+": a built-in character
661*0Sstevel@tonic-gateproperty (prefixed by "utf8::"), to represent all the characters in that
662*0Sstevel@tonic-gateproperty; two hexadecimal code points for a range; or a single
663*0Sstevel@tonic-gatehexadecimal code point.
664*0Sstevel@tonic-gate
665*0Sstevel@tonic-gate=item *
666*0Sstevel@tonic-gate
667*0Sstevel@tonic-gateSomething to exclude, prefixed by "-": an existing character
668*0Sstevel@tonic-gateproperty (prefixed by "utf8::"), for all the characters in that
669*0Sstevel@tonic-gateproperty; two hexadecimal code points for a range; or a single
670*0Sstevel@tonic-gatehexadecimal code point.
671*0Sstevel@tonic-gate
672*0Sstevel@tonic-gate=item *
673*0Sstevel@tonic-gate
674*0Sstevel@tonic-gateSomething to negate, prefixed "!": an existing character
675*0Sstevel@tonic-gateproperty (prefixed by "utf8::") for all the characters except the
676*0Sstevel@tonic-gatecharacters in the property; two hexadecimal code points for a range;
677*0Sstevel@tonic-gateor a single hexadecimal code point.
678*0Sstevel@tonic-gate
679*0Sstevel@tonic-gate=back
680*0Sstevel@tonic-gate
681*0Sstevel@tonic-gateFor example, to define a property that covers both the Japanese
682*0Sstevel@tonic-gatesyllabaries (hiragana and katakana), you can define
683*0Sstevel@tonic-gate
684*0Sstevel@tonic-gate    sub InKana {
685*0Sstevel@tonic-gate	return <<END;
686*0Sstevel@tonic-gate    3040\t309F
687*0Sstevel@tonic-gate    30A0\t30FF
688*0Sstevel@tonic-gate    END
689*0Sstevel@tonic-gate    }
690*0Sstevel@tonic-gate
691*0Sstevel@tonic-gateImagine that the here-doc end marker is at the beginning of the line.
692*0Sstevel@tonic-gateNow you can use C<\p{InKana}> and C<\P{InKana}>.
693*0Sstevel@tonic-gate
694*0Sstevel@tonic-gateYou could also have used the existing block property names:
695*0Sstevel@tonic-gate
696*0Sstevel@tonic-gate    sub InKana {
697*0Sstevel@tonic-gate	return <<'END';
698*0Sstevel@tonic-gate    +utf8::InHiragana
699*0Sstevel@tonic-gate    +utf8::InKatakana
700*0Sstevel@tonic-gate    END
701*0Sstevel@tonic-gate    }
702*0Sstevel@tonic-gate
703*0Sstevel@tonic-gateSuppose you wanted to match only the allocated characters,
704*0Sstevel@tonic-gatenot the raw block ranges: in other words, you want to remove
705*0Sstevel@tonic-gatethe non-characters:
706*0Sstevel@tonic-gate
707*0Sstevel@tonic-gate    sub InKana {
708*0Sstevel@tonic-gate	return <<'END';
709*0Sstevel@tonic-gate    +utf8::InHiragana
710*0Sstevel@tonic-gate    +utf8::InKatakana
711*0Sstevel@tonic-gate    -utf8::IsCn
712*0Sstevel@tonic-gate    END
713*0Sstevel@tonic-gate    }
714*0Sstevel@tonic-gate
715*0Sstevel@tonic-gateThe negation is useful for defining (surprise!) negated classes.
716*0Sstevel@tonic-gate
717*0Sstevel@tonic-gate    sub InNotKana {
718*0Sstevel@tonic-gate	return <<'END';
719*0Sstevel@tonic-gate    !utf8::InHiragana
720*0Sstevel@tonic-gate    -utf8::InKatakana
721*0Sstevel@tonic-gate    +utf8::IsCn
722*0Sstevel@tonic-gate    END
723*0Sstevel@tonic-gate    }
724*0Sstevel@tonic-gate
725*0Sstevel@tonic-gateYou can also define your own mappings to be used in the lc(),
726*0Sstevel@tonic-gatelcfirst(), uc(), and ucfirst() (or their string-inlined versions).
727*0Sstevel@tonic-gateThe principle is the same: define subroutines in the C<main> package
728*0Sstevel@tonic-gatewith names like C<ToLower> (for lc() and lcfirst()), C<ToTitle> (for
729*0Sstevel@tonic-gatethe first character in ucfirst()), and C<ToUpper> (for uc(), and the
730*0Sstevel@tonic-gaterest of the characters in ucfirst()).
731*0Sstevel@tonic-gate
732*0Sstevel@tonic-gateThe string returned by the subroutines needs now to be three
733*0Sstevel@tonic-gatehexadecimal numbers separated by tabulators: start of the source
734*0Sstevel@tonic-gaterange, end of the source range, and start of the destination range.
735*0Sstevel@tonic-gateFor example:
736*0Sstevel@tonic-gate
737*0Sstevel@tonic-gate    sub ToUpper {
738*0Sstevel@tonic-gate	return <<END;
739*0Sstevel@tonic-gate    0061\t0063\t0041
740*0Sstevel@tonic-gate    END
741*0Sstevel@tonic-gate    }
742*0Sstevel@tonic-gate
743*0Sstevel@tonic-gatedefines an uc() mapping that causes only the characters "a", "b", and
744*0Sstevel@tonic-gate"c" to be mapped to "A", "B", "C", all other characters will remain
745*0Sstevel@tonic-gateunchanged.
746*0Sstevel@tonic-gate
747*0Sstevel@tonic-gateIf there is no source range to speak of, that is, the mapping is from
748*0Sstevel@tonic-gatea single character to another single character, leave the end of the
749*0Sstevel@tonic-gatesource range empty, but the two tabulator characters are still needed.
750*0Sstevel@tonic-gateFor example:
751*0Sstevel@tonic-gate
752*0Sstevel@tonic-gate    sub ToLower {
753*0Sstevel@tonic-gate	return <<END;
754*0Sstevel@tonic-gate    0041\t\t0061
755*0Sstevel@tonic-gate    END
756*0Sstevel@tonic-gate    }
757*0Sstevel@tonic-gate
758*0Sstevel@tonic-gatedefines a lc() mapping that causes only "A" to be mapped to "a", all
759*0Sstevel@tonic-gateother characters will remain unchanged.
760*0Sstevel@tonic-gate
761*0Sstevel@tonic-gate(For serious hackers only)  If you want to introspect the default
762*0Sstevel@tonic-gatemappings, you can find the data in the directory
763*0Sstevel@tonic-gateC<$Config{privlib}>/F<unicore/To/>.  The mapping data is returned as
764*0Sstevel@tonic-gatethe here-document, and the C<utf8::ToSpecFoo> are special exception
765*0Sstevel@tonic-gatemappings derived from <$Config{privlib}>/F<unicore/SpecialCasing.txt>.
766*0Sstevel@tonic-gateThe C<Digit> and C<Fold> mappings that one can see in the directory
767*0Sstevel@tonic-gateare not directly user-accessible, one can use either the
768*0Sstevel@tonic-gateC<Unicode::UCD> module, or just match case-insensitively (that's when
769*0Sstevel@tonic-gatethe C<Fold> mapping is used).
770*0Sstevel@tonic-gate
771*0Sstevel@tonic-gateA final note on the user-defined property tests and mappings: they
772*0Sstevel@tonic-gatewill be used only if the scalar has been marked as having Unicode
773*0Sstevel@tonic-gatecharacters.  Old byte-style strings will not be affected.
774*0Sstevel@tonic-gate
775*0Sstevel@tonic-gate=head2 Character Encodings for Input and Output
776*0Sstevel@tonic-gate
777*0Sstevel@tonic-gateSee L<Encode>.
778*0Sstevel@tonic-gate
779*0Sstevel@tonic-gate=head2 Unicode Regular Expression Support Level
780*0Sstevel@tonic-gate
781*0Sstevel@tonic-gateThe following list of Unicode support for regular expressions describes
782*0Sstevel@tonic-gateall the features currently supported.  The references to "Level N"
783*0Sstevel@tonic-gateand the section numbers refer to the Unicode Technical Report 18,
784*0Sstevel@tonic-gate"Unicode Regular Expression Guidelines", version 6 (Unicode 3.2.0,
785*0Sstevel@tonic-gatePerl 5.8.0).
786*0Sstevel@tonic-gate
787*0Sstevel@tonic-gate=over 4
788*0Sstevel@tonic-gate
789*0Sstevel@tonic-gate=item *
790*0Sstevel@tonic-gate
791*0Sstevel@tonic-gateLevel 1 - Basic Unicode Support
792*0Sstevel@tonic-gate
793*0Sstevel@tonic-gate        2.1 Hex Notation                        - done          [1]
794*0Sstevel@tonic-gate            Named Notation                      - done          [2]
795*0Sstevel@tonic-gate        2.2 Categories                          - done          [3][4]
796*0Sstevel@tonic-gate        2.3 Subtraction                         - MISSING       [5][6]
797*0Sstevel@tonic-gate        2.4 Simple Word Boundaries              - done          [7]
798*0Sstevel@tonic-gate        2.5 Simple Loose Matches                - done          [8]
799*0Sstevel@tonic-gate        2.6 End of Line                         - MISSING       [9][10]
800*0Sstevel@tonic-gate
801*0Sstevel@tonic-gate        [ 1] \x{...}
802*0Sstevel@tonic-gate        [ 2] \N{...}
803*0Sstevel@tonic-gate        [ 3] . \p{...} \P{...}
804*0Sstevel@tonic-gate        [ 4] now scripts (see UTR#24 Script Names) in addition to blocks
805*0Sstevel@tonic-gate        [ 5] have negation
806*0Sstevel@tonic-gate        [ 6] can use regular expression look-ahead [a]
807*0Sstevel@tonic-gate             or user-defined character properties [b] to emulate subtraction
808*0Sstevel@tonic-gate        [ 7] include Letters in word characters
809*0Sstevel@tonic-gate        [ 8] note that Perl does Full case-folding in matching, not Simple:
810*0Sstevel@tonic-gate             for example U+1F88 is equivalent with U+1F00 U+03B9,
811*0Sstevel@tonic-gate             not with 1F80.  This difference matters for certain Greek
812*0Sstevel@tonic-gate             capital letters with certain modifiers: the Full case-folding
813*0Sstevel@tonic-gate             decomposes the letter, while the Simple case-folding would map
814*0Sstevel@tonic-gate             it to a single character.
815*0Sstevel@tonic-gate        [ 9] see UTR #13 Unicode Newline Guidelines
816*0Sstevel@tonic-gate        [10] should do ^ and $ also on \x{85}, \x{2028} and \x{2029}
817*0Sstevel@tonic-gate             (should also affect <>, $., and script line numbers)
818*0Sstevel@tonic-gate             (the \x{85}, \x{2028} and \x{2029} do match \s)
819*0Sstevel@tonic-gate
820*0Sstevel@tonic-gate[a] You can mimic class subtraction using lookahead.
821*0Sstevel@tonic-gateFor example, what UTR #18 might write as
822*0Sstevel@tonic-gate
823*0Sstevel@tonic-gate    [{Greek}-[{UNASSIGNED}]]
824*0Sstevel@tonic-gate
825*0Sstevel@tonic-gatein Perl can be written as:
826*0Sstevel@tonic-gate
827*0Sstevel@tonic-gate    (?!\p{Unassigned})\p{InGreekAndCoptic}
828*0Sstevel@tonic-gate    (?=\p{Assigned})\p{InGreekAndCoptic}
829*0Sstevel@tonic-gate
830*0Sstevel@tonic-gateBut in this particular example, you probably really want
831*0Sstevel@tonic-gate
832*0Sstevel@tonic-gate    \p{GreekAndCoptic}
833*0Sstevel@tonic-gate
834*0Sstevel@tonic-gatewhich will match assigned characters known to be part of the Greek script.
835*0Sstevel@tonic-gate
836*0Sstevel@tonic-gateAlso see the Unicode::Regex::Set module, it does implement the full
837*0Sstevel@tonic-gateUTR #18 grouping, intersection, union, and removal (subtraction) syntax.
838*0Sstevel@tonic-gate
839*0Sstevel@tonic-gate[b] See L</"User-Defined Character Properties">.
840*0Sstevel@tonic-gate
841*0Sstevel@tonic-gate=item *
842*0Sstevel@tonic-gate
843*0Sstevel@tonic-gateLevel 2 - Extended Unicode Support
844*0Sstevel@tonic-gate
845*0Sstevel@tonic-gate        3.1 Surrogates                          - MISSING	[11]
846*0Sstevel@tonic-gate        3.2 Canonical Equivalents               - MISSING       [12][13]
847*0Sstevel@tonic-gate        3.3 Locale-Independent Graphemes        - MISSING       [14]
848*0Sstevel@tonic-gate        3.4 Locale-Independent Words            - MISSING       [15]
849*0Sstevel@tonic-gate        3.5 Locale-Independent Loose Matches    - MISSING       [16]
850*0Sstevel@tonic-gate
851*0Sstevel@tonic-gate        [11] Surrogates are solely a UTF-16 concept and Perl's internal
852*0Sstevel@tonic-gate             representation is UTF-8.  The Encode module does UTF-16, though.
853*0Sstevel@tonic-gate        [12] see UTR#15 Unicode Normalization
854*0Sstevel@tonic-gate        [13] have Unicode::Normalize but not integrated to regexes
855*0Sstevel@tonic-gate        [14] have \X but at this level . should equal that
856*0Sstevel@tonic-gate        [15] need three classes, not just \w and \W
857*0Sstevel@tonic-gate        [16] see UTR#21 Case Mappings
858*0Sstevel@tonic-gate
859*0Sstevel@tonic-gate=item *
860*0Sstevel@tonic-gate
861*0Sstevel@tonic-gateLevel 3 - Locale-Sensitive Support
862*0Sstevel@tonic-gate
863*0Sstevel@tonic-gate        4.1 Locale-Dependent Categories         - MISSING
864*0Sstevel@tonic-gate        4.2 Locale-Dependent Graphemes          - MISSING       [16][17]
865*0Sstevel@tonic-gate        4.3 Locale-Dependent Words              - MISSING
866*0Sstevel@tonic-gate        4.4 Locale-Dependent Loose Matches      - MISSING
867*0Sstevel@tonic-gate        4.5 Locale-Dependent Ranges             - MISSING
868*0Sstevel@tonic-gate
869*0Sstevel@tonic-gate        [16] see UTR#10 Unicode Collation Algorithms
870*0Sstevel@tonic-gate        [17] have Unicode::Collate but not integrated to regexes
871*0Sstevel@tonic-gate
872*0Sstevel@tonic-gate=back
873*0Sstevel@tonic-gate
874*0Sstevel@tonic-gate=head2 Unicode Encodings
875*0Sstevel@tonic-gate
876*0Sstevel@tonic-gateUnicode characters are assigned to I<code points>, which are abstract
877*0Sstevel@tonic-gatenumbers.  To use these numbers, various encodings are needed.
878*0Sstevel@tonic-gate
879*0Sstevel@tonic-gate=over 4
880*0Sstevel@tonic-gate
881*0Sstevel@tonic-gate=item *
882*0Sstevel@tonic-gate
883*0Sstevel@tonic-gateUTF-8
884*0Sstevel@tonic-gate
885*0Sstevel@tonic-gateUTF-8 is a variable-length (1 to 6 bytes, current character allocations
886*0Sstevel@tonic-gaterequire 4 bytes), byte-order independent encoding. For ASCII (and we
887*0Sstevel@tonic-gatereally do mean 7-bit ASCII, not another 8-bit encoding), UTF-8 is
888*0Sstevel@tonic-gatetransparent.
889*0Sstevel@tonic-gate
890*0Sstevel@tonic-gateThe following table is from Unicode 3.2.
891*0Sstevel@tonic-gate
892*0Sstevel@tonic-gate Code Points            1st Byte  2nd Byte  3rd Byte  4th Byte
893*0Sstevel@tonic-gate
894*0Sstevel@tonic-gate   U+0000..U+007F       00..7F
895*0Sstevel@tonic-gate   U+0080..U+07FF       C2..DF    80..BF
896*0Sstevel@tonic-gate   U+0800..U+0FFF       E0        A0..BF    80..BF
897*0Sstevel@tonic-gate   U+1000..U+CFFF       E1..EC    80..BF    80..BF
898*0Sstevel@tonic-gate   U+D000..U+D7FF       ED        80..9F    80..BF
899*0Sstevel@tonic-gate   U+D800..U+DFFF       ******* ill-formed *******
900*0Sstevel@tonic-gate   U+E000..U+FFFF       EE..EF    80..BF    80..BF
901*0Sstevel@tonic-gate  U+10000..U+3FFFF      F0        90..BF    80..BF    80..BF
902*0Sstevel@tonic-gate  U+40000..U+FFFFF      F1..F3    80..BF    80..BF    80..BF
903*0Sstevel@tonic-gate U+100000..U+10FFFF     F4        80..8F    80..BF    80..BF
904*0Sstevel@tonic-gate
905*0Sstevel@tonic-gateNote the C<A0..BF> in C<U+0800..U+0FFF>, the C<80..9F> in
906*0Sstevel@tonic-gateC<U+D000...U+D7FF>, the C<90..B>F in C<U+10000..U+3FFFF>, and the
907*0Sstevel@tonic-gateC<80...8F> in C<U+100000..U+10FFFF>.  The "gaps" are caused by legal
908*0Sstevel@tonic-gateUTF-8 avoiding non-shortest encodings: it is technically possible to
909*0Sstevel@tonic-gateUTF-8-encode a single code point in different ways, but that is
910*0Sstevel@tonic-gateexplicitly forbidden, and the shortest possible encoding should always
911*0Sstevel@tonic-gatebe used.  So that's what Perl does.
912*0Sstevel@tonic-gate
913*0Sstevel@tonic-gateAnother way to look at it is via bits:
914*0Sstevel@tonic-gate
915*0Sstevel@tonic-gate Code Points                    1st Byte   2nd Byte  3rd Byte  4th Byte
916*0Sstevel@tonic-gate
917*0Sstevel@tonic-gate                    0aaaaaaa     0aaaaaaa
918*0Sstevel@tonic-gate            00000bbbbbaaaaaa     110bbbbb  10aaaaaa
919*0Sstevel@tonic-gate            ccccbbbbbbaaaaaa     1110cccc  10bbbbbb  10aaaaaa
920*0Sstevel@tonic-gate  00000dddccccccbbbbbbaaaaaa     11110ddd  10cccccc  10bbbbbb  10aaaaaa
921*0Sstevel@tonic-gate
922*0Sstevel@tonic-gateAs you can see, the continuation bytes all begin with C<10>, and the
923*0Sstevel@tonic-gateleading bits of the start byte tell how many bytes the are in the
924*0Sstevel@tonic-gateencoded character.
925*0Sstevel@tonic-gate
926*0Sstevel@tonic-gate=item *
927*0Sstevel@tonic-gate
928*0Sstevel@tonic-gateUTF-EBCDIC
929*0Sstevel@tonic-gate
930*0Sstevel@tonic-gateLike UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe.
931*0Sstevel@tonic-gate
932*0Sstevel@tonic-gate=item *
933*0Sstevel@tonic-gate
934*0Sstevel@tonic-gateUTF-16, UTF-16BE, UTF-16LE, Surrogates, and BOMs (Byte Order Marks)
935*0Sstevel@tonic-gate
936*0Sstevel@tonic-gateThe followings items are mostly for reference and general Unicode
937*0Sstevel@tonic-gateknowledge, Perl doesn't use these constructs internally.
938*0Sstevel@tonic-gate
939*0Sstevel@tonic-gateUTF-16 is a 2 or 4 byte encoding.  The Unicode code points
940*0Sstevel@tonic-gateC<U+0000..U+FFFF> are stored in a single 16-bit unit, and the code
941*0Sstevel@tonic-gatepoints C<U+10000..U+10FFFF> in two 16-bit units.  The latter case is
942*0Sstevel@tonic-gateusing I<surrogates>, the first 16-bit unit being the I<high
943*0Sstevel@tonic-gatesurrogate>, and the second being the I<low surrogate>.
944*0Sstevel@tonic-gate
945*0Sstevel@tonic-gateSurrogates are code points set aside to encode the C<U+10000..U+10FFFF>
946*0Sstevel@tonic-gaterange of Unicode code points in pairs of 16-bit units.  The I<high
947*0Sstevel@tonic-gatesurrogates> are the range C<U+D800..U+DBFF>, and the I<low surrogates>
948*0Sstevel@tonic-gateare the range C<U+DC00..U+DFFF>.  The surrogate encoding is
949*0Sstevel@tonic-gate
950*0Sstevel@tonic-gate	$hi = ($uni - 0x10000) / 0x400 + 0xD800;
951*0Sstevel@tonic-gate	$lo = ($uni - 0x10000) % 0x400 + 0xDC00;
952*0Sstevel@tonic-gate
953*0Sstevel@tonic-gateand the decoding is
954*0Sstevel@tonic-gate
955*0Sstevel@tonic-gate	$uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
956*0Sstevel@tonic-gate
957*0Sstevel@tonic-gateIf you try to generate surrogates (for example by using chr()), you
958*0Sstevel@tonic-gatewill get a warning if warnings are turned on, because those code
959*0Sstevel@tonic-gatepoints are not valid for a Unicode character.
960*0Sstevel@tonic-gate
961*0Sstevel@tonic-gateBecause of the 16-bitness, UTF-16 is byte-order dependent.  UTF-16
962*0Sstevel@tonic-gateitself can be used for in-memory computations, but if storage or
963*0Sstevel@tonic-gatetransfer is required either UTF-16BE (big-endian) or UTF-16LE
964*0Sstevel@tonic-gate(little-endian) encodings must be chosen.
965*0Sstevel@tonic-gate
966*0Sstevel@tonic-gateThis introduces another problem: what if you just know that your data
967*0Sstevel@tonic-gateis UTF-16, but you don't know which endianness?  Byte Order Marks, or
968*0Sstevel@tonic-gateBOMs, are a solution to this.  A special character has been reserved
969*0Sstevel@tonic-gatein Unicode to function as a byte order marker: the character with the
970*0Sstevel@tonic-gatecode point C<U+FEFF> is the BOM.
971*0Sstevel@tonic-gate
972*0Sstevel@tonic-gateThe trick is that if you read a BOM, you will know the byte order,
973*0Sstevel@tonic-gatesince if it was written on a big-endian platform, you will read the
974*0Sstevel@tonic-gatebytes C<0xFE 0xFF>, but if it was written on a little-endian platform,
975*0Sstevel@tonic-gateyou will read the bytes C<0xFF 0xFE>.  (And if the originating platform
976*0Sstevel@tonic-gatewas writing in UTF-8, you will read the bytes C<0xEF 0xBB 0xBF>.)
977*0Sstevel@tonic-gate
978*0Sstevel@tonic-gateThe way this trick works is that the character with the code point
979*0Sstevel@tonic-gateC<U+FFFE> is guaranteed not to be a valid Unicode character, so the
980*0Sstevel@tonic-gatesequence of bytes C<0xFF 0xFE> is unambiguously "BOM, represented in
981*0Sstevel@tonic-gatelittle-endian format" and cannot be C<U+FFFE>, represented in big-endian
982*0Sstevel@tonic-gateformat".
983*0Sstevel@tonic-gate
984*0Sstevel@tonic-gate=item *
985*0Sstevel@tonic-gate
986*0Sstevel@tonic-gateUTF-32, UTF-32BE, UTF-32LE
987*0Sstevel@tonic-gate
988*0Sstevel@tonic-gateThe UTF-32 family is pretty much like the UTF-16 family, expect that
989*0Sstevel@tonic-gatethe units are 32-bit, and therefore the surrogate scheme is not
990*0Sstevel@tonic-gateneeded.  The BOM signatures will be C<0x00 0x00 0xFE 0xFF> for BE and
991*0Sstevel@tonic-gateC<0xFF 0xFE 0x00 0x00> for LE.
992*0Sstevel@tonic-gate
993*0Sstevel@tonic-gate=item *
994*0Sstevel@tonic-gate
995*0Sstevel@tonic-gateUCS-2, UCS-4
996*0Sstevel@tonic-gate
997*0Sstevel@tonic-gateEncodings defined by the ISO 10646 standard.  UCS-2 is a 16-bit
998*0Sstevel@tonic-gateencoding.  Unlike UTF-16, UCS-2 is not extensible beyond C<U+FFFF>,
999*0Sstevel@tonic-gatebecause it does not use surrogates.  UCS-4 is a 32-bit encoding,
1000*0Sstevel@tonic-gatefunctionally identical to UTF-32.
1001*0Sstevel@tonic-gate
1002*0Sstevel@tonic-gate=item *
1003*0Sstevel@tonic-gate
1004*0Sstevel@tonic-gateUTF-7
1005*0Sstevel@tonic-gate
1006*0Sstevel@tonic-gateA seven-bit safe (non-eight-bit) encoding, which is useful if the
1007*0Sstevel@tonic-gatetransport or storage is not eight-bit safe.  Defined by RFC 2152.
1008*0Sstevel@tonic-gate
1009*0Sstevel@tonic-gate=back
1010*0Sstevel@tonic-gate
1011*0Sstevel@tonic-gate=head2 Security Implications of Unicode
1012*0Sstevel@tonic-gate
1013*0Sstevel@tonic-gate=over 4
1014*0Sstevel@tonic-gate
1015*0Sstevel@tonic-gate=item *
1016*0Sstevel@tonic-gate
1017*0Sstevel@tonic-gateMalformed UTF-8
1018*0Sstevel@tonic-gate
1019*0Sstevel@tonic-gateUnfortunately, the specification of UTF-8 leaves some room for
1020*0Sstevel@tonic-gateinterpretation of how many bytes of encoded output one should generate
1021*0Sstevel@tonic-gatefrom one input Unicode character.  Strictly speaking, the shortest
1022*0Sstevel@tonic-gatepossible sequence of UTF-8 bytes should be generated,
1023*0Sstevel@tonic-gatebecause otherwise there is potential for an input buffer overflow at
1024*0Sstevel@tonic-gatethe receiving end of a UTF-8 connection.  Perl always generates the
1025*0Sstevel@tonic-gateshortest length UTF-8, and with warnings on Perl will warn about
1026*0Sstevel@tonic-gatenon-shortest length UTF-8 along with other malformations, such as the
1027*0Sstevel@tonic-gatesurrogates, which are not real Unicode code points.
1028*0Sstevel@tonic-gate
1029*0Sstevel@tonic-gate=item *
1030*0Sstevel@tonic-gate
1031*0Sstevel@tonic-gateRegular expressions behave slightly differently between byte data and
1032*0Sstevel@tonic-gatecharacter (Unicode) data.  For example, the "word character" character
1033*0Sstevel@tonic-gateclass C<\w> will work differently depending on if data is eight-bit bytes
1034*0Sstevel@tonic-gateor Unicode.
1035*0Sstevel@tonic-gate
1036*0Sstevel@tonic-gateIn the first case, the set of C<\w> characters is either small--the
1037*0Sstevel@tonic-gatedefault set of alphabetic characters, digits, and the "_"--or, if you
1038*0Sstevel@tonic-gateare using a locale (see L<perllocale>), the C<\w> might contain a few
1039*0Sstevel@tonic-gatemore letters according to your language and country.
1040*0Sstevel@tonic-gate
1041*0Sstevel@tonic-gateIn the second case, the C<\w> set of characters is much, much larger.
1042*0Sstevel@tonic-gateMost importantly, even in the set of the first 256 characters, it will
1043*0Sstevel@tonic-gateprobably match different characters: unlike most locales, which are
1044*0Sstevel@tonic-gatespecific to a language and country pair, Unicode classifies all the
1045*0Sstevel@tonic-gatecharacters that are letters I<somewhere> as C<\w>.  For example, your
1046*0Sstevel@tonic-gatelocale might not think that LATIN SMALL LETTER ETH is a letter (unless
1047*0Sstevel@tonic-gateyou happen to speak Icelandic), but Unicode does.
1048*0Sstevel@tonic-gate
1049*0Sstevel@tonic-gateAs discussed elsewhere, Perl has one foot (two hooves?) planted in
1050*0Sstevel@tonic-gateeach of two worlds: the old world of bytes and the new world of
1051*0Sstevel@tonic-gatecharacters, upgrading from bytes to characters when necessary.
1052*0Sstevel@tonic-gateIf your legacy code does not explicitly use Unicode, no automatic
1053*0Sstevel@tonic-gateswitch-over to characters should happen.  Characters shouldn't get
1054*0Sstevel@tonic-gatedowngraded to bytes, either.  It is possible to accidentally mix bytes
1055*0Sstevel@tonic-gateand characters, however (see L<perluniintro>), in which case C<\w> in
1056*0Sstevel@tonic-gateregular expressions might start behaving differently.  Review your
1057*0Sstevel@tonic-gatecode.  Use warnings and the C<strict> pragma.
1058*0Sstevel@tonic-gate
1059*0Sstevel@tonic-gate=back
1060*0Sstevel@tonic-gate
1061*0Sstevel@tonic-gate=head2 Unicode in Perl on EBCDIC
1062*0Sstevel@tonic-gate
1063*0Sstevel@tonic-gateThe way Unicode is handled on EBCDIC platforms is still
1064*0Sstevel@tonic-gateexperimental.  On such platforms, references to UTF-8 encoding in this
1065*0Sstevel@tonic-gatedocument and elsewhere should be read as meaning the UTF-EBCDIC
1066*0Sstevel@tonic-gatespecified in Unicode Technical Report 16, unless ASCII vs. EBCDIC issues
1067*0Sstevel@tonic-gateare specifically discussed. There is no C<utfebcdic> pragma or
1068*0Sstevel@tonic-gate":utfebcdic" layer; rather, "utf8" and ":utf8" are reused to mean
1069*0Sstevel@tonic-gatethe platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic>
1070*0Sstevel@tonic-gatefor more discussion of the issues.
1071*0Sstevel@tonic-gate
1072*0Sstevel@tonic-gate=head2 Locales
1073*0Sstevel@tonic-gate
1074*0Sstevel@tonic-gateUsually locale settings and Unicode do not affect each other, but
1075*0Sstevel@tonic-gatethere are a couple of exceptions:
1076*0Sstevel@tonic-gate
1077*0Sstevel@tonic-gate=over 4
1078*0Sstevel@tonic-gate
1079*0Sstevel@tonic-gate=item *
1080*0Sstevel@tonic-gate
1081*0Sstevel@tonic-gateYou can enable automatic UTF-8-ification of your standard file
1082*0Sstevel@tonic-gatehandles, default C<open()> layer, and C<@ARGV> by using either
1083*0Sstevel@tonic-gatethe C<-C> command line switch or the C<PERL_UNICODE> environment
1084*0Sstevel@tonic-gatevariable, see L<perlrun> for the documentation of the C<-C> switch.
1085*0Sstevel@tonic-gate
1086*0Sstevel@tonic-gate=item *
1087*0Sstevel@tonic-gate
1088*0Sstevel@tonic-gatePerl tries really hard to work both with Unicode and the old
1089*0Sstevel@tonic-gatebyte-oriented world. Most often this is nice, but sometimes Perl's
1090*0Sstevel@tonic-gatestraddling of the proverbial fence causes problems.
1091*0Sstevel@tonic-gate
1092*0Sstevel@tonic-gate=back
1093*0Sstevel@tonic-gate
1094*0Sstevel@tonic-gate=head2 When Unicode Does Not Happen
1095*0Sstevel@tonic-gate
1096*0Sstevel@tonic-gateWhile Perl does have extensive ways to input and output in Unicode,
1097*0Sstevel@tonic-gateand few other 'entry points' like the @ARGV which can be interpreted
1098*0Sstevel@tonic-gateas Unicode (UTF-8), there still are many places where Unicode (in some
1099*0Sstevel@tonic-gateencoding or another) could be given as arguments or received as
1100*0Sstevel@tonic-gateresults, or both, but it is not.
1101*0Sstevel@tonic-gate
1102*0Sstevel@tonic-gateThe following are such interfaces.  For all of these interfaces Perl
1103*0Sstevel@tonic-gatecurrently (as of 5.8.3) simply assumes byte strings both as arguments
1104*0Sstevel@tonic-gateand results, or UTF-8 strings if the C<encoding> pragma has been used.
1105*0Sstevel@tonic-gate
1106*0Sstevel@tonic-gateOne reason why Perl does not attempt to resolve the role of Unicode in
1107*0Sstevel@tonic-gatethis cases is that the answers are highly dependent on the operating
1108*0Sstevel@tonic-gatesystem and the file system(s).  For example, whether filenames can be
1109*0Sstevel@tonic-gatein Unicode, and in exactly what kind of encoding, is not exactly a
1110*0Sstevel@tonic-gateportable concept.  Similarly for the qx and system: how well will the
1111*0Sstevel@tonic-gate'command line interface' (and which of them?) handle Unicode?
1112*0Sstevel@tonic-gate
1113*0Sstevel@tonic-gate=over 4
1114*0Sstevel@tonic-gate
1115*0Sstevel@tonic-gate=item *
1116*0Sstevel@tonic-gate
1117*0Sstevel@tonic-gatechmod, chmod, chown, chroot, exec, link, lstat, mkdir,
1118*0Sstevel@tonic-gaterename, rmdir, stat, symlink, truncate, unlink, utime, -X
1119*0Sstevel@tonic-gate
1120*0Sstevel@tonic-gate=item *
1121*0Sstevel@tonic-gate
1122*0Sstevel@tonic-gate%ENV
1123*0Sstevel@tonic-gate
1124*0Sstevel@tonic-gate=item *
1125*0Sstevel@tonic-gate
1126*0Sstevel@tonic-gateglob (aka the <*>)
1127*0Sstevel@tonic-gate
1128*0Sstevel@tonic-gate=item *
1129*0Sstevel@tonic-gate
1130*0Sstevel@tonic-gateopen, opendir, sysopen
1131*0Sstevel@tonic-gate
1132*0Sstevel@tonic-gate=item *
1133*0Sstevel@tonic-gate
1134*0Sstevel@tonic-gateqx (aka the backtick operator), system
1135*0Sstevel@tonic-gate
1136*0Sstevel@tonic-gate=item *
1137*0Sstevel@tonic-gate
1138*0Sstevel@tonic-gatereaddir, readlink
1139*0Sstevel@tonic-gate
1140*0Sstevel@tonic-gate=back
1141*0Sstevel@tonic-gate
1142*0Sstevel@tonic-gate=head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl)
1143*0Sstevel@tonic-gate
1144*0Sstevel@tonic-gateSometimes (see L</"When Unicode Does Not Happen">) there are
1145*0Sstevel@tonic-gatesituations where you simply need to force Perl to believe that a byte
1146*0Sstevel@tonic-gatestring is UTF-8, or vice versa.  The low-level calls
1147*0Sstevel@tonic-gateutf8::upgrade($bytestring) and utf8::downgrade($utf8string) are
1148*0Sstevel@tonic-gatethe answers.
1149*0Sstevel@tonic-gate
1150*0Sstevel@tonic-gateDo not use them without careful thought, though: Perl may easily get
1151*0Sstevel@tonic-gatevery confused, angry, or even crash, if you suddenly change the 'nature'
1152*0Sstevel@tonic-gateof scalar like that.  Especially careful you have to be if you use the
1153*0Sstevel@tonic-gateutf8::upgrade(): any random byte string is not valid UTF-8.
1154*0Sstevel@tonic-gate
1155*0Sstevel@tonic-gate=head2 Using Unicode in XS
1156*0Sstevel@tonic-gate
1157*0Sstevel@tonic-gateIf you want to handle Perl Unicode in XS extensions, you may find the
1158*0Sstevel@tonic-gatefollowing C APIs useful.  See also L<perlguts/"Unicode Support"> for an
1159*0Sstevel@tonic-gateexplanation about Unicode at the XS level, and L<perlapi> for the API
1160*0Sstevel@tonic-gatedetails.
1161*0Sstevel@tonic-gate
1162*0Sstevel@tonic-gate=over 4
1163*0Sstevel@tonic-gate
1164*0Sstevel@tonic-gate=item *
1165*0Sstevel@tonic-gate
1166*0Sstevel@tonic-gateC<DO_UTF8(sv)> returns true if the C<UTF8> flag is on and the bytes
1167*0Sstevel@tonic-gatepragma is not in effect.  C<SvUTF8(sv)> returns true is the C<UTF8>
1168*0Sstevel@tonic-gateflag is on; the bytes pragma is ignored.  The C<UTF8> flag being on
1169*0Sstevel@tonic-gatedoes B<not> mean that there are any characters of code points greater
1170*0Sstevel@tonic-gatethan 255 (or 127) in the scalar or that there are even any characters
1171*0Sstevel@tonic-gatein the scalar.  What the C<UTF8> flag means is that the sequence of
1172*0Sstevel@tonic-gateoctets in the representation of the scalar is the sequence of UTF-8
1173*0Sstevel@tonic-gateencoded code points of the characters of a string.  The C<UTF8> flag
1174*0Sstevel@tonic-gatebeing off means that each octet in this representation encodes a
1175*0Sstevel@tonic-gatesingle character with code point 0..255 within the string.  Perl's
1176*0Sstevel@tonic-gateUnicode model is not to use UTF-8 until it is absolutely necessary.
1177*0Sstevel@tonic-gate
1178*0Sstevel@tonic-gate=item *
1179*0Sstevel@tonic-gate
1180*0Sstevel@tonic-gateC<uvuni_to_utf8(buf, chr)> writes a Unicode character code point into
1181*0Sstevel@tonic-gatea buffer encoding the code point as UTF-8, and returns a pointer
1182*0Sstevel@tonic-gatepointing after the UTF-8 bytes.
1183*0Sstevel@tonic-gate
1184*0Sstevel@tonic-gate=item *
1185*0Sstevel@tonic-gate
1186*0Sstevel@tonic-gateC<utf8_to_uvuni(buf, lenp)> reads UTF-8 encoded bytes from a buffer and
1187*0Sstevel@tonic-gatereturns the Unicode character code point and, optionally, the length of
1188*0Sstevel@tonic-gatethe UTF-8 byte sequence.
1189*0Sstevel@tonic-gate
1190*0Sstevel@tonic-gate=item *
1191*0Sstevel@tonic-gate
1192*0Sstevel@tonic-gateC<utf8_length(start, end)> returns the length of the UTF-8 encoded buffer
1193*0Sstevel@tonic-gatein characters.  C<sv_len_utf8(sv)> returns the length of the UTF-8 encoded
1194*0Sstevel@tonic-gatescalar.
1195*0Sstevel@tonic-gate
1196*0Sstevel@tonic-gate=item *
1197*0Sstevel@tonic-gate
1198*0Sstevel@tonic-gateC<sv_utf8_upgrade(sv)> converts the string of the scalar to its UTF-8
1199*0Sstevel@tonic-gateencoded form.  C<sv_utf8_downgrade(sv)> does the opposite, if
1200*0Sstevel@tonic-gatepossible.  C<sv_utf8_encode(sv)> is like sv_utf8_upgrade except that
1201*0Sstevel@tonic-gateit does not set the C<UTF8> flag.  C<sv_utf8_decode()> does the
1202*0Sstevel@tonic-gateopposite of C<sv_utf8_encode()>.  Note that none of these are to be
1203*0Sstevel@tonic-gateused as general-purpose encoding or decoding interfaces: C<use Encode>
1204*0Sstevel@tonic-gatefor that.  C<sv_utf8_upgrade()> is affected by the encoding pragma
1205*0Sstevel@tonic-gatebut C<sv_utf8_downgrade()> is not (since the encoding pragma is
1206*0Sstevel@tonic-gatedesigned to be a one-way street).
1207*0Sstevel@tonic-gate
1208*0Sstevel@tonic-gate=item *
1209*0Sstevel@tonic-gate
1210*0Sstevel@tonic-gateC<is_utf8_char(s)> returns true if the pointer points to a valid UTF-8
1211*0Sstevel@tonic-gatecharacter.
1212*0Sstevel@tonic-gate
1213*0Sstevel@tonic-gate=item *
1214*0Sstevel@tonic-gate
1215*0Sstevel@tonic-gateC<is_utf8_string(buf, len)> returns true if C<len> bytes of the buffer
1216*0Sstevel@tonic-gateare valid UTF-8.
1217*0Sstevel@tonic-gate
1218*0Sstevel@tonic-gate=item *
1219*0Sstevel@tonic-gate
1220*0Sstevel@tonic-gateC<UTF8SKIP(buf)> will return the number of bytes in the UTF-8 encoded
1221*0Sstevel@tonic-gatecharacter in the buffer.  C<UNISKIP(chr)> will return the number of bytes
1222*0Sstevel@tonic-gaterequired to UTF-8-encode the Unicode character code point.  C<UTF8SKIP()>
1223*0Sstevel@tonic-gateis useful for example for iterating over the characters of a UTF-8
1224*0Sstevel@tonic-gateencoded buffer; C<UNISKIP()> is useful, for example, in computing
1225*0Sstevel@tonic-gatethe size required for a UTF-8 encoded buffer.
1226*0Sstevel@tonic-gate
1227*0Sstevel@tonic-gate=item *
1228*0Sstevel@tonic-gate
1229*0Sstevel@tonic-gateC<utf8_distance(a, b)> will tell the distance in characters between the
1230*0Sstevel@tonic-gatetwo pointers pointing to the same UTF-8 encoded buffer.
1231*0Sstevel@tonic-gate
1232*0Sstevel@tonic-gate=item *
1233*0Sstevel@tonic-gate
1234*0Sstevel@tonic-gateC<utf8_hop(s, off)> will return a pointer to an UTF-8 encoded buffer
1235*0Sstevel@tonic-gatethat is C<off> (positive or negative) Unicode characters displaced
1236*0Sstevel@tonic-gatefrom the UTF-8 buffer C<s>.  Be careful not to overstep the buffer:
1237*0Sstevel@tonic-gateC<utf8_hop()> will merrily run off the end or the beginning of the
1238*0Sstevel@tonic-gatebuffer if told to do so.
1239*0Sstevel@tonic-gate
1240*0Sstevel@tonic-gate=item *
1241*0Sstevel@tonic-gate
1242*0Sstevel@tonic-gateC<pv_uni_display(dsv, spv, len, pvlim, flags)> and
1243*0Sstevel@tonic-gateC<sv_uni_display(dsv, ssv, pvlim, flags)> are useful for debugging the
1244*0Sstevel@tonic-gateoutput of Unicode strings and scalars.  By default they are useful
1245*0Sstevel@tonic-gateonly for debugging--they display B<all> characters as hexadecimal code
1246*0Sstevel@tonic-gatepoints--but with the flags C<UNI_DISPLAY_ISPRINT>,
1247*0Sstevel@tonic-gateC<UNI_DISPLAY_BACKSLASH>, and C<UNI_DISPLAY_QQ> you can make the
1248*0Sstevel@tonic-gateoutput more readable.
1249*0Sstevel@tonic-gate
1250*0Sstevel@tonic-gate=item *
1251*0Sstevel@tonic-gate
1252*0Sstevel@tonic-gateC<ibcmp_utf8(s1, pe1, u1, l1, u1, s2, pe2, l2, u2)> can be used to
1253*0Sstevel@tonic-gatecompare two strings case-insensitively in Unicode.  For case-sensitive
1254*0Sstevel@tonic-gatecomparisons you can just use C<memEQ()> and C<memNE()> as usual.
1255*0Sstevel@tonic-gate
1256*0Sstevel@tonic-gate=back
1257*0Sstevel@tonic-gate
1258*0Sstevel@tonic-gateFor more information, see L<perlapi>, and F<utf8.c> and F<utf8.h>
1259*0Sstevel@tonic-gatein the Perl source code distribution.
1260*0Sstevel@tonic-gate
1261*0Sstevel@tonic-gate=head1 BUGS
1262*0Sstevel@tonic-gate
1263*0Sstevel@tonic-gate=head2 Interaction with Locales
1264*0Sstevel@tonic-gate
1265*0Sstevel@tonic-gateUse of locales with Unicode data may lead to odd results.  Currently,
1266*0Sstevel@tonic-gatePerl attempts to attach 8-bit locale info to characters in the range
1267*0Sstevel@tonic-gate0..255, but this technique is demonstrably incorrect for locales that
1268*0Sstevel@tonic-gateuse characters above that range when mapped into Unicode.  Perl's
1269*0Sstevel@tonic-gateUnicode support will also tend to run slower.  Use of locales with
1270*0Sstevel@tonic-gateUnicode is discouraged.
1271*0Sstevel@tonic-gate
1272*0Sstevel@tonic-gate=head2 Interaction with Extensions
1273*0Sstevel@tonic-gate
1274*0Sstevel@tonic-gateWhen Perl exchanges data with an extension, the extension should be
1275*0Sstevel@tonic-gateable to understand the UTF-8 flag and act accordingly. If the
1276*0Sstevel@tonic-gateextension doesn't know about the flag, it's likely that the extension
1277*0Sstevel@tonic-gatewill return incorrectly-flagged data.
1278*0Sstevel@tonic-gate
1279*0Sstevel@tonic-gateSo if you're working with Unicode data, consult the documentation of
1280*0Sstevel@tonic-gateevery module you're using if there are any issues with Unicode data
1281*0Sstevel@tonic-gateexchange. If the documentation does not talk about Unicode at all,
1282*0Sstevel@tonic-gatesuspect the worst and probably look at the source to learn how the
1283*0Sstevel@tonic-gatemodule is implemented. Modules written completely in Perl shouldn't
1284*0Sstevel@tonic-gatecause problems. Modules that directly or indirectly access code written
1285*0Sstevel@tonic-gatein other programming languages are at risk.
1286*0Sstevel@tonic-gate
1287*0Sstevel@tonic-gateFor affected functions, the simple strategy to avoid data corruption is
1288*0Sstevel@tonic-gateto always make the encoding of the exchanged data explicit. Choose an
1289*0Sstevel@tonic-gateencoding that you know the extension can handle. Convert arguments passed
1290*0Sstevel@tonic-gateto the extensions to that encoding and convert results back from that
1291*0Sstevel@tonic-gateencoding. Write wrapper functions that do the conversions for you, so
1292*0Sstevel@tonic-gateyou can later change the functions when the extension catches up.
1293*0Sstevel@tonic-gate
1294*0Sstevel@tonic-gateTo provide an example, let's say the popular Foo::Bar::escape_html
1295*0Sstevel@tonic-gatefunction doesn't deal with Unicode data yet. The wrapper function
1296*0Sstevel@tonic-gatewould convert the argument to raw UTF-8 and convert the result back to
1297*0Sstevel@tonic-gatePerl's internal representation like so:
1298*0Sstevel@tonic-gate
1299*0Sstevel@tonic-gate    sub my_escape_html ($) {
1300*0Sstevel@tonic-gate      my($what) = shift;
1301*0Sstevel@tonic-gate      return unless defined $what;
1302*0Sstevel@tonic-gate      Encode::decode_utf8(Foo::Bar::escape_html(Encode::encode_utf8($what)));
1303*0Sstevel@tonic-gate    }
1304*0Sstevel@tonic-gate
1305*0Sstevel@tonic-gateSometimes, when the extension does not convert data but just stores
1306*0Sstevel@tonic-gateand retrieves them, you will be in a position to use the otherwise
1307*0Sstevel@tonic-gatedangerous Encode::_utf8_on() function. Let's say the popular
1308*0Sstevel@tonic-gateC<Foo::Bar> extension, written in C, provides a C<param> method that
1309*0Sstevel@tonic-gatelets you store and retrieve data according to these prototypes:
1310*0Sstevel@tonic-gate
1311*0Sstevel@tonic-gate    $self->param($name, $value);            # set a scalar
1312*0Sstevel@tonic-gate    $value = $self->param($name);           # retrieve a scalar
1313*0Sstevel@tonic-gate
1314*0Sstevel@tonic-gateIf it does not yet provide support for any encoding, one could write a
1315*0Sstevel@tonic-gatederived class with such a C<param> method:
1316*0Sstevel@tonic-gate
1317*0Sstevel@tonic-gate    sub param {
1318*0Sstevel@tonic-gate      my($self,$name,$value) = @_;
1319*0Sstevel@tonic-gate      utf8::upgrade($name);     # make sure it is UTF-8 encoded
1320*0Sstevel@tonic-gate      if (defined $value)
1321*0Sstevel@tonic-gate        utf8::upgrade($value);  # make sure it is UTF-8 encoded
1322*0Sstevel@tonic-gate        return $self->SUPER::param($name,$value);
1323*0Sstevel@tonic-gate      } else {
1324*0Sstevel@tonic-gate        my $ret = $self->SUPER::param($name);
1325*0Sstevel@tonic-gate        Encode::_utf8_on($ret); # we know, it is UTF-8 encoded
1326*0Sstevel@tonic-gate        return $ret;
1327*0Sstevel@tonic-gate      }
1328*0Sstevel@tonic-gate    }
1329*0Sstevel@tonic-gate
1330*0Sstevel@tonic-gateSome extensions provide filters on data entry/exit points, such as
1331*0Sstevel@tonic-gateDB_File::filter_store_key and family. Look out for such filters in
1332*0Sstevel@tonic-gatethe documentation of your extensions, they can make the transition to
1333*0Sstevel@tonic-gateUnicode data much easier.
1334*0Sstevel@tonic-gate
1335*0Sstevel@tonic-gate=head2 Speed
1336*0Sstevel@tonic-gate
1337*0Sstevel@tonic-gateSome functions are slower when working on UTF-8 encoded strings than
1338*0Sstevel@tonic-gateon byte encoded strings.  All functions that need to hop over
1339*0Sstevel@tonic-gatecharacters such as length(), substr() or index(), or matching regular
1340*0Sstevel@tonic-gateexpressions can work B<much> faster when the underlying data are
1341*0Sstevel@tonic-gatebyte-encoded.
1342*0Sstevel@tonic-gate
1343*0Sstevel@tonic-gateIn Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1
1344*0Sstevel@tonic-gatea caching scheme was introduced which will hopefully make the slowness
1345*0Sstevel@tonic-gatesomewhat less spectacular, at least for some operations.  In general,
1346*0Sstevel@tonic-gateoperations with UTF-8 encoded strings are still slower. As an example,
1347*0Sstevel@tonic-gatethe Unicode properties (character classes) like C<\p{Nd}> are known to
1348*0Sstevel@tonic-gatebe quite a bit slower (5-20 times) than their simpler counterparts
1349*0Sstevel@tonic-gatelike C<\d> (then again, there 268 Unicode characters matching C<Nd>
1350*0Sstevel@tonic-gatecompared with the 10 ASCII characters matching C<d>).
1351*0Sstevel@tonic-gate
1352*0Sstevel@tonic-gate=head2 Porting code from perl-5.6.X
1353*0Sstevel@tonic-gate
1354*0Sstevel@tonic-gatePerl 5.8 has a different Unicode model from 5.6. In 5.6 the programmer
1355*0Sstevel@tonic-gatewas required to use the C<utf8> pragma to declare that a given scope
1356*0Sstevel@tonic-gateexpected to deal with Unicode data and had to make sure that only
1357*0Sstevel@tonic-gateUnicode data were reaching that scope. If you have code that is
1358*0Sstevel@tonic-gateworking with 5.6, you will need some of the following adjustments to
1359*0Sstevel@tonic-gateyour code. The examples are written such that the code will continue
1360*0Sstevel@tonic-gateto work under 5.6, so you should be safe to try them out.
1361*0Sstevel@tonic-gate
1362*0Sstevel@tonic-gate=over 4
1363*0Sstevel@tonic-gate
1364*0Sstevel@tonic-gate=item *
1365*0Sstevel@tonic-gate
1366*0Sstevel@tonic-gateA filehandle that should read or write UTF-8
1367*0Sstevel@tonic-gate
1368*0Sstevel@tonic-gate  if ($] > 5.007) {
1369*0Sstevel@tonic-gate    binmode $fh, ":utf8";
1370*0Sstevel@tonic-gate  }
1371*0Sstevel@tonic-gate
1372*0Sstevel@tonic-gate=item *
1373*0Sstevel@tonic-gate
1374*0Sstevel@tonic-gateA scalar that is going to be passed to some extension
1375*0Sstevel@tonic-gate
1376*0Sstevel@tonic-gateBe it Compress::Zlib, Apache::Request or any extension that has no
1377*0Sstevel@tonic-gatemention of Unicode in the manpage, you need to make sure that the
1378*0Sstevel@tonic-gateUTF-8 flag is stripped off. Note that at the time of this writing
1379*0Sstevel@tonic-gate(October 2002) the mentioned modules are not UTF-8-aware. Please
1380*0Sstevel@tonic-gatecheck the documentation to verify if this is still true.
1381*0Sstevel@tonic-gate
1382*0Sstevel@tonic-gate  if ($] > 5.007) {
1383*0Sstevel@tonic-gate    require Encode;
1384*0Sstevel@tonic-gate    $val = Encode::encode_utf8($val); # make octets
1385*0Sstevel@tonic-gate  }
1386*0Sstevel@tonic-gate
1387*0Sstevel@tonic-gate=item *
1388*0Sstevel@tonic-gate
1389*0Sstevel@tonic-gateA scalar we got back from an extension
1390*0Sstevel@tonic-gate
1391*0Sstevel@tonic-gateIf you believe the scalar comes back as UTF-8, you will most likely
1392*0Sstevel@tonic-gatewant the UTF-8 flag restored:
1393*0Sstevel@tonic-gate
1394*0Sstevel@tonic-gate  if ($] > 5.007) {
1395*0Sstevel@tonic-gate    require Encode;
1396*0Sstevel@tonic-gate    $val = Encode::decode_utf8($val);
1397*0Sstevel@tonic-gate  }
1398*0Sstevel@tonic-gate
1399*0Sstevel@tonic-gate=item *
1400*0Sstevel@tonic-gate
1401*0Sstevel@tonic-gateSame thing, if you are really sure it is UTF-8
1402*0Sstevel@tonic-gate
1403*0Sstevel@tonic-gate  if ($] > 5.007) {
1404*0Sstevel@tonic-gate    require Encode;
1405*0Sstevel@tonic-gate    Encode::_utf8_on($val);
1406*0Sstevel@tonic-gate  }
1407*0Sstevel@tonic-gate
1408*0Sstevel@tonic-gate=item *
1409*0Sstevel@tonic-gate
1410*0Sstevel@tonic-gateA wrapper for fetchrow_array and fetchrow_hashref
1411*0Sstevel@tonic-gate
1412*0Sstevel@tonic-gateWhen the database contains only UTF-8, a wrapper function or method is
1413*0Sstevel@tonic-gatea convenient way to replace all your fetchrow_array and
1414*0Sstevel@tonic-gatefetchrow_hashref calls. A wrapper function will also make it easier to
1415*0Sstevel@tonic-gateadapt to future enhancements in your database driver. Note that at the
1416*0Sstevel@tonic-gatetime of this writing (October 2002), the DBI has no standardized way
1417*0Sstevel@tonic-gateto deal with UTF-8 data. Please check the documentation to verify if
1418*0Sstevel@tonic-gatethat is still true.
1419*0Sstevel@tonic-gate
1420*0Sstevel@tonic-gate  sub fetchrow {
1421*0Sstevel@tonic-gate    my($self, $sth, $what) = @_; # $what is one of fetchrow_{array,hashref}
1422*0Sstevel@tonic-gate    if ($] < 5.007) {
1423*0Sstevel@tonic-gate      return $sth->$what;
1424*0Sstevel@tonic-gate    } else {
1425*0Sstevel@tonic-gate      require Encode;
1426*0Sstevel@tonic-gate      if (wantarray) {
1427*0Sstevel@tonic-gate        my @arr = $sth->$what;
1428*0Sstevel@tonic-gate        for (@arr) {
1429*0Sstevel@tonic-gate          defined && /[^\000-\177]/ && Encode::_utf8_on($_);
1430*0Sstevel@tonic-gate        }
1431*0Sstevel@tonic-gate        return @arr;
1432*0Sstevel@tonic-gate      } else {
1433*0Sstevel@tonic-gate        my $ret = $sth->$what;
1434*0Sstevel@tonic-gate        if (ref $ret) {
1435*0Sstevel@tonic-gate          for my $k (keys %$ret) {
1436*0Sstevel@tonic-gate            defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret->{$k};
1437*0Sstevel@tonic-gate          }
1438*0Sstevel@tonic-gate          return $ret;
1439*0Sstevel@tonic-gate        } else {
1440*0Sstevel@tonic-gate          defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret;
1441*0Sstevel@tonic-gate          return $ret;
1442*0Sstevel@tonic-gate        }
1443*0Sstevel@tonic-gate      }
1444*0Sstevel@tonic-gate    }
1445*0Sstevel@tonic-gate  }
1446*0Sstevel@tonic-gate
1447*0Sstevel@tonic-gate
1448*0Sstevel@tonic-gate=item *
1449*0Sstevel@tonic-gate
1450*0Sstevel@tonic-gateA large scalar that you know can only contain ASCII
1451*0Sstevel@tonic-gate
1452*0Sstevel@tonic-gateScalars that contain only ASCII and are marked as UTF-8 are sometimes
1453*0Sstevel@tonic-gatea drag to your program. If you recognize such a situation, just remove
1454*0Sstevel@tonic-gatethe UTF-8 flag:
1455*0Sstevel@tonic-gate
1456*0Sstevel@tonic-gate  utf8::downgrade($val) if $] > 5.007;
1457*0Sstevel@tonic-gate
1458*0Sstevel@tonic-gate=back
1459*0Sstevel@tonic-gate
1460*0Sstevel@tonic-gate=head1 SEE ALSO
1461*0Sstevel@tonic-gate
1462*0Sstevel@tonic-gateL<perluniintro>, L<encoding>, L<Encode>, L<open>, L<utf8>, L<bytes>,
1463*0Sstevel@tonic-gateL<perlretut>, L<perlvar/"${^UNICODE}">
1464*0Sstevel@tonic-gate
1465*0Sstevel@tonic-gate=cut
1466