1package utf8; 2 3$utf8::hint_bits = 0x00800000; 4 5our $VERSION = '1.22'; 6 7sub import { 8 $^H |= $utf8::hint_bits; 9} 10 11sub unimport { 12 $^H &= ~$utf8::hint_bits; 13} 14 15sub AUTOLOAD { 16 goto &$AUTOLOAD if defined &$AUTOLOAD; 17 require Carp; 18 Carp::croak("Undefined subroutine $AUTOLOAD called"); 19} 20 211; 22__END__ 23 24=head1 NAME 25 26utf8 - Perl pragma to enable/disable UTF-8 (or UTF-EBCDIC) in source code 27 28=head1 SYNOPSIS 29 30 use utf8; 31 no utf8; 32 33 # Convert the internal representation of a Perl scalar to/from UTF-8. 34 35 $num_octets = utf8::upgrade($string); 36 $success = utf8::downgrade($string[, $fail_ok]); 37 38 # Change each character of a Perl scalar to/from a series of 39 # characters that represent the UTF-8 bytes of each original character. 40 41 utf8::encode($string); # "\x{100}" becomes "\xc4\x80" 42 utf8::decode($string); # "\xc4\x80" becomes "\x{100}" 43 44 # Convert a code point from the platform native character set to 45 # Unicode, and vice-versa. 46 $unicode = utf8::native_to_unicode(ord('A')); # returns 65 on both 47 # ASCII and EBCDIC 48 # platforms 49 $native = utf8::unicode_to_native(65); # returns 65 on ASCII 50 # platforms; 193 on 51 # EBCDIC 52 53 $flag = utf8::is_utf8($string); # since Perl 5.8.1 54 $flag = utf8::valid($string); 55 56=head1 DESCRIPTION 57 58The C<use utf8> pragma tells the Perl parser to allow UTF-8 in the 59program text in the current lexical scope. The C<no utf8> pragma tells Perl 60to switch back to treating the source text as literal bytes in the current 61lexical scope. (On EBCDIC platforms, technically it is allowing UTF-EBCDIC, 62and not UTF-8, but this distinction is academic, so in this document the term 63UTF-8 is used to mean both). 64 65B<Do not use this pragma for anything else than telling Perl that your 66script is written in UTF-8.> The utility functions described below are 67directly usable without C<use utf8;>. 68 69Because it is not possible to reliably tell UTF-8 from native 8 bit 70encodings, you need either a Byte Order Mark at the beginning of your 71source code, or C<use utf8;>, to instruct perl. 72 73When UTF-8 becomes the standard source format, this pragma will 74effectively become a no-op. 75 76See also the effects of the C<-C> switch and its cousin, the 77C<PERL_UNICODE> environment variable, in L<perlrun>. 78 79Enabling the C<utf8> pragma has the following effect: 80 81=over 4 82 83=item * 84 85Bytes in the source text that are not in the ASCII character set will be 86treated as being part of a literal UTF-8 sequence. This includes most 87literals such as identifier names, string constants, and constant 88regular expression patterns. 89 90=back 91 92Note that if you have non-ASCII, non-UTF-8 bytes in your script (for example 93embedded Latin-1 in your string literals), C<use utf8> will be unhappy. If 94you want to have such bytes under C<use utf8>, you can disable this pragma 95until the end the block (or file, if at top level) by C<no utf8;>. 96 97=head2 Utility functions 98 99The following functions are defined in the C<utf8::> package by the 100Perl core. You do not need to say C<use utf8> to use these and in fact 101you should not say that unless you really want to have UTF-8 source code. 102 103=over 4 104 105=item * C<$num_octets = utf8::upgrade($string)> 106 107(Since Perl v5.8.0) 108Converts in-place the internal representation of the string from an octet 109sequence in the native encoding (Latin-1 or EBCDIC) to UTF-8. The 110logical character sequence itself is unchanged. If I<$string> is already 111upgraded, then this is a no-op. Returns the 112number of octets necessary to represent the string as UTF-8. 113 114If your code needs to be compatible with versions of perl without 115C<use feature 'unicode_strings';>, you can force Unicode semantics on 116a given string: 117 118 # force unicode semantics for $string without the 119 # "unicode_strings" feature 120 utf8::upgrade($string); 121 122For example: 123 124 # without explicit or implicit use feature 'unicode_strings' 125 my $x = "\xDF"; # LATIN SMALL LETTER SHARP S 126 $x =~ /ss/i; # won't match 127 my $y = uc($x); # won't convert 128 utf8::upgrade($x); 129 $x =~ /ss/i; # matches 130 my $z = uc($x); # converts to "SS" 131 132B<Note that this function does not handle arbitrary encodings>; 133use L<Encode> instead. 134 135=item * C<$success = utf8::downgrade($string[, $fail_ok])> 136 137(Since Perl v5.8.0) 138Converts in-place the internal representation of the string from UTF-8 to the 139equivalent octet sequence in the native encoding (Latin-1 or EBCDIC). The 140logical character sequence itself is unchanged. If I<$string> is already 141stored as native 8 bit, then this is a no-op. Can be used to make sure that 142the UTF-8 flag is off, e.g. when you want to make sure that the substr() or 143length() function works with the usually faster byte algorithm. 144 145Fails if the original UTF-8 sequence cannot be represented in the 146native 8 bit encoding. On failure dies or, if the value of I<$fail_ok> is 147true, returns false. 148 149Returns true on success. 150 151If your code expects an octet sequence this can be used to validate 152that you've received one: 153 154 # throw an exception if not representable as octets 155 utf8::downgrade($string) 156 157 # or do your own error handling 158 utf8::downgrade($string, 1) or die "string must be octets"; 159 160B<Note that this function does not handle arbitrary encodings>; 161use L<Encode> instead. 162 163=item * C<utf8::encode($string)> 164 165(Since Perl v5.8.0) 166Converts in-place the character sequence to the corresponding octet 167sequence in Perl's extended UTF-8. That is, every (possibly wide) character 168gets replaced with a sequence of one or more characters that represent the 169individual UTF-8 bytes of the character. The UTF8 flag is turned off. 170Returns nothing. 171 172 my $x = "\x{100}"; # $x contains one character, with ord 0x100 173 utf8::encode($x); # $x contains two characters, with ords (on 174 # ASCII platforms) 0xc4 and 0x80. On EBCDIC 175 # 1047, this would instead be 0x8C and 0x41. 176 177Similar to: 178 179 use Encode; 180 $x = Encode::encode("utf8", $x); 181 182B<Note that this function does not handle arbitrary encodings>; 183use L<Encode> instead. 184 185=item * C<$success = utf8::decode($string)> 186 187(Since Perl v5.8.0) 188Attempts to convert in-place the octet sequence encoded in Perl's extended 189UTF-8 to the corresponding character sequence. That is, it replaces each 190sequence of characters in the string whose ords represent a valid (extended) 191UTF-8 byte sequence, with the corresponding single character. The UTF-8 flag 192is turned on only if the source string contains multiple-byte UTF-8 193characters. If I<$string> is invalid as extended UTF-8, returns false; 194otherwise returns true. 195 196 my $x = "\xc4\x80"; # $x contains two characters, with ords 197 # 0xc4 and 0x80 198 utf8::decode($x); # On ASCII platforms, $x contains one char, 199 # with ord 0x100. Since these bytes aren't 200 # legal UTF-EBCDIC, on EBCDIC platforms, $x is 201 # unchanged and the function returns FALSE. 202 203B<Note that this function does not handle arbitrary encodings>; 204use L<Encode> instead. 205 206=item * C<$unicode = utf8::native_to_unicode($code_point)> 207 208(Since Perl v5.8.0) 209This takes an unsigned integer (which represents the ordinal number of a 210character (or a code point) on the platform the program is being run on) and 211returns its Unicode equivalent value. Since ASCII platforms natively use the 212Unicode code points, this function returns its input on them. On EBCDIC 213platforms it converts from EBCDIC to Unicode. 214 215A meaningless value will currently be returned if the input is not an unsigned 216integer. 217 218Since Perl v5.22.0, calls to this function are optimized out on ASCII 219platforms, so there is no performance hit in using it there. 220 221=item * C<$native = utf8::unicode_to_native($code_point)> 222 223(Since Perl v5.8.0) 224This is the inverse of C<utf8::native_to_unicode()>, converting the other 225direction. Again, on ASCII platforms, this returns its input, but on EBCDIC 226platforms it will find the native platform code point, given any Unicode one. 227 228A meaningless value will currently be returned if the input is not an unsigned 229integer. 230 231Since Perl v5.22.0, calls to this function are optimized out on ASCII 232platforms, so there is no performance hit in using it there. 233 234=item * C<$flag = utf8::is_utf8($string)> 235 236(Since Perl 5.8.1) Test whether I<$string> is marked internally as encoded in 237UTF-8. Functionally the same as C<Encode::is_utf8($string)>. 238 239Typically only necessary for debugging and testing, if you need to 240dump the internals of an SV, L<Devel::Peek's|Devel::Peek> Dump() 241provides more detail in a compact form. 242 243If you still think you need this outside of debugging, testing or 244dealing with filenames, you should probably read L<perlunitut> and 245L<perlunifaq/What is "the UTF8 flag"?>. 246 247Don't use this flag as a marker to distinguish character and binary 248data: that should be decided for each variable when you write your 249code. 250 251To force unicode semantics in code portable to perl 5.8 and 5.10, call 252C<utf8::upgrade($string)> unconditionally. 253 254=item * C<$flag = utf8::valid($string)> 255 256[INTERNAL] Test whether I<$string> is in a consistent state regarding 257UTF-8. Will return true if it is well-formed Perl extended UTF-8 and has the 258UTF-8 flag 259on B<or> if I<$string> is held as bytes (both these states are 'consistent'). 260The main reason for this routine is to allow Perl's test suite to check 261that operations have left strings in a consistent state. 262 263=back 264 265C<utf8::encode> is like C<utf8::upgrade>, but the UTF8 flag is 266cleared. See L<perlunicode>, and the C API 267functions C<L<sv_utf8_upgrade|perlapi/sv_utf8_upgrade>>, 268C<L<perlapi/sv_utf8_downgrade>>, C<L<perlapi/sv_utf8_encode>>, 269and C<L<perlapi/sv_utf8_decode>>, which are wrapped by the Perl functions 270C<utf8::upgrade>, C<utf8::downgrade>, C<utf8::encode> and 271C<utf8::decode>. Also, the functions C<utf8::is_utf8>, C<utf8::valid>, 272C<utf8::encode>, C<utf8::decode>, C<utf8::upgrade>, and C<utf8::downgrade> are 273actually internal, and thus always available, without a C<require utf8> 274statement. 275 276=head1 BUGS 277 278Some filesystems may not support UTF-8 file names, or they may be supported 279incompatibly with Perl. Therefore UTF-8 names that are visible to the 280filesystem, such as module names may not work. 281 282=head1 SEE ALSO 283 284L<perlunitut>, L<perluniintro>, L<perlrun>, L<bytes>, L<perlunicode> 285 286=cut 287