1package utf8; 2 3$utf8::hint_bits = 0x00800000; 4 5our $VERSION = '1.19'; 6 7sub import { 8 $^H |= $utf8::hint_bits; 9} 10 11sub unimport { 12 $^H &= ~$utf8::hint_bits; 13} 14 15sub AUTOLOAD { 16 require "utf8_heavy.pl"; 17 goto &$AUTOLOAD if defined &$AUTOLOAD; 18 require Carp; 19 Carp::croak("Undefined subroutine $AUTOLOAD called"); 20} 21 221; 23__END__ 24 25=head1 NAME 26 27utf8 - Perl pragma to enable/disable UTF-8 (or UTF-EBCDIC) in source code 28 29=head1 SYNOPSIS 30 31 use utf8; 32 no utf8; 33 34 # Convert the internal representation of a Perl scalar to/from UTF-8. 35 36 $num_octets = utf8::upgrade($string); 37 $success = utf8::downgrade($string[, $fail_ok]); 38 39 # Change each character of a Perl scalar to/from a series of 40 # characters that represent the UTF-8 bytes of each original character. 41 42 utf8::encode($string); # "\x{100}" becomes "\xc4\x80" 43 utf8::decode($string); # "\xc4\x80" becomes "\x{100}" 44 45 # Convert a code point from the platform native character set to 46 # Unicode, and vice-versa. 47 $unicode = utf8::native_to_unicode(ord('A')); # returns 65 on both 48 # ASCII and EBCDIC 49 # platforms 50 $native = utf8::unicode_to_native(65); # returns 65 on ASCII 51 # platforms; 193 on 52 # EBCDIC 53 54 $flag = utf8::is_utf8($string); # since Perl 5.8.1 55 $flag = utf8::valid($string); 56 57=head1 DESCRIPTION 58 59The C<use utf8> pragma tells the Perl parser to allow UTF-8 in the 60program text in the current lexical scope. The C<no utf8> pragma tells Perl 61to switch back to treating the source text as literal bytes in the current 62lexical scope. (On EBCDIC platforms, technically it is allowing UTF-EBCDIC, 63and not UTF-8, but this distinction is academic, so in this document the term 64UTF-8 is used to mean both). 65 66B<Do not use this pragma for anything else than telling Perl that your 67script is written in UTF-8.> The utility functions described below are 68directly usable without C<use utf8;>. 69 70Because it is not possible to reliably tell UTF-8 from native 8 bit 71encodings, you need either a Byte Order Mark at the beginning of your 72source code, or C<use utf8;>, to instruct perl. 73 74When UTF-8 becomes the standard source format, this pragma will 75effectively become a no-op. 76 77See also the effects of the C<-C> switch and its cousin, the 78C<PERL_UNICODE> environment variable, in L<perlrun>. 79 80Enabling the C<utf8> pragma has the following effect: 81 82=over 4 83 84=item * 85 86Bytes in the source text that are not in the ASCII character set will be 87treated as being part of a literal UTF-8 sequence. This includes most 88literals such as identifier names, string constants, and constant 89regular expression patterns. 90 91=back 92 93Note that if you have non-ASCII, non-UTF-8 bytes in your script (for example 94embedded Latin-1 in your string literals), C<use utf8> will be unhappy. If 95you want to have such bytes under C<use utf8>, you can disable this pragma 96until the end the block (or file, if at top level) by C<no utf8;>. 97 98=head2 Utility functions 99 100The following functions are defined in the C<utf8::> package by the 101Perl core. You do not need to say C<use utf8> to use these and in fact 102you should not say that unless you really want to have UTF-8 source code. 103 104=over 4 105 106=item * C<$num_octets = utf8::upgrade($string)> 107 108(Since Perl v5.8.0) 109Converts in-place the internal representation of the string from an octet 110sequence in the native encoding (Latin-1 or EBCDIC) to UTF-8. The 111logical character sequence itself is unchanged. If I<$string> is already 112stored as UTF-8, then this is a no-op. Returns the 113number of octets necessary to represent the string as UTF-8. Can be 114used to make sure that the UTF-8 flag is on, so that C<\w> or C<lc()> 115work as Unicode on strings containing non-ASCII characters whose code points 116are below 256. 117 118B<Note that this function does not handle arbitrary encodings>; 119use L<Encode> instead. 120 121=item * C<$success = utf8::downgrade($string[, $fail_ok])> 122 123(Since Perl v5.8.0) 124Converts in-place the internal representation of the string from 125UTF-8 to the equivalent octet sequence in the native encoding (Latin-1 126or EBCDIC). The logical character sequence itself is unchanged. If 127I<$string> is already stored as native 8 bit, then this is a no-op. Can 128be used to 129make sure that the UTF-8 flag is off, e.g. when you want to make sure 130that the substr() or length() function works with the usually faster 131byte algorithm. 132 133Fails if the original UTF-8 sequence cannot be represented in the 134native 8 bit encoding. On failure dies or, if the value of I<$fail_ok> is 135true, returns false. 136 137Returns true on success. 138 139B<Note that this function does not handle arbitrary encodings>; 140use L<Encode> instead. 141 142=item * C<utf8::encode($string)> 143 144(Since Perl v5.8.0) 145Converts in-place the character sequence to the corresponding octet 146sequence in UTF-8. That is, every (possibly wide) character gets 147replaced with a sequence of one or more characters that represent the 148individual UTF-8 bytes of the character. The UTF8 flag is turned off. 149Returns nothing. 150 151 my $a = "\x{100}"; # $a contains one character, with ord 0x100 152 utf8::encode($a); # $a contains two characters, with ords (on 153 # ASCII platforms) 0xc4 and 0x80. On EBCDIC 154 # 1047, this would instead be 0x8C and 0x41. 155 156B<Note that this function does not handle arbitrary encodings>; 157use L<Encode> instead. 158 159=item * C<$success = utf8::decode($string)> 160 161(Since Perl v5.8.0) 162Attempts to convert in-place the octet sequence encoded as UTF-8 to the 163corresponding character sequence. That is, it replaces each sequence of 164characters in the string whose ords represent a valid UTF-8 byte 165sequence, with the corresponding single character. The UTF-8 flag is 166turned on only if the source string contains multiple-byte UTF-8 167characters. If I<$string> is invalid as UTF-8, returns false; 168otherwise returns true. 169 170 my $a = "\xc4\x80"; # $a contains two characters, with ords 171 # 0xc4 and 0x80 172 utf8::decode($a); # On ASCII platforms, $a contains one char, 173 # with ord 0x100. Since these bytes aren't 174 # legal UTF-EBCDIC, on EBCDIC platforms, $a is 175 # unchanged and the function returns FALSE. 176 177B<Note that this function does not handle arbitrary encodings>; 178use L<Encode> instead. 179 180=item * C<$unicode = utf8::native_to_unicode($code_point)> 181 182(Since Perl v5.8.0) 183This takes an unsigned integer (which represents the ordinal number of a 184character (or a code point) on the platform the program is being run on) and 185returns its Unicode equivalent value. Since ASCII platforms natively use the 186Unicode code points, this function returns its input on them. On EBCDIC 187platforms it converts from EBCDIC to Unicode. 188 189A meaningless value will currently be returned if the input is not an unsigned 190integer. 191 192Since Perl v5.22.0, calls to this function are optimized out on ASCII 193platforms, so there is no performance hit in using it there. 194 195=item * C<$native = utf8::unicode_to_native($code_point)> 196 197(Since Perl v5.8.0) 198This is the inverse of C<utf8::native_to_unicode()>, converting the other 199direction. Again, on ASCII platforms, this returns its input, but on EBCDIC 200platforms it will find the native platform code point, given any Unicode one. 201 202A meaningless value will currently be returned if the input is not an unsigned 203integer. 204 205Since Perl v5.22.0, calls to this function are optimized out on ASCII 206platforms, so there is no performance hit in using it there. 207 208=item * C<$flag = utf8::is_utf8($string)> 209 210(Since Perl 5.8.1) Test whether I<$string> is marked internally as encoded in 211UTF-8. Functionally the same as C<Encode::is_utf8()>. 212 213=item * C<$flag = utf8::valid($string)> 214 215[INTERNAL] Test whether I<$string> is in a consistent state regarding 216UTF-8. Will return true if it is well-formed UTF-8 and has the UTF-8 flag 217on B<or> if I<$string> is held as bytes (both these states are 'consistent'). 218Main reason for this routine is to allow Perl's test suite to check 219that operations have left strings in a consistent state. You most 220probably want to use C<utf8::is_utf8()> instead. 221 222=back 223 224C<utf8::encode> is like C<utf8::upgrade>, but the UTF8 flag is 225cleared. See L<perlunicode>, and the C API 226functions C<L<sv_utf8_upgrade|perlapi/sv_utf8_upgrade>>, 227C<L<perlapi/sv_utf8_downgrade>>, C<L<perlapi/sv_utf8_encode>>, 228and C<L<perlapi/sv_utf8_decode>>, which are wrapped by the Perl functions 229C<utf8::upgrade>, C<utf8::downgrade>, C<utf8::encode> and 230C<utf8::decode>. Also, the functions C<utf8::is_utf8>, C<utf8::valid>, 231C<utf8::encode>, C<utf8::decode>, C<utf8::upgrade>, and C<utf8::downgrade> are 232actually internal, and thus always available, without a C<require utf8> 233statement. 234 235=head1 BUGS 236 237Some filesystems may not support UTF-8 file names, or they may be supported 238incompatibly with Perl. Therefore UTF-8 names that are visible to the 239filesystem, such as module names may not work. 240 241=head1 SEE ALSO 242 243L<perlunitut>, L<perluniintro>, L<perlrun>, L<bytes>, L<perlunicode> 244 245=cut 246