distrib/pod/perluniintro.pod

*0Sstevel@tonic-gate=head1 NAME
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateperluniintro - Perl Unicode introduction
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=head1 DESCRIPTION
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateThis document gives a general idea of Unicode and how to use Unicode
*0Sstevel@tonic-gatein Perl.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=head2 Unicode
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateUnicode is a character set standard which plans to codify all of the
*0Sstevel@tonic-gatewriting systems of the world, plus many other symbols.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateUnicode and ISO/IEC 10646 are coordinated standards that provide code
*0Sstevel@tonic-gatepoints for characters in almost all modern character set standards,
*0Sstevel@tonic-gatecovering more than 30 writing systems and hundreds of languages,
*0Sstevel@tonic-gateincluding all commercially-important modern languages.  All characters
*0Sstevel@tonic-gatein the largest Chinese, Japanese, and Korean dictionaries are also
*0Sstevel@tonic-gateencoded. The standards will eventually cover almost all characters in
*0Sstevel@tonic-gatemore than 250 writing systems and thousands of languages.
*0Sstevel@tonic-gateUnicode 1.0 was released in October 1991, and 4.0 in April 2003.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateA Unicode I<character> is an abstract entity.  It is not bound to any
*0Sstevel@tonic-gateparticular integer width, especially not to the C language C<char>.
*0Sstevel@tonic-gateUnicode is language-neutral and display-neutral: it does not encode the
*0Sstevel@tonic-gatelanguage of the text and it does not define fonts or other graphical
*0Sstevel@tonic-gatelayout details.  Unicode operates on characters and on text built from
*0Sstevel@tonic-gatethose characters.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateUnicode defines characters like C<LATIN CAPITAL LETTER A> or C<GREEK
*0Sstevel@tonic-gateSMALL LETTER ALPHA> and unique numbers for the characters, in this
*0Sstevel@tonic-gatecase 0x0041 and 0x03B1, respectively.  These unique numbers are called
*0Sstevel@tonic-gateI<code points>.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateThe Unicode standard prefers using hexadecimal notation for the code
*0Sstevel@tonic-gatepoints.  If numbers like C<0x0041> are unfamiliar to you, take a peek
*0Sstevel@tonic-gateat a later section, L</"Hexadecimal Notation">.  The Unicode standard
*0Sstevel@tonic-gateuses the notation C<U+0041 LATIN CAPITAL LETTER A>, to give the
*0Sstevel@tonic-gatehexadecimal code point and the normative name of the character.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateUnicode also defines various I<properties> for the characters, like
*0Sstevel@tonic-gate"uppercase" or "lowercase", "decimal digit", or "punctuation";
*0Sstevel@tonic-gatethese properties are independent of the names of the characters.
*0Sstevel@tonic-gateFurthermore, various operations on the characters like uppercasing,
*0Sstevel@tonic-gatelowercasing, and collating (sorting) are defined.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateA Unicode character consists either of a single code point, or a
*0Sstevel@tonic-gateI<base character> (like C<LATIN CAPITAL LETTER A>), followed by one or
*0Sstevel@tonic-gatemore I<modifiers> (like C<COMBINING ACUTE ACCENT>).  This sequence of
*0Sstevel@tonic-gatebase character and modifiers is called a I<combining character
*0Sstevel@tonic-gatesequence>.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateWhether to call these combining character sequences "characters"
*0Sstevel@tonic-gatedepends on your point of view. If you are a programmer, you probably
*0Sstevel@tonic-gatewould tend towards seeing each element in the sequences as one unit,
*0Sstevel@tonic-gateor "character".  The whole sequence could be seen as one "character",
*0Sstevel@tonic-gatehowever, from the user's point of view, since that's probably what it
*0Sstevel@tonic-gatelooks like in the context of the user's language.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateWith this "whole sequence" view of characters, the total number of
*0Sstevel@tonic-gatecharacters is open-ended. But in the programmer's "one unit is one
*0Sstevel@tonic-gatecharacter" point of view, the concept of "characters" is more
*0Sstevel@tonic-gatedeterministic.  In this document, we take that second  point of view:
*0Sstevel@tonic-gateone "character" is one Unicode code point, be it a base character or
*0Sstevel@tonic-gatea combining character.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateFor some combinations, there are I<precomposed> characters.
*0Sstevel@tonic-gateC<LATIN CAPITAL LETTER A WITH ACUTE>, for example, is defined as
*0Sstevel@tonic-gatea single code point.  These precomposed characters are, however,
*0Sstevel@tonic-gateonly available for some combinations, and are mainly
*0Sstevel@tonic-gatemeant to support round-trip conversions between Unicode and legacy
*0Sstevel@tonic-gatestandards (like the ISO 8859).  In the general case, the composing
*0Sstevel@tonic-gatemethod is more extensible.  To support conversion between
*0Sstevel@tonic-gatedifferent compositions of the characters, various I<normalization
*0Sstevel@tonic-gateforms> to standardize representations are also defined.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateBecause of backward compatibility with legacy encodings, the "a unique
*0Sstevel@tonic-gatenumber for every character" idea breaks down a bit: instead, there is
*0Sstevel@tonic-gate"at least one number for every character".  The same character could
*0Sstevel@tonic-gatebe represented differently in several legacy encodings.  The
*0Sstevel@tonic-gateconverse is also not true: some code points do not have an assigned
*0Sstevel@tonic-gatecharacter.  Firstly, there are unallocated code points within
*0Sstevel@tonic-gateotherwise used blocks.  Secondly, there are special Unicode control
*0Sstevel@tonic-gatecharacters that do not represent true characters.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateA common myth about Unicode is that it would be "16-bit", that is,
*0Sstevel@tonic-gateUnicode is only represented as C<0x10000> (or 65536) characters from
*0Sstevel@tonic-gateC<0x0000> to C<0xFFFF>.  B<This is untrue.>  Since Unicode 2.0 (July
*0Sstevel@tonic-gate1996), Unicode has been defined all the way up to 21 bits (C<0x10FFFF>),
*0Sstevel@tonic-gateand since Unicode 3.1 (March 2001), characters have been defined
*0Sstevel@tonic-gatebeyond C<0xFFFF>.  The first C<0x10000> characters are called the
*0Sstevel@tonic-gateI<Plane 0>, or the I<Basic Multilingual Plane> (BMP).  With Unicode
*0Sstevel@tonic-gate3.1, 17 (yes, seventeen) planes in all were defined--but they are
*0Sstevel@tonic-gatenowhere near full of defined characters, yet.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateAnother myth is that the 256-character blocks have something to
*0Sstevel@tonic-gatedo with languages--that each block would define the characters used
*0Sstevel@tonic-gateby a language or a set of languages.  B<This is also untrue.>
*0Sstevel@tonic-gateThe division into blocks exists, but it is almost completely
*0Sstevel@tonic-gateaccidental--an artifact of how the characters have been and
*0Sstevel@tonic-gatestill are allocated.  Instead, there is a concept called I<scripts>,
*0Sstevel@tonic-gatewhich is more useful: there is C<Latin> script, C<Greek> script, and
*0Sstevel@tonic-gateso on.  Scripts usually span varied parts of several blocks.
*0Sstevel@tonic-gateFor further information see L<Unicode::UCD>.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateThe Unicode code points are just abstract numbers.  To input and
*0Sstevel@tonic-gateoutput these abstract numbers, the numbers must be I<encoded> or
*0Sstevel@tonic-gateI<serialised> somehow.  Unicode defines several I<character encoding
*0Sstevel@tonic-gateforms>, of which I<UTF-8> is perhaps the most popular.  UTF-8 is a
*0Sstevel@tonic-gatevariable length encoding that encodes Unicode characters as 1 to 6
*0Sstevel@tonic-gatebytes (only 4 with the currently defined characters).  Other encodings
*0Sstevel@tonic-gateinclude UTF-16 and UTF-32 and their big- and little-endian variants
*0Sstevel@tonic-gate(UTF-8 is byte-order independent) The ISO/IEC 10646 defines the UCS-2
*0Sstevel@tonic-gateand UCS-4 encoding forms.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateFor more information about encodings--for instance, to learn what
*0Sstevel@tonic-gateI<surrogates> and I<byte order marks> (BOMs) are--see L<perlunicode>.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=head2 Perl's Unicode Support
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateStarting from Perl 5.6.0, Perl has had the capacity to handle Unicode
*0Sstevel@tonic-gatenatively.  Perl 5.8.0, however, is the first recommended release for
*0Sstevel@tonic-gateserious Unicode work.  The maintenance release 5.6.1 fixed many of the
*0Sstevel@tonic-gateproblems of the initial Unicode implementation, but for example
*0Sstevel@tonic-gateregular expressions still do not work with Unicode in 5.6.1.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateB<Starting from Perl 5.8.0, the use of C<use utf8> is no longer
*0Sstevel@tonic-gatenecessary.> In earlier releases the C<utf8> pragma was used to declare
*0Sstevel@tonic-gatethat operations in the current block or file would be Unicode-aware.
*0Sstevel@tonic-gateThis model was found to be wrong, or at least clumsy: the "Unicodeness"
*0Sstevel@tonic-gateis now carried with the data, instead of being attached to the
*0Sstevel@tonic-gateoperations.  Only one case remains where an explicit C<use utf8> is
*0Sstevel@tonic-gateneeded: if your Perl script itself is encoded in UTF-8, you can use
*0Sstevel@tonic-gateUTF-8 in your identifier names, and in string and regular expression
*0Sstevel@tonic-gateliterals, by saying C<use utf8>.  This is not the default because
*0Sstevel@tonic-gatescripts with legacy 8-bit data in them would break.  See L<utf8>.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=head2 Perl's Unicode Model
*0Sstevel@tonic-gate
*0Sstevel@tonic-gatePerl supports both pre-5.6 strings of eight-bit native bytes, and
*0Sstevel@tonic-gatestrings of Unicode characters.  The principle is that Perl tries to
*0Sstevel@tonic-gatekeep its data as eight-bit bytes for as long as possible, but as soon
*0Sstevel@tonic-gateas Unicodeness cannot be avoided, the data is transparently upgraded
*0Sstevel@tonic-gateto Unicode.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateInternally, Perl currently uses either whatever the native eight-bit
*0Sstevel@tonic-gatecharacter set of the platform (for example Latin-1) is, defaulting to
*0Sstevel@tonic-gateUTF-8, to encode Unicode strings. Specifically, if all code points in
*0Sstevel@tonic-gatethe string are C<0xFF> or less, Perl uses the native eight-bit
*0Sstevel@tonic-gatecharacter set.  Otherwise, it uses UTF-8.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateA user of Perl does not normally need to know nor care how Perl
*0Sstevel@tonic-gatehappens to encode its internal strings, but it becomes relevant when
*0Sstevel@tonic-gateoutputting Unicode strings to a stream without a PerlIO layer -- one with
*0Sstevel@tonic-gatethe "default" encoding.  In such a case, the raw bytes used internally
*0Sstevel@tonic-gate(the native character set or UTF-8, as appropriate for each string)
*0Sstevel@tonic-gatewill be used, and a "Wide character" warning will be issued if those
*0Sstevel@tonic-gatestrings contain a character beyond 0x00FF.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateFor example,
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate      perl -e 'print "\x{DF}\n", "\x{0100}\x{DF}\n"'
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateproduces a fairly useless mixture of native bytes and UTF-8, as well
*0Sstevel@tonic-gateas a warning:
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate     Wide character in print at ...
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateTo output UTF-8, use the C<:utf8> output layer.  Prepending
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate      binmode(STDOUT, ":utf8");
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateto this sample program ensures that the output is completely UTF-8,
*0Sstevel@tonic-gateand removes the program's warning.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateYou can enable automatic UTF-8-ification of your standard file
*0Sstevel@tonic-gatehandles, default C<open()> layer, and C<@ARGV> by using either
*0Sstevel@tonic-gatethe C<-C> command line switch or the C<PERL_UNICODE> environment
*0Sstevel@tonic-gatevariable, see L<perlrun> for the documentation of the C<-C> switch.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateNote that this means that Perl expects other software to work, too:
*0Sstevel@tonic-gateif Perl has been led to believe that STDIN should be UTF-8, but then
*0Sstevel@tonic-gateSTDIN coming in from another command is not UTF-8, Perl will complain
*0Sstevel@tonic-gateabout the malformed UTF-8.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateAll features that combine Unicode and I/O also require using the new
*0Sstevel@tonic-gatePerlIO feature.  Almost all Perl 5.8 platforms do use PerlIO, though:
*0Sstevel@tonic-gateyou can see whether yours is by running "perl -V" and looking for
*0Sstevel@tonic-gateC<useperlio=define>.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=head2 Unicode and EBCDIC
*0Sstevel@tonic-gate
*0Sstevel@tonic-gatePerl 5.8.0 also supports Unicode on EBCDIC platforms.  There,
*0Sstevel@tonic-gateUnicode support is somewhat more complex to implement since
*0Sstevel@tonic-gateadditional conversions are needed at every step.  Some problems
*0Sstevel@tonic-gateremain, see L<perlebcdic> for details.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateIn any case, the Unicode support on EBCDIC platforms is better than
*0Sstevel@tonic-gatein the 5.6 series, which didn't work much at all for EBCDIC platform.
*0Sstevel@tonic-gateOn EBCDIC platforms, the internal Unicode encoding form is UTF-EBCDIC
*0Sstevel@tonic-gateinstead of UTF-8.  The difference is that as UTF-8 is "ASCII-safe" in
*0Sstevel@tonic-gatethat ASCII characters encode to UTF-8 as-is, while UTF-EBCDIC is
*0Sstevel@tonic-gate"EBCDIC-safe".
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=head2 Creating Unicode
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateTo create Unicode characters in literals for code points above C<0xFF>,
*0Sstevel@tonic-gateuse the C<\x{...}> notation in double-quoted strings:
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate    my $smiley = "\x{263a}";
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateSimilarly, it can be used in regular expression literals
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate    $smiley =~ /\x{263a}/;
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateAt run-time you can use C<chr()>:
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate    my $hebrew_alef = chr(0x05d0);
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateSee L</"Further Resources"> for how to find all these numeric codes.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateNaturally, C<ord()> will do the reverse: it turns a character into
*0Sstevel@tonic-gatea code point.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateNote that C<\x..> (no C<{}> and only two hexadecimal digits), C<\x{...}>,
*0Sstevel@tonic-gateand C<chr(...)> for arguments less than C<0x100> (decimal 256)
*0Sstevel@tonic-gategenerate an eight-bit character for backward compatibility with older
*0Sstevel@tonic-gatePerls.  For arguments of C<0x100> or more, Unicode characters are
*0Sstevel@tonic-gatealways produced. If you want to force the production of Unicode
*0Sstevel@tonic-gatecharacters regardless of the numeric value, use C<pack("U", ...)>
*0Sstevel@tonic-gateinstead of C<\x..>, C<\x{...}>, or C<chr()>.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateYou can also use the C<charnames> pragma to invoke characters
*0Sstevel@tonic-gateby name in double-quoted strings:
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate    use charnames ':full';
*0Sstevel@tonic-gate    my $arabic_alef = "\N{ARABIC LETTER ALEF}";
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateAnd, as mentioned above, you can also C<pack()> numbers into Unicode
*0Sstevel@tonic-gatecharacters:
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate   my $georgian_an  = pack("U", 0x10a0);
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateNote that both C<\x{...}> and C<\N{...}> are compile-time string
*0Sstevel@tonic-gateconstants: you cannot use variables in them.  if you want similar
*0Sstevel@tonic-gaterun-time functionality, use C<chr()> and C<charnames::vianame()>.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateIf you want to force the result to Unicode characters, use the special
*0Sstevel@tonic-gateC<"U0"> prefix.  It consumes no arguments but forces the result to be
*0Sstevel@tonic-gatein Unicode characters, instead of bytes.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate   my $chars = pack("U0C*", 0x80, 0x42);
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateLikewise, you can force the result to be bytes by using the special
*0Sstevel@tonic-gateC<"C0"> prefix.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=head2 Handling Unicode
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateHandling Unicode is for the most part transparent: just use the
*0Sstevel@tonic-gatestrings as usual.  Functions like C<index()>, C<length()>, and
*0Sstevel@tonic-gateC<substr()> will work on the Unicode characters; regular expressions
*0Sstevel@tonic-gatewill work on the Unicode characters (see L<perlunicode> and L<perlretut>).
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateNote that Perl considers combining character sequences to be
*0Sstevel@tonic-gateseparate characters, so for example
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate    use charnames ':full';
*0Sstevel@tonic-gate    print length("\N{LATIN CAPITAL LETTER A}\N{COMBINING ACUTE ACCENT}"), "\n";
*0Sstevel@tonic-gate
*0Sstevel@tonic-gatewill print 2, not 1.  The only exception is that regular expressions
*0Sstevel@tonic-gatehave C<\X> for matching a combining character sequence.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateLife is not quite so transparent, however, when working with legacy
*0Sstevel@tonic-gateencodings, I/O, and certain special cases:
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=head2 Legacy Encodings
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateWhen you combine legacy data and Unicode the legacy data needs
*0Sstevel@tonic-gateto be upgraded to Unicode.  Normally ISO 8859-1 (or EBCDIC, if
*0Sstevel@tonic-gateapplicable) is assumed.  You can override this assumption by
*0Sstevel@tonic-gateusing the C<encoding> pragma, for example
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate    use encoding 'latin2'; # ISO 8859-2
*0Sstevel@tonic-gate
*0Sstevel@tonic-gatein which case literals (string or regular expressions), C<chr()>,
*0Sstevel@tonic-gateand C<ord()> in your whole script are assumed to produce Unicode
*0Sstevel@tonic-gatecharacters from ISO 8859-2 code points.  Note that the matching for
*0Sstevel@tonic-gateencoding names is forgiving: instead of C<latin2> you could have
*0Sstevel@tonic-gatesaid C<Latin 2>, or C<iso8859-2>, or other variations.  With just
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate    use encoding;
*0Sstevel@tonic-gate
*0Sstevel@tonic-gatethe environment variable C<PERL_ENCODING> will be consulted.
*0Sstevel@tonic-gateIf that variable isn't set, the encoding pragma will fail.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateThe C<Encode> module knows about many encodings and has interfaces
*0Sstevel@tonic-gatefor doing conversions between those encodings:
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate    use Encode 'decode';
*0Sstevel@tonic-gate    $data = decode("iso-8859-3", $data); # convert from legacy to utf-8
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=head2 Unicode I/O
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateNormally, writing out Unicode data
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate    print FH $some_string_with_unicode, "\n";
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateproduces raw bytes that Perl happens to use to internally encode the
*0Sstevel@tonic-gateUnicode string.  Perl's internal encoding depends on the system as
*0Sstevel@tonic-gatewell as what characters happen to be in the string at the time. If
*0Sstevel@tonic-gateany of the characters are at code points C<0x100> or above, you will get
*0Sstevel@tonic-gatea warning.  To ensure that the output is explicitly rendered in the
*0Sstevel@tonic-gateencoding you desire--and to avoid the warning--open the stream with
*0Sstevel@tonic-gatethe desired encoding. Some examples:
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate    open FH, ">:utf8", "file";
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate    open FH, ">:encoding(ucs2)",      "file";
*0Sstevel@tonic-gate    open FH, ">:encoding(UTF-8)",     "file";
*0Sstevel@tonic-gate    open FH, ">:encoding(shift_jis)", "file";
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateand on already open streams, use C<binmode()>:
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate    binmode(STDOUT, ":utf8");
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate    binmode(STDOUT, ":encoding(ucs2)");
*0Sstevel@tonic-gate    binmode(STDOUT, ":encoding(UTF-8)");
*0Sstevel@tonic-gate    binmode(STDOUT, ":encoding(shift_jis)");
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateThe matching of encoding names is loose: case does not matter, and
*0Sstevel@tonic-gatemany encodings have several aliases.  Note that the C<:utf8> layer
*0Sstevel@tonic-gatemust always be specified exactly like that; it is I<not> subject to
*0Sstevel@tonic-gatethe loose matching of encoding names.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateSee L<PerlIO> for the C<:utf8> layer, L<PerlIO::encoding> and
*0Sstevel@tonic-gateL<Encode::PerlIO> for the C<:encoding()> layer, and
*0Sstevel@tonic-gateL<Encode::Supported> for many encodings supported by the C<Encode>
*0Sstevel@tonic-gatemodule.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateReading in a file that you know happens to be encoded in one of the
*0Sstevel@tonic-gateUnicode or legacy encodings does not magically turn the data into
*0Sstevel@tonic-gateUnicode in Perl's eyes.  To do that, specify the appropriate
*0Sstevel@tonic-gatelayer when opening files
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate    open(my $fh,'<:utf8', 'anything');
*0Sstevel@tonic-gate    my $line_of_unicode = <$fh>;
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate    open(my $fh,'<:encoding(Big5)', 'anything');
*0Sstevel@tonic-gate    my $line_of_unicode = <$fh>;
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateThe I/O layers can also be specified more flexibly with
*0Sstevel@tonic-gatethe C<open> pragma.  See L<open>, or look at the following example.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate    use open ':utf8'; # input and output default layer will be UTF-8
*0Sstevel@tonic-gate    open X, ">file";
*0Sstevel@tonic-gate    print X chr(0x100), "\n";
*0Sstevel@tonic-gate    close X;
*0Sstevel@tonic-gate    open Y, "<file";
*0Sstevel@tonic-gate    printf "%#x\n", ord(<Y>); # this should print 0x100
*0Sstevel@tonic-gate    close Y;
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateWith the C<open> pragma you can use the C<:locale> layer
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate    BEGIN { $ENV{LC_ALL} = $ENV{LANG} = 'ru_RU.KOI8-R' }
*0Sstevel@tonic-gate    # the :locale will probe the locale environment variables like LC_ALL
*0Sstevel@tonic-gate    use open OUT => ':locale'; # russki parusski
*0Sstevel@tonic-gate    open(O, ">koi8");
*0Sstevel@tonic-gate    print O chr(0x430); # Unicode CYRILLIC SMALL LETTER A = KOI8-R 0xc1
*0Sstevel@tonic-gate    close O;
*0Sstevel@tonic-gate    open(I, "<koi8");
*0Sstevel@tonic-gate    printf "%#x\n", ord(<I>), "\n"; # this should print 0xc1
*0Sstevel@tonic-gate    close I;
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateor you can also use the C<':encoding(...)'> layer
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate    open(my $epic,'<:encoding(iso-8859-7)','iliad.greek');
*0Sstevel@tonic-gate    my $line_of_unicode = <$epic>;
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateThese methods install a transparent filter on the I/O stream that
*0Sstevel@tonic-gateconverts data from the specified encoding when it is read in from the
*0Sstevel@tonic-gatestream.  The result is always Unicode.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateThe L<open> pragma affects all the C<open()> calls after the pragma by
*0Sstevel@tonic-gatesetting default layers.  If you want to affect only certain
*0Sstevel@tonic-gatestreams, use explicit layers directly in the C<open()> call.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateYou can switch encodings on an already opened stream by using
*0Sstevel@tonic-gateC<binmode()>; see L<perlfunc/binmode>.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateThe C<:locale> does not currently (as of Perl 5.8.0) work with
*0Sstevel@tonic-gateC<open()> and C<binmode()>, only with the C<open> pragma.  The
*0Sstevel@tonic-gateC<:utf8> and C<:encoding(...)> methods do work with all of C<open()>,
*0Sstevel@tonic-gateC<binmode()>, and the C<open> pragma.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateSimilarly, you may use these I/O layers on output streams to
*0Sstevel@tonic-gateautomatically convert Unicode to the specified encoding when it is
*0Sstevel@tonic-gatewritten to the stream. For example, the following snippet copies the
*0Sstevel@tonic-gatecontents of the file "text.jis" (encoded as ISO-2022-JP, aka JIS) to
*0Sstevel@tonic-gatethe file "text.utf8", encoded as UTF-8:
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate    open(my $nihongo, '<:encoding(iso-2022-jp)', 'text.jis');
*0Sstevel@tonic-gate    open(my $unicode, '>:utf8',                  'text.utf8');
*0Sstevel@tonic-gate    while (<$nihongo>) { print $unicode $_ }
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateThe naming of encodings, both by the C<open()> and by the C<open>
*0Sstevel@tonic-gatepragma, is similar to the C<encoding> pragma in that it allows for
*0Sstevel@tonic-gateflexible names: C<koi8-r> and C<KOI8R> will both be understood.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateCommon encodings recognized by ISO, MIME, IANA, and various other
*0Sstevel@tonic-gatestandardisation organisations are recognised; for a more detailed
*0Sstevel@tonic-gatelist see L<Encode::Supported>.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateC<read()> reads characters and returns the number of characters.
*0Sstevel@tonic-gateC<seek()> and C<tell()> operate on byte counts, as do C<sysread()>
*0Sstevel@tonic-gateand C<sysseek()>.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateNotice that because of the default behaviour of not doing any
*0Sstevel@tonic-gateconversion upon input if there is no default layer,
*0Sstevel@tonic-gateit is easy to mistakenly write code that keeps on expanding a file
*0Sstevel@tonic-gateby repeatedly encoding the data:
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate    # BAD CODE WARNING
*0Sstevel@tonic-gate    open F, "file";
*0Sstevel@tonic-gate    local $/; ## read in the whole file of 8-bit characters
*0Sstevel@tonic-gate    $t = <F>;
*0Sstevel@tonic-gate    close F;
*0Sstevel@tonic-gate    open F, ">:utf8", "file";
*0Sstevel@tonic-gate    print F $t; ## convert to UTF-8 on output
*0Sstevel@tonic-gate    close F;
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateIf you run this code twice, the contents of the F<file> will be twice
*0Sstevel@tonic-gateUTF-8 encoded.  A C<use open ':utf8'> would have avoided the bug, or
*0Sstevel@tonic-gateexplicitly opening also the F<file> for input as UTF-8.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateB<NOTE>: the C<:utf8> and C<:encoding> features work only if your
*0Sstevel@tonic-gatePerl has been built with the new PerlIO feature (which is the default
*0Sstevel@tonic-gateon most systems).
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=head2 Displaying Unicode As Text
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateSometimes you might want to display Perl scalars containing Unicode as
*0Sstevel@tonic-gatesimple ASCII (or EBCDIC) text.  The following subroutine converts
*0Sstevel@tonic-gateits argument so that Unicode characters with code points greater than
*0Sstevel@tonic-gate255 are displayed as C<\x{...}>, control characters (like C<\n>) are
*0Sstevel@tonic-gatedisplayed as C<\x..>, and the rest of the characters as themselves:
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate   sub nice_string {
*0Sstevel@tonic-gate       join("",
*0Sstevel@tonic-gate         map { $_ > 255 ?                  # if wide character...
*0Sstevel@tonic-gate               sprintf("\\x{%04X}", $_) :  # \x{...}
*0Sstevel@tonic-gate               chr($_) =~ /[[:cntrl:]]/ ?  # else if control character ...
*0Sstevel@tonic-gate               sprintf("\\x%02X", $_) :    # \x..
*0Sstevel@tonic-gate               quotemeta(chr($_))          # else quoted or as themselves
*0Sstevel@tonic-gate         } unpack("U*", $_[0]));           # unpack Unicode characters
*0Sstevel@tonic-gate   }
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateFor example,
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate   nice_string("foo\x{100}bar\n")
*0Sstevel@tonic-gate
*0Sstevel@tonic-gatereturns the string
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate   'foo\x{0100}bar\x0A'
*0Sstevel@tonic-gate
*0Sstevel@tonic-gatewhich is ready to be printed.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=head2 Special Cases
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=over 4
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=item *
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateBit Complement Operator ~ And vec()
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateThe bit complement operator C<~> may produce surprising results if
*0Sstevel@tonic-gateused on strings containing characters with ordinal values above
*0Sstevel@tonic-gate255. In such a case, the results are consistent with the internal
*0Sstevel@tonic-gateencoding of the characters, but not with much else. So don't do
*0Sstevel@tonic-gatethat. Similarly for C<vec()>: you will be operating on the
*0Sstevel@tonic-gateinternally-encoded bit patterns of the Unicode characters, not on
*0Sstevel@tonic-gatethe code point values, which is very probably not what you want.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=item *
*0Sstevel@tonic-gate
*0Sstevel@tonic-gatePeeking At Perl's Internal Encoding
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateNormal users of Perl should never care how Perl encodes any particular
*0Sstevel@tonic-gateUnicode string (because the normal ways to get at the contents of a
*0Sstevel@tonic-gatestring with Unicode--via input and output--should always be via
*0Sstevel@tonic-gateexplicitly-defined I/O layers). But if you must, there are two
*0Sstevel@tonic-gateways of looking behind the scenes.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateOne way of peeking inside the internal encoding of Unicode characters
*0Sstevel@tonic-gateis to use C<unpack("C*", ...> to get the bytes or C<unpack("H*", ...)>
*0Sstevel@tonic-gateto display the bytes:
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate    # this prints  c4 80  for the UTF-8 bytes 0xc4 0x80
*0Sstevel@tonic-gate    print join(" ", unpack("H*", pack("U", 0x100))), "\n";
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateYet another way would be to use the Devel::Peek module:
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate    perl -MDevel::Peek -e 'Dump(chr(0x100))'
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateThat shows the C<UTF8> flag in FLAGS and both the UTF-8 bytes
*0Sstevel@tonic-gateand Unicode characters in C<PV>.  See also later in this document
*0Sstevel@tonic-gatethe discussion about the C<utf8::is_utf8()> function.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=back
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=head2 Advanced Topics
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=over 4
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=item *
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateString Equivalence
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateThe question of string equivalence turns somewhat complicated
*0Sstevel@tonic-gatein Unicode: what do you mean by "equal"?
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate(Is C<LATIN CAPITAL LETTER A WITH ACUTE> equal to
*0Sstevel@tonic-gateC<LATIN CAPITAL LETTER A>?)
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateThe short answer is that by default Perl compares equivalence (C<eq>,
*0Sstevel@tonic-gateC<ne>) based only on code points of the characters.  In the above
*0Sstevel@tonic-gatecase, the answer is no (because 0x00C1 != 0x0041).  But sometimes, any
*0Sstevel@tonic-gateCAPITAL LETTER As should be considered equal, or even As of any case.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateThe long answer is that you need to consider character normalization
*0Sstevel@tonic-gateand casing issues: see L<Unicode::Normalize>, Unicode Technical
*0Sstevel@tonic-gateReports #15 and #21, I<Unicode Normalization Forms> and I<Case
*0Sstevel@tonic-gateMappings>, http://www.unicode.org/unicode/reports/tr15/ and
*0Sstevel@tonic-gatehttp://www.unicode.org/unicode/reports/tr21/
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateAs of Perl 5.8.0, the "Full" case-folding of I<Case
*0Sstevel@tonic-gateMappings/SpecialCasing> is implemented.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=item *
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateString Collation
*0Sstevel@tonic-gate
*0Sstevel@tonic-gatePeople like to see their strings nicely sorted--or as Unicode
*0Sstevel@tonic-gateparlance goes, collated.  But again, what do you mean by collate?
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate(Does C<LATIN CAPITAL LETTER A WITH ACUTE> come before or after
*0Sstevel@tonic-gateC<LATIN CAPITAL LETTER A WITH GRAVE>?)
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateThe short answer is that by default, Perl compares strings (C<lt>,
*0Sstevel@tonic-gateC<le>, C<cmp>, C<ge>, C<gt>) based only on the code points of the
*0Sstevel@tonic-gatecharacters.  In the above case, the answer is "after", since
*0Sstevel@tonic-gateC<0x00C1> > C<0x00C0>.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateThe long answer is that "it depends", and a good answer cannot be
*0Sstevel@tonic-gategiven without knowing (at the very least) the language context.
*0Sstevel@tonic-gateSee L<Unicode::Collate>, and I<Unicode Collation Algorithm>
*0Sstevel@tonic-gatehttp://www.unicode.org/unicode/reports/tr10/
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=back
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=head2 Miscellaneous
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=over 4
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=item *
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateCharacter Ranges and Classes
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateCharacter ranges in regular expression character classes (C</[a-z]/>)
*0Sstevel@tonic-gateand in the C<tr///> (also known as C<y///>) operator are not magically
*0Sstevel@tonic-gateUnicode-aware.  What this means that C<[A-Za-z]> will not magically start
*0Sstevel@tonic-gateto mean "all alphabetic letters"; not that it does mean that even for
*0Sstevel@tonic-gate8-bit characters, you should be using C</[[:alpha:]]/> in that case.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateFor specifying character classes like that in regular expressions,
*0Sstevel@tonic-gateyou can use the various Unicode properties--C<\pL>, or perhaps
*0Sstevel@tonic-gateC<\p{Alphabetic}>, in this particular case.  You can use Unicode
*0Sstevel@tonic-gatecode points as the end points of character ranges, but there is no
*0Sstevel@tonic-gatemagic associated with specifying a certain range.  For further
*0Sstevel@tonic-gateinformation--there are dozens of Unicode character classes--see
*0Sstevel@tonic-gateL<perlunicode>.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=item *
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateString-To-Number Conversions
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateUnicode does define several other decimal--and numeric--characters
*0Sstevel@tonic-gatebesides the familiar 0 to 9, such as the Arabic and Indic digits.
*0Sstevel@tonic-gatePerl does not support string-to-number conversion for digits other
*0Sstevel@tonic-gatethan ASCII 0 to 9 (and ASCII a to f for hexadecimal).
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=back
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=head2 Questions With Answers
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=over 4
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=item *
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateWill My Old Scripts Break?
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateVery probably not.  Unless you are generating Unicode characters
*0Sstevel@tonic-gatesomehow, old behaviour should be preserved.  About the only behaviour
*0Sstevel@tonic-gatethat has changed and which could start generating Unicode is the old
*0Sstevel@tonic-gatebehaviour of C<chr()> where supplying an argument more than 255
*0Sstevel@tonic-gateproduced a character modulo 255.  C<chr(300)>, for example, was equal
*0Sstevel@tonic-gateto C<chr(45)> or "-" (in ASCII), now it is LATIN CAPITAL LETTER I WITH
*0Sstevel@tonic-gateBREVE.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=item *
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateHow Do I Make My Scripts Work With Unicode?
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateVery little work should be needed since nothing changes until you
*0Sstevel@tonic-gategenerate Unicode data.  The most important thing is getting input as
*0Sstevel@tonic-gateUnicode; for that, see the earlier I/O discussion.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=item *
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateHow Do I Know Whether My String Is In Unicode?
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateYou shouldn't care.  No, you really shouldn't.  No, really.  If you
*0Sstevel@tonic-gatehave to care--beyond the cases described above--it means that we
*0Sstevel@tonic-gatedidn't get the transparency of Unicode quite right.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateOkay, if you insist:
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate    print utf8::is_utf8($string) ? 1 : 0, "\n";
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateBut note that this doesn't mean that any of the characters in the
*0Sstevel@tonic-gatestring are necessary UTF-8 encoded, or that any of the characters have
*0Sstevel@tonic-gatecode points greater than 0xFF (255) or even 0x80 (128), or that the
*0Sstevel@tonic-gatestring has any characters at all.  All the C<is_utf8()> does is to
*0Sstevel@tonic-gatereturn the value of the internal "utf8ness" flag attached to the
*0Sstevel@tonic-gateC<$string>.  If the flag is off, the bytes in the scalar are interpreted
*0Sstevel@tonic-gateas a single byte encoding.  If the flag is on, the bytes in the scalar
*0Sstevel@tonic-gateare interpreted as the (multi-byte, variable-length) UTF-8 encoded code
*0Sstevel@tonic-gatepoints of the characters.  Bytes added to an UTF-8 encoded string are
*0Sstevel@tonic-gateautomatically upgraded to UTF-8.  If mixed non-UTF-8 and UTF-8 scalars
*0Sstevel@tonic-gateare merged (double-quoted interpolation, explicit concatenation, and
*0Sstevel@tonic-gateprintf/sprintf parameter substitution), the result will be UTF-8 encoded
*0Sstevel@tonic-gateas if copies of the byte strings were upgraded to UTF-8: for example,
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate    $a = "ab\x80c";
*0Sstevel@tonic-gate    $b = "\x{100}";
*0Sstevel@tonic-gate    print "$a = $b\n";
*0Sstevel@tonic-gate
*0Sstevel@tonic-gatethe output string will be UTF-8-encoded C<ab\x80c = \x{100}\n>, but
*0Sstevel@tonic-gateC<$a> will stay byte-encoded.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateSometimes you might really need to know the byte length of a string
*0Sstevel@tonic-gateinstead of the character length. For that use either the
*0Sstevel@tonic-gateC<Encode::encode_utf8()> function or the C<bytes> pragma and its only
*0Sstevel@tonic-gatedefined function C<length()>:
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate    my $unicode = chr(0x100);
*0Sstevel@tonic-gate    print length($unicode), "\n"; # will print 1
*0Sstevel@tonic-gate    require Encode;
*0Sstevel@tonic-gate    print length(Encode::encode_utf8($unicode)), "\n"; # will print 2
*0Sstevel@tonic-gate    use bytes;
*0Sstevel@tonic-gate    print length($unicode), "\n"; # will also print 2
*0Sstevel@tonic-gate                                  # (the 0xC4 0x80 of the UTF-8)
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=item *
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateHow Do I Detect Data That's Not Valid In a Particular Encoding?
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateUse the C<Encode> package to try converting it.
*0Sstevel@tonic-gateFor example,
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate    use Encode 'encode_utf8';
*0Sstevel@tonic-gate    if (encode_utf8($string_of_bytes_that_I_think_is_utf8)) {
*0Sstevel@tonic-gate        # valid
*0Sstevel@tonic-gate    } else {
*0Sstevel@tonic-gate        # invalid
*0Sstevel@tonic-gate    }
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateFor UTF-8 only, you can use:
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate    use warnings;
*0Sstevel@tonic-gate    @chars = unpack("U0U*", $string_of_bytes_that_I_think_is_utf8);
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateIf invalid, a C<Malformed UTF-8 character (byte 0x##) in unpack>
*0Sstevel@tonic-gatewarning is produced. The "U0" means "expect strictly UTF-8 encoded
*0Sstevel@tonic-gateUnicode".  Without that the C<unpack("U*", ...)> would accept also
*0Sstevel@tonic-gatedata like C<chr(0xFF>), similarly to the C<pack> as we saw earlier.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=item *
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateHow Do I Convert Binary Data Into a Particular Encoding, Or Vice Versa?
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateThis probably isn't as useful as you might think.
*0Sstevel@tonic-gateNormally, you shouldn't need to.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateIn one sense, what you are asking doesn't make much sense: encodings
*0Sstevel@tonic-gateare for characters, and binary data are not "characters", so converting
*0Sstevel@tonic-gate"data" into some encoding isn't meaningful unless you know in what
*0Sstevel@tonic-gatecharacter set and encoding the binary data is in, in which case it's
*0Sstevel@tonic-gatenot just binary data, now is it?
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateIf you have a raw sequence of bytes that you know should be
*0Sstevel@tonic-gateinterpreted via a particular encoding, you can use C<Encode>:
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate    use Encode 'from_to';
*0Sstevel@tonic-gate    from_to($data, "iso-8859-1", "utf-8"); # from latin-1 to utf-8
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateThe call to C<from_to()> changes the bytes in C<$data>, but nothing
*0Sstevel@tonic-gatematerial about the nature of the string has changed as far as Perl is
*0Sstevel@tonic-gateconcerned.  Both before and after the call, the string C<$data>
*0Sstevel@tonic-gatecontains just a bunch of 8-bit bytes. As far as Perl is concerned,
*0Sstevel@tonic-gatethe encoding of the string remains as "system-native 8-bit bytes".
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateYou might relate this to a fictional 'Translate' module:
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate   use Translate;
*0Sstevel@tonic-gate   my $phrase = "Yes";
*0Sstevel@tonic-gate   Translate::from_to($phrase, 'english', 'deutsch');
*0Sstevel@tonic-gate   ## phrase now contains "Ja"
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateThe contents of the string changes, but not the nature of the string.
*0Sstevel@tonic-gatePerl doesn't know any more after the call than before that the
*0Sstevel@tonic-gatecontents of the string indicates the affirmative.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateBack to converting data.  If you have (or want) data in your system's
*0Sstevel@tonic-gatenative 8-bit encoding (e.g. Latin-1, EBCDIC, etc.), you can use
*0Sstevel@tonic-gatepack/unpack to convert to/from Unicode.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate    $native_string  = pack("C*", unpack("U*", $Unicode_string));
*0Sstevel@tonic-gate    $Unicode_string = pack("U*", unpack("C*", $native_string));
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateIf you have a sequence of bytes you B<know> is valid UTF-8,
*0Sstevel@tonic-gatebut Perl doesn't know it yet, you can make Perl a believer, too:
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate    use Encode 'decode_utf8';
*0Sstevel@tonic-gate    $Unicode = decode_utf8($bytes);
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateYou can convert well-formed UTF-8 to a sequence of bytes, but if
*0Sstevel@tonic-gateyou just want to convert random binary data into UTF-8, you can't.
*0Sstevel@tonic-gateB<Any random collection of bytes isn't well-formed UTF-8>.  You can
*0Sstevel@tonic-gateuse C<unpack("C*", $string)> for the former, and you can create
*0Sstevel@tonic-gatewell-formed Unicode data by C<pack("U*", 0xff, ...)>.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=item *
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateHow Do I Display Unicode?  How Do I Input Unicode?
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateSee http://www.alanwood.net/unicode/ and
*0Sstevel@tonic-gatehttp://www.cl.cam.ac.uk/~mgk25/unicode.html
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=item *
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateHow Does Unicode Work With Traditional Locales?
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateIn Perl, not very well.  Avoid using locales through the C<locale>
*0Sstevel@tonic-gatepragma.  Use only one or the other.  But see L<perlrun> for the
*0Sstevel@tonic-gatedescription of the C<-C> switch and its environment counterpart,
*0Sstevel@tonic-gateC<$ENV{PERL_UNICODE}> to see how to enable various Unicode features,
*0Sstevel@tonic-gatefor example by using locale settings.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=back
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=head2 Hexadecimal Notation
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateThe Unicode standard prefers using hexadecimal notation because
*0Sstevel@tonic-gatethat more clearly shows the division of Unicode into blocks of 256 characters.
*0Sstevel@tonic-gateHexadecimal is also simply shorter than decimal.  You can use decimal
*0Sstevel@tonic-gatenotation, too, but learning to use hexadecimal just makes life easier
*0Sstevel@tonic-gatewith the Unicode standard.  The C<U+HHHH> notation uses hexadecimal,
*0Sstevel@tonic-gatefor example.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateThe C<0x> prefix means a hexadecimal number, the digits are 0-9 I<and>
*0Sstevel@tonic-gatea-f (or A-F, case doesn't matter).  Each hexadecimal digit represents
*0Sstevel@tonic-gatefour bits, or half a byte.  C<print 0x..., "\n"> will show a
*0Sstevel@tonic-gatehexadecimal number in decimal, and C<printf "%x\n", $decimal> will
*0Sstevel@tonic-gateshow a decimal number in hexadecimal.  If you have just the
*0Sstevel@tonic-gate"hex digits" of a hexadecimal number, you can use the C<hex()> function.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate    print 0x0009, "\n";    # 9
*0Sstevel@tonic-gate    print 0x000a, "\n";    # 10
*0Sstevel@tonic-gate    print 0x000f, "\n";    # 15
*0Sstevel@tonic-gate    print 0x0010, "\n";    # 16
*0Sstevel@tonic-gate    print 0x0011, "\n";    # 17
*0Sstevel@tonic-gate    print 0x0100, "\n";    # 256
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate    print 0x0041, "\n";    # 65
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate    printf "%x\n",  65;    # 41
*0Sstevel@tonic-gate    printf "%#x\n", 65;    # 0x41
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate    print hex("41"), "\n"; # 65
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=head2 Further Resources
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=over 4
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=item *
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateUnicode Consortium
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate    http://www.unicode.org/
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=item *
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateUnicode FAQ
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate    http://www.unicode.org/unicode/faq/
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=item *
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateUnicode Glossary
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate    http://www.unicode.org/glossary/
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=item *
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateUnicode Useful Resources
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate    http://www.unicode.org/unicode/onlinedat/resources.html
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=item *
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateUnicode and Multilingual Support in HTML, Fonts, Web Browsers and Other Applications
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate    http://www.alanwood.net/unicode/
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=item *
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateUTF-8 and Unicode FAQ for Unix/Linux
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate    http://www.cl.cam.ac.uk/~mgk25/unicode.html
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=item *
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateLegacy Character Sets
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate    http://www.czyborra.com/
*0Sstevel@tonic-gate    http://www.eki.ee/letter/
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=item *
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateThe Unicode support files live within the Perl installation in the
*0Sstevel@tonic-gatedirectory
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate    $Config{installprivlib}/unicore
*0Sstevel@tonic-gate
*0Sstevel@tonic-gatein Perl 5.8.0 or newer, and
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate    $Config{installprivlib}/unicode
*0Sstevel@tonic-gate
*0Sstevel@tonic-gatein the Perl 5.6 series.  (The renaming to F<lib/unicore> was done to
*0Sstevel@tonic-gateavoid naming conflicts with lib/Unicode in case-insensitive filesystems.)
*0Sstevel@tonic-gateThe main Unicode data file is F<UnicodeData.txt> (or F<Unicode.301> in
*0Sstevel@tonic-gatePerl 5.6.1.)  You can find the C<$Config{installprivlib}> by
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate    perl "-V:installprivlib"
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateYou can explore various information from the Unicode data files using
*0Sstevel@tonic-gatethe C<Unicode::UCD> module.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=back
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=head1 UNICODE IN OLDER PERLS
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateIf you cannot upgrade your Perl to 5.8.0 or later, you can still
*0Sstevel@tonic-gatedo some Unicode processing by using the modules C<Unicode::String>,
*0Sstevel@tonic-gateC<Unicode::Map8>, and C<Unicode::Map>, available from CPAN.
*0Sstevel@tonic-gateIf you have the GNU recode installed, you can also use the
*0Sstevel@tonic-gatePerl front-end C<Convert::Recode> for character conversions.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateThe following are fast conversions from ISO 8859-1 (Latin-1) bytes
*0Sstevel@tonic-gateto UTF-8 bytes and back, the code works even with older Perl 5 versions.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate    # ISO 8859-1 to UTF-8
*0Sstevel@tonic-gate    s/([\x80-\xFF])/chr(0xC0|ord($1)>>6).chr(0x80|ord($1)&0x3F)/eg;
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate    # UTF-8 to ISO 8859-1
*0Sstevel@tonic-gate    s/([\xC2\xC3])([\x80-\xBF])/chr(ord($1)<<6&0xC0|ord($2)&0x3F)/eg;
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=head1 SEE ALSO
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateL<perlunicode>, L<Encode>, L<encoding>, L<open>, L<utf8>, L<bytes>,
*0Sstevel@tonic-gateL<perlretut>, L<perlrun>, L<Unicode::Collate>, L<Unicode::Normalize>,
*0Sstevel@tonic-gateL<Unicode::UCD>
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=head1 ACKNOWLEDGMENTS
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateThanks to the kind readers of the perl5-porters@perl.org,
*0Sstevel@tonic-gateperl-unicode@perl.org, linux-utf8@nl.linux.org, and unicore@unicode.org
*0Sstevel@tonic-gatemailing lists for their valuable feedback.
*0Sstevel@tonic-gate
*0Sstevel@tonic-gate=head1 AUTHOR, COPYRIGHT, AND LICENSE
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateCopyright 2001-2002 Jarkko Hietaniemi E<lt>jhi@iki.fiE<gt>
*0Sstevel@tonic-gate
*0Sstevel@tonic-gateThis document may be distributed under the same terms as Perl itself.