Lines Matching defs:Unicode
3 perluniintro - Perl Unicode introduction
7 This document gives a general idea of Unicode and how to use Unicode
9 treatments of Unicode.
11 =head2 Unicode
13 Unicode is a character set standard which plans to codify all of the
16 Unicode and ISO/IEC 10646 are coordinated standards that unify
23 Unicode 1.0 was released in October 1991, and 6.0 in October 2010.
25 A Unicode I<character> is an abstract entity. It is not bound to any
27 Unicode is language-neutral and display-neutral: it does not encode the
29 layout details. Unicode operates on characters and on text built from
32 Unicode defines characters like C<LATIN CAPITAL LETTER A> or C<GREEK
36 character within the set of all possible Unicode characters, and thus in
39 The Unicode standard prefers using hexadecimal notation for the code
41 at a later section, L</"Hexadecimal Notation">. The Unicode standard
45 Unicode also defines various I<properties> for the characters, like
51 A Unicode I<logical> "character" can actually consist of more than one internal
57 models, so Unicode created the I<grapheme cluster> concept, which was
61 Unicode characters: a leading consonant followed by an interior vowel followed
70 view: one "character" is one Unicode code point.
78 conversions between Unicode and legacy standards (like ISO 8859). Using
79 sequences, as Unicode does, allows for needing fewer basic building blocks
93 otherwise used blocks. Secondly, there are special Unicode control
96 When Unicode was first conceived, it was thought that all the world's
99 C<0xFFFF>. This soon proved to be wrong, and since Unicode 2.0 (July
100 1996), Unicode has been defined all the way up to 21 bits (C<0x10FFFF>),
101 and Unicode 3.1 (March 2001) defined the first characters above C<0xFFFF>.
103 I<Basic Multilingual Plane> (BMP). With Unicode 3.1, 17 (yes,
107 When a new language is being encoded, Unicode generally will choose a
128 sake of this being an introduction. Unicode doesn't really encode
130 script can be used by many languages. Unicode also encodes things that
133 The Unicode code points are just abstract numbers. To input and
135 I<serialised> somehow. Unicode defines several I<character encoding
137 variable length encoding that encodes Unicode characters as 1 to 4
146 =head2 Perl's Unicode Support
148 Starting from Perl v5.6.0, Perl has had the capacity to handle Unicode
150 serious Unicode work. The maintenance release 5.6.1 fixed many of the
151 problems of the initial Unicode implementation, but for example
152 regular expressions still do not work with Unicode in 5.6.1.
153 Perl v5.14.0 is the first release where Unicode support is
164 (5.14 also fixes a number of bugs and departures from the Unicode
168 that operations in the current block or file would be Unicode-aware.
178 =head2 Perl's Unicode Model
181 strings of Unicode characters. The general principle is that Perl tries
184 to Unicode. Prior to Perl v5.14.0, the upgrade was not completely
185 transparent (see L<perlunicode/The "Unicode Bug">), and for backwards
192 UTF-8, to encode Unicode strings. Specifically, if all code points in
198 outputting Unicode strings to a stream without a PerlIO layer (one with
232 All features that combine Unicode and I/O also require using the new
237 =head2 Unicode and EBCDIC
239 Perl 5.8.0 added support for Unicode on EBCDIC platforms. This support
241 Unicode support is somewhat more complex to implement since additional
244 On EBCDIC platforms, the internal Unicode encoding form is UTF-EBCDIC
253 =head2 Creating Unicode
259 To create Unicode characters in literals,
330 form C<"U+...">. Your best bet there for runtime Unicode by character
337 =head2 Handling Unicode
339 Handling Unicode is for the most part transparent: just use the
341 C<substr()> will work on the Unicode characters; regular expressions
342 will work on the Unicode characters (see L<perlunicode> and L<perlretut>).
360 When you combine legacy data and Unicode, the legacy data needs
361 to be upgraded to Unicode. Normally the legacy data is assumed to be
370 =head2 Unicode I/O
372 Normally, writing out Unicode data
377 Unicode string. Perl's internal encoding depends on the system as
412 Unicode or legacy encodings does not magically turn the data into
413 Unicode in Perl's eyes. To do that, specify the appropriate
441 print O chr(0x430); # Unicode CYRILLIC SMALL LETTER A = KOI8-R 0xc1
449 stream. The result is always Unicode.
464 automatically convert Unicode to the specified encoding when it is
510 =head2 Displaying Unicode As Text
512 Sometimes you might want to display Perl scalars containing Unicode as
514 its argument so that Unicode characters with code points greater than
525 } unpack("W*", $_[0])); # unpack Unicode characters
564 Unicode string (because the normal ways to get at the contents of a
565 string with Unicode--via input and output--should always be via
569 One way of peeking inside the internal encoding of Unicode characters
582 and Unicode characters in C<PV>. See also later in this document
596 in Unicode: what do you mean by "equal"?
607 and casing issues: see L<Unicode::Normalize>, Unicode Technical Report #15,
608 L<Unicode Normalization Forms|https://www.unicode.org/reports/tr15> and
609 sections on case mapping in the L<Unicode Standard|https://www.unicode.org>.
619 People like to see their strings nicely sorted--or as Unicode
632 See L<Unicode::Collate>, and I<Unicode Collation Algorithm>
647 magically Unicode-aware. What this means is that C<[A-Za-z]> will not
653 character classes that are Unicode-aware. There are dozens of them, see
656 Starting in v5.22, you can use Unicode code points as the end points of
658 all Unicode code points that lie between those end points, inclusive.
671 Unicode does define several other decimal--and numeric--characters
675 To get safe conversions from any Unicode string, use
676 L<Unicode::UCD/num()>.
688 Very probably not. Unless you are generating Unicode characters
690 that has changed and which could start generating Unicode is the old
698 How Do I Make My Scripts Work With Unicode?
701 generate Unicode data. The most important thing is getting input as
702 Unicode; for that, see the earlier I/O discussion.
703 To get full seamless Unicode support, add
709 How Do I Know Whether My String Is In Unicode?
715 whether the string they are contained within is in Unicode or not.
716 (See L<perlunicode/When Unicode Does Not Happen>.)
718 To determine if a string is in Unicode, use:
824 pack/unpack to convert to/from Unicode.
832 $Unicode = $bytes;
833 utf8::decode($Unicode);
837 $Unicode = pack("U0a*", $bytes);
843 and you can create well-formed Unicode with
849 How Do I Display Unicode? How Do I Input Unicode?
856 How Does Unicode Work With Traditional Locales?
862 C<L<Unicode::Collate>> and C<L<Unicode::Collate::Locale>> modules offer
871 have to translate from the locale character set to/from Unicode
872 yourself. See L</Unicode IE<sol>O> above for how to
876 to accomplish this, but full details are in L<perllocale/Unicode and
884 The Unicode standard prefers using hexadecimal notation because
885 that more clearly shows the division of Unicode into blocks of 256 characters.
888 with the Unicode standard. The C<U+HHHH> notation uses hexadecimal,
918 Unicode Consortium
924 Unicode FAQ
930 Unicode Glossary
936 Unicode Recommended Reading List
938 The Unicode Consortium has a list of articles and books, some of which
939 give a much more in depth treatment of Unicode:
944 Unicode Useful Resources
950 Unicode and Multilingual Support in HTML, Fonts, Web Browsers and Other Applications
956 UTF-8 and Unicode FAQ for Unix/Linux
969 You can explore various information from the Unicode data files using
970 the C<Unicode::UCD> module.
977 do some Unicode processing by using the modules C<Unicode::String>,
978 C<Unicode::Map8>, and C<Unicode::Map>, available from CPAN.
994 L<perlretut>, L<perlrun>, L<Unicode::Collate>, L<Unicode::Normalize>,
995 L<Unicode::UCD>