utf8.h - OpenGrok cross reference for /openbsd-src/gnu/usr.bin/perl/utf8.h

Lines Matching defs:UTF
3  * This file contains definitions for use with the UTF-8 encoding.  It
4  * actually also works with the variant UTF-8 encoding called UTF-EBCDIC, and
15  * A note on nomenclature:  The term UTF-8 is used loosely and inconsistently
16  * in Perl documentation.  For one, perl uses an extension of UTF-8 to
18  * platform UTF-8 is usually conflated with EBCDIC platform UTF-EBCDIC, because
22  * UTF-EBCDIC has an isomorphic translation named I8 (for "Intermediate eight")
23  * which differs from UTF-8 only in a few details.  It is often useful to
24  * translate UTF-EBCDIC into this form for processing.  In general, macros and
25  * functions that are expecting their inputs to be either in I8 or UTF-8 are
39 indicate the UTF-8ness of those strings.
42 SV with the UTF-8 flag of the SV properly set, rather than use this mechanism.)
51 UTF-8-encoded characters.
56      treat as utf8;  // like turning on an SV UTF-8 flag
62 encoded as UTF-8.
66 This means it is equally valid to treat the string as bytes, or as UTF-8
69 UTF-8 or not.
80     do something for UTF-8;
95 the string may be treated in code as encoded in UTF-8
103     UTF8NESS_NO               =  0,  /* Definitely not UTF-8 */
104     UTF8NESS_IMMATERIAL       =  1,  /* Representation is the same in UTF-8 as
107     UTF8NESS_YES              =  2,  /* Defintely is UTF-8, wideness
112 /* Use UTF-8 as the default script encoding?
113  * Turning this on will break scripts having non-UTF-8 binary
134 are exactly the UTF-8 invariants.  But EBCDIC machines have more invariants
183  * fundamental difference between UTF-8 and UTF-EBCDIC is that the former has
196 /* The equivalent of the next few macros but implementing UTF-EBCDIC are in the
225            /* Perl extended (never was official UTF-8).  Up to 36 bit */
282 /* I8 is an intermediate version of UTF-8 used only in UTF-EBCDIC.  We thus
283  * consider it to be identical to UTF-8 on ASCII platforms.  Strictly speaking
284  * UTF-8 and UTF-EBCDIC are two different things, but we often conflate them
322 caused by legal UTF-8 avoiding non-shortest encodings: it is technically
323 possible to UTF-8-encode a single code point in different ways, but that is
340 Perl's extended UTF-8 means we can have start bytes up through FF, though any
346 /* This is the number of low-order bits a continuation byte in a UTF-8 encoded
370  * information in a continuation byte.  This turns out to be 0x3F in UTF-8,
371  * 0x1F in UTF-EBCDIC. */
375 /* For use in UTF8_IS_CONTINUATION().  This turns out to be 0xC0 in UTF-8,
376  * E0 in UTF-EBCDIC */
381  * multi-byte UTF-8 encoded character that mark it is a continuation byte.
382  * This turns out to be 0x80 in UTF-8, 0xA0 in UTF-EBCDIC.  (khw doesn't know
397  * being encoded in UTF-8 or not? This is a fundamental property of
398  * UTF-8,EBCDIC */
406 not it is encoded in UTF-8; otherwise evaluates to 0.  UTF-8 invariant
407 characters can be copied as-is when converting to/from UTF-8, saving time.
420  * UTF-8 encoded character that mark it as a start byte and give the number of
450 The maximum width of a single UTF-8 encoded character, in bytes.
452 NOTE: Strictly speaking Perl's UTF-8 should not be called UTF-8 since UTF-8
454 expressed with 4 bytes.  However, Perl thinks of UTF-8 as a way to encode
459 The start byte 0xFE, never used in any ASCII platform UTF-8 specification, has
461 sequence of 7 bytes.  And in fact, this is exactly what standard UTF-EBCDIC
466   1) The meaning in standard UTF-EBCDIC, namely as an FE start byte, with the
472      There are published UTF-8 extensions that do this, some string together
478 The goal is to be able to represent 64-bit values in UTF-8 or UTF-EBCDIC.  That
481 sequence.  So in Perl, a start byte of FF indicates a UTF-8 string consisting
483 This turns out to be 13 total bytes in UTF-8 and 14 in UTF-EBCDIC.  This is
486                 UTF-8 (khw knows not why 11, which would encode 66 bits wasn't
488     13 * 5 bits of info per byte (could encode 65-bit numbers) on UTF-EBCDIC
529 /* Compute the number of UTF-8 bytes required for representing the input uv,
560 encoded as UTF-8.  C<cp> is a native (ASCII or EBCDIC) code point if less than
645 /* The largest code point representable by two UTF-8 bytes on this platform.
647  *      1101_1111 10xx_xxxx in UTF-8, and
648  *      1101_1111 101y_yyyy in UTF-EBCDIC I8.
653  *      1_1111 xx_xxxx in UTF-8, and
654  *      1_1111 y_yyyy in UTF-EBCDIC I8.
658 /* The largest code point representable by two UTF-8 bytes on any platform that
668 The maximum number of UTF-8 bytes a single Unicode character can
677  * Perl's extended UTF-8.  We have to make it large enough to fit any single
704  * used in a loop to convert from UTF-8 to the code point represented.  Note
706  * the UTF-EBCDIC byte, whereas the 'old' parameter is a Unicode (not EBCDIC)
713 /* This works in the face of malformed UTF-8. */
722 /* Convert a UTF-8 variant Latin1 character to a native code point value.
724  * that the code point is < 256, and is not UTF-8 invariant.  Use the slower
756 returns the number of bytes a non-malformed UTF-8 encoded character whose first
799 beyond the end of the input buffer, even if it is malformed UTF-8.
815 UTF-8 encoded character whose first  byte is pointed to by C<s>.  But it never
833 UTF-8 as when not; otherwise evaluates to 0.  UTF-8 invariant characters can be
834 copied as-is when converting to/from UTF-8, saving time.
837 from which C<c> comes is not encoded in UTF-8.
843 The reason it works on both UTF-8 encoded strings and non-UTF-8 encoded, is
855  * in UTF-8?  This is the inverse of UTF8_IS_INVARIANT. */
890  * code point whose UTF-8 is known to occupy 2 bytes; they are less efficient
904 /* This is illegal in any well-formed UTF-8 in both EBCDIC and ASCII
909  * 'UTF' is whether or not p is encoded in UTF8.  The names 'foo_lazy_if' stem
915 #define isIDFIRST_lazy_if_safe(p, e, UTF)                                   \
916                    ((IN_BYTES || !UTF)                                      \
919 #define isWORDCHAR_lazy_if_safe(p, e, UTF)                                  \
920                    ((IN_BYTES || !UTF)                                      \
923 #define isALNUM_lazy_if_safe(p, e, UTF) isWORDCHAR_lazy_if_safe(p, e, UTF)
936 encoded in UTF-8.
939 case any call to string overloading updates the internal UTF-8 encoding flag.
945 /* Should all strings be treated as Unicode, and not just UTF-8 encoded ones?
947  * within 'use bytes'.  UTF-8 locales are not tested for here, because it gets
968 looking no further than S<C<e - 1>> are well-formed UTF-8 that represents one
994 looking no further than S<C<e - 1>> are well-formed UTF-8 that represents the
1027 Recall that Perl recognizes an extension to UTF-8 that can encode code
1031 at C<s> and looking no further than S<C<e - 1>> are from this UTF-8 extension;
1035 0 is returned if the bytes are not well-formed extended UTF-8, or if they
1090 looking no further than S<C<e - 1>> are well-formed UTF-8 that represents one
1115 /* Perl extends Unicode so that it is possible to encode (as extended UTF-8 or
1116  * UTF-EBCDIC) any 64-bit value.  No standard known to khw ever encoded higher
1178 /* The original UTF-8 standard did not define UTF-8 with start bytes of 0xFE or
1179  * 0xFF, though UTF-EBCDIC did.  This allowed both versions to represent code
1180  * points up to 2 ** 31 - 1.  Perl extends UTF-8 so that 0xFE and 0xFF are
1182  * UTF-EBCDIC defines.  These changes allow code points of 64 bits (actually
1219 /* This is typically used for code that processes UTF-8 input and doesn't want
1229 /* Accept any Perl-extended UTF-8 that evaluates to any UV on the platform, but
1234 #define UNICODE_WARN_SURROGATE         0x0001	/* UTF-16 surrogates */