Lines Matching defs:UTF

3  * This file contains definitions for use with the UTF-8 encoding.  It
4 * actually also works with the variant UTF-8 encoding called UTF-EBCDIC, and
15 * A note on nomenclature: The term UTF-8 is used loosely and inconsistently
16 * in Perl documentation. For one, perl uses an extension of UTF-8 to
18 * platform UTF-8 is usually conflated with EBCDIC platform UTF-EBCDIC, because
22 * UTF-EBCDIC has an isomorphic translation named I8 (for "Intermediate eight")
23 * which differs from UTF-8 only in a few details. It is often useful to
24 * translate UTF-EBCDIC into this form for processing. In general, macros and
25 * functions that are expecting their inputs to be either in I8 or UTF-8 are
39 indicate the UTF-8ness of those strings.
42 SV with the UTF-8 flag of the SV properly set, rather than use this mechanism.)
51 UTF-8-encoded characters.
56 treat as utf8; // like turning on an SV UTF-8 flag
62 encoded as UTF-8.
66 This means it is equally valid to treat the string as bytes, or as UTF-8
69 UTF-8 or not.
80 do something for UTF-8;
95 the string may be treated in code as encoded in UTF-8
103 UTF8NESS_NO = 0, /* Definitely not UTF-8 */
104 UTF8NESS_IMMATERIAL = 1, /* Representation is the same in UTF-8 as
107 UTF8NESS_YES = 2, /* Defintely is UTF-8, wideness
112 /* Use UTF-8 as the default script encoding?
113 * Turning this on will break scripts having non-UTF-8 binary
134 are exactly the UTF-8 invariants. But EBCDIC machines have more invariants
183 * fundamental difference between UTF-8 and UTF-EBCDIC is that the former has
196 /* The equivalent of the next few macros but implementing UTF-EBCDIC are in the
225 /* Perl extended (never was official UTF-8). Up to 36 bit */
282 /* I8 is an intermediate version of UTF-8 used only in UTF-EBCDIC. We thus
283 * consider it to be identical to UTF-8 on ASCII platforms. Strictly speaking
284 * UTF-8 and UTF-EBCDIC are two different things, but we often conflate them
322 caused by legal UTF-8 avoiding non-shortest encodings: it is technically
323 possible to UTF-8-encode a single code point in different ways, but that is
340 Perl's extended UTF-8 means we can have start bytes up through FF, though any
346 /* This is the number of low-order bits a continuation byte in a UTF-8 encoded
370 * information in a continuation byte. This turns out to be 0x3F in UTF-8,
371 * 0x1F in UTF-EBCDIC. */
375 /* For use in UTF8_IS_CONTINUATION(). This turns out to be 0xC0 in UTF-8,
376 * E0 in UTF-EBCDIC */
381 * multi-byte UTF-8 encoded character that mark it is a continuation byte.
382 * This turns out to be 0x80 in UTF-8, 0xA0 in UTF-EBCDIC. (khw doesn't know
397 * being encoded in UTF-8 or not? This is a fundamental property of
398 * UTF-8,EBCDIC */
406 not it is encoded in UTF-8; otherwise evaluates to 0. UTF-8 invariant
407 characters can be copied as-is when converting to/from UTF-8, saving time.
420 * UTF-8 encoded character that mark it as a start byte and give the number of
450 The maximum width of a single UTF-8 encoded character, in bytes.
452 NOTE: Strictly speaking Perl's UTF-8 should not be called UTF-8 since UTF-8
454 expressed with 4 bytes. However, Perl thinks of UTF-8 as a way to encode
459 The start byte 0xFE, never used in any ASCII platform UTF-8 specification, has
461 sequence of 7 bytes. And in fact, this is exactly what standard UTF-EBCDIC
466 1) The meaning in standard UTF-EBCDIC, namely as an FE start byte, with the
472 There are published UTF-8 extensions that do this, some string together
478 The goal is to be able to represent 64-bit values in UTF-8 or UTF-EBCDIC. That
481 sequence. So in Perl, a start byte of FF indicates a UTF-8 string consisting
483 This turns out to be 13 total bytes in UTF-8 and 14 in UTF-EBCDIC. This is
486 UTF-8 (khw knows not why 11, which would encode 66 bits wasn't
488 13 * 5 bits of info per byte (could encode 65-bit numbers) on UTF-EBCDIC
529 /* Compute the number of UTF-8 bytes required for representing the input uv,
560 encoded as UTF-8. C<cp> is a native (ASCII or EBCDIC) code point if less than
645 /* The largest code point representable by two UTF-8 bytes on this platform.
647 * 1101_1111 10xx_xxxx in UTF-8, and
648 * 1101_1111 101y_yyyy in UTF-EBCDIC I8.
653 * 1_1111 xx_xxxx in UTF-8, and
654 * 1_1111 y_yyyy in UTF-EBCDIC I8.
658 /* The largest code point representable by two UTF-8 bytes on any platform that
668 The maximum number of UTF-8 bytes a single Unicode character can
677 * Perl's extended UTF-8. We have to make it large enough to fit any single
704 * used in a loop to convert from UTF-8 to the code point represented. Note
706 * the UTF-EBCDIC byte, whereas the 'old' parameter is a Unicode (not EBCDIC)
713 /* This works in the face of malformed UTF-8. */
722 /* Convert a UTF-8 variant Latin1 character to a native code point value.
724 * that the code point is < 256, and is not UTF-8 invariant. Use the slower
756 returns the number of bytes a non-malformed UTF-8 encoded character whose first
799 beyond the end of the input buffer, even if it is malformed UTF-8.
815 UTF-8 encoded character whose first byte is pointed to by C<s>. But it never
833 UTF-8 as when not; otherwise evaluates to 0. UTF-8 invariant characters can be
834 copied as-is when converting to/from UTF-8, saving time.
837 from which C<c> comes is not encoded in UTF-8.
843 The reason it works on both UTF-8 encoded strings and non-UTF-8 encoded, is
855 * in UTF-8? This is the inverse of UTF8_IS_INVARIANT. */
890 * code point whose UTF-8 is known to occupy 2 bytes; they are less efficient
904 /* This is illegal in any well-formed UTF-8 in both EBCDIC and ASCII
909 * 'UTF' is whether or not p is encoded in UTF8. The names 'foo_lazy_if' stem
915 #define isIDFIRST_lazy_if_safe(p, e, UTF) \
916 ((IN_BYTES || !UTF) \
919 #define isWORDCHAR_lazy_if_safe(p, e, UTF) \
920 ((IN_BYTES || !UTF) \
923 #define isALNUM_lazy_if_safe(p, e, UTF) isWORDCHAR_lazy_if_safe(p, e, UTF)
936 encoded in UTF-8.
939 case any call to string overloading updates the internal UTF-8 encoding flag.
945 /* Should all strings be treated as Unicode, and not just UTF-8 encoded ones?
947 * within 'use bytes'. UTF-8 locales are not tested for here, because it gets
968 looking no further than S<C<e - 1>> are well-formed UTF-8 that represents one
994 looking no further than S<C<e - 1>> are well-formed UTF-8 that represents the
1027 Recall that Perl recognizes an extension to UTF-8 that can encode code
1031 at C<s> and looking no further than S<C<e - 1>> are from this UTF-8 extension;
1035 0 is returned if the bytes are not well-formed extended UTF-8, or if they
1090 looking no further than S<C<e - 1>> are well-formed UTF-8 that represents one
1115 /* Perl extends Unicode so that it is possible to encode (as extended UTF-8 or
1116 * UTF-EBCDIC) any 64-bit value. No standard known to khw ever encoded higher
1178 /* The original UTF-8 standard did not define UTF-8 with start bytes of 0xFE or
1179 * 0xFF, though UTF-EBCDIC did. This allowed both versions to represent code
1180 * points up to 2 ** 31 - 1. Perl extends UTF-8 so that 0xFE and 0xFF are
1182 * UTF-EBCDIC defines. These changes allow code points of 64 bits (actually
1219 /* This is typically used for code that processes UTF-8 input and doesn't want
1229 /* Accept any Perl-extended UTF-8 that evaluates to any UV on the platform, but
1234 #define UNICODE_WARN_SURROGATE 0x0001 /* UTF-16 surrogates */