xref: /openbsd-src/share/man/man7/utf8.7 (revision d905fc10a3527c993c262ab7a5c86ba8232b3bdc)
1*d905fc10Sjsg.\"	$OpenBSD: utf8.7,v 1.9 2022/02/18 10:24:32 jsg Exp $
20595762fStedu.\"
32c208b1dSschwarze.\" Copyright (c) 2017 Ted Unangst <tedu@openbsd.org>
40595762fStedu.\"
52c208b1dSschwarze.\" Permission to use, copy, modify, and distribute this software for any
62c208b1dSschwarze.\" purpose with or without fee is hereby granted, provided that the above
72c208b1dSschwarze.\" copyright notice and this permission notice appear in all copies.
80595762fStedu.\"
92c208b1dSschwarze.\" THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
102c208b1dSschwarze.\" WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
112c208b1dSschwarze.\" MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
122c208b1dSschwarze.\" ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
132c208b1dSschwarze.\" WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
142c208b1dSschwarze.\" ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
152c208b1dSschwarze.\" OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
160595762fStedu.\"
17*d905fc10Sjsg.Dd $Mdocdate: February 18 2022 $
180595762fStedu.Dt UTF8 7
190595762fStedu.Os
200595762fStedu.Sh NAME
210595762fStedu.Nm utf8
220595762fStedu.Nd UTF-8 text encoding
230595762fStedu.Sh DESCRIPTION
248261d416SschwarzeUTF-8 is a multibyte character encoding for Unicode text.
250595762fSteduIt is the preferred format for non ASCII text.
260595762fStedu.Pp
278261d416SschwarzeUnicode codepoints are encoded as follows:
280595762fStedu.Bl -tag -width Ds
298261d416Sschwarze.It U+0000 \(en U+007F:
308261d416SschwarzeOne byte: 0....... (compatible with ASCII)
318261d416Sschwarze.It U+0080 \(en U+07FF:
325b3e0503SschwarzeTwo bytes: 110..... 10......
338261d416Sschwarze.It U+0800 \(en U+D7FF and U+E000 \(en U+FFFF:
348261d416SschwarzeThree bytes: 1110.... 10...... 10......
358261d416Sschwarze.It U+10000 \(en U+10FFFF:
368261d416SschwarzeFour bytes: 11110... 10...... 10...... 10......
370595762fStedu.El
388261d416Sschwarze.Pp
398261d416SschwarzeThe bits shown as dots contain the codepoint represented as a binary
408261d416Sschwarzeinteger.
418261d416Sschwarze.Pp
428261d416SschwarzeBytes starting with the bit pattern 11...... are called UTF-8 start
438261d416Sschwarzebytes, and those starting with 10...... UTF-8 continuation bytes.
448261d416SschwarzeThe number of leading 1 bits in a start byte indicates the total
458261d416Sschwarzenumber of bytes used to encode the codepoint, including the start
468261d416Sschwarzebyte.
478261d416Sschwarze.Pp
488261d416SschwarzeEncodings using more bytes than required are invalid.
498261d416SschwarzeIn particular, 11000000 and 11000001 are not valid start bytes,
508261d416Sschwarzethe byte after 11100000 must be at least 10100000,
518261d416Sschwarzeand the byte after 11110000 must be at least 10010000.
52ac5372acSschwarze.Pp
53ac5372acSschwarzeThe ranges U+D800 to U+DFFF and U+110000 to U+1FFFFF
54ac5372acSschwarzedo not contain valid Unicode codepoints.
55ac5372acSschwarzeConsequently, the corresponding three- and four-byte UTF-8 sequences
56ac5372acSschwarzeare invalid.
57ac5372acSschwarzeThe highest valid byte after 11101101 is 10011111,
58ac5372acSschwarzethe highest valid byte of the form 1111.... is 11110100,
59ac5372acSschwarzeand the highest valid byte after 11110100 is 10001111.
60ac5372acSschwarze.Pp
61ac5372acSschwarzeTo summarize, the following is a complete list of bytes
62ac5372acSschwarzethat are invalid in all contexts:
63ac5372acSschwarze.Pp
64ac5372acSschwarze.Bl -tag -width 5n -offset 4n -compact
65ac5372acSschwarze.It c0\(enc1
66ac5372acSschwarzetwo-byte sequence that has to be encoded as a single byte
67ac5372acSschwarze.It f5\(enf7
68ac5372acSschwarzefour-byte sequence beyond the Unicode range
69ac5372acSschwarze.It f8\(enff
70ac5372acSschwarzeinvalid sequence of five or more bytes
71ac5372acSschwarze.El
72ac5372acSschwarze.Pp
73ac5372acSschwarzeThe following is a complete list of invalid two-byte combinations
74ac5372acSschwarzeof the form 11...... 10...... that consist of two valid bytes:
75ac5372acSschwarze.Pp
76ac5372acSschwarze.Bl -tag -width 9n -offset 4n -compact
77ac5372acSschwarze.It e080\(ene09f
78ac5372acSschwarzethree-byte sequence that has to be encoded as two bytes
79ac5372acSschwarze.It eda0\(enedbf
80ac5372acSschwarzestart of a UTF-16 surrogate, which is not valid UTF-8
81ac5372acSschwarze.It f080\(enf08f
82ac5372acSschwarzefour-byte sequence that has to be encoded as three bytes
83ac5372acSschwarze.It f490\(enf4bf
84ac5372acSschwarzefour-byte sequence beyond the Unicode range
85ac5372acSschwarze.El
865da5fd68Sjmc.Sh SEE ALSO
878261d416Sschwarze.Xr locale 1 ,
885da5fd68Sjmc.Xr ascii 7
890595762fStedu.Sh STANDARDS
9009fce300Stedu.Rs
9109fce300Stedu.%A F. Yergeau
9209fce300Stedu.%D November 2003
9309fce300Stedu.%R RFC 3629
9409fce300Stedu.%T UTF-8, a transformation format of ISO 10646
9509fce300Stedu.Re
9609fce300Stedu.Pp
97*d905fc10Sjsg.Lk https://www.unicode.org/versions/latest/ "The Unicode Standard"
988261d416Sschwarze.Pp
99*d905fc10Sjsg.Lk https://www.unicode.org/reports/tr44/ "The Unicode Character Database"
100