xref: /openbsd-src/share/man/man7/utf8.7 (revision 2c208b1d7ea5b8a4843ed16d5edebfd882e98a62)
1.\"	$OpenBSD: utf8.7,v 1.5 2017/05/31 17:16:48 schwarze Exp $
2.\"
3.\" Copyright (c) 2017 Ted Unangst <tedu@openbsd.org>
4.\"
5.\" Permission to use, copy, modify, and distribute this software for any
6.\" purpose with or without fee is hereby granted, provided that the above
7.\" copyright notice and this permission notice appear in all copies.
8.\"
9.\" THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
10.\" WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
11.\" MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
12.\" ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
13.\" WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
14.\" ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
15.\" OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
16.\"
17.Dd $Mdocdate: May 31 2017 $
18.Dt UTF8 7
19.Os
20.Sh NAME
21.Nm utf8
22.Nd UTF-8 text encoding
23.Sh DESCRIPTION
24UTF-8 is a multibyte encoding for Unicode text.
25It is the preferred format for non ASCII text.
26.Pp
27The length of a UTF-8 sequence varies depending on the encoded value.
28If the high bit of the first byte is zero, the sequence length is one and
29the value is the remaining seven bits.
30If the high bit is set, then the number of high bits set, followed by a zero
31bit, indicates the length of the sequence and the value is formed by combining
32the low bits of each byte.
33Continuation bytes all have the same format, with the top two bits set and
34unset, respectively, and six value bits.
35.Pp
36Unicode ranges and their encoding formats:
37.Bl -tag -width Ds
38.It 0x0 - 0x7f
39One byte.
400.......
41.It 0x80 - 0x7ff
42Two bytes.
43110..... 10.......
44.It 0x800 - 0xffff
45Three bytes.
461110.... 10...... 10......
47.It 0x1000 - 0x10ffff
48Four bytes.
4911110... 10...... 10...... 10......
50.El
51.Sh SEE ALSO
52.Xr ascii 7
53.Sh STANDARDS
54.Rs
55.%A F. Yergeau
56.%D November 2003
57.%R RFC 3629
58.%T UTF-8, a transformation format of ISO 10646
59.Re
60.Pp
61The Unicode Standard.
62.Sh CAVEATS
63Beware of overlong encodings.
64