.\" $OpenBSD: utf8.7,v 1.4 2017/05/31 12:46:30 jmc Exp $ .\" .\" Copyright (c) 2017 Ted Unangst .\" All rights reserved. .\" .\" Redistribution and use in source and binary forms, with or without .\" modification, are permitted provided that the following conditions .\" are met: .\" 1. Redistributions of source code must retain the above copyright .\" notice, this list of conditions and the following disclaimer. .\" 2. Redistributions in binary form must reproduce the above copyright .\" notice, this list of conditions and the following disclaimer in the .\" documentation and/or other materials provided with the distribution. .\" .\" THIS SOFTWARE IS PROVIDED BY THE DEVELOPERS ``AS IS'' AND ANY EXPRESS OR .\" IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES .\" OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. .\" IN NO EVENT SHALL THE DEVELOPERS BE LIABLE FOR ANY DIRECT, INDIRECT, .\" INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT .\" NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, .\" DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY .\" THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT .\" (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF .\" THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. .\" .Dd $Mdocdate: May 31 2017 $ .Dt UTF8 7 .Os .Sh NAME .Nm utf8 .Nd UTF-8 text encoding .Sh DESCRIPTION UTF-8 is a multibyte encoding for Unicode text. It is the preferred format for non ASCII text. .Pp The length of a UTF-8 sequence varies depending on the encoded value. If the high bit of the first byte is zero, the sequence length is one and the value is the remaining seven bits. If the high bit is set, then the number of high bits set, followed by a zero bit, indicates the length of the sequence and the value is formed by combining the low bits of each byte. Continuation bytes all have the same format, with the top two bits set and unset, respectively, and six value bits. .Pp Unicode ranges and their encoding formats: .Bl -tag -width Ds .It 0x0 - 0x7f One byte. 0....... .It 0x80 - 0x7ff Two bytes. 110..... 10....... .It 0x800 - 0xffff Three bytes. 1110.... 10...... 10...... .It 0x1000 - 0x10ffff Four bytes. 11110... 10...... 10...... 10...... .El .Sh SEE ALSO .Xr ascii 7 .Sh STANDARDS .Rs .%A F. Yergeau .%D November 2003 .%R RFC 3629 .%T UTF-8, a transformation format of ISO 10646 .Re .Pp The Unicode Standard. .Sh CAVEATS Beware of overlong encodings.