1.\" $NetBSD: mbrtoc8.3,v 1.7 2024/08/23 12:59:49 riastradh Exp $ 2.\" 3.\" Copyright (c) 2024 The NetBSD Foundation, Inc. 4.\" All rights reserved. 5.\" 6.\" Redistribution and use in source and binary forms, with or without 7.\" modification, are permitted provided that the following conditions 8.\" are met: 9.\" 1. Redistributions of source code must retain the above copyright 10.\" notice, this list of conditions and the following disclaimer. 11.\" 2. Redistributions in binary form must reproduce the above copyright 12.\" notice, this list of conditions and the following disclaimer in the 13.\" documentation and/or other materials provided with the distribution. 14.\" 15.\" THIS SOFTWARE IS PROVIDED BY THE NETBSD FOUNDATION, INC. AND CONTRIBUTORS 16.\" ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED 17.\" TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR 18.\" PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE FOUNDATION OR CONTRIBUTORS 19.\" BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR 20.\" CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF 21.\" SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 22.\" INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN 23.\" CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) 24.\" ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE 25.\" POSSIBILITY OF SUCH DAMAGE. 26.\" 27.Dd August 15, 2024 28.Dt MBRTOC8 3 29.Os 30.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" 31.Sh NAME 32.Nm mbrtoc8 33.Nd Restartable multibyte to UTF-8 conversion 34.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" 35.Sh LIBRARY 36.Lb libc 37.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" 38.Sh SYNOPSIS 39. 40.In uchar.h 41. 42.Ft size_t 43.Fo mbrtoc8 44.Fa "char8_t * restrict pc8" 45.Fa "const char * restrict s" 46.Fa "size_t n" 47.Fa "mbstate_t * restrict ps" 48.Fc 49.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" 50.Sh DESCRIPTION 51The 52.Nm 53function decodes multibyte characters in the current locale and 54converts them to UTF-8, keeping state so it can restart after 55incremental progress. 56.Pp 57Each call to 58.Nm : 59.Bl -enum -compact 60.It 61examines up to 62.Fa n 63bytes starting at 64.Fa s , 65.It 66yields a UTF-8 code unit if available by storing it at 67.Li * Ns Fa pc8 , 68.It 69saves state at 70.Fa ps , 71and 72.It 73returns either the number of bytes consumed if any or a special return 74value. 75.El 76.Pp 77Specifically: 78.Bl -bullet 79.It 80If the multibyte sequence at 81.Fa s 82is invalid after any previous input saved at 83.Fa ps , 84or if an error occurs in decoding, 85.Nm 86returns 87.Li (size_t)-1 88and sets 89.Xr errno 2 90to indicate the error. 91.It 92If the multibyte sequence at 93.Fa s 94is still incomplete after 95.Fa n 96bytes, including any previous input saved in 97.Fa ps , 98.Nm 99saves its state in 100.Fa ps 101after all the input so far and returns 102.Li "(size_t)-2". 103.Sy All 104.Fa n 105bytes of input are consumed in this case. 106.It 107If 108.Nm 109had previously decoded a multibyte character but has not yet yielded 110all the code units of its UTF-8 encoding, it stores the next UTF-8 code 111unit at 112.Li * Ns Fa pc8 113and returns 114.Li "(size_t)-3" . 115.Sy \&No 116input is consumed in this case. 117.It 118If 119.Nm 120decodes the null multibyte character, then it stores zero at 121.Li * Ns Fa pc8 122and returns zero. 123.It 124Otherwise, 125.Nm 126decodes a single multibyte character, stores the first (and possibly 127only) code unit in its UTF-8 encoding at 128.Li * Ns Fa pc8 , 129and returns the number of bytes consumed to decode the first multibyte 130character. 131.El 132.Pp 133If 134.Fa pc8 135is a null pointer, nothing is stored, but the effects on 136.Fa ps 137and the return value are unchanged. 138.Pp 139If 140.Fa s 141is a null pointer, the 142.Nm 143call is equivalent to: 144.Bd -ragged -offset indent 145.Fo mbrtoc8 146.Li NULL , 147.Li \*q\*q , 148.Li 1 , 149.Fa ps 150.Fc 151.Ed 152.Pp 153This always returns zero, and has the effect of resetting 154.Fa ps 155to the initial conversion state, without writing to 156.Fa pc8 , 157even if it is nonnull. 158.Pp 159If 160.Fa ps 161is a null pointer, 162.Nm 163uses an internal 164.Vt mbstate_t 165object with static storage duration, distinct from all other 166.Vt mbstate_t 167objects 168.Po 169including those used by 170.Xr mbrtoc16 3 , 171.Xr mbrtoc32 3 , 172.Xr c8rtomb 3 , 173.Xr c16rtomb 3 , 174and 175.Xr c32rtomb 3 176.Pc , 177which is initialized at program startup to the initial conversion 178state. 179.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" 180.Sh IMPLEMENTATION NOTES 181On well-formed input, the 182.Nm 183function yields either a Unicode scalar value in US-ASCII range, i.e., 184a 7-bit Unicode code point, or, over two to four successive calls, the 185leading and trailing code units in order of the UTF-8 encoding of a 186Unicode scalar value outside the US-ASCII range. 187.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" 188.Sh RETURN VALUES 189The 190.Nm 191function returns: 192.Bl -tag -width Li 193.It Li 0 194.Bq null 195if 196.Nm 197decoded a null multibyte character. 198.It Ar i 199.Bq code unit 200where 201.Li 1 202\*(Le 203.Ar i 204\*(Le 205.Fa n , 206if 207.Nm 208consumed 209.Ar i 210bytes of input to decode the next multibyte character, yielding a 211UTF-8 code unit. 212.It Li (size_t)-3 213.Bq continuation 214if 215.Nm 216consumed no new bytes of input but yielded a UTF-8 code unit that was 217pending from previous input. 218.It Li (size_t)-2 219.Bq incomplete 220if 221.Nm 222found only an incomplete multibyte sequence after all 223.Fa n 224bytes of input and any previous input, and saved its state to restart 225in the next call with 226.Fa ps . 227.It Li (size_t)-1 228.Bq error 229if any encoding error was detected; 230.Xr errno 2 231is set to reflect the error. 232.El 233.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" 234.Sh EXAMPLES 235Print the UTF-8 code units of a multibyte string in hexadecimal text: 236.Bd -literal -offset indent 237char *s = ...; 238size_t n = ...; 239mbstate_t mbs = {0}; /* initial conversion state */ 240 241while (n) { 242 char8_t c8; 243 size_t len; 244 245 len = mbrtoc8(&c8, s, n, &mbs); 246 switch (len) { 247 case 0: /* NUL terminator */ 248 assert(c8 == 0); 249 goto out; 250 default: /* consumed input and yielded a byte c8 */ 251 printf("0x%02hhx\en", c8); 252 break; 253 case (size_t)-3: /* yielded a pending byte c8 */ 254 printf("continue 0x%02hhx\en", c8); 255 break; 256 case (size_t)-2: /* incomplete */ 257 printf("incomplete\en"); 258 goto readmore; 259 case (size_t)-1: /* error */ 260 printf("error: %d\en", errno); 261 goto out; 262 } 263 s += len; 264 n -= len; 265} 266.Ed 267.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" 268.Sh ERRORS 269.Bl -tag -width Bq 270.It Bq Er EILSEQ 271The multibyte sequence cannot be decoded in the current locale as a 272Unicode scalar value. 273.It Bq Er EIO 274An error occurred in loading the locale's character conversions. 275.El 276.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" 277.Sh SEE ALSO 278.Xr c8rtomb 3 , 279.Xr c16rtomb 3 , 280.Xr c32rtomb 3 , 281.Xr mbrtoc16 3 , 282.Xr mbrtoc32 3 , 283.Xr uchar 3 284.Rs 285.%B The Unicode Standard 286.%O Version 15.0 \(em Core Specification 287.%Q The Unicode Consortium 288.%D September 2022 289.%U https://www.unicode.org/versions/Unicode15.0.0/UnicodeStandard-15.0.pdf 290.Re 291.Rs 292.%A F. Yergeau 293.%T UTF-8, a transformation format of ISO 10646 294.%R RFC 3629 295.%D November 2003 296.%I Internet Engineering Task Force 297.%U https://datatracker.ietf.org/doc/html/rfc3629 298.Re 299.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" 300.\" .Sh STANDARDS 301.\" The 302.\" .Nm 303.\" function conforms to 304.\" .St -isoC-2023 . 305.\" .\" XXX PR misc/58600: man pages lack C17, C23, C++98, C++03, C++11, C++17, C++20, C++23 citation syntax 306.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" 307.Sh HISTORY 308The 309.Nm 310function first appeared in 311.Nx 11.0 . 312