xref: /netbsd-src/lib/libc/locale/mbrtoc8.3 (revision fdd9db8a91c767e1b3e0b7be194f588935269cca)
1.\"	$NetBSD: mbrtoc8.3,v 1.7 2024/08/23 12:59:49 riastradh Exp $
2.\"
3.\" Copyright (c) 2024 The NetBSD Foundation, Inc.
4.\" All rights reserved.
5.\"
6.\" Redistribution and use in source and binary forms, with or without
7.\" modification, are permitted provided that the following conditions
8.\" are met:
9.\" 1. Redistributions of source code must retain the above copyright
10.\"    notice, this list of conditions and the following disclaimer.
11.\" 2. Redistributions in binary form must reproduce the above copyright
12.\"    notice, this list of conditions and the following disclaimer in the
13.\"    documentation and/or other materials provided with the distribution.
14.\"
15.\" THIS SOFTWARE IS PROVIDED BY THE NETBSD FOUNDATION, INC. AND CONTRIBUTORS
16.\" ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
17.\" TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
18.\" PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE FOUNDATION OR CONTRIBUTORS
19.\" BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
20.\" CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
21.\" SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
22.\" INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
23.\" CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
24.\" ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
25.\" POSSIBILITY OF SUCH DAMAGE.
26.\"
27.Dd August 15, 2024
28.Dt MBRTOC8 3
29.Os
30.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
31.Sh NAME
32.Nm mbrtoc8
33.Nd Restartable multibyte to UTF-8 conversion
34.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
35.Sh LIBRARY
36.Lb libc
37.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
38.Sh SYNOPSIS
39.
40.In uchar.h
41.
42.Ft size_t
43.Fo mbrtoc8
44.Fa "char8_t * restrict pc8"
45.Fa "const char * restrict s"
46.Fa "size_t n"
47.Fa "mbstate_t * restrict ps"
48.Fc
49.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
50.Sh DESCRIPTION
51The
52.Nm
53function decodes multibyte characters in the current locale and
54converts them to UTF-8, keeping state so it can restart after
55incremental progress.
56.Pp
57Each call to
58.Nm :
59.Bl -enum -compact
60.It
61examines up to
62.Fa n
63bytes starting at
64.Fa s ,
65.It
66yields a UTF-8 code unit if available by storing it at
67.Li * Ns Fa pc8 ,
68.It
69saves state at
70.Fa ps ,
71and
72.It
73returns either the number of bytes consumed if any or a special return
74value.
75.El
76.Pp
77Specifically:
78.Bl -bullet
79.It
80If the multibyte sequence at
81.Fa s
82is invalid after any previous input saved at
83.Fa ps ,
84or if an error occurs in decoding,
85.Nm
86returns
87.Li (size_t)-1
88and sets
89.Xr errno 2
90to indicate the error.
91.It
92If the multibyte sequence at
93.Fa s
94is still incomplete after
95.Fa n
96bytes, including any previous input saved in
97.Fa ps ,
98.Nm
99saves its state in
100.Fa ps
101after all the input so far and returns
102.Li "(size_t)-2".
103.Sy All
104.Fa n
105bytes of input are consumed in this case.
106.It
107If
108.Nm
109had previously decoded a multibyte character but has not yet yielded
110all the code units of its UTF-8 encoding, it stores the next UTF-8 code
111unit at
112.Li * Ns Fa pc8
113and returns
114.Li "(size_t)-3" .
115.Sy \&No
116input is consumed in this case.
117.It
118If
119.Nm
120decodes the null multibyte character, then it stores zero at
121.Li * Ns Fa pc8
122and returns zero.
123.It
124Otherwise,
125.Nm
126decodes a single multibyte character, stores the first (and possibly
127only) code unit in its UTF-8 encoding at
128.Li * Ns Fa pc8 ,
129and returns the number of bytes consumed to decode the first multibyte
130character.
131.El
132.Pp
133If
134.Fa pc8
135is a null pointer, nothing is stored, but the effects on
136.Fa ps
137and the return value are unchanged.
138.Pp
139If
140.Fa s
141is a null pointer, the
142.Nm
143call is equivalent to:
144.Bd -ragged -offset indent
145.Fo mbrtoc8
146.Li NULL ,
147.Li \*q\*q ,
148.Li 1 ,
149.Fa ps
150.Fc
151.Ed
152.Pp
153This always returns zero, and has the effect of resetting
154.Fa ps
155to the initial conversion state, without writing to
156.Fa pc8 ,
157even if it is nonnull.
158.Pp
159If
160.Fa ps
161is a null pointer,
162.Nm
163uses an internal
164.Vt mbstate_t
165object with static storage duration, distinct from all other
166.Vt mbstate_t
167objects
168.Po
169including those used by
170.Xr mbrtoc16 3 ,
171.Xr mbrtoc32 3 ,
172.Xr c8rtomb 3 ,
173.Xr c16rtomb 3 ,
174and
175.Xr c32rtomb 3
176.Pc ,
177which is initialized at program startup to the initial conversion
178state.
179.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
180.Sh IMPLEMENTATION NOTES
181On well-formed input, the
182.Nm
183function yields either a Unicode scalar value in US-ASCII range, i.e.,
184a 7-bit Unicode code point, or, over two to four successive calls, the
185leading and trailing code units in order of the UTF-8 encoding of a
186Unicode scalar value outside the US-ASCII range.
187.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
188.Sh RETURN VALUES
189The
190.Nm
191function returns:
192.Bl -tag -width Li
193.It Li 0
194.Bq null
195if
196.Nm
197decoded a null multibyte character.
198.It Ar i
199.Bq code unit
200where
201.Li 1
202\*(Le
203.Ar i
204\*(Le
205.Fa n ,
206if
207.Nm
208consumed
209.Ar i
210bytes of input to decode the next multibyte character, yielding a
211UTF-8 code unit.
212.It Li (size_t)-3
213.Bq continuation
214if
215.Nm
216consumed no new bytes of input but yielded a UTF-8 code unit that was
217pending from previous input.
218.It Li (size_t)-2
219.Bq incomplete
220if
221.Nm
222found only an incomplete multibyte sequence after all
223.Fa n
224bytes of input and any previous input, and saved its state to restart
225in the next call with
226.Fa ps .
227.It Li (size_t)-1
228.Bq error
229if any encoding error was detected;
230.Xr errno 2
231is set to reflect the error.
232.El
233.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
234.Sh EXAMPLES
235Print the UTF-8 code units of a multibyte string in hexadecimal text:
236.Bd -literal -offset indent
237char *s = ...;
238size_t n = ...;
239mbstate_t mbs = {0};    /* initial conversion state */
240
241while (n) {
242        char8_t c8;
243        size_t len;
244
245        len = mbrtoc8(&c8, s, n, &mbs);
246        switch (len) {
247        case 0:         /* NUL terminator */
248                assert(c8 == 0);
249                goto out;
250        default:        /* consumed input and yielded a byte c8 */
251                printf("0x%02hhx\en", c8);
252                break;
253        case (size_t)-3: /* yielded a pending byte c8 */
254                printf("continue 0x%02hhx\en", c8);
255                break;
256        case (size_t)-2: /* incomplete */
257                printf("incomplete\en");
258                goto readmore;
259        case (size_t)-1: /* error */
260                printf("error: %d\en", errno);
261                goto out;
262        }
263        s += len;
264        n -= len;
265}
266.Ed
267.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
268.Sh ERRORS
269.Bl -tag -width Bq
270.It Bq Er EILSEQ
271The multibyte sequence cannot be decoded in the current locale as a
272Unicode scalar value.
273.It Bq Er EIO
274An error occurred in loading the locale's character conversions.
275.El
276.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
277.Sh SEE ALSO
278.Xr c8rtomb 3 ,
279.Xr c16rtomb 3 ,
280.Xr c32rtomb 3 ,
281.Xr mbrtoc16 3 ,
282.Xr mbrtoc32 3 ,
283.Xr uchar 3
284.Rs
285.%B The Unicode Standard
286.%O Version 15.0 \(em Core Specification
287.%Q The Unicode Consortium
288.%D September 2022
289.%U https://www.unicode.org/versions/Unicode15.0.0/UnicodeStandard-15.0.pdf
290.Re
291.Rs
292.%A F. Yergeau
293.%T UTF-8, a transformation format of ISO 10646
294.%R RFC 3629
295.%D November 2003
296.%I Internet Engineering Task Force
297.%U https://datatracker.ietf.org/doc/html/rfc3629
298.Re
299.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
300.\" .Sh STANDARDS
301.\" The
302.\" .Nm
303.\" function conforms to
304.\" .St -isoC-2023 .
305.\" .\" XXX PR misc/58600: man pages lack C17, C23, C++98, C++03, C++11, C++17, C++20, C++23 citation syntax
306.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
307.Sh HISTORY
308The
309.Nm
310function first appeared in
311.Nx 11.0 .
312