xref: /openbsd-src/lib/libc/locale/mbrtoc16.3 (revision 46c354aa2baf687e7a81339ec07289555b065bb2)
1.\" $OpenBSD: mbrtoc16.3,v 1.1 2023/08/20 15:02:51 schwarze Exp $
2.\"
3.\" Copyright 2023 Ingo Schwarze <schwarze@openbsd.org>
4.\" Copyright 2010 Stefan Sperling <stsp@openbsd.org>
5.\"
6.\" Permission to use, copy, modify, and distribute this software for any
7.\" purpose with or without fee is hereby granted, provided that the above
8.\" copyright notice and this permission notice appear in all copies.
9.\"
10.\" THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
11.\" WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
12.\" MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
13.\" ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
14.\" WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
15.\" ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
16.\" OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
17.\"
18.Dd $Mdocdate: August 20 2023 $
19.Dt MBRTOC16 3
20.Os
21.Sh NAME
22.Nm mbrtoc16
23.Nd convert one UTF-8 encoded character to UTF-16
24.Sh SYNOPSIS
25.In uchar.h
26.Ft size_t
27.Fo mbrtoc16
28.Fa "char16_t * restrict pc16"
29.Fa "const char * restrict s"
30.Fa "size_t n"
31.Fa "mbstate_t * restrict mbs"
32.Fc
33.Sh DESCRIPTION
34The
35.Fn mbrtoc16
36function examines at most
37.Fa n
38bytes of the multibyte character byte string pointed to by
39.Fa s ,
40converts those bytes to a wide character,
41and encodes the wide character using UTF-16.
42In some cases, it is necessary to call this function
43twice to convert a single character.
44.Pp
45Conversion happens in accordance with the conversion state
46.Pf * Fa mbs ,
47which must be initialized to zero before the application's first call to
48.Fn mbrtoc16 .
49For this function,
50.Pf * Fa mbs
51stores information about both the state of the UTF-8 input encoding
52and the state of the UTF-16 output encoding.
53If the previous call did not return
54.Po Vt size_t Pc Ns \-1 ,
55.Fa mbs
56can safely be reused without reinitialization.
57.Pp
58The input encoding that
59.Fn mbrtoc16
60uses for
61.Fa s
62is determined by the
63.Dv LC_CTYPE
64category of the current locale.
65If the locale is changed without reinitialization of
66.Pf * Fa mbs ,
67the behaviour is undefined.
68.Pp
69Unlike
70.Xr mbtowc 3 ,
71.Fn mbrtoc16
72accepts an incomplete byte sequence pointed to by
73.Fa s
74which does not form a complete character but is potentially part of
75a valid character.
76In this case, the function consumes all such bytes.
77The conversion state saved in
78.Pf * Fa mbs
79will be used to restart the suspended conversion during the next call.
80.Pp
81On systems other than
82.Ox
83that support state-dependent encodings,
84.Fa s
85may point to a special sequence of bytes called a
86.Dq shift sequence ;
87see
88.Xr mbrtowc 3
89for details.
90.Pp
91The following arguments cause special processing:
92.Bl -tag -width 012345678901
93.It Fa pc16 No == Dv NULL
94The conversion from a multibyte character to a wide character is performed
95and the conversion state may be affected, but the resulting wide character
96is discarded.
97.It Fa s No == Dv NULL
98The arguments
99.Fa pc16
100and
101.Fa n
102are ignored and starting or continuing the conversion with an empty string
103is attempted, discarding the conversion result.
104.It Fa mbs No == Dv NULL
105An internal
106.Vt mbstate_t
107object specific to the
108.Fn mbrtoc16
109function is used instead of the
110.Fa mbs
111argument.
112This internal object is automatically initialized at program startup
113and never changed by any
114.Em libc
115function except
116.Fn mbrtoc16 .
117.Pp
118If
119.Fn mbrtoc16
120is called with a
121.Dv NULL
122.Fa mbs
123argument and that call returns
124.Po Vt size_t Pc Ns \-1 ,
125the internal conversion state of
126.Fn mbrtoc16
127becomes permanently undefined and there is no way
128to reset it to any defined state.
129Consequently, after such a mishap, it is not safe to call
130.Fn mbrtoc16
131with a
132.Dv NULL
133.Fa mbs
134argument ever again until the program is terminated.
135.El
136.Sh RETURN VALUES
137.Bl -tag -width 012345678901
138.It 0
139The bytes pointed to by
140.Fa s
141form a terminating NUL character.
142If
143.Fa pc16
144is not
145.Dv NULL ,
146a NUL wide character has been stored in
147.Pf * Fa pc16 .
148.It positive
149.Fa s
150points to a valid character, and the value returned is the number of
151bytes completing the character.
152If
153.Fa pc16
154is not
155.Dv NULL ,
156the first UTF-16 code unit of the corresponding wide character
157has been stored in
158.Pf * Fa pc16 .
159If it is an UTF-16 high surrogate, the function needs to be called
160again to retrieve a second UTF-16 code unit, the low surrogate.
161On
162.Ox ,
163this happens if and only if the return value is 4,
164but this equivalence does not hold on other operating systems
165that support input encodings other than UTF-8.
166.It Po Vt size_t Pc Ns \-1
167.Fa s
168points to an illegal byte sequence which does not form a valid multibyte
169character in the current locale, or
170.Fa mbs
171points to an invalid or uninitialized object.
172.Va errno
173is set to
174.Er EILSEQ
175or
176.Er EINVAL ,
177respectively.
178The conversion state object pointed to by
179.Fa mbs
180is left in an undefined state and must be reinitialized before being
181used again.
182.It Po Vt size_t Pc Ns \-2
183.Fa s
184points to an incomplete byte sequence of length
185.Fa n
186which has been consumed and contains part of a valid multibyte character.
187The character may be completed by calling the same function again with
188.Fa s
189pointing to one or more subsequent bytes of the multibyte character and
190.Fa mbs
191pointing to the conversion state object used during conversion of the
192incomplete byte sequence.
193.It Po Vt size_t Pc Ns \-3
194The second 16-bit code unit resulting from a previous call
195has been stored into
196.Pf * Fa pc16 ,
197without consuming any additional bytes from
198.Fa s .
199.El
200.Sh ERRORS
201.Fn mbrtoc16
202causes an error in the following cases:
203.Bl -tag -width Er
204.It Bq Er EILSEQ
205.Fa s
206points to an invalid multibyte character.
207.It Bq Er EINVAL
208.Fa mbs
209points to an invalid or uninitialized
210.Vt mbstate_t
211object.
212.El
213.Sh SEE ALSO
214.Xr c16rtomb 3 ,
215.Xr mbrtowc 3 ,
216.Xr setlocale 3
217.Sh STANDARDS
218.Fn mbrtoc16
219conforms to
220.St -isoC-2011 .
221.Sh HISTORY
222.Fn mbrtoc16
223has been available since
224.Ox 7.4 .
225.Sh CAVEATS
226On operating systems other than
227.Ox
228that support input encodings other than UTF-8, inspecting the return value
229is insufficient to tell whether the function needs to be called again.
230If the return value is positive, inspecting
231.Pf * Fa pc16
232is also required to make that decision.
233Consequently, passing a
234.Dv NULL
235pointer for the
236.Fa pc16
237argument is discouraged because it can result
238in a well-defined but unknown output encoding state.
239The simplest way to recover from such an unknown state is to
240reinitialize the object pointed to by
241.Fa mbs .
242.Pp
243The C11 standard only requires the
244.Fa pc16
245argument to be encoded according to UTF-16
246if the predefined environment macro
247.Dv __STDC_UTF_16__
248is defined with a value of 1.
249On
250.Ox ,
251.In uchar.h
252provides this definition.
253Other operating systems which do not define
254.Dv __STDC_UTF_16__
255could theoretically use a different,
256implementation-defined output encoding for
257.Fa pc16
258instead of UTF-16.
259Writing portable code for an arbitrary output encoding is impossible
260because the rules when and how often the function needs to be called
261again depend on the output encoding; the rules explained above are
262specific to UTF-16.
263Using UTF-16 as the output encoding of
264.Fn wcrtoc16
265becomes mandatory in C23.
266