xref: /openbsd-src/lib/libc/locale/mbrtoc16.3 (revision 46c354aa2baf687e7a81339ec07289555b065bb2)
1*46c354aaSschwarze.\" $OpenBSD: mbrtoc16.3,v 1.1 2023/08/20 15:02:51 schwarze Exp $
2*46c354aaSschwarze.\"
3*46c354aaSschwarze.\" Copyright 2023 Ingo Schwarze <schwarze@openbsd.org>
4*46c354aaSschwarze.\" Copyright 2010 Stefan Sperling <stsp@openbsd.org>
5*46c354aaSschwarze.\"
6*46c354aaSschwarze.\" Permission to use, copy, modify, and distribute this software for any
7*46c354aaSschwarze.\" purpose with or without fee is hereby granted, provided that the above
8*46c354aaSschwarze.\" copyright notice and this permission notice appear in all copies.
9*46c354aaSschwarze.\"
10*46c354aaSschwarze.\" THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
11*46c354aaSschwarze.\" WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
12*46c354aaSschwarze.\" MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
13*46c354aaSschwarze.\" ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
14*46c354aaSschwarze.\" WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
15*46c354aaSschwarze.\" ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
16*46c354aaSschwarze.\" OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
17*46c354aaSschwarze.\"
18*46c354aaSschwarze.Dd $Mdocdate: August 20 2023 $
19*46c354aaSschwarze.Dt MBRTOC16 3
20*46c354aaSschwarze.Os
21*46c354aaSschwarze.Sh NAME
22*46c354aaSschwarze.Nm mbrtoc16
23*46c354aaSschwarze.Nd convert one UTF-8 encoded character to UTF-16
24*46c354aaSschwarze.Sh SYNOPSIS
25*46c354aaSschwarze.In uchar.h
26*46c354aaSschwarze.Ft size_t
27*46c354aaSschwarze.Fo mbrtoc16
28*46c354aaSschwarze.Fa "char16_t * restrict pc16"
29*46c354aaSschwarze.Fa "const char * restrict s"
30*46c354aaSschwarze.Fa "size_t n"
31*46c354aaSschwarze.Fa "mbstate_t * restrict mbs"
32*46c354aaSschwarze.Fc
33*46c354aaSschwarze.Sh DESCRIPTION
34*46c354aaSschwarzeThe
35*46c354aaSschwarze.Fn mbrtoc16
36*46c354aaSschwarzefunction examines at most
37*46c354aaSschwarze.Fa n
38*46c354aaSschwarzebytes of the multibyte character byte string pointed to by
39*46c354aaSschwarze.Fa s ,
40*46c354aaSschwarzeconverts those bytes to a wide character,
41*46c354aaSschwarzeand encodes the wide character using UTF-16.
42*46c354aaSschwarzeIn some cases, it is necessary to call this function
43*46c354aaSschwarzetwice to convert a single character.
44*46c354aaSschwarze.Pp
45*46c354aaSschwarzeConversion happens in accordance with the conversion state
46*46c354aaSschwarze.Pf * Fa mbs ,
47*46c354aaSschwarzewhich must be initialized to zero before the application's first call to
48*46c354aaSschwarze.Fn mbrtoc16 .
49*46c354aaSschwarzeFor this function,
50*46c354aaSschwarze.Pf * Fa mbs
51*46c354aaSschwarzestores information about both the state of the UTF-8 input encoding
52*46c354aaSschwarzeand the state of the UTF-16 output encoding.
53*46c354aaSschwarzeIf the previous call did not return
54*46c354aaSschwarze.Po Vt size_t Pc Ns \-1 ,
55*46c354aaSschwarze.Fa mbs
56*46c354aaSschwarzecan safely be reused without reinitialization.
57*46c354aaSschwarze.Pp
58*46c354aaSschwarzeThe input encoding that
59*46c354aaSschwarze.Fn mbrtoc16
60*46c354aaSschwarzeuses for
61*46c354aaSschwarze.Fa s
62*46c354aaSschwarzeis determined by the
63*46c354aaSschwarze.Dv LC_CTYPE
64*46c354aaSschwarzecategory of the current locale.
65*46c354aaSschwarzeIf the locale is changed without reinitialization of
66*46c354aaSschwarze.Pf * Fa mbs ,
67*46c354aaSschwarzethe behaviour is undefined.
68*46c354aaSschwarze.Pp
69*46c354aaSschwarzeUnlike
70*46c354aaSschwarze.Xr mbtowc 3 ,
71*46c354aaSschwarze.Fn mbrtoc16
72*46c354aaSschwarzeaccepts an incomplete byte sequence pointed to by
73*46c354aaSschwarze.Fa s
74*46c354aaSschwarzewhich does not form a complete character but is potentially part of
75*46c354aaSschwarzea valid character.
76*46c354aaSschwarzeIn this case, the function consumes all such bytes.
77*46c354aaSschwarzeThe conversion state saved in
78*46c354aaSschwarze.Pf * Fa mbs
79*46c354aaSschwarzewill be used to restart the suspended conversion during the next call.
80*46c354aaSschwarze.Pp
81*46c354aaSschwarzeOn systems other than
82*46c354aaSschwarze.Ox
83*46c354aaSschwarzethat support state-dependent encodings,
84*46c354aaSschwarze.Fa s
85*46c354aaSschwarzemay point to a special sequence of bytes called a
86*46c354aaSschwarze.Dq shift sequence ;
87*46c354aaSschwarzesee
88*46c354aaSschwarze.Xr mbrtowc 3
89*46c354aaSschwarzefor details.
90*46c354aaSschwarze.Pp
91*46c354aaSschwarzeThe following arguments cause special processing:
92*46c354aaSschwarze.Bl -tag -width 012345678901
93*46c354aaSschwarze.It Fa pc16 No == Dv NULL
94*46c354aaSschwarzeThe conversion from a multibyte character to a wide character is performed
95*46c354aaSschwarzeand the conversion state may be affected, but the resulting wide character
96*46c354aaSschwarzeis discarded.
97*46c354aaSschwarze.It Fa s No == Dv NULL
98*46c354aaSschwarzeThe arguments
99*46c354aaSschwarze.Fa pc16
100*46c354aaSschwarzeand
101*46c354aaSschwarze.Fa n
102*46c354aaSschwarzeare ignored and starting or continuing the conversion with an empty string
103*46c354aaSschwarzeis attempted, discarding the conversion result.
104*46c354aaSschwarze.It Fa mbs No == Dv NULL
105*46c354aaSschwarzeAn internal
106*46c354aaSschwarze.Vt mbstate_t
107*46c354aaSschwarzeobject specific to the
108*46c354aaSschwarze.Fn mbrtoc16
109*46c354aaSschwarzefunction is used instead of the
110*46c354aaSschwarze.Fa mbs
111*46c354aaSschwarzeargument.
112*46c354aaSschwarzeThis internal object is automatically initialized at program startup
113*46c354aaSschwarzeand never changed by any
114*46c354aaSschwarze.Em libc
115*46c354aaSschwarzefunction except
116*46c354aaSschwarze.Fn mbrtoc16 .
117*46c354aaSschwarze.Pp
118*46c354aaSschwarzeIf
119*46c354aaSschwarze.Fn mbrtoc16
120*46c354aaSschwarzeis called with a
121*46c354aaSschwarze.Dv NULL
122*46c354aaSschwarze.Fa mbs
123*46c354aaSschwarzeargument and that call returns
124*46c354aaSschwarze.Po Vt size_t Pc Ns \-1 ,
125*46c354aaSschwarzethe internal conversion state of
126*46c354aaSschwarze.Fn mbrtoc16
127*46c354aaSschwarzebecomes permanently undefined and there is no way
128*46c354aaSschwarzeto reset it to any defined state.
129*46c354aaSschwarzeConsequently, after such a mishap, it is not safe to call
130*46c354aaSschwarze.Fn mbrtoc16
131*46c354aaSschwarzewith a
132*46c354aaSschwarze.Dv NULL
133*46c354aaSschwarze.Fa mbs
134*46c354aaSschwarzeargument ever again until the program is terminated.
135*46c354aaSschwarze.El
136*46c354aaSschwarze.Sh RETURN VALUES
137*46c354aaSschwarze.Bl -tag -width 012345678901
138*46c354aaSschwarze.It 0
139*46c354aaSschwarzeThe bytes pointed to by
140*46c354aaSschwarze.Fa s
141*46c354aaSschwarzeform a terminating NUL character.
142*46c354aaSschwarzeIf
143*46c354aaSschwarze.Fa pc16
144*46c354aaSschwarzeis not
145*46c354aaSschwarze.Dv NULL ,
146*46c354aaSschwarzea NUL wide character has been stored in
147*46c354aaSschwarze.Pf * Fa pc16 .
148*46c354aaSschwarze.It positive
149*46c354aaSschwarze.Fa s
150*46c354aaSschwarzepoints to a valid character, and the value returned is the number of
151*46c354aaSschwarzebytes completing the character.
152*46c354aaSschwarzeIf
153*46c354aaSschwarze.Fa pc16
154*46c354aaSschwarzeis not
155*46c354aaSschwarze.Dv NULL ,
156*46c354aaSschwarzethe first UTF-16 code unit of the corresponding wide character
157*46c354aaSschwarzehas been stored in
158*46c354aaSschwarze.Pf * Fa pc16 .
159*46c354aaSschwarzeIf it is an UTF-16 high surrogate, the function needs to be called
160*46c354aaSschwarzeagain to retrieve a second UTF-16 code unit, the low surrogate.
161*46c354aaSschwarzeOn
162*46c354aaSschwarze.Ox ,
163*46c354aaSschwarzethis happens if and only if the return value is 4,
164*46c354aaSschwarzebut this equivalence does not hold on other operating systems
165*46c354aaSschwarzethat support input encodings other than UTF-8.
166*46c354aaSschwarze.It Po Vt size_t Pc Ns \-1
167*46c354aaSschwarze.Fa s
168*46c354aaSschwarzepoints to an illegal byte sequence which does not form a valid multibyte
169*46c354aaSschwarzecharacter in the current locale, or
170*46c354aaSschwarze.Fa mbs
171*46c354aaSschwarzepoints to an invalid or uninitialized object.
172*46c354aaSschwarze.Va errno
173*46c354aaSschwarzeis set to
174*46c354aaSschwarze.Er EILSEQ
175*46c354aaSschwarzeor
176*46c354aaSschwarze.Er EINVAL ,
177*46c354aaSschwarzerespectively.
178*46c354aaSschwarzeThe conversion state object pointed to by
179*46c354aaSschwarze.Fa mbs
180*46c354aaSschwarzeis left in an undefined state and must be reinitialized before being
181*46c354aaSschwarzeused again.
182*46c354aaSschwarze.It Po Vt size_t Pc Ns \-2
183*46c354aaSschwarze.Fa s
184*46c354aaSschwarzepoints to an incomplete byte sequence of length
185*46c354aaSschwarze.Fa n
186*46c354aaSschwarzewhich has been consumed and contains part of a valid multibyte character.
187*46c354aaSschwarzeThe character may be completed by calling the same function again with
188*46c354aaSschwarze.Fa s
189*46c354aaSschwarzepointing to one or more subsequent bytes of the multibyte character and
190*46c354aaSschwarze.Fa mbs
191*46c354aaSschwarzepointing to the conversion state object used during conversion of the
192*46c354aaSschwarzeincomplete byte sequence.
193*46c354aaSschwarze.It Po Vt size_t Pc Ns \-3
194*46c354aaSschwarzeThe second 16-bit code unit resulting from a previous call
195*46c354aaSschwarzehas been stored into
196*46c354aaSschwarze.Pf * Fa pc16 ,
197*46c354aaSschwarzewithout consuming any additional bytes from
198*46c354aaSschwarze.Fa s .
199*46c354aaSschwarze.El
200*46c354aaSschwarze.Sh ERRORS
201*46c354aaSschwarze.Fn mbrtoc16
202*46c354aaSschwarzecauses an error in the following cases:
203*46c354aaSschwarze.Bl -tag -width Er
204*46c354aaSschwarze.It Bq Er EILSEQ
205*46c354aaSschwarze.Fa s
206*46c354aaSschwarzepoints to an invalid multibyte character.
207*46c354aaSschwarze.It Bq Er EINVAL
208*46c354aaSschwarze.Fa mbs
209*46c354aaSschwarzepoints to an invalid or uninitialized
210*46c354aaSschwarze.Vt mbstate_t
211*46c354aaSschwarzeobject.
212*46c354aaSschwarze.El
213*46c354aaSschwarze.Sh SEE ALSO
214*46c354aaSschwarze.Xr c16rtomb 3 ,
215*46c354aaSschwarze.Xr mbrtowc 3 ,
216*46c354aaSschwarze.Xr setlocale 3
217*46c354aaSschwarze.Sh STANDARDS
218*46c354aaSschwarze.Fn mbrtoc16
219*46c354aaSschwarzeconforms to
220*46c354aaSschwarze.St -isoC-2011 .
221*46c354aaSschwarze.Sh HISTORY
222*46c354aaSschwarze.Fn mbrtoc16
223*46c354aaSschwarzehas been available since
224*46c354aaSschwarze.Ox 7.4 .
225*46c354aaSschwarze.Sh CAVEATS
226*46c354aaSschwarzeOn operating systems other than
227*46c354aaSschwarze.Ox
228*46c354aaSschwarzethat support input encodings other than UTF-8, inspecting the return value
229*46c354aaSschwarzeis insufficient to tell whether the function needs to be called again.
230*46c354aaSschwarzeIf the return value is positive, inspecting
231*46c354aaSschwarze.Pf * Fa pc16
232*46c354aaSschwarzeis also required to make that decision.
233*46c354aaSschwarzeConsequently, passing a
234*46c354aaSschwarze.Dv NULL
235*46c354aaSschwarzepointer for the
236*46c354aaSschwarze.Fa pc16
237*46c354aaSschwarzeargument is discouraged because it can result
238*46c354aaSschwarzein a well-defined but unknown output encoding state.
239*46c354aaSschwarzeThe simplest way to recover from such an unknown state is to
240*46c354aaSschwarzereinitialize the object pointed to by
241*46c354aaSschwarze.Fa mbs .
242*46c354aaSschwarze.Pp
243*46c354aaSschwarzeThe C11 standard only requires the
244*46c354aaSschwarze.Fa pc16
245*46c354aaSschwarzeargument to be encoded according to UTF-16
246*46c354aaSschwarzeif the predefined environment macro
247*46c354aaSschwarze.Dv __STDC_UTF_16__
248*46c354aaSschwarzeis defined with a value of 1.
249*46c354aaSschwarzeOn
250*46c354aaSschwarze.Ox ,
251*46c354aaSschwarze.In uchar.h
252*46c354aaSschwarzeprovides this definition.
253*46c354aaSschwarzeOther operating systems which do not define
254*46c354aaSschwarze.Dv __STDC_UTF_16__
255*46c354aaSschwarzecould theoretically use a different,
256*46c354aaSschwarzeimplementation-defined output encoding for
257*46c354aaSschwarze.Fa pc16
258*46c354aaSschwarzeinstead of UTF-16.
259*46c354aaSschwarzeWriting portable code for an arbitrary output encoding is impossible
260*46c354aaSschwarzebecause the rules when and how often the function needs to be called
261*46c354aaSschwarzeagain depend on the output encoding; the rules explained above are
262*46c354aaSschwarzespecific to UTF-16.
263*46c354aaSschwarzeUsing UTF-16 as the output encoding of
264*46c354aaSschwarze.Fn wcrtoc16
265*46c354aaSschwarzebecomes mandatory in C23.
266