1*46c354aaSschwarze.\" $OpenBSD: mbrtoc16.3,v 1.1 2023/08/20 15:02:51 schwarze Exp $ 2*46c354aaSschwarze.\" 3*46c354aaSschwarze.\" Copyright 2023 Ingo Schwarze <schwarze@openbsd.org> 4*46c354aaSschwarze.\" Copyright 2010 Stefan Sperling <stsp@openbsd.org> 5*46c354aaSschwarze.\" 6*46c354aaSschwarze.\" Permission to use, copy, modify, and distribute this software for any 7*46c354aaSschwarze.\" purpose with or without fee is hereby granted, provided that the above 8*46c354aaSschwarze.\" copyright notice and this permission notice appear in all copies. 9*46c354aaSschwarze.\" 10*46c354aaSschwarze.\" THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES 11*46c354aaSschwarze.\" WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF 12*46c354aaSschwarze.\" MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR 13*46c354aaSschwarze.\" ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES 14*46c354aaSschwarze.\" WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN 15*46c354aaSschwarze.\" ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF 16*46c354aaSschwarze.\" OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. 17*46c354aaSschwarze.\" 18*46c354aaSschwarze.Dd $Mdocdate: August 20 2023 $ 19*46c354aaSschwarze.Dt MBRTOC16 3 20*46c354aaSschwarze.Os 21*46c354aaSschwarze.Sh NAME 22*46c354aaSschwarze.Nm mbrtoc16 23*46c354aaSschwarze.Nd convert one UTF-8 encoded character to UTF-16 24*46c354aaSschwarze.Sh SYNOPSIS 25*46c354aaSschwarze.In uchar.h 26*46c354aaSschwarze.Ft size_t 27*46c354aaSschwarze.Fo mbrtoc16 28*46c354aaSschwarze.Fa "char16_t * restrict pc16" 29*46c354aaSschwarze.Fa "const char * restrict s" 30*46c354aaSschwarze.Fa "size_t n" 31*46c354aaSschwarze.Fa "mbstate_t * restrict mbs" 32*46c354aaSschwarze.Fc 33*46c354aaSschwarze.Sh DESCRIPTION 34*46c354aaSschwarzeThe 35*46c354aaSschwarze.Fn mbrtoc16 36*46c354aaSschwarzefunction examines at most 37*46c354aaSschwarze.Fa n 38*46c354aaSschwarzebytes of the multibyte character byte string pointed to by 39*46c354aaSschwarze.Fa s , 40*46c354aaSschwarzeconverts those bytes to a wide character, 41*46c354aaSschwarzeand encodes the wide character using UTF-16. 42*46c354aaSschwarzeIn some cases, it is necessary to call this function 43*46c354aaSschwarzetwice to convert a single character. 44*46c354aaSschwarze.Pp 45*46c354aaSschwarzeConversion happens in accordance with the conversion state 46*46c354aaSschwarze.Pf * Fa mbs , 47*46c354aaSschwarzewhich must be initialized to zero before the application's first call to 48*46c354aaSschwarze.Fn mbrtoc16 . 49*46c354aaSschwarzeFor this function, 50*46c354aaSschwarze.Pf * Fa mbs 51*46c354aaSschwarzestores information about both the state of the UTF-8 input encoding 52*46c354aaSschwarzeand the state of the UTF-16 output encoding. 53*46c354aaSschwarzeIf the previous call did not return 54*46c354aaSschwarze.Po Vt size_t Pc Ns \-1 , 55*46c354aaSschwarze.Fa mbs 56*46c354aaSschwarzecan safely be reused without reinitialization. 57*46c354aaSschwarze.Pp 58*46c354aaSschwarzeThe input encoding that 59*46c354aaSschwarze.Fn mbrtoc16 60*46c354aaSschwarzeuses for 61*46c354aaSschwarze.Fa s 62*46c354aaSschwarzeis determined by the 63*46c354aaSschwarze.Dv LC_CTYPE 64*46c354aaSschwarzecategory of the current locale. 65*46c354aaSschwarzeIf the locale is changed without reinitialization of 66*46c354aaSschwarze.Pf * Fa mbs , 67*46c354aaSschwarzethe behaviour is undefined. 68*46c354aaSschwarze.Pp 69*46c354aaSschwarzeUnlike 70*46c354aaSschwarze.Xr mbtowc 3 , 71*46c354aaSschwarze.Fn mbrtoc16 72*46c354aaSschwarzeaccepts an incomplete byte sequence pointed to by 73*46c354aaSschwarze.Fa s 74*46c354aaSschwarzewhich does not form a complete character but is potentially part of 75*46c354aaSschwarzea valid character. 76*46c354aaSschwarzeIn this case, the function consumes all such bytes. 77*46c354aaSschwarzeThe conversion state saved in 78*46c354aaSschwarze.Pf * Fa mbs 79*46c354aaSschwarzewill be used to restart the suspended conversion during the next call. 80*46c354aaSschwarze.Pp 81*46c354aaSschwarzeOn systems other than 82*46c354aaSschwarze.Ox 83*46c354aaSschwarzethat support state-dependent encodings, 84*46c354aaSschwarze.Fa s 85*46c354aaSschwarzemay point to a special sequence of bytes called a 86*46c354aaSschwarze.Dq shift sequence ; 87*46c354aaSschwarzesee 88*46c354aaSschwarze.Xr mbrtowc 3 89*46c354aaSschwarzefor details. 90*46c354aaSschwarze.Pp 91*46c354aaSschwarzeThe following arguments cause special processing: 92*46c354aaSschwarze.Bl -tag -width 012345678901 93*46c354aaSschwarze.It Fa pc16 No == Dv NULL 94*46c354aaSschwarzeThe conversion from a multibyte character to a wide character is performed 95*46c354aaSschwarzeand the conversion state may be affected, but the resulting wide character 96*46c354aaSschwarzeis discarded. 97*46c354aaSschwarze.It Fa s No == Dv NULL 98*46c354aaSschwarzeThe arguments 99*46c354aaSschwarze.Fa pc16 100*46c354aaSschwarzeand 101*46c354aaSschwarze.Fa n 102*46c354aaSschwarzeare ignored and starting or continuing the conversion with an empty string 103*46c354aaSschwarzeis attempted, discarding the conversion result. 104*46c354aaSschwarze.It Fa mbs No == Dv NULL 105*46c354aaSschwarzeAn internal 106*46c354aaSschwarze.Vt mbstate_t 107*46c354aaSschwarzeobject specific to the 108*46c354aaSschwarze.Fn mbrtoc16 109*46c354aaSschwarzefunction is used instead of the 110*46c354aaSschwarze.Fa mbs 111*46c354aaSschwarzeargument. 112*46c354aaSschwarzeThis internal object is automatically initialized at program startup 113*46c354aaSschwarzeand never changed by any 114*46c354aaSschwarze.Em libc 115*46c354aaSschwarzefunction except 116*46c354aaSschwarze.Fn mbrtoc16 . 117*46c354aaSschwarze.Pp 118*46c354aaSschwarzeIf 119*46c354aaSschwarze.Fn mbrtoc16 120*46c354aaSschwarzeis called with a 121*46c354aaSschwarze.Dv NULL 122*46c354aaSschwarze.Fa mbs 123*46c354aaSschwarzeargument and that call returns 124*46c354aaSschwarze.Po Vt size_t Pc Ns \-1 , 125*46c354aaSschwarzethe internal conversion state of 126*46c354aaSschwarze.Fn mbrtoc16 127*46c354aaSschwarzebecomes permanently undefined and there is no way 128*46c354aaSschwarzeto reset it to any defined state. 129*46c354aaSschwarzeConsequently, after such a mishap, it is not safe to call 130*46c354aaSschwarze.Fn mbrtoc16 131*46c354aaSschwarzewith a 132*46c354aaSschwarze.Dv NULL 133*46c354aaSschwarze.Fa mbs 134*46c354aaSschwarzeargument ever again until the program is terminated. 135*46c354aaSschwarze.El 136*46c354aaSschwarze.Sh RETURN VALUES 137*46c354aaSschwarze.Bl -tag -width 012345678901 138*46c354aaSschwarze.It 0 139*46c354aaSschwarzeThe bytes pointed to by 140*46c354aaSschwarze.Fa s 141*46c354aaSschwarzeform a terminating NUL character. 142*46c354aaSschwarzeIf 143*46c354aaSschwarze.Fa pc16 144*46c354aaSschwarzeis not 145*46c354aaSschwarze.Dv NULL , 146*46c354aaSschwarzea NUL wide character has been stored in 147*46c354aaSschwarze.Pf * Fa pc16 . 148*46c354aaSschwarze.It positive 149*46c354aaSschwarze.Fa s 150*46c354aaSschwarzepoints to a valid character, and the value returned is the number of 151*46c354aaSschwarzebytes completing the character. 152*46c354aaSschwarzeIf 153*46c354aaSschwarze.Fa pc16 154*46c354aaSschwarzeis not 155*46c354aaSschwarze.Dv NULL , 156*46c354aaSschwarzethe first UTF-16 code unit of the corresponding wide character 157*46c354aaSschwarzehas been stored in 158*46c354aaSschwarze.Pf * Fa pc16 . 159*46c354aaSschwarzeIf it is an UTF-16 high surrogate, the function needs to be called 160*46c354aaSschwarzeagain to retrieve a second UTF-16 code unit, the low surrogate. 161*46c354aaSschwarzeOn 162*46c354aaSschwarze.Ox , 163*46c354aaSschwarzethis happens if and only if the return value is 4, 164*46c354aaSschwarzebut this equivalence does not hold on other operating systems 165*46c354aaSschwarzethat support input encodings other than UTF-8. 166*46c354aaSschwarze.It Po Vt size_t Pc Ns \-1 167*46c354aaSschwarze.Fa s 168*46c354aaSschwarzepoints to an illegal byte sequence which does not form a valid multibyte 169*46c354aaSschwarzecharacter in the current locale, or 170*46c354aaSschwarze.Fa mbs 171*46c354aaSschwarzepoints to an invalid or uninitialized object. 172*46c354aaSschwarze.Va errno 173*46c354aaSschwarzeis set to 174*46c354aaSschwarze.Er EILSEQ 175*46c354aaSschwarzeor 176*46c354aaSschwarze.Er EINVAL , 177*46c354aaSschwarzerespectively. 178*46c354aaSschwarzeThe conversion state object pointed to by 179*46c354aaSschwarze.Fa mbs 180*46c354aaSschwarzeis left in an undefined state and must be reinitialized before being 181*46c354aaSschwarzeused again. 182*46c354aaSschwarze.It Po Vt size_t Pc Ns \-2 183*46c354aaSschwarze.Fa s 184*46c354aaSschwarzepoints to an incomplete byte sequence of length 185*46c354aaSschwarze.Fa n 186*46c354aaSschwarzewhich has been consumed and contains part of a valid multibyte character. 187*46c354aaSschwarzeThe character may be completed by calling the same function again with 188*46c354aaSschwarze.Fa s 189*46c354aaSschwarzepointing to one or more subsequent bytes of the multibyte character and 190*46c354aaSschwarze.Fa mbs 191*46c354aaSschwarzepointing to the conversion state object used during conversion of the 192*46c354aaSschwarzeincomplete byte sequence. 193*46c354aaSschwarze.It Po Vt size_t Pc Ns \-3 194*46c354aaSschwarzeThe second 16-bit code unit resulting from a previous call 195*46c354aaSschwarzehas been stored into 196*46c354aaSschwarze.Pf * Fa pc16 , 197*46c354aaSschwarzewithout consuming any additional bytes from 198*46c354aaSschwarze.Fa s . 199*46c354aaSschwarze.El 200*46c354aaSschwarze.Sh ERRORS 201*46c354aaSschwarze.Fn mbrtoc16 202*46c354aaSschwarzecauses an error in the following cases: 203*46c354aaSschwarze.Bl -tag -width Er 204*46c354aaSschwarze.It Bq Er EILSEQ 205*46c354aaSschwarze.Fa s 206*46c354aaSschwarzepoints to an invalid multibyte character. 207*46c354aaSschwarze.It Bq Er EINVAL 208*46c354aaSschwarze.Fa mbs 209*46c354aaSschwarzepoints to an invalid or uninitialized 210*46c354aaSschwarze.Vt mbstate_t 211*46c354aaSschwarzeobject. 212*46c354aaSschwarze.El 213*46c354aaSschwarze.Sh SEE ALSO 214*46c354aaSschwarze.Xr c16rtomb 3 , 215*46c354aaSschwarze.Xr mbrtowc 3 , 216*46c354aaSschwarze.Xr setlocale 3 217*46c354aaSschwarze.Sh STANDARDS 218*46c354aaSschwarze.Fn mbrtoc16 219*46c354aaSschwarzeconforms to 220*46c354aaSschwarze.St -isoC-2011 . 221*46c354aaSschwarze.Sh HISTORY 222*46c354aaSschwarze.Fn mbrtoc16 223*46c354aaSschwarzehas been available since 224*46c354aaSschwarze.Ox 7.4 . 225*46c354aaSschwarze.Sh CAVEATS 226*46c354aaSschwarzeOn operating systems other than 227*46c354aaSschwarze.Ox 228*46c354aaSschwarzethat support input encodings other than UTF-8, inspecting the return value 229*46c354aaSschwarzeis insufficient to tell whether the function needs to be called again. 230*46c354aaSschwarzeIf the return value is positive, inspecting 231*46c354aaSschwarze.Pf * Fa pc16 232*46c354aaSschwarzeis also required to make that decision. 233*46c354aaSschwarzeConsequently, passing a 234*46c354aaSschwarze.Dv NULL 235*46c354aaSschwarzepointer for the 236*46c354aaSschwarze.Fa pc16 237*46c354aaSschwarzeargument is discouraged because it can result 238*46c354aaSschwarzein a well-defined but unknown output encoding state. 239*46c354aaSschwarzeThe simplest way to recover from such an unknown state is to 240*46c354aaSschwarzereinitialize the object pointed to by 241*46c354aaSschwarze.Fa mbs . 242*46c354aaSschwarze.Pp 243*46c354aaSschwarzeThe C11 standard only requires the 244*46c354aaSschwarze.Fa pc16 245*46c354aaSschwarzeargument to be encoded according to UTF-16 246*46c354aaSschwarzeif the predefined environment macro 247*46c354aaSschwarze.Dv __STDC_UTF_16__ 248*46c354aaSschwarzeis defined with a value of 1. 249*46c354aaSschwarzeOn 250*46c354aaSschwarze.Ox , 251*46c354aaSschwarze.In uchar.h 252*46c354aaSschwarzeprovides this definition. 253*46c354aaSschwarzeOther operating systems which do not define 254*46c354aaSschwarze.Dv __STDC_UTF_16__ 255*46c354aaSschwarzecould theoretically use a different, 256*46c354aaSschwarzeimplementation-defined output encoding for 257*46c354aaSschwarze.Fa pc16 258*46c354aaSschwarzeinstead of UTF-16. 259*46c354aaSschwarzeWriting portable code for an arbitrary output encoding is impossible 260*46c354aaSschwarzebecause the rules when and how often the function needs to be called 261*46c354aaSschwarzeagain depend on the output encoding; the rules explained above are 262*46c354aaSschwarzespecific to UTF-16. 263*46c354aaSschwarzeUsing UTF-16 as the output encoding of 264*46c354aaSschwarze.Fn wcrtoc16 265*46c354aaSschwarzebecomes mandatory in C23. 266