xref: /plan9/sys/doc/utf.ms (revision b9e364c446c00cfa6b1164b4648b126624c464b2)
1426d2b71SDavid du Colombier.HTML "Hello World or Καλημέρα κόσμε or こんにちは 世界
23e12c5d1SDavid du Colombier.TL
33e12c5d1SDavid du ColombierHello World
43e12c5d1SDavid du Colombier.br
53e12c5d1SDavid du Colombieror
63e12c5d1SDavid du Colombier.br
7219b2ee8SDavid du Colombier.ft R
83e12c5d1SDavid du ColombierΚαλημέρα κόσμε
9219b2ee8SDavid du Colombier.ft
103e12c5d1SDavid du Colombier.br
113e12c5d1SDavid du Colombieror
123e12c5d1SDavid du Colombier.br
13219b2ee8SDavid du Colombier\f(Jpこんにちは 世界\fP
143e12c5d1SDavid du Colombier.AU
153e12c5d1SDavid du ColombierRob Pike
163e12c5d1SDavid du ColombierKen Thompson
17219b2ee8SDavid du Colombier.sp
187dd7cddfSDavid du Colombierrob,ken@plan9.bell-labs.com
193e12c5d1SDavid du Colombier.AB
20219b2ee8SDavid du Colombier.FS
21219b2ee8SDavid du ColombierOriginally appeared, in a slightly different form, in
22219b2ee8SDavid du Colombier.I
23219b2ee8SDavid du ColombierProc. of the Winter 1993 USENIX Conf.,
24219b2ee8SDavid du Colombier.R
25219b2ee8SDavid du Colombierpp. 43-50,
26*b9e364c4SDavid du ColombierSan Diego.
27*b9e364c4SDavid du ColombierIt has been revised to reflect the move to 21-bit Unicode.
28219b2ee8SDavid du Colombier.FE
293e12c5d1SDavid du ColombierPlan 9 from Bell Labs has recently been converted from ASCII
30*b9e364c4SDavid du Colombierto an ASCII-compatible variant of the Unicode Standard,
31*b9e364c4SDavid du Colombiera 16-bit (now 21-bit) character set.
323e12c5d1SDavid du ColombierIn this paper we explain the reasons for the change,
333e12c5d1SDavid du Colombierdescribe the character set and representation we chose,
343e12c5d1SDavid du Colombierand present the programming models and software changes
353e12c5d1SDavid du Colombierthat support the new text format.
363e12c5d1SDavid du ColombierAlthough we stopped short of full internationalization\(emfor
373e12c5d1SDavid du Colombierexample, system error messages are in Unixese, not Japanese\(emwe
383e12c5d1SDavid du Colombierbelieve Plan 9 is the first system to treat the representation
393e12c5d1SDavid du Colombierof all major languages on a uniform, equal footing throughout all its
403e12c5d1SDavid du Colombiersoftware.
413e12c5d1SDavid du Colombier.AE
423e12c5d1SDavid du Colombier.SH
433e12c5d1SDavid du ColombierIntroduction
443e12c5d1SDavid du Colombier.PP
453e12c5d1SDavid du ColombierThe world is multilingual but most computer systems
46bd389b36SDavid du Colombierare based on English and ASCII.
47219b2ee8SDavid du ColombierThe first release of Plan 9 [Pike90], a new distributed operating
483e12c5d1SDavid du Colombiersystem from Bell Laboratories, seemed a good occasion
493e12c5d1SDavid du Colombierto correct this chauvinism.
503e12c5d1SDavid du ColombierIt is easier to make such deep changes when building new systems than
51bd389b36SDavid du Colombierby refitting old ones.
523e12c5d1SDavid du Colombier.PP
533e12c5d1SDavid du ColombierThe ANSI C standard [ANSIC] contains some guidance on the matter of
543e12c5d1SDavid du Colombier`wide' and `multi-byte' characters but falls far short of
553e12c5d1SDavid du Colombiersolving the myriad associated problems.
563e12c5d1SDavid du ColombierWe could find no literature on how to convert a
573e12c5d1SDavid du Colombier.I system
583e12c5d1SDavid du Colombierto larger character sets, although some individual
593e12c5d1SDavid du Colombier.I programs
60bd389b36SDavid du Colombierhad been converted.
613e12c5d1SDavid du ColombierThis paper reports what we discovered as we
623e12c5d1SDavid du Colombierexplored the problem of representing multilingual
633e12c5d1SDavid du Colombiertext at all levels of an operating system,
643e12c5d1SDavid du Colombierfrom the file system and kernel through
653e12c5d1SDavid du Colombierthe applications and up to the window system
663e12c5d1SDavid du Colombierand display.
673e12c5d1SDavid du Colombier.PP
683e12c5d1SDavid du ColombierPlan 9 has not been `internationalized':
693e12c5d1SDavid du Colombierits manuals are in English,
703e12c5d1SDavid du Colombierits error messages are in English,
713e12c5d1SDavid du Colombierand it can display text that goes from left to right only.
723e12c5d1SDavid du ColombierBut before we can address these other problems,
733e12c5d1SDavid du Colombierwe need to handle, uniformly and comfortably,
743e12c5d1SDavid du Colombierthe textual representation of all the major written languages.
753e12c5d1SDavid du ColombierThat subproblem is richer than we had anticipated.
763e12c5d1SDavid du Colombier.SH
773e12c5d1SDavid du ColombierStandards
783e12c5d1SDavid du Colombier.PP
793e12c5d1SDavid du ColombierOur first step was to select a standard.
803e12c5d1SDavid du ColombierAt the time (January 1992),
813e12c5d1SDavid du Colombierthere were only two viable options:
823e12c5d1SDavid du ColombierISO 10646 [ISO10646] and Unicode [Unicode].
833e12c5d1SDavid du ColombierThe documents describing both proposals were still in the draft stage.
843e12c5d1SDavid du Colombier.PP
85219b2ee8SDavid du ColombierThe draft of ISO 10646 was not
86bd389b36SDavid du Colombiervery attractive to us.
87219b2ee8SDavid du ColombierIt defined a sparse set of 32-bit characters,
88bd389b36SDavid du Colombierwhich would be
893e12c5d1SDavid du Colombierhard to implement
90bd389b36SDavid du Colombierand have punitive storage requirements.
91219b2ee8SDavid du ColombierAlso, the draft attempted to
92bd389b36SDavid du Colombiermollify national interests by allocating
93bd389b36SDavid du Colombier16-bit subspaces to national committees
94bd389b36SDavid du Colombierto partition individually.
95219b2ee8SDavid du ColombierThe suggested mode of use was to
963e12c5d1SDavid du Colombier``flip'' between separate national
973e12c5d1SDavid du Colombierstandards to implement the international standard.
983e12c5d1SDavid du ColombierThis did not strike us as a sound basis for a character set.
99bd389b36SDavid du ColombierAs well, transmitting 32-bit values in a byte stream,
100bd389b36SDavid du Colombiersuch as in pipes, would be expensive and hard to implement.
101bd389b36SDavid du ColombierSince the standard does not define a byte order for such
102bd389b36SDavid du Colombiertransmission, the byte stream would also have to carry
103bd389b36SDavid du Colombierstate to enable the values to be recovered.
1043e12c5d1SDavid du Colombier.PP
105219b2ee8SDavid du ColombierThe Unicode Standard is a proposal by a consortium of mostly American
106bd389b36SDavid du Colombiercomputer companies formed
1073e12c5d1SDavid du Colombierto protest the technical
1083e12c5d1SDavid du Colombierfailings of ISO 10646.
109219b2ee8SDavid du ColombierIt defines a uniform 16-bit code based on the
1103e12c5d1SDavid du Colombierprinciple of unification:
1113e12c5d1SDavid du Colombiertwo characters are the same if they look the
1123e12c5d1SDavid du Colombiersame even though they are from different
1133e12c5d1SDavid du Colombierlanguages.
114bd389b36SDavid du ColombierThis principle, called Han unification,
115bd389b36SDavid du Colombierallows the large Japanese, Chinese, and Korean
1163e12c5d1SDavid du Colombiercharacter sets to be packed comfortably into a 16-bit representation.
1173e12c5d1SDavid du Colombier.PP
118219b2ee8SDavid du ColombierWe chose the Unicode Standard for its technical merits and because its
119bd389b36SDavid du Colombiercode space was better defined.
1203e12c5d1SDavid du ColombierMoreover,
121219b2ee8SDavid du Colombierthe Unicode Consortium was derailing the
1223e12c5d1SDavid du ColombierISO 10646 standard.
123219b2ee8SDavid du Colombier(Now, in 1995,
124219b2ee8SDavid du ColombierISO 10646 is a standard
125219b2ee8SDavid du Colombierwith one 16-bit group defined,
126219b2ee8SDavid du Colombierwhich is almost exactly the Unicode Standard.
127219b2ee8SDavid du ColombierAs most people expected, the two standards bodies
128219b2ee8SDavid du Colombierreached a détente and
129219b2ee8SDavid du ColombierISO 10646 and Unicode represent the same character set.)
1303e12c5d1SDavid du Colombier.PP
131219b2ee8SDavid du ColombierThe Unicode Standard defines an adequate character set
1323e12c5d1SDavid du Colombierbut an unreasonable representation.
133219b2ee8SDavid du ColombierIt states that all characters
1343e12c5d1SDavid du Colombierare 16 bits wide and are communicated and stored in
135bd389b36SDavid du Colombier16-bit units.
1363e12c5d1SDavid du ColombierIt also reserves a pair of characters
137bd389b36SDavid du Colombier(hexadecimal FFFE and FEFF) to detect byte order
138bd389b36SDavid du Colombierin transmitted text, requiring state in the byte stream.
139219b2ee8SDavid du Colombier(The Unicode Consortium was thinking of files, not pipes.)
140219b2ee8SDavid du ColombierTo adopt this encoding,
1413e12c5d1SDavid du Colombierwe would have had to convert all text going
1423e12c5d1SDavid du Colombierinto and out of Plan 9 between ASCII and Unicode, which cannot be done.
1433e12c5d1SDavid du ColombierWithin a single program, in command of all its input and output,
1443e12c5d1SDavid du Colombierit is possible to define characters as 16-bit quantities;
145bd389b36SDavid du Colombierin the context of a networked system with
1463e12c5d1SDavid du Colombierhundreds of applications on diverse machines
1473e12c5d1SDavid du Colombierby different manufacturers,
1483e12c5d1SDavid du Colombierit is impossible.
1493e12c5d1SDavid du Colombier.PP
150219b2ee8SDavid du ColombierWe needed a way to adapt the Unicode Standard to the tools-and-pipes
1513e12c5d1SDavid du Colombiermodel of text processing embodied by the Unix system.
1523e12c5d1SDavid du ColombierTo do that, we
1533e12c5d1SDavid du Colombierneeded an ASCII-compatible textual
154219b2ee8SDavid du Colombierrepresentation of Unicode characters for transmission
1553e12c5d1SDavid du Colombierand storage.
156219b2ee8SDavid du ColombierIn the draft ISO standard there was an informative
1573e12c5d1SDavid du Colombier(non-required)
1583e12c5d1SDavid du ColombierAnnex
1593e12c5d1SDavid du Colombiercalled UTF
160219b2ee8SDavid du Colombierthat provided a byte stream encoding
1613e12c5d1SDavid du Colombierof the 32-bit ISO code.
162bd389b36SDavid du ColombierThe encoding uses multibyte sequences composed
163bd389b36SDavid du Colombierfrom the 190 printable characters of Latin-1
164bd389b36SDavid du Colombierto represent character values larger
165bd389b36SDavid du Colombierthan 159.
1663e12c5d1SDavid du Colombier.PP
1673e12c5d1SDavid du ColombierThe UTF encoding has several good properties.
1683e12c5d1SDavid du ColombierBy far the most important is that
1693e12c5d1SDavid du Colombiera byte in the ASCII range 0-127 represents
1703e12c5d1SDavid du Colombieritself in UTF.
1713e12c5d1SDavid du ColombierThus UTF is backward compatible with ASCII.
1723e12c5d1SDavid du Colombier.PP
1733e12c5d1SDavid du ColombierUTF has other advantages.
1743e12c5d1SDavid du ColombierIt is a byte encoding and is
1753e12c5d1SDavid du Colombiertherefore byte-order independent.
1763e12c5d1SDavid du ColombierASCII control characters appear in the byte stream
1773e12c5d1SDavid du Colombieronly as themselves, never as an element of a sequence
1783e12c5d1SDavid du Colombierencoding another character,
1793e12c5d1SDavid du Colombierso newline bytes separate lines of UTF text.
1803e12c5d1SDavid du ColombierFinally, ANSI C's
1813e12c5d1SDavid du Colombier.CW strcmp
1823e12c5d1SDavid du Colombierfunction applied to UTF strings preserves the ordering of Unicode characters.
1833e12c5d1SDavid du Colombier.PP
184bd389b36SDavid du ColombierTo encode and decode UTF is expensive (involving multiplication,
185bd389b36SDavid du Colombierdivision, and modulo operations) but workable.
186bd389b36SDavid du ColombierUTF's major disadvantage is that the encoding
1873e12c5d1SDavid du Colombieris not self-synchronizing.
1883e12c5d1SDavid du ColombierIt is in general impossible to find the character
1893e12c5d1SDavid du Colombierboundaries in a UTF string without reading from
1903e12c5d1SDavid du Colombierthe beginning of the string, although in practice
1913e12c5d1SDavid du Colombiercontrol characters such as newlines,
1923e12c5d1SDavid du Colombiertabs, and blanks provide synchronization points.
1933e12c5d1SDavid du Colombier.PP
194bd389b36SDavid du ColombierIn August 1992,
195bd389b36SDavid du ColombierX-Open circulated a proposal for another UTF-like
196219b2ee8SDavid du Colombierbyte encoding of Unicode characters.
197bd389b36SDavid du ColombierTheir major concern was that an embedded character
198bd389b36SDavid du Colombierin a file name
199bd389b36SDavid du Colombier(in particular a slash)
200bd389b36SDavid du Colombiercould be part of an escape sequence in UTF and
201bd389b36SDavid du Colombiertherefore confuse a traditional file system.
202bd389b36SDavid du ColombierTheir proposal would allow all 7-bit ASCII characters
203bd389b36SDavid du Colombierto represent themselves
204bd389b36SDavid du Colombier.I "and only themselves"
205bd389b36SDavid du Colombierin text.
206bd389b36SDavid du ColombierMultibyte sequences would contain only characters
207bd389b36SDavid du Colombierwith the high bit set.
208bd389b36SDavid du ColombierWe proposed a modification to the new UTF that
209bd389b36SDavid du Colombierwould address our synchronization problem.
210219b2ee8SDavid du ColombierOur proposal, which was  originally known informally as UTF-2 and FSS-UTF,
211219b2ee8SDavid du Colombieris now referred to as UTF-8 and has been approved by ISO to become
212219b2ee8SDavid du ColombierAnnex P to ISO 10646.
213bd389b36SDavid du Colombier.PP
2143e12c5d1SDavid du ColombierThe model for text in Plan 9 is chosen from these
215bd389b36SDavid du Colombierthree standards*:
2163e12c5d1SDavid du Colombier.FS
217bd389b36SDavid du Colombier* ``That's the nice thing about standards\(emthere's so many to choose from.'' \- Andy Tannenbaum (no, the other one)
2183e12c5d1SDavid du Colombier.FE
219219b2ee8SDavid du Colombierthe Unicode character set encoded as a byte stream by
220219b2ee8SDavid du ColombierUTF-8, from
221219b2ee8SDavid du Colombier(soon to be) Annex P of ISO 10646.
222219b2ee8SDavid du ColombierAlthough this mixture may seem like a precarious position for us to adopt,
2233e12c5d1SDavid du Colombierit is not as bad as it sounds.
224219b2ee8SDavid du ColombierISO 10646 and the Unicode Standard have converged,
225219b2ee8SDavid du Colombierother systems such as Linux have adopted the same character set and encoding,
226219b2ee8SDavid du Colombierand the general feeling seems to be that Unicode and UTF-8 will be accepted as the way
227219b2ee8SDavid du Colombierto exchange text between systems.
228219b2ee8SDavid du ColombierThe prognosis for wide acceptance is good.
2293e12c5d1SDavid du Colombier.PP
230219b2ee8SDavid du ColombierThere are a couple of aspects of the Unicode Standard we have not faced.
2313e12c5d1SDavid du ColombierOne is the issue of right-to-left text such as Hebrew or Arabic.
2323e12c5d1SDavid du ColombierSince that is an issue of display, not representation, we believe
2333e12c5d1SDavid du Colombierwe can defer that problem for the moment without affecting our
2343e12c5d1SDavid du Colombierability to solve it later.
235219b2ee8SDavid du ColombierAnother issue is diacriticals and `combining characters',
236219b2ee8SDavid du Colombierwhich cause overstriking of multiple Unicode characters.
237219b2ee8SDavid du ColombierAlthough necessary for some scripts, such as Thai, Arabic, and Hebrew,
238219b2ee8SDavid du Colombiersuch characters confuse the issues for Latin languages because they
239219b2ee8SDavid du Colombiergenerate multiple representations for accented characters.
240219b2ee8SDavid du ColombierISO 10646 describes three levels of implementation;
241219b2ee8SDavid du Colombierin Plan 9 we decided not to address the issue.
242219b2ee8SDavid du ColombierAgain, this can be labeled as a display issue and its finer points are still being debated,
243219b2ee8SDavid du Colombierso we felt comfortable deferring.  Mañana.
244bd389b36SDavid du Colombier.PP
245bd389b36SDavid du ColombierAlthough we converted Plan 9 in the altruistic interests of
246bd389b36SDavid du Colombierserving foreign languages, we have found the large character
247219b2ee8SDavid du Colombierset attractive for other reasons.  The Unicode Standard includes many
248bd389b36SDavid du Colombiercharacters\(emmathematical symbols, scientific notation,
249bd389b36SDavid du Colombiermore general punctuation, and more\(emthat we now use
250bd389b36SDavid du Colombierdaily in our work.  We no longer test our imaginations
251bd389b36SDavid du Colombierto find ways to include non-ASCII symbols in our text;
252bd389b36SDavid du Colombierwhy type
253bd389b36SDavid du Colombier.CW :-)
254bd389b36SDavid du Colombierwhen you can use the character ☺?
255bd389b36SDavid du ColombierMost compelling is the ability to absorb documents
256bd389b36SDavid du Colombierand data that contain non-ASCII characters; our browser for the
257bd389b36SDavid du ColombierOxford English Dictionary
258bd389b36SDavid du Colombierlets us see the dictionary as it really is, with pronunciation
259bd389b36SDavid du Colombierin the IPA font, foreign phrases properly rendered, and so on,
260bd389b36SDavid du Colombier.I "in plain text.
2613e12c5d1SDavid du Colombier.PP
262*b9e364c4SDavid du ColombierAs of Unicode 4.0,
263*b9e364c4SDavid du Colombiercharacters are now 21 bits wide and the longest UTF-8 encoding of a character
264*b9e364c4SDavid du Colombierrequires 4 bytes.
265*b9e364c4SDavid du ColombierWe are adapting the system to match.
266*b9e364c4SDavid du Colombier.PP
2673e12c5d1SDavid du ColombierIn the rest of this paper, except when
268219b2ee8SDavid du Colombierstated otherwise, the term `UTF' refers to the UTF-8 encoding
2693e12c5d1SDavid du Colombierof Unicode characters as adopted by Plan 9.
2703e12c5d1SDavid du Colombier.SH
2713e12c5d1SDavid du ColombierC Compiler
2723e12c5d1SDavid du Colombier.PP
273bd389b36SDavid du ColombierThe first program to be converted to UTF
2743e12c5d1SDavid du Colombierwas the C Compiler.
2753e12c5d1SDavid du ColombierThere are two levels of conversion.
2763e12c5d1SDavid du ColombierOn the syntactic level,
2773e12c5d1SDavid du Colombierinput to the C compiler
2783e12c5d1SDavid du Colombieris UTF; on the semantic level,
2793e12c5d1SDavid du Colombierthe C language needs to define
2803e12c5d1SDavid du Colombierhow compiled programs manipulate
2813e12c5d1SDavid du Colombierthe UTF set.
2823e12c5d1SDavid du Colombier.PP
2833e12c5d1SDavid du ColombierThe syntactic part is simple.
2843e12c5d1SDavid du ColombierThe ANSI C language standard defines the
285bd389b36SDavid du Colombiersource character set to be ASCII.
2863e12c5d1SDavid du ColombierSince UTF is backward compatible with ASCII,
2873e12c5d1SDavid du Colombierthe compiler needs little change.
2883e12c5d1SDavid du ColombierThe only places where a larger character set
2893e12c5d1SDavid du Colombieris allowed are in character constants, strings, and comments.
290bd389b36SDavid du ColombierSince 7-bit ASCII characters can represent only
291bd389b36SDavid du Colombierthemselves in UTF,
292bd389b36SDavid du Colombierthe compiler does not have to be careful while looking
293bd389b36SDavid du Colombierfor the termination of a string or comment.
294bd389b36SDavid du Colombier.PP
295bd389b36SDavid du ColombierThe Plan 9 compiler extends ANSI C to treat any Unicode
296bd389b36SDavid du Colombiercharacter with a value outside of the ASCII range as
297bd389b36SDavid du Colombieran alphabetic.
298bd389b36SDavid du ColombierTo a Greek programmer or an English mathematician,
299bd389b36SDavid du Colombierα is a sensible and now valid variable name.
3003e12c5d1SDavid du Colombier.PP
3013e12c5d1SDavid du ColombierOn the semantic level, ANSI C allows,
3023e12c5d1SDavid du Colombierbut does not tie down,
3033e12c5d1SDavid du Colombierthe notion of a
3043e12c5d1SDavid du Colombier.I "wide character
3053e12c5d1SDavid du Colombierand admits string and character constants
3063e12c5d1SDavid du Colombierof this type.
3073e12c5d1SDavid du ColombierWe chose the wide character type to be
3083e12c5d1SDavid du Colombier.CW unsigned
309*b9e364c4SDavid du Colombier.CW short
310*b9e364c4SDavid du Colombier(now
311*b9e364c4SDavid du Colombier.CW unsigned
312*b9e364c4SDavid du Colombier.CW long) .
3133e12c5d1SDavid du ColombierIn the libraries, the word
3143e12c5d1SDavid du Colombier.CW Rune
315*b9e364c4SDavid du Colombieris now defined by a
3163e12c5d1SDavid du Colombier.CW typedef
3173e12c5d1SDavid du Colombierto be equivalent to
3183e12c5d1SDavid du Colombier.CW unsigned
319*b9e364c4SDavid du Colombier.CW long
3203e12c5d1SDavid du Colombierand is
3213e12c5d1SDavid du Colombierused to signify a Unicode character.
3223e12c5d1SDavid du Colombier.PP
3233e12c5d1SDavid du ColombierThere are surprises; for example:
3243e12c5d1SDavid du Colombier.P1
3253e12c5d1SDavid du ColombierL'x'	\f1is 120\fP
3263e12c5d1SDavid du Colombier\&'x'	\f1is 120\fP
3273e12c5d1SDavid du ColombierL'ÿ'	\f1is 255\fP
328219b2ee8SDavid du Colombier\&'ÿ'	\f1is -1, stdio \fPEOF\f1 (if \fPchar\f1 is signed)\fP
329219b2ee8SDavid du ColombierL'\f1α\fP'	\f1is 945\fP
330219b2ee8SDavid du Colombier\&'\f1α\fP'	\f1is illegal\fP
3313e12c5d1SDavid du Colombier.P2
3323e12c5d1SDavid du ColombierIn the string constants,
3333e12c5d1SDavid du Colombier.P1
334219b2ee8SDavid du Colombier"\f(Jpこんにちは 世界\fP"
335219b2ee8SDavid du ColombierL"\f(Jpこんにちは 世界\fP",
3363e12c5d1SDavid du Colombier.P2
3373e12c5d1SDavid du Colombierthe former is an array of
3383e12c5d1SDavid du Colombier.CW chars
339bd389b36SDavid du Colombierwith 22 elements
3403e12c5d1SDavid du Colombierand a null byte,
3413e12c5d1SDavid du Colombierwhile the latter is an array of
3423e12c5d1SDavid du Colombier.CW unsigned
343*b9e364c4SDavid du Colombier.CW long s
3443e12c5d1SDavid du Colombier.CW Runes ) (
3453e12c5d1SDavid du Colombierwith 8 elements and a null
3463e12c5d1SDavid du Colombier.CW Rune .
3473e12c5d1SDavid du Colombier.PP
3483e12c5d1SDavid du ColombierThe Plan 9 library provides an output conversion function,
3493e12c5d1SDavid du Colombier.CW print
3503e12c5d1SDavid du Colombier(analogous to
3513e12c5d1SDavid du Colombier.CW printf ),
3523e12c5d1SDavid du Colombierwith formats
3533e12c5d1SDavid du Colombier.CW %c ,
3543e12c5d1SDavid du Colombier.CW %C ,
3553e12c5d1SDavid du Colombier.CW %s ,
3563e12c5d1SDavid du Colombierand
3573e12c5d1SDavid du Colombier.CW %S .
3583e12c5d1SDavid du ColombierSince
3593e12c5d1SDavid du Colombier.CW print
3603e12c5d1SDavid du Colombierproduces text, its output is always UTF.
3613e12c5d1SDavid du ColombierThe character conversion
3623e12c5d1SDavid du Colombier.CW %c
3633e12c5d1SDavid du Colombier(lower case) masks its argument
3643e12c5d1SDavid du Colombierto 8 bits before converting to UTF.
3653e12c5d1SDavid du ColombierThus
3663e12c5d1SDavid du Colombier.CW L'ÿ'
3673e12c5d1SDavid du Colombierand
3683e12c5d1SDavid du Colombier.CW 'ÿ'
3693e12c5d1SDavid du Colombierprinted under
3703e12c5d1SDavid du Colombier.CW %c
3713e12c5d1SDavid du Colombierwill be identical,
3723e12c5d1SDavid du Colombierbut
373219b2ee8SDavid du Colombier.CW L'\f1α\fP'
374bd389b36SDavid du Colombierwill print as the Unicode
375bd389b36SDavid du Colombiercharacter with decimal value 177.
3763e12c5d1SDavid du ColombierThe character conversion
3773e12c5d1SDavid du Colombier.CW %C
3783e12c5d1SDavid du Colombier(upper case) masks its argument
3793e12c5d1SDavid du Colombierto 16 bits before converting to UTF.
3803e12c5d1SDavid du ColombierThus
3813e12c5d1SDavid du Colombier.CW L'ÿ'
3823e12c5d1SDavid du Colombierand
383219b2ee8SDavid du Colombier.CW L'\f1α\fP'
3843e12c5d1SDavid du Colombierwill print correctly under
3853e12c5d1SDavid du Colombier.CW %C ,
3863e12c5d1SDavid du Colombierbut
3873e12c5d1SDavid du Colombier.CW 'ÿ'
3883e12c5d1SDavid du Colombierwill not.
3893e12c5d1SDavid du ColombierThe conversion
3903e12c5d1SDavid du Colombier.CW %s
3913e12c5d1SDavid du Colombier(lower case)
3923e12c5d1SDavid du Colombierexpects a pointer to
3933e12c5d1SDavid du Colombier.CW char
3943e12c5d1SDavid du Colombierand copies UTF sequences up to a null byte.
3953e12c5d1SDavid du ColombierThe conversion
3963e12c5d1SDavid du Colombier.CW %S
3973e12c5d1SDavid du Colombier(upper case) expects a pointer to
3983e12c5d1SDavid du Colombier.CW Rune
3993e12c5d1SDavid du Colombierand
4003e12c5d1SDavid du Colombierperforms sequential
4013e12c5d1SDavid du Colombier.CW %C
4023e12c5d1SDavid du Colombierconversions until a null
4033e12c5d1SDavid du Colombier.CW Rune
4043e12c5d1SDavid du Colombieris encountered.
4053e12c5d1SDavid du Colombier.PP
406bd389b36SDavid du ColombierAnother problem in format conversion
407bd389b36SDavid du Colombieris the definition of
408bd389b36SDavid du Colombier.CW %10s :
409bd389b36SDavid du Colombierdoes the number refer to bytes or characters?
410bd389b36SDavid du ColombierWe decided that such formats were most
411bd389b36SDavid du Colombieroften used to align output columns and
412bd389b36SDavid du Colombierso made the number count characters.
413bd389b36SDavid du ColombierSome programs, however, use the count
414bd389b36SDavid du Colombierto place blank-padded strings
415bd389b36SDavid du Colombierin fixed-sized arrays.
416bd389b36SDavid du ColombierThese programs must be found and corrected.
417bd389b36SDavid du Colombier.PP
4183e12c5d1SDavid du ColombierHere is a complete example:
4193e12c5d1SDavid du Colombier.P1
4203e12c5d1SDavid du Colombier#include <u.h>
4213e12c5d1SDavid du Colombier
422219b2ee8SDavid du Colombierchar c[] = "\f(Jpこんにちは 世界\fP";
423219b2ee8SDavid du ColombierRune s[] = L"\f(Jpこんにちは 世界\fP";
4243e12c5d1SDavid du Colombier
4253e12c5d1SDavid du Colombiermain(void)
4263e12c5d1SDavid du Colombier{
4273e12c5d1SDavid du Colombier	print("%d, %d\en", sizeof(c), sizeof(s));
4283e12c5d1SDavid du Colombier	print("%s\en", c);
4293e12c5d1SDavid du Colombier	print("%S\en", s);
4303e12c5d1SDavid du Colombier}
4313e12c5d1SDavid du Colombier.P2
4323e12c5d1SDavid du Colombier.PP
4333e12c5d1SDavid du ColombierThis program prints
434bd389b36SDavid du Colombier.CW 23,
4353e12c5d1SDavid du Colombier.CW 18
4363e12c5d1SDavid du Colombierand then two identical lines of
4373e12c5d1SDavid du ColombierUTF text.
4383e12c5d1SDavid du ColombierIn practice,
4393e12c5d1SDavid du Colombier.CW %S
4403e12c5d1SDavid du Colombierand
4413e12c5d1SDavid du Colombier.CW L"..."
4423e12c5d1SDavid du Colombierare rare in programs; one reason is
4433e12c5d1SDavid du Colombierthat most formatted I/O is done in unconverted UTF.
4443e12c5d1SDavid du Colombier.SH
4453e12c5d1SDavid du ColombierRamifications
4463e12c5d1SDavid du Colombier.PP
4473e12c5d1SDavid du ColombierAll programs in Plan 9 now read and write text as UTF, not ASCII.
4483e12c5d1SDavid du ColombierThis change breaks two deep-rooted symmetries implicit in most C programs:
4493e12c5d1SDavid du Colombier.IP 1.
4503e12c5d1SDavid du ColombierA character is no longer a
4513e12c5d1SDavid du Colombier.CW char .
4523e12c5d1SDavid du Colombier.IP 2.
453219b2ee8SDavid du ColombierThe internal representation (Rune) of a character now differs from its
4543e12c5d1SDavid du Colombierexternal representation (UTF).
4553e12c5d1SDavid du Colombier.PP
4563e12c5d1SDavid du ColombierIn the sections that follow,
4573e12c5d1SDavid du Colombierwe show how these issues were faced in the layers of
4583e12c5d1SDavid du Colombiersystem software from the operating system up to the applications.
4593e12c5d1SDavid du ColombierThe effects are wide-reaching and often surprising.
4603e12c5d1SDavid du Colombier.SH
4613e12c5d1SDavid du ColombierOperating system
4623e12c5d1SDavid du Colombier.PP
4633e12c5d1SDavid du ColombierSince UTF is the only format for text in Plan 9,
4643e12c5d1SDavid du Colombierthe interface to the operating system had to be converted to UTF.
4653e12c5d1SDavid du ColombierText strings cross the interface in several places:
4663e12c5d1SDavid du Colombiercommand arguments,
4673e12c5d1SDavid du Colombierfile names,
4683e12c5d1SDavid du Colombieruser names (people can log in using their native name),
4693e12c5d1SDavid du Colombiererror messages,
4703e12c5d1SDavid du Colombierand miscellaneous minor places such as commands to the I/O system.
4713e12c5d1SDavid du ColombierLittle change was required: null-terminated UTF strings
4723e12c5d1SDavid du Colombierare equivalent to null-terminated ASCII strings for most purposes
4733e12c5d1SDavid du Colombierof the operating system.
4743e12c5d1SDavid du ColombierThe library routines described in the next section made that
4753e12c5d1SDavid du Colombierchange straightforward.
4763e12c5d1SDavid du Colombier.PP
4773e12c5d1SDavid du ColombierThe window system, once called
4783e12c5d1SDavid du Colombier.CW 8.5 ,
4793e12c5d1SDavid du Colombieris now rightfully called
4803e12c5d1SDavid du Colombier.CW 8½ .
4813e12c5d1SDavid du Colombier.SH
4823e12c5d1SDavid du ColombierLibraries
4833e12c5d1SDavid du Colombier.PP
4843e12c5d1SDavid du ColombierA header file included by all programs (see [Pike92]) declares
4853e12c5d1SDavid du Colombierthe
4863e12c5d1SDavid du Colombier.CW Rune
487*b9e364c4SDavid du Colombiertype to hold 21-bit character values:
4883e12c5d1SDavid du Colombier.P1
489*b9e364c4SDavid du Colombiertypedef unsigned long Rune;
4903e12c5d1SDavid du Colombier.P2
4913e12c5d1SDavid du ColombierAlso defined are several constants relevant to UTF:
4923e12c5d1SDavid du Colombier.P1
4933e12c5d1SDavid du Colombierenum
4943e12c5d1SDavid du Colombier{
495*b9e364c4SDavid du Colombier    UTFmax	= 4,	/* maximum bytes per rune */
496*b9e364c4SDavid du Colombier    Runesync	= 0x80,	/* cannot be in a UTF sequence (<) */
497219b2ee8SDavid du Colombier    Runeself	= 0x80,	/* rune==UTF sequence (<) */
498*b9e364c4SDavid du Colombier    Runeerror	= 0xFFFD,	/* decoding error in UTF */
499*b9e364c4SDavid du Colombier    Runemax	= 0x10FFFF,	/* largest 21-bit rune */
500*b9e364c4SDavid du Colombier    Runemask	= 0x1FFFFF,	/* bits used by runes (see grep) */
501bd389b36SDavid du Colombier};
5023e12c5d1SDavid du Colombier.P2
503bd389b36SDavid du Colombier(With the original UTF,
504bd389b36SDavid du Colombier.CW Runesync
505bd389b36SDavid du Colombierwas hexadecimal 21 and
506bd389b36SDavid du Colombier.CW Runeself
507bd389b36SDavid du Colombierwas A0.)
508bd389b36SDavid du Colombier.CW UTFmax
509bd389b36SDavid du Colombierbytes are sufficient
5103e12c5d1SDavid du Colombierto hold the UTF encoding of any Unicode character.
511bd389b36SDavid du ColombierCharacters of value less than
512bd389b36SDavid du Colombier.CW Runesync
513bd389b36SDavid du Colombieronly appear in a UTF string as
5143e12c5d1SDavid du Colombierthemselves, never as part of a sequence encoding another character.
515bd389b36SDavid du ColombierCharacters of value less than
516bd389b36SDavid du Colombier.CW Runeself
517bd389b36SDavid du Colombierencode into single bytes
5183e12c5d1SDavid du Colombierof the same value.
5193e12c5d1SDavid du ColombierFinally, when the library detects errors in UTF input\(embyte sequences
5203e12c5d1SDavid du Colombierthat are not valid UTF sequences\(emit converts the first byte of the
521bd389b36SDavid du Colombiererror sequence to the character
522bd389b36SDavid du Colombier.CW Runeerror .
5233e12c5d1SDavid du ColombierThere is little a rune-oriented program can do when given bad data
5243e12c5d1SDavid du Colombierexcept exit, which is unreasonable, or carry on.
5253e12c5d1SDavid du ColombierOriginally the conversion routines, described below,
5263e12c5d1SDavid du Colombierreturned errors when given invalid UTF,
5273e12c5d1SDavid du Colombierbut we found ourselves repeatedly checking for errors and ignoring them.
5283e12c5d1SDavid du ColombierWe therefore decided to convert a bad sequence to a valid rune
5293e12c5d1SDavid du Colombierand continue processing.
5303e12c5d1SDavid du Colombier(The ANSI C routines, on the other hand, return errors.)
5313e12c5d1SDavid du Colombier.PP
5323e12c5d1SDavid du ColombierThis technique does have the unfortunate property that converting
5333e12c5d1SDavid du Colombierinvalid UTF byte strings in and out of runes does not preserve the input,
5343e12c5d1SDavid du Colombierbut this circumstance only occurs when non-textual input is
5353e12c5d1SDavid du Colombiergiven to a textual program.
536219b2ee8SDavid du ColombierThe Unicode Standard defines an error character, value FFFD, to stand for
537219b2ee8SDavid du Colombiercharacters from other sets that it does not represent.
5383e12c5d1SDavid du ColombierThe
5393e12c5d1SDavid du Colombier.CW Runeerror
540*b9e364c4SDavid du Colombiercharacter is a different concept, related to the encoding rather than the character set.
5413e12c5d1SDavid du Colombier.PP
5423e12c5d1SDavid du ColombierThe Plan 9 C library contains a number of routines for
5433e12c5d1SDavid du Colombiermanipulating runes.
5443e12c5d1SDavid du ColombierThe first set converts between runes and UTF strings:
5453e12c5d1SDavid du Colombier.P1
5463e12c5d1SDavid du Colombierextern	int	runetochar(char*, Rune*);
5473e12c5d1SDavid du Colombierextern	int	chartorune(Rune*, char*);
5483e12c5d1SDavid du Colombierextern	int	runelen(long);
5493e12c5d1SDavid du Colombierextern	int	fullrune(char*, int);
5503e12c5d1SDavid du Colombier.P2
5513e12c5d1SDavid du Colombier.CW Runetochar
5523e12c5d1SDavid du Colombiertranslates a single
5533e12c5d1SDavid du Colombier.CW Rune
5543e12c5d1SDavid du Colombierto a UTF sequence and returns the number of bytes produced.
5553e12c5d1SDavid du Colombier.CW Chartorune
5563e12c5d1SDavid du Colombiergoes the other way, reporting how many bytes were consumed.
5573e12c5d1SDavid du Colombier.CW Runelen
5583e12c5d1SDavid du Colombierreturns the number of bytes in the UTF encoding of a rune.
5593e12c5d1SDavid du Colombier.CW Fullrune
5603e12c5d1SDavid du Colombierexamines a UTF string up to a specified number of bytes
5613e12c5d1SDavid du Colombierand reports whether the string begins with a complete UTF encoding.
5623e12c5d1SDavid du ColombierAll these routines use the
5633e12c5d1SDavid du Colombier.CW Runeerror
5643e12c5d1SDavid du Colombiercharacter to work around encoding problems.
5653e12c5d1SDavid du Colombier.PP
5663e12c5d1SDavid du ColombierThere is also a set of routines for examining null-terminated UTF strings,
5673e12c5d1SDavid du Colombierbased on the model of the ANSI standard
5683e12c5d1SDavid du Colombier.CW str
5693e12c5d1SDavid du Colombierroutines, but with
5703e12c5d1SDavid du Colombier.CW utf
5713e12c5d1SDavid du Colombiersubstituted for
5723e12c5d1SDavid du Colombier.CW str
5733e12c5d1SDavid du Colombierand
5743e12c5d1SDavid du Colombier.CW rune
5753e12c5d1SDavid du Colombierfor
5763e12c5d1SDavid du Colombier.CW chr :
5773e12c5d1SDavid du Colombier.P1
5783e12c5d1SDavid du Colombierextern	int	utflen(char*);
5793e12c5d1SDavid du Colombierextern	char*	utfrune(char*, long);
5803e12c5d1SDavid du Colombierextern	char*	utfrrune(char*, long);
5813e12c5d1SDavid du Colombierextern	char*	utfutf(char*, char*);
5823e12c5d1SDavid du Colombier.P2
5833e12c5d1SDavid du Colombier.CW Utflen
5843e12c5d1SDavid du Colombierreturns the number of runes in a UTF string;
5853e12c5d1SDavid du Colombier.CW utfrune
5863e12c5d1SDavid du Colombierreturns a pointer to the first occurrence of a rune in a UTF string;
5873e12c5d1SDavid du Colombierand
5883e12c5d1SDavid du Colombier.CW utfrrune
5893e12c5d1SDavid du Colombiera pointer to the last.
5903e12c5d1SDavid du Colombier.CW Utfutf
5913e12c5d1SDavid du Colombiersearches for the first occurrence of a UTF string in another UTF string.
592219b2ee8SDavid du ColombierGiven the synchronizing property of UTF-8,
593bd389b36SDavid du Colombier.CW utfutf
594bd389b36SDavid du Colombieris the same as
595bd389b36SDavid du Colombier.CW strstr
596bd389b36SDavid du Colombierif the arguments point to valid UTF strings.
5973e12c5d1SDavid du Colombier.PP
5983e12c5d1SDavid du ColombierIt is a mistake to use
5993e12c5d1SDavid du Colombier.CW strchr
6003e12c5d1SDavid du Colombieror
6013e12c5d1SDavid du Colombier.CW strrchr
602bd389b36SDavid du Colombierunless searching for a 7-bit ASCII character, that is, a character
6033e12c5d1SDavid du Colombierless than
6043e12c5d1SDavid du Colombier.CW Runeself .
6053e12c5d1SDavid du Colombier.PP
6063e12c5d1SDavid du ColombierWe have no routines for manipulating null-terminated arrays of
6073e12c5d1SDavid du Colombier.CW Runes .
6083e12c5d1SDavid du ColombierAlthough they should probably exist for completeness, we have
6093e12c5d1SDavid du Colombierfound no need for them, for the same reason that
6103e12c5d1SDavid du Colombier.CW %S
6113e12c5d1SDavid du Colombierand
6123e12c5d1SDavid du Colombier.CW L"..."
6133e12c5d1SDavid du Colombierare rarely used.
6143e12c5d1SDavid du Colombier.PP
6153e12c5d1SDavid du ColombierMost Plan 9 programs use a new buffered I/O library, BIO, in place of
6163e12c5d1SDavid du ColombierStandard I/O.
6173e12c5d1SDavid du ColombierBIO contains routines to read and write UTF streams, converting to and from
6183e12c5d1SDavid du Colombierrunes.
6193e12c5d1SDavid du Colombier.CW Bgetrune
6203e12c5d1SDavid du Colombierreturns, as a
6213e12c5d1SDavid du Colombier.CW Rune
6223e12c5d1SDavid du Colombierwithin a
6233e12c5d1SDavid du Colombier.CW long ,
6243e12c5d1SDavid du Colombierthe next character in the UTF input stream;
6253e12c5d1SDavid du Colombier.CW Bputrune
6263e12c5d1SDavid du Colombiertakes a rune and writes its UTF representation.
6273e12c5d1SDavid du Colombier.CW Bungetrune
6283e12c5d1SDavid du Colombierputs a rune back into the input stream for rereading.
6293e12c5d1SDavid du Colombier.PP
6303e12c5d1SDavid du ColombierPlan 9 programs use a simple set of macros to process command line arguments.
6313e12c5d1SDavid du ColombierConverting these macros to UTF automatically updated the
6323e12c5d1SDavid du Colombierargument processing of most programs.
633bd389b36SDavid du ColombierIn general,
634bd389b36SDavid du Colombierargument flag names can no longer be held in bytes and
635bd389b36SDavid du Colombierarrays of 256 bytes cannot be used to hold a set of flags.
6363e12c5d1SDavid du Colombier.PP
6373e12c5d1SDavid du ColombierWe have done nothing analogous to ANSI C's locales, partly because
6383e12c5d1SDavid du Colombierwe do not feel qualified to define locales and partly because we remain
6393e12c5d1SDavid du Colombierunconvinced of that model for dealing with the problems.
6403e12c5d1SDavid du ColombierThat is really more an issue of internationalization than conversion
6413e12c5d1SDavid du Colombierto a larger character set; on the other hand,
642bd389b36SDavid du Colombierbecause we have chosen a single character set that encompasses
643bd389b36SDavid du Colombiermost languages, some of the need for
6443e12c5d1SDavid du Colombierlocales is eliminated.
6453e12c5d1SDavid du Colombier(We have a utility,
6463e12c5d1SDavid du Colombier.CW tcs ,
6473e12c5d1SDavid du Colombierthat translates between UTF and other character sets.)
6483e12c5d1SDavid du Colombier.PP
6493e12c5d1SDavid du ColombierThere are several reasons why our library does not follow the ANSI design
6503e12c5d1SDavid du Colombierfor wide and multi-byte characters.
6513e12c5d1SDavid du ColombierThe ANSI model was designed by a committee, untried, almost
6523e12c5d1SDavid du Colombieras an afterthought, whereas
6533e12c5d1SDavid du Colombierwe wanted to design as we built.
6543e12c5d1SDavid du Colombier(We made several major changes to the interface
6553e12c5d1SDavid du Colombieras we became familiar with the problems involved.)
6563e12c5d1SDavid du ColombierWe disagree with ANSI C's handling of invalid multi-byte sequences.
6573e12c5d1SDavid du ColombierAlso, the ANSI C library is incomplete:
6583e12c5d1SDavid du Colombieralthough it contains some crucial routines for handling
6593e12c5d1SDavid du Colombierwide and multi-byte characters, there are some serious omissions.
660bd389b36SDavid du ColombierFor example, our software can exploit
661bd389b36SDavid du Colombierthe fact that UTF preserves ASCII characters in the byte stream.
6623e12c5d1SDavid du ColombierWe could remove that assumption by replacing all
6633e12c5d1SDavid du Colombiercalls to
6643e12c5d1SDavid du Colombier.CW strchr
6653e12c5d1SDavid du Colombierwith
6663e12c5d1SDavid du Colombier.CW utfrune
6673e12c5d1SDavid du Colombierand so on.
668bd389b36SDavid du Colombier(Because of the weaker properties of the original UTF,
669bd389b36SDavid du Colombierwe have actually done so.)
6703e12c5d1SDavid du ColombierANSI C cannot:
6713e12c5d1SDavid du Colombierthe standard says nothing about the representation, so portable code should
6723e12c5d1SDavid du Colombier.I never
6733e12c5d1SDavid du Colombiercall
6743e12c5d1SDavid du Colombier.CW strchr ,
6753e12c5d1SDavid du Colombieryet there is no ANSI equivalent to
6763e12c5d1SDavid du Colombier.CW utfrune .
6773e12c5d1SDavid du ColombierANSI C simultaneously invalidates
6783e12c5d1SDavid du Colombier.CW strchr
6793e12c5d1SDavid du Colombierand offers no replacement.
6803e12c5d1SDavid du Colombier.PP
6813e12c5d1SDavid du ColombierFinally, ANSI did nothing to integrate wide characters
6823e12c5d1SDavid du Colombierinto the I/O system: it gives no method for printing
6833e12c5d1SDavid du Colombierwide characters.
6843e12c5d1SDavid du ColombierWe therefore needed to invent some things and decided to invent
6853e12c5d1SDavid du Colombiereverything.
6863e12c5d1SDavid du ColombierIn the end, some of our entry points do correspond closely to
6873e12c5d1SDavid du ColombierANSI routines\(emfor example
6883e12c5d1SDavid du Colombier.CW chartorune
6893e12c5d1SDavid du Colombierand
6903e12c5d1SDavid du Colombier.CW runetochar
691bd389b36SDavid du Colombierare similar to
692bd389b36SDavid du Colombier.CW mbtowc
6933e12c5d1SDavid du Colombierand
694bd389b36SDavid du Colombier.CW wctomb \(embut
6953e12c5d1SDavid du ColombierPlan 9's library defines more functionality, enough
696bd389b36SDavid du Colombierto write real applications comfortably.
6973e12c5d1SDavid du Colombier.SH
6983e12c5d1SDavid du ColombierConverting the tools
6993e12c5d1SDavid du Colombier.PP
7003e12c5d1SDavid du ColombierThe source for our tools and applications had already been converted to
701219b2ee8SDavid du Colombierwork with Latin-1, so it was `8-bit safe', but the conversion to the Unicode
702219b2ee8SDavid du ColombierStandard and UTF is more involved.
7033e12c5d1SDavid du ColombierSome programs needed no change at all:
7043e12c5d1SDavid du Colombier.CW cat ,
7053e12c5d1SDavid du Colombierfor instance,
7063e12c5d1SDavid du Colombierinterprets its argument strings, delivered in UTF,
7073e12c5d1SDavid du Colombieras file names that it passes uninterpreted to the
7083e12c5d1SDavid du Colombier.CW open
7093e12c5d1SDavid du Colombiersystem call,
7103e12c5d1SDavid du Colombierand then just copies bytes from its input to its output;
7113e12c5d1SDavid du Colombierit never makes decisions based on the values of the bytes.
7123e12c5d1SDavid du Colombier(Plan 9
7133e12c5d1SDavid du Colombier.CW cat
7143e12c5d1SDavid du Colombierhas no options such as
7153e12c5d1SDavid du Colombier.CW -v
7163e12c5d1SDavid du Colombierto complicate matters.)
7173e12c5d1SDavid du ColombierMost programs, however, needed modest change.
7183e12c5d1SDavid du Colombier.PP
7193e12c5d1SDavid du ColombierIt is difficult to
7203e12c5d1SDavid du Colombierfind automatically the places that need attention,
7213e12c5d1SDavid du Colombierbut
7223e12c5d1SDavid du Colombier.CW grep
7233e12c5d1SDavid du Colombierhelps.
7243e12c5d1SDavid du ColombierSoftware that uses the libraries conscientiously can be searched
7253e12c5d1SDavid du Colombierfor calls to library routines that examine bytes as characters:
7263e12c5d1SDavid du Colombier.CW strchr ,
7273e12c5d1SDavid du Colombier.CW strrchr ,
7283e12c5d1SDavid du Colombier.CW strstr ,
7293e12c5d1SDavid du Colombieretc.
7303e12c5d1SDavid du ColombierReplacing these by calls to
7313e12c5d1SDavid du Colombier.CW utfrune ,
7323e12c5d1SDavid du Colombier.CW utfrrune ,
7333e12c5d1SDavid du Colombierand
7343e12c5d1SDavid du Colombier.CW utfutf
7353e12c5d1SDavid du Colombieris enough to fix many programs.
7363e12c5d1SDavid du ColombierFew tools actually need to operate on runes internally;
7373e12c5d1SDavid du Colombiermore typically they need only to look for the final slash in a file
7383e12c5d1SDavid du Colombiername and similar trivial tasks.
7393e12c5d1SDavid du ColombierOf the 170 C source programs in the top levels of
7403e12c5d1SDavid du Colombier.CW /sys/src/cmd ,
7413e12c5d1SDavid du Colombieronly 23 now contain the word
7423e12c5d1SDavid du Colombier.CW Rune .
7433e12c5d1SDavid du Colombier.PP
7443e12c5d1SDavid du ColombierThe programs that
7453e12c5d1SDavid du Colombier.I do
7463e12c5d1SDavid du Colombierstore runes internally
7473e12c5d1SDavid du Colombierare mostly those whose
7483e12c5d1SDavid du Colombier.I raison
7493e12c5d1SDavid du Colombier.I d'être
7503e12c5d1SDavid du Colombieris character manipulation:
7513e12c5d1SDavid du Colombier.CW sam
7523e12c5d1SDavid du Colombier(the text editor),
7533e12c5d1SDavid du Colombier.CW sed ,
7543e12c5d1SDavid du Colombier.CW sort ,
7553e12c5d1SDavid du Colombier.CW tr ,
7563e12c5d1SDavid du Colombier.CW troff ,
7573e12c5d1SDavid du Colombier.CW 8½
7583e12c5d1SDavid du Colombier(the window system and terminal emulator),
7593e12c5d1SDavid du Colombierand so on.
7603e12c5d1SDavid du ColombierTo decide whether to compute using runes
7613e12c5d1SDavid du Colombieror UTF-encoded byte strings requires balancing the cost of converting
7623e12c5d1SDavid du Colombierthe data when read and written
7633e12c5d1SDavid du Colombieragainst the cost of converting relevant text on demand.
7643e12c5d1SDavid du ColombierFor programs such as editors that run a long time with a relatively
7653e12c5d1SDavid du Colombierconstant dataset, runes are the better choice.
7663e12c5d1SDavid du ColombierThere are space considerations too, but they are more complicated:
7673e12c5d1SDavid du Colombierplain ASCII text grows when converted to runes; UTF-encoded Japanese
7683e12c5d1SDavid du Colombiershrinks.
7693e12c5d1SDavid du Colombier.PP
7703e12c5d1SDavid du ColombierAgain, it is hard to automate the conversion of a program from
7713e12c5d1SDavid du Colombier.CW chars
7723e12c5d1SDavid du Colombierto
7733e12c5d1SDavid du Colombier.CW Runes .
7743e12c5d1SDavid du ColombierIt is not enough just to change the type of variables; the assumption
7753e12c5d1SDavid du Colombierthat bytes and characters are equivalent can be insidious.
7763e12c5d1SDavid du ColombierFor instance, to clear a character array by
7773e12c5d1SDavid du Colombier.P1
7783e12c5d1SDavid du Colombiermemset(buf, 0, BUFSIZE)
7793e12c5d1SDavid du Colombier.P2
7803e12c5d1SDavid du Colombierbecomes wrong if
7813e12c5d1SDavid du Colombier.CW buf
7823e12c5d1SDavid du Colombieris changed from an array of
7833e12c5d1SDavid du Colombier.CW chars
7843e12c5d1SDavid du Colombierto an array of
7853e12c5d1SDavid du Colombier.CW Runes .
7863e12c5d1SDavid du ColombierAny program that indexes tables based on character values needs
7873e12c5d1SDavid du Colombierrethinking.
7883e12c5d1SDavid du ColombierConsider
7893e12c5d1SDavid du Colombier.CW tr ,
7903e12c5d1SDavid du Colombierwhich originally used multiple 256-byte arrays for the mapping.
791*b9e364c4SDavid du ColombierThe naïve conversion would yield multiple 1,114,112-rune arrays.
7923e12c5d1SDavid du ColombierInstead Plan 9
7933e12c5d1SDavid du Colombier.CW tr
7943e12c5d1SDavid du Colombiersaves space by building in effect
7953e12c5d1SDavid du Colombiera run-encoded version of the map.
7963e12c5d1SDavid du Colombier.PP
7973e12c5d1SDavid du Colombier.CW Sort
7983e12c5d1SDavid du Colombierhas related problems.
7993e12c5d1SDavid du ColombierThe cooperation of UTF and
8003e12c5d1SDavid du Colombier.CW strcmp
8013e12c5d1SDavid du Colombiermeans that a simple sort\(emone with no options\(emcan be done
8023e12c5d1SDavid du Colombieron the original UTF strings using
8033e12c5d1SDavid du Colombier.CW strcmp .
8043e12c5d1SDavid du ColombierWith sorting options enabled, however,
8053e12c5d1SDavid du Colombier.CW sort
8063e12c5d1SDavid du Colombiermay need to convert its input to runes: for example,
8073e12c5d1SDavid du Colombieroption
808219b2ee8SDavid du Colombier.CW -t\f1α\fP
809bd389b36SDavid du Colombierrequires searching for alphas in the input text to
8103e12c5d1SDavid du Colombiercrack the input into fields.
811bd389b36SDavid du ColombierThe field specifier
812bd389b36SDavid du Colombier.CW +3.2
813bd389b36SDavid du Colombierrefers to 2 runes beyond the third field.
8143e12c5d1SDavid du ColombierSome of the other options are hopelessly provincial:
8153e12c5d1SDavid du Colombierconsider the case-folding and dictionary order options
816bd389b36SDavid du Colombier(Japanese doesn't even have an official dictionary order) or
8173e12c5d1SDavid du Colombier.CW -M
8183e12c5d1SDavid du Colombierwhich compares by case-insensitive English month name.
8193e12c5d1SDavid du ColombierHandling these options involves the
8203e12c5d1SDavid du Colombierlarger issues of internationalization and is beyond the scope
8213e12c5d1SDavid du Colombierof this paper and our expertise.
8223e12c5d1SDavid du ColombierPlan 9
8233e12c5d1SDavid du Colombier.CW sort
8243e12c5d1SDavid du Colombierworks sensibly with options that make sense relative to the input.
8253e12c5d1SDavid du ColombierThe simple and most important options are, however, usually meaningful.
8263e12c5d1SDavid du ColombierIn particular,
8273e12c5d1SDavid du Colombier.CW sort
8283e12c5d1SDavid du Colombiersorts UTF into the same order that
8293e12c5d1SDavid du Colombier.CW look
8303e12c5d1SDavid du Colombierexpects.
8313e12c5d1SDavid du Colombier.PP
8323e12c5d1SDavid du ColombierRegular expression-matching algorithms need rethinking to
8333e12c5d1SDavid du Colombierbe applied to UTF text.
8343e12c5d1SDavid du ColombierDeterministic automata are usually applied to bytes;
8353e12c5d1SDavid du Colombierconverting them to operate on variable-sized byte sequences is awkward.
8363e12c5d1SDavid du ColombierOn the other hand, converting the input stream to runes adds measurable
8373e12c5d1SDavid du Colombierexpense
8383e12c5d1SDavid du Colombierand the state tables expand
839*b9e364c4SDavid du Colombierfrom size 256 to 1,114,112; it can be expensive just to generate them.
840bd389b36SDavid du ColombierFor simple string searching,
841bd389b36SDavid du Colombierthe Boyer-Moore algorithm works with UTF provided the input is
842bd389b36SDavid du Colombierguaranteed to be only valid UTF strings; however, it does not work
843bd389b36SDavid du Colombierwith the old UTF encoding.
8443e12c5d1SDavid du ColombierAt a more mundane level, even character classes are harder:
8453e12c5d1SDavid du Colombierthe usual bit-vector representation within a non-deterministic automaton
846*b9e364c4SDavid du Colombieris unwieldy with 1,114,112 characters in the alphabet.
8473e12c5d1SDavid du Colombier.PP
8483e12c5d1SDavid du ColombierWe compromised.
8493e12c5d1SDavid du ColombierAn existing library for compiling and executing regular expressions
8503e12c5d1SDavid du Colombierwas adapted to work on runes, with two entry points for searching
8513e12c5d1SDavid du Colombierin arrays of runes and arrays of chars (the pattern is always UTF text).
8523e12c5d1SDavid du ColombierCharacter classes are represented internally as runs of runes;
853219b2ee8SDavid du Colombierthe reserved value
8543e12c5d1SDavid du Colombier.CW FFFF
8553e12c5d1SDavid du Colombiermarks the end of the class.
8563e12c5d1SDavid du ColombierThen
8573e12c5d1SDavid du Colombier.I all
8583e12c5d1SDavid du Colombierutilities that use regular expressions\(emeditors,
8593e12c5d1SDavid du Colombier.CW grep ,
8603e12c5d1SDavid du Colombier.CW awk ,
8613e12c5d1SDavid du Colombieretc.\(emexcept the shell, whose notation
8623e12c5d1SDavid du Colombierwas grandfathered, were converted to use the library.
8633e12c5d1SDavid du ColombierFor some programs, there was a concomitant loss of performance,
8643e12c5d1SDavid du Colombierbut there was also a strong advantage.
8653e12c5d1SDavid du ColombierTo our knowledge, Plan 9 is the only Unix-like system
8663e12c5d1SDavid du Colombierthat has a single definition and implementation of
8673e12c5d1SDavid du Colombierregular expressions; patterns are written and interpreted
8683e12c5d1SDavid du Colombieridentically by all the programs in the system.
8693e12c5d1SDavid du Colombier.PP
8703e12c5d1SDavid du ColombierA handful of programs have the notion of character built into them
8713e12c5d1SDavid du Colombierso strongly as to confuse the issue of what they should do with UTF input.
8723e12c5d1SDavid du ColombierSuch programs were treated as individual special cases.
8733e12c5d1SDavid du ColombierFor example,
8743e12c5d1SDavid du Colombier.CW wc
8753e12c5d1SDavid du Colombieris, by default, unchanged in behavior and output; a new option,
8763e12c5d1SDavid du Colombier.CW -r ,
8773e12c5d1SDavid du Colombiercounts the number of correctly encoded runes\(emvalid UTF sequences\(emin
8783e12c5d1SDavid du Colombierits input;
8793e12c5d1SDavid du Colombier.CW -b
8803e12c5d1SDavid du Colombierthe number of invalid sequences.
881bd389b36SDavid du Colombier.PP
882bd389b36SDavid du ColombierIt took us several months to convert all the software in the system
883219b2ee8SDavid du Colombierto the Unicode Standard and the old UTF.
884219b2ee8SDavid du ColombierWhen we decided to convert from that to the new UTF,
885bd389b36SDavid du Colombieronly three things needed to be done.
886bd389b36SDavid du ColombierFirst, we rewrote the library routines to encode and decode the
887bd389b36SDavid du Colombiernew UTF.  This took an evening.
888bd389b36SDavid du ColombierNext, we converted all the files containing UTF
889bd389b36SDavid du Colombierto the new encoding.
890bd389b36SDavid du ColombierWe wrote a trivial program to look for non-ASCII bytes in
891bd389b36SDavid du Colombiertext files and used a Plan 9 program called
892bd389b36SDavid du Colombier.CW tcs
893bd389b36SDavid du Colombier(translate character set) to change encodings.
894bd389b36SDavid du ColombierFinally, we recompiled all the system software;
895bd389b36SDavid du Colombierthe library interface was unchanged, so recompilation was sufficient
896bd389b36SDavid du Colombierto effect the transformation.
897bd389b36SDavid du ColombierThe second two steps were done concurrently and took an afternoon.
898bd389b36SDavid du ColombierWe concluded that the actual encoding is relatively unimportant to the
899bd389b36SDavid du Colombiersoftware; the adoption of large characters and a byte-stream encoding
900bd389b36SDavid du Colombier.I per
901bd389b36SDavid du Colombier.I se
902bd389b36SDavid du Colombierare much deeper issues.
9033e12c5d1SDavid du Colombier.SH
9043e12c5d1SDavid du ColombierGraphics and fonts
9053e12c5d1SDavid du Colombier.PP
9063e12c5d1SDavid du ColombierPlan 9 provides only minimal support for plain text terminals.
9073e12c5d1SDavid du ColombierIt is instead designed to be used with all character input and
9083e12c5d1SDavid du Colombieroutput mediated by a window system such as
9093e12c5d1SDavid du Colombier.CW 8½ .
9103e12c5d1SDavid du ColombierThe window system and related software are responsible for the
9113e12c5d1SDavid du Colombierdisplay of UTF text as Unicode character images.
9123e12c5d1SDavid du ColombierFor plain text, the window system must provide a user-settable
9133e12c5d1SDavid du Colombier.I font
9143e12c5d1SDavid du Colombierthat provides a (possibly empty) picture for each Unicode character.
915bd389b36SDavid du ColombierFancier applications that use bold and Italic characters
9163e12c5d1SDavid du Colombierneed multiple fonts storing multiple pictures for each
9173e12c5d1SDavid du ColombierUnicode value.
9183e12c5d1SDavid du ColombierAll the issues are apparent, though,
9193e12c5d1SDavid du Colombierin just the problem of
9203e12c5d1SDavid du Colombierdisplaying a single image for each character, that is, the
9213e12c5d1SDavid du ColombierUnicode equivalent of a plain text terminal.
9223e12c5d1SDavid du ColombierWith 128 or even 256 characters, a font can be just
923*b9e364c4SDavid du Colombieran array of bitmaps.  With 1,114,112 characters,
9243e12c5d1SDavid du Colombiera more sophisticated design is necessary.  To store the ideographs
925219b2ee8SDavid du Colombierfor just Japanese as 16×16×1 bit images,
9263e12c5d1SDavid du Colombierthe smallest they can reasonably be, takes over a quarter of a
9273e12c5d1SDavid du Colombiermegabyte.  Make the images a little larger, store more bits per
9283e12c5d1SDavid du Colombierpixel, and hold a copy in every running application, and the
9293e12c5d1SDavid du Colombiermemory cost becomes unreasonable.
9303e12c5d1SDavid du Colombier.PP
9313e12c5d1SDavid du ColombierThe structure of the bitmap graphics services is described at length elsewhere
9323e12c5d1SDavid du Colombier[Pike91].
9333e12c5d1SDavid du ColombierIn summary, the memory holding the bitmaps is stored in the same machine that has
9343e12c5d1SDavid du Colombierthe display, mouse, and keyboard: the terminal in Plan 9 terminology,
9353e12c5d1SDavid du Colombierthe workstation in others'.
9363e12c5d1SDavid du ColombierAccess to that memory and associated services is provided
9373e12c5d1SDavid du Colombierby device files served by system
9383e12c5d1SDavid du Colombiersoftware on the terminal.  One of those files,
9393e12c5d1SDavid du Colombier.CW /dev/bitblt ,
9403e12c5d1SDavid du Colombierinterprets messages written upon it as requests for actions
9413e12c5d1SDavid du Colombiercorresponding to entry points in the graphics library:
9423e12c5d1SDavid du Colombierallocate a bitmap, execute a raster operation, draw a text string, etc.
9433e12c5d1SDavid du ColombierThe window system
9443e12c5d1SDavid du Colombieracts as a multiplexer that mediates access to the services
9453e12c5d1SDavid du Colombierand resources of the terminal by simulating in each client window
9463e12c5d1SDavid du Colombiera set of files mirroring those provided by the system.
9473e12c5d1SDavid du ColombierThat is, each window has a distinct
9483e12c5d1SDavid du Colombier.CW /dev/mouse ,
9493e12c5d1SDavid du Colombier.CW /dev/bitblt ,
9503e12c5d1SDavid du Colombierand so on through which applications drive graphical
9513e12c5d1SDavid du Colombierinput and output.
9523e12c5d1SDavid du Colombier.PP
9533e12c5d1SDavid du ColombierOne of the resources managed by
9543e12c5d1SDavid du Colombier.CW 8½
9553e12c5d1SDavid du Colombierand the terminal is the set of active
9563e12c5d1SDavid du Colombier.I subfonts.
9573e12c5d1SDavid du ColombierEach subfont holds the
9583e12c5d1SDavid du Colombierbitmaps and associated data structures for a sequential set of Unicode
9593e12c5d1SDavid du Colombiercharacters.
9603e12c5d1SDavid du ColombierSubfonts are stored in files and loaded into the terminal by
9613e12c5d1SDavid du Colombier.CW 8½
9623e12c5d1SDavid du Colombieror an application.
9633e12c5d1SDavid du ColombierFor example, one subfont
9643e12c5d1SDavid du Colombiermight hold the images of the first 256 characters of the Unicode space,
9653e12c5d1SDavid du Colombiercorresponding to the Latin-1 character set;
9663e12c5d1SDavid du Colombieranother might hold the standard phonetic character set, Unicode characters
967219b2ee8SDavid du Colombierwith value 0250 to 02E9.
9683e12c5d1SDavid du ColombierThese files are collected in directories corresponding to typefaces:
9693e12c5d1SDavid du Colombier.CW /lib/font/bit/pelm
9703e12c5d1SDavid du Colombiercontains the Pellucida Monospace character set, with subfonts holding
9713e12c5d1SDavid du Colombierthe Latin-1, Greek, Cyrillic and other components of the typeface.
9723e12c5d1SDavid du ColombierA suffix on subfont files encodes (in a subfont-specific
9733e12c5d1SDavid du Colombierway) the size of the images:
9743e12c5d1SDavid du Colombier.CW /lib/font/bit/pelm/latin1.9
9753e12c5d1SDavid du Colombiercontains the Latin-1 Pellucida Monospace characters with lower
9763e12c5d1SDavid du Colombiercase letters 9 pixels high;
977bd389b36SDavid du Colombier.CW /lib/font/bit/jis/jis5400.16
9783e12c5d1SDavid du Colombiercontains 16-pixel high
979bd389b36SDavid du Colombierideographs starting at Unicode value 5400.
9803e12c5d1SDavid du Colombier.PP
9813e12c5d1SDavid du ColombierThe subfonts do not identify which portion of the Unicode space
9823e12c5d1SDavid du Colombierthey cover.  Instead, a
9833e12c5d1SDavid du Colombierfont file, in plain text,
9843e12c5d1SDavid du Colombierdescribes how to assemble subfonts into a complete
9853e12c5d1SDavid du Colombiercharacter set.
9863e12c5d1SDavid du ColombierThe font file is presented as an argument to the window system
9873e12c5d1SDavid du Colombierto determine how plain text is displayed in text windows and
9883e12c5d1SDavid du Colombierapplications.
9893e12c5d1SDavid du ColombierHere is the beginning of the font file
990bd389b36SDavid du Colombier.CW /lib/font/bit/pelm/jis.9.font ,
9913e12c5d1SDavid du Colombierwhich describes the layout of a font covering that portion of
992219b2ee8SDavid du Colombierthe Unicode Standard for which we have characters of typical
993bd389b36SDavid du Colombierdisplay size, using Japanese characters
994bd389b36SDavid du Colombierto cover the Han space:
9953e12c5d1SDavid du Colombier.P1
9963e12c5d1SDavid du Colombier18	14
9973e12c5d1SDavid du Colombier0x0000	0x00FF	latin1.9
998bd389b36SDavid du Colombier0x0100	0x017E	latineur.9
9993e12c5d1SDavid du Colombier0x0250	0x02E9	ipa.9
10003e12c5d1SDavid du Colombier0x0386	0x03F5	greek.9
10013e12c5d1SDavid du Colombier0x0400	0x0475	cyrillic.9
1002bd389b36SDavid du Colombier0x2000	0x2044	../misc/genpunc.9
1003bd389b36SDavid du Colombier0x2070	0x208E	supsub.9
1004bd389b36SDavid du Colombier0x20A0	0x20AA	currency.9
1005bd389b36SDavid du Colombier0x2100	0x2138	../misc/letterlike.9
10063e12c5d1SDavid du Colombier0x2190	0x21EA	../misc/arrows
10073e12c5d1SDavid du Colombier0x2200	0x227F	../misc/math1
10083e12c5d1SDavid du Colombier0x2280	0x22F1	../misc/math2
10093e12c5d1SDavid du Colombier0x2300	0x232C	../misc/tech
1010bd389b36SDavid du Colombier0x2500	0x257F	../misc/chart
10113e12c5d1SDavid du Colombier0x2600	0x266F	../misc/ding
1012219b2ee8SDavid du Colombier.P2
1013219b2ee8SDavid du Colombier.P1
10143e12c5d1SDavid du Colombier0x3000	0x303f	../jis/jis3000.16
10153e12c5d1SDavid du Colombier0x30a1	0x30fe	../jis/katakana.16
1016bd389b36SDavid du Colombier0x3041	0x309e	../jis/hiragana.16
1017bd389b36SDavid du Colombier0x4e00	0x4fff	../jis/jis4e00.16
1018bd389b36SDavid du Colombier0x5000	0x51ff	../jis/jis5000.16
10193e12c5d1SDavid du Colombier\&...
10203e12c5d1SDavid du Colombier.P2
10213e12c5d1SDavid du ColombierThe first two numbers set the interline spacing of the font (18
10223e12c5d1SDavid du Colombierpixels) and the distance from the baseline to the top of the
10233e12c5d1SDavid du Colombierline (14 pixels).
10243e12c5d1SDavid du ColombierWhen characters are displayed, they are placed so as best
10253e12c5d1SDavid du Colombierto fit within those constraints; characters
10263e12c5d1SDavid du Colombiertoo large to fit will be truncated.
10273e12c5d1SDavid du ColombierThe rest of the file associates subfont files
10283e12c5d1SDavid du Colombierwith portions of Unicode space.
10293e12c5d1SDavid du ColombierThe first four such files are in the Pellucida Monospace typeface
10303e12c5d1SDavid du Colombierand directory; others reside in other directories.  The file names
10313e12c5d1SDavid du Colombierare relative to the font file's own location.
10323e12c5d1SDavid du Colombier.PP
10333e12c5d1SDavid du ColombierThere are several advantages to this two-level structure.
10343e12c5d1SDavid du ColombierFirst, it simultaneously breaks the huge Unicode space into manageable
10353e12c5d1SDavid du Colombiercomponents and provides a unifying architecture for
10363e12c5d1SDavid du Colombierassembling fonts from disjoint pieces.
10373e12c5d1SDavid du ColombierSecond, the structure promotes sharing.
10383e12c5d1SDavid du ColombierFor example, we have only one set of Japanese
10393e12c5d1SDavid du Colombiercharacters but dozens of typefaces for the Latin-1 characters,
10403e12c5d1SDavid du Colombierand this structure permits us to store only one copy of the
10413e12c5d1SDavid du ColombierJapanese set but use it with any Roman typeface.
10423e12c5d1SDavid du ColombierAlso, customization is easy.
10433e12c5d1SDavid du ColombierEnglish-speaking users who don't need Japanese characters
1044bd389b36SDavid du Colombierbut may want to read an on-line Oxford English Dictionary can
10453e12c5d1SDavid du Colombierassemble a custom font with the
10463e12c5d1SDavid du ColombierLatin-1 (or even just ASCII) characters and the International
10473e12c5d1SDavid du ColombierPhonetic Alphabet (IPA).
10483e12c5d1SDavid du ColombierMoreover, to do so requires just editing a plain text file,
10493e12c5d1SDavid du Colombiernot using a special font editing tool.
10503e12c5d1SDavid du ColombierFinally, the structure guides the design of
10513e12c5d1SDavid du Colombiercaching protocols to improve performance and memory usage.
10523e12c5d1SDavid du Colombier.PP
10533e12c5d1SDavid du ColombierTo load a complete Unicode character set into each application
10543e12c5d1SDavid du Colombierwould consume too
10553e12c5d1SDavid du Colombiermuch memory and, particularly on slow terminal lines, would take
10563e12c5d1SDavid du Colombierunreasonably long.
10573e12c5d1SDavid du ColombierInstead, Plan 9 assembles a multi-level cache structure for
10583e12c5d1SDavid du Colombiereach font.
10593e12c5d1SDavid du ColombierAn application opens a font file, reads and parses it,
10603e12c5d1SDavid du Colombierand allocates a data structure.
10613e12c5d1SDavid du ColombierA message written to
10623e12c5d1SDavid du Colombier.CW /dev/bitblt
10633e12c5d1SDavid du Colombierallocates an associated structure held in the terminal, in particular,
10643e12c5d1SDavid du Colombiera bitmap to act as a cache
10653e12c5d1SDavid du Colombierfor recently used character images.
10663e12c5d1SDavid du ColombierOther messages copy these images to bitmaps such as the screen
10673e12c5d1SDavid du Colombierby loading characters from subfonts into the cache on demand and
10683e12c5d1SDavid du Colombierfrom there to the destination bitmap.
10693e12c5d1SDavid du ColombierThe protocol to draw characters is in terms of cache indices,
10703e12c5d1SDavid du Colombiernot Unicode character number or UTF sequences.
10713e12c5d1SDavid du ColombierThese details are hidden from the application, which instead
10723e12c5d1SDavid du Colombiersees only a subroutine to draw a string in a bitmap from a
10733e12c5d1SDavid du Colombiergiven font, functions to discover character size information,
10743e12c5d1SDavid du Colombierand routines to allocate and to free fonts.
10753e12c5d1SDavid du Colombier.PP
10763e12c5d1SDavid du ColombierAs needed, whole
10773e12c5d1SDavid du Colombiersubfonts are opened by the graphics library, read, and then downloaded
10783e12c5d1SDavid du Colombierto the terminal.
10793e12c5d1SDavid du ColombierThey are held open by the library in an LRU-replacement list.
10803e12c5d1SDavid du ColombierEven when the program closes a subfont, it is retained
10813e12c5d1SDavid du Colombierin the terminal for later use.
10823e12c5d1SDavid du ColombierWhen the application opens the subfont, it asks the terminal
10833e12c5d1SDavid du Colombierif it already has a copy to avoid reading it from the file
10843e12c5d1SDavid du Colombierserver if possible.
10853e12c5d1SDavid du ColombierThis level of cache has the property that the bitmaps for, say,
10863e12c5d1SDavid du Colombierall the Japanese characters are stored only once, in the terminal;
10873e12c5d1SDavid du Colombierthe applications read only size and width information from the terminal
10883e12c5d1SDavid du Colombierand share the images.
10893e12c5d1SDavid du Colombier.PP
10903e12c5d1SDavid du ColombierThe sizes of the character and subfont caches held by the
10913e12c5d1SDavid du Colombierapplication are adaptive.
10923e12c5d1SDavid du ColombierA simple algorithm monitors the cache miss rate to enlarge and
10933e12c5d1SDavid du Colombiershrink the caches as required.
10943e12c5d1SDavid du ColombierThe size of the character cache is limited to 2048 images maximum,
10953e12c5d1SDavid du Colombierwhich in practice seems enough even for Japanese text.
10963e12c5d1SDavid du ColombierFor plain ASCII-like text it naturally stays around 128 images.
10973e12c5d1SDavid du Colombier.PP
10983e12c5d1SDavid du ColombierThis mechanism sounds complicated but is implemented by only about
10993e12c5d1SDavid du Colombier500 lines in the library and considerably less in each of the
11003e12c5d1SDavid du Colombierterminal's graphics driver and
11013e12c5d1SDavid du Colombier.CW 8½ .
11023e12c5d1SDavid du ColombierIt has the advantage that only characters that are
11033e12c5d1SDavid du Colombierbeing used are loaded into memory.
11043e12c5d1SDavid du ColombierIt is also efficient: if the characters being drawn
11053e12c5d1SDavid du Colombierare in the cache the extra overhead is negligible.
11063e12c5d1SDavid du ColombierIt works particularly well for alphabetic character sets,
11073e12c5d1SDavid du Colombierbut also adapts on demand for ideographic sets.
11083e12c5d1SDavid du ColombierWhen a user first looks at Japanese text, it takes a few
11093e12c5d1SDavid du Colombierseconds to read all the font data, but thereafter the
11103e12c5d1SDavid du Colombiertext is drawn almost as fast as regular text (the images
11113e12c5d1SDavid du Colombierare larger, so draw a little slower).
11123e12c5d1SDavid du ColombierAlso, because the bitmaps are remembered by the terminal,
11133e12c5d1SDavid du Colombierif a second application then looks at Japanese text
11143e12c5d1SDavid du Colombierit starts faster than the first.
11153e12c5d1SDavid du Colombier.PP
11163e12c5d1SDavid du ColombierWe considered
11173e12c5d1SDavid du Colombierbuilding a `font server'
11183e12c5d1SDavid du Colombierto cache character images and associated data
11193e12c5d1SDavid du Colombierfor the applications, the window system, and the terminal.
11203e12c5d1SDavid du ColombierWe rejected this design because, although isolating
11213e12c5d1SDavid du Colombiermany of the problems of font management into a separate program,
11223e12c5d1SDavid du Colombierit didn't simplify the applications.
11233e12c5d1SDavid du ColombierMoreover, in a distributed system such as Plan 9 it is easy
11243e12c5d1SDavid du Colombierto have too many special purpose servers.
11253e12c5d1SDavid du ColombierMaking the management of the fonts the concern of only
11263e12c5d1SDavid du Colombierthe essential components simplifies the system and makes
11273e12c5d1SDavid du Colombierbootstrapping less intricate.
11283e12c5d1SDavid du Colombier.SH
11293e12c5d1SDavid du ColombierInput
11303e12c5d1SDavid du Colombier.PP
11313e12c5d1SDavid du ColombierA completely different problem is how to type Unicode characters
11323e12c5d1SDavid du Colombieras input to the system.
11333e12c5d1SDavid du ColombierWe selected an unused key on our ASCII keyboards
11343e12c5d1SDavid du Colombierto serve as a prefix for multi-keystroke
11353e12c5d1SDavid du Colombiersequences that generate Unicode characters.
1136219b2ee8SDavid du ColombierFor example, the character
1137219b2ee8SDavid du Colombier.CW ü
1138219b2ee8SDavid du Colombieris generated by the prefix key
11393e12c5d1SDavid du Colombier(typically
11403e12c5d1SDavid du Colombier.CW ALT
11413e12c5d1SDavid du Colombieror
11423e12c5d1SDavid du Colombier.CW Compose )
1143219b2ee8SDavid du Colombierfollowed by a double quote and a lower-case
1144219b2ee8SDavid du Colombier.CW u .
11453e12c5d1SDavid du ColombierWhen that character is read by the application, from the file
11463e12c5d1SDavid du Colombier.CW /dev/cons ,
11473e12c5d1SDavid du Colombierit is of course presented as its UTF encoding.
11483e12c5d1SDavid du ColombierSuch sequences generate characters from an arbitrary set that
11493e12c5d1SDavid du Colombierincludes all of Latin-1 plus a selection of mathematical
11503e12c5d1SDavid du Colombierand technical characters.
11513e12c5d1SDavid du ColombierAn arbitrary Unicode character may be generated by typing the prefix,
11523e12c5d1SDavid du Colombieran upper case X, and four hexadecimal digits that identify
11533e12c5d1SDavid du Colombierthe Unicode value.
11543e12c5d1SDavid du Colombier.PP
11553e12c5d1SDavid du ColombierThese simple mechanisms are adequate for most of our day-to-day needs:
11563e12c5d1SDavid du Colombierit's easy to remember to type `ALT 1 2' for ½\^ or `ALT accent letter'
11573e12c5d1SDavid du Colombierfor accented Latin letters.
11583e12c5d1SDavid du ColombierFor the occasional unusual character, the cut and paste features of
11593e12c5d1SDavid du Colombier.CW 8½
11603e12c5d1SDavid du Colombierserve well.  A program called (perhaps misleadingly)
11613e12c5d1SDavid du Colombier.CW unicode
11623e12c5d1SDavid du Colombiertakes as argument a hexadecimal value, and prints the UTF representation of that character,
11633e12c5d1SDavid du Colombierwhich may then be picked up with the mouse and used as input.
11643e12c5d1SDavid du Colombier.PP
11653e12c5d1SDavid du ColombierThese methods
11663e12c5d1SDavid du Colombierare clearly unsatisfactory when working in a non-English language.
11673e12c5d1SDavid du ColombierIn the native country of such a language
11683e12c5d1SDavid du Colombierthe appropriate keyboard is likely to be at hand.
1169219b2ee8SDavid du ColombierBut it's also reasonable\(emespecially now that the system handles Unicode characters\(emto
11703e12c5d1SDavid du Colombierwork in a language foreign to the keyboard.
11713e12c5d1SDavid du Colombier.PP
11723e12c5d1SDavid du ColombierFor alphabetic languages such as Greek or Russian, it is
11733e12c5d1SDavid du Colombierstraightforward to construct a program that does phonetic substitution,
11743e12c5d1SDavid du Colombierso that, for example, typing a Latin `a' yields the Greek `α'.
11753e12c5d1SDavid du ColombierWithin Plan 9, such a program can be inserted transparently
11763e12c5d1SDavid du Colombierbetween the real keyboard and a program such as the window system,
11773e12c5d1SDavid du Colombierproviding a manageable input device for such languages.
11783e12c5d1SDavid du Colombier.PP
11793e12c5d1SDavid du ColombierFor ideographic languages such as Chinese or Japanese the problem is harder.
11803e12c5d1SDavid du ColombierNative users of such languages have adopted methods for dealing with
11813e12c5d1SDavid du ColombierLatin keyboards that involve a hybrid technique based on phonetics
11823e12c5d1SDavid du Colombierto generate a list of possible symbols followed by menu selection to
1183bd389b36SDavid du Colombierchoose the desired one.
1184bd389b36SDavid du ColombierSuch methods can be
11853e12c5d1SDavid du Colombiereffective, but their design must be rooted in information about
11863e12c5d1SDavid du Colombierthe language unknown to non-native speakers.
1187bd389b36SDavid du Colombier.CW Cxterm , (
1188bd389b36SDavid du Colombiera Chinese terminal emulator built by and for
1189bd389b36SDavid du ColombierChinese programmers,
1190bd389b36SDavid du Colombieremploys such a technique
1191bd389b36SDavid du Colombier[Pong and Zhang].)
11923e12c5d1SDavid du ColombierAlthough the technical problem of implementing such a device
11933e12c5d1SDavid du Colombieris easy in Plan 9\(emit is just an elaboration of the technique for
11943e12c5d1SDavid du Colombieralphabetic languages\(emour lack of familiarity with such languages
11953e12c5d1SDavid du Colombierhas restrained our enthusiasm for building one.
11963e12c5d1SDavid du Colombier.PP
11973e12c5d1SDavid du ColombierThe input problem is technically the least interesting but perhaps
11983e12c5d1SDavid du Colombieremotionally the most important of the problems of converting a system
11993e12c5d1SDavid du Colombierto an international character set.
12003e12c5d1SDavid du ColombierBeyond that remain the deeper problems of internationalization
12013e12c5d1SDavid du Colombiersuch as multi-lingual error messages and command names,
12023e12c5d1SDavid du Colombierproblems we are not qualified to solve.
12033e12c5d1SDavid du ColombierWith the ability to treat text of most languages on an equal
12043e12c5d1SDavid du Colombierfooting, though, we can begin down that path.
12053e12c5d1SDavid du ColombierPerhaps people in non-English speaking countries will
12063e12c5d1SDavid du Colombierconsider adopting Plan 9, solving the input problem locally\(emperhaps
12073e12c5d1SDavid du Colombierjust by plugging in their local terminals\(emand begin to use
12083e12c5d1SDavid du Colombiera system with at least the capacity to be international.
12093e12c5d1SDavid du Colombier.SH
12103e12c5d1SDavid du ColombierAcknowledgements
12113e12c5d1SDavid du Colombier.PP
12123e12c5d1SDavid du ColombierDennis Ritchie provided consultation and encouragement.
12133e12c5d1SDavid du ColombierBob Flandrena converted most of the standard tools to UTF.
12143e12c5d1SDavid du ColombierBrian Kernighan suffered cheerfully with several
12153e12c5d1SDavid du Colombierinadequate implementations and converted
12163e12c5d1SDavid du Colombier.CW troff
12173e12c5d1SDavid du Colombierto UTF.
12183e12c5d1SDavid du ColombierRich Drechsler converted his Postscript driver to UTF.
1219bd389b36SDavid du ColombierJohn Hobby built the Postscript ☺.
12203e12c5d1SDavid du ColombierWe thank them all.
12213e12c5d1SDavid du Colombier.SH
12223e12c5d1SDavid du ColombierReferences
12233e12c5d1SDavid du Colombier.LP
1224219b2ee8SDavid du Colombier[ANSIC] \f2American National Standard for Information Systems \-
12253e12c5d1SDavid du ColombierProgramming Language C\f1, American National Standards Institute, Inc.,
1226219b2ee8SDavid du ColombierNew York, 1990.
12273e12c5d1SDavid du Colombier.LP
12283e12c5d1SDavid du Colombier[ISO10646]
1229bd389b36SDavid du ColombierISO/IEC DIS 10646-1:1993
1230219b2ee8SDavid du Colombier\f2Information technology \-
12313e12c5d1SDavid du ColombierUniversal Multiple-Octet Coded Character Set (UCS) \(em
1232219b2ee8SDavid du ColombierPart 1: Architecture and Basic Multilingual Plane\fP.
12333e12c5d1SDavid du Colombier.LP
12343e12c5d1SDavid du Colombier[Pike90] R. Pike, D. Presotto, K. Thompson, H. Trickey,
12353e12c5d1SDavid du Colombier``Plan 9 from Bell Labs'',
12363e12c5d1SDavid du ColombierUKUUG Proc. of the Summer 1990 Conf.,
12373e12c5d1SDavid du ColombierLondon, England,
1238219b2ee8SDavid du Colombier1990.
12393e12c5d1SDavid du Colombier.LP
1240219b2ee8SDavid du Colombier[Pike91] R. Pike, ``8½, The Plan 9 Window System'', USENIX Summer
1241219b2ee8SDavid du ColombierConf. Proc., Nashville, 1991, reprinted in this volume.
12423e12c5d1SDavid du Colombier.LP
1243219b2ee8SDavid du Colombier[Pike92] R. Pike, ``How to Use the Plan 9 C Compiler'', this volume.
12443e12c5d1SDavid du Colombier.LP
1245bd389b36SDavid du Colombier[Pong and Zhang] Man-Chi Pong and Yongguang Zhang, ``cxterm:
1246bd389b36SDavid du ColombierA Chinese Terminal Emulator for the X Window System'',
1247bd389b36SDavid du Colombier.I
1248219b2ee8SDavid du ColombierSoftware\(emPractice and Experience,
1249bd389b36SDavid du Colombier.R
1250bd389b36SDavid du ColombierVol 22(1), 809-926, October 1992.
1251bd389b36SDavid du Colombier.LP
12523e12c5d1SDavid du Colombier[Unicode]
12533e12c5d1SDavid du Colombier\f2The Unicode Standard,
12543e12c5d1SDavid du ColombierWorldwide Character Encoding,
12553e12c5d1SDavid du ColombierVersion 1.0, Volume 1\f1,
12563e12c5d1SDavid du ColombierThe Unicode Consortium,
12573e12c5d1SDavid du ColombierAddison Wesley,
12583e12c5d1SDavid du ColombierNew York,
1259219b2ee8SDavid du Colombier1991.
1260