xref: /plan9-contrib/sys/doc/utf.ms (revision 219b2ee8daee37f4aad58d63f21287faa8e4ffdc)
1.TL
2Hello World
3.br
4or
5.br
6.ft R
7Καλημέρα κόσμε
8.ft
9.br
10or
11.br
12\f(Jpこんにちは 世界\fP
13.AU
14Rob Pike
15Ken Thompson
16.sp
17rob,ken@plan9.att.com
18.AB
19.FS
20Originally appeared, in a slightly different form, in
21.I
22Proc. of the Winter 1993 USENIX Conf.,
23.R
24pp. 43-50,
25San Diego
26.FE
27Plan 9 from Bell Labs has recently been converted from ASCII
28to an ASCII-compatible variant of the Unicode Standard, a 16-bit character set.
29In this paper we explain the reasons for the change,
30describe the character set and representation we chose,
31and present the programming models and software changes
32that support the new text format.
33Although we stopped short of full internationalization\(emfor
34example, system error messages are in Unixese, not Japanese\(emwe
35believe Plan 9 is the first system to treat the representation
36of all major languages on a uniform, equal footing throughout all its
37software.
38.AE
39.SH
40Introduction
41.PP
42The world is multilingual but most computer systems
43are based on English and ASCII.
44The first release of Plan 9 [Pike90], a new distributed operating
45system from Bell Laboratories, seemed a good occasion
46to correct this chauvinism.
47It is easier to make such deep changes when building new systems than
48by refitting old ones.
49.PP
50The ANSI C standard [ANSIC] contains some guidance on the matter of
51`wide' and `multi-byte' characters but falls far short of
52solving the myriad associated problems.
53We could find no literature on how to convert a
54.I system
55to larger character sets, although some individual
56.I programs
57had been converted.
58This paper reports what we discovered as we
59explored the problem of representing multilingual
60text at all levels of an operating system,
61from the file system and kernel through
62the applications and up to the window system
63and display.
64.PP
65Plan 9 has not been `internationalized':
66its manuals are in English,
67its error messages are in English,
68and it can display text that goes from left to right only.
69But before we can address these other problems,
70we need to handle, uniformly and comfortably,
71the textual representation of all the major written languages.
72That subproblem is richer than we had anticipated.
73.SH
74Standards
75.PP
76Our first step was to select a standard.
77At the time (January 1992),
78there were only two viable options:
79ISO 10646 [ISO10646] and Unicode [Unicode].
80The documents describing both proposals were still in the draft stage.
81.PP
82The draft of ISO 10646 was not
83very attractive to us.
84It defined a sparse set of 32-bit characters,
85which would be
86hard to implement
87and have punitive storage requirements.
88Also, the draft attempted to
89mollify national interests by allocating
9016-bit subspaces to national committees
91to partition individually.
92The suggested mode of use was to
93``flip'' between separate national
94standards to implement the international standard.
95This did not strike us as a sound basis for a character set.
96As well, transmitting 32-bit values in a byte stream,
97such as in pipes, would be expensive and hard to implement.
98Since the standard does not define a byte order for such
99transmission, the byte stream would also have to carry
100state to enable the values to be recovered.
101.PP
102The Unicode Standard is a proposal by a consortium of mostly American
103computer companies formed
104to protest the technical
105failings of ISO 10646.
106It defines a uniform 16-bit code based on the
107principle of unification:
108two characters are the same if they look the
109same even though they are from different
110languages.
111This principle, called Han unification,
112allows the large Japanese, Chinese, and Korean
113character sets to be packed comfortably into a 16-bit representation.
114.PP
115We chose the Unicode Standard for its technical merits and because its
116code space was better defined.
117Moreover,
118the Unicode Consortium was derailing the
119ISO 10646 standard.
120(Now, in 1995,
121ISO 10646 is a standard
122with one 16-bit group defined,
123which is almost exactly the Unicode Standard.
124As most people expected, the two standards bodies
125reached a détente and
126ISO 10646 and Unicode represent the same character set.)
127.PP
128The Unicode Standard defines an adequate character set
129but an unreasonable representation.
130It states that all characters
131are 16 bits wide and are communicated and stored in
13216-bit units.
133It also reserves a pair of characters
134(hexadecimal FFFE and FEFF) to detect byte order
135in transmitted text, requiring state in the byte stream.
136(The Unicode Consortium was thinking of files, not pipes.)
137To adopt this encoding,
138we would have had to convert all text going
139into and out of Plan 9 between ASCII and Unicode, which cannot be done.
140Within a single program, in command of all its input and output,
141it is possible to define characters as 16-bit quantities;
142in the context of a networked system with
143hundreds of applications on diverse machines
144by different manufacturers,
145it is impossible.
146.PP
147We needed a way to adapt the Unicode Standard to the tools-and-pipes
148model of text processing embodied by the Unix system.
149To do that, we
150needed an ASCII-compatible textual
151representation of Unicode characters for transmission
152and storage.
153In the draft ISO standard there was an informative
154(non-required)
155Annex
156called UTF
157that provided a byte stream encoding
158of the 32-bit ISO code.
159The encoding uses multibyte sequences composed
160from the 190 printable characters of Latin-1
161to represent character values larger
162than 159.
163.PP
164The UTF encoding has several good properties.
165By far the most important is that
166a byte in the ASCII range 0-127 represents
167itself in UTF.
168Thus UTF is backward compatible with ASCII.
169.PP
170UTF has other advantages.
171It is a byte encoding and is
172therefore byte-order independent.
173ASCII control characters appear in the byte stream
174only as themselves, never as an element of a sequence
175encoding another character,
176so newline bytes separate lines of UTF text.
177Finally, ANSI C's
178.CW strcmp
179function applied to UTF strings preserves the ordering of Unicode characters.
180.PP
181To encode and decode UTF is expensive (involving multiplication,
182division, and modulo operations) but workable.
183UTF's major disadvantage is that the encoding
184is not self-synchronizing.
185It is in general impossible to find the character
186boundaries in a UTF string without reading from
187the beginning of the string, although in practice
188control characters such as newlines,
189tabs, and blanks provide synchronization points.
190.PP
191In August 1992,
192X-Open circulated a proposal for another UTF-like
193byte encoding of Unicode characters.
194Their major concern was that an embedded character
195in a file name
196(in particular a slash)
197could be part of an escape sequence in UTF and
198therefore confuse a traditional file system.
199Their proposal would allow all 7-bit ASCII characters
200to represent themselves
201.I "and only themselves"
202in text.
203Multibyte sequences would contain only characters
204with the high bit set.
205We proposed a modification to the new UTF that
206would address our synchronization problem.
207Our proposal, which was  originally known informally as UTF-2 and FSS-UTF,
208is now referred to as UTF-8 and has been approved by ISO to become
209Annex P to ISO 10646.
210.PP
211The model for text in Plan 9 is chosen from these
212three standards*:
213.FS
214* ``That's the nice thing about standards\(emthere's so many to choose from.'' \- Andy Tannenbaum (no, the other one)
215.FE
216the Unicode character set encoded as a byte stream by
217UTF-8, from
218(soon to be) Annex P of ISO 10646.
219Although this mixture may seem like a precarious position for us to adopt,
220it is not as bad as it sounds.
221ISO 10646 and the Unicode Standard have converged,
222other systems such as Linux have adopted the same character set and encoding,
223and the general feeling seems to be that Unicode and UTF-8 will be accepted as the way
224to exchange text between systems.
225The prognosis for wide acceptance is good.
226.PP
227There are a couple of aspects of the Unicode Standard we have not faced.
228One is the issue of right-to-left text such as Hebrew or Arabic.
229Since that is an issue of display, not representation, we believe
230we can defer that problem for the moment without affecting our
231ability to solve it later.
232Another issue is diacriticals and `combining characters',
233which cause overstriking of multiple Unicode characters.
234Although necessary for some scripts, such as Thai, Arabic, and Hebrew,
235such characters confuse the issues for Latin languages because they
236generate multiple representations for accented characters.
237ISO 10646 describes three levels of implementation;
238in Plan 9 we decided not to address the issue.
239Again, this can be labeled as a display issue and its finer points are still being debated,
240so we felt comfortable deferring.  Mañana.
241.PP
242Although we converted Plan 9 in the altruistic interests of
243serving foreign languages, we have found the large character
244set attractive for other reasons.  The Unicode Standard includes many
245characters\(emmathematical symbols, scientific notation,
246more general punctuation, and more\(emthat we now use
247daily in our work.  We no longer test our imaginations
248to find ways to include non-ASCII symbols in our text;
249why type
250.CW :-)
251when you can use the character ☺?
252Most compelling is the ability to absorb documents
253and data that contain non-ASCII characters; our browser for the
254Oxford English Dictionary
255lets us see the dictionary as it really is, with pronunciation
256in the IPA font, foreign phrases properly rendered, and so on,
257.I "in plain text.
258.PP
259In the rest of this paper, except when
260stated otherwise, the term `UTF' refers to the UTF-8 encoding
261of Unicode characters as adopted by Plan 9.
262.SH
263C Compiler
264.PP
265The first program to be converted to UTF
266was the C Compiler.
267There are two levels of conversion.
268On the syntactic level,
269input to the C compiler
270is UTF; on the semantic level,
271the C language needs to define
272how compiled programs manipulate
273the UTF set.
274.PP
275The syntactic part is simple.
276The ANSI C language standard defines the
277source character set to be ASCII.
278Since UTF is backward compatible with ASCII,
279the compiler needs little change.
280The only places where a larger character set
281is allowed are in character constants, strings, and comments.
282Since 7-bit ASCII characters can represent only
283themselves in UTF,
284the compiler does not have to be careful while looking
285for the termination of a string or comment.
286.PP
287The Plan 9 compiler extends ANSI C to treat any Unicode
288character with a value outside of the ASCII range as
289an alphabetic.
290To a Greek programmer or an English mathematician,
291α is a sensible and now valid variable name.
292.PP
293On the semantic level, ANSI C allows,
294but does not tie down,
295the notion of a
296.I "wide character
297and admits string and character constants
298of this type.
299We chose the wide character type to be
300.CW unsigned
301.CW short .
302In the libraries, the word
303.CW Rune
304is defined by a
305.CW typedef
306to be equivalent to
307.CW unsigned
308.CW short
309and is
310used to signify a Unicode character.
311.PP
312There are surprises; for example:
313.P1
314L'x'	\f1is 120\fP
315\&'x'	\f1is 120\fP
316L'ÿ'	\f1is 255\fP
317\&'ÿ'	\f1is -1, stdio \fPEOF\f1 (if \fPchar\f1 is signed)\fP
318L'\f1α\fP'	\f1is 945\fP
319\&'\f1α\fP'	\f1is illegal\fP
320.P2
321In the string constants,
322.P1
323"\f(Jpこんにちは 世界\fP"
324L"\f(Jpこんにちは 世界\fP",
325.P2
326the former is an array of
327.CW chars
328with 22 elements
329and a null byte,
330while the latter is an array of
331.CW unsigned
332.CW shorts
333.CW Runes ) (
334with 8 elements and a null
335.CW Rune .
336.PP
337The Plan 9 library provides an output conversion function,
338.CW print
339(analogous to
340.CW printf ),
341with formats
342.CW %c ,
343.CW %C ,
344.CW %s ,
345and
346.CW %S .
347Since
348.CW print
349produces text, its output is always UTF.
350The character conversion
351.CW %c
352(lower case) masks its argument
353to 8 bits before converting to UTF.
354Thus
355.CW L'ÿ'
356and
357.CW 'ÿ'
358printed under
359.CW %c
360will be identical,
361but
362.CW L'\f1α\fP'
363will print as the Unicode
364character with decimal value 177.
365The character conversion
366.CW %C
367(upper case) masks its argument
368to 16 bits before converting to UTF.
369Thus
370.CW L'ÿ'
371and
372.CW L'\f1α\fP'
373will print correctly under
374.CW %C ,
375but
376.CW 'ÿ'
377will not.
378The conversion
379.CW %s
380(lower case)
381expects a pointer to
382.CW char
383and copies UTF sequences up to a null byte.
384The conversion
385.CW %S
386(upper case) expects a pointer to
387.CW Rune
388and
389performs sequential
390.CW %C
391conversions until a null
392.CW Rune
393is encountered.
394.PP
395Another problem in format conversion
396is the definition of
397.CW %10s :
398does the number refer to bytes or characters?
399We decided that such formats were most
400often used to align output columns and
401so made the number count characters.
402Some programs, however, use the count
403to place blank-padded strings
404in fixed-sized arrays.
405These programs must be found and corrected.
406.PP
407Here is a complete example:
408.P1
409#include <u.h>
410
411char c[] = "\f(Jpこんにちは 世界\fP";
412Rune s[] = L"\f(Jpこんにちは 世界\fP";
413
414main(void)
415{
416	print("%d, %d\en", sizeof(c), sizeof(s));
417	print("%s\en", c);
418	print("%S\en", s);
419}
420.P2
421.PP
422This program prints
423.CW 23,
424.CW 18
425and then two identical lines of
426UTF text.
427In practice,
428.CW %S
429and
430.CW L"..."
431are rare in programs; one reason is
432that most formatted I/O is done in unconverted UTF.
433.SH
434Ramifications
435.PP
436All programs in Plan 9 now read and write text as UTF, not ASCII.
437This change breaks two deep-rooted symmetries implicit in most C programs:
438.IP 1.
439A character is no longer a
440.CW char .
441.IP 2.
442The internal representation (Rune) of a character now differs from its
443external representation (UTF).
444.PP
445In the sections that follow,
446we show how these issues were faced in the layers of
447system software from the operating system up to the applications.
448The effects are wide-reaching and often surprising.
449.SH
450Operating system
451.PP
452Since UTF is the only format for text in Plan 9,
453the interface to the operating system had to be converted to UTF.
454Text strings cross the interface in several places:
455command arguments,
456file names,
457user names (people can log in using their native name),
458error messages,
459and miscellaneous minor places such as commands to the I/O system.
460Little change was required: null-terminated UTF strings
461are equivalent to null-terminated ASCII strings for most purposes
462of the operating system.
463The library routines described in the next section made that
464change straightforward.
465.PP
466The window system, once called
467.CW 8.5 ,
468is now rightfully called
469.CW 8½ .
470.SH
471Libraries
472.PP
473A header file included by all programs (see [Pike92]) declares
474the
475.CW Rune
476type to hold 16-bit character values:
477.P1
478typedef unsigned short Rune;
479.P2
480Also defined are several constants relevant to UTF:
481.P1
482enum
483{
484    UTFmax    = 3,    /* maximum bytes per rune */
485    Runesync  = 0x80, /* can't appear in UTF sequence (<) */
486    Runeself  = 0x80, /* rune==UTF sequence (<) */
487    Runeerror = 0x80, /* decoding error in UTF */
488};
489.P2
490(With the original UTF,
491.CW Runesync
492was hexadecimal 21 and
493.CW Runeself
494was A0.)
495.CW UTFmax
496bytes are sufficient
497to hold the UTF encoding of any Unicode character.
498Characters of value less than
499.CW Runesync
500only appear in a UTF string as
501themselves, never as part of a sequence encoding another character.
502Characters of value less than
503.CW Runeself
504encode into single bytes
505of the same value.
506Finally, when the library detects errors in UTF input\(embyte sequences
507that are not valid UTF sequences\(emit converts the first byte of the
508error sequence to the character
509.CW Runeerror .
510There is little a rune-oriented program can do when given bad data
511except exit, which is unreasonable, or carry on.
512Originally the conversion routines, described below,
513returned errors when given invalid UTF,
514but we found ourselves repeatedly checking for errors and ignoring them.
515We therefore decided to convert a bad sequence to a valid rune
516and continue processing.
517(The ANSI C routines, on the other hand, return errors.)
518.PP
519This technique does have the unfortunate property that converting
520invalid UTF byte strings in and out of runes does not preserve the input,
521but this circumstance only occurs when non-textual input is
522given to a textual program.
523The Unicode Standard defines an error character, value FFFD, to stand for
524characters from other sets that it does not represent.
525The
526.CW Runeerror
527character is a different concept, related to the encoding rather than the character set, so we
528chose a different character for it.
529.PP
530The Plan 9 C library contains a number of routines for
531manipulating runes.
532The first set converts between runes and UTF strings:
533.P1
534extern	int	runetochar(char*, Rune*);
535extern	int	chartorune(Rune*, char*);
536extern	int	runelen(long);
537extern	int	fullrune(char*, int);
538.P2
539.CW Runetochar
540translates a single
541.CW Rune
542to a UTF sequence and returns the number of bytes produced.
543.CW Chartorune
544goes the other way, reporting how many bytes were consumed.
545.CW Runelen
546returns the number of bytes in the UTF encoding of a rune.
547.CW Fullrune
548examines a UTF string up to a specified number of bytes
549and reports whether the string begins with a complete UTF encoding.
550All these routines use the
551.CW Runeerror
552character to work around encoding problems.
553.PP
554There is also a set of routines for examining null-terminated UTF strings,
555based on the model of the ANSI standard
556.CW str
557routines, but with
558.CW utf
559substituted for
560.CW str
561and
562.CW rune
563for
564.CW chr :
565.P1
566extern	int	utflen(char*);
567extern	char*	utfrune(char*, long);
568extern	char*	utfrrune(char*, long);
569extern	char*	utfutf(char*, char*);
570.P2
571.CW Utflen
572returns the number of runes in a UTF string;
573.CW utfrune
574returns a pointer to the first occurrence of a rune in a UTF string;
575and
576.CW utfrrune
577a pointer to the last.
578.CW Utfutf
579searches for the first occurrence of a UTF string in another UTF string.
580Given the synchronizing property of UTF-8,
581.CW utfutf
582is the same as
583.CW strstr
584if the arguments point to valid UTF strings.
585.PP
586It is a mistake to use
587.CW strchr
588or
589.CW strrchr
590unless searching for a 7-bit ASCII character, that is, a character
591less than
592.CW Runeself .
593.PP
594We have no routines for manipulating null-terminated arrays of
595.CW Runes .
596Although they should probably exist for completeness, we have
597found no need for them, for the same reason that
598.CW %S
599and
600.CW L"..."
601are rarely used.
602.PP
603Most Plan 9 programs use a new buffered I/O library, BIO, in place of
604Standard I/O.
605BIO contains routines to read and write UTF streams, converting to and from
606runes.
607.CW Bgetrune
608returns, as a
609.CW Rune
610within a
611.CW long ,
612the next character in the UTF input stream;
613.CW Bputrune
614takes a rune and writes its UTF representation.
615.CW Bungetrune
616puts a rune back into the input stream for rereading.
617.PP
618Plan 9 programs use a simple set of macros to process command line arguments.
619Converting these macros to UTF automatically updated the
620argument processing of most programs.
621In general,
622argument flag names can no longer be held in bytes and
623arrays of 256 bytes cannot be used to hold a set of flags.
624.PP
625We have done nothing analogous to ANSI C's locales, partly because
626we do not feel qualified to define locales and partly because we remain
627unconvinced of that model for dealing with the problems.
628That is really more an issue of internationalization than conversion
629to a larger character set; on the other hand,
630because we have chosen a single character set that encompasses
631most languages, some of the need for
632locales is eliminated.
633(We have a utility,
634.CW tcs ,
635that translates between UTF and other character sets.)
636.PP
637There are several reasons why our library does not follow the ANSI design
638for wide and multi-byte characters.
639The ANSI model was designed by a committee, untried, almost
640as an afterthought, whereas
641we wanted to design as we built.
642(We made several major changes to the interface
643as we became familiar with the problems involved.)
644We disagree with ANSI C's handling of invalid multi-byte sequences.
645Also, the ANSI C library is incomplete:
646although it contains some crucial routines for handling
647wide and multi-byte characters, there are some serious omissions.
648For example, our software can exploit
649the fact that UTF preserves ASCII characters in the byte stream.
650We could remove that assumption by replacing all
651calls to
652.CW strchr
653with
654.CW utfrune
655and so on.
656(Because of the weaker properties of the original UTF,
657we have actually done so.)
658ANSI C cannot:
659the standard says nothing about the representation, so portable code should
660.I never
661call
662.CW strchr ,
663yet there is no ANSI equivalent to
664.CW utfrune .
665ANSI C simultaneously invalidates
666.CW strchr
667and offers no replacement.
668.PP
669Finally, ANSI did nothing to integrate wide characters
670into the I/O system: it gives no method for printing
671wide characters.
672We therefore needed to invent some things and decided to invent
673everything.
674In the end, some of our entry points do correspond closely to
675ANSI routines\(emfor example
676.CW chartorune
677and
678.CW runetochar
679are similar to
680.CW mbtowc
681and
682.CW wctomb \(embut
683Plan 9's library defines more functionality, enough
684to write real applications comfortably.
685.SH
686Converting the tools
687.PP
688The source for our tools and applications had already been converted to
689work with Latin-1, so it was `8-bit safe', but the conversion to the Unicode
690Standard and UTF is more involved.
691Some programs needed no change at all:
692.CW cat ,
693for instance,
694interprets its argument strings, delivered in UTF,
695as file names that it passes uninterpreted to the
696.CW open
697system call,
698and then just copies bytes from its input to its output;
699it never makes decisions based on the values of the bytes.
700(Plan 9
701.CW cat
702has no options such as
703.CW -v
704to complicate matters.)
705Most programs, however, needed modest change.
706.PP
707It is difficult to
708find automatically the places that need attention,
709but
710.CW grep
711helps.
712Software that uses the libraries conscientiously can be searched
713for calls to library routines that examine bytes as characters:
714.CW strchr ,
715.CW strrchr ,
716.CW strstr ,
717etc.
718Replacing these by calls to
719.CW utfrune ,
720.CW utfrrune ,
721and
722.CW utfutf
723is enough to fix many programs.
724Few tools actually need to operate on runes internally;
725more typically they need only to look for the final slash in a file
726name and similar trivial tasks.
727Of the 170 C source programs in the top levels of
728.CW /sys/src/cmd ,
729only 23 now contain the word
730.CW Rune .
731.PP
732The programs that
733.I do
734store runes internally
735are mostly those whose
736.I raison
737.I d'être
738is character manipulation:
739.CW sam
740(the text editor),
741.CW sed ,
742.CW sort ,
743.CW tr ,
744.CW troff ,
745.CW 8½
746(the window system and terminal emulator),
747and so on.
748To decide whether to compute using runes
749or UTF-encoded byte strings requires balancing the cost of converting
750the data when read and written
751against the cost of converting relevant text on demand.
752For programs such as editors that run a long time with a relatively
753constant dataset, runes are the better choice.
754There are space considerations too, but they are more complicated:
755plain ASCII text grows when converted to runes; UTF-encoded Japanese
756shrinks.
757.PP
758Again, it is hard to automate the conversion of a program from
759.CW chars
760to
761.CW Runes .
762It is not enough just to change the type of variables; the assumption
763that bytes and characters are equivalent can be insidious.
764For instance, to clear a character array by
765.P1
766memset(buf, 0, BUFSIZE)
767.P2
768becomes wrong if
769.CW buf
770is changed from an array of
771.CW chars
772to an array of
773.CW Runes .
774Any program that indexes tables based on character values needs
775rethinking.
776Consider
777.CW tr ,
778which originally used multiple 256-byte arrays for the mapping.
779The naïve conversion would yield multiple 65536-rune arrays.
780Instead Plan 9
781.CW tr
782saves space by building in effect
783a run-encoded version of the map.
784.PP
785.CW Sort
786has related problems.
787The cooperation of UTF and
788.CW strcmp
789means that a simple sort\(emone with no options\(emcan be done
790on the original UTF strings using
791.CW strcmp .
792With sorting options enabled, however,
793.CW sort
794may need to convert its input to runes: for example,
795option
796.CW -t\f1α\fP
797requires searching for alphas in the input text to
798crack the input into fields.
799The field specifier
800.CW +3.2
801refers to 2 runes beyond the third field.
802Some of the other options are hopelessly provincial:
803consider the case-folding and dictionary order options
804(Japanese doesn't even have an official dictionary order) or
805.CW -M
806which compares by case-insensitive English month name.
807Handling these options involves the
808larger issues of internationalization and is beyond the scope
809of this paper and our expertise.
810Plan 9
811.CW sort
812works sensibly with options that make sense relative to the input.
813The simple and most important options are, however, usually meaningful.
814In particular,
815.CW sort
816sorts UTF into the same order that
817.CW look
818expects.
819.PP
820Regular expression-matching algorithms need rethinking to
821be applied to UTF text.
822Deterministic automata are usually applied to bytes;
823converting them to operate on variable-sized byte sequences is awkward.
824On the other hand, converting the input stream to runes adds measurable
825expense
826and the state tables expand
827from size 256 to 65536; it can be expensive just to generate them.
828For simple string searching,
829the Boyer-Moore algorithm works with UTF provided the input is
830guaranteed to be only valid UTF strings; however, it does not work
831with the old UTF encoding.
832At a more mundane level, even character classes are harder:
833the usual bit-vector representation within a non-deterministic automaton
834is unwieldy with 65536 characters in the alphabet.
835.PP
836We compromised.
837An existing library for compiling and executing regular expressions
838was adapted to work on runes, with two entry points for searching
839in arrays of runes and arrays of chars (the pattern is always UTF text).
840Character classes are represented internally as runs of runes;
841the reserved value
842.CW FFFF
843marks the end of the class.
844Then
845.I all
846utilities that use regular expressions\(emeditors,
847.CW grep ,
848.CW awk ,
849etc.\(emexcept the shell, whose notation
850was grandfathered, were converted to use the library.
851For some programs, there was a concomitant loss of performance,
852but there was also a strong advantage.
853To our knowledge, Plan 9 is the only Unix-like system
854that has a single definition and implementation of
855regular expressions; patterns are written and interpreted
856identically by all the programs in the system.
857.PP
858A handful of programs have the notion of character built into them
859so strongly as to confuse the issue of what they should do with UTF input.
860Such programs were treated as individual special cases.
861For example,
862.CW wc
863is, by default, unchanged in behavior and output; a new option,
864.CW -r ,
865counts the number of correctly encoded runes\(emvalid UTF sequences\(emin
866its input;
867.CW -b
868the number of invalid sequences.
869.PP
870It took us several months to convert all the software in the system
871to the Unicode Standard and the old UTF.
872When we decided to convert from that to the new UTF,
873only three things needed to be done.
874First, we rewrote the library routines to encode and decode the
875new UTF.  This took an evening.
876Next, we converted all the files containing UTF
877to the new encoding.
878We wrote a trivial program to look for non-ASCII bytes in
879text files and used a Plan 9 program called
880.CW tcs
881(translate character set) to change encodings.
882Finally, we recompiled all the system software;
883the library interface was unchanged, so recompilation was sufficient
884to effect the transformation.
885The second two steps were done concurrently and took an afternoon.
886We concluded that the actual encoding is relatively unimportant to the
887software; the adoption of large characters and a byte-stream encoding
888.I per
889.I se
890are much deeper issues.
891.SH
892Graphics and fonts
893.PP
894Plan 9 provides only minimal support for plain text terminals.
895It is instead designed to be used with all character input and
896output mediated by a window system such as
897.CW 8½ .
898The window system and related software are responsible for the
899display of UTF text as Unicode character images.
900For plain text, the window system must provide a user-settable
901.I font
902that provides a (possibly empty) picture for each Unicode character.
903Fancier applications that use bold and Italic characters
904need multiple fonts storing multiple pictures for each
905Unicode value.
906All the issues are apparent, though,
907in just the problem of
908displaying a single image for each character, that is, the
909Unicode equivalent of a plain text terminal.
910With 128 or even 256 characters, a font can be just
911an array of bitmaps.  With 65536 characters,
912a more sophisticated design is necessary.  To store the ideographs
913for just Japanese as 16×16×1 bit images,
914the smallest they can reasonably be, takes over a quarter of a
915megabyte.  Make the images a little larger, store more bits per
916pixel, and hold a copy in every running application, and the
917memory cost becomes unreasonable.
918.PP
919The structure of the bitmap graphics services is described at length elsewhere
920[Pike91].
921In summary, the memory holding the bitmaps is stored in the same machine that has
922the display, mouse, and keyboard: the terminal in Plan 9 terminology,
923the workstation in others'.
924Access to that memory and associated services is provided
925by device files served by system
926software on the terminal.  One of those files,
927.CW /dev/bitblt ,
928interprets messages written upon it as requests for actions
929corresponding to entry points in the graphics library:
930allocate a bitmap, execute a raster operation, draw a text string, etc.
931The window system
932acts as a multiplexer that mediates access to the services
933and resources of the terminal by simulating in each client window
934a set of files mirroring those provided by the system.
935That is, each window has a distinct
936.CW /dev/mouse ,
937.CW /dev/bitblt ,
938and so on through which applications drive graphical
939input and output.
940.PP
941One of the resources managed by
942.CW 8½
943and the terminal is the set of active
944.I subfonts.
945Each subfont holds the
946bitmaps and associated data structures for a sequential set of Unicode
947characters.
948Subfonts are stored in files and loaded into the terminal by
949.CW 8½
950or an application.
951For example, one subfont
952might hold the images of the first 256 characters of the Unicode space,
953corresponding to the Latin-1 character set;
954another might hold the standard phonetic character set, Unicode characters
955with value 0250 to 02E9.
956These files are collected in directories corresponding to typefaces:
957.CW /lib/font/bit/pelm
958contains the Pellucida Monospace character set, with subfonts holding
959the Latin-1, Greek, Cyrillic and other components of the typeface.
960A suffix on subfont files encodes (in a subfont-specific
961way) the size of the images:
962.CW /lib/font/bit/pelm/latin1.9
963contains the Latin-1 Pellucida Monospace characters with lower
964case letters 9 pixels high;
965.CW /lib/font/bit/jis/jis5400.16
966contains 16-pixel high
967ideographs starting at Unicode value 5400.
968.PP
969The subfonts do not identify which portion of the Unicode space
970they cover.  Instead, a
971font file, in plain text,
972describes how to assemble subfonts into a complete
973character set.
974The font file is presented as an argument to the window system
975to determine how plain text is displayed in text windows and
976applications.
977Here is the beginning of the font file
978.CW /lib/font/bit/pelm/jis.9.font ,
979which describes the layout of a font covering that portion of
980the Unicode Standard for which we have characters of typical
981display size, using Japanese characters
982to cover the Han space:
983.P1
98418	14
9850x0000	0x00FF	latin1.9
9860x0100	0x017E	latineur.9
9870x0250	0x02E9	ipa.9
9880x0386	0x03F5	greek.9
9890x0400	0x0475	cyrillic.9
9900x2000	0x2044	../misc/genpunc.9
9910x2070	0x208E	supsub.9
9920x20A0	0x20AA	currency.9
9930x2100	0x2138	../misc/letterlike.9
9940x2190	0x21EA	../misc/arrows
9950x2200	0x227F	../misc/math1
9960x2280	0x22F1	../misc/math2
9970x2300	0x232C	../misc/tech
9980x2500	0x257F	../misc/chart
9990x2600	0x266F	../misc/ding
1000.P2
1001.P1
10020x3000	0x303f	../jis/jis3000.16
10030x30a1	0x30fe	../jis/katakana.16
10040x3041	0x309e	../jis/hiragana.16
10050x4e00	0x4fff	../jis/jis4e00.16
10060x5000	0x51ff	../jis/jis5000.16
1007\&...
1008.P2
1009The first two numbers set the interline spacing of the font (18
1010pixels) and the distance from the baseline to the top of the
1011line (14 pixels).
1012When characters are displayed, they are placed so as best
1013to fit within those constraints; characters
1014too large to fit will be truncated.
1015The rest of the file associates subfont files
1016with portions of Unicode space.
1017The first four such files are in the Pellucida Monospace typeface
1018and directory; others reside in other directories.  The file names
1019are relative to the font file's own location.
1020.PP
1021There are several advantages to this two-level structure.
1022First, it simultaneously breaks the huge Unicode space into manageable
1023components and provides a unifying architecture for
1024assembling fonts from disjoint pieces.
1025Second, the structure promotes sharing.
1026For example, we have only one set of Japanese
1027characters but dozens of typefaces for the Latin-1 characters,
1028and this structure permits us to store only one copy of the
1029Japanese set but use it with any Roman typeface.
1030Also, customization is easy.
1031English-speaking users who don't need Japanese characters
1032but may want to read an on-line Oxford English Dictionary can
1033assemble a custom font with the
1034Latin-1 (or even just ASCII) characters and the International
1035Phonetic Alphabet (IPA).
1036Moreover, to do so requires just editing a plain text file,
1037not using a special font editing tool.
1038Finally, the structure guides the design of
1039caching protocols to improve performance and memory usage.
1040.PP
1041To load a complete Unicode character set into each application
1042would consume too
1043much memory and, particularly on slow terminal lines, would take
1044unreasonably long.
1045Instead, Plan 9 assembles a multi-level cache structure for
1046each font.
1047An application opens a font file, reads and parses it,
1048and allocates a data structure.
1049A message written to
1050.CW /dev/bitblt
1051allocates an associated structure held in the terminal, in particular,
1052a bitmap to act as a cache
1053for recently used character images.
1054Other messages copy these images to bitmaps such as the screen
1055by loading characters from subfonts into the cache on demand and
1056from there to the destination bitmap.
1057The protocol to draw characters is in terms of cache indices,
1058not Unicode character number or UTF sequences.
1059These details are hidden from the application, which instead
1060sees only a subroutine to draw a string in a bitmap from a
1061given font, functions to discover character size information,
1062and routines to allocate and to free fonts.
1063.PP
1064As needed, whole
1065subfonts are opened by the graphics library, read, and then downloaded
1066to the terminal.
1067They are held open by the library in an LRU-replacement list.
1068Even when the program closes a subfont, it is retained
1069in the terminal for later use.
1070When the application opens the subfont, it asks the terminal
1071if it already has a copy to avoid reading it from the file
1072server if possible.
1073This level of cache has the property that the bitmaps for, say,
1074all the Japanese characters are stored only once, in the terminal;
1075the applications read only size and width information from the terminal
1076and share the images.
1077.PP
1078The sizes of the character and subfont caches held by the
1079application are adaptive.
1080A simple algorithm monitors the cache miss rate to enlarge and
1081shrink the caches as required.
1082The size of the character cache is limited to 2048 images maximum,
1083which in practice seems enough even for Japanese text.
1084For plain ASCII-like text it naturally stays around 128 images.
1085.PP
1086This mechanism sounds complicated but is implemented by only about
1087500 lines in the library and considerably less in each of the
1088terminal's graphics driver and
1089.CW 8½ .
1090It has the advantage that only characters that are
1091being used are loaded into memory.
1092It is also efficient: if the characters being drawn
1093are in the cache the extra overhead is negligible.
1094It works particularly well for alphabetic character sets,
1095but also adapts on demand for ideographic sets.
1096When a user first looks at Japanese text, it takes a few
1097seconds to read all the font data, but thereafter the
1098text is drawn almost as fast as regular text (the images
1099are larger, so draw a little slower).
1100Also, because the bitmaps are remembered by the terminal,
1101if a second application then looks at Japanese text
1102it starts faster than the first.
1103.PP
1104We considered
1105building a `font server'
1106to cache character images and associated data
1107for the applications, the window system, and the terminal.
1108We rejected this design because, although isolating
1109many of the problems of font management into a separate program,
1110it didn't simplify the applications.
1111Moreover, in a distributed system such as Plan 9 it is easy
1112to have too many special purpose servers.
1113Making the management of the fonts the concern of only
1114the essential components simplifies the system and makes
1115bootstrapping less intricate.
1116.SH
1117Input
1118.PP
1119A completely different problem is how to type Unicode characters
1120as input to the system.
1121We selected an unused key on our ASCII keyboards
1122to serve as a prefix for multi-keystroke
1123sequences that generate Unicode characters.
1124For example, the character
1125.CW ü
1126is generated by the prefix key
1127(typically
1128.CW ALT
1129or
1130.CW Compose )
1131followed by a double quote and a lower-case
1132.CW u .
1133When that character is read by the application, from the file
1134.CW /dev/cons ,
1135it is of course presented as its UTF encoding.
1136Such sequences generate characters from an arbitrary set that
1137includes all of Latin-1 plus a selection of mathematical
1138and technical characters.
1139An arbitrary Unicode character may be generated by typing the prefix,
1140an upper case X, and four hexadecimal digits that identify
1141the Unicode value.
1142.PP
1143These simple mechanisms are adequate for most of our day-to-day needs:
1144it's easy to remember to type `ALT 1 2' for ½\^ or `ALT accent letter'
1145for accented Latin letters.
1146For the occasional unusual character, the cut and paste features of
1147.CW 8½
1148serve well.  A program called (perhaps misleadingly)
1149.CW unicode
1150takes as argument a hexadecimal value, and prints the UTF representation of that character,
1151which may then be picked up with the mouse and used as input.
1152.PP
1153These methods
1154are clearly unsatisfactory when working in a non-English language.
1155In the native country of such a language
1156the appropriate keyboard is likely to be at hand.
1157But it's also reasonable\(emespecially now that the system handles Unicode characters\(emto
1158work in a language foreign to the keyboard.
1159.PP
1160For alphabetic languages such as Greek or Russian, it is
1161straightforward to construct a program that does phonetic substitution,
1162so that, for example, typing a Latin `a' yields the Greek `α'.
1163Within Plan 9, such a program can be inserted transparently
1164between the real keyboard and a program such as the window system,
1165providing a manageable input device for such languages.
1166.PP
1167For ideographic languages such as Chinese or Japanese the problem is harder.
1168Native users of such languages have adopted methods for dealing with
1169Latin keyboards that involve a hybrid technique based on phonetics
1170to generate a list of possible symbols followed by menu selection to
1171choose the desired one.
1172Such methods can be
1173effective, but their design must be rooted in information about
1174the language unknown to non-native speakers.
1175.CW Cxterm , (
1176a Chinese terminal emulator built by and for
1177Chinese programmers,
1178employs such a technique
1179[Pong and Zhang].)
1180Although the technical problem of implementing such a device
1181is easy in Plan 9\(emit is just an elaboration of the technique for
1182alphabetic languages\(emour lack of familiarity with such languages
1183has restrained our enthusiasm for building one.
1184.PP
1185The input problem is technically the least interesting but perhaps
1186emotionally the most important of the problems of converting a system
1187to an international character set.
1188Beyond that remain the deeper problems of internationalization
1189such as multi-lingual error messages and command names,
1190problems we are not qualified to solve.
1191With the ability to treat text of most languages on an equal
1192footing, though, we can begin down that path.
1193Perhaps people in non-English speaking countries will
1194consider adopting Plan 9, solving the input problem locally\(emperhaps
1195just by plugging in their local terminals\(emand begin to use
1196a system with at least the capacity to be international.
1197.SH
1198Acknowledgements
1199.PP
1200Dennis Ritchie provided consultation and encouragement.
1201Bob Flandrena converted most of the standard tools to UTF.
1202Brian Kernighan suffered cheerfully with several
1203inadequate implementations and converted
1204.CW troff
1205to UTF.
1206Rich Drechsler converted his Postscript driver to UTF.
1207John Hobby built the Postscript ☺.
1208We thank them all.
1209.SH
1210References
1211.LP
1212[ANSIC] \f2American National Standard for Information Systems \-
1213Programming Language C\f1, American National Standards Institute, Inc.,
1214New York, 1990.
1215.LP
1216[ISO10646]
1217ISO/IEC DIS 10646-1:1993
1218\f2Information technology \-
1219Universal Multiple-Octet Coded Character Set (UCS) \(em
1220Part 1: Architecture and Basic Multilingual Plane\fP.
1221.LP
1222[Pike90] R. Pike, D. Presotto, K. Thompson, H. Trickey,
1223``Plan 9 from Bell Labs'',
1224UKUUG Proc. of the Summer 1990 Conf.,
1225London, England,
12261990.
1227.LP
1228[Pike91] R. Pike, ``8½, The Plan 9 Window System'', USENIX Summer
1229Conf. Proc., Nashville, 1991, reprinted in this volume.
1230.LP
1231[Pike92] R. Pike, ``How to Use the Plan 9 C Compiler'', this volume.
1232.LP
1233[Pong and Zhang] Man-Chi Pong and Yongguang Zhang, ``cxterm:
1234A Chinese Terminal Emulator for the X Window System'',
1235.I
1236Software\(emPractice and Experience,
1237.R
1238Vol 22(1), 809-926, October 1992.
1239.LP
1240[Unicode]
1241\f2The Unicode Standard,
1242Worldwide Character Encoding,
1243Version 1.0, Volume 1\f1,
1244The Unicode Consortium,
1245Addison Wesley,
1246New York,
12471991.
1248