xref: /plan9/sys/doc/utf.ms (revision b9e364c446c00cfa6b1164b4648b126624c464b2)
1.HTML "Hello World or Καλημέρα κόσμε or こんにちは 世界
2.TL
3Hello World
4.br
5or
6.br
7.ft R
8Καλημέρα κόσμε
9.ft
10.br
11or
12.br
13\f(Jpこんにちは 世界\fP
14.AU
15Rob Pike
16Ken Thompson
17.sp
18rob,ken@plan9.bell-labs.com
19.AB
20.FS
21Originally appeared, in a slightly different form, in
22.I
23Proc. of the Winter 1993 USENIX Conf.,
24.R
25pp. 43-50,
26San Diego.
27It has been revised to reflect the move to 21-bit Unicode.
28.FE
29Plan 9 from Bell Labs has recently been converted from ASCII
30to an ASCII-compatible variant of the Unicode Standard,
31a 16-bit (now 21-bit) character set.
32In this paper we explain the reasons for the change,
33describe the character set and representation we chose,
34and present the programming models and software changes
35that support the new text format.
36Although we stopped short of full internationalization\(emfor
37example, system error messages are in Unixese, not Japanese\(emwe
38believe Plan 9 is the first system to treat the representation
39of all major languages on a uniform, equal footing throughout all its
40software.
41.AE
42.SH
43Introduction
44.PP
45The world is multilingual but most computer systems
46are based on English and ASCII.
47The first release of Plan 9 [Pike90], a new distributed operating
48system from Bell Laboratories, seemed a good occasion
49to correct this chauvinism.
50It is easier to make such deep changes when building new systems than
51by refitting old ones.
52.PP
53The ANSI C standard [ANSIC] contains some guidance on the matter of
54`wide' and `multi-byte' characters but falls far short of
55solving the myriad associated problems.
56We could find no literature on how to convert a
57.I system
58to larger character sets, although some individual
59.I programs
60had been converted.
61This paper reports what we discovered as we
62explored the problem of representing multilingual
63text at all levels of an operating system,
64from the file system and kernel through
65the applications and up to the window system
66and display.
67.PP
68Plan 9 has not been `internationalized':
69its manuals are in English,
70its error messages are in English,
71and it can display text that goes from left to right only.
72But before we can address these other problems,
73we need to handle, uniformly and comfortably,
74the textual representation of all the major written languages.
75That subproblem is richer than we had anticipated.
76.SH
77Standards
78.PP
79Our first step was to select a standard.
80At the time (January 1992),
81there were only two viable options:
82ISO 10646 [ISO10646] and Unicode [Unicode].
83The documents describing both proposals were still in the draft stage.
84.PP
85The draft of ISO 10646 was not
86very attractive to us.
87It defined a sparse set of 32-bit characters,
88which would be
89hard to implement
90and have punitive storage requirements.
91Also, the draft attempted to
92mollify national interests by allocating
9316-bit subspaces to national committees
94to partition individually.
95The suggested mode of use was to
96``flip'' between separate national
97standards to implement the international standard.
98This did not strike us as a sound basis for a character set.
99As well, transmitting 32-bit values in a byte stream,
100such as in pipes, would be expensive and hard to implement.
101Since the standard does not define a byte order for such
102transmission, the byte stream would also have to carry
103state to enable the values to be recovered.
104.PP
105The Unicode Standard is a proposal by a consortium of mostly American
106computer companies formed
107to protest the technical
108failings of ISO 10646.
109It defines a uniform 16-bit code based on the
110principle of unification:
111two characters are the same if they look the
112same even though they are from different
113languages.
114This principle, called Han unification,
115allows the large Japanese, Chinese, and Korean
116character sets to be packed comfortably into a 16-bit representation.
117.PP
118We chose the Unicode Standard for its technical merits and because its
119code space was better defined.
120Moreover,
121the Unicode Consortium was derailing the
122ISO 10646 standard.
123(Now, in 1995,
124ISO 10646 is a standard
125with one 16-bit group defined,
126which is almost exactly the Unicode Standard.
127As most people expected, the two standards bodies
128reached a détente and
129ISO 10646 and Unicode represent the same character set.)
130.PP
131The Unicode Standard defines an adequate character set
132but an unreasonable representation.
133It states that all characters
134are 16 bits wide and are communicated and stored in
13516-bit units.
136It also reserves a pair of characters
137(hexadecimal FFFE and FEFF) to detect byte order
138in transmitted text, requiring state in the byte stream.
139(The Unicode Consortium was thinking of files, not pipes.)
140To adopt this encoding,
141we would have had to convert all text going
142into and out of Plan 9 between ASCII and Unicode, which cannot be done.
143Within a single program, in command of all its input and output,
144it is possible to define characters as 16-bit quantities;
145in the context of a networked system with
146hundreds of applications on diverse machines
147by different manufacturers,
148it is impossible.
149.PP
150We needed a way to adapt the Unicode Standard to the tools-and-pipes
151model of text processing embodied by the Unix system.
152To do that, we
153needed an ASCII-compatible textual
154representation of Unicode characters for transmission
155and storage.
156In the draft ISO standard there was an informative
157(non-required)
158Annex
159called UTF
160that provided a byte stream encoding
161of the 32-bit ISO code.
162The encoding uses multibyte sequences composed
163from the 190 printable characters of Latin-1
164to represent character values larger
165than 159.
166.PP
167The UTF encoding has several good properties.
168By far the most important is that
169a byte in the ASCII range 0-127 represents
170itself in UTF.
171Thus UTF is backward compatible with ASCII.
172.PP
173UTF has other advantages.
174It is a byte encoding and is
175therefore byte-order independent.
176ASCII control characters appear in the byte stream
177only as themselves, never as an element of a sequence
178encoding another character,
179so newline bytes separate lines of UTF text.
180Finally, ANSI C's
181.CW strcmp
182function applied to UTF strings preserves the ordering of Unicode characters.
183.PP
184To encode and decode UTF is expensive (involving multiplication,
185division, and modulo operations) but workable.
186UTF's major disadvantage is that the encoding
187is not self-synchronizing.
188It is in general impossible to find the character
189boundaries in a UTF string without reading from
190the beginning of the string, although in practice
191control characters such as newlines,
192tabs, and blanks provide synchronization points.
193.PP
194In August 1992,
195X-Open circulated a proposal for another UTF-like
196byte encoding of Unicode characters.
197Their major concern was that an embedded character
198in a file name
199(in particular a slash)
200could be part of an escape sequence in UTF and
201therefore confuse a traditional file system.
202Their proposal would allow all 7-bit ASCII characters
203to represent themselves
204.I "and only themselves"
205in text.
206Multibyte sequences would contain only characters
207with the high bit set.
208We proposed a modification to the new UTF that
209would address our synchronization problem.
210Our proposal, which was  originally known informally as UTF-2 and FSS-UTF,
211is now referred to as UTF-8 and has been approved by ISO to become
212Annex P to ISO 10646.
213.PP
214The model for text in Plan 9 is chosen from these
215three standards*:
216.FS
217* ``That's the nice thing about standards\(emthere's so many to choose from.'' \- Andy Tannenbaum (no, the other one)
218.FE
219the Unicode character set encoded as a byte stream by
220UTF-8, from
221(soon to be) Annex P of ISO 10646.
222Although this mixture may seem like a precarious position for us to adopt,
223it is not as bad as it sounds.
224ISO 10646 and the Unicode Standard have converged,
225other systems such as Linux have adopted the same character set and encoding,
226and the general feeling seems to be that Unicode and UTF-8 will be accepted as the way
227to exchange text between systems.
228The prognosis for wide acceptance is good.
229.PP
230There are a couple of aspects of the Unicode Standard we have not faced.
231One is the issue of right-to-left text such as Hebrew or Arabic.
232Since that is an issue of display, not representation, we believe
233we can defer that problem for the moment without affecting our
234ability to solve it later.
235Another issue is diacriticals and `combining characters',
236which cause overstriking of multiple Unicode characters.
237Although necessary for some scripts, such as Thai, Arabic, and Hebrew,
238such characters confuse the issues for Latin languages because they
239generate multiple representations for accented characters.
240ISO 10646 describes three levels of implementation;
241in Plan 9 we decided not to address the issue.
242Again, this can be labeled as a display issue and its finer points are still being debated,
243so we felt comfortable deferring.  Mañana.
244.PP
245Although we converted Plan 9 in the altruistic interests of
246serving foreign languages, we have found the large character
247set attractive for other reasons.  The Unicode Standard includes many
248characters\(emmathematical symbols, scientific notation,
249more general punctuation, and more\(emthat we now use
250daily in our work.  We no longer test our imaginations
251to find ways to include non-ASCII symbols in our text;
252why type
253.CW :-)
254when you can use the character ☺?
255Most compelling is the ability to absorb documents
256and data that contain non-ASCII characters; our browser for the
257Oxford English Dictionary
258lets us see the dictionary as it really is, with pronunciation
259in the IPA font, foreign phrases properly rendered, and so on,
260.I "in plain text.
261.PP
262As of Unicode 4.0,
263characters are now 21 bits wide and the longest UTF-8 encoding of a character
264requires 4 bytes.
265We are adapting the system to match.
266.PP
267In the rest of this paper, except when
268stated otherwise, the term `UTF' refers to the UTF-8 encoding
269of Unicode characters as adopted by Plan 9.
270.SH
271C Compiler
272.PP
273The first program to be converted to UTF
274was the C Compiler.
275There are two levels of conversion.
276On the syntactic level,
277input to the C compiler
278is UTF; on the semantic level,
279the C language needs to define
280how compiled programs manipulate
281the UTF set.
282.PP
283The syntactic part is simple.
284The ANSI C language standard defines the
285source character set to be ASCII.
286Since UTF is backward compatible with ASCII,
287the compiler needs little change.
288The only places where a larger character set
289is allowed are in character constants, strings, and comments.
290Since 7-bit ASCII characters can represent only
291themselves in UTF,
292the compiler does not have to be careful while looking
293for the termination of a string or comment.
294.PP
295The Plan 9 compiler extends ANSI C to treat any Unicode
296character with a value outside of the ASCII range as
297an alphabetic.
298To a Greek programmer or an English mathematician,
299α is a sensible and now valid variable name.
300.PP
301On the semantic level, ANSI C allows,
302but does not tie down,
303the notion of a
304.I "wide character
305and admits string and character constants
306of this type.
307We chose the wide character type to be
308.CW unsigned
309.CW short
310(now
311.CW unsigned
312.CW long) .
313In the libraries, the word
314.CW Rune
315is now defined by a
316.CW typedef
317to be equivalent to
318.CW unsigned
319.CW long
320and is
321used to signify a Unicode character.
322.PP
323There are surprises; for example:
324.P1
325L'x'	\f1is 120\fP
326\&'x'	\f1is 120\fP
327L'ÿ'	\f1is 255\fP
328\&'ÿ'	\f1is -1, stdio \fPEOF\f1 (if \fPchar\f1 is signed)\fP
329L'\f1α\fP'	\f1is 945\fP
330\&'\f1α\fP'	\f1is illegal\fP
331.P2
332In the string constants,
333.P1
334"\f(Jpこんにちは 世界\fP"
335L"\f(Jpこんにちは 世界\fP",
336.P2
337the former is an array of
338.CW chars
339with 22 elements
340and a null byte,
341while the latter is an array of
342.CW unsigned
343.CW long s
344.CW Runes ) (
345with 8 elements and a null
346.CW Rune .
347.PP
348The Plan 9 library provides an output conversion function,
349.CW print
350(analogous to
351.CW printf ),
352with formats
353.CW %c ,
354.CW %C ,
355.CW %s ,
356and
357.CW %S .
358Since
359.CW print
360produces text, its output is always UTF.
361The character conversion
362.CW %c
363(lower case) masks its argument
364to 8 bits before converting to UTF.
365Thus
366.CW L'ÿ'
367and
368.CW 'ÿ'
369printed under
370.CW %c
371will be identical,
372but
373.CW L'\f1α\fP'
374will print as the Unicode
375character with decimal value 177.
376The character conversion
377.CW %C
378(upper case) masks its argument
379to 16 bits before converting to UTF.
380Thus
381.CW L'ÿ'
382and
383.CW L'\f1α\fP'
384will print correctly under
385.CW %C ,
386but
387.CW 'ÿ'
388will not.
389The conversion
390.CW %s
391(lower case)
392expects a pointer to
393.CW char
394and copies UTF sequences up to a null byte.
395The conversion
396.CW %S
397(upper case) expects a pointer to
398.CW Rune
399and
400performs sequential
401.CW %C
402conversions until a null
403.CW Rune
404is encountered.
405.PP
406Another problem in format conversion
407is the definition of
408.CW %10s :
409does the number refer to bytes or characters?
410We decided that such formats were most
411often used to align output columns and
412so made the number count characters.
413Some programs, however, use the count
414to place blank-padded strings
415in fixed-sized arrays.
416These programs must be found and corrected.
417.PP
418Here is a complete example:
419.P1
420#include <u.h>
421
422char c[] = "\f(Jpこんにちは 世界\fP";
423Rune s[] = L"\f(Jpこんにちは 世界\fP";
424
425main(void)
426{
427	print("%d, %d\en", sizeof(c), sizeof(s));
428	print("%s\en", c);
429	print("%S\en", s);
430}
431.P2
432.PP
433This program prints
434.CW 23,
435.CW 18
436and then two identical lines of
437UTF text.
438In practice,
439.CW %S
440and
441.CW L"..."
442are rare in programs; one reason is
443that most formatted I/O is done in unconverted UTF.
444.SH
445Ramifications
446.PP
447All programs in Plan 9 now read and write text as UTF, not ASCII.
448This change breaks two deep-rooted symmetries implicit in most C programs:
449.IP 1.
450A character is no longer a
451.CW char .
452.IP 2.
453The internal representation (Rune) of a character now differs from its
454external representation (UTF).
455.PP
456In the sections that follow,
457we show how these issues were faced in the layers of
458system software from the operating system up to the applications.
459The effects are wide-reaching and often surprising.
460.SH
461Operating system
462.PP
463Since UTF is the only format for text in Plan 9,
464the interface to the operating system had to be converted to UTF.
465Text strings cross the interface in several places:
466command arguments,
467file names,
468user names (people can log in using their native name),
469error messages,
470and miscellaneous minor places such as commands to the I/O system.
471Little change was required: null-terminated UTF strings
472are equivalent to null-terminated ASCII strings for most purposes
473of the operating system.
474The library routines described in the next section made that
475change straightforward.
476.PP
477The window system, once called
478.CW 8.5 ,
479is now rightfully called
480.CW 8½ .
481.SH
482Libraries
483.PP
484A header file included by all programs (see [Pike92]) declares
485the
486.CW Rune
487type to hold 21-bit character values:
488.P1
489typedef unsigned long Rune;
490.P2
491Also defined are several constants relevant to UTF:
492.P1
493enum
494{
495    UTFmax	= 4,	/* maximum bytes per rune */
496    Runesync	= 0x80,	/* cannot be in a UTF sequence (<) */
497    Runeself	= 0x80,	/* rune==UTF sequence (<) */
498    Runeerror	= 0xFFFD,	/* decoding error in UTF */
499    Runemax	= 0x10FFFF,	/* largest 21-bit rune */
500    Runemask	= 0x1FFFFF,	/* bits used by runes (see grep) */
501};
502.P2
503(With the original UTF,
504.CW Runesync
505was hexadecimal 21 and
506.CW Runeself
507was A0.)
508.CW UTFmax
509bytes are sufficient
510to hold the UTF encoding of any Unicode character.
511Characters of value less than
512.CW Runesync
513only appear in a UTF string as
514themselves, never as part of a sequence encoding another character.
515Characters of value less than
516.CW Runeself
517encode into single bytes
518of the same value.
519Finally, when the library detects errors in UTF input\(embyte sequences
520that are not valid UTF sequences\(emit converts the first byte of the
521error sequence to the character
522.CW Runeerror .
523There is little a rune-oriented program can do when given bad data
524except exit, which is unreasonable, or carry on.
525Originally the conversion routines, described below,
526returned errors when given invalid UTF,
527but we found ourselves repeatedly checking for errors and ignoring them.
528We therefore decided to convert a bad sequence to a valid rune
529and continue processing.
530(The ANSI C routines, on the other hand, return errors.)
531.PP
532This technique does have the unfortunate property that converting
533invalid UTF byte strings in and out of runes does not preserve the input,
534but this circumstance only occurs when non-textual input is
535given to a textual program.
536The Unicode Standard defines an error character, value FFFD, to stand for
537characters from other sets that it does not represent.
538The
539.CW Runeerror
540character is a different concept, related to the encoding rather than the character set.
541.PP
542The Plan 9 C library contains a number of routines for
543manipulating runes.
544The first set converts between runes and UTF strings:
545.P1
546extern	int	runetochar(char*, Rune*);
547extern	int	chartorune(Rune*, char*);
548extern	int	runelen(long);
549extern	int	fullrune(char*, int);
550.P2
551.CW Runetochar
552translates a single
553.CW Rune
554to a UTF sequence and returns the number of bytes produced.
555.CW Chartorune
556goes the other way, reporting how many bytes were consumed.
557.CW Runelen
558returns the number of bytes in the UTF encoding of a rune.
559.CW Fullrune
560examines a UTF string up to a specified number of bytes
561and reports whether the string begins with a complete UTF encoding.
562All these routines use the
563.CW Runeerror
564character to work around encoding problems.
565.PP
566There is also a set of routines for examining null-terminated UTF strings,
567based on the model of the ANSI standard
568.CW str
569routines, but with
570.CW utf
571substituted for
572.CW str
573and
574.CW rune
575for
576.CW chr :
577.P1
578extern	int	utflen(char*);
579extern	char*	utfrune(char*, long);
580extern	char*	utfrrune(char*, long);
581extern	char*	utfutf(char*, char*);
582.P2
583.CW Utflen
584returns the number of runes in a UTF string;
585.CW utfrune
586returns a pointer to the first occurrence of a rune in a UTF string;
587and
588.CW utfrrune
589a pointer to the last.
590.CW Utfutf
591searches for the first occurrence of a UTF string in another UTF string.
592Given the synchronizing property of UTF-8,
593.CW utfutf
594is the same as
595.CW strstr
596if the arguments point to valid UTF strings.
597.PP
598It is a mistake to use
599.CW strchr
600or
601.CW strrchr
602unless searching for a 7-bit ASCII character, that is, a character
603less than
604.CW Runeself .
605.PP
606We have no routines for manipulating null-terminated arrays of
607.CW Runes .
608Although they should probably exist for completeness, we have
609found no need for them, for the same reason that
610.CW %S
611and
612.CW L"..."
613are rarely used.
614.PP
615Most Plan 9 programs use a new buffered I/O library, BIO, in place of
616Standard I/O.
617BIO contains routines to read and write UTF streams, converting to and from
618runes.
619.CW Bgetrune
620returns, as a
621.CW Rune
622within a
623.CW long ,
624the next character in the UTF input stream;
625.CW Bputrune
626takes a rune and writes its UTF representation.
627.CW Bungetrune
628puts a rune back into the input stream for rereading.
629.PP
630Plan 9 programs use a simple set of macros to process command line arguments.
631Converting these macros to UTF automatically updated the
632argument processing of most programs.
633In general,
634argument flag names can no longer be held in bytes and
635arrays of 256 bytes cannot be used to hold a set of flags.
636.PP
637We have done nothing analogous to ANSI C's locales, partly because
638we do not feel qualified to define locales and partly because we remain
639unconvinced of that model for dealing with the problems.
640That is really more an issue of internationalization than conversion
641to a larger character set; on the other hand,
642because we have chosen a single character set that encompasses
643most languages, some of the need for
644locales is eliminated.
645(We have a utility,
646.CW tcs ,
647that translates between UTF and other character sets.)
648.PP
649There are several reasons why our library does not follow the ANSI design
650for wide and multi-byte characters.
651The ANSI model was designed by a committee, untried, almost
652as an afterthought, whereas
653we wanted to design as we built.
654(We made several major changes to the interface
655as we became familiar with the problems involved.)
656We disagree with ANSI C's handling of invalid multi-byte sequences.
657Also, the ANSI C library is incomplete:
658although it contains some crucial routines for handling
659wide and multi-byte characters, there are some serious omissions.
660For example, our software can exploit
661the fact that UTF preserves ASCII characters in the byte stream.
662We could remove that assumption by replacing all
663calls to
664.CW strchr
665with
666.CW utfrune
667and so on.
668(Because of the weaker properties of the original UTF,
669we have actually done so.)
670ANSI C cannot:
671the standard says nothing about the representation, so portable code should
672.I never
673call
674.CW strchr ,
675yet there is no ANSI equivalent to
676.CW utfrune .
677ANSI C simultaneously invalidates
678.CW strchr
679and offers no replacement.
680.PP
681Finally, ANSI did nothing to integrate wide characters
682into the I/O system: it gives no method for printing
683wide characters.
684We therefore needed to invent some things and decided to invent
685everything.
686In the end, some of our entry points do correspond closely to
687ANSI routines\(emfor example
688.CW chartorune
689and
690.CW runetochar
691are similar to
692.CW mbtowc
693and
694.CW wctomb \(embut
695Plan 9's library defines more functionality, enough
696to write real applications comfortably.
697.SH
698Converting the tools
699.PP
700The source for our tools and applications had already been converted to
701work with Latin-1, so it was `8-bit safe', but the conversion to the Unicode
702Standard and UTF is more involved.
703Some programs needed no change at all:
704.CW cat ,
705for instance,
706interprets its argument strings, delivered in UTF,
707as file names that it passes uninterpreted to the
708.CW open
709system call,
710and then just copies bytes from its input to its output;
711it never makes decisions based on the values of the bytes.
712(Plan 9
713.CW cat
714has no options such as
715.CW -v
716to complicate matters.)
717Most programs, however, needed modest change.
718.PP
719It is difficult to
720find automatically the places that need attention,
721but
722.CW grep
723helps.
724Software that uses the libraries conscientiously can be searched
725for calls to library routines that examine bytes as characters:
726.CW strchr ,
727.CW strrchr ,
728.CW strstr ,
729etc.
730Replacing these by calls to
731.CW utfrune ,
732.CW utfrrune ,
733and
734.CW utfutf
735is enough to fix many programs.
736Few tools actually need to operate on runes internally;
737more typically they need only to look for the final slash in a file
738name and similar trivial tasks.
739Of the 170 C source programs in the top levels of
740.CW /sys/src/cmd ,
741only 23 now contain the word
742.CW Rune .
743.PP
744The programs that
745.I do
746store runes internally
747are mostly those whose
748.I raison
749.I d'être
750is character manipulation:
751.CW sam
752(the text editor),
753.CW sed ,
754.CW sort ,
755.CW tr ,
756.CW troff ,
757.CW 8½
758(the window system and terminal emulator),
759and so on.
760To decide whether to compute using runes
761or UTF-encoded byte strings requires balancing the cost of converting
762the data when read and written
763against the cost of converting relevant text on demand.
764For programs such as editors that run a long time with a relatively
765constant dataset, runes are the better choice.
766There are space considerations too, but they are more complicated:
767plain ASCII text grows when converted to runes; UTF-encoded Japanese
768shrinks.
769.PP
770Again, it is hard to automate the conversion of a program from
771.CW chars
772to
773.CW Runes .
774It is not enough just to change the type of variables; the assumption
775that bytes and characters are equivalent can be insidious.
776For instance, to clear a character array by
777.P1
778memset(buf, 0, BUFSIZE)
779.P2
780becomes wrong if
781.CW buf
782is changed from an array of
783.CW chars
784to an array of
785.CW Runes .
786Any program that indexes tables based on character values needs
787rethinking.
788Consider
789.CW tr ,
790which originally used multiple 256-byte arrays for the mapping.
791The naïve conversion would yield multiple 1,114,112-rune arrays.
792Instead Plan 9
793.CW tr
794saves space by building in effect
795a run-encoded version of the map.
796.PP
797.CW Sort
798has related problems.
799The cooperation of UTF and
800.CW strcmp
801means that a simple sort\(emone with no options\(emcan be done
802on the original UTF strings using
803.CW strcmp .
804With sorting options enabled, however,
805.CW sort
806may need to convert its input to runes: for example,
807option
808.CW -t\f1α\fP
809requires searching for alphas in the input text to
810crack the input into fields.
811The field specifier
812.CW +3.2
813refers to 2 runes beyond the third field.
814Some of the other options are hopelessly provincial:
815consider the case-folding and dictionary order options
816(Japanese doesn't even have an official dictionary order) or
817.CW -M
818which compares by case-insensitive English month name.
819Handling these options involves the
820larger issues of internationalization and is beyond the scope
821of this paper and our expertise.
822Plan 9
823.CW sort
824works sensibly with options that make sense relative to the input.
825The simple and most important options are, however, usually meaningful.
826In particular,
827.CW sort
828sorts UTF into the same order that
829.CW look
830expects.
831.PP
832Regular expression-matching algorithms need rethinking to
833be applied to UTF text.
834Deterministic automata are usually applied to bytes;
835converting them to operate on variable-sized byte sequences is awkward.
836On the other hand, converting the input stream to runes adds measurable
837expense
838and the state tables expand
839from size 256 to 1,114,112; it can be expensive just to generate them.
840For simple string searching,
841the Boyer-Moore algorithm works with UTF provided the input is
842guaranteed to be only valid UTF strings; however, it does not work
843with the old UTF encoding.
844At a more mundane level, even character classes are harder:
845the usual bit-vector representation within a non-deterministic automaton
846is unwieldy with 1,114,112 characters in the alphabet.
847.PP
848We compromised.
849An existing library for compiling and executing regular expressions
850was adapted to work on runes, with two entry points for searching
851in arrays of runes and arrays of chars (the pattern is always UTF text).
852Character classes are represented internally as runs of runes;
853the reserved value
854.CW FFFF
855marks the end of the class.
856Then
857.I all
858utilities that use regular expressions\(emeditors,
859.CW grep ,
860.CW awk ,
861etc.\(emexcept the shell, whose notation
862was grandfathered, were converted to use the library.
863For some programs, there was a concomitant loss of performance,
864but there was also a strong advantage.
865To our knowledge, Plan 9 is the only Unix-like system
866that has a single definition and implementation of
867regular expressions; patterns are written and interpreted
868identically by all the programs in the system.
869.PP
870A handful of programs have the notion of character built into them
871so strongly as to confuse the issue of what they should do with UTF input.
872Such programs were treated as individual special cases.
873For example,
874.CW wc
875is, by default, unchanged in behavior and output; a new option,
876.CW -r ,
877counts the number of correctly encoded runes\(emvalid UTF sequences\(emin
878its input;
879.CW -b
880the number of invalid sequences.
881.PP
882It took us several months to convert all the software in the system
883to the Unicode Standard and the old UTF.
884When we decided to convert from that to the new UTF,
885only three things needed to be done.
886First, we rewrote the library routines to encode and decode the
887new UTF.  This took an evening.
888Next, we converted all the files containing UTF
889to the new encoding.
890We wrote a trivial program to look for non-ASCII bytes in
891text files and used a Plan 9 program called
892.CW tcs
893(translate character set) to change encodings.
894Finally, we recompiled all the system software;
895the library interface was unchanged, so recompilation was sufficient
896to effect the transformation.
897The second two steps were done concurrently and took an afternoon.
898We concluded that the actual encoding is relatively unimportant to the
899software; the adoption of large characters and a byte-stream encoding
900.I per
901.I se
902are much deeper issues.
903.SH
904Graphics and fonts
905.PP
906Plan 9 provides only minimal support for plain text terminals.
907It is instead designed to be used with all character input and
908output mediated by a window system such as
909.CW 8½ .
910The window system and related software are responsible for the
911display of UTF text as Unicode character images.
912For plain text, the window system must provide a user-settable
913.I font
914that provides a (possibly empty) picture for each Unicode character.
915Fancier applications that use bold and Italic characters
916need multiple fonts storing multiple pictures for each
917Unicode value.
918All the issues are apparent, though,
919in just the problem of
920displaying a single image for each character, that is, the
921Unicode equivalent of a plain text terminal.
922With 128 or even 256 characters, a font can be just
923an array of bitmaps.  With 1,114,112 characters,
924a more sophisticated design is necessary.  To store the ideographs
925for just Japanese as 16×16×1 bit images,
926the smallest they can reasonably be, takes over a quarter of a
927megabyte.  Make the images a little larger, store more bits per
928pixel, and hold a copy in every running application, and the
929memory cost becomes unreasonable.
930.PP
931The structure of the bitmap graphics services is described at length elsewhere
932[Pike91].
933In summary, the memory holding the bitmaps is stored in the same machine that has
934the display, mouse, and keyboard: the terminal in Plan 9 terminology,
935the workstation in others'.
936Access to that memory and associated services is provided
937by device files served by system
938software on the terminal.  One of those files,
939.CW /dev/bitblt ,
940interprets messages written upon it as requests for actions
941corresponding to entry points in the graphics library:
942allocate a bitmap, execute a raster operation, draw a text string, etc.
943The window system
944acts as a multiplexer that mediates access to the services
945and resources of the terminal by simulating in each client window
946a set of files mirroring those provided by the system.
947That is, each window has a distinct
948.CW /dev/mouse ,
949.CW /dev/bitblt ,
950and so on through which applications drive graphical
951input and output.
952.PP
953One of the resources managed by
954.CW 8½
955and the terminal is the set of active
956.I subfonts.
957Each subfont holds the
958bitmaps and associated data structures for a sequential set of Unicode
959characters.
960Subfonts are stored in files and loaded into the terminal by
961.CW 8½
962or an application.
963For example, one subfont
964might hold the images of the first 256 characters of the Unicode space,
965corresponding to the Latin-1 character set;
966another might hold the standard phonetic character set, Unicode characters
967with value 0250 to 02E9.
968These files are collected in directories corresponding to typefaces:
969.CW /lib/font/bit/pelm
970contains the Pellucida Monospace character set, with subfonts holding
971the Latin-1, Greek, Cyrillic and other components of the typeface.
972A suffix on subfont files encodes (in a subfont-specific
973way) the size of the images:
974.CW /lib/font/bit/pelm/latin1.9
975contains the Latin-1 Pellucida Monospace characters with lower
976case letters 9 pixels high;
977.CW /lib/font/bit/jis/jis5400.16
978contains 16-pixel high
979ideographs starting at Unicode value 5400.
980.PP
981The subfonts do not identify which portion of the Unicode space
982they cover.  Instead, a
983font file, in plain text,
984describes how to assemble subfonts into a complete
985character set.
986The font file is presented as an argument to the window system
987to determine how plain text is displayed in text windows and
988applications.
989Here is the beginning of the font file
990.CW /lib/font/bit/pelm/jis.9.font ,
991which describes the layout of a font covering that portion of
992the Unicode Standard for which we have characters of typical
993display size, using Japanese characters
994to cover the Han space:
995.P1
99618	14
9970x0000	0x00FF	latin1.9
9980x0100	0x017E	latineur.9
9990x0250	0x02E9	ipa.9
10000x0386	0x03F5	greek.9
10010x0400	0x0475	cyrillic.9
10020x2000	0x2044	../misc/genpunc.9
10030x2070	0x208E	supsub.9
10040x20A0	0x20AA	currency.9
10050x2100	0x2138	../misc/letterlike.9
10060x2190	0x21EA	../misc/arrows
10070x2200	0x227F	../misc/math1
10080x2280	0x22F1	../misc/math2
10090x2300	0x232C	../misc/tech
10100x2500	0x257F	../misc/chart
10110x2600	0x266F	../misc/ding
1012.P2
1013.P1
10140x3000	0x303f	../jis/jis3000.16
10150x30a1	0x30fe	../jis/katakana.16
10160x3041	0x309e	../jis/hiragana.16
10170x4e00	0x4fff	../jis/jis4e00.16
10180x5000	0x51ff	../jis/jis5000.16
1019\&...
1020.P2
1021The first two numbers set the interline spacing of the font (18
1022pixels) and the distance from the baseline to the top of the
1023line (14 pixels).
1024When characters are displayed, they are placed so as best
1025to fit within those constraints; characters
1026too large to fit will be truncated.
1027The rest of the file associates subfont files
1028with portions of Unicode space.
1029The first four such files are in the Pellucida Monospace typeface
1030and directory; others reside in other directories.  The file names
1031are relative to the font file's own location.
1032.PP
1033There are several advantages to this two-level structure.
1034First, it simultaneously breaks the huge Unicode space into manageable
1035components and provides a unifying architecture for
1036assembling fonts from disjoint pieces.
1037Second, the structure promotes sharing.
1038For example, we have only one set of Japanese
1039characters but dozens of typefaces for the Latin-1 characters,
1040and this structure permits us to store only one copy of the
1041Japanese set but use it with any Roman typeface.
1042Also, customization is easy.
1043English-speaking users who don't need Japanese characters
1044but may want to read an on-line Oxford English Dictionary can
1045assemble a custom font with the
1046Latin-1 (or even just ASCII) characters and the International
1047Phonetic Alphabet (IPA).
1048Moreover, to do so requires just editing a plain text file,
1049not using a special font editing tool.
1050Finally, the structure guides the design of
1051caching protocols to improve performance and memory usage.
1052.PP
1053To load a complete Unicode character set into each application
1054would consume too
1055much memory and, particularly on slow terminal lines, would take
1056unreasonably long.
1057Instead, Plan 9 assembles a multi-level cache structure for
1058each font.
1059An application opens a font file, reads and parses it,
1060and allocates a data structure.
1061A message written to
1062.CW /dev/bitblt
1063allocates an associated structure held in the terminal, in particular,
1064a bitmap to act as a cache
1065for recently used character images.
1066Other messages copy these images to bitmaps such as the screen
1067by loading characters from subfonts into the cache on demand and
1068from there to the destination bitmap.
1069The protocol to draw characters is in terms of cache indices,
1070not Unicode character number or UTF sequences.
1071These details are hidden from the application, which instead
1072sees only a subroutine to draw a string in a bitmap from a
1073given font, functions to discover character size information,
1074and routines to allocate and to free fonts.
1075.PP
1076As needed, whole
1077subfonts are opened by the graphics library, read, and then downloaded
1078to the terminal.
1079They are held open by the library in an LRU-replacement list.
1080Even when the program closes a subfont, it is retained
1081in the terminal for later use.
1082When the application opens the subfont, it asks the terminal
1083if it already has a copy to avoid reading it from the file
1084server if possible.
1085This level of cache has the property that the bitmaps for, say,
1086all the Japanese characters are stored only once, in the terminal;
1087the applications read only size and width information from the terminal
1088and share the images.
1089.PP
1090The sizes of the character and subfont caches held by the
1091application are adaptive.
1092A simple algorithm monitors the cache miss rate to enlarge and
1093shrink the caches as required.
1094The size of the character cache is limited to 2048 images maximum,
1095which in practice seems enough even for Japanese text.
1096For plain ASCII-like text it naturally stays around 128 images.
1097.PP
1098This mechanism sounds complicated but is implemented by only about
1099500 lines in the library and considerably less in each of the
1100terminal's graphics driver and
1101.CW 8½ .
1102It has the advantage that only characters that are
1103being used are loaded into memory.
1104It is also efficient: if the characters being drawn
1105are in the cache the extra overhead is negligible.
1106It works particularly well for alphabetic character sets,
1107but also adapts on demand for ideographic sets.
1108When a user first looks at Japanese text, it takes a few
1109seconds to read all the font data, but thereafter the
1110text is drawn almost as fast as regular text (the images
1111are larger, so draw a little slower).
1112Also, because the bitmaps are remembered by the terminal,
1113if a second application then looks at Japanese text
1114it starts faster than the first.
1115.PP
1116We considered
1117building a `font server'
1118to cache character images and associated data
1119for the applications, the window system, and the terminal.
1120We rejected this design because, although isolating
1121many of the problems of font management into a separate program,
1122it didn't simplify the applications.
1123Moreover, in a distributed system such as Plan 9 it is easy
1124to have too many special purpose servers.
1125Making the management of the fonts the concern of only
1126the essential components simplifies the system and makes
1127bootstrapping less intricate.
1128.SH
1129Input
1130.PP
1131A completely different problem is how to type Unicode characters
1132as input to the system.
1133We selected an unused key on our ASCII keyboards
1134to serve as a prefix for multi-keystroke
1135sequences that generate Unicode characters.
1136For example, the character
1137.CW ü
1138is generated by the prefix key
1139(typically
1140.CW ALT
1141or
1142.CW Compose )
1143followed by a double quote and a lower-case
1144.CW u .
1145When that character is read by the application, from the file
1146.CW /dev/cons ,
1147it is of course presented as its UTF encoding.
1148Such sequences generate characters from an arbitrary set that
1149includes all of Latin-1 plus a selection of mathematical
1150and technical characters.
1151An arbitrary Unicode character may be generated by typing the prefix,
1152an upper case X, and four hexadecimal digits that identify
1153the Unicode value.
1154.PP
1155These simple mechanisms are adequate for most of our day-to-day needs:
1156it's easy to remember to type `ALT 1 2' for ½\^ or `ALT accent letter'
1157for accented Latin letters.
1158For the occasional unusual character, the cut and paste features of
1159.CW 8½
1160serve well.  A program called (perhaps misleadingly)
1161.CW unicode
1162takes as argument a hexadecimal value, and prints the UTF representation of that character,
1163which may then be picked up with the mouse and used as input.
1164.PP
1165These methods
1166are clearly unsatisfactory when working in a non-English language.
1167In the native country of such a language
1168the appropriate keyboard is likely to be at hand.
1169But it's also reasonable\(emespecially now that the system handles Unicode characters\(emto
1170work in a language foreign to the keyboard.
1171.PP
1172For alphabetic languages such as Greek or Russian, it is
1173straightforward to construct a program that does phonetic substitution,
1174so that, for example, typing a Latin `a' yields the Greek `α'.
1175Within Plan 9, such a program can be inserted transparently
1176between the real keyboard and a program such as the window system,
1177providing a manageable input device for such languages.
1178.PP
1179For ideographic languages such as Chinese or Japanese the problem is harder.
1180Native users of such languages have adopted methods for dealing with
1181Latin keyboards that involve a hybrid technique based on phonetics
1182to generate a list of possible symbols followed by menu selection to
1183choose the desired one.
1184Such methods can be
1185effective, but their design must be rooted in information about
1186the language unknown to non-native speakers.
1187.CW Cxterm , (
1188a Chinese terminal emulator built by and for
1189Chinese programmers,
1190employs such a technique
1191[Pong and Zhang].)
1192Although the technical problem of implementing such a device
1193is easy in Plan 9\(emit is just an elaboration of the technique for
1194alphabetic languages\(emour lack of familiarity with such languages
1195has restrained our enthusiasm for building one.
1196.PP
1197The input problem is technically the least interesting but perhaps
1198emotionally the most important of the problems of converting a system
1199to an international character set.
1200Beyond that remain the deeper problems of internationalization
1201such as multi-lingual error messages and command names,
1202problems we are not qualified to solve.
1203With the ability to treat text of most languages on an equal
1204footing, though, we can begin down that path.
1205Perhaps people in non-English speaking countries will
1206consider adopting Plan 9, solving the input problem locally\(emperhaps
1207just by plugging in their local terminals\(emand begin to use
1208a system with at least the capacity to be international.
1209.SH
1210Acknowledgements
1211.PP
1212Dennis Ritchie provided consultation and encouragement.
1213Bob Flandrena converted most of the standard tools to UTF.
1214Brian Kernighan suffered cheerfully with several
1215inadequate implementations and converted
1216.CW troff
1217to UTF.
1218Rich Drechsler converted his Postscript driver to UTF.
1219John Hobby built the Postscript ☺.
1220We thank them all.
1221.SH
1222References
1223.LP
1224[ANSIC] \f2American National Standard for Information Systems \-
1225Programming Language C\f1, American National Standards Institute, Inc.,
1226New York, 1990.
1227.LP
1228[ISO10646]
1229ISO/IEC DIS 10646-1:1993
1230\f2Information technology \-
1231Universal Multiple-Octet Coded Character Set (UCS) \(em
1232Part 1: Architecture and Basic Multilingual Plane\fP.
1233.LP
1234[Pike90] R. Pike, D. Presotto, K. Thompson, H. Trickey,
1235``Plan 9 from Bell Labs'',
1236UKUUG Proc. of the Summer 1990 Conf.,
1237London, England,
12381990.
1239.LP
1240[Pike91] R. Pike, ``8½, The Plan 9 Window System'', USENIX Summer
1241Conf. Proc., Nashville, 1991, reprinted in this volume.
1242.LP
1243[Pike92] R. Pike, ``How to Use the Plan 9 C Compiler'', this volume.
1244.LP
1245[Pong and Zhang] Man-Chi Pong and Yongguang Zhang, ``cxterm:
1246A Chinese Terminal Emulator for the X Window System'',
1247.I
1248Software\(emPractice and Experience,
1249.R
1250Vol 22(1), 809-926, October 1992.
1251.LP
1252[Unicode]
1253\f2The Unicode Standard,
1254Worldwide Character Encoding,
1255Version 1.0, Volume 1\f1,
1256The Unicode Consortium,
1257Addison Wesley,
1258New York,
12591991.
1260