1.HTML "Hello World or Καλημέρα κόσμε or こんにちは 世界 2.TL 3Hello World 4.br 5or 6.br 7.ft R 8Καλημέρα κόσμε 9.ft 10.br 11or 12.br 13\f(Jpこんにちは 世界\fP 14.AU 15Rob Pike 16Ken Thompson 17.sp 18rob,ken@plan9.bell-labs.com 19.AB 20.FS 21Originally appeared, in a slightly different form, in 22.I 23Proc. of the Winter 1993 USENIX Conf., 24.R 25pp. 43-50, 26San Diego. 27It has been revised to reflect the move to 21-bit Unicode. 28.FE 29Plan 9 from Bell Labs has recently been converted from ASCII 30to an ASCII-compatible variant of the Unicode Standard, 31a 16-bit (now 21-bit) character set. 32In this paper we explain the reasons for the change, 33describe the character set and representation we chose, 34and present the programming models and software changes 35that support the new text format. 36Although we stopped short of full internationalization\(emfor 37example, system error messages are in Unixese, not Japanese\(emwe 38believe Plan 9 is the first system to treat the representation 39of all major languages on a uniform, equal footing throughout all its 40software. 41.AE 42.SH 43Introduction 44.PP 45The world is multilingual but most computer systems 46are based on English and ASCII. 47The first release of Plan 9 [Pike90], a new distributed operating 48system from Bell Laboratories, seemed a good occasion 49to correct this chauvinism. 50It is easier to make such deep changes when building new systems than 51by refitting old ones. 52.PP 53The ANSI C standard [ANSIC] contains some guidance on the matter of 54`wide' and `multi-byte' characters but falls far short of 55solving the myriad associated problems. 56We could find no literature on how to convert a 57.I system 58to larger character sets, although some individual 59.I programs 60had been converted. 61This paper reports what we discovered as we 62explored the problem of representing multilingual 63text at all levels of an operating system, 64from the file system and kernel through 65the applications and up to the window system 66and display. 67.PP 68Plan 9 has not been `internationalized': 69its manuals are in English, 70its error messages are in English, 71and it can display text that goes from left to right only. 72But before we can address these other problems, 73we need to handle, uniformly and comfortably, 74the textual representation of all the major written languages. 75That subproblem is richer than we had anticipated. 76.SH 77Standards 78.PP 79Our first step was to select a standard. 80At the time (January 1992), 81there were only two viable options: 82ISO 10646 [ISO10646] and Unicode [Unicode]. 83The documents describing both proposals were still in the draft stage. 84.PP 85The draft of ISO 10646 was not 86very attractive to us. 87It defined a sparse set of 32-bit characters, 88which would be 89hard to implement 90and have punitive storage requirements. 91Also, the draft attempted to 92mollify national interests by allocating 9316-bit subspaces to national committees 94to partition individually. 95The suggested mode of use was to 96``flip'' between separate national 97standards to implement the international standard. 98This did not strike us as a sound basis for a character set. 99As well, transmitting 32-bit values in a byte stream, 100such as in pipes, would be expensive and hard to implement. 101Since the standard does not define a byte order for such 102transmission, the byte stream would also have to carry 103state to enable the values to be recovered. 104.PP 105The Unicode Standard is a proposal by a consortium of mostly American 106computer companies formed 107to protest the technical 108failings of ISO 10646. 109It defines a uniform 16-bit code based on the 110principle of unification: 111two characters are the same if they look the 112same even though they are from different 113languages. 114This principle, called Han unification, 115allows the large Japanese, Chinese, and Korean 116character sets to be packed comfortably into a 16-bit representation. 117.PP 118We chose the Unicode Standard for its technical merits and because its 119code space was better defined. 120Moreover, 121the Unicode Consortium was derailing the 122ISO 10646 standard. 123(Now, in 1995, 124ISO 10646 is a standard 125with one 16-bit group defined, 126which is almost exactly the Unicode Standard. 127As most people expected, the two standards bodies 128reached a détente and 129ISO 10646 and Unicode represent the same character set.) 130.PP 131The Unicode Standard defines an adequate character set 132but an unreasonable representation. 133It states that all characters 134are 16 bits wide and are communicated and stored in 13516-bit units. 136It also reserves a pair of characters 137(hexadecimal FFFE and FEFF) to detect byte order 138in transmitted text, requiring state in the byte stream. 139(The Unicode Consortium was thinking of files, not pipes.) 140To adopt this encoding, 141we would have had to convert all text going 142into and out of Plan 9 between ASCII and Unicode, which cannot be done. 143Within a single program, in command of all its input and output, 144it is possible to define characters as 16-bit quantities; 145in the context of a networked system with 146hundreds of applications on diverse machines 147by different manufacturers, 148it is impossible. 149.PP 150We needed a way to adapt the Unicode Standard to the tools-and-pipes 151model of text processing embodied by the Unix system. 152To do that, we 153needed an ASCII-compatible textual 154representation of Unicode characters for transmission 155and storage. 156In the draft ISO standard there was an informative 157(non-required) 158Annex 159called UTF 160that provided a byte stream encoding 161of the 32-bit ISO code. 162The encoding uses multibyte sequences composed 163from the 190 printable characters of Latin-1 164to represent character values larger 165than 159. 166.PP 167The UTF encoding has several good properties. 168By far the most important is that 169a byte in the ASCII range 0-127 represents 170itself in UTF. 171Thus UTF is backward compatible with ASCII. 172.PP 173UTF has other advantages. 174It is a byte encoding and is 175therefore byte-order independent. 176ASCII control characters appear in the byte stream 177only as themselves, never as an element of a sequence 178encoding another character, 179so newline bytes separate lines of UTF text. 180Finally, ANSI C's 181.CW strcmp 182function applied to UTF strings preserves the ordering of Unicode characters. 183.PP 184To encode and decode UTF is expensive (involving multiplication, 185division, and modulo operations) but workable. 186UTF's major disadvantage is that the encoding 187is not self-synchronizing. 188It is in general impossible to find the character 189boundaries in a UTF string without reading from 190the beginning of the string, although in practice 191control characters such as newlines, 192tabs, and blanks provide synchronization points. 193.PP 194In August 1992, 195X-Open circulated a proposal for another UTF-like 196byte encoding of Unicode characters. 197Their major concern was that an embedded character 198in a file name 199(in particular a slash) 200could be part of an escape sequence in UTF and 201therefore confuse a traditional file system. 202Their proposal would allow all 7-bit ASCII characters 203to represent themselves 204.I "and only themselves" 205in text. 206Multibyte sequences would contain only characters 207with the high bit set. 208We proposed a modification to the new UTF that 209would address our synchronization problem. 210Our proposal, which was originally known informally as UTF-2 and FSS-UTF, 211is now referred to as UTF-8 and has been approved by ISO to become 212Annex P to ISO 10646. 213.PP 214The model for text in Plan 9 is chosen from these 215three standards*: 216.FS 217* ``That's the nice thing about standards\(emthere's so many to choose from.'' \- Andy Tannenbaum (no, the other one) 218.FE 219the Unicode character set encoded as a byte stream by 220UTF-8, from 221(soon to be) Annex P of ISO 10646. 222Although this mixture may seem like a precarious position for us to adopt, 223it is not as bad as it sounds. 224ISO 10646 and the Unicode Standard have converged, 225other systems such as Linux have adopted the same character set and encoding, 226and the general feeling seems to be that Unicode and UTF-8 will be accepted as the way 227to exchange text between systems. 228The prognosis for wide acceptance is good. 229.PP 230There are a couple of aspects of the Unicode Standard we have not faced. 231One is the issue of right-to-left text such as Hebrew or Arabic. 232Since that is an issue of display, not representation, we believe 233we can defer that problem for the moment without affecting our 234ability to solve it later. 235Another issue is diacriticals and `combining characters', 236which cause overstriking of multiple Unicode characters. 237Although necessary for some scripts, such as Thai, Arabic, and Hebrew, 238such characters confuse the issues for Latin languages because they 239generate multiple representations for accented characters. 240ISO 10646 describes three levels of implementation; 241in Plan 9 we decided not to address the issue. 242Again, this can be labeled as a display issue and its finer points are still being debated, 243so we felt comfortable deferring. Mañana. 244.PP 245Although we converted Plan 9 in the altruistic interests of 246serving foreign languages, we have found the large character 247set attractive for other reasons. The Unicode Standard includes many 248characters\(emmathematical symbols, scientific notation, 249more general punctuation, and more\(emthat we now use 250daily in our work. We no longer test our imaginations 251to find ways to include non-ASCII symbols in our text; 252why type 253.CW :-) 254when you can use the character ☺? 255Most compelling is the ability to absorb documents 256and data that contain non-ASCII characters; our browser for the 257Oxford English Dictionary 258lets us see the dictionary as it really is, with pronunciation 259in the IPA font, foreign phrases properly rendered, and so on, 260.I "in plain text. 261.PP 262As of Unicode 4.0, 263characters are now 21 bits wide and the longest UTF-8 encoding of a character 264requires 4 bytes. 265We are adapting the system to match. 266.PP 267In the rest of this paper, except when 268stated otherwise, the term `UTF' refers to the UTF-8 encoding 269of Unicode characters as adopted by Plan 9. 270.SH 271C Compiler 272.PP 273The first program to be converted to UTF 274was the C Compiler. 275There are two levels of conversion. 276On the syntactic level, 277input to the C compiler 278is UTF; on the semantic level, 279the C language needs to define 280how compiled programs manipulate 281the UTF set. 282.PP 283The syntactic part is simple. 284The ANSI C language standard defines the 285source character set to be ASCII. 286Since UTF is backward compatible with ASCII, 287the compiler needs little change. 288The only places where a larger character set 289is allowed are in character constants, strings, and comments. 290Since 7-bit ASCII characters can represent only 291themselves in UTF, 292the compiler does not have to be careful while looking 293for the termination of a string or comment. 294.PP 295The Plan 9 compiler extends ANSI C to treat any Unicode 296character with a value outside of the ASCII range as 297an alphabetic. 298To a Greek programmer or an English mathematician, 299α is a sensible and now valid variable name. 300.PP 301On the semantic level, ANSI C allows, 302but does not tie down, 303the notion of a 304.I "wide character 305and admits string and character constants 306of this type. 307We chose the wide character type to be 308.CW unsigned 309.CW short 310(now 311.CW unsigned 312.CW long) . 313In the libraries, the word 314.CW Rune 315is now defined by a 316.CW typedef 317to be equivalent to 318.CW unsigned 319.CW long 320and is 321used to signify a Unicode character. 322.PP 323There are surprises; for example: 324.P1 325L'x' \f1is 120\fP 326\&'x' \f1is 120\fP 327L'ÿ' \f1is 255\fP 328\&'ÿ' \f1is -1, stdio \fPEOF\f1 (if \fPchar\f1 is signed)\fP 329L'\f1α\fP' \f1is 945\fP 330\&'\f1α\fP' \f1is illegal\fP 331.P2 332In the string constants, 333.P1 334"\f(Jpこんにちは 世界\fP" 335L"\f(Jpこんにちは 世界\fP", 336.P2 337the former is an array of 338.CW chars 339with 22 elements 340and a null byte, 341while the latter is an array of 342.CW unsigned 343.CW long s 344.CW Runes ) ( 345with 8 elements and a null 346.CW Rune . 347.PP 348The Plan 9 library provides an output conversion function, 349.CW print 350(analogous to 351.CW printf ), 352with formats 353.CW %c , 354.CW %C , 355.CW %s , 356and 357.CW %S . 358Since 359.CW print 360produces text, its output is always UTF. 361The character conversion 362.CW %c 363(lower case) masks its argument 364to 8 bits before converting to UTF. 365Thus 366.CW L'ÿ' 367and 368.CW 'ÿ' 369printed under 370.CW %c 371will be identical, 372but 373.CW L'\f1α\fP' 374will print as the Unicode 375character with decimal value 177. 376The character conversion 377.CW %C 378(upper case) masks its argument 379to 16 bits before converting to UTF. 380Thus 381.CW L'ÿ' 382and 383.CW L'\f1α\fP' 384will print correctly under 385.CW %C , 386but 387.CW 'ÿ' 388will not. 389The conversion 390.CW %s 391(lower case) 392expects a pointer to 393.CW char 394and copies UTF sequences up to a null byte. 395The conversion 396.CW %S 397(upper case) expects a pointer to 398.CW Rune 399and 400performs sequential 401.CW %C 402conversions until a null 403.CW Rune 404is encountered. 405.PP 406Another problem in format conversion 407is the definition of 408.CW %10s : 409does the number refer to bytes or characters? 410We decided that such formats were most 411often used to align output columns and 412so made the number count characters. 413Some programs, however, use the count 414to place blank-padded strings 415in fixed-sized arrays. 416These programs must be found and corrected. 417.PP 418Here is a complete example: 419.P1 420#include <u.h> 421 422char c[] = "\f(Jpこんにちは 世界\fP"; 423Rune s[] = L"\f(Jpこんにちは 世界\fP"; 424 425main(void) 426{ 427 print("%d, %d\en", sizeof(c), sizeof(s)); 428 print("%s\en", c); 429 print("%S\en", s); 430} 431.P2 432.PP 433This program prints 434.CW 23, 435.CW 18 436and then two identical lines of 437UTF text. 438In practice, 439.CW %S 440and 441.CW L"..." 442are rare in programs; one reason is 443that most formatted I/O is done in unconverted UTF. 444.SH 445Ramifications 446.PP 447All programs in Plan 9 now read and write text as UTF, not ASCII. 448This change breaks two deep-rooted symmetries implicit in most C programs: 449.IP 1. 450A character is no longer a 451.CW char . 452.IP 2. 453The internal representation (Rune) of a character now differs from its 454external representation (UTF). 455.PP 456In the sections that follow, 457we show how these issues were faced in the layers of 458system software from the operating system up to the applications. 459The effects are wide-reaching and often surprising. 460.SH 461Operating system 462.PP 463Since UTF is the only format for text in Plan 9, 464the interface to the operating system had to be converted to UTF. 465Text strings cross the interface in several places: 466command arguments, 467file names, 468user names (people can log in using their native name), 469error messages, 470and miscellaneous minor places such as commands to the I/O system. 471Little change was required: null-terminated UTF strings 472are equivalent to null-terminated ASCII strings for most purposes 473of the operating system. 474The library routines described in the next section made that 475change straightforward. 476.PP 477The window system, once called 478.CW 8.5 , 479is now rightfully called 480.CW 8½ . 481.SH 482Libraries 483.PP 484A header file included by all programs (see [Pike92]) declares 485the 486.CW Rune 487type to hold 21-bit character values: 488.P1 489typedef unsigned long Rune; 490.P2 491Also defined are several constants relevant to UTF: 492.P1 493enum 494{ 495 UTFmax = 4, /* maximum bytes per rune */ 496 Runesync = 0x80, /* cannot be in a UTF sequence (<) */ 497 Runeself = 0x80, /* rune==UTF sequence (<) */ 498 Runeerror = 0xFFFD, /* decoding error in UTF */ 499 Runemax = 0x10FFFF, /* largest 21-bit rune */ 500 Runemask = 0x1FFFFF, /* bits used by runes (see grep) */ 501}; 502.P2 503(With the original UTF, 504.CW Runesync 505was hexadecimal 21 and 506.CW Runeself 507was A0.) 508.CW UTFmax 509bytes are sufficient 510to hold the UTF encoding of any Unicode character. 511Characters of value less than 512.CW Runesync 513only appear in a UTF string as 514themselves, never as part of a sequence encoding another character. 515Characters of value less than 516.CW Runeself 517encode into single bytes 518of the same value. 519Finally, when the library detects errors in UTF input\(embyte sequences 520that are not valid UTF sequences\(emit converts the first byte of the 521error sequence to the character 522.CW Runeerror . 523There is little a rune-oriented program can do when given bad data 524except exit, which is unreasonable, or carry on. 525Originally the conversion routines, described below, 526returned errors when given invalid UTF, 527but we found ourselves repeatedly checking for errors and ignoring them. 528We therefore decided to convert a bad sequence to a valid rune 529and continue processing. 530(The ANSI C routines, on the other hand, return errors.) 531.PP 532This technique does have the unfortunate property that converting 533invalid UTF byte strings in and out of runes does not preserve the input, 534but this circumstance only occurs when non-textual input is 535given to a textual program. 536The Unicode Standard defines an error character, value FFFD, to stand for 537characters from other sets that it does not represent. 538The 539.CW Runeerror 540character is a different concept, related to the encoding rather than the character set. 541.PP 542The Plan 9 C library contains a number of routines for 543manipulating runes. 544The first set converts between runes and UTF strings: 545.P1 546extern int runetochar(char*, Rune*); 547extern int chartorune(Rune*, char*); 548extern int runelen(long); 549extern int fullrune(char*, int); 550.P2 551.CW Runetochar 552translates a single 553.CW Rune 554to a UTF sequence and returns the number of bytes produced. 555.CW Chartorune 556goes the other way, reporting how many bytes were consumed. 557.CW Runelen 558returns the number of bytes in the UTF encoding of a rune. 559.CW Fullrune 560examines a UTF string up to a specified number of bytes 561and reports whether the string begins with a complete UTF encoding. 562All these routines use the 563.CW Runeerror 564character to work around encoding problems. 565.PP 566There is also a set of routines for examining null-terminated UTF strings, 567based on the model of the ANSI standard 568.CW str 569routines, but with 570.CW utf 571substituted for 572.CW str 573and 574.CW rune 575for 576.CW chr : 577.P1 578extern int utflen(char*); 579extern char* utfrune(char*, long); 580extern char* utfrrune(char*, long); 581extern char* utfutf(char*, char*); 582.P2 583.CW Utflen 584returns the number of runes in a UTF string; 585.CW utfrune 586returns a pointer to the first occurrence of a rune in a UTF string; 587and 588.CW utfrrune 589a pointer to the last. 590.CW Utfutf 591searches for the first occurrence of a UTF string in another UTF string. 592Given the synchronizing property of UTF-8, 593.CW utfutf 594is the same as 595.CW strstr 596if the arguments point to valid UTF strings. 597.PP 598It is a mistake to use 599.CW strchr 600or 601.CW strrchr 602unless searching for a 7-bit ASCII character, that is, a character 603less than 604.CW Runeself . 605.PP 606We have no routines for manipulating null-terminated arrays of 607.CW Runes . 608Although they should probably exist for completeness, we have 609found no need for them, for the same reason that 610.CW %S 611and 612.CW L"..." 613are rarely used. 614.PP 615Most Plan 9 programs use a new buffered I/O library, BIO, in place of 616Standard I/O. 617BIO contains routines to read and write UTF streams, converting to and from 618runes. 619.CW Bgetrune 620returns, as a 621.CW Rune 622within a 623.CW long , 624the next character in the UTF input stream; 625.CW Bputrune 626takes a rune and writes its UTF representation. 627.CW Bungetrune 628puts a rune back into the input stream for rereading. 629.PP 630Plan 9 programs use a simple set of macros to process command line arguments. 631Converting these macros to UTF automatically updated the 632argument processing of most programs. 633In general, 634argument flag names can no longer be held in bytes and 635arrays of 256 bytes cannot be used to hold a set of flags. 636.PP 637We have done nothing analogous to ANSI C's locales, partly because 638we do not feel qualified to define locales and partly because we remain 639unconvinced of that model for dealing with the problems. 640That is really more an issue of internationalization than conversion 641to a larger character set; on the other hand, 642because we have chosen a single character set that encompasses 643most languages, some of the need for 644locales is eliminated. 645(We have a utility, 646.CW tcs , 647that translates between UTF and other character sets.) 648.PP 649There are several reasons why our library does not follow the ANSI design 650for wide and multi-byte characters. 651The ANSI model was designed by a committee, untried, almost 652as an afterthought, whereas 653we wanted to design as we built. 654(We made several major changes to the interface 655as we became familiar with the problems involved.) 656We disagree with ANSI C's handling of invalid multi-byte sequences. 657Also, the ANSI C library is incomplete: 658although it contains some crucial routines for handling 659wide and multi-byte characters, there are some serious omissions. 660For example, our software can exploit 661the fact that UTF preserves ASCII characters in the byte stream. 662We could remove that assumption by replacing all 663calls to 664.CW strchr 665with 666.CW utfrune 667and so on. 668(Because of the weaker properties of the original UTF, 669we have actually done so.) 670ANSI C cannot: 671the standard says nothing about the representation, so portable code should 672.I never 673call 674.CW strchr , 675yet there is no ANSI equivalent to 676.CW utfrune . 677ANSI C simultaneously invalidates 678.CW strchr 679and offers no replacement. 680.PP 681Finally, ANSI did nothing to integrate wide characters 682into the I/O system: it gives no method for printing 683wide characters. 684We therefore needed to invent some things and decided to invent 685everything. 686In the end, some of our entry points do correspond closely to 687ANSI routines\(emfor example 688.CW chartorune 689and 690.CW runetochar 691are similar to 692.CW mbtowc 693and 694.CW wctomb \(embut 695Plan 9's library defines more functionality, enough 696to write real applications comfortably. 697.SH 698Converting the tools 699.PP 700The source for our tools and applications had already been converted to 701work with Latin-1, so it was `8-bit safe', but the conversion to the Unicode 702Standard and UTF is more involved. 703Some programs needed no change at all: 704.CW cat , 705for instance, 706interprets its argument strings, delivered in UTF, 707as file names that it passes uninterpreted to the 708.CW open 709system call, 710and then just copies bytes from its input to its output; 711it never makes decisions based on the values of the bytes. 712(Plan 9 713.CW cat 714has no options such as 715.CW -v 716to complicate matters.) 717Most programs, however, needed modest change. 718.PP 719It is difficult to 720find automatically the places that need attention, 721but 722.CW grep 723helps. 724Software that uses the libraries conscientiously can be searched 725for calls to library routines that examine bytes as characters: 726.CW strchr , 727.CW strrchr , 728.CW strstr , 729etc. 730Replacing these by calls to 731.CW utfrune , 732.CW utfrrune , 733and 734.CW utfutf 735is enough to fix many programs. 736Few tools actually need to operate on runes internally; 737more typically they need only to look for the final slash in a file 738name and similar trivial tasks. 739Of the 170 C source programs in the top levels of 740.CW /sys/src/cmd , 741only 23 now contain the word 742.CW Rune . 743.PP 744The programs that 745.I do 746store runes internally 747are mostly those whose 748.I raison 749.I d'être 750is character manipulation: 751.CW sam 752(the text editor), 753.CW sed , 754.CW sort , 755.CW tr , 756.CW troff , 757.CW 8½ 758(the window system and terminal emulator), 759and so on. 760To decide whether to compute using runes 761or UTF-encoded byte strings requires balancing the cost of converting 762the data when read and written 763against the cost of converting relevant text on demand. 764For programs such as editors that run a long time with a relatively 765constant dataset, runes are the better choice. 766There are space considerations too, but they are more complicated: 767plain ASCII text grows when converted to runes; UTF-encoded Japanese 768shrinks. 769.PP 770Again, it is hard to automate the conversion of a program from 771.CW chars 772to 773.CW Runes . 774It is not enough just to change the type of variables; the assumption 775that bytes and characters are equivalent can be insidious. 776For instance, to clear a character array by 777.P1 778memset(buf, 0, BUFSIZE) 779.P2 780becomes wrong if 781.CW buf 782is changed from an array of 783.CW chars 784to an array of 785.CW Runes . 786Any program that indexes tables based on character values needs 787rethinking. 788Consider 789.CW tr , 790which originally used multiple 256-byte arrays for the mapping. 791The naïve conversion would yield multiple 1,114,112-rune arrays. 792Instead Plan 9 793.CW tr 794saves space by building in effect 795a run-encoded version of the map. 796.PP 797.CW Sort 798has related problems. 799The cooperation of UTF and 800.CW strcmp 801means that a simple sort\(emone with no options\(emcan be done 802on the original UTF strings using 803.CW strcmp . 804With sorting options enabled, however, 805.CW sort 806may need to convert its input to runes: for example, 807option 808.CW -t\f1α\fP 809requires searching for alphas in the input text to 810crack the input into fields. 811The field specifier 812.CW +3.2 813refers to 2 runes beyond the third field. 814Some of the other options are hopelessly provincial: 815consider the case-folding and dictionary order options 816(Japanese doesn't even have an official dictionary order) or 817.CW -M 818which compares by case-insensitive English month name. 819Handling these options involves the 820larger issues of internationalization and is beyond the scope 821of this paper and our expertise. 822Plan 9 823.CW sort 824works sensibly with options that make sense relative to the input. 825The simple and most important options are, however, usually meaningful. 826In particular, 827.CW sort 828sorts UTF into the same order that 829.CW look 830expects. 831.PP 832Regular expression-matching algorithms need rethinking to 833be applied to UTF text. 834Deterministic automata are usually applied to bytes; 835converting them to operate on variable-sized byte sequences is awkward. 836On the other hand, converting the input stream to runes adds measurable 837expense 838and the state tables expand 839from size 256 to 1,114,112; it can be expensive just to generate them. 840For simple string searching, 841the Boyer-Moore algorithm works with UTF provided the input is 842guaranteed to be only valid UTF strings; however, it does not work 843with the old UTF encoding. 844At a more mundane level, even character classes are harder: 845the usual bit-vector representation within a non-deterministic automaton 846is unwieldy with 1,114,112 characters in the alphabet. 847.PP 848We compromised. 849An existing library for compiling and executing regular expressions 850was adapted to work on runes, with two entry points for searching 851in arrays of runes and arrays of chars (the pattern is always UTF text). 852Character classes are represented internally as runs of runes; 853the reserved value 854.CW FFFF 855marks the end of the class. 856Then 857.I all 858utilities that use regular expressions\(emeditors, 859.CW grep , 860.CW awk , 861etc.\(emexcept the shell, whose notation 862was grandfathered, were converted to use the library. 863For some programs, there was a concomitant loss of performance, 864but there was also a strong advantage. 865To our knowledge, Plan 9 is the only Unix-like system 866that has a single definition and implementation of 867regular expressions; patterns are written and interpreted 868identically by all the programs in the system. 869.PP 870A handful of programs have the notion of character built into them 871so strongly as to confuse the issue of what they should do with UTF input. 872Such programs were treated as individual special cases. 873For example, 874.CW wc 875is, by default, unchanged in behavior and output; a new option, 876.CW -r , 877counts the number of correctly encoded runes\(emvalid UTF sequences\(emin 878its input; 879.CW -b 880the number of invalid sequences. 881.PP 882It took us several months to convert all the software in the system 883to the Unicode Standard and the old UTF. 884When we decided to convert from that to the new UTF, 885only three things needed to be done. 886First, we rewrote the library routines to encode and decode the 887new UTF. This took an evening. 888Next, we converted all the files containing UTF 889to the new encoding. 890We wrote a trivial program to look for non-ASCII bytes in 891text files and used a Plan 9 program called 892.CW tcs 893(translate character set) to change encodings. 894Finally, we recompiled all the system software; 895the library interface was unchanged, so recompilation was sufficient 896to effect the transformation. 897The second two steps were done concurrently and took an afternoon. 898We concluded that the actual encoding is relatively unimportant to the 899software; the adoption of large characters and a byte-stream encoding 900.I per 901.I se 902are much deeper issues. 903.SH 904Graphics and fonts 905.PP 906Plan 9 provides only minimal support for plain text terminals. 907It is instead designed to be used with all character input and 908output mediated by a window system such as 909.CW 8½ . 910The window system and related software are responsible for the 911display of UTF text as Unicode character images. 912For plain text, the window system must provide a user-settable 913.I font 914that provides a (possibly empty) picture for each Unicode character. 915Fancier applications that use bold and Italic characters 916need multiple fonts storing multiple pictures for each 917Unicode value. 918All the issues are apparent, though, 919in just the problem of 920displaying a single image for each character, that is, the 921Unicode equivalent of a plain text terminal. 922With 128 or even 256 characters, a font can be just 923an array of bitmaps. With 1,114,112 characters, 924a more sophisticated design is necessary. To store the ideographs 925for just Japanese as 16×16×1 bit images, 926the smallest they can reasonably be, takes over a quarter of a 927megabyte. Make the images a little larger, store more bits per 928pixel, and hold a copy in every running application, and the 929memory cost becomes unreasonable. 930.PP 931The structure of the bitmap graphics services is described at length elsewhere 932[Pike91]. 933In summary, the memory holding the bitmaps is stored in the same machine that has 934the display, mouse, and keyboard: the terminal in Plan 9 terminology, 935the workstation in others'. 936Access to that memory and associated services is provided 937by device files served by system 938software on the terminal. One of those files, 939.CW /dev/bitblt , 940interprets messages written upon it as requests for actions 941corresponding to entry points in the graphics library: 942allocate a bitmap, execute a raster operation, draw a text string, etc. 943The window system 944acts as a multiplexer that mediates access to the services 945and resources of the terminal by simulating in each client window 946a set of files mirroring those provided by the system. 947That is, each window has a distinct 948.CW /dev/mouse , 949.CW /dev/bitblt , 950and so on through which applications drive graphical 951input and output. 952.PP 953One of the resources managed by 954.CW 8½ 955and the terminal is the set of active 956.I subfonts. 957Each subfont holds the 958bitmaps and associated data structures for a sequential set of Unicode 959characters. 960Subfonts are stored in files and loaded into the terminal by 961.CW 8½ 962or an application. 963For example, one subfont 964might hold the images of the first 256 characters of the Unicode space, 965corresponding to the Latin-1 character set; 966another might hold the standard phonetic character set, Unicode characters 967with value 0250 to 02E9. 968These files are collected in directories corresponding to typefaces: 969.CW /lib/font/bit/pelm 970contains the Pellucida Monospace character set, with subfonts holding 971the Latin-1, Greek, Cyrillic and other components of the typeface. 972A suffix on subfont files encodes (in a subfont-specific 973way) the size of the images: 974.CW /lib/font/bit/pelm/latin1.9 975contains the Latin-1 Pellucida Monospace characters with lower 976case letters 9 pixels high; 977.CW /lib/font/bit/jis/jis5400.16 978contains 16-pixel high 979ideographs starting at Unicode value 5400. 980.PP 981The subfonts do not identify which portion of the Unicode space 982they cover. Instead, a 983font file, in plain text, 984describes how to assemble subfonts into a complete 985character set. 986The font file is presented as an argument to the window system 987to determine how plain text is displayed in text windows and 988applications. 989Here is the beginning of the font file 990.CW /lib/font/bit/pelm/jis.9.font , 991which describes the layout of a font covering that portion of 992the Unicode Standard for which we have characters of typical 993display size, using Japanese characters 994to cover the Han space: 995.P1 99618 14 9970x0000 0x00FF latin1.9 9980x0100 0x017E latineur.9 9990x0250 0x02E9 ipa.9 10000x0386 0x03F5 greek.9 10010x0400 0x0475 cyrillic.9 10020x2000 0x2044 ../misc/genpunc.9 10030x2070 0x208E supsub.9 10040x20A0 0x20AA currency.9 10050x2100 0x2138 ../misc/letterlike.9 10060x2190 0x21EA ../misc/arrows 10070x2200 0x227F ../misc/math1 10080x2280 0x22F1 ../misc/math2 10090x2300 0x232C ../misc/tech 10100x2500 0x257F ../misc/chart 10110x2600 0x266F ../misc/ding 1012.P2 1013.P1 10140x3000 0x303f ../jis/jis3000.16 10150x30a1 0x30fe ../jis/katakana.16 10160x3041 0x309e ../jis/hiragana.16 10170x4e00 0x4fff ../jis/jis4e00.16 10180x5000 0x51ff ../jis/jis5000.16 1019\&... 1020.P2 1021The first two numbers set the interline spacing of the font (18 1022pixels) and the distance from the baseline to the top of the 1023line (14 pixels). 1024When characters are displayed, they are placed so as best 1025to fit within those constraints; characters 1026too large to fit will be truncated. 1027The rest of the file associates subfont files 1028with portions of Unicode space. 1029The first four such files are in the Pellucida Monospace typeface 1030and directory; others reside in other directories. The file names 1031are relative to the font file's own location. 1032.PP 1033There are several advantages to this two-level structure. 1034First, it simultaneously breaks the huge Unicode space into manageable 1035components and provides a unifying architecture for 1036assembling fonts from disjoint pieces. 1037Second, the structure promotes sharing. 1038For example, we have only one set of Japanese 1039characters but dozens of typefaces for the Latin-1 characters, 1040and this structure permits us to store only one copy of the 1041Japanese set but use it with any Roman typeface. 1042Also, customization is easy. 1043English-speaking users who don't need Japanese characters 1044but may want to read an on-line Oxford English Dictionary can 1045assemble a custom font with the 1046Latin-1 (or even just ASCII) characters and the International 1047Phonetic Alphabet (IPA). 1048Moreover, to do so requires just editing a plain text file, 1049not using a special font editing tool. 1050Finally, the structure guides the design of 1051caching protocols to improve performance and memory usage. 1052.PP 1053To load a complete Unicode character set into each application 1054would consume too 1055much memory and, particularly on slow terminal lines, would take 1056unreasonably long. 1057Instead, Plan 9 assembles a multi-level cache structure for 1058each font. 1059An application opens a font file, reads and parses it, 1060and allocates a data structure. 1061A message written to 1062.CW /dev/bitblt 1063allocates an associated structure held in the terminal, in particular, 1064a bitmap to act as a cache 1065for recently used character images. 1066Other messages copy these images to bitmaps such as the screen 1067by loading characters from subfonts into the cache on demand and 1068from there to the destination bitmap. 1069The protocol to draw characters is in terms of cache indices, 1070not Unicode character number or UTF sequences. 1071These details are hidden from the application, which instead 1072sees only a subroutine to draw a string in a bitmap from a 1073given font, functions to discover character size information, 1074and routines to allocate and to free fonts. 1075.PP 1076As needed, whole 1077subfonts are opened by the graphics library, read, and then downloaded 1078to the terminal. 1079They are held open by the library in an LRU-replacement list. 1080Even when the program closes a subfont, it is retained 1081in the terminal for later use. 1082When the application opens the subfont, it asks the terminal 1083if it already has a copy to avoid reading it from the file 1084server if possible. 1085This level of cache has the property that the bitmaps for, say, 1086all the Japanese characters are stored only once, in the terminal; 1087the applications read only size and width information from the terminal 1088and share the images. 1089.PP 1090The sizes of the character and subfont caches held by the 1091application are adaptive. 1092A simple algorithm monitors the cache miss rate to enlarge and 1093shrink the caches as required. 1094The size of the character cache is limited to 2048 images maximum, 1095which in practice seems enough even for Japanese text. 1096For plain ASCII-like text it naturally stays around 128 images. 1097.PP 1098This mechanism sounds complicated but is implemented by only about 1099500 lines in the library and considerably less in each of the 1100terminal's graphics driver and 1101.CW 8½ . 1102It has the advantage that only characters that are 1103being used are loaded into memory. 1104It is also efficient: if the characters being drawn 1105are in the cache the extra overhead is negligible. 1106It works particularly well for alphabetic character sets, 1107but also adapts on demand for ideographic sets. 1108When a user first looks at Japanese text, it takes a few 1109seconds to read all the font data, but thereafter the 1110text is drawn almost as fast as regular text (the images 1111are larger, so draw a little slower). 1112Also, because the bitmaps are remembered by the terminal, 1113if a second application then looks at Japanese text 1114it starts faster than the first. 1115.PP 1116We considered 1117building a `font server' 1118to cache character images and associated data 1119for the applications, the window system, and the terminal. 1120We rejected this design because, although isolating 1121many of the problems of font management into a separate program, 1122it didn't simplify the applications. 1123Moreover, in a distributed system such as Plan 9 it is easy 1124to have too many special purpose servers. 1125Making the management of the fonts the concern of only 1126the essential components simplifies the system and makes 1127bootstrapping less intricate. 1128.SH 1129Input 1130.PP 1131A completely different problem is how to type Unicode characters 1132as input to the system. 1133We selected an unused key on our ASCII keyboards 1134to serve as a prefix for multi-keystroke 1135sequences that generate Unicode characters. 1136For example, the character 1137.CW ü 1138is generated by the prefix key 1139(typically 1140.CW ALT 1141or 1142.CW Compose ) 1143followed by a double quote and a lower-case 1144.CW u . 1145When that character is read by the application, from the file 1146.CW /dev/cons , 1147it is of course presented as its UTF encoding. 1148Such sequences generate characters from an arbitrary set that 1149includes all of Latin-1 plus a selection of mathematical 1150and technical characters. 1151An arbitrary Unicode character may be generated by typing the prefix, 1152an upper case X, and four hexadecimal digits that identify 1153the Unicode value. 1154.PP 1155These simple mechanisms are adequate for most of our day-to-day needs: 1156it's easy to remember to type `ALT 1 2' for ½\^ or `ALT accent letter' 1157for accented Latin letters. 1158For the occasional unusual character, the cut and paste features of 1159.CW 8½ 1160serve well. A program called (perhaps misleadingly) 1161.CW unicode 1162takes as argument a hexadecimal value, and prints the UTF representation of that character, 1163which may then be picked up with the mouse and used as input. 1164.PP 1165These methods 1166are clearly unsatisfactory when working in a non-English language. 1167In the native country of such a language 1168the appropriate keyboard is likely to be at hand. 1169But it's also reasonable\(emespecially now that the system handles Unicode characters\(emto 1170work in a language foreign to the keyboard. 1171.PP 1172For alphabetic languages such as Greek or Russian, it is 1173straightforward to construct a program that does phonetic substitution, 1174so that, for example, typing a Latin `a' yields the Greek `α'. 1175Within Plan 9, such a program can be inserted transparently 1176between the real keyboard and a program such as the window system, 1177providing a manageable input device for such languages. 1178.PP 1179For ideographic languages such as Chinese or Japanese the problem is harder. 1180Native users of such languages have adopted methods for dealing with 1181Latin keyboards that involve a hybrid technique based on phonetics 1182to generate a list of possible symbols followed by menu selection to 1183choose the desired one. 1184Such methods can be 1185effective, but their design must be rooted in information about 1186the language unknown to non-native speakers. 1187.CW Cxterm , ( 1188a Chinese terminal emulator built by and for 1189Chinese programmers, 1190employs such a technique 1191[Pong and Zhang].) 1192Although the technical problem of implementing such a device 1193is easy in Plan 9\(emit is just an elaboration of the technique for 1194alphabetic languages\(emour lack of familiarity with such languages 1195has restrained our enthusiasm for building one. 1196.PP 1197The input problem is technically the least interesting but perhaps 1198emotionally the most important of the problems of converting a system 1199to an international character set. 1200Beyond that remain the deeper problems of internationalization 1201such as multi-lingual error messages and command names, 1202problems we are not qualified to solve. 1203With the ability to treat text of most languages on an equal 1204footing, though, we can begin down that path. 1205Perhaps people in non-English speaking countries will 1206consider adopting Plan 9, solving the input problem locally\(emperhaps 1207just by plugging in their local terminals\(emand begin to use 1208a system with at least the capacity to be international. 1209.SH 1210Acknowledgements 1211.PP 1212Dennis Ritchie provided consultation and encouragement. 1213Bob Flandrena converted most of the standard tools to UTF. 1214Brian Kernighan suffered cheerfully with several 1215inadequate implementations and converted 1216.CW troff 1217to UTF. 1218Rich Drechsler converted his Postscript driver to UTF. 1219John Hobby built the Postscript ☺. 1220We thank them all. 1221.SH 1222References 1223.LP 1224[ANSIC] \f2American National Standard for Information Systems \- 1225Programming Language C\f1, American National Standards Institute, Inc., 1226New York, 1990. 1227.LP 1228[ISO10646] 1229ISO/IEC DIS 10646-1:1993 1230\f2Information technology \- 1231Universal Multiple-Octet Coded Character Set (UCS) \(em 1232Part 1: Architecture and Basic Multilingual Plane\fP. 1233.LP 1234[Pike90] R. Pike, D. Presotto, K. Thompson, H. Trickey, 1235``Plan 9 from Bell Labs'', 1236UKUUG Proc. of the Summer 1990 Conf., 1237London, England, 12381990. 1239.LP 1240[Pike91] R. Pike, ``8½, The Plan 9 Window System'', USENIX Summer 1241Conf. Proc., Nashville, 1991, reprinted in this volume. 1242.LP 1243[Pike92] R. Pike, ``How to Use the Plan 9 C Compiler'', this volume. 1244.LP 1245[Pong and Zhang] Man-Chi Pong and Yongguang Zhang, ``cxterm: 1246A Chinese Terminal Emulator for the X Window System'', 1247.I 1248Software\(emPractice and Experience, 1249.R 1250Vol 22(1), 809-926, October 1992. 1251.LP 1252[Unicode] 1253\f2The Unicode Standard, 1254Worldwide Character Encoding, 1255Version 1.0, Volume 1\f1, 1256The Unicode Consortium, 1257Addison Wesley, 1258New York, 12591991. 1260