1.TL 2Hello World 3.br 4or 5.br 6.ft R 7Καλημέρα κόσμε 8.ft 9.br 10or 11.br 12\f(Jpこんにちは 世界\fP 13.AU 14Rob Pike 15Ken Thompson 16.sp 17rob,ken@plan9.att.com 18.AB 19.FS 20Originally appeared, in a slightly different form, in 21.I 22Proc. of the Winter 1993 USENIX Conf., 23.R 24pp. 43-50, 25San Diego 26.FE 27Plan 9 from Bell Labs has recently been converted from ASCII 28to an ASCII-compatible variant of the Unicode Standard, a 16-bit character set. 29In this paper we explain the reasons for the change, 30describe the character set and representation we chose, 31and present the programming models and software changes 32that support the new text format. 33Although we stopped short of full internationalization\(emfor 34example, system error messages are in Unixese, not Japanese\(emwe 35believe Plan 9 is the first system to treat the representation 36of all major languages on a uniform, equal footing throughout all its 37software. 38.AE 39.SH 40Introduction 41.PP 42The world is multilingual but most computer systems 43are based on English and ASCII. 44The first release of Plan 9 [Pike90], a new distributed operating 45system from Bell Laboratories, seemed a good occasion 46to correct this chauvinism. 47It is easier to make such deep changes when building new systems than 48by refitting old ones. 49.PP 50The ANSI C standard [ANSIC] contains some guidance on the matter of 51`wide' and `multi-byte' characters but falls far short of 52solving the myriad associated problems. 53We could find no literature on how to convert a 54.I system 55to larger character sets, although some individual 56.I programs 57had been converted. 58This paper reports what we discovered as we 59explored the problem of representing multilingual 60text at all levels of an operating system, 61from the file system and kernel through 62the applications and up to the window system 63and display. 64.PP 65Plan 9 has not been `internationalized': 66its manuals are in English, 67its error messages are in English, 68and it can display text that goes from left to right only. 69But before we can address these other problems, 70we need to handle, uniformly and comfortably, 71the textual representation of all the major written languages. 72That subproblem is richer than we had anticipated. 73.SH 74Standards 75.PP 76Our first step was to select a standard. 77At the time (January 1992), 78there were only two viable options: 79ISO 10646 [ISO10646] and Unicode [Unicode]. 80The documents describing both proposals were still in the draft stage. 81.PP 82The draft of ISO 10646 was not 83very attractive to us. 84It defined a sparse set of 32-bit characters, 85which would be 86hard to implement 87and have punitive storage requirements. 88Also, the draft attempted to 89mollify national interests by allocating 9016-bit subspaces to national committees 91to partition individually. 92The suggested mode of use was to 93``flip'' between separate national 94standards to implement the international standard. 95This did not strike us as a sound basis for a character set. 96As well, transmitting 32-bit values in a byte stream, 97such as in pipes, would be expensive and hard to implement. 98Since the standard does not define a byte order for such 99transmission, the byte stream would also have to carry 100state to enable the values to be recovered. 101.PP 102The Unicode Standard is a proposal by a consortium of mostly American 103computer companies formed 104to protest the technical 105failings of ISO 10646. 106It defines a uniform 16-bit code based on the 107principle of unification: 108two characters are the same if they look the 109same even though they are from different 110languages. 111This principle, called Han unification, 112allows the large Japanese, Chinese, and Korean 113character sets to be packed comfortably into a 16-bit representation. 114.PP 115We chose the Unicode Standard for its technical merits and because its 116code space was better defined. 117Moreover, 118the Unicode Consortium was derailing the 119ISO 10646 standard. 120(Now, in 1995, 121ISO 10646 is a standard 122with one 16-bit group defined, 123which is almost exactly the Unicode Standard. 124As most people expected, the two standards bodies 125reached a détente and 126ISO 10646 and Unicode represent the same character set.) 127.PP 128The Unicode Standard defines an adequate character set 129but an unreasonable representation. 130It states that all characters 131are 16 bits wide and are communicated and stored in 13216-bit units. 133It also reserves a pair of characters 134(hexadecimal FFFE and FEFF) to detect byte order 135in transmitted text, requiring state in the byte stream. 136(The Unicode Consortium was thinking of files, not pipes.) 137To adopt this encoding, 138we would have had to convert all text going 139into and out of Plan 9 between ASCII and Unicode, which cannot be done. 140Within a single program, in command of all its input and output, 141it is possible to define characters as 16-bit quantities; 142in the context of a networked system with 143hundreds of applications on diverse machines 144by different manufacturers, 145it is impossible. 146.PP 147We needed a way to adapt the Unicode Standard to the tools-and-pipes 148model of text processing embodied by the Unix system. 149To do that, we 150needed an ASCII-compatible textual 151representation of Unicode characters for transmission 152and storage. 153In the draft ISO standard there was an informative 154(non-required) 155Annex 156called UTF 157that provided a byte stream encoding 158of the 32-bit ISO code. 159The encoding uses multibyte sequences composed 160from the 190 printable characters of Latin-1 161to represent character values larger 162than 159. 163.PP 164The UTF encoding has several good properties. 165By far the most important is that 166a byte in the ASCII range 0-127 represents 167itself in UTF. 168Thus UTF is backward compatible with ASCII. 169.PP 170UTF has other advantages. 171It is a byte encoding and is 172therefore byte-order independent. 173ASCII control characters appear in the byte stream 174only as themselves, never as an element of a sequence 175encoding another character, 176so newline bytes separate lines of UTF text. 177Finally, ANSI C's 178.CW strcmp 179function applied to UTF strings preserves the ordering of Unicode characters. 180.PP 181To encode and decode UTF is expensive (involving multiplication, 182division, and modulo operations) but workable. 183UTF's major disadvantage is that the encoding 184is not self-synchronizing. 185It is in general impossible to find the character 186boundaries in a UTF string without reading from 187the beginning of the string, although in practice 188control characters such as newlines, 189tabs, and blanks provide synchronization points. 190.PP 191In August 1992, 192X-Open circulated a proposal for another UTF-like 193byte encoding of Unicode characters. 194Their major concern was that an embedded character 195in a file name 196(in particular a slash) 197could be part of an escape sequence in UTF and 198therefore confuse a traditional file system. 199Their proposal would allow all 7-bit ASCII characters 200to represent themselves 201.I "and only themselves" 202in text. 203Multibyte sequences would contain only characters 204with the high bit set. 205We proposed a modification to the new UTF that 206would address our synchronization problem. 207Our proposal, which was originally known informally as UTF-2 and FSS-UTF, 208is now referred to as UTF-8 and has been approved by ISO to become 209Annex P to ISO 10646. 210.PP 211The model for text in Plan 9 is chosen from these 212three standards*: 213.FS 214* ``That's the nice thing about standards\(emthere's so many to choose from.'' \- Andy Tannenbaum (no, the other one) 215.FE 216the Unicode character set encoded as a byte stream by 217UTF-8, from 218(soon to be) Annex P of ISO 10646. 219Although this mixture may seem like a precarious position for us to adopt, 220it is not as bad as it sounds. 221ISO 10646 and the Unicode Standard have converged, 222other systems such as Linux have adopted the same character set and encoding, 223and the general feeling seems to be that Unicode and UTF-8 will be accepted as the way 224to exchange text between systems. 225The prognosis for wide acceptance is good. 226.PP 227There are a couple of aspects of the Unicode Standard we have not faced. 228One is the issue of right-to-left text such as Hebrew or Arabic. 229Since that is an issue of display, not representation, we believe 230we can defer that problem for the moment without affecting our 231ability to solve it later. 232Another issue is diacriticals and `combining characters', 233which cause overstriking of multiple Unicode characters. 234Although necessary for some scripts, such as Thai, Arabic, and Hebrew, 235such characters confuse the issues for Latin languages because they 236generate multiple representations for accented characters. 237ISO 10646 describes three levels of implementation; 238in Plan 9 we decided not to address the issue. 239Again, this can be labeled as a display issue and its finer points are still being debated, 240so we felt comfortable deferring. Mañana. 241.PP 242Although we converted Plan 9 in the altruistic interests of 243serving foreign languages, we have found the large character 244set attractive for other reasons. The Unicode Standard includes many 245characters\(emmathematical symbols, scientific notation, 246more general punctuation, and more\(emthat we now use 247daily in our work. We no longer test our imaginations 248to find ways to include non-ASCII symbols in our text; 249why type 250.CW :-) 251when you can use the character ☺? 252Most compelling is the ability to absorb documents 253and data that contain non-ASCII characters; our browser for the 254Oxford English Dictionary 255lets us see the dictionary as it really is, with pronunciation 256in the IPA font, foreign phrases properly rendered, and so on, 257.I "in plain text. 258.PP 259In the rest of this paper, except when 260stated otherwise, the term `UTF' refers to the UTF-8 encoding 261of Unicode characters as adopted by Plan 9. 262.SH 263C Compiler 264.PP 265The first program to be converted to UTF 266was the C Compiler. 267There are two levels of conversion. 268On the syntactic level, 269input to the C compiler 270is UTF; on the semantic level, 271the C language needs to define 272how compiled programs manipulate 273the UTF set. 274.PP 275The syntactic part is simple. 276The ANSI C language standard defines the 277source character set to be ASCII. 278Since UTF is backward compatible with ASCII, 279the compiler needs little change. 280The only places where a larger character set 281is allowed are in character constants, strings, and comments. 282Since 7-bit ASCII characters can represent only 283themselves in UTF, 284the compiler does not have to be careful while looking 285for the termination of a string or comment. 286.PP 287The Plan 9 compiler extends ANSI C to treat any Unicode 288character with a value outside of the ASCII range as 289an alphabetic. 290To a Greek programmer or an English mathematician, 291α is a sensible and now valid variable name. 292.PP 293On the semantic level, ANSI C allows, 294but does not tie down, 295the notion of a 296.I "wide character 297and admits string and character constants 298of this type. 299We chose the wide character type to be 300.CW unsigned 301.CW short . 302In the libraries, the word 303.CW Rune 304is defined by a 305.CW typedef 306to be equivalent to 307.CW unsigned 308.CW short 309and is 310used to signify a Unicode character. 311.PP 312There are surprises; for example: 313.P1 314L'x' \f1is 120\fP 315\&'x' \f1is 120\fP 316L'ÿ' \f1is 255\fP 317\&'ÿ' \f1is -1, stdio \fPEOF\f1 (if \fPchar\f1 is signed)\fP 318L'\f1α\fP' \f1is 945\fP 319\&'\f1α\fP' \f1is illegal\fP 320.P2 321In the string constants, 322.P1 323"\f(Jpこんにちは 世界\fP" 324L"\f(Jpこんにちは 世界\fP", 325.P2 326the former is an array of 327.CW chars 328with 22 elements 329and a null byte, 330while the latter is an array of 331.CW unsigned 332.CW shorts 333.CW Runes ) ( 334with 8 elements and a null 335.CW Rune . 336.PP 337The Plan 9 library provides an output conversion function, 338.CW print 339(analogous to 340.CW printf ), 341with formats 342.CW %c , 343.CW %C , 344.CW %s , 345and 346.CW %S . 347Since 348.CW print 349produces text, its output is always UTF. 350The character conversion 351.CW %c 352(lower case) masks its argument 353to 8 bits before converting to UTF. 354Thus 355.CW L'ÿ' 356and 357.CW 'ÿ' 358printed under 359.CW %c 360will be identical, 361but 362.CW L'\f1α\fP' 363will print as the Unicode 364character with decimal value 177. 365The character conversion 366.CW %C 367(upper case) masks its argument 368to 16 bits before converting to UTF. 369Thus 370.CW L'ÿ' 371and 372.CW L'\f1α\fP' 373will print correctly under 374.CW %C , 375but 376.CW 'ÿ' 377will not. 378The conversion 379.CW %s 380(lower case) 381expects a pointer to 382.CW char 383and copies UTF sequences up to a null byte. 384The conversion 385.CW %S 386(upper case) expects a pointer to 387.CW Rune 388and 389performs sequential 390.CW %C 391conversions until a null 392.CW Rune 393is encountered. 394.PP 395Another problem in format conversion 396is the definition of 397.CW %10s : 398does the number refer to bytes or characters? 399We decided that such formats were most 400often used to align output columns and 401so made the number count characters. 402Some programs, however, use the count 403to place blank-padded strings 404in fixed-sized arrays. 405These programs must be found and corrected. 406.PP 407Here is a complete example: 408.P1 409#include <u.h> 410 411char c[] = "\f(Jpこんにちは 世界\fP"; 412Rune s[] = L"\f(Jpこんにちは 世界\fP"; 413 414main(void) 415{ 416 print("%d, %d\en", sizeof(c), sizeof(s)); 417 print("%s\en", c); 418 print("%S\en", s); 419} 420.P2 421.PP 422This program prints 423.CW 23, 424.CW 18 425and then two identical lines of 426UTF text. 427In practice, 428.CW %S 429and 430.CW L"..." 431are rare in programs; one reason is 432that most formatted I/O is done in unconverted UTF. 433.SH 434Ramifications 435.PP 436All programs in Plan 9 now read and write text as UTF, not ASCII. 437This change breaks two deep-rooted symmetries implicit in most C programs: 438.IP 1. 439A character is no longer a 440.CW char . 441.IP 2. 442The internal representation (Rune) of a character now differs from its 443external representation (UTF). 444.PP 445In the sections that follow, 446we show how these issues were faced in the layers of 447system software from the operating system up to the applications. 448The effects are wide-reaching and often surprising. 449.SH 450Operating system 451.PP 452Since UTF is the only format for text in Plan 9, 453the interface to the operating system had to be converted to UTF. 454Text strings cross the interface in several places: 455command arguments, 456file names, 457user names (people can log in using their native name), 458error messages, 459and miscellaneous minor places such as commands to the I/O system. 460Little change was required: null-terminated UTF strings 461are equivalent to null-terminated ASCII strings for most purposes 462of the operating system. 463The library routines described in the next section made that 464change straightforward. 465.PP 466The window system, once called 467.CW 8.5 , 468is now rightfully called 469.CW 8½ . 470.SH 471Libraries 472.PP 473A header file included by all programs (see [Pike92]) declares 474the 475.CW Rune 476type to hold 16-bit character values: 477.P1 478typedef unsigned short Rune; 479.P2 480Also defined are several constants relevant to UTF: 481.P1 482enum 483{ 484 UTFmax = 3, /* maximum bytes per rune */ 485 Runesync = 0x80, /* can't appear in UTF sequence (<) */ 486 Runeself = 0x80, /* rune==UTF sequence (<) */ 487 Runeerror = 0x80, /* decoding error in UTF */ 488}; 489.P2 490(With the original UTF, 491.CW Runesync 492was hexadecimal 21 and 493.CW Runeself 494was A0.) 495.CW UTFmax 496bytes are sufficient 497to hold the UTF encoding of any Unicode character. 498Characters of value less than 499.CW Runesync 500only appear in a UTF string as 501themselves, never as part of a sequence encoding another character. 502Characters of value less than 503.CW Runeself 504encode into single bytes 505of the same value. 506Finally, when the library detects errors in UTF input\(embyte sequences 507that are not valid UTF sequences\(emit converts the first byte of the 508error sequence to the character 509.CW Runeerror . 510There is little a rune-oriented program can do when given bad data 511except exit, which is unreasonable, or carry on. 512Originally the conversion routines, described below, 513returned errors when given invalid UTF, 514but we found ourselves repeatedly checking for errors and ignoring them. 515We therefore decided to convert a bad sequence to a valid rune 516and continue processing. 517(The ANSI C routines, on the other hand, return errors.) 518.PP 519This technique does have the unfortunate property that converting 520invalid UTF byte strings in and out of runes does not preserve the input, 521but this circumstance only occurs when non-textual input is 522given to a textual program. 523The Unicode Standard defines an error character, value FFFD, to stand for 524characters from other sets that it does not represent. 525The 526.CW Runeerror 527character is a different concept, related to the encoding rather than the character set, so we 528chose a different character for it. 529.PP 530The Plan 9 C library contains a number of routines for 531manipulating runes. 532The first set converts between runes and UTF strings: 533.P1 534extern int runetochar(char*, Rune*); 535extern int chartorune(Rune*, char*); 536extern int runelen(long); 537extern int fullrune(char*, int); 538.P2 539.CW Runetochar 540translates a single 541.CW Rune 542to a UTF sequence and returns the number of bytes produced. 543.CW Chartorune 544goes the other way, reporting how many bytes were consumed. 545.CW Runelen 546returns the number of bytes in the UTF encoding of a rune. 547.CW Fullrune 548examines a UTF string up to a specified number of bytes 549and reports whether the string begins with a complete UTF encoding. 550All these routines use the 551.CW Runeerror 552character to work around encoding problems. 553.PP 554There is also a set of routines for examining null-terminated UTF strings, 555based on the model of the ANSI standard 556.CW str 557routines, but with 558.CW utf 559substituted for 560.CW str 561and 562.CW rune 563for 564.CW chr : 565.P1 566extern int utflen(char*); 567extern char* utfrune(char*, long); 568extern char* utfrrune(char*, long); 569extern char* utfutf(char*, char*); 570.P2 571.CW Utflen 572returns the number of runes in a UTF string; 573.CW utfrune 574returns a pointer to the first occurrence of a rune in a UTF string; 575and 576.CW utfrrune 577a pointer to the last. 578.CW Utfutf 579searches for the first occurrence of a UTF string in another UTF string. 580Given the synchronizing property of UTF-8, 581.CW utfutf 582is the same as 583.CW strstr 584if the arguments point to valid UTF strings. 585.PP 586It is a mistake to use 587.CW strchr 588or 589.CW strrchr 590unless searching for a 7-bit ASCII character, that is, a character 591less than 592.CW Runeself . 593.PP 594We have no routines for manipulating null-terminated arrays of 595.CW Runes . 596Although they should probably exist for completeness, we have 597found no need for them, for the same reason that 598.CW %S 599and 600.CW L"..." 601are rarely used. 602.PP 603Most Plan 9 programs use a new buffered I/O library, BIO, in place of 604Standard I/O. 605BIO contains routines to read and write UTF streams, converting to and from 606runes. 607.CW Bgetrune 608returns, as a 609.CW Rune 610within a 611.CW long , 612the next character in the UTF input stream; 613.CW Bputrune 614takes a rune and writes its UTF representation. 615.CW Bungetrune 616puts a rune back into the input stream for rereading. 617.PP 618Plan 9 programs use a simple set of macros to process command line arguments. 619Converting these macros to UTF automatically updated the 620argument processing of most programs. 621In general, 622argument flag names can no longer be held in bytes and 623arrays of 256 bytes cannot be used to hold a set of flags. 624.PP 625We have done nothing analogous to ANSI C's locales, partly because 626we do not feel qualified to define locales and partly because we remain 627unconvinced of that model for dealing with the problems. 628That is really more an issue of internationalization than conversion 629to a larger character set; on the other hand, 630because we have chosen a single character set that encompasses 631most languages, some of the need for 632locales is eliminated. 633(We have a utility, 634.CW tcs , 635that translates between UTF and other character sets.) 636.PP 637There are several reasons why our library does not follow the ANSI design 638for wide and multi-byte characters. 639The ANSI model was designed by a committee, untried, almost 640as an afterthought, whereas 641we wanted to design as we built. 642(We made several major changes to the interface 643as we became familiar with the problems involved.) 644We disagree with ANSI C's handling of invalid multi-byte sequences. 645Also, the ANSI C library is incomplete: 646although it contains some crucial routines for handling 647wide and multi-byte characters, there are some serious omissions. 648For example, our software can exploit 649the fact that UTF preserves ASCII characters in the byte stream. 650We could remove that assumption by replacing all 651calls to 652.CW strchr 653with 654.CW utfrune 655and so on. 656(Because of the weaker properties of the original UTF, 657we have actually done so.) 658ANSI C cannot: 659the standard says nothing about the representation, so portable code should 660.I never 661call 662.CW strchr , 663yet there is no ANSI equivalent to 664.CW utfrune . 665ANSI C simultaneously invalidates 666.CW strchr 667and offers no replacement. 668.PP 669Finally, ANSI did nothing to integrate wide characters 670into the I/O system: it gives no method for printing 671wide characters. 672We therefore needed to invent some things and decided to invent 673everything. 674In the end, some of our entry points do correspond closely to 675ANSI routines\(emfor example 676.CW chartorune 677and 678.CW runetochar 679are similar to 680.CW mbtowc 681and 682.CW wctomb \(embut 683Plan 9's library defines more functionality, enough 684to write real applications comfortably. 685.SH 686Converting the tools 687.PP 688The source for our tools and applications had already been converted to 689work with Latin-1, so it was `8-bit safe', but the conversion to the Unicode 690Standard and UTF is more involved. 691Some programs needed no change at all: 692.CW cat , 693for instance, 694interprets its argument strings, delivered in UTF, 695as file names that it passes uninterpreted to the 696.CW open 697system call, 698and then just copies bytes from its input to its output; 699it never makes decisions based on the values of the bytes. 700(Plan 9 701.CW cat 702has no options such as 703.CW -v 704to complicate matters.) 705Most programs, however, needed modest change. 706.PP 707It is difficult to 708find automatically the places that need attention, 709but 710.CW grep 711helps. 712Software that uses the libraries conscientiously can be searched 713for calls to library routines that examine bytes as characters: 714.CW strchr , 715.CW strrchr , 716.CW strstr , 717etc. 718Replacing these by calls to 719.CW utfrune , 720.CW utfrrune , 721and 722.CW utfutf 723is enough to fix many programs. 724Few tools actually need to operate on runes internally; 725more typically they need only to look for the final slash in a file 726name and similar trivial tasks. 727Of the 170 C source programs in the top levels of 728.CW /sys/src/cmd , 729only 23 now contain the word 730.CW Rune . 731.PP 732The programs that 733.I do 734store runes internally 735are mostly those whose 736.I raison 737.I d'être 738is character manipulation: 739.CW sam 740(the text editor), 741.CW sed , 742.CW sort , 743.CW tr , 744.CW troff , 745.CW 8½ 746(the window system and terminal emulator), 747and so on. 748To decide whether to compute using runes 749or UTF-encoded byte strings requires balancing the cost of converting 750the data when read and written 751against the cost of converting relevant text on demand. 752For programs such as editors that run a long time with a relatively 753constant dataset, runes are the better choice. 754There are space considerations too, but they are more complicated: 755plain ASCII text grows when converted to runes; UTF-encoded Japanese 756shrinks. 757.PP 758Again, it is hard to automate the conversion of a program from 759.CW chars 760to 761.CW Runes . 762It is not enough just to change the type of variables; the assumption 763that bytes and characters are equivalent can be insidious. 764For instance, to clear a character array by 765.P1 766memset(buf, 0, BUFSIZE) 767.P2 768becomes wrong if 769.CW buf 770is changed from an array of 771.CW chars 772to an array of 773.CW Runes . 774Any program that indexes tables based on character values needs 775rethinking. 776Consider 777.CW tr , 778which originally used multiple 256-byte arrays for the mapping. 779The naïve conversion would yield multiple 65536-rune arrays. 780Instead Plan 9 781.CW tr 782saves space by building in effect 783a run-encoded version of the map. 784.PP 785.CW Sort 786has related problems. 787The cooperation of UTF and 788.CW strcmp 789means that a simple sort\(emone with no options\(emcan be done 790on the original UTF strings using 791.CW strcmp . 792With sorting options enabled, however, 793.CW sort 794may need to convert its input to runes: for example, 795option 796.CW -t\f1α\fP 797requires searching for alphas in the input text to 798crack the input into fields. 799The field specifier 800.CW +3.2 801refers to 2 runes beyond the third field. 802Some of the other options are hopelessly provincial: 803consider the case-folding and dictionary order options 804(Japanese doesn't even have an official dictionary order) or 805.CW -M 806which compares by case-insensitive English month name. 807Handling these options involves the 808larger issues of internationalization and is beyond the scope 809of this paper and our expertise. 810Plan 9 811.CW sort 812works sensibly with options that make sense relative to the input. 813The simple and most important options are, however, usually meaningful. 814In particular, 815.CW sort 816sorts UTF into the same order that 817.CW look 818expects. 819.PP 820Regular expression-matching algorithms need rethinking to 821be applied to UTF text. 822Deterministic automata are usually applied to bytes; 823converting them to operate on variable-sized byte sequences is awkward. 824On the other hand, converting the input stream to runes adds measurable 825expense 826and the state tables expand 827from size 256 to 65536; it can be expensive just to generate them. 828For simple string searching, 829the Boyer-Moore algorithm works with UTF provided the input is 830guaranteed to be only valid UTF strings; however, it does not work 831with the old UTF encoding. 832At a more mundane level, even character classes are harder: 833the usual bit-vector representation within a non-deterministic automaton 834is unwieldy with 65536 characters in the alphabet. 835.PP 836We compromised. 837An existing library for compiling and executing regular expressions 838was adapted to work on runes, with two entry points for searching 839in arrays of runes and arrays of chars (the pattern is always UTF text). 840Character classes are represented internally as runs of runes; 841the reserved value 842.CW FFFF 843marks the end of the class. 844Then 845.I all 846utilities that use regular expressions\(emeditors, 847.CW grep , 848.CW awk , 849etc.\(emexcept the shell, whose notation 850was grandfathered, were converted to use the library. 851For some programs, there was a concomitant loss of performance, 852but there was also a strong advantage. 853To our knowledge, Plan 9 is the only Unix-like system 854that has a single definition and implementation of 855regular expressions; patterns are written and interpreted 856identically by all the programs in the system. 857.PP 858A handful of programs have the notion of character built into them 859so strongly as to confuse the issue of what they should do with UTF input. 860Such programs were treated as individual special cases. 861For example, 862.CW wc 863is, by default, unchanged in behavior and output; a new option, 864.CW -r , 865counts the number of correctly encoded runes\(emvalid UTF sequences\(emin 866its input; 867.CW -b 868the number of invalid sequences. 869.PP 870It took us several months to convert all the software in the system 871to the Unicode Standard and the old UTF. 872When we decided to convert from that to the new UTF, 873only three things needed to be done. 874First, we rewrote the library routines to encode and decode the 875new UTF. This took an evening. 876Next, we converted all the files containing UTF 877to the new encoding. 878We wrote a trivial program to look for non-ASCII bytes in 879text files and used a Plan 9 program called 880.CW tcs 881(translate character set) to change encodings. 882Finally, we recompiled all the system software; 883the library interface was unchanged, so recompilation was sufficient 884to effect the transformation. 885The second two steps were done concurrently and took an afternoon. 886We concluded that the actual encoding is relatively unimportant to the 887software; the adoption of large characters and a byte-stream encoding 888.I per 889.I se 890are much deeper issues. 891.SH 892Graphics and fonts 893.PP 894Plan 9 provides only minimal support for plain text terminals. 895It is instead designed to be used with all character input and 896output mediated by a window system such as 897.CW 8½ . 898The window system and related software are responsible for the 899display of UTF text as Unicode character images. 900For plain text, the window system must provide a user-settable 901.I font 902that provides a (possibly empty) picture for each Unicode character. 903Fancier applications that use bold and Italic characters 904need multiple fonts storing multiple pictures for each 905Unicode value. 906All the issues are apparent, though, 907in just the problem of 908displaying a single image for each character, that is, the 909Unicode equivalent of a plain text terminal. 910With 128 or even 256 characters, a font can be just 911an array of bitmaps. With 65536 characters, 912a more sophisticated design is necessary. To store the ideographs 913for just Japanese as 16×16×1 bit images, 914the smallest they can reasonably be, takes over a quarter of a 915megabyte. Make the images a little larger, store more bits per 916pixel, and hold a copy in every running application, and the 917memory cost becomes unreasonable. 918.PP 919The structure of the bitmap graphics services is described at length elsewhere 920[Pike91]. 921In summary, the memory holding the bitmaps is stored in the same machine that has 922the display, mouse, and keyboard: the terminal in Plan 9 terminology, 923the workstation in others'. 924Access to that memory and associated services is provided 925by device files served by system 926software on the terminal. One of those files, 927.CW /dev/bitblt , 928interprets messages written upon it as requests for actions 929corresponding to entry points in the graphics library: 930allocate a bitmap, execute a raster operation, draw a text string, etc. 931The window system 932acts as a multiplexer that mediates access to the services 933and resources of the terminal by simulating in each client window 934a set of files mirroring those provided by the system. 935That is, each window has a distinct 936.CW /dev/mouse , 937.CW /dev/bitblt , 938and so on through which applications drive graphical 939input and output. 940.PP 941One of the resources managed by 942.CW 8½ 943and the terminal is the set of active 944.I subfonts. 945Each subfont holds the 946bitmaps and associated data structures for a sequential set of Unicode 947characters. 948Subfonts are stored in files and loaded into the terminal by 949.CW 8½ 950or an application. 951For example, one subfont 952might hold the images of the first 256 characters of the Unicode space, 953corresponding to the Latin-1 character set; 954another might hold the standard phonetic character set, Unicode characters 955with value 0250 to 02E9. 956These files are collected in directories corresponding to typefaces: 957.CW /lib/font/bit/pelm 958contains the Pellucida Monospace character set, with subfonts holding 959the Latin-1, Greek, Cyrillic and other components of the typeface. 960A suffix on subfont files encodes (in a subfont-specific 961way) the size of the images: 962.CW /lib/font/bit/pelm/latin1.9 963contains the Latin-1 Pellucida Monospace characters with lower 964case letters 9 pixels high; 965.CW /lib/font/bit/jis/jis5400.16 966contains 16-pixel high 967ideographs starting at Unicode value 5400. 968.PP 969The subfonts do not identify which portion of the Unicode space 970they cover. Instead, a 971font file, in plain text, 972describes how to assemble subfonts into a complete 973character set. 974The font file is presented as an argument to the window system 975to determine how plain text is displayed in text windows and 976applications. 977Here is the beginning of the font file 978.CW /lib/font/bit/pelm/jis.9.font , 979which describes the layout of a font covering that portion of 980the Unicode Standard for which we have characters of typical 981display size, using Japanese characters 982to cover the Han space: 983.P1 98418 14 9850x0000 0x00FF latin1.9 9860x0100 0x017E latineur.9 9870x0250 0x02E9 ipa.9 9880x0386 0x03F5 greek.9 9890x0400 0x0475 cyrillic.9 9900x2000 0x2044 ../misc/genpunc.9 9910x2070 0x208E supsub.9 9920x20A0 0x20AA currency.9 9930x2100 0x2138 ../misc/letterlike.9 9940x2190 0x21EA ../misc/arrows 9950x2200 0x227F ../misc/math1 9960x2280 0x22F1 ../misc/math2 9970x2300 0x232C ../misc/tech 9980x2500 0x257F ../misc/chart 9990x2600 0x266F ../misc/ding 1000.P2 1001.P1 10020x3000 0x303f ../jis/jis3000.16 10030x30a1 0x30fe ../jis/katakana.16 10040x3041 0x309e ../jis/hiragana.16 10050x4e00 0x4fff ../jis/jis4e00.16 10060x5000 0x51ff ../jis/jis5000.16 1007\&... 1008.P2 1009The first two numbers set the interline spacing of the font (18 1010pixels) and the distance from the baseline to the top of the 1011line (14 pixels). 1012When characters are displayed, they are placed so as best 1013to fit within those constraints; characters 1014too large to fit will be truncated. 1015The rest of the file associates subfont files 1016with portions of Unicode space. 1017The first four such files are in the Pellucida Monospace typeface 1018and directory; others reside in other directories. The file names 1019are relative to the font file's own location. 1020.PP 1021There are several advantages to this two-level structure. 1022First, it simultaneously breaks the huge Unicode space into manageable 1023components and provides a unifying architecture for 1024assembling fonts from disjoint pieces. 1025Second, the structure promotes sharing. 1026For example, we have only one set of Japanese 1027characters but dozens of typefaces for the Latin-1 characters, 1028and this structure permits us to store only one copy of the 1029Japanese set but use it with any Roman typeface. 1030Also, customization is easy. 1031English-speaking users who don't need Japanese characters 1032but may want to read an on-line Oxford English Dictionary can 1033assemble a custom font with the 1034Latin-1 (or even just ASCII) characters and the International 1035Phonetic Alphabet (IPA). 1036Moreover, to do so requires just editing a plain text file, 1037not using a special font editing tool. 1038Finally, the structure guides the design of 1039caching protocols to improve performance and memory usage. 1040.PP 1041To load a complete Unicode character set into each application 1042would consume too 1043much memory and, particularly on slow terminal lines, would take 1044unreasonably long. 1045Instead, Plan 9 assembles a multi-level cache structure for 1046each font. 1047An application opens a font file, reads and parses it, 1048and allocates a data structure. 1049A message written to 1050.CW /dev/bitblt 1051allocates an associated structure held in the terminal, in particular, 1052a bitmap to act as a cache 1053for recently used character images. 1054Other messages copy these images to bitmaps such as the screen 1055by loading characters from subfonts into the cache on demand and 1056from there to the destination bitmap. 1057The protocol to draw characters is in terms of cache indices, 1058not Unicode character number or UTF sequences. 1059These details are hidden from the application, which instead 1060sees only a subroutine to draw a string in a bitmap from a 1061given font, functions to discover character size information, 1062and routines to allocate and to free fonts. 1063.PP 1064As needed, whole 1065subfonts are opened by the graphics library, read, and then downloaded 1066to the terminal. 1067They are held open by the library in an LRU-replacement list. 1068Even when the program closes a subfont, it is retained 1069in the terminal for later use. 1070When the application opens the subfont, it asks the terminal 1071if it already has a copy to avoid reading it from the file 1072server if possible. 1073This level of cache has the property that the bitmaps for, say, 1074all the Japanese characters are stored only once, in the terminal; 1075the applications read only size and width information from the terminal 1076and share the images. 1077.PP 1078The sizes of the character and subfont caches held by the 1079application are adaptive. 1080A simple algorithm monitors the cache miss rate to enlarge and 1081shrink the caches as required. 1082The size of the character cache is limited to 2048 images maximum, 1083which in practice seems enough even for Japanese text. 1084For plain ASCII-like text it naturally stays around 128 images. 1085.PP 1086This mechanism sounds complicated but is implemented by only about 1087500 lines in the library and considerably less in each of the 1088terminal's graphics driver and 1089.CW 8½ . 1090It has the advantage that only characters that are 1091being used are loaded into memory. 1092It is also efficient: if the characters being drawn 1093are in the cache the extra overhead is negligible. 1094It works particularly well for alphabetic character sets, 1095but also adapts on demand for ideographic sets. 1096When a user first looks at Japanese text, it takes a few 1097seconds to read all the font data, but thereafter the 1098text is drawn almost as fast as regular text (the images 1099are larger, so draw a little slower). 1100Also, because the bitmaps are remembered by the terminal, 1101if a second application then looks at Japanese text 1102it starts faster than the first. 1103.PP 1104We considered 1105building a `font server' 1106to cache character images and associated data 1107for the applications, the window system, and the terminal. 1108We rejected this design because, although isolating 1109many of the problems of font management into a separate program, 1110it didn't simplify the applications. 1111Moreover, in a distributed system such as Plan 9 it is easy 1112to have too many special purpose servers. 1113Making the management of the fonts the concern of only 1114the essential components simplifies the system and makes 1115bootstrapping less intricate. 1116.SH 1117Input 1118.PP 1119A completely different problem is how to type Unicode characters 1120as input to the system. 1121We selected an unused key on our ASCII keyboards 1122to serve as a prefix for multi-keystroke 1123sequences that generate Unicode characters. 1124For example, the character 1125.CW ü 1126is generated by the prefix key 1127(typically 1128.CW ALT 1129or 1130.CW Compose ) 1131followed by a double quote and a lower-case 1132.CW u . 1133When that character is read by the application, from the file 1134.CW /dev/cons , 1135it is of course presented as its UTF encoding. 1136Such sequences generate characters from an arbitrary set that 1137includes all of Latin-1 plus a selection of mathematical 1138and technical characters. 1139An arbitrary Unicode character may be generated by typing the prefix, 1140an upper case X, and four hexadecimal digits that identify 1141the Unicode value. 1142.PP 1143These simple mechanisms are adequate for most of our day-to-day needs: 1144it's easy to remember to type `ALT 1 2' for ½\^ or `ALT accent letter' 1145for accented Latin letters. 1146For the occasional unusual character, the cut and paste features of 1147.CW 8½ 1148serve well. A program called (perhaps misleadingly) 1149.CW unicode 1150takes as argument a hexadecimal value, and prints the UTF representation of that character, 1151which may then be picked up with the mouse and used as input. 1152.PP 1153These methods 1154are clearly unsatisfactory when working in a non-English language. 1155In the native country of such a language 1156the appropriate keyboard is likely to be at hand. 1157But it's also reasonable\(emespecially now that the system handles Unicode characters\(emto 1158work in a language foreign to the keyboard. 1159.PP 1160For alphabetic languages such as Greek or Russian, it is 1161straightforward to construct a program that does phonetic substitution, 1162so that, for example, typing a Latin `a' yields the Greek `α'. 1163Within Plan 9, such a program can be inserted transparently 1164between the real keyboard and a program such as the window system, 1165providing a manageable input device for such languages. 1166.PP 1167For ideographic languages such as Chinese or Japanese the problem is harder. 1168Native users of such languages have adopted methods for dealing with 1169Latin keyboards that involve a hybrid technique based on phonetics 1170to generate a list of possible symbols followed by menu selection to 1171choose the desired one. 1172Such methods can be 1173effective, but their design must be rooted in information about 1174the language unknown to non-native speakers. 1175.CW Cxterm , ( 1176a Chinese terminal emulator built by and for 1177Chinese programmers, 1178employs such a technique 1179[Pong and Zhang].) 1180Although the technical problem of implementing such a device 1181is easy in Plan 9\(emit is just an elaboration of the technique for 1182alphabetic languages\(emour lack of familiarity with such languages 1183has restrained our enthusiasm for building one. 1184.PP 1185The input problem is technically the least interesting but perhaps 1186emotionally the most important of the problems of converting a system 1187to an international character set. 1188Beyond that remain the deeper problems of internationalization 1189such as multi-lingual error messages and command names, 1190problems we are not qualified to solve. 1191With the ability to treat text of most languages on an equal 1192footing, though, we can begin down that path. 1193Perhaps people in non-English speaking countries will 1194consider adopting Plan 9, solving the input problem locally\(emperhaps 1195just by plugging in their local terminals\(emand begin to use 1196a system with at least the capacity to be international. 1197.SH 1198Acknowledgements 1199.PP 1200Dennis Ritchie provided consultation and encouragement. 1201Bob Flandrena converted most of the standard tools to UTF. 1202Brian Kernighan suffered cheerfully with several 1203inadequate implementations and converted 1204.CW troff 1205to UTF. 1206Rich Drechsler converted his Postscript driver to UTF. 1207John Hobby built the Postscript ☺. 1208We thank them all. 1209.SH 1210References 1211.LP 1212[ANSIC] \f2American National Standard for Information Systems \- 1213Programming Language C\f1, American National Standards Institute, Inc., 1214New York, 1990. 1215.LP 1216[ISO10646] 1217ISO/IEC DIS 10646-1:1993 1218\f2Information technology \- 1219Universal Multiple-Octet Coded Character Set (UCS) \(em 1220Part 1: Architecture and Basic Multilingual Plane\fP. 1221.LP 1222[Pike90] R. Pike, D. Presotto, K. Thompson, H. Trickey, 1223``Plan 9 from Bell Labs'', 1224UKUUG Proc. of the Summer 1990 Conf., 1225London, England, 12261990. 1227.LP 1228[Pike91] R. Pike, ``8½, The Plan 9 Window System'', USENIX Summer 1229Conf. Proc., Nashville, 1991, reprinted in this volume. 1230.LP 1231[Pike92] R. Pike, ``How to Use the Plan 9 C Compiler'', this volume. 1232.LP 1233[Pong and Zhang] Man-Chi Pong and Yongguang Zhang, ``cxterm: 1234A Chinese Terminal Emulator for the X Window System'', 1235.I 1236Software\(emPractice and Experience, 1237.R 1238Vol 22(1), 809-926, October 1992. 1239.LP 1240[Unicode] 1241\f2The Unicode Standard, 1242Worldwide Character Encoding, 1243Version 1.0, Volume 1\f1, 1244The Unicode Consortium, 1245Addison Wesley, 1246New York, 12471991. 1248