1.HTML "How to Use the Plan 9 C Compiler 2.TL 3How to Use the Plan 9 C Compiler* 4.AU 5Rob Pike 6rob@plan9.bell-labs.com 7.SH 8Introduction 9.FS 10* This paper has been revised to reflect the move to 21-bit Unicode. 11.FE 12.PP 13The C compiler on Plan 9 is a wholly new program; in fact 14it was the first piece of software written for what would 15eventually become Plan 9 from Bell Labs. 16Programmers familiar with existing C compilers will find 17a number of differences in both the language the Plan 9 compiler 18accepts and in how the compiler is used. 19.PP 20The compiler is really a set of compilers, one for each 21architecture \(em MIPS, SPARC, Intel 386, Power PC, ARM, etc. \(em 22that accept a dialect of ANSI C and efficiently produce 23fairly good code for the target machine. 24There is a packaging of the compiler that accepts strict ANSI C for 25a POSIX environment, but this document focuses on the 26native Plan 9 environment, that in which all the system source and 27almost all the utilities are written. 28.SH 29Source 30.PP 31The language accepted by the compilers is the core 1989 ANSI C language 32with some modest extensions, 33a greatly simplified preprocessor, 34a smaller library that includes system calls and related facilities, 35and a completely different structure for include files. 36.PP 37Official ANSI C accepts the old (K&R) style of declarations for 38functions; the Plan 9 compilers 39are more demanding. 40Without an explicit run-time flag 41.CW -B ) ( 42whose use is discouraged, the compilers insist 43on new-style function declarations, that is, prototypes for 44function arguments. 45The function declarations in the libraries' include files are 46all in the new style so the interfaces are checked at compile time. 47For C programmers who have not yet switched to function prototypes 48the clumsy syntax may seem repellent but the payoff in stronger typing 49is substantial. 50Those who wish to import existing software to Plan 9 are urged 51to use the opportunity to update their code. 52.PP 53The compilers include an integrated preprocessor that accepts the familiar 54.CW #include , 55.CW #define 56for macros both with and without arguments, 57.CW #undef , 58.CW #line , 59.CW #ifdef , 60.CW #ifndef , 61and 62.CW #endif . 63It 64supports neither 65.CW #if 66nor 67.CW ## , 68although it does 69honor a few 70.CW #pragmas . 71The 72.CW #if 73directive was omitted because it greatly complicates the 74preprocessor, is never necessary, and is usually abused. 75Conditional compilation in general makes code hard to understand; 76the Plan 9 source uses it sparingly. 77Also, because the compilers remove dead code, regular 78.CW if 79statements with constant conditions are more readable equivalents to many 80.CW #ifs . 81To compile imported code ineluctably fouled by 82.CW #if 83there is a separate command, 84.CW /bin/cpp , 85that implements the complete ANSI C preprocessor specification. 86.PP 87Include files fall into two groups: machine-dependent and machine-independent. 88The machine-independent files occupy the directory 89.CW /sys/include ; 90the others are placed in a directory appropriate to the machine, such as 91.CW /mips/include . 92The compiler searches for include files 93first in the machine-dependent directory and then 94in the machine-independent directory. 95At the time of writing there are thirty-one machine-independent include 96files and two (per machine) machine-dependent ones: 97.CW <ureg.h> 98and 99.CW <u.h> . 100The first describes the layout of registers on the system stack, 101for use by the debugger. 102The second defines some 103architecture-dependent types such as 104.CW jmp_buf 105for 106.CW setjmp 107and the 108.CW va_arg 109and 110.CW va_list 111macros for handling arguments to variadic functions, 112as well as a set of 113.CW typedef 114abbreviations for 115.CW unsigned 116.CW short 117and so on. 118.PP 119Here is an excerpt from 120.CW /386/include/u.h : 121.P1 122#define nil ((void*)0) 123typedef unsigned short ushort; 124typedef unsigned char uchar; 125typedef unsigned long ulong; 126typedef unsigned int uint; 127typedef signed char schar; 128typedef long long vlong; 129 130typedef long jmp_buf[2]; 131#define JMPBUFSP 0 132#define JMPBUFPC 1 133#define JMPBUFDPC 0 134.P2 135Plan 9 programs use 136.CW nil 137for the name of the zero-valued pointer. 138The type 139.CW vlong 140is the largest integer type available; on most architectures it 141is a 64-bit value. 142A couple of other types in 143.CW <u.h> 144are 145.CW u32int , 146which is guaranteed to have exactly 32 bits (a possibility on all the supported architectures) and 147.CW mpdigit , 148which is used by the multiprecision math package 149.CW <mp.h> . 150The 151.CW #define 152constants permit an architecture-independent (but compiler-dependent) 153implementation of stack-switching using 154.CW setjmp 155and 156.CW longjmp . 157.PP 158Every Plan 9 C program begins 159.P1 160#include <u.h> 161.P2 162because all the other installed header files use the 163.CW typedefs 164declared in 165.CW <u.h> . 166.PP 167In strict ANSI C, include files are grouped to collect related functions 168in a single file: one for string functions, one for memory functions, 169one for I/O, and none for system calls. 170Each include file is protected by an 171.CW #ifdef 172to guarantee its contents are seen by the compiler only once. 173Plan 9 takes a different approach. Other than a few include 174files that define external formats such as archives, the files in 175.CW /sys/include 176correspond to 177.I libraries. 178If a program is using a library, it includes the corresponding header. 179The default C library comprises string functions, memory functions, and 180so on, largely as in ANSI C, some formatted I/O routines, 181plus all the system calls and related functions. 182To use these functions, one must 183.CW #include 184the file 185.CW <libc.h> , 186which in turn must follow 187.CW <u.h> , 188to define their prototypes for the compiler. 189Here is the complete source to the traditional first C program: 190.P1 191#include <u.h> 192#include <libc.h> 193 194void 195main(void) 196{ 197 print("hello world\en"); 198 exits(0); 199} 200.P2 201The 202.CW print 203routine and its relatives 204.CW fprint 205and 206.CW sprint 207resemble the similarly-named functions in Standard I/O but are not 208attached to a specific I/O library. 209In Plan 9 210.CW main 211is not integer-valued; it should call 212.CW exits , 213which takes a string argument (or null; here ANSI C promotes the 0 to a 214.CW char* ). 215All these functions are, of course, documented in the Programmer's Manual. 216.PP 217To use 218.CW printf , 219.CW <stdio.h> 220must be included to define the function prototype for 221.CW printf : 222.P1 223#include <u.h> 224#include <libc.h> 225#include <stdio.h> 226 227void 228main(int argc, char *argv[]) 229{ 230 printf("%s: hello world; argc = %d\en", argv[0], argc); 231 exits(0); 232} 233.P2 234In practice, Standard I/O is not used much in Plan 9. I/O libraries are 235discussed in a later section of this document. 236.PP 237There are libraries for handling regular expressions, raster graphics, 238windows, and so on, and each has an associated include file. 239The manual for each library states which include files are needed. 240The files are not protected against multiple inclusion and themselves 241contain no nested 242.CW #includes . 243Instead the 244programmer is expected to sort out the requirements 245and to 246.CW #include 247the necessary files once at the top of each source file. In practice this is 248trivial: this way of handling include files is so straightforward 249that it is rare for a source file to contain more than half a dozen 250.CW #includes . 251.PP 252The compilers do their own register allocation so the 253.CW register 254keyword is ignored. 255For different reasons, 256.CW volatile 257and 258.CW const 259are also ignored. 260.PP 261To make it easier to share code with other systems, Plan 9 has a version 262of the compiler, 263.CW pcc , 264that provides the standard ANSI C preprocessor, headers, and libraries 265with POSIX extensions. 266.CW Pcc 267is recommended only 268when broad external portability is mandated. It compiles slower, 269produces slower code (it takes extra work to simulate POSIX on Plan 9), 270eliminates those parts of the Plan 9 interface 271not related to POSIX, and illustrates the clumsiness of an environment 272designed by committee. 273.CW Pcc 274is described in more detail in 275.I 276APE\(emThe ANSI/POSIX Environment, 277.R 278by Howard Trickey. 279.SH 280Process 281.PP 282Each CPU architecture supported by Plan 9 is identified by a single, 283arbitrary, alphanumeric character: 284.CW k 285for SPARC, 286.CW q 287for 32-bit Power PC, 288.CW v 289for MIPS, 290.CW 0 291for little-endian MIPS, 292.CW 5 293for ARM v5 and later 32-bit architectures, 294.CW 6 295for AMD64, 296.CW 8 297for Intel 386, and 298.CW 9 299for 64-bit Power PC. 300The character labels the support tools and files for that architecture. 301For instance, for the 386 the compiler is 302.CW 8c , 303the assembler is 304.CW 8a , 305the link editor/loader is 306.CW 8l , 307the object files are suffixed 308.CW \&.8 , 309and the default name for an executable file is 310.CW 8.out . 311Before we can use the compiler we therefore need to know which 312machine we are compiling for. 313The next section explains how this decision is made; for the moment 314assume we are building 386 binaries and make the mental substitution for 315.CW 8 316appropriate to the machine you are actually using. 317.PP 318To convert source to an executable binary is a two-step process. 319First run the compiler, 320.CW 8c , 321on the source, say 322.CW file.c , 323to generate an object file 324.CW file.8 . 325Then run the loader, 326.CW 8l , 327to generate an executable 328.CW 8.out 329that may be run (on a 386 machine): 330.P1 3318c file.c 3328l file.8 3338.out 334.P2 335The loader automatically links with whatever libraries the program 336needs, usually including the standard C library as defined by 337.CW <libc.h> . 338Of course the compiler and loader have lots of options, both familiar and new; 339see the manual for details. 340The compiler does not generate an executable automatically; 341the output of the compiler must be given to the loader. 342Since most compilation is done under the control of 343.CW mk 344(see below), this is rarely an inconvenience. 345.PP 346The distribution of work between the compiler and loader is unusual. 347The compiler integrates preprocessing, parsing, register allocation, 348code generation and some assembly. 349Combining these tasks in a single program is part of the reason for 350the compiler's efficiency. 351The loader does instruction selection, branch folding, 352instruction scheduling, 353and writes the final executable. 354There is no separate C preprocessor and no assembler in the usual pipeline. 355Instead the intermediate object file 356(here a 357.CW \&.8 358file) is a type of binary assembly language. 359The instructions in the intermediate format are not exactly those in 360the machine. For example, on the 68020 the object file may specify 361a MOVE instruction but the loader will decide just which variant of 362the MOVE instruction \(em MOVE immediate, MOVE quick, MOVE address, 363etc. \(em is most efficient. 364.PP 365The assembler, 366.CW 8a , 367is just a translator between the textual and binary 368representations of the object file format. 369It is not an assembler in the traditional sense. It has limited 370macro capabilities (the same as the integral C preprocessor in the compiler), 371clumsy syntax, and minimal error checking. For instance, the assembler 372will accept an instruction (such as memory-to-memory MOVE on the MIPS) that the 373machine does not actually support; only when the output of the assembler 374is passed to the loader will the error be discovered. 375The assembler is intended only for writing things that need access to instructions 376invisible from C, 377such as the machine-dependent 378part of an operating system; 379very little code in Plan 9 is in assembly language. 380.PP 381The compilers take an option 382.CW -S 383that causes them to print on their standard output the generated code 384in a format acceptable as input to the assemblers. 385This is of course merely a formatting of the 386data in the object file; therefore the assembler is just 387an 388ASCII-to-binary converter for this format. 389Other than the specific instructions, the input to the assemblers 390is largely architecture-independent; see 391``A Manual for the Plan 9 Assembler'', 392by Rob Pike, 393for more information. 394.PP 395The loader is an integral part of the compilation process. 396Each library header file contains a 397.CW #pragma 398that tells the loader the name of the associated archive; it is 399not necessary to tell the loader which libraries a program uses. 400The C run-time startup is found, by default, in the C library. 401The loader starts with an undefined 402symbol, 403.CW _main , 404that is resolved by pulling in the run-time startup code from the library. 405(The loader undefines 406.CW _mainp 407when profiling is enabled, to force loading of the profiling start-up 408instead.) 409.PP 410Unlike its counterpart on other systems, the Plan 9 loader rearranges 411data to optimize access. This means the order of variables in the 412loaded program is unrelated to its order in the source. 413Most programs don't care, but some assume that, for example, the 414variables declared by 415.P1 416int a; 417int b; 418.P2 419will appear at adjacent addresses in memory. On Plan 9, they won't. 420.SH 421Heterogeneity 422.PP 423When the system starts or a user logs in the environment is configured 424so the appropriate binaries are available in 425.CW /bin . 426The configuration process is controlled by an environment variable, 427.CW $cputype , 428with value such as 429.CW mips , 430.CW 386 , 431.CW arm , 432or 433.CW sparc . 434For each architecture there is a directory in the root, 435with the appropriate name, 436that holds the binary and library files for that architecture. 437Thus 438.CW /mips/lib 439contains the object code libraries for MIPS programs, 440.CW /mips/include 441holds MIPS-specific include files, and 442.CW /mips/bin 443has the MIPS binaries. 444These binaries are attached to 445.CW /bin 446at boot time by binding 447.CW /$cputype/bin 448to 449.CW /bin , 450so 451.CW /bin 452always contains the correct files. 453.PP 454The MIPS compiler, 455.CW vc , 456by definition 457produces object files for the MIPS architecture, 458regardless of the architecture of the machine on which the compiler is running. 459There is a version of 460.CW vc 461compiled for each architecture: 462.CW /mips/bin/vc , 463.CW /arm/bin/vc , 464.CW /sparc/bin/vc , 465and so on, 466each capable of producing MIPS object files regardless of the native 467instruction set. 468If one is running on a SPARC, 469.CW /sparc/bin/vc 470will compile programs for the MIPS; 471if one is running on machine 472.CW $cputype , 473.CW /$cputype/bin/vc 474will compile programs for the MIPS. 475.PP 476Because of the bindings that assemble 477.CW /bin , 478the shell always looks for a command, say 479.CW date , 480in 481.CW /bin 482and automatically finds the file 483.CW /$cputype/bin/date . 484Therefore the MIPS compiler is known as just 485.CW vc ; 486the shell will invoke 487.CW /bin/vc 488and that is guaranteed to be the version of the MIPS compiler 489appropriate for the machine running the command. 490Regardless of the architecture of the compiling machine, 491.CW /bin/vc 492is 493.I always 494the MIPS compiler. 495.PP 496Also, the output of 497.CW vc 498and 499.CW vl 500is completely independent of the machine type on which they are executed: 501.CW \&.v 502files compiled (with 503.CW vc ) 504on a SPARC may be linked (with 505.CW vl ) 506on a 386. 507(The resulting 508.CW v.out 509will run, of course, only on a MIPS.) 510Similarly, the MIPS libraries in 511.CW /mips/lib 512are suitable for loading with 513.CW vl 514on any machine; there is only one set of MIPS libraries, not one 515set for each architecture that supports the MIPS compiler. 516.SH 517Heterogeneity and \f(CWmk\fP 518.PP 519Most software on Plan 9 is compiled under the control of 520.CW mk , 521a descendant of 522.CW make 523that is documented in the Programmer's Manual. 524A convention used throughout the 525.CW mkfiles 526makes it easy to compile the source into binary suitable for any architecture. 527.PP 528The variable 529.CW $cputype 530is advisory: it reports the architecture of the current environment, and should 531not be modified. A second variable, 532.CW $objtype , 533is used to set which architecture is being 534.I compiled 535for. 536The value of 537.CW $objtype 538can be used by a 539.CW mkfile 540to configure the compilation environment. 541.PP 542In each machine's root directory there is a short 543.CW mkfile 544that defines a set of macros for the compiler, loader, etc. 545Here is 546.CW /mips/mkfile : 547.P1 548</sys/src/mkfile.proto 549 550CC=vc 551LD=vl 552O=v 553AS=va 554.P2 555The line 556.P1 557</sys/src/mkfile.proto 558.P2 559causes 560.CW mk 561to include the file 562.CW /sys/src/mkfile.proto , 563which contains general definitions: 564.P1 565# 566# common mkfile parameters shared by all architectures 567# 568 569OS=5689qv 570CPUS=arm amd64 386 power mips 571CFLAGS=-FTVw 572LEX=lex 573YACC=yacc 574MK=/bin/mk 575.P2 576.CW CC 577is obviously the compiler, 578.CW AS 579the assembler, and 580.CW LD 581the loader. 582.CW O 583is the suffix for the object files and 584.CW CPUS 585and 586.CW OS 587are used in special rules described below. 588.PP 589Here is a 590.CW mkfile 591to build the installed source for 592.CW sam : 593.P1 594</$objtype/mkfile 595OBJ=sam.$O address.$O buffer.$O cmd.$O disc.$O error.$O \e 596 file.$O io.$O list.$O mesg.$O moveto.$O multi.$O \e 597 plan9.$O rasp.$O regexp.$O string.$O sys.$O xec.$O 598 599$O.out: $OBJ 600 $LD $OBJ 601 602install: $O.out 603 cp $O.out /$objtype/bin/sam 604 605installall: 606 for(objtype in $CPUS) mk install 607 608%.$O: %.c 609 $CC $CFLAGS $stem.c 610 611$OBJ: sam.h errors.h mesg.h 612address.$O cmd.$O parse.$O xec.$O unix.$O: parse.h 613 614clean:V: 615 rm -f [$OS].out *.[$OS] y.tab.? 616.P2 617(The actual 618.CW mkfile 619imports most of its rules from other secondary files, but 620this example works and is not misleading.) 621The first line causes 622.CW mk 623to include the contents of 624.CW /$objtype/mkfile 625in the current 626.CW mkfile . 627If 628.CW $objtype 629is 630.CW mips , 631this inserts the MIPS macro definitions into the 632.CW mkfile . 633In this case the rule for 634.CW $O.out 635uses the MIPS tools to build 636.CW v.out . 637The 638.CW %.$O 639rule in the file uses 640.CW mk 's 641pattern matching facilities to convert the source files to the object 642files through the compiler. 643(The text of the rules is passed directly to the shell, 644.CW rc , 645without further translation. 646See the 647.CW mk 648manual if any of this is unfamiliar.) 649Because the default rule builds 650.CW $O.out 651rather than 652.CW sam , 653it is possible to maintain binaries for multiple machines in the 654same source directory without conflict. 655This is also, of course, why the output files from the various 656compilers and loaders 657have distinct names. 658.PP 659The rest of the 660.CW mkfile 661should be easy to follow; notice how the rules for 662.CW clean 663and 664.CW installall 665(that is, install versions for all architectures) use other macros 666defined in 667.CW /$objtype/mkfile . 668In Plan 9, 669.CW mkfiles 670for commands conventionally contain rules to 671.CW install 672(compile and install the version for 673.CW $objtype ), 674.CW installall 675(compile and install for all 676.CW $objtypes ), 677and 678.CW clean 679(remove all object files, binaries, etc.). 680.PP 681The 682.CW mkfile 683is easy to use. To build a MIPS binary, 684.CW v.out : 685.P1 686% objtype=mips 687% mk 688.P2 689To build and install a MIPS binary: 690.P1 691% objtype=mips 692% mk install 693.P2 694To build and install all versions: 695.P1 696% mk installall 697.P2 698These conventions make cross-compilation as easy to manage 699as traditional native compilation. 700Plan 9 programs compile and run without change on machines from 701large multiprocessors to laptops. For more information about this process, see 702``Plan 9 Mkfiles'', 703by Bob Flandrena. 704.SH 705Portability 706.PP 707Within Plan 9, it is painless to write portable programs, programs whose 708source is independent of the machine on which they execute. 709The operating system is fixed and the compiler, headers and libraries 710are constant so most of the stumbling blocks to portability are removed. 711Attention to a few details can avoid those that remain. 712.PP 713Plan 9 is a heterogeneous environment, so programs must 714.I expect 715that external files will be written by programs on machines of different 716architectures. 717The compilers, for instance, must handle without confusion 718object files written by other machines. 719The traditional approach to this problem is to pepper the source with 720.CW #ifdefs 721to turn byte-swapping on and off. 722Plan 9 takes a different approach: of the handful of machine-dependent 723.CW #ifdefs 724in all the source, almost all are deep in the libraries. 725Instead programs read and write files in a defined format, 726either (for low volume applications) as formatted text, or 727(for high volume applications) as binary in a known byte order. 728If the external data were written with the most significant 729byte first, the following code reads a 4-byte integer correctly 730regardless of the architecture of the executing machine (assuming 731an unsigned long holds 4 bytes): 732.P1 733ulong 734getlong(void) 735{ 736 ulong l; 737 738 l = (getchar()&0xFF)<<24; 739 l |= (getchar()&0xFF)<<16; 740 l |= (getchar()&0xFF)<<8; 741 l |= (getchar()&0xFF)<<0; 742 return l; 743} 744.P2 745Note that this code does not `swap' the bytes; instead it just reads 746them in the correct order. 747Variations of this code will handle any binary format 748and also avoid problems 749involving how structures are padded, how words are aligned, 750and other impediments to portability. 751Be aware, though, that extra care is needed to handle floating point data. 752.PP 753Efficiency hounds will argue that this method is unnecessarily slow and clumsy 754when the executing machine has the same byte order (and padding and alignment) 755as the data. 756The CPU cost of I/O processing 757is rarely the bottleneck for an application, however, 758and the gain in simplicity of porting and maintaining the code greatly outweighs 759the minor speed loss from handling data in this general way. 760This method is how the Plan 9 compilers, the window system, and even the file 761servers transmit data between programs. 762.PP 763To port programs beyond Plan 9, where the system interface is more variable, 764it is probably necessary to use 765.CW pcc 766and hope that the target machine supports ANSI C and POSIX. 767.SH 768I/O 769.PP 770The default C library, defined by the include file 771.CW <libc.h> , 772contains no buffered I/O package. 773It does have several entry points for printing formatted text: 774.CW print 775outputs text to the standard output, 776.CW fprint 777outputs text to a specified integer file descriptor, and 778.CW sprint 779places text in a character array. 780To access library routines for buffered I/O, a program must 781explicitly include the header file associated with an appropriate library. 782.PP 783The recommended I/O library, used by most Plan 9 utilities, is 784.CW bio 785(buffered I/O), defined by 786.CW <bio.h> . 787There also exists an implementation of ANSI Standard I/O, 788.CW stdio . 789.PP 790.CW Bio 791is small and efficient, particularly for buffer-at-a-time or 792line-at-a-time I/O. 793Even for character-at-a-time I/O, however, it is significantly faster than 794the Standard I/O library, 795.CW stdio . 796Its interface is compact and regular, although it lacks a few conveniences. 797The most noticeable is that one must explicitly define buffers for standard 798input and output; 799.CW bio 800does not predefine them. Here is a program to copy input to output a byte 801at a time using 802.CW bio : 803.P1 804#include <u.h> 805#include <libc.h> 806#include <bio.h> 807 808Biobuf bin; 809Biobuf bout; 810 811main(void) 812{ 813 int c; 814 815 Binit(&bin, 0, OREAD); 816 Binit(&bout, 1, OWRITE); 817 818 while((c=Bgetc(&bin)) != Beof) 819 Bputc(&bout, c); 820 exits(0); 821} 822.P2 823For peak performance, we could replace 824.CW Bgetc 825and 826.CW Bputc 827by their equivalent in-line macros 828.CW BGETC 829and 830.CW BPUTC 831but 832the performance gain would be modest. 833For more information on 834.CW bio , 835see the Programmer's Manual. 836.PP 837Perhaps the most dramatic difference in the I/O interface of Plan 9 from other 838systems' is that text is not ASCII. 839The format for 840text in Plan 9 is a byte-stream encoding of 21-bit characters. 841The character set is based on the Unicode Standard and is backward compatible with 842ASCII: 843characters with value 0 through 127 are the same in both sets. 844The 21-bit characters, called 845.I runes 846in Plan 9, are encoded using a representation called 847UTF, 848an encoding that is becoming accepted as a standard. 849(ISO calls it UTF-8; 850throughout Plan 9 it's just called 851UTF.) 852UTF 853defines multibyte sequences to 854represent character values from 0 to 1,114,111. 855In 856UTF, 857character values up to 127 decimal, 7F hexadecimal, represent themselves, 858so straight 859ASCII 860files are also valid 861UTF. 862Also, 863UTF 864guarantees that bytes with values 0 to 127 (NUL to DEL, inclusive) 865will appear only when they represent themselves, so programs that read bytes 866looking for plain ASCII characters will continue to work. 867Any program that expects a one-to-one correspondence between bytes and 868characters will, however, need to be modified. 869An example is parsing file names. 870File names, like all text, are in 871UTF, 872so it is incorrect to search for a character in a string by 873.CW strchr(filename, 874.CW c) 875because the character might have a multi-byte encoding. 876The correct method is to call 877.CW utfrune(filename, 878.CW c) , 879defined in 880.I rune (2), 881which interprets the file name as a sequence of encoded characters 882rather than bytes. 883In fact, even when you know the character is a single byte 884that can represent only itself, 885it is safer to use 886.CW utfrune 887because that assumes nothing about the character set 888and its representation. 889.PP 890The library defines several symbols relevant to the representation of characters. 891Any byte with unsigned value less than 892.CW Runesync 893will not appear in any multi-byte encoding of a character. 894.CW Utfrune 895compares the character being searched against 896.CW Runesync 897to see if it is sufficient to call 898.CW strchr 899or if the byte stream must be interpreted. 900Any byte with unsigned value less than 901.CW Runeself 902is represented by a single byte with the same value. 903Finally, when errors are encountered converting 904to runes from a byte stream, the library returns the rune value 905.CW Runeerror 906and advances a single byte. This permits programs to find runes 907embedded in binary data. 908.PP 909.CW Bio 910includes routines 911.CW Bgetrune 912and 913.CW Bputrune 914to transform the external byte stream 915UTF 916format to and from 917internal 21-bit runes. 918Also, the 919.CW %s 920format to 921.CW print 922accepts 923UTF; 924.CW %c 925prints a character after narrowing it to 8 bits. 926The 927.CW %S 928format prints a null-terminated sequence of runes; 929.CW %C 930prints a character after narrowing it to 21 bits. 931For more information, see the Programmer's Manual, in particular 932.I utf (6) 933and 934.I rune (2), 935and the paper, 936``Hello world, or 937Καλημέρα κόσμε, or\ 938\f(Jpこんにちは 世界\f1'', 939by Rob Pike and 940Ken Thompson; 941there is not room for the full story here. 942.PP 943These issues affect the compiler in several ways. 944First, the C source is in 945UTF. 946ANSI says C variables are formed from 947ASCII 948alphanumerics, but comments and literal strings may contain any characters 949encoded in the native encoding, here 950UTF. 951The declaration 952.P1 953char *cp = "abcÿ"; 954.P2 955initializes the variable 956.CW cp 957to point to an array of bytes holding the 958UTF 959representation of the characters 960.CW abcÿ. 961The type 962.CW Rune 963is defined in 964.CW <u.h> 965to be 966.CW ushort , 967which is also the `wide character' type in the compiler. 968Therefore the declaration 969.P1 970Rune *rp = L"abcÿ"; 971.P2 972initializes the variable 973.CW rp 974to point to an array of unsigned long integers holding the 21-bit 975values of the characters 976.CW abcÿ . 977Note that in both these declarations the characters in the source 978that represent 979.CW "abcÿ" 980are the same; what changes is how those characters are represented 981in memory in the program. 982The following two lines: 983.P1 984print("%s\en", "abcÿ"); 985print("%S\en", L"abcÿ"); 986.P2 987produce the same 988UTF 989string on their output, the first by copying the bytes, the second 990by converting from runes to bytes. 991.PP 992In C, character constants are integers but narrowed through the 993.CW char 994type. 995The Unicode character 996.CW ÿ 997has value 255, so if the 998.CW char 999type is signed, 1000the constant 1001.CW 'ÿ' 1002has value \-1 (which is equal to EOF). 1003On the other hand, 1004.CW L'ÿ' 1005narrows through the wide character type, 1006.CW ushort , 1007and therefore has value 255. 1008.PP 1009Finally, although it's not ANSI C, the Plan 9 C compilers 1010assume any character with value above 1011.CW Runeself 1012is an alphanumeric, 1013so α is a legal, if non-portable, variable name. 1014.SH 1015Arguments 1016.PP 1017Some macros are defined 1018in 1019.CW <libc.h> 1020for parsing the arguments to 1021.CW main() . 1022They are described in 1023.I ARG (2) 1024but are fairly self-explanatory. 1025There are four macros: 1026.CW ARGBEGIN 1027and 1028.CW ARGEND 1029are used to bracket a hidden 1030.CW switch 1031statement within which 1032.CW ARGC 1033returns the current option character (rune) being processed and 1034.CW ARGF 1035returns the argument to the option, as in the loader option 1036.CW -o 1037.CW file . 1038Here, for example, is the code at the beginning of 1039.CW main() 1040in 1041.CW ramfs.c 1042(see 1043.I ramfs (1)) 1044that cracks its arguments: 1045.P1 1046void 1047main(int argc, char *argv[]) 1048{ 1049 char *defmnt; 1050 int p[2]; 1051 int mfd[2]; 1052 int stdio = 0; 1053 1054 defmnt = "/tmp"; 1055 ARGBEGIN{ 1056 case 'i': 1057 defmnt = 0; 1058 stdio = 1; 1059 mfd[0] = 0; 1060 mfd[1] = 1; 1061 break; 1062 case 's': 1063 defmnt = 0; 1064 break; 1065 case 'm': 1066 defmnt = ARGF(); 1067 break; 1068 default: 1069 usage(); 1070 }ARGEND 1071.P2 1072.SH 1073Extensions 1074.PP 1075The compiler has several extensions to 1989 ANSI C, all of which are used 1076extensively in the system source. 1077Some of these have been adopted in later ANSI C standards. 1078First, 1079.I structure 1080.I displays 1081permit 1082.CW struct 1083expressions to be formed dynamically. 1084Given these declarations: 1085.P1 1086typedef struct Point Point; 1087typedef struct Rectangle Rectangle; 1088 1089struct Point 1090{ 1091 int x, y; 1092}; 1093 1094struct Rectangle 1095{ 1096 Point min, max; 1097}; 1098 1099Point p, q, add(Point, Point); 1100Rectangle r; 1101int x, y; 1102.P2 1103this assignment may appear anywhere an assignment is legal: 1104.P1 1105r = (Rectangle){add(p, q), (Point){x, y+3}}; 1106.P2 1107The syntax is the same as for initializing a structure but with 1108a leading cast. 1109.PP 1110If an 1111.I anonymous 1112.I structure 1113or 1114.I union 1115is declared within another structure or union, the members of the internal 1116structure or union are addressable without prefix in the outer structure. 1117This feature eliminates the clumsy naming of nested structures and, 1118particularly, unions. 1119For example, after these declarations, 1120.P1 1121struct Lock 1122{ 1123 int locked; 1124}; 1125 1126struct Node 1127{ 1128 int type; 1129 union{ 1130 double dval; 1131 double fval; 1132 long lval; 1133 }; /* anonymous union */ 1134 struct Lock; /* anonymous structure */ 1135} *node; 1136 1137void lock(struct Lock*); 1138.P2 1139one may refer to 1140.CW node->type , 1141.CW node->dval , 1142.CW node->fval , 1143.CW node->lval , 1144and 1145.CW node->locked . 1146Moreover, the address of a 1147.CW struct 1148.CW Node 1149may be used without a cast anywhere that the address of a 1150.CW struct 1151.CW Lock 1152is used, such as in argument lists. 1153The compiler automatically promotes the type and adjusts the address. 1154Thus one may invoke 1155.CW lock(node) . 1156.PP 1157Anonymous structures and unions may be accessed by type name 1158if (and only if) they are declared using a 1159.CW typedef 1160name. 1161For example, using the above declaration for 1162.CW Point , 1163one may declare 1164.P1 1165struct 1166{ 1167 int type; 1168 Point; 1169} p; 1170.P2 1171and refer to 1172.CW p.Point . 1173.PP 1174In the initialization of arrays, a number in square brackets before an 1175element sets the index for the initialization. For example, to initialize 1176some elements in 1177a table of function pointers indexed by 1178ASCII 1179character, 1180.P1 1181void percent(void), slash(void); 1182 1183void (*func[128])(void) = 1184{ 1185 ['%'] percent, 1186 ['/'] slash, 1187}; 1188.P2 1189.LP 1190A similar syntax allows one to initialize structure elements: 1191.P1 1192Point p = 1193{ 1194 .y 100, 1195 .x 200 1196}; 1197.P2 1198These initialization syntaxes were later added to ANSI C, with the addition of an 1199equals sign between the index or tag and the value. 1200The Plan 9 compiler accepts either form. 1201.PP 1202Finally, the declaration 1203.P1 1204extern register reg; 1205.P2 1206.I this "" ( 1207appearance of the register keyword is not ignored) 1208allocates a global register to hold the variable 1209.CW reg . 1210External registers must be used carefully: they need to be declared in 1211.I all 1212source files and libraries in the program to guarantee the register 1213is not allocated temporarily for other purposes. 1214Especially on machines with few registers, such as the i386, 1215it is easy to link accidentally with code that has already usurped 1216the global registers and there is no diagnostic when this happens. 1217Used wisely, though, external registers are powerful. 1218The Plan 9 operating system uses them to access per-process and 1219per-machine data structures on a multiprocessor. The storage class they provide 1220is hard to create in other ways. 1221.SH 1222The compile-time environment 1223.PP 1224The code generated by the compilers is `optimized' by default: 1225variables are placed in registers and peephole optimizations are 1226performed. 1227The compiler flag 1228.CW -N 1229disables these optimizations. 1230Registerization is done locally rather than throughout a function: 1231whether a variable occupies a register or 1232the memory location identified in the symbol 1233table depends on the activity of the variable and may change 1234throughout the life of the variable. 1235The 1236.CW -N 1237flag is rarely needed; 1238its main use is to simplify debugging. 1239There is no information in the symbol table to identify the 1240registerization of a variable, so 1241.CW -N 1242guarantees the variable is always where the symbol table says it is. 1243.PP 1244Another flag, 1245.CW -w , 1246turns 1247.I on 1248warnings about portability and problems detected in flow analysis. 1249Most code in Plan 9 is compiled with warnings enabled; 1250these warnings plus the type checking offered by function prototypes 1251provide most of the support of the Unix tool 1252.CW lint 1253more accurately and with less chatter. 1254Two of the warnings, 1255`used and not set' and `set and not used', are almost always accurate but 1256may be triggered spuriously by code with invisible control flow, 1257such as in routines that call 1258.CW longjmp . 1259The compiler statements 1260.P1 1261SET(v1); 1262USED(v2); 1263.P2 1264decorate the flow graph to silence the compiler. 1265Either statement accepts a comma-separated list of variables. 1266Use them carefully: they may silence real errors. 1267For the common case of unused parameters to a function, 1268leaving the name off the declaration silences the warnings. 1269That is, listing the type of a parameter but giving it no 1270associated variable name does the trick. 1271.SH 1272Debugging 1273.PP 1274There are two debuggers available on Plan 9. 1275The first, and older, is 1276.CW db , 1277a revision of Unix 1278.CW adb . 1279The other, 1280.CW acid , 1281is a source-level debugger whose commands are statements in 1282a true programming language. 1283.CW Acid 1284is the preferred debugger, but since it 1285borrows some elements of 1286.CW db , 1287notably the formats for displaying values, it is worth knowing a little bit about 1288.CW db . 1289.PP 1290Both debuggers support multiple architectures in a single program; that is, 1291the programs are 1292.CW db 1293and 1294.CW acid , 1295not for example 1296.CW vdb 1297and 1298.CW vacid . 1299They also support cross-architecture debugging comfortably: 1300one may debug a 386 binary on a MIPS. 1301.PP 1302Imagine a program has crashed mysteriously: 1303.P1 1304% X11/X 1305Fatal server bug! 1306failed to create default stipple 1307X 106: suicide: sys: trap: fault read addr=0x0 pc=0x00105fb8 1308% 1309.P2 1310When a process dies on Plan 9 it hangs in the `broken' state 1311for debugging. 1312Attach a debugger to the process by naming its process id: 1313.P1 1314% acid 106 1315/proc/106/text:mips plan 9 executable 1316 1317/sys/lib/acid/port 1318/sys/lib/acid/mips 1319acid: 1320.P2 1321The 1322.CW acid 1323function 1324.CW stk() 1325reports the stack traceback: 1326.P1 1327acid: stk() 1328At pc:0x105fb8:abort+0x24 /sys/src/ape/lib/ap/stdio/abort.c:6 1329abort() /sys/src/ape/lib/ap/stdio/abort.c:4 1330 called from FatalError+#4e 1331 /sys/src/X/mit/server/dix/misc.c:421 1332FatalError(s9=#e02, s8=#4901d200, s7=#2, s6=#72701, s5=#1, 1333 s4=#7270d, s3=#6, s2=#12, s1=#ff37f1c, s0=#6, f=#7270f) 1334 /sys/src/X/mit/server/dix/misc.c:416 1335 called from gnotscreeninit+#4ce 1336 /sys/src/X/mit/server/ddx/gnot/gnot.c:792 1337gnotscreeninit(snum=#0, sc=#80db0) 1338 /sys/src/X/mit/server/ddx/gnot/gnot.c:766 1339 called from AddScreen+#16e 1340 /n/bootes/sys/src/X/mit/server/dix/main.c:610 1341AddScreen(pfnInit=0x0000129c,argc=0x00000001,argv=0x7fffffe4) 1342 /sys/src/X/mit/server/dix/main.c:530 1343 called from InitOutput+0x80 1344 /sys/src/X/mit/server/ddx/brazil/brddx.c:522 1345InitOutput(argc=0x00000001,argv=0x7fffffe4) 1346 /sys/src/X/mit/server/ddx/brazil/brddx.c:511 1347 called from main+0x294 1348 /sys/src/X/mit/server/dix/main.c:225 1349main(argc=0x00000001,argv=0x7fffffe4) 1350 /sys/src/X/mit/server/dix/main.c:136 1351 called from _main+0x24 1352 /sys/src/ape/lib/ap/mips/main9.s:8 1353.P2 1354The function 1355.CW lstk() 1356is similar but 1357also reports the values of local variables. 1358Note that the traceback includes full file names; this is a boon to debugging, 1359although it makes the output much noisier. 1360.PP 1361To use 1362.CW acid 1363well you will need to learn its input language; see the 1364``Acid Manual'', 1365by Phil Winterbottom, 1366for details. For simple debugging, however, the information in the manual page is 1367sufficient. In particular, it describes the most useful functions 1368for examining a process. 1369.PP 1370The compiler does not place 1371information describing the types of variables in the executable, 1372but a compile-time flag provides crude support for symbolic debugging. 1373The 1374.CW -a 1375flag to the compiler suppresses code generation 1376and instead emits source text in the 1377.CW acid 1378language to format and display data structure types defined in the program. 1379The easiest way to use this feature is to put a rule in the 1380.CW mkfile : 1381.P1 1382syms: main.$O 1383 $CC -a main.c > syms 1384.P2 1385Then from within 1386.CW acid , 1387.P1 1388acid: include("sourcedirectory/syms") 1389.P2 1390to read in the relevant definitions. 1391(For multi-file source, you need to be a little fancier; 1392see 1393.I 8c (1)). 1394This text includes, for each defined compound 1395type, a function with that name that may be called with the address of a structure 1396of that type to display its contents. 1397For example, if 1398.CW rect 1399is a global variable of type 1400.CW Rectangle , 1401one may execute 1402.P1 1403Rectangle(*rect) 1404.P2 1405to display it. 1406The 1407.CW * 1408(indirection) operator is necessary because 1409of the way 1410.CW acid 1411works: each global symbol in the program is defined as a variable by 1412.CW acid , 1413with value equal to the 1414.I address 1415of the symbol. 1416.PP 1417Another common technique is to write by hand special 1418.CW acid 1419code to define functions to aid debugging, initialize the debugger, and so on. 1420Conventionally, this is placed in a file called 1421.CW acid 1422in the source directory; it has a line 1423.P1 1424include("sourcedirectory/syms"); 1425.P2 1426to load the compiler-produced symbols. One may edit the compiler output directly but 1427it is wiser to keep the hand-generated 1428.CW acid 1429separate from the machine-generated. 1430.PP 1431To make things simple, the default rules in the system 1432.CW mkfiles 1433include entries to make 1434.CW foo.acid 1435from 1436.CW foo.c , 1437so one may use 1438.CW mk 1439to automate the production of 1440.CW acid 1441definitions for a given C source file. 1442.PP 1443There is much more to say here. See 1444.CW acid 1445manual page, the reference manual, or the paper 1446``Acid: A Debugger Built From A Language'', 1447also by Phil Winterbottom. 1448