1.HTML "Plan 9 C Compilers 2.TL 3Plan 9 C Compilers 4.AU 5Ken Thompson 6ken@plan9.bell-labs.com 7.AB 8.FS 9Originally appeared, in a different form, in 10.I 11Proceedings of the Summer 1990 UKUUG Conference, 12.R 13pp. 41-51, 14London, 1990. 15.FE 16This paper describes the overall structure and function of the Plan 9 C compilers. 17A more detailed implementation document 18for any one of the compilers 19is yet to be written. 20.AE 21.NH 22Introduction 23.LP 24There are many compilers in the series. 25Six of the compilers 26(Intel 386, AMD64, PowerPC, PowerPC 64-bit, ARM, MIPS R3000) 27are considered active and are used to compile 28current versions of Plan 9. 29One of the compilers (SPARC) 30is maintained but is for older machines 31for which we have no current ports of Plan 9; 32we are unlikely to port to any SPARC machines. 33The DEC Alpha and Motorola 68020 compilers have been retired. 34Several others (Motorola 68000, Intel 960, AMD 29000) 35have had only limited use, such as 36to program peripherals or experimental devices. 37Any of the disused compilers could be restored if needed. 38.NH 39Structure 40.LP 41The compiler is a single program that produces an 42object file. 43Combined in the compiler are the traditional 44roles of preprocessor, lexical analyzer, parser, code generator, 45local optimizer, 46and first half of the assembler. 47The object files are binary forms of assembly 48language, 49similar to what might be passed between 50the first and second passes of an assembler. 51.LP 52Object files and libraries 53are combined by a loader 54program to produce the executable binary. 55The loader combines the roles of second half 56of the assembler, global optimizer, and loader. 57The names of the compilers, loaders, and assemblers 58are as follows: 59.DS 60.ta 1.5i 61.de Ta 62\\$1 \f(CW\\$2\fP \f(CW\\$3\fP \f(CW\\$4\fP 63.. 64.Ta SPARC kc kl ka 65.Ta PowerPC qc ql qa 66.Ta MIPS vc vl va 67.Ta MIPS\ little-endian 0c 0l 0a 68.Ta ARM 5c 5l 5a 69.Ta AMD64 6c 6l 6a 70.Ta Intel\ 386 8c 8l 8a 71.Ta PowerPC\ 64-bit 9c 9l 9a 72.DE 73There is a further breakdown 74in the source of the compilers into 75object-independent and 76object-dependent 77parts. 78All of the object-independent parts 79are combined into source files in the 80directory 81.CW /sys/src/cmd/cc . 82The object-dependent parts are collected 83in a separate directory for each compiler, 84for example 85.CW /sys/src/cmd/vc . 86All of the code, 87both object-independent and 88object-dependent, 89is machine-independent 90and may be cross-compiled and executed on any 91of the architectures. 92.NH 93The Language 94.LP 95The compiler implements ANSI C with some 96restrictions and extensions 97[ANSI90]. 98Most of the restrictions are due to 99personal preference, while 100most of the extensions were to help in 101the implementation of Plan 9. 102There are other departures from the standard, 103particularly in the libraries, 104that are beyond the scope of this 105paper. 106.NH 2 107Register, volatile, const 108.LP 109The keyword 110.CW register 111is recognized syntactically 112but is semantically ignored. 113Thus taking the address of a 114.CW register 115variable is not diagnosed. 116The keyword 117.CW volatile 118disables all optimizations, in particular registerization, of the corresponding variable. 119The keyword 120.CW const 121generates warnings (if warnings are enabled by the compiler's 122.CW -w 123option) of non-constant use of the variable, 124but does not affect the generated code. 125.NH 2 126The preprocessor 127.LP 128The C preprocessor is probably the 129biggest departure from the ANSI standard. 130.LP 131The preprocessor built into the Plan 9 compilers does not support 132.CW #if , 133although it does handle 134.CW #ifdef 135and 136.CW #include . 137If it is necessary to be more standard, 138the source text can first be run through the separate ANSI C 139preprocessor, 140.CW cpp . 141.NH 2 142Unnamed substructures 143.LP 144The most important and most heavily used of the 145extensions is the declaration of an 146unnamed substructure or subunion. 147For example: 148.DS 149.CW 150.ta .1i .6i 1.1i 1.6i 151 typedef 152 struct lock 153 { 154 int locked; 155 } Lock; 156 157 typedef 158 struct node 159 { 160 int type; 161 union 162 { 163 double dval; 164 float fval; 165 long lval; 166 }; 167 Lock; 168 } Node; 169 170 Lock* lock; 171 Node* node; 172.DE 173The declaration of 174.CW Node 175has an unnamed substructure of type 176.CW Lock 177and an unnamed subunion. 178One use of this feature allows references to elements of the 179subunit to be accessed as if they were in 180the outer structure. 181Thus 182.CW node->dval 183and 184.CW node->locked 185are legitimate references. 186.LP 187When an outer structure is used 188in a context that is only legal for 189an unnamed substructure, 190the compiler promotes the reference to the 191unnamed substructure. 192This is true for references to structures and 193to references to pointers to structures. 194This happens in assignment statements and 195in argument passing where prototypes have been 196declared. 197Thus, continuing with the example, 198.DS 199.CW 200.ta .1i .6i 1.1i 1.6i 201 lock = node; 202.DE 203would assign a pointer to the unnamed 204.CW Lock 205in 206the 207.CW Node 208to the variable 209.CW lock . 210Another example, 211.DS 212.CW 213.ta .1i .6i 1.1i 1.6i 214 extern void lock(Lock*); 215 func(...) 216 { 217 ... 218 lock(node); 219 ... 220 } 221.DE 222will pass a pointer to the 223.CW Lock 224substructure. 225.LP 226Finally, in places where context is insufficient to identify the unnamed structure, 227the type name (it must be a 228.CW typedef ) 229of the unnamed structure can be used as an identifier. 230In our example, 231.CW &node->Lock 232gives the address of the anonymous 233.CW Lock 234structure. 235.NH 2 236Structure displays 237.LP 238A structure cast followed by a list of expressions in braces is 239an expression with the type of the structure and elements assigned from 240the corresponding list. 241Structures are now almost first-class citizens of the language. 242It is common to see code like this: 243.DS 244.CW 245.ta .1i 246 r = (Rectangle){point1, (Point){x,y+2}}; 247.DE 248.NH 2 249Initialization indexes 250.LP 251In initializers of arrays, 252one may place a constant expression 253in square brackets before an initializer. 254This causes the next initializer to assign 255the indicated element. 256For example: 257.DS 258.CW 259.ta .1i .6i 1.6i 260 enum errors 261 { 262 Etoobig, 263 Ealarm, 264 Egreg 265 }; 266 char* errstrings[] = 267 { 268 [Ealarm] "Alarm call", 269 [Egreg] "Panic: out of mbufs", 270 [Etoobig] "Arg list too long", 271 }; 272.DE 273In the same way, 274individual structures members may 275be initialized in any order by preceding the initialization with 276.CW .tagname . 277Both forms allow an optional 278.CW = , 279to be compatible with a proposed 280extension to ANSI C. 281.NH 2 282External register 283.LP 284The declaration 285.CW extern 286.CW register 287will dedicate a register to 288a variable on a global basis. 289It can be used only under special circumstances. 290External register variables must be identically 291declared in all modules and 292libraries. 293The feature is not intended for efficiency, 294although it can produce efficient code; 295rather it represents a unique storage class that 296would be hard to get any other way. 297On a shared-memory multi-processor, 298an external register is 299one-per-processor and neither one-per-procedure (automatic) 300or one-per-system (external). 301It is used for two variables in the Plan 9 kernel, 302.CW u 303and 304.CW m . 305.CW U 306is a pointer to the structure representing the currently running process 307and 308.CW m 309is a pointer to the per-machine data structure. 310.NH 2 311Long long 312.LP 313The compilers accept 314.CW long 315.CW long 316as a basic type meaning 64-bit integer. 317On some of the machines 318this type is synthesized from 32-bit instructions. 319.NH 2 320Pragma 321.LP 322The compilers accept 323.CW #pragma 324.CW lib 325.I libname 326and pass the 327library name string uninterpreted 328to the loader. 329The loader uses the library name to 330find libraries to load. 331If the name contains 332.CW $O , 333it is replaced with 334the single character object type of the compiler 335(e.g., 336.CW v 337for the MIPS). 338If the name contains 339.CW $M , 340it is replaced with 341the architecture type for the compiler 342(e.g., 343.CW mips 344for the MIPS). 345If the name starts with 346.CW / 347it is an absolute pathname; 348if it starts with 349.CW . 350then it is searched for in the loader's current directory. 351Otherwise, the name is searched from 352.CW /$M/lib . 353Such 354.CW #pragma 355statements in header files guarantee that the correct 356libraries are always linked with a program without the 357need to specify them explicitly at link time. 358.LP 359They also accept 360.CW #pragma 361.CW packed 362.CW on 363(or 364.CW yes 365or 366.CW 1 ) 367to cause subsequently declared data, until 368.CW #pragma 369.CW packed 370.CW off 371(or 372.CW no 373or 374.CW 0 ), 375to be laid out in memory tightly packed in successive bytes, disregarding 376the usual alignment rules. 377Accessing such data can cause faults. 378.LP 379Similarly, 380.CW #pragma 381.CW profile 382.CW off 383(or 384.CW no 385or 386.CW 0 ) 387causes subsequently declared functions, until 388.CW #pragma 389.CW profile 390.CW on 391(or 392.CW yes 393or 394.CW 1 ), 395to be marked as unprofiled. 396Such functions will not be profiled when 397profiling is enabled for the rest of the program. 398.LP 399Two 400.CW #pragma 401statements allow type-checking of 402.CW print -like 403functions. 404The first, of the form 405.P1 406#pragma varargck argpos error 2 407.P2 408tells the compiler that the second argument to 409.CW error 410is a 411.CW print 412format string (see the manual page 413.I print (2)) 414that specifies how to format 415.CW error 's 416subsequent arguments. 417The second, of the form 418.P1 419#pragma varargck type "s" char* 420.P2 421says that the 422.CW print 423format verb 424.CW s 425processes an argument of 426type 427.CW char* . 428If the compiler's 429.CW -F 430option is enabled, the compiler will use this information 431to report type violations in the arguments to 432.CW print , 433.CW error , 434and similar routines. 435.NH 436Object module conventions 437.LP 438The overall conventions of the runtime environment 439are important 440to runtime efficiency. 441In this section, 442several of these conventions are discussed. 443.NH 2 444Register saving 445.LP 446In the Plan 9 compilers, 447the caller of a procedure saves the registers. 448With caller-saves, 449the leaf procedures can use all the 450registers and never save them. 451If you spend a lot of time at the leaves, 452this seems preferable. 453With callee-saves, 454the saving of the registers is done 455in the single point of entry and return. 456If you are interested in space, 457this seems preferable. 458In both, 459there is a degree of uncertainty 460about what registers need to be saved. 461Callee-saved registers make it difficult to 462find variables in registers in debuggers. 463Callee-saved registers also complicate 464the implementation of 465.CW longjmp . 466The convincing argument is 467that with caller-saves, 468the decision to registerize a variable 469can include the cost of saving the register 470across calls. 471For a further discussion of caller- vs. callee-saves, 472see the paper by Davidson and Whalley [Dav91]. 473.LP 474In the Plan 9 operating system, 475calls to the kernel look like normal procedure 476calls, which means 477the caller 478has saved the registers and the system 479entry does not have to. 480This makes system calls considerably faster. 481Since this is a potential security hole, 482and can lead to non-determinism, 483the system may eventually save the registers 484on entry, 485or more likely clear the registers on return. 486.NH 2 487Calling convention 488.LP 489Older C compilers maintain a frame pointer, which is at a known constant 490offset from the stack pointer within each function. 491For machines where the stack grows towards zero, 492the argument pointer is at a known constant offset 493from the frame pointer. 494Since the stack grows down in Plan 9, 495the Plan 9 compilers 496keep neither an 497explicit frame pointer nor 498an explicit argument pointer; 499instead they generate addresses relative to the stack pointer. 500.LP 501On some architectures, the first argument to a subroutine is passed in a register. 502.NH 2 503Functions returning structures 504.LP 505Structures longer than one word are awkward to implement 506since they do not fit in registers and must 507be passed around in memory. 508Functions that return structures 509are particularly clumsy. 510The Plan 9 compilers pass the return address of 511a structure as the first argument of a 512function that has a structure return value. 513Thus 514.DS 515.CW 516.ta .1i .6i 1.1i 1.6i 517 x = f(...) 518.DE 519is rewritten as 520.DS 521.CW 522.ta .1i .6i 1.1i 1.6i 523 f(&x, ...)\f1. 524.DE 525This saves a copy and makes the compilation 526much less clumsy. 527A disadvantage is that if you call this 528function without an assignment, 529a dummy location must be invented. 530.LP 531There is also a danger of calling a function 532that returns a structure without declaring 533it as such. 534With ANSI C function prototypes, 535this error need never occur. 536.NH 537Implementation 538.LP 539The compiler is divided internally into 540four machine-independent passes, 541four machine-dependent passes, 542and an output pass. 543The next nine sections describe each pass in order. 544.NH 2 545Parsing 546.LP 547The first pass is a YACC-based parser 548[Joh79]. 549Declarations are interpreted immediately, 550building a block structured symbol table. 551Executable statements are put into a parse tree 552and collected, 553without interpretation. 554At the end of each procedure, 555the parse tree for the function is 556examined by the other passes of the compiler. 557.LP 558The input stream of the parser is 559a pushdown list of input activations. 560The preprocessor 561expansions of 562macros 563and 564.CW #include 565are implemented as pushdowns. 566Thus there is no separate 567pass for preprocessing. 568.NH 2 569Typing 570.LP 571The next pass distributes typing information 572to every node of the tree. 573Implicit operations on the tree are added, 574such as type promotions and taking the 575address of arrays and functions. 576.NH 2 577Machine-independent optimization 578.LP 579The next pass performs optimizations 580and transformations of the tree, such as converting 581.CW &*x 582and 583.CW *&x 584into 585.CW x . 586Constant expressions are converted to constants in this pass. 587.NH 2 588Arithmetic rewrites 589.LP 590This is another machine-independent optimization. 591Subtrees of add, subtract, and multiply of integers are 592rewritten for easier compilation. 593The major transformation is factoring: 594.CW 4+8*a+16*b+5 595is transformed into 596.CW 9+8*(a+2*b) . 597Such expressions arise from address 598manipulation and array indexing. 599.NH 2 600Addressability 601.LP 602This is the first of the machine-dependent passes. 603The addressability of a processor is defined as the set of 604expressions that is legal in the address field 605of a machine language instruction. 606The addressability of different processors varies widely. 607At one end of the spectrum are the 68020 and VAX, 608which allow a complex mix of incrementing, 609decrementing, 610indexing, and relative addressing. 611At the other end is the MIPS, 612which allows only registers and constant offsets from the 613contents of a register. 614The addressability can be different for different instructions 615within the same processor. 616.LP 617It is important to the code generator to know when a 618subtree represents an address of a particular type. 619This is done with a bottom-up walk of the tree. 620In this pass, the leaves are labeled with small integers. 621When an internal node is encountered, 622it is labeled by consulting a table indexed by the 623labels on the left and right subtrees. 624For example, 625on the 68020 processor, 626it is possible to address an 627offset from a named location. 628In C, this is represented by the expression 629.CW *(&name+constant) . 630This is marked addressable by the following table. 631In the table, 632a node represented by the left column is marked 633with a small integer from the right column. 634Marks of the form 635.CW A\s-2\di\u\s0 636are addressable while 637marks of the form 638.CW N\s-2\di\u\s0 639are not addressable. 640.DS 641.B 642.ta .1i 1.1i 643 Node Marked 644.CW 645 name A\s-2\d1\u\s0 646 const A\s-2\d2\u\s0 647 &A\s-2\d1\u\s0 A\s-2\d3\u\s0 648 A\s-2\d3\u\s0+A\s-2\d1\u\s0 N\s-2\d1\u\s0 \fR(note that this is not addressable)\fP 649 *N\s-2\d1\u\s0 A\s-2\d4\u\s0 650.DE 651Here there is a distinction between 652a node marked 653.CW A\s-2\d1\u\s0 654and a node marked 655.CW A\s-2\d4\u\s0 656because the address operator of an 657.CW A\s-2\d4\u\s0 658node is not addressable. 659So to extend the table: 660.DS 661.B 662.ta .1i 1.1i 663 Node Marked 664.CW 665 &A\s-2\d4\u\s0 N\s-2\d2\u\s0 666 N\s-2\d2\u\s0+N\s-2\d1\u\s0 N\s-2\d1\u\s0 667.DE 668The full addressability of the 68020 is expressed 669in 18 rules like this, 670while the addressability of the MIPS is expressed 671in 11 rules. 672When one ports the compiler, 673this table is usually initialized 674so that leaves are labeled as addressable and nothing else. 675The code produced is poor, 676but porting is easy. 677The table can be extended later. 678.LP 679This pass also rewrites some complex operators 680into procedure calls. 681Examples include 64-bit multiply and divide. 682.LP 683In the same bottom-up pass of the tree, 684the nodes are labeled with a Sethi-Ullman complexity 685[Set70]. 686This number is roughly the number of registers required 687to compile the tree on an ideal machine. 688An addressable node is marked 0. 689A function call is marked infinite. 690A unary operator is marked as the 691maximum of 1 and the mark of its subtree. 692A binary operator with equal marks on its subtrees is 693marked with a subtree mark plus 1. 694A binary operator with unequal marks on its subtrees is 695marked with the maximum mark of its subtrees. 696The actual values of the marks are not too important, 697but the relative values are. 698The goal is to compile the harder 699(larger mark) 700subtree first. 701.NH 2 702Code generation 703.LP 704Code is generated by recursive 705descent. 706The Sethi-Ullman complexity completely guides the 707order. 708The addressability defines the leaves. 709The only difficult part is compiling a tree 710that has two infinite (function call) 711subtrees. 712In this case, 713one subtree is compiled into the return register 714(usually the most convenient place for a function call) 715and then stored on the stack. 716The other subtree is compiled into the return register 717and then the operation is compiled with 718operands from the stack and the return register. 719.LP 720There is a separate boolean code generator that compiles 721conditional expressions. 722This is fundamentally different from compiling an arithmetic expression. 723The result of the boolean code generator is the 724position of the program counter and not an expression. 725The boolean code generator makes extensive use of De Morgan's rule. 726The boolean code generator is an expanded version of that described 727in chapter 8 of Aho, Sethi, and Ullman 728[Aho87]. 729.LP 730There is a considerable amount of talk in the literature 731about automating this part of a compiler with a machine 732description. 733Since this code generator is so small 734(less than 500 lines of C) 735and easy, 736it hardly seems worth the effort. 737.NH 2 738Registerization 739.LP 740Up to now, 741the compiler has operated on syntax trees 742that are roughly equivalent to the original source language. 743The previous pass has produced machine language in an internal 744format. 745The next two passes operate on the internal machine language 746structures. 747The purpose of the next pass is to reintroduce 748registers for heavily used variables. 749.LP 750All of the variables that can be 751potentially registerized within a procedure are 752placed in a table. 753(Suitable variables are any automatic or external 754scalars that do not have their addresses extracted. 755Some constants that are hard to reference are also 756considered for registerization.) 757Four separate data flow equations are evaluated 758over the procedure on all of these variables. 759Two of the equations are the normal set-behind 760and used-ahead 761bits that define the life of a variable. 762The two new bits tell if a variable life 763crosses a function call ahead or behind. 764By examining a variable over its lifetime, 765it is possible to get a cost 766for registerizing. 767Loops are detected and the costs are multiplied 768by three for every level of loop nesting. 769Costs are sorted and the variables 770are replaced by available registers on a greedy basis. 771.LP 772The 68020 has two different 773types of registers. 774For the 68020, 775two different costs are calculated for 776each variable life and the register type that 777affords the better cost is used. 778Ties are broken by counting the number of available 779registers of each type. 780.LP 781Note that externals are registerized together with automatics. 782This is done by evaluating the semantics of a ``call'' instruction 783differently for externals and automatics. 784Since a call goes outside the local procedure, 785it is assumed that a call references all externals. 786Similarly, 787externals are assumed to be set before an ``entry'' instruction 788and assumed to be referenced after a ``return'' instruction. 789This makes sure that externals are in memory across calls. 790.LP 791The overall results are satisfactory. 792It would be nice to be able to do this processing in 793a machine-independent way, 794but it is impossible to get all of the costs and 795side effects of different choices by examining the parse tree. 796.LP 797Most of the code in the registerization pass is machine-independent. 798The major machine-dependency is in 799examining a machine instruction to ask if it sets or references 800a variable. 801.NH 2 802Machine code optimization 803.LP 804The next pass walks the machine code 805for opportunistic optimizations. 806For the most part, 807this is highly specific to a particular 808processor. 809One optimization that is performed 810on all of the processors is the 811removal of unnecessary ``move'' 812instructions. 813Ironically, 814most of these instructions were inserted by 815the previous pass. 816There are two patterns that are repetitively 817matched and replaced until no more matches are 818found. 819The first tries to remove ``move'' instructions 820by relabeling variables. 821.LP 822When a ``move'' instruction is encountered, 823if the destination variable is set before the 824source variable is referenced, 825then all of the references to the destination 826variable can be renamed to the source and the ``move'' 827can be deleted. 828This transformation uses the reverse data flow 829set up in the previous pass. 830.LP 831An example of this pattern is depicted in the following 832table. 833The pattern is in the left column and the 834replacement action is in the right column. 835.DS 836.CW 837.ta .1i .6i 1.6i 2.1i 2.6i 838 MOVE a->b \fR(remove)\fP 839.R 840 (sequence with no mention of \f(CWa\fP) 841.CW 842 USE b USE a 843.R 844 (sequence with no mention of \f(CWa\fP) 845.CW 846 SET b SET b 847.DE 848.LP 849Experiments have shown that it is marginally 850worthwhile to rename uses of the destination variable 851with uses of the source variable up to 852the first use of the source variable. 853.LP 854The second transform will do relabeling 855without deleting instructions. 856When a ``move'' instruction is encountered, 857if the source variable has been set prior 858to the use of the destination variable 859then all of the references to the source 860variable are replaced by the destination and 861the ``move'' is inverted. 862Typically, 863this transformation will alter two ``move'' 864instructions and allow the first transformation 865another chance to remove code. 866This transformation uses the forward data flow 867set up in the previous pass. 868.LP 869Again, 870the following is a depiction of the transformation where 871the pattern is in the left column and the 872rewrite is in the right column. 873.DS 874.CW 875.ta .1i .6i 1.6i 2.1i 2.6i 876 SET a SET b 877.R 878 (sequence with no use of \f(CWb\fP) 879.CW 880 USE a USE b 881.R 882 (sequence with no use of \f(CWb\fP) 883.CW 884 MOVE a->b MOVE b->a 885.DE 886Iterating these transformations 887will usually get rid of all redundant ``move'' instructions. 888.LP 889A problem with this organization is that the costs 890of registerization calculated in the previous pass 891must depend on how well this pass can detect and remove 892redundant instructions. 893Often, 894a fine candidate for registerization is rejected 895because of the cost of instructions that are later 896removed. 897.NH 2 898Writing the object file 899.LP 900The last pass walks the internal assembly language 901and writes the object file. 902The object file is reduced in size by about a factor 903of three with simple compression 904techniques. 905The most important aspect of the object file 906format is that it is independent of the compiling machine. 907All integer and floating numbers in the object 908code are converted to known formats and byte 909orders. 910.NH 911The loader 912.LP 913The loader is a multiple pass program that 914reads object files and libraries and produces 915an executable binary. 916The loader also does some minimal 917optimizations and code rewriting. 918Many of the operations performed by the 919loader are machine-dependent. 920.LP 921The first pass of the loader reads the 922object modules into an internal data 923structure that looks like binary assembly language. 924As the instructions are read, 925code is reordered to remove 926unconditional branch instructions. 927Conditional branch instructions are inverted 928to prevent the insertion of unconditional branches. 929The loader will also make a copy of a few instructions 930to remove an unconditional branch. 931.LP 932The next pass allocates addresses for 933all external data. 934Typical of processors is the MIPS, 935which can reference ±32K bytes from a 936register. 937The loader allocates the register 938.CW R30 939as the static pointer. 940The value placed in 941.CW R30 942is the base of the data segment plus 32K. 943It is then cheap to reference all data in the 944first 64K of the data segment. 945External variables are allocated to 946the data segment 947with the smallest variables allocated first. 948If all of the data cannot fit into the first 94964K of the data segment, 950then usually only a few large arrays 951need more expensive addressing modes. 952.LP 953For the MIPS processor, 954the loader makes a pass over the internal 955structures, 956exchanging instructions to try 957to fill ``delay slots'' with useful work. 958If a useful instruction cannot be found 959to fill a delay slot, 960the loader will insert 961``noop'' 962instructions. 963This pass is very expensive and does not 964do a good job. 965About 40% of all instructions are in 966delay slots. 967About 65% of these are useful instructions and 96835% are ``noops.'' 969The vendor-supplied assembler does this job 970more effectively, 971filling about 80% 972of the delay slots with useful instructions. 973.LP 974On the 68020 processor, 975branch instructions come in a variety of 976sizes depending on the relative distance 977of the branch. 978Thus the size of branch instructions 979can be mutually dependent. 980The loader uses a multiple pass algorithm 981to resolve the branch lengths 982[Szy78]. 983Initially, all branches are assumed minimal length. 984On each subsequent pass, 985the branches are reassessed 986and expanded if necessary. 987When no more expansions occur, 988the locations of the instructions in 989the text segment are known. 990.LP 991On the MIPS processor, 992all instructions are one size. 993A single pass over the instructions will 994determine the locations of all addresses 995in the text segment. 996.LP 997The last pass of the loader produces the 998executable binary. 999A symbol table and other tables are 1000produced to help the debugger to 1001interpret the binary symbolically. 1002.LP 1003The loader places absolute source line numbers in the symbol table. 1004The name and absolute line number of all 1005.CW #include 1006files is also placed in the 1007symbol table so that the debuggers can 1008associate object code to source files. 1009.NH 1010Performance 1011.LP 1012The following is a table of the source size of the MIPS 1013compiler. 1014.DS 1015.ta .1i .6i 1016 lines module 1017 \0509 machine-independent headers 1018 1070 machine-independent YACC source 1019 6090 machine-independent C source 1020 1021 \0545 machine-dependent headers 1022 6532 machine-dependent C source 1023 1024 \0298 loader headers 1025 5215 loader C source 1026.DE 1027.LP 1028The following table shows timing 1029of a test program 1030that plays checkers, running on a MIPS R4000. 1031The test program is 26 files totaling 12600 lines of C. 1032The execution time does not significantly 1033depend on library implementation. 1034Since no other compiler runs on Plan 9, 1035the Plan 9 tests were done with the Plan 9 operating system; 1036the other tests were done on the vendor's operating system. 1037The hardware was identical in both cases. 1038The optimizer in the vendor's compiler 1039is reputed to be extremely good. 1040.DS 1041.ta .1i .9i 1042 \0\04.49s Plan 9 \f(CWvc\fP \f(CW-N\fP compile time (opposite of \f(CW-O\fP) 1043 \0\01.72s Plan 9 \f(CWvc\fP \f(CW-N\fP load time 1044 148.69s Plan 9 \f(CWvc\fP \f(CW-N\fP run time 1045 1046 \015.07s Plan 9 \f(CWvc\fP compile time (\f(CW-O\fP implicit) 1047 \0\01.66s Plan 9 \f(CWvc\fP load time 1048 \089.96s Plan 9 \f(CWvc\fP run time 1049 1050 \014.83s vendor \f(CWcc\fP compile time 1051 \0\00.38s vendor \f(CWcc\fP load time 1052 104.75s vendor \f(CWcc\fP run time 1053 1054 \043.59s vendor \f(CWcc\fP \f(CW-O\fP compile time 1055 \0\00.38s vendor \f(CWcc\fP \f(CW-O\fP load time 1056 \076.19s vendor \f(CWcc\fP \f(CW-O\fP run time 1057 1058 \0\08.19s vendor \f(CWcc\fP \f(CW-O3\fP compile time 1059 \035.97s vendor \f(CWcc\fP \f(CW-O3\fP load time 1060 \071.16s vendor \f(CWcc\fP \f(CW-O3\fP run time 1061.DE 1062.LP 1063To compare the Intel compiler, 1064a program that is about 40% bit manipulation and 1065about 60% single precision floating point was 1066run on the same 33 MHz 486, once under Windows 1067compiled with the Watcom compiler, version 10.0, 1068in 16-bit mode and once under 1069Plan 9 in 32-bit mode. 1070The Plan 9 execution time was 27 sec while the Windows 1071execution time was 31 sec. 1072.NH 1073Conclusions 1074.LP 1075The new compilers compile 1076quickly, 1077load slowly, 1078and produce 1079medium quality 1080object code. 1081The compilers are relatively 1082portable, 1083requiring but a couple of weeks' work to 1084produce a compiler for a different computer. 1085For Plan 9, 1086where we needed several compilers 1087with specialized features and 1088our own object formats, 1089this project was indispensable. 1090It is also necessary for us to 1091be able to freely distribute our compilers 1092with the Plan 9 distribution. 1093.LP 1094Two problems have come up in retrospect. 1095The first has to do with the 1096division of labor between compiler and loader. 1097Plan 9 runs on multi-processors and as such 1098compilations are often done in parallel. 1099Unfortunately, 1100all compilations must be complete before loading 1101can begin. 1102The load is then single-threaded. 1103With this model, 1104any shift of work from compile to load 1105results in a significant increase in real time. 1106The same is true of libraries that are compiled 1107infrequently and loaded often. 1108In the future, 1109we may try to put some of the loader work 1110back into the compiler. 1111.LP 1112The second problem comes from 1113the various optimizations performed over several 1114passes. 1115Often optimizations in different passes depend 1116on each other. 1117Iterating the passes could compromise efficiency, 1118or even loop. 1119We see no real solution to this problem. 1120.NH 1121References 1122.LP 1123[Aho87] A. V. Aho, R. Sethi, and J. D. Ullman, 1124.I 1125Compilers \- Principles, Techniques, and Tools, 1126.R 1127Addison Wesley, 1128Reading, MA, 11291987. 1130.LP 1131[ANSI90] \f2American National Standard for Information Systems \- 1132Programming Language C\f1, American National Standards Institute, Inc., 1133New York, 1990. 1134.LP 1135[Dav91] J. W. Davidson and D. B. Whalley, 1136``Methods for Saving and Restoring Register Values across Function Calls'', 1137.I 1138Software\-Practice and Experience, 1139.R 1140Vol 21(2), pp. 149-165, February 1991. 1141.LP 1142[Joh79] S. C. Johnson, 1143``YACC \- Yet Another Compiler Compiler'', 1144.I 1145UNIX Programmer's Manual, Seventh Ed., Vol. 2A, 1146.R 1147AT&T Bell Laboratories, 1148Murray Hill, NJ, 11491979. 1150.LP 1151[Set70] R. Sethi and J. D. Ullman, 1152``The Generation of Optimal Code for Arithmetic Expressions'', 1153.I 1154Journal of the ACM, 1155.R 1156Vol 17(4), pp. 715-728, 1970. 1157.LP 1158[Szy78] T. G. Szymanski, 1159``Assembling Code for Machines with Span-dependent Instructions'', 1160.I 1161Communications of the ACM, 1162.R 1163Vol 21(4), pp. 300-308, 1978. 1164