1.HTML "Plan 9 C Compilers 2.TL 3Plan 9 C Compilers 4.AU 5Ken Thompson 6ken@plan9.bell-labs.com 7.AB 8.FS 9Originally appeared, in a different form, in 10.I 11Proceedings of the Summer 1990 UKUUG Conference, 12.R 13pp. 41-51, 14London, 1990. 15.FE 16This paper describes the overall structure and function of the Plan 9 C compilers. 17A more detailed implementation document 18for any one of the compilers 19is yet to be written. 20.AE 21.NH 22Introduction 23.LP 24There are many compilers in the series. 25Five of the compilers (Intel 386, AMD64, PowerPC, PowerPC 64-bit, ARM) 26are considered active and are used to compile 27current versions of Plan 9. 28Four of the compilers (MIPS 3000, SPARC, DEC Alpha, and Motorola 68020) 29are maintained but are for older machines 30for which we have no current ports of Plan 9; 31other than the MIPS, we are unlikely to port to any such machines. 32Several others (Motorola 68000, Intel 960, AMD 29000) have had only limited use, such as 33to program peripherals or experimental devices. 34.NH 35Structure 36.LP 37The compiler is a single program that produces an 38object file. 39Combined in the compiler are the traditional 40roles of preprocessor, lexical analyzer, parser, code generator, 41local optimizer, 42and first half of the assembler. 43The object files are binary forms of assembly 44language, 45similar to what might be passed between 46the first and second passes of an assembler. 47.LP 48Object files and libraries 49are combined by a loader 50program to produce the executable binary. 51The loader combines the roles of second half 52of the assembler, global optimizer, and loader. 53The names of the compliers, loaders, and assemblers 54are as follows: 55.DS 56.ta 1.5i 57.de Ta 58\\$1 \f(CW\\$2\fP \f(CW\\$3\fP \f(CW\\$4\fP 59.. 60.Ta SPARC kc kl ka 61.Ta PowerPC qc ql qa 62.Ta MIPS vc vl va 63.Ta MIPS\ little-endian 0c 0l 0a 64.Ta Motorola\ 68000 1c 1l 1a 65.Ta Motorola\ 68020 2c 2l 2a 66.Ta ARM 5c 5l 5a 67.Ta AMD64 6c 6l 6a 68.Ta DEC\ Alpha 7c 7l 7a 69.Ta Intel\ 386 8c 8l 8a 70.Ta PowerPC\ 64-bit 9c 9l 9a 71.DE 72There is a further breakdown 73in the source of the compilers into 74object-independent and 75object-dependent 76parts. 77All of the object-independent parts 78are combined into source files in the 79directory 80.CW /sys/src/cmd/cc . 81The object-dependent parts are collected 82in a separate directory for each compiler, 83for example 84.CW /sys/src/cmd/vc . 85All of the code, 86both object-independent and 87object-dependent, 88is machine-independent 89and may be cross-compiled and executed on any 90of the architectures. 91.NH 92The Language 93.LP 94The compiler implements ANSI C with some 95restrictions and extensions 96[ANSI90]. 97Most of the restrictions are due to 98personal preference, while 99most of the extensions were to help in 100the implementation of Plan 9. 101There are other departures from the standard, 102particularly in the libraries, 103that are beyond the scope of this 104paper. 105.NH 2 106Register, volatile, const 107.LP 108The keyword 109.CW register 110is recognized syntactically 111but is semantically ignored. 112Thus taking the address of a 113.CW register 114variable is not diagnosed. 115The keyword 116.CW volatile 117disables all optimizations, in particular registerization, of the corresponding variable. 118The keyword 119.CW const 120generates warnings (if warnings are enabled by the compiler's 121.CW -w 122option) of non-constant use of the variable, 123but does not affect the generated code. 124.NH 2 125The preprocessor 126.LP 127The C preprocessor is probably the 128biggest departure from the ANSI standard. 129.LP 130The preprocessor built into the Plan 9 compilers does not support 131.CW #if , 132although it does handle 133.CW #ifdef 134and 135.CW #include . 136If it is necessary to be more standard, 137the source text can first be run through the separate ANSI C 138preprocessor, 139.CW cpp . 140.NH 2 141Unnamed substructures 142.LP 143The most important and most heavily used of the 144extensions is the declaration of an 145unnamed substructure or subunion. 146For example: 147.DS 148.CW 149.ta .1i .6i 1.1i 1.6i 150 typedef 151 struct lock 152 { 153 int locked; 154 } Lock; 155 156 typedef 157 struct node 158 { 159 int type; 160 union 161 { 162 double dval; 163 float fval; 164 long lval; 165 }; 166 Lock; 167 } Node; 168 169 Lock* lock; 170 Node* node; 171.DE 172The declaration of 173.CW Node 174has an unnamed substructure of type 175.CW Lock 176and an unnamed subunion. 177One use of this feature allows references to elements of the 178subunit to be accessed as if they were in 179the outer structure. 180Thus 181.CW node->dval 182and 183.CW node->locked 184are legitimate references. 185.LP 186When an outer structure is used 187in a context that is only legal for 188an unnamed substructure, 189the compiler promotes the reference to the 190unnamed substructure. 191This is true for references to structures and 192to references to pointers to structures. 193This happens in assignment statements and 194in argument passing where prototypes have been 195declared. 196Thus, continuing with the example, 197.DS 198.CW 199.ta .1i .6i 1.1i 1.6i 200 lock = node; 201.DE 202would assign a pointer to the unnamed 203.CW Lock 204in 205the 206.CW Node 207to the variable 208.CW lock . 209Another example, 210.DS 211.CW 212.ta .1i .6i 1.1i 1.6i 213 extern void lock(Lock*); 214 func(...) 215 { 216 ... 217 lock(node); 218 ... 219 } 220.DE 221will pass a pointer to the 222.CW Lock 223substructure. 224.LP 225Finally, in places where context is insufficient to identify the unnamed structure, 226the type name (it must be a 227.CW typedef ) 228of the unnamed structure can be used as an identifier. 229In our example, 230.CW &node->Lock 231gives the address of the anonymous 232.CW Lock 233structure. 234.NH 2 235Structure displays 236.LP 237A structure cast followed by a list of expressions in braces is 238an expression with the type of the structure and elements assigned from 239the corresponding list. 240Structures are now almost first-class citizens of the language. 241It is common to see code like this: 242.DS 243.CW 244.ta .1i 245 r = (Rectangle){point1, (Point){x,y+2}}; 246.DE 247.NH 2 248Initialization indexes 249.LP 250In initializers of arrays, 251one may place a constant expression 252in square brackets before an initializer. 253This causes the next initializer to assign 254the indicated element. 255For example: 256.DS 257.CW 258.ta .1i .6i 1.6i 259 enum errors 260 { 261 Etoobig, 262 Ealarm, 263 Egreg 264 }; 265 char* errstrings[] = 266 { 267 [Ealarm] "Alarm call", 268 [Egreg] "Panic: out of mbufs", 269 [Etoobig] "Arg list too long", 270 }; 271.DE 272In the same way, 273individual structures members may 274be initialized in any order by preceding the initialization with 275.CW .tagname . 276Both forms allow an optional 277.CW = , 278to be compatible with a proposed 279extension to ANSI C. 280.NH 2 281External register 282.LP 283The declaration 284.CW extern 285.CW register 286will dedicate a register to 287a variable on a global basis. 288It can be used only under special circumstances. 289External register variables must be identically 290declared in all modules and 291libraries. 292The feature is not intended for efficiency, 293although it can produce efficient code; 294rather it represents a unique storage class that 295would be hard to get any other way. 296On a shared-memory multi-processor, 297an external register is 298one-per-processor and neither one-per-procedure (automatic) 299or one-per-system (external). 300It is used for two variables in the Plan 9 kernel, 301.CW u 302and 303.CW m . 304.CW U 305is a pointer to the structure representing the currently running process 306and 307.CW m 308is a pointer to the per-machine data structure. 309.NH 2 310Long long 311.LP 312The compilers accept 313.CW long 314.CW long 315as a basic type meaning 64-bit integer. 316On all of the machines 317this type is synthesized from 32-bit instructions. 318.NH 2 319Pragma 320.LP 321The compilers accept 322.CW #pragma 323.CW lib 324.I libname 325and pass the 326library name string uninterpreted 327to the loader. 328The loader uses the library name to 329find libraries to load. 330If the name contains 331.CW $O , 332it is replaced with 333the single character object type of the compiler 334(e.g., 335.CW v 336for the MIPS). 337If the name contains 338.CW $M , 339it is replaced with 340the architecture type for the compiler 341(e.g., 342.CW mips 343for the MIPS). 344If the name starts with 345.CW / 346it is an absolute pathname; 347if it starts with 348.CW . 349then it is searched for in the loader's current directory. 350Otherwise, the name is searched from 351.CW /$M/lib . 352Such 353.CW #pragma 354statements in header files guarantee that the correct 355libraries are always linked with a program without the 356need to specify them explicitly at link time. 357.LP 358They also accept 359.CW #pragma 360.CW packed 361.CW on 362(or 363.CW yes 364or 365.CW 1 ) 366to cause subsequently declared data, until 367.CW #pragma 368.CW packed 369.CW off 370(or 371.CW no 372or 373.CW 0 ), 374to be laid out in memory tightly packed in successive bytes, disregarding 375the usual alignment rules. 376Accessing such data can cause faults. 377.LP 378Similarly, 379.CW #pragma 380.CW profile 381.CW off 382(or 383.CW no 384or 385.CW 0 ) 386causes subsequently declared functions, until 387.CW #pragma 388.CW profile 389.CW on 390(or 391.CW yes 392or 393.CW 1 ), 394to be marked as unprofiled. 395Such functions will not be profiled when 396profiling is enabled for the rest of the program. 397.LP 398Two 399.CW #pragma 400statements allow type-checking of 401.CW print -like 402functions. 403The first, of the form 404.P1 405#pragma varargck argpos error 2 406.P2 407tells the compiler that the second argument to 408.CW error 409is a 410.CW print 411format string (see the manual page 412.I print (2)) 413that specifies how to format 414.CW error 's 415subsequent arguments. 416The second, of the form 417.P1 418#pragma varargck type "s" char* 419.P2 420says that the 421.CW print 422format verb 423.CW s 424processes an argument of 425type 426.CW char* . 427If the compiler's 428.CW -F 429option is enabled, the compiler will use this information 430to report type violations in the arguments to 431.CW print , 432.CW error , 433and similar routines. 434.NH 435Object module conventions 436.LP 437The overall conventions of the runtime environment 438are important 439to runtime efficiency. 440In this section, 441several of these conventions are discussed. 442.NH 2 443Register saving 444.LP 445In the Plan 9 compilers, 446the caller of a procedure saves the registers. 447With caller-saves, 448the leaf procedures can use all the 449registers and never save them. 450If you spend a lot of time at the leaves, 451this seems preferable. 452With callee-saves, 453the saving of the registers is done 454in the single point of entry and return. 455If you are interested in space, 456this seems preferable. 457In both, 458there is a degree of uncertainty 459about what registers need to be saved. 460Callee-saved registers make it difficult to 461find variables in registers in debuggers. 462Callee-saved registers also complicate 463the implementation of 464.CW longjmp . 465The convincing argument is 466that with caller-saves, 467the decision to registerize a variable 468can include the cost of saving the register 469across calls. 470For a further discussion of caller- vs. callee-saves, 471see the paper by Davidson and Whalley [Dav91]. 472.LP 473In the Plan 9 operating system, 474calls to the kernel look like normal procedure 475calls, which means 476the caller 477has saved the registers and the system 478entry does not have to. 479This makes system calls considerably faster. 480Since this is a potential security hole, 481and can lead to non-determinism, 482the system may eventually save the registers 483on entry, 484or more likely clear the registers on return. 485.NH 2 486Calling convention 487.LP 488Older C compilers maintain a frame pointer, which is at a known constant 489offset from the stack pointer within each function. 490For machines where the stack grows towards zero, 491the argument pointer is at a known constant offset 492from the frame pointer. 493Since the stack grows down in Plan 9, 494the Plan 9 compilers 495keep neither an 496explicit frame pointer nor 497an explicit argument pointer; 498instead they generate addresses relative to the stack pointer. 499.LP 500On some architectures, the first argument to a subroutine is passed in a register. 501.NH 2 502Functions returning structures 503.LP 504Structures longer than one word are awkward to implement 505since they do not fit in registers and must 506be passed around in memory. 507Functions that return structures 508are particularly clumsy. 509The Plan 9 compilers pass the return address of 510a structure as the first argument of a 511function that has a structure return value. 512Thus 513.DS 514.CW 515.ta .1i .6i 1.1i 1.6i 516 x = f(...) 517.DE 518is rewritten as 519.DS 520.CW 521.ta .1i .6i 1.1i 1.6i 522 f(&x, ...)\f1. 523.DE 524This saves a copy and makes the compilation 525much less clumsy. 526A disadvantage is that if you call this 527function without an assignment, 528a dummy location must be invented. 529.LP 530There is also a danger of calling a function 531that returns a structure without declaring 532it as such. 533With ANSI C function prototypes, 534this error need never occur. 535.NH 536Implementation 537.LP 538The compiler is divided internally into 539four machine-independent passes, 540four machine-dependent passes, 541and an output pass. 542The next nine sections describe each pass in order. 543.NH 2 544Parsing 545.LP 546The first pass is a YACC-based parser 547[Joh79]. 548Declarations are interpreted immediately, 549building a block structured symbol table. 550Executable statements are put into a parse tree 551and collected, 552without interpretation. 553At the end of each procedure, 554the parse tree for the function is 555examined by the other passes of the compiler. 556.LP 557The input stream of the parser is 558a pushdown list of input activations. 559The preprocessor 560expansions of 561macros 562and 563.CW #include 564are implemented as pushdowns. 565Thus there is no separate 566pass for preprocessing. 567.NH 2 568Typing 569.LP 570The next pass distributes typing information 571to every node of the tree. 572Implicit operations on the tree are added, 573such as type promotions and taking the 574address of arrays and functions. 575.NH 2 576Machine-independent optimization 577.LP 578The next pass performs optimizations 579and transformations of the tree, such as converting 580.CW &*x 581and 582.CW *&x 583into 584.CW x . 585Constant expressions are converted to constants in this pass. 586.NH 2 587Arithmetic rewrites 588.LP 589This is another machine-independent optimization. 590Subtrees of add, subtract, and multiply of integers are 591rewritten for easier compilation. 592The major transformation is factoring: 593.CW 4+8*a+16*b+5 594is transformed into 595.CW 9+8*(a+2*b) . 596Such expressions arise from address 597manipulation and array indexing. 598.NH 2 599Addressability 600.LP 601This is the first of the machine-dependent passes. 602The addressability of a processor is defined as the set of 603expressions that is legal in the address field 604of a machine language instruction. 605The addressability of different processors varies widely. 606At one end of the spectrum are the 68020 and VAX, 607which allow a complex mix of incrementing, 608decrementing, 609indexing, and relative addressing. 610At the other end is the MIPS, 611which allows only registers and constant offsets from the 612contents of a register. 613The addressability can be different for different instructions 614within the same processor. 615.LP 616It is important to the code generator to know when a 617subtree represents an address of a particular type. 618This is done with a bottom-up walk of the tree. 619In this pass, the leaves are labeled with small integers. 620When an internal node is encountered, 621it is labeled by consulting a table indexed by the 622labels on the left and right subtrees. 623For example, 624on the 68020 processor, 625it is possible to address an 626offset from a named location. 627In C, this is represented by the expression 628.CW *(&name+constant) . 629This is marked addressable by the following table. 630In the table, 631a node represented by the left column is marked 632with a small integer from the right column. 633Marks of the form 634.CW A\s-2\di\u\s0 635are addressable while 636marks of the form 637.CW N\s-2\di\u\s0 638are not addressable. 639.DS 640.B 641.ta .1i 1.1i 642 Node Marked 643.CW 644 name A\s-2\d1\u\s0 645 const A\s-2\d2\u\s0 646 &A\s-2\d1\u\s0 A\s-2\d3\u\s0 647 A\s-2\d3\u\s0+A\s-2\d1\u\s0 N\s-2\d1\u\s0 \fR(note that this is not addressable)\fP 648 *N\s-2\d1\u\s0 A\s-2\d4\u\s0 649.DE 650Here there is a distinction between 651a node marked 652.CW A\s-2\d1\u\s0 653and a node marked 654.CW A\s-2\d4\u\s0 655because the address operator of an 656.CW A\s-2\d4\u\s0 657node is not addressable. 658So to extend the table: 659.DS 660.B 661.ta .1i 1.1i 662 Node Marked 663.CW 664 &A\s-2\d4\u\s0 N\s-2\d2\u\s0 665 N\s-2\d2\u\s0+N\s-2\d1\u\s0 N\s-2\d1\u\s0 666.DE 667The full addressability of the 68020 is expressed 668in 18 rules like this, 669while the addressability of the MIPS is expressed 670in 11 rules. 671When one ports the compiler, 672this table is usually initialized 673so that leaves are labeled as addressable and nothing else. 674The code produced is poor, 675but porting is easy. 676The table can be extended later. 677.LP 678This pass also rewrites some complex operators 679into procedure calls. 680Examples include 64-bit multiply and divide. 681.LP 682In the same bottom-up pass of the tree, 683the nodes are labeled with a Sethi-Ullman complexity 684[Set70]. 685This number is roughly the number of registers required 686to compile the tree on an ideal machine. 687An addressable node is marked 0. 688A function call is marked infinite. 689A unary operator is marked as the 690maximum of 1 and the mark of its subtree. 691A binary operator with equal marks on its subtrees is 692marked with a subtree mark plus 1. 693A binary operator with unequal marks on its subtrees is 694marked with the maximum mark of its subtrees. 695The actual values of the marks are not too important, 696but the relative values are. 697The goal is to compile the harder 698(larger mark) 699subtree first. 700.NH 2 701Code generation 702.LP 703Code is generated by recursive 704descent. 705The Sethi-Ullman complexity completely guides the 706order. 707The addressability defines the leaves. 708The only difficult part is compiling a tree 709that has two infinite (function call) 710subtrees. 711In this case, 712one subtree is compiled into the return register 713(usually the most convenient place for a function call) 714and then stored on the stack. 715The other subtree is compiled into the return register 716and then the operation is compiled with 717operands from the stack and the return register. 718.LP 719There is a separate boolean code generator that compiles 720conditional expressions. 721This is fundamentally different from compiling an arithmetic expression. 722The result of the boolean code generator is the 723position of the program counter and not an expression. 724The boolean code generator makes extensive use of De Morgan's rule. 725The boolean code generator is an expanded version of that described 726in chapter 8 of Aho, Sethi, and Ullman 727[Aho87]. 728.LP 729There is a considerable amount of talk in the literature 730about automating this part of a compiler with a machine 731description. 732Since this code generator is so small 733(less than 500 lines of C) 734and easy, 735it hardly seems worth the effort. 736.NH 2 737Registerization 738.LP 739Up to now, 740the compiler has operated on syntax trees 741that are roughly equivalent to the original source language. 742The previous pass has produced machine language in an internal 743format. 744The next two passes operate on the internal machine language 745structures. 746The purpose of the next pass is to reintroduce 747registers for heavily used variables. 748.LP 749All of the variables that can be 750potentially registerized within a procedure are 751placed in a table. 752(Suitable variables are any automatic or external 753scalars that do not have their addresses extracted. 754Some constants that are hard to reference are also 755considered for registerization.) 756Four separate data flow equations are evaluated 757over the procedure on all of these variables. 758Two of the equations are the normal set-behind 759and used-ahead 760bits that define the life of a variable. 761The two new bits tell if a variable life 762crosses a function call ahead or behind. 763By examining a variable over its lifetime, 764it is possible to get a cost 765for registerizing. 766Loops are detected and the costs are multiplied 767by three for every level of loop nesting. 768Costs are sorted and the variables 769are replaced by available registers on a greedy basis. 770.LP 771The 68020 has two different 772types of registers. 773For the 68020, 774two different costs are calculated for 775each variable life and the register type that 776affords the better cost is used. 777Ties are broken by counting the number of available 778registers of each type. 779.LP 780Note that externals are registerized together with automatics. 781This is done by evaluating the semantics of a ``call'' instruction 782differently for externals and automatics. 783Since a call goes outside the local procedure, 784it is assumed that a call references all externals. 785Similarly, 786externals are assumed to be set before an ``entry'' instruction 787and assumed to be referenced after a ``return'' instruction. 788This makes sure that externals are in memory across calls. 789.LP 790The overall results are satisfactory. 791It would be nice to be able to do this processing in 792a machine-independent way, 793but it is impossible to get all of the costs and 794side effects of different choices by examining the parse tree. 795.LP 796Most of the code in the registerization pass is machine-independent. 797The major machine-dependency is in 798examining a machine instruction to ask if it sets or references 799a variable. 800.NH 2 801Machine code optimization 802.LP 803The next pass walks the machine code 804for opportunistic optimizations. 805For the most part, 806this is highly specific to a particular 807processor. 808One optimization that is performed 809on all of the processors is the 810removal of unnecessary ``move'' 811instructions. 812Ironically, 813most of these instructions were inserted by 814the previous pass. 815There are two patterns that are repetitively 816matched and replaced until no more matches are 817found. 818The first tries to remove ``move'' instructions 819by relabeling variables. 820.LP 821When a ``move'' instruction is encountered, 822if the destination variable is set before the 823source variable is referenced, 824then all of the references to the destination 825variable can be renamed to the source and the ``move'' 826can be deleted. 827This transformation uses the reverse data flow 828set up in the previous pass. 829.LP 830An example of this pattern is depicted in the following 831table. 832The pattern is in the left column and the 833replacement action is in the right column. 834.DS 835.CW 836.ta .1i .6i 1.6i 2.1i 2.6i 837 MOVE a->b \fR(remove)\fP 838.R 839 (sequence with no mention of \f(CWa\fP) 840.CW 841 USE b USE a 842.R 843 (sequence with no mention of \f(CWa\fP) 844.CW 845 SET b SET b 846.DE 847.LP 848Experiments have shown that it is marginally 849worthwhile to rename uses of the destination variable 850with uses of the source variable up to 851the first use of the source variable. 852.LP 853The second transform will do relabeling 854without deleting instructions. 855When a ``move'' instruction is encountered, 856if the source variable has been set prior 857to the use of the destination variable 858then all of the references to the source 859variable are replaced by the destination and 860the ``move'' is inverted. 861Typically, 862this transformation will alter two ``move'' 863instructions and allow the first transformation 864another chance to remove code. 865This transformation uses the forward data flow 866set up in the previous pass. 867.LP 868Again, 869the following is a depiction of the transformation where 870the pattern is in the left column and the 871rewrite is in the right column. 872.DS 873.CW 874.ta .1i .6i 1.6i 2.1i 2.6i 875 SET a SET b 876.R 877 (sequence with no use of \f(CWb\fP) 878.CW 879 USE a USE b 880.R 881 (sequence with no use of \f(CWb\fP) 882.CW 883 MOVE a->b MOVE b->a 884.DE 885Iterating these transformations 886will usually get rid of all redundant ``move'' instructions. 887.LP 888A problem with this organization is that the costs 889of registerization calculated in the previous pass 890must depend on how well this pass can detect and remove 891redundant instructions. 892Often, 893a fine candidate for registerization is rejected 894because of the cost of instructions that are later 895removed. 896.NH 2 897Writing the object file 898.LP 899The last pass walks the internal assembly language 900and writes the object file. 901The object file is reduced in size by about a factor 902of three with simple compression 903techniques. 904The most important aspect of the object file 905format is that it is independent of the compiling machine. 906All integer and floating numbers in the object 907code are converted to known formats and byte 908orders. 909.NH 910The loader 911.LP 912The loader is a multiple pass program that 913reads object files and libraries and produces 914an executable binary. 915The loader also does some minimal 916optimizations and code rewriting. 917Many of the operations performed by the 918loader are machine-dependent. 919.LP 920The first pass of the loader reads the 921object modules into an internal data 922structure that looks like binary assembly language. 923As the instructions are read, 924code is reordered to remove 925unconditional branch instructions. 926Conditional branch instructions are inverted 927to prevent the insertion of unconditional branches. 928The loader will also make a copy of a few instructions 929to remove an unconditional branch. 930.LP 931The next pass allocates addresses for 932all external data. 933Typical of processors is the MIPS, 934which can reference ±32K bytes from a 935register. 936The loader allocates the register 937.CW R30 938as the static pointer. 939The value placed in 940.CW R30 941is the base of the data segment plus 32K. 942It is then cheap to reference all data in the 943first 64K of the data segment. 944External variables are allocated to 945the data segment 946with the smallest variables allocated first. 947If all of the data cannot fit into the first 94864K of the data segment, 949then usually only a few large arrays 950need more expensive addressing modes. 951.LP 952For the MIPS processor, 953the loader makes a pass over the internal 954structures, 955exchanging instructions to try 956to fill ``delay slots'' with useful work. 957If a useful instruction cannot be found 958to fill a delay slot, 959the loader will insert 960``noop'' 961instructions. 962This pass is very expensive and does not 963do a good job. 964About 40% of all instructions are in 965delay slots. 966About 65% of these are useful instructions and 96735% are ``noops.'' 968The vendor-supplied assembler does this job 969more effectively, 970filling about 80% 971of the delay slots with useful instructions. 972.LP 973On the 68020 processor, 974branch instructions come in a variety of 975sizes depending on the relative distance 976of the branch. 977Thus the size of branch instructions 978can be mutually dependent. 979The loader uses a multiple pass algorithm 980to resolve the branch lengths 981[Szy78]. 982Initially, all branches are assumed minimal length. 983On each subsequent pass, 984the branches are reassessed 985and expanded if necessary. 986When no more expansions occur, 987the locations of the instructions in 988the text segment are known. 989.LP 990On the MIPS processor, 991all instructions are one size. 992A single pass over the instructions will 993determine the locations of all addresses 994in the text segment. 995.LP 996The last pass of the loader produces the 997executable binary. 998A symbol table and other tables are 999produced to help the debugger to 1000interpret the binary symbolically. 1001.LP 1002The loader places absolute source line numbers in the symbol table. 1003The name and absolute line number of all 1004.CW #include 1005files is also placed in the 1006symbol table so that the debuggers can 1007associate object code to source files. 1008.NH 1009Performance 1010.LP 1011The following is a table of the source size of the MIPS 1012compiler. 1013.DS 1014.ta .1i .6i 1015 lines module 1016 \0509 machine-independent headers 1017 1070 machine-independent YACC source 1018 6090 machine-independent C source 1019 1020 \0545 machine-dependent headers 1021 6532 machine-dependent C source 1022 1023 \0298 loader headers 1024 5215 loader C source 1025.DE 1026.LP 1027The following table shows timing 1028of a test program 1029that plays checkers, running on a MIPS R4000. 1030The test program is 26 files totaling 12600 lines of C. 1031The execution time does not significantly 1032depend on library implementation. 1033Since no other compiler runs on Plan 9, 1034the Plan 9 tests were done with the Plan 9 operating system; 1035the other tests were done on the vendor's operating system. 1036The hardware was identical in both cases. 1037The optimizer in the vendor's compiler 1038is reputed to be extremely good. 1039.DS 1040.ta .1i .9i 1041 \0\04.49s Plan 9 \f(CWvc\fP \f(CW-N\fP compile time (opposite of \f(CW-O\fP) 1042 \0\01.72s Plan 9 \f(CWvc\fP \f(CW-N\fP load time 1043 148.69s Plan 9 \f(CWvc\fP \f(CW-N\fP run time 1044 1045 \015.07s Plan 9 \f(CWvc\fP compile time (\f(CW-O\fP implicit) 1046 \0\01.66s Plan 9 \f(CWvc\fP load time 1047 \089.96s Plan 9 \f(CWvc\fP run time 1048 1049 \014.83s vendor \f(CWcc\fP compile time 1050 \0\00.38s vendor \f(CWcc\fP load time 1051 104.75s vendor \f(CWcc\fP run time 1052 1053 \043.59s vendor \f(CWcc\fP \f(CW-O\fP compile time 1054 \0\00.38s vendor \f(CWcc\fP \f(CW-O\fP load time 1055 \076.19s vendor \f(CWcc\fP \f(CW-O\fP run time 1056 1057 \0\08.19s vendor \f(CWcc\fP \f(CW-O3\fP compile time 1058 \035.97s vendor \f(CWcc\fP \f(CW-O3\fP load time 1059 \071.16s vendor \f(CWcc\fP \f(CW-O3\fP run time 1060.DE 1061.LP 1062To compare the Intel compiler, 1063a program that is about 40% bit manipulation and 1064about 60% single precision floating point was 1065run on the same 33 MHz 486, once under Windows 1066compiled with the Watcom compiler, version 10.0, 1067in 16-bit mode and once under 1068Plan 9 in 32-bit mode. 1069The Plan 9 execution time was 27 sec while the Windows 1070execution time was 31 sec. 1071.NH 1072Conclusions 1073.LP 1074The new compilers compile 1075quickly, 1076load slowly, 1077and produce 1078medium quality 1079object code. 1080The compilers are relatively 1081portable, 1082requiring but a couple of weeks' work to 1083produce a compiler for a different computer. 1084For Plan 9, 1085where we needed several compilers 1086with specialized features and 1087our own object formats, 1088this project was indispensable. 1089It is also necessary for us to 1090be able to freely distribute our compilers 1091with the Plan 9 distribution. 1092.LP 1093Two problems have come up in retrospect. 1094The first has to do with the 1095division of labor between compiler and loader. 1096Plan 9 runs on multi-processors and as such 1097compilations are often done in parallel. 1098Unfortunately, 1099all compilations must be complete before loading 1100can begin. 1101The load is then single-threaded. 1102With this model, 1103any shift of work from compile to load 1104results in a significant increase in real time. 1105The same is true of libraries that are compiled 1106infrequently and loaded often. 1107In the future, 1108we may try to put some of the loader work 1109back into the compiler. 1110.LP 1111The second problem comes from 1112the various optimizations performed over several 1113passes. 1114Often optimizations in different passes depend 1115on each other. 1116Iterating the passes could compromise efficiency, 1117or even loop. 1118We see no real solution to this problem. 1119.NH 1120References 1121.LP 1122[Aho87] A. V. Aho, R. Sethi, and J. D. Ullman, 1123.I 1124Compilers \- Principles, Techniques, and Tools, 1125.R 1126Addison Wesley, 1127Reading, MA, 11281987. 1129.LP 1130[ANSI90] \f2American National Standard for Information Systems \- 1131Programming Language C\f1, American National Standards Institute, Inc., 1132New York, 1990. 1133.LP 1134[Dav91] J. W. Davidson and D. B. Whalley, 1135``Methods for Saving and Restoring Register Values across Function Calls'', 1136.I 1137Software\-Practice and Experience, 1138.R 1139Vol 21(2), pp. 149-165, February 1991. 1140.LP 1141[Joh79] S. C. Johnson, 1142``YACC \- Yet Another Compiler Compiler'', 1143.I 1144UNIX Programmer's Manual, Seventh Ed., Vol. 2A, 1145.R 1146AT&T Bell Laboratories, 1147Murray Hill, NJ, 11481979. 1149.LP 1150[Set70] R. Sethi and J. D. Ullman, 1151``The Generation of Optimal Code for Arithmetic Expressions'', 1152.I 1153Journal of the ACM, 1154.R 1155Vol 17(4), pp. 715-728, 1970. 1156.LP 1157[Szy78] T. G. Szymanski, 1158``Assembling Code for Machines with Span-dependent Instructions'', 1159.I 1160Communications of the ACM, 1161.R 1162Vol 21(4), pp. 300-308, 1978. 1163