1.TL 2Plan 9 C Compilers \(dg 3.AU 4.I "Ken Thompson" 5.AI 6ken@plan9.bell-labs.com 7.AB 8.FS 9\l'1i' 10.br 11\(dg Originally appeared, in a different form, in 12.I 13Proceedings of the Summer 1990 UKUUG Conference, 14.R 15pp. 41-51, 16London, 1990. 17This version first appeared in 18.I "Plan 9 Programmer's Manual, Volume 2 (Second Edition)" . 19The Plan 9 compiler suite forms the basis for the portable Inferno compiler suite, 20making this paper still relevant. 21.FE 22This paper describes the overall structure and function of the Plan 9 C compilers. 23A more detailed implementation document 24for any one of the compilers 25is yet to be written. 26.AE 27.NH 28Introduction 29.LP 30There are many compilers in the series. 31Eight of the compilers (MIPS 3000, SPARC, Intel 386, AMD64, Power PC, ARM, DEC Alpha, and Motorola 68020) 32are considered active and are used to compile 33current versions of Plan 9 or Inferno. 34Several others (Motorola 68000, Intel 960, AMD 29000) have had only limited use, such as 35to program peripherals or experimental devices. 36.NH 37Structure 38.LP 39The compiler is a single program that produces an 40object file. 41Combined in the compiler are the traditional 42roles of preprocessor, lexical analyzer, parser, code generator, 43local optimizer, 44and first half of the assembler. 45The object files are binary forms of assembly 46language, 47similar to what might be passed between 48the first and second passes of an assembler. 49.LP 50Object files and libraries 51are combined by a loader 52program to produce the executable binary. 53The loader combines the roles of second half 54of the assembler, global optimizer, and loader. 55The names of the compliers, loaders, and assemblers 56are as follows: 57.DS 58.ta 1.5i 59.de Ta 60\\$1 \f(CW\\$2\fP \f(CW\\$3\fP \f(CW\\$4\fP 61.. 62.Ta SPARC kc kl ka 63.Ta PowerPC qc ql qa 64.Ta MIPS vc vl va 65.Ta Motorola\ 68000 1c 1l 1a 66.Ta Motorola\ 68020 2c 2l 2a 67.Ta ARM 5c 5l 5a 68.Ta AMD64 6c 6l 6a 69.Ta DEC\ Alpha 7c 7l 7a 70.Ta Intel\ 386 8c 8l 8a 71.Ta AMD\ 29000 9c 9l 9a 72.DE 73There is a further breakdown 74in the source of the compilers into 75object-independent and 76object-dependent 77parts. 78All of the object-independent parts 79are combined into source files in the 80directory 81.CW /sys/src/cmd/cc . 82The object-dependent parts are collected 83in a separate directory for each compiler, 84for example 85.CW /sys/src/cmd/vc . 86All of the code, 87both object-independent and 88object-dependent, 89is machine-independent 90and may be cross-compiled and executed on any 91of the architectures. 92.NH 93The Language 94.LP 95The compiler implements ANSI C with some 96restrictions and extensions 97[ANSI90]. 98Most of the restrictions are due to 99personal preference, while 100most of the extensions were to help in 101the implementation of Plan 9. 102There are other departures from the standard, 103particularly in the libraries, 104that are beyond the scope of this 105paper. 106.NH 2 107Register, volatile, const 108.LP 109The keyword 110.CW register 111is recognized syntactically 112but is semantically ignored. 113Thus taking the address of a 114.CW register 115variable is not diagnosed. 116The keyword 117.CW volatile 118disables all optimizations, in particular registerization, of the corresponding variable. 119The keyword 120.CW const 121generates warnings (if warnings are enabled by the compiler's 122.CW -w 123option) of non-constant use of the variable, 124but does not affect the generated code. 125.NH 2 126The preprocessor 127.LP 128The C preprocessor is probably the 129biggest departure from the ANSI standard. 130.LP 131The preprocessor built into the Plan 9 compilers does not support 132.CW #if , 133although it does handle 134.CW #ifdef 135and 136.CW #include . 137If it is necessary to be more standard, 138the source text can first be run through the separate ANSI C 139preprocessor, 140.CW cpp . 141.NH 2 142Unnamed substructures 143.LP 144The most important and most heavily used of the 145extensions is the declaration of an 146unnamed substructure or subunion. 147For example: 148.DS 149.CW 150.ta .1i .6i 1.1i 1.6i 151 typedef 152 struct lock 153 { 154 int locked; 155 } Lock; 156 157 typedef 158 struct node 159 { 160 int type; 161 union 162 { 163 double dval; 164 float fval; 165 long lval; 166 }; 167 Lock; 168 } Node; 169 170 Lock* lock; 171 Node* node; 172.R 173.DE 174The declaration of 175.CW Node 176has an unnamed substructure of type 177.CW Lock 178and an unnamed subunion. 179One use of this feature allows references to elements of the 180subunit to be accessed as if they were in 181the outer structure. 182Thus 183.CW node->dval 184and 185.CW node->locked 186are legitimate references. 187.LP 188When an outer structure is used 189in a context that is only legal for 190an unnamed substructure, 191the compiler promotes the reference to the 192unnamed substructure. 193This is true for references to structures and 194to references to pointers to structures. 195This happens in assignment statements and 196in argument passing where prototypes have been 197declared. 198Thus, continuing with the example, 199.DS 200.CW 201.ta .1i .6i 1.1i 1.6i 202 lock = node; 203.R 204.DE 205would assign a pointer to the unnamed 206.CW Lock 207in 208the 209.CW Node 210to the variable 211.CW lock . 212Another example, 213.DS 214.CW 215.ta .1i .6i 1.1i 1.6i 216 extern void lock(Lock*); 217 func(...) 218 { 219 ... 220 lock(node); 221 ... 222 } 223.R 224.DE 225will pass a pointer to the 226.CW Lock 227substructure. 228.LP 229Finally, in places where context is insufficient to identify the unnamed structure, 230the type name (it must be a 231.CW typedef ) 232of the unnamed structure can be used as an identifier. 233In our example, 234.CW &node->Lock 235gives the address of the anonymous 236.CW Lock 237structure. 238.NH 2 239Structure displays 240.LP 241A structure cast followed by a list of expressions in braces is 242an expression with the type of the structure and elements assigned from 243the corresponding list. 244Structures are now almost first-class citizens of the language. 245It is common to see code like this: 246.DS 247.CW 248.ta .1i 249 r = (Rectangle){point1, (Point){x,y+2}}; 250.R 251.DE 252.NH 2 253Initialization indexes 254.LP 255In initializers of arrays, 256one may place a constant expression 257in square brackets before an initializer. 258This causes the next initializer to assign 259the indicated element. 260For example: 261.DS 262.CW 263.ta .1i .6i 1.6i 264 enum errors 265 { 266 Etoobig, 267 Ealarm, 268 Egreg 269 }; 270 char* errstrings[] = 271 { 272 [Ealarm] "Alarm call", 273 [Egreg] "Panic: out of mbufs", 274 [Etoobig] "Arg list too long", 275 }; 276.R 277.DE 278In the same way, 279individual structures members may 280be initialized in any order by preceding the initialization with 281.CW .tagname . 282Both forms allow an optional 283.CW = , 284to be compatible with a proposed 285extension to ANSI C. 286.NH 2 287External register 288.LP 289The declaration 290.CW extern 291.CW register 292will dedicate a register to 293a variable on a global basis. 294It can be used only under special circumstances. 295External register variables must be identically 296declared in all modules and 297libraries. 298The feature is not intended for efficiency, 299although it can produce efficient code; 300rather it represents a unique storage class that 301would be hard to get any other way. 302On a shared-memory multi-processor, 303an external register is 304one-per-processor and neither one-per-procedure (automatic) 305or one-per-system (external). 306It is used for two variables in the Plan 9 kernel, 307.CW u 308and 309.CW m . 310.CW U 311is a pointer to the structure representing the currently running process 312and 313.CW m 314is a pointer to the per-machine data structure. 315.NH 2 316Long long 317.LP 318The compilers accept 319.CW long 320.CW long 321as a basic type meaning 64-bit integer. 322On all of the machines 323this type is synthesized from 32-bit instructions. 324.NH 2 325Pragma 326.LP 327The compilers accept 328.CW #pragma 329.CW lib 330.I libname 331and pass the 332library name string uninterpreted 333to the loader. 334The loader uses the library name to 335find libraries to load. 336If the name contains 337.CW $O , 338it is replaced with 339the single character object type of the compiler 340(e.g., 341.CW v 342for the MIPS). 343If the name contains 344.CW $M , 345it is replaced with 346the architecture type for the compiler 347(e.g., 348.CW mips 349for the MIPS). 350If the name starts with 351.CW / 352it is an absolute pathname; 353if it starts with 354.CW . 355then it is searched for in the loader's current directory. 356Otherwise, the name is searched from 357.CW /$M/lib . 358Such 359.CW #pragma 360statements in header files guarantee that the correct 361libraries are always linked with a program without the 362need to specify them explicitly at link time. 363.LP 364They also accept 365.CW #pragma 366.CW packed 367.CW on 368(or 369.CW yes 370or 371.CW 1 ) 372to cause subsequently declared data, until 373.CW #pragma 374.CW packed 375.CW off 376(or 377.CW no 378or 379.CW 0 ), 380to be laid out in memory tightly packed in successive bytes, disregarding 381the usual alignment rules. 382Accessing such data can cause faults. 383.LP 384Similarly, 385.CW #pragma 386.CW profile 387.CW off 388(or 389.CW no 390or 391.CW 0 ) 392causes subsequently declared functions, until 393.CW #pragma 394.CW profile 395.CW on 396(or 397.CW yes 398or 399.CW 1 ), 400to be marked as unprofiled. 401Such functions will not be profiled when 402profiling is enabled for the rest of the program. 403.LP 404Two 405.CW #pragma 406statements allow type-checking of 407.CW print -like 408functions. 409The first, of the form 410.P1 411#pragma varargck argpos error 2 412.P2 413tells the compiler that the second argument to 414.CW error 415is a 416.CW print 417format string (see the manual page 418.I print (2)) 419that specifies how to format 420.CW error 's 421subsequent arguments. 422The second, of the form 423.P1 424#pragma varargck type "s" char* 425.P2 426says that the 427.CW print 428format verb 429.CW s 430processes an argument of 431type 432.CW char* . 433If the compiler's 434.CW -F 435option is enabled, the compiler will use this information 436to report type violations in the arguments to 437.CW print , 438.CW error , 439and similar routines. 440.NH 441Object module conventions 442.LP 443The overall conventions of the runtime environment 444are important 445to runtime efficiency. 446In this section, 447several of these conventions are discussed. 448.NH 2 449Register saving 450.LP 451In the Plan 9 compilers, 452the caller of a procedure saves the registers. 453With caller-saves, 454the leaf procedures can use all the 455registers and never save them. 456If you spend a lot of time at the leaves, 457this seems preferable. 458With callee-saves, 459the saving of the registers is done 460in the single point of entry and return. 461If you are interested in space, 462this seems preferable. 463In both, 464there is a degree of uncertainty 465about what registers need to be saved. 466Callee-saved registers make it difficult to 467find variables in registers in debuggers. 468Callee-saved registers also complicate 469the implementation of 470.CW longjmp . 471The convincing argument is 472that with caller-saves, 473the decision to registerize a variable 474can include the cost of saving the register 475across calls. 476For a further discussion of caller- vs. callee-saves, 477see the paper by Davidson and Whalley [Dav91]. 478.LP 479In the Plan 9 operating system, 480calls to the kernel look like normal procedure 481calls, which means 482the caller 483has saved the registers and the system 484entry does not have to. 485This makes system calls considerably faster. 486Since this is a potential security hole, 487and can lead to non-determinism, 488the system may eventually save the registers 489on entry, 490or more likely clear the registers on return. 491.NH 2 492Calling convention 493.LP 494Older C compilers maintain a frame pointer, which is at a known constant 495offset from the stack pointer within each function. 496For machines where the stack grows towards zero, 497the argument pointer is at a known constant offset 498from the frame pointer. 499Since the stack grows down in Plan 9, 500the Plan 9 compilers 501keep neither an 502explicit frame pointer nor 503an explicit argument pointer; 504instead they generate addresses relative to the stack pointer. 505.LP 506On some architectures, the first argument to a subroutine is passed in a register. 507.NH 2 508Functions returning structures 509.LP 510Structures longer than one word are awkward to implement 511since they do not fit in registers and must 512be passed around in memory. 513Functions that return structures 514are particularly clumsy. 515The Plan 9 compilers pass the return address of 516a structure as the first argument of a 517function that has a structure return value. 518Thus 519.DS 520.CW 521.ta .1i .6i 1.1i 1.6i 522 x = f(...) 523.R 524.DE 525is rewritten as 526.DS 527.CW 528.ta .1i .6i 1.1i 1.6i 529 f(&x, ...)\f1. 530.R 531.DE 532This saves a copy and makes the compilation 533much less clumsy. 534A disadvantage is that if you call this 535function without an assignment, 536a dummy location must be invented. 537.LP 538There is also a danger of calling a function 539that returns a structure without declaring 540it as such. 541With ANSI C function prototypes, 542this error need never occur. 543.NH 544Implementation 545.LP 546The compiler is divided internally into 547four machine-independent passes, 548four machine-dependent passes, 549and an output pass. 550The next nine sections describe each pass in order. 551.NH 2 552Parsing 553.LP 554The first pass is a YACC-based parser 555[Joh79]. 556Declarations are interpreted immediately, 557building a block structured symbol table. 558Executable statements are put into a parse tree 559and collected, 560without interpretation. 561At the end of each procedure, 562the parse tree for the function is 563examined by the other passes of the compiler. 564.LP 565The input stream of the parser is 566a pushdown list of input activations. 567The preprocessor 568expansions of 569macros 570and 571.CW #include 572are implemented as pushdowns. 573Thus there is no separate 574pass for preprocessing. 575.NH 2 576Typing 577.LP 578The next pass distributes typing information 579to every node of the tree. 580Implicit operations on the tree are added, 581such as type promotions and taking the 582address of arrays and functions. 583.NH 2 584Machine-independent optimization 585.LP 586The next pass performs optimizations 587and transformations of the tree, such as converting 588.CW &*x 589and 590.CW *&x 591into 592.CW x . 593Constant expressions are converted to constants in this pass. 594.NH 2 595Arithmetic rewrites 596.LP 597This is another machine-independent optimization. 598Subtrees of add, subtract, and multiply of integers are 599rewritten for easier compilation. 600The major transformation is factoring: 601.CW 4+8*a+16*b+5 602is transformed into 603.CW 9+8*(a+2*b) . 604Such expressions arise from address 605manipulation and array indexing. 606.NH 2 607Addressability 608.LP 609This is the first of the machine-dependent passes. 610The addressability of a processor is defined as the set of 611expressions that is legal in the address field 612of a machine language instruction. 613The addressability of different processors varies widely. 614At one end of the spectrum are the 68020 and VAX, 615which allow a complex mix of incrementing, 616decrementing, 617indexing, and relative addressing. 618At the other end is the MIPS, 619which allows only registers and constant offsets from the 620contents of a register. 621The addressability can be different for different instructions 622within the same processor. 623.LP 624It is important to the code generator to know when a 625subtree represents an address of a particular type. 626This is done with a bottom-up walk of the tree. 627In this pass, the leaves are labeled with small integers. 628When an internal node is encountered, 629it is labeled by consulting a table indexed by the 630labels on the left and right subtrees. 631For example, 632on the 68020 processor, 633it is possible to address an 634offset from a named location. 635In C, this is represented by the expression 636.CW *(&name+constant) . 637This is marked addressable by the following table. 638In the table, 639a node represented by the left column is marked 640with a small integer from the right column. 641Marks of the form 642.CW A\s-2\di\u\s0 643are addressable while 644marks of the form 645.CW N\s-2\di\u\s0 646are not addressable. 647.DS 648.B 649.ta .1i 1.1i 650 Node Marked 651.CW 652 name A\s-2\d1\u\s0 653 const A\s-2\d2\u\s0 654 &A\s-2\d1\u\s0 A\s-2\d3\u\s0 655 A\s-2\d3\u\s0+A\s-2\d1\u\s0 N\s-2\d1\u\s0 \fR(note that this is not addressable)\fP 656 *N\s-2\d1\u\s0 A\s-2\d4\u\s0 657.R 658.DE 659Here there is a distinction between 660a node marked 661.CW A\s-2\d1\u\s0 662and a node marked 663.CW A\s-2\d4\u\s0 664because the address operator of an 665.CW A\s-2\d4\u\s0 666node is not addressable. 667So to extend the table: 668.DS 669.B 670.ta .1i 1.1i 671 Node Marked 672.CW 673 &A\s-2\d4\u\s0 N\s-2\d2\u\s0 674 N\s-2\d2\u\s0+N\s-2\d1\u\s0 N\s-2\d1\u\s0 675.R 676.DE 677The full addressability of the 68020 is expressed 678in 18 rules like this, 679while the addressability of the MIPS is expressed 680in 11 rules. 681When one ports the compiler, 682this table is usually initialized 683so that leaves are labeled as addressable and nothing else. 684The code produced is poor, 685but porting is easy. 686The table can be extended later. 687.LP 688This pass also rewrites some complex operators 689into procedure calls. 690Examples include 64-bit multiply and divide. 691.LP 692In the same bottom-up pass of the tree, 693the nodes are labeled with a Sethi-Ullman complexity 694[Set70]. 695This number is roughly the number of registers required 696to compile the tree on an ideal machine. 697An addressable node is marked 0. 698A function call is marked infinite. 699A unary operator is marked as the 700maximum of 1 and the mark of its subtree. 701A binary operator with equal marks on its subtrees is 702marked with a subtree mark plus 1. 703A binary operator with unequal marks on its subtrees is 704marked with the maximum mark of its subtrees. 705The actual values of the marks are not too important, 706but the relative values are. 707The goal is to compile the harder 708(larger mark) 709subtree first. 710.NH 2 711Code generation 712.LP 713Code is generated by recursive 714descent. 715The Sethi-Ullman complexity completely guides the 716order. 717The addressability defines the leaves. 718The only difficult part is compiling a tree 719that has two infinite (function call) 720subtrees. 721In this case, 722one subtree is compiled into the return register 723(usually the most convenient place for a function call) 724and then stored on the stack. 725The other subtree is compiled into the return register 726and then the operation is compiled with 727operands from the stack and the return register. 728.LP 729There is a separate boolean code generator that compiles 730conditional expressions. 731This is fundamentally different from compiling an arithmetic expression. 732The result of the boolean code generator is the 733position of the program counter and not an expression. 734The boolean code generator makes extensive use of De Morgan's rule. 735The boolean code generator is an expanded version of that described 736in chapter 8 of Aho, Sethi, and Ullman 737[Aho87]. 738.LP 739There is a considerable amount of talk in the literature 740about automating this part of a compiler with a machine 741description. 742Since this code generator is so small 743(less than 500 lines of C) 744and easy, 745it hardly seems worth the effort. 746.NH 2 747Registerization 748.LP 749Up to now, 750the compiler has operated on syntax trees 751that are roughly equivalent to the original source language. 752The previous pass has produced machine language in an internal 753format. 754The next two passes operate on the internal machine language 755structures. 756The purpose of the next pass is to reintroduce 757registers for heavily used variables. 758.LP 759All of the variables that can be 760potentially registerized within a procedure are 761placed in a table. 762(Suitable variables are any automatic or external 763scalars that do not have their addresses extracted. 764Some constants that are hard to reference are also 765considered for registerization.) 766Four separate data flow equations are evaluated 767over the procedure on all of these variables. 768Two of the equations are the normal set-behind 769and used-ahead 770bits that define the life of a variable. 771The two new bits tell if a variable life 772crosses a function call ahead or behind. 773By examining a variable over its lifetime, 774it is possible to get a cost 775for registerizing. 776Loops are detected and the costs are multiplied 777by three for every level of loop nesting. 778Costs are sorted and the variables 779are replaced by available registers on a greedy basis. 780.LP 781The 68020 has two different 782types of registers. 783For the 68020, 784two different costs are calculated for 785each variable life and the register type that 786affords the better cost is used. 787Ties are broken by counting the number of available 788registers of each type. 789.LP 790Note that externals are registerized together with automatics. 791This is done by evaluating the semantics of a ``call'' instruction 792differently for externals and automatics. 793Since a call goes outside the local procedure, 794it is assumed that a call references all externals. 795Similarly, 796externals are assumed to be set before an ``entry'' instruction 797and assumed to be referenced after a ``return'' instruction. 798This makes sure that externals are in memory across calls. 799.LP 800The overall results are satisfactory. 801It would be nice to be able to do this processing in 802a machine-independent way, 803but it is impossible to get all of the costs and 804side effects of different choices by examining the parse tree. 805.LP 806Most of the code in the registerization pass is machine-independent. 807The major machine-dependency is in 808examining a machine instruction to ask if it sets or references 809a variable. 810.NH 2 811Machine code optimization 812.LP 813The next pass walks the machine code 814for opportunistic optimizations. 815For the most part, 816this is highly specific to a particular 817processor. 818One optimization that is performed 819on all of the processors is the 820removal of unnecessary ``move'' 821instructions. 822Ironically, 823most of these instructions were inserted by 824the previous pass. 825There are two patterns that are repetitively 826matched and replaced until no more matches are 827found. 828The first tries to remove ``move'' instructions 829by relabeling variables. 830.LP 831When a ``move'' instruction is encountered, 832if the destination variable is set before the 833source variable is referenced, 834then all of the references to the destination 835variable can be renamed to the source and the ``move'' 836can be deleted. 837This transformation uses the reverse data flow 838set up in the previous pass. 839.LP 840An example of this pattern is depicted in the following 841table. 842The pattern is in the left column and the 843replacement action is in the right column. 844.DS 845.CW 846.ta .1i .6i 1.6i 2.1i 2.6i 847 MOVE a->b \fR(remove)\fP 848.R 849 (sequence with no mention of \f(CWa\fP) 850.CW 851 USE b USE a 852.R 853 (sequence with no mention of \f(CWa\fP) 854.CW 855 SET b SET b 856.R 857.DE 858.LP 859Experiments have shown that it is marginally 860worthwhile to rename uses of the destination variable 861with uses of the source variable up to 862the first use of the source variable. 863.LP 864The second transform will do relabeling 865without deleting instructions. 866When a ``move'' instruction is encountered, 867if the source variable has been set prior 868to the use of the destination variable 869then all of the references to the source 870variable are replaced by the destination and 871the ``move'' is inverted. 872Typically, 873this transformation will alter two ``move'' 874instructions and allow the first transformation 875another chance to remove code. 876This transformation uses the forward data flow 877set up in the previous pass. 878.LP 879Again, 880the following is a depiction of the transformation where 881the pattern is in the left column and the 882rewrite is in the right column. 883.DS 884.CW 885.ta .1i .6i 1.6i 2.1i 2.6i 886 SET a SET b 887.R 888 (sequence with no use of \f(CWb\fP) 889.CW 890 USE a USE b 891.R 892 (sequence with no use of \f(CWb\fP) 893.CW 894 MOVE a->b MOVE b->a 895.R 896.DE 897Iterating these transformations 898will usually get rid of all redundant ``move'' instructions. 899.LP 900A problem with this organization is that the costs 901of registerization calculated in the previous pass 902must depend on how well this pass can detect and remove 903redundant instructions. 904Often, 905a fine candidate for registerization is rejected 906because of the cost of instructions that are later 907removed. 908.NH 2 909Writing the object file 910.LP 911The last pass walks the internal assembly language 912and writes the object file. 913The object file is reduced in size by about a factor 914of three with simple compression 915techniques. 916The most important aspect of the object file 917format is that it is independent of the compiling machine. 918All integer and floating numbers in the object 919code are converted to known formats and byte 920orders. 921.NH 922The loader 923.LP 924The loader is a multiple pass program that 925reads object files and libraries and produces 926an executable binary. 927The loader also does some minimal 928optimizations and code rewriting. 929Many of the operations performed by the 930loader are machine-dependent. 931.LP 932The first pass of the loader reads the 933object modules into an internal data 934structure that looks like binary assembly language. 935As the instructions are read, 936code is reordered to remove 937unconditional branch instructions. 938Conditional branch instructions are inverted 939to prevent the insertion of unconditional branches. 940The loader will also make a copy of a few instructions 941to remove an unconditional branch. 942.LP 943The next pass allocates addresses for 944all external data. 945Typical of processors is the MIPS, 946which can reference ±32K bytes from a 947register. 948The loader allocates the register 949.CW R30 950as the static pointer. 951The value placed in 952.CW R30 953is the base of the data segment plus 32K. 954It is then cheap to reference all data in the 955first 64K of the data segment. 956External variables are allocated to 957the data segment 958with the smallest variables allocated first. 959If all of the data cannot fit into the first 96064K of the data segment, 961then usually only a few large arrays 962need more expensive addressing modes. 963.LP 964For the MIPS processor, 965the loader makes a pass over the internal 966structures, 967exchanging instructions to try 968to fill ``delay slots'' with useful work. 969If a useful instruction cannot be found 970to fill a delay slot, 971the loader will insert 972``noop'' 973instructions. 974This pass is very expensive and does not 975do a good job. 976About 40% of all instructions are in 977delay slots. 978About 65% of these are useful instructions and 97935% are ``noops.'' 980The vendor-supplied assembler does this job 981more effectively, 982filling about 80% 983of the delay slots with useful instructions. 984.LP 985On the 68020 processor, 986branch instructions come in a variety of 987sizes depending on the relative distance 988of the branch. 989Thus the size of branch instructions 990can be mutually dependent. 991The loader uses a multiple pass algorithm 992to resolve the branch lengths 993[Szy78]. 994Initially, all branches are assumed minimal length. 995On each subsequent pass, 996the branches are reassessed 997and expanded if necessary. 998When no more expansions occur, 999the locations of the instructions in 1000the text segment are known. 1001.LP 1002On the MIPS processor, 1003all instructions are one size. 1004A single pass over the instructions will 1005determine the locations of all addresses 1006in the text segment. 1007.LP 1008The last pass of the loader produces the 1009executable binary. 1010A symbol table and other tables are 1011produced to help the debugger to 1012interpret the binary symbolically. 1013.LP 1014The loader places absolute source line numbers in the symbol table. 1015The name and absolute line number of all 1016.CW #include 1017files is also placed in the 1018symbol table so that the debuggers can 1019associate object code to source files. 1020.NH 1021Performance 1022.LP 1023The following is a table of the source size of the MIPS 1024compiler. 1025.DS 1026.ta .1i .6i 1027 lines module 1028 \0509 machine-independent headers 1029 1070 machine-independent YACC source 1030 6090 machine-independent C source 1031 1032 \0545 machine-dependent headers 1033 6532 machine-dependent C source 1034 1035 \0298 loader headers 1036 5215 loader C source 1037.DE 1038.LP 1039The following table shows timing 1040of a test program 1041that plays checkers, running on a MIPS R4000. 1042The test program is 26 files totaling 12600 lines of C. 1043The execution time does not significantly 1044depend on library implementation. 1045Since no other compiler runs on Plan 9, 1046the Plan 9 tests were done with the Plan 9 operating system; 1047the other tests were done on the vendor's operating system. 1048The hardware was identical in both cases. 1049The optimizer in the vendor's compiler 1050is reputed to be extremely good. 1051.DS 1052.ta .1i .9i 1053 \0\04.49s Plan 9 \f(CWvc\fP \f(CW-N\fP compile time (opposite of \f(CW-O\fP) 1054 \0\01.72s Plan 9 \f(CWvc\fP \f(CW-N\fP load time 1055 148.69s Plan 9 \f(CWvc\fP \f(CW-N\fP run time 1056 1057 \015.07s Plan 9 \f(CWvc\fP compile time (\f(CW-O\fP implicit) 1058 \0\01.66s Plan 9 \f(CWvc\fP load time 1059 \089.96s Plan 9 \f(CWvc\fP run time 1060 1061 \014.83s vendor \f(CWcc\fP compile time 1062 \0\00.38s vendor \f(CWcc\fP load time 1063 104.75s vendor \f(CWcc\fP run time 1064 1065 \043.59s vendor \f(CWcc\fP \f(CW-O\fP compile time 1066 \0\00.38s vendor \f(CWcc\fP \f(CW-O\fP load time 1067 \076.19s vendor \f(CWcc\fP \f(CW-O\fP run time 1068 1069 \0\08.19s vendor \f(CWcc\fP \f(CW-O3\fP compile time 1070 \035.97s vendor \f(CWcc\fP \f(CW-O3\fP load time 1071 \071.16s vendor \f(CWcc\fP \f(CW-O3\fP run time 1072.DE 1073.LP 1074To compare the Intel compiler, 1075a program that is about 40% bit manipulation and 1076about 60% single precision floating point was 1077run on the same 33 MHz 486, once under Windows 1078compiled with the Watcom compiler, version 10.0, 1079in 16-bit mode and once under 1080Plan 9 in 32-bit mode. 1081The Plan 9 execution time was 27 sec while the Windows 1082execution time was 31 sec. 1083.NH 1084Conclusions 1085.LP 1086The new compilers compile 1087quickly, 1088load slowly, 1089and produce 1090medium quality 1091object code. 1092The compilers are relatively 1093portable, 1094requiring but a couple of weeks' work to 1095produce a compiler for a different computer. 1096For Plan 9, 1097where we needed several compilers 1098with specialized features and 1099our own object formats, 1100this project was indispensable. 1101It is also necessary for us to 1102be able to freely distribute our compilers 1103with the Plan 9 distribution. 1104.LP 1105Two problems have come up in retrospect. 1106The first has to do with the 1107division of labor between compiler and loader. 1108Plan 9 runs on multi-processors and as such 1109compilations are often done in parallel. 1110Unfortunately, 1111all compilations must be complete before loading 1112can begin. 1113The load is then single-threaded. 1114With this model, 1115any shift of work from compile to load 1116results in a significant increase in real time. 1117The same is true of libraries that are compiled 1118infrequently and loaded often. 1119In the future, 1120we may try to put some of the loader work 1121back into the compiler. 1122.LP 1123The second problem comes from 1124the various optimizations performed over several 1125passes. 1126Often optimizations in different passes depend 1127on each other. 1128Iterating the passes could compromise efficiency, 1129or even loop. 1130We see no real solution to this problem. 1131.NH 1132References 1133.LP 1134[Aho87] A. V. Aho, R. Sethi, and J. D. Ullman, 1135.I 1136Compilers \- Principles, Techniques, and Tools, 1137.R 1138Addison Wesley, 1139Reading, MA, 11401987. 1141.LP 1142[ANSI90] \f2American National Standard for Information Systems \- 1143Programming Language C\f1, American National Standards Institute, Inc., 1144New York, 1990. 1145.LP 1146[Dav91] J. W. Davidson and D. B. Whalley, 1147``Methods for Saving and Restoring Register Values across Function Calls'', 1148.I 1149Software\-Practice and Experience, 1150.R 1151Vol 21(2), pp. 149-165, February 1991. 1152.LP 1153[Joh79] S. C. Johnson, 1154``YACC \- Yet Another Compiler Compiler'', 1155.I 1156UNIX Programmer's Manual, Seventh Ed., Vol. 2A, 1157.R 1158AT&T Bell Laboratories, 1159Murray Hill, NJ, 11601979. 1161.LP 1162[Set70] R. Sethi and J. D. Ullman, 1163``The Generation of Optimal Code for Arithmetic Expressions'', 1164.I 1165Journal of the ACM, 1166.R 1167Vol 17(4), pp. 715-728, 1970. 1168.LP 1169[Szy78] T. G. Szymanski, 1170``Assembling Code for Machines with Span-dependent Instructions'', 1171.I 1172Communications of the ACM, 1173.R 1174Vol 21(4), pp. 300-308, 1978. 1175