1.HTML "Plan 9 C Compilers 2.TL 3Plan 9 C Compilers 4.AU 5Ken Thompson 6ken@plan9.bell-labs.com 7.AB 8.FS 9Originally appeared, in a different form, in 10.I 11Proceedings of the Summer 1990 UKUUG Conference, 12.R 13pp. 41-51, 14London, 1990. 15.FE 16This paper describes the overall structure and function of the Plan 9 C compilers. 17A more detailed implementation document 18for any one of the compilers 19is yet to be written. 20.AE 21.NH 22Introduction 23.LP 24There are many compilers in the series. 25Six of the compilers (MIPS 3000, SPARC, Intel 386, Power PC, DEC Alpha, and Motorola 68020) 26are considered active and are used to compile 27current versions of Plan 9. 28Several others (Motorola 68000, Intel 960, ARM 7500, AMD 29000) have had only limited use, such as 29to program peripherals or experimental devices. 30.NH 31Structure 32.LP 33The compiler is a single program that produces an 34object file. 35Combined in the compiler are the traditional 36roles of preprocessor, lexical analyzer, parser, code generator, 37local optimizer, 38and first half of the assembler. 39The object files are binary forms of assembly 40language, 41similar to what might be passed between 42the first and second passes of an assembler. 43.LP 44Object files and libraries 45are combined by a loader 46program to produce the executable binary. 47The loader combines the roles of second half 48of the assembler, global optimizer, and loader. 49The names of the compliers, loaders, and assemblers 50are as follows: 51.DS 52.ta 1.5i 53.de Ta 54\\$1 \f(CW\\$2\fP \f(CW\\$3\fP \f(CW\\$4\fP 55.. 56.Ta SPARC kc kl ka 57.Ta Power\ PC qc ql qa 58.Ta MIPS vc vl va 59.Ta Motorola\ 68000 1c 1l 1a 60.Ta Motorola\ 68020 2c 2l 2a 61.Ta ARM\ 7500 5c 5l 5a 62.Ta Intel\ 960 6c 6l 6a 63.Ta DEC\ Alpha 7c 7l 7a 64.Ta Intel\ 386 8c 8l 8a 65.Ta AMD\ 29000 9c 9l 9a 66.DE 67There is a further breakdown 68in the source of the compilers into 69object-independent and 70object-dependent 71parts. 72All of the object-independent parts 73are combined into source files in the 74directory 75.CW /sys/src/cmd/cc . 76The object-dependent parts are collected 77in a separate directory for each compiler, 78for example 79.CW /sys/src/cmd/vc . 80All of the code, 81both object-independent and 82object-dependent, 83is machine-independent 84and may be cross-compiled and executed on any 85of the architectures. 86.NH 87The Language 88.LP 89The compiler implements ANSI C with some 90restrictions and extensions 91[ANSI90]. 92Most of the restrictions are due to 93personal preference, while 94most of the extensions were to help in 95the implementation of Plan 9. 96There are other departures from the standard, 97particularly in the libraries, 98that are beyond the scope of this 99paper. 100.NH 2 101Register, volatile, const 102.LP 103The keyword 104.CW register 105is recognized syntactically 106but is semantically ignored. 107Thus taking the address of a 108.CW register 109variable is not diagnosed. 110The keyword 111.CW volatile 112disables all optimizations, in particular registerization, of the corresponding variable. 113The keyword 114.CW const 115generates warnings (if warnings are enabled by the compiler's 116.CW -w 117option) of non-constant use of the variable, 118but does not affect the generated code. 119.NH 2 120The preprocessor 121.LP 122The C preprocessor is probably the 123biggest departure from the ANSI standard. 124.LP 125The preprocessor built into the Plan 9 compilers does not support 126.CW #if , 127although it does handle 128.CW #ifdef 129and 130.CW #include . 131If it is necessary to be more standard, 132the source text can first be run through the separate ANSI C 133preprocessor, 134.CW cpp . 135.NH 2 136Unnamed substructures 137.LP 138The most important and most heavily used of the 139extensions is the declaration of an 140unnamed substructure or subunion. 141For example: 142.DS 143.CW 144.ta .1i .6i 1.1i 1.6i 145 typedef 146 struct lock 147 { 148 int locked; 149 } Lock; 150 151 typedef 152 struct node 153 { 154 int type; 155 union 156 { 157 double dval; 158 float fval; 159 long lval; 160 }; 161 Lock; 162 } Node; 163 164 Lock* lock; 165 Node* node; 166.DE 167The declaration of 168.CW Node 169has an unnamed substructure of type 170.CW Lock 171and an unnamed subunion. 172One use of this feature allows references to elements of the 173subunit to be accessed as if they were in 174the outer structure. 175Thus 176.CW node->dval 177and 178.CW node->locked 179are legitimate references. 180.LP 181When an outer structure is used 182in a context that is only legal for 183an unnamed substructure, 184the compiler promotes the reference to the 185unnamed substructure. 186This is true for references to structures and 187to references to pointers to structures. 188This happens in assignment statements and 189in argument passing where prototypes have been 190declared. 191Thus, continuing with the example, 192.DS 193.CW 194.ta .1i .6i 1.1i 1.6i 195 lock = node; 196.DE 197would assign a pointer to the unnamed 198.CW Lock 199in 200the 201.CW Node 202to the variable 203.CW lock . 204Another example, 205.DS 206.CW 207.ta .1i .6i 1.1i 1.6i 208 extern void lock(Lock*); 209 func(...) 210 { 211 ... 212 lock(node); 213 ... 214 } 215.DE 216will pass a pointer to the 217.CW Lock 218substructure. 219.LP 220Finally, in places where context is insufficient to identify the unnamed structure, 221the type name (it must be a 222.CW typedef ) 223of the unnamed structure can be used as an identifier. 224In our example, 225.CW &node->Lock 226gives the address of the anonymous 227.CW Lock 228structure. 229.NH 2 230Structure displays 231.LP 232A structure cast followed by a list of expressions in braces is 233an expression with the type of the structure and elements assigned from 234the corresponding list. 235Structures are now almost first-class citizens of the language. 236It is common to see code like this: 237.DS 238.CW 239.ta .1i 240 r = (Rectangle){point1, (Point){x,y+2}}; 241.DE 242.NH 2 243Initialization indexes 244.LP 245In initializers of arrays, 246one may place a constant expression 247in square brackets before an initializer. 248This causes the next initializer to assign 249the indicated element. 250For example: 251.DS 252.CW 253.ta .1i .6i 1.6i 254 enum errors 255 { 256 Etoobig, 257 Ealarm, 258 Egreg 259 }; 260 char* errstrings[] = 261 { 262 [Ealarm] "Alarm call", 263 [Egreg] "Panic: out of mbufs", 264 [Etoobig] "Arg list too long", 265 }; 266.DE 267In the same way, 268individual structures members may 269be initialized in any order by preceding the initialization with 270.CW .tagname . 271Both forms allow an optional 272.CW = , 273to be compatible with a proposed 274extension to ANSI C. 275.NH 2 276External register 277.LP 278The declaration 279.CW extern 280.CW register 281will dedicate a register to 282a variable on a global basis. 283It can be used only under special circumstances. 284External register variables must be identically 285declared in all modules and 286libraries. 287The feature is not intended for efficiency, 288although it can produce efficient code; 289rather it represents a unique storage class that 290would be hard to get any other way. 291On a shared-memory multi-processor, 292an external register is 293one-per-processor and neither one-per-procedure (automatic) 294or one-per-system (external). 295It is used for two variables in the Plan 9 kernel, 296.CW u 297and 298.CW m . 299.CW U 300is a pointer to the structure representing the currently running process 301and 302.CW m 303is a pointer to the per-machine data structure. 304.NH 2 305Long long 306.LP 307The compilers accept 308.CW long 309.CW long 310as a basic type meaning 64-bit integer. 311On all of the machines 312this type is synthesized from 32-bit instructions. 313.NH 2 314Pragma 315.LP 316The compilers accept 317.CW #pragma 318.CW lib 319.I libname 320and pass the 321library name string uninterpreted 322to the loader. 323The loader uses the library name to 324find libraries to load. 325If the name contains 326.CW %O , 327it is replaced with 328the single character object type of the compiler 329(e.g., 330.CW v 331for the MIPS). 332If the name contains 333.CW %M , 334it is replaced with 335the architecture type for the compiler 336(e.g., 337.CW mips 338for the MIPS). 339If the name starts with 340.CW / 341it is an absolute pathname; 342if it starts with 343.CW . 344then it is searched for in the loader's current directory. 345Otherwise, the name is searched from 346.CW /%M/lib . 347Such 348.CW #pragma 349statements in header files guarantee that the correct 350libraries are always linked with a program without the 351need to specify them explicitly at link time. 352.LP 353They also accept 354.CW #pragma 355.CW hjdicks 356.CW on 357(or 358.CW yes 359or 360.CW 1 ) 361to cause subsequently declared data, until 362.CW #pragma 363.CW hjdicks 364.CW off 365(or 366.CW no 367or 368.CW 0 ), 369to be laid out in memory tightly packed in successive bytes, disregarding 370the usual alignment rules. 371Accessing such data can cause faults. 372.LP 373Similarly, 374.CW #pragma 375.CW profile 376.CW off 377(or 378.CW no 379or 380.CW 0 ) 381causes subsequently declared functions, until 382.CW #pragma 383.CW profile 384.CW on 385(or 386.CW yes 387or 388.CW 1 ), 389to be marked as unprofiled. 390Such functions will not be profiled when 391profiling is enabled for the rest of the program. 392.LP 393Two 394.CW #pragma 395statements allow type-checking of 396.CW print -like 397functions. 398The first, of the form 399.P1 400#pragma varargck argpos error 2 401.P2 402tells the compiler that the second argument to 403.CW error 404is a 405.CW print 406format string (see the manual page 407.I print (2)) 408that specifies how to format 409.CW error 's 410subsequent arguments. 411The second, of the form 412.P1 413#pragma varargck type "s" char* 414.P2 415says that the 416.CW print 417format verb 418.CW s 419processes an argument of 420type 421.CW char* . 422If the compiler's 423.CW -F 424option is enabled, the compiler will use this information 425to report type violations in the arguments to 426.CW print , 427.CW error , 428and similar routines. 429.NH 430Object module conventions 431.LP 432The overall conventions of the runtime environment 433are important 434to runtime efficiency. 435In this section, 436several of these conventions are discussed. 437.NH 2 438Register saving 439.LP 440In the Plan 9 compilers, 441the caller of a procedure saves the registers. 442With caller-saves, 443the leaf procedures can use all the 444registers and never save them. 445If you spend a lot of time at the leaves, 446this seems preferable. 447With callee-saves, 448the saving of the registers is done 449in the single point of entry and return. 450If you are interested in space, 451this seems preferable. 452In both, 453there is a degree of uncertainty 454about what registers need to be saved. 455Callee-saved registers make it difficult to 456find variables in registers in debuggers. 457Callee-saved registers also complicate 458the implementation of 459.CW longjmp . 460The convincing argument is 461that with caller-saves, 462the decision to registerize a variable 463can include the cost of saving the register 464across calls. 465For a further discussion of caller- vs. callee-saves, 466see the paper by Davidson and Whalley [Dav91]. 467.LP 468In the Plan 9 operating system, 469calls to the kernel look like normal procedure 470calls, which means 471the caller 472has saved the registers and the system 473entry does not have to. 474This makes system calls considerably faster. 475Since this is a potential security hole, 476and can lead to non-determinism, 477the system may eventually save the registers 478on entry, 479or more likely clear the registers on return. 480.NH 2 481Calling convention 482.LP 483Older C compilers maintain a frame pointer, which is at a known constant 484offset from the stack pointer within each function. 485For machines where the stack grows towards zero, 486the argument pointer is at a known constant offset 487from the frame pointer. 488Since the stack grows down in Plan 9, 489the Plan 9 compilers 490keep neither an 491explicit frame pointer nor 492an explicit argument pointer; 493instead they generate addresses relative to the stack pointer. 494.LP 495On some architectures, the first argument to a subroutine is passed in a register. 496.NH 2 497Functions returning structures 498.LP 499Structures longer than one word are awkward to implement 500since they do not fit in registers and must 501be passed around in memory. 502Functions that return structures 503are particularly clumsy. 504The Plan 9 compilers pass the return address of 505a structure as the first argument of a 506function that has a structure return value. 507Thus 508.DS 509.CW 510.ta .1i .6i 1.1i 1.6i 511 x = f(...) 512.DE 513is rewritten as 514.DS 515.CW 516.ta .1i .6i 1.1i 1.6i 517 f(&x, ...)\f1. 518.DE 519This saves a copy and makes the compilation 520much less clumsy. 521A disadvantage is that if you call this 522function without an assignment, 523a dummy location must be invented. 524.LP 525There is also a danger of calling a function 526that returns a structure without declaring 527it as such. 528With ANSI C function prototypes, 529this error need never occur. 530.NH 531Implementation 532.LP 533The compiler is divided internally into 534four machine-independent passes, 535four machine-dependent passes, 536and an output pass. 537The next nine sections describe each pass in order. 538.NH 2 539Parsing 540.LP 541The first pass is a YACC-based parser 542[Joh79]. 543Declarations are interpreted immediately, 544building a block structured symbol table. 545Executable statements are put into a parse tree 546and collected, 547without interpretation. 548At the end of each procedure, 549the parse tree for the function is 550examined by the other passes of the compiler. 551.LP 552The input stream of the parser is 553a pushdown list of input activations. 554The preprocessor 555expansions of 556macros 557and 558.CW #include 559are implemented as pushdowns. 560Thus there is no separate 561pass for preprocessing. 562.NH 2 563Typing 564.LP 565The next pass distributes typing information 566to every node of the tree. 567Implicit operations on the tree are added, 568such as type promotions and taking the 569address of arrays and functions. 570.NH 2 571Machine-independent optimization 572.LP 573The next pass performs optimizations 574and transformations of the tree, such as converting 575.CW &*x 576and 577.CW *&x 578into 579.CW x . 580Constant expressions are converted to constants in this pass. 581.NH 2 582Arithmetic rewrites 583.LP 584This is another machine-independent optimization. 585Subtrees of add, subtract, and multiply of integers are 586rewritten for easier compilation. 587The major transformation is factoring: 588.CW 4+8*a+16*b+5 589is transformed into 590.CW 9+8*(a+2*b) . 591Such expressions arise from address 592manipulation and array indexing. 593.NH 2 594Addressability 595.LP 596This is the first of the machine-dependent passes. 597The addressability of a processor is defined as the set of 598expressions that is legal in the address field 599of a machine language instruction. 600The addressability of different processors varies widely. 601At one end of the spectrum are the 68020 and VAX, 602which allow a complex mix of incrementing, 603decrementing, 604indexing, and relative addressing. 605At the other end is the MIPS, 606which allows only registers and constant offsets from the 607contents of a register. 608The addressability can be different for different instructions 609within the same processor. 610.LP 611It is important to the code generator to know when a 612subtree represents an address of a particular type. 613This is done with a bottom-up walk of the tree. 614In this pass, the leaves are labeled with small integers. 615When an internal node is encountered, 616it is labeled by consulting a table indexed by the 617labels on the left and right subtrees. 618For example, 619on the 68020 processor, 620it is possible to address an 621offset from a named location. 622In C, this is represented by the expression 623.CW *(&name+constant) . 624This is marked addressable by the following table. 625In the table, 626a node represented by the left column is marked 627with a small integer from the right column. 628Marks of the form 629.CW A\s-2\di\u\s0 630are addressable while 631marks of the form 632.CW N\s-2\di\u\s0 633are not addressable. 634.DS 635.B 636.ta .1i 1.1i 637 Node Marked 638.CW 639 name A\s-2\d1\u\s0 640 const A\s-2\d2\u\s0 641 &A\s-2\d1\u\s0 A\s-2\d3\u\s0 642 A\s-2\d3\u\s0+A\s-2\d1\u\s0 N\s-2\d1\u\s0 \fR(note that this is not addressable)\fP 643 *N\s-2\d1\u\s0 A\s-2\d4\u\s0 644.DE 645Here there is a distinction between 646a node marked 647.CW A\s-2\d1\u\s0 648and a node marked 649.CW A\s-2\d4\u\s0 650because the address operator of an 651.CW A\s-2\d4\u\s0 652node is not addressable. 653So to extend the table: 654.DS 655.B 656.ta .1i 1.1i 657 Node Marked 658.CW 659 &A\s-2\d4\u\s0 N\s-2\d2\u\s0 660 N\s-2\d2\u\s0+N\s-2\d1\u\s0 N\s-2\d1\u\s0 661.DE 662The full addressability of the 68020 is expressed 663in 18 rules like this, 664while the addressability of the MIPS is expressed 665in 11 rules. 666When one ports the compiler, 667this table is usually initialized 668so that leaves are labeled as addressable and nothing else. 669The code produced is poor, 670but porting is easy. 671The table can be extended later. 672.LP 673This pass also rewrites some complex operators 674into procedure calls. 675Examples include 64-bit multiply and divide. 676.LP 677In the same bottom-up pass of the tree, 678the nodes are labeled with a Sethi-Ullman complexity 679[Set70]. 680This number is roughly the number of registers required 681to compile the tree on an ideal machine. 682An addressable node is marked 0. 683A function call is marked infinite. 684A unary operator is marked as the 685maximum of 1 and the mark of its subtree. 686A binary operator with equal marks on its subtrees is 687marked with a subtree mark plus 1. 688A binary operator with unequal marks on its subtrees is 689marked with the maximum mark of its subtrees. 690The actual values of the marks are not too important, 691but the relative values are. 692The goal is to compile the harder 693(larger mark) 694subtree first. 695.NH 2 696Code generation 697.LP 698Code is generated by recursive 699descent. 700The Sethi-Ullman complexity completely guides the 701order. 702The addressability defines the leaves. 703The only difficult part is compiling a tree 704that has two infinite (function call) 705subtrees. 706In this case, 707one subtree is compiled into the return register 708(usually the most convenient place for a function call) 709and then stored on the stack. 710The other subtree is compiled into the return register 711and then the operation is compiled with 712operands from the stack and the return register. 713.LP 714There is a separate boolean code generator that compiles 715conditional expressions. 716This is fundamentally different from compiling an arithmetic expression. 717The result of the boolean code generator is the 718position of the program counter and not an expression. 719The boolean code generator makes extensive use of De Morgan's rule. 720The boolean code generator is an expanded version of that described 721in chapter 8 of Aho, Sethi, and Ullman 722[Aho87]. 723.LP 724There is a considerable amount of talk in the literature 725about automating this part of a compiler with a machine 726description. 727Since this code generator is so small 728(less than 500 lines of C) 729and easy, 730it hardly seems worth the effort. 731.NH 2 732Registerization 733.LP 734Up to now, 735the compiler has operated on syntax trees 736that are roughly equivalent to the original source language. 737The previous pass has produced machine language in an internal 738format. 739The next two passes operate on the internal machine language 740structures. 741The purpose of the next pass is to reintroduce 742registers for heavily used variables. 743.LP 744All of the variables that can be 745potentially registerized within a procedure are 746placed in a table. 747(Suitable variables are any automatic or external 748scalars that do not have their addresses extracted. 749Some constants that are hard to reference are also 750considered for registerization.) 751Four separate data flow equations are evaluated 752over the procedure on all of these variables. 753Two of the equations are the normal set-behind 754and used-ahead 755bits that define the life of a variable. 756The two new bits tell if a variable life 757crosses a function call ahead or behind. 758By examining a variable over its lifetime, 759it is possible to get a cost 760for registerizing. 761Loops are detected and the costs are multiplied 762by three for every level of loop nesting. 763Costs are sorted and the variables 764are replaced by available registers on a greedy basis. 765.LP 766The 68020 has two different 767types of registers. 768For the 68020, 769two different costs are calculated for 770each variable life and the register type that 771affords the better cost is used. 772Ties are broken by counting the number of available 773registers of each type. 774.LP 775Note that externals are registerized together with automatics. 776This is done by evaluating the semantics of a ``call'' instruction 777differently for externals and automatics. 778Since a call goes outside the local procedure, 779it is assumed that a call references all externals. 780Similarly, 781externals are assumed to be set before an ``entry'' instruction 782and assumed to be referenced after a ``return'' instruction. 783This makes sure that externals are in memory across calls. 784.LP 785The overall results are satisfactory. 786It would be nice to be able to do this processing in 787a machine-independent way, 788but it is impossible to get all of the costs and 789side effects of different choices by examining the parse tree. 790.LP 791Most of the code in the registerization pass is machine-independent. 792The major machine-dependency is in 793examining a machine instruction to ask if it sets or references 794a variable. 795.NH 2 796Machine code optimization 797.LP 798The next pass walks the machine code 799for opportunistic optimizations. 800For the most part, 801this is highly specific to a particular 802processor. 803One optimization that is performed 804on all of the processors is the 805removal of unnecessary ``move'' 806instructions. 807Ironically, 808most of these instructions were inserted by 809the previous pass. 810There are two patterns that are repetitively 811matched and replaced until no more matches are 812found. 813The first tries to remove ``move'' instructions 814by relabeling variables. 815.LP 816When a ``move'' instruction is encountered, 817if the destination variable is set before the 818source variable is referenced, 819then all of the references to the destination 820variable can be renamed to the source and the ``move'' 821can be deleted. 822This transformation uses the reverse data flow 823set up in the previous pass. 824.LP 825An example of this pattern is depicted in the following 826table. 827The pattern is in the left column and the 828replacement action is in the right column. 829.DS 830.CW 831.ta .1i .6i 1.6i 2.1i 2.6i 832 MOVE a->b \fR(remove)\fP 833.R 834 (sequence with no mention of \f(CWa\fP) 835.CW 836 USE b USE a 837.R 838 (sequence with no mention of \f(CWa\fP) 839.CW 840 SET b SET b 841.DE 842.LP 843Experiments have shown that it is marginally 844worthwhile to rename uses of the destination variable 845with uses of the source variable up to 846the first use of the source variable. 847.LP 848The second transform will do relabeling 849without deleting instructions. 850When a ``move'' instruction is encountered, 851if the source variable has been set prior 852to the use of the destination variable 853then all of the references to the source 854variable are replaced by the destination and 855the ``move'' is inverted. 856Typically, 857this transformation will alter two ``move'' 858instructions and allow the first transformation 859another chance to remove code. 860This transformation uses the forward data flow 861set up in the previous pass. 862.LP 863Again, 864the following is a depiction of the transformation where 865the pattern is in the left column and the 866rewrite is in the right column. 867.DS 868.CW 869.ta .1i .6i 1.6i 2.1i 2.6i 870 SET a SET b 871.R 872 (sequence with no use of \f(CWb\fP) 873.CW 874 USE a USE b 875.R 876 (sequence with no use of \f(CWb\fP) 877.CW 878 MOVE a->b MOVE b->a 879.DE 880Iterating these transformations 881will usually get rid of all redundant ``move'' instructions. 882.LP 883A problem with this organization is that the costs 884of registerization calculated in the previous pass 885must depend on how well this pass can detect and remove 886redundant instructions. 887Often, 888a fine candidate for registerization is rejected 889because of the cost of instructions that are later 890removed. 891.NH 2 892Writing the object file 893.LP 894The last pass walks the internal assembly language 895and writes the object file. 896The object file is reduced in size by about a factor 897of three with simple compression 898techniques. 899The most important aspect of the object file 900format is that it is independent of the compiling machine. 901All integer and floating numbers in the object 902code are converted to known formats and byte 903orders. 904.NH 905The loader 906.LP 907The loader is a multiple pass program that 908reads object files and libraries and produces 909an executable binary. 910The loader also does some minimal 911optimizations and code rewriting. 912Many of the operations performed by the 913loader are machine-dependent. 914.LP 915The first pass of the loader reads the 916object modules into an internal data 917structure that looks like binary assembly language. 918As the instructions are read, 919code is reordered to remove 920unconditional branch instructions. 921Conditional branch instructions are inverted 922to prevent the insertion of unconditional branches. 923The loader will also make a copy of a few instructions 924to remove an unconditional branch. 925.LP 926The next pass allocates addresses for 927all external data. 928Typical of processors is the MIPS, 929which can reference ±32K bytes from a 930register. 931The loader allocates the register 932.CW R30 933as the static pointer. 934The value placed in 935.CW R30 936is the base of the data segment plus 32K. 937It is then cheap to reference all data in the 938first 64K of the data segment. 939External variables are allocated to 940the data segment 941with the smallest variables allocated first. 942If all of the data cannot fit into the first 94364K of the data segment, 944then usually only a few large arrays 945need more expensive addressing modes. 946.LP 947For the MIPS processor, 948the loader makes a pass over the internal 949structures, 950exchanging instructions to try 951to fill ``delay slots'' with useful work. 952If a useful instruction cannot be found 953to fill a delay slot, 954the loader will insert 955``noop'' 956instructions. 957This pass is very expensive and does not 958do a good job. 959About 40% of all instructions are in 960delay slots. 961About 65% of these are useful instructions and 96235% are ``noops.'' 963The vendor-supplied assembler does this job 964more effectively, 965filling about 80% 966of the delay slots with useful instructions. 967.LP 968On the 68020 processor, 969branch instructions come in a variety of 970sizes depending on the relative distance 971of the branch. 972Thus the size of branch instructions 973can be mutually dependent. 974The loader uses a multiple pass algorithm 975to resolve the branch lengths 976[Szy78]. 977Initially, all branches are assumed minimal length. 978On each subsequent pass, 979the branches are reassessed 980and expanded if necessary. 981When no more expansions occur, 982the locations of the instructions in 983the text segment are known. 984.LP 985On the MIPS processor, 986all instructions are one size. 987A single pass over the instructions will 988determine the locations of all addresses 989in the text segment. 990.LP 991The last pass of the loader produces the 992executable binary. 993A symbol table and other tables are 994produced to help the debugger to 995interpret the binary symbolically. 996.LP 997The loader places absolute source line numbers in the symbol table. 998The name and absolute line number of all 999.CW #include 1000files is also placed in the 1001symbol table so that the debuggers can 1002associate object code to source files. 1003.NH 1004Performance 1005.LP 1006The following is a table of the source size of the MIPS 1007compiler. 1008.DS 1009.ta .1i .6i 1010 lines module 1011 \0509 machine-independent headers 1012 1070 machine-independent YACC source 1013 6090 machine-independent C source 1014 1015 \0545 machine-dependent headers 1016 6532 machine-dependent C source 1017 1018 \0298 loader headers 1019 5215 loader C source 1020.DE 1021.LP 1022The following table shows timing 1023of a test program 1024that plays checkers, running on a MIPS R4000. 1025The test program is 26 files totaling 12600 lines of C. 1026The execution time does not significantly 1027depend on library implementation. 1028Since no other compiler runs on Plan 9, 1029the Plan 9 tests were done with the Plan 9 operating system; 1030the other tests were done on the vendor's operating system. 1031The hardware was identical in both cases. 1032The optimizer in the vendor's compiler 1033is reputed to be extremely good. 1034.DS 1035.ta .1i .9i 1036 \0\04.49s Plan 9 \f(CWvc\fP \f(CW-N\fP compile time (opposite of \f(CW-O\fP) 1037 \0\01.72s Plan 9 \f(CWvc\fP \f(CW-N\fP load time 1038 148.69s Plan 9 \f(CWvc\fP \f(CW-N\fP run time 1039 1040 \015.07s Plan 9 \f(CWvc\fP compile time (\f(CW-O\fP implicit) 1041 \0\01.66s Plan 9 \f(CWvc\fP load time 1042 \089.96s Plan 9 \f(CWvc\fP run time 1043 1044 \014.83s vendor \f(CWcc\fP compile time 1045 \0\00.38s vendor \f(CWcc\fP load time 1046 104.75s vendor \f(CWcc\fP run time 1047 1048 \043.59s vendor \f(CWcc\fP \f(CW-O\fP compile time 1049 \0\00.38s vendor \f(CWcc\fP \f(CW-O\fP load time 1050 \076.19s vendor \f(CWcc\fP \f(CW-O\fP run time 1051 1052 \0\08.19s vendor \f(CWcc\fP \f(CW-O3\fP compile time 1053 \035.97s vendor \f(CWcc\fP \f(CW-O3\fP load time 1054 \071.16s vendor \f(CWcc\fP \f(CW-O3\fP run time 1055.DE 1056.LP 1057To compare the Intel compiler, 1058a program that is about 40% bit manipulation and 1059about 60% single precision floating point was 1060run on the same 33 MHz 486, once under Windows 1061compiled with the Watcom compiler, version 10.0, 1062in 16-bit mode and once under 1063Plan 9 in 32-bit mode. 1064The Plan 9 execution time was 27 sec while the Windows 1065execution time was 31 sec. 1066.NH 1067Conclusions 1068.LP 1069The new compilers compile 1070quickly, 1071load slowly, 1072and produce 1073medium quality 1074object code. 1075The compilers are relatively 1076portable, 1077requiring but a couple of weeks' work to 1078produce a compiler for a different computer. 1079For Plan 9, 1080where we needed several compilers 1081with specialized features and 1082our own object formats, 1083this project was indispensable. 1084It is also necessary for us to 1085be able to freely distribute our compilers 1086with the Plan 9 distribution. 1087.LP 1088Two problems have come up in retrospect. 1089The first has to do with the 1090division of labor between compiler and loader. 1091Plan 9 runs on multi-processors and as such 1092compilations are often done in parallel. 1093Unfortunately, 1094all compilations must be complete before loading 1095can begin. 1096The load is then single-threaded. 1097With this model, 1098any shift of work from compile to load 1099results in a significant increase in real time. 1100The same is true of libraries that are compiled 1101infrequently and loaded often. 1102In the future, 1103we may try to put some of the loader work 1104back into the compiler. 1105.LP 1106The second problem comes from 1107the various optimizations performed over several 1108passes. 1109Often optimizations in different passes depend 1110on each other. 1111Iterating the passes could compromise efficiency, 1112or even loop. 1113We see no real solution to this problem. 1114.NH 1115References 1116.LP 1117[Aho87] A. V. Aho, R. Sethi, and J. D. Ullman, 1118.I 1119Compilers \- Principles, Techniques, and Tools, 1120.R 1121Addison Wesley, 1122Reading, MA, 11231987. 1124.LP 1125[ANSI90] \f2American National Standard for Information Systems \- 1126Programming Language C\f1, American National Standards Institute, Inc., 1127New York, 1990. 1128.LP 1129[Dav91] J. W. Davidson and D. B. Whalley, 1130``Methods for Saving and Restoring Register Values across Function Calls'', 1131.I 1132Software\-Practice and Experience, 1133.R 1134Vol 21(2), pp. 149-165, February 1991. 1135.LP 1136[Joh79] S. C. Johnson, 1137``YACC \- Yet Another Compiler Compiler'', 1138.I 1139UNIX Programmer's Manual, Seventh Ed., Vol. 2A, 1140.R 1141AT&T Bell Laboratories, 1142Murray Hill, NJ, 11431979. 1144.LP 1145[Set70] R. Sethi and J. D. Ullman, 1146``The Generation of Optimal Code for Arithmetic Expressions'', 1147.I 1148Journal of the ACM, 1149.R 1150Vol 17(4), pp. 715-728, 1970. 1151.LP 1152[Szy78] T. G. Szymanski, 1153``Assembling Code for Machines with Span-dependent Instructions'', 1154.I 1155Communications of the ACM, 1156.R 1157Vol 21(4), pp. 300-308, 1978. 1158