xref: /inferno-os/doc/compiler.ms (revision ecc9caba0e344ed50c05ee8156b2734f4d76e463)
1.TL
2Plan 9 C Compilers \(dg
3.AU
4.I "Ken Thompson"
5.AI
6ken@plan9.bell-labs.com
7.AB
8.FS
9\l'1i'
10.br
11\(dg Originally appeared, in a different form, in
12.I
13Proceedings of the Summer 1990 UKUUG Conference,
14.R
15pp. 41-51,
16London, 1990.
17This version first appeared in
18.I "Plan 9 Programmer's Manual, Volume 2 (Second Edition)" .
19The Plan 9 compiler suite forms the basis for the portable Inferno compiler suite,
20making this paper still relevant.
21.FE
22This paper describes the overall structure and function of the Plan 9 C compilers.
23A more detailed implementation document
24for any one of the compilers
25is yet to be written.
26.AE
27.NH
28Introduction
29.LP
30There are many compilers in the series.
31Eight of the compilers (MIPS 3000, SPARC, Intel 386, AMD64, Power PC, ARM, DEC Alpha, and Motorola 68020)
32are considered active and are used to compile
33current versions of Plan 9 or Inferno.
34Several others (Motorola 68000, Intel 960, AMD 29000) have had only limited use, such as
35to program peripherals or experimental devices.
36.NH
37Structure
38.LP
39The compiler is a single program that produces an
40object file.
41Combined in the compiler are the traditional
42roles of preprocessor, lexical analyzer, parser, code generator,
43local optimizer,
44and first half of the assembler.
45The object files are binary forms of assembly
46language,
47similar to what might be passed between
48the first and second passes of an assembler.
49.LP
50Object files and libraries
51are combined by a loader
52program to produce the executable binary.
53The loader combines the roles of second half
54of the assembler, global optimizer, and loader.
55The names of the compliers, loaders, and assemblers
56are as follows:
57.DS
58.ta 1.5i
59.de Ta
60\\$1	\f(CW\\$2\fP  \f(CW\\$3\fP  \f(CW\\$4\fP
61..
62.Ta SPARC kc kl ka
63.Ta PowerPC qc ql qa
64.Ta MIPS vc vl va
65.Ta Motorola\ 68000 1c 1l 1a
66.Ta Motorola\ 68020 2c 2l 2a
67.Ta ARM 5c 5l 5a
68.Ta AMD64 6c 6l 6a
69.Ta DEC\ Alpha 7c 7l 7a
70.Ta Intel\ 386 8c 8l 8a
71.Ta AMD\ 29000 9c 9l 9a
72.DE
73There is a further breakdown
74in the source of the compilers into
75object-independent and
76object-dependent
77parts.
78All of the object-independent parts
79are combined into source files in the
80directory
81.CW /sys/src/cmd/cc .
82The object-dependent parts are collected
83in a separate directory for each compiler,
84for example
85.CW /sys/src/cmd/vc .
86All of the code,
87both object-independent and
88object-dependent,
89is machine-independent
90and may be cross-compiled and executed on any
91of the architectures.
92.NH
93The Language
94.LP
95The compiler implements ANSI C with some
96restrictions and extensions
97[ANSI90].
98Most of the restrictions are due to
99personal preference, while
100most of the extensions were to help in
101the implementation of Plan 9.
102There are other departures from the standard,
103particularly in the libraries,
104that are beyond the scope of this
105paper.
106.NH 2
107Register, volatile, const
108.LP
109The keyword
110.CW register
111is recognized syntactically
112but is semantically ignored.
113Thus taking the address of a
114.CW register
115variable is not diagnosed.
116The keyword
117.CW volatile
118disables all optimizations, in particular registerization, of the corresponding variable.
119The keyword
120.CW const
121generates warnings (if warnings are enabled by the compiler's
122.CW -w
123option) of non-constant use of the variable,
124but does not affect the generated code.
125.NH 2
126The preprocessor
127.LP
128The C preprocessor is probably the
129biggest departure from the ANSI standard.
130.LP
131The preprocessor built into the Plan 9 compilers does not support
132.CW #if ,
133although it does handle
134.CW #ifdef
135and
136.CW #include .
137If it is necessary to be more standard,
138the source text can first be run through the separate ANSI C
139preprocessor,
140.CW cpp .
141.NH 2
142Unnamed substructures
143.LP
144The most important and most heavily used of the
145extensions is the declaration of an
146unnamed substructure or subunion.
147For example:
148.DS
149.CW
150.ta .1i .6i 1.1i 1.6i
151	typedef
152	struct	lock
153	{
154		int    locked;
155	} Lock;
156
157	typedef
158	struct	node
159	{
160		int	type;
161		union
162		{
163			double dval;
164			float  fval;
165			long   lval;
166		};
167		Lock;
168	} Node;
169
170	Lock*	lock;
171	Node*	node;
172.R
173.DE
174The declaration of
175.CW Node
176has an unnamed substructure of type
177.CW Lock
178and an unnamed subunion.
179One use of this feature allows references to elements of the
180subunit to be accessed as if they were in
181the outer structure.
182Thus
183.CW node->dval
184and
185.CW node->locked
186are legitimate references.
187.LP
188When an outer structure is used
189in a context that is only legal for
190an unnamed substructure,
191the compiler promotes the reference to the
192unnamed substructure.
193This is true for references to structures and
194to references to pointers to structures.
195This happens in assignment statements and
196in argument passing where prototypes have been
197declared.
198Thus, continuing with the example,
199.DS
200.CW
201.ta .1i .6i 1.1i 1.6i
202	lock = node;
203.R
204.DE
205would assign a pointer to the unnamed
206.CW Lock
207in
208the
209.CW Node
210to the variable
211.CW lock .
212Another example,
213.DS
214.CW
215.ta .1i .6i 1.1i 1.6i
216	extern void lock(Lock*);
217	func(...)
218	{
219		...
220		lock(node);
221		...
222	}
223.R
224.DE
225will pass a pointer to the
226.CW Lock
227substructure.
228.LP
229Finally, in places where context is insufficient to identify the unnamed structure,
230the type name (it must be a
231.CW typedef )
232of the unnamed structure can be used as an identifier.
233In our example,
234.CW &node->Lock
235gives the address of the anonymous
236.CW Lock
237structure.
238.NH 2
239Structure displays
240.LP
241A structure cast followed by a list of expressions in braces is
242an expression with the type of the structure and elements assigned from
243the corresponding list.
244Structures are now almost first-class citizens of the language.
245It is common to see code like this:
246.DS
247.CW
248.ta .1i
249	r = (Rectangle){point1, (Point){x,y+2}};
250.R
251.DE
252.NH 2
253Initialization indexes
254.LP
255In initializers of arrays,
256one may place a constant expression
257in square brackets before an initializer.
258This causes the next initializer to assign
259the indicated element.
260For example:
261.DS
262.CW
263.ta .1i .6i 1.6i
264	enum	errors
265	{
266		Etoobig,
267		Ealarm,
268		Egreg
269	};
270	char* errstrings[] =
271	{
272		[Ealarm]	"Alarm call",
273		[Egreg]	"Panic: out of mbufs",
274		[Etoobig]	"Arg list too long",
275	};
276.R
277.DE
278In the same way,
279individual structures members may
280be initialized in any order by preceding the initialization with
281.CW .tagname .
282Both forms allow an optional
283.CW = ,
284to be compatible with a proposed
285extension to ANSI C.
286.NH 2
287External register
288.LP
289The declaration
290.CW extern
291.CW register
292will dedicate a register to
293a variable on a global basis.
294It can be used only under special circumstances.
295External register variables must be identically
296declared in all modules and
297libraries.
298The feature is not intended for efficiency,
299although it can produce efficient code;
300rather it represents a unique storage class that
301would be hard to get any other way.
302On a shared-memory multi-processor,
303an external register is
304one-per-processor and neither one-per-procedure (automatic)
305or one-per-system (external).
306It is used for two variables in the Plan 9 kernel,
307.CW u
308and
309.CW m .
310.CW U
311is a pointer to the structure representing the currently running process
312and
313.CW m
314is a pointer to the per-machine data structure.
315.NH 2
316Long long
317.LP
318The compilers accept
319.CW long
320.CW long
321as a basic type meaning 64-bit integer.
322On all of the machines
323this type is synthesized from 32-bit instructions.
324.NH 2
325Pragma
326.LP
327The compilers accept
328.CW #pragma
329.CW lib
330.I libname
331and pass the
332library name string uninterpreted
333to the loader.
334The loader uses the library name to
335find libraries to load.
336If the name contains
337.CW $O ,
338it is replaced with
339the single character object type of the compiler
340(e.g.,
341.CW v
342for the MIPS).
343If the name contains
344.CW $M ,
345it is replaced with
346the architecture type for the compiler
347(e.g.,
348.CW mips
349for the MIPS).
350If the name starts with
351.CW /
352it is an absolute pathname;
353if it starts with
354.CW .
355then it is searched for in the loader's current directory.
356Otherwise, the name is searched from
357.CW /$M/lib .
358Such
359.CW #pragma
360statements in header files guarantee that the correct
361libraries are always linked with a program without the
362need to specify them explicitly at link time.
363.LP
364They also accept
365.CW #pragma
366.CW packed
367.CW on
368(or
369.CW yes
370or
371.CW 1 )
372to cause subsequently declared data, until
373.CW #pragma
374.CW packed
375.CW off
376(or
377.CW no
378or
379.CW 0 ),
380to be laid out in memory tightly packed in successive bytes, disregarding
381the usual alignment rules.
382Accessing such data can cause faults.
383.LP
384Similarly,
385.CW #pragma
386.CW profile
387.CW off
388(or
389.CW no
390or
391.CW 0 )
392causes subsequently declared functions, until
393.CW #pragma
394.CW profile
395.CW on
396(or
397.CW yes
398or
399.CW 1 ),
400to be marked as unprofiled.
401Such functions will not be profiled when
402profiling is enabled for the rest of the program.
403.LP
404Two
405.CW #pragma
406statements allow type-checking of
407.CW print -like
408functions.
409The first, of the form
410.P1
411#pragma varargck argpos error 2
412.P2
413tells the compiler that the second argument to
414.CW error
415is a
416.CW print
417format string (see the manual page
418.I print (2))
419that specifies how to format
420.CW error 's
421subsequent arguments.
422The second, of the form
423.P1
424#pragma varargck type "s" char*
425.P2
426says that the
427.CW print
428format verb
429.CW s
430processes an argument of
431type
432.CW char* .
433If the compiler's
434.CW -F
435option is enabled, the compiler will use this information
436to report type violations in the arguments to
437.CW print ,
438.CW error ,
439and similar routines.
440.NH
441Object module conventions
442.LP
443The overall conventions of the runtime environment
444are important
445to runtime efficiency.
446In this section,
447several of these conventions are discussed.
448.NH 2
449Register saving
450.LP
451In the Plan 9 compilers,
452the caller of a procedure saves the registers.
453With caller-saves,
454the leaf procedures can use all the
455registers and never save them.
456If you spend a lot of time at the leaves,
457this seems preferable.
458With callee-saves,
459the saving of the registers is done
460in the single point of entry and return.
461If you are interested in space,
462this seems preferable.
463In both,
464there is a degree of uncertainty
465about what registers need to be saved.
466Callee-saved registers make it difficult to
467find variables in registers in debuggers.
468Callee-saved registers also complicate
469the implementation of
470.CW longjmp .
471The convincing argument is
472that with caller-saves,
473the decision to registerize a variable
474can include the cost of saving the register
475across calls.
476For a further discussion of caller- vs. callee-saves,
477see the paper by Davidson and Whalley [Dav91].
478.LP
479In the Plan 9 operating system,
480calls to the kernel look like normal procedure
481calls, which means
482the caller
483has saved the registers and the system
484entry does not have to.
485This makes system calls considerably faster.
486Since this is a potential security hole,
487and can lead to non-determinism,
488the system may eventually save the registers
489on entry,
490or more likely clear the registers on return.
491.NH 2
492Calling convention
493.LP
494Older C compilers maintain a frame pointer, which is at a known constant
495offset from the stack pointer within each function.
496For machines where the stack grows towards zero,
497the argument pointer is at a known constant offset
498from the frame pointer.
499Since the stack grows down in Plan 9,
500the Plan 9 compilers
501keep neither an
502explicit frame pointer nor
503an explicit argument pointer;
504instead they generate addresses relative to the stack pointer.
505.LP
506On some architectures, the first argument to a subroutine is passed in a register.
507.NH 2
508Functions returning structures
509.LP
510Structures longer than one word are awkward to implement
511since they do not fit in registers and must
512be passed around in memory.
513Functions that return structures
514are particularly clumsy.
515The Plan 9 compilers pass the return address of
516a structure as the first argument of a
517function that has a structure return value.
518Thus
519.DS
520.CW
521.ta .1i .6i 1.1i 1.6i
522	x = f(...)
523.R
524.DE
525is rewritten as
526.DS
527.CW
528.ta .1i .6i 1.1i 1.6i
529	f(&x, ...)\f1.
530.R
531.DE
532This saves a copy and makes the compilation
533much less clumsy.
534A disadvantage is that if you call this
535function without an assignment,
536a dummy location must be invented.
537.LP
538There is also a danger of calling a function
539that returns a structure without declaring
540it as such.
541With ANSI C function prototypes,
542this error need never occur.
543.NH
544Implementation
545.LP
546The compiler is divided internally into
547four machine-independent passes,
548four machine-dependent passes,
549and an output pass.
550The next nine sections describe each pass in order.
551.NH 2
552Parsing
553.LP
554The first pass is a YACC-based parser
555[Joh79].
556Declarations are interpreted immediately,
557building a block structured symbol table.
558Executable statements are put into a parse tree
559and collected,
560without interpretation.
561At the end of each procedure,
562the parse tree for the function is
563examined by the other passes of the compiler.
564.LP
565The input stream of the parser is
566a pushdown list of input activations.
567The preprocessor
568expansions of
569macros
570and
571.CW #include
572are implemented as pushdowns.
573Thus there is no separate
574pass for preprocessing.
575.NH 2
576Typing
577.LP
578The next pass distributes typing information
579to every node of the tree.
580Implicit operations on the tree are added,
581such as type promotions and taking the
582address of arrays and functions.
583.NH 2
584Machine-independent optimization
585.LP
586The next pass performs optimizations
587and transformations of the tree, such as converting
588.CW &*x
589and
590.CW *&x
591into
592.CW x .
593Constant expressions are converted to constants in this pass.
594.NH 2
595Arithmetic rewrites
596.LP
597This is another machine-independent optimization.
598Subtrees of add, subtract, and multiply of integers are
599rewritten for easier compilation.
600The major transformation is factoring:
601.CW 4+8*a+16*b+5
602is transformed into
603.CW 9+8*(a+2*b) .
604Such expressions arise from address
605manipulation and array indexing.
606.NH 2
607Addressability
608.LP
609This is the first of the machine-dependent passes.
610The addressability of a processor is defined as the set of
611expressions that is legal in the address field
612of a machine language instruction.
613The addressability of different processors varies widely.
614At one end of the spectrum are the 68020 and VAX,
615which allow a complex mix of incrementing,
616decrementing,
617indexing, and relative addressing.
618At the other end is the MIPS,
619which allows only registers and constant offsets from the
620contents of a register.
621The addressability can be different for different instructions
622within the same processor.
623.LP
624It is important to the code generator to know when a
625subtree represents an address of a particular type.
626This is done with a bottom-up walk of the tree.
627In this pass, the leaves are labeled with small integers.
628When an internal node is encountered,
629it is labeled by consulting a table indexed by the
630labels on the left and right subtrees.
631For example,
632on the 68020 processor,
633it is possible to address an
634offset from a named location.
635In C, this is represented by the expression
636.CW *(&name+constant) .
637This is marked addressable by the following table.
638In the table,
639a node represented by the left column is marked
640with a small integer from the right column.
641Marks of the form
642.CW A\s-2\di\u\s0
643are addressable while
644marks of the form
645.CW N\s-2\di\u\s0
646are not addressable.
647.DS
648.B
649.ta .1i 1.1i
650	Node	Marked
651.CW
652	name	A\s-2\d1\u\s0
653	const	A\s-2\d2\u\s0
654	&A\s-2\d1\u\s0	A\s-2\d3\u\s0
655	A\s-2\d3\u\s0+A\s-2\d1\u\s0	N\s-2\d1\u\s0 \fR(note that this is not addressable)\fP
656	*N\s-2\d1\u\s0	A\s-2\d4\u\s0
657.R
658.DE
659Here there is a distinction between
660a node marked
661.CW A\s-2\d1\u\s0
662and a node marked
663.CW A\s-2\d4\u\s0
664because the address operator of an
665.CW A\s-2\d4\u\s0
666node is not addressable.
667So to extend the table:
668.DS
669.B
670.ta .1i 1.1i
671	Node	Marked
672.CW
673	&A\s-2\d4\u\s0	N\s-2\d2\u\s0
674	N\s-2\d2\u\s0+N\s-2\d1\u\s0	N\s-2\d1\u\s0
675.R
676.DE
677The full addressability of the 68020 is expressed
678in 18 rules like this,
679while the addressability of the MIPS is expressed
680in 11 rules.
681When one ports the compiler,
682this table is usually initialized
683so that leaves are labeled as addressable and nothing else.
684The code produced is poor,
685but porting is easy.
686The table can be extended later.
687.LP
688This pass also rewrites some complex operators
689into procedure calls.
690Examples include 64-bit multiply and divide.
691.LP
692In the same bottom-up pass of the tree,
693the nodes are labeled with a Sethi-Ullman complexity
694[Set70].
695This number is roughly the number of registers required
696to compile the tree on an ideal machine.
697An addressable node is marked 0.
698A function call is marked infinite.
699A unary operator is marked as the
700maximum of 1 and the mark of its subtree.
701A binary operator with equal marks on its subtrees is
702marked with a subtree mark plus 1.
703A binary operator with unequal marks on its subtrees is
704marked with the maximum mark of its subtrees.
705The actual values of the marks are not too important,
706but the relative values are.
707The goal is to compile the harder
708(larger mark)
709subtree first.
710.NH 2
711Code generation
712.LP
713Code is generated by recursive
714descent.
715The Sethi-Ullman complexity completely guides the
716order.
717The addressability defines the leaves.
718The only difficult part is compiling a tree
719that has two infinite (function call)
720subtrees.
721In this case,
722one subtree is compiled into the return register
723(usually the most convenient place for a function call)
724and then stored on the stack.
725The other subtree is compiled into the return register
726and then the operation is compiled with
727operands from the stack and the return register.
728.LP
729There is a separate boolean code generator that compiles
730conditional expressions.
731This is fundamentally different from compiling an arithmetic expression.
732The result of the boolean code generator is the
733position of the program counter and not an expression.
734The boolean code generator makes extensive use of De Morgan's rule.
735The boolean code generator is an expanded version of that described
736in chapter 8 of Aho, Sethi, and Ullman
737[Aho87].
738.LP
739There is a considerable amount of talk in the literature
740about automating this part of a compiler with a machine
741description.
742Since this code generator is so small
743(less than 500 lines of C)
744and easy,
745it hardly seems worth the effort.
746.NH 2
747Registerization
748.LP
749Up to now,
750the compiler has operated on syntax trees
751that are roughly equivalent to the original source language.
752The previous pass has produced machine language in an internal
753format.
754The next two passes operate on the internal machine language
755structures.
756The purpose of the next pass is to reintroduce
757registers for heavily used variables.
758.LP
759All of the variables that can be
760potentially registerized within a procedure are
761placed in a table.
762(Suitable variables are any automatic or external
763scalars that do not have their addresses extracted.
764Some constants that are hard to reference are also
765considered for registerization.)
766Four separate data flow equations are evaluated
767over the procedure on all of these variables.
768Two of the equations are the normal set-behind
769and used-ahead
770bits that define the life of a variable.
771The two new bits tell if a variable life
772crosses a function call ahead or behind.
773By examining a variable over its lifetime,
774it is possible to get a cost
775for registerizing.
776Loops are detected and the costs are multiplied
777by three for every level of loop nesting.
778Costs are sorted and the variables
779are replaced by available registers on a greedy basis.
780.LP
781The 68020 has two different
782types of registers.
783For the 68020,
784two different costs are calculated for
785each variable life and the register type that
786affords the better cost is used.
787Ties are broken by counting the number of available
788registers of each type.
789.LP
790Note that externals are registerized together with automatics.
791This is done by evaluating the semantics of a ``call'' instruction
792differently for externals and automatics.
793Since a call goes outside the local procedure,
794it is assumed that a call references all externals.
795Similarly,
796externals are assumed to be set before an ``entry'' instruction
797and assumed to be referenced after a ``return'' instruction.
798This makes sure that externals are in memory across calls.
799.LP
800The overall results are satisfactory.
801It would be nice to be able to do this processing in
802a machine-independent way,
803but it is impossible to get all of the costs and
804side effects of different choices by examining the parse tree.
805.LP
806Most of the code in the registerization pass is machine-independent.
807The major machine-dependency is in
808examining a machine instruction to ask if it sets or references
809a variable.
810.NH 2
811Machine code optimization
812.LP
813The next pass walks the machine code
814for opportunistic optimizations.
815For the most part,
816this is highly specific to a particular
817processor.
818One optimization that is performed
819on all of the processors is the
820removal of unnecessary ``move''
821instructions.
822Ironically,
823most of these instructions were inserted by
824the previous pass.
825There are two patterns that are repetitively
826matched and replaced until no more matches are
827found.
828The first tries to remove ``move'' instructions
829by relabeling variables.
830.LP
831When a ``move'' instruction is encountered,
832if the destination variable is set before the
833source variable is referenced,
834then all of the references to the destination
835variable can be renamed to the source and the ``move''
836can be deleted.
837This transformation uses the reverse data flow
838set up in the previous pass.
839.LP
840An example of this pattern is depicted in the following
841table.
842The pattern is in the left column and the
843replacement action is in the right column.
844.DS
845.CW
846.ta .1i .6i 1.6i 2.1i 2.6i
847	MOVE	a->b		\fR(remove)\fP
848.R
849	(sequence with no mention of \f(CWa\fP)
850.CW
851	USE	b		USE	a
852.R
853	(sequence with no mention of \f(CWa\fP)
854.CW
855	SET	b		SET	b
856.R
857.DE
858.LP
859Experiments have shown that it is marginally
860worthwhile to rename uses of the destination variable
861with uses of the source variable up to
862the first use of the source variable.
863.LP
864The second transform will do relabeling
865without deleting instructions.
866When a ``move'' instruction is encountered,
867if the source variable has been set prior
868to the use of the destination variable
869then all of the references to the source
870variable are replaced by the destination and
871the ``move'' is inverted.
872Typically,
873this transformation will alter two ``move''
874instructions and allow the first transformation
875another chance to remove code.
876This transformation uses the forward data flow
877set up in the previous pass.
878.LP
879Again,
880the following is a depiction of the transformation where
881the pattern is in the left column and the
882rewrite is in the right column.
883.DS
884.CW
885.ta .1i .6i 1.6i 2.1i 2.6i
886	SET	a		SET	b
887.R
888	(sequence with no use of \f(CWb\fP)
889.CW
890	USE	a		USE	b
891.R
892	(sequence with no use of \f(CWb\fP)
893.CW
894	MOVE	a->b		MOVE	b->a
895.R
896.DE
897Iterating these transformations
898will usually get rid of all redundant ``move'' instructions.
899.LP
900A problem with this organization is that the costs
901of registerization calculated in the previous pass
902must depend on how well this pass can detect and remove
903redundant instructions.
904Often,
905a fine candidate for registerization is rejected
906because of the cost of instructions that are later
907removed.
908.NH 2
909Writing the object file
910.LP
911The last pass walks the internal assembly language
912and writes the object file.
913The object file is reduced in size by about a factor
914of three with simple compression
915techniques.
916The most important aspect of the object file
917format is that it is independent of the compiling machine.
918All integer and floating numbers in the object
919code are converted to known formats and byte
920orders.
921.NH
922The loader
923.LP
924The loader is a multiple pass program that
925reads object files and libraries and produces
926an executable binary.
927The loader also does some minimal
928optimizations and code rewriting.
929Many of the operations performed by the
930loader are machine-dependent.
931.LP
932The first pass of the loader reads the
933object modules into an internal data
934structure that looks like binary assembly language.
935As the instructions are read,
936code is reordered to remove
937unconditional branch instructions.
938Conditional branch instructions are inverted
939to prevent the insertion of unconditional branches.
940The loader will also make a copy of a few instructions
941to remove an unconditional branch.
942.LP
943The next pass allocates addresses for
944all external data.
945Typical of processors is the MIPS,
946which can reference ±32K bytes from a
947register.
948The loader allocates the register
949.CW R30
950as the static pointer.
951The value placed in
952.CW R30
953is the base of the data segment plus 32K.
954It is then cheap to reference all data in the
955first 64K of the data segment.
956External variables are allocated to
957the data segment
958with the smallest variables allocated first.
959If all of the data cannot fit into the first
96064K of the data segment,
961then usually only a few large arrays
962need more expensive addressing modes.
963.LP
964For the MIPS processor,
965the loader makes a pass over the internal
966structures,
967exchanging instructions to try
968to fill ``delay slots'' with useful work.
969If a useful instruction cannot be found
970to fill a delay slot,
971the loader will insert
972``noop''
973instructions.
974This pass is very expensive and does not
975do a good job.
976About 40% of all instructions are in
977delay slots.
978About 65% of these are useful instructions and
97935% are ``noops.''
980The vendor-supplied assembler does this job
981more effectively,
982filling about 80%
983of the delay slots with useful instructions.
984.LP
985On the 68020 processor,
986branch instructions come in a variety of
987sizes depending on the relative distance
988of the branch.
989Thus the size of branch instructions
990can be mutually dependent.
991The loader uses a multiple pass algorithm
992to resolve the branch lengths
993[Szy78].
994Initially, all branches are assumed minimal length.
995On each subsequent pass,
996the branches are reassessed
997and expanded if necessary.
998When no more expansions occur,
999the locations of the instructions in
1000the text segment are known.
1001.LP
1002On the MIPS processor,
1003all instructions are one size.
1004A single pass over the instructions will
1005determine the locations of all addresses
1006in the text segment.
1007.LP
1008The last pass of the loader produces the
1009executable binary.
1010A symbol table and other tables are
1011produced to help the debugger to
1012interpret the binary symbolically.
1013.LP
1014The loader places absolute source line numbers in the symbol table.
1015The name and absolute line number of all
1016.CW #include
1017files is also placed in the
1018symbol table so that the debuggers can
1019associate object code to source files.
1020.NH
1021Performance
1022.LP
1023The following is a table of the source size of the MIPS
1024compiler.
1025.DS
1026.ta .1i .6i
1027	lines	module
1028	\0509	machine-independent headers
1029	1070	machine-independent YACC source
1030	6090	machine-independent C source
1031
1032	\0545	machine-dependent headers
1033	6532	machine-dependent C source
1034
1035	\0298	loader headers
1036	5215	loader C source
1037.DE
1038.LP
1039The following table shows timing
1040of a test program
1041that plays checkers, running on a MIPS R4000.
1042The test program is 26 files totaling 12600 lines of C.
1043The execution time does not significantly
1044depend on library implementation.
1045Since no other compiler runs on Plan 9,
1046the Plan 9 tests were done with the Plan 9 operating system;
1047the other tests were done on the vendor's operating system.
1048The hardware was identical in both cases.
1049The optimizer in the vendor's compiler
1050is reputed to be extremely good.
1051.DS
1052.ta .1i .9i
1053	\0\04.49s	Plan 9 \f(CWvc\fP \f(CW-N\fP compile time (opposite of \f(CW-O\fP)
1054	\0\01.72s	Plan 9 \f(CWvc\fP \f(CW-N\fP load time
1055	148.69s	Plan 9 \f(CWvc\fP \f(CW-N\fP run time
1056
1057	\015.07s	Plan 9 \f(CWvc\fP compile time (\f(CW-O\fP implicit)
1058	\0\01.66s	Plan 9 \f(CWvc\fP load time
1059	\089.96s	Plan 9 \f(CWvc\fP run time
1060
1061	\014.83s	vendor \f(CWcc\fP compile time
1062	\0\00.38s	vendor \f(CWcc\fP load time
1063	104.75s	vendor \f(CWcc\fP run time
1064
1065	\043.59s	vendor \f(CWcc\fP \f(CW-O\fP compile time
1066	\0\00.38s	vendor \f(CWcc\fP \f(CW-O\fP load time
1067	\076.19s	vendor \f(CWcc\fP \f(CW-O\fP run time
1068
1069	\0\08.19s	vendor \f(CWcc\fP \f(CW-O3\fP compile time
1070	\035.97s	vendor \f(CWcc\fP \f(CW-O3\fP load time
1071	\071.16s	vendor \f(CWcc\fP \f(CW-O3\fP run time
1072.DE
1073.LP
1074To compare the Intel compiler,
1075a program that is about 40% bit manipulation and
1076about 60% single precision floating point was
1077run on the same 33 MHz 486, once under Windows
1078compiled with the Watcom compiler, version 10.0,
1079in 16-bit mode and once under
1080Plan 9 in 32-bit mode.
1081The Plan 9 execution time was 27 sec while the Windows
1082execution time was 31 sec.
1083.NH
1084Conclusions
1085.LP
1086The new compilers compile
1087quickly,
1088load slowly,
1089and produce
1090medium quality
1091object code.
1092The compilers are relatively
1093portable,
1094requiring but a couple of weeks' work to
1095produce a compiler for a different computer.
1096For Plan 9,
1097where we needed several compilers
1098with specialized features and
1099our own object formats,
1100this project was indispensable.
1101It is also necessary for us to
1102be able to freely distribute our compilers
1103with the Plan 9 distribution.
1104.LP
1105Two problems have come up in retrospect.
1106The first has to do with the
1107division of labor between compiler and loader.
1108Plan 9 runs on multi-processors and as such
1109compilations are often done in parallel.
1110Unfortunately,
1111all compilations must be complete before loading
1112can begin.
1113The load is then single-threaded.
1114With this model,
1115any shift of work from compile to load
1116results in a significant increase in real time.
1117The same is true of libraries that are compiled
1118infrequently and loaded often.
1119In the future,
1120we may try to put some of the loader work
1121back into the compiler.
1122.LP
1123The second problem comes from
1124the various optimizations performed over several
1125passes.
1126Often optimizations in different passes depend
1127on each other.
1128Iterating the passes could compromise efficiency,
1129or even loop.
1130We see no real solution to this problem.
1131.NH
1132References
1133.LP
1134[Aho87] A. V. Aho, R. Sethi, and J. D. Ullman,
1135.I
1136Compilers \- Principles, Techniques, and Tools,
1137.R
1138Addison Wesley,
1139Reading, MA,
11401987.
1141.LP
1142[ANSI90] \f2American National Standard for Information Systems \-
1143Programming Language C\f1, American National Standards Institute, Inc.,
1144New York, 1990.
1145.LP
1146[Dav91] J. W. Davidson and D. B. Whalley,
1147``Methods for Saving and Restoring Register Values across Function Calls'',
1148.I
1149Software\-Practice and Experience,
1150.R
1151Vol 21(2), pp. 149-165, February 1991.
1152.LP
1153[Joh79] S. C. Johnson,
1154``YACC \- Yet Another Compiler Compiler'',
1155.I
1156UNIX Programmer's Manual, Seventh Ed., Vol. 2A,
1157.R
1158AT&T Bell Laboratories,
1159Murray Hill, NJ,
11601979.
1161.LP
1162[Set70] R. Sethi and J. D. Ullman,
1163``The Generation of Optimal Code for Arithmetic Expressions'',
1164.I
1165Journal of the ACM,
1166.R
1167Vol 17(4), pp. 715-728, 1970.
1168.LP
1169[Szy78] T. G. Szymanski,
1170``Assembling Code for Machines with Span-dependent Instructions'',
1171.I
1172Communications of the ACM,
1173.R
1174Vol 21(4), pp. 300-308, 1978.
1175