xref: /plan9/sys/doc/compiler.ms (revision ff8c3af2f44d95267f67219afa20ba82ff6cf7e4)
1.TL
2Plan 9 C Compilers
3.AU
4Ken Thompson
5ken@plan9.bell-labs.com
6.AB
7.FS
8Originally appeared, in a different form, in
9.I
10Proceedings of the Summer 1990 UKUUG Conference,
11.R
12pp. 41-51,
13London, 1990.
14.FE
15This paper describes the overall structure and function of the Plan 9 C compilers.
16A more detailed implementation document
17for any one of the compilers
18is yet to be written.
19.AE
20.NH
21Introduction
22.LP
23There are many compilers in the series.
24Six of the compilers (MIPS 3000, SPARC, Intel 386, Power PC, DEC Alpha, and Motorola 68020)
25are considered active and are used to compile
26current versions of Plan 9.
27Several others (Motorola 68000, Intel 960, ARM 7500, AMD 29000) have had only limited use, such as
28to program peripherals or experimental devices.
29.NH
30Structure
31.LP
32The compiler is a single program that produces an
33object file.
34Combined in the compiler are the traditional
35roles of preprocessor, lexical analyzer, parser, code generator,
36local optimizer,
37and first half of the assembler.
38The object files are binary forms of assembly
39language,
40similar to what might be passed between
41the first and second passes of an assembler.
42.LP
43Object files and libraries
44are combined by a loader
45program to produce the executable binary.
46The loader combines the roles of second half
47of the assembler, global optimizer, and loader.
48The names of the compliers, loaders, and assemblers
49are as follows:
50.DS
51.ta 1.5i
52.de Ta
53\\$1	\f(CW\\$2\fP  \f(CW\\$3\fP  \f(CW\\$4\fP
54..
55.Ta SPARC kc kl ka
56.Ta Power\ PC qc ql qa
57.Ta MIPS vc vl va
58.Ta Motorola\ 68000 1c 1l 1a
59.Ta Motorola\ 68020 2c 2l 2a
60.Ta ARM\ 7500 5c 5l 5a
61.Ta Intel\ 960 6c 6l 6a
62.Ta DEC\ Alpha 7c 7l 7a
63.Ta Intel\ 386 8c 8l 8a
64.Ta AMD\ 29000 9c 9l 9a
65.DE
66There is a further breakdown
67in the source of the compilers into
68object-independent and
69object-dependent
70parts.
71All of the object-independent parts
72are combined into source files in the
73directory
74.CW /sys/src/cmd/cc .
75The object-dependent parts are collected
76in a separate directory for each compiler,
77for example
78.CW /sys/src/cmd/vc .
79All of the code,
80both object-independent and
81object-dependent,
82is machine-independent
83and may be cross-compiled and executed on any
84of the architectures.
85.NH
86The Language
87.LP
88The compiler implements ANSI C with some
89restrictions and extensions
90[ANSI90].
91Most of the restrictions are due to
92personal preference, while
93most of the extensions were to help in
94the implementation of Plan 9.
95There are other departures from the standard,
96particularly in the libraries,
97that are beyond the scope of this
98paper.
99.NH 2
100Register, volatile, const
101.LP
102The keyword
103.CW register
104is recognized syntactically
105but is semantically ignored.
106Thus taking the address of a
107.CW register
108variable is not diagnosed.
109The keyword
110.CW volatile
111disables all optimizations, in particular registerization, of the corresponding variable.
112The keyword
113.CW const
114generates warnings (if warnings are enabled by the compiler's
115.CW -w
116option) of non-constant use of the variable,
117but does not affect the generated code.
118.NH 2
119The preprocessor
120.LP
121The C preprocessor is probably the
122biggest departure from the ANSI standard.
123.LP
124The preprocessor built into the Plan 9 compilers does not support
125.CW #if ,
126although it does handle
127.CW #ifdef
128and
129.CW #include .
130If it is necessary to be more standard,
131the source text can first be run through the separate ANSI C
132preprocessor,
133.CW cpp .
134.NH 2
135Unnamed substructures
136.LP
137The most important and most heavily used of the
138extensions is the declaration of an
139unnamed substructure or subunion.
140For example:
141.DS
142.CW
143.ta .1i .6i 1.1i 1.6i
144	typedef
145	struct	lock
146	{
147		int    locked;
148	} Lock;
149
150	typedef
151	struct	node
152	{
153		int	type;
154		union
155		{
156			double dval;
157			float  fval;
158			long   lval;
159		};
160		Lock;
161	} Node;
162
163	Lock*	lock;
164	Node*	node;
165.DE
166The declaration of
167.CW Node
168has an unnamed substructure of type
169.CW Lock
170and an unnamed subunion.
171One use of this feature allows references to elements of the
172subunit to be accessed as if they were in
173the outer structure.
174Thus
175.CW node->dval
176and
177.CW node->locked
178are legitimate references.
179.LP
180When an outer structure is used
181in a context that is only legal for
182an unnamed substructure,
183the compiler promotes the reference to the
184unnamed substructure.
185This is true for references to structures and
186to references to pointers to structures.
187This happens in assignment statements and
188in argument passing where prototypes have been
189declared.
190Thus, continuing with the example,
191.DS
192.CW
193.ta .1i .6i 1.1i 1.6i
194	lock = node;
195.DE
196would assign a pointer to the unnamed
197.CW Lock
198in
199the
200.CW Node
201to the variable
202.CW lock .
203Another example,
204.DS
205.CW
206.ta .1i .6i 1.1i 1.6i
207	extern void lock(Lock*);
208	func(...)
209	{
210		...
211		lock(node);
212		...
213	}
214.DE
215will pass a pointer to the
216.CW Lock
217substructure.
218.LP
219Finally, in places where context is insufficient to identify the unnamed structure,
220the type name (it must be a
221.CW typedef )
222of the unnamed structure can be used as an identifier.
223In our example,
224.CW &node->Lock
225gives the address of the anonymous
226.CW Lock
227structure.
228.NH 2
229Structure displays
230.LP
231A structure cast followed by a list of expressions in braces is
232an expression with the type of the structure and elements assigned from
233the corresponding list.
234Structures are now almost first-class citizens of the language.
235It is common to see code like this:
236.DS
237.CW
238.ta .1i
239	r = (Rectangle){point1, (Point){x,y+2}};
240.DE
241.NH 2
242Initialization indexes
243.LP
244In initializers of arrays,
245one may place a constant expression
246in square brackets before an initializer.
247This causes the next initializer to assign
248the indicated element.
249For example:
250.DS
251.CW
252.ta .1i .6i 1.6i
253	enum	errors
254	{
255		Etoobig,
256		Ealarm,
257		Egreg
258	};
259	char* errstrings[] =
260	{
261		[Ealarm]	"Alarm call",
262		[Egreg]	"Panic: out of mbufs",
263		[Etoobig]	"Arg list too long",
264	};
265.DE
266In the same way,
267individual structures members may
268be initialized in any order by preceding the initialization with
269.CW .tagname .
270Both forms allow an optional
271.CW = ,
272to be compatible with a proposed
273extension to ANSI C.
274.NH 2
275External register
276.LP
277The declaration
278.CW extern
279.CW register
280will dedicate a register to
281a variable on a global basis.
282It can be used only under special circumstances.
283External register variables must be identically
284declared in all modules and
285libraries.
286The feature is not intended for efficiency,
287although it can produce efficient code;
288rather it represents a unique storage class that
289would be hard to get any other way.
290On a shared-memory multi-processor,
291an external register is
292one-per-processor and neither one-per-procedure (automatic)
293or one-per-system (external).
294It is used for two variables in the Plan 9 kernel,
295.CW u
296and
297.CW m .
298.CW U
299is a pointer to the structure representing the currently running process
300and
301.CW m
302is a pointer to the per-machine data structure.
303.NH 2
304Long long
305.LP
306The compilers accept
307.CW long
308.CW long
309as a basic type meaning 64-bit integer.
310On all of the machines
311this type is synthesized from 32-bit instructions.
312.NH 2
313Pragma
314.LP
315The compilers accept
316.CW #pragma
317.CW lib
318.I libname
319and pass the
320library name string uninterpreted
321to the loader.
322The loader uses the library name to
323find libraries to load.
324If the name contains
325.CW %O ,
326it is replaced with
327the single character object type of the compiler
328(e.g.,
329.CW v
330for the MIPS).
331If the name contains
332.CW %M ,
333it is replaced with
334the architecture type for the compiler
335(e.g.,
336.CW mips
337for the MIPS).
338If the name starts with
339.CW /
340it is an absolute pathname;
341if it starts with
342.CW .
343then it is searched for in the loader's current directory.
344Otherwise, the name is searched from
345.CW /%M/lib .
346Such
347.CW #pragma
348statements in header files guarantee that the correct
349libraries are always linked with a program without the
350need to specify them explicitly at link time.
351.LP
352They also accept
353.CW #pragma
354.CW hjdicks
355.CW on
356(or
357.CW yes
358or
359.CW 1 )
360to cause subsequently declared data, until
361.CW #pragma
362.CW hjdicks
363.CW off
364(or
365.CW no
366or
367.CW 0 ),
368to be laid out in memory tightly packed in successive bytes, disregarding
369the usual alignment rules.
370Accessing such data can cause faults.
371.LP
372Similarly,
373.CW #pragma
374.CW profile
375.CW off
376(or
377.CW no
378or
379.CW 0 )
380causes subsequently declared functions, until
381.CW #pragma
382.CW profile
383.CW on
384(or
385.CW yes
386or
387.CW 1 ),
388to be marked as unprofiled.
389Such functions will not be profiled when
390profiling is enabled for the rest of the program.
391.LP
392Two
393.CW #pragma
394statements allow type-checking of
395.CW print -like
396functions.
397The first, of the form
398.P1
399#pragma varargck argpos error 2
400.P2
401tells the compiler that the second argument to
402.CW error
403is a
404.CW print
405format string (see the manual page
406.I print (2))
407that specifies how to format
408.CW error 's
409subsequent arguments.
410The second, of the form
411.P1
412#pragma varargck type "s" char*
413.P2
414says that the
415.CW print
416format verb
417.CW s
418processes an argument of
419type
420.CW char* .
421If the compiler's
422.CW -F
423option is enabled, the compiler will use this information
424to report type violations in the arguments to
425.CW print ,
426.CW error ,
427and similar routines.
428.NH
429Object module conventions
430.LP
431The overall conventions of the runtime environment
432are important
433to runtime efficiency.
434In this section,
435several of these conventions are discussed.
436.NH 2
437Register saving
438.LP
439In the Plan 9 compilers,
440the caller of a procedure saves the registers.
441With caller-saves,
442the leaf procedures can use all the
443registers and never save them.
444If you spend a lot of time at the leaves,
445this seems preferable.
446With callee-saves,
447the saving of the registers is done
448in the single point of entry and return.
449If you are interested in space,
450this seems preferable.
451In both,
452there is a degree of uncertainty
453about what registers need to be saved.
454Callee-saved registers make it difficult to
455find variables in registers in debuggers.
456Callee-saved registers also complicate
457the implementation of
458.CW longjmp .
459The convincing argument is
460that with caller-saves,
461the decision to registerize a variable
462can include the cost of saving the register
463across calls.
464For a further discussion of caller- vs. callee-saves,
465see the paper by Davidson and Whalley [Dav91].
466.LP
467In the Plan 9 operating system,
468calls to the kernel look like normal procedure
469calls, which means
470the caller
471has saved the registers and the system
472entry does not have to.
473This makes system calls considerably faster.
474Since this is a potential security hole,
475and can lead to non-determinism,
476the system may eventually save the registers
477on entry,
478or more likely clear the registers on return.
479.NH 2
480Calling convention
481.LP
482Older C compilers maintain a frame pointer, which is at a known constant
483offset from the stack pointer within each function.
484For machines where the stack grows towards zero,
485the argument pointer is at a known constant offset
486from the frame pointer.
487Since the stack grows down in Plan 9,
488the Plan 9 compilers
489keep neither an
490explicit frame pointer nor
491an explicit argument pointer;
492instead they generate addresses relative to the stack pointer.
493.LP
494On some architectures, the first argument to a subroutine is passed in a register.
495.NH 2
496Functions returning structures
497.LP
498Structures longer than one word are awkward to implement
499since they do not fit in registers and must
500be passed around in memory.
501Functions that return structures
502are particularly clumsy.
503The Plan 9 compilers pass the return address of
504a structure as the first argument of a
505function that has a structure return value.
506Thus
507.DS
508.CW
509.ta .1i .6i 1.1i 1.6i
510	x = f(...)
511.DE
512is rewritten as
513.DS
514.CW
515.ta .1i .6i 1.1i 1.6i
516	f(&x, ...)\f1.
517.DE
518This saves a copy and makes the compilation
519much less clumsy.
520A disadvantage is that if you call this
521function without an assignment,
522a dummy location must be invented.
523.LP
524There is also a danger of calling a function
525that returns a structure without declaring
526it as such.
527With ANSI C function prototypes,
528this error need never occur.
529.NH
530Implementation
531.LP
532The compiler is divided internally into
533four machine-independent passes,
534four machine-dependent passes,
535and an output pass.
536The next nine sections describe each pass in order.
537.NH 2
538Parsing
539.LP
540The first pass is a YACC-based parser
541[Joh79].
542Declarations are interpreted immediately,
543building a block structured symbol table.
544Executable statements are put into a parse tree
545and collected,
546without interpretation.
547At the end of each procedure,
548the parse tree for the function is
549examined by the other passes of the compiler.
550.LP
551The input stream of the parser is
552a pushdown list of input activations.
553The preprocessor
554expansions of
555macros
556and
557.CW #include
558are implemented as pushdowns.
559Thus there is no separate
560pass for preprocessing.
561.NH 2
562Typing
563.LP
564The next pass distributes typing information
565to every node of the tree.
566Implicit operations on the tree are added,
567such as type promotions and taking the
568address of arrays and functions.
569.NH 2
570Machine-independent optimization
571.LP
572The next pass performs optimizations
573and transformations of the tree, such as converting
574.CW &*x
575and
576.CW *&x
577into
578.CW x .
579Constant expressions are converted to constants in this pass.
580.NH 2
581Arithmetic rewrites
582.LP
583This is another machine-independent optimization.
584Subtrees of add, subtract, and multiply of integers are
585rewritten for easier compilation.
586The major transformation is factoring:
587.CW 4+8*a+16*b+5
588is transformed into
589.CW 9+8*(a+2*b) .
590Such expressions arise from address
591manipulation and array indexing.
592.NH 2
593Addressability
594.LP
595This is the first of the machine-dependent passes.
596The addressability of a processor is defined as the set of
597expressions that is legal in the address field
598of a machine language instruction.
599The addressability of different processors varies widely.
600At one end of the spectrum are the 68020 and VAX,
601which allow a complex mix of incrementing,
602decrementing,
603indexing, and relative addressing.
604At the other end is the MIPS,
605which allows only registers and constant offsets from the
606contents of a register.
607The addressability can be different for different instructions
608within the same processor.
609.LP
610It is important to the code generator to know when a
611subtree represents an address of a particular type.
612This is done with a bottom-up walk of the tree.
613In this pass, the leaves are labeled with small integers.
614When an internal node is encountered,
615it is labeled by consulting a table indexed by the
616labels on the left and right subtrees.
617For example,
618on the 68020 processor,
619it is possible to address an
620offset from a named location.
621In C, this is represented by the expression
622.CW *(&name+constant) .
623This is marked addressable by the following table.
624In the table,
625a node represented by the left column is marked
626with a small integer from the right column.
627Marks of the form
628.CW A\s-2\di\u\s0
629are addressable while
630marks of the form
631.CW N\s-2\di\u\s0
632are not addressable.
633.DS
634.B
635.ta .1i 1.1i
636	Node	Marked
637.CW
638	name	A\s-2\d1\u\s0
639	const	A\s-2\d2\u\s0
640	&A\s-2\d1\u\s0	A\s-2\d3\u\s0
641	A\s-2\d3\u\s0+A\s-2\d1\u\s0	N\s-2\d1\u\s0 \fR(note that this is not addressable)\fP
642	*N\s-2\d1\u\s0	A\s-2\d4\u\s0
643.DE
644Here there is a distinction between
645a node marked
646.CW A\s-2\d1\u\s0
647and a node marked
648.CW A\s-2\d4\u\s0
649because the address operator of an
650.CW A\s-2\d4\u\s0
651node is not addressable.
652So to extend the table:
653.DS
654.B
655.ta .1i 1.1i
656	Node	Marked
657.CW
658	&A\s-2\d4\u\s0	N\s-2\d2\u\s0
659	N\s-2\d2\u\s0+N\s-2\d1\u\s0	N\s-2\d1\u\s0
660.DE
661The full addressability of the 68020 is expressed
662in 18 rules like this,
663while the addressability of the MIPS is expressed
664in 11 rules.
665When one ports the compiler,
666this table is usually initialized
667so that leaves are labeled as addressable and nothing else.
668The code produced is poor,
669but porting is easy.
670The table can be extended later.
671.LP
672This pass also rewrites some complex operators
673into procedure calls.
674Examples include 64-bit multiply and divide.
675.LP
676In the same bottom-up pass of the tree,
677the nodes are labeled with a Sethi-Ullman complexity
678[Set70].
679This number is roughly the number of registers required
680to compile the tree on an ideal machine.
681An addressable node is marked 0.
682A function call is marked infinite.
683A unary operator is marked as the
684maximum of 1 and the mark of its subtree.
685A binary operator with equal marks on its subtrees is
686marked with a subtree mark plus 1.
687A binary operator with unequal marks on its subtrees is
688marked with the maximum mark of its subtrees.
689The actual values of the marks are not too important,
690but the relative values are.
691The goal is to compile the harder
692(larger mark)
693subtree first.
694.NH 2
695Code generation
696.LP
697Code is generated by recursive
698descent.
699The Sethi-Ullman complexity completely guides the
700order.
701The addressability defines the leaves.
702The only difficult part is compiling a tree
703that has two infinite (function call)
704subtrees.
705In this case,
706one subtree is compiled into the return register
707(usually the most convenient place for a function call)
708and then stored on the stack.
709The other subtree is compiled into the return register
710and then the operation is compiled with
711operands from the stack and the return register.
712.LP
713There is a separate boolean code generator that compiles
714conditional expressions.
715This is fundamentally different from compiling an arithmetic expression.
716The result of the boolean code generator is the
717position of the program counter and not an expression.
718The boolean code generator makes extensive use of De Morgan's rule.
719The boolean code generator is an expanded version of that described
720in chapter 8 of Aho, Sethi, and Ullman
721[Aho87].
722.LP
723There is a considerable amount of talk in the literature
724about automating this part of a compiler with a machine
725description.
726Since this code generator is so small
727(less than 500 lines of C)
728and easy,
729it hardly seems worth the effort.
730.NH 2
731Registerization
732.LP
733Up to now,
734the compiler has operated on syntax trees
735that are roughly equivalent to the original source language.
736The previous pass has produced machine language in an internal
737format.
738The next two passes operate on the internal machine language
739structures.
740The purpose of the next pass is to reintroduce
741registers for heavily used variables.
742.LP
743All of the variables that can be
744potentially registerized within a procedure are
745placed in a table.
746(Suitable variables are any automatic or external
747scalars that do not have their addresses extracted.
748Some constants that are hard to reference are also
749considered for registerization.)
750Four separate data flow equations are evaluated
751over the procedure on all of these variables.
752Two of the equations are the normal set-behind
753and used-ahead
754bits that define the life of a variable.
755The two new bits tell if a variable life
756crosses a function call ahead or behind.
757By examining a variable over its lifetime,
758it is possible to get a cost
759for registerizing.
760Loops are detected and the costs are multiplied
761by three for every level of loop nesting.
762Costs are sorted and the variables
763are replaced by available registers on a greedy basis.
764.LP
765The 68020 has two different
766types of registers.
767For the 68020,
768two different costs are calculated for
769each variable life and the register type that
770affords the better cost is used.
771Ties are broken by counting the number of available
772registers of each type.
773.LP
774Note that externals are registerized together with automatics.
775This is done by evaluating the semantics of a ``call'' instruction
776differently for externals and automatics.
777Since a call goes outside the local procedure,
778it is assumed that a call references all externals.
779Similarly,
780externals are assumed to be set before an ``entry'' instruction
781and assumed to be referenced after a ``return'' instruction.
782This makes sure that externals are in memory across calls.
783.LP
784The overall results are satisfactory.
785It would be nice to be able to do this processing in
786a machine-independent way,
787but it is impossible to get all of the costs and
788side effects of different choices by examining the parse tree.
789.LP
790Most of the code in the registerization pass is machine-independent.
791The major machine-dependency is in
792examining a machine instruction to ask if it sets or references
793a variable.
794.NH 2
795Machine code optimization
796.LP
797The next pass walks the machine code
798for opportunistic optimizations.
799For the most part,
800this is highly specific to a particular
801processor.
802One optimization that is performed
803on all of the processors is the
804removal of unnecessary ``move''
805instructions.
806Ironically,
807most of these instructions were inserted by
808the previous pass.
809There are two patterns that are repetitively
810matched and replaced until no more matches are
811found.
812The first tries to remove ``move'' instructions
813by relabeling variables.
814.LP
815When a ``move'' instruction is encountered,
816if the destination variable is set before the
817source variable is referenced,
818then all of the references to the destination
819variable can be renamed to the source and the ``move''
820can be deleted.
821This transformation uses the reverse data flow
822set up in the previous pass.
823.LP
824An example of this pattern is depicted in the following
825table.
826The pattern is in the left column and the
827replacement action is in the right column.
828.DS
829.CW
830.ta .1i .6i 1.6i 2.1i 2.6i
831	MOVE	a->b		\fR(remove)\fP
832.R
833	(sequence with no mention of \f(CWa\fP)
834.CW
835	USE	b		USE	a
836.R
837	(sequence with no mention of \f(CWa\fP)
838.CW
839	SET	b		SET	b
840.DE
841.LP
842Experiments have shown that it is marginally
843worthwhile to rename uses of the destination variable
844with uses of the source variable up to
845the first use of the source variable.
846.LP
847The second transform will do relabeling
848without deleting instructions.
849When a ``move'' instruction is encountered,
850if the source variable has been set prior
851to the use of the destination variable
852then all of the references to the source
853variable are replaced by the destination and
854the ``move'' is inverted.
855Typically,
856this transformation will alter two ``move''
857instructions and allow the first transformation
858another chance to remove code.
859This transformation uses the forward data flow
860set up in the previous pass.
861.LP
862Again,
863the following is a depiction of the transformation where
864the pattern is in the left column and the
865rewrite is in the right column.
866.DS
867.CW
868.ta .1i .6i 1.6i 2.1i 2.6i
869	SET	a		SET	b
870.R
871	(sequence with no use of \f(CWb\fP)
872.CW
873	USE	a		USE	b
874.R
875	(sequence with no use of \f(CWb\fP)
876.CW
877	MOVE	a->b		MOVE	b->a
878.DE
879Iterating these transformations
880will usually get rid of all redundant ``move'' instructions.
881.LP
882A problem with this organization is that the costs
883of registerization calculated in the previous pass
884must depend on how well this pass can detect and remove
885redundant instructions.
886Often,
887a fine candidate for registerization is rejected
888because of the cost of instructions that are later
889removed.
890.NH 2
891Writing the object file
892.LP
893The last pass walks the internal assembly language
894and writes the object file.
895The object file is reduced in size by about a factor
896of three with simple compression
897techniques.
898The most important aspect of the object file
899format is that it is independent of the compiling machine.
900All integer and floating numbers in the object
901code are converted to known formats and byte
902orders.
903.NH
904The loader
905.LP
906The loader is a multiple pass program that
907reads object files and libraries and produces
908an executable binary.
909The loader also does some minimal
910optimizations and code rewriting.
911Many of the operations performed by the
912loader are machine-dependent.
913.LP
914The first pass of the loader reads the
915object modules into an internal data
916structure that looks like binary assembly language.
917As the instructions are read,
918code is reordered to remove
919unconditional branch instructions.
920Conditional branch instructions are inverted
921to prevent the insertion of unconditional branches.
922The loader will also make a copy of a few instructions
923to remove an unconditional branch.
924.LP
925The next pass allocates addresses for
926all external data.
927Typical of processors is the MIPS,
928which can reference ±32K bytes from a
929register.
930The loader allocates the register
931.CW R30
932as the static pointer.
933The value placed in
934.CW R30
935is the base of the data segment plus 32K.
936It is then cheap to reference all data in the
937first 64K of the data segment.
938External variables are allocated to
939the data segment
940with the smallest variables allocated first.
941If all of the data cannot fit into the first
94264K of the data segment,
943then usually only a few large arrays
944need more expensive addressing modes.
945.LP
946For the MIPS processor,
947the loader makes a pass over the internal
948structures,
949exchanging instructions to try
950to fill ``delay slots'' with useful work.
951If a useful instruction cannot be found
952to fill a delay slot,
953the loader will insert
954``noop''
955instructions.
956This pass is very expensive and does not
957do a good job.
958About 40% of all instructions are in
959delay slots.
960About 65% of these are useful instructions and
96135% are ``noops.''
962The vendor-supplied assembler does this job
963more effectively,
964filling about 80%
965of the delay slots with useful instructions.
966.LP
967On the 68020 processor,
968branch instructions come in a variety of
969sizes depending on the relative distance
970of the branch.
971Thus the size of branch instructions
972can be mutually dependent.
973The loader uses a multiple pass algorithm
974to resolve the branch lengths
975[Szy78].
976Initially, all branches are assumed minimal length.
977On each subsequent pass,
978the branches are reassessed
979and expanded if necessary.
980When no more expansions occur,
981the locations of the instructions in
982the text segment are known.
983.LP
984On the MIPS processor,
985all instructions are one size.
986A single pass over the instructions will
987determine the locations of all addresses
988in the text segment.
989.LP
990The last pass of the loader produces the
991executable binary.
992A symbol table and other tables are
993produced to help the debugger to
994interpret the binary symbolically.
995.LP
996The loader places absolute source line numbers in the symbol table.
997The name and absolute line number of all
998.CW #include
999files is also placed in the
1000symbol table so that the debuggers can
1001associate object code to source files.
1002.NH
1003Performance
1004.LP
1005The following is a table of the source size of the MIPS
1006compiler.
1007.DS
1008.ta .1i .6i
1009	lines	module
1010	\0509	machine-independent headers
1011	1070	machine-independent YACC source
1012	6090	machine-independent C source
1013
1014	\0545	machine-dependent headers
1015	6532	machine-dependent C source
1016
1017	\0298	loader headers
1018	5215	loader C source
1019.DE
1020.LP
1021The following table shows timing
1022of a test program
1023that plays checkers, running on a MIPS R4000.
1024The test program is 26 files totaling 12600 lines of C.
1025The execution time does not significantly
1026depend on library implementation.
1027Since no other compiler runs on Plan 9,
1028the Plan 9 tests were done with the Plan 9 operating system;
1029the other tests were done on the vendor's operating system.
1030The hardware was identical in both cases.
1031The optimizer in the vendor's compiler
1032is reputed to be extremely good.
1033.DS
1034.ta .1i .9i
1035	\0\04.49s	Plan 9 \f(CWvc\fP \f(CW-N\fP compile time (opposite of \f(CW-O\fP)
1036	\0\01.72s	Plan 9 \f(CWvc\fP \f(CW-N\fP load time
1037	148.69s	Plan 9 \f(CWvc\fP \f(CW-N\fP run time
1038
1039	\015.07s	Plan 9 \f(CWvc\fP compile time (\f(CW-O\fP implicit)
1040	\0\01.66s	Plan 9 \f(CWvc\fP load time
1041	\089.96s	Plan 9 \f(CWvc\fP run time
1042
1043	\014.83s	vendor \f(CWcc\fP compile time
1044	\0\00.38s	vendor \f(CWcc\fP load time
1045	104.75s	vendor \f(CWcc\fP run time
1046
1047	\043.59s	vendor \f(CWcc\fP \f(CW-O\fP compile time
1048	\0\00.38s	vendor \f(CWcc\fP \f(CW-O\fP load time
1049	\076.19s	vendor \f(CWcc\fP \f(CW-O\fP run time
1050
1051	\0\08.19s	vendor \f(CWcc\fP \f(CW-O3\fP compile time
1052	\035.97s	vendor \f(CWcc\fP \f(CW-O3\fP load time
1053	\071.16s	vendor \f(CWcc\fP \f(CW-O3\fP run time
1054.DE
1055.LP
1056To compare the Intel compiler,
1057a program that is about 40% bit manipulation and
1058about 60% single precision floating point was
1059run on the same 33 MHz 486, once under Windows
1060compiled with the Watcom compiler, version 10.0,
1061in 16-bit mode and once under
1062Plan 9 in 32-bit mode.
1063The Plan 9 execution time was 27 sec while the Windows
1064execution time was 31 sec.
1065.NH
1066Conclusions
1067.LP
1068The new compilers compile
1069quickly,
1070load slowly,
1071and produce
1072medium quality
1073object code.
1074The compilers are relatively
1075portable,
1076requiring but a couple of weeks' work to
1077produce a compiler for a different computer.
1078For Plan 9,
1079where we needed several compilers
1080with specialized features and
1081our own object formats,
1082this project was indispensable.
1083It is also necessary for us to
1084be able to freely distribute our compilers
1085with the Plan 9 distribution.
1086.LP
1087Two problems have come up in retrospect.
1088The first has to do with the
1089division of labor between compiler and loader.
1090Plan 9 runs on multi-processors and as such
1091compilations are often done in parallel.
1092Unfortunately,
1093all compilations must be complete before loading
1094can begin.
1095The load is then single-threaded.
1096With this model,
1097any shift of work from compile to load
1098results in a significant increase in real time.
1099The same is true of libraries that are compiled
1100infrequently and loaded often.
1101In the future,
1102we may try to put some of the loader work
1103back into the compiler.
1104.LP
1105The second problem comes from
1106the various optimizations performed over several
1107passes.
1108Often optimizations in different passes depend
1109on each other.
1110Iterating the passes could compromise efficiency,
1111or even loop.
1112We see no real solution to this problem.
1113.NH
1114References
1115.LP
1116[Aho87] A. V. Aho, R. Sethi, and J. D. Ullman,
1117.I
1118Compilers \- Principles, Techniques, and Tools,
1119.R
1120Addison Wesley,
1121Reading, MA,
11221987.
1123.LP
1124[ANSI90] \f2American National Standard for Information Systems \-
1125Programming Language C\f1, American National Standards Institute, Inc.,
1126New York, 1990.
1127.LP
1128[Dav91] J. W. Davidson and D. B. Whalley,
1129``Methods for Saving and Restoring Register Values across Function Calls'',
1130.I
1131Software\-Practice and Experience,
1132.R
1133Vol 21(2), pp. 149-165, February 1991.
1134.LP
1135[Joh79] S. C. Johnson,
1136``YACC \- Yet Another Compiler Compiler'',
1137.I
1138UNIX Programmer's Manual, Seventh Ed., Vol. 2A,
1139.R
1140AT&T Bell Laboratories,
1141Murray Hill, NJ,
11421979.
1143.LP
1144[Set70] R. Sethi and J. D. Ullman,
1145``The Generation of Optimal Code for Arithmetic Expressions'',
1146.I
1147Journal of the ACM,
1148.R
1149Vol 17(4), pp. 715-728, 1970.
1150.LP
1151[Szy78] T. G. Szymanski,
1152``Assembling Code for Machines with Span-dependent Instructions'',
1153.I
1154Communications of the ACM,
1155.R
1156Vol 21(4), pp. 300-308, 1978.
1157