xref: /plan9/sys/doc/compiler.ms (revision ad68c38a864335dc8accb1f5423459d2890975af)
1.HTML "Plan 9 C Compilers
2.TL
3Plan 9 C Compilers
4.AU
5Ken Thompson
6ken@plan9.bell-labs.com
7.AB
8.FS
9Originally appeared, in a different form, in
10.I
11Proceedings of the Summer 1990 UKUUG Conference,
12.R
13pp. 41-51,
14London, 1990.
15.FE
16This paper describes the overall structure and function of the Plan 9 C compilers.
17A more detailed implementation document
18for any one of the compilers
19is yet to be written.
20.AE
21.NH
22Introduction
23.LP
24There are many compilers in the series.
25Six of the compilers
26(Intel 386, AMD64, PowerPC, PowerPC 64-bit, ARM, MIPS R3000)
27are considered active and are used to compile
28current versions of Plan 9.
29One of the compilers (SPARC)
30is maintained but is for older machines
31for which we have no current ports of Plan 9;
32we are unlikely to port to any SPARC machines.
33The DEC Alpha and Motorola 68020 compilers have been retired.
34Several others (Motorola 68000, Intel 960, AMD 29000)
35have had only limited use, such as
36to program peripherals or experimental devices.
37Any of the disused compilers could be restored if needed.
38.NH
39Structure
40.LP
41The compiler is a single program that produces an
42object file.
43Combined in the compiler are the traditional
44roles of preprocessor, lexical analyzer, parser, code generator,
45local optimizer,
46and first half of the assembler.
47The object files are binary forms of assembly
48language,
49similar to what might be passed between
50the first and second passes of an assembler.
51.LP
52Object files and libraries
53are combined by a loader
54program to produce the executable binary.
55The loader combines the roles of second half
56of the assembler, global optimizer, and loader.
57The names of the compilers, loaders, and assemblers
58are as follows:
59.DS
60.ta 1.5i
61.de Ta
62\\$1	\f(CW\\$2\fP  \f(CW\\$3\fP  \f(CW\\$4\fP
63..
64.Ta SPARC kc kl ka
65.Ta PowerPC qc ql qa
66.Ta MIPS vc vl va
67.Ta MIPS\ little-endian 0c 0l 0a
68.Ta ARM 5c 5l 5a
69.Ta AMD64 6c 6l 6a
70.Ta Intel\ 386 8c 8l 8a
71.Ta PowerPC\ 64-bit 9c 9l 9a
72.DE
73There is a further breakdown
74in the source of the compilers into
75object-independent and
76object-dependent
77parts.
78All of the object-independent parts
79are combined into source files in the
80directory
81.CW /sys/src/cmd/cc .
82The object-dependent parts are collected
83in a separate directory for each compiler,
84for example
85.CW /sys/src/cmd/vc .
86All of the code,
87both object-independent and
88object-dependent,
89is machine-independent
90and may be cross-compiled and executed on any
91of the architectures.
92.NH
93The Language
94.LP
95The compiler implements ANSI C with some
96restrictions and extensions
97[ANSI90].
98Most of the restrictions are due to
99personal preference, while
100most of the extensions were to help in
101the implementation of Plan 9.
102There are other departures from the standard,
103particularly in the libraries,
104that are beyond the scope of this
105paper.
106.NH 2
107Register, volatile, const
108.LP
109The keyword
110.CW register
111is recognized syntactically
112but is semantically ignored.
113Thus taking the address of a
114.CW register
115variable is not diagnosed.
116The keyword
117.CW volatile
118disables all optimizations, in particular registerization, of the corresponding variable.
119The keyword
120.CW const
121generates warnings (if warnings are enabled by the compiler's
122.CW -w
123option) of non-constant use of the variable,
124but does not affect the generated code.
125.NH 2
126The preprocessor
127.LP
128The C preprocessor is probably the
129biggest departure from the ANSI standard.
130.LP
131The preprocessor built into the Plan 9 compilers does not support
132.CW #if ,
133although it does handle
134.CW #ifdef
135and
136.CW #include .
137If it is necessary to be more standard,
138the source text can first be run through the separate ANSI C
139preprocessor,
140.CW cpp .
141.NH 2
142Unnamed substructures
143.LP
144The most important and most heavily used of the
145extensions is the declaration of an
146unnamed substructure or subunion.
147For example:
148.DS
149.CW
150.ta .1i .6i 1.1i 1.6i
151	typedef
152	struct	lock
153	{
154		int    locked;
155	} Lock;
156
157	typedef
158	struct	node
159	{
160		int	type;
161		union
162		{
163			double dval;
164			float  fval;
165			long   lval;
166		};
167		Lock;
168	} Node;
169
170	Lock*	lock;
171	Node*	node;
172.DE
173The declaration of
174.CW Node
175has an unnamed substructure of type
176.CW Lock
177and an unnamed subunion.
178One use of this feature allows references to elements of the
179subunit to be accessed as if they were in
180the outer structure.
181Thus
182.CW node->dval
183and
184.CW node->locked
185are legitimate references.
186.LP
187When an outer structure is used
188in a context that is only legal for
189an unnamed substructure,
190the compiler promotes the reference to the
191unnamed substructure.
192This is true for references to structures and
193to references to pointers to structures.
194This happens in assignment statements and
195in argument passing where prototypes have been
196declared.
197Thus, continuing with the example,
198.DS
199.CW
200.ta .1i .6i 1.1i 1.6i
201	lock = node;
202.DE
203would assign a pointer to the unnamed
204.CW Lock
205in
206the
207.CW Node
208to the variable
209.CW lock .
210Another example,
211.DS
212.CW
213.ta .1i .6i 1.1i 1.6i
214	extern void lock(Lock*);
215	func(...)
216	{
217		...
218		lock(node);
219		...
220	}
221.DE
222will pass a pointer to the
223.CW Lock
224substructure.
225.LP
226Finally, in places where context is insufficient to identify the unnamed structure,
227the type name (it must be a
228.CW typedef )
229of the unnamed structure can be used as an identifier.
230In our example,
231.CW &node->Lock
232gives the address of the anonymous
233.CW Lock
234structure.
235.NH 2
236Structure displays
237.LP
238A structure cast followed by a list of expressions in braces is
239an expression with the type of the structure and elements assigned from
240the corresponding list.
241Structures are now almost first-class citizens of the language.
242It is common to see code like this:
243.DS
244.CW
245.ta .1i
246	r = (Rectangle){point1, (Point){x,y+2}};
247.DE
248.NH 2
249Initialization indexes
250.LP
251In initializers of arrays,
252one may place a constant expression
253in square brackets before an initializer.
254This causes the next initializer to assign
255the indicated element.
256For example:
257.DS
258.CW
259.ta .1i .6i 1.6i
260	enum	errors
261	{
262		Etoobig,
263		Ealarm,
264		Egreg
265	};
266	char* errstrings[] =
267	{
268		[Ealarm]	"Alarm call",
269		[Egreg]	"Panic: out of mbufs",
270		[Etoobig]	"Arg list too long",
271	};
272.DE
273In the same way,
274individual structures members may
275be initialized in any order by preceding the initialization with
276.CW .tagname .
277Both forms allow an optional
278.CW = ,
279to be compatible with a proposed
280extension to ANSI C.
281.NH 2
282External register
283.LP
284The declaration
285.CW extern
286.CW register
287will dedicate a register to
288a variable on a global basis.
289It can be used only under special circumstances.
290External register variables must be identically
291declared in all modules and
292libraries.
293The feature is not intended for efficiency,
294although it can produce efficient code;
295rather it represents a unique storage class that
296would be hard to get any other way.
297On a shared-memory multi-processor,
298an external register is
299one-per-processor and neither one-per-procedure (automatic)
300or one-per-system (external).
301It is used for two variables in the Plan 9 kernel,
302.CW u
303and
304.CW m .
305.CW U
306is a pointer to the structure representing the currently running process
307and
308.CW m
309is a pointer to the per-machine data structure.
310.NH 2
311Long long
312.LP
313The compilers accept
314.CW long
315.CW long
316as a basic type meaning 64-bit integer.
317On some of the machines
318this type is synthesized from 32-bit instructions.
319.NH 2
320Pragma
321.LP
322The compilers accept
323.CW #pragma
324.CW lib
325.I libname
326and pass the
327library name string uninterpreted
328to the loader.
329The loader uses the library name to
330find libraries to load.
331If the name contains
332.CW $O ,
333it is replaced with
334the single character object type of the compiler
335(e.g.,
336.CW v
337for the MIPS).
338If the name contains
339.CW $M ,
340it is replaced with
341the architecture type for the compiler
342(e.g.,
343.CW mips
344for the MIPS).
345If the name starts with
346.CW /
347it is an absolute pathname;
348if it starts with
349.CW .
350then it is searched for in the loader's current directory.
351Otherwise, the name is searched from
352.CW /$M/lib .
353Such
354.CW #pragma
355statements in header files guarantee that the correct
356libraries are always linked with a program without the
357need to specify them explicitly at link time.
358.LP
359They also accept
360.CW #pragma
361.CW packed
362.CW on
363(or
364.CW yes
365or
366.CW 1 )
367to cause subsequently declared data, until
368.CW #pragma
369.CW packed
370.CW off
371(or
372.CW no
373or
374.CW 0 ),
375to be laid out in memory tightly packed in successive bytes, disregarding
376the usual alignment rules.
377Accessing such data can cause faults.
378.LP
379Similarly,
380.CW #pragma
381.CW profile
382.CW off
383(or
384.CW no
385or
386.CW 0 )
387causes subsequently declared functions, until
388.CW #pragma
389.CW profile
390.CW on
391(or
392.CW yes
393or
394.CW 1 ),
395to be marked as unprofiled.
396Such functions will not be profiled when
397profiling is enabled for the rest of the program.
398.LP
399Two
400.CW #pragma
401statements allow type-checking of
402.CW print -like
403functions.
404The first, of the form
405.P1
406#pragma varargck argpos error 2
407.P2
408tells the compiler that the second argument to
409.CW error
410is a
411.CW print
412format string (see the manual page
413.I print (2))
414that specifies how to format
415.CW error 's
416subsequent arguments.
417The second, of the form
418.P1
419#pragma varargck type "s" char*
420.P2
421says that the
422.CW print
423format verb
424.CW s
425processes an argument of
426type
427.CW char* .
428If the compiler's
429.CW -F
430option is enabled, the compiler will use this information
431to report type violations in the arguments to
432.CW print ,
433.CW error ,
434and similar routines.
435.NH
436Object module conventions
437.LP
438The overall conventions of the runtime environment
439are important
440to runtime efficiency.
441In this section,
442several of these conventions are discussed.
443.NH 2
444Register saving
445.LP
446In the Plan 9 compilers,
447the caller of a procedure saves the registers.
448With caller-saves,
449the leaf procedures can use all the
450registers and never save them.
451If you spend a lot of time at the leaves,
452this seems preferable.
453With callee-saves,
454the saving of the registers is done
455in the single point of entry and return.
456If you are interested in space,
457this seems preferable.
458In both,
459there is a degree of uncertainty
460about what registers need to be saved.
461Callee-saved registers make it difficult to
462find variables in registers in debuggers.
463Callee-saved registers also complicate
464the implementation of
465.CW longjmp .
466The convincing argument is
467that with caller-saves,
468the decision to registerize a variable
469can include the cost of saving the register
470across calls.
471For a further discussion of caller- vs. callee-saves,
472see the paper by Davidson and Whalley [Dav91].
473.LP
474In the Plan 9 operating system,
475calls to the kernel look like normal procedure
476calls, which means
477the caller
478has saved the registers and the system
479entry does not have to.
480This makes system calls considerably faster.
481Since this is a potential security hole,
482and can lead to non-determinism,
483the system may eventually save the registers
484on entry,
485or more likely clear the registers on return.
486.NH 2
487Calling convention
488.LP
489Older C compilers maintain a frame pointer, which is at a known constant
490offset from the stack pointer within each function.
491For machines where the stack grows towards zero,
492the argument pointer is at a known constant offset
493from the frame pointer.
494Since the stack grows down in Plan 9,
495the Plan 9 compilers
496keep neither an
497explicit frame pointer nor
498an explicit argument pointer;
499instead they generate addresses relative to the stack pointer.
500.LP
501On some architectures, the first argument to a subroutine is passed in a register.
502.NH 2
503Functions returning structures
504.LP
505Structures longer than one word are awkward to implement
506since they do not fit in registers and must
507be passed around in memory.
508Functions that return structures
509are particularly clumsy.
510The Plan 9 compilers pass the return address of
511a structure as the first argument of a
512function that has a structure return value.
513Thus
514.DS
515.CW
516.ta .1i .6i 1.1i 1.6i
517	x = f(...)
518.DE
519is rewritten as
520.DS
521.CW
522.ta .1i .6i 1.1i 1.6i
523	f(&x, ...)\f1.
524.DE
525This saves a copy and makes the compilation
526much less clumsy.
527A disadvantage is that if you call this
528function without an assignment,
529a dummy location must be invented.
530.LP
531There is also a danger of calling a function
532that returns a structure without declaring
533it as such.
534With ANSI C function prototypes,
535this error need never occur.
536.NH
537Implementation
538.LP
539The compiler is divided internally into
540four machine-independent passes,
541four machine-dependent passes,
542and an output pass.
543The next nine sections describe each pass in order.
544.NH 2
545Parsing
546.LP
547The first pass is a YACC-based parser
548[Joh79].
549Declarations are interpreted immediately,
550building a block structured symbol table.
551Executable statements are put into a parse tree
552and collected,
553without interpretation.
554At the end of each procedure,
555the parse tree for the function is
556examined by the other passes of the compiler.
557.LP
558The input stream of the parser is
559a pushdown list of input activations.
560The preprocessor
561expansions of
562macros
563and
564.CW #include
565are implemented as pushdowns.
566Thus there is no separate
567pass for preprocessing.
568.NH 2
569Typing
570.LP
571The next pass distributes typing information
572to every node of the tree.
573Implicit operations on the tree are added,
574such as type promotions and taking the
575address of arrays and functions.
576.NH 2
577Machine-independent optimization
578.LP
579The next pass performs optimizations
580and transformations of the tree, such as converting
581.CW &*x
582and
583.CW *&x
584into
585.CW x .
586Constant expressions are converted to constants in this pass.
587.NH 2
588Arithmetic rewrites
589.LP
590This is another machine-independent optimization.
591Subtrees of add, subtract, and multiply of integers are
592rewritten for easier compilation.
593The major transformation is factoring:
594.CW 4+8*a+16*b+5
595is transformed into
596.CW 9+8*(a+2*b) .
597Such expressions arise from address
598manipulation and array indexing.
599.NH 2
600Addressability
601.LP
602This is the first of the machine-dependent passes.
603The addressability of a processor is defined as the set of
604expressions that is legal in the address field
605of a machine language instruction.
606The addressability of different processors varies widely.
607At one end of the spectrum are the 68020 and VAX,
608which allow a complex mix of incrementing,
609decrementing,
610indexing, and relative addressing.
611At the other end is the MIPS,
612which allows only registers and constant offsets from the
613contents of a register.
614The addressability can be different for different instructions
615within the same processor.
616.LP
617It is important to the code generator to know when a
618subtree represents an address of a particular type.
619This is done with a bottom-up walk of the tree.
620In this pass, the leaves are labeled with small integers.
621When an internal node is encountered,
622it is labeled by consulting a table indexed by the
623labels on the left and right subtrees.
624For example,
625on the 68020 processor,
626it is possible to address an
627offset from a named location.
628In C, this is represented by the expression
629.CW *(&name+constant) .
630This is marked addressable by the following table.
631In the table,
632a node represented by the left column is marked
633with a small integer from the right column.
634Marks of the form
635.CW A\s-2\di\u\s0
636are addressable while
637marks of the form
638.CW N\s-2\di\u\s0
639are not addressable.
640.DS
641.B
642.ta .1i 1.1i
643	Node	Marked
644.CW
645	name	A\s-2\d1\u\s0
646	const	A\s-2\d2\u\s0
647	&A\s-2\d1\u\s0	A\s-2\d3\u\s0
648	A\s-2\d3\u\s0+A\s-2\d1\u\s0	N\s-2\d1\u\s0 \fR(note that this is not addressable)\fP
649	*N\s-2\d1\u\s0	A\s-2\d4\u\s0
650.DE
651Here there is a distinction between
652a node marked
653.CW A\s-2\d1\u\s0
654and a node marked
655.CW A\s-2\d4\u\s0
656because the address operator of an
657.CW A\s-2\d4\u\s0
658node is not addressable.
659So to extend the table:
660.DS
661.B
662.ta .1i 1.1i
663	Node	Marked
664.CW
665	&A\s-2\d4\u\s0	N\s-2\d2\u\s0
666	N\s-2\d2\u\s0+N\s-2\d1\u\s0	N\s-2\d1\u\s0
667.DE
668The full addressability of the 68020 is expressed
669in 18 rules like this,
670while the addressability of the MIPS is expressed
671in 11 rules.
672When one ports the compiler,
673this table is usually initialized
674so that leaves are labeled as addressable and nothing else.
675The code produced is poor,
676but porting is easy.
677The table can be extended later.
678.LP
679This pass also rewrites some complex operators
680into procedure calls.
681Examples include 64-bit multiply and divide.
682.LP
683In the same bottom-up pass of the tree,
684the nodes are labeled with a Sethi-Ullman complexity
685[Set70].
686This number is roughly the number of registers required
687to compile the tree on an ideal machine.
688An addressable node is marked 0.
689A function call is marked infinite.
690A unary operator is marked as the
691maximum of 1 and the mark of its subtree.
692A binary operator with equal marks on its subtrees is
693marked with a subtree mark plus 1.
694A binary operator with unequal marks on its subtrees is
695marked with the maximum mark of its subtrees.
696The actual values of the marks are not too important,
697but the relative values are.
698The goal is to compile the harder
699(larger mark)
700subtree first.
701.NH 2
702Code generation
703.LP
704Code is generated by recursive
705descent.
706The Sethi-Ullman complexity completely guides the
707order.
708The addressability defines the leaves.
709The only difficult part is compiling a tree
710that has two infinite (function call)
711subtrees.
712In this case,
713one subtree is compiled into the return register
714(usually the most convenient place for a function call)
715and then stored on the stack.
716The other subtree is compiled into the return register
717and then the operation is compiled with
718operands from the stack and the return register.
719.LP
720There is a separate boolean code generator that compiles
721conditional expressions.
722This is fundamentally different from compiling an arithmetic expression.
723The result of the boolean code generator is the
724position of the program counter and not an expression.
725The boolean code generator makes extensive use of De Morgan's rule.
726The boolean code generator is an expanded version of that described
727in chapter 8 of Aho, Sethi, and Ullman
728[Aho87].
729.LP
730There is a considerable amount of talk in the literature
731about automating this part of a compiler with a machine
732description.
733Since this code generator is so small
734(less than 500 lines of C)
735and easy,
736it hardly seems worth the effort.
737.NH 2
738Registerization
739.LP
740Up to now,
741the compiler has operated on syntax trees
742that are roughly equivalent to the original source language.
743The previous pass has produced machine language in an internal
744format.
745The next two passes operate on the internal machine language
746structures.
747The purpose of the next pass is to reintroduce
748registers for heavily used variables.
749.LP
750All of the variables that can be
751potentially registerized within a procedure are
752placed in a table.
753(Suitable variables are any automatic or external
754scalars that do not have their addresses extracted.
755Some constants that are hard to reference are also
756considered for registerization.)
757Four separate data flow equations are evaluated
758over the procedure on all of these variables.
759Two of the equations are the normal set-behind
760and used-ahead
761bits that define the life of a variable.
762The two new bits tell if a variable life
763crosses a function call ahead or behind.
764By examining a variable over its lifetime,
765it is possible to get a cost
766for registerizing.
767Loops are detected and the costs are multiplied
768by three for every level of loop nesting.
769Costs are sorted and the variables
770are replaced by available registers on a greedy basis.
771.LP
772The 68020 has two different
773types of registers.
774For the 68020,
775two different costs are calculated for
776each variable life and the register type that
777affords the better cost is used.
778Ties are broken by counting the number of available
779registers of each type.
780.LP
781Note that externals are registerized together with automatics.
782This is done by evaluating the semantics of a ``call'' instruction
783differently for externals and automatics.
784Since a call goes outside the local procedure,
785it is assumed that a call references all externals.
786Similarly,
787externals are assumed to be set before an ``entry'' instruction
788and assumed to be referenced after a ``return'' instruction.
789This makes sure that externals are in memory across calls.
790.LP
791The overall results are satisfactory.
792It would be nice to be able to do this processing in
793a machine-independent way,
794but it is impossible to get all of the costs and
795side effects of different choices by examining the parse tree.
796.LP
797Most of the code in the registerization pass is machine-independent.
798The major machine-dependency is in
799examining a machine instruction to ask if it sets or references
800a variable.
801.NH 2
802Machine code optimization
803.LP
804The next pass walks the machine code
805for opportunistic optimizations.
806For the most part,
807this is highly specific to a particular
808processor.
809One optimization that is performed
810on all of the processors is the
811removal of unnecessary ``move''
812instructions.
813Ironically,
814most of these instructions were inserted by
815the previous pass.
816There are two patterns that are repetitively
817matched and replaced until no more matches are
818found.
819The first tries to remove ``move'' instructions
820by relabeling variables.
821.LP
822When a ``move'' instruction is encountered,
823if the destination variable is set before the
824source variable is referenced,
825then all of the references to the destination
826variable can be renamed to the source and the ``move''
827can be deleted.
828This transformation uses the reverse data flow
829set up in the previous pass.
830.LP
831An example of this pattern is depicted in the following
832table.
833The pattern is in the left column and the
834replacement action is in the right column.
835.DS
836.CW
837.ta .1i .6i 1.6i 2.1i 2.6i
838	MOVE	a->b		\fR(remove)\fP
839.R
840	(sequence with no mention of \f(CWa\fP)
841.CW
842	USE	b		USE	a
843.R
844	(sequence with no mention of \f(CWa\fP)
845.CW
846	SET	b		SET	b
847.DE
848.LP
849Experiments have shown that it is marginally
850worthwhile to rename uses of the destination variable
851with uses of the source variable up to
852the first use of the source variable.
853.LP
854The second transform will do relabeling
855without deleting instructions.
856When a ``move'' instruction is encountered,
857if the source variable has been set prior
858to the use of the destination variable
859then all of the references to the source
860variable are replaced by the destination and
861the ``move'' is inverted.
862Typically,
863this transformation will alter two ``move''
864instructions and allow the first transformation
865another chance to remove code.
866This transformation uses the forward data flow
867set up in the previous pass.
868.LP
869Again,
870the following is a depiction of the transformation where
871the pattern is in the left column and the
872rewrite is in the right column.
873.DS
874.CW
875.ta .1i .6i 1.6i 2.1i 2.6i
876	SET	a		SET	b
877.R
878	(sequence with no use of \f(CWb\fP)
879.CW
880	USE	a		USE	b
881.R
882	(sequence with no use of \f(CWb\fP)
883.CW
884	MOVE	a->b		MOVE	b->a
885.DE
886Iterating these transformations
887will usually get rid of all redundant ``move'' instructions.
888.LP
889A problem with this organization is that the costs
890of registerization calculated in the previous pass
891must depend on how well this pass can detect and remove
892redundant instructions.
893Often,
894a fine candidate for registerization is rejected
895because of the cost of instructions that are later
896removed.
897.NH 2
898Writing the object file
899.LP
900The last pass walks the internal assembly language
901and writes the object file.
902The object file is reduced in size by about a factor
903of three with simple compression
904techniques.
905The most important aspect of the object file
906format is that it is independent of the compiling machine.
907All integer and floating numbers in the object
908code are converted to known formats and byte
909orders.
910.NH
911The loader
912.LP
913The loader is a multiple pass program that
914reads object files and libraries and produces
915an executable binary.
916The loader also does some minimal
917optimizations and code rewriting.
918Many of the operations performed by the
919loader are machine-dependent.
920.LP
921The first pass of the loader reads the
922object modules into an internal data
923structure that looks like binary assembly language.
924As the instructions are read,
925code is reordered to remove
926unconditional branch instructions.
927Conditional branch instructions are inverted
928to prevent the insertion of unconditional branches.
929The loader will also make a copy of a few instructions
930to remove an unconditional branch.
931.LP
932The next pass allocates addresses for
933all external data.
934Typical of processors is the MIPS,
935which can reference ±32K bytes from a
936register.
937The loader allocates the register
938.CW R30
939as the static pointer.
940The value placed in
941.CW R30
942is the base of the data segment plus 32K.
943It is then cheap to reference all data in the
944first 64K of the data segment.
945External variables are allocated to
946the data segment
947with the smallest variables allocated first.
948If all of the data cannot fit into the first
94964K of the data segment,
950then usually only a few large arrays
951need more expensive addressing modes.
952.LP
953For the MIPS processor,
954the loader makes a pass over the internal
955structures,
956exchanging instructions to try
957to fill ``delay slots'' with useful work.
958If a useful instruction cannot be found
959to fill a delay slot,
960the loader will insert
961``noop''
962instructions.
963This pass is very expensive and does not
964do a good job.
965About 40% of all instructions are in
966delay slots.
967About 65% of these are useful instructions and
96835% are ``noops.''
969The vendor-supplied assembler does this job
970more effectively,
971filling about 80%
972of the delay slots with useful instructions.
973.LP
974On the 68020 processor,
975branch instructions come in a variety of
976sizes depending on the relative distance
977of the branch.
978Thus the size of branch instructions
979can be mutually dependent.
980The loader uses a multiple pass algorithm
981to resolve the branch lengths
982[Szy78].
983Initially, all branches are assumed minimal length.
984On each subsequent pass,
985the branches are reassessed
986and expanded if necessary.
987When no more expansions occur,
988the locations of the instructions in
989the text segment are known.
990.LP
991On the MIPS processor,
992all instructions are one size.
993A single pass over the instructions will
994determine the locations of all addresses
995in the text segment.
996.LP
997The last pass of the loader produces the
998executable binary.
999A symbol table and other tables are
1000produced to help the debugger to
1001interpret the binary symbolically.
1002.LP
1003The loader places absolute source line numbers in the symbol table.
1004The name and absolute line number of all
1005.CW #include
1006files is also placed in the
1007symbol table so that the debuggers can
1008associate object code to source files.
1009.NH
1010Performance
1011.LP
1012The following is a table of the source size of the MIPS
1013compiler.
1014.DS
1015.ta .1i .6i
1016	lines	module
1017	\0509	machine-independent headers
1018	1070	machine-independent YACC source
1019	6090	machine-independent C source
1020
1021	\0545	machine-dependent headers
1022	6532	machine-dependent C source
1023
1024	\0298	loader headers
1025	5215	loader C source
1026.DE
1027.LP
1028The following table shows timing
1029of a test program
1030that plays checkers, running on a MIPS R4000.
1031The test program is 26 files totaling 12600 lines of C.
1032The execution time does not significantly
1033depend on library implementation.
1034Since no other compiler runs on Plan 9,
1035the Plan 9 tests were done with the Plan 9 operating system;
1036the other tests were done on the vendor's operating system.
1037The hardware was identical in both cases.
1038The optimizer in the vendor's compiler
1039is reputed to be extremely good.
1040.DS
1041.ta .1i .9i
1042	\0\04.49s	Plan 9 \f(CWvc\fP \f(CW-N\fP compile time (opposite of \f(CW-O\fP)
1043	\0\01.72s	Plan 9 \f(CWvc\fP \f(CW-N\fP load time
1044	148.69s	Plan 9 \f(CWvc\fP \f(CW-N\fP run time
1045
1046	\015.07s	Plan 9 \f(CWvc\fP compile time (\f(CW-O\fP implicit)
1047	\0\01.66s	Plan 9 \f(CWvc\fP load time
1048	\089.96s	Plan 9 \f(CWvc\fP run time
1049
1050	\014.83s	vendor \f(CWcc\fP compile time
1051	\0\00.38s	vendor \f(CWcc\fP load time
1052	104.75s	vendor \f(CWcc\fP run time
1053
1054	\043.59s	vendor \f(CWcc\fP \f(CW-O\fP compile time
1055	\0\00.38s	vendor \f(CWcc\fP \f(CW-O\fP load time
1056	\076.19s	vendor \f(CWcc\fP \f(CW-O\fP run time
1057
1058	\0\08.19s	vendor \f(CWcc\fP \f(CW-O3\fP compile time
1059	\035.97s	vendor \f(CWcc\fP \f(CW-O3\fP load time
1060	\071.16s	vendor \f(CWcc\fP \f(CW-O3\fP run time
1061.DE
1062.LP
1063To compare the Intel compiler,
1064a program that is about 40% bit manipulation and
1065about 60% single precision floating point was
1066run on the same 33 MHz 486, once under Windows
1067compiled with the Watcom compiler, version 10.0,
1068in 16-bit mode and once under
1069Plan 9 in 32-bit mode.
1070The Plan 9 execution time was 27 sec while the Windows
1071execution time was 31 sec.
1072.NH
1073Conclusions
1074.LP
1075The new compilers compile
1076quickly,
1077load slowly,
1078and produce
1079medium quality
1080object code.
1081The compilers are relatively
1082portable,
1083requiring but a couple of weeks' work to
1084produce a compiler for a different computer.
1085For Plan 9,
1086where we needed several compilers
1087with specialized features and
1088our own object formats,
1089this project was indispensable.
1090It is also necessary for us to
1091be able to freely distribute our compilers
1092with the Plan 9 distribution.
1093.LP
1094Two problems have come up in retrospect.
1095The first has to do with the
1096division of labor between compiler and loader.
1097Plan 9 runs on multi-processors and as such
1098compilations are often done in parallel.
1099Unfortunately,
1100all compilations must be complete before loading
1101can begin.
1102The load is then single-threaded.
1103With this model,
1104any shift of work from compile to load
1105results in a significant increase in real time.
1106The same is true of libraries that are compiled
1107infrequently and loaded often.
1108In the future,
1109we may try to put some of the loader work
1110back into the compiler.
1111.LP
1112The second problem comes from
1113the various optimizations performed over several
1114passes.
1115Often optimizations in different passes depend
1116on each other.
1117Iterating the passes could compromise efficiency,
1118or even loop.
1119We see no real solution to this problem.
1120.NH
1121References
1122.LP
1123[Aho87] A. V. Aho, R. Sethi, and J. D. Ullman,
1124.I
1125Compilers \- Principles, Techniques, and Tools,
1126.R
1127Addison Wesley,
1128Reading, MA,
11291987.
1130.LP
1131[ANSI90] \f2American National Standard for Information Systems \-
1132Programming Language C\f1, American National Standards Institute, Inc.,
1133New York, 1990.
1134.LP
1135[Dav91] J. W. Davidson and D. B. Whalley,
1136``Methods for Saving and Restoring Register Values across Function Calls'',
1137.I
1138Software\-Practice and Experience,
1139.R
1140Vol 21(2), pp. 149-165, February 1991.
1141.LP
1142[Joh79] S. C. Johnson,
1143``YACC \- Yet Another Compiler Compiler'',
1144.I
1145UNIX Programmer's Manual, Seventh Ed., Vol. 2A,
1146.R
1147AT&T Bell Laboratories,
1148Murray Hill, NJ,
11491979.
1150.LP
1151[Set70] R. Sethi and J. D. Ullman,
1152``The Generation of Optimal Code for Arithmetic Expressions'',
1153.I
1154Journal of the ACM,
1155.R
1156Vol 17(4), pp. 715-728, 1970.
1157.LP
1158[Szy78] T. G. Szymanski,
1159``Assembling Code for Machines with Span-dependent Instructions'',
1160.I
1161Communications of the ACM,
1162.R
1163Vol 21(4), pp. 300-308, 1978.
1164