xref: /plan9/sys/doc/compiler.ms (revision ec59a3ddbfceee0efe34584c2c9981a5e5ff1ec4)
1.HTML "Plan 9 C Compilers
2.TL
3Plan 9 C Compilers
4.AU
5Ken Thompson
6ken@plan9.bell-labs.com
7.AB
8.FS
9Originally appeared, in a different form, in
10.I
11Proceedings of the Summer 1990 UKUUG Conference,
12.R
13pp. 41-51,
14London, 1990.
15.FE
16This paper describes the overall structure and function of the Plan 9 C compilers.
17A more detailed implementation document
18for any one of the compilers
19is yet to be written.
20.AE
21.NH
22Introduction
23.LP
24There are many compilers in the series.
25Six of the compilers (MIPS 3000, SPARC, Intel 386, Power PC, DEC Alpha, and Motorola 68020)
26are considered active and are used to compile
27current versions of Plan 9.
28Several others (Motorola 68000, Intel 960, ARM 7500, AMD 29000) have had only limited use, such as
29to program peripherals or experimental devices.
30.NH
31Structure
32.LP
33The compiler is a single program that produces an
34object file.
35Combined in the compiler are the traditional
36roles of preprocessor, lexical analyzer, parser, code generator,
37local optimizer,
38and first half of the assembler.
39The object files are binary forms of assembly
40language,
41similar to what might be passed between
42the first and second passes of an assembler.
43.LP
44Object files and libraries
45are combined by a loader
46program to produce the executable binary.
47The loader combines the roles of second half
48of the assembler, global optimizer, and loader.
49The names of the compliers, loaders, and assemblers
50are as follows:
51.DS
52.ta 1.5i
53.de Ta
54\\$1	\f(CW\\$2\fP  \f(CW\\$3\fP  \f(CW\\$4\fP
55..
56.Ta SPARC kc kl ka
57.Ta Power\ PC qc ql qa
58.Ta MIPS vc vl va
59.Ta Motorola\ 68000 1c 1l 1a
60.Ta Motorola\ 68020 2c 2l 2a
61.Ta ARM\ 7500 5c 5l 5a
62.Ta Intel\ 960 6c 6l 6a
63.Ta DEC\ Alpha 7c 7l 7a
64.Ta Intel\ 386 8c 8l 8a
65.Ta AMD\ 29000 9c 9l 9a
66.DE
67There is a further breakdown
68in the source of the compilers into
69object-independent and
70object-dependent
71parts.
72All of the object-independent parts
73are combined into source files in the
74directory
75.CW /sys/src/cmd/cc .
76The object-dependent parts are collected
77in a separate directory for each compiler,
78for example
79.CW /sys/src/cmd/vc .
80All of the code,
81both object-independent and
82object-dependent,
83is machine-independent
84and may be cross-compiled and executed on any
85of the architectures.
86.NH
87The Language
88.LP
89The compiler implements ANSI C with some
90restrictions and extensions
91[ANSI90].
92Most of the restrictions are due to
93personal preference, while
94most of the extensions were to help in
95the implementation of Plan 9.
96There are other departures from the standard,
97particularly in the libraries,
98that are beyond the scope of this
99paper.
100.NH 2
101Register, volatile, const
102.LP
103The keyword
104.CW register
105is recognized syntactically
106but is semantically ignored.
107Thus taking the address of a
108.CW register
109variable is not diagnosed.
110The keyword
111.CW volatile
112disables all optimizations, in particular registerization, of the corresponding variable.
113The keyword
114.CW const
115generates warnings (if warnings are enabled by the compiler's
116.CW -w
117option) of non-constant use of the variable,
118but does not affect the generated code.
119.NH 2
120The preprocessor
121.LP
122The C preprocessor is probably the
123biggest departure from the ANSI standard.
124.LP
125The preprocessor built into the Plan 9 compilers does not support
126.CW #if ,
127although it does handle
128.CW #ifdef
129and
130.CW #include .
131If it is necessary to be more standard,
132the source text can first be run through the separate ANSI C
133preprocessor,
134.CW cpp .
135.NH 2
136Unnamed substructures
137.LP
138The most important and most heavily used of the
139extensions is the declaration of an
140unnamed substructure or subunion.
141For example:
142.DS
143.CW
144.ta .1i .6i 1.1i 1.6i
145	typedef
146	struct	lock
147	{
148		int    locked;
149	} Lock;
150
151	typedef
152	struct	node
153	{
154		int	type;
155		union
156		{
157			double dval;
158			float  fval;
159			long   lval;
160		};
161		Lock;
162	} Node;
163
164	Lock*	lock;
165	Node*	node;
166.DE
167The declaration of
168.CW Node
169has an unnamed substructure of type
170.CW Lock
171and an unnamed subunion.
172One use of this feature allows references to elements of the
173subunit to be accessed as if they were in
174the outer structure.
175Thus
176.CW node->dval
177and
178.CW node->locked
179are legitimate references.
180.LP
181When an outer structure is used
182in a context that is only legal for
183an unnamed substructure,
184the compiler promotes the reference to the
185unnamed substructure.
186This is true for references to structures and
187to references to pointers to structures.
188This happens in assignment statements and
189in argument passing where prototypes have been
190declared.
191Thus, continuing with the example,
192.DS
193.CW
194.ta .1i .6i 1.1i 1.6i
195	lock = node;
196.DE
197would assign a pointer to the unnamed
198.CW Lock
199in
200the
201.CW Node
202to the variable
203.CW lock .
204Another example,
205.DS
206.CW
207.ta .1i .6i 1.1i 1.6i
208	extern void lock(Lock*);
209	func(...)
210	{
211		...
212		lock(node);
213		...
214	}
215.DE
216will pass a pointer to the
217.CW Lock
218substructure.
219.LP
220Finally, in places where context is insufficient to identify the unnamed structure,
221the type name (it must be a
222.CW typedef )
223of the unnamed structure can be used as an identifier.
224In our example,
225.CW &node->Lock
226gives the address of the anonymous
227.CW Lock
228structure.
229.NH 2
230Structure displays
231.LP
232A structure cast followed by a list of expressions in braces is
233an expression with the type of the structure and elements assigned from
234the corresponding list.
235Structures are now almost first-class citizens of the language.
236It is common to see code like this:
237.DS
238.CW
239.ta .1i
240	r = (Rectangle){point1, (Point){x,y+2}};
241.DE
242.NH 2
243Initialization indexes
244.LP
245In initializers of arrays,
246one may place a constant expression
247in square brackets before an initializer.
248This causes the next initializer to assign
249the indicated element.
250For example:
251.DS
252.CW
253.ta .1i .6i 1.6i
254	enum	errors
255	{
256		Etoobig,
257		Ealarm,
258		Egreg
259	};
260	char* errstrings[] =
261	{
262		[Ealarm]	"Alarm call",
263		[Egreg]	"Panic: out of mbufs",
264		[Etoobig]	"Arg list too long",
265	};
266.DE
267In the same way,
268individual structures members may
269be initialized in any order by preceding the initialization with
270.CW .tagname .
271Both forms allow an optional
272.CW = ,
273to be compatible with a proposed
274extension to ANSI C.
275.NH 2
276External register
277.LP
278The declaration
279.CW extern
280.CW register
281will dedicate a register to
282a variable on a global basis.
283It can be used only under special circumstances.
284External register variables must be identically
285declared in all modules and
286libraries.
287The feature is not intended for efficiency,
288although it can produce efficient code;
289rather it represents a unique storage class that
290would be hard to get any other way.
291On a shared-memory multi-processor,
292an external register is
293one-per-processor and neither one-per-procedure (automatic)
294or one-per-system (external).
295It is used for two variables in the Plan 9 kernel,
296.CW u
297and
298.CW m .
299.CW U
300is a pointer to the structure representing the currently running process
301and
302.CW m
303is a pointer to the per-machine data structure.
304.NH 2
305Long long
306.LP
307The compilers accept
308.CW long
309.CW long
310as a basic type meaning 64-bit integer.
311On all of the machines
312this type is synthesized from 32-bit instructions.
313.NH 2
314Pragma
315.LP
316The compilers accept
317.CW #pragma
318.CW lib
319.I libname
320and pass the
321library name string uninterpreted
322to the loader.
323The loader uses the library name to
324find libraries to load.
325If the name contains
326.CW %O ,
327it is replaced with
328the single character object type of the compiler
329(e.g.,
330.CW v
331for the MIPS).
332If the name contains
333.CW %M ,
334it is replaced with
335the architecture type for the compiler
336(e.g.,
337.CW mips
338for the MIPS).
339If the name starts with
340.CW /
341it is an absolute pathname;
342if it starts with
343.CW .
344then it is searched for in the loader's current directory.
345Otherwise, the name is searched from
346.CW /%M/lib .
347Such
348.CW #pragma
349statements in header files guarantee that the correct
350libraries are always linked with a program without the
351need to specify them explicitly at link time.
352.LP
353They also accept
354.CW #pragma
355.CW hjdicks
356.CW on
357(or
358.CW yes
359or
360.CW 1 )
361to cause subsequently declared data, until
362.CW #pragma
363.CW hjdicks
364.CW off
365(or
366.CW no
367or
368.CW 0 ),
369to be laid out in memory tightly packed in successive bytes, disregarding
370the usual alignment rules.
371Accessing such data can cause faults.
372.LP
373Similarly,
374.CW #pragma
375.CW profile
376.CW off
377(or
378.CW no
379or
380.CW 0 )
381causes subsequently declared functions, until
382.CW #pragma
383.CW profile
384.CW on
385(or
386.CW yes
387or
388.CW 1 ),
389to be marked as unprofiled.
390Such functions will not be profiled when
391profiling is enabled for the rest of the program.
392.LP
393Two
394.CW #pragma
395statements allow type-checking of
396.CW print -like
397functions.
398The first, of the form
399.P1
400#pragma varargck argpos error 2
401.P2
402tells the compiler that the second argument to
403.CW error
404is a
405.CW print
406format string (see the manual page
407.I print (2))
408that specifies how to format
409.CW error 's
410subsequent arguments.
411The second, of the form
412.P1
413#pragma varargck type "s" char*
414.P2
415says that the
416.CW print
417format verb
418.CW s
419processes an argument of
420type
421.CW char* .
422If the compiler's
423.CW -F
424option is enabled, the compiler will use this information
425to report type violations in the arguments to
426.CW print ,
427.CW error ,
428and similar routines.
429.NH
430Object module conventions
431.LP
432The overall conventions of the runtime environment
433are important
434to runtime efficiency.
435In this section,
436several of these conventions are discussed.
437.NH 2
438Register saving
439.LP
440In the Plan 9 compilers,
441the caller of a procedure saves the registers.
442With caller-saves,
443the leaf procedures can use all the
444registers and never save them.
445If you spend a lot of time at the leaves,
446this seems preferable.
447With callee-saves,
448the saving of the registers is done
449in the single point of entry and return.
450If you are interested in space,
451this seems preferable.
452In both,
453there is a degree of uncertainty
454about what registers need to be saved.
455Callee-saved registers make it difficult to
456find variables in registers in debuggers.
457Callee-saved registers also complicate
458the implementation of
459.CW longjmp .
460The convincing argument is
461that with caller-saves,
462the decision to registerize a variable
463can include the cost of saving the register
464across calls.
465For a further discussion of caller- vs. callee-saves,
466see the paper by Davidson and Whalley [Dav91].
467.LP
468In the Plan 9 operating system,
469calls to the kernel look like normal procedure
470calls, which means
471the caller
472has saved the registers and the system
473entry does not have to.
474This makes system calls considerably faster.
475Since this is a potential security hole,
476and can lead to non-determinism,
477the system may eventually save the registers
478on entry,
479or more likely clear the registers on return.
480.NH 2
481Calling convention
482.LP
483Older C compilers maintain a frame pointer, which is at a known constant
484offset from the stack pointer within each function.
485For machines where the stack grows towards zero,
486the argument pointer is at a known constant offset
487from the frame pointer.
488Since the stack grows down in Plan 9,
489the Plan 9 compilers
490keep neither an
491explicit frame pointer nor
492an explicit argument pointer;
493instead they generate addresses relative to the stack pointer.
494.LP
495On some architectures, the first argument to a subroutine is passed in a register.
496.NH 2
497Functions returning structures
498.LP
499Structures longer than one word are awkward to implement
500since they do not fit in registers and must
501be passed around in memory.
502Functions that return structures
503are particularly clumsy.
504The Plan 9 compilers pass the return address of
505a structure as the first argument of a
506function that has a structure return value.
507Thus
508.DS
509.CW
510.ta .1i .6i 1.1i 1.6i
511	x = f(...)
512.DE
513is rewritten as
514.DS
515.CW
516.ta .1i .6i 1.1i 1.6i
517	f(&x, ...)\f1.
518.DE
519This saves a copy and makes the compilation
520much less clumsy.
521A disadvantage is that if you call this
522function without an assignment,
523a dummy location must be invented.
524.LP
525There is also a danger of calling a function
526that returns a structure without declaring
527it as such.
528With ANSI C function prototypes,
529this error need never occur.
530.NH
531Implementation
532.LP
533The compiler is divided internally into
534four machine-independent passes,
535four machine-dependent passes,
536and an output pass.
537The next nine sections describe each pass in order.
538.NH 2
539Parsing
540.LP
541The first pass is a YACC-based parser
542[Joh79].
543Declarations are interpreted immediately,
544building a block structured symbol table.
545Executable statements are put into a parse tree
546and collected,
547without interpretation.
548At the end of each procedure,
549the parse tree for the function is
550examined by the other passes of the compiler.
551.LP
552The input stream of the parser is
553a pushdown list of input activations.
554The preprocessor
555expansions of
556macros
557and
558.CW #include
559are implemented as pushdowns.
560Thus there is no separate
561pass for preprocessing.
562.NH 2
563Typing
564.LP
565The next pass distributes typing information
566to every node of the tree.
567Implicit operations on the tree are added,
568such as type promotions and taking the
569address of arrays and functions.
570.NH 2
571Machine-independent optimization
572.LP
573The next pass performs optimizations
574and transformations of the tree, such as converting
575.CW &*x
576and
577.CW *&x
578into
579.CW x .
580Constant expressions are converted to constants in this pass.
581.NH 2
582Arithmetic rewrites
583.LP
584This is another machine-independent optimization.
585Subtrees of add, subtract, and multiply of integers are
586rewritten for easier compilation.
587The major transformation is factoring:
588.CW 4+8*a+16*b+5
589is transformed into
590.CW 9+8*(a+2*b) .
591Such expressions arise from address
592manipulation and array indexing.
593.NH 2
594Addressability
595.LP
596This is the first of the machine-dependent passes.
597The addressability of a processor is defined as the set of
598expressions that is legal in the address field
599of a machine language instruction.
600The addressability of different processors varies widely.
601At one end of the spectrum are the 68020 and VAX,
602which allow a complex mix of incrementing,
603decrementing,
604indexing, and relative addressing.
605At the other end is the MIPS,
606which allows only registers and constant offsets from the
607contents of a register.
608The addressability can be different for different instructions
609within the same processor.
610.LP
611It is important to the code generator to know when a
612subtree represents an address of a particular type.
613This is done with a bottom-up walk of the tree.
614In this pass, the leaves are labeled with small integers.
615When an internal node is encountered,
616it is labeled by consulting a table indexed by the
617labels on the left and right subtrees.
618For example,
619on the 68020 processor,
620it is possible to address an
621offset from a named location.
622In C, this is represented by the expression
623.CW *(&name+constant) .
624This is marked addressable by the following table.
625In the table,
626a node represented by the left column is marked
627with a small integer from the right column.
628Marks of the form
629.CW A\s-2\di\u\s0
630are addressable while
631marks of the form
632.CW N\s-2\di\u\s0
633are not addressable.
634.DS
635.B
636.ta .1i 1.1i
637	Node	Marked
638.CW
639	name	A\s-2\d1\u\s0
640	const	A\s-2\d2\u\s0
641	&A\s-2\d1\u\s0	A\s-2\d3\u\s0
642	A\s-2\d3\u\s0+A\s-2\d1\u\s0	N\s-2\d1\u\s0 \fR(note that this is not addressable)\fP
643	*N\s-2\d1\u\s0	A\s-2\d4\u\s0
644.DE
645Here there is a distinction between
646a node marked
647.CW A\s-2\d1\u\s0
648and a node marked
649.CW A\s-2\d4\u\s0
650because the address operator of an
651.CW A\s-2\d4\u\s0
652node is not addressable.
653So to extend the table:
654.DS
655.B
656.ta .1i 1.1i
657	Node	Marked
658.CW
659	&A\s-2\d4\u\s0	N\s-2\d2\u\s0
660	N\s-2\d2\u\s0+N\s-2\d1\u\s0	N\s-2\d1\u\s0
661.DE
662The full addressability of the 68020 is expressed
663in 18 rules like this,
664while the addressability of the MIPS is expressed
665in 11 rules.
666When one ports the compiler,
667this table is usually initialized
668so that leaves are labeled as addressable and nothing else.
669The code produced is poor,
670but porting is easy.
671The table can be extended later.
672.LP
673This pass also rewrites some complex operators
674into procedure calls.
675Examples include 64-bit multiply and divide.
676.LP
677In the same bottom-up pass of the tree,
678the nodes are labeled with a Sethi-Ullman complexity
679[Set70].
680This number is roughly the number of registers required
681to compile the tree on an ideal machine.
682An addressable node is marked 0.
683A function call is marked infinite.
684A unary operator is marked as the
685maximum of 1 and the mark of its subtree.
686A binary operator with equal marks on its subtrees is
687marked with a subtree mark plus 1.
688A binary operator with unequal marks on its subtrees is
689marked with the maximum mark of its subtrees.
690The actual values of the marks are not too important,
691but the relative values are.
692The goal is to compile the harder
693(larger mark)
694subtree first.
695.NH 2
696Code generation
697.LP
698Code is generated by recursive
699descent.
700The Sethi-Ullman complexity completely guides the
701order.
702The addressability defines the leaves.
703The only difficult part is compiling a tree
704that has two infinite (function call)
705subtrees.
706In this case,
707one subtree is compiled into the return register
708(usually the most convenient place for a function call)
709and then stored on the stack.
710The other subtree is compiled into the return register
711and then the operation is compiled with
712operands from the stack and the return register.
713.LP
714There is a separate boolean code generator that compiles
715conditional expressions.
716This is fundamentally different from compiling an arithmetic expression.
717The result of the boolean code generator is the
718position of the program counter and not an expression.
719The boolean code generator makes extensive use of De Morgan's rule.
720The boolean code generator is an expanded version of that described
721in chapter 8 of Aho, Sethi, and Ullman
722[Aho87].
723.LP
724There is a considerable amount of talk in the literature
725about automating this part of a compiler with a machine
726description.
727Since this code generator is so small
728(less than 500 lines of C)
729and easy,
730it hardly seems worth the effort.
731.NH 2
732Registerization
733.LP
734Up to now,
735the compiler has operated on syntax trees
736that are roughly equivalent to the original source language.
737The previous pass has produced machine language in an internal
738format.
739The next two passes operate on the internal machine language
740structures.
741The purpose of the next pass is to reintroduce
742registers for heavily used variables.
743.LP
744All of the variables that can be
745potentially registerized within a procedure are
746placed in a table.
747(Suitable variables are any automatic or external
748scalars that do not have their addresses extracted.
749Some constants that are hard to reference are also
750considered for registerization.)
751Four separate data flow equations are evaluated
752over the procedure on all of these variables.
753Two of the equations are the normal set-behind
754and used-ahead
755bits that define the life of a variable.
756The two new bits tell if a variable life
757crosses a function call ahead or behind.
758By examining a variable over its lifetime,
759it is possible to get a cost
760for registerizing.
761Loops are detected and the costs are multiplied
762by three for every level of loop nesting.
763Costs are sorted and the variables
764are replaced by available registers on a greedy basis.
765.LP
766The 68020 has two different
767types of registers.
768For the 68020,
769two different costs are calculated for
770each variable life and the register type that
771affords the better cost is used.
772Ties are broken by counting the number of available
773registers of each type.
774.LP
775Note that externals are registerized together with automatics.
776This is done by evaluating the semantics of a ``call'' instruction
777differently for externals and automatics.
778Since a call goes outside the local procedure,
779it is assumed that a call references all externals.
780Similarly,
781externals are assumed to be set before an ``entry'' instruction
782and assumed to be referenced after a ``return'' instruction.
783This makes sure that externals are in memory across calls.
784.LP
785The overall results are satisfactory.
786It would be nice to be able to do this processing in
787a machine-independent way,
788but it is impossible to get all of the costs and
789side effects of different choices by examining the parse tree.
790.LP
791Most of the code in the registerization pass is machine-independent.
792The major machine-dependency is in
793examining a machine instruction to ask if it sets or references
794a variable.
795.NH 2
796Machine code optimization
797.LP
798The next pass walks the machine code
799for opportunistic optimizations.
800For the most part,
801this is highly specific to a particular
802processor.
803One optimization that is performed
804on all of the processors is the
805removal of unnecessary ``move''
806instructions.
807Ironically,
808most of these instructions were inserted by
809the previous pass.
810There are two patterns that are repetitively
811matched and replaced until no more matches are
812found.
813The first tries to remove ``move'' instructions
814by relabeling variables.
815.LP
816When a ``move'' instruction is encountered,
817if the destination variable is set before the
818source variable is referenced,
819then all of the references to the destination
820variable can be renamed to the source and the ``move''
821can be deleted.
822This transformation uses the reverse data flow
823set up in the previous pass.
824.LP
825An example of this pattern is depicted in the following
826table.
827The pattern is in the left column and the
828replacement action is in the right column.
829.DS
830.CW
831.ta .1i .6i 1.6i 2.1i 2.6i
832	MOVE	a->b		\fR(remove)\fP
833.R
834	(sequence with no mention of \f(CWa\fP)
835.CW
836	USE	b		USE	a
837.R
838	(sequence with no mention of \f(CWa\fP)
839.CW
840	SET	b		SET	b
841.DE
842.LP
843Experiments have shown that it is marginally
844worthwhile to rename uses of the destination variable
845with uses of the source variable up to
846the first use of the source variable.
847.LP
848The second transform will do relabeling
849without deleting instructions.
850When a ``move'' instruction is encountered,
851if the source variable has been set prior
852to the use of the destination variable
853then all of the references to the source
854variable are replaced by the destination and
855the ``move'' is inverted.
856Typically,
857this transformation will alter two ``move''
858instructions and allow the first transformation
859another chance to remove code.
860This transformation uses the forward data flow
861set up in the previous pass.
862.LP
863Again,
864the following is a depiction of the transformation where
865the pattern is in the left column and the
866rewrite is in the right column.
867.DS
868.CW
869.ta .1i .6i 1.6i 2.1i 2.6i
870	SET	a		SET	b
871.R
872	(sequence with no use of \f(CWb\fP)
873.CW
874	USE	a		USE	b
875.R
876	(sequence with no use of \f(CWb\fP)
877.CW
878	MOVE	a->b		MOVE	b->a
879.DE
880Iterating these transformations
881will usually get rid of all redundant ``move'' instructions.
882.LP
883A problem with this organization is that the costs
884of registerization calculated in the previous pass
885must depend on how well this pass can detect and remove
886redundant instructions.
887Often,
888a fine candidate for registerization is rejected
889because of the cost of instructions that are later
890removed.
891.NH 2
892Writing the object file
893.LP
894The last pass walks the internal assembly language
895and writes the object file.
896The object file is reduced in size by about a factor
897of three with simple compression
898techniques.
899The most important aspect of the object file
900format is that it is independent of the compiling machine.
901All integer and floating numbers in the object
902code are converted to known formats and byte
903orders.
904.NH
905The loader
906.LP
907The loader is a multiple pass program that
908reads object files and libraries and produces
909an executable binary.
910The loader also does some minimal
911optimizations and code rewriting.
912Many of the operations performed by the
913loader are machine-dependent.
914.LP
915The first pass of the loader reads the
916object modules into an internal data
917structure that looks like binary assembly language.
918As the instructions are read,
919code is reordered to remove
920unconditional branch instructions.
921Conditional branch instructions are inverted
922to prevent the insertion of unconditional branches.
923The loader will also make a copy of a few instructions
924to remove an unconditional branch.
925.LP
926The next pass allocates addresses for
927all external data.
928Typical of processors is the MIPS,
929which can reference ±32K bytes from a
930register.
931The loader allocates the register
932.CW R30
933as the static pointer.
934The value placed in
935.CW R30
936is the base of the data segment plus 32K.
937It is then cheap to reference all data in the
938first 64K of the data segment.
939External variables are allocated to
940the data segment
941with the smallest variables allocated first.
942If all of the data cannot fit into the first
94364K of the data segment,
944then usually only a few large arrays
945need more expensive addressing modes.
946.LP
947For the MIPS processor,
948the loader makes a pass over the internal
949structures,
950exchanging instructions to try
951to fill ``delay slots'' with useful work.
952If a useful instruction cannot be found
953to fill a delay slot,
954the loader will insert
955``noop''
956instructions.
957This pass is very expensive and does not
958do a good job.
959About 40% of all instructions are in
960delay slots.
961About 65% of these are useful instructions and
96235% are ``noops.''
963The vendor-supplied assembler does this job
964more effectively,
965filling about 80%
966of the delay slots with useful instructions.
967.LP
968On the 68020 processor,
969branch instructions come in a variety of
970sizes depending on the relative distance
971of the branch.
972Thus the size of branch instructions
973can be mutually dependent.
974The loader uses a multiple pass algorithm
975to resolve the branch lengths
976[Szy78].
977Initially, all branches are assumed minimal length.
978On each subsequent pass,
979the branches are reassessed
980and expanded if necessary.
981When no more expansions occur,
982the locations of the instructions in
983the text segment are known.
984.LP
985On the MIPS processor,
986all instructions are one size.
987A single pass over the instructions will
988determine the locations of all addresses
989in the text segment.
990.LP
991The last pass of the loader produces the
992executable binary.
993A symbol table and other tables are
994produced to help the debugger to
995interpret the binary symbolically.
996.LP
997The loader places absolute source line numbers in the symbol table.
998The name and absolute line number of all
999.CW #include
1000files is also placed in the
1001symbol table so that the debuggers can
1002associate object code to source files.
1003.NH
1004Performance
1005.LP
1006The following is a table of the source size of the MIPS
1007compiler.
1008.DS
1009.ta .1i .6i
1010	lines	module
1011	\0509	machine-independent headers
1012	1070	machine-independent YACC source
1013	6090	machine-independent C source
1014
1015	\0545	machine-dependent headers
1016	6532	machine-dependent C source
1017
1018	\0298	loader headers
1019	5215	loader C source
1020.DE
1021.LP
1022The following table shows timing
1023of a test program
1024that plays checkers, running on a MIPS R4000.
1025The test program is 26 files totaling 12600 lines of C.
1026The execution time does not significantly
1027depend on library implementation.
1028Since no other compiler runs on Plan 9,
1029the Plan 9 tests were done with the Plan 9 operating system;
1030the other tests were done on the vendor's operating system.
1031The hardware was identical in both cases.
1032The optimizer in the vendor's compiler
1033is reputed to be extremely good.
1034.DS
1035.ta .1i .9i
1036	\0\04.49s	Plan 9 \f(CWvc\fP \f(CW-N\fP compile time (opposite of \f(CW-O\fP)
1037	\0\01.72s	Plan 9 \f(CWvc\fP \f(CW-N\fP load time
1038	148.69s	Plan 9 \f(CWvc\fP \f(CW-N\fP run time
1039
1040	\015.07s	Plan 9 \f(CWvc\fP compile time (\f(CW-O\fP implicit)
1041	\0\01.66s	Plan 9 \f(CWvc\fP load time
1042	\089.96s	Plan 9 \f(CWvc\fP run time
1043
1044	\014.83s	vendor \f(CWcc\fP compile time
1045	\0\00.38s	vendor \f(CWcc\fP load time
1046	104.75s	vendor \f(CWcc\fP run time
1047
1048	\043.59s	vendor \f(CWcc\fP \f(CW-O\fP compile time
1049	\0\00.38s	vendor \f(CWcc\fP \f(CW-O\fP load time
1050	\076.19s	vendor \f(CWcc\fP \f(CW-O\fP run time
1051
1052	\0\08.19s	vendor \f(CWcc\fP \f(CW-O3\fP compile time
1053	\035.97s	vendor \f(CWcc\fP \f(CW-O3\fP load time
1054	\071.16s	vendor \f(CWcc\fP \f(CW-O3\fP run time
1055.DE
1056.LP
1057To compare the Intel compiler,
1058a program that is about 40% bit manipulation and
1059about 60% single precision floating point was
1060run on the same 33 MHz 486, once under Windows
1061compiled with the Watcom compiler, version 10.0,
1062in 16-bit mode and once under
1063Plan 9 in 32-bit mode.
1064The Plan 9 execution time was 27 sec while the Windows
1065execution time was 31 sec.
1066.NH
1067Conclusions
1068.LP
1069The new compilers compile
1070quickly,
1071load slowly,
1072and produce
1073medium quality
1074object code.
1075The compilers are relatively
1076portable,
1077requiring but a couple of weeks' work to
1078produce a compiler for a different computer.
1079For Plan 9,
1080where we needed several compilers
1081with specialized features and
1082our own object formats,
1083this project was indispensable.
1084It is also necessary for us to
1085be able to freely distribute our compilers
1086with the Plan 9 distribution.
1087.LP
1088Two problems have come up in retrospect.
1089The first has to do with the
1090division of labor between compiler and loader.
1091Plan 9 runs on multi-processors and as such
1092compilations are often done in parallel.
1093Unfortunately,
1094all compilations must be complete before loading
1095can begin.
1096The load is then single-threaded.
1097With this model,
1098any shift of work from compile to load
1099results in a significant increase in real time.
1100The same is true of libraries that are compiled
1101infrequently and loaded often.
1102In the future,
1103we may try to put some of the loader work
1104back into the compiler.
1105.LP
1106The second problem comes from
1107the various optimizations performed over several
1108passes.
1109Often optimizations in different passes depend
1110on each other.
1111Iterating the passes could compromise efficiency,
1112or even loop.
1113We see no real solution to this problem.
1114.NH
1115References
1116.LP
1117[Aho87] A. V. Aho, R. Sethi, and J. D. Ullman,
1118.I
1119Compilers \- Principles, Techniques, and Tools,
1120.R
1121Addison Wesley,
1122Reading, MA,
11231987.
1124.LP
1125[ANSI90] \f2American National Standard for Information Systems \-
1126Programming Language C\f1, American National Standards Institute, Inc.,
1127New York, 1990.
1128.LP
1129[Dav91] J. W. Davidson and D. B. Whalley,
1130``Methods for Saving and Restoring Register Values across Function Calls'',
1131.I
1132Software\-Practice and Experience,
1133.R
1134Vol 21(2), pp. 149-165, February 1991.
1135.LP
1136[Joh79] S. C. Johnson,
1137``YACC \- Yet Another Compiler Compiler'',
1138.I
1139UNIX Programmer's Manual, Seventh Ed., Vol. 2A,
1140.R
1141AT&T Bell Laboratories,
1142Murray Hill, NJ,
11431979.
1144.LP
1145[Set70] R. Sethi and J. D. Ullman,
1146``The Generation of Optimal Code for Arithmetic Expressions'',
1147.I
1148Journal of the ACM,
1149.R
1150Vol 17(4), pp. 715-728, 1970.
1151.LP
1152[Szy78] T. G. Szymanski,
1153``Assembling Code for Machines with Span-dependent Instructions'',
1154.I
1155Communications of the ACM,
1156.R
1157Vol 21(4), pp. 300-308, 1978.
1158