xref: /plan9/sys/doc/compiler.ms (revision 0b459c2cb92b7c9d88818e9a2f72e678e5bc4553)
1.TL
2Plan 9 C Compilers
3.AU
4Ken Thompson
5ken@plan9.bell-labs.com
6.AB
7.FS
8Originally appeared, in a different form, in
9.I
10Proceedings of the Summer 1990 UKUUG Conference,
11.R
12pp. 41-51,
13London, 1990.
14.FE
15This paper describes the overall structure and function of the Plan 9 C compilers.
16A more detailed implementation document
17for any one of the compilers
18is yet to be written.
19.AE
20.NH
21Introduction
22.LP
23There are many compilers in the series.
24Six of the compilers (MIPS 3000, SPARC, Intel 386, Power PC, DEC Alpha, and Motorola 68020)
25are considered active and are used to compile
26current versions of Plan 9.
27Several others (Motorola 68000, Intel 960, ARM 7500, AMD 29000) have had only limited use, such as
28to program peripherals or experimental devices.
29.NH
30Structure
31.LP
32The compiler is a single program that produces an
33object file.
34Combined in the compiler are the traditional
35roles of preprocessor, lexical analyzer, parser, code generator,
36local optimizer,
37and first half of the assembler.
38The object files are binary forms of assembly
39language,
40similar to what might be passed between
41the first and second passes of an assembler.
42.LP
43Object files and libraries
44are combined by a loader
45program to produce the executable binary.
46The loader combines the roles of second half
47of the assembler, global optimizer, and loader.
48The names of the compliers, loaders, and assemblers
49are as follows:
50.DS
51.ta 1.5i
52.de Ta
53\\$1	\f(CW\\$2\fP  \f(CW\\$3\fP  \f(CW\\$4\fP
54..
55.Ta SPARC kc kl ka
56.Ta Power\ PC qc ql qa
57.Ta MIPS vc vl va
58.Ta Motorola\ 68000 1c 1l 1a
59.Ta Motorola\ 68020 2c 2l 2a
60.Ta ARM\ 7500 5c 5l 5a
61.Ta Intel\ 960 6c 6l 6a
62.Ta DEC\ Alpha 7c 7l 7a
63.Ta Intel\ 386 8c 8l 8a
64.Ta AMD\ 29000 9c 9l 9a
65.DE
66There is a further breakdown
67in the source of the compilers into
68object-independent and
69object-dependent
70parts.
71All of the object-independent parts
72are combined into source files in the
73directory
74.CW /sys/src/cmd/cc .
75The object-dependent parts are collected
76in a separate directory for each compiler,
77for example
78.CW /sys/src/cmd/vc .
79All of the code,
80both object-independent and
81object-dependent,
82is machine-independent
83and may be cross-compiled and executed on any
84of the architectures.
85.NH
86The Language
87.LP
88The compiler implements ANSI C with some
89restrictions and extensions
90[ANSI90].
91Most of the restrictions are due to
92personal preference, while
93most of the extensions were to help in
94the implementation of Plan 9.
95There are other departures from the standard,
96particularly in the libraries,
97that are beyond the scope of this
98paper.
99.NH 2
100Register, volatile, const
101.LP
102The keyword
103.CW register
104is recognized syntactically
105but is semantically ignored.
106Thus taking the address of a
107.CW register
108variable is not diagnosed.
109The keyword
110.CW volatile
111disables all optimizations, in particular registerization, of the corresponding variable.
112The keyword
113.CW const
114generates warnings (if warnings are enabled by the compiler's
115.CW -w
116option) of non-constant use of the variable,
117but does not affect the generated code.
118.NH 2
119The preprocessor
120.LP
121The C preprocessor is probably the
122biggest departure from the ANSI standard.
123.LP
124The preprocessor built into the Plan 9 compilers does not support
125.CW #if ,
126although it does handle
127.CW #ifdef
128and
129.CW #include .
130If it is necessary to be more standard,
131the source text can first be run through the separate ANSI C
132preprocessor,
133.CW cpp .
134.NH 2
135Unnamed substructures
136.LP
137The most important and most heavily used of the
138extensions is the declaration of an
139unnamed substructure or subunion.
140For example:
141.DS
142.CW
143.ta .1i .6i 1.1i 1.6i
144	typedef
145	struct	lock
146	{
147		int    locked;
148	} Lock;
149
150	typedef
151	struct	node
152	{
153		int	type;
154		union
155		{
156			double dval;
157			float  fval;
158			long   lval;
159		};
160		Lock;
161	} Node;
162
163	Lock*	lock;
164	Node*	node;
165.DE
166The declaration of
167.CW Node
168has an unnamed substructure of type
169.CW Lock
170and an unnamed subunion.
171One use of this feature allows references to elements of the
172subunit to be accessed as if they were in
173the outer structure.
174Thus
175.CW node->dval
176and
177.CW node->locked
178are legitimate references.
179.LP
180When an outer structure is used
181in a context that is only legal for
182an unnamed substructure,
183the compiler promotes the reference to the
184unnamed substructure.
185This is true for references to structures and
186to references to pointers to structures.
187This happens in assignment statements and
188in argument passing where prototypes have been
189declared.
190Thus, continuing with the example,
191.DS
192.CW
193.ta .1i .6i 1.1i 1.6i
194	lock = node;
195.DE
196would assign a pointer to the unnamed
197.CW Lock
198in
199the
200.CW Node
201to the variable
202.CW lock .
203Another example,
204.DS
205.CW
206.ta .1i .6i 1.1i 1.6i
207	extern void lock(Lock*);
208	func(...)
209	{
210		...
211		lock(node);
212		...
213	}
214.DE
215will pass a pointer to the
216.CW Lock
217substructure.
218.LP
219Finally, in places where context is insufficient to identify the unnamed structure,
220the type name (it must be a
221.CW typedef )
222of the unnamed structure can be used as an identifier.
223In our example,
224.CW &node->Lock
225gives the address of the anonymous
226.CW Lock
227structure.
228.NH 2
229Structure displays
230.LP
231A structure cast followed by a list of expressions in braces is
232an expression with the type of the structure and elements assigned from
233the corresponding list.
234Structures are now almost first-class citizens of the language.
235It is common to see code like this:
236.DS
237.CW
238.ta .1i
239	r = (Rectangle){point1, (Point){x,y+2}};
240.DE
241.NH 2
242Initialization indexes
243.LP
244In initializers of arrays,
245one may place a constant expression
246in square brackets before an initializer.
247This causes the next initializer to assign
248the indicated element.
249For example:
250.DS
251.CW
252.ta .1i .6i 1.6i
253	enum	errors
254	{
255		Etoobig,
256		Ealarm,
257		Egreg
258	};
259	char* errstrings[] =
260	{
261		[Ealarm]	"Alarm call",
262		[Egreg]	"Panic: out of mbufs",
263		[Etoobig]	"Arg list too long",
264	};
265.DE
266In the same way,
267individual structures members may
268be initialized in any order by preceding the initialization with
269.CW .tagname .
270Both forms allow an optional
271.CW = ,
272to be compatible with a proposed
273extension to ANSI C.
274.NH 2
275External register
276.LP
277The declaration
278.CW extern
279.CW register
280will dedicate a register to
281a variable on a global basis.
282It can be used only under special circumstances.
283External register variables must be identically
284declared in all modules and
285libraries.
286The feature is not intended for efficiency,
287although it can produce efficient code;
288rather it represents a unique storage class that
289would be hard to get any other way.
290On a shared-memory multi-processor,
291an external register is
292one-per-processor and neither one-per-procedure (automatic)
293or one-per-system (external).
294It is used for two variables in the Plan 9 kernel,
295.CW u
296and
297.CW m .
298.CW U
299is a pointer to the structure representing the currently running process
300and
301.CW m
302is a pointer to the per-machine data structure.
303.NH 2
304Long long
305.LP
306The compilers accept
307.CW long
308.CW long
309as a basic type meaning 64-bit integer.
310On all of the machines
311this type is synthesized from 32-bit instructions.
312.NH 2
313Pragma
314.LP
315The compilers accept
316.CW #pragma
317.CW lib
318.I libname
319and pass the
320library name string uninterpreted
321to the loader.
322The loader uses the library name to
323find libraries to load.
324If the name contains
325.CW %O ,
326it is replaced with
327the single character object type of the compiler
328(e.g.,
329.CW v
330for the MIPS).
331If the name contains
332.CW %M ,
333it is replaced with
334the architecture type for the compiler
335(e.g.,
336.CW mips
337for the MIPS).
338If the name starts with
339.CW /
340it is an absolute pathname;
341if it starts with
342.CW .
343then it is searched for in the loader's current directory.
344Otherwise, the name is searched from
345.CW /%M/lib .
346Such
347.CW #pragma
348statements in header files guarantee that the correct
349libraries are always linked with a program without the
350need to specify them explicitly at link time.
351.LP
352They also accept
353.CW #pragma
354.CW hjdicks
355.CW on
356(or
357.CW yes
358or
359.CW 1 )
360to cause subsequently declared data, until
361.CW #pragma
362.CW hjdicks
363.CW off
364(or
365.CW no
366or
367.CW 0 ),
368to be laid out in memory tightly packed in successive bytes, disregarding
369the usual alignment rules.
370Accessing such data can cause faults.
371.LP
372Two
373.CW #pragma
374statements allow type-checking of
375.CW print -like
376functions.
377The first, of the form
378.P1
379#pragma varargck argpos error 2
380.P2
381tells the compiler that the second argument to
382.CW error
383is a
384.CW print
385format string (see the manual page
386.I print (2))
387that specifies how to format
388.CW error 's
389subsequent arguments.
390The second, of the form
391.P1
392#pragma varargck type "s" char*
393.P2
394says that the
395.CW print
396format verb
397.CW s
398processes an argument of
399type
400.CW char* .
401If the compiler's
402.CW -F
403option is enabled, the compiler will use this information
404to report type violations in the arguments to
405.CW print ,
406.CW error ,
407and similar routines.
408.NH
409Object module conventions
410.LP
411The overall conventions of the runtime environment
412are important
413to runtime efficiency.
414In this section,
415several of these conventions are discussed.
416.NH 2
417Register saving
418.LP
419In the Plan 9 compilers,
420the caller of a procedure saves the registers.
421With caller-saves,
422the leaf procedures can use all the
423registers and never save them.
424If you spend a lot of time at the leaves,
425this seems preferable.
426With callee-saves,
427the saving of the registers is done
428in the single point of entry and return.
429If you are interested in space,
430this seems preferable.
431In both,
432there is a degree of uncertainty
433about what registers need to be saved.
434Callee-saved registers make it difficult to
435find variables in registers in debuggers.
436Callee-saved registers also complicate
437the implementation of
438.CW longjmp .
439The convincing argument is
440that with caller-saves,
441the decision to registerize a variable
442can include the cost of saving the register
443across calls.
444For a further discussion of caller- vs. callee-saves,
445see the paper by Davidson and Whalley [Dav91].
446.LP
447In the Plan 9 operating system,
448calls to the kernel look like normal procedure
449calls, which means
450the caller
451has saved the registers and the system
452entry does not have to.
453This makes system calls considerably faster.
454Since this is a potential security hole,
455and can lead to non-determinism,
456the system may eventually save the registers
457on entry,
458or more likely clear the registers on return.
459.NH 2
460Calling convention
461.LP
462Older C compilers maintain a frame pointer, which is at a known constant
463offset from the stack pointer within each function.
464For machines where the stack grows towards zero,
465the argument pointer is at a known constant offset
466from the frame pointer.
467Since the stack grows down in Plan 9,
468the Plan 9 compilers
469keep neither an
470explicit frame pointer nor
471an explicit argument pointer;
472instead they generate addresses relative to the stack pointer.
473.LP
474On some architectures, the first argument to a subroutine is passed in a register.
475.NH 2
476Functions returning structures
477.LP
478Structures longer than one word are awkward to implement
479since they do not fit in registers and must
480be passed around in memory.
481Functions that return structures
482are particularly clumsy.
483The Plan 9 compilers pass the return address of
484a structure as the first argument of a
485function that has a structure return value.
486Thus
487.DS
488.CW
489.ta .1i .6i 1.1i 1.6i
490	x = f(...)
491.DE
492is rewritten as
493.DS
494.CW
495.ta .1i .6i 1.1i 1.6i
496	f(&x, ...)\f1.
497.DE
498This saves a copy and makes the compilation
499much less clumsy.
500A disadvantage is that if you call this
501function without an assignment,
502a dummy location must be invented.
503.LP
504There is also a danger of calling a function
505that returns a structure without declaring
506it as such.
507With ANSI C function prototypes,
508this error need never occur.
509.NH
510Implementation
511.LP
512The compiler is divided internally into
513four machine-independent passes,
514four machine-dependent passes,
515and an output pass.
516The next nine sections describe each pass in order.
517.NH 2
518Parsing
519.LP
520The first pass is a YACC-based parser
521[Joh79].
522Declarations are interpreted immediately,
523building a block structured symbol table.
524Executable statements are put into a parse tree
525and collected,
526without interpretation.
527At the end of each procedure,
528the parse tree for the function is
529examined by the other passes of the compiler.
530.LP
531The input stream of the parser is
532a pushdown list of input activations.
533The preprocessor
534expansions of
535macros
536and
537.CW #include
538are implemented as pushdowns.
539Thus there is no separate
540pass for preprocessing.
541.NH 2
542Typing
543.LP
544The next pass distributes typing information
545to every node of the tree.
546Implicit operations on the tree are added,
547such as type promotions and taking the
548address of arrays and functions.
549.NH 2
550Machine-independent optimization
551.LP
552The next pass performs optimizations
553and transformations of the tree, such as converting
554.CW &*x
555and
556.CW *&x
557into
558.CW x .
559Constant expressions are converted to constants in this pass.
560.NH 2
561Arithmetic rewrites
562.LP
563This is another machine-independent optimization.
564Subtrees of add, subtract, and multiply of integers are
565rewritten for easier compilation.
566The major transformation is factoring:
567.CW 4+8*a+16*b+5
568is transformed into
569.CW 9+8*(a+2*b) .
570Such expressions arise from address
571manipulation and array indexing.
572.NH 2
573Addressability
574.LP
575This is the first of the machine-dependent passes.
576The addressability of a processor is defined as the set of
577expressions that is legal in the address field
578of a machine language instruction.
579The addressability of different processors varies widely.
580At one end of the spectrum are the 68020 and VAX,
581which allow a complex mix of incrementing,
582decrementing,
583indexing, and relative addressing.
584At the other end is the MIPS,
585which allows only registers and constant offsets from the
586contents of a register.
587The addressability can be different for different instructions
588within the same processor.
589.LP
590It is important to the code generator to know when a
591subtree represents an address of a particular type.
592This is done with a bottom-up walk of the tree.
593In this pass, the leaves are labeled with small integers.
594When an internal node is encountered,
595it is labeled by consulting a table indexed by the
596labels on the left and right subtrees.
597For example,
598on the 68020 processor,
599it is possible to address an
600offset from a named location.
601In C, this is represented by the expression
602.CW *(&name+constant) .
603This is marked addressable by the following table.
604In the table,
605a node represented by the left column is marked
606with a small integer from the right column.
607Marks of the form
608.CW A\s-2\di\u\s0
609are addressable while
610marks of the form
611.CW N\s-2\di\u\s0
612are not addressable.
613.DS
614.B
615.ta .1i 1.1i
616	Node	Marked
617.CW
618	name	A\s-2\d1\u\s0
619	const	A\s-2\d2\u\s0
620	&A\s-2\d1\u\s0	A\s-2\d3\u\s0
621	A\s-2\d3\u\s0+A\s-2\d1\u\s0	N\s-2\d1\u\s0 \fR(note that this is not addressable)\fP
622	*N\s-2\d1\u\s0	A\s-2\d4\u\s0
623.DE
624Here there is a distinction between
625a node marked
626.CW A\s-2\d1\u\s0
627and a node marked
628.CW A\s-2\d4\u\s0
629because the address operator of an
630.CW A\s-2\d4\u\s0
631node is not addressable.
632So to extend the table:
633.DS
634.B
635.ta .1i 1.1i
636	Node	Marked
637.CW
638	&A\s-2\d4\u\s0	N\s-2\d2\u\s0
639	N\s-2\d2\u\s0+N\s-2\d1\u\s0	N\s-2\d1\u\s0
640.DE
641The full addressability of the 68020 is expressed
642in 18 rules like this,
643while the addressability of the MIPS is expressed
644in 11 rules.
645When one ports the compiler,
646this table is usually initialized
647so that leaves are labeled as addressable and nothing else.
648The code produced is poor,
649but porting is easy.
650The table can be extended later.
651.LP
652This pass also rewrites some complex operators
653into procedure calls.
654Examples include 64-bit multiply and divide.
655.LP
656In the same bottom-up pass of the tree,
657the nodes are labeled with a Sethi-Ullman complexity
658[Set70].
659This number is roughly the number of registers required
660to compile the tree on an ideal machine.
661An addressable node is marked 0.
662A function call is marked infinite.
663A unary operator is marked as the
664maximum of 1 and the mark of its subtree.
665A binary operator with equal marks on its subtrees is
666marked with a subtree mark plus 1.
667A binary operator with unequal marks on its subtrees is
668marked with the maximum mark of its subtrees.
669The actual values of the marks are not too important,
670but the relative values are.
671The goal is to compile the harder
672(larger mark)
673subtree first.
674.NH 2
675Code generation
676.LP
677Code is generated by recursive
678descent.
679The Sethi-Ullman complexity completely guides the
680order.
681The addressability defines the leaves.
682The only difficult part is compiling a tree
683that has two infinite (function call)
684subtrees.
685In this case,
686one subtree is compiled into the return register
687(usually the most convenient place for a function call)
688and then stored on the stack.
689The other subtree is compiled into the return register
690and then the operation is compiled with
691operands from the stack and the return register.
692.LP
693There is a separate boolean code generator that compiles
694conditional expressions.
695This is fundamentally different from compiling an arithmetic expression.
696The result of the boolean code generator is the
697position of the program counter and not an expression.
698The boolean code generator makes extensive use of De Morgan's rule.
699The boolean code generator is an expanded version of that described
700in chapter 8 of Aho, Sethi, and Ullman
701[Aho87].
702.LP
703There is a considerable amount of talk in the literature
704about automating this part of a compiler with a machine
705description.
706Since this code generator is so small
707(less than 500 lines of C)
708and easy,
709it hardly seems worth the effort.
710.NH 2
711Registerization
712.LP
713Up to now,
714the compiler has operated on syntax trees
715that are roughly equivalent to the original source language.
716The previous pass has produced machine language in an internal
717format.
718The next two passes operate on the internal machine language
719structures.
720The purpose of the next pass is to reintroduce
721registers for heavily used variables.
722.LP
723All of the variables that can be
724potentially registerized within a procedure are
725placed in a table.
726(Suitable variables are any automatic or external
727scalars that do not have their addresses extracted.
728Some constants that are hard to reference are also
729considered for registerization.)
730Four separate data flow equations are evaluated
731over the procedure on all of these variables.
732Two of the equations are the normal set-behind
733and used-ahead
734bits that define the life of a variable.
735The two new bits tell if a variable life
736crosses a function call ahead or behind.
737By examining a variable over its lifetime,
738it is possible to get a cost
739for registerizing.
740Loops are detected and the costs are multiplied
741by three for every level of loop nesting.
742Costs are sorted and the variables
743are replaced by available registers on a greedy basis.
744.LP
745The 68020 has two different
746types of registers.
747For the 68020,
748two different costs are calculated for
749each variable life and the register type that
750affords the better cost is used.
751Ties are broken by counting the number of available
752registers of each type.
753.LP
754Note that externals are registerized together with automatics.
755This is done by evaluating the semantics of a ``call'' instruction
756differently for externals and automatics.
757Since a call goes outside the local procedure,
758it is assumed that a call references all externals.
759Similarly,
760externals are assumed to be set before an ``entry'' instruction
761and assumed to be referenced after a ``return'' instruction.
762This makes sure that externals are in memory across calls.
763.LP
764The overall results are satisfactory.
765It would be nice to be able to do this processing in
766a machine-independent way,
767but it is impossible to get all of the costs and
768side effects of different choices by examining the parse tree.
769.LP
770Most of the code in the registerization pass is machine-independent.
771The major machine-dependency is in
772examining a machine instruction to ask if it sets or references
773a variable.
774.NH 2
775Machine code optimization
776.LP
777The next pass walks the machine code
778for opportunistic optimizations.
779For the most part,
780this is highly specific to a particular
781processor.
782One optimization that is performed
783on all of the processors is the
784removal of unnecessary ``move''
785instructions.
786Ironically,
787most of these instructions were inserted by
788the previous pass.
789There are two patterns that are repetitively
790matched and replaced until no more matches are
791found.
792The first tries to remove ``move'' instructions
793by relabeling variables.
794.LP
795When a ``move'' instruction is encountered,
796if the destination variable is set before the
797source variable is referenced,
798then all of the references to the destination
799variable can be renamed to the source and the ``move''
800can be deleted.
801This transformation uses the reverse data flow
802set up in the previous pass.
803.LP
804An example of this pattern is depicted in the following
805table.
806The pattern is in the left column and the
807replacement action is in the right column.
808.DS
809.CW
810.ta .1i .6i 1.6i 2.1i 2.6i
811	MOVE	a->b		\fR(remove)\fP
812.R
813	(sequence with no mention of \f(CWa\fP)
814.CW
815	USE	b		USE	a
816.R
817	(sequence with no mention of \f(CWa\fP)
818.CW
819	SET	b		SET	b
820.DE
821.LP
822Experiments have shown that it is marginally
823worthwhile to rename uses of the destination variable
824with uses of the source variable up to
825the first use of the source variable.
826.LP
827The second transform will do relabeling
828without deleting instructions.
829When a ``move'' instruction is encountered,
830if the source variable has been set prior
831to the use of the destination variable
832then all of the references to the source
833variable are replaced by the destination and
834the ``move'' is inverted.
835Typically,
836this transformation will alter two ``move''
837instructions and allow the first transformation
838another chance to remove code.
839This transformation uses the forward data flow
840set up in the previous pass.
841.LP
842Again,
843the following is a depiction of the transformation where
844the pattern is in the left column and the
845rewrite is in the right column.
846.DS
847.CW
848.ta .1i .6i 1.6i 2.1i 2.6i
849	SET	a		SET	b
850.R
851	(sequence with no use of \f(CWb\fP)
852.CW
853	USE	a		USE	b
854.R
855	(sequence with no use of \f(CWb\fP)
856.CW
857	MOVE	a->b		MOVE	b->a
858.DE
859Iterating these transformations
860will usually get rid of all redundant ``move'' instructions.
861.LP
862A problem with this organization is that the costs
863of registerization calculated in the previous pass
864must depend on how well this pass can detect and remove
865redundant instructions.
866Often,
867a fine candidate for registerization is rejected
868because of the cost of instructions that are later
869removed.
870.NH 2
871Writing the object file
872.LP
873The last pass walks the internal assembly language
874and writes the object file.
875The object file is reduced in size by about a factor
876of three with simple compression
877techniques.
878The most important aspect of the object file
879format is that it is independent of the compiling machine.
880All integer and floating numbers in the object
881code are converted to known formats and byte
882orders.
883.NH
884The loader
885.LP
886The loader is a multiple pass program that
887reads object files and libraries and produces
888an executable binary.
889The loader also does some minimal
890optimizations and code rewriting.
891Many of the operations performed by the
892loader are machine-dependent.
893.LP
894The first pass of the loader reads the
895object modules into an internal data
896structure that looks like binary assembly language.
897As the instructions are read,
898code is reordered to remove
899unconditional branch instructions.
900Conditional branch instructions are inverted
901to prevent the insertion of unconditional branches.
902The loader will also make a copy of a few instructions
903to remove an unconditional branch.
904.LP
905The next pass allocates addresses for
906all external data.
907Typical of processors is the MIPS,
908which can reference ±32K bytes from a
909register.
910The loader allocates the register
911.CW R30
912as the static pointer.
913The value placed in
914.CW R30
915is the base of the data segment plus 32K.
916It is then cheap to reference all data in the
917first 64K of the data segment.
918External variables are allocated to
919the data segment
920with the smallest variables allocated first.
921If all of the data cannot fit into the first
92264K of the data segment,
923then usually only a few large arrays
924need more expensive addressing modes.
925.LP
926For the MIPS processor,
927the loader makes a pass over the internal
928structures,
929exchanging instructions to try
930to fill ``delay slots'' with useful work.
931If a useful instruction cannot be found
932to fill a delay slot,
933the loader will insert
934``noop''
935instructions.
936This pass is very expensive and does not
937do a good job.
938About 40% of all instructions are in
939delay slots.
940About 65% of these are useful instructions and
94135% are ``noops.''
942The vendor-supplied assembler does this job
943more effectively,
944filling about 80%
945of the delay slots with useful instructions.
946.LP
947On the 68020 processor,
948branch instructions come in a variety of
949sizes depending on the relative distance
950of the branch.
951Thus the size of branch instructions
952can be mutually dependent.
953The loader uses a multiple pass algorithm
954to resolve the branch lengths
955[Szy78].
956Initially, all branches are assumed minimal length.
957On each subsequent pass,
958the branches are reassessed
959and expanded if necessary.
960When no more expansions occur,
961the locations of the instructions in
962the text segment are known.
963.LP
964On the MIPS processor,
965all instructions are one size.
966A single pass over the instructions will
967determine the locations of all addresses
968in the text segment.
969.LP
970The last pass of the loader produces the
971executable binary.
972A symbol table and other tables are
973produced to help the debugger to
974interpret the binary symbolically.
975.LP
976The loader places absolute source line numbers in the symbol table.
977The name and absolute line number of all
978.CW #include
979files is also placed in the
980symbol table so that the debuggers can
981associate object code to source files.
982.NH
983Performance
984.LP
985The following is a table of the source size of the MIPS
986compiler.
987.DS
988.ta .1i .6i
989	lines	module
990	\0509	machine-independent headers
991	1070	machine-independent YACC source
992	6090	machine-independent C source
993
994	\0545	machine-dependent headers
995	6532	machine-dependent C source
996
997	\0298	loader headers
998	5215	loader C source
999.DE
1000.LP
1001The following table shows timing
1002of a test program
1003that plays checkers, running on a MIPS R4000.
1004The test program is 26 files totaling 12600 lines of C.
1005The execution time does not significantly
1006depend on library implementation.
1007Since no other compiler runs on Plan 9,
1008the Plan 9 tests were done with the Plan 9 operating system;
1009the other tests were done on the vendor's operating system.
1010The hardware was identical in both cases.
1011The optimizer in the vendor's compiler
1012is reputed to be extremely good.
1013.DS
1014.ta .1i .9i
1015	\0\04.49s	Plan 9 \f(CWvc\fP \f(CW-N\fP compile time (opposite of \f(CW-O\fP)
1016	\0\01.72s	Plan 9 \f(CWvc\fP \f(CW-N\fP load time
1017	148.69s	Plan 9 \f(CWvc\fP \f(CW-N\fP run time
1018
1019	\015.07s	Plan 9 \f(CWvc\fP compile time (\f(CW-O\fP implicit)
1020	\0\01.66s	Plan 9 \f(CWvc\fP load time
1021	\089.96s	Plan 9 \f(CWvc\fP run time
1022
1023	\014.83s	vendor \f(CWcc\fP compile time
1024	\0\00.38s	vendor \f(CWcc\fP load time
1025	104.75s	vendor \f(CWcc\fP run time
1026
1027	\043.59s	vendor \f(CWcc\fP \f(CW-O\fP compile time
1028	\0\00.38s	vendor \f(CWcc\fP \f(CW-O\fP load time
1029	\076.19s	vendor \f(CWcc\fP \f(CW-O\fP run time
1030
1031	\0\08.19s	vendor \f(CWcc\fP \f(CW-O3\fP compile time
1032	\035.97s	vendor \f(CWcc\fP \f(CW-O3\fP load time
1033	\071.16s	vendor \f(CWcc\fP \f(CW-O3\fP run time
1034.DE
1035.LP
1036To compare the Intel compiler,
1037a program that is about 40% bit manipulation and
1038about 60% single precision floating point was
1039run on the same 33 MHz 486, once under Windows
1040compiled with the Watcom compiler, version 10.0,
1041in 16-bit mode and once under
1042Plan 9 in 32-bit mode.
1043The Plan 9 execution time was 27 sec while the Windows
1044execution time was 31 sec.
1045.NH
1046Conclusions
1047.LP
1048The new compilers compile
1049quickly,
1050load slowly,
1051and produce
1052medium quality
1053object code.
1054The compilers are relatively
1055portable,
1056requiring but a couple of weeks' work to
1057produce a compiler for a different computer.
1058For Plan 9,
1059where we needed several compilers
1060with specialized features and
1061our own object formats,
1062this project was indispensable.
1063It is also necessary for us to
1064be able to freely distribute our compilers
1065with the Plan 9 distribution.
1066.LP
1067Two problems have come up in retrospect.
1068The first has to do with the
1069division of labor between compiler and loader.
1070Plan 9 runs on multi-processors and as such
1071compilations are often done in parallel.
1072Unfortunately,
1073all compilations must be complete before loading
1074can begin.
1075The load is then single-threaded.
1076With this model,
1077any shift of work from compile to load
1078results in a significant increase in real time.
1079The same is true of libraries that are compiled
1080infrequently and loaded often.
1081In the future,
1082we may try to put some of the loader work
1083back into the compiler.
1084.LP
1085The second problem comes from
1086the various optimizations performed over several
1087passes.
1088Often optimizations in different passes depend
1089on each other.
1090Iterating the passes could compromise efficiency,
1091or even loop.
1092We see no real solution to this problem.
1093.NH
1094References
1095.LP
1096[Aho87] A. V. Aho, R. Sethi, and J. D. Ullman,
1097.I
1098Compilers \- Principles, Techniques, and Tools,
1099.R
1100Addison Wesley,
1101Reading, MA,
11021987.
1103.LP
1104[ANSI90] \f2American National Standard for Information Systems \-
1105Programming Language C\f1, American National Standards Institute, Inc.,
1106New York, 1990.
1107.LP
1108[Dav91] J. W. Davidson and D. B. Whalley,
1109``Methods for Saving and Restoring Register Values across Function Calls'',
1110.I
1111Software\-Practice and Experience,
1112.R
1113Vol 21(2), pp. 149-165, February 1991.
1114.LP
1115[Joh79] S. C. Johnson,
1116``YACC \- Yet Another Compiler Compiler'',
1117.I
1118UNIX Programmer's Manual, Seventh Ed., Vol. 2A,
1119.R
1120AT&T Bell Laboratories,
1121Murray Hill, NJ,
11221979.
1123.LP
1124[Set70] R. Sethi and J. D. Ullman,
1125``The Generation of Optimal Code for Arithmetic Expressions'',
1126.I
1127Journal of the ACM,
1128.R
1129Vol 17(4), pp. 715-728, 1970.
1130.LP
1131[Szy78] T. G. Szymanski,
1132``Assembling Code for Machines with Span-dependent Instructions'',
1133.I
1134Communications of the ACM,
1135.R
1136Vol 21(4), pp. 300-308, 1978.
1137