xref: /plan9-contrib/sys/doc/compiler.ms (revision b94bb474148e9d24a82a427863d9c9eb4c20f4ae)
1.HTML "Plan 9 C Compilers
2.TL
3Plan 9 C Compilers
4.AU
5Ken Thompson
6ken@plan9.bell-labs.com
7.AB
8.FS
9Originally appeared, in a different form, in
10.I
11Proceedings of the Summer 1990 UKUUG Conference,
12.R
13pp. 41-51,
14London, 1990.
15.FE
16This paper describes the overall structure and function of the Plan 9 C compilers.
17A more detailed implementation document
18for any one of the compilers
19is yet to be written.
20.AE
21.NH
22Introduction
23.LP
24There are many compilers in the series.
25Five of the compilers (Intel 386, AMD64, PowerPC, PowerPC 64-bit, ARM)
26are considered active and are used to compile
27current versions of Plan 9.
28Four of the compilers (MIPS 3000, SPARC, DEC Alpha, and Motorola 68020)
29are maintained but are for older machines
30for which we have no current ports of Plan 9;
31other than the MIPS, we are unlikely to port to any such machines.
32Several others (Motorola 68000, Intel 960, AMD 29000) have had only limited use, such as
33to program peripherals or experimental devices.
34.NH
35Structure
36.LP
37The compiler is a single program that produces an
38object file.
39Combined in the compiler are the traditional
40roles of preprocessor, lexical analyzer, parser, code generator,
41local optimizer,
42and first half of the assembler.
43The object files are binary forms of assembly
44language,
45similar to what might be passed between
46the first and second passes of an assembler.
47.LP
48Object files and libraries
49are combined by a loader
50program to produce the executable binary.
51The loader combines the roles of second half
52of the assembler, global optimizer, and loader.
53The names of the compliers, loaders, and assemblers
54are as follows:
55.DS
56.ta 1.5i
57.de Ta
58\\$1	\f(CW\\$2\fP  \f(CW\\$3\fP  \f(CW\\$4\fP
59..
60.Ta SPARC kc kl ka
61.Ta PowerPC qc ql qa
62.Ta MIPS vc vl va
63.Ta MIPS\ little-endian 0c 0l 0a
64.Ta Motorola\ 68000 1c 1l 1a
65.Ta Motorola\ 68020 2c 2l 2a
66.Ta ARM 5c 5l 5a
67.Ta AMD64 6c 6l 6a
68.Ta DEC\ Alpha 7c 7l 7a
69.Ta Intel\ 386 8c 8l 8a
70.Ta PowerPC\ 64-bit 9c 9l 9a
71.DE
72There is a further breakdown
73in the source of the compilers into
74object-independent and
75object-dependent
76parts.
77All of the object-independent parts
78are combined into source files in the
79directory
80.CW /sys/src/cmd/cc .
81The object-dependent parts are collected
82in a separate directory for each compiler,
83for example
84.CW /sys/src/cmd/vc .
85All of the code,
86both object-independent and
87object-dependent,
88is machine-independent
89and may be cross-compiled and executed on any
90of the architectures.
91.NH
92The Language
93.LP
94The compiler implements ANSI C with some
95restrictions and extensions
96[ANSI90].
97Most of the restrictions are due to
98personal preference, while
99most of the extensions were to help in
100the implementation of Plan 9.
101There are other departures from the standard,
102particularly in the libraries,
103that are beyond the scope of this
104paper.
105.NH 2
106Register, volatile, const
107.LP
108The keyword
109.CW register
110is recognized syntactically
111but is semantically ignored.
112Thus taking the address of a
113.CW register
114variable is not diagnosed.
115The keyword
116.CW volatile
117disables all optimizations, in particular registerization, of the corresponding variable.
118The keyword
119.CW const
120generates warnings (if warnings are enabled by the compiler's
121.CW -w
122option) of non-constant use of the variable,
123but does not affect the generated code.
124.NH 2
125The preprocessor
126.LP
127The C preprocessor is probably the
128biggest departure from the ANSI standard.
129.LP
130The preprocessor built into the Plan 9 compilers does not support
131.CW #if ,
132although it does handle
133.CW #ifdef
134and
135.CW #include .
136If it is necessary to be more standard,
137the source text can first be run through the separate ANSI C
138preprocessor,
139.CW cpp .
140.NH 2
141Unnamed substructures
142.LP
143The most important and most heavily used of the
144extensions is the declaration of an
145unnamed substructure or subunion.
146For example:
147.DS
148.CW
149.ta .1i .6i 1.1i 1.6i
150	typedef
151	struct	lock
152	{
153		int    locked;
154	} Lock;
155
156	typedef
157	struct	node
158	{
159		int	type;
160		union
161		{
162			double dval;
163			float  fval;
164			long   lval;
165		};
166		Lock;
167	} Node;
168
169	Lock*	lock;
170	Node*	node;
171.DE
172The declaration of
173.CW Node
174has an unnamed substructure of type
175.CW Lock
176and an unnamed subunion.
177One use of this feature allows references to elements of the
178subunit to be accessed as if they were in
179the outer structure.
180Thus
181.CW node->dval
182and
183.CW node->locked
184are legitimate references.
185.LP
186When an outer structure is used
187in a context that is only legal for
188an unnamed substructure,
189the compiler promotes the reference to the
190unnamed substructure.
191This is true for references to structures and
192to references to pointers to structures.
193This happens in assignment statements and
194in argument passing where prototypes have been
195declared.
196Thus, continuing with the example,
197.DS
198.CW
199.ta .1i .6i 1.1i 1.6i
200	lock = node;
201.DE
202would assign a pointer to the unnamed
203.CW Lock
204in
205the
206.CW Node
207to the variable
208.CW lock .
209Another example,
210.DS
211.CW
212.ta .1i .6i 1.1i 1.6i
213	extern void lock(Lock*);
214	func(...)
215	{
216		...
217		lock(node);
218		...
219	}
220.DE
221will pass a pointer to the
222.CW Lock
223substructure.
224.LP
225Finally, in places where context is insufficient to identify the unnamed structure,
226the type name (it must be a
227.CW typedef )
228of the unnamed structure can be used as an identifier.
229In our example,
230.CW &node->Lock
231gives the address of the anonymous
232.CW Lock
233structure.
234.NH 2
235Structure displays
236.LP
237A structure cast followed by a list of expressions in braces is
238an expression with the type of the structure and elements assigned from
239the corresponding list.
240Structures are now almost first-class citizens of the language.
241It is common to see code like this:
242.DS
243.CW
244.ta .1i
245	r = (Rectangle){point1, (Point){x,y+2}};
246.DE
247.NH 2
248Initialization indexes
249.LP
250In initializers of arrays,
251one may place a constant expression
252in square brackets before an initializer.
253This causes the next initializer to assign
254the indicated element.
255For example:
256.DS
257.CW
258.ta .1i .6i 1.6i
259	enum	errors
260	{
261		Etoobig,
262		Ealarm,
263		Egreg
264	};
265	char* errstrings[] =
266	{
267		[Ealarm]	"Alarm call",
268		[Egreg]	"Panic: out of mbufs",
269		[Etoobig]	"Arg list too long",
270	};
271.DE
272In the same way,
273individual structures members may
274be initialized in any order by preceding the initialization with
275.CW .tagname .
276Both forms allow an optional
277.CW = ,
278to be compatible with a proposed
279extension to ANSI C.
280.NH 2
281External register
282.LP
283The declaration
284.CW extern
285.CW register
286will dedicate a register to
287a variable on a global basis.
288It can be used only under special circumstances.
289External register variables must be identically
290declared in all modules and
291libraries.
292The feature is not intended for efficiency,
293although it can produce efficient code;
294rather it represents a unique storage class that
295would be hard to get any other way.
296On a shared-memory multi-processor,
297an external register is
298one-per-processor and neither one-per-procedure (automatic)
299or one-per-system (external).
300It is used for two variables in the Plan 9 kernel,
301.CW u
302and
303.CW m .
304.CW U
305is a pointer to the structure representing the currently running process
306and
307.CW m
308is a pointer to the per-machine data structure.
309.NH 2
310Long long
311.LP
312The compilers accept
313.CW long
314.CW long
315as a basic type meaning 64-bit integer.
316On all of the machines
317this type is synthesized from 32-bit instructions.
318.NH 2
319Pragma
320.LP
321The compilers accept
322.CW #pragma
323.CW lib
324.I libname
325and pass the
326library name string uninterpreted
327to the loader.
328The loader uses the library name to
329find libraries to load.
330If the name contains
331.CW $O ,
332it is replaced with
333the single character object type of the compiler
334(e.g.,
335.CW v
336for the MIPS).
337If the name contains
338.CW $M ,
339it is replaced with
340the architecture type for the compiler
341(e.g.,
342.CW mips
343for the MIPS).
344If the name starts with
345.CW /
346it is an absolute pathname;
347if it starts with
348.CW .
349then it is searched for in the loader's current directory.
350Otherwise, the name is searched from
351.CW /$M/lib .
352Such
353.CW #pragma
354statements in header files guarantee that the correct
355libraries are always linked with a program without the
356need to specify them explicitly at link time.
357.LP
358They also accept
359.CW #pragma
360.CW packed
361.CW on
362(or
363.CW yes
364or
365.CW 1 )
366to cause subsequently declared data, until
367.CW #pragma
368.CW packed
369.CW off
370(or
371.CW no
372or
373.CW 0 ),
374to be laid out in memory tightly packed in successive bytes, disregarding
375the usual alignment rules.
376Accessing such data can cause faults.
377.LP
378Similarly,
379.CW #pragma
380.CW profile
381.CW off
382(or
383.CW no
384or
385.CW 0 )
386causes subsequently declared functions, until
387.CW #pragma
388.CW profile
389.CW on
390(or
391.CW yes
392or
393.CW 1 ),
394to be marked as unprofiled.
395Such functions will not be profiled when
396profiling is enabled for the rest of the program.
397.LP
398Two
399.CW #pragma
400statements allow type-checking of
401.CW print -like
402functions.
403The first, of the form
404.P1
405#pragma varargck argpos error 2
406.P2
407tells the compiler that the second argument to
408.CW error
409is a
410.CW print
411format string (see the manual page
412.I print (2))
413that specifies how to format
414.CW error 's
415subsequent arguments.
416The second, of the form
417.P1
418#pragma varargck type "s" char*
419.P2
420says that the
421.CW print
422format verb
423.CW s
424processes an argument of
425type
426.CW char* .
427If the compiler's
428.CW -F
429option is enabled, the compiler will use this information
430to report type violations in the arguments to
431.CW print ,
432.CW error ,
433and similar routines.
434.NH
435Object module conventions
436.LP
437The overall conventions of the runtime environment
438are important
439to runtime efficiency.
440In this section,
441several of these conventions are discussed.
442.NH 2
443Register saving
444.LP
445In the Plan 9 compilers,
446the caller of a procedure saves the registers.
447With caller-saves,
448the leaf procedures can use all the
449registers and never save them.
450If you spend a lot of time at the leaves,
451this seems preferable.
452With callee-saves,
453the saving of the registers is done
454in the single point of entry and return.
455If you are interested in space,
456this seems preferable.
457In both,
458there is a degree of uncertainty
459about what registers need to be saved.
460Callee-saved registers make it difficult to
461find variables in registers in debuggers.
462Callee-saved registers also complicate
463the implementation of
464.CW longjmp .
465The convincing argument is
466that with caller-saves,
467the decision to registerize a variable
468can include the cost of saving the register
469across calls.
470For a further discussion of caller- vs. callee-saves,
471see the paper by Davidson and Whalley [Dav91].
472.LP
473In the Plan 9 operating system,
474calls to the kernel look like normal procedure
475calls, which means
476the caller
477has saved the registers and the system
478entry does not have to.
479This makes system calls considerably faster.
480Since this is a potential security hole,
481and can lead to non-determinism,
482the system may eventually save the registers
483on entry,
484or more likely clear the registers on return.
485.NH 2
486Calling convention
487.LP
488Older C compilers maintain a frame pointer, which is at a known constant
489offset from the stack pointer within each function.
490For machines where the stack grows towards zero,
491the argument pointer is at a known constant offset
492from the frame pointer.
493Since the stack grows down in Plan 9,
494the Plan 9 compilers
495keep neither an
496explicit frame pointer nor
497an explicit argument pointer;
498instead they generate addresses relative to the stack pointer.
499.LP
500On some architectures, the first argument to a subroutine is passed in a register.
501.NH 2
502Functions returning structures
503.LP
504Structures longer than one word are awkward to implement
505since they do not fit in registers and must
506be passed around in memory.
507Functions that return structures
508are particularly clumsy.
509The Plan 9 compilers pass the return address of
510a structure as the first argument of a
511function that has a structure return value.
512Thus
513.DS
514.CW
515.ta .1i .6i 1.1i 1.6i
516	x = f(...)
517.DE
518is rewritten as
519.DS
520.CW
521.ta .1i .6i 1.1i 1.6i
522	f(&x, ...)\f1.
523.DE
524This saves a copy and makes the compilation
525much less clumsy.
526A disadvantage is that if you call this
527function without an assignment,
528a dummy location must be invented.
529.LP
530There is also a danger of calling a function
531that returns a structure without declaring
532it as such.
533With ANSI C function prototypes,
534this error need never occur.
535.NH
536Implementation
537.LP
538The compiler is divided internally into
539four machine-independent passes,
540four machine-dependent passes,
541and an output pass.
542The next nine sections describe each pass in order.
543.NH 2
544Parsing
545.LP
546The first pass is a YACC-based parser
547[Joh79].
548Declarations are interpreted immediately,
549building a block structured symbol table.
550Executable statements are put into a parse tree
551and collected,
552without interpretation.
553At the end of each procedure,
554the parse tree for the function is
555examined by the other passes of the compiler.
556.LP
557The input stream of the parser is
558a pushdown list of input activations.
559The preprocessor
560expansions of
561macros
562and
563.CW #include
564are implemented as pushdowns.
565Thus there is no separate
566pass for preprocessing.
567.NH 2
568Typing
569.LP
570The next pass distributes typing information
571to every node of the tree.
572Implicit operations on the tree are added,
573such as type promotions and taking the
574address of arrays and functions.
575.NH 2
576Machine-independent optimization
577.LP
578The next pass performs optimizations
579and transformations of the tree, such as converting
580.CW &*x
581and
582.CW *&x
583into
584.CW x .
585Constant expressions are converted to constants in this pass.
586.NH 2
587Arithmetic rewrites
588.LP
589This is another machine-independent optimization.
590Subtrees of add, subtract, and multiply of integers are
591rewritten for easier compilation.
592The major transformation is factoring:
593.CW 4+8*a+16*b+5
594is transformed into
595.CW 9+8*(a+2*b) .
596Such expressions arise from address
597manipulation and array indexing.
598.NH 2
599Addressability
600.LP
601This is the first of the machine-dependent passes.
602The addressability of a processor is defined as the set of
603expressions that is legal in the address field
604of a machine language instruction.
605The addressability of different processors varies widely.
606At one end of the spectrum are the 68020 and VAX,
607which allow a complex mix of incrementing,
608decrementing,
609indexing, and relative addressing.
610At the other end is the MIPS,
611which allows only registers and constant offsets from the
612contents of a register.
613The addressability can be different for different instructions
614within the same processor.
615.LP
616It is important to the code generator to know when a
617subtree represents an address of a particular type.
618This is done with a bottom-up walk of the tree.
619In this pass, the leaves are labeled with small integers.
620When an internal node is encountered,
621it is labeled by consulting a table indexed by the
622labels on the left and right subtrees.
623For example,
624on the 68020 processor,
625it is possible to address an
626offset from a named location.
627In C, this is represented by the expression
628.CW *(&name+constant) .
629This is marked addressable by the following table.
630In the table,
631a node represented by the left column is marked
632with a small integer from the right column.
633Marks of the form
634.CW A\s-2\di\u\s0
635are addressable while
636marks of the form
637.CW N\s-2\di\u\s0
638are not addressable.
639.DS
640.B
641.ta .1i 1.1i
642	Node	Marked
643.CW
644	name	A\s-2\d1\u\s0
645	const	A\s-2\d2\u\s0
646	&A\s-2\d1\u\s0	A\s-2\d3\u\s0
647	A\s-2\d3\u\s0+A\s-2\d1\u\s0	N\s-2\d1\u\s0 \fR(note that this is not addressable)\fP
648	*N\s-2\d1\u\s0	A\s-2\d4\u\s0
649.DE
650Here there is a distinction between
651a node marked
652.CW A\s-2\d1\u\s0
653and a node marked
654.CW A\s-2\d4\u\s0
655because the address operator of an
656.CW A\s-2\d4\u\s0
657node is not addressable.
658So to extend the table:
659.DS
660.B
661.ta .1i 1.1i
662	Node	Marked
663.CW
664	&A\s-2\d4\u\s0	N\s-2\d2\u\s0
665	N\s-2\d2\u\s0+N\s-2\d1\u\s0	N\s-2\d1\u\s0
666.DE
667The full addressability of the 68020 is expressed
668in 18 rules like this,
669while the addressability of the MIPS is expressed
670in 11 rules.
671When one ports the compiler,
672this table is usually initialized
673so that leaves are labeled as addressable and nothing else.
674The code produced is poor,
675but porting is easy.
676The table can be extended later.
677.LP
678This pass also rewrites some complex operators
679into procedure calls.
680Examples include 64-bit multiply and divide.
681.LP
682In the same bottom-up pass of the tree,
683the nodes are labeled with a Sethi-Ullman complexity
684[Set70].
685This number is roughly the number of registers required
686to compile the tree on an ideal machine.
687An addressable node is marked 0.
688A function call is marked infinite.
689A unary operator is marked as the
690maximum of 1 and the mark of its subtree.
691A binary operator with equal marks on its subtrees is
692marked with a subtree mark plus 1.
693A binary operator with unequal marks on its subtrees is
694marked with the maximum mark of its subtrees.
695The actual values of the marks are not too important,
696but the relative values are.
697The goal is to compile the harder
698(larger mark)
699subtree first.
700.NH 2
701Code generation
702.LP
703Code is generated by recursive
704descent.
705The Sethi-Ullman complexity completely guides the
706order.
707The addressability defines the leaves.
708The only difficult part is compiling a tree
709that has two infinite (function call)
710subtrees.
711In this case,
712one subtree is compiled into the return register
713(usually the most convenient place for a function call)
714and then stored on the stack.
715The other subtree is compiled into the return register
716and then the operation is compiled with
717operands from the stack and the return register.
718.LP
719There is a separate boolean code generator that compiles
720conditional expressions.
721This is fundamentally different from compiling an arithmetic expression.
722The result of the boolean code generator is the
723position of the program counter and not an expression.
724The boolean code generator makes extensive use of De Morgan's rule.
725The boolean code generator is an expanded version of that described
726in chapter 8 of Aho, Sethi, and Ullman
727[Aho87].
728.LP
729There is a considerable amount of talk in the literature
730about automating this part of a compiler with a machine
731description.
732Since this code generator is so small
733(less than 500 lines of C)
734and easy,
735it hardly seems worth the effort.
736.NH 2
737Registerization
738.LP
739Up to now,
740the compiler has operated on syntax trees
741that are roughly equivalent to the original source language.
742The previous pass has produced machine language in an internal
743format.
744The next two passes operate on the internal machine language
745structures.
746The purpose of the next pass is to reintroduce
747registers for heavily used variables.
748.LP
749All of the variables that can be
750potentially registerized within a procedure are
751placed in a table.
752(Suitable variables are any automatic or external
753scalars that do not have their addresses extracted.
754Some constants that are hard to reference are also
755considered for registerization.)
756Four separate data flow equations are evaluated
757over the procedure on all of these variables.
758Two of the equations are the normal set-behind
759and used-ahead
760bits that define the life of a variable.
761The two new bits tell if a variable life
762crosses a function call ahead or behind.
763By examining a variable over its lifetime,
764it is possible to get a cost
765for registerizing.
766Loops are detected and the costs are multiplied
767by three for every level of loop nesting.
768Costs are sorted and the variables
769are replaced by available registers on a greedy basis.
770.LP
771The 68020 has two different
772types of registers.
773For the 68020,
774two different costs are calculated for
775each variable life and the register type that
776affords the better cost is used.
777Ties are broken by counting the number of available
778registers of each type.
779.LP
780Note that externals are registerized together with automatics.
781This is done by evaluating the semantics of a ``call'' instruction
782differently for externals and automatics.
783Since a call goes outside the local procedure,
784it is assumed that a call references all externals.
785Similarly,
786externals are assumed to be set before an ``entry'' instruction
787and assumed to be referenced after a ``return'' instruction.
788This makes sure that externals are in memory across calls.
789.LP
790The overall results are satisfactory.
791It would be nice to be able to do this processing in
792a machine-independent way,
793but it is impossible to get all of the costs and
794side effects of different choices by examining the parse tree.
795.LP
796Most of the code in the registerization pass is machine-independent.
797The major machine-dependency is in
798examining a machine instruction to ask if it sets or references
799a variable.
800.NH 2
801Machine code optimization
802.LP
803The next pass walks the machine code
804for opportunistic optimizations.
805For the most part,
806this is highly specific to a particular
807processor.
808One optimization that is performed
809on all of the processors is the
810removal of unnecessary ``move''
811instructions.
812Ironically,
813most of these instructions were inserted by
814the previous pass.
815There are two patterns that are repetitively
816matched and replaced until no more matches are
817found.
818The first tries to remove ``move'' instructions
819by relabeling variables.
820.LP
821When a ``move'' instruction is encountered,
822if the destination variable is set before the
823source variable is referenced,
824then all of the references to the destination
825variable can be renamed to the source and the ``move''
826can be deleted.
827This transformation uses the reverse data flow
828set up in the previous pass.
829.LP
830An example of this pattern is depicted in the following
831table.
832The pattern is in the left column and the
833replacement action is in the right column.
834.DS
835.CW
836.ta .1i .6i 1.6i 2.1i 2.6i
837	MOVE	a->b		\fR(remove)\fP
838.R
839	(sequence with no mention of \f(CWa\fP)
840.CW
841	USE	b		USE	a
842.R
843	(sequence with no mention of \f(CWa\fP)
844.CW
845	SET	b		SET	b
846.DE
847.LP
848Experiments have shown that it is marginally
849worthwhile to rename uses of the destination variable
850with uses of the source variable up to
851the first use of the source variable.
852.LP
853The second transform will do relabeling
854without deleting instructions.
855When a ``move'' instruction is encountered,
856if the source variable has been set prior
857to the use of the destination variable
858then all of the references to the source
859variable are replaced by the destination and
860the ``move'' is inverted.
861Typically,
862this transformation will alter two ``move''
863instructions and allow the first transformation
864another chance to remove code.
865This transformation uses the forward data flow
866set up in the previous pass.
867.LP
868Again,
869the following is a depiction of the transformation where
870the pattern is in the left column and the
871rewrite is in the right column.
872.DS
873.CW
874.ta .1i .6i 1.6i 2.1i 2.6i
875	SET	a		SET	b
876.R
877	(sequence with no use of \f(CWb\fP)
878.CW
879	USE	a		USE	b
880.R
881	(sequence with no use of \f(CWb\fP)
882.CW
883	MOVE	a->b		MOVE	b->a
884.DE
885Iterating these transformations
886will usually get rid of all redundant ``move'' instructions.
887.LP
888A problem with this organization is that the costs
889of registerization calculated in the previous pass
890must depend on how well this pass can detect and remove
891redundant instructions.
892Often,
893a fine candidate for registerization is rejected
894because of the cost of instructions that are later
895removed.
896.NH 2
897Writing the object file
898.LP
899The last pass walks the internal assembly language
900and writes the object file.
901The object file is reduced in size by about a factor
902of three with simple compression
903techniques.
904The most important aspect of the object file
905format is that it is independent of the compiling machine.
906All integer and floating numbers in the object
907code are converted to known formats and byte
908orders.
909.NH
910The loader
911.LP
912The loader is a multiple pass program that
913reads object files and libraries and produces
914an executable binary.
915The loader also does some minimal
916optimizations and code rewriting.
917Many of the operations performed by the
918loader are machine-dependent.
919.LP
920The first pass of the loader reads the
921object modules into an internal data
922structure that looks like binary assembly language.
923As the instructions are read,
924code is reordered to remove
925unconditional branch instructions.
926Conditional branch instructions are inverted
927to prevent the insertion of unconditional branches.
928The loader will also make a copy of a few instructions
929to remove an unconditional branch.
930.LP
931The next pass allocates addresses for
932all external data.
933Typical of processors is the MIPS,
934which can reference ±32K bytes from a
935register.
936The loader allocates the register
937.CW R30
938as the static pointer.
939The value placed in
940.CW R30
941is the base of the data segment plus 32K.
942It is then cheap to reference all data in the
943first 64K of the data segment.
944External variables are allocated to
945the data segment
946with the smallest variables allocated first.
947If all of the data cannot fit into the first
94864K of the data segment,
949then usually only a few large arrays
950need more expensive addressing modes.
951.LP
952For the MIPS processor,
953the loader makes a pass over the internal
954structures,
955exchanging instructions to try
956to fill ``delay slots'' with useful work.
957If a useful instruction cannot be found
958to fill a delay slot,
959the loader will insert
960``noop''
961instructions.
962This pass is very expensive and does not
963do a good job.
964About 40% of all instructions are in
965delay slots.
966About 65% of these are useful instructions and
96735% are ``noops.''
968The vendor-supplied assembler does this job
969more effectively,
970filling about 80%
971of the delay slots with useful instructions.
972.LP
973On the 68020 processor,
974branch instructions come in a variety of
975sizes depending on the relative distance
976of the branch.
977Thus the size of branch instructions
978can be mutually dependent.
979The loader uses a multiple pass algorithm
980to resolve the branch lengths
981[Szy78].
982Initially, all branches are assumed minimal length.
983On each subsequent pass,
984the branches are reassessed
985and expanded if necessary.
986When no more expansions occur,
987the locations of the instructions in
988the text segment are known.
989.LP
990On the MIPS processor,
991all instructions are one size.
992A single pass over the instructions will
993determine the locations of all addresses
994in the text segment.
995.LP
996The last pass of the loader produces the
997executable binary.
998A symbol table and other tables are
999produced to help the debugger to
1000interpret the binary symbolically.
1001.LP
1002The loader places absolute source line numbers in the symbol table.
1003The name and absolute line number of all
1004.CW #include
1005files is also placed in the
1006symbol table so that the debuggers can
1007associate object code to source files.
1008.NH
1009Performance
1010.LP
1011The following is a table of the source size of the MIPS
1012compiler.
1013.DS
1014.ta .1i .6i
1015	lines	module
1016	\0509	machine-independent headers
1017	1070	machine-independent YACC source
1018	6090	machine-independent C source
1019
1020	\0545	machine-dependent headers
1021	6532	machine-dependent C source
1022
1023	\0298	loader headers
1024	5215	loader C source
1025.DE
1026.LP
1027The following table shows timing
1028of a test program
1029that plays checkers, running on a MIPS R4000.
1030The test program is 26 files totaling 12600 lines of C.
1031The execution time does not significantly
1032depend on library implementation.
1033Since no other compiler runs on Plan 9,
1034the Plan 9 tests were done with the Plan 9 operating system;
1035the other tests were done on the vendor's operating system.
1036The hardware was identical in both cases.
1037The optimizer in the vendor's compiler
1038is reputed to be extremely good.
1039.DS
1040.ta .1i .9i
1041	\0\04.49s	Plan 9 \f(CWvc\fP \f(CW-N\fP compile time (opposite of \f(CW-O\fP)
1042	\0\01.72s	Plan 9 \f(CWvc\fP \f(CW-N\fP load time
1043	148.69s	Plan 9 \f(CWvc\fP \f(CW-N\fP run time
1044
1045	\015.07s	Plan 9 \f(CWvc\fP compile time (\f(CW-O\fP implicit)
1046	\0\01.66s	Plan 9 \f(CWvc\fP load time
1047	\089.96s	Plan 9 \f(CWvc\fP run time
1048
1049	\014.83s	vendor \f(CWcc\fP compile time
1050	\0\00.38s	vendor \f(CWcc\fP load time
1051	104.75s	vendor \f(CWcc\fP run time
1052
1053	\043.59s	vendor \f(CWcc\fP \f(CW-O\fP compile time
1054	\0\00.38s	vendor \f(CWcc\fP \f(CW-O\fP load time
1055	\076.19s	vendor \f(CWcc\fP \f(CW-O\fP run time
1056
1057	\0\08.19s	vendor \f(CWcc\fP \f(CW-O3\fP compile time
1058	\035.97s	vendor \f(CWcc\fP \f(CW-O3\fP load time
1059	\071.16s	vendor \f(CWcc\fP \f(CW-O3\fP run time
1060.DE
1061.LP
1062To compare the Intel compiler,
1063a program that is about 40% bit manipulation and
1064about 60% single precision floating point was
1065run on the same 33 MHz 486, once under Windows
1066compiled with the Watcom compiler, version 10.0,
1067in 16-bit mode and once under
1068Plan 9 in 32-bit mode.
1069The Plan 9 execution time was 27 sec while the Windows
1070execution time was 31 sec.
1071.NH
1072Conclusions
1073.LP
1074The new compilers compile
1075quickly,
1076load slowly,
1077and produce
1078medium quality
1079object code.
1080The compilers are relatively
1081portable,
1082requiring but a couple of weeks' work to
1083produce a compiler for a different computer.
1084For Plan 9,
1085where we needed several compilers
1086with specialized features and
1087our own object formats,
1088this project was indispensable.
1089It is also necessary for us to
1090be able to freely distribute our compilers
1091with the Plan 9 distribution.
1092.LP
1093Two problems have come up in retrospect.
1094The first has to do with the
1095division of labor between compiler and loader.
1096Plan 9 runs on multi-processors and as such
1097compilations are often done in parallel.
1098Unfortunately,
1099all compilations must be complete before loading
1100can begin.
1101The load is then single-threaded.
1102With this model,
1103any shift of work from compile to load
1104results in a significant increase in real time.
1105The same is true of libraries that are compiled
1106infrequently and loaded often.
1107In the future,
1108we may try to put some of the loader work
1109back into the compiler.
1110.LP
1111The second problem comes from
1112the various optimizations performed over several
1113passes.
1114Often optimizations in different passes depend
1115on each other.
1116Iterating the passes could compromise efficiency,
1117or even loop.
1118We see no real solution to this problem.
1119.NH
1120References
1121.LP
1122[Aho87] A. V. Aho, R. Sethi, and J. D. Ullman,
1123.I
1124Compilers \- Principles, Techniques, and Tools,
1125.R
1126Addison Wesley,
1127Reading, MA,
11281987.
1129.LP
1130[ANSI90] \f2American National Standard for Information Systems \-
1131Programming Language C\f1, American National Standards Institute, Inc.,
1132New York, 1990.
1133.LP
1134[Dav91] J. W. Davidson and D. B. Whalley,
1135``Methods for Saving and Restoring Register Values across Function Calls'',
1136.I
1137Software\-Practice and Experience,
1138.R
1139Vol 21(2), pp. 149-165, February 1991.
1140.LP
1141[Joh79] S. C. Johnson,
1142``YACC \- Yet Another Compiler Compiler'',
1143.I
1144UNIX Programmer's Manual, Seventh Ed., Vol. 2A,
1145.R
1146AT&T Bell Laboratories,
1147Murray Hill, NJ,
11481979.
1149.LP
1150[Set70] R. Sethi and J. D. Ullman,
1151``The Generation of Optimal Code for Arithmetic Expressions'',
1152.I
1153Journal of the ACM,
1154.R
1155Vol 17(4), pp. 715-728, 1970.
1156.LP
1157[Szy78] T. G. Szymanski,
1158``Assembling Code for Machines with Span-dependent Instructions'',
1159.I
1160Communications of the ACM,
1161.R
1162Vol 21(4), pp. 300-308, 1978.
1163