xref: /plan9-contrib/sys/doc/compiler.ms (revision 219b2ee8daee37f4aad58d63f21287faa8e4ffdc)
1.TL
2Plan 9 C Compilers
3.AU
4Ken Thompson
5ken@plan9.att.com
6.AB
7.FS
8Originally appeared, in a different form, in
9.I
10Proceedings of the Summer 1990 UKUUG Conference,
11.R
12pp. 41-51,
13London, 1990.
14.FE
15This paper describes the overall structure and function of the Plan 9 C compilers.
16A more detailed implementation document
17for any one of the compilers
18is yet to be written.
19.AE
20.NH
21Introduction
22.LP
23There are currently six compilers in the series.
24Four of the compilers (MIPS 3000, SPARC, Intel 386, and Motorola 68020)
25are considered active and are used to compile
26current versions of Plan 9.
27The other two (AT&T 3210, Intel 960) have been used
28only to program peripherals.
29.NH
30Structure
31.LP
32The compiler is a single program that produces an
33object file.
34Combined in the compiler are the traditional
35roles of preprocessor, lexical analyzer, parser, code generator,
36local optimizer,
37and first half of the assembler.
38The object files are binary forms of assembly
39language,
40similar to what might be passed between
41the first and second passes of an assembler.
42.LP
43Object files and libraries
44are combined by a loader
45program to produce the executable binary.
46The loader combines the roles of second half
47of the assembler, global optimizer, and loader.
48The names of the compliers, loaders, and assemblers
49are as follows:
50.DS
51.ta 1.5i
52.de Ta
53\\$1	\f(CW\\$2\fP  \f(CW\\$3\fP  \f(CW\\$4\fP
54..
55.Ta MIPS vc vl va
56.Ta SPARC kc kl ka
57.Ta Intel\ 386 8c 8l 8a
58.Ta Motorola\ 68020 2c 2l 2a
59.Ta AT&T\ 3210 xc xl xa
60.Ta Intel\ 960 9c 9l 9a
61.DE
62There is a further breakdown
63in the source of the compilers into
64object-independent and
65object-dependent
66parts.
67All of the object-independent parts
68are combined into source files in the
69directory
70.CW /sys/src/cmd/cc .
71The object-dependent parts are collected
72in a separate directory for each compiler,
73for example
74.CW /sys/src/cmd/vc .
75All of the code,
76both object-independent and
77object-dependent,
78is machine-independent
79and may be cross-compiled and executed on any
80of the architectures.
81.NH
82The Language
83.LP
84The compiler implements ANSI C with some
85restrictions and extensions
86[ANSI90].
87Most of the restrictions are due to
88personal preference, while
89most of the extensions were to help in
90the implementation of Plan 9.
91There are other departures from the standard,
92particularly in the libraries,
93that are beyond the scope of this
94paper.
95.NH 2
96Register, volatile, const
97.LP
98The keywords
99.CW register ,
100.CW volatile ,
101and
102.CW const
103are recognized syntactically
104but are semantically ignored.
105Thus
106assigning to a
107.CW const
108variable is not diagnosed, nor is
109taking the address of a
110.CW register .
111.NH 2
112The preprocessor
113.LP
114The C preprocessor is probably the
115biggest departure from the ANSI standard.
116.LP
117The preprocessor built into the Plan 9 compilers does not support
118.CW #if ,
119although it does handle
120.CW #ifdef
121and
122.CW #include .
123If it is necessary to be more standard,
124the source text can first be run through the separate ANSI C
125preprocessor,
126.CW cpp .
127.NH 2
128Unnamed substructures
129.LP
130The most important and most heavily used of the
131extensions is the declaration of an
132unnamed substructure or subunion.
133For example:
134.DS
135.CW
136.ta .1i .6i 1.1i 1.6i
137	typedef
138	struct	lock
139	{
140		int    locked;
141	} Lock;
142
143	typedef
144	struct	node
145	{
146		int	type;
147		union
148		{
149			double dval;
150			float  fval;
151			long   lval;
152		};
153		Lock;
154	} Node;
155
156	Lock*	lock;
157	Node*	node;
158.DE
159The declaration of
160.CW Node
161has an unnamed substructure of type
162.CW Lock
163and an unnamed subunion.
164One use of this feature allows references to elements of the
165subunit to be accessed as if they were in
166the outer structure.
167Thus
168.CW node->dval
169and
170.CW node->locked
171are legitimate references.
172.LP
173When an outer structure is used
174in a context that is only legal for
175an unnamed substructure,
176the compiler promotes the reference to the
177unnamed substructure.
178This is true for references to structures and
179to references to pointers to structures.
180This happens in assignment statements and
181in argument passing where prototypes have been
182declared.
183Thus, continuing with the example,
184.DS
185.CW
186.ta .1i .6i 1.1i 1.6i
187	lock = node;
188.DE
189would assign a pointer to the unnamed
190.CW Lock
191in
192the
193.CW Node
194to the variable
195.CW lock .
196Another example,
197.DS
198.CW
199.ta .1i .6i 1.1i 1.6i
200	extern void lock(Lock*);
201	func(...)
202	{
203		...
204		lock(node);
205		...
206	}
207.DE
208will pass a pointer to the
209.CW Lock
210substructure.
211.NH 2
212Structure displays
213.LP
214A structure cast followed by a list of expressions in braces is
215an expression with the type of the structure and elements assigned from
216the corresponding list.
217Structures are now almost first-class citizens of the language.
218It is common to see code like this:
219.DS
220.CW
221.ta .1i
222	r = (Rectangle){point1, (Point){x,y+2}};
223.DE
224.NH 2
225Initialization indexes
226.LP
227In initializers of arrays,
228one may place a constant expression
229in square brackets before an initializer.
230This causes the next initializer to assign
231the indicated element.
232For example:
233.DS
234.CW
235.ta .1i .6i 1.6i
236	enum	errors
237	{
238		Etoobig,
239		Ealarm,
240		Egreg
241	};
242	char* errstrings[] =
243	{
244		[Ealarm]	"Alarm call",
245		[Egreg]	"Panic: out of mbufs",
246		[Etoobig]	"Arg list too long",
247	};
248.DE
249In the same way,
250individual structures members may
251be initialized in any order by preceding the initialization with
252.CW .tagname .
253Both forms allow an optional
254.CW = ,
255to be compatible with a proposed
256extension to ANSI C.
257.NH 2
258External register
259.LP
260The declaration
261.CW extern
262.CW register
263will dedicate a register to
264a variable on a global basis.
265It can be used only under special circumstances.
266External register variables must be identically
267declared in all modules and
268libraries.
269The feature is not intended for efficiency,
270although it can produce efficient code;
271rather it represents a unique storage class that
272would be hard to get any other way.
273On a shared-memory multi-processor,
274an external register is
275one-per-processor and neither one-per-procedure (automatic)
276or one-per-system (external).
277It is used for two variables in the Plan 9 kernel,
278.CW u
279and
280.CW m .
281.CW U
282is a pointer to the structure representing the currently running process
283and
284.CW m
285is a pointer to the per-machine data structure.
286.NH 2
287Long long
288.LP
289The compilers accept
290.CW long
291.CW long
292as a basic type meaning 64-bit integer.
293On all of the machines
294this type is synthesized from 32-bit instructions.
295.NH 2
296Pragma
297.LP
298The compilers accept
299.CW #pragma
300.CW lib
301.I libname
302and pass the
303library name string uninterpreted
304to the loader.
305The loader uses the library name to
306find libraries to load.
307If the name contains
308.CW %O ,
309it is replaced with
310the single character object type of the compiler
311(e.g.,
312.CW v
313for the MIPS).
314If the name contains
315.CW %M ,
316it is replaced with
317the architecture type for the compiler
318(e.g.,
319.CW mips
320for the MIPS).
321If the name starts with
322.CW /
323it is an absolute pathname;
324if it starts with
325.CW .
326then it is searched for in the loader's current directory.
327Otherwise, the name is searched from
328.CW /%M/lib .
329Such
330.CW #pragma
331statements in header files guarantee that the correct
332libraries are always linked with a program without the
333need to specify them explicitly at link time.
334.NH
335Object module conventions
336.LP
337The overall conventions of the runtime environment
338are important
339to runtime efficiency.
340In this section,
341several of these conventions are discussed.
342.NH 2
343Register saving
344.LP
345In the Plan 9 compilers,
346the caller of a procedure saves the registers.
347With caller-saves,
348the leaf procedures can use all the
349registers and never save them.
350If you spend a lot of time at the leaves,
351this seems preferable.
352With callee-saves,
353the saving of the registers is done
354in the single point of entry and return.
355If you are interested in space,
356this seems preferable.
357In both,
358there is a degree of uncertainty
359about what registers need to be saved.
360Callee-saved registers make it difficult to
361find variables in registers in debuggers.
362Callee-saved registers also complicate
363the implementation of
364.CW longjmp .
365The convincing argument is
366that with caller-saves,
367the decision to registerize a variable
368can include the cost of saving the register
369across calls.
370For a further discussion of caller- vs. callee-saves,
371see the paper by Davidson and Whalley [Dav91].
372.LP
373In the Plan 9 operating system,
374calls to the kernel look like normal procedure
375calls, which means
376the caller
377has saved the registers and the system
378entry does not have to.
379This makes system calls considerably faster.
380Since this is a potential security hole,
381and can lead to non-determinism,
382the system may eventually save the registers
383on entry,
384or more likely clear the registers on return.
385.NH 2
386Calling convention
387.LP
388Older C compilers maintain a frame pointer, which is at a known constant
389offset from the stack pointer within each function.
390For machines where the stack grows towards zero,
391the argument pointer is at a known constant offset
392from the frame pointer.
393Since the stack grows down in Plan 9,
394the Plan 9 compilers
395keep neither an
396explicit frame pointer nor
397an explicit argument pointer;
398instead they generate addresses relative to the stack pointer.
399.LP
400On some architectures, the first argument to a subroutine is passed in a register.
401.NH 2
402Functions returning structures
403.LP
404Structures longer than one word are awkward to implement
405since they do not fit in registers and must
406be passed around in memory.
407Functions that return structures
408are particularly clumsy.
409The Plan 9 compilers pass the return address of
410a structure as the first argument of a
411function that has a structure return value.
412Thus
413.DS
414.CW
415.ta .1i .6i 1.1i 1.6i
416	x = f(...)
417.DE
418is rewritten as
419.DS
420.CW
421.ta .1i .6i 1.1i 1.6i
422	f(&x, ...)\f1.
423.DE
424This saves a copy and makes the compilation
425much less clumsy.
426A disadvantage is that if you call this
427function without an assignment,
428a dummy location must be invented.
429.LP
430There is also a danger of calling a function
431that returns a structure without declaring
432it as such.
433With ANSI C function prototypes,
434this error need never occur.
435.NH
436Implementation
437.LP
438The compiler is divided internally into
439four machine-independent passes,
440four machine-dependent passes,
441and an output pass.
442The next nine sections describe each pass in order.
443.NH 2
444Parsing
445.LP
446The first pass is a YACC-based parser
447[Joh79].
448Declarations are interpreted immediately,
449building a block structured symbol table.
450Executable statements are put into a parse tree
451and collected,
452without interpretation.
453At the end of each procedure,
454the parse tree for the function is
455examined by the other passes of the compiler.
456.LP
457The input stream of the parser is
458a pushdown list of input activations.
459The preprocessor
460expansions of
461macros
462and
463.CW #include
464are implemented as pushdowns.
465Thus there is no separate
466pass for preprocessing.
467.NH 2
468Typing
469.LP
470The next pass distributes typing information
471to every node of the tree.
472Implicit operations on the tree are added,
473such as type promotions and taking the
474address of arrays and functions.
475.NH 2
476Machine-independent optimization
477.LP
478The next pass performs optimizations
479and transformations of the tree, such as converting
480.CW &*x
481and
482.CW *&x
483into
484.CW x .
485Constant expressions are converted to constants in this pass.
486.NH 2
487Arithmetic rewrites
488.LP
489This is another machine-independent optimization.
490Subtrees of add, subtract, and multiply of integers are
491rewritten for easier compilation.
492The major transformation is factoring:
493.CW 4+8*a+16*b+5
494is transformed into
495.CW 9+8*(a+2*b) .
496Such expressions arise from address
497manipulation and array indexing.
498.NH 2
499Addressability
500.LP
501This is the first of the machine-dependent passes.
502The addressability of a processor is defined as the set of
503expressions that is legal in the address field
504of a machine language instruction.
505The addressability of different processors varies widely.
506At one end of the spectrum are the 68020 and VAX,
507which allow a complex mix of incrementing,
508decrementing,
509indexing, and relative addressing.
510At the other end is the MIPS,
511which allows only registers and constant offsets from the
512contents of a register.
513The addressability can be different for different instructions
514within the same processor.
515.LP
516It is important to the code generator to know when a
517subtree represents an address of a particular type.
518This is done with a bottom-up walk of the tree.
519In this pass, the leaves are labeled with small integers.
520When an internal node is encountered,
521it is labeled by consulting a table indexed by the
522labels on the left and right subtrees.
523For example,
524on the 68020 processor,
525it is possible to address an
526offset from a named location.
527In C, this is represented by the expression
528.CW *(&name+constant) .
529This is marked addressable by the following table.
530In the table,
531a node represented by the left column is marked
532with a small integer from the right column.
533Marks of the form
534.CW A\s-2\di\u\s0
535are addressable while
536marks of the form
537.CW N\s-2\di\u\s0
538are not addressable.
539.DS
540.B
541.ta .1i 1.1i
542	Node	Marked
543.CW
544	name	A\s-2\d1\u\s0
545	const	A\s-2\d2\u\s0
546	&A\s-2\d1\u\s0	A\s-2\d3\u\s0
547	A\s-2\d3\u\s0+A\s-2\d1\u\s0	N\s-2\d1\u\s0 \fR(note that this is not addressable)\fP
548	*N\s-2\d1\u\s0	A\s-2\d4\u\s0
549.DE
550Here there is a distinction between
551a node marked
552.CW A\s-2\d1\u\s0
553and a node marked
554.CW A\s-2\d4\u\s0
555because the address operator of an
556.CW A\s-2\d4\u\s0
557node is not addressable.
558So to extend the table:
559.DS
560.B
561.ta .1i 1.1i
562	Node	Marked
563.CW
564	&A\s-2\d4\u\s0	N\s-2\d2\u\s0
565	N\s-2\d2\u\s0+N\s-2\d1\u\s0	N\s-2\d1\u\s0
566.DE
567The full addressability of the 68020 is expressed
568in 18 rules like this,
569while the addressability of the MIPS is expressed
570in 11 rules.
571When one ports the compiler,
572this table is usually initialized
573so that leaves are labeled as addressable and nothing else.
574The code produced is poor,
575but porting is easy.
576The table can be extended later.
577.LP
578This pass also rewrites some complex operators
579into procedure calls.
580Examples include 64-bit multiply and divide.
581.LP
582In the same bottom-up pass of the tree,
583the nodes are labeled with a Sethi-Ullman complexity
584[Set70].
585This number is roughly the number of registers required
586to compile the tree on an ideal machine.
587An addressable node is marked 0.
588A function call is marked infinite.
589A unary operator is marked as the
590maximum of 1 and the mark of its subtree.
591A binary operator with equal marks on its subtrees is
592marked with a subtree mark plus 1.
593A binary operator with unequal marks on its subtrees is
594marked with the maximum mark of its subtrees.
595The actual values of the marks are not too important,
596but the relative values are.
597The goal is to compile the harder
598(larger mark)
599subtree first.
600.NH 2
601Code generation
602.LP
603Code is generated by recursive
604descent.
605The Sethi-Ullman complexity completely guides the
606order.
607The addressability defines the leaves.
608The only difficult part is compiling a tree
609that has two infinite (function call)
610subtrees.
611In this case,
612one subtree is compiled into the return register
613(usually the most convenient place for a function call)
614and then stored on the stack.
615The other subtree is compiled into the return register
616and then the operation is compiled with
617operands from the stack and the return register.
618.LP
619There is a separate boolean code generator that compiles
620conditional expressions.
621This is fundamentally different from compiling an arithmetic expression.
622The result of the boolean code generator is the
623position of the program counter and not an expression.
624The boolean code generator makes extensive use of De Morgan's rule.
625The boolean code generator is an expanded version of that described
626in chapter 8 of Aho, Sethi, and Ullman
627[Aho87].
628.LP
629There is a considerable amount of talk in the literature
630about automating this part of a compiler with a machine
631description.
632Since this code generator is so small
633(less than 500 lines of C)
634and easy,
635it hardly seems worth the effort.
636.NH 2
637Registerization
638.LP
639Up to now,
640the compiler has operated on syntax trees
641that are roughly equivalent to the original source language.
642The previous pass has produced machine language in an internal
643format.
644The next two passes operate on the internal machine language
645structures.
646The purpose of the next pass is to reintroduce
647registers for heavily used variables.
648.LP
649All of the variables that can be
650potentially registerized within a procedure are
651placed in a table.
652(Suitable variables are any automatic or external
653scalars that do not have their addresses extracted.
654Some constants that are hard to reference are also
655considered for registerization.)
656Four separate data flow equations are evaluated
657over the procedure on all of these variables.
658Two of the equations are the normal set-behind
659and used-ahead
660bits that define the life of a variable.
661The two new bits tell if a variable life
662crosses a function call ahead or behind.
663By examining a variable over its lifetime,
664it is possible to get a cost
665for registerizing.
666Loops are detected and the costs are multiplied
667by three for every level of loop nesting.
668Costs are sorted and the variables
669are replaced by available registers on a greedy basis.
670.LP
671The 68020 has two different
672types of registers.
673For the 68020,
674two different costs are calculated for
675each variable life and the register type that
676affords the better cost is used.
677Ties are broken by counting the number of available
678registers of each type.
679.LP
680Note that externals are registerized together with automatics.
681This is done by evaluating the semantics of a ``call'' instruction
682differently for externals and automatics.
683Since a call goes outside the local procedure,
684it is assumed that a call references all externals.
685Similarly,
686externals are assumed to be set before an ``entry'' instruction
687and assumed to be referenced after a ``return'' instruction.
688This makes sure that externals are in memory across calls.
689.LP
690The overall results are satisfactory.
691It would be nice to be able to do this processing in
692a machine-independent way,
693but it is impossible to get all of the costs and
694side effects of different choices by examining the parse tree.
695.LP
696Most of the code in the registerization pass is machine-independent.
697The major machine-dependency is in
698examining a machine instruction to ask if it sets or references
699a variable.
700.NH 2
701Machine code optimization
702.LP
703The next pass walks the machine code
704for opportunistic optimizations.
705For the most part,
706this is highly specific to a particular
707processor.
708One optimization that is performed
709on all of the processors is the
710removal of unnecessary ``move''
711instructions.
712Ironically,
713most of these instructions were inserted by
714the previous pass.
715There are two patterns that are repetitively
716matched and replaced until no more matches are
717found.
718The first tries to remove ``move'' instructions
719by relabeling variables.
720.LP
721When a ``move'' instruction is encountered,
722if the destination variable is set before the
723source variable is referenced,
724then all of the references to the destination
725variable can be renamed to the source and the ``move''
726can be deleted.
727This transformation uses the reverse data flow
728set up in the previous pass.
729.LP
730An example of this pattern is depicted in the following
731table.
732The pattern is in the left column and the
733replacement action is in the right column.
734.DS
735.CW
736.ta .1i .6i 1.6i 2.1i 2.6i
737	MOVE	a->b		\fR(remove)\fP
738.R
739	(sequence with no mention of \f(CWa\fP)
740.CW
741	USE	b		USE	a
742.R
743	(sequence with no mention of \f(CWa\fP)
744.CW
745	SET	b		SET	b
746.DE
747.LP
748Experiments have shown that it is marginally
749worthwhile to rename uses of the destination variable
750with uses of the source variable up to
751the first use of the source variable.
752.LP
753The second transform will do relabeling
754without deleting instructions.
755When a ``move'' instruction is encountered,
756if the source variable has been set prior
757to the use of the destination variable
758then all of the references to the source
759variable are replaced by the destination and
760the ``move'' is inverted.
761Typically,
762this transformation will alter two ``move''
763instructions and allow the first transformation
764another chance to remove code.
765This transformation uses the forward data flow
766set up in the previous pass.
767.LP
768Again,
769the following is a depiction of the transformation where
770the pattern is in the left column and the
771rewrite is in the right column.
772.DS
773.CW
774.ta .1i .6i 1.6i 2.1i 2.6i
775	SET	a		SET	b
776.R
777	(sequence with no use of \f(CWb\fP)
778.CW
779	USE	a		USE	b
780.R
781	(sequence with no use of \f(CWb\fP)
782.CW
783	MOVE	a->b		MOVE	b->a
784.DE
785Iterating these transformations
786will usually get rid of all redundant ``move'' instructions.
787.LP
788A problem with this organization is that the costs
789of registerization calculated in the previous pass
790must depend on how well this pass can detect and remove
791redundant instructions.
792Often,
793a fine candidate for registerization is rejected
794because of the cost of instructions that are later
795removed.
796.NH 2
797Writing the object file
798.LP
799The last pass walks the internal assembly language
800and writes the object file.
801The object file is reduced in size by about a factor
802of three with simple compression
803techniques.
804The most important aspect of the object file
805format is that it is independent of the compiling machine.
806All integer and floating numbers in the object
807code are converted to known formats and byte
808orders.
809.NH
810The loader
811.LP
812The loader is a multiple pass program that
813reads object files and libraries and produces
814an executable binary.
815The loader also does some minimal
816optimizations and code rewriting.
817Many of the operations performed by the
818loader are machine-dependent.
819.LP
820The first pass of the loader reads the
821object modules into an internal data
822structure that looks like binary assembly language.
823As the instructions are read,
824code is reordered to remove
825unconditional branch instructions.
826Conditional branch instructions are inverted
827to prevent the insertion of unconditional branches.
828The loader will also make a copy of a few instructions
829to remove an unconditional branch.
830.LP
831The next pass allocates addresses for
832all external data.
833Typical of processors is the MIPS,
834which can reference ±32K bytes from a
835register.
836The loader allocates the register
837.CW R30
838as the static pointer.
839The value placed in
840.CW R30
841is the base of the data segment plus 32K.
842It is then cheap to reference all data in the
843first 64K of the data segment.
844External variables are allocated to
845the data segment
846with the smallest variables allocated first.
847If all of the data cannot fit into the first
84864K of the data segment,
849then usually only a few large arrays
850need more expensive addressing modes.
851.LP
852For the MIPS processor,
853the loader makes a pass over the internal
854structures,
855exchanging instructions to try
856to fill ``delay slots'' with useful work.
857If a useful instruction cannot be found
858to fill a delay slot,
859the loader will insert
860``noop''
861instructions.
862This pass is very expensive and does not
863do a good job.
864About 40% of all instructions are in
865delay slots.
866About 65% of these are useful instructions and
86735% are ``noops.''
868The vendor-supplied assembler does this job
869more effectively,
870filling about 80%
871of the delay slots with useful instructions.
872.LP
873On the 68020 processor,
874branch instructions come in a variety of
875sizes depending on the relative distance
876of the branch.
877Thus the size of branch instructions
878can be mutually dependent.
879The loader uses a multiple pass algorithm
880to resolve the branch lengths
881[Szy78].
882Initially, all branches are assumed minimal length.
883On each subsequent pass,
884the branches are reassessed
885and expanded if necessary.
886When no more expansions occur,
887the locations of the instructions in
888the text segment are known.
889.LP
890On the MIPS processor,
891all instructions are one size.
892A single pass over the instructions will
893determine the locations of all addresses
894in the text segment.
895.LP
896The last pass of the loader produces the
897executable binary.
898A symbol table and other tables are
899produced to help the debugger to
900interpret the binary symbolically.
901.LP
902The loader places absolute source line numbers in the symbol table.
903The name and absolute line number of all
904.CW #include
905files is also placed in the
906symbol table so that the debuggers can
907associate object code to source files.
908.NH
909Performance
910.LP
911The following is a table of the source size of the MIPS
912compiler.
913.DS
914.ta .1i .6i
915	lines	module
916	\0509	machine-independent headers
917	1070	machine-independent YACC source
918	6090	machine-independent C source
919
920	\0545	machine-dependent headers
921	6532	machine-dependent C source
922
923	\0298	loader headers
924	5215	loader C source
925.DE
926.LP
927The following table shows timing
928of a test program
929that plays checkers, running on a MIPS R4000.
930The test program is 26 files totaling 12600 lines of C.
931The execution time does not significantly
932depend on library implementation.
933Since no other compiler runs on Plan 9,
934the Plan 9 tests were done with the Plan 9 operating system;
935the other tests were done on the vendor's operating system.
936The hardware was identical in both cases.
937The optimizer in the vendor's compiler
938is reputed to be extremely good.
939.DS
940.ta .1i .9i
941	\0\04.49s	Plan 9 \f(CWvc\fP \f(CW-N\fP compile time (opposite of \f(CW-O\fP)
942	\0\01.72s	Plan 9 \f(CWvc\fP \f(CW-N\fP load time
943	148.69s	Plan 9 \f(CWvc\fP \f(CW-N\fP run time
944
945	\015.07s	Plan 9 \f(CWvc\fP compile time (\f(CW-O\fP implicit)
946	\0\01.66s	Plan 9 \f(CWvc\fP load time
947	\089.96s	Plan 9 \f(CWvc\fP run time
948
949	\014.83s	vendor \f(CWcc\fP compile time
950	\0\00.38s	vendor \f(CWcc\fP load time
951	104.75s	vendor \f(CWcc\fP run time
952
953	\043.59s	vendor \f(CWcc\fP \f(CW-O\fP compile time
954	\0\00.38s	vendor \f(CWcc\fP \f(CW-O\fP load time
955	\076.19s	vendor \f(CWcc\fP \f(CW-O\fP run time
956
957	\0\08.19s	vendor \f(CWcc\fP \f(CW-O3\fP compile time
958	\035.97s	vendor \f(CWcc\fP \f(CW-O3\fP load time
959	\071.16s	vendor \f(CWcc\fP \f(CW-O3\fP run time
960.DE
961.LP
962To compare the Intel compiler,
963a program that is about 40% bit manipulation and
964about 60% single precision floating point was
965run on the same 33 MHz 486, once under Windows
966compiled with the Watcom compiler, version 10.0,
967in 16-bit mode and once under
968Plan 9 in 32-bit mode.
969The Plan 9 execution time was 27 sec while the Windows
970execution time was 31 sec.
971.NH
972Conclusions
973.LP
974The new compilers compile
975quickly,
976load slowly,
977and produce
978medium quality
979object code.
980The compilers are relatively
981portable,
982requiring but a couple of weeks' work to
983produce a compiler for a different computer.
984For Plan 9,
985where we needed several compilers
986with specialized features and
987our own object formats,
988this project was indispensable.
989It is also necessary for us to
990be able to freely distribute our compilers
991with the Plan 9 distribution.
992.LP
993Two problems have come up in retrospect.
994The first has to do with the
995division of labor between compiler and loader.
996Plan 9 runs on multi-processors and as such
997compilations are often done in parallel.
998Unfortunately,
999all compilations must be complete before loading
1000can begin.
1001The load is then single-threaded.
1002With this model,
1003any shift of work from compile to load
1004results in a significant increase in real time.
1005The same is true of libraries that are compiled
1006infrequently and loaded often.
1007In the future,
1008we may try to put some of the loader work
1009back into the compiler.
1010.LP
1011The second problem comes from
1012the various optimizations performed over several
1013passes.
1014Often optimizations in different passes depend
1015on each other.
1016Iterating the passes could compromise efficiency,
1017or even loop.
1018We see no real solution to this problem.
1019.NH
1020References
1021.LP
1022[Aho87] A. V. Aho, R. Sethi, and J. D. Ullman,
1023.I
1024Compilers \- Principles, Techniques, and Tools,
1025.R
1026Addison Wesley,
1027Reading, MA,
10281987.
1029.LP
1030[ANSI90] \f2American National Standard for Information Systems \-
1031Programming Language C\f1, American National Standards Institute, Inc.,
1032New York, 1990.
1033.LP
1034[Dav91] J. W. Davidson and D. B. Whalley,
1035``Methods for Saving and Restoring Register Values across Function Calls'',
1036.I
1037Software\-Practice and Experience,
1038.R
1039Vol 21(2), pp. 149-165, February 1991.
1040.LP
1041[Joh79] S. C. Johnson,
1042``YACC \- Yet Another Compiler Compiler'',
1043.I
1044UNIX Programmer's Manual, Seventh Ed., Vol. 2A,
1045.R
1046AT&T Bell Laboratories,
1047Murray Hill, NJ,
10481979.
1049.LP
1050[Set70] R. Sethi and J. D. Ullman,
1051``The Generation of Optimal Code for Arithmetic Expressions'',
1052.I
1053Journal of the ACM,
1054.R
1055Vol 17(4), pp. 715-728, 1970.
1056.LP
1057[Szy78] T. G. Szymanski,
1058``Assembling Code for Machines with Span-dependent Instructions'',
1059.I
1060Communications of the ACM,
1061.R
1062Vol 21(4), pp. 300-308, 1978.
1063