xref: /plan9-contrib/sys/doc/comp.ms (revision 219b2ee8daee37f4aad58d63f21287faa8e4ffdc)
1.TL
2How to Use the Plan 9 C Compiler
3.AU
4Rob Pike
5rob@plan9.att.com
6.SH
7Introduction
8.PP
9The C compiler on Plan 9 is a wholly new program; in fact
10it was the first piece of software written for what would
11eventually become Plan 9 from Bell Labs.
12Programmers familiar with existing C compilers will find
13a number of differences in both the language the Plan 9 compiler
14accepts and in how the compiler is used.
15.PP
16The compiler is really a set of compilers, one for each
17architecture \(em MIPS, SPARC, Motorola 68020, Intel 386, etc. \(em
18that accept a dialect of ANSI C and efficiently produce
19fairly good code for the target machine.
20There is a packaging of the compiler that accepts strict ANSI C for
21a POSIX environment, but this document focuses on the
22native Plan 9 environment, that in which all the system source and
23almost all the utilities are written.
24.SH
25Source
26.PP
27The language accepted by the compilers is the core ANSI C language
28with some modest extensions,
29a greatly simplified preprocessor,
30a smaller library that includes system calls and related facilities,
31and a completely different structure for include files.
32.PP
33Official ANSI C accepts the old (K&R) style of declarations for
34functions; the Plan 9 compilers
35are more demanding.
36Without an explicit run-time flag
37.CW -B ) (
38whose use is discouraged, the compilers insist
39on new-style function declarations, that is, prototypes for
40function arguments.
41The function declarations in the libraries' include files are
42all in the new style so the interfaces are checked at compile time.
43For C programmers who have not yet switched to function prototypes
44the clumsy syntax may seem repellent but the payoff in stronger typing
45is substantial.
46Those who wish to import existing software to Plan 9 are urged
47to use the opportunity to update their code.
48.PP
49The compilers include an integrated preprocessor that accepts the familiar
50.CW #include ,
51.CW #define
52for macros both with and without arguments,
53.CW #undef ,
54.CW #line ,
55.CW #ifdef ,
56.CW #ifndef ,
57and
58.CW #endif .
59It
60supports neither
61.CW #if
62nor
63.CW ##
64and honors a single
65.CW #pragma .
66The
67.CW #if
68directive was omitted because it greatly complicates the
69preprocessor, is never necessary, and is usually abused.
70Conditional compilation in general makes code hard to understand;
71the Plan 9 source uses it sparingly.
72Also, because the compilers remove dead code, regular
73.CW if
74statements with constant conditions are more readable equivalents to many
75.CW #ifs .
76To compile imported code ineluctably fouled by
77.CW #if
78there is a separate command,
79.CW /bin/cpp ,
80that implements the complete ANSI C preprocessor specification.
81.PP
82Include files fall into two groups: machine-dependent and machine-independent.
83The machine-independent files occupy the directory
84.CW /sys/include ;
85the others are placed in a directory appropriate to the machine, such as
86.CW /mips/include .
87The compiler searches for include files
88first in the machine-dependent directory and then
89in the machine-independent directory.
90At the time of writing there are twenty-two machine-independent include
91files and three (per machine) machine-dependent ones:
92.CW <ureg.h> ,
93.CW <stdarg.h> ,
94and
95.CW <u.h> .
96The first describes the layout of registers on the system stack,
97for use by the debugger;
98the second, as in ANSI C, defines a portable way to declare variadic
99functions.
100The third defines some
101architecture-dependent types such as
102.CW jmp_buf
103for
104.CW setjmp
105and
106also a set of
107.CW typedef
108abbreviations for
109.CW unsigned
110.CW short
111and so on.
112.PP
113Here is an excerpt from
114.CW /68020/include/u.h :
115.P1
116typedef	unsigned short	ushort;
117typedef	unsigned char	uchar;
118typedef unsigned long	ulong;
119typedef unsigned int	uint;
120typedef   signed char	schar;
121typedef	long		vlong;
122
123typedef long	jmp_buf[2];
124#define	JMPBUFSP	0
125#define	JMPBUFPC	1
126#define	JMPBUFDPC	0
127.P2
128The type
129.CW vlong
130is the largest integer type available; on some architectures it
131is a 64-bit value.
132The
133.CW #define
134constants permit an architecture-independent (but compiler-dependent)
135implementation of stack-switching using
136.CW setjmp
137and
138.CW longjmp .
139.PP
140Every Plan 9 C program begins
141.P1
142#include <u.h>
143.P2
144because all the other installed header files use the
145.CW typedefs
146declared in
147.CW <u.h> .
148.PP
149In strict ANSI C, include files are grouped to collect related functions
150in a single file: one for string functions, one for memory functions,
151one for I/O, and none for system calls.
152Each include file is protected by an
153.CW #ifdef
154to guarantee its contents are seen by the compiler only once.
155Plan 9 takes a different approach.  Other than a few include
156files that define external formats such as archives, the files in
157.CW /sys/include
158correspond to
159.I libraries.
160If a program is using a library, it includes the corresponding header.
161The default C library comprises string functions, memory functions, and
162so on, largely as in ANSI C, some formatted I/O routines,
163plus all the system calls and related functions.
164To use these functions, one must
165.CW #include
166the file
167.CW <libc.h> ,
168which in turn must follow
169.CW <u.h> ,
170to define their prototypes for the compiler.
171Here is the complete source to the traditional first C program:
172.P1
173#include <u.h>
174#include <libc.h>
175
176void
177main(void)
178{
179	print("hello world\en");
180	exits(0);
181}
182.P2
183The
184.CW print
185routine and its relatives
186.CW fprint
187and
188.CW sprint
189resemble the similarly-named functions in Standard I/O but are not
190attached to a specific I/O library.
191In Plan 9
192.CW main
193is not integer-valued; it should call
194.CW exits ,
195which takes a string argument (or null; here ANSI C promotes the 0 to a
196.CW char* ).
197All these functions are, of course, documented in the Programmer's Manual.
198.PP
199To use
200.CW printf ,
201.CW <stdio.h>
202must be included to define the function prototype for
203.CW printf :
204.P1
205#include <u.h>
206#include <libc.h>
207#include <stdio.h>
208
209void
210main(int argc, char *argv[])
211{
212	printf("%s: hello world; argc = %d\en", argv[0], argc);
213	exits(0);
214}
215.P2
216In practice, Standard I/O is not used much in Plan 9.  I/O libraries are
217discussed in a later section of this document.
218.PP
219There are libraries for handling regular expressions, bitmap graphics,
220windows, and so on, and each has an associated include file.
221The manual for each library states which include files are needed.
222The files are not protected against multiple inclusion and themselves
223contain no nested
224.CW #includes .
225Instead the
226programmer is expected to sort out the requirements
227and to
228.CW #include
229the necessary files once at the top of each source file.  In practice this is
230trivial: this way of handling include files is so straightforward
231that it is rare for a source file to contain more than half a dozen
232.CW #includes .
233.PP
234The compilers do their own register allocation so the
235.CW register
236keyword is ignored.
237For different reasons,
238.CW volatile
239and
240.CW const
241are also ignored.
242.PP
243To make it easier to share code with other systems, Plan 9 has a version
244of the compiler,
245.CW pcc ,
246that provides the standard ANSI C preprocessor, headers, and libraries
247with POSIX extensions.
248.CW Pcc
249is recommended only
250when broad external portability is mandated.  It compiles slower,
251produces slower code (it takes extra work to simulate POSIX on Plan 9),
252eliminates those parts of the Plan 9 interface
253not related to POSIX, and illustrates the clumsiness of an environment
254designed by committee.
255.CW Pcc
256is described in more detail in
257.I
258APE\(emThe ANSI/POSIX Environment,
259.R
260by Howard Trickey.
261.SH
262Process
263.PP
264Each CPU architecture supported by Plan 9 is identified by a single,
265arbitrary, alphanumeric character:
266.CW v
267for MIPS,
268.CW k
269for SPARC,
270.CW x
271for AT&T DSP3210,
272.CW 2
273for Motorola 68020 and 68040,
274.CW 8
275for Intel 386, and
276.CW 6
277for Intel 960.
278The character labels the support tools and files for that architecture.
279For instance, for the 68020 the compiler is
280.CW 2c ,
281the assembler is
282.CW 2a ,
283the link editor/loader is
284.CW 2l ,
285the object files are suffixed
286.CW \&.2 ,
287and the default name for an executable file is
288.CW 2.out .
289Before we can use the compiler we therefore need to know which
290machine we are compiling for.
291The next section explains how this decision is made; for the moment
292assume we are building 68020 binaries and make the mental substitution for
293.CW 2
294appropriate to the machine you are actually using.
295.PP
296To convert source to an executable binary is a two-step process.
297First run the compiler,
298.CW 2c ,
299on the source, say
300.CW file.c ,
301to generate an object file
302.CW file.2 .
303Then run the loader,
304.CW 2l ,
305to generate an executable
306.CW 2.out
307that may be run (on a 680X0 machine):
308.P1
3092c file.c
3102l file.2
3112.out
312.P2
313The loader automatically links with whatever libraries the program
314needs, usually including the standard C library as defined by
315.CW <libc.h> .
316Of course the compiler and loader have lots of options, both familiar and new;
317see the manual for details.
318The compiler does not generate an executable automatically;
319the output of the compiler must be given to the loader.
320Since most compilation is done under the control of
321.CW mk
322(see below), this is rarely an inconvenience.
323.PP
324The distribution of work between the compiler and loader is unusual.
325The compiler integrates preprocessing, parsing, register allocation,
326code generation and some assembly.
327Combining these tasks in a single program is part of the reason for
328the compiler's efficiency.
329The loader does instruction selection, branch folding,
330instruction scheduling,
331and writes the final executable.
332There is no separate C preprocessor and no assembler in the usual pipeline.
333Instead the intermediate object file
334(here a
335.CW \&.2
336file) is a type of binary assembly language.
337The instructions in the intermediate format are not exactly those in
338the machine.  For example, on the 68020 the object file may specify
339a MOVE instruction but the loader will decide just which variant of
340the MOVE instruction \(em MOVE immediate, MOVE quick, MOVE address,
341etc. \(em is most efficient.
342.PP
343The assembler,
344.CW 2a ,
345is just a translator between the textual and binary
346representations of the object file format.
347It is not an assembler in the traditional sense.  It has limited
348macro capabilities (the same as the integral C preprocessor in the compiler),
349clumsy syntax, and minimal error checking.  For instance, the assembler
350will accept an instruction (such as memory-to-memory MOVE on the MIPS) that the
351machine does not actually support; only when the output of the assembler
352is passed to the loader will the error be discovered.
353The assembler is intended only for writing things that need access to instructions
354invisible from C,
355such as the machine-dependent
356part of an operating system;
357very little code in Plan 9 is in assembly language.
358.PP
359The compilers take an option
360.CW -S
361that causes them to print on their standard output the generated code
362in a format acceptable as input to the assemblers.
363This is of course merely a formatting of the
364data in the object file; therefore the assembler is just
365an
366ASCII-to-binary converter for this format.
367Other than the specific instructions, the input to the assemblers
368is largely architecture-independent; see
369``A Manual for the Plan 9 Assembler'',
370by Rob Pike,
371for more information.
372.PP
373The loader is an integral part of the compilation process.
374Each library header file contains a
375.CW #pragma
376that tells the loader the name of the associated archive; it is
377not necessary to tell the loader which libraries a program uses.
378The C run-time startup is found, by default, in the C library.
379The loader starts with an undefined
380symbol,
381.CW _main ,
382that is resolved by pulling in the run-time startup code from the library.
383(The loader undefines
384.CW _mainp
385when profiling is enabled, to force loading of the profiling start-up
386instead.)
387.PP
388Unlike its counterpart on other systems, the Plan 9 loader rearranges
389data to optimize access.  This means the order of variables in the
390loaded program is unrelated to its order in the source.
391Most programs don't care, but some assume that, for example, the
392variables declared by
393.P1
394int a;
395int b;
396.P2
397will appear at adjacent addresses in memory.  On Plan 9, they won't.
398.SH
399Heterogeneity
400.PP
401When the system starts or a user logs in the environment is configured
402so the appropriate binaries are available in
403.CW /bin .
404The configuration process is controlled by an environment variable,
405.CW $cputype ,
406with value such as
407.CW mips ,
408.CW 68020 ,
409or
410.CW sparc .
411For each architecture there is a directory in the root,
412with the appropriate name,
413that holds the binary and library files for that architecture.
414Thus
415.CW /mips/lib
416contains the object code libraries for MIPS programs,
417.CW /mips/include
418holds MIPS-specific include files, and
419.CW /mips/bin
420has the MIPS binaries.
421These binaries are attached to
422.CW /bin
423at boot time by binding
424.CW /$cputype/bin
425to
426.CW /bin ,
427so
428.CW /bin
429always contains the correct files.
430.PP
431The MIPS compiler,
432.CW vc ,
433by definition
434produces object files for the MIPS architecture,
435regardless of the architecture of the machine on which the compiler is running.
436There is a version of
437.CW vc
438compiled for each architecture:
439.CW /mips/bin/vc ,
440.CW /68020/bin/vc ,
441.CW /sparc/bin/vc ,
442and so on,
443each capable of producing MIPS object files regardless of the native
444instruction set.
445If one is running on a SPARC,
446.CW /sparc/bin/vc
447will compile programs for the MIPS;
448if one is running on machine
449.CW $cputype ,
450.CW /$cputype/bin/vc
451will compile programs for the MIPS.
452.PP
453Because of the bindings that assemble
454.CW /bin ,
455the shell always looks for a command, say
456.CW date ,
457in
458.CW /bin
459and automatically finds the file
460.CW /$cputype/bin/date .
461Therefore the MIPS compiler is known as just
462.CW vc ;
463the shell will invoke
464.CW /bin/vc
465and that is guaranteed to be the version of the MIPS compiler
466appropriate for the machine running the command.
467Regardless of the architecture of the compiling machine,
468.CW /bin/vc
469is
470.I always
471the MIPS compiler.
472.PP
473Also, the output of
474.CW vc
475and
476.CW vl
477is completely independent of the machine type on which they are executed:
478.CW \&.v
479files compiled (with
480.CW vc )
481on a SPARC may be linked (with
482.CW vl )
483on a 386.
484(The resulting
485.CW v.out
486will run, of course, only on a MIPS.)
487Similarly, the MIPS libraries in
488.CW /mips/lib
489are suitable for loading with
490.CW vl
491on any machine; there is only one set of MIPS libraries, not one
492set for each architecture that supports the MIPS compiler.
493.SH
494Heterogeneity and \f(CWmk\fP
495.PP
496Most software on Plan 9 is compiled under the control of
497.CW mk ,
498a descendant of
499.CW make
500that is documented in the Programmer's Manual.
501A convention used throughout the
502.CW mkfiles
503makes it easy to compile the source into binary suitable for any architecture.
504.PP
505The variable
506.CW $cputype
507is advisory: it reports the architecture of the current environment, and should
508not be modified.  A second variable,
509.CW $objtype ,
510is used to set which architecture is being
511.I compiled
512for.
513The value of
514.CW $objtype
515can be used by a
516.CW mkfile
517to configure the compilation environment.
518.PP
519In each machine's root directory there is a short
520.CW mkfile
521that defines a set of macros for the compiler, loader, etc.
522Here is
523.CW /mips/mkfile :
524.P1
525CC=vc
526ALEF=val
527LD=vl
528O=v
529AS=va
530OS=2kv86x
531CPUS=mips 68020 sparc 386
532CFLAGS=
533LEX=lex
534YACC=yacc
535MK=/bin/mk
536.P2
537.CW CC
538is obviously the compiler,
539.CW AS
540the assembler, and
541.CW LD
542the loader.
543.CW ALEF
544identifies the Alef compiler, described below.
545.CW O
546is the suffix for the object files and
547.CW CPUS
548and
549.CW OS
550are used in special rules described below.
551.PP
552Here is a
553.CW mkfile
554to build the installed source for
555.CW sam :
556.P1
557</$objtype/mkfile
558OBJ=sam.$O address.$O buffer.$O cmd.$O disc.$O error.$O \e
559	file.$O io.$O list.$O mesg.$O moveto.$O multi.$O \e
560	plan9.$O rasp.$O regexp.$O string.$O sys.$O xec.$O
561
562$O.out:	$OBJ
563	$LD $OBJ
564
565install:	$O.out
566	cp $O.out /$objtype/bin/sam
567
568installall:
569	for(objtype in $CPUS) mk install
570
571%.$O:	%.c
572	$CC $CFLAGS $stem.c
573
574$OBJ:	sam.h errors.h mesg.h
575address.$O cmd.$O parse.$O xec.$O unix.$O:	parse.h
576
577clean:V:
578	rm -f [$OS].out *.[$OS] y.tab.?
579.P2
580(The actual
581.CW mkfile
582imports most of its rules from other secondary files, but
583this example works and is not misleading.)
584The first line causes
585.CW mk
586to include the contents of
587.CW /$objtype/mkfile
588in the current
589.CW mkfile .
590If
591.CW $objtype
592is
593.CW mips ,
594this inserts the MIPS macro definitions into the
595.CW mkfile .
596In this case the rule for
597.CW $O.out
598uses the MIPS tools to build
599.CW v.out .
600The
601.CW %.$O
602rule in the file uses
603.CW mk 's
604pattern matching facilities to convert the source files to the object
605files through the compiler.
606(The text of the rules is passed directly to the shell,
607.CW rc ,
608without further translation.
609See the
610.CW mk
611manual if any of this is unfamiliar.)
612Because the default rule builds
613.CW $O.out
614rather than
615.CW sam ,
616it is possible to maintain binaries for multiple machines in the
617same source directory without conflict.
618This is also, of course, why the output files from the various
619compilers and loaders
620have distinct names.
621.PP
622The rest of the
623.CW mkfile
624should be easy to follow; notice how the rules for
625.CW clean
626and
627.CW installall
628(that is, install versions for all architectures) use other macros
629defined in
630.CW /$objtype/mkfile .
631In Plan 9,
632.CW mkfiles
633for commands conventionally contain rules to
634.CW install
635(compile and install the version for
636.CW $objtype ),
637.CW installall
638(compile and install for all
639.CW $objtypes ),
640and
641.CW clean
642(remove all object files, binaries, etc.).
643.PP
644The
645.CW mkfile
646is easy to use.  To build a MIPS binary,
647.CW v.out :
648.P1
649% objtype=mips
650% mk
651.P2
652To build and install a MIPS binary:
653.P1
654% objtype=mips
655% mk install
656.P2
657To build and install all versions:
658.P1
659% mk installall
660.P2
661These conventions make cross-compilation as easy to manage
662as traditional native compilation.
663Plan 9 programs compile and run without change on machines from
664large multiprocessors to laptops.  For more information about this process, see
665``Plan 9 Mkfiles'',
666by Bob Flandrena.
667.SH
668Portability
669.PP
670Within Plan 9, it is painless to write portable programs, programs whose
671source is independent of the machine on which they execute.
672The operating system is fixed and the compiler, headers and libraries
673are constant so most of the stumbling blocks to portability are removed.
674Attention to a few details can avoid those that remain.
675.PP
676Plan 9 is a heterogeneous environment, so programs must
677.I expect
678that external files will be written by programs on machines of different
679architectures.
680The compilers, for instance, must handle without confusion
681object files written by other machines.
682The traditional approach to this problem is to pepper the source with
683.CW #ifdefs
684to turn byte-swapping on and off.
685Plan 9 takes a different approach: of the handful of machine-dependent
686.CW #ifdefs
687in all the source, almost all are deep in the libraries.
688Instead programs read and write files in a defined format,
689either (for low volume applications) as formatted text, or
690(for high volume applications) as binary in a known byte order.
691If the external data were written with the most significant
692byte first, the following code reads a 4-byte integer correctly
693regardless of the architecture of the executing machine (assuming
694an unsigned long holds 4 bytes):
695.P1
696ulong
697getlong(void)
698{
699	ulong l;
700
701	l = (getchar()&0xFF)<<24;
702	l |= (getchar()&0xFF)<<16;
703	l |= (getchar()&0xFF)<<8;
704	l |= (getchar()&0xFF)<<0;
705	return l;
706}
707.P2
708Note that this code does not `swap' the bytes; instead it just reads
709them in the correct order.
710Variations of this code will handle any binary format
711and also avoid problems
712involving how structures are padded, how words are aligned,
713and other impediments to portability.
714Be aware, though, that extra care is needed to handle floating point data.
715.PP
716Efficiency hounds will argue that this method is unnecessarily slow and clumsy
717when the executing machine has the same byte order (and padding and alignment)
718as the data.
719I/O speed is rarely the bottleneck for an application, however,
720and the gain in simplicity of porting and maintaining the code greatly outweighs
721the minor speed loss from handling data in this general way.
722This method is how the Plan 9 compilers, the window system, and even the file
723servers transmit data between programs.
724.PP
725To port programs beyond Plan 9, where the system interface is more variable,
726it is probably necessary to use
727.CW pcc
728and hope that the target machine supports ANSI C and POSIX.
729.SH
730I/O
731.PP
732The default C library, defined by the include file
733.CW <libc.h> ,
734contains no buffered I/O package.
735It does have several entry points for printing formatted text:
736.CW print
737outputs text to the standard output,
738.CW fprint
739outputs text to a specified integer file descriptor, and
740.CW sprint
741places text in a character array.
742To access library routines for buffered I/O, a program must
743explicitly include the header file associated with an appropriate library.
744.PP
745The recommended I/O library, used by most Plan 9 utilities, is
746.CW bio
747(buffered I/O), defined by
748.CW <bio.h> .
749There also exists an implementation of ANSI Standard I/O,
750.CW stdio .
751.PP
752.CW Bio
753is small and efficient, particularly for buffer-at-a-time or
754line-at-a-time I/O.
755Even for character-at-a-time I/O, however, it is significantly faster than
756the Standard I/O library,
757.CW stdio .
758Its interface is compact and regular, although it lacks a few conveniences.
759The most noticeable is that one must explicitly define buffers for standard
760input and output;
761.CW bio
762does not predefine them.  Here is a program to copy input to output a character
763at a time using
764.CW bio :
765.P1
766#include <u.h>
767#include <libc.h>
768#include <bio.h>
769
770Biobuf	bin;
771Biobuf	bout;
772
773main(void)
774{
775	int c;
776
777	Binit(&bin, 0, OREAD);
778	Binit(&bout, 1, OWRITE);
779
780	while((c=Bgetc(&bin)) != Beof)
781		Bputc(&bout, c);
782	exits(0);
783}
784.P2
785For peak performance, we could replace
786.CW Bgetc
787and
788.CW Bputc
789by their equivalent in-line macros
790.CW BGETC
791and
792.CW BPUTC
793but
794the performance gain would be modest.
795For more information on
796.CW bio ,
797see the Programmer's Manual.
798.PP
799Perhaps the most dramatic difference in the I/O interface of Plan 9 from other
800systems' is that text is not ASCII.
801The format for
802text in Plan 9 is a byte-stream encoding of 16-bit characters.
803The character set is based on the Unicode Standard and is backward compatible with
804ASCII:
805characters with value 0 through 127 are the same in both sets.
806The 16-bit characters, called
807.I runes
808in Plan 9, are encoded using a representation called
809UTF,
810an encoding that is becoming accepted as a standard.
811(ISO calls it UTF-8;
812throughout Plan 9 it's just called
813UTF.)
814UTF
815defines multibyte sequences to
816represent character values from 0 to 65535.
817In
818UTF,
819character values up to 127 decimal, 7F hexadecimal, represent themselves,
820so straight
821ASCII
822files are also valid
823UTF.
824Also,
825UTF
826guarantees that bytes with values 0 to 127 (NUL to DEL, inclusive)
827will appear only when they represent themselves, so programs that read bytes
828looking for plain ASCII characters will continue to work.
829Any program that expects a one-to-one correspondence between bytes and
830characters will, however, need to be modified.
831An example is parsing file names.
832File names, like all text, are in
833UTF,
834so it is incorrect to search for a character in a string by
835.CW strchr(filename,
836.CW c)
837because the character might have a multi-byte encoding.
838The correct method is to call
839.CW utfrune(filename,
840.CW c) ,
841defined in
842.I rune (2),
843which interprets the file name as a sequence of encoded characters
844rather than bytes.
845In fact, even when you know the character is a single byte
846that can represent only itself,
847it is safer to use
848.CW utfrune
849because that assumes nothing about the character set
850and its representation.
851.PP
852The library defines several symbols relevant to the representation of characters.
853Any byte with unsigned value less than
854.CW Runesync
855will not appear in any multi-byte encoding of a character.
856.CW Utfrune
857compares the character being searched against
858.CW Runesync
859to see if it is sufficient to call
860.CW strchr
861or if the byte stream must be interpreted.
862Any byte with unsigned value less than
863.CW Runeself
864is represented by a single byte with the same value.
865Finally, when errors are encountered converting
866to runes from a byte stream, the library returns the rune value
867.CW Runeerror
868and advances a single byte.  This permits programs to find runes
869embedded in binary data.
870.PP
871.CW Bio
872includes routines
873.CW Bgetrune
874and
875.CW Bputrune
876to transform the external byte stream
877UTF
878format to and from
879internal 16-bit runes.
880Also, the
881.CW %s
882format to
883.CW print
884accepts
885UTF;
886.CW %c
887prints a character after narrowing it to 8 bits.
888The
889.CW %S
890format prints a null-terminated sequence of runes;
891.CW %C
892prints a character after narrowing it to 16 bits.
893For more information, see the Programmer's Manual, in particular
894.I utf (6)
895and
896.I rune (2),
897and the paper,
898``Hello world, or
899Καλημέρα κόσμε, or\
900\f(Jpこんにちは 世界\f1'',
901by Rob Pike and
902Ken Thompson;
903there is not room for the full story here.
904.PP
905These issues affect the compiler in several ways.
906First, the C source is in
907UTF.
908ANSI says C variables are formed from
909ASCII
910alphanumerics, but comments and literal strings may contain any characters
911encoded in the native encoding, here
912UTF.
913The declaration
914.P1
915char *cp = "abcÿ";
916.P2
917initializes the variable
918.CW cp
919to point to an array of bytes holding the
920UTF
921representation of the characters
922.CW abcÿ.
923The type
924.CW Rune
925is defined in
926.CW <u.h>
927to be
928.CW ushort ,
929which is also the  `wide character' type in the compiler.
930Therefore the declaration
931.P1
932Rune *rp = L"abcÿ";
933.P2
934initializes the variable
935.CW rp
936to point to an array of unsigned short integers holding the 16-bit
937values of the characters
938.CW abcÿ .
939Note that in both these declarations the characters in the source
940that represent
941.CW "abcÿ"
942are the same; what changes is how those characters are represented
943in memory in the program.
944The following two lines:
945.P1
946print("%s\en", "abcÿ");
947print("%S\en", L"abcÿ");
948.P2
949produce the same
950UTF
951string on their output, the first by copying the bytes, the second
952by converting from runes to bytes.
953.PP
954In C, character constants are integers but narrowed through the
955.CW char
956type.
957The Unicode character
958.CW ÿ
959has value 255, so if the
960.CW char
961type is signed,
962the constant
963.CW 'ÿ'
964has value \-1 (which is equal to EOF).
965On the other hand,
966.CW L'ÿ'
967narrows through the wide character type,
968.CW ushort ,
969and therefore has value 255.
970.PP
971Finally, although it's not ANSI C, the Plan 9 C compilers
972assume any character with value above
973.CW Runeself
974is an alphanumeric,
975so α is a legal, if non-portable, variable name.
976.SH
977Arguments
978.PP
979Some macros are defined
980in
981.CW <libc.h>
982for parsing the arguments to
983.CW main() .
984They are described in
985.I ARG (2)
986but are fairly self-explanatory.
987There are four macros:
988.CW ARGBEGIN
989and
990.CW ARGEND
991are used to bracket a hidden
992.CW switch
993statement within which
994.CW ARGC
995returns the current option character (rune) being processed and
996.CW ARGF
997returns the argument to the option, as in the loader option
998.CW -o
999.CW file .
1000Here, for example, is the code at the beginning of
1001.CW main()
1002in
1003.CW ramfs.c
1004(see
1005.I ramfs (1))
1006that cracks its arguments:
1007.P1
1008void
1009main(int argc, char *argv[])
1010{
1011	char *defmnt;
1012	int p[2];
1013	int mfd[2];
1014	int stdio = 0;
1015
1016	defmnt = "/tmp";
1017	ARGBEGIN{
1018	case 'i':
1019		defmnt = 0;
1020		stdio = 1;
1021		mfd[0] = 0;
1022		mfd[1] = 1;
1023		break;
1024	case 's':
1025		defmnt = 0;
1026		break;
1027	case 'm':
1028		defmnt = ARGF();
1029		break;
1030	default:
1031		usage();
1032	}ARGEND
1033.P2
1034.SH
1035Extensions
1036.PP
1037The compiler has several extensions to ANSI C, all of which are used
1038extensively in the system source.
1039First,
1040.I structure
1041.I displays
1042permit
1043.CW struct
1044expressions to be formed dynamically.
1045Given these declarations:
1046.P1
1047typedef struct Point Point;
1048typedef struct Rectangle Rectangle;
1049
1050struct Point
1051{
1052	int x, y;
1053};
1054
1055struct Rectangle
1056{
1057	Point min, max;
1058};
1059
1060Point	p, q, add(Point, Point);
1061Rectangle r;
1062int	x, y;
1063.P2
1064this assignment may appear anywhere an assignment is legal:
1065.P1
1066r = (Rectangle){add(p, q), (Point){x, y+3}};
1067.P2
1068The syntax is the same as for initializing a structure but with
1069a leading cast.
1070.PP
1071If an
1072.I anonymous
1073.I structure
1074or
1075.I union
1076is declared within another structure or union, the members of the internal
1077structure or union are addressable without prefix in the outer structure.
1078This feature eliminates the clumsy naming of nested structures and,
1079particularly, unions.
1080For example, after these declarations,
1081.P1
1082struct Lock
1083{
1084	int	locked;
1085};
1086
1087struct Node
1088{
1089	int	type;
1090	union{
1091		double  dval;
1092		double  fval;
1093		long    lval;
1094	};		/* anonymous union */
1095	struct Lock;	/* anonymous structure */
1096} *node;
1097
1098void	lock(struct Lock*);
1099.P2
1100one may refer to
1101.CW node->type ,
1102.CW node->dval ,
1103.CW node->fval ,
1104.CW node->lval ,
1105and
1106.CW node->locked .
1107Moreover, the address of a
1108.CW struct
1109.CW Node
1110may be used without a cast anywhere that the address of a
1111.CW struct
1112.CW Lock
1113is used, such as in argument lists.
1114The compiler automatically promotes the type and adjusts the address.
1115Thus one may invoke
1116.CW lock(node) .
1117.PP
1118Anonymous structures and unions may be accessed by type name
1119if (and only if) they are declared using a
1120.CW typedef
1121name.
1122For example, using the above declaration for
1123.CW Point ,
1124one may declare
1125.P1
1126struct
1127{
1128	int	type;
1129	Point;
1130} p;
1131.P2
1132and refer to
1133.CW p.Point .
1134.PP
1135In the initialization of arrays, a number in square brackets before an
1136element sets the index for the initialization.  For example, to initialize
1137some elements in
1138a table of function pointers indexed by
1139ASCII
1140character,
1141.P1
1142void	percent(void), slash(void);
1143
1144void	(*func[128])(void) =
1145{
1146	['%']	percent,
1147	['/']	slash,
1148};
1149.P2
1150.PP
1151Finally, the declaration
1152.P1
1153extern register reg;
1154.P2
1155.I this "" (
1156appearance of the register keyword is not ignored)
1157allocates a global register to hold the variable
1158.CW reg .
1159External registers must be used carefully: they need to be declared in
1160.I all
1161source files and libraries in the program to guarantee the register
1162is not allocated temporarily for other purposes.
1163Especially on machines with few registers, such as the i386,
1164it is easy to link accidentally with code that has already usurped
1165the global registers and there is no diagnostic when this happens.
1166Used wisely, though, external registers are powerful.
1167The Plan 9 operating system uses them to access per-process and
1168per-machine data structures on a multiprocessor.  The storage class they provide
1169is hard to create in other ways.
1170.SH
1171The compile-time environment
1172.PP
1173The code generated by the compilers is `optimized' by default:
1174variables are placed in registers and peephole optimizations are
1175performed.
1176The compiler flag
1177.CW -N
1178disables these optimizations.
1179Registerization is done locally rather than throughout a function:
1180whether a variable occupies a register or
1181the memory location identified in the symbol
1182table depends on the activity of the variable and may change
1183throughout the life of the variable.
1184The
1185.CW -N
1186flag is rarely needed;
1187its main use is to simplify debugging.
1188There is no information in the symbol table to identify the
1189registerization of a variable, so
1190.CW -N
1191guarantees the variable is always where the symbol table says it is.
1192.PP
1193Another flag,
1194.CW -w ,
1195turns
1196.I on
1197warnings about portability and problems detected in flow analysis.
1198Most code in Plan 9 is compiled with warnings enabled;
1199these warnings plus the type checking offered by function prototypes
1200provide most of the support of the Unix tool
1201.CW lint
1202more accurately and with less chatter.
1203Two of the warnings,
1204`used and not set' and `set and not used', are almost always accurate but
1205may be triggered spuriously by code with invisible control flow,
1206such as in routines that call
1207.CW longjmp .
1208The compiler statements
1209.P1
1210SET(v1);
1211USED(v2);
1212.P2
1213decorate the flow graph to silence the compiler.
1214Either statement accepts a comma-separated list of variables.
1215Use them carefully: they may silence real errors.
1216For the common case of unused parameters to a function,
1217leaving the name off the declaration silences the warnings.
1218That is, listing the type of a parameter but giving it no
1219associated variable name does the trick.
1220.SH
1221Debugging
1222.PP
1223There are two debuggers available on Plan 9.
1224The first, and older, is
1225.CW db ,
1226a revision of Unix
1227.CW adb .
1228The other,
1229.CW acid ,
1230is a source-level debugger whose commands are statements in
1231a true programming language.
1232.CW Acid
1233is the preferred debugger, but since it
1234borrows some elements of
1235.CW db ,
1236notably the formats for displaying values, it is worth knowing a little bit about
1237.CW db .
1238.PP
1239Both debuggers support multiple architectures in a single program; that is,
1240the programs are
1241.CW db
1242and
1243.CW acid ,
1244not for example
1245.CW vdb
1246and
1247.CW vacid .
1248They also support cross-architecture debugging comfortably:
1249one may debug a 68020 binary on a MIPS.
1250.PP
1251Imagine a program has crashed mysteriously:
1252.P1
1253% X11/X
1254Fatal server bug!
1255failed to create default stipple
1256X 106: suicide: sys: trap: fault read addr=0x0 pc=0x00105fb8
1257%
1258.P2
1259When a process dies on Plan 9 it hangs in the `broken' state
1260for debugging.
1261Attach a debugger to the process by naming its process id:
1262.P1
1263% acid 106
1264/proc/106/text:mips plan 9 executable
1265
1266/sys/lib/acid/port
1267/sys/lib/acid/mips
1268acid:
1269.P2
1270The
1271.CW acid
1272function
1273.CW stk()
1274reports the stack traceback:
1275.P1
1276acid: stk()
1277At pc:0x105fb8:abort+0x24 /sys/src/ape/lib/ap/stdio/abort.c:6
1278abort() /sys/src/ape/lib/ap/stdio/abort.c:4
1279	called from FatalError+#4e
1280		/sys/src/X/mit/server/dix/misc.c:421
1281FatalError(s9=#e02, s8=#4901d200, s7=#2, s6=#72701, s5=#1,
1282    s4=#7270d, s3=#6, s2=#12, s1=#ff37f1c, s0=#6, f=#7270f)
1283    /sys/src/X/mit/server/dix/misc.c:416
1284	called from gnotscreeninit+#4ce
1285		/sys/src/X/mit/server/ddx/gnot/gnot.c:792
1286gnotscreeninit(snum=#0, sc=#80db0)
1287    /sys/src/X/mit/server/ddx/gnot/gnot.c:766
1288	called from AddScreen+#16e
1289		/n/bootes/sys/src/X/mit/server/dix/main.c:610
1290AddScreen(pfnInit=0x0000129c,argc=0x00000001,argv=0x7fffffe4)
1291    /sys/src/X/mit/server/dix/main.c:530
1292	called from InitOutput+0x80
1293		/sys/src/X/mit/server/ddx/brazil/brddx.c:522
1294InitOutput(argc=0x00000001,argv=0x7fffffe4)
1295    /sys/src/X/mit/server/ddx/brazil/brddx.c:511
1296	called from main+0x294
1297		/sys/src/X/mit/server/dix/main.c:225
1298main(argc=0x00000001,argv=0x7fffffe4)
1299    /sys/src/X/mit/server/dix/main.c:136
1300	called from _main+0x24
1301		/sys/src/ape/lib/ap/mips/main9.s:8
1302.P2
1303The function
1304.CW lstk()
1305is similar but
1306also reports the values of local variables.
1307Note that the traceback includes full file names; this is a boon to debugging,
1308although it makes the output much noisier.
1309.PP
1310To use
1311.CW acid
1312well you will need to learn its input language; see the
1313``Acid Manual'',
1314by Phil Winterbottom,
1315for details.  For simple debugging, however, the information in the manual page is
1316sufficient.  In particular, it describes the most useful functions
1317for examining a process.
1318.PP
1319The compiler does not place
1320information describing the types of variables in the executable,
1321but a compile-time flag provides crude support for symbolic debugging.
1322The
1323.CW -a
1324flag to the compiler suppresses code generation
1325and instead emits source text in the
1326.CW acid
1327language to format and display data structure types defined in the program.
1328The easiest way to use this feature is to put a rule in the
1329.CW mkfile :
1330.P1
1331syms:   main.$O
1332        $CC -a main.c > syms
1333.P2
1334Then from within
1335.CW acid ,
1336.P1
1337acid: include("sourcedirectory/syms")
1338.P2
1339to read in the relevant definitions.
1340(For multi-file source, you need to be a little fancier;
1341see
1342.I 2c (1)).
1343This text includes, for each defined compound
1344type, a function with that name that may be called with the address of a structure
1345of that type to display its contents.
1346For example, if
1347.CW rect
1348is a global variable of type
1349.CW Rectangle ,
1350one may execute
1351.P1
1352Rectangle(*rect)
1353.P2
1354to display it.
1355The
1356.CW *
1357(indirection) operator is necessary because
1358of the way
1359.CW acid
1360works: each global symbol in the program is defined as a variable by
1361.CW acid ,
1362with value equal to the
1363.I address
1364of the symbol.
1365.PP
1366Another common technique is to write by hand special
1367.CW acid
1368code to define functions to aid debugging, initialize the debugger, and so on.
1369Conventionally, this is placed in a file called
1370.CW acid
1371in the source directory; it has a line
1372.P1
1373include("sourcedirectory/syms");
1374.P2
1375to load the compiler-produced symbols.  One may edit the compiler output directly but
1376it is wiser to keep the hand-generated
1377.CW acid
1378separate from the machine-generated.
1379.PP
1380There is much more to say here.  See
1381.CW acid
1382manual page, the reference manual, or the paper
1383``Acid: A Debugger Built From A Language'',
1384also by Phil Winterbottom.
1385.SH
1386Alef
1387.PP
1388With minor substitutions, most of this document applies to Alef.
1389The compilers are
1390.CW val ,
1391.CW kal ,
1392and
1393.CW 8al ;
1394they work with the usual assemblers and loaders.
1395There is no Alef compiler for the 68020.
1396The directory of machine-independent include files is
1397.CW /sys/include/alef ;
1398there are no machine-dependent Alef include files.
1399The libraries are in
1400.CW /$objtype/lib/alef .
1401Alef uses
1402.CW /bin/cpp ,
1403which is a full ANSI C preprocessor.
1404Our style of use, however, is the same as in Plan 9 C.
1405.PP
1406The Alef compilers don't have the
1407.CW USED(v)
1408and
1409.CW SET(v)
1410operators; instead say something like
1411.P1
1412if(v);
1413.P2
1414for
1415.CW USED
1416and just set the variable to something benign to silence `used and not set' warnings.
1417The compilers also permit leaving unused parameters unnamed.
1418.PP
1419The compilers support UTF,
1420although variable names must be plain alphanumeric.
1421UTF
1422strings have syntax
1423.CW $"string"
1424rather than
1425.CW L"string" .
1426.PP
1427Finally, when debugging, some helpful
1428.CW acid
1429may be loaded by supplying the flag
1430.CW -lalef
1431when starting
1432.CW acid .
1433This code defines
1434functions to help analyze the state of the run-time system.
1435For example,
1436.CW pchan(c)
1437reports the state of a channel.
1438Because Alef programs are multi-threaded, they have multiple stacks.
1439To print the stack trace for a
1440.CW proc ,
1441do
1442.P1
1443setproc(pid);
1444stk();
1445.P2
1446where
1447.CW pid
1448is the Plan 9 process id of the
1449.CW proc .
1450To print the stack trace for a task is clumsier.
1451In the program, get the `task id'
1452by calling the run-time function
1453.CW ALEF_tid
1454in each task and recording it in a global:
1455.P1
1456taskid = ALEF_tid();
1457.P2
1458When the program is debugged, the task id
1459may be passed to an
1460.CW acid
1461function to print the stack:
1462.P1
1463labstk(*taskid);
1464.P2
1465This is of course best done in the private, program-specific
1466.CW acid
1467code.
1468