xref: /plan9/sys/doc/comp.ms (revision a9ca66cb5da598471296edb4fa4d29e108b966f0)
1.HTML "How to Use the Plan 9 C Compiler
2.TL
3How to Use the Plan 9 C Compiler
4.AU
5Rob Pike
6rob@plan9.bell-labs.com
7.SH
8Introduction
9.PP
10The C compiler on Plan 9 is a wholly new program; in fact
11it was the first piece of software written for what would
12eventually become Plan 9 from Bell Labs.
13Programmers familiar with existing C compilers will find
14a number of differences in both the language the Plan 9 compiler
15accepts and in how the compiler is used.
16.PP
17The compiler is really a set of compilers, one for each
18architecture \(em MIPS, SPARC, Motorola 68020, Intel 386, etc. \(em
19that accept a dialect of ANSI C and efficiently produce
20fairly good code for the target machine.
21There is a packaging of the compiler that accepts strict ANSI C for
22a POSIX environment, but this document focuses on the
23native Plan 9 environment, that in which all the system source and
24almost all the utilities are written.
25.SH
26Source
27.PP
28The language accepted by the compilers is the core 1989 ANSI C language
29with some modest extensions,
30a greatly simplified preprocessor,
31a smaller library that includes system calls and related facilities,
32and a completely different structure for include files.
33.PP
34Official ANSI C accepts the old (K&R) style of declarations for
35functions; the Plan 9 compilers
36are more demanding.
37Without an explicit run-time flag
38.CW -B ) (
39whose use is discouraged, the compilers insist
40on new-style function declarations, that is, prototypes for
41function arguments.
42The function declarations in the libraries' include files are
43all in the new style so the interfaces are checked at compile time.
44For C programmers who have not yet switched to function prototypes
45the clumsy syntax may seem repellent but the payoff in stronger typing
46is substantial.
47Those who wish to import existing software to Plan 9 are urged
48to use the opportunity to update their code.
49.PP
50The compilers include an integrated preprocessor that accepts the familiar
51.CW #include ,
52.CW #define
53for macros both with and without arguments,
54.CW #undef ,
55.CW #line ,
56.CW #ifdef ,
57.CW #ifndef ,
58and
59.CW #endif .
60It
61supports neither
62.CW #if
63nor
64.CW ## ,
65although it does
66honor a few
67.CW #pragmas .
68The
69.CW #if
70directive was omitted because it greatly complicates the
71preprocessor, is never necessary, and is usually abused.
72Conditional compilation in general makes code hard to understand;
73the Plan 9 source uses it sparingly.
74Also, because the compilers remove dead code, regular
75.CW if
76statements with constant conditions are more readable equivalents to many
77.CW #ifs .
78To compile imported code ineluctably fouled by
79.CW #if
80there is a separate command,
81.CW /bin/cpp ,
82that implements the complete ANSI C preprocessor specification.
83.PP
84Include files fall into two groups: machine-dependent and machine-independent.
85The machine-independent files occupy the directory
86.CW /sys/include ;
87the others are placed in a directory appropriate to the machine, such as
88.CW /mips/include .
89The compiler searches for include files
90first in the machine-dependent directory and then
91in the machine-independent directory.
92At the time of writing there are thirty-one machine-independent include
93files and two (per machine) machine-dependent ones:
94.CW <ureg.h>
95and
96.CW <u.h> .
97The first describes the layout of registers on the system stack,
98for use by the debugger.
99The second defines some
100architecture-dependent types such as
101.CW jmp_buf
102for
103.CW setjmp
104and the
105.CW va_arg
106and
107.CW va_list
108macros for handling arguments to variadic functions,
109as well as a set of
110.CW typedef
111abbreviations for
112.CW unsigned
113.CW short
114and so on.
115.PP
116Here is an excerpt from
117.CW /68020/include/u.h :
118.P1
119#define nil		((void*)0)
120typedef	unsigned short	ushort;
121typedef	unsigned char	uchar;
122typedef unsigned long	ulong;
123typedef unsigned int	uint;
124typedef   signed char	schar;
125typedef	long long       vlong;
126
127typedef long	jmp_buf[2];
128#define	JMPBUFSP	0
129#define	JMPBUFPC	1
130#define	JMPBUFDPC	0
131.P2
132Plan 9 programs use
133.CW nil
134for the name of the zero-valued pointer.
135The type
136.CW vlong
137is the largest integer type available; on most architectures it
138is a 64-bit value.
139A couple of other types in
140.CW <u.h>
141are
142.CW u32int ,
143which is guaranteed to have exactly 32 bits (a possibility on all the supported architectures) and
144.CW mpdigit ,
145which is used by the multiprecision math package
146.CW <mp.h> .
147The
148.CW #define
149constants permit an architecture-independent (but compiler-dependent)
150implementation of stack-switching using
151.CW setjmp
152and
153.CW longjmp .
154.PP
155Every Plan 9 C program begins
156.P1
157#include <u.h>
158.P2
159because all the other installed header files use the
160.CW typedefs
161declared in
162.CW <u.h> .
163.PP
164In strict ANSI C, include files are grouped to collect related functions
165in a single file: one for string functions, one for memory functions,
166one for I/O, and none for system calls.
167Each include file is protected by an
168.CW #ifdef
169to guarantee its contents are seen by the compiler only once.
170Plan 9 takes a different approach.  Other than a few include
171files that define external formats such as archives, the files in
172.CW /sys/include
173correspond to
174.I libraries.
175If a program is using a library, it includes the corresponding header.
176The default C library comprises string functions, memory functions, and
177so on, largely as in ANSI C, some formatted I/O routines,
178plus all the system calls and related functions.
179To use these functions, one must
180.CW #include
181the file
182.CW <libc.h> ,
183which in turn must follow
184.CW <u.h> ,
185to define their prototypes for the compiler.
186Here is the complete source to the traditional first C program:
187.P1
188#include <u.h>
189#include <libc.h>
190
191void
192main(void)
193{
194	print("hello world\en");
195	exits(0);
196}
197.P2
198The
199.CW print
200routine and its relatives
201.CW fprint
202and
203.CW sprint
204resemble the similarly-named functions in Standard I/O but are not
205attached to a specific I/O library.
206In Plan 9
207.CW main
208is not integer-valued; it should call
209.CW exits ,
210which takes a string argument (or null; here ANSI C promotes the 0 to a
211.CW char* ).
212All these functions are, of course, documented in the Programmer's Manual.
213.PP
214To use
215.CW printf ,
216.CW <stdio.h>
217must be included to define the function prototype for
218.CW printf :
219.P1
220#include <u.h>
221#include <libc.h>
222#include <stdio.h>
223
224void
225main(int argc, char *argv[])
226{
227	printf("%s: hello world; argc = %d\en", argv[0], argc);
228	exits(0);
229}
230.P2
231In practice, Standard I/O is not used much in Plan 9.  I/O libraries are
232discussed in a later section of this document.
233.PP
234There are libraries for handling regular expressions, raster graphics,
235windows, and so on, and each has an associated include file.
236The manual for each library states which include files are needed.
237The files are not protected against multiple inclusion and themselves
238contain no nested
239.CW #includes .
240Instead the
241programmer is expected to sort out the requirements
242and to
243.CW #include
244the necessary files once at the top of each source file.  In practice this is
245trivial: this way of handling include files is so straightforward
246that it is rare for a source file to contain more than half a dozen
247.CW #includes .
248.PP
249The compilers do their own register allocation so the
250.CW register
251keyword is ignored.
252For different reasons,
253.CW volatile
254and
255.CW const
256are also ignored.
257.PP
258To make it easier to share code with other systems, Plan 9 has a version
259of the compiler,
260.CW pcc ,
261that provides the standard ANSI C preprocessor, headers, and libraries
262with POSIX extensions.
263.CW Pcc
264is recommended only
265when broad external portability is mandated.  It compiles slower,
266produces slower code (it takes extra work to simulate POSIX on Plan 9),
267eliminates those parts of the Plan 9 interface
268not related to POSIX, and illustrates the clumsiness of an environment
269designed by committee.
270.CW Pcc
271is described in more detail in
272.I
273APE\(emThe ANSI/POSIX Environment,
274.R
275by Howard Trickey.
276.SH
277Process
278.PP
279Each CPU architecture supported by Plan 9 is identified by a single,
280arbitrary, alphanumeric character:
281.CW k
282for SPARC,
283.CW q
284for Motorola Power PC 630 and 640,
285.CW v
286for MIPS,
287.CW 0
288for little-endian MIPS,
289.CW 1
290for Motorola 68000,
291.CW 2
292for Motorola 68020 and 68040,
293.CW 5
294for Acorn ARM 7500,
295.CW 6
296for AMD 64,
297.CW 7
298for DEC Alpha,
299.CW 8
300for Intel 386, and
301.CW 9
302for AMD 29000.
303The character labels the support tools and files for that architecture.
304For instance, for the 68020 the compiler is
305.CW 2c ,
306the assembler is
307.CW 2a ,
308the link editor/loader is
309.CW 2l ,
310the object files are suffixed
311.CW \&.2 ,
312and the default name for an executable file is
313.CW 2.out .
314Before we can use the compiler we therefore need to know which
315machine we are compiling for.
316The next section explains how this decision is made; for the moment
317assume we are building 68020 binaries and make the mental substitution for
318.CW 2
319appropriate to the machine you are actually using.
320.PP
321To convert source to an executable binary is a two-step process.
322First run the compiler,
323.CW 2c ,
324on the source, say
325.CW file.c ,
326to generate an object file
327.CW file.2 .
328Then run the loader,
329.CW 2l ,
330to generate an executable
331.CW 2.out
332that may be run (on a 680X0 machine):
333.P1
3342c file.c
3352l file.2
3362.out
337.P2
338The loader automatically links with whatever libraries the program
339needs, usually including the standard C library as defined by
340.CW <libc.h> .
341Of course the compiler and loader have lots of options, both familiar and new;
342see the manual for details.
343The compiler does not generate an executable automatically;
344the output of the compiler must be given to the loader.
345Since most compilation is done under the control of
346.CW mk
347(see below), this is rarely an inconvenience.
348.PP
349The distribution of work between the compiler and loader is unusual.
350The compiler integrates preprocessing, parsing, register allocation,
351code generation and some assembly.
352Combining these tasks in a single program is part of the reason for
353the compiler's efficiency.
354The loader does instruction selection, branch folding,
355instruction scheduling,
356and writes the final executable.
357There is no separate C preprocessor and no assembler in the usual pipeline.
358Instead the intermediate object file
359(here a
360.CW \&.2
361file) is a type of binary assembly language.
362The instructions in the intermediate format are not exactly those in
363the machine.  For example, on the 68020 the object file may specify
364a MOVE instruction but the loader will decide just which variant of
365the MOVE instruction \(em MOVE immediate, MOVE quick, MOVE address,
366etc. \(em is most efficient.
367.PP
368The assembler,
369.CW 2a ,
370is just a translator between the textual and binary
371representations of the object file format.
372It is not an assembler in the traditional sense.  It has limited
373macro capabilities (the same as the integral C preprocessor in the compiler),
374clumsy syntax, and minimal error checking.  For instance, the assembler
375will accept an instruction (such as memory-to-memory MOVE on the MIPS) that the
376machine does not actually support; only when the output of the assembler
377is passed to the loader will the error be discovered.
378The assembler is intended only for writing things that need access to instructions
379invisible from C,
380such as the machine-dependent
381part of an operating system;
382very little code in Plan 9 is in assembly language.
383.PP
384The compilers take an option
385.CW -S
386that causes them to print on their standard output the generated code
387in a format acceptable as input to the assemblers.
388This is of course merely a formatting of the
389data in the object file; therefore the assembler is just
390an
391ASCII-to-binary converter for this format.
392Other than the specific instructions, the input to the assemblers
393is largely architecture-independent; see
394``A Manual for the Plan 9 Assembler'',
395by Rob Pike,
396for more information.
397.PP
398The loader is an integral part of the compilation process.
399Each library header file contains a
400.CW #pragma
401that tells the loader the name of the associated archive; it is
402not necessary to tell the loader which libraries a program uses.
403The C run-time startup is found, by default, in the C library.
404The loader starts with an undefined
405symbol,
406.CW _main ,
407that is resolved by pulling in the run-time startup code from the library.
408(The loader undefines
409.CW _mainp
410when profiling is enabled, to force loading of the profiling start-up
411instead.)
412.PP
413Unlike its counterpart on other systems, the Plan 9 loader rearranges
414data to optimize access.  This means the order of variables in the
415loaded program is unrelated to its order in the source.
416Most programs don't care, but some assume that, for example, the
417variables declared by
418.P1
419int a;
420int b;
421.P2
422will appear at adjacent addresses in memory.  On Plan 9, they won't.
423.SH
424Heterogeneity
425.PP
426When the system starts or a user logs in the environment is configured
427so the appropriate binaries are available in
428.CW /bin .
429The configuration process is controlled by an environment variable,
430.CW $cputype ,
431with value such as
432.CW mips ,
433.CW 68020 ,
434.CW 386 ,
435or
436.CW sparc .
437For each architecture there is a directory in the root,
438with the appropriate name,
439that holds the binary and library files for that architecture.
440Thus
441.CW /mips/lib
442contains the object code libraries for MIPS programs,
443.CW /mips/include
444holds MIPS-specific include files, and
445.CW /mips/bin
446has the MIPS binaries.
447These binaries are attached to
448.CW /bin
449at boot time by binding
450.CW /$cputype/bin
451to
452.CW /bin ,
453so
454.CW /bin
455always contains the correct files.
456.PP
457The MIPS compiler,
458.CW vc ,
459by definition
460produces object files for the MIPS architecture,
461regardless of the architecture of the machine on which the compiler is running.
462There is a version of
463.CW vc
464compiled for each architecture:
465.CW /mips/bin/vc ,
466.CW /68020/bin/vc ,
467.CW /sparc/bin/vc ,
468and so on,
469each capable of producing MIPS object files regardless of the native
470instruction set.
471If one is running on a SPARC,
472.CW /sparc/bin/vc
473will compile programs for the MIPS;
474if one is running on machine
475.CW $cputype ,
476.CW /$cputype/bin/vc
477will compile programs for the MIPS.
478.PP
479Because of the bindings that assemble
480.CW /bin ,
481the shell always looks for a command, say
482.CW date ,
483in
484.CW /bin
485and automatically finds the file
486.CW /$cputype/bin/date .
487Therefore the MIPS compiler is known as just
488.CW vc ;
489the shell will invoke
490.CW /bin/vc
491and that is guaranteed to be the version of the MIPS compiler
492appropriate for the machine running the command.
493Regardless of the architecture of the compiling machine,
494.CW /bin/vc
495is
496.I always
497the MIPS compiler.
498.PP
499Also, the output of
500.CW vc
501and
502.CW vl
503is completely independent of the machine type on which they are executed:
504.CW \&.v
505files compiled (with
506.CW vc )
507on a SPARC may be linked (with
508.CW vl )
509on a 386.
510(The resulting
511.CW v.out
512will run, of course, only on a MIPS.)
513Similarly, the MIPS libraries in
514.CW /mips/lib
515are suitable for loading with
516.CW vl
517on any machine; there is only one set of MIPS libraries, not one
518set for each architecture that supports the MIPS compiler.
519.SH
520Heterogeneity and \f(CWmk\fP
521.PP
522Most software on Plan 9 is compiled under the control of
523.CW mk ,
524a descendant of
525.CW make
526that is documented in the Programmer's Manual.
527A convention used throughout the
528.CW mkfiles
529makes it easy to compile the source into binary suitable for any architecture.
530.PP
531The variable
532.CW $cputype
533is advisory: it reports the architecture of the current environment, and should
534not be modified.  A second variable,
535.CW $objtype ,
536is used to set which architecture is being
537.I compiled
538for.
539The value of
540.CW $objtype
541can be used by a
542.CW mkfile
543to configure the compilation environment.
544.PP
545In each machine's root directory there is a short
546.CW mkfile
547that defines a set of macros for the compiler, loader, etc.
548Here is
549.CW /mips/mkfile :
550.P1
551</sys/src/mkfile.proto
552
553CC=vc
554LD=vl
555O=v
556AS=va
557.P2
558The line
559.P1
560</sys/src/mkfile.proto
561.P2
562causes
563.CW mk
564to include the file
565.CW /sys/src/mkfile.proto ,
566which contains general definitions:
567.P1
568#
569# common mkfile parameters shared by all architectures
570#
571
572OS=v486xq7
573CPUS=mips 386 power alpha
574CFLAGS=-FVw
575LEX=lex
576YACC=yacc
577MK=/bin/mk
578.P2
579.CW CC
580is obviously the compiler,
581.CW AS
582the assembler, and
583.CW LD
584the loader.
585.CW O
586is the suffix for the object files and
587.CW CPUS
588and
589.CW OS
590are used in special rules described below.
591.PP
592Here is a
593.CW mkfile
594to build the installed source for
595.CW sam :
596.P1
597</$objtype/mkfile
598OBJ=sam.$O address.$O buffer.$O cmd.$O disc.$O error.$O \e
599	file.$O io.$O list.$O mesg.$O moveto.$O multi.$O \e
600	plan9.$O rasp.$O regexp.$O string.$O sys.$O xec.$O
601
602$O.out:	$OBJ
603	$LD $OBJ
604
605install:	$O.out
606	cp $O.out /$objtype/bin/sam
607
608installall:
609	for(objtype in $CPUS) mk install
610
611%.$O:	%.c
612	$CC $CFLAGS $stem.c
613
614$OBJ:	sam.h errors.h mesg.h
615address.$O cmd.$O parse.$O xec.$O unix.$O:	parse.h
616
617clean:V:
618	rm -f [$OS].out *.[$OS] y.tab.?
619.P2
620(The actual
621.CW mkfile
622imports most of its rules from other secondary files, but
623this example works and is not misleading.)
624The first line causes
625.CW mk
626to include the contents of
627.CW /$objtype/mkfile
628in the current
629.CW mkfile .
630If
631.CW $objtype
632is
633.CW mips ,
634this inserts the MIPS macro definitions into the
635.CW mkfile .
636In this case the rule for
637.CW $O.out
638uses the MIPS tools to build
639.CW v.out .
640The
641.CW %.$O
642rule in the file uses
643.CW mk 's
644pattern matching facilities to convert the source files to the object
645files through the compiler.
646(The text of the rules is passed directly to the shell,
647.CW rc ,
648without further translation.
649See the
650.CW mk
651manual if any of this is unfamiliar.)
652Because the default rule builds
653.CW $O.out
654rather than
655.CW sam ,
656it is possible to maintain binaries for multiple machines in the
657same source directory without conflict.
658This is also, of course, why the output files from the various
659compilers and loaders
660have distinct names.
661.PP
662The rest of the
663.CW mkfile
664should be easy to follow; notice how the rules for
665.CW clean
666and
667.CW installall
668(that is, install versions for all architectures) use other macros
669defined in
670.CW /$objtype/mkfile .
671In Plan 9,
672.CW mkfiles
673for commands conventionally contain rules to
674.CW install
675(compile and install the version for
676.CW $objtype ),
677.CW installall
678(compile and install for all
679.CW $objtypes ),
680and
681.CW clean
682(remove all object files, binaries, etc.).
683.PP
684The
685.CW mkfile
686is easy to use.  To build a MIPS binary,
687.CW v.out :
688.P1
689% objtype=mips
690% mk
691.P2
692To build and install a MIPS binary:
693.P1
694% objtype=mips
695% mk install
696.P2
697To build and install all versions:
698.P1
699% mk installall
700.P2
701These conventions make cross-compilation as easy to manage
702as traditional native compilation.
703Plan 9 programs compile and run without change on machines from
704large multiprocessors to laptops.  For more information about this process, see
705``Plan 9 Mkfiles'',
706by Bob Flandrena.
707.SH
708Portability
709.PP
710Within Plan 9, it is painless to write portable programs, programs whose
711source is independent of the machine on which they execute.
712The operating system is fixed and the compiler, headers and libraries
713are constant so most of the stumbling blocks to portability are removed.
714Attention to a few details can avoid those that remain.
715.PP
716Plan 9 is a heterogeneous environment, so programs must
717.I expect
718that external files will be written by programs on machines of different
719architectures.
720The compilers, for instance, must handle without confusion
721object files written by other machines.
722The traditional approach to this problem is to pepper the source with
723.CW #ifdefs
724to turn byte-swapping on and off.
725Plan 9 takes a different approach: of the handful of machine-dependent
726.CW #ifdefs
727in all the source, almost all are deep in the libraries.
728Instead programs read and write files in a defined format,
729either (for low volume applications) as formatted text, or
730(for high volume applications) as binary in a known byte order.
731If the external data were written with the most significant
732byte first, the following code reads a 4-byte integer correctly
733regardless of the architecture of the executing machine (assuming
734an unsigned long holds 4 bytes):
735.P1
736ulong
737getlong(void)
738{
739	ulong l;
740
741	l = (getchar()&0xFF)<<24;
742	l |= (getchar()&0xFF)<<16;
743	l |= (getchar()&0xFF)<<8;
744	l |= (getchar()&0xFF)<<0;
745	return l;
746}
747.P2
748Note that this code does not `swap' the bytes; instead it just reads
749them in the correct order.
750Variations of this code will handle any binary format
751and also avoid problems
752involving how structures are padded, how words are aligned,
753and other impediments to portability.
754Be aware, though, that extra care is needed to handle floating point data.
755.PP
756Efficiency hounds will argue that this method is unnecessarily slow and clumsy
757when the executing machine has the same byte order (and padding and alignment)
758as the data.
759The CPU cost of I/O processing
760is rarely the bottleneck for an application, however,
761and the gain in simplicity of porting and maintaining the code greatly outweighs
762the minor speed loss from handling data in this general way.
763This method is how the Plan 9 compilers, the window system, and even the file
764servers transmit data between programs.
765.PP
766To port programs beyond Plan 9, where the system interface is more variable,
767it is probably necessary to use
768.CW pcc
769and hope that the target machine supports ANSI C and POSIX.
770.SH
771I/O
772.PP
773The default C library, defined by the include file
774.CW <libc.h> ,
775contains no buffered I/O package.
776It does have several entry points for printing formatted text:
777.CW print
778outputs text to the standard output,
779.CW fprint
780outputs text to a specified integer file descriptor, and
781.CW sprint
782places text in a character array.
783To access library routines for buffered I/O, a program must
784explicitly include the header file associated with an appropriate library.
785.PP
786The recommended I/O library, used by most Plan 9 utilities, is
787.CW bio
788(buffered I/O), defined by
789.CW <bio.h> .
790There also exists an implementation of ANSI Standard I/O,
791.CW stdio .
792.PP
793.CW Bio
794is small and efficient, particularly for buffer-at-a-time or
795line-at-a-time I/O.
796Even for character-at-a-time I/O, however, it is significantly faster than
797the Standard I/O library,
798.CW stdio .
799Its interface is compact and regular, although it lacks a few conveniences.
800The most noticeable is that one must explicitly define buffers for standard
801input and output;
802.CW bio
803does not predefine them.  Here is a program to copy input to output a byte
804at a time using
805.CW bio :
806.P1
807#include <u.h>
808#include <libc.h>
809#include <bio.h>
810
811Biobuf	bin;
812Biobuf	bout;
813
814main(void)
815{
816	int c;
817
818	Binit(&bin, 0, OREAD);
819	Binit(&bout, 1, OWRITE);
820
821	while((c=Bgetc(&bin)) != Beof)
822		Bputc(&bout, c);
823	exits(0);
824}
825.P2
826For peak performance, we could replace
827.CW Bgetc
828and
829.CW Bputc
830by their equivalent in-line macros
831.CW BGETC
832and
833.CW BPUTC
834but
835the performance gain would be modest.
836For more information on
837.CW bio ,
838see the Programmer's Manual.
839.PP
840Perhaps the most dramatic difference in the I/O interface of Plan 9 from other
841systems' is that text is not ASCII.
842The format for
843text in Plan 9 is a byte-stream encoding of 16-bit characters.
844The character set is based on the Unicode Standard and is backward compatible with
845ASCII:
846characters with value 0 through 127 are the same in both sets.
847The 16-bit characters, called
848.I runes
849in Plan 9, are encoded using a representation called
850UTF,
851an encoding that is becoming accepted as a standard.
852(ISO calls it UTF-8;
853throughout Plan 9 it's just called
854UTF.)
855UTF
856defines multibyte sequences to
857represent character values from 0 to 65535.
858In
859UTF,
860character values up to 127 decimal, 7F hexadecimal, represent themselves,
861so straight
862ASCII
863files are also valid
864UTF.
865Also,
866UTF
867guarantees that bytes with values 0 to 127 (NUL to DEL, inclusive)
868will appear only when they represent themselves, so programs that read bytes
869looking for plain ASCII characters will continue to work.
870Any program that expects a one-to-one correspondence between bytes and
871characters will, however, need to be modified.
872An example is parsing file names.
873File names, like all text, are in
874UTF,
875so it is incorrect to search for a character in a string by
876.CW strchr(filename,
877.CW c)
878because the character might have a multi-byte encoding.
879The correct method is to call
880.CW utfrune(filename,
881.CW c) ,
882defined in
883.I rune (2),
884which interprets the file name as a sequence of encoded characters
885rather than bytes.
886In fact, even when you know the character is a single byte
887that can represent only itself,
888it is safer to use
889.CW utfrune
890because that assumes nothing about the character set
891and its representation.
892.PP
893The library defines several symbols relevant to the representation of characters.
894Any byte with unsigned value less than
895.CW Runesync
896will not appear in any multi-byte encoding of a character.
897.CW Utfrune
898compares the character being searched against
899.CW Runesync
900to see if it is sufficient to call
901.CW strchr
902or if the byte stream must be interpreted.
903Any byte with unsigned value less than
904.CW Runeself
905is represented by a single byte with the same value.
906Finally, when errors are encountered converting
907to runes from a byte stream, the library returns the rune value
908.CW Runeerror
909and advances a single byte.  This permits programs to find runes
910embedded in binary data.
911.PP
912.CW Bio
913includes routines
914.CW Bgetrune
915and
916.CW Bputrune
917to transform the external byte stream
918UTF
919format to and from
920internal 16-bit runes.
921Also, the
922.CW %s
923format to
924.CW print
925accepts
926UTF;
927.CW %c
928prints a character after narrowing it to 8 bits.
929The
930.CW %S
931format prints a null-terminated sequence of runes;
932.CW %C
933prints a character after narrowing it to 16 bits.
934For more information, see the Programmer's Manual, in particular
935.I utf (6)
936and
937.I rune (2),
938and the paper,
939``Hello world, or
940Καλημέρα κόσμε, or\
941\f(Jpこんにちは 世界\f1'',
942by Rob Pike and
943Ken Thompson;
944there is not room for the full story here.
945.PP
946These issues affect the compiler in several ways.
947First, the C source is in
948UTF.
949ANSI says C variables are formed from
950ASCII
951alphanumerics, but comments and literal strings may contain any characters
952encoded in the native encoding, here
953UTF.
954The declaration
955.P1
956char *cp = "abcÿ";
957.P2
958initializes the variable
959.CW cp
960to point to an array of bytes holding the
961UTF
962representation of the characters
963.CW abcÿ.
964The type
965.CW Rune
966is defined in
967.CW <u.h>
968to be
969.CW ushort ,
970which is also the  `wide character' type in the compiler.
971Therefore the declaration
972.P1
973Rune *rp = L"abcÿ";
974.P2
975initializes the variable
976.CW rp
977to point to an array of unsigned short integers holding the 16-bit
978values of the characters
979.CW abcÿ .
980Note that in both these declarations the characters in the source
981that represent
982.CW "abcÿ"
983are the same; what changes is how those characters are represented
984in memory in the program.
985The following two lines:
986.P1
987print("%s\en", "abcÿ");
988print("%S\en", L"abcÿ");
989.P2
990produce the same
991UTF
992string on their output, the first by copying the bytes, the second
993by converting from runes to bytes.
994.PP
995In C, character constants are integers but narrowed through the
996.CW char
997type.
998The Unicode character
999.CW ÿ
1000has value 255, so if the
1001.CW char
1002type is signed,
1003the constant
1004.CW 'ÿ'
1005has value \-1 (which is equal to EOF).
1006On the other hand,
1007.CW L'ÿ'
1008narrows through the wide character type,
1009.CW ushort ,
1010and therefore has value 255.
1011.PP
1012Finally, although it's not ANSI C, the Plan 9 C compilers
1013assume any character with value above
1014.CW Runeself
1015is an alphanumeric,
1016so α is a legal, if non-portable, variable name.
1017.SH
1018Arguments
1019.PP
1020Some macros are defined
1021in
1022.CW <libc.h>
1023for parsing the arguments to
1024.CW main() .
1025They are described in
1026.I ARG (2)
1027but are fairly self-explanatory.
1028There are four macros:
1029.CW ARGBEGIN
1030and
1031.CW ARGEND
1032are used to bracket a hidden
1033.CW switch
1034statement within which
1035.CW ARGC
1036returns the current option character (rune) being processed and
1037.CW ARGF
1038returns the argument to the option, as in the loader option
1039.CW -o
1040.CW file .
1041Here, for example, is the code at the beginning of
1042.CW main()
1043in
1044.CW ramfs.c
1045(see
1046.I ramfs (1))
1047that cracks its arguments:
1048.P1
1049void
1050main(int argc, char *argv[])
1051{
1052	char *defmnt;
1053	int p[2];
1054	int mfd[2];
1055	int stdio = 0;
1056
1057	defmnt = "/tmp";
1058	ARGBEGIN{
1059	case 'i':
1060		defmnt = 0;
1061		stdio = 1;
1062		mfd[0] = 0;
1063		mfd[1] = 1;
1064		break;
1065	case 's':
1066		defmnt = 0;
1067		break;
1068	case 'm':
1069		defmnt = ARGF();
1070		break;
1071	default:
1072		usage();
1073	}ARGEND
1074.P2
1075.SH
1076Extensions
1077.PP
1078The compiler has several extensions to 1989 ANSI C, all of which are used
1079extensively in the system source.
1080Some of these have been adopted in later ANSI C standards.
1081First,
1082.I structure
1083.I displays
1084permit
1085.CW struct
1086expressions to be formed dynamically.
1087Given these declarations:
1088.P1
1089typedef struct Point Point;
1090typedef struct Rectangle Rectangle;
1091
1092struct Point
1093{
1094	int x, y;
1095};
1096
1097struct Rectangle
1098{
1099	Point min, max;
1100};
1101
1102Point	p, q, add(Point, Point);
1103Rectangle r;
1104int	x, y;
1105.P2
1106this assignment may appear anywhere an assignment is legal:
1107.P1
1108r = (Rectangle){add(p, q), (Point){x, y+3}};
1109.P2
1110The syntax is the same as for initializing a structure but with
1111a leading cast.
1112.PP
1113If an
1114.I anonymous
1115.I structure
1116or
1117.I union
1118is declared within another structure or union, the members of the internal
1119structure or union are addressable without prefix in the outer structure.
1120This feature eliminates the clumsy naming of nested structures and,
1121particularly, unions.
1122For example, after these declarations,
1123.P1
1124struct Lock
1125{
1126	int	locked;
1127};
1128
1129struct Node
1130{
1131	int	type;
1132	union{
1133		double  dval;
1134		double  fval;
1135		long    lval;
1136	};		/* anonymous union */
1137	struct Lock;	/* anonymous structure */
1138} *node;
1139
1140void	lock(struct Lock*);
1141.P2
1142one may refer to
1143.CW node->type ,
1144.CW node->dval ,
1145.CW node->fval ,
1146.CW node->lval ,
1147and
1148.CW node->locked .
1149Moreover, the address of a
1150.CW struct
1151.CW Node
1152may be used without a cast anywhere that the address of a
1153.CW struct
1154.CW Lock
1155is used, such as in argument lists.
1156The compiler automatically promotes the type and adjusts the address.
1157Thus one may invoke
1158.CW lock(node) .
1159.PP
1160Anonymous structures and unions may be accessed by type name
1161if (and only if) they are declared using a
1162.CW typedef
1163name.
1164For example, using the above declaration for
1165.CW Point ,
1166one may declare
1167.P1
1168struct
1169{
1170	int	type;
1171	Point;
1172} p;
1173.P2
1174and refer to
1175.CW p.Point .
1176.PP
1177In the initialization of arrays, a number in square brackets before an
1178element sets the index for the initialization.  For example, to initialize
1179some elements in
1180a table of function pointers indexed by
1181ASCII
1182character,
1183.P1
1184void	percent(void), slash(void);
1185
1186void	(*func[128])(void) =
1187{
1188	['%']	percent,
1189	['/']	slash,
1190};
1191.P2
1192.LP
1193A similar syntax allows one to initialize structure elements:
1194.P1
1195Point p =
1196{
1197	.y 100,
1198	.x 200
1199};
1200.P2
1201These initialization syntaxes were later added to ANSI C, with the addition of an
1202equals sign between the index or tag and the value.
1203The Plan 9 compiler accepts either form.
1204.PP
1205Finally, the declaration
1206.P1
1207extern register reg;
1208.P2
1209.I this "" (
1210appearance of the register keyword is not ignored)
1211allocates a global register to hold the variable
1212.CW reg .
1213External registers must be used carefully: they need to be declared in
1214.I all
1215source files and libraries in the program to guarantee the register
1216is not allocated temporarily for other purposes.
1217Especially on machines with few registers, such as the i386,
1218it is easy to link accidentally with code that has already usurped
1219the global registers and there is no diagnostic when this happens.
1220Used wisely, though, external registers are powerful.
1221The Plan 9 operating system uses them to access per-process and
1222per-machine data structures on a multiprocessor.  The storage class they provide
1223is hard to create in other ways.
1224.SH
1225The compile-time environment
1226.PP
1227The code generated by the compilers is `optimized' by default:
1228variables are placed in registers and peephole optimizations are
1229performed.
1230The compiler flag
1231.CW -N
1232disables these optimizations.
1233Registerization is done locally rather than throughout a function:
1234whether a variable occupies a register or
1235the memory location identified in the symbol
1236table depends on the activity of the variable and may change
1237throughout the life of the variable.
1238The
1239.CW -N
1240flag is rarely needed;
1241its main use is to simplify debugging.
1242There is no information in the symbol table to identify the
1243registerization of a variable, so
1244.CW -N
1245guarantees the variable is always where the symbol table says it is.
1246.PP
1247Another flag,
1248.CW -w ,
1249turns
1250.I on
1251warnings about portability and problems detected in flow analysis.
1252Most code in Plan 9 is compiled with warnings enabled;
1253these warnings plus the type checking offered by function prototypes
1254provide most of the support of the Unix tool
1255.CW lint
1256more accurately and with less chatter.
1257Two of the warnings,
1258`used and not set' and `set and not used', are almost always accurate but
1259may be triggered spuriously by code with invisible control flow,
1260such as in routines that call
1261.CW longjmp .
1262The compiler statements
1263.P1
1264SET(v1);
1265USED(v2);
1266.P2
1267decorate the flow graph to silence the compiler.
1268Either statement accepts a comma-separated list of variables.
1269Use them carefully: they may silence real errors.
1270For the common case of unused parameters to a function,
1271leaving the name off the declaration silences the warnings.
1272That is, listing the type of a parameter but giving it no
1273associated variable name does the trick.
1274.SH
1275Debugging
1276.PP
1277There are two debuggers available on Plan 9.
1278The first, and older, is
1279.CW db ,
1280a revision of Unix
1281.CW adb .
1282The other,
1283.CW acid ,
1284is a source-level debugger whose commands are statements in
1285a true programming language.
1286.CW Acid
1287is the preferred debugger, but since it
1288borrows some elements of
1289.CW db ,
1290notably the formats for displaying values, it is worth knowing a little bit about
1291.CW db .
1292.PP
1293Both debuggers support multiple architectures in a single program; that is,
1294the programs are
1295.CW db
1296and
1297.CW acid ,
1298not for example
1299.CW vdb
1300and
1301.CW vacid .
1302They also support cross-architecture debugging comfortably:
1303one may debug a 68020 binary on a MIPS.
1304.PP
1305Imagine a program has crashed mysteriously:
1306.P1
1307% X11/X
1308Fatal server bug!
1309failed to create default stipple
1310X 106: suicide: sys: trap: fault read addr=0x0 pc=0x00105fb8
1311%
1312.P2
1313When a process dies on Plan 9 it hangs in the `broken' state
1314for debugging.
1315Attach a debugger to the process by naming its process id:
1316.P1
1317% acid 106
1318/proc/106/text:mips plan 9 executable
1319
1320/sys/lib/acid/port
1321/sys/lib/acid/mips
1322acid:
1323.P2
1324The
1325.CW acid
1326function
1327.CW stk()
1328reports the stack traceback:
1329.P1
1330acid: stk()
1331At pc:0x105fb8:abort+0x24 /sys/src/ape/lib/ap/stdio/abort.c:6
1332abort() /sys/src/ape/lib/ap/stdio/abort.c:4
1333	called from FatalError+#4e
1334		/sys/src/X/mit/server/dix/misc.c:421
1335FatalError(s9=#e02, s8=#4901d200, s7=#2, s6=#72701, s5=#1,
1336    s4=#7270d, s3=#6, s2=#12, s1=#ff37f1c, s0=#6, f=#7270f)
1337    /sys/src/X/mit/server/dix/misc.c:416
1338	called from gnotscreeninit+#4ce
1339		/sys/src/X/mit/server/ddx/gnot/gnot.c:792
1340gnotscreeninit(snum=#0, sc=#80db0)
1341    /sys/src/X/mit/server/ddx/gnot/gnot.c:766
1342	called from AddScreen+#16e
1343		/n/bootes/sys/src/X/mit/server/dix/main.c:610
1344AddScreen(pfnInit=0x0000129c,argc=0x00000001,argv=0x7fffffe4)
1345    /sys/src/X/mit/server/dix/main.c:530
1346	called from InitOutput+0x80
1347		/sys/src/X/mit/server/ddx/brazil/brddx.c:522
1348InitOutput(argc=0x00000001,argv=0x7fffffe4)
1349    /sys/src/X/mit/server/ddx/brazil/brddx.c:511
1350	called from main+0x294
1351		/sys/src/X/mit/server/dix/main.c:225
1352main(argc=0x00000001,argv=0x7fffffe4)
1353    /sys/src/X/mit/server/dix/main.c:136
1354	called from _main+0x24
1355		/sys/src/ape/lib/ap/mips/main9.s:8
1356.P2
1357The function
1358.CW lstk()
1359is similar but
1360also reports the values of local variables.
1361Note that the traceback includes full file names; this is a boon to debugging,
1362although it makes the output much noisier.
1363.PP
1364To use
1365.CW acid
1366well you will need to learn its input language; see the
1367``Acid Manual'',
1368by Phil Winterbottom,
1369for details.  For simple debugging, however, the information in the manual page is
1370sufficient.  In particular, it describes the most useful functions
1371for examining a process.
1372.PP
1373The compiler does not place
1374information describing the types of variables in the executable,
1375but a compile-time flag provides crude support for symbolic debugging.
1376The
1377.CW -a
1378flag to the compiler suppresses code generation
1379and instead emits source text in the
1380.CW acid
1381language to format and display data structure types defined in the program.
1382The easiest way to use this feature is to put a rule in the
1383.CW mkfile :
1384.P1
1385syms:   main.$O
1386        $CC -a main.c > syms
1387.P2
1388Then from within
1389.CW acid ,
1390.P1
1391acid: include("sourcedirectory/syms")
1392.P2
1393to read in the relevant definitions.
1394(For multi-file source, you need to be a little fancier;
1395see
1396.I 2c (1)).
1397This text includes, for each defined compound
1398type, a function with that name that may be called with the address of a structure
1399of that type to display its contents.
1400For example, if
1401.CW rect
1402is a global variable of type
1403.CW Rectangle ,
1404one may execute
1405.P1
1406Rectangle(*rect)
1407.P2
1408to display it.
1409The
1410.CW *
1411(indirection) operator is necessary because
1412of the way
1413.CW acid
1414works: each global symbol in the program is defined as a variable by
1415.CW acid ,
1416with value equal to the
1417.I address
1418of the symbol.
1419.PP
1420Another common technique is to write by hand special
1421.CW acid
1422code to define functions to aid debugging, initialize the debugger, and so on.
1423Conventionally, this is placed in a file called
1424.CW acid
1425in the source directory; it has a line
1426.P1
1427include("sourcedirectory/syms");
1428.P2
1429to load the compiler-produced symbols.  One may edit the compiler output directly but
1430it is wiser to keep the hand-generated
1431.CW acid
1432separate from the machine-generated.
1433.PP
1434To make things simple, the default rules in the system
1435.CW mkfiles
1436include entries to make
1437.CW foo.acid
1438from
1439.CW foo.c ,
1440so one may use
1441.CW mk
1442to automate the production of
1443.CW acid
1444definitions for a given C source file.
1445.PP
1446There is much more to say here.  See
1447.CW acid
1448manual page, the reference manual, or the paper
1449``Acid: A Debugger Built From A Language'',
1450also by Phil Winterbottom.
1451