xref: /plan9/sys/doc/comp.ms (revision ec59a3ddbfceee0efe34584c2c9981a5e5ff1ec4)
1.HTML "How to Use the Plan 9 C Compiler
2.TL
3How to Use the Plan 9 C Compiler
4.AU
5Rob Pike
6rob@plan9.bell-labs.com
7.SH
8Introduction
9.PP
10The C compiler on Plan 9 is a wholly new program; in fact
11it was the first piece of software written for what would
12eventually become Plan 9 from Bell Labs.
13Programmers familiar with existing C compilers will find
14a number of differences in both the language the Plan 9 compiler
15accepts and in how the compiler is used.
16.PP
17The compiler is really a set of compilers, one for each
18architecture \(em MIPS, SPARC, Motorola 68020, Intel 386, etc. \(em
19that accept a dialect of ANSI C and efficiently produce
20fairly good code for the target machine.
21There is a packaging of the compiler that accepts strict ANSI C for
22a POSIX environment, but this document focuses on the
23native Plan 9 environment, that in which all the system source and
24almost all the utilities are written.
25.SH
26Source
27.PP
28The language accepted by the compilers is the core ANSI C language
29with some modest extensions,
30a greatly simplified preprocessor,
31a smaller library that includes system calls and related facilities,
32and a completely different structure for include files.
33.PP
34Official ANSI C accepts the old (K&R) style of declarations for
35functions; the Plan 9 compilers
36are more demanding.
37Without an explicit run-time flag
38.CW -B ) (
39whose use is discouraged, the compilers insist
40on new-style function declarations, that is, prototypes for
41function arguments.
42The function declarations in the libraries' include files are
43all in the new style so the interfaces are checked at compile time.
44For C programmers who have not yet switched to function prototypes
45the clumsy syntax may seem repellent but the payoff in stronger typing
46is substantial.
47Those who wish to import existing software to Plan 9 are urged
48to use the opportunity to update their code.
49.PP
50The compilers include an integrated preprocessor that accepts the familiar
51.CW #include ,
52.CW #define
53for macros both with and without arguments,
54.CW #undef ,
55.CW #line ,
56.CW #ifdef ,
57.CW #ifndef ,
58and
59.CW #endif .
60It
61supports neither
62.CW #if
63nor
64.CW ## ,
65although it does
66honor a few
67.CW #pragmas .
68The
69.CW #if
70directive was omitted because it greatly complicates the
71preprocessor, is never necessary, and is usually abused.
72Conditional compilation in general makes code hard to understand;
73the Plan 9 source uses it sparingly.
74Also, because the compilers remove dead code, regular
75.CW if
76statements with constant conditions are more readable equivalents to many
77.CW #ifs .
78To compile imported code ineluctably fouled by
79.CW #if
80there is a separate command,
81.CW /bin/cpp ,
82that implements the complete ANSI C preprocessor specification.
83.PP
84Include files fall into two groups: machine-dependent and machine-independent.
85The machine-independent files occupy the directory
86.CW /sys/include ;
87the others are placed in a directory appropriate to the machine, such as
88.CW /mips/include .
89The compiler searches for include files
90first in the machine-dependent directory and then
91in the machine-independent directory.
92At the time of writing there are thirty-one machine-independent include
93files and two (per machine) machine-dependent ones:
94.CW <ureg.h>
95and
96.CW <u.h> .
97The first describes the layout of registers on the system stack,
98for use by the debugger.
99The second defines some
100architecture-dependent types such as
101.CW jmp_buf
102for
103.CW setjmp
104and the
105.CW va_arg
106and
107.CW va_list
108macros for handling arguments to variadic functions,
109as well as a set of
110.CW typedef
111abbreviations for
112.CW unsigned
113.CW short
114and so on.
115.PP
116Here is an excerpt from
117.CW /68020/include/u.h :
118.P1
119#define nil		((void*)0)
120typedef	unsigned short	ushort;
121typedef	unsigned char	uchar;
122typedef unsigned long	ulong;
123typedef unsigned int	uint;
124typedef   signed char	schar;
125typedef	long long       vlong;
126
127typedef long	jmp_buf[2];
128#define	JMPBUFSP	0
129#define	JMPBUFPC	1
130#define	JMPBUFDPC	0
131.P2
132Plan 9 programs use
133.CW nil
134for the name of the zero-valued pointer.
135The type
136.CW vlong
137is the largest integer type available; on most architectures it
138is a 64-bit value.
139A couple of other types in
140.CW <u.h>
141are
142.CW u32int ,
143which is guaranteed to have exactly 32 bits (a possibility on all the supported architectures) and
144.CW mpdigit ,
145which is used by the multiprecision math package
146.CW <mp.h> .
147The
148.CW #define
149constants permit an architecture-independent (but compiler-dependent)
150implementation of stack-switching using
151.CW setjmp
152and
153.CW longjmp .
154.PP
155Every Plan 9 C program begins
156.P1
157#include <u.h>
158.P2
159because all the other installed header files use the
160.CW typedefs
161declared in
162.CW <u.h> .
163.PP
164In strict ANSI C, include files are grouped to collect related functions
165in a single file: one for string functions, one for memory functions,
166one for I/O, and none for system calls.
167Each include file is protected by an
168.CW #ifdef
169to guarantee its contents are seen by the compiler only once.
170Plan 9 takes a different approach.  Other than a few include
171files that define external formats such as archives, the files in
172.CW /sys/include
173correspond to
174.I libraries.
175If a program is using a library, it includes the corresponding header.
176The default C library comprises string functions, memory functions, and
177so on, largely as in ANSI C, some formatted I/O routines,
178plus all the system calls and related functions.
179To use these functions, one must
180.CW #include
181the file
182.CW <libc.h> ,
183which in turn must follow
184.CW <u.h> ,
185to define their prototypes for the compiler.
186Here is the complete source to the traditional first C program:
187.P1
188#include <u.h>
189#include <libc.h>
190
191void
192main(void)
193{
194	print("hello world\en");
195	exits(0);
196}
197.P2
198The
199.CW print
200routine and its relatives
201.CW fprint
202and
203.CW sprint
204resemble the similarly-named functions in Standard I/O but are not
205attached to a specific I/O library.
206In Plan 9
207.CW main
208is not integer-valued; it should call
209.CW exits ,
210which takes a string argument (or null; here ANSI C promotes the 0 to a
211.CW char* ).
212All these functions are, of course, documented in the Programmer's Manual.
213.PP
214To use
215.CW printf ,
216.CW <stdio.h>
217must be included to define the function prototype for
218.CW printf :
219.P1
220#include <u.h>
221#include <libc.h>
222#include <stdio.h>
223
224void
225main(int argc, char *argv[])
226{
227	printf("%s: hello world; argc = %d\en", argv[0], argc);
228	exits(0);
229}
230.P2
231In practice, Standard I/O is not used much in Plan 9.  I/O libraries are
232discussed in a later section of this document.
233.PP
234There are libraries for handling regular expressions, raster graphics,
235windows, and so on, and each has an associated include file.
236The manual for each library states which include files are needed.
237The files are not protected against multiple inclusion and themselves
238contain no nested
239.CW #includes .
240Instead the
241programmer is expected to sort out the requirements
242and to
243.CW #include
244the necessary files once at the top of each source file.  In practice this is
245trivial: this way of handling include files is so straightforward
246that it is rare for a source file to contain more than half a dozen
247.CW #includes .
248.PP
249The compilers do their own register allocation so the
250.CW register
251keyword is ignored.
252For different reasons,
253.CW volatile
254and
255.CW const
256are also ignored.
257.PP
258To make it easier to share code with other systems, Plan 9 has a version
259of the compiler,
260.CW pcc ,
261that provides the standard ANSI C preprocessor, headers, and libraries
262with POSIX extensions.
263.CW Pcc
264is recommended only
265when broad external portability is mandated.  It compiles slower,
266produces slower code (it takes extra work to simulate POSIX on Plan 9),
267eliminates those parts of the Plan 9 interface
268not related to POSIX, and illustrates the clumsiness of an environment
269designed by committee.
270.CW Pcc
271is described in more detail in
272.I
273APE\(emThe ANSI/POSIX Environment,
274.R
275by Howard Trickey.
276.SH
277Process
278.PP
279Each CPU architecture supported by Plan 9 is identified by a single,
280arbitrary, alphanumeric character:
281.CW k
282for SPARC,
283.CW q
284for Motorola Power PC 630 and 640,
285.CW v
286for MIPS,
287.CW 1
288for Motorola 68000,
289.CW 2
290for Motorola 68020 and 68040,
291.CW 5
292for Acorn ARM 7500,
293.CW 6
294for Intel 960,
295.CW 7
296for DEC Alpha,
297.CW 8
298for Intel 386, and
299.CW 9
300for AMD 29000.
301The character labels the support tools and files for that architecture.
302For instance, for the 68020 the compiler is
303.CW 2c ,
304the assembler is
305.CW 2a ,
306the link editor/loader is
307.CW 2l ,
308the object files are suffixed
309.CW \&.2 ,
310and the default name for an executable file is
311.CW 2.out .
312Before we can use the compiler we therefore need to know which
313machine we are compiling for.
314The next section explains how this decision is made; for the moment
315assume we are building 68020 binaries and make the mental substitution for
316.CW 2
317appropriate to the machine you are actually using.
318.PP
319To convert source to an executable binary is a two-step process.
320First run the compiler,
321.CW 2c ,
322on the source, say
323.CW file.c ,
324to generate an object file
325.CW file.2 .
326Then run the loader,
327.CW 2l ,
328to generate an executable
329.CW 2.out
330that may be run (on a 680X0 machine):
331.P1
3322c file.c
3332l file.2
3342.out
335.P2
336The loader automatically links with whatever libraries the program
337needs, usually including the standard C library as defined by
338.CW <libc.h> .
339Of course the compiler and loader have lots of options, both familiar and new;
340see the manual for details.
341The compiler does not generate an executable automatically;
342the output of the compiler must be given to the loader.
343Since most compilation is done under the control of
344.CW mk
345(see below), this is rarely an inconvenience.
346.PP
347The distribution of work between the compiler and loader is unusual.
348The compiler integrates preprocessing, parsing, register allocation,
349code generation and some assembly.
350Combining these tasks in a single program is part of the reason for
351the compiler's efficiency.
352The loader does instruction selection, branch folding,
353instruction scheduling,
354and writes the final executable.
355There is no separate C preprocessor and no assembler in the usual pipeline.
356Instead the intermediate object file
357(here a
358.CW \&.2
359file) is a type of binary assembly language.
360The instructions in the intermediate format are not exactly those in
361the machine.  For example, on the 68020 the object file may specify
362a MOVE instruction but the loader will decide just which variant of
363the MOVE instruction \(em MOVE immediate, MOVE quick, MOVE address,
364etc. \(em is most efficient.
365.PP
366The assembler,
367.CW 2a ,
368is just a translator between the textual and binary
369representations of the object file format.
370It is not an assembler in the traditional sense.  It has limited
371macro capabilities (the same as the integral C preprocessor in the compiler),
372clumsy syntax, and minimal error checking.  For instance, the assembler
373will accept an instruction (such as memory-to-memory MOVE on the MIPS) that the
374machine does not actually support; only when the output of the assembler
375is passed to the loader will the error be discovered.
376The assembler is intended only for writing things that need access to instructions
377invisible from C,
378such as the machine-dependent
379part of an operating system;
380very little code in Plan 9 is in assembly language.
381.PP
382The compilers take an option
383.CW -S
384that causes them to print on their standard output the generated code
385in a format acceptable as input to the assemblers.
386This is of course merely a formatting of the
387data in the object file; therefore the assembler is just
388an
389ASCII-to-binary converter for this format.
390Other than the specific instructions, the input to the assemblers
391is largely architecture-independent; see
392``A Manual for the Plan 9 Assembler'',
393by Rob Pike,
394for more information.
395.PP
396The loader is an integral part of the compilation process.
397Each library header file contains a
398.CW #pragma
399that tells the loader the name of the associated archive; it is
400not necessary to tell the loader which libraries a program uses.
401The C run-time startup is found, by default, in the C library.
402The loader starts with an undefined
403symbol,
404.CW _main ,
405that is resolved by pulling in the run-time startup code from the library.
406(The loader undefines
407.CW _mainp
408when profiling is enabled, to force loading of the profiling start-up
409instead.)
410.PP
411Unlike its counterpart on other systems, the Plan 9 loader rearranges
412data to optimize access.  This means the order of variables in the
413loaded program is unrelated to its order in the source.
414Most programs don't care, but some assume that, for example, the
415variables declared by
416.P1
417int a;
418int b;
419.P2
420will appear at adjacent addresses in memory.  On Plan 9, they won't.
421.SH
422Heterogeneity
423.PP
424When the system starts or a user logs in the environment is configured
425so the appropriate binaries are available in
426.CW /bin .
427The configuration process is controlled by an environment variable,
428.CW $cputype ,
429with value such as
430.CW mips ,
431.CW 68020 ,
432.CW 386 ,
433or
434.CW sparc .
435For each architecture there is a directory in the root,
436with the appropriate name,
437that holds the binary and library files for that architecture.
438Thus
439.CW /mips/lib
440contains the object code libraries for MIPS programs,
441.CW /mips/include
442holds MIPS-specific include files, and
443.CW /mips/bin
444has the MIPS binaries.
445These binaries are attached to
446.CW /bin
447at boot time by binding
448.CW /$cputype/bin
449to
450.CW /bin ,
451so
452.CW /bin
453always contains the correct files.
454.PP
455The MIPS compiler,
456.CW vc ,
457by definition
458produces object files for the MIPS architecture,
459regardless of the architecture of the machine on which the compiler is running.
460There is a version of
461.CW vc
462compiled for each architecture:
463.CW /mips/bin/vc ,
464.CW /68020/bin/vc ,
465.CW /sparc/bin/vc ,
466and so on,
467each capable of producing MIPS object files regardless of the native
468instruction set.
469If one is running on a SPARC,
470.CW /sparc/bin/vc
471will compile programs for the MIPS;
472if one is running on machine
473.CW $cputype ,
474.CW /$cputype/bin/vc
475will compile programs for the MIPS.
476.PP
477Because of the bindings that assemble
478.CW /bin ,
479the shell always looks for a command, say
480.CW date ,
481in
482.CW /bin
483and automatically finds the file
484.CW /$cputype/bin/date .
485Therefore the MIPS compiler is known as just
486.CW vc ;
487the shell will invoke
488.CW /bin/vc
489and that is guaranteed to be the version of the MIPS compiler
490appropriate for the machine running the command.
491Regardless of the architecture of the compiling machine,
492.CW /bin/vc
493is
494.I always
495the MIPS compiler.
496.PP
497Also, the output of
498.CW vc
499and
500.CW vl
501is completely independent of the machine type on which they are executed:
502.CW \&.v
503files compiled (with
504.CW vc )
505on a SPARC may be linked (with
506.CW vl )
507on a 386.
508(The resulting
509.CW v.out
510will run, of course, only on a MIPS.)
511Similarly, the MIPS libraries in
512.CW /mips/lib
513are suitable for loading with
514.CW vl
515on any machine; there is only one set of MIPS libraries, not one
516set for each architecture that supports the MIPS compiler.
517.SH
518Heterogeneity and \f(CWmk\fP
519.PP
520Most software on Plan 9 is compiled under the control of
521.CW mk ,
522a descendant of
523.CW make
524that is documented in the Programmer's Manual.
525A convention used throughout the
526.CW mkfiles
527makes it easy to compile the source into binary suitable for any architecture.
528.PP
529The variable
530.CW $cputype
531is advisory: it reports the architecture of the current environment, and should
532not be modified.  A second variable,
533.CW $objtype ,
534is used to set which architecture is being
535.I compiled
536for.
537The value of
538.CW $objtype
539can be used by a
540.CW mkfile
541to configure the compilation environment.
542.PP
543In each machine's root directory there is a short
544.CW mkfile
545that defines a set of macros for the compiler, loader, etc.
546Here is
547.CW /mips/mkfile :
548.P1
549</sys/src/mkfile.proto
550
551CC=vc
552LD=vl
553O=v
554AS=va
555.P2
556The line
557.P1
558</sys/src/mkfile.proto
559.P2
560causes
561.CW mk
562to include the file
563.CW /sys/src/mkfile.proto ,
564which contains general definitions:
565.P1
566#
567# common mkfile parameters shared by all architectures
568#
569
570OS=v486xq7
571CPUS=mips 386 power alpha
572CFLAGS=-FVw
573LEX=lex
574YACC=yacc
575MK=/bin/mk
576.P2
577.CW CC
578is obviously the compiler,
579.CW AS
580the assembler, and
581.CW LD
582the loader.
583.CW O
584is the suffix for the object files and
585.CW CPUS
586and
587.CW OS
588are used in special rules described below.
589.PP
590Here is a
591.CW mkfile
592to build the installed source for
593.CW sam :
594.P1
595</$objtype/mkfile
596OBJ=sam.$O address.$O buffer.$O cmd.$O disc.$O error.$O \e
597	file.$O io.$O list.$O mesg.$O moveto.$O multi.$O \e
598	plan9.$O rasp.$O regexp.$O string.$O sys.$O xec.$O
599
600$O.out:	$OBJ
601	$LD $OBJ
602
603install:	$O.out
604	cp $O.out /$objtype/bin/sam
605
606installall:
607	for(objtype in $CPUS) mk install
608
609%.$O:	%.c
610	$CC $CFLAGS $stem.c
611
612$OBJ:	sam.h errors.h mesg.h
613address.$O cmd.$O parse.$O xec.$O unix.$O:	parse.h
614
615clean:V:
616	rm -f [$OS].out *.[$OS] y.tab.?
617.P2
618(The actual
619.CW mkfile
620imports most of its rules from other secondary files, but
621this example works and is not misleading.)
622The first line causes
623.CW mk
624to include the contents of
625.CW /$objtype/mkfile
626in the current
627.CW mkfile .
628If
629.CW $objtype
630is
631.CW mips ,
632this inserts the MIPS macro definitions into the
633.CW mkfile .
634In this case the rule for
635.CW $O.out
636uses the MIPS tools to build
637.CW v.out .
638The
639.CW %.$O
640rule in the file uses
641.CW mk 's
642pattern matching facilities to convert the source files to the object
643files through the compiler.
644(The text of the rules is passed directly to the shell,
645.CW rc ,
646without further translation.
647See the
648.CW mk
649manual if any of this is unfamiliar.)
650Because the default rule builds
651.CW $O.out
652rather than
653.CW sam ,
654it is possible to maintain binaries for multiple machines in the
655same source directory without conflict.
656This is also, of course, why the output files from the various
657compilers and loaders
658have distinct names.
659.PP
660The rest of the
661.CW mkfile
662should be easy to follow; notice how the rules for
663.CW clean
664and
665.CW installall
666(that is, install versions for all architectures) use other macros
667defined in
668.CW /$objtype/mkfile .
669In Plan 9,
670.CW mkfiles
671for commands conventionally contain rules to
672.CW install
673(compile and install the version for
674.CW $objtype ),
675.CW installall
676(compile and install for all
677.CW $objtypes ),
678and
679.CW clean
680(remove all object files, binaries, etc.).
681.PP
682The
683.CW mkfile
684is easy to use.  To build a MIPS binary,
685.CW v.out :
686.P1
687% objtype=mips
688% mk
689.P2
690To build and install a MIPS binary:
691.P1
692% objtype=mips
693% mk install
694.P2
695To build and install all versions:
696.P1
697% mk installall
698.P2
699These conventions make cross-compilation as easy to manage
700as traditional native compilation.
701Plan 9 programs compile and run without change on machines from
702large multiprocessors to laptops.  For more information about this process, see
703``Plan 9 Mkfiles'',
704by Bob Flandrena.
705.SH
706Portability
707.PP
708Within Plan 9, it is painless to write portable programs, programs whose
709source is independent of the machine on which they execute.
710The operating system is fixed and the compiler, headers and libraries
711are constant so most of the stumbling blocks to portability are removed.
712Attention to a few details can avoid those that remain.
713.PP
714Plan 9 is a heterogeneous environment, so programs must
715.I expect
716that external files will be written by programs on machines of different
717architectures.
718The compilers, for instance, must handle without confusion
719object files written by other machines.
720The traditional approach to this problem is to pepper the source with
721.CW #ifdefs
722to turn byte-swapping on and off.
723Plan 9 takes a different approach: of the handful of machine-dependent
724.CW #ifdefs
725in all the source, almost all are deep in the libraries.
726Instead programs read and write files in a defined format,
727either (for low volume applications) as formatted text, or
728(for high volume applications) as binary in a known byte order.
729If the external data were written with the most significant
730byte first, the following code reads a 4-byte integer correctly
731regardless of the architecture of the executing machine (assuming
732an unsigned long holds 4 bytes):
733.P1
734ulong
735getlong(void)
736{
737	ulong l;
738
739	l = (getchar()&0xFF)<<24;
740	l |= (getchar()&0xFF)<<16;
741	l |= (getchar()&0xFF)<<8;
742	l |= (getchar()&0xFF)<<0;
743	return l;
744}
745.P2
746Note that this code does not `swap' the bytes; instead it just reads
747them in the correct order.
748Variations of this code will handle any binary format
749and also avoid problems
750involving how structures are padded, how words are aligned,
751and other impediments to portability.
752Be aware, though, that extra care is needed to handle floating point data.
753.PP
754Efficiency hounds will argue that this method is unnecessarily slow and clumsy
755when the executing machine has the same byte order (and padding and alignment)
756as the data.
757The CPU cost of I/O processing
758is rarely the bottleneck for an application, however,
759and the gain in simplicity of porting and maintaining the code greatly outweighs
760the minor speed loss from handling data in this general way.
761This method is how the Plan 9 compilers, the window system, and even the file
762servers transmit data between programs.
763.PP
764To port programs beyond Plan 9, where the system interface is more variable,
765it is probably necessary to use
766.CW pcc
767and hope that the target machine supports ANSI C and POSIX.
768.SH
769I/O
770.PP
771The default C library, defined by the include file
772.CW <libc.h> ,
773contains no buffered I/O package.
774It does have several entry points for printing formatted text:
775.CW print
776outputs text to the standard output,
777.CW fprint
778outputs text to a specified integer file descriptor, and
779.CW sprint
780places text in a character array.
781To access library routines for buffered I/O, a program must
782explicitly include the header file associated with an appropriate library.
783.PP
784The recommended I/O library, used by most Plan 9 utilities, is
785.CW bio
786(buffered I/O), defined by
787.CW <bio.h> .
788There also exists an implementation of ANSI Standard I/O,
789.CW stdio .
790.PP
791.CW Bio
792is small and efficient, particularly for buffer-at-a-time or
793line-at-a-time I/O.
794Even for character-at-a-time I/O, however, it is significantly faster than
795the Standard I/O library,
796.CW stdio .
797Its interface is compact and regular, although it lacks a few conveniences.
798The most noticeable is that one must explicitly define buffers for standard
799input and output;
800.CW bio
801does not predefine them.  Here is a program to copy input to output a byte
802at a time using
803.CW bio :
804.P1
805#include <u.h>
806#include <libc.h>
807#include <bio.h>
808
809Biobuf	bin;
810Biobuf	bout;
811
812main(void)
813{
814	int c;
815
816	Binit(&bin, 0, OREAD);
817	Binit(&bout, 1, OWRITE);
818
819	while((c=Bgetc(&bin)) != Beof)
820		Bputc(&bout, c);
821	exits(0);
822}
823.P2
824For peak performance, we could replace
825.CW Bgetc
826and
827.CW Bputc
828by their equivalent in-line macros
829.CW BGETC
830and
831.CW BPUTC
832but
833the performance gain would be modest.
834For more information on
835.CW bio ,
836see the Programmer's Manual.
837.PP
838Perhaps the most dramatic difference in the I/O interface of Plan 9 from other
839systems' is that text is not ASCII.
840The format for
841text in Plan 9 is a byte-stream encoding of 16-bit characters.
842The character set is based on the Unicode Standard and is backward compatible with
843ASCII:
844characters with value 0 through 127 are the same in both sets.
845The 16-bit characters, called
846.I runes
847in Plan 9, are encoded using a representation called
848UTF,
849an encoding that is becoming accepted as a standard.
850(ISO calls it UTF-8;
851throughout Plan 9 it's just called
852UTF.)
853UTF
854defines multibyte sequences to
855represent character values from 0 to 65535.
856In
857UTF,
858character values up to 127 decimal, 7F hexadecimal, represent themselves,
859so straight
860ASCII
861files are also valid
862UTF.
863Also,
864UTF
865guarantees that bytes with values 0 to 127 (NUL to DEL, inclusive)
866will appear only when they represent themselves, so programs that read bytes
867looking for plain ASCII characters will continue to work.
868Any program that expects a one-to-one correspondence between bytes and
869characters will, however, need to be modified.
870An example is parsing file names.
871File names, like all text, are in
872UTF,
873so it is incorrect to search for a character in a string by
874.CW strchr(filename,
875.CW c)
876because the character might have a multi-byte encoding.
877The correct method is to call
878.CW utfrune(filename,
879.CW c) ,
880defined in
881.I rune (2),
882which interprets the file name as a sequence of encoded characters
883rather than bytes.
884In fact, even when you know the character is a single byte
885that can represent only itself,
886it is safer to use
887.CW utfrune
888because that assumes nothing about the character set
889and its representation.
890.PP
891The library defines several symbols relevant to the representation of characters.
892Any byte with unsigned value less than
893.CW Runesync
894will not appear in any multi-byte encoding of a character.
895.CW Utfrune
896compares the character being searched against
897.CW Runesync
898to see if it is sufficient to call
899.CW strchr
900or if the byte stream must be interpreted.
901Any byte with unsigned value less than
902.CW Runeself
903is represented by a single byte with the same value.
904Finally, when errors are encountered converting
905to runes from a byte stream, the library returns the rune value
906.CW Runeerror
907and advances a single byte.  This permits programs to find runes
908embedded in binary data.
909.PP
910.CW Bio
911includes routines
912.CW Bgetrune
913and
914.CW Bputrune
915to transform the external byte stream
916UTF
917format to and from
918internal 16-bit runes.
919Also, the
920.CW %s
921format to
922.CW print
923accepts
924UTF;
925.CW %c
926prints a character after narrowing it to 8 bits.
927The
928.CW %S
929format prints a null-terminated sequence of runes;
930.CW %C
931prints a character after narrowing it to 16 bits.
932For more information, see the Programmer's Manual, in particular
933.I utf (6)
934and
935.I rune (2),
936and the paper,
937``Hello world, or
938Καλημέρα κόσμε, or\
939\f(Jpこんにちは 世界\f1'',
940by Rob Pike and
941Ken Thompson;
942there is not room for the full story here.
943.PP
944These issues affect the compiler in several ways.
945First, the C source is in
946UTF.
947ANSI says C variables are formed from
948ASCII
949alphanumerics, but comments and literal strings may contain any characters
950encoded in the native encoding, here
951UTF.
952The declaration
953.P1
954char *cp = "abcÿ";
955.P2
956initializes the variable
957.CW cp
958to point to an array of bytes holding the
959UTF
960representation of the characters
961.CW abcÿ.
962The type
963.CW Rune
964is defined in
965.CW <u.h>
966to be
967.CW ushort ,
968which is also the  `wide character' type in the compiler.
969Therefore the declaration
970.P1
971Rune *rp = L"abcÿ";
972.P2
973initializes the variable
974.CW rp
975to point to an array of unsigned short integers holding the 16-bit
976values of the characters
977.CW abcÿ .
978Note that in both these declarations the characters in the source
979that represent
980.CW "abcÿ"
981are the same; what changes is how those characters are represented
982in memory in the program.
983The following two lines:
984.P1
985print("%s\en", "abcÿ");
986print("%S\en", L"abcÿ");
987.P2
988produce the same
989UTF
990string on their output, the first by copying the bytes, the second
991by converting from runes to bytes.
992.PP
993In C, character constants are integers but narrowed through the
994.CW char
995type.
996The Unicode character
997.CW ÿ
998has value 255, so if the
999.CW char
1000type is signed,
1001the constant
1002.CW 'ÿ'
1003has value \-1 (which is equal to EOF).
1004On the other hand,
1005.CW L'ÿ'
1006narrows through the wide character type,
1007.CW ushort ,
1008and therefore has value 255.
1009.PP
1010Finally, although it's not ANSI C, the Plan 9 C compilers
1011assume any character with value above
1012.CW Runeself
1013is an alphanumeric,
1014so α is a legal, if non-portable, variable name.
1015.SH
1016Arguments
1017.PP
1018Some macros are defined
1019in
1020.CW <libc.h>
1021for parsing the arguments to
1022.CW main() .
1023They are described in
1024.I ARG (2)
1025but are fairly self-explanatory.
1026There are four macros:
1027.CW ARGBEGIN
1028and
1029.CW ARGEND
1030are used to bracket a hidden
1031.CW switch
1032statement within which
1033.CW ARGC
1034returns the current option character (rune) being processed and
1035.CW ARGF
1036returns the argument to the option, as in the loader option
1037.CW -o
1038.CW file .
1039Here, for example, is the code at the beginning of
1040.CW main()
1041in
1042.CW ramfs.c
1043(see
1044.I ramfs (1))
1045that cracks its arguments:
1046.P1
1047void
1048main(int argc, char *argv[])
1049{
1050	char *defmnt;
1051	int p[2];
1052	int mfd[2];
1053	int stdio = 0;
1054
1055	defmnt = "/tmp";
1056	ARGBEGIN{
1057	case 'i':
1058		defmnt = 0;
1059		stdio = 1;
1060		mfd[0] = 0;
1061		mfd[1] = 1;
1062		break;
1063	case 's':
1064		defmnt = 0;
1065		break;
1066	case 'm':
1067		defmnt = ARGF();
1068		break;
1069	default:
1070		usage();
1071	}ARGEND
1072.P2
1073.SH
1074Extensions
1075.PP
1076The compiler has several extensions to ANSI C, all of which are used
1077extensively in the system source.
1078First,
1079.I structure
1080.I displays
1081permit
1082.CW struct
1083expressions to be formed dynamically.
1084Given these declarations:
1085.P1
1086typedef struct Point Point;
1087typedef struct Rectangle Rectangle;
1088
1089struct Point
1090{
1091	int x, y;
1092};
1093
1094struct Rectangle
1095{
1096	Point min, max;
1097};
1098
1099Point	p, q, add(Point, Point);
1100Rectangle r;
1101int	x, y;
1102.P2
1103this assignment may appear anywhere an assignment is legal:
1104.P1
1105r = (Rectangle){add(p, q), (Point){x, y+3}};
1106.P2
1107The syntax is the same as for initializing a structure but with
1108a leading cast.
1109.PP
1110If an
1111.I anonymous
1112.I structure
1113or
1114.I union
1115is declared within another structure or union, the members of the internal
1116structure or union are addressable without prefix in the outer structure.
1117This feature eliminates the clumsy naming of nested structures and,
1118particularly, unions.
1119For example, after these declarations,
1120.P1
1121struct Lock
1122{
1123	int	locked;
1124};
1125
1126struct Node
1127{
1128	int	type;
1129	union{
1130		double  dval;
1131		double  fval;
1132		long    lval;
1133	};		/* anonymous union */
1134	struct Lock;	/* anonymous structure */
1135} *node;
1136
1137void	lock(struct Lock*);
1138.P2
1139one may refer to
1140.CW node->type ,
1141.CW node->dval ,
1142.CW node->fval ,
1143.CW node->lval ,
1144and
1145.CW node->locked .
1146Moreover, the address of a
1147.CW struct
1148.CW Node
1149may be used without a cast anywhere that the address of a
1150.CW struct
1151.CW Lock
1152is used, such as in argument lists.
1153The compiler automatically promotes the type and adjusts the address.
1154Thus one may invoke
1155.CW lock(node) .
1156.PP
1157Anonymous structures and unions may be accessed by type name
1158if (and only if) they are declared using a
1159.CW typedef
1160name.
1161For example, using the above declaration for
1162.CW Point ,
1163one may declare
1164.P1
1165struct
1166{
1167	int	type;
1168	Point;
1169} p;
1170.P2
1171and refer to
1172.CW p.Point .
1173.PP
1174In the initialization of arrays, a number in square brackets before an
1175element sets the index for the initialization.  For example, to initialize
1176some elements in
1177a table of function pointers indexed by
1178ASCII
1179character,
1180.P1
1181void	percent(void), slash(void);
1182
1183void	(*func[128])(void) =
1184{
1185	['%']	percent,
1186	['/']	slash,
1187};
1188.P2
1189.LP
1190A similar syntax allows one to initialize structure elements:
1191.P1
1192Point p =
1193{
1194	.y 100,
1195	.x 200
1196};
1197.P2
1198These initialization syntaxes were later added to ANSI C, with the addition of an
1199equals sign between the index or tag and the value.
1200The Plan 9 compiler accepts either form.
1201.PP
1202Finally, the declaration
1203.P1
1204extern register reg;
1205.P2
1206.I this "" (
1207appearance of the register keyword is not ignored)
1208allocates a global register to hold the variable
1209.CW reg .
1210External registers must be used carefully: they need to be declared in
1211.I all
1212source files and libraries in the program to guarantee the register
1213is not allocated temporarily for other purposes.
1214Especially on machines with few registers, such as the i386,
1215it is easy to link accidentally with code that has already usurped
1216the global registers and there is no diagnostic when this happens.
1217Used wisely, though, external registers are powerful.
1218The Plan 9 operating system uses them to access per-process and
1219per-machine data structures on a multiprocessor.  The storage class they provide
1220is hard to create in other ways.
1221.SH
1222The compile-time environment
1223.PP
1224The code generated by the compilers is `optimized' by default:
1225variables are placed in registers and peephole optimizations are
1226performed.
1227The compiler flag
1228.CW -N
1229disables these optimizations.
1230Registerization is done locally rather than throughout a function:
1231whether a variable occupies a register or
1232the memory location identified in the symbol
1233table depends on the activity of the variable and may change
1234throughout the life of the variable.
1235The
1236.CW -N
1237flag is rarely needed;
1238its main use is to simplify debugging.
1239There is no information in the symbol table to identify the
1240registerization of a variable, so
1241.CW -N
1242guarantees the variable is always where the symbol table says it is.
1243.PP
1244Another flag,
1245.CW -w ,
1246turns
1247.I on
1248warnings about portability and problems detected in flow analysis.
1249Most code in Plan 9 is compiled with warnings enabled;
1250these warnings plus the type checking offered by function prototypes
1251provide most of the support of the Unix tool
1252.CW lint
1253more accurately and with less chatter.
1254Two of the warnings,
1255`used and not set' and `set and not used', are almost always accurate but
1256may be triggered spuriously by code with invisible control flow,
1257such as in routines that call
1258.CW longjmp .
1259The compiler statements
1260.P1
1261SET(v1);
1262USED(v2);
1263.P2
1264decorate the flow graph to silence the compiler.
1265Either statement accepts a comma-separated list of variables.
1266Use them carefully: they may silence real errors.
1267For the common case of unused parameters to a function,
1268leaving the name off the declaration silences the warnings.
1269That is, listing the type of a parameter but giving it no
1270associated variable name does the trick.
1271.SH
1272Debugging
1273.PP
1274There are two debuggers available on Plan 9.
1275The first, and older, is
1276.CW db ,
1277a revision of Unix
1278.CW adb .
1279The other,
1280.CW acid ,
1281is a source-level debugger whose commands are statements in
1282a true programming language.
1283.CW Acid
1284is the preferred debugger, but since it
1285borrows some elements of
1286.CW db ,
1287notably the formats for displaying values, it is worth knowing a little bit about
1288.CW db .
1289.PP
1290Both debuggers support multiple architectures in a single program; that is,
1291the programs are
1292.CW db
1293and
1294.CW acid ,
1295not for example
1296.CW vdb
1297and
1298.CW vacid .
1299They also support cross-architecture debugging comfortably:
1300one may debug a 68020 binary on a MIPS.
1301.PP
1302Imagine a program has crashed mysteriously:
1303.P1
1304% X11/X
1305Fatal server bug!
1306failed to create default stipple
1307X 106: suicide: sys: trap: fault read addr=0x0 pc=0x00105fb8
1308%
1309.P2
1310When a process dies on Plan 9 it hangs in the `broken' state
1311for debugging.
1312Attach a debugger to the process by naming its process id:
1313.P1
1314% acid 106
1315/proc/106/text:mips plan 9 executable
1316
1317/sys/lib/acid/port
1318/sys/lib/acid/mips
1319acid:
1320.P2
1321The
1322.CW acid
1323function
1324.CW stk()
1325reports the stack traceback:
1326.P1
1327acid: stk()
1328At pc:0x105fb8:abort+0x24 /sys/src/ape/lib/ap/stdio/abort.c:6
1329abort() /sys/src/ape/lib/ap/stdio/abort.c:4
1330	called from FatalError+#4e
1331		/sys/src/X/mit/server/dix/misc.c:421
1332FatalError(s9=#e02, s8=#4901d200, s7=#2, s6=#72701, s5=#1,
1333    s4=#7270d, s3=#6, s2=#12, s1=#ff37f1c, s0=#6, f=#7270f)
1334    /sys/src/X/mit/server/dix/misc.c:416
1335	called from gnotscreeninit+#4ce
1336		/sys/src/X/mit/server/ddx/gnot/gnot.c:792
1337gnotscreeninit(snum=#0, sc=#80db0)
1338    /sys/src/X/mit/server/ddx/gnot/gnot.c:766
1339	called from AddScreen+#16e
1340		/n/bootes/sys/src/X/mit/server/dix/main.c:610
1341AddScreen(pfnInit=0x0000129c,argc=0x00000001,argv=0x7fffffe4)
1342    /sys/src/X/mit/server/dix/main.c:530
1343	called from InitOutput+0x80
1344		/sys/src/X/mit/server/ddx/brazil/brddx.c:522
1345InitOutput(argc=0x00000001,argv=0x7fffffe4)
1346    /sys/src/X/mit/server/ddx/brazil/brddx.c:511
1347	called from main+0x294
1348		/sys/src/X/mit/server/dix/main.c:225
1349main(argc=0x00000001,argv=0x7fffffe4)
1350    /sys/src/X/mit/server/dix/main.c:136
1351	called from _main+0x24
1352		/sys/src/ape/lib/ap/mips/main9.s:8
1353.P2
1354The function
1355.CW lstk()
1356is similar but
1357also reports the values of local variables.
1358Note that the traceback includes full file names; this is a boon to debugging,
1359although it makes the output much noisier.
1360.PP
1361To use
1362.CW acid
1363well you will need to learn its input language; see the
1364``Acid Manual'',
1365by Phil Winterbottom,
1366for details.  For simple debugging, however, the information in the manual page is
1367sufficient.  In particular, it describes the most useful functions
1368for examining a process.
1369.PP
1370The compiler does not place
1371information describing the types of variables in the executable,
1372but a compile-time flag provides crude support for symbolic debugging.
1373The
1374.CW -a
1375flag to the compiler suppresses code generation
1376and instead emits source text in the
1377.CW acid
1378language to format and display data structure types defined in the program.
1379The easiest way to use this feature is to put a rule in the
1380.CW mkfile :
1381.P1
1382syms:   main.$O
1383        $CC -a main.c > syms
1384.P2
1385Then from within
1386.CW acid ,
1387.P1
1388acid: include("sourcedirectory/syms")
1389.P2
1390to read in the relevant definitions.
1391(For multi-file source, you need to be a little fancier;
1392see
1393.I 2c (1)).
1394This text includes, for each defined compound
1395type, a function with that name that may be called with the address of a structure
1396of that type to display its contents.
1397For example, if
1398.CW rect
1399is a global variable of type
1400.CW Rectangle ,
1401one may execute
1402.P1
1403Rectangle(*rect)
1404.P2
1405to display it.
1406The
1407.CW *
1408(indirection) operator is necessary because
1409of the way
1410.CW acid
1411works: each global symbol in the program is defined as a variable by
1412.CW acid ,
1413with value equal to the
1414.I address
1415of the symbol.
1416.PP
1417Another common technique is to write by hand special
1418.CW acid
1419code to define functions to aid debugging, initialize the debugger, and so on.
1420Conventionally, this is placed in a file called
1421.CW acid
1422in the source directory; it has a line
1423.P1
1424include("sourcedirectory/syms");
1425.P2
1426to load the compiler-produced symbols.  One may edit the compiler output directly but
1427it is wiser to keep the hand-generated
1428.CW acid
1429separate from the machine-generated.
1430.PP
1431To make things simple, the default rules in the system
1432.CW mkfiles
1433include entries to make
1434.CW foo.acid
1435from
1436.CW foo.c ,
1437so one may use
1438.CW mk
1439to automate the production of
1440.CW acid
1441definitions for a given C source file.
1442.PP
1443There is much more to say here.  See
1444.CW acid
1445manual page, the reference manual, or the paper
1446``Acid: A Debugger Built From A Language'',
1447also by Phil Winterbottom.
1448