xref: /plan9/sys/doc/comp.ms (revision 86abb9fb23a9f11dbfd9e6dc2fe0c20d62417d94)
1.HTML "How to Use the Plan 9 C Compiler
2.TL
3How to Use the Plan 9 C Compiler
4.AU
5Rob Pike
6rob@plan9.bell-labs.com
7.SH
8Introduction
9.PP
10The C compiler on Plan 9 is a wholly new program; in fact
11it was the first piece of software written for what would
12eventually become Plan 9 from Bell Labs.
13Programmers familiar with existing C compilers will find
14a number of differences in both the language the Plan 9 compiler
15accepts and in how the compiler is used.
16.PP
17The compiler is really a set of compilers, one for each
18architecture \(em MIPS, SPARC, Intel 386, Power PC, ARM, etc. \(em
19that accept a dialect of ANSI C and efficiently produce
20fairly good code for the target machine.
21There is a packaging of the compiler that accepts strict ANSI C for
22a POSIX environment, but this document focuses on the
23native Plan 9 environment, that in which all the system source and
24almost all the utilities are written.
25.SH
26Source
27.PP
28The language accepted by the compilers is the core 1989 ANSI C language
29with some modest extensions,
30a greatly simplified preprocessor,
31a smaller library that includes system calls and related facilities,
32and a completely different structure for include files.
33.PP
34Official ANSI C accepts the old (K&R) style of declarations for
35functions; the Plan 9 compilers
36are more demanding.
37Without an explicit run-time flag
38.CW -B ) (
39whose use is discouraged, the compilers insist
40on new-style function declarations, that is, prototypes for
41function arguments.
42The function declarations in the libraries' include files are
43all in the new style so the interfaces are checked at compile time.
44For C programmers who have not yet switched to function prototypes
45the clumsy syntax may seem repellent but the payoff in stronger typing
46is substantial.
47Those who wish to import existing software to Plan 9 are urged
48to use the opportunity to update their code.
49.PP
50The compilers include an integrated preprocessor that accepts the familiar
51.CW #include ,
52.CW #define
53for macros both with and without arguments,
54.CW #undef ,
55.CW #line ,
56.CW #ifdef ,
57.CW #ifndef ,
58and
59.CW #endif .
60It
61supports neither
62.CW #if
63nor
64.CW ## ,
65although it does
66honor a few
67.CW #pragmas .
68The
69.CW #if
70directive was omitted because it greatly complicates the
71preprocessor, is never necessary, and is usually abused.
72Conditional compilation in general makes code hard to understand;
73the Plan 9 source uses it sparingly.
74Also, because the compilers remove dead code, regular
75.CW if
76statements with constant conditions are more readable equivalents to many
77.CW #ifs .
78To compile imported code ineluctably fouled by
79.CW #if
80there is a separate command,
81.CW /bin/cpp ,
82that implements the complete ANSI C preprocessor specification.
83.PP
84Include files fall into two groups: machine-dependent and machine-independent.
85The machine-independent files occupy the directory
86.CW /sys/include ;
87the others are placed in a directory appropriate to the machine, such as
88.CW /mips/include .
89The compiler searches for include files
90first in the machine-dependent directory and then
91in the machine-independent directory.
92At the time of writing there are thirty-one machine-independent include
93files and two (per machine) machine-dependent ones:
94.CW <ureg.h>
95and
96.CW <u.h> .
97The first describes the layout of registers on the system stack,
98for use by the debugger.
99The second defines some
100architecture-dependent types such as
101.CW jmp_buf
102for
103.CW setjmp
104and the
105.CW va_arg
106and
107.CW va_list
108macros for handling arguments to variadic functions,
109as well as a set of
110.CW typedef
111abbreviations for
112.CW unsigned
113.CW short
114and so on.
115.PP
116Here is an excerpt from
117.CW /386/include/u.h :
118.P1
119#define nil		((void*)0)
120typedef	unsigned short	ushort;
121typedef	unsigned char	uchar;
122typedef unsigned long	ulong;
123typedef unsigned int	uint;
124typedef   signed char	schar;
125typedef	long long       vlong;
126
127typedef long	jmp_buf[2];
128#define	JMPBUFSP	0
129#define	JMPBUFPC	1
130#define	JMPBUFDPC	0
131.P2
132Plan 9 programs use
133.CW nil
134for the name of the zero-valued pointer.
135The type
136.CW vlong
137is the largest integer type available; on most architectures it
138is a 64-bit value.
139A couple of other types in
140.CW <u.h>
141are
142.CW u32int ,
143which is guaranteed to have exactly 32 bits (a possibility on all the supported architectures) and
144.CW mpdigit ,
145which is used by the multiprecision math package
146.CW <mp.h> .
147The
148.CW #define
149constants permit an architecture-independent (but compiler-dependent)
150implementation of stack-switching using
151.CW setjmp
152and
153.CW longjmp .
154.PP
155Every Plan 9 C program begins
156.P1
157#include <u.h>
158.P2
159because all the other installed header files use the
160.CW typedefs
161declared in
162.CW <u.h> .
163.PP
164In strict ANSI C, include files are grouped to collect related functions
165in a single file: one for string functions, one for memory functions,
166one for I/O, and none for system calls.
167Each include file is protected by an
168.CW #ifdef
169to guarantee its contents are seen by the compiler only once.
170Plan 9 takes a different approach.  Other than a few include
171files that define external formats such as archives, the files in
172.CW /sys/include
173correspond to
174.I libraries.
175If a program is using a library, it includes the corresponding header.
176The default C library comprises string functions, memory functions, and
177so on, largely as in ANSI C, some formatted I/O routines,
178plus all the system calls and related functions.
179To use these functions, one must
180.CW #include
181the file
182.CW <libc.h> ,
183which in turn must follow
184.CW <u.h> ,
185to define their prototypes for the compiler.
186Here is the complete source to the traditional first C program:
187.P1
188#include <u.h>
189#include <libc.h>
190
191void
192main(void)
193{
194	print("hello world\en");
195	exits(0);
196}
197.P2
198The
199.CW print
200routine and its relatives
201.CW fprint
202and
203.CW sprint
204resemble the similarly-named functions in Standard I/O but are not
205attached to a specific I/O library.
206In Plan 9
207.CW main
208is not integer-valued; it should call
209.CW exits ,
210which takes a string argument (or null; here ANSI C promotes the 0 to a
211.CW char* ).
212All these functions are, of course, documented in the Programmer's Manual.
213.PP
214To use
215.CW printf ,
216.CW <stdio.h>
217must be included to define the function prototype for
218.CW printf :
219.P1
220#include <u.h>
221#include <libc.h>
222#include <stdio.h>
223
224void
225main(int argc, char *argv[])
226{
227	printf("%s: hello world; argc = %d\en", argv[0], argc);
228	exits(0);
229}
230.P2
231In practice, Standard I/O is not used much in Plan 9.  I/O libraries are
232discussed in a later section of this document.
233.PP
234There are libraries for handling regular expressions, raster graphics,
235windows, and so on, and each has an associated include file.
236The manual for each library states which include files are needed.
237The files are not protected against multiple inclusion and themselves
238contain no nested
239.CW #includes .
240Instead the
241programmer is expected to sort out the requirements
242and to
243.CW #include
244the necessary files once at the top of each source file.  In practice this is
245trivial: this way of handling include files is so straightforward
246that it is rare for a source file to contain more than half a dozen
247.CW #includes .
248.PP
249The compilers do their own register allocation so the
250.CW register
251keyword is ignored.
252For different reasons,
253.CW volatile
254and
255.CW const
256are also ignored.
257.PP
258To make it easier to share code with other systems, Plan 9 has a version
259of the compiler,
260.CW pcc ,
261that provides the standard ANSI C preprocessor, headers, and libraries
262with POSIX extensions.
263.CW Pcc
264is recommended only
265when broad external portability is mandated.  It compiles slower,
266produces slower code (it takes extra work to simulate POSIX on Plan 9),
267eliminates those parts of the Plan 9 interface
268not related to POSIX, and illustrates the clumsiness of an environment
269designed by committee.
270.CW Pcc
271is described in more detail in
272.I
273APE\(emThe ANSI/POSIX Environment,
274.R
275by Howard Trickey.
276.SH
277Process
278.PP
279Each CPU architecture supported by Plan 9 is identified by a single,
280arbitrary, alphanumeric character:
281.CW k
282for SPARC,
283.CW q
284for 32-bit Power PC,
285.CW v
286for MIPS,
287.CW 0
288for little-endian MIPS,
289.CW 5
290for ARM v5 and later 32-bit architectures,
291.CW 6
292for AMD64,
293.CW 8
294for Intel 386, and
295.CW 9
296for 64-bit Power PC.
297The character labels the support tools and files for that architecture.
298For instance, for the 386 the compiler is
299.CW 8c ,
300the assembler is
301.CW 8a ,
302the link editor/loader is
303.CW 8l ,
304the object files are suffixed
305.CW \&.8 ,
306and the default name for an executable file is
307.CW 8.out .
308Before we can use the compiler we therefore need to know which
309machine we are compiling for.
310The next section explains how this decision is made; for the moment
311assume we are building 386 binaries and make the mental substitution for
312.CW 8
313appropriate to the machine you are actually using.
314.PP
315To convert source to an executable binary is a two-step process.
316First run the compiler,
317.CW 8c ,
318on the source, say
319.CW file.c ,
320to generate an object file
321.CW file.8 .
322Then run the loader,
323.CW 8l ,
324to generate an executable
325.CW 8.out
326that may be run (on a 386 machine):
327.P1
3288c file.c
3298l file.8
3308.out
331.P2
332The loader automatically links with whatever libraries the program
333needs, usually including the standard C library as defined by
334.CW <libc.h> .
335Of course the compiler and loader have lots of options, both familiar and new;
336see the manual for details.
337The compiler does not generate an executable automatically;
338the output of the compiler must be given to the loader.
339Since most compilation is done under the control of
340.CW mk
341(see below), this is rarely an inconvenience.
342.PP
343The distribution of work between the compiler and loader is unusual.
344The compiler integrates preprocessing, parsing, register allocation,
345code generation and some assembly.
346Combining these tasks in a single program is part of the reason for
347the compiler's efficiency.
348The loader does instruction selection, branch folding,
349instruction scheduling,
350and writes the final executable.
351There is no separate C preprocessor and no assembler in the usual pipeline.
352Instead the intermediate object file
353(here a
354.CW \&.8
355file) is a type of binary assembly language.
356The instructions in the intermediate format are not exactly those in
357the machine.  For example, on the 68020 the object file may specify
358a MOVE instruction but the loader will decide just which variant of
359the MOVE instruction \(em MOVE immediate, MOVE quick, MOVE address,
360etc. \(em is most efficient.
361.PP
362The assembler,
363.CW 8a ,
364is just a translator between the textual and binary
365representations of the object file format.
366It is not an assembler in the traditional sense.  It has limited
367macro capabilities (the same as the integral C preprocessor in the compiler),
368clumsy syntax, and minimal error checking.  For instance, the assembler
369will accept an instruction (such as memory-to-memory MOVE on the MIPS) that the
370machine does not actually support; only when the output of the assembler
371is passed to the loader will the error be discovered.
372The assembler is intended only for writing things that need access to instructions
373invisible from C,
374such as the machine-dependent
375part of an operating system;
376very little code in Plan 9 is in assembly language.
377.PP
378The compilers take an option
379.CW -S
380that causes them to print on their standard output the generated code
381in a format acceptable as input to the assemblers.
382This is of course merely a formatting of the
383data in the object file; therefore the assembler is just
384an
385ASCII-to-binary converter for this format.
386Other than the specific instructions, the input to the assemblers
387is largely architecture-independent; see
388``A Manual for the Plan 9 Assembler'',
389by Rob Pike,
390for more information.
391.PP
392The loader is an integral part of the compilation process.
393Each library header file contains a
394.CW #pragma
395that tells the loader the name of the associated archive; it is
396not necessary to tell the loader which libraries a program uses.
397The C run-time startup is found, by default, in the C library.
398The loader starts with an undefined
399symbol,
400.CW _main ,
401that is resolved by pulling in the run-time startup code from the library.
402(The loader undefines
403.CW _mainp
404when profiling is enabled, to force loading of the profiling start-up
405instead.)
406.PP
407Unlike its counterpart on other systems, the Plan 9 loader rearranges
408data to optimize access.  This means the order of variables in the
409loaded program is unrelated to its order in the source.
410Most programs don't care, but some assume that, for example, the
411variables declared by
412.P1
413int a;
414int b;
415.P2
416will appear at adjacent addresses in memory.  On Plan 9, they won't.
417.SH
418Heterogeneity
419.PP
420When the system starts or a user logs in the environment is configured
421so the appropriate binaries are available in
422.CW /bin .
423The configuration process is controlled by an environment variable,
424.CW $cputype ,
425with value such as
426.CW mips ,
427.CW 386 ,
428.CW arm ,
429or
430.CW sparc .
431For each architecture there is a directory in the root,
432with the appropriate name,
433that holds the binary and library files for that architecture.
434Thus
435.CW /mips/lib
436contains the object code libraries for MIPS programs,
437.CW /mips/include
438holds MIPS-specific include files, and
439.CW /mips/bin
440has the MIPS binaries.
441These binaries are attached to
442.CW /bin
443at boot time by binding
444.CW /$cputype/bin
445to
446.CW /bin ,
447so
448.CW /bin
449always contains the correct files.
450.PP
451The MIPS compiler,
452.CW vc ,
453by definition
454produces object files for the MIPS architecture,
455regardless of the architecture of the machine on which the compiler is running.
456There is a version of
457.CW vc
458compiled for each architecture:
459.CW /mips/bin/vc ,
460.CW /arm/bin/vc ,
461.CW /sparc/bin/vc ,
462and so on,
463each capable of producing MIPS object files regardless of the native
464instruction set.
465If one is running on a SPARC,
466.CW /sparc/bin/vc
467will compile programs for the MIPS;
468if one is running on machine
469.CW $cputype ,
470.CW /$cputype/bin/vc
471will compile programs for the MIPS.
472.PP
473Because of the bindings that assemble
474.CW /bin ,
475the shell always looks for a command, say
476.CW date ,
477in
478.CW /bin
479and automatically finds the file
480.CW /$cputype/bin/date .
481Therefore the MIPS compiler is known as just
482.CW vc ;
483the shell will invoke
484.CW /bin/vc
485and that is guaranteed to be the version of the MIPS compiler
486appropriate for the machine running the command.
487Regardless of the architecture of the compiling machine,
488.CW /bin/vc
489is
490.I always
491the MIPS compiler.
492.PP
493Also, the output of
494.CW vc
495and
496.CW vl
497is completely independent of the machine type on which they are executed:
498.CW \&.v
499files compiled (with
500.CW vc )
501on a SPARC may be linked (with
502.CW vl )
503on a 386.
504(The resulting
505.CW v.out
506will run, of course, only on a MIPS.)
507Similarly, the MIPS libraries in
508.CW /mips/lib
509are suitable for loading with
510.CW vl
511on any machine; there is only one set of MIPS libraries, not one
512set for each architecture that supports the MIPS compiler.
513.SH
514Heterogeneity and \f(CWmk\fP
515.PP
516Most software on Plan 9 is compiled under the control of
517.CW mk ,
518a descendant of
519.CW make
520that is documented in the Programmer's Manual.
521A convention used throughout the
522.CW mkfiles
523makes it easy to compile the source into binary suitable for any architecture.
524.PP
525The variable
526.CW $cputype
527is advisory: it reports the architecture of the current environment, and should
528not be modified.  A second variable,
529.CW $objtype ,
530is used to set which architecture is being
531.I compiled
532for.
533The value of
534.CW $objtype
535can be used by a
536.CW mkfile
537to configure the compilation environment.
538.PP
539In each machine's root directory there is a short
540.CW mkfile
541that defines a set of macros for the compiler, loader, etc.
542Here is
543.CW /mips/mkfile :
544.P1
545</sys/src/mkfile.proto
546
547CC=vc
548LD=vl
549O=v
550AS=va
551.P2
552The line
553.P1
554</sys/src/mkfile.proto
555.P2
556causes
557.CW mk
558to include the file
559.CW /sys/src/mkfile.proto ,
560which contains general definitions:
561.P1
562#
563# common mkfile parameters shared by all architectures
564#
565
566OS=5689qv
567CPUS=arm amd64 386 power mips
568CFLAGS=-FTVw
569LEX=lex
570YACC=yacc
571MK=/bin/mk
572.P2
573.CW CC
574is obviously the compiler,
575.CW AS
576the assembler, and
577.CW LD
578the loader.
579.CW O
580is the suffix for the object files and
581.CW CPUS
582and
583.CW OS
584are used in special rules described below.
585.PP
586Here is a
587.CW mkfile
588to build the installed source for
589.CW sam :
590.P1
591</$objtype/mkfile
592OBJ=sam.$O address.$O buffer.$O cmd.$O disc.$O error.$O \e
593	file.$O io.$O list.$O mesg.$O moveto.$O multi.$O \e
594	plan9.$O rasp.$O regexp.$O string.$O sys.$O xec.$O
595
596$O.out:	$OBJ
597	$LD $OBJ
598
599install:	$O.out
600	cp $O.out /$objtype/bin/sam
601
602installall:
603	for(objtype in $CPUS) mk install
604
605%.$O:	%.c
606	$CC $CFLAGS $stem.c
607
608$OBJ:	sam.h errors.h mesg.h
609address.$O cmd.$O parse.$O xec.$O unix.$O:	parse.h
610
611clean:V:
612	rm -f [$OS].out *.[$OS] y.tab.?
613.P2
614(The actual
615.CW mkfile
616imports most of its rules from other secondary files, but
617this example works and is not misleading.)
618The first line causes
619.CW mk
620to include the contents of
621.CW /$objtype/mkfile
622in the current
623.CW mkfile .
624If
625.CW $objtype
626is
627.CW mips ,
628this inserts the MIPS macro definitions into the
629.CW mkfile .
630In this case the rule for
631.CW $O.out
632uses the MIPS tools to build
633.CW v.out .
634The
635.CW %.$O
636rule in the file uses
637.CW mk 's
638pattern matching facilities to convert the source files to the object
639files through the compiler.
640(The text of the rules is passed directly to the shell,
641.CW rc ,
642without further translation.
643See the
644.CW mk
645manual if any of this is unfamiliar.)
646Because the default rule builds
647.CW $O.out
648rather than
649.CW sam ,
650it is possible to maintain binaries for multiple machines in the
651same source directory without conflict.
652This is also, of course, why the output files from the various
653compilers and loaders
654have distinct names.
655.PP
656The rest of the
657.CW mkfile
658should be easy to follow; notice how the rules for
659.CW clean
660and
661.CW installall
662(that is, install versions for all architectures) use other macros
663defined in
664.CW /$objtype/mkfile .
665In Plan 9,
666.CW mkfiles
667for commands conventionally contain rules to
668.CW install
669(compile and install the version for
670.CW $objtype ),
671.CW installall
672(compile and install for all
673.CW $objtypes ),
674and
675.CW clean
676(remove all object files, binaries, etc.).
677.PP
678The
679.CW mkfile
680is easy to use.  To build a MIPS binary,
681.CW v.out :
682.P1
683% objtype=mips
684% mk
685.P2
686To build and install a MIPS binary:
687.P1
688% objtype=mips
689% mk install
690.P2
691To build and install all versions:
692.P1
693% mk installall
694.P2
695These conventions make cross-compilation as easy to manage
696as traditional native compilation.
697Plan 9 programs compile and run without change on machines from
698large multiprocessors to laptops.  For more information about this process, see
699``Plan 9 Mkfiles'',
700by Bob Flandrena.
701.SH
702Portability
703.PP
704Within Plan 9, it is painless to write portable programs, programs whose
705source is independent of the machine on which they execute.
706The operating system is fixed and the compiler, headers and libraries
707are constant so most of the stumbling blocks to portability are removed.
708Attention to a few details can avoid those that remain.
709.PP
710Plan 9 is a heterogeneous environment, so programs must
711.I expect
712that external files will be written by programs on machines of different
713architectures.
714The compilers, for instance, must handle without confusion
715object files written by other machines.
716The traditional approach to this problem is to pepper the source with
717.CW #ifdefs
718to turn byte-swapping on and off.
719Plan 9 takes a different approach: of the handful of machine-dependent
720.CW #ifdefs
721in all the source, almost all are deep in the libraries.
722Instead programs read and write files in a defined format,
723either (for low volume applications) as formatted text, or
724(for high volume applications) as binary in a known byte order.
725If the external data were written with the most significant
726byte first, the following code reads a 4-byte integer correctly
727regardless of the architecture of the executing machine (assuming
728an unsigned long holds 4 bytes):
729.P1
730ulong
731getlong(void)
732{
733	ulong l;
734
735	l = (getchar()&0xFF)<<24;
736	l |= (getchar()&0xFF)<<16;
737	l |= (getchar()&0xFF)<<8;
738	l |= (getchar()&0xFF)<<0;
739	return l;
740}
741.P2
742Note that this code does not `swap' the bytes; instead it just reads
743them in the correct order.
744Variations of this code will handle any binary format
745and also avoid problems
746involving how structures are padded, how words are aligned,
747and other impediments to portability.
748Be aware, though, that extra care is needed to handle floating point data.
749.PP
750Efficiency hounds will argue that this method is unnecessarily slow and clumsy
751when the executing machine has the same byte order (and padding and alignment)
752as the data.
753The CPU cost of I/O processing
754is rarely the bottleneck for an application, however,
755and the gain in simplicity of porting and maintaining the code greatly outweighs
756the minor speed loss from handling data in this general way.
757This method is how the Plan 9 compilers, the window system, and even the file
758servers transmit data between programs.
759.PP
760To port programs beyond Plan 9, where the system interface is more variable,
761it is probably necessary to use
762.CW pcc
763and hope that the target machine supports ANSI C and POSIX.
764.SH
765I/O
766.PP
767The default C library, defined by the include file
768.CW <libc.h> ,
769contains no buffered I/O package.
770It does have several entry points for printing formatted text:
771.CW print
772outputs text to the standard output,
773.CW fprint
774outputs text to a specified integer file descriptor, and
775.CW sprint
776places text in a character array.
777To access library routines for buffered I/O, a program must
778explicitly include the header file associated with an appropriate library.
779.PP
780The recommended I/O library, used by most Plan 9 utilities, is
781.CW bio
782(buffered I/O), defined by
783.CW <bio.h> .
784There also exists an implementation of ANSI Standard I/O,
785.CW stdio .
786.PP
787.CW Bio
788is small and efficient, particularly for buffer-at-a-time or
789line-at-a-time I/O.
790Even for character-at-a-time I/O, however, it is significantly faster than
791the Standard I/O library,
792.CW stdio .
793Its interface is compact and regular, although it lacks a few conveniences.
794The most noticeable is that one must explicitly define buffers for standard
795input and output;
796.CW bio
797does not predefine them.  Here is a program to copy input to output a byte
798at a time using
799.CW bio :
800.P1
801#include <u.h>
802#include <libc.h>
803#include <bio.h>
804
805Biobuf	bin;
806Biobuf	bout;
807
808main(void)
809{
810	int c;
811
812	Binit(&bin, 0, OREAD);
813	Binit(&bout, 1, OWRITE);
814
815	while((c=Bgetc(&bin)) != Beof)
816		Bputc(&bout, c);
817	exits(0);
818}
819.P2
820For peak performance, we could replace
821.CW Bgetc
822and
823.CW Bputc
824by their equivalent in-line macros
825.CW BGETC
826and
827.CW BPUTC
828but
829the performance gain would be modest.
830For more information on
831.CW bio ,
832see the Programmer's Manual.
833.PP
834Perhaps the most dramatic difference in the I/O interface of Plan 9 from other
835systems' is that text is not ASCII.
836The format for
837text in Plan 9 is a byte-stream encoding of 16-bit characters.
838The character set is based on the Unicode Standard and is backward compatible with
839ASCII:
840characters with value 0 through 127 are the same in both sets.
841The 16-bit characters, called
842.I runes
843in Plan 9, are encoded using a representation called
844UTF,
845an encoding that is becoming accepted as a standard.
846(ISO calls it UTF-8;
847throughout Plan 9 it's just called
848UTF.)
849UTF
850defines multibyte sequences to
851represent character values from 0 to 65535.
852In
853UTF,
854character values up to 127 decimal, 7F hexadecimal, represent themselves,
855so straight
856ASCII
857files are also valid
858UTF.
859Also,
860UTF
861guarantees that bytes with values 0 to 127 (NUL to DEL, inclusive)
862will appear only when they represent themselves, so programs that read bytes
863looking for plain ASCII characters will continue to work.
864Any program that expects a one-to-one correspondence between bytes and
865characters will, however, need to be modified.
866An example is parsing file names.
867File names, like all text, are in
868UTF,
869so it is incorrect to search for a character in a string by
870.CW strchr(filename,
871.CW c)
872because the character might have a multi-byte encoding.
873The correct method is to call
874.CW utfrune(filename,
875.CW c) ,
876defined in
877.I rune (2),
878which interprets the file name as a sequence of encoded characters
879rather than bytes.
880In fact, even when you know the character is a single byte
881that can represent only itself,
882it is safer to use
883.CW utfrune
884because that assumes nothing about the character set
885and its representation.
886.PP
887The library defines several symbols relevant to the representation of characters.
888Any byte with unsigned value less than
889.CW Runesync
890will not appear in any multi-byte encoding of a character.
891.CW Utfrune
892compares the character being searched against
893.CW Runesync
894to see if it is sufficient to call
895.CW strchr
896or if the byte stream must be interpreted.
897Any byte with unsigned value less than
898.CW Runeself
899is represented by a single byte with the same value.
900Finally, when errors are encountered converting
901to runes from a byte stream, the library returns the rune value
902.CW Runeerror
903and advances a single byte.  This permits programs to find runes
904embedded in binary data.
905.PP
906.CW Bio
907includes routines
908.CW Bgetrune
909and
910.CW Bputrune
911to transform the external byte stream
912UTF
913format to and from
914internal 16-bit runes.
915Also, the
916.CW %s
917format to
918.CW print
919accepts
920UTF;
921.CW %c
922prints a character after narrowing it to 8 bits.
923The
924.CW %S
925format prints a null-terminated sequence of runes;
926.CW %C
927prints a character after narrowing it to 16 bits.
928For more information, see the Programmer's Manual, in particular
929.I utf (6)
930and
931.I rune (2),
932and the paper,
933``Hello world, or
934Καλημέρα κόσμε, or\
935\f(Jpこんにちは 世界\f1'',
936by Rob Pike and
937Ken Thompson;
938there is not room for the full story here.
939.PP
940These issues affect the compiler in several ways.
941First, the C source is in
942UTF.
943ANSI says C variables are formed from
944ASCII
945alphanumerics, but comments and literal strings may contain any characters
946encoded in the native encoding, here
947UTF.
948The declaration
949.P1
950char *cp = "abcÿ";
951.P2
952initializes the variable
953.CW cp
954to point to an array of bytes holding the
955UTF
956representation of the characters
957.CW abcÿ.
958The type
959.CW Rune
960is defined in
961.CW <u.h>
962to be
963.CW ushort ,
964which is also the  `wide character' type in the compiler.
965Therefore the declaration
966.P1
967Rune *rp = L"abcÿ";
968.P2
969initializes the variable
970.CW rp
971to point to an array of unsigned short integers holding the 16-bit
972values of the characters
973.CW abcÿ .
974Note that in both these declarations the characters in the source
975that represent
976.CW "abcÿ"
977are the same; what changes is how those characters are represented
978in memory in the program.
979The following two lines:
980.P1
981print("%s\en", "abcÿ");
982print("%S\en", L"abcÿ");
983.P2
984produce the same
985UTF
986string on their output, the first by copying the bytes, the second
987by converting from runes to bytes.
988.PP
989In C, character constants are integers but narrowed through the
990.CW char
991type.
992The Unicode character
993.CW ÿ
994has value 255, so if the
995.CW char
996type is signed,
997the constant
998.CW 'ÿ'
999has value \-1 (which is equal to EOF).
1000On the other hand,
1001.CW L'ÿ'
1002narrows through the wide character type,
1003.CW ushort ,
1004and therefore has value 255.
1005.PP
1006Finally, although it's not ANSI C, the Plan 9 C compilers
1007assume any character with value above
1008.CW Runeself
1009is an alphanumeric,
1010so α is a legal, if non-portable, variable name.
1011.SH
1012Arguments
1013.PP
1014Some macros are defined
1015in
1016.CW <libc.h>
1017for parsing the arguments to
1018.CW main() .
1019They are described in
1020.I ARG (2)
1021but are fairly self-explanatory.
1022There are four macros:
1023.CW ARGBEGIN
1024and
1025.CW ARGEND
1026are used to bracket a hidden
1027.CW switch
1028statement within which
1029.CW ARGC
1030returns the current option character (rune) being processed and
1031.CW ARGF
1032returns the argument to the option, as in the loader option
1033.CW -o
1034.CW file .
1035Here, for example, is the code at the beginning of
1036.CW main()
1037in
1038.CW ramfs.c
1039(see
1040.I ramfs (1))
1041that cracks its arguments:
1042.P1
1043void
1044main(int argc, char *argv[])
1045{
1046	char *defmnt;
1047	int p[2];
1048	int mfd[2];
1049	int stdio = 0;
1050
1051	defmnt = "/tmp";
1052	ARGBEGIN{
1053	case 'i':
1054		defmnt = 0;
1055		stdio = 1;
1056		mfd[0] = 0;
1057		mfd[1] = 1;
1058		break;
1059	case 's':
1060		defmnt = 0;
1061		break;
1062	case 'm':
1063		defmnt = ARGF();
1064		break;
1065	default:
1066		usage();
1067	}ARGEND
1068.P2
1069.SH
1070Extensions
1071.PP
1072The compiler has several extensions to 1989 ANSI C, all of which are used
1073extensively in the system source.
1074Some of these have been adopted in later ANSI C standards.
1075First,
1076.I structure
1077.I displays
1078permit
1079.CW struct
1080expressions to be formed dynamically.
1081Given these declarations:
1082.P1
1083typedef struct Point Point;
1084typedef struct Rectangle Rectangle;
1085
1086struct Point
1087{
1088	int x, y;
1089};
1090
1091struct Rectangle
1092{
1093	Point min, max;
1094};
1095
1096Point	p, q, add(Point, Point);
1097Rectangle r;
1098int	x, y;
1099.P2
1100this assignment may appear anywhere an assignment is legal:
1101.P1
1102r = (Rectangle){add(p, q), (Point){x, y+3}};
1103.P2
1104The syntax is the same as for initializing a structure but with
1105a leading cast.
1106.PP
1107If an
1108.I anonymous
1109.I structure
1110or
1111.I union
1112is declared within another structure or union, the members of the internal
1113structure or union are addressable without prefix in the outer structure.
1114This feature eliminates the clumsy naming of nested structures and,
1115particularly, unions.
1116For example, after these declarations,
1117.P1
1118struct Lock
1119{
1120	int	locked;
1121};
1122
1123struct Node
1124{
1125	int	type;
1126	union{
1127		double  dval;
1128		double  fval;
1129		long    lval;
1130	};		/* anonymous union */
1131	struct Lock;	/* anonymous structure */
1132} *node;
1133
1134void	lock(struct Lock*);
1135.P2
1136one may refer to
1137.CW node->type ,
1138.CW node->dval ,
1139.CW node->fval ,
1140.CW node->lval ,
1141and
1142.CW node->locked .
1143Moreover, the address of a
1144.CW struct
1145.CW Node
1146may be used without a cast anywhere that the address of a
1147.CW struct
1148.CW Lock
1149is used, such as in argument lists.
1150The compiler automatically promotes the type and adjusts the address.
1151Thus one may invoke
1152.CW lock(node) .
1153.PP
1154Anonymous structures and unions may be accessed by type name
1155if (and only if) they are declared using a
1156.CW typedef
1157name.
1158For example, using the above declaration for
1159.CW Point ,
1160one may declare
1161.P1
1162struct
1163{
1164	int	type;
1165	Point;
1166} p;
1167.P2
1168and refer to
1169.CW p.Point .
1170.PP
1171In the initialization of arrays, a number in square brackets before an
1172element sets the index for the initialization.  For example, to initialize
1173some elements in
1174a table of function pointers indexed by
1175ASCII
1176character,
1177.P1
1178void	percent(void), slash(void);
1179
1180void	(*func[128])(void) =
1181{
1182	['%']	percent,
1183	['/']	slash,
1184};
1185.P2
1186.LP
1187A similar syntax allows one to initialize structure elements:
1188.P1
1189Point p =
1190{
1191	.y 100,
1192	.x 200
1193};
1194.P2
1195These initialization syntaxes were later added to ANSI C, with the addition of an
1196equals sign between the index or tag and the value.
1197The Plan 9 compiler accepts either form.
1198.PP
1199Finally, the declaration
1200.P1
1201extern register reg;
1202.P2
1203.I this "" (
1204appearance of the register keyword is not ignored)
1205allocates a global register to hold the variable
1206.CW reg .
1207External registers must be used carefully: they need to be declared in
1208.I all
1209source files and libraries in the program to guarantee the register
1210is not allocated temporarily for other purposes.
1211Especially on machines with few registers, such as the i386,
1212it is easy to link accidentally with code that has already usurped
1213the global registers and there is no diagnostic when this happens.
1214Used wisely, though, external registers are powerful.
1215The Plan 9 operating system uses them to access per-process and
1216per-machine data structures on a multiprocessor.  The storage class they provide
1217is hard to create in other ways.
1218.SH
1219The compile-time environment
1220.PP
1221The code generated by the compilers is `optimized' by default:
1222variables are placed in registers and peephole optimizations are
1223performed.
1224The compiler flag
1225.CW -N
1226disables these optimizations.
1227Registerization is done locally rather than throughout a function:
1228whether a variable occupies a register or
1229the memory location identified in the symbol
1230table depends on the activity of the variable and may change
1231throughout the life of the variable.
1232The
1233.CW -N
1234flag is rarely needed;
1235its main use is to simplify debugging.
1236There is no information in the symbol table to identify the
1237registerization of a variable, so
1238.CW -N
1239guarantees the variable is always where the symbol table says it is.
1240.PP
1241Another flag,
1242.CW -w ,
1243turns
1244.I on
1245warnings about portability and problems detected in flow analysis.
1246Most code in Plan 9 is compiled with warnings enabled;
1247these warnings plus the type checking offered by function prototypes
1248provide most of the support of the Unix tool
1249.CW lint
1250more accurately and with less chatter.
1251Two of the warnings,
1252`used and not set' and `set and not used', are almost always accurate but
1253may be triggered spuriously by code with invisible control flow,
1254such as in routines that call
1255.CW longjmp .
1256The compiler statements
1257.P1
1258SET(v1);
1259USED(v2);
1260.P2
1261decorate the flow graph to silence the compiler.
1262Either statement accepts a comma-separated list of variables.
1263Use them carefully: they may silence real errors.
1264For the common case of unused parameters to a function,
1265leaving the name off the declaration silences the warnings.
1266That is, listing the type of a parameter but giving it no
1267associated variable name does the trick.
1268.SH
1269Debugging
1270.PP
1271There are two debuggers available on Plan 9.
1272The first, and older, is
1273.CW db ,
1274a revision of Unix
1275.CW adb .
1276The other,
1277.CW acid ,
1278is a source-level debugger whose commands are statements in
1279a true programming language.
1280.CW Acid
1281is the preferred debugger, but since it
1282borrows some elements of
1283.CW db ,
1284notably the formats for displaying values, it is worth knowing a little bit about
1285.CW db .
1286.PP
1287Both debuggers support multiple architectures in a single program; that is,
1288the programs are
1289.CW db
1290and
1291.CW acid ,
1292not for example
1293.CW vdb
1294and
1295.CW vacid .
1296They also support cross-architecture debugging comfortably:
1297one may debug a 386 binary on a MIPS.
1298.PP
1299Imagine a program has crashed mysteriously:
1300.P1
1301% X11/X
1302Fatal server bug!
1303failed to create default stipple
1304X 106: suicide: sys: trap: fault read addr=0x0 pc=0x00105fb8
1305%
1306.P2
1307When a process dies on Plan 9 it hangs in the `broken' state
1308for debugging.
1309Attach a debugger to the process by naming its process id:
1310.P1
1311% acid 106
1312/proc/106/text:mips plan 9 executable
1313
1314/sys/lib/acid/port
1315/sys/lib/acid/mips
1316acid:
1317.P2
1318The
1319.CW acid
1320function
1321.CW stk()
1322reports the stack traceback:
1323.P1
1324acid: stk()
1325At pc:0x105fb8:abort+0x24 /sys/src/ape/lib/ap/stdio/abort.c:6
1326abort() /sys/src/ape/lib/ap/stdio/abort.c:4
1327	called from FatalError+#4e
1328		/sys/src/X/mit/server/dix/misc.c:421
1329FatalError(s9=#e02, s8=#4901d200, s7=#2, s6=#72701, s5=#1,
1330    s4=#7270d, s3=#6, s2=#12, s1=#ff37f1c, s0=#6, f=#7270f)
1331    /sys/src/X/mit/server/dix/misc.c:416
1332	called from gnotscreeninit+#4ce
1333		/sys/src/X/mit/server/ddx/gnot/gnot.c:792
1334gnotscreeninit(snum=#0, sc=#80db0)
1335    /sys/src/X/mit/server/ddx/gnot/gnot.c:766
1336	called from AddScreen+#16e
1337		/n/bootes/sys/src/X/mit/server/dix/main.c:610
1338AddScreen(pfnInit=0x0000129c,argc=0x00000001,argv=0x7fffffe4)
1339    /sys/src/X/mit/server/dix/main.c:530
1340	called from InitOutput+0x80
1341		/sys/src/X/mit/server/ddx/brazil/brddx.c:522
1342InitOutput(argc=0x00000001,argv=0x7fffffe4)
1343    /sys/src/X/mit/server/ddx/brazil/brddx.c:511
1344	called from main+0x294
1345		/sys/src/X/mit/server/dix/main.c:225
1346main(argc=0x00000001,argv=0x7fffffe4)
1347    /sys/src/X/mit/server/dix/main.c:136
1348	called from _main+0x24
1349		/sys/src/ape/lib/ap/mips/main9.s:8
1350.P2
1351The function
1352.CW lstk()
1353is similar but
1354also reports the values of local variables.
1355Note that the traceback includes full file names; this is a boon to debugging,
1356although it makes the output much noisier.
1357.PP
1358To use
1359.CW acid
1360well you will need to learn its input language; see the
1361``Acid Manual'',
1362by Phil Winterbottom,
1363for details.  For simple debugging, however, the information in the manual page is
1364sufficient.  In particular, it describes the most useful functions
1365for examining a process.
1366.PP
1367The compiler does not place
1368information describing the types of variables in the executable,
1369but a compile-time flag provides crude support for symbolic debugging.
1370The
1371.CW -a
1372flag to the compiler suppresses code generation
1373and instead emits source text in the
1374.CW acid
1375language to format and display data structure types defined in the program.
1376The easiest way to use this feature is to put a rule in the
1377.CW mkfile :
1378.P1
1379syms:   main.$O
1380        $CC -a main.c > syms
1381.P2
1382Then from within
1383.CW acid ,
1384.P1
1385acid: include("sourcedirectory/syms")
1386.P2
1387to read in the relevant definitions.
1388(For multi-file source, you need to be a little fancier;
1389see
1390.I 8c (1)).
1391This text includes, for each defined compound
1392type, a function with that name that may be called with the address of a structure
1393of that type to display its contents.
1394For example, if
1395.CW rect
1396is a global variable of type
1397.CW Rectangle ,
1398one may execute
1399.P1
1400Rectangle(*rect)
1401.P2
1402to display it.
1403The
1404.CW *
1405(indirection) operator is necessary because
1406of the way
1407.CW acid
1408works: each global symbol in the program is defined as a variable by
1409.CW acid ,
1410with value equal to the
1411.I address
1412of the symbol.
1413.PP
1414Another common technique is to write by hand special
1415.CW acid
1416code to define functions to aid debugging, initialize the debugger, and so on.
1417Conventionally, this is placed in a file called
1418.CW acid
1419in the source directory; it has a line
1420.P1
1421include("sourcedirectory/syms");
1422.P2
1423to load the compiler-produced symbols.  One may edit the compiler output directly but
1424it is wiser to keep the hand-generated
1425.CW acid
1426separate from the machine-generated.
1427.PP
1428To make things simple, the default rules in the system
1429.CW mkfiles
1430include entries to make
1431.CW foo.acid
1432from
1433.CW foo.c ,
1434so one may use
1435.CW mk
1436to automate the production of
1437.CW acid
1438definitions for a given C source file.
1439.PP
1440There is much more to say here.  See
1441.CW acid
1442manual page, the reference manual, or the paper
1443``Acid: A Debugger Built From A Language'',
1444also by Phil Winterbottom.
1445