xref: /plan9/sys/doc/comp.ms (revision b9e364c446c00cfa6b1164b4648b126624c464b2)
1.HTML "How to Use the Plan 9 C Compiler
2.TL
3How to Use the Plan 9 C Compiler*
4.AU
5Rob Pike
6rob@plan9.bell-labs.com
7.SH
8Introduction
9.FS
10* This paper has been revised to reflect the move to 21-bit Unicode.
11.FE
12.PP
13The C compiler on Plan 9 is a wholly new program; in fact
14it was the first piece of software written for what would
15eventually become Plan 9 from Bell Labs.
16Programmers familiar with existing C compilers will find
17a number of differences in both the language the Plan 9 compiler
18accepts and in how the compiler is used.
19.PP
20The compiler is really a set of compilers, one for each
21architecture \(em MIPS, SPARC, Intel 386, Power PC, ARM, etc. \(em
22that accept a dialect of ANSI C and efficiently produce
23fairly good code for the target machine.
24There is a packaging of the compiler that accepts strict ANSI C for
25a POSIX environment, but this document focuses on the
26native Plan 9 environment, that in which all the system source and
27almost all the utilities are written.
28.SH
29Source
30.PP
31The language accepted by the compilers is the core 1989 ANSI C language
32with some modest extensions,
33a greatly simplified preprocessor,
34a smaller library that includes system calls and related facilities,
35and a completely different structure for include files.
36.PP
37Official ANSI C accepts the old (K&R) style of declarations for
38functions; the Plan 9 compilers
39are more demanding.
40Without an explicit run-time flag
41.CW -B ) (
42whose use is discouraged, the compilers insist
43on new-style function declarations, that is, prototypes for
44function arguments.
45The function declarations in the libraries' include files are
46all in the new style so the interfaces are checked at compile time.
47For C programmers who have not yet switched to function prototypes
48the clumsy syntax may seem repellent but the payoff in stronger typing
49is substantial.
50Those who wish to import existing software to Plan 9 are urged
51to use the opportunity to update their code.
52.PP
53The compilers include an integrated preprocessor that accepts the familiar
54.CW #include ,
55.CW #define
56for macros both with and without arguments,
57.CW #undef ,
58.CW #line ,
59.CW #ifdef ,
60.CW #ifndef ,
61and
62.CW #endif .
63It
64supports neither
65.CW #if
66nor
67.CW ## ,
68although it does
69honor a few
70.CW #pragmas .
71The
72.CW #if
73directive was omitted because it greatly complicates the
74preprocessor, is never necessary, and is usually abused.
75Conditional compilation in general makes code hard to understand;
76the Plan 9 source uses it sparingly.
77Also, because the compilers remove dead code, regular
78.CW if
79statements with constant conditions are more readable equivalents to many
80.CW #ifs .
81To compile imported code ineluctably fouled by
82.CW #if
83there is a separate command,
84.CW /bin/cpp ,
85that implements the complete ANSI C preprocessor specification.
86.PP
87Include files fall into two groups: machine-dependent and machine-independent.
88The machine-independent files occupy the directory
89.CW /sys/include ;
90the others are placed in a directory appropriate to the machine, such as
91.CW /mips/include .
92The compiler searches for include files
93first in the machine-dependent directory and then
94in the machine-independent directory.
95At the time of writing there are thirty-one machine-independent include
96files and two (per machine) machine-dependent ones:
97.CW <ureg.h>
98and
99.CW <u.h> .
100The first describes the layout of registers on the system stack,
101for use by the debugger.
102The second defines some
103architecture-dependent types such as
104.CW jmp_buf
105for
106.CW setjmp
107and the
108.CW va_arg
109and
110.CW va_list
111macros for handling arguments to variadic functions,
112as well as a set of
113.CW typedef
114abbreviations for
115.CW unsigned
116.CW short
117and so on.
118.PP
119Here is an excerpt from
120.CW /386/include/u.h :
121.P1
122#define nil		((void*)0)
123typedef	unsigned short	ushort;
124typedef	unsigned char	uchar;
125typedef unsigned long	ulong;
126typedef unsigned int	uint;
127typedef   signed char	schar;
128typedef	long long       vlong;
129
130typedef long	jmp_buf[2];
131#define	JMPBUFSP	0
132#define	JMPBUFPC	1
133#define	JMPBUFDPC	0
134.P2
135Plan 9 programs use
136.CW nil
137for the name of the zero-valued pointer.
138The type
139.CW vlong
140is the largest integer type available; on most architectures it
141is a 64-bit value.
142A couple of other types in
143.CW <u.h>
144are
145.CW u32int ,
146which is guaranteed to have exactly 32 bits (a possibility on all the supported architectures) and
147.CW mpdigit ,
148which is used by the multiprecision math package
149.CW <mp.h> .
150The
151.CW #define
152constants permit an architecture-independent (but compiler-dependent)
153implementation of stack-switching using
154.CW setjmp
155and
156.CW longjmp .
157.PP
158Every Plan 9 C program begins
159.P1
160#include <u.h>
161.P2
162because all the other installed header files use the
163.CW typedefs
164declared in
165.CW <u.h> .
166.PP
167In strict ANSI C, include files are grouped to collect related functions
168in a single file: one for string functions, one for memory functions,
169one for I/O, and none for system calls.
170Each include file is protected by an
171.CW #ifdef
172to guarantee its contents are seen by the compiler only once.
173Plan 9 takes a different approach.  Other than a few include
174files that define external formats such as archives, the files in
175.CW /sys/include
176correspond to
177.I libraries.
178If a program is using a library, it includes the corresponding header.
179The default C library comprises string functions, memory functions, and
180so on, largely as in ANSI C, some formatted I/O routines,
181plus all the system calls and related functions.
182To use these functions, one must
183.CW #include
184the file
185.CW <libc.h> ,
186which in turn must follow
187.CW <u.h> ,
188to define their prototypes for the compiler.
189Here is the complete source to the traditional first C program:
190.P1
191#include <u.h>
192#include <libc.h>
193
194void
195main(void)
196{
197	print("hello world\en");
198	exits(0);
199}
200.P2
201The
202.CW print
203routine and its relatives
204.CW fprint
205and
206.CW sprint
207resemble the similarly-named functions in Standard I/O but are not
208attached to a specific I/O library.
209In Plan 9
210.CW main
211is not integer-valued; it should call
212.CW exits ,
213which takes a string argument (or null; here ANSI C promotes the 0 to a
214.CW char* ).
215All these functions are, of course, documented in the Programmer's Manual.
216.PP
217To use
218.CW printf ,
219.CW <stdio.h>
220must be included to define the function prototype for
221.CW printf :
222.P1
223#include <u.h>
224#include <libc.h>
225#include <stdio.h>
226
227void
228main(int argc, char *argv[])
229{
230	printf("%s: hello world; argc = %d\en", argv[0], argc);
231	exits(0);
232}
233.P2
234In practice, Standard I/O is not used much in Plan 9.  I/O libraries are
235discussed in a later section of this document.
236.PP
237There are libraries for handling regular expressions, raster graphics,
238windows, and so on, and each has an associated include file.
239The manual for each library states which include files are needed.
240The files are not protected against multiple inclusion and themselves
241contain no nested
242.CW #includes .
243Instead the
244programmer is expected to sort out the requirements
245and to
246.CW #include
247the necessary files once at the top of each source file.  In practice this is
248trivial: this way of handling include files is so straightforward
249that it is rare for a source file to contain more than half a dozen
250.CW #includes .
251.PP
252The compilers do their own register allocation so the
253.CW register
254keyword is ignored.
255For different reasons,
256.CW volatile
257and
258.CW const
259are also ignored.
260.PP
261To make it easier to share code with other systems, Plan 9 has a version
262of the compiler,
263.CW pcc ,
264that provides the standard ANSI C preprocessor, headers, and libraries
265with POSIX extensions.
266.CW Pcc
267is recommended only
268when broad external portability is mandated.  It compiles slower,
269produces slower code (it takes extra work to simulate POSIX on Plan 9),
270eliminates those parts of the Plan 9 interface
271not related to POSIX, and illustrates the clumsiness of an environment
272designed by committee.
273.CW Pcc
274is described in more detail in
275.I
276APE\(emThe ANSI/POSIX Environment,
277.R
278by Howard Trickey.
279.SH
280Process
281.PP
282Each CPU architecture supported by Plan 9 is identified by a single,
283arbitrary, alphanumeric character:
284.CW k
285for SPARC,
286.CW q
287for 32-bit Power PC,
288.CW v
289for MIPS,
290.CW 0
291for little-endian MIPS,
292.CW 5
293for ARM v5 and later 32-bit architectures,
294.CW 6
295for AMD64,
296.CW 8
297for Intel 386, and
298.CW 9
299for 64-bit Power PC.
300The character labels the support tools and files for that architecture.
301For instance, for the 386 the compiler is
302.CW 8c ,
303the assembler is
304.CW 8a ,
305the link editor/loader is
306.CW 8l ,
307the object files are suffixed
308.CW \&.8 ,
309and the default name for an executable file is
310.CW 8.out .
311Before we can use the compiler we therefore need to know which
312machine we are compiling for.
313The next section explains how this decision is made; for the moment
314assume we are building 386 binaries and make the mental substitution for
315.CW 8
316appropriate to the machine you are actually using.
317.PP
318To convert source to an executable binary is a two-step process.
319First run the compiler,
320.CW 8c ,
321on the source, say
322.CW file.c ,
323to generate an object file
324.CW file.8 .
325Then run the loader,
326.CW 8l ,
327to generate an executable
328.CW 8.out
329that may be run (on a 386 machine):
330.P1
3318c file.c
3328l file.8
3338.out
334.P2
335The loader automatically links with whatever libraries the program
336needs, usually including the standard C library as defined by
337.CW <libc.h> .
338Of course the compiler and loader have lots of options, both familiar and new;
339see the manual for details.
340The compiler does not generate an executable automatically;
341the output of the compiler must be given to the loader.
342Since most compilation is done under the control of
343.CW mk
344(see below), this is rarely an inconvenience.
345.PP
346The distribution of work between the compiler and loader is unusual.
347The compiler integrates preprocessing, parsing, register allocation,
348code generation and some assembly.
349Combining these tasks in a single program is part of the reason for
350the compiler's efficiency.
351The loader does instruction selection, branch folding,
352instruction scheduling,
353and writes the final executable.
354There is no separate C preprocessor and no assembler in the usual pipeline.
355Instead the intermediate object file
356(here a
357.CW \&.8
358file) is a type of binary assembly language.
359The instructions in the intermediate format are not exactly those in
360the machine.  For example, on the 68020 the object file may specify
361a MOVE instruction but the loader will decide just which variant of
362the MOVE instruction \(em MOVE immediate, MOVE quick, MOVE address,
363etc. \(em is most efficient.
364.PP
365The assembler,
366.CW 8a ,
367is just a translator between the textual and binary
368representations of the object file format.
369It is not an assembler in the traditional sense.  It has limited
370macro capabilities (the same as the integral C preprocessor in the compiler),
371clumsy syntax, and minimal error checking.  For instance, the assembler
372will accept an instruction (such as memory-to-memory MOVE on the MIPS) that the
373machine does not actually support; only when the output of the assembler
374is passed to the loader will the error be discovered.
375The assembler is intended only for writing things that need access to instructions
376invisible from C,
377such as the machine-dependent
378part of an operating system;
379very little code in Plan 9 is in assembly language.
380.PP
381The compilers take an option
382.CW -S
383that causes them to print on their standard output the generated code
384in a format acceptable as input to the assemblers.
385This is of course merely a formatting of the
386data in the object file; therefore the assembler is just
387an
388ASCII-to-binary converter for this format.
389Other than the specific instructions, the input to the assemblers
390is largely architecture-independent; see
391``A Manual for the Plan 9 Assembler'',
392by Rob Pike,
393for more information.
394.PP
395The loader is an integral part of the compilation process.
396Each library header file contains a
397.CW #pragma
398that tells the loader the name of the associated archive; it is
399not necessary to tell the loader which libraries a program uses.
400The C run-time startup is found, by default, in the C library.
401The loader starts with an undefined
402symbol,
403.CW _main ,
404that is resolved by pulling in the run-time startup code from the library.
405(The loader undefines
406.CW _mainp
407when profiling is enabled, to force loading of the profiling start-up
408instead.)
409.PP
410Unlike its counterpart on other systems, the Plan 9 loader rearranges
411data to optimize access.  This means the order of variables in the
412loaded program is unrelated to its order in the source.
413Most programs don't care, but some assume that, for example, the
414variables declared by
415.P1
416int a;
417int b;
418.P2
419will appear at adjacent addresses in memory.  On Plan 9, they won't.
420.SH
421Heterogeneity
422.PP
423When the system starts or a user logs in the environment is configured
424so the appropriate binaries are available in
425.CW /bin .
426The configuration process is controlled by an environment variable,
427.CW $cputype ,
428with value such as
429.CW mips ,
430.CW 386 ,
431.CW arm ,
432or
433.CW sparc .
434For each architecture there is a directory in the root,
435with the appropriate name,
436that holds the binary and library files for that architecture.
437Thus
438.CW /mips/lib
439contains the object code libraries for MIPS programs,
440.CW /mips/include
441holds MIPS-specific include files, and
442.CW /mips/bin
443has the MIPS binaries.
444These binaries are attached to
445.CW /bin
446at boot time by binding
447.CW /$cputype/bin
448to
449.CW /bin ,
450so
451.CW /bin
452always contains the correct files.
453.PP
454The MIPS compiler,
455.CW vc ,
456by definition
457produces object files for the MIPS architecture,
458regardless of the architecture of the machine on which the compiler is running.
459There is a version of
460.CW vc
461compiled for each architecture:
462.CW /mips/bin/vc ,
463.CW /arm/bin/vc ,
464.CW /sparc/bin/vc ,
465and so on,
466each capable of producing MIPS object files regardless of the native
467instruction set.
468If one is running on a SPARC,
469.CW /sparc/bin/vc
470will compile programs for the MIPS;
471if one is running on machine
472.CW $cputype ,
473.CW /$cputype/bin/vc
474will compile programs for the MIPS.
475.PP
476Because of the bindings that assemble
477.CW /bin ,
478the shell always looks for a command, say
479.CW date ,
480in
481.CW /bin
482and automatically finds the file
483.CW /$cputype/bin/date .
484Therefore the MIPS compiler is known as just
485.CW vc ;
486the shell will invoke
487.CW /bin/vc
488and that is guaranteed to be the version of the MIPS compiler
489appropriate for the machine running the command.
490Regardless of the architecture of the compiling machine,
491.CW /bin/vc
492is
493.I always
494the MIPS compiler.
495.PP
496Also, the output of
497.CW vc
498and
499.CW vl
500is completely independent of the machine type on which they are executed:
501.CW \&.v
502files compiled (with
503.CW vc )
504on a SPARC may be linked (with
505.CW vl )
506on a 386.
507(The resulting
508.CW v.out
509will run, of course, only on a MIPS.)
510Similarly, the MIPS libraries in
511.CW /mips/lib
512are suitable for loading with
513.CW vl
514on any machine; there is only one set of MIPS libraries, not one
515set for each architecture that supports the MIPS compiler.
516.SH
517Heterogeneity and \f(CWmk\fP
518.PP
519Most software on Plan 9 is compiled under the control of
520.CW mk ,
521a descendant of
522.CW make
523that is documented in the Programmer's Manual.
524A convention used throughout the
525.CW mkfiles
526makes it easy to compile the source into binary suitable for any architecture.
527.PP
528The variable
529.CW $cputype
530is advisory: it reports the architecture of the current environment, and should
531not be modified.  A second variable,
532.CW $objtype ,
533is used to set which architecture is being
534.I compiled
535for.
536The value of
537.CW $objtype
538can be used by a
539.CW mkfile
540to configure the compilation environment.
541.PP
542In each machine's root directory there is a short
543.CW mkfile
544that defines a set of macros for the compiler, loader, etc.
545Here is
546.CW /mips/mkfile :
547.P1
548</sys/src/mkfile.proto
549
550CC=vc
551LD=vl
552O=v
553AS=va
554.P2
555The line
556.P1
557</sys/src/mkfile.proto
558.P2
559causes
560.CW mk
561to include the file
562.CW /sys/src/mkfile.proto ,
563which contains general definitions:
564.P1
565#
566# common mkfile parameters shared by all architectures
567#
568
569OS=5689qv
570CPUS=arm amd64 386 power mips
571CFLAGS=-FTVw
572LEX=lex
573YACC=yacc
574MK=/bin/mk
575.P2
576.CW CC
577is obviously the compiler,
578.CW AS
579the assembler, and
580.CW LD
581the loader.
582.CW O
583is the suffix for the object files and
584.CW CPUS
585and
586.CW OS
587are used in special rules described below.
588.PP
589Here is a
590.CW mkfile
591to build the installed source for
592.CW sam :
593.P1
594</$objtype/mkfile
595OBJ=sam.$O address.$O buffer.$O cmd.$O disc.$O error.$O \e
596	file.$O io.$O list.$O mesg.$O moveto.$O multi.$O \e
597	plan9.$O rasp.$O regexp.$O string.$O sys.$O xec.$O
598
599$O.out:	$OBJ
600	$LD $OBJ
601
602install:	$O.out
603	cp $O.out /$objtype/bin/sam
604
605installall:
606	for(objtype in $CPUS) mk install
607
608%.$O:	%.c
609	$CC $CFLAGS $stem.c
610
611$OBJ:	sam.h errors.h mesg.h
612address.$O cmd.$O parse.$O xec.$O unix.$O:	parse.h
613
614clean:V:
615	rm -f [$OS].out *.[$OS] y.tab.?
616.P2
617(The actual
618.CW mkfile
619imports most of its rules from other secondary files, but
620this example works and is not misleading.)
621The first line causes
622.CW mk
623to include the contents of
624.CW /$objtype/mkfile
625in the current
626.CW mkfile .
627If
628.CW $objtype
629is
630.CW mips ,
631this inserts the MIPS macro definitions into the
632.CW mkfile .
633In this case the rule for
634.CW $O.out
635uses the MIPS tools to build
636.CW v.out .
637The
638.CW %.$O
639rule in the file uses
640.CW mk 's
641pattern matching facilities to convert the source files to the object
642files through the compiler.
643(The text of the rules is passed directly to the shell,
644.CW rc ,
645without further translation.
646See the
647.CW mk
648manual if any of this is unfamiliar.)
649Because the default rule builds
650.CW $O.out
651rather than
652.CW sam ,
653it is possible to maintain binaries for multiple machines in the
654same source directory without conflict.
655This is also, of course, why the output files from the various
656compilers and loaders
657have distinct names.
658.PP
659The rest of the
660.CW mkfile
661should be easy to follow; notice how the rules for
662.CW clean
663and
664.CW installall
665(that is, install versions for all architectures) use other macros
666defined in
667.CW /$objtype/mkfile .
668In Plan 9,
669.CW mkfiles
670for commands conventionally contain rules to
671.CW install
672(compile and install the version for
673.CW $objtype ),
674.CW installall
675(compile and install for all
676.CW $objtypes ),
677and
678.CW clean
679(remove all object files, binaries, etc.).
680.PP
681The
682.CW mkfile
683is easy to use.  To build a MIPS binary,
684.CW v.out :
685.P1
686% objtype=mips
687% mk
688.P2
689To build and install a MIPS binary:
690.P1
691% objtype=mips
692% mk install
693.P2
694To build and install all versions:
695.P1
696% mk installall
697.P2
698These conventions make cross-compilation as easy to manage
699as traditional native compilation.
700Plan 9 programs compile and run without change on machines from
701large multiprocessors to laptops.  For more information about this process, see
702``Plan 9 Mkfiles'',
703by Bob Flandrena.
704.SH
705Portability
706.PP
707Within Plan 9, it is painless to write portable programs, programs whose
708source is independent of the machine on which they execute.
709The operating system is fixed and the compiler, headers and libraries
710are constant so most of the stumbling blocks to portability are removed.
711Attention to a few details can avoid those that remain.
712.PP
713Plan 9 is a heterogeneous environment, so programs must
714.I expect
715that external files will be written by programs on machines of different
716architectures.
717The compilers, for instance, must handle without confusion
718object files written by other machines.
719The traditional approach to this problem is to pepper the source with
720.CW #ifdefs
721to turn byte-swapping on and off.
722Plan 9 takes a different approach: of the handful of machine-dependent
723.CW #ifdefs
724in all the source, almost all are deep in the libraries.
725Instead programs read and write files in a defined format,
726either (for low volume applications) as formatted text, or
727(for high volume applications) as binary in a known byte order.
728If the external data were written with the most significant
729byte first, the following code reads a 4-byte integer correctly
730regardless of the architecture of the executing machine (assuming
731an unsigned long holds 4 bytes):
732.P1
733ulong
734getlong(void)
735{
736	ulong l;
737
738	l = (getchar()&0xFF)<<24;
739	l |= (getchar()&0xFF)<<16;
740	l |= (getchar()&0xFF)<<8;
741	l |= (getchar()&0xFF)<<0;
742	return l;
743}
744.P2
745Note that this code does not `swap' the bytes; instead it just reads
746them in the correct order.
747Variations of this code will handle any binary format
748and also avoid problems
749involving how structures are padded, how words are aligned,
750and other impediments to portability.
751Be aware, though, that extra care is needed to handle floating point data.
752.PP
753Efficiency hounds will argue that this method is unnecessarily slow and clumsy
754when the executing machine has the same byte order (and padding and alignment)
755as the data.
756The CPU cost of I/O processing
757is rarely the bottleneck for an application, however,
758and the gain in simplicity of porting and maintaining the code greatly outweighs
759the minor speed loss from handling data in this general way.
760This method is how the Plan 9 compilers, the window system, and even the file
761servers transmit data between programs.
762.PP
763To port programs beyond Plan 9, where the system interface is more variable,
764it is probably necessary to use
765.CW pcc
766and hope that the target machine supports ANSI C and POSIX.
767.SH
768I/O
769.PP
770The default C library, defined by the include file
771.CW <libc.h> ,
772contains no buffered I/O package.
773It does have several entry points for printing formatted text:
774.CW print
775outputs text to the standard output,
776.CW fprint
777outputs text to a specified integer file descriptor, and
778.CW sprint
779places text in a character array.
780To access library routines for buffered I/O, a program must
781explicitly include the header file associated with an appropriate library.
782.PP
783The recommended I/O library, used by most Plan 9 utilities, is
784.CW bio
785(buffered I/O), defined by
786.CW <bio.h> .
787There also exists an implementation of ANSI Standard I/O,
788.CW stdio .
789.PP
790.CW Bio
791is small and efficient, particularly for buffer-at-a-time or
792line-at-a-time I/O.
793Even for character-at-a-time I/O, however, it is significantly faster than
794the Standard I/O library,
795.CW stdio .
796Its interface is compact and regular, although it lacks a few conveniences.
797The most noticeable is that one must explicitly define buffers for standard
798input and output;
799.CW bio
800does not predefine them.  Here is a program to copy input to output a byte
801at a time using
802.CW bio :
803.P1
804#include <u.h>
805#include <libc.h>
806#include <bio.h>
807
808Biobuf	bin;
809Biobuf	bout;
810
811main(void)
812{
813	int c;
814
815	Binit(&bin, 0, OREAD);
816	Binit(&bout, 1, OWRITE);
817
818	while((c=Bgetc(&bin)) != Beof)
819		Bputc(&bout, c);
820	exits(0);
821}
822.P2
823For peak performance, we could replace
824.CW Bgetc
825and
826.CW Bputc
827by their equivalent in-line macros
828.CW BGETC
829and
830.CW BPUTC
831but
832the performance gain would be modest.
833For more information on
834.CW bio ,
835see the Programmer's Manual.
836.PP
837Perhaps the most dramatic difference in the I/O interface of Plan 9 from other
838systems' is that text is not ASCII.
839The format for
840text in Plan 9 is a byte-stream encoding of 21-bit characters.
841The character set is based on the Unicode Standard and is backward compatible with
842ASCII:
843characters with value 0 through 127 are the same in both sets.
844The 21-bit characters, called
845.I runes
846in Plan 9, are encoded using a representation called
847UTF,
848an encoding that is becoming accepted as a standard.
849(ISO calls it UTF-8;
850throughout Plan 9 it's just called
851UTF.)
852UTF
853defines multibyte sequences to
854represent character values from 0 to 1,114,111.
855In
856UTF,
857character values up to 127 decimal, 7F hexadecimal, represent themselves,
858so straight
859ASCII
860files are also valid
861UTF.
862Also,
863UTF
864guarantees that bytes with values 0 to 127 (NUL to DEL, inclusive)
865will appear only when they represent themselves, so programs that read bytes
866looking for plain ASCII characters will continue to work.
867Any program that expects a one-to-one correspondence between bytes and
868characters will, however, need to be modified.
869An example is parsing file names.
870File names, like all text, are in
871UTF,
872so it is incorrect to search for a character in a string by
873.CW strchr(filename,
874.CW c)
875because the character might have a multi-byte encoding.
876The correct method is to call
877.CW utfrune(filename,
878.CW c) ,
879defined in
880.I rune (2),
881which interprets the file name as a sequence of encoded characters
882rather than bytes.
883In fact, even when you know the character is a single byte
884that can represent only itself,
885it is safer to use
886.CW utfrune
887because that assumes nothing about the character set
888and its representation.
889.PP
890The library defines several symbols relevant to the representation of characters.
891Any byte with unsigned value less than
892.CW Runesync
893will not appear in any multi-byte encoding of a character.
894.CW Utfrune
895compares the character being searched against
896.CW Runesync
897to see if it is sufficient to call
898.CW strchr
899or if the byte stream must be interpreted.
900Any byte with unsigned value less than
901.CW Runeself
902is represented by a single byte with the same value.
903Finally, when errors are encountered converting
904to runes from a byte stream, the library returns the rune value
905.CW Runeerror
906and advances a single byte.  This permits programs to find runes
907embedded in binary data.
908.PP
909.CW Bio
910includes routines
911.CW Bgetrune
912and
913.CW Bputrune
914to transform the external byte stream
915UTF
916format to and from
917internal 21-bit runes.
918Also, the
919.CW %s
920format to
921.CW print
922accepts
923UTF;
924.CW %c
925prints a character after narrowing it to 8 bits.
926The
927.CW %S
928format prints a null-terminated sequence of runes;
929.CW %C
930prints a character after narrowing it to 21 bits.
931For more information, see the Programmer's Manual, in particular
932.I utf (6)
933and
934.I rune (2),
935and the paper,
936``Hello world, or
937Καλημέρα κόσμε, or\
938\f(Jpこんにちは 世界\f1'',
939by Rob Pike and
940Ken Thompson;
941there is not room for the full story here.
942.PP
943These issues affect the compiler in several ways.
944First, the C source is in
945UTF.
946ANSI says C variables are formed from
947ASCII
948alphanumerics, but comments and literal strings may contain any characters
949encoded in the native encoding, here
950UTF.
951The declaration
952.P1
953char *cp = "abcÿ";
954.P2
955initializes the variable
956.CW cp
957to point to an array of bytes holding the
958UTF
959representation of the characters
960.CW abcÿ.
961The type
962.CW Rune
963is defined in
964.CW <u.h>
965to be
966.CW ushort ,
967which is also the  `wide character' type in the compiler.
968Therefore the declaration
969.P1
970Rune *rp = L"abcÿ";
971.P2
972initializes the variable
973.CW rp
974to point to an array of unsigned long integers holding the 21-bit
975values of the characters
976.CW abcÿ .
977Note that in both these declarations the characters in the source
978that represent
979.CW "abcÿ"
980are the same; what changes is how those characters are represented
981in memory in the program.
982The following two lines:
983.P1
984print("%s\en", "abcÿ");
985print("%S\en", L"abcÿ");
986.P2
987produce the same
988UTF
989string on their output, the first by copying the bytes, the second
990by converting from runes to bytes.
991.PP
992In C, character constants are integers but narrowed through the
993.CW char
994type.
995The Unicode character
996.CW ÿ
997has value 255, so if the
998.CW char
999type is signed,
1000the constant
1001.CW 'ÿ'
1002has value \-1 (which is equal to EOF).
1003On the other hand,
1004.CW L'ÿ'
1005narrows through the wide character type,
1006.CW ushort ,
1007and therefore has value 255.
1008.PP
1009Finally, although it's not ANSI C, the Plan 9 C compilers
1010assume any character with value above
1011.CW Runeself
1012is an alphanumeric,
1013so α is a legal, if non-portable, variable name.
1014.SH
1015Arguments
1016.PP
1017Some macros are defined
1018in
1019.CW <libc.h>
1020for parsing the arguments to
1021.CW main() .
1022They are described in
1023.I ARG (2)
1024but are fairly self-explanatory.
1025There are four macros:
1026.CW ARGBEGIN
1027and
1028.CW ARGEND
1029are used to bracket a hidden
1030.CW switch
1031statement within which
1032.CW ARGC
1033returns the current option character (rune) being processed and
1034.CW ARGF
1035returns the argument to the option, as in the loader option
1036.CW -o
1037.CW file .
1038Here, for example, is the code at the beginning of
1039.CW main()
1040in
1041.CW ramfs.c
1042(see
1043.I ramfs (1))
1044that cracks its arguments:
1045.P1
1046void
1047main(int argc, char *argv[])
1048{
1049	char *defmnt;
1050	int p[2];
1051	int mfd[2];
1052	int stdio = 0;
1053
1054	defmnt = "/tmp";
1055	ARGBEGIN{
1056	case 'i':
1057		defmnt = 0;
1058		stdio = 1;
1059		mfd[0] = 0;
1060		mfd[1] = 1;
1061		break;
1062	case 's':
1063		defmnt = 0;
1064		break;
1065	case 'm':
1066		defmnt = ARGF();
1067		break;
1068	default:
1069		usage();
1070	}ARGEND
1071.P2
1072.SH
1073Extensions
1074.PP
1075The compiler has several extensions to 1989 ANSI C, all of which are used
1076extensively in the system source.
1077Some of these have been adopted in later ANSI C standards.
1078First,
1079.I structure
1080.I displays
1081permit
1082.CW struct
1083expressions to be formed dynamically.
1084Given these declarations:
1085.P1
1086typedef struct Point Point;
1087typedef struct Rectangle Rectangle;
1088
1089struct Point
1090{
1091	int x, y;
1092};
1093
1094struct Rectangle
1095{
1096	Point min, max;
1097};
1098
1099Point	p, q, add(Point, Point);
1100Rectangle r;
1101int	x, y;
1102.P2
1103this assignment may appear anywhere an assignment is legal:
1104.P1
1105r = (Rectangle){add(p, q), (Point){x, y+3}};
1106.P2
1107The syntax is the same as for initializing a structure but with
1108a leading cast.
1109.PP
1110If an
1111.I anonymous
1112.I structure
1113or
1114.I union
1115is declared within another structure or union, the members of the internal
1116structure or union are addressable without prefix in the outer structure.
1117This feature eliminates the clumsy naming of nested structures and,
1118particularly, unions.
1119For example, after these declarations,
1120.P1
1121struct Lock
1122{
1123	int	locked;
1124};
1125
1126struct Node
1127{
1128	int	type;
1129	union{
1130		double  dval;
1131		double  fval;
1132		long    lval;
1133	};		/* anonymous union */
1134	struct Lock;	/* anonymous structure */
1135} *node;
1136
1137void	lock(struct Lock*);
1138.P2
1139one may refer to
1140.CW node->type ,
1141.CW node->dval ,
1142.CW node->fval ,
1143.CW node->lval ,
1144and
1145.CW node->locked .
1146Moreover, the address of a
1147.CW struct
1148.CW Node
1149may be used without a cast anywhere that the address of a
1150.CW struct
1151.CW Lock
1152is used, such as in argument lists.
1153The compiler automatically promotes the type and adjusts the address.
1154Thus one may invoke
1155.CW lock(node) .
1156.PP
1157Anonymous structures and unions may be accessed by type name
1158if (and only if) they are declared using a
1159.CW typedef
1160name.
1161For example, using the above declaration for
1162.CW Point ,
1163one may declare
1164.P1
1165struct
1166{
1167	int	type;
1168	Point;
1169} p;
1170.P2
1171and refer to
1172.CW p.Point .
1173.PP
1174In the initialization of arrays, a number in square brackets before an
1175element sets the index for the initialization.  For example, to initialize
1176some elements in
1177a table of function pointers indexed by
1178ASCII
1179character,
1180.P1
1181void	percent(void), slash(void);
1182
1183void	(*func[128])(void) =
1184{
1185	['%']	percent,
1186	['/']	slash,
1187};
1188.P2
1189.LP
1190A similar syntax allows one to initialize structure elements:
1191.P1
1192Point p =
1193{
1194	.y 100,
1195	.x 200
1196};
1197.P2
1198These initialization syntaxes were later added to ANSI C, with the addition of an
1199equals sign between the index or tag and the value.
1200The Plan 9 compiler accepts either form.
1201.PP
1202Finally, the declaration
1203.P1
1204extern register reg;
1205.P2
1206.I this "" (
1207appearance of the register keyword is not ignored)
1208allocates a global register to hold the variable
1209.CW reg .
1210External registers must be used carefully: they need to be declared in
1211.I all
1212source files and libraries in the program to guarantee the register
1213is not allocated temporarily for other purposes.
1214Especially on machines with few registers, such as the i386,
1215it is easy to link accidentally with code that has already usurped
1216the global registers and there is no diagnostic when this happens.
1217Used wisely, though, external registers are powerful.
1218The Plan 9 operating system uses them to access per-process and
1219per-machine data structures on a multiprocessor.  The storage class they provide
1220is hard to create in other ways.
1221.SH
1222The compile-time environment
1223.PP
1224The code generated by the compilers is `optimized' by default:
1225variables are placed in registers and peephole optimizations are
1226performed.
1227The compiler flag
1228.CW -N
1229disables these optimizations.
1230Registerization is done locally rather than throughout a function:
1231whether a variable occupies a register or
1232the memory location identified in the symbol
1233table depends on the activity of the variable and may change
1234throughout the life of the variable.
1235The
1236.CW -N
1237flag is rarely needed;
1238its main use is to simplify debugging.
1239There is no information in the symbol table to identify the
1240registerization of a variable, so
1241.CW -N
1242guarantees the variable is always where the symbol table says it is.
1243.PP
1244Another flag,
1245.CW -w ,
1246turns
1247.I on
1248warnings about portability and problems detected in flow analysis.
1249Most code in Plan 9 is compiled with warnings enabled;
1250these warnings plus the type checking offered by function prototypes
1251provide most of the support of the Unix tool
1252.CW lint
1253more accurately and with less chatter.
1254Two of the warnings,
1255`used and not set' and `set and not used', are almost always accurate but
1256may be triggered spuriously by code with invisible control flow,
1257such as in routines that call
1258.CW longjmp .
1259The compiler statements
1260.P1
1261SET(v1);
1262USED(v2);
1263.P2
1264decorate the flow graph to silence the compiler.
1265Either statement accepts a comma-separated list of variables.
1266Use them carefully: they may silence real errors.
1267For the common case of unused parameters to a function,
1268leaving the name off the declaration silences the warnings.
1269That is, listing the type of a parameter but giving it no
1270associated variable name does the trick.
1271.SH
1272Debugging
1273.PP
1274There are two debuggers available on Plan 9.
1275The first, and older, is
1276.CW db ,
1277a revision of Unix
1278.CW adb .
1279The other,
1280.CW acid ,
1281is a source-level debugger whose commands are statements in
1282a true programming language.
1283.CW Acid
1284is the preferred debugger, but since it
1285borrows some elements of
1286.CW db ,
1287notably the formats for displaying values, it is worth knowing a little bit about
1288.CW db .
1289.PP
1290Both debuggers support multiple architectures in a single program; that is,
1291the programs are
1292.CW db
1293and
1294.CW acid ,
1295not for example
1296.CW vdb
1297and
1298.CW vacid .
1299They also support cross-architecture debugging comfortably:
1300one may debug a 386 binary on a MIPS.
1301.PP
1302Imagine a program has crashed mysteriously:
1303.P1
1304% X11/X
1305Fatal server bug!
1306failed to create default stipple
1307X 106: suicide: sys: trap: fault read addr=0x0 pc=0x00105fb8
1308%
1309.P2
1310When a process dies on Plan 9 it hangs in the `broken' state
1311for debugging.
1312Attach a debugger to the process by naming its process id:
1313.P1
1314% acid 106
1315/proc/106/text:mips plan 9 executable
1316
1317/sys/lib/acid/port
1318/sys/lib/acid/mips
1319acid:
1320.P2
1321The
1322.CW acid
1323function
1324.CW stk()
1325reports the stack traceback:
1326.P1
1327acid: stk()
1328At pc:0x105fb8:abort+0x24 /sys/src/ape/lib/ap/stdio/abort.c:6
1329abort() /sys/src/ape/lib/ap/stdio/abort.c:4
1330	called from FatalError+#4e
1331		/sys/src/X/mit/server/dix/misc.c:421
1332FatalError(s9=#e02, s8=#4901d200, s7=#2, s6=#72701, s5=#1,
1333    s4=#7270d, s3=#6, s2=#12, s1=#ff37f1c, s0=#6, f=#7270f)
1334    /sys/src/X/mit/server/dix/misc.c:416
1335	called from gnotscreeninit+#4ce
1336		/sys/src/X/mit/server/ddx/gnot/gnot.c:792
1337gnotscreeninit(snum=#0, sc=#80db0)
1338    /sys/src/X/mit/server/ddx/gnot/gnot.c:766
1339	called from AddScreen+#16e
1340		/n/bootes/sys/src/X/mit/server/dix/main.c:610
1341AddScreen(pfnInit=0x0000129c,argc=0x00000001,argv=0x7fffffe4)
1342    /sys/src/X/mit/server/dix/main.c:530
1343	called from InitOutput+0x80
1344		/sys/src/X/mit/server/ddx/brazil/brddx.c:522
1345InitOutput(argc=0x00000001,argv=0x7fffffe4)
1346    /sys/src/X/mit/server/ddx/brazil/brddx.c:511
1347	called from main+0x294
1348		/sys/src/X/mit/server/dix/main.c:225
1349main(argc=0x00000001,argv=0x7fffffe4)
1350    /sys/src/X/mit/server/dix/main.c:136
1351	called from _main+0x24
1352		/sys/src/ape/lib/ap/mips/main9.s:8
1353.P2
1354The function
1355.CW lstk()
1356is similar but
1357also reports the values of local variables.
1358Note that the traceback includes full file names; this is a boon to debugging,
1359although it makes the output much noisier.
1360.PP
1361To use
1362.CW acid
1363well you will need to learn its input language; see the
1364``Acid Manual'',
1365by Phil Winterbottom,
1366for details.  For simple debugging, however, the information in the manual page is
1367sufficient.  In particular, it describes the most useful functions
1368for examining a process.
1369.PP
1370The compiler does not place
1371information describing the types of variables in the executable,
1372but a compile-time flag provides crude support for symbolic debugging.
1373The
1374.CW -a
1375flag to the compiler suppresses code generation
1376and instead emits source text in the
1377.CW acid
1378language to format and display data structure types defined in the program.
1379The easiest way to use this feature is to put a rule in the
1380.CW mkfile :
1381.P1
1382syms:   main.$O
1383        $CC -a main.c > syms
1384.P2
1385Then from within
1386.CW acid ,
1387.P1
1388acid: include("sourcedirectory/syms")
1389.P2
1390to read in the relevant definitions.
1391(For multi-file source, you need to be a little fancier;
1392see
1393.I 8c (1)).
1394This text includes, for each defined compound
1395type, a function with that name that may be called with the address of a structure
1396of that type to display its contents.
1397For example, if
1398.CW rect
1399is a global variable of type
1400.CW Rectangle ,
1401one may execute
1402.P1
1403Rectangle(*rect)
1404.P2
1405to display it.
1406The
1407.CW *
1408(indirection) operator is necessary because
1409of the way
1410.CW acid
1411works: each global symbol in the program is defined as a variable by
1412.CW acid ,
1413with value equal to the
1414.I address
1415of the symbol.
1416.PP
1417Another common technique is to write by hand special
1418.CW acid
1419code to define functions to aid debugging, initialize the debugger, and so on.
1420Conventionally, this is placed in a file called
1421.CW acid
1422in the source directory; it has a line
1423.P1
1424include("sourcedirectory/syms");
1425.P2
1426to load the compiler-produced symbols.  One may edit the compiler output directly but
1427it is wiser to keep the hand-generated
1428.CW acid
1429separate from the machine-generated.
1430.PP
1431To make things simple, the default rules in the system
1432.CW mkfiles
1433include entries to make
1434.CW foo.acid
1435from
1436.CW foo.c ,
1437so one may use
1438.CW mk
1439to automate the production of
1440.CW acid
1441definitions for a given C source file.
1442.PP
1443There is much more to say here.  See
1444.CW acid
1445manual page, the reference manual, or the paper
1446``Acid: A Debugger Built From A Language'',
1447also by Phil Winterbottom.
1448