xref: /plan9/sys/doc/comp.ms (revision f9e1cf08d3be51592e03e639fc848a68dc31a55e)
1.HTML "How to Use the Plan 9 C Compiler
2.TL
3How to Use the Plan 9 C Compiler
4.AU
5Rob Pike
6rob@plan9.bell-labs.com
7.SH
8Introduction
9.PP
10The C compiler on Plan 9 is a wholly new program; in fact
11it was the first piece of software written for what would
12eventually become Plan 9 from Bell Labs.
13Programmers familiar with existing C compilers will find
14a number of differences in both the language the Plan 9 compiler
15accepts and in how the compiler is used.
16.PP
17The compiler is really a set of compilers, one for each
18architecture \(em MIPS, SPARC, Motorola 68020, Intel 386, etc. \(em
19that accept a dialect of ANSI C and efficiently produce
20fairly good code for the target machine.
21There is a packaging of the compiler that accepts strict ANSI C for
22a POSIX environment, but this document focuses on the
23native Plan 9 environment, that in which all the system source and
24almost all the utilities are written.
25.SH
26Source
27.PP
28The language accepted by the compilers is the core ANSI C language
29with some modest extensions,
30a greatly simplified preprocessor,
31a smaller library that includes system calls and related facilities,
32and a completely different structure for include files.
33.PP
34Official ANSI C accepts the old (K&R) style of declarations for
35functions; the Plan 9 compilers
36are more demanding.
37Without an explicit run-time flag
38.CW -B ) (
39whose use is discouraged, the compilers insist
40on new-style function declarations, that is, prototypes for
41function arguments.
42The function declarations in the libraries' include files are
43all in the new style so the interfaces are checked at compile time.
44For C programmers who have not yet switched to function prototypes
45the clumsy syntax may seem repellent but the payoff in stronger typing
46is substantial.
47Those who wish to import existing software to Plan 9 are urged
48to use the opportunity to update their code.
49.PP
50The compilers include an integrated preprocessor that accepts the familiar
51.CW #include ,
52.CW #define
53for macros both with and without arguments,
54.CW #undef ,
55.CW #line ,
56.CW #ifdef ,
57.CW #ifndef ,
58and
59.CW #endif .
60It
61supports neither
62.CW #if
63nor
64.CW ## ,
65although it does
66honor a few
67.CW #pragmas .
68The
69.CW #if
70directive was omitted because it greatly complicates the
71preprocessor, is never necessary, and is usually abused.
72Conditional compilation in general makes code hard to understand;
73the Plan 9 source uses it sparingly.
74Also, because the compilers remove dead code, regular
75.CW if
76statements with constant conditions are more readable equivalents to many
77.CW #ifs .
78To compile imported code ineluctably fouled by
79.CW #if
80there is a separate command,
81.CW /bin/cpp ,
82that implements the complete ANSI C preprocessor specification.
83.PP
84Include files fall into two groups: machine-dependent and machine-independent.
85The machine-independent files occupy the directory
86.CW /sys/include ;
87the others are placed in a directory appropriate to the machine, such as
88.CW /mips/include .
89The compiler searches for include files
90first in the machine-dependent directory and then
91in the machine-independent directory.
92At the time of writing there are thirty-one machine-independent include
93files and two (per machine) machine-dependent ones:
94.CW <ureg.h>
95and
96.CW <u.h> .
97The first describes the layout of registers on the system stack,
98for use by the debugger.
99The second defines some
100architecture-dependent types such as
101.CW jmp_buf
102for
103.CW setjmp
104and the
105.CW va_arg
106and
107.CW va_list
108macros for handling arguments to variadic functions,
109as well as a set of
110.CW typedef
111abbreviations for
112.CW unsigned
113.CW short
114and so on.
115.PP
116Here is an excerpt from
117.CW /68020/include/u.h :
118.P1
119#define nil		((void*)0)
120typedef	unsigned short	ushort;
121typedef	unsigned char	uchar;
122typedef unsigned long	ulong;
123typedef unsigned int	uint;
124typedef   signed char	schar;
125typedef	long long       vlong;
126
127typedef long	jmp_buf[2];
128#define	JMPBUFSP	0
129#define	JMPBUFPC	1
130#define	JMPBUFDPC	0
131.P2
132Plan 9 programs use
133.CW nil
134for the name of the zero-valued pointer.
135The type
136.CW vlong
137is the largest integer type available; on most architectures it
138is a 64-bit value.
139A couple of other types in
140.CW <u.h>
141are
142.CW u32int ,
143which is guaranteed to have exactly 32 bits (a possibility on all the supported architectures) and
144.CW mpdigit ,
145which is used by the multiprecision math package
146.CW <mp.h> .
147The
148.CW #define
149constants permit an architecture-independent (but compiler-dependent)
150implementation of stack-switching using
151.CW setjmp
152and
153.CW longjmp .
154.PP
155Every Plan 9 C program begins
156.P1
157#include <u.h>
158.P2
159because all the other installed header files use the
160.CW typedefs
161declared in
162.CW <u.h> .
163.PP
164In strict ANSI C, include files are grouped to collect related functions
165in a single file: one for string functions, one for memory functions,
166one for I/O, and none for system calls.
167Each include file is protected by an
168.CW #ifdef
169to guarantee its contents are seen by the compiler only once.
170Plan 9 takes a different approach.  Other than a few include
171files that define external formats such as archives, the files in
172.CW /sys/include
173correspond to
174.I libraries.
175If a program is using a library, it includes the corresponding header.
176The default C library comprises string functions, memory functions, and
177so on, largely as in ANSI C, some formatted I/O routines,
178plus all the system calls and related functions.
179To use these functions, one must
180.CW #include
181the file
182.CW <libc.h> ,
183which in turn must follow
184.CW <u.h> ,
185to define their prototypes for the compiler.
186Here is the complete source to the traditional first C program:
187.P1
188#include <u.h>
189#include <libc.h>
190
191void
192main(void)
193{
194	print("hello world\en");
195	exits(0);
196}
197.P2
198The
199.CW print
200routine and its relatives
201.CW fprint
202and
203.CW sprint
204resemble the similarly-named functions in Standard I/O but are not
205attached to a specific I/O library.
206In Plan 9
207.CW main
208is not integer-valued; it should call
209.CW exits ,
210which takes a string argument (or null; here ANSI C promotes the 0 to a
211.CW char* ).
212All these functions are, of course, documented in the Programmer's Manual.
213.PP
214To use
215.CW printf ,
216.CW <stdio.h>
217must be included to define the function prototype for
218.CW printf :
219.P1
220#include <u.h>
221#include <libc.h>
222#include <stdio.h>
223
224void
225main(int argc, char *argv[])
226{
227	printf("%s: hello world; argc = %d\en", argv[0], argc);
228	exits(0);
229}
230.P2
231In practice, Standard I/O is not used much in Plan 9.  I/O libraries are
232discussed in a later section of this document.
233.PP
234There are libraries for handling regular expressions, raster graphics,
235windows, and so on, and each has an associated include file.
236The manual for each library states which include files are needed.
237The files are not protected against multiple inclusion and themselves
238contain no nested
239.CW #includes .
240Instead the
241programmer is expected to sort out the requirements
242and to
243.CW #include
244the necessary files once at the top of each source file.  In practice this is
245trivial: this way of handling include files is so straightforward
246that it is rare for a source file to contain more than half a dozen
247.CW #includes .
248.PP
249The compilers do their own register allocation so the
250.CW register
251keyword is ignored.
252For different reasons,
253.CW volatile
254and
255.CW const
256are also ignored.
257.PP
258To make it easier to share code with other systems, Plan 9 has a version
259of the compiler,
260.CW pcc ,
261that provides the standard ANSI C preprocessor, headers, and libraries
262with POSIX extensions.
263.CW Pcc
264is recommended only
265when broad external portability is mandated.  It compiles slower,
266produces slower code (it takes extra work to simulate POSIX on Plan 9),
267eliminates those parts of the Plan 9 interface
268not related to POSIX, and illustrates the clumsiness of an environment
269designed by committee.
270.CW Pcc
271is described in more detail in
272.I
273APE\(emThe ANSI/POSIX Environment,
274.R
275by Howard Trickey.
276.SH
277Process
278.PP
279Each CPU architecture supported by Plan 9 is identified by a single,
280arbitrary, alphanumeric character:
281.CW k
282for SPARC,
283.CW q
284for Motorola Power PC 630 and 640,
285.CW v
286for MIPS,
287.CW 0
288for little-endian MIPS,
289.CW 1
290for Motorola 68000,
291.CW 2
292for Motorola 68020 and 68040,
293.CW 5
294for Acorn ARM 7500,
295.CW 6
296for AMD 64,
297.CW 7
298for DEC Alpha,
299.CW 8
300for Intel 386, and
301.CW 9
302for AMD 29000.
303The character labels the support tools and files for that architecture.
304For instance, for the 68020 the compiler is
305.CW 2c ,
306the assembler is
307.CW 2a ,
308the link editor/loader is
309.CW 2l ,
310the object files are suffixed
311.CW \&.2 ,
312and the default name for an executable file is
313.CW 2.out .
314Before we can use the compiler we therefore need to know which
315machine we are compiling for.
316The next section explains how this decision is made; for the moment
317assume we are building 68020 binaries and make the mental substitution for
318.CW 2
319appropriate to the machine you are actually using.
320.PP
321To convert source to an executable binary is a two-step process.
322First run the compiler,
323.CW 2c ,
324on the source, say
325.CW file.c ,
326to generate an object file
327.CW file.2 .
328Then run the loader,
329.CW 2l ,
330to generate an executable
331.CW 2.out
332that may be run (on a 680X0 machine):
333.P1
3342c file.c
3352l file.2
3362.out
337.P2
338The loader automatically links with whatever libraries the program
339needs, usually including the standard C library as defined by
340.CW <libc.h> .
341Of course the compiler and loader have lots of options, both familiar and new;
342see the manual for details.
343The compiler does not generate an executable automatically;
344the output of the compiler must be given to the loader.
345Since most compilation is done under the control of
346.CW mk
347(see below), this is rarely an inconvenience.
348.PP
349The distribution of work between the compiler and loader is unusual.
350The compiler integrates preprocessing, parsing, register allocation,
351code generation and some assembly.
352Combining these tasks in a single program is part of the reason for
353the compiler's efficiency.
354The loader does instruction selection, branch folding,
355instruction scheduling,
356and writes the final executable.
357There is no separate C preprocessor and no assembler in the usual pipeline.
358Instead the intermediate object file
359(here a
360.CW \&.2
361file) is a type of binary assembly language.
362The instructions in the intermediate format are not exactly those in
363the machine.  For example, on the 68020 the object file may specify
364a MOVE instruction but the loader will decide just which variant of
365the MOVE instruction \(em MOVE immediate, MOVE quick, MOVE address,
366etc. \(em is most efficient.
367.PP
368The assembler,
369.CW 2a ,
370is just a translator between the textual and binary
371representations of the object file format.
372It is not an assembler in the traditional sense.  It has limited
373macro capabilities (the same as the integral C preprocessor in the compiler),
374clumsy syntax, and minimal error checking.  For instance, the assembler
375will accept an instruction (such as memory-to-memory MOVE on the MIPS) that the
376machine does not actually support; only when the output of the assembler
377is passed to the loader will the error be discovered.
378The assembler is intended only for writing things that need access to instructions
379invisible from C,
380such as the machine-dependent
381part of an operating system;
382very little code in Plan 9 is in assembly language.
383.PP
384The compilers take an option
385.CW -S
386that causes them to print on their standard output the generated code
387in a format acceptable as input to the assemblers.
388This is of course merely a formatting of the
389data in the object file; therefore the assembler is just
390an
391ASCII-to-binary converter for this format.
392Other than the specific instructions, the input to the assemblers
393is largely architecture-independent; see
394``A Manual for the Plan 9 Assembler'',
395by Rob Pike,
396for more information.
397.PP
398The loader is an integral part of the compilation process.
399Each library header file contains a
400.CW #pragma
401that tells the loader the name of the associated archive; it is
402not necessary to tell the loader which libraries a program uses.
403The C run-time startup is found, by default, in the C library.
404The loader starts with an undefined
405symbol,
406.CW _main ,
407that is resolved by pulling in the run-time startup code from the library.
408(The loader undefines
409.CW _mainp
410when profiling is enabled, to force loading of the profiling start-up
411instead.)
412.PP
413Unlike its counterpart on other systems, the Plan 9 loader rearranges
414data to optimize access.  This means the order of variables in the
415loaded program is unrelated to its order in the source.
416Most programs don't care, but some assume that, for example, the
417variables declared by
418.P1
419int a;
420int b;
421.P2
422will appear at adjacent addresses in memory.  On Plan 9, they won't.
423.SH
424Heterogeneity
425.PP
426When the system starts or a user logs in the environment is configured
427so the appropriate binaries are available in
428.CW /bin .
429The configuration process is controlled by an environment variable,
430.CW $cputype ,
431with value such as
432.CW mips ,
433.CW 68020 ,
434.CW 386 ,
435or
436.CW sparc .
437For each architecture there is a directory in the root,
438with the appropriate name,
439that holds the binary and library files for that architecture.
440Thus
441.CW /mips/lib
442contains the object code libraries for MIPS programs,
443.CW /mips/include
444holds MIPS-specific include files, and
445.CW /mips/bin
446has the MIPS binaries.
447These binaries are attached to
448.CW /bin
449at boot time by binding
450.CW /$cputype/bin
451to
452.CW /bin ,
453so
454.CW /bin
455always contains the correct files.
456.PP
457The MIPS compiler,
458.CW vc ,
459by definition
460produces object files for the MIPS architecture,
461regardless of the architecture of the machine on which the compiler is running.
462There is a version of
463.CW vc
464compiled for each architecture:
465.CW /mips/bin/vc ,
466.CW /68020/bin/vc ,
467.CW /sparc/bin/vc ,
468and so on,
469each capable of producing MIPS object files regardless of the native
470instruction set.
471If one is running on a SPARC,
472.CW /sparc/bin/vc
473will compile programs for the MIPS;
474if one is running on machine
475.CW $cputype ,
476.CW /$cputype/bin/vc
477will compile programs for the MIPS.
478.PP
479Because of the bindings that assemble
480.CW /bin ,
481the shell always looks for a command, say
482.CW date ,
483in
484.CW /bin
485and automatically finds the file
486.CW /$cputype/bin/date .
487Therefore the MIPS compiler is known as just
488.CW vc ;
489the shell will invoke
490.CW /bin/vc
491and that is guaranteed to be the version of the MIPS compiler
492appropriate for the machine running the command.
493Regardless of the architecture of the compiling machine,
494.CW /bin/vc
495is
496.I always
497the MIPS compiler.
498.PP
499Also, the output of
500.CW vc
501and
502.CW vl
503is completely independent of the machine type on which they are executed:
504.CW \&.v
505files compiled (with
506.CW vc )
507on a SPARC may be linked (with
508.CW vl )
509on a 386.
510(The resulting
511.CW v.out
512will run, of course, only on a MIPS.)
513Similarly, the MIPS libraries in
514.CW /mips/lib
515are suitable for loading with
516.CW vl
517on any machine; there is only one set of MIPS libraries, not one
518set for each architecture that supports the MIPS compiler.
519.SH
520Heterogeneity and \f(CWmk\fP
521.PP
522Most software on Plan 9 is compiled under the control of
523.CW mk ,
524a descendant of
525.CW make
526that is documented in the Programmer's Manual.
527A convention used throughout the
528.CW mkfiles
529makes it easy to compile the source into binary suitable for any architecture.
530.PP
531The variable
532.CW $cputype
533is advisory: it reports the architecture of the current environment, and should
534not be modified.  A second variable,
535.CW $objtype ,
536is used to set which architecture is being
537.I compiled
538for.
539The value of
540.CW $objtype
541can be used by a
542.CW mkfile
543to configure the compilation environment.
544.PP
545In each machine's root directory there is a short
546.CW mkfile
547that defines a set of macros for the compiler, loader, etc.
548Here is
549.CW /mips/mkfile :
550.P1
551</sys/src/mkfile.proto
552
553CC=vc
554LD=vl
555O=v
556AS=va
557.P2
558The line
559.P1
560</sys/src/mkfile.proto
561.P2
562causes
563.CW mk
564to include the file
565.CW /sys/src/mkfile.proto ,
566which contains general definitions:
567.P1
568#
569# common mkfile parameters shared by all architectures
570#
571
572OS=v486xq7
573CPUS=mips 386 power alpha
574CFLAGS=-FVw
575LEX=lex
576YACC=yacc
577MK=/bin/mk
578.P2
579.CW CC
580is obviously the compiler,
581.CW AS
582the assembler, and
583.CW LD
584the loader.
585.CW O
586is the suffix for the object files and
587.CW CPUS
588and
589.CW OS
590are used in special rules described below.
591.PP
592Here is a
593.CW mkfile
594to build the installed source for
595.CW sam :
596.P1
597</$objtype/mkfile
598OBJ=sam.$O address.$O buffer.$O cmd.$O disc.$O error.$O \e
599	file.$O io.$O list.$O mesg.$O moveto.$O multi.$O \e
600	plan9.$O rasp.$O regexp.$O string.$O sys.$O xec.$O
601
602$O.out:	$OBJ
603	$LD $OBJ
604
605install:	$O.out
606	cp $O.out /$objtype/bin/sam
607
608installall:
609	for(objtype in $CPUS) mk install
610
611%.$O:	%.c
612	$CC $CFLAGS $stem.c
613
614$OBJ:	sam.h errors.h mesg.h
615address.$O cmd.$O parse.$O xec.$O unix.$O:	parse.h
616
617clean:V:
618	rm -f [$OS].out *.[$OS] y.tab.?
619.P2
620(The actual
621.CW mkfile
622imports most of its rules from other secondary files, but
623this example works and is not misleading.)
624The first line causes
625.CW mk
626to include the contents of
627.CW /$objtype/mkfile
628in the current
629.CW mkfile .
630If
631.CW $objtype
632is
633.CW mips ,
634this inserts the MIPS macro definitions into the
635.CW mkfile .
636In this case the rule for
637.CW $O.out
638uses the MIPS tools to build
639.CW v.out .
640The
641.CW %.$O
642rule in the file uses
643.CW mk 's
644pattern matching facilities to convert the source files to the object
645files through the compiler.
646(The text of the rules is passed directly to the shell,
647.CW rc ,
648without further translation.
649See the
650.CW mk
651manual if any of this is unfamiliar.)
652Because the default rule builds
653.CW $O.out
654rather than
655.CW sam ,
656it is possible to maintain binaries for multiple machines in the
657same source directory without conflict.
658This is also, of course, why the output files from the various
659compilers and loaders
660have distinct names.
661.PP
662The rest of the
663.CW mkfile
664should be easy to follow; notice how the rules for
665.CW clean
666and
667.CW installall
668(that is, install versions for all architectures) use other macros
669defined in
670.CW /$objtype/mkfile .
671In Plan 9,
672.CW mkfiles
673for commands conventionally contain rules to
674.CW install
675(compile and install the version for
676.CW $objtype ),
677.CW installall
678(compile and install for all
679.CW $objtypes ),
680and
681.CW clean
682(remove all object files, binaries, etc.).
683.PP
684The
685.CW mkfile
686is easy to use.  To build a MIPS binary,
687.CW v.out :
688.P1
689% objtype=mips
690% mk
691.P2
692To build and install a MIPS binary:
693.P1
694% objtype=mips
695% mk install
696.P2
697To build and install all versions:
698.P1
699% mk installall
700.P2
701These conventions make cross-compilation as easy to manage
702as traditional native compilation.
703Plan 9 programs compile and run without change on machines from
704large multiprocessors to laptops.  For more information about this process, see
705``Plan 9 Mkfiles'',
706by Bob Flandrena.
707.SH
708Portability
709.PP
710Within Plan 9, it is painless to write portable programs, programs whose
711source is independent of the machine on which they execute.
712The operating system is fixed and the compiler, headers and libraries
713are constant so most of the stumbling blocks to portability are removed.
714Attention to a few details can avoid those that remain.
715.PP
716Plan 9 is a heterogeneous environment, so programs must
717.I expect
718that external files will be written by programs on machines of different
719architectures.
720The compilers, for instance, must handle without confusion
721object files written by other machines.
722The traditional approach to this problem is to pepper the source with
723.CW #ifdefs
724to turn byte-swapping on and off.
725Plan 9 takes a different approach: of the handful of machine-dependent
726.CW #ifdefs
727in all the source, almost all are deep in the libraries.
728Instead programs read and write files in a defined format,
729either (for low volume applications) as formatted text, or
730(for high volume applications) as binary in a known byte order.
731If the external data were written with the most significant
732byte first, the following code reads a 4-byte integer correctly
733regardless of the architecture of the executing machine (assuming
734an unsigned long holds 4 bytes):
735.P1
736ulong
737getlong(void)
738{
739	ulong l;
740
741	l = (getchar()&0xFF)<<24;
742	l |= (getchar()&0xFF)<<16;
743	l |= (getchar()&0xFF)<<8;
744	l |= (getchar()&0xFF)<<0;
745	return l;
746}
747.P2
748Note that this code does not `swap' the bytes; instead it just reads
749them in the correct order.
750Variations of this code will handle any binary format
751and also avoid problems
752involving how structures are padded, how words are aligned,
753and other impediments to portability.
754Be aware, though, that extra care is needed to handle floating point data.
755.PP
756Efficiency hounds will argue that this method is unnecessarily slow and clumsy
757when the executing machine has the same byte order (and padding and alignment)
758as the data.
759The CPU cost of I/O processing
760is rarely the bottleneck for an application, however,
761and the gain in simplicity of porting and maintaining the code greatly outweighs
762the minor speed loss from handling data in this general way.
763This method is how the Plan 9 compilers, the window system, and even the file
764servers transmit data between programs.
765.PP
766To port programs beyond Plan 9, where the system interface is more variable,
767it is probably necessary to use
768.CW pcc
769and hope that the target machine supports ANSI C and POSIX.
770.SH
771I/O
772.PP
773The default C library, defined by the include file
774.CW <libc.h> ,
775contains no buffered I/O package.
776It does have several entry points for printing formatted text:
777.CW print
778outputs text to the standard output,
779.CW fprint
780outputs text to a specified integer file descriptor, and
781.CW sprint
782places text in a character array.
783To access library routines for buffered I/O, a program must
784explicitly include the header file associated with an appropriate library.
785.PP
786The recommended I/O library, used by most Plan 9 utilities, is
787.CW bio
788(buffered I/O), defined by
789.CW <bio.h> .
790There also exists an implementation of ANSI Standard I/O,
791.CW stdio .
792.PP
793.CW Bio
794is small and efficient, particularly for buffer-at-a-time or
795line-at-a-time I/O.
796Even for character-at-a-time I/O, however, it is significantly faster than
797the Standard I/O library,
798.CW stdio .
799Its interface is compact and regular, although it lacks a few conveniences.
800The most noticeable is that one must explicitly define buffers for standard
801input and output;
802.CW bio
803does not predefine them.  Here is a program to copy input to output a byte
804at a time using
805.CW bio :
806.P1
807#include <u.h>
808#include <libc.h>
809#include <bio.h>
810
811Biobuf	bin;
812Biobuf	bout;
813
814main(void)
815{
816	int c;
817
818	Binit(&bin, 0, OREAD);
819	Binit(&bout, 1, OWRITE);
820
821	while((c=Bgetc(&bin)) != Beof)
822		Bputc(&bout, c);
823	exits(0);
824}
825.P2
826For peak performance, we could replace
827.CW Bgetc
828and
829.CW Bputc
830by their equivalent in-line macros
831.CW BGETC
832and
833.CW BPUTC
834but
835the performance gain would be modest.
836For more information on
837.CW bio ,
838see the Programmer's Manual.
839.PP
840Perhaps the most dramatic difference in the I/O interface of Plan 9 from other
841systems' is that text is not ASCII.
842The format for
843text in Plan 9 is a byte-stream encoding of 16-bit characters.
844The character set is based on the Unicode Standard and is backward compatible with
845ASCII:
846characters with value 0 through 127 are the same in both sets.
847The 16-bit characters, called
848.I runes
849in Plan 9, are encoded using a representation called
850UTF,
851an encoding that is becoming accepted as a standard.
852(ISO calls it UTF-8;
853throughout Plan 9 it's just called
854UTF.)
855UTF
856defines multibyte sequences to
857represent character values from 0 to 65535.
858In
859UTF,
860character values up to 127 decimal, 7F hexadecimal, represent themselves,
861so straight
862ASCII
863files are also valid
864UTF.
865Also,
866UTF
867guarantees that bytes with values 0 to 127 (NUL to DEL, inclusive)
868will appear only when they represent themselves, so programs that read bytes
869looking for plain ASCII characters will continue to work.
870Any program that expects a one-to-one correspondence between bytes and
871characters will, however, need to be modified.
872An example is parsing file names.
873File names, like all text, are in
874UTF,
875so it is incorrect to search for a character in a string by
876.CW strchr(filename,
877.CW c)
878because the character might have a multi-byte encoding.
879The correct method is to call
880.CW utfrune(filename,
881.CW c) ,
882defined in
883.I rune (2),
884which interprets the file name as a sequence of encoded characters
885rather than bytes.
886In fact, even when you know the character is a single byte
887that can represent only itself,
888it is safer to use
889.CW utfrune
890because that assumes nothing about the character set
891and its representation.
892.PP
893The library defines several symbols relevant to the representation of characters.
894Any byte with unsigned value less than
895.CW Runesync
896will not appear in any multi-byte encoding of a character.
897.CW Utfrune
898compares the character being searched against
899.CW Runesync
900to see if it is sufficient to call
901.CW strchr
902or if the byte stream must be interpreted.
903Any byte with unsigned value less than
904.CW Runeself
905is represented by a single byte with the same value.
906Finally, when errors are encountered converting
907to runes from a byte stream, the library returns the rune value
908.CW Runeerror
909and advances a single byte.  This permits programs to find runes
910embedded in binary data.
911.PP
912.CW Bio
913includes routines
914.CW Bgetrune
915and
916.CW Bputrune
917to transform the external byte stream
918UTF
919format to and from
920internal 16-bit runes.
921Also, the
922.CW %s
923format to
924.CW print
925accepts
926UTF;
927.CW %c
928prints a character after narrowing it to 8 bits.
929The
930.CW %S
931format prints a null-terminated sequence of runes;
932.CW %C
933prints a character after narrowing it to 16 bits.
934For more information, see the Programmer's Manual, in particular
935.I utf (6)
936and
937.I rune (2),
938and the paper,
939``Hello world, or
940Καλημέρα κόσμε, or\
941\f(Jpこんにちは 世界\f1'',
942by Rob Pike and
943Ken Thompson;
944there is not room for the full story here.
945.PP
946These issues affect the compiler in several ways.
947First, the C source is in
948UTF.
949ANSI says C variables are formed from
950ASCII
951alphanumerics, but comments and literal strings may contain any characters
952encoded in the native encoding, here
953UTF.
954The declaration
955.P1
956char *cp = "abcÿ";
957.P2
958initializes the variable
959.CW cp
960to point to an array of bytes holding the
961UTF
962representation of the characters
963.CW abcÿ.
964The type
965.CW Rune
966is defined in
967.CW <u.h>
968to be
969.CW ushort ,
970which is also the  `wide character' type in the compiler.
971Therefore the declaration
972.P1
973Rune *rp = L"abcÿ";
974.P2
975initializes the variable
976.CW rp
977to point to an array of unsigned short integers holding the 16-bit
978values of the characters
979.CW abcÿ .
980Note that in both these declarations the characters in the source
981that represent
982.CW "abcÿ"
983are the same; what changes is how those characters are represented
984in memory in the program.
985The following two lines:
986.P1
987print("%s\en", "abcÿ");
988print("%S\en", L"abcÿ");
989.P2
990produce the same
991UTF
992string on their output, the first by copying the bytes, the second
993by converting from runes to bytes.
994.PP
995In C, character constants are integers but narrowed through the
996.CW char
997type.
998The Unicode character
999.CW ÿ
1000has value 255, so if the
1001.CW char
1002type is signed,
1003the constant
1004.CW 'ÿ'
1005has value \-1 (which is equal to EOF).
1006On the other hand,
1007.CW L'ÿ'
1008narrows through the wide character type,
1009.CW ushort ,
1010and therefore has value 255.
1011.PP
1012Finally, although it's not ANSI C, the Plan 9 C compilers
1013assume any character with value above
1014.CW Runeself
1015is an alphanumeric,
1016so α is a legal, if non-portable, variable name.
1017.SH
1018Arguments
1019.PP
1020Some macros are defined
1021in
1022.CW <libc.h>
1023for parsing the arguments to
1024.CW main() .
1025They are described in
1026.I ARG (2)
1027but are fairly self-explanatory.
1028There are four macros:
1029.CW ARGBEGIN
1030and
1031.CW ARGEND
1032are used to bracket a hidden
1033.CW switch
1034statement within which
1035.CW ARGC
1036returns the current option character (rune) being processed and
1037.CW ARGF
1038returns the argument to the option, as in the loader option
1039.CW -o
1040.CW file .
1041Here, for example, is the code at the beginning of
1042.CW main()
1043in
1044.CW ramfs.c
1045(see
1046.I ramfs (1))
1047that cracks its arguments:
1048.P1
1049void
1050main(int argc, char *argv[])
1051{
1052	char *defmnt;
1053	int p[2];
1054	int mfd[2];
1055	int stdio = 0;
1056
1057	defmnt = "/tmp";
1058	ARGBEGIN{
1059	case 'i':
1060		defmnt = 0;
1061		stdio = 1;
1062		mfd[0] = 0;
1063		mfd[1] = 1;
1064		break;
1065	case 's':
1066		defmnt = 0;
1067		break;
1068	case 'm':
1069		defmnt = ARGF();
1070		break;
1071	default:
1072		usage();
1073	}ARGEND
1074.P2
1075.SH
1076Extensions
1077.PP
1078The compiler has several extensions to ANSI C, all of which are used
1079extensively in the system source.
1080First,
1081.I structure
1082.I displays
1083permit
1084.CW struct
1085expressions to be formed dynamically.
1086Given these declarations:
1087.P1
1088typedef struct Point Point;
1089typedef struct Rectangle Rectangle;
1090
1091struct Point
1092{
1093	int x, y;
1094};
1095
1096struct Rectangle
1097{
1098	Point min, max;
1099};
1100
1101Point	p, q, add(Point, Point);
1102Rectangle r;
1103int	x, y;
1104.P2
1105this assignment may appear anywhere an assignment is legal:
1106.P1
1107r = (Rectangle){add(p, q), (Point){x, y+3}};
1108.P2
1109The syntax is the same as for initializing a structure but with
1110a leading cast.
1111.PP
1112If an
1113.I anonymous
1114.I structure
1115or
1116.I union
1117is declared within another structure or union, the members of the internal
1118structure or union are addressable without prefix in the outer structure.
1119This feature eliminates the clumsy naming of nested structures and,
1120particularly, unions.
1121For example, after these declarations,
1122.P1
1123struct Lock
1124{
1125	int	locked;
1126};
1127
1128struct Node
1129{
1130	int	type;
1131	union{
1132		double  dval;
1133		double  fval;
1134		long    lval;
1135	};		/* anonymous union */
1136	struct Lock;	/* anonymous structure */
1137} *node;
1138
1139void	lock(struct Lock*);
1140.P2
1141one may refer to
1142.CW node->type ,
1143.CW node->dval ,
1144.CW node->fval ,
1145.CW node->lval ,
1146and
1147.CW node->locked .
1148Moreover, the address of a
1149.CW struct
1150.CW Node
1151may be used without a cast anywhere that the address of a
1152.CW struct
1153.CW Lock
1154is used, such as in argument lists.
1155The compiler automatically promotes the type and adjusts the address.
1156Thus one may invoke
1157.CW lock(node) .
1158.PP
1159Anonymous structures and unions may be accessed by type name
1160if (and only if) they are declared using a
1161.CW typedef
1162name.
1163For example, using the above declaration for
1164.CW Point ,
1165one may declare
1166.P1
1167struct
1168{
1169	int	type;
1170	Point;
1171} p;
1172.P2
1173and refer to
1174.CW p.Point .
1175.PP
1176In the initialization of arrays, a number in square brackets before an
1177element sets the index for the initialization.  For example, to initialize
1178some elements in
1179a table of function pointers indexed by
1180ASCII
1181character,
1182.P1
1183void	percent(void), slash(void);
1184
1185void	(*func[128])(void) =
1186{
1187	['%']	percent,
1188	['/']	slash,
1189};
1190.P2
1191.LP
1192A similar syntax allows one to initialize structure elements:
1193.P1
1194Point p =
1195{
1196	.y 100,
1197	.x 200
1198};
1199.P2
1200These initialization syntaxes were later added to ANSI C, with the addition of an
1201equals sign between the index or tag and the value.
1202The Plan 9 compiler accepts either form.
1203.PP
1204Finally, the declaration
1205.P1
1206extern register reg;
1207.P2
1208.I this "" (
1209appearance of the register keyword is not ignored)
1210allocates a global register to hold the variable
1211.CW reg .
1212External registers must be used carefully: they need to be declared in
1213.I all
1214source files and libraries in the program to guarantee the register
1215is not allocated temporarily for other purposes.
1216Especially on machines with few registers, such as the i386,
1217it is easy to link accidentally with code that has already usurped
1218the global registers and there is no diagnostic when this happens.
1219Used wisely, though, external registers are powerful.
1220The Plan 9 operating system uses them to access per-process and
1221per-machine data structures on a multiprocessor.  The storage class they provide
1222is hard to create in other ways.
1223.SH
1224The compile-time environment
1225.PP
1226The code generated by the compilers is `optimized' by default:
1227variables are placed in registers and peephole optimizations are
1228performed.
1229The compiler flag
1230.CW -N
1231disables these optimizations.
1232Registerization is done locally rather than throughout a function:
1233whether a variable occupies a register or
1234the memory location identified in the symbol
1235table depends on the activity of the variable and may change
1236throughout the life of the variable.
1237The
1238.CW -N
1239flag is rarely needed;
1240its main use is to simplify debugging.
1241There is no information in the symbol table to identify the
1242registerization of a variable, so
1243.CW -N
1244guarantees the variable is always where the symbol table says it is.
1245.PP
1246Another flag,
1247.CW -w ,
1248turns
1249.I on
1250warnings about portability and problems detected in flow analysis.
1251Most code in Plan 9 is compiled with warnings enabled;
1252these warnings plus the type checking offered by function prototypes
1253provide most of the support of the Unix tool
1254.CW lint
1255more accurately and with less chatter.
1256Two of the warnings,
1257`used and not set' and `set and not used', are almost always accurate but
1258may be triggered spuriously by code with invisible control flow,
1259such as in routines that call
1260.CW longjmp .
1261The compiler statements
1262.P1
1263SET(v1);
1264USED(v2);
1265.P2
1266decorate the flow graph to silence the compiler.
1267Either statement accepts a comma-separated list of variables.
1268Use them carefully: they may silence real errors.
1269For the common case of unused parameters to a function,
1270leaving the name off the declaration silences the warnings.
1271That is, listing the type of a parameter but giving it no
1272associated variable name does the trick.
1273.SH
1274Debugging
1275.PP
1276There are two debuggers available on Plan 9.
1277The first, and older, is
1278.CW db ,
1279a revision of Unix
1280.CW adb .
1281The other,
1282.CW acid ,
1283is a source-level debugger whose commands are statements in
1284a true programming language.
1285.CW Acid
1286is the preferred debugger, but since it
1287borrows some elements of
1288.CW db ,
1289notably the formats for displaying values, it is worth knowing a little bit about
1290.CW db .
1291.PP
1292Both debuggers support multiple architectures in a single program; that is,
1293the programs are
1294.CW db
1295and
1296.CW acid ,
1297not for example
1298.CW vdb
1299and
1300.CW vacid .
1301They also support cross-architecture debugging comfortably:
1302one may debug a 68020 binary on a MIPS.
1303.PP
1304Imagine a program has crashed mysteriously:
1305.P1
1306% X11/X
1307Fatal server bug!
1308failed to create default stipple
1309X 106: suicide: sys: trap: fault read addr=0x0 pc=0x00105fb8
1310%
1311.P2
1312When a process dies on Plan 9 it hangs in the `broken' state
1313for debugging.
1314Attach a debugger to the process by naming its process id:
1315.P1
1316% acid 106
1317/proc/106/text:mips plan 9 executable
1318
1319/sys/lib/acid/port
1320/sys/lib/acid/mips
1321acid:
1322.P2
1323The
1324.CW acid
1325function
1326.CW stk()
1327reports the stack traceback:
1328.P1
1329acid: stk()
1330At pc:0x105fb8:abort+0x24 /sys/src/ape/lib/ap/stdio/abort.c:6
1331abort() /sys/src/ape/lib/ap/stdio/abort.c:4
1332	called from FatalError+#4e
1333		/sys/src/X/mit/server/dix/misc.c:421
1334FatalError(s9=#e02, s8=#4901d200, s7=#2, s6=#72701, s5=#1,
1335    s4=#7270d, s3=#6, s2=#12, s1=#ff37f1c, s0=#6, f=#7270f)
1336    /sys/src/X/mit/server/dix/misc.c:416
1337	called from gnotscreeninit+#4ce
1338		/sys/src/X/mit/server/ddx/gnot/gnot.c:792
1339gnotscreeninit(snum=#0, sc=#80db0)
1340    /sys/src/X/mit/server/ddx/gnot/gnot.c:766
1341	called from AddScreen+#16e
1342		/n/bootes/sys/src/X/mit/server/dix/main.c:610
1343AddScreen(pfnInit=0x0000129c,argc=0x00000001,argv=0x7fffffe4)
1344    /sys/src/X/mit/server/dix/main.c:530
1345	called from InitOutput+0x80
1346		/sys/src/X/mit/server/ddx/brazil/brddx.c:522
1347InitOutput(argc=0x00000001,argv=0x7fffffe4)
1348    /sys/src/X/mit/server/ddx/brazil/brddx.c:511
1349	called from main+0x294
1350		/sys/src/X/mit/server/dix/main.c:225
1351main(argc=0x00000001,argv=0x7fffffe4)
1352    /sys/src/X/mit/server/dix/main.c:136
1353	called from _main+0x24
1354		/sys/src/ape/lib/ap/mips/main9.s:8
1355.P2
1356The function
1357.CW lstk()
1358is similar but
1359also reports the values of local variables.
1360Note that the traceback includes full file names; this is a boon to debugging,
1361although it makes the output much noisier.
1362.PP
1363To use
1364.CW acid
1365well you will need to learn its input language; see the
1366``Acid Manual'',
1367by Phil Winterbottom,
1368for details.  For simple debugging, however, the information in the manual page is
1369sufficient.  In particular, it describes the most useful functions
1370for examining a process.
1371.PP
1372The compiler does not place
1373information describing the types of variables in the executable,
1374but a compile-time flag provides crude support for symbolic debugging.
1375The
1376.CW -a
1377flag to the compiler suppresses code generation
1378and instead emits source text in the
1379.CW acid
1380language to format and display data structure types defined in the program.
1381The easiest way to use this feature is to put a rule in the
1382.CW mkfile :
1383.P1
1384syms:   main.$O
1385        $CC -a main.c > syms
1386.P2
1387Then from within
1388.CW acid ,
1389.P1
1390acid: include("sourcedirectory/syms")
1391.P2
1392to read in the relevant definitions.
1393(For multi-file source, you need to be a little fancier;
1394see
1395.I 2c (1)).
1396This text includes, for each defined compound
1397type, a function with that name that may be called with the address of a structure
1398of that type to display its contents.
1399For example, if
1400.CW rect
1401is a global variable of type
1402.CW Rectangle ,
1403one may execute
1404.P1
1405Rectangle(*rect)
1406.P2
1407to display it.
1408The
1409.CW *
1410(indirection) operator is necessary because
1411of the way
1412.CW acid
1413works: each global symbol in the program is defined as a variable by
1414.CW acid ,
1415with value equal to the
1416.I address
1417of the symbol.
1418.PP
1419Another common technique is to write by hand special
1420.CW acid
1421code to define functions to aid debugging, initialize the debugger, and so on.
1422Conventionally, this is placed in a file called
1423.CW acid
1424in the source directory; it has a line
1425.P1
1426include("sourcedirectory/syms");
1427.P2
1428to load the compiler-produced symbols.  One may edit the compiler output directly but
1429it is wiser to keep the hand-generated
1430.CW acid
1431separate from the machine-generated.
1432.PP
1433To make things simple, the default rules in the system
1434.CW mkfiles
1435include entries to make
1436.CW foo.acid
1437from
1438.CW foo.c ,
1439so one may use
1440.CW mk
1441to automate the production of
1442.CW acid
1443definitions for a given C source file.
1444.PP
1445There is much more to say here.  See
1446.CW acid
1447manual page, the reference manual, or the paper
1448``Acid: A Debugger Built From A Language'',
1449also by Phil Winterbottom.
1450