xref: /csrg-svn/bin/sh/TOUR (revision 60699)
1*60699Sbostic#	@(#)TOUR	8.1 (Berkeley) 05/31/93
247102Sbostic
355237SmarcNOTE -- This is the original TOUR paper distributed with ash and
455237Smarcdoes not represent the current state of the shell.  It is provided anyway
555237Smarcsince it provides helpful information for how the shell is structured,
655237Smarcbut be warned that things have changed -- the current shell is
755237Smarcstill under development.
855237Smarc
955237Smarc================================================================
1055237Smarc
1147102Sbostic                       A Tour through Ash
1247102Sbostic
1347102Sbostic               Copyright 1989 by Kenneth Almquist.
1447102Sbostic
1547102Sbostic
1647102SbosticDIRECTORIES:  The subdirectory bltin contains commands which can
1747102Sbosticbe compiled stand-alone.  The rest of the source is in the main
1847102Sbosticash directory.
1947102Sbostic
2047102SbosticSOURCE CODE GENERATORS:  Files whose names begin with "mk" are
2147102Sbosticprograms that generate source code.  A complete list of these
2247102Sbosticprograms is:
2347102Sbostic
2447102Sbostic        program         intput files        generates
2547102Sbostic        -------         ------------        ---------
2647102Sbostic        mkbuiltins      builtins            builtins.h builtins.c
2747102Sbostic        mkinit          *.c                 init.c
2847102Sbostic        mknodes         nodetypes           nodes.h nodes.c
2947102Sbostic        mksignames          -               signames.h signames.c
3047102Sbostic        mksyntax            -               syntax.h syntax.c
3147102Sbostic        mktokens            -               token.def
3247102Sbostic        bltin/mkexpr    unary_op binary_op  operators.h operators.c
3347102Sbostic
3447102SbosticThere are undoubtedly too many of these.  Mkinit searches all the
3547102SbosticC source files for entries looking like:
3647102Sbostic
3747102Sbostic        INIT {
3847102Sbostic              x = 1;    /* executed during initialization */
3947102Sbostic        }
4047102Sbostic
4147102Sbostic        RESET {
4247102Sbostic              x = 2;    /* executed when the shell does a longjmp
4347102Sbostic                           back to the main command loop */
4447102Sbostic        }
4547102Sbostic
4647102Sbostic        SHELLPROC {
4747102Sbostic              x = 3;    /* executed when the shell runs a shell procedure */
4847102Sbostic        }
4947102Sbostic
5047102SbosticIt pulls this code out into routines which are when particular
5147102Sbosticevents occur.  The intent is to improve modularity by isolating
5247102Sbosticthe information about which modules need to be explicitly
5347102Sbosticinitialized/reset within the modules themselves.
5447102Sbostic
5547102SbosticMkinit recognizes several constructs for placing declarations in
5647102Sbosticthe init.c file.
5747102Sbostic        INCLUDE "file.h"
5847102Sbosticincludes a file.  The storage class MKINIT makes a declaration
5947102Sbosticavailable in the init.c file, for example:
6047102Sbostic        MKINIT int funcnest;    /* depth of function calls */
6147102SbosticMKINIT alone on a line introduces a structure or union declara-
6247102Sbostiction:
6347102Sbostic        MKINIT
6447102Sbostic        struct redirtab {
6547102Sbostic              short renamed[10];
6647102Sbostic        };
6747102SbosticPreprocessor #define statements are copied to init.c without any
6847102Sbosticspecial action to request this.
6947102Sbostic
7047102SbosticINDENTATION:  The ash source is indented in multiples of six
7147102Sbosticspaces.  The only study that I have heard of on the subject con-
7247102Sbosticcluded that the optimal amount to indent is in the range of four
7347102Sbosticto six spaces.  I use six spaces since it is not too big a jump
7447102Sbosticfrom the widely used eight spaces.  If you really hate six space
7547102Sbosticindentation, use the adjind (source included) program to change
7647102Sbosticit to something else.
7747102Sbostic
7847102SbosticEXCEPTIONS:  Code for dealing with exceptions appears in
7947102Sbosticexceptions.c.  The C language doesn't include exception handling,
8047102Sbosticso I implement it using setjmp and longjmp.  The global variable
8147102Sbosticexception contains the type of exception.  EXERROR is raised by
8247102Sbosticcalling error.  EXINT is an interrupt.  EXSHELLPROC is an excep-
8347102Sbostiction which is raised when a shell procedure is invoked.  The pur-
8447102Sbosticpose of EXSHELLPROC is to perform the cleanup actions associated
8547102Sbosticwith other exceptions.  After these cleanup actions, the shell
8647102Sbosticcan interpret a shell procedure itself without exec'ing a new
8747102Sbosticcopy of the shell.
8847102Sbostic
8947102SbosticINTERRUPTS:  In an interactive shell, an interrupt will cause an
9047102SbosticEXINT exception to return to the main command loop.  (Exception:
9147102SbosticEXINT is not raised if the user traps interrupts using the trap
9247102Sbosticcommand.)  The INTOFF and INTON macros (defined in exception.h)
9347102Sbosticprovide uninterruptable critical sections.  Between the execution
9447102Sbosticof INTOFF and the execution of INTON, interrupt signals will be
9547102Sbosticheld for later delivery.  INTOFF and INTON can be nested.
9647102Sbostic
9747102SbosticMEMALLOC.C:  Memalloc.c defines versions of malloc and realloc
9847102Sbosticwhich call error when there is no memory left.  It also defines a
9947102Sbosticstack oriented memory allocation scheme.  Allocating off a stack
10047102Sbosticis probably more efficient than allocation using malloc, but the
10147102Sbosticbig advantage is that when an exception occurs all we have to do
10247102Sbosticto free up the memory in use at the time of the exception is to
10347102Sbosticrestore the stack pointer.  The stack is implemented using a
10447102Sbosticlinked list of blocks.
10547102Sbostic
10647102SbosticSTPUTC:  If the stack were contiguous, it would be easy to store
10747102Sbosticstrings on the stack without knowing in advance how long the
10847102Sbosticstring was going to be:
10947102Sbostic        p = stackptr;
11047102Sbostic        *p++ = c;       /* repeated as many times as needed */
11147102Sbostic        stackptr = p;
11247102SbosticThe folloing three macros (defined in memalloc.h) perform these
11347102Sbosticoperations, but grow the stack if you run off the end:
11447102Sbostic        STARTSTACKSTR(p);
11547102Sbostic        STPUTC(c, p);   /* repeated as many times as needed */
11647102Sbostic        grabstackstr(p);
11747102Sbostic
11847102SbosticWe now start a top-down look at the code:
11947102Sbostic
12047102SbosticMAIN.C:  The main routine performs some initialization, executes
12147102Sbosticthe user's profile if necessary, and calls cmdloop.  Cmdloop is
12247102Sbosticrepeatedly parses and executes commands.
12347102Sbostic
12447102SbosticOPTIONS.C:  This file contains the option processing code.  It is
12547102Sbosticcalled from main to parse the shell arguments when the shell is
12647102Sbosticinvoked, and it also contains the set builtin.  The -i and -j op-
12747102Sbostictions (the latter turns on job control) require changes in signal
12847102Sbostichandling.  The routines setjobctl (in jobs.c) and setinteractive
12947102Sbostic(in trap.c) are called to handle changes to these options.
13047102Sbostic
13147102SbosticPARSING:  The parser code is all in parser.c.  A recursive des-
13247102Sbosticcent parser is used.  Syntax tables (generated by mksyntax) are
13347102Sbosticused to classify characters during lexical analysis.  There are
13447102Sbosticthree tables:  one for normal use, one for use when inside single
13547102Sbosticquotes, and one for use when inside double quotes.  The tables
13647102Sbosticare machine dependent because they are indexed by character vari-
13747102Sbosticables and the range of a char varies from machine to machine.
13847102Sbostic
13947102SbosticPARSE OUTPUT:  The output of the parser consists of a tree of
14047102Sbosticnodes.  The various types of nodes are defined in the file node-
14147102Sbostictypes.
14247102Sbostic
14347102SbosticNodes of type NARG are used to represent both words and the con-
14447102Sbostictents of here documents.  An early version of ash kept the con-
14547102Sbostictents of here documents in temporary files, but keeping here do-
14647102Sbosticcuments in memory typically results in significantly better per-
14747102Sbosticformance.  It would have been nice to make it an option to use
14847102Sbostictemporary files for here documents, for the benefit of small
14947102Sbosticmachines, but the code to keep track of when to delete the tem-
15047102Sbosticporary files was complex and I never fixed all the bugs in it.
15147102Sbostic(AT&T has been maintaining the Bourne shell for more than ten
15247102Sbosticyears, and to the best of my knowledge they still haven't gotten
15347102Sbosticit to handle temporary files correctly in obscure cases.)
15447102Sbostic
15547102SbosticThe text field of a NARG structure points to the text of the
15647102Sbosticword.  The text consists of ordinary characters and a number of
15747102Sbosticspecial codes defined in parser.h.  The special codes are:
15847102Sbostic
15947102Sbostic        CTLVAR              Variable substitution
16047102Sbostic        CTLENDVAR           End of variable substitution
16147102Sbostic        CTLBACKQ            Command substitution
16247102Sbostic        CTLBACKQ|CTLQUOTE   Command substitution inside double quotes
16347102Sbostic        CTLESC              Escape next character
16447102Sbostic
16547102SbosticA variable substitution contains the following elements:
16647102Sbostic
16747102Sbostic        CTLVAR type name '=' [ alternative-text CTLENDVAR ]
16847102Sbostic
16947102SbosticThe type field is a single character specifying the type of sub-
17047102Sbosticstitution.  The possible types are:
17147102Sbostic
17247102Sbostic        VSNORMAL            $var
17347102Sbostic        VSMINUS             ${var-text}
17447102Sbostic        VSMINUS|VSNUL       ${var:-text}
17547102Sbostic        VSPLUS              ${var+text}
17647102Sbostic        VSPLUS|VSNUL        ${var:+text}
17747102Sbostic        VSQUESTION          ${var?text}
17847102Sbostic        VSQUESTION|VSNUL    ${var:?text}
17947102Sbostic        VSASSIGN            ${var=text}
18047102Sbostic        VSASSIGN|VSNUL      ${var=text}
18147102Sbostic
18247102SbosticIn addition, the type field will have the VSQUOTE flag set if the
18347102Sbosticvariable is enclosed in double quotes.  The name of the variable
18447102Sbosticcomes next, terminated by an equals sign.  If the type is not
18547102SbosticVSNORMAL, then the text field in the substitution follows, ter-
18647102Sbosticminated by a CTLENDVAR byte.
18747102Sbostic
18847102SbosticCommands in back quotes are parsed and stored in a linked list.
18947102SbosticThe locations of these commands in the string are indicated by
19047102SbosticCTLBACKQ and CTLBACKQ+CTLQUOTE characters, depending upon whether
19147102Sbosticthe back quotes were enclosed in double quotes.
19247102Sbostic
19347102SbosticThe character CTLESC escapes the next character, so that in case
19447102Sbosticany of the CTL characters mentioned above appear in the input,
19547102Sbosticthey can be passed through transparently.  CTLESC is also used to
19647102Sbosticescape '*', '?', '[', and '!' characters which were quoted by the
19747102Sbosticuser and thus should not be used for file name generation.
19847102Sbostic
19947102SbosticCTLESC characters have proved to be particularly tricky to get
20047102Sbosticright.  In the case of here documents which are not subject to
20147102Sbosticvariable and command substitution, the parser doesn't insert any
20247102SbosticCTLESC characters to begin with (so the contents of the text
20347102Sbosticfield can be written without any processing).  Other here docu-
20447102Sbosticments, and words which are not subject to splitting and file name
20547102Sbosticgeneration, have the CTLESC characters removed during the vari-
20647102Sbosticable and command substitution phase.  Words which are subject
20747102Sbosticsplitting and file name generation have the CTLESC characters re-
20847102Sbosticmoved as part of the file name phase.
20947102Sbostic
21047102SbosticEXECUTION:  Command execution is handled by the following files:
21147102Sbostic        eval.c     The top level routines.
21247102Sbostic        redir.c    Code to handle redirection of input and output.
21347102Sbostic        jobs.c     Code to handle forking, waiting, and job control.
21447102Sbostic        exec.c     Code to to path searches and the actual exec sys call.
21547102Sbostic        expand.c   Code to evaluate arguments.
21647102Sbostic        var.c      Maintains the variable symbol table.  Called from expand.c.
21747102Sbostic
21847102SbosticEVAL.C:  Evaltree recursively executes a parse tree.  The exit
21947102Sbosticstatus is returned in the global variable exitstatus.  The alter-
22047102Sbosticnative entry evalbackcmd is called to evaluate commands in back
22147102Sbosticquotes.  It saves the result in memory if the command is a buil-
22247102Sbostictin; otherwise it forks off a child to execute the command and
22347102Sbosticconnects the standard output of the child to a pipe.
22447102Sbostic
22547102SbosticJOBS.C:  To create a process, you call makejob to return a job
22647102Sbosticstructure, and then call forkshell (passing the job structure as
22747102Sbostican argument) to create the process.  Waitforjob waits for a job
22847102Sbosticto complete.  These routines take care of process groups if job
22947102Sbosticcontrol is defined.
23047102Sbostic
23147102SbosticREDIR.C:  Ash allows file descriptors to be redirected and then
23247102Sbosticrestored without forking off a child process.  This is accom-
23347102Sbosticplished by duplicating the original file descriptors.  The redir-
23447102Sbostictab structure records where the file descriptors have be dupli-
23547102Sbosticcated to.
23647102Sbostic
23747102SbosticEXEC.C:  The routine find_command locates a command, and enters
23847102Sbosticthe command in the hash table if it is not already there.  The
23947102Sbosticthird argument specifies whether it is to print an error message
24047102Sbosticif the command is not found.  (When a pipeline is set up,
24147102Sbosticfind_command is called for all the commands in the pipeline be-
24247102Sbosticfore any forking is done, so to get the commands into the hash
24347102Sbostictable of the parent process.  But to make command hashing as
24447102Sbostictransparent as possible, we silently ignore errors at that point
24547102Sbosticand only print error messages if the command cannot be found
24647102Sbosticlater.)
24747102Sbostic
24847102SbosticThe routine shellexec is the interface to the exec system call.
24947102Sbostic
25047102SbosticEXPAND.C:  Arguments are processed in three passes.  The first
25147102Sbostic(performed by the routine argstr) performs variable and command
25247102Sbosticsubstitution.  The second (ifsbreakup) performs word splitting
25347102Sbosticand the third (expandmeta) performs file name generation.  If the
25447102Sbostic"/u" directory is simulated, then when "/u/username" is replaced
25547102Sbosticby the user's home directory, the flag "didudir" is set.  This
25647102Sbostictells the cd command that it should print out the directory name,
25747102Sbosticjust as it would if the "/u" directory were implemented using
25847102Sbosticsymbolic links.
25947102Sbostic
26047102SbosticVAR.C:  Variables are stored in a hash table.  Probably we should
26147102Sbosticswitch to extensible hashing.  The variable name is stored in the
26247102Sbosticsame string as the value (using the format "name=value") so that
26347102Sbosticno string copying is needed to create the environment of a com-
26447102Sbosticmand.  Variables which the shell references internally are preal-
26547102Sbosticlocated so that the shell can reference the values of these vari-
26647102Sbosticables without doing a lookup.
26747102Sbostic
26847102SbosticWhen a program is run, the code in eval.c sticks any environment
26947102Sbosticvariables which precede the command (as in "PATH=xxx command") in
27047102Sbosticthe variable table as the simplest way to strip duplicates, and
27147102Sbosticthen calls "environment" to get the value of the environment.
27247102SbosticThere are two consequences of this.  First, if an assignment to
27347102SbosticPATH precedes the command, the value of PATH before the assign-
27447102Sbosticment must be remembered and passed to shellexec.  Second, if the
27547102Sbosticprogram turns out to be a shell procedure, the strings from the
27647102Sbosticenvironment variables which preceded the command must be pulled
27747102Sbosticout of the table and replaced with strings obtained from malloc,
27847102Sbosticsince the former will automatically be freed when the stack (see
27947102Sbosticthe entry on memalloc.c) is emptied.
28047102Sbostic
28147102SbosticBUILTIN COMMANDS:  The procedures for handling these are scat-
28247102Sbostictered throughout the code, depending on which location appears
28347102Sbosticmost appropriate.  They can be recognized because their names al-
28447102Sbosticways end in "cmd".  The mapping from names to procedures is
28547102Sbosticspecified in the file builtins, which is processed by the mkbuil-
28647102Sbostictins command.
28747102Sbostic
28847102SbosticA builtin command is invoked with argc and argv set up like a
28947102Sbosticnormal program.  A builtin command is allowed to overwrite its
29047102Sbosticarguments.  Builtin routines can call nextopt to do option pars-
29147102Sbosticing.  This is kind of like getopt, but you don't pass argc and
29247102Sbosticargv to it.  Builtin routines can also call error.  This routine
29347102Sbosticnormally terminates the shell (or returns to the main command
29447102Sbosticloop if the shell is interactive), but when called from a builtin
29547102Sbosticcommand it causes the builtin command to terminate with an exit
29647102Sbosticstatus of 2.
29747102Sbostic
29847102SbosticThe directory bltins contains commands which can be compiled in-
29947102Sbosticdependently but can also be built into the shell for efficiency
30047102Sbosticreasons.  The makefile in this directory compiles these programs
30147102Sbosticin the normal fashion (so that they can be run regardless of
30247102Sbosticwhether the invoker is ash), but also creates a library named
30347102Sbosticbltinlib.a which can be linked with ash.  The header file bltin.h
30447102Sbostictakes care of most of the differences between the ash and the
30547102Sbosticstand-alone environment.  The user should call the main routine
30647102Sbostic"main", and #define main to be the name of the routine to use
30747102Sbosticwhen the program is linked into ash.  This #define should appear
30847102Sbosticbefore bltin.h is included; bltin.h will #undef main if the pro-
30947102Sbosticgram is to be compiled stand-alone.
31047102Sbostic
31147102SbosticCD.C:  This file defines the cd and pwd builtins.  The pwd com-
31247102Sbosticmand runs /bin/pwd the first time it is invoked (unless the user
31347102Sbostichas already done a cd to an absolute pathname), but then
31447102Sbosticremembers the current directory and updates it when the cd com-
31547102Sbosticmand is run, so subsequent pwd commands run very fast.  The main
31647102Sbosticcomplication in the cd command is in the docd command, which
31747102Sbosticresolves symbolic links into actual names and informs the user
31847102Sbosticwhere the user ended up if he crossed a symbolic link.
31947102Sbostic
32047102SbosticSIGNALS:  Trap.c implements the trap command.  The routine set-
32147102Sbosticsignal figures out what action should be taken when a signal is
32247102Sbosticreceived and invokes the signal system call to set the signal ac-
32347102Sbostiction appropriately.  When a signal that a user has set a trap for
32447102Sbosticis caught, the routine "onsig" sets a flag.  The routine dotrap
32547102Sbosticis called at appropriate points to actually handle the signal.
32647102SbosticWhen an interrupt is caught and no trap has been set for that
32747102Sbosticsignal, the routine "onint" in error.c is called.
32847102Sbostic
32947102SbosticOUTPUT:  Ash uses it's own output routines.  There are three out-
33047102Sbosticput structures allocated.  "Output" represents the standard out-
33147102Sbosticput, "errout" the standard error, and "memout" contains output
33247102Sbosticwhich is to be stored in memory.  This last is used when a buil-
33347102Sbostictin command appears in backquotes, to allow its output to be col-
33447102Sbosticlected without doing any I/O through the UNIX operating system.
33547102SbosticThe variables out1 and out2 normally point to output and errout,
33647102Sbosticrespectively, but they are set to point to memout when appropri-
33747102Sbosticate inside backquotes.
33847102Sbostic
33947102SbosticINPUT:  The basic input routine is pgetc, which reads from the
34047102Sbosticcurrent input file.  There is a stack of input files; the current
34147102Sbosticinput file is the top file on this stack.  The code allows the
34247102Sbosticinput to come from a string rather than a file.  (This is for the
34347102Sbostic-c option and the "." and eval builtin commands.)  The global
34447102Sbosticvariable plinno is saved and restored when files are pushed and
34547102Sbosticpopped from the stack.  The parser routines store the number of
34647102Sbosticthe current line in this variable.
34747102Sbostic
34847102SbosticDEBUGGING:  If DEBUG is defined in shell.h, then the shell will
34947102Sbosticwrite debugging information to the file $HOME/trace.  Most of
35047102Sbosticthis is done using the TRACE macro, which takes a set of printf
35147102Sbosticarguments inside two sets of parenthesis.  Example:
35247102Sbostic"TRACE(("n=%d0, n))".  The double parenthesis are necessary be-
35347102Sbosticcause the preprocessor can't handle functions with a variable
35447102Sbosticnumber of arguments.  Defining DEBUG also causes the shell to
35547102Sbosticgenerate a core dump if it is sent a quit signal.  The tracing
35647102Sbosticcode is in show.c.
357