xref: /netbsd-src/bin/sh/TOUR (revision 056b8cc7675db27390fdef78966b0d1c9202368e)
1*056b8cc7Sabhinav#	$NetBSD: TOUR,v 1.11 2016/10/25 13:01:59 abhinav Exp $
249f0ad86Scgd#	@(#)TOUR	8.1 (Berkeley) 5/31/93
337ed7877Sjtc
437ed7877SjtcNOTE -- This is the original TOUR paper distributed with ash and
537ed7877Sjtcdoes not represent the current state of the shell.  It is provided anyway
637ed7877Sjtcsince it provides helpful information for how the shell is structured,
737ed7877Sjtcbut be warned that things have changed -- the current shell is
837ed7877Sjtcstill under development.
937ed7877Sjtc
1037ed7877Sjtc================================================================
1161f28255Scgd
1261f28255Scgd                       A Tour through Ash
1361f28255Scgd
1461f28255Scgd               Copyright 1989 by Kenneth Almquist.
1561f28255Scgd
1661f28255Scgd
1761f28255ScgdDIRECTORIES:  The subdirectory bltin contains commands which can
1861f28255Scgdbe compiled stand-alone.  The rest of the source is in the main
1961f28255Scgdash directory.
2061f28255Scgd
2161f28255ScgdSOURCE CODE GENERATORS:  Files whose names begin with "mk" are
2261f28255Scgdprograms that generate source code.  A complete list of these
2361f28255Scgdprograms is:
2461f28255Scgd
25976326adSsnj        program         input files         generates
2661f28255Scgd        -------         ------------        ---------
2761f28255Scgd        mkbuiltins      builtins            builtins.h builtins.c
2861f28255Scgd        mkinit          *.c                 init.c
2961f28255Scgd        mknodes         nodetypes           nodes.h nodes.c
3061f28255Scgd        mksignames          -               signames.h signames.c
3161f28255Scgd        mksyntax            -               syntax.h syntax.c
32aded8d4cSchristos        mktokens            -               token.h
3361f28255Scgd        bltin/mkexpr    unary_op binary_op  operators.h operators.c
3461f28255Scgd
3561f28255ScgdThere are undoubtedly too many of these.  Mkinit searches all the
3661f28255ScgdC source files for entries looking like:
3761f28255Scgd
3861f28255Scgd        INIT {
3961f28255Scgd              x = 1;    /* executed during initialization */
4061f28255Scgd        }
4161f28255Scgd
4261f28255Scgd        RESET {
4361f28255Scgd              x = 2;    /* executed when the shell does a longjmp
4461f28255Scgd                           back to the main command loop */
4561f28255Scgd        }
4661f28255Scgd
4761f28255Scgd        SHELLPROC {
4861f28255Scgd              x = 3;    /* executed when the shell runs a shell procedure */
4961f28255Scgd        }
5061f28255Scgd
5161f28255ScgdIt pulls this code out into routines which are when particular
5261f28255Scgdevents occur.  The intent is to improve modularity by isolating
5361f28255Scgdthe information about which modules need to be explicitly
5461f28255Scgdinitialized/reset within the modules themselves.
5561f28255Scgd
5661f28255ScgdMkinit recognizes several constructs for placing declarations in
5761f28255Scgdthe init.c file.
5861f28255Scgd        INCLUDE "file.h"
5961f28255Scgdincludes a file.  The storage class MKINIT makes a declaration
6061f28255Scgdavailable in the init.c file, for example:
6161f28255Scgd        MKINIT int funcnest;    /* depth of function calls */
6261f28255ScgdMKINIT alone on a line introduces a structure or union declara-
6361f28255Scgdtion:
6461f28255Scgd        MKINIT
6561f28255Scgd        struct redirtab {
6661f28255Scgd              short renamed[10];
6761f28255Scgd        };
6861f28255ScgdPreprocessor #define statements are copied to init.c without any
6961f28255Scgdspecial action to request this.
7061f28255Scgd
7161f28255ScgdINDENTATION:  The ash source is indented in multiples of six
7261f28255Scgdspaces.  The only study that I have heard of on the subject con-
7361f28255Scgdcluded that the optimal amount to indent is in the range of four
7461f28255Scgdto six spaces.  I use six spaces since it is not too big a jump
7561f28255Scgdfrom the widely used eight spaces.  If you really hate six space
7661f28255Scgdindentation, use the adjind (source included) program to change
7761f28255Scgdit to something else.
7861f28255Scgd
7961f28255ScgdEXCEPTIONS:  Code for dealing with exceptions appears in
8061f28255Scgdexceptions.c.  The C language doesn't include exception handling,
8161f28255Scgdso I implement it using setjmp and longjmp.  The global variable
8261f28255Scgdexception contains the type of exception.  EXERROR is raised by
8361f28255Scgdcalling error.  EXINT is an interrupt.  EXSHELLPROC is an excep-
8461f28255Scgdtion which is raised when a shell procedure is invoked.  The pur-
8561f28255Scgdpose of EXSHELLPROC is to perform the cleanup actions associated
8661f28255Scgdwith other exceptions.  After these cleanup actions, the shell
8761f28255Scgdcan interpret a shell procedure itself without exec'ing a new
8861f28255Scgdcopy of the shell.
8961f28255Scgd
9061f28255ScgdINTERRUPTS:  In an interactive shell, an interrupt will cause an
9161f28255ScgdEXINT exception to return to the main command loop.  (Exception:
9261f28255ScgdEXINT is not raised if the user traps interrupts using the trap
9361f28255Scgdcommand.)  The INTOFF and INTON macros (defined in exception.h)
94976326adSsnjprovide uninterruptible critical sections.  Between the execution
9561f28255Scgdof INTOFF and the execution of INTON, interrupt signals will be
9661f28255Scgdheld for later delivery.  INTOFF and INTON can be nested.
9761f28255Scgd
9861f28255ScgdMEMALLOC.C:  Memalloc.c defines versions of malloc and realloc
9961f28255Scgdwhich call error when there is no memory left.  It also defines a
10061f28255Scgdstack oriented memory allocation scheme.  Allocating off a stack
10161f28255Scgdis probably more efficient than allocation using malloc, but the
10261f28255Scgdbig advantage is that when an exception occurs all we have to do
10361f28255Scgdto free up the memory in use at the time of the exception is to
10461f28255Scgdrestore the stack pointer.  The stack is implemented using a
10561f28255Scgdlinked list of blocks.
10661f28255Scgd
10761f28255ScgdSTPUTC:  If the stack were contiguous, it would be easy to store
10861f28255Scgdstrings on the stack without knowing in advance how long the
10961f28255Scgdstring was going to be:
11061f28255Scgd        p = stackptr;
11161f28255Scgd        *p++ = c;       /* repeated as many times as needed */
11261f28255Scgd        stackptr = p;
113976326adSsnjThe following three macros (defined in memalloc.h) perform these
11461f28255Scgdoperations, but grow the stack if you run off the end:
11561f28255Scgd        STARTSTACKSTR(p);
11661f28255Scgd        STPUTC(c, p);   /* repeated as many times as needed */
11761f28255Scgd        grabstackstr(p);
11861f28255Scgd
11961f28255ScgdWe now start a top-down look at the code:
12061f28255Scgd
12161f28255ScgdMAIN.C:  The main routine performs some initialization, executes
122*056b8cc7Sabhinavthe user's profile if necessary, and calls cmdloop.  Cmdloop
12361f28255Scgdrepeatedly parses and executes commands.
12461f28255Scgd
12561f28255ScgdOPTIONS.C:  This file contains the option processing code.  It is
12661f28255Scgdcalled from main to parse the shell arguments when the shell is
12761f28255Scgdinvoked, and it also contains the set builtin.  The -i and -j op-
12861f28255Scgdtions (the latter turns on job control) require changes in signal
12961f28255Scgdhandling.  The routines setjobctl (in jobs.c) and setinteractive
13061f28255Scgd(in trap.c) are called to handle changes to these options.
13161f28255Scgd
13261f28255ScgdPARSING:  The parser code is all in parser.c.  A recursive des-
13361f28255Scgdcent parser is used.  Syntax tables (generated by mksyntax) are
13461f28255Scgdused to classify characters during lexical analysis.  There are
13561f28255Scgdthree tables:  one for normal use, one for use when inside single
13661f28255Scgdquotes, and one for use when inside double quotes.  The tables
13761f28255Scgdare machine dependent because they are indexed by character vari-
13861f28255Scgdables and the range of a char varies from machine to machine.
13961f28255Scgd
14061f28255ScgdPARSE OUTPUT:  The output of the parser consists of a tree of
14161f28255Scgdnodes.  The various types of nodes are defined in the file node-
14261f28255Scgdtypes.
14361f28255Scgd
14461f28255ScgdNodes of type NARG are used to represent both words and the con-
14561f28255Scgdtents of here documents.  An early version of ash kept the con-
14661f28255Scgdtents of here documents in temporary files, but keeping here do-
14761f28255Scgdcuments in memory typically results in significantly better per-
14861f28255Scgdformance.  It would have been nice to make it an option to use
14961f28255Scgdtemporary files for here documents, for the benefit of small
15061f28255Scgdmachines, but the code to keep track of when to delete the tem-
15161f28255Scgdporary files was complex and I never fixed all the bugs in it.
15261f28255Scgd(AT&T has been maintaining the Bourne shell for more than ten
15361f28255Scgdyears, and to the best of my knowledge they still haven't gotten
15461f28255Scgdit to handle temporary files correctly in obscure cases.)
15561f28255Scgd
15661f28255ScgdThe text field of a NARG structure points to the text of the
15761f28255Scgdword.  The text consists of ordinary characters and a number of
15861f28255Scgdspecial codes defined in parser.h.  The special codes are:
15961f28255Scgd
16061f28255Scgd        CTLVAR              Variable substitution
16161f28255Scgd        CTLENDVAR           End of variable substitution
16261f28255Scgd        CTLBACKQ            Command substitution
16361f28255Scgd        CTLBACKQ|CTLQUOTE   Command substitution inside double quotes
16461f28255Scgd        CTLESC              Escape next character
16561f28255Scgd
16661f28255ScgdA variable substitution contains the following elements:
16761f28255Scgd
16861f28255Scgd        CTLVAR type name '=' [ alternative-text CTLENDVAR ]
16961f28255Scgd
17061f28255ScgdThe type field is a single character specifying the type of sub-
17161f28255Scgdstitution.  The possible types are:
17261f28255Scgd
17361f28255Scgd        VSNORMAL            $var
17461f28255Scgd        VSMINUS             ${var-text}
17561f28255Scgd        VSMINUS|VSNUL       ${var:-text}
17661f28255Scgd        VSPLUS              ${var+text}
17761f28255Scgd        VSPLUS|VSNUL        ${var:+text}
17861f28255Scgd        VSQUESTION          ${var?text}
17961f28255Scgd        VSQUESTION|VSNUL    ${var:?text}
18061f28255Scgd        VSASSIGN            ${var=text}
18161f28255Scgd        VSASSIGN|VSNUL      ${var=text}
18261f28255Scgd
18361f28255ScgdIn addition, the type field will have the VSQUOTE flag set if the
18461f28255Scgdvariable is enclosed in double quotes.  The name of the variable
18561f28255Scgdcomes next, terminated by an equals sign.  If the type is not
18661f28255ScgdVSNORMAL, then the text field in the substitution follows, ter-
18761f28255Scgdminated by a CTLENDVAR byte.
18861f28255Scgd
18961f28255ScgdCommands in back quotes are parsed and stored in a linked list.
19061f28255ScgdThe locations of these commands in the string are indicated by
19161f28255ScgdCTLBACKQ and CTLBACKQ+CTLQUOTE characters, depending upon whether
19261f28255Scgdthe back quotes were enclosed in double quotes.
19361f28255Scgd
19461f28255ScgdThe character CTLESC escapes the next character, so that in case
19561f28255Scgdany of the CTL characters mentioned above appear in the input,
19661f28255Scgdthey can be passed through transparently.  CTLESC is also used to
19761f28255Scgdescape '*', '?', '[', and '!' characters which were quoted by the
19861f28255Scgduser and thus should not be used for file name generation.
19961f28255Scgd
20061f28255ScgdCTLESC characters have proved to be particularly tricky to get
20161f28255Scgdright.  In the case of here documents which are not subject to
20261f28255Scgdvariable and command substitution, the parser doesn't insert any
20361f28255ScgdCTLESC characters to begin with (so the contents of the text
20461f28255Scgdfield can be written without any processing).  Other here docu-
20561f28255Scgdments, and words which are not subject to splitting and file name
20661f28255Scgdgeneration, have the CTLESC characters removed during the vari-
20761f28255Scgdable and command substitution phase.  Words which are subject
20861f28255Scgdsplitting and file name generation have the CTLESC characters re-
20961f28255Scgdmoved as part of the file name phase.
21061f28255Scgd
21161f28255ScgdEXECUTION:  Command execution is handled by the following files:
21261f28255Scgd        eval.c     The top level routines.
21361f28255Scgd        redir.c    Code to handle redirection of input and output.
21461f28255Scgd        jobs.c     Code to handle forking, waiting, and job control.
215*056b8cc7Sabhinav        exec.c     Code to do path searches and the actual exec sys call.
21661f28255Scgd        expand.c   Code to evaluate arguments.
21761f28255Scgd        var.c      Maintains the variable symbol table.  Called from expand.c.
21861f28255Scgd
21961f28255ScgdEVAL.C:  Evaltree recursively executes a parse tree.  The exit
22061f28255Scgdstatus is returned in the global variable exitstatus.  The alter-
22161f28255Scgdnative entry evalbackcmd is called to evaluate commands in back
22261f28255Scgdquotes.  It saves the result in memory if the command is a buil-
22361f28255Scgdtin; otherwise it forks off a child to execute the command and
22461f28255Scgdconnects the standard output of the child to a pipe.
22561f28255Scgd
22661f28255ScgdJOBS.C:  To create a process, you call makejob to return a job
22761f28255Scgdstructure, and then call forkshell (passing the job structure as
22861f28255Scgdan argument) to create the process.  Waitforjob waits for a job
22961f28255Scgdto complete.  These routines take care of process groups if job
23061f28255Scgdcontrol is defined.
23161f28255Scgd
23261f28255ScgdREDIR.C:  Ash allows file descriptors to be redirected and then
23361f28255Scgdrestored without forking off a child process.  This is accom-
23461f28255Scgdplished by duplicating the original file descriptors.  The redir-
235976326adSsnjtab structure records where the file descriptors have been dupli-
23661f28255Scgdcated to.
23761f28255Scgd
23861f28255ScgdEXEC.C:  The routine find_command locates a command, and enters
23961f28255Scgdthe command in the hash table if it is not already there.  The
24061f28255Scgdthird argument specifies whether it is to print an error message
24161f28255Scgdif the command is not found.  (When a pipeline is set up,
24261f28255Scgdfind_command is called for all the commands in the pipeline be-
24361f28255Scgdfore any forking is done, so to get the commands into the hash
24461f28255Scgdtable of the parent process.  But to make command hashing as
24561f28255Scgdtransparent as possible, we silently ignore errors at that point
24661f28255Scgdand only print error messages if the command cannot be found
24761f28255Scgdlater.)
24861f28255Scgd
24961f28255ScgdThe routine shellexec is the interface to the exec system call.
25061f28255Scgd
25161f28255ScgdEXPAND.C:  Arguments are processed in three passes.  The first
25261f28255Scgd(performed by the routine argstr) performs variable and command
25361f28255Scgdsubstitution.  The second (ifsbreakup) performs word splitting
25461f28255Scgdand the third (expandmeta) performs file name generation.  If the
25561f28255Scgd"/u" directory is simulated, then when "/u/username" is replaced
25661f28255Scgdby the user's home directory, the flag "didudir" is set.  This
25761f28255Scgdtells the cd command that it should print out the directory name,
25861f28255Scgdjust as it would if the "/u" directory were implemented using
25961f28255Scgdsymbolic links.
26061f28255Scgd
26161f28255ScgdVAR.C:  Variables are stored in a hash table.  Probably we should
26261f28255Scgdswitch to extensible hashing.  The variable name is stored in the
26361f28255Scgdsame string as the value (using the format "name=value") so that
26461f28255Scgdno string copying is needed to create the environment of a com-
26561f28255Scgdmand.  Variables which the shell references internally are preal-
26661f28255Scgdlocated so that the shell can reference the values of these vari-
26761f28255Scgdables without doing a lookup.
26861f28255Scgd
26961f28255ScgdWhen a program is run, the code in eval.c sticks any environment
27061f28255Scgdvariables which precede the command (as in "PATH=xxx command") in
27161f28255Scgdthe variable table as the simplest way to strip duplicates, and
27261f28255Scgdthen calls "environment" to get the value of the environment.
27361f28255ScgdThere are two consequences of this.  First, if an assignment to
27461f28255ScgdPATH precedes the command, the value of PATH before the assign-
27561f28255Scgdment must be remembered and passed to shellexec.  Second, if the
27661f28255Scgdprogram turns out to be a shell procedure, the strings from the
27761f28255Scgdenvironment variables which preceded the command must be pulled
27861f28255Scgdout of the table and replaced with strings obtained from malloc,
27961f28255Scgdsince the former will automatically be freed when the stack (see
28061f28255Scgdthe entry on memalloc.c) is emptied.
28161f28255Scgd
28261f28255ScgdBUILTIN COMMANDS:  The procedures for handling these are scat-
28361f28255Scgdtered throughout the code, depending on which location appears
28461f28255Scgdmost appropriate.  They can be recognized because their names al-
28561f28255Scgdways end in "cmd".  The mapping from names to procedures is
28661f28255Scgdspecified in the file builtins, which is processed by the mkbuil-
28761f28255Scgdtins command.
28861f28255Scgd
28961f28255ScgdA builtin command is invoked with argc and argv set up like a
29061f28255Scgdnormal program.  A builtin command is allowed to overwrite its
29161f28255Scgdarguments.  Builtin routines can call nextopt to do option pars-
29261f28255Scgding.  This is kind of like getopt, but you don't pass argc and
29361f28255Scgdargv to it.  Builtin routines can also call error.  This routine
29461f28255Scgdnormally terminates the shell (or returns to the main command
29561f28255Scgdloop if the shell is interactive), but when called from a builtin
29661f28255Scgdcommand it causes the builtin command to terminate with an exit
29761f28255Scgdstatus of 2.
29861f28255Scgd
29961f28255ScgdThe directory bltins contains commands which can be compiled in-
30061f28255Scgddependently but can also be built into the shell for efficiency
30161f28255Scgdreasons.  The makefile in this directory compiles these programs
30261f28255Scgdin the normal fashion (so that they can be run regardless of
30361f28255Scgdwhether the invoker is ash), but also creates a library named
30461f28255Scgdbltinlib.a which can be linked with ash.  The header file bltin.h
30561f28255Scgdtakes care of most of the differences between the ash and the
30661f28255Scgdstand-alone environment.  The user should call the main routine
30761f28255Scgd"main", and #define main to be the name of the routine to use
30861f28255Scgdwhen the program is linked into ash.  This #define should appear
30961f28255Scgdbefore bltin.h is included; bltin.h will #undef main if the pro-
31061f28255Scgdgram is to be compiled stand-alone.
31161f28255Scgd
31261f28255ScgdCD.C:  This file defines the cd and pwd builtins.  The pwd com-
31361f28255Scgdmand runs /bin/pwd the first time it is invoked (unless the user
31461f28255Scgdhas already done a cd to an absolute pathname), but then
31561f28255Scgdremembers the current directory and updates it when the cd com-
31661f28255Scgdmand is run, so subsequent pwd commands run very fast.  The main
31761f28255Scgdcomplication in the cd command is in the docd command, which
31861f28255Scgdresolves symbolic links into actual names and informs the user
31961f28255Scgdwhere the user ended up if he crossed a symbolic link.
32061f28255Scgd
32161f28255ScgdSIGNALS:  Trap.c implements the trap command.  The routine set-
32261f28255Scgdsignal figures out what action should be taken when a signal is
32361f28255Scgdreceived and invokes the signal system call to set the signal ac-
32461f28255Scgdtion appropriately.  When a signal that a user has set a trap for
32561f28255Scgdis caught, the routine "onsig" sets a flag.  The routine dotrap
32661f28255Scgdis called at appropriate points to actually handle the signal.
32761f28255ScgdWhen an interrupt is caught and no trap has been set for that
32861f28255Scgdsignal, the routine "onint" in error.c is called.
32961f28255Scgd
330bf5ceaaeSsnjOUTPUT:  Ash uses its own output routines.  There are three out-
33161f28255Scgdput structures allocated.  "Output" represents the standard out-
33261f28255Scgdput, "errout" the standard error, and "memout" contains output
33361f28255Scgdwhich is to be stored in memory.  This last is used when a buil-
33461f28255Scgdtin command appears in backquotes, to allow its output to be col-
33561f28255Scgdlected without doing any I/O through the UNIX operating system.
33661f28255ScgdThe variables out1 and out2 normally point to output and errout,
33761f28255Scgdrespectively, but they are set to point to memout when appropri-
33861f28255Scgdate inside backquotes.
33961f28255Scgd
34061f28255ScgdINPUT:  The basic input routine is pgetc, which reads from the
34161f28255Scgdcurrent input file.  There is a stack of input files; the current
34261f28255Scgdinput file is the top file on this stack.  The code allows the
34361f28255Scgdinput to come from a string rather than a file.  (This is for the
34461f28255Scgd-c option and the "." and eval builtin commands.)  The global
34561f28255Scgdvariable plinno is saved and restored when files are pushed and
34661f28255Scgdpopped from the stack.  The parser routines store the number of
34761f28255Scgdthe current line in this variable.
34861f28255Scgd
34961f28255ScgdDEBUGGING:  If DEBUG is defined in shell.h, then the shell will
35061f28255Scgdwrite debugging information to the file $HOME/trace.  Most of
35161f28255Scgdthis is done using the TRACE macro, which takes a set of printf
35261f28255Scgdarguments inside two sets of parenthesis.  Example:
35361f28255Scgd"TRACE(("n=%d0, n))".  The double parenthesis are necessary be-
35461f28255Scgdcause the preprocessor can't handle functions with a variable
35561f28255Scgdnumber of arguments.  Defining DEBUG also causes the shell to
35661f28255Scgdgenerate a core dump if it is sent a quit signal.  The tracing
35761f28255Scgdcode is in show.c.
358