xref: /plan9-contrib/sys/doc/sam/sam.ms (revision 219b2ee8daee37f4aad58d63f21287faa8e4ffdc)
1.Vx 17 11 November 87 1 32 "ROB PIKE" "THE TEXT EDITOR SAM"
2.ds DY "31 May 1987
3.ds DR "Revised 1 July 1987
4.de CW		\" puts first arg in CW font, same as UL; maintains font
5\%\&\\$3\f(CW\\$1\fP\&\\$2
6..
7.de Cs
8.br
9.fi
10.ft 2
11.ps -2
12.vs -2
13..
14.de Ce
15.br
16.nf
17.ft 1
18.ps
19.vs
20.sp
21..
22.TL
23The Text Editor \&\f(CWsam\fP
24.AU
25Rob Pike
26rob@plan9.att.com
27.AB
28.LP
29.CW Sam
30is an interactive multi-file text editor intended for
31bitmap displays.
32A textual command language
33supplements the mouse-driven, cut-and-paste interface
34to make complex or
35repetitive editing tasks easy to specify.
36The language is characterized by the composition of regular expressions
37to describe the structure of the text being modified.
38The treatment of files as a database, with changes logged
39as atomic transactions, guides the implementation and
40makes a general `undo' mechanism straightforward.
41.PP
42.CW Sam
43is implemented as two processes connected by a low-bandwidth stream,
44one process handling the display and the other the editing
45algorithms.  Therefore it can run with the display process
46in a bitmap terminal and the editor on a local host,
47with both processes on a bitmap-equipped host, or with
48the display process in the terminal and the editor in a
49remote host.
50By suppressing the display process,
51it can even run without a bitmap terminal.
52.PP
53This paper is reprinted from Software\(emPractice and Experience,
54Vol 17, number 11, pp. 813-845, November 1987.
55The paper has not been updated for the Plan 9 manuals.  Although
56.CW Sam
57has not changed much since the paper was written, the system around it certainly has.
58Nonetheless, the description here still stands as the best introduction to the editor.
59.AE
60.SH
61Introduction
62.LP
63.CW Sam
64is an interactive text editor that combines cut-and-paste interactive editing with
65an unusual command language based on the composition of regular expressions.
66It is written as two programs: one, the `host part,' runs on a UNIX system
67and implements the command language and provides file access; the other, the
68`terminal part,' runs asynchronously
69on a machine with a mouse and bitmap display
70and supports the display and interactive editing.
71The host part may be even run in isolation on an ordinary terminal
72to edit text using the command
73language, much like a traditional line editor,
74without assistance from a mouse or display.
75Most often,
76the terminal part runs on a Blit\u\s-4\&1\s+4\d terminal
77(actually on a Teletype DMD 5620, the production version of the Blit), whose
78host connection is an ordinary 9600 bps RS232 link;
79on the SUN computer the host and display processes run on a single machine,
80connected by a pipe.
81.PP
82.CW Sam
83edits uninterpreted
84ASCII text.
85It has no facilities for multiple fonts, graphics or tables,
86unlike MacWrite,\u\s-4\&2\s+4\d Bravo,\u\s-4\&3\s+4\d Tioga\u\s-4\&4\s+4\d
87or Lara.\u\s-4\&5\s+4\d
88Also unlike them, it has a rich command language.
89(Throughout this paper, the phrase
90.I
91command language
92.R
93refers to
94textual commands; commands activated from the mouse form the
95.I mouse
96.I language. )
97.CW Sam
98developed as an editor for use by programmers, and tries to join
99the styles of the UNIX text editor
100.CW ed \u\s-4\&6,7\s+4\d
101with that of interactive cut-and-paste editors by
102providing a comfortable mouse-driven interface
103to a program with a solid command language driven by regular expressions.
104The command language developed more than the mouse language, and
105acquired a notation for describing the structure of files
106more richly than as a sequence of lines,
107using a dataflow-like syntax for specifying changes.
108.PP
109The interactive style was influenced by
110.CW jim ,\u\s-4\&1\s+4\d
111an early cut-and-paste editor for the Blit, and by
112.CW mux ,\u\s-4\&8\s+4\d
113the Blit window system.
114.CW Mux
115merges the original Blit window system,
116.CW mpx ,\u\s-4\&1\s+4\d
117with cut-and-paste editing, forming something like a
118multiplexed version of
119.CW jim
120that edits the output of (and input to) command sessions rather than files.
121.PP
122The first part of this paper describes the command language, then the mouse
123language, and explains how they interact.
124That is followed by a description of the implementation,
125first of the host part, then of the terminal part.
126A principle that influenced the design of
127.CW sam
128is that it should have no explicit limits, such as upper limits on
129file size or line length.
130A secondary consideration is that it be efficient.
131To honor these two goals together requires a method for efficiently
132manipulating
133huge strings (files) without breaking them into lines,
134perhaps while making thousands of changes
135under control of the command language.
136.CW Sam 's
137method is to
138treat the file as a transaction database, implementing changes as atomic
139updates.  These updates may be unwound easily to `undo' changes.
140Efficiency is achieved through a collection of caches that minimizes
141disc traffic and data motion, both within the two parts of the program
142and between them.
143.PP
144The terminal part of
145.CW sam
146is fairly straightforward.
147More interesting is how the two halves of the editor stay
148synchronized when either half may initiate a change.
149This is achieved through a data structure that organizes the
150communications and is maintained in parallel by both halves.
151.PP
152The last part of the paper chronicles the writing of
153.CW sam
154and discusses the lessons that were learned through its development and use.
155.PP
156The paper is long, but is composed largely of two papers of reasonable length:
157a description of the user interface of
158.CW sam
159and a discussion of its implementation.
160They are combined because the implementation is strongly influenced by
161the user interface, and vice versa.
162.SH
163The Interface
164.LP
165.CW Sam
166is a text editor for multiple files.
167File names may be provided when it is invoked:
168.P1
169sam file1 file2 ...
170.P2
171and there are commands
172to add new files and discard unneeded ones.
173Files are not read until necessary
174to complete some command.
175Editing operations apply to an internal copy
176made when the file is read; the UNIX file associated with the copy
177is changed only by an explicit command.
178To simplify the discussion, the internal copy is here called a
179.I file ,
180while the disc-resident original is called a
181.I
182disc file.
183.R
184.PP
185.CW Sam
186is usually connected to a bitmap display that presents a cut-and-paste
187editor driven by the mouse.
188In this mode, the command language is still available:
189text typed in a special window, called the
190.CW sam
191.I window,
192is interpreted
193as commands to be executed in the current file.
194Cut-and-paste editing may be used in any window \(em even in the
195.CW sam
196window to construct commands.
197The other mode of operation, invoked by starting
198.CW sam
199with the option
200.CW -d
201(for `no download'),
202does not use the mouse or bitmap display, but still permits
203editing using the textual command language, even on an ordinary terminal,
204interactively or from a script.
205.PP
206The following sections describe first the command language (under
207.CW sam\ -d
208and in the
209.CW sam
210window), and then the mouse interface.
211These two languages are nearly independent, but connect through the
212.I current
213.I text,
214described below.
215.SH 2
216The Command Language
217.LP
218A file consists of its contents, which are an array of characters
219(that is, a string); the
220.I name
221of the associated disc file; the
222.I
223modified bit
224.R
225that states whether the contents match those of
226the disc file;
227and a substring of the contents, called the
228.I
229current text
230.R
231or
232.I dot
233(see Figures 1 and 2).
234If the current text is a null string, dot falls between characters.
235The
236.I value
237of dot is the location of the current text; the
238.I contents
239of dot are the characters it contains.
240.CW Sam
241imparts to the text no two-dimensional interpretation such as columns
242or fields; text is always one-dimensional.
243Even the idea of a `line' of text as understood by most UNIX programs
244\(em a sequence of characters terminated by a newline character \(em
245is only weakly supported.
246.PP
247The
248.I
249current file
250.R
251is the file to which editing commands refer.
252The current text is therefore dot in the current file.
253If a command doesn't explicitly name a particular file or piece of text,
254the command is assumed to apply to the current text.
255For the moment, ignore the presence of multiple files and consider
256editing a single file.
257.KF L
258.BP fig1.ps 3.5i
259.Cs
260Figure 1. A typical
261.CW sam
262screen, with the editing menu presented.
263The
264.CW sam
265(command language) window is in the middle, with file windows above and below.
266(The user interface makes it easy to create these abutting windows.)
267The partially obscured window is a third file window.
268The uppermost window is that to which typing and mouse operations apply,
269as indicated by its heavy border.
270Each window has its current text highlighted in reverse video.
271The
272.CW sam
273window's current text is the null string on the last visible line,
274indicated by a vertical bar.
275See also Figure 2.
276.Ce
277.KE
278.PP
279Commands have one-letter names.
280Except for non-editing commands such as writing
281the file to disc, most commands make some change
282to the text in dot and leave dot set to the text resulting from the change.
283For example, the delete command,
284.CW d ,
285deletes the text in dot, replacing it by the null string and setting dot
286to the result.
287The change command,
288.CW c ,
289replaces dot by text delimited by an arbitrary punctuation character,
290conventionally
291a slash.  Thus,
292.P1
293c/Peter/
294.P2
295replaces the text in dot by the string
296.CW Peter .
297Similarly,
298.P1
299a/Peter/
300.P2
301(append) adds the string after dot, and
302.P1
303i/Peter/
304.P2
305(insert) inserts before dot.
306All three leave dot set to the new text,
307.CW Peter .
308.PP
309Newlines are part of the syntax of commands:
310the newline character lexically terminates a command.
311Within the inserted text, however, newlines are never implicit.
312But since it is often convenient to insert multiple lines of text,
313.CW sam
314has a special
315syntax for that case:
316.P1
317a
318some lines of text
319to be inserted in the file,
320terminated by a period
321on a line by itself
322\&.
323.P2
324In the one-line syntax, a newline character may be specified by a C-like
325escape, so
326.P1
327c/\en/
328.P2
329replaces dot by a single newline character.
330.PP
331.CW Sam
332also has a substitute command,
333.CW s :
334.P1
335s/\f2expression\fP/\f2replacement\fP/
336.P2
337substitutes the replacement text for the first match, in dot,
338of the regular expression.
339Thus, if dot is the string
340.CW Peter ,
341the command
342.P1
343s/t/st/
344.P2
345changes it to
346.CW Pester .
347In general,
348.CW s
349is unnecessary, but it was inherited from
350.CW ed
351and it has some convenient variations.
352For instance, the replacement text may include the matched text,
353specified by
354.CW & :
355.P1
356s/Peter/Oh, &, &, &, &!/
357.P2
358.PP
359There are also three commands that apply programs
360to text:
361.P1
362< \f2UNIX program\fP
363.P2
364replaces dot by the output of the UNIX program.
365Similarly, the
366.CW >
367command
368runs the program with dot as its standard input, and
369.CW |
370does both.  For example,
371.P1
372| sort
373.P2
374replaces dot by the result of applying the standard sorting utility to it.
375Again, newlines have no special significance for these
376.CW sam
377commands.
378The text acted upon and resulting from these commands is not necessarily
379bounded by newlines, although for connection with UNIX programs,
380newlines may be necessary to obey conventions.
381.PP
382One more command:
383.CW p
384prints the contents of dot.
385Table I summarizes
386.CW sam 's
387commands.
388.KF
389.TS
390center;
391c s
392lfCW l.
393Table I. \f(CWSam\fP commands
394.sp .4
395.ft CW
396_
397.ft
398.sp .4
399\f1Text commands\fP
400.sp .4
401_
402.sp .4
403a/\f2text\fP/	Append text after dot
404c/\f2text\fP/	Change text in dot
405i/\f2text\fP/	Insert text before dot
406d	Delete text in dot
407s/\f2regexp\fP/\f2text\fP/	Substitute text for match of regular expression in dot
408m \f2address\fP	Move text in dot after address
409t \f2address\fP	Copy text in dot after address
410.sp .4
411_
412.sp .4
413\f1Display commands\fP
414.sp .4
415_
416.sp .2
417p	Print contents of dot
418\&=	Print value (line numbers and character numbers) of dot
419.sp .4
420_
421.sp .4
422\f1File commands\fP
423.sp .4
424_
425.sp .2
426b \f2file-list\fP	Set current file to first file in list that \f(CWsam\fP has in menu
427B \f2file-list\fP	Same as \f(CWb\fP, but load new files
428n	Print menu lines of all files
429D \f2file-list\fP	Delete named files from \f(CWsam\fP
430.sp .4
431_
432.sp .4
433\f1I/O commands\fP
434.sp .4
435_
436.sp .2
437e \f2filename\fP	Replace file with named disc file
438r \f2filename\fP	Replace dot by contents of named disc file
439w \f2filename\fP	Write file to named disc file
440f \f2filename\fP	Set file name and print new menu line
441< \f2UNIX-command\fP	Replace dot by standard output of command
442> \f2UNIX-command\fP	Send dot to standard input of command
443| \f2UNIX-command\fP	Replace dot by result of command applied to dot
444! \f2UNIX-command\fP	Run the command
445.sp .4
446_
447.sp .4
448\f1Loops and conditionals\fP
449.sp .4
450_
451.sp .2
452x/\f2regexp\fP/ \f2command\fP	For each match of regexp, set dot and run command
453y/\f2regexp\fP/ \f2command\fP	Between adjacent matches of regexp, set dot and run command
454X/\f2regexp\fP/ \f2command\fP	Run command in each file whose menu line matches regexp
455Y/\f2regexp\fP/ \f2command\fP	Run command in each file whose menu line does not match
456g/\f2regexp\fP/ \f2command\fP	If dot contains a match of regexp, run command
457v/\f2regexp\fP/ \f2command\fP	If dot does not contain a match of regexp, run command
458.sp .4
459_
460.sp .4
461\f1Miscellany\fP
462.sp .4
463_
464.sp .2
465k	Set address mark to value of dot
466q	Quit
467u \f2n\fP	Undo last \f2n\fP (default 1) changes
468{ }	Braces group commands
469.sp .3
470.ft CW
471_
472.ft
473.TE
474.sp
475.KE
476.PP
477The value of dot may be changed by
478specifying an
479.I address
480for the command.
481The simplest address is a line number:
482.P1
4833
484.P2
485refers to the third line of the file, so
486.P1
4873d
488.P2
489deletes the third line of the file, and implicitly renumbers
490the lines so the old line 4 is now numbered 3.
491(This is one of the few places where
492.CW sam
493deals with lines directly.)
494Line
495.CW 0
496is the null string at the beginning of the file.
497If a command consists of only an address, a
498.CW p
499command is assumed, so typing an unadorned
500.CW 3
501prints line 3 on the terminal.
502There are a couple of other basic addresses:
503a period addresses dot itself; and
504a dollar sign
505.CW $ ) (
506addresses the null string at the end of the file.
507.PP
508An address is always a single substring of the file.
509Thus, the address
510.CW 3
511addresses the characters
512after the second newline of
513the file through the third newline of the file.
514A
515.I
516compound address
517.R
518is constructed by the comma operator
519.P1
520\f2address1\fP,\f2address2\fP
521.P2
522and addresses the substring of the file from the beginning of
523.I address1
524to the end of
525.I address2 .
526For example, the command
527.CW 3,5p
528prints the third through fifth lines of the file and
529.CW .,$d
530deletes the text from the beginning of dot to the end of the file.
531.PP
532These addresses are all absolute positions in the file, but
533.CW sam
534also has relative addresses, indicated by
535.CW +
536or
537.CW - .
538For example,
539.P1
540$-3
541.P2
542is the third line before the end of the file and
543.P1
544\&.+1
545.P2
546is the line after dot.
547If no address appears to the left of the
548.CW +
549or
550.CW - ,
551dot is assumed;
552if nothing appears to the right,
553.CW 1
554is assumed.
555Therefore,
556.CW .+1
557may be abbreviated to just a plus sign.
558.PP
559The
560.CW +
561operator acts relative to the end of its first argument, while the
562.CW -
563operator acts relative to the beginning.  Thus
564.CW .+1
565addresses the first line after dot,
566.CW .-
567addresses the first line before dot, and
568.CW +-
569refers to the line containing the end of dot.  (Dot may span multiple lines, and
570.CW +
571selects the line after the end of dot, then
572.CW -
573backs up one line.)
574.PP
575The final type of address is a regular expression, which addresses the
576text matched by the expression.  The expression is enclosed in slashes, as in
577.P1
578/\f2expression\fP/
579.P2
580The expressions are the same as those in the UNIX program
581.CW egrep ,\u\s-4\&6,7\s+4\d
582and include closures, alternations, and so on.
583They find the
584.I
585leftmost longest
586.R
587string that matches the expression, that is,
588the first match after the point where the search is started,
589and if more than one match begins at the same spot, the longest such match.
590(I assume familiarity with the syntax for regular expressions in UNIX programs.\u\s-4\&9\s+4\d)
591For example,
592.P1
593/x/
594.P2
595matches the next
596.CW x
597character in the file,
598.P1
599/xx*/
600.P2
601matches the next run of one or more
602.CW x 's,
603and
604.P1
605/x|Peter/
606.P2
607matches the next
608.CW x
609or
610.CW Peter .
611For compatibility with other UNIX programs, the `any character' operator,
612a period,
613does not match a newline, so
614.P1
615/.*/
616.P2
617matches the text from dot to the end of the line, but excludes the newline
618and so will not match across
619the line boundary.
620.PP
621Regular expressions are always relative addresses.
622The direction is forwards by default,
623so
624.CW /Peter/
625is really an abbreviation for
626.CW +/Peter/ .
627The search can be reversed with a minus sign, so
628.P1
629.CW -/Peter/
630.P2
631finds the first
632.CW Peter
633before dot.
634Regular expressions may be used with other address forms, so
635.CW 0+/Peter/
636finds the first
637.CW Peter
638in the file and
639.CW $-/Peter/
640finds the last.
641Table II summarizes
642.CW sam 's
643addresses.
644.KF
645.TS
646center;
647c s
648lfCW l.
649Table II. \f(CWSam\fP addresses
650.sp .4
651.ft CW
652_
653.ft
654.sp .4
655\f1Simple addresses\fP
656.sp .4
657_
658.sp .2
659#\f2n\fP	The empty string after character \f2n\fP
660\f2n\fP	Line \f2n\fP.
661/\f2regexp\fP/	The first following match of the regular expression
662-/\f2regexp\fP/	The first previous match of the regular expression
663$	The null string at the end of the file
664\&.	Dot
665\&'	The address mark, set by \f(CWk\fP command
666"\f2regexp\fP"	Dot in the file whose menu line matches regexp
667.sp .4
668_
669.sp .4
670\f1Compound addresses\fP
671.sp .4
672_
673.sp .2
674\f2a1\fP+\f2a2\fP	The address \f2a2\fP evaluated starting at right of \f2a1\fP
675\f2a1\fP-\f2a2\fP	\f2a2\fP evaluated in the reverse direction starting at left of \f2a1\fP
676\f2a1\fP,\f2a2\fP	From the left of \f2a1\fP to the right of \f2a2\fP (default \f(CW0,$\fP)
677\f2a1\fP;\f2a2\fP	Like \f(CW,\fP but sets dot after evaluating \f2a1\fP
678.sp .4
679_
680.sp .4
681.T&
682c s.
683T{
684The operators
685.CW +
686and
687.CW -
688are high precedence, while
689.CW ,
690and
691.CW ;
692are low precedence.
693In both
694.CW +
695and
696.CW -
697forms,
698.I a2
699defaults to 1 and
700.I a1
701defaults to dot.
702If both
703.I a1
704and
705.I a2
706are present,
707.CW +
708may be elided.
709T}
710.sp .5
711.ft CW
712_
713.ft
714.TE
715.sp
716.KE
717.PP
718The language discussed so far will not seem novel
719to people who use UNIX text editors
720such as
721.CW ed
722or
723.CW vi .\u\s-4\&9\s+4\d
724Moreover, the kinds of editing operations these commands allow, with the exception
725of regular expressions and line numbers,
726are clearly more conveniently handled by a mouse-based interface.
727Indeed,
728.CW sam 's
729mouse language (discussed at length below) is the means by which
730simple changes are usually made.
731For large or repetitive changes, however, a textual language
732outperforms a manual interface.
733.PP
734Imagine that, instead of deleting just one occurrence of the string
735.CW Peter ,
736we wanted to eliminate every
737.CW Peter .
738What's needed is an iterator that runs a command for each occurrence of some
739text.
740.CW Sam 's
741iterator is called
742.CW x ,
743for extract:
744.P1
745x/\f2expression\fP/ \f2command\fP
746.P2
747finds all matches in dot of the specified expression, and for each
748such match, sets dot to the text matched and runs the command.
749So to delete all the
750.CW Peters:
751.P1
7520,$ x/Peter/ d
753.P2
754(Blanks in these examples are to improve readability;
755.CW sam
756neither requires nor interprets them.)
757This searches the entire file
758.CW 0,$ ) (
759for occurrences of the string
760.CW Peter ,
761and runs the
762.CW d
763command with dot set to each such occurrence.
764(By contrast, the comparable
765.CW ed
766command would delete all
767.I lines
768containing
769.CW Peter ;
770.CW sam
771deletes only the
772.CW Peters .)
773The address
774.CW 0,$
775is commonly used, and may be abbreviated to just a comma.
776As another example,
777.P1
778, x/Peter/ p
779.P2
780prints a list of
781.CW Peters,
782one for each appearance in the file, with no intervening text (not even newlines
783to separate the instances).
784.PP
785Of course, the text extracted by
786.CW x
787may be selected by a regular expression,
788which complicates deciding what set of matches is chosen \(em
789matches may overlap.  This is resolved by generating the matches
790starting from the beginning of dot using the leftmost-longest rule,
791and searching for each match starting from the end of the previous one.
792Regular expressions may also match null strings, but a null match
793adjacent to a non-null match is never selected; at least one character
794must intervene.
795For example,
796.P1
797, c/AAA/
798x/B*/ c/-/
799, p
800.P2
801produces as output
802.P1
803-A-A-A-
804.P2
805because the pattern
806.CW B*
807matches the null strings separating the
808.CW A 's.
809.PP
810The
811.CW x
812command has a complement,
813.CW y ,
814with similar syntax, that executes the command with dot set to the text
815.I between
816the matches of the expression.
817For example,
818.P1
819, c/AAA/
820y/A/ c/-/
821, p
822.P2
823produces the same result as the example above.
824.PP
825The
826.CW x
827and
828.CW y
829commands are looping constructs, and
830.CW sam
831has a pair of conditional commands to go with them.
832They have similar syntax:
833.P1
834g/\f2expression\fP/ \f2command\fP
835.P2
836(guard)
837runs the command exactly once if dot contains a match of the expression.
838This is different from
839.CW x ,
840which runs the command for
841.I each
842match:
843.CW x
844loops;
845.CW g
846merely tests, without changing the value of dot.
847Thus,
848.P1
849, x/Peter/ d
850.P2
851deletes all occurrences of
852.CW Peter ,
853but
854.P1
855, g/Peter/ d
856.P2
857deletes the whole file (reduces it to a null string) if
858.CW Peter
859occurs anywhere in the text.
860The complementary conditional is
861.CW v ,
862which runs the command if there is
863.I no
864match of the expression.
865.PP
866These control-structure-like commands may be composed to construct more
867involved operations.  For example, to print those lines of text that
868contain the string
869.CW Peter :
870.P1
871, x/.*\en/ g/Peter/ p
872.P2
873The
874.CW x
875breaks the file into lines, the
876.CW g
877selects those lines containing
878.CW Peter ,
879and the
880.CW p
881prints them.
882This command gives an address for the
883.CW x
884command (the whole file), but because
885.CW g
886does not have an explicit address, it applies to the value of
887dot produced by the
888.CW x
889command, that is, to each line.
890All commands in
891.CW sam
892except for the command to write a file to disc use dot for the
893default address.
894.PP
895Composition may be continued indefinitely.
896.P1
897, x/.*\en/ g/Peter/ v/SaltPeter/ p
898.P2
899prints those lines containing
900.CW Peter
901but
902.I not
903those containing
904.CW SaltPeter .
905.SH 2
906Structural Regular Expressions
907.LP
908Unlike other UNIX text editors,
909including the non-interactive ones such as
910.CW sed
911and
912.CW awk ,\u\s-4\&7\s+4\d
913.CW sam
914is good for manipulating files with multi-line `records.'
915An example is an on-line phone book composed of records,
916separated by blank lines, of the form
917.P1
918Herbert Tic
91944 Turnip Ave., Endive, NJ
920201-5555642
921
922Norbert Twinge
92316 Potato St., Cabbagetown, NJ
924201-5553145
925
926\&...
927.P2
928The format may be encoded as a regular expression:
929.P1
930(.+\en)+
931.P2
932that is, a sequence of one or more non-blank lines.
933The command to print Mr. Tic's entire record is then
934.P1
935, x/(.+\en)+/ g/^Herbert Tic$/ p
936.P2
937and that to extract just the phone number is
938.P1
939, x/(.+\en)+/ g/^Herbert Tic$/ x/^[0-9]*-[0-9]*\en/ p
940.P2
941The latter command breaks the file into records,
942chooses Mr. Tic's record,
943extracts the phone number from the record,
944and finally prints the number.
945.PP
946A more involved problem is that of
947renaming a particular variable, say
948.CW n ,
949to
950.CW num
951in a C program.
952The obvious first attempt,
953.P1
954, x/n/ c/num/
955.P2
956is badly flawed: it changes not only the variable
957.CW n
958but any letter
959.CW n
960that appears.
961We need to extract all the variables, and select those that match
962.CW n
963and only
964.CW n :
965.P1
966, x/[A-Za-z_][A-Za-z_0-9]*/ g/n/ v/../ c/num/
967.P2
968The pattern
969.CW [A-Za-z_][A-Za-z_0-9]*
970matches C identifiers.
971Next
972.CW g/n/
973selects those containing an
974.CW n .
975Then
976.CW v/../
977rejects those containing two (or more) characters, and finally
978.CW c/num/
979changes the remainder (identifiers
980.CW n )
981to
982.CW num .
983This version clearly works much better, but there may still be problems.
984For example, in C character and string constants, the sequence
985.CW \en
986is interpreted as a newline character, and we don't want to change it to
987.CW \enum.
988This problem can be forestalled with a
989.CW y
990command:
991.P1
992, y/\e\en/ x/[A-Za-z_][A-Za-z_0-9]*/ g/n/ v/../ c/num/
993.P2
994(the second
995.CW \e
996is necessary because of lexical conventions in regular expressions),
997or we could even reject character constants and strings outright:
998.P1 0
999,y/'[^']*'/ y/"[^"]*"/ x/[A-Za-z_][A-Za-z_0-9]*/ g/n/ v/../ c/num/
1000.P2
1001The
1002.CW y
1003commands in this version exclude from consideration all character constants
1004and strings.
1005The only remaining problem is to deal with the possible occurrence of
1006.CW \e'
1007or
1008.CW \e"
1009within these sequences, but it's easy to see how to resolve this difficulty.
1010.PP
1011The point of these composed commands is successive refinement.
1012A simple version of the command is tried, and if it's not good enough,
1013it can be honed by adding a clause or two.
1014(Mistakes can be undone; see below.
1015Also, the mouse language makes it unnecessary to retype the command each time.)
1016The resulting chains of commands are somewhat reminiscent of
1017shell pipelines.\u\s-4\&7\s+4\d
1018Unlike pipelines, though, which pass along modified
1019.I data ,
1020.CW sam
1021commands pass a
1022.I view
1023of the data.
1024The text at each step of the command is the same, but which pieces
1025are selected is refined step by step until the correct piece is
1026available to the final step of the command line, which ultimately makes the change.
1027.PP
1028In other UNIX programs, regular expressions are used only for selection,
1029as in the
1030.CW sam
1031.CW g
1032command, never for extraction as in the
1033.CW x
1034or
1035.CW y
1036command.
1037For example, patterns in
1038.CW awk \u\s-4\&7\s+4\d
1039are used to select lines to be operated on, but cannot be used
1040to describe the format of the input text, or to handle newline-free text.
1041The use of regular expressions to describe the structure of a piece
1042of text rather than its contents, as in the
1043.CW x
1044command,
1045has been given a name:
1046.I
1047structural regular expressions.
1048.R
1049When they are composed, as in the above example,
1050they are pleasantly expressive.
1051Their use is discussed at greater length elsewhere.\u\s-4\&10\s+4\d
1052.PP
1053.SH 2
1054Multiple files
1055.LP
1056.CW Sam
1057has a few other commands, mostly relating to input and output.
1058.P1
1059e discfilename
1060.P2
1061replaces the contents and name of the current file with those of the named
1062disc file;
1063.P1
1064w discfilename
1065.P2
1066writes the contents to the named disc file; and
1067.P1
1068r discfilename
1069.P2
1070replaces dot with the contents of the named disc file.
1071All these commands use the current file's name if none is specified.
1072Finally,
1073.P1
1074f discfilename
1075.P2
1076changes the name associated with the file and displays the result:
1077.P1
1078\&'-. discfilename
1079.P2
1080This output is called the file's
1081.I
1082menu line,
1083.R
1084because it is the contents of the file's line in the button 3 menu (described
1085in the
1086next section).
1087The first three characters are a concise notation for the state of the file.
1088The apostrophe signifies that the file is modified.
1089The minus sign indicates the number of windows
1090open on the file (see the next section):
1091.CW -
1092means none,
1093.CW +
1094means one, and
1095.CW *
1096means more than one.
1097Finally, the period indicates that this is the current file.
1098These characters are useful for controlling the
1099.CW X
1100command, described shortly.
1101.PP
1102.CW Sam
1103may be started with a set of disc files (such as all the source for
1104a program) by invoking it with a list of file names as arguments, and
1105more may be added or deleted on demand.
1106.P1
1107B discfile1 discfile2 ...
1108.P2
1109adds the named files to
1110.CW sam 's
1111list, and
1112.P1
1113D discfile1 discfile2 ...
1114.P2
1115removes them from
1116.CW sam 's
1117memory (without effect on associated disc files).
1118Both these commands have a syntax for using the shell\u\s-4\&7\s+4\d
1119(the UNIX command interpreter) to generate the lists:
1120.P1
1121B <echo *.c
1122.P2
1123will add all C source files, and
1124.P1
1125B <grep -l variable *.c
1126.P2
1127will add all C source files referencing a particular variable
1128(the UNIX command
1129.CW grep\ -l
1130lists all files in its arguments that contain matches of
1131the specified regular expression).
1132Finally,
1133.CW D
1134without arguments deletes the current file.
1135.PP
1136There are two ways to change which file is current:
1137.P1
1138b filename
1139.P2
1140makes the named file current.
1141The
1142.CW B
1143command
1144does the same, but also adds any new files to
1145.CW sam 's
1146list.
1147(In practice, of course, the current file
1148is usually chosen by mouse actions, not by textual commands.)
1149The other way is to use a form of address that refers to files:
1150.P1
1151"\f2expression\fP" \f2address\fP
1152.P2
1153refers to the address evaluated in the file whose menu line
1154matches the expression (there must be exactly one match).
1155For example,
1156.P1
1157"peter.c" 3
1158.P2
1159refers to the third line of the file whose name matches
1160.CW peter.c .
1161This is most useful in the move
1162.CW m ) (
1163and copy
1164.CW t ) (
1165commands:
1166.P1
11670,$ t "peter.c" 0
1168.P2
1169makes a copy of the current file at the beginning of
1170.CW peter.c .
1171.PP
1172The
1173.CW X
1174command
1175is a looping construct, like
1176.CW x ,
1177that refers to files instead of strings:
1178.P1
1179X/\f2expression\fP/ \f2command\fP
1180.P2
1181runs the command in all
1182files whose menu lines match the expression.  The best example is
1183.P1
1184X/'/ w
1185.P2
1186which writes to disc all modified files.
1187.CW Y
1188is the complement of
1189.CW X :
1190it runs the command on all files whose menu lines don't match the expression:
1191.P1
1192Y/\e.c/ D
1193.P2
1194deletes all files that don't have
1195.CW \&.c
1196in their names, that is, it keeps all C source files and deletes the rest.
1197.PP
1198Braces allow commands to be grouped, so
1199.P1
1200{
1201	\f2command1\fP
1202	\f2command2\fP
1203}
1204.P2
1205is syntactically a single command that runs two commands.
1206Thus,
1207.P1
1208X/\e.c/ ,g/variable/ {
1209	f
1210	, x/.*\en/ g/variable/ p
1211}
1212.P2
1213finds all occurrences of
1214.CW variable
1215in C source files, and prints
1216out the file names and lines of each match.
1217The precise semantics of compound operations is discussed in the implementation
1218sections below.
1219.PP
1220Finally,
1221the undo command,
1222.CW u ,
1223undoes the last command,
1224no matter how many files were affected.
1225Multiple undo operations move further back in time, so
1226.P1
1227u
1228u
1229.P2
1230(which may be abbreviated
1231.CW u2 )
1232undoes the last two commands.  An undo may not be undone, however, nor
1233may any command that adds or deletes files.
1234Everything else is undoable, though, including for example
1235.CW e
1236commands:
1237.P1
1238e filename
1239u
1240.P2
1241restores the state of the file completely, including its name, dot,
1242and modified bit.  Because of the undo, potentially dangerous commands
1243are not guarded by confirmations.  Only
1244.CW D ,
1245which destroys the information necessary to restore itself, is protected.
1246It will not delete a modified file, but a second
1247.CW D
1248of the same file will succeed regardless.
1249The
1250.CW q
1251command, which exits
1252.CW sam ,
1253is similarly guarded.
1254.SH 2
1255Mouse Interface
1256.LP
1257.CW Sam
1258is most commonly run
1259connected to a bitmap display and mouse for interactive editing.
1260The only difference in the command language
1261between regular, mouse-driven
1262.CW sam
1263and
1264.CW sam\ -d
1265is that if an address
1266is provided without a command,
1267.CW sam\ -d
1268will print the text referenced by the address, but
1269regular
1270.CW sam
1271will highlight it on the screen \(em in fact,
1272dot is always highlighted (see Figure 2).
1273.WS 1
1274.KF
1275.BP fig3.ps 2.04i
1276.Cs
1277Figure 2. A
1278.CW sam
1279window.  The scroll bar down the left
1280represents the file, with the bubble showing the fraction
1281visible in the window.
1282The scroll bar may be manipulated by the mouse for convenient browsing.
1283The current text,
1284which is highlighted, need not fit on a line.  Here it consists of one partial
1285line, one complete line, and final partial line.
1286.Ce
1287.KE
1288.PP
1289Each file may have zero or more windows open on the display.
1290At any time, only one window in all of
1291.CW sam
1292is the
1293.I
1294current window,
1295.R
1296that is, the window to which typing and mouse actions refer;
1297this may be the
1298.CW sam
1299window (that in which commands may be typed)
1300or one of the file windows.
1301When a file has multiple windows, the image of the file in each window
1302is always kept up to date.
1303The current file is the last file affected by a command,
1304so if the
1305.CW sam
1306window is current,
1307the current window is not a window on the current file.
1308However, each window on a file has its own value of dot,
1309and when switching between windows on a single file,
1310the file's value of dot is changed to that of the window.
1311Thus, flipping between windows behaves in the obvious, convenient way.
1312.PP
1313The mouse on the Blit has three buttons, numbered left to right.
1314Button 3 has a list of commands to manipulate windows,
1315followed by a list of `menu lines' exactly as printed by the
1316.CW f
1317command, one per file (not one per window).
1318These menu lines are sorted by file name.
1319If the list is long, the Blit menu software will make it more manageable
1320by generating a scrolling menu instead of an unwieldy long list.
1321Using the menu to select a file from the list makes that file the current
1322file, and the most recently current window in that file the current window.
1323But if that file is already current, selecting it in the menu cycles through
1324the windows on the file; this simple trick avoids a special menu to
1325choose windows on a file.
1326If there is no window open on the file,
1327.CW sam
1328changes the mouse cursor to prompt the user to create one.
1329.PP
1330The commands on the button 3 menu are straightforward (see Figure 3), and
1331are like the commands to manipulate windows in
1332.CW mux ,\u\s-4\&8\s+4\d
1333the Blit's window system.
1334.CW New
1335makes a new file, and gives it one empty window, whose size is determined
1336by a rectangle swept by the mouse.
1337.CW Zerox
1338prompts for a window to be selected, and
1339makes a clone of that window; this is how multiple windows are created on one file.
1340.CW Reshape
1341changes the size of the indicated window, and
1342.CW close
1343deletes it.  If that is the last window open on the file,
1344.CW close
1345first does a
1346.CW D
1347command on the file.
1348.CW Write
1349is identical to a
1350.CW w
1351command on the file; it is in the menu purely for convenience.
1352Finally,
1353.CW ~~sam~~
1354is a menu item that appears between the commands and the file names.
1355Selecting it makes the
1356.CW sam
1357window the current window,
1358causing subsequent typing to be interpreted as commands.
1359.KF
1360.BP fig2.ps 2.74i
1361.Cs
1362Figure 3. The menu on button 3.
1363The black rectangle on the left is a scroll bar; the menu is limited to
1364the length shown to prevent its becoming unwieldy.
1365Above the
1366.CW ~~sam~~
1367line is a list of commands;
1368beneath it is a list of files, presented exactly as with the
1369.CW f
1370command.
1371.Ce
1372.KE
1373.PP
1374When
1375.CW sam
1376requests that a window be swept, in response to
1377.CW new ,
1378.CW zerox
1379or
1380.CW reshape ,
1381it changes the mouse cursor from the usual arrow to a box with
1382a small arrow.
1383In this state, the mouse may be used to indicate an arbitrary rectangle by
1384pressing button 3 at one corner and releasing it at the opposite corner.
1385More conveniently,
1386button 3 may simply be clicked,
1387whereupon
1388.CW sam
1389creates the maximal rectangle that contains the cursor
1390and abuts the
1391.CW sam
1392window.
1393By placing the
1394.CW sam
1395window in the middle of the screen, the user can define two regions (one above,
1396one below) in which stacked fully-overlapping
1397windows can be created with minimal fuss (see Figure 1).
1398This simple user interface trick makes window creation noticeably easier.
1399.PP
1400The cut-and-paste editor is essentially the same as that in Smalltalk-80.\u\s-4\&11\s+4\d
1401The text in dot is always highlighted on the screen.
1402When a character is typed it replaces dot, and sets dot to the null
1403string after the character.  Thus, ordinary typing inserts text.
1404Button 1 is used for selection:
1405pressing the button, moving the mouse, and lifting the button
1406selects (sets dot to) the text between the points where the
1407button was pressed and released.
1408Pressing and releasing at the same point selects a null string; this
1409is called clicking.  Clicking twice quickly, or
1410.I
1411double clicking,
1412.R
1413selects larger objects;
1414for example, double clicking in a word selects the word,
1415double clicking just inside an opening bracket selects the text
1416contained in the brackets (handling nested brackets correctly),
1417and similarly for
1418parentheses, quotes, and so on.
1419The double-clicking rules reflect a bias toward
1420programmers.
1421If
1422.CW sam
1423were intended more for word processing, double-clicks would probably
1424select linguistic structures such as sentences.
1425.PP
1426If button 1 is pressed outside the current window, it makes the indicated
1427window current.
1428This is the easiest way to switch between windows and files.
1429.PP
1430Pressing button 2 brings up a menu of editing functions (see Figure 4).
1431These mostly apply to the selected text:
1432.CW cut
1433deletes the selected text, and remembers it in a hidden buffer called the
1434.I
1435snarf buffer,
1436.R
1437.CW paste
1438replaces the selected text by the contents of the snarf buffer,
1439.CW snarf
1440just copies the selected text to the snarf buffer,
1441.CW look
1442searches forward for the next literal occurrence of the selected text, and
1443.CW <mux>
1444exchanges snarf buffers with the window system in which
1445.CW sam
1446is running.
1447Finally, the last regular expression used appears as a menu entry
1448to search
1449forward for the next occurrence of a match for the expression.
1450.WS 1
1451.KF
1452.BP fig4.ps 1.20i
1453.Cs
1454Figure 4. The menu on button 2.
1455The bottom entry tracks the most recently used regular expression, which may
1456be literal text.
1457.Ce
1458.KE
1459.PP
1460The relationship between the command language and the mouse language is
1461entirely due to the equality of dot and the selected text chosen
1462with button 1 on the mouse.
1463For example, to make a set of changes in a C subroutine, dot can be
1464set by double clicking on the left brace that begins the subroutine,
1465which sets dot for the command language.
1466An address-free command then typed in the
1467.CW sam
1468window will apply only to the text between the opening and closing
1469braces of the function.
1470The idea is to select what you want, and then say what you want
1471to do with it, whether invoked by a menu selection or by a typed command.
1472And of course, the value of dot is highlighted on
1473the display after the command completes.
1474This relationship between mouse interface and command language
1475is clumsy to explain, but comfortable, even natural, in practice.
1476.SH
1477The Implementation
1478.LP
1479The next few sections describe how
1480.CW sam
1481is put together, first the host part,
1482then the inter-component communication,
1483then the terminal part.
1484After explaining how the command language is implemented,
1485the discussion follows (roughly) the path of a character
1486from the temporary file on disc to the screen.
1487The presentation centers on the data structures,
1488because that is how the program was designed and because
1489the algorithms are easy to provide, given the right data
1490structures.
1491.SH 2
1492Parsing and execution
1493.LP
1494The command language is interpreted by parsing each command with a
1495table-driven recursive
1496descent parser, and when a complete command is assembled, invoking a top-down
1497executor.
1498Most editors instead employ a simple character-at-a-time
1499lexical scanner.
1500Use of a parser makes it
1501easy and unambiguous to detect when a command is complete,
1502which has two advantages.
1503First, escape conventions such as backslashes to quote
1504multiple-line commands are unnecessary;  if the command isn't finished,
1505the parser keeps reading.  For example, a multiple-line append driven by an
1506.CW x
1507command is straightforward:
1508.P1
1509x/.*\en/ g/Peter/ a
1510one line about Peter
1511another line about Peter
1512\&.
1513.P2
1514Other UNIX editors would require a backslash after all but the last line.
1515.PP
1516The other advantage is specific to the two-process structure of
1517.CW sam .
1518The host process must decide when a command is completed so the
1519command interpreter can be called.  This problem is easily resolved
1520by having the lexical analyzer read the single stream of events from the
1521terminal, directly executing all typing and mouse commands,
1522but passing to the parser characters typed to the
1523.CW sam
1524command window.
1525This scheme is slightly complicated by the availability of cut-and-paste
1526editing in the
1527.CW sam
1528window, but that difficulty is resolved by applying the rules
1529used in
1530.CW mux :
1531when a newline is typed to the
1532.CW sam
1533window, all text between the newline and the previously typed newline
1534is made available to the parser.
1535This permits arbitrary editing to be done to a command before
1536typing newline and thereby requesting execution.
1537.PP
1538The parser is driven by a table because the syntax of addresses
1539and commands is regular enough
1540to be encoded compactly.  There are few special cases, such as the
1541replacement text in a substitution, so the syntax of almost all commands
1542can be encoded with a few flags.
1543These include whether the command allows an address (for example,
1544.CW e
1545does not), whether it takes a regular expression (as in
1546.CW x
1547and
1548.CW s ),
1549whether it takes replacement text (as in
1550.CW c
1551or
1552.CW i ),
1553which may be multi-line, and so on.
1554The internal syntax of regular expressions is handled by a separate
1555parser; a regular expression is a leaf of the command parse tree.
1556Regular expressions are discussed fully in the next section.
1557.PP
1558The parser table also has information about defaults, so the interpreter
1559is always called with a complete tree.  For example, the parser fills in
1560the implicit
1561.CW 0
1562and
1563.CW $
1564in the abbreviated address
1565.CW ,
1566(comma),
1567inserts a
1568.CW +
1569to the left of an unadorned regular expression in an address,
1570and provides the usual default address
1571.CW .
1572(dot) for commands that expect an address but are not given one.
1573.PP
1574Once a complete command is parsed, the evaluation is easy.
1575The address is evaluated left-to-right starting from the value of dot,
1576with a mostly ordinary expression evaluator.
1577Addresses, like many of the data structures in
1578.CW sam ,
1579are held in a C structure and passed around by value:
1580.P1
1581typedef long Posn;    /* Position in a file */
1582typedef struct Range{
1583        Posn    p1, p2;
1584}Range;
1585typedef struct Address{
1586        Range   r;
1587        File    *f;
1588}Address;
1589.P2
1590An address is encoded as a substring (character positions
1591.CW p1
1592to
1593.CW p2 )
1594in a file
1595.CW f .
1596(The data type
1597.CW File
1598is described in detail below.)
1599.PP
1600The address interpreter is an
1601.CW Address -valued
1602function that traverses the parse tree describing an address (the
1603parse tree for the address has type
1604.CW Addrtree ):
1605.P1
1606Address
1607address(ap, a, sign)
1608	Addrtree *ap;
1609	Address a;
1610	int sign;
1611{
1612	Address a2;
1613	do
1614		switch(ap->type){
1615		case '.':
1616			a=a.f->dot;
1617			break;
1618		case '$':
1619			a.r.p1=a.r.p2=a.f->nbytes;
1620			break;
1621		case '"':
1622			a=matchfile(a, ap->aregexp)->dot;
1623			break;
1624		case ',':
1625			a2=address(ap->right, a, 0);
1626			a=address(ap->left, a, 0);
1627			if(a.f!=a2.f || a2.r.p2<a.r.p1)
1628				error(Eorder);
1629			a.r.p2=a2.r.p2;
1630			return a;
1631		/* and so on */
1632		}
1633	while((ap=ap->right)!=0);
1634	return a;
1635}
1636.P2
1637.PP
1638Throughout, errors are handled by a non-local
1639.CW goto
1640(a
1641.CW setjmp/longjmp
1642in C terminology)
1643hidden in a routine called
1644.CW error
1645that immediately aborts the execution, retracts any
1646partially made changes (see the section below on `undoing'), and
1647returns to the top level of the parser.
1648The argument to
1649.CW error
1650is an enumeration type that
1651is translated to a terse but possibly helpful
1652message such as `?addresses out of order.'
1653Very common messages are kept short; for example the message for
1654a failed regular expression search is `?search.'
1655.PP
1656Character addresses such as
1657.CW #3
1658are trivial to implement, as the
1659.CW File
1660data structure is accessible by character number.
1661However,
1662.CW sam
1663keeps no information about the position of newlines \(em it is too
1664expensive to track dynamically \(em so line addresses are computed by reading
1665the file, counting newlines.  Except in very large files, this has proven
1666acceptable: file access is fast enough to make the technique practical,
1667and lines are not central to the structure of the command language.
1668.PP
1669The command interpreter, called
1670.CW cmdexec ,
1671is also straightforward.  The parse table includes a
1672function to call to interpret a particular command.  That function
1673receives as arguments
1674the calculated address
1675for the command
1676and the command tree (of type
1677.CW Cmdtree ),
1678which may contain information such as the subtree for compound commands.
1679Here, for example, is the function for the
1680.CW g
1681and
1682.CW v
1683commands:
1684.P1
1685int
1686g_cmd(a, cp)
1687	Address a;
1688	Cmdtree *cp;
1689{
1690	compile(cp->regexp);
1691	if(execute(a.f, a.r.p1, a.r.p2)!=(cp->cmdchar=='v')){
1692		a.f->dot=a;
1693		return cmdexec(a, cp->subcmd);
1694	}
1695	return TRUE;	/* cause execution to continue */
1696}
1697.P2
1698.CW Compile "" (
1699and
1700.CW execute
1701are part of the regular expression code, described in the next section.)
1702Because the parser and the
1703.CW File
1704data structure do most of the work, most commands
1705are similarly brief.
1706.SH 2
1707Regular expressions
1708.LP
1709The regular expression code in
1710.CW sam
1711is an interpreted, rather than compiled on-the-fly, implementation of Thompson's
1712non-deterministic finite automaton algorithm.\u\s-4\&12\s+4\d
1713The syntax and semantics of the expressions are as in the UNIX program
1714.CW egrep ,
1715including alternation, closures, character classes, and so on.
1716The only changes in the notation are two additions:
1717.CW \en
1718is translated to, and matches, a newline character, and
1719.CW @
1720matches any character.  In
1721.CW egrep ,
1722the character
1723.CW \&.
1724matches any character except newline, and in
1725.CW sam
1726the same rule seemed safest, to prevent idioms like
1727.CW \&.*
1728from spanning newlines.
1729.CW Egrep
1730expressions are arguably too complicated for an interactive editor \(em
1731certainly it would make sense if all the special characters were two-character
1732sequences, so that most of the punctuation characters wouldn't have
1733peculiar meanings \(em but for an interesting command language, full
1734regular expressions are necessary, and
1735.CW egrep
1736defines the full regular expression syntax for UNIX programs.
1737Also, it seemed superfluous to define a new syntax, since various UNIX programs
1738.CW ed , (
1739.CW egrep
1740and
1741.CW vi )
1742define too many already.
1743.PP
1744The expressions are compiled by a routine,
1745.CW compile ,
1746that generates the description of the non-deterministic finite state machine.
1747A second routine,
1748.CW execute ,
1749interprets the machine to generate the leftmost-longest match of the
1750expression in a substring of the file.
1751The algorithm is described elsewhere.\u\s-4\&12,13\s+4\d
1752.CW Execute
1753reports
1754whether a match was found, and sets a global variable,
1755of type
1756.CW Range ,
1757to the substring matched.
1758.PP
1759A trick is required to evaluate the expression in reverse, such as when
1760searching backwards for an expression.
1761For example,
1762.P1
1763-/P.*r/
1764.P2
1765looks backwards through the file for a match of the expression.
1766The expression, however, is defined for a forward search.
1767The solution is to construct a machine identical to the machine
1768for a forward search except for a reversal of all the concatenation
1769operators (the other operators are symmetric under direction reversal),
1770to exchange the meaning of the operators
1771.CW ^
1772and
1773.CW $ ,
1774and then to read the file backwards, looking for the
1775usual earliest longest match.
1776.PP
1777.CW Execute
1778generates only one match each time it is called.
1779To interpret looping constructs such as the
1780.CW x
1781command,
1782.CW sam
1783must therefore synchronize between
1784calls of
1785.CW execute
1786to avoid
1787problems with null matches.
1788For example, even given the leftmost-longest rule,
1789the expression
1790.CW a*
1791matches three times in the string
1792.CW ab
1793(the character
1794.CW a ,
1795the null string between the
1796.CW a
1797and
1798.CW b ,
1799and the final null string).
1800After returning a match for the
1801.CW a ,
1802.CW sam
1803must not match the null string before the
1804.CW b .
1805The algorithm starts
1806.CW execute
1807at the end of its previous match, and
1808if the match it returns
1809is null and abuts the previous match, rejects the match and advances
1810the initial position one character.
1811.SH 2
1812Memory allocation
1813.LP
1814The C language has no memory allocation primitives, although a standard
1815library routine,
1816.CW malloc ,
1817provides adequate service for simple programs.
1818For specific uses, however,
1819it can be better to write a custom allocator.
1820The allocator (or rather, pair of allocators) described here
1821work in both the terminal and host parts of
1822.CW sam .
1823They are designed for efficient manipulation of strings,
1824which are allocated and freed frequently and vary in length from essentially
1825zero to 32 Kbytes (very large strings are written to disc).
1826More important, strings may be large and change size often,
1827so to minimize memory usage it is helpful to reclaim and to coalesce the
1828unused portions of strings when they are truncated.
1829.PP
1830Objects to be allocated in
1831.CW sam
1832are of two flavors:
1833the first is C
1834.CW structs ,
1835which are small and often addressed by pointer variables;
1836the second is variable-sized arrays of characters
1837or integers whose
1838base pointer is always used to access them.
1839The memory allocator in
1840.CW sam
1841is therefore in two parts:
1842first, a traditional first-fit allocator that provides fixed storage for
1843.CW structs ;
1844and second, a garbage-compacting allocator that reduces storage
1845overhead for variable-sized objects, at the cost of some bookkeeping.
1846The two types of objects are allocated from adjoining arenas, with
1847the garbage-compacting allocator controlling the arena with higher addresses.
1848Separating into two arenas simplifies compaction and prevents fragmentation due
1849to immovable objects.
1850The access rules for garbage-compactable objects
1851(discussed in the next paragraph) allow them to be relocated, so when
1852the first-fit arena needs space, it moves the garbage-compacted arena
1853to higher addresses to make room.  Storage is therefore created only
1854at successively higher addresses, either when more garbage-compacted
1855space is needed or when the first-fit arena pushes up the other arena.
1856.PP
1857Objects that may be compacted declare to the
1858allocator a cell that is guaranteed to be the sole repository of the
1859address of the object whenever a compaction can occur.
1860The compactor can then update the address when the object is moved.
1861For example, the implementation of type
1862.CW List
1863(really a variable-length array)
1864is:
1865.P1
1866typedef struct List{
1867        int     nused;
1868        long    *ptr;
1869}List;
1870.P2
1871The
1872.CW ptr
1873cell must always be used directly, and never copied.  When a
1874.CW List
1875is to be created the
1876.CW List
1877structure is allocated in the ordinary first-fit arena
1878and its
1879.CW ptr
1880is allocated in the garbage-compacted arena.
1881A similar data type for strings, called
1882.CW String ,
1883stores variable-length character arrays of up to 32767 elements.
1884.PP
1885A related matter of programming style:
1886.CW sam
1887frequently passes structures by value, which
1888simplifies the code.
1889Traditionally, C programs have
1890passed structures by reference, but implicit allocation on
1891the stack is easier to use.
1892Structure passing is a relatively new feature of C
1893(it is not in the
1894standard reference manual for C\u\s-4\&14\s+4\d), and is poorly supported in most
1895commercial C compilers.
1896It's convenient and expressive, though,
1897and simplifies memory management by
1898avoiding the allocator altogether
1899and eliminating pointer aliases.
1900.SH 2
1901Data structures for manipulating files
1902.LP
1903Experience with
1904.CW jim
1905showed that the requirements
1906of the file data structure were few, but strict.
1907First, files need to be read and written quickly;
1908adding a fresh file must be painless.
1909Second, the implementation must place no arbitrary upper limit on
1910the number or sizes of files.  (It should be practical to edit many files,
1911and files up to megabytes in length should be handled gracefully.)
1912This implies that files be stored on disc, not in main memory.
1913(Aficionados of virtual memory may argue otherwise, but the
1914implementation of virtual
1915memory in our system is not something to depend on
1916for good performance.)
1917Third, changes to files need be made by only two primitives:
1918deletion and insertion.
1919These are inverses of each other,
1920which simplifies the implementation of the undo operation.
1921Finally,
1922it must be easy and efficient to access the file, either
1923forwards or backwards, a byte at a time.
1924.PP
1925The
1926.CW File
1927data type is constructed from three simpler data structures that hold arrays
1928of characters.
1929Each of these types has an insertion and deletion operator, and the
1930insertion and deletion operators of the
1931.CW File
1932type itself are constructed from them.
1933.PP
1934The simplest type is the
1935.CW String ,
1936which is used to hold strings in main memory.
1937The code that manages
1938.CW Strings
1939guarantees that they will never be longer
1940than some moderate size, and in practice they are rarely larger than 8 Kbytes.
1941.CW Strings
1942have two purposes: they hold short strings like file names with little overhead,
1943and because they are deliberately small, they are efficient to modify.
1944They are therefore used as the data structure for in-memory caches.
1945.PP
1946The disc copy of the file is managed by a data structure called a
1947.CW Disc ,
1948which corresponds to a temporary file.  A
1949.CW Disc
1950has no storage in main memory other than bookkeeping information;
1951the actual data being held is all on the disc.
1952To reduce the number of open files needed,
1953.CW sam
1954opens a dozen temporary UNIX files and multiplexes the
1955.CW Discs
1956upon them.
1957This permits many files to
1958be edited; the entire
1959.CW sam
1960source (48 files) may be edited comfortably with a single
1961instance of
1962.CW sam .
1963Allocating one temporary file per
1964.CW Disc
1965would strain the operating system's limit on the number of open files.
1966Also, spreading the traffic among temporary files keeps the files shorter,
1967and shorter files are more efficiently implemented by the UNIX
1968I/O subsystem.
1969.PP
1970A
1971.CW Disc
1972is an array of fixed-length blocks, each of which contains
1973between 1 and 4096 characters of active data.
1974(The block size of our UNIX file system is 4096 bytes.)
1975The block addresses within the temporary file and the length of each
1976block are stored in a
1977.CW List .
1978When changes are made the live part of blocks may change size.
1979Blocks are created and coalesced when necessary to try to keep the sizes
1980between 2048 and 4096 bytes.
1981An actively changing part of the
1982.CW Disc
1983therefore typically has about a kilobyte of slop that can be
1984inserted or deleted
1985without changing more than one block or affecting the block order.
1986When an insertion would overflow a block, the block is split, a new one
1987is allocated to receive the overflow, and the memory-resident list of blocks
1988is rearranged to reflect the insertion of the new block.
1989.PP
1990Obviously, going to the disc for every modification to the file is
1991prohibitively expensive.
1992The data type
1993.CW Buffer
1994consists of a
1995.CW Disc
1996to hold the data and a
1997.CW String
1998that acts as a cache.
1999This is the first of a series of caches throughout the data structures in
2000.CW sam.
2001The caches not only improve performance, they provide a way to organize
2002the flow of data, particularly in the communication between the host
2003and terminal.
2004This idea is developed below, in the section on communications.
2005.PP
2006To reduce disc traffic, changes to a
2007.CW Buffer
2008are mediated by a variable-length string, in memory, that acts as a cache.
2009When an insertion or deletion is made to a
2010.CW Buffer ,
2011if the change can be accommodated by the cache, it is done there.
2012If the cache becomes bigger than a block because of an insertion,
2013some of it is written to the
2014.CW Disc
2015and deleted from the cache.
2016If the change does not intersect the cache, the cache is flushed.
2017The cache is only loaded at the new position if the change is smaller than a block;
2018otherwise, it is sent directly to the
2019.CW Disc .
2020This is because
2021large changes are typically sequential,
2022whereupon the next change is unlikely to overlap the current one.
2023.PP
2024A
2025.CW File
2026comprises a
2027.CW String
2028to hold the file name and some ancillary data such as dot and the modified bit.
2029The most important components, though, are a pair of
2030.CW Buffers ,
2031one called the transcript and the other the contents.
2032Their use is described in the next section.
2033.PP
2034The overall structure is shown in Figure 5.
2035Although it may seem that the data is touched many times on its
2036way from the
2037.CW Disc ,
2038it is read (by one UNIX system call) directly into the cache of the
2039associated
2040.CW Buffer ;
2041no extra copy is done.
2042Similarly, when flushing the cache, the text is written
2043directly from the cache to disc.
2044Most operations act directly on the text in the cache.
2045A principle applied throughout
2046.CW sam
2047is that the fewer times the data is copied, the faster the program will run
2048(see also the paper by Waite\u\s-4\&15\s+4\d).
2049.KF
2050.PS
2051copy "fig5.pic"
2052.PE
2053.Cs
2054Figure 5. File data structures.
2055The temporary files are stored in the standard repository for such files
2056on the host system.
2057.Ce
2058.KE
2059.PP
2060The contents of a
2061.CW File
2062are accessed by a routine that
2063copies to a buffer a substring of a file starting at a specified offset.
2064To read a byte at a time, a
2065.CW File "" per-
2066array is loaded starting from a specified initial position,
2067and bytes may then be read from the array.
2068The implementation is done by a macro similar to the C standard I/O
2069.CW getc
2070macro.\u\s-4\&14\s+4\d
2071Because the reading may be done at any address, a minor change to the
2072macro allows the file to be read backwards.
2073This array is read-only; there is no
2074.CW putc .
2075.SH 2
2076Doing and undoing
2077.LP
2078.CW Sam
2079has an unusual method for managing changes to files.
2080The command language makes it easy to specify multiple variable-length changes
2081to a file millions of bytes long, and such changes
2082must be made efficiently if the editor is to be practical.
2083The usual techniques for inserting and deleting strings
2084are inadequate under these conditions.
2085The
2086.CW Buffer
2087and
2088.CW Disc
2089data structures are designed for efficient random access to long strings,
2090but care must be taken to avoid super-linear behavior when making
2091many changes simultaneously.
2092.PP
2093.CW Sam
2094uses a two-pass algorithm for making changes, and treats each file as a database
2095against which transactions are registered.
2096Changes are not made directly to the contents.
2097Instead, when a command is started, a `mark' containing
2098a sequence number is placed in the transcript
2099.CW Buffer ,
2100and each change made to the file, either an insertion or deletion
2101or a change to the file name,
2102is appended to the end of the transcript.
2103When the command is complete, the transcript is rewound to the
2104mark and applied to the contents.
2105.PP
2106One reason for separating evaluation from
2107application in this way is to simplify tracking the addresses of changes
2108made in the middle of a long sequence.
2109The two-pass algorithm also allows all changes to apply to the
2110.I original
2111data: no change can affect another change made in the same command.
2112This is particularly important when evaluating an
2113.CW x
2114command because it prevents regular expression matches
2115from stumbling over changes made earlier in the execution.
2116Also, the two-pass
2117algorithm is cleaner than the way other UNIX editors allow changes to
2118affect each other;
2119for example,
2120.CW ed 's
2121idioms to do things like delete every other line
2122depend critically on the implementation.
2123Instead,
2124.CW sam 's
2125simple model, in which all changes in a command occur effectively
2126simultaneously, is easy to explain and to understand.
2127.PP
2128The records in the transcript are of the form ``delete substring from
2129locations
2130123 to 456'' and ``insert 11 characters `hello there' at location 789.''
2131(It is an error if the changes are not at monotonically greater
2132positions through the file.)
2133While the update is occurring, these numbers must be
2134offset by earlier changes, but that is straightforward and
2135local to the update routine;
2136moreover, all the numbers have been computed
2137before the first is examined.
2138.PP
2139Treating the file as a transaction system has another advantage:
2140undo is trivial.
2141All it takes is to invert the transcript after it has been
2142implemented, converting insertions
2143into deletions and vice versa, and saving them in a holding
2144.CW Buffer .
2145The `do' transcript can then be deleted from
2146the transcript
2147.CW Buffer
2148and replaced by the `undo' transcript.
2149If an undo is requested, the transcript is rewound and the undo transcript
2150executed.
2151Because the transcript
2152.CW Buffer
2153is not truncated after each command, it accumulates
2154successive changes.
2155A sequence of undo commands
2156can therefore back up the file arbitrarily,
2157which is more helpful than the more commonly implemented self-inverse form of undo.
2158.CW Sam "" (
2159provides no way to undo an undo, but if it were desired,
2160it would be easy to provide by re-interpreting the `do' transcript.)
2161Each mark in the transcript contains a sequence number and the offset into
2162the transcript of the previous mark, to aid in unwinding the transcript.
2163Marks also contain the value of dot and the modified bit so these can be
2164restored easily.
2165Undoing multiple files is easy; it merely demands undoing all files whose
2166latest change has the same sequence number as the current file.
2167.PP
2168Another benefit of having a transcript is that errors encountered in the middle
2169of a complicated command need not leave the files in an intermediate state.
2170By rewinding the transcript to the mark beginning the command,
2171the partial command can be trivially undone.
2172.PP
2173When the update algorithm was first implemented, it was unacceptably slow,
2174so a cache was added to coalesce nearby changes,
2175replacing multiple small changes by a single larger one.
2176This reduced the number
2177of insertions into the transaction
2178.CW Buffer ,
2179and made a dramatic improvement in performance,
2180but made it impossible
2181to handle changes in non-monotonic order in the file; the caching method
2182only works if changes don't overlap.
2183Before the cache was added, the transaction could in principle be sorted
2184if the changes were out of order, although
2185this was never done.
2186The current status is therefore acceptable performance with a minor
2187restriction on global changes, which is sometimes, but rarely, an annoyance.
2188.PP
2189The update algorithm obviously paws the data more than simpler
2190algorithms, but it is not prohibitively expensive;
2191the caches help.
2192(The principle of avoiding copying the data is still honored here,
2193although not as piously:
2194the data is moved from contents' cache to
2195the transcript's all at once and through only one internal buffer.)
2196Performance figures confirm the efficiency.
2197To read from a dead start a hundred kilobyte file on a VAX-11/750
2198takes 1.4 seconds of user time, 2.5 seconds of system time,
2199and 5 seconds of real time.
2200Reading the same file in
2201.CW ed
2202takes 6.0 seconds of user time, 1.7 seconds of system time,
2203and 8 seconds of real time.
2204.CW Sam
2205uses about half the CPU time.
2206A more interesting example is the one stated above:
2207inserting a character between every pair of characters in the file.
2208The
2209.CW sam
2210command is
2211.P1
2212,y/@/ a/x/
2213.P2
2214and takes 3 CPU seconds per kilobyte of input file, of which
2215about a third is spent in the regular expression code.
2216This translates to about 500 changes per second.
2217.CW Ed
2218takes 1.5 seconds per kilobyte to make a similar change (ignoring newlines),
2219but cannot undo it.
2220The same example in
2221.CW ex ,\u\s-4\&9\s+4\d
2222a variant of
2223.CW ed
2224done at the University of California at Berkeley,
2225which allows one level of undoing, again takes 3 seconds.
2226In summary,
2227.CW sam 's
2228performance is comparable to that of other UNIX editors, although it solves
2229a harder problem.
2230.SH 2
2231Communications
2232.LP
2233The discussion so far has described the implementation of the host part of
2234.CW sam ;
2235the next few sections explain how a machine with mouse and bitmap display
2236can be engaged to improve interaction.
2237.CW Sam
2238is not the first editor to be written as two processes,\u\s-4\&16\s+4\d
2239but its implementation
2240has some unusual aspects.
2241.PP
2242There are several ways
2243.CW sam 's
2244host and terminal parts may be connected.
2245The first and simplest is to forgo the terminal part and use the host
2246part's command language to edit text on an ordinary terminal.
2247This mode is invoked by starting
2248.CW sam
2249with the
2250.CW -d
2251option.
2252With no options,
2253.CW sam
2254runs separate host and terminal programs,
2255communicating with a message protocol over the physical
2256connection that joins them.
2257Typically, the connection is an RS-232 link between a Blit
2258(the prototypical display for
2259.CW sam )
2260and a host running
2261the Ninth Edition of the UNIX operating system.\u\s-4\&8\s+4\d
2262(This is the version of the system used in the Computing Sciences Research
2263Center at AT&T Bell Laboratories, where I work.  Its relevant
2264aspects are discussed in the Blit paper.\u\s-4\&1\s+4\d)
2265The implementation of
2266.CW sam
2267for the SUN computer runs both processes on the same machine and
2268connects them by a pipe.
2269.PP
2270The low bandwidth of an RS-232 link
2271necessitated the split between
2272the two programs.
2273The division is a mixed blessing:
2274a program in two parts is much harder to write and to debug
2275than a self-contained one,
2276but the split makes several unusual configurations possible.
2277The terminal may be physically separated from the host, allowing the conveniences
2278of a mouse and bitmap display to be taken home while leaving the files at work.
2279It is also possible to run the host part on a remote machine:
2280.P1
2281sam -r host
2282.P2
2283connects to the terminal in the usual way, and then makes a call
2284across the network to establish the host part of
2285.CW sam
2286on the named machine.
2287Finally, it cross-connects the I/O to join the two parts.
2288This allows
2289.CW sam
2290to be run on machines that do not support bitmap displays;
2291for example,
2292.CW sam
2293is the editor of choice on our Cray X-MP/24.
2294.CW Sam
2295.CW -r
2296involves
2297.I three
2298machines: the remote host, the terminal, and the local host.
2299The local host's job is simple but vital: it passes the data
2300between the remote host and terminal.
2301.PP
2302The host and terminal exchange messages asynchronously
2303(rather than, say, as remote procedure calls) but there is no
2304error detection or correction
2305because, whatever the configuration, the connection is reliable.
2306Because the terminal handles mundane interaction tasks such as
2307popping up menus and interpreting the responses, the messages are about
2308data, not actions.
2309For example, the host knows nothing about what is displayed on the screen,
2310and when the user types a character, the message sent to the host says
2311``insert a one-byte string at location 123 in file 7,'' not ``a character
2312was typed at the current position in the current file.''
2313In other words, the messages look very much like the transaction records
2314in the transcripts.
2315.PP
2316Either the host or terminal part of
2317.CW sam
2318may initiate a change to a file.
2319The command language operates on the host, while typing and some
2320mouse operations are executed directly in the terminal to optimize response.
2321Changes initiated by the host program must be transmitted to the terminal,
2322and
2323vice versa.
2324(A token is exchanged to determine which end is in control,
2325which means that characters typed while a time-consuming command runs
2326must be buffered and do not appear until the command is complete.)
2327To maintain consistent information,
2328the host and terminal track changes through a per-file
2329data structure that records what portions of the file
2330the terminal has received.
2331The data structure, called a
2332.CW Rasp
2333(a weak pun: it's a file with holes)
2334is held and updated by both the host and terminal.
2335A
2336.CW Rasp
2337is a list of
2338.CW Strings
2339holding those parts of the file known to the terminal,
2340separated by counts of the number of bytes in the interstices.
2341Of course, the host doesn't keep a separate copy of the data (it only needs
2342the lengths of the various pieces),
2343but the structure is the same on both ends.
2344.PP
2345The
2346.CW Rasp
2347in the terminal doubles as a cache.
2348Since the terminal keeps the text for portions of the file it has displayed,
2349it need not request data from the host when revisiting old parts of the file
2350or redrawing obscured windows, which speeds things up considerably
2351over low-speed links.
2352.PP
2353It's trivial for the terminal to maintain its
2354.CW Rasp ,
2355because all changes made on the terminal apply to parts of the file
2356already loaded there.
2357Changes made by the host are compared against the
2358.CW Rasp
2359during the update sequence after each command.
2360Small changes to pieces of the file loaded in the terminal
2361are sent in their entirety.
2362Larger changes, and changes that fall entirely in the holes,
2363are transmitted as messages without literal data:
2364only the lengths of the deleted and inserted strings are transmitted.
2365When a command is completed, the terminal examines its visible
2366windows to see if any holes in their
2367.CW Rasps
2368intersect the visible portion of the file.
2369It then requests the missing data from the host,
2370along with up to 512 bytes of surrounding data, to minimize
2371the number of messages when visiting a new portion of the file.
2372This technique provides a kind of two-level lazy evaluation for the terminal.
2373The first level sends a minimum of information about
2374parts of the file not being edited interactively;
2375the second level waits until a change is displayed before
2376transmitting the new data.
2377Of course,
2378performance is also helped by having the terminal respond immediately to typing
2379and simple mouse requests.
2380Except for small changes to active pieces of the file, which are
2381transmitted to the terminal without negotiation,
2382the terminal is wholly responsible for deciding what is displayed;
2383the host uses the
2384.CW Rasp
2385only to tell the terminal what might be relevant.
2386.PP
2387When a change is initiated by the host,
2388the messages to the terminal describing the change
2389are generated by the routine that applies the transcript of the changes
2390to the contents of the
2391.CW File .
2392Since changes are undone by the same update routine,
2393undoing requires
2394no extra code in the communications;
2395the usual messages describing changes to the file are sufficient
2396to back up the screen image.
2397.PP
2398The
2399.CW Rasp
2400is a particularly good example of the way caches are used in
2401.CW sam .
2402First, it facilitates access to the active portion of the text by placing
2403the busy text in main memory.
2404In so doing, it provides efficient access
2405to a large data structure that does not fit in memory.
2406Since the form of data is to be imposed by the user, not by the program,
2407and because characters will frequently be scanned sequentially,
2408files are stored as flat objects.
2409Caches help keep performance good and linear when working with such
2410data.
2411.PP
2412Second, the
2413.CW Rasp
2414and several of the other caches have some
2415.I read-ahead;
2416that is, the cache is loaded with more information than is needed for
2417the job immediately at hand.
2418When manipulating linear structures, the accesses are usually sequential,
2419and read-ahead can significantly reduce the average time to access the
2420next element of the object.
2421Sequential access is a common mode for people as well as programs;
2422consider scrolling through a document while looking for something.
2423.PP
2424Finally, like any good data structure,
2425the cache guides the algorithm, or at least the implementation.
2426The
2427.CW Rasp
2428was actually invented to control the communications between the host and
2429terminal parts, but I realized very early that it was also a form of
2430cache.  Other caches were more explicitly intended to serve a double
2431purpose: for example, the caches in
2432.CW Files
2433that coalesce updates not only reduce traffic to the
2434transcript and contents
2435.CW Buffers ,
2436they also clump screen updates so that complicated changes to the
2437screen are achieved in
2438just a few messages to the terminal.
2439This saved me considerable work: I did not need to write special
2440code to optimize the message traffic to the
2441terminal.
2442Caches pay off in surprising ways.
2443Also, they tend to be independent, so their performance improvements
2444are multiplicative.
2445.SH 2
2446Data structures in the terminal
2447.LP
2448The terminal's job is to display and to maintain a consistent image of
2449pieces of the files being edited.
2450Because the text is always in memory, the data structures are
2451considerably simpler than those in the host part.
2452.PP
2453.CW Sam
2454typically has far more windows than does
2455.CW mux ,
2456the window system within which its Blit implementation runs.
2457.CW Mux
2458has a fairly small number of asynchronously updated windows;
2459.CW sam
2460needs a large number of synchronously updated windows that are
2461usually static and often fully obscured.
2462The different tradeoffs guided
2463.CW sam
2464away from the memory-intensive implementation of windows, called
2465.CW Layers ,\u\s-4\&17\s+4\d
2466used in
2467.CW mux.
2468Rather than depending on a complete bitmap image of the display for each window,
2469.CW sam
2470regenerates the image from its in-memory text
2471(stored in the
2472.CW Rasp )
2473when necessary, although it will use such an image if it is available.
2474Like
2475.CW Layers ,
2476though,
2477.CW sam
2478uses the screen bitmap as active storage in which to update the image using
2479.CW bitblt .\u\s-4\&18,19\s+4\d
2480The resulting organization, pictured in Figure 6,
2481has a global array of windows, called
2482.CW Flayers ,
2483each of which holds an image of a piece of text held in a data structure
2484called a
2485.CW Frame ,
2486which in turn represents
2487a rectangular window full of text displayed in some
2488.CW Bitmap .
2489Each
2490.CW Flayer
2491appears in a global list that orders them all front-to-back
2492on the display, and simultaneously as an element of a per-file array
2493that holds all the open windows for that file.
2494The complement in the terminal of the
2495.CW File
2496on the host is called a
2497.CW Text ;
2498each connects its
2499.CW Flayers
2500to the associated
2501.CW Rasp .
2502.KF
2503.PS
2504copy "fig6.pic"
2505.PE
2506.Cs
2507Figure 6. Data structures in the terminal.
2508.CW Flayers
2509are also linked together into a front-to-back list.
2510.CW Boxes
2511are discussed in the next section.
2512.Ce
2513.KE
2514.PP
2515The
2516.CW Bitmap
2517for a
2518.CW Frame
2519contains the image of the text.
2520For a fully visible window, the
2521.CW Bitmap
2522will be the screen (or at least the
2523.CW Layer
2524in which
2525.CW sam
2526is being run),
2527while for partially obscured windows the
2528.CW Bitmap
2529will be off-screen.
2530If the window is fully obscured, the
2531.CW Bitmap
2532will be null.
2533.PP
2534The
2535.CW Bitmap
2536is a kind of cache.
2537When making changes to the display, most of the original image will
2538look the same in the final image, and the update algorithms exploit this.
2539The
2540.CW Frame
2541software updates the image in the
2542.CW Bitmap
2543incrementally; the
2544.CW Bitmap
2545is not just an image, it is a data structure.\u\s-4\&18,19\s+4\d
2546The job of the software that updates the display is therefore
2547to use as much as possible of the existing image (converting the
2548text from ASCII characters to pixels is expensive) in a sort of two-dimensional
2549string insertion algorithm.
2550The details of this process are described in the next section.
2551.PP
2552The
2553.CW Frame
2554software has no code to support overlapping windows;
2555its job is to keep a single
2556.CW Bitmap
2557up to date.
2558It falls to the
2559.CW Flayer
2560software to multiplex the various
2561.CW Bitmaps
2562onto the screen.
2563The problem of maintaining overlapping
2564.CW Flayers
2565is easier than for
2566.CW Layers \u\s-4\&17\s+4\d
2567because changes are made synchronously and because the contents of the window
2568can be reconstructed from the data stored in the
2569.CW Frame ;
2570the
2571.CW Layers
2572software
2573makes no such assumptions.
2574In
2575.CW sam ,
2576the window being changed is almost always fully visible, because the current
2577window is always fully visible, by construction.
2578However, when multi-file changes are being made, or when
2579more than one window is open on a file,
2580it may be necessary to update partially obscured windows.
2581.PP
2582There are three cases: the window is
2583fully visible, invisible (fully obscured), or partially visible.
2584If fully visible, the
2585.CW Bitmap
2586is part of the screen, so when the
2587.CW Flayer
2588update routine calls the
2589.CW Frame
2590update routine, the screen will be updated directly.
2591If the window is invisible,
2592there is no associated
2593.CW Bitmap ,
2594and all that is necessary is to update the
2595.CW Frame
2596data structure, not the image.
2597If the window is partially visible, the
2598.CW Frame
2599routine is called to update the image in the off-screen
2600.CW Bitmap ,
2601which may require regenerating it from the text of the window.
2602The
2603.CW Flayer
2604code then clips this
2605.CW Bitmap
2606against the
2607.CW Bitmaps
2608of all
2609.CW Frames
2610in front of the
2611.CW Frame
2612being modified, and the remainder is copied to the display.
2613.PP
2614This is much faster than recreating the image off-screen
2615for every change, or clipping all the changes made to the image
2616during its update.
2617Unfortunately, these caches can also consume prohibitive amounts of
2618memory, so they are freed fairly liberally \(em after every change to the
2619front-to-back order of the
2620.CW Flayers .
2621The result is that
2622the off-screen
2623.CW Bitmaps
2624exist only while multi-window changes are occurring,
2625which is the only time the performance improvement they provide is needed.
2626Also, the user interface causes fully-obscured windows to be the
2627easiest to make \(em
2628creating a canonically sized and placed window requires only a button click
2629\(em which reduces the need for caching still further.
2630.PP
2631.SH 2
2632Screen update
2633.LP
2634Only two low-level primitives are needed for incremental update:
2635.CW bitblt ,
2636which copies rectangles of pixels, and
2637.CW string
2638(which in turn calls
2639.CW bitblt ),
2640which draws a null-terminated character string in a
2641.CW Bitmap .
2642A
2643.CW Frame
2644contains a list of
2645.CW Boxes ,
2646each of which defines a horizontal strip of text in the window
2647(see Figure 7).
2648A
2649.CW Box
2650has a character string
2651.CW str ,
2652and a
2653.CW Rectangle
2654.CW rect
2655that defines the location of the strip in the window.
2656(The text in
2657.CW str
2658is stored in the
2659.CW Box
2660separately from the
2661.CW Rasp
2662associated with the window's file, so
2663.CW Boxes
2664are self-contained.)
2665The invariant is that
2666the image of the
2667.CW Box
2668can be reproduced by calling
2669.CW string
2670with argument
2671.CW str
2672to draw the string in
2673.CW rect ,
2674and the resulting picture fits perfectly within
2675.CW rect .
2676In other words, the
2677.CW Boxes
2678define the tiling of the window.
2679The tiling may be complicated by long lines of text, which
2680are folded onto the next line.
2681Some editors use horizontal scrolling to avoid this complication,
2682but to be comfortable this technique requires that lines not be
2683.I too
2684long;
2685.CW sam
2686has no such restriction.
2687Also, and perhaps more importantly, UNIX programs and terminals traditionally fold
2688long lines to make their contents fully visible.
2689.PP
2690Two special kinds of
2691.CW Boxes
2692contain a single
2693character: either a newline or a tab.
2694Newlines and tabs are white space.
2695A newline
2696.CW Box
2697always extends to the right edge of the window,
2698forcing the following
2699.CW Box
2700to the next line.
2701The width of a tab depends on where it is located:
2702it forces the next
2703.CW Box
2704to begin at a tab location.
2705Tabs also
2706have a minimum width equivalent to a blank (blanks are
2707drawn by
2708.CW string
2709and are not treated specially); newlines have a minimum width of zero.
2710.KF
2711.PS
2712copy "fig7.pic"
2713.PE
2714.sp .5
2715.Cs
2716Figure 7. A line of text showing its
2717.CW Boxes .
2718The first two blank
2719.CW Boxes
2720contain tabs; the last contains a newline.
2721Spaces are handled as ordinary characters.
2722.Ce
2723.KE
2724.PP
2725The update algorithms always use the
2726.CW Bitmap
2727image of the text (either the display or cache
2728.CW Bitmap );
2729they never examine the characters within a
2730.CW Box
2731except when the
2732.CW Box
2733needs to be split in two.
2734Before a change, the window consists of a tiling of
2735.CW Boxes ;
2736after the change the window is tiled differently.
2737The update algorithms rearrange the tiles in place, without
2738backup storage.
2739The algorithms are not strictly optimal \(em for example, they can
2740clear a pixel that is later going to be written upon \(em
2741but they never move a tile that doesn't need to be moved,
2742and they move each tile at most once.
2743.CW Frinsert
2744on a Blit can absorb over a thousand characters a second if the strings
2745being inserted are a few tens of characters long.
2746.PP
2747Consider
2748.CW frdelete .
2749Its job is to delete a substring from a
2750.CW Frame
2751and restore the image of the
2752.CW Frame .
2753The image of a substring has a peculiar shape (see Figure 2) comprising
2754possibly a partial line,
2755zero or more full lines,
2756and possibly a final partial line.
2757For reference, call this the
2758.I
2759Z-shape.
2760.R
2761.CW Frdelete
2762begins by splitting, if necessary, the
2763.CW Boxes
2764containing the ends of
2765the substring so the substring begins and ends on
2766.CW Box
2767boundaries.
2768Because the substring is being deleted, its image is not needed,
2769so the Z-shape is then cleared.
2770Then, tiles (that is, the images of
2771.CW Boxes )
2772are copied, using
2773.CW bitblt ,
2774from immediately after the Z-shape to
2775the beginning of the Z-shape,
2776resulting in a new Z-shape.
2777.CW Boxes "" (
2778whose contents would span two lines in the new position must first be split.)
2779.PP
2780Copying the remainder of the
2781.CW Frame
2782tile by tile
2783this way will clearly accomplish the deletion but eventually,
2784typically when the copying algorithm encounters a tab or newline,
2785the old and new
2786.CW x
2787coordinates of the tile
2788to be copied are the same.
2789This correspondence implies
2790that the Z-shape has its beginning and ending edges aligned
2791vertically, and a sequence of at most two
2792.CW bitblts
2793can be used to copy the remaining tiles.
2794The last step is to clear out the resulting empty space at the bottom
2795of the window;
2796the number of lines to be cleared is the number of complete lines in the
2797Z-shape closed by the final
2798.CW bitblts.
2799The final step is to merge horizontally adjacent
2800.CW Boxes
2801of plain text.
2802The complete source to
2803.CW frdelete
2804is less than 100 lines of C.
2805.PP
2806.CW frinsert
2807is more complicated because it must do four passes:
2808one to construct the
2809.CW Box
2810list for the inserted string,
2811one to reconnoitre,
2812one to copy (in opposite order to
2813.CW frdelete )
2814the
2815.CW Boxes
2816to make the hole for the new text,
2817and finally one to copy the new text into place.
2818Overall, though,
2819.CW frinsert
2820has a similar flavor to
2821.CW frdelete ,
2822and needn't be described further.
2823.CW Frinsert
2824and its subsidiary routines comprise 211 lines of C.
2825.PP
2826The terminal source code is 3024 lines of C,
2827and the host source is 5797 lines.
2828.SH
2829Discussion
2830.SH 2
2831History
2832.LP
2833The immediate ancestor of
2834.CW sam
2835was the original text editor for the Blit, called
2836.CW jim .
2837.CW Sam
2838inherited
2839.CW jim 's
2840two-process structure and mouse language almost unchanged, but
2841.CW jim
2842suffered from several drawbacks that were addressed in the design of
2843.CW sam .
2844The most important of these was the lack of a command language.
2845Although
2846.CW jim
2847was easy to use for simple editing, it provided no direct help with
2848large or repetitive editing tasks.  Instead, it provided a command to pass
2849selected text through a shell pipeline,
2850but this was no more satisfactory than could be expected of a stopgap measure.
2851.PP
2852.CW Jim
2853was written primarily as a vehicle for experimenting with a mouse-based
2854interface to text, and the experiment was successful.
2855.CW Jim
2856had some spin-offs:
2857.CW mux ,
2858the second window system for the Blit, is essentially a multiplexed
2859version of the terminal part of
2860.CW jim ;
2861and the debugger
2862.CW pi 's
2863user interface\u\s-4\&20\s+4\d was closely modeled on
2864.CW jim 's.
2865But after a couple of years,
2866.CW jim
2867had become difficult to maintain and limiting to use,
2868and its replacement was overdue.
2869.PP
2870I began the design of
2871.CW sam
2872by asking
2873.CW jim
2874customers what they wanted.
2875This was probably a mistake; the answers were essentially a list of features
2876to be found in other editors, which did not provide any of the
2877guiding principles I was seeking.
2878For instance, one common request was for a ``global substitute,''
2879but no one suggested how to provide it within a cut-and-paste editor.
2880I was looking for a scheme that would
2881support such specialized features comfortably in the context of some
2882general command language.
2883Ideas were not forthcoming, though, particularly given my insistence
2884on removing all limits on file sizes, line lengths and so on.
2885Even worse, I recognized that, since the mouse could easily
2886indicate a region of the screen that was not an integral number of lines,
2887the command language would best forget about newlines altogether,
2888and that meant the command language had to treat the file as a single
2889string, not an array of lines.
2890.PP
2891Eventually, I decided that thinking was not getting me very far and it was
2892time to try building.
2893I knew that the terminal part could be built easily \(em
2894that part of
2895.CW jim
2896behaved acceptably well \(em and that most of the hard work was going
2897to be in the host part: the file interface, command interpreter and so on.
2898Moreover, I had some ideas about how the architecture of
2899.CW jim
2900could be improved without destroying its basic structure, which I liked
2901in principle but which hadn't worked out as well as I had hoped.
2902So I began by designing the file data structure,
2903starting with the way
2904.CW jim
2905worked \(em comparable to a single structure merging
2906.CW Disc
2907and
2908.CW Buffer ,
2909which I split to make the cache more general
2910\(em and thinking about how global substitute could be implemented.
2911The answer was clearly that it had to be done in two passes,
2912and the transcript-oriented implementation fell out naturally.
2913.PP
2914.CW Sam
2915was written bottom-up,
2916starting from the data structures and algorithms for manipulating text,
2917through the command language and up to the code for maintaining
2918the display.
2919In retrospect, it turned out well, but this implementation method is
2920not recommended in general.
2921There were several times when I had a large body of interesting code
2922assembled and no clue how to proceed with it.
2923The command language, in particular, took almost a year to figure out,
2924but can be implemented (given what was there at the beginning of that year)
2925in a day or two.  Similarly, inventing the
2926.CW Rasp
2927data structure delayed the
2928connection of the host and terminal pieces by another few months.
2929.CW Sam
2930took about two years to write, although only about four months were
2931spent actually working on it.
2932.PP
2933Part of the design process was unusual:
2934the subset of the protocol that maintains the
2935.CW Rasp
2936was simulated, debugged
2937and verified by an automatic protocol analyzer,\u\s-4\&21\s+4\d and was bug-free
2938from the start.
2939The rest of the protocol, concerned mostly
2940with keeping menus up to date,
2941was unfortunately too unwieldy for such analysis,
2942and was debugged by more traditional methods, primarily
2943by logging in a file all messages in and out of the host.
2944.SH 2
2945Reflections
2946.LP
2947.CW Sam
2948is essentially the only interactive editor used by the sixty or so members of
2949the computing science research center in which I work.
2950The same could not be said of
2951.CW jim ;
2952the lack of a command language kept some people from adopting it.
2953The union of a user interface as comfortable as
2954.CW jim 's
2955with a command language as powerful as
2956.CW ed 's†
2957.FS
2958.vs 9
2959†The people who criticize
2960.CW ed
2961as an interactive program often forget that it and its close relative
2962.CW sed \u\s-4\&7\s+4\d
2963still thrive as programmable editors.  The strength of these programs is
2964independent of their convenience for interactive editing.
2965.br
2966.vs
2967.FE
2968is essential to
2969.CW sam 's
2970success.
2971When
2972.CW sam
2973was first made available to the
2974.CW jim
2975community,
2976almost everyone switched to it within two or three days.
2977In the months that followed, even people who had never adopted
2978.CW jim
2979started using
2980.CW sam
2981exclusively.
2982.PP
2983To be honest,
2984.CW ed
2985still gets occasional use, but usually when
2986something quick needs to be done and the overhead of
2987downloading the terminal part of
2988.CW sam
2989isn't worth the trouble.
2990Also, as a `line' editor,
2991.CW sam
2992.CW -d
2993is a bit odd;
2994when using a good old ASCII terminal, it's comforting to have
2995a true line editor.
2996But it is fair to say that
2997.CW sam 's
2998command language has displaced
2999.CW ed 's
3000for most of the complicated editing that has kept line editors
3001(that is, command-driven editors) with us.
3002.PP
3003.CW Sam 's
3004command language is even fancier than
3005.CW ed 's,
3006and most
3007.CW sam
3008customers don't come near to using all its capabilities.
3009Does it need to be so sophisticated?
3010I think the answer is yes, for two reasons.
3011.PP
3012First, the
3013.I model
3014for
3015.CW sam 's
3016command language is really relatively simple, and certainly simpler than that of
3017.CW ed .
3018For instance, there is only one kind of textual loop in
3019.CW sam
3020\(em the
3021.CW x
3022command \(em
3023while
3024.CW ed
3025has three (the
3026.CW g
3027command, the global flag on substitutions, and the implicit loop over
3028lines in multi-line substitutions).
3029Also,
3030.CW ed 's
3031substitute command is necessary to make changes within lines, but in
3032.CW sam
3033the
3034.CW s
3035command is more of a familiar convenience than a necessity;
3036.CW c
3037and
3038.CW t
3039can do all the work.
3040.PP
3041Second,
3042given a community that expects an editor to be about as powerful as
3043.CW ed ,
3044it's hard to see how
3045.CW sam
3046could really be much simpler and still satisfy that expectation.
3047People want to do ``global substitutes,'' and most are content
3048to have the recipe for that and a few other fancy changes.
3049The sophistication of the command language is really just a veneer
3050over a design that makes it possible to do global substitutes
3051in a screen editor.
3052Some people will always want something more, however, and it's gratifying to
3053be able to provide it.
3054The real power of
3055.CW sam 's
3056command language comes from composability of the operators, which is by
3057nature orthogonal to the underlying model.
3058In other words,
3059.CW sam
3060is not itself complex, but it makes complex things possible.
3061If you don't want to do anything complex, you can ignore the
3062complexity altogether, and many people do so.
3063.PP
3064Sometimes I am asked the opposite question: why didn't I just make
3065.CW sam
3066a real programmable editor, with macros and variables and so on?
3067The main reason is a matter of taste: I like the editor
3068to be the same every time I use it.
3069There is one technical reason, though:
3070programmability in editors is largely a workaround for insufficient
3071interactivity.
3072Programmable editors are used to make particular, usually short-term,
3073things easy to do, such as by providing shorthands for common actions.
3074If things are generally easy to do in the first place,
3075shorthands are not as helpful.
3076.CW Sam
3077makes common editing operations very easy, and the solutions to
3078complex editing problems seem commensurate with the problems themselves.
3079Also, the ability to edit the
3080.CW sam
3081window makes it easy to repeat commands \(em it only takes a mouse button click
3082to execute a command again.
3083.SH 2
3084Pros and cons
3085.LP
3086.CW Sam
3087has several other good points,
3088and its share of problems.
3089Among the good things is the idea of
3090structural regular expressions,
3091whose usefulness has only begun to be explored.
3092They were arrived at serendipitously when I attempted to distill the essence of
3093.CW ed 's
3094way of doing global substitution and recognized that the looping command in
3095.CW ed
3096was implicitly imposing a structure (an array of lines) on the file.
3097.PP
3098Another of
3099.CW sam 's
3100good things is its undo capability.
3101I had never before used an editor with a true undo,
3102but I would never go back now.
3103Undo
3104.I must
3105be done well, but if it is, it can be relied on.
3106For example,
3107it's safe to experiment if you're not sure how to write some intricate command,
3108because if you make a mistake, it can be fixed simply and reliably.
3109I learned two things about undo from writing
3110.CW sam :
3111first, it's easy to provide if you design it in from the beginning, and
3112second, it's necessary, particularly if the system has some subtle
3113properties that may be unfamiliar or error-prone for users.
3114.PP
3115.CW Sam 's
3116lack of internal limits and sizes is a virtue.
3117Because it avoids all fixed-size tables and data structures,
3118.CW sam
3119is able to make global changes to files that some of our other
3120tools cannot even read.
3121Moreover, the design keeps the performance linear when doing such
3122operations, although I must admit
3123.CW sam
3124does get slow when editing a huge file.
3125.PP
3126Now, the problems.
3127Externally, the most obvious is that it is poorly integrated into the
3128surrounding window system.
3129By design, the user interface in
3130.CW sam
3131feels almost identical to that of
3132.CW mux ,
3133but a thick wall separates text in
3134.CW sam
3135from the programs running in
3136.CW mux .
3137For instance, the `snarf buffer' in
3138.CW sam
3139must be maintained separately from that in
3140.CW mux .
3141This is regrettable, but probably necessary given the unusual configuration
3142of the system, with a programmable terminal on the far end of an RS-232 link.
3143.PP
3144.CW Sam
3145is reliable; otherwise, people wouldn't use it.
3146But it was written over such a long time, and has so many new (to me)
3147ideas in it, that I would like to see it done over again to clean
3148up the code and remove many of the lingering problems in the implementation.
3149The worst part is in the interconnection of the host and terminal parts,
3150which might even be able to go away in a redesign for a more
3151conventional window system.
3152The program must be split in two to use the terminal effectively,
3153but the low bandwidth of the connection forces the separation to
3154occur in an inconvenient part of the design if performance is to be acceptable.
3155A simple remote procedure call
3156protocol driven by the host, emitting only graphics
3157commands, would be easy to write but wouldn't have nearly the
3158necessary responsiveness.  On the other hand, if the terminal were in control
3159and requested much simpler file services from the host, regular expression
3160searches would require that the terminal read the entire file over its RS-232
3161link, which would be unreasonably slow.
3162A compromise in which either end can take control is necessary.
3163In retrospect, the communications protocol should have been
3164designed and verified formally, although I do not know of any tool
3165that can adequately relate the protocol to
3166its implementation.
3167.PP
3168Not all of
3169.CW sam 's
3170users are comfortable with its command language, and few are adept.
3171Some (venerable) people use a sort of
3172.CW ed \& ``
3173subset'' of
3174.CW sam 's
3175command language,
3176and even ask why
3177.CW sam 's
3178command language is not exactly
3179.CW ed 's.
3180(The reason, of course, is that
3181.CW sam 's
3182model for text does not include newlines, which are central to
3183.CW ed .
3184Making the text an array of newlines to the command language would
3185be too much of a break from the seamless model provided by the mouse.
3186Some editors, such as
3187.CW vi ,
3188are willing to make this break, though.)
3189The difficulty is that
3190.CW sam 's
3191syntax is so close to
3192.CW ed 's
3193that people believe it
3194.I should
3195be the same.
3196I thought, with some justification in hindsight,
3197that making
3198.CW sam
3199similar to
3200.CW ed
3201would make it easier to learn and to accept.
3202But I may have overstepped and raised the users'
3203expectations too much.
3204It's hard to decide which way to resolve this problem.
3205.PP
3206Finally, there is a tradeoff in
3207.CW sam
3208that was decided by the environment in which it runs:
3209.CW sam
3210is a multi-file editor, although in a different system there might instead be
3211multiple single-file editors.
3212The decision was made primarily because starting a new program in a Blit is
3213time-consuming.
3214If the choice could be made freely, however, I would
3215still choose the multi-file architecture, because it allows
3216groups of files to be handled as a unit;
3217the usefulness of the multi-file commands is incontrovertible.
3218It is delightful to have the source to an entire program
3219available at your fingertips.
3220.SH
3221Acknowledgements
3222.LP
3223Tom Cargill suggested the idea behind the
3224.CW Rasp
3225data structure.
3226Norman Wilson and Ken Thompson influenced the command language.
3227This paper was improved by comments from
3228Al Aho,
3229Jon Bentley,
3230Chris Fraser,
3231Gerard Holzmann,
3232Brian Kernighan,
3233Ted Kowalski,
3234Doug McIlroy
3235and
3236Dennis Ritchie.
3237