xref: /plan9/sys/doc/sam/sam.ms (revision 5d535f588fe0a7c7f04081512974c9c20b5d87a8)
1.HTML "The Text Editor sam
2.Vx 17 11 November 87 1 32 "ROB PIKE" "THE TEXT EDITOR SAM"
3.ds DY "31 May 1987
4.ds DR "Revised 1 July 1987
5.de CW		\" puts first arg in CW font, same as UL; maintains font
6\%\&\\$3\f(CW\\$1\fP\&\\$2
7..
8.de Cs
9.br
10.fi
11.ft 2
12.ps -2
13.vs -2
14..
15.de Ce
16.br
17.nf
18.ft 1
19.ps
20.vs
21.sp
22..
23.de XP
24.ie h .html - <center><img src="\\$1.gif" /></center>
25.el .BP \\$1.ps \\$2
26..
27.TL
28The Text Editor \&\f(CWsam\fP
29.AU
30Rob Pike
31rob@plan9.bell-labs.com
32.AB
33.LP
34.CW Sam
35is an interactive multi-file text editor intended for
36bitmap displays.
37A textual command language
38supplements the mouse-driven, cut-and-paste interface
39to make complex or
40repetitive editing tasks easy to specify.
41The language is characterized by the composition of regular expressions
42to describe the structure of the text being modified.
43The treatment of files as a database, with changes logged
44as atomic transactions, guides the implementation and
45makes a general `undo' mechanism straightforward.
46.PP
47.CW Sam
48is implemented as two processes connected by a low-bandwidth stream,
49one process handling the display and the other the editing
50algorithms.  Therefore it can run with the display process
51in a bitmap terminal and the editor on a local host,
52with both processes on a bitmap-equipped host, or with
53the display process in the terminal and the editor in a
54remote host.
55By suppressing the display process,
56it can even run without a bitmap terminal.
57.PP
58This paper is reprinted from Software\(emPractice and Experience,
59Vol 17, number 11, pp. 813-845, November 1987.
60The paper has not been updated for the Plan 9 manuals.  Although
61.CW Sam
62has not changed much since the paper was written, the system around it certainly has.
63Nonetheless, the description here still stands as the best introduction to the editor.
64.AE
65.SH
66Introduction
67.LP
68.CW Sam
69is an interactive text editor that combines cut-and-paste interactive editing with
70an unusual command language based on the composition of regular expressions.
71It is written as two programs: one, the `host part,' runs on a UNIX system
72and implements the command language and provides file access; the other, the
73`terminal part,' runs asynchronously
74on a machine with a mouse and bitmap display
75and supports the display and interactive editing.
76The host part may be even run in isolation on an ordinary terminal
77to edit text using the command
78language, much like a traditional line editor,
79without assistance from a mouse or display.
80Most often,
81the terminal part runs on a Blit\u\s-4\&1\s+4\d terminal
82(actually on a Teletype DMD 5620, the production version of the Blit), whose
83host connection is an ordinary 9600 bps RS232 link;
84on the SUN computer the host and display processes run on a single machine,
85connected by a pipe.
86.PP
87.CW Sam
88edits uninterpreted
89ASCII text.
90It has no facilities for multiple fonts, graphics or tables,
91unlike MacWrite,\u\s-4\&2\s+4\d Bravo,\u\s-4\&3\s+4\d Tioga\u\s-4\&4\s+4\d
92or Lara.\u\s-4\&5\s+4\d
93Also unlike them, it has a rich command language.
94(Throughout this paper, the phrase
95.I
96command language
97.R
98refers to
99textual commands; commands activated from the mouse form the
100.I mouse
101.I language. )
102.CW Sam
103developed as an editor for use by programmers, and tries to join
104the styles of the UNIX text editor
105.CW ed \u\s-4\&6,7\s+4\d
106with that of interactive cut-and-paste editors by
107providing a comfortable mouse-driven interface
108to a program with a solid command language driven by regular expressions.
109The command language developed more than the mouse language, and
110acquired a notation for describing the structure of files
111more richly than as a sequence of lines,
112using a dataflow-like syntax for specifying changes.
113.PP
114The interactive style was influenced by
115.CW jim ,\u\s-4\&1\s+4\d
116an early cut-and-paste editor for the Blit, and by
117.CW mux ,\u\s-4\&8\s+4\d
118the Blit window system.
119.CW Mux
120merges the original Blit window system,
121.CW mpx ,\u\s-4\&1\s+4\d
122with cut-and-paste editing, forming something like a
123multiplexed version of
124.CW jim
125that edits the output of (and input to) command sessions rather than files.
126.PP
127The first part of this paper describes the command language, then the mouse
128language, and explains how they interact.
129That is followed by a description of the implementation,
130first of the host part, then of the terminal part.
131A principle that influenced the design of
132.CW sam
133is that it should have no explicit limits, such as upper limits on
134file size or line length.
135A secondary consideration is that it be efficient.
136To honor these two goals together requires a method for efficiently
137manipulating
138huge strings (files) without breaking them into lines,
139perhaps while making thousands of changes
140under control of the command language.
141.CW Sam 's
142method is to
143treat the file as a transaction database, implementing changes as atomic
144updates.  These updates may be unwound easily to `undo' changes.
145Efficiency is achieved through a collection of caches that minimizes
146disc traffic and data motion, both within the two parts of the program
147and between them.
148.PP
149The terminal part of
150.CW sam
151is fairly straightforward.
152More interesting is how the two halves of the editor stay
153synchronized when either half may initiate a change.
154This is achieved through a data structure that organizes the
155communications and is maintained in parallel by both halves.
156.PP
157The last part of the paper chronicles the writing of
158.CW sam
159and discusses the lessons that were learned through its development and use.
160.PP
161The paper is long, but is composed largely of two papers of reasonable length:
162a description of the user interface of
163.CW sam
164and a discussion of its implementation.
165They are combined because the implementation is strongly influenced by
166the user interface, and vice versa.
167.SH
168The Interface
169.LP
170.CW Sam
171is a text editor for multiple files.
172File names may be provided when it is invoked:
173.P1
174sam file1 file2 ...
175.P2
176and there are commands
177to add new files and discard unneeded ones.
178Files are not read until necessary
179to complete some command.
180Editing operations apply to an internal copy
181made when the file is read; the UNIX file associated with the copy
182is changed only by an explicit command.
183To simplify the discussion, the internal copy is here called a
184.I file ,
185while the disc-resident original is called a
186.I
187disc file.
188.R
189.PP
190.CW Sam
191is usually connected to a bitmap display that presents a cut-and-paste
192editor driven by the mouse.
193In this mode, the command language is still available:
194text typed in a special window, called the
195.CW sam
196.I window,
197is interpreted
198as commands to be executed in the current file.
199Cut-and-paste editing may be used in any window \(em even in the
200.CW sam
201window to construct commands.
202The other mode of operation, invoked by starting
203.CW sam
204with the option
205.CW -d
206(for `no download'),
207does not use the mouse or bitmap display, but still permits
208editing using the textual command language, even on an ordinary terminal,
209interactively or from a script.
210.PP
211The following sections describe first the command language (under
212.CW sam\ -d
213and in the
214.CW sam
215window), and then the mouse interface.
216These two languages are nearly independent, but connect through the
217.I current
218.I text,
219described below.
220.SH 2
221The Command Language
222.LP
223A file consists of its contents, which are an array of characters
224(that is, a string); the
225.I name
226of the associated disc file; the
227.I
228modified bit
229.R
230that states whether the contents match those of
231the disc file;
232and a substring of the contents, called the
233.I
234current text
235.R
236or
237.I dot
238(see Figures 1 and 2).
239If the current text is a null string, dot falls between characters.
240The
241.I value
242of dot is the location of the current text; the
243.I contents
244of dot are the characters it contains.
245.CW Sam
246imparts to the text no two-dimensional interpretation such as columns
247or fields; text is always one-dimensional.
248Even the idea of a `line' of text as understood by most UNIX programs
249\(em a sequence of characters terminated by a newline character \(em
250is only weakly supported.
251.PP
252The
253.I
254current file
255.R
256is the file to which editing commands refer.
257The current text is therefore dot in the current file.
258If a command doesn't explicitly name a particular file or piece of text,
259the command is assumed to apply to the current text.
260For the moment, ignore the presence of multiple files and consider
261editing a single file.
262.KF L
263.XP fig1 3.5i
264.Cs
265Figure 1. A typical
266.CW sam
267screen, with the editing menu presented.
268The
269.CW sam
270(command language) window is in the middle, with file windows above and below.
271(The user interface makes it easy to create these abutting windows.)
272The partially obscured window is a third file window.
273The uppermost window is that to which typing and mouse operations apply,
274as indicated by its heavy border.
275Each window has its current text highlighted in reverse video.
276The
277.CW sam
278window's current text is the null string on the last visible line,
279indicated by a vertical bar.
280See also Figure 2.
281.Ce
282.KE
283.PP
284Commands have one-letter names.
285Except for non-editing commands such as writing
286the file to disc, most commands make some change
287to the text in dot and leave dot set to the text resulting from the change.
288For example, the delete command,
289.CW d ,
290deletes the text in dot, replacing it by the null string and setting dot
291to the result.
292The change command,
293.CW c ,
294replaces dot by text delimited by an arbitrary punctuation character,
295conventionally
296a slash.  Thus,
297.P1
298c/Peter/
299.P2
300replaces the text in dot by the string
301.CW Peter .
302Similarly,
303.P1
304a/Peter/
305.P2
306(append) adds the string after dot, and
307.P1
308i/Peter/
309.P2
310(insert) inserts before dot.
311All three leave dot set to the new text,
312.CW Peter .
313.PP
314Newlines are part of the syntax of commands:
315the newline character lexically terminates a command.
316Within the inserted text, however, newlines are never implicit.
317But since it is often convenient to insert multiple lines of text,
318.CW sam
319has a special
320syntax for that case:
321.P1
322a
323some lines of text
324to be inserted in the file,
325terminated by a period
326on a line by itself
327\&.
328.P2
329In the one-line syntax, a newline character may be specified by a C-like
330escape, so
331.P1
332c/\en/
333.P2
334replaces dot by a single newline character.
335.PP
336.CW Sam
337also has a substitute command,
338.CW s :
339.P1
340s/\f2expression\fP/\f2replacement\fP/
341.P2
342substitutes the replacement text for the first match, in dot,
343of the regular expression.
344Thus, if dot is the string
345.CW Peter ,
346the command
347.P1
348s/t/st/
349.P2
350changes it to
351.CW Pester .
352In general,
353.CW s
354is unnecessary, but it was inherited from
355.CW ed
356and it has some convenient variations.
357For instance, the replacement text may include the matched text,
358specified by
359.CW & :
360.P1
361s/Peter/Oh, &, &, &, &!/
362.P2
363.PP
364There are also three commands that apply programs
365to text:
366.P1
367< \f2UNIX program\fP
368.P2
369replaces dot by the output of the UNIX program.
370Similarly, the
371.CW >
372command
373runs the program with dot as its standard input, and
374.CW |
375does both.  For example,
376.P1
377| sort
378.P2
379replaces dot by the result of applying the standard sorting utility to it.
380Again, newlines have no special significance for these
381.CW sam
382commands.
383The text acted upon and resulting from these commands is not necessarily
384bounded by newlines, although for connection with UNIX programs,
385newlines may be necessary to obey conventions.
386.PP
387One more command:
388.CW p
389prints the contents of dot.
390Table I summarizes
391.CW sam 's
392commands.
393.KF
394.TS
395center;
396c s
397lfCW l.
398Table I. \f(CWSam\fP commands
399.sp .4
400.ft CW
401_
402.ft
403.sp .4
404\f1Text commands\fP
405.sp .4
406_
407.sp .4
408a/\f2text\fP/	Append text after dot
409c/\f2text\fP/	Change text in dot
410i/\f2text\fP/	Insert text before dot
411d	Delete text in dot
412s/\f2regexp\fP/\f2text\fP/	Substitute text for match of regular expression in dot
413m \f2address\fP	Move text in dot after address
414t \f2address\fP	Copy text in dot after address
415.sp .4
416_
417.sp .4
418\f1Display commands\fP
419.sp .4
420_
421.sp .2
422p	Print contents of dot
423\&=	Print value (line numbers and character numbers) of dot
424.sp .4
425_
426.sp .4
427\f1File commands\fP
428.sp .4
429_
430.sp .2
431b \f2file-list\fP	Set current file to first file in list that \f(CWsam\fP has in menu
432B \f2file-list\fP	Same as \f(CWb\fP, but load new files
433n	Print menu lines of all files
434D \f2file-list\fP	Delete named files from \f(CWsam\fP
435.sp .4
436_
437.sp .4
438\f1I/O commands\fP
439.sp .4
440_
441.sp .2
442e \f2filename\fP	Replace file with named disc file
443r \f2filename\fP	Replace dot by contents of named disc file
444w \f2filename\fP	Write file to named disc file
445f \f2filename\fP	Set file name and print new menu line
446< \f2UNIX-command\fP	Replace dot by standard output of command
447> \f2UNIX-command\fP	Send dot to standard input of command
448| \f2UNIX-command\fP	Replace dot by result of command applied to dot
449! \f2UNIX-command\fP	Run the command
450.sp .4
451_
452.sp .4
453\f1Loops and conditionals\fP
454.sp .4
455_
456.sp .2
457x/\f2regexp\fP/ \f2command\fP	For each match of regexp, set dot and run command
458y/\f2regexp\fP/ \f2command\fP	Between adjacent matches of regexp, set dot and run command
459X/\f2regexp\fP/ \f2command\fP	Run command in each file whose menu line matches regexp
460Y/\f2regexp\fP/ \f2command\fP	Run command in each file whose menu line does not match
461g/\f2regexp\fP/ \f2command\fP	If dot contains a match of regexp, run command
462v/\f2regexp\fP/ \f2command\fP	If dot does not contain a match of regexp, run command
463.sp .4
464_
465.sp .4
466\f1Miscellany\fP
467.sp .4
468_
469.sp .2
470k	Set address mark to value of dot
471q	Quit
472u \f2n\fP	Undo last \f2n\fP (default 1) changes
473{ }	Braces group commands
474.sp .3
475.ft CW
476_
477.ft
478.TE
479.sp
480.KE
481.PP
482The value of dot may be changed by
483specifying an
484.I address
485for the command.
486The simplest address is a line number:
487.P1
4883
489.P2
490refers to the third line of the file, so
491.P1
4923d
493.P2
494deletes the third line of the file, and implicitly renumbers
495the lines so the old line 4 is now numbered 3.
496(This is one of the few places where
497.CW sam
498deals with lines directly.)
499Line
500.CW 0
501is the null string at the beginning of the file.
502If a command consists of only an address, a
503.CW p
504command is assumed, so typing an unadorned
505.CW 3
506prints line 3 on the terminal.
507There are a couple of other basic addresses:
508a period addresses dot itself; and
509a dollar sign
510.CW $ ) (
511addresses the null string at the end of the file.
512.PP
513An address is always a single substring of the file.
514Thus, the address
515.CW 3
516addresses the characters
517after the second newline of
518the file through the third newline of the file.
519A
520.I
521compound address
522.R
523is constructed by the comma operator
524.P1
525\f2address1\fP,\f2address2\fP
526.P2
527and addresses the substring of the file from the beginning of
528.I address1
529to the end of
530.I address2 .
531For example, the command
532.CW 3,5p
533prints the third through fifth lines of the file and
534.CW .,$d
535deletes the text from the beginning of dot to the end of the file.
536.PP
537These addresses are all absolute positions in the file, but
538.CW sam
539also has relative addresses, indicated by
540.CW +
541or
542.CW - .
543For example,
544.P1
545$-3
546.P2
547is the third line before the end of the file and
548.P1
549\&.+1
550.P2
551is the line after dot.
552If no address appears to the left of the
553.CW +
554or
555.CW - ,
556dot is assumed;
557if nothing appears to the right,
558.CW 1
559is assumed.
560Therefore,
561.CW .+1
562may be abbreviated to just a plus sign.
563.PP
564The
565.CW +
566operator acts relative to the end of its first argument, while the
567.CW -
568operator acts relative to the beginning.  Thus
569.CW .+1
570addresses the first line after dot,
571.CW .-
572addresses the first line before dot, and
573.CW +-
574refers to the line containing the end of dot.  (Dot may span multiple lines, and
575.CW +
576selects the line after the end of dot, then
577.CW -
578backs up one line.)
579.PP
580The final type of address is a regular expression, which addresses the
581text matched by the expression.  The expression is enclosed in slashes, as in
582.P1
583/\f2expression\fP/
584.P2
585The expressions are the same as those in the UNIX program
586.CW egrep ,\u\s-4\&6,7\s+4\d
587and include closures, alternations, and so on.
588They find the
589.I
590leftmost longest
591.R
592string that matches the expression, that is,
593the first match after the point where the search is started,
594and if more than one match begins at the same spot, the longest such match.
595(I assume familiarity with the syntax for regular expressions in UNIX programs.\u\s-4\&9\s+4\d)
596For example,
597.P1
598/x/
599.P2
600matches the next
601.CW x
602character in the file,
603.P1
604/xx*/
605.P2
606matches the next run of one or more
607.CW x 's,
608and
609.P1
610/x|Peter/
611.P2
612matches the next
613.CW x
614or
615.CW Peter .
616For compatibility with other UNIX programs, the `any character' operator,
617a period,
618does not match a newline, so
619.P1
620/.*/
621.P2
622matches the text from dot to the end of the line, but excludes the newline
623and so will not match across
624the line boundary.
625.PP
626Regular expressions are always relative addresses.
627The direction is forwards by default,
628so
629.CW /Peter/
630is really an abbreviation for
631.CW +/Peter/ .
632The search can be reversed with a minus sign, so
633.P1
634.CW -/Peter/
635.P2
636finds the first
637.CW Peter
638before dot.
639Regular expressions may be used with other address forms, so
640.CW 0+/Peter/
641finds the first
642.CW Peter
643in the file and
644.CW $-/Peter/
645finds the last.
646Table II summarizes
647.CW sam 's
648addresses.
649.KF
650.TS
651center;
652c s
653lfCW l.
654Table II. \f(CWSam\fP addresses
655.sp .4
656.ft CW
657_
658.ft
659.sp .4
660\f1Simple addresses\fP
661.sp .4
662_
663.sp .2
664#\f2n\fP	The empty string after character \f2n\fP
665\f2n\fP	Line \f2n\fP.
666/\f2regexp\fP/	The first following match of the regular expression
667-/\f2regexp\fP/	The first previous match of the regular expression
668$	The null string at the end of the file
669\&.	Dot
670\&'	The address mark, set by \f(CWk\fP command
671"\f2regexp\fP"	Dot in the file whose menu line matches regexp
672.sp .4
673_
674.sp .4
675\f1Compound addresses\fP
676.sp .4
677_
678.sp .2
679\f2a1\fP+\f2a2\fP	The address \f2a2\fP evaluated starting at right of \f2a1\fP
680\f2a1\fP-\f2a2\fP	\f2a2\fP evaluated in the reverse direction starting at left of \f2a1\fP
681\f2a1\fP,\f2a2\fP	From the left of \f2a1\fP to the right of \f2a2\fP (default \f(CW0,$\fP)
682\f2a1\fP;\f2a2\fP	Like \f(CW,\fP but sets dot after evaluating \f2a1\fP
683.sp .4
684_
685.sp .4
686.T&
687c s.
688T{
689The operators
690.CW +
691and
692.CW -
693are high precedence, while
694.CW ,
695and
696.CW ;
697are low precedence.
698In both
699.CW +
700and
701.CW -
702forms,
703.I a2
704defaults to 1 and
705.I a1
706defaults to dot.
707If both
708.I a1
709and
710.I a2
711are present,
712.CW +
713may be elided.
714T}
715.sp .5
716.ft CW
717_
718.ft
719.TE
720.sp
721.KE
722.PP
723The language discussed so far will not seem novel
724to people who use UNIX text editors
725such as
726.CW ed
727or
728.CW vi .\u\s-4\&9\s+4\d
729Moreover, the kinds of editing operations these commands allow, with the exception
730of regular expressions and line numbers,
731are clearly more conveniently handled by a mouse-based interface.
732Indeed,
733.CW sam 's
734mouse language (discussed at length below) is the means by which
735simple changes are usually made.
736For large or repetitive changes, however, a textual language
737outperforms a manual interface.
738.PP
739Imagine that, instead of deleting just one occurrence of the string
740.CW Peter ,
741we wanted to eliminate every
742.CW Peter .
743What's needed is an iterator that runs a command for each occurrence of some
744text.
745.CW Sam 's
746iterator is called
747.CW x ,
748for extract:
749.P1
750x/\f2expression\fP/ \f2command\fP
751.P2
752finds all matches in dot of the specified expression, and for each
753such match, sets dot to the text matched and runs the command.
754So to delete all the
755.CW Peters:
756.P1
7570,$ x/Peter/ d
758.P2
759(Blanks in these examples are to improve readability;
760.CW sam
761neither requires nor interprets them.)
762This searches the entire file
763.CW 0,$ ) (
764for occurrences of the string
765.CW Peter ,
766and runs the
767.CW d
768command with dot set to each such occurrence.
769(By contrast, the comparable
770.CW ed
771command would delete all
772.I lines
773containing
774.CW Peter ;
775.CW sam
776deletes only the
777.CW Peters .)
778The address
779.CW 0,$
780is commonly used, and may be abbreviated to just a comma.
781As another example,
782.P1
783, x/Peter/ p
784.P2
785prints a list of
786.CW Peters,
787one for each appearance in the file, with no intervening text (not even newlines
788to separate the instances).
789.PP
790Of course, the text extracted by
791.CW x
792may be selected by a regular expression,
793which complicates deciding what set of matches is chosen \(em
794matches may overlap.  This is resolved by generating the matches
795starting from the beginning of dot using the leftmost-longest rule,
796and searching for each match starting from the end of the previous one.
797Regular expressions may also match null strings, but a null match
798adjacent to a non-null match is never selected; at least one character
799must intervene.
800For example,
801.P1
802, c/AAA/
803x/B*/ c/-/
804, p
805.P2
806produces as output
807.P1
808-A-A-A-
809.P2
810because the pattern
811.CW B*
812matches the null strings separating the
813.CW A 's.
814.PP
815The
816.CW x
817command has a complement,
818.CW y ,
819with similar syntax, that executes the command with dot set to the text
820.I between
821the matches of the expression.
822For example,
823.P1
824, c/AAA/
825y/A/ c/-/
826, p
827.P2
828produces the same result as the example above.
829.PP
830The
831.CW x
832and
833.CW y
834commands are looping constructs, and
835.CW sam
836has a pair of conditional commands to go with them.
837They have similar syntax:
838.P1
839g/\f2expression\fP/ \f2command\fP
840.P2
841(guard)
842runs the command exactly once if dot contains a match of the expression.
843This is different from
844.CW x ,
845which runs the command for
846.I each
847match:
848.CW x
849loops;
850.CW g
851merely tests, without changing the value of dot.
852Thus,
853.P1
854, x/Peter/ d
855.P2
856deletes all occurrences of
857.CW Peter ,
858but
859.P1
860, g/Peter/ d
861.P2
862deletes the whole file (reduces it to a null string) if
863.CW Peter
864occurs anywhere in the text.
865The complementary conditional is
866.CW v ,
867which runs the command if there is
868.I no
869match of the expression.
870.PP
871These control-structure-like commands may be composed to construct more
872involved operations.  For example, to print those lines of text that
873contain the string
874.CW Peter :
875.P1
876, x/.*\en/ g/Peter/ p
877.P2
878The
879.CW x
880breaks the file into lines, the
881.CW g
882selects those lines containing
883.CW Peter ,
884and the
885.CW p
886prints them.
887This command gives an address for the
888.CW x
889command (the whole file), but because
890.CW g
891does not have an explicit address, it applies to the value of
892dot produced by the
893.CW x
894command, that is, to each line.
895All commands in
896.CW sam
897except for the command to write a file to disc use dot for the
898default address.
899.PP
900Composition may be continued indefinitely.
901.P1
902, x/.*\en/ g/Peter/ v/SaltPeter/ p
903.P2
904prints those lines containing
905.CW Peter
906but
907.I not
908those containing
909.CW SaltPeter .
910.SH 2
911Structural Regular Expressions
912.LP
913Unlike other UNIX text editors,
914including the non-interactive ones such as
915.CW sed
916and
917.CW awk ,\u\s-4\&7\s+4\d
918.CW sam
919is good for manipulating files with multi-line `records.'
920An example is an on-line phone book composed of records,
921separated by blank lines, of the form
922.P1
923Herbert Tic
92444 Turnip Ave., Endive, NJ
925201-5555642
926
927Norbert Twinge
92816 Potato St., Cabbagetown, NJ
929201-5553145
930
931\&...
932.P2
933The format may be encoded as a regular expression:
934.P1
935(.+\en)+
936.P2
937that is, a sequence of one or more non-blank lines.
938The command to print Mr. Tic's entire record is then
939.P1
940, x/(.+\en)+/ g/^Herbert Tic$/ p
941.P2
942and that to extract just the phone number is
943.P1
944, x/(.+\en)+/ g/^Herbert Tic$/ x/^[0-9]*-[0-9]*\en/ p
945.P2
946The latter command breaks the file into records,
947chooses Mr. Tic's record,
948extracts the phone number from the record,
949and finally prints the number.
950.PP
951A more involved problem is that of
952renaming a particular variable, say
953.CW n ,
954to
955.CW num
956in a C program.
957The obvious first attempt,
958.P1
959, x/n/ c/num/
960.P2
961is badly flawed: it changes not only the variable
962.CW n
963but any letter
964.CW n
965that appears.
966We need to extract all the variables, and select those that match
967.CW n
968and only
969.CW n :
970.P1
971, x/[A-Za-z_][A-Za-z_0-9]*/ g/n/ v/../ c/num/
972.P2
973The pattern
974.CW [A-Za-z_][A-Za-z_0-9]*
975matches C identifiers.
976Next
977.CW g/n/
978selects those containing an
979.CW n .
980Then
981.CW v/../
982rejects those containing two (or more) characters, and finally
983.CW c/num/
984changes the remainder (identifiers
985.CW n )
986to
987.CW num .
988This version clearly works much better, but there may still be problems.
989For example, in C character and string constants, the sequence
990.CW \en
991is interpreted as a newline character, and we don't want to change it to
992.CW \enum.
993This problem can be forestalled with a
994.CW y
995command:
996.P1
997, y/\e\en/ x/[A-Za-z_][A-Za-z_0-9]*/ g/n/ v/../ c/num/
998.P2
999(the second
1000.CW \e
1001is necessary because of lexical conventions in regular expressions),
1002or we could even reject character constants and strings outright:
1003.P1 0
1004,y/'[^']*'/ y/"[^"]*"/ x/[A-Za-z_][A-Za-z_0-9]*/ g/n/ v/../ c/num/
1005.P2
1006The
1007.CW y
1008commands in this version exclude from consideration all character constants
1009and strings.
1010The only remaining problem is to deal with the possible occurrence of
1011.CW \e'
1012or
1013.CW \e"
1014within these sequences, but it's easy to see how to resolve this difficulty.
1015.PP
1016The point of these composed commands is successive refinement.
1017A simple version of the command is tried, and if it's not good enough,
1018it can be honed by adding a clause or two.
1019(Mistakes can be undone; see below.
1020Also, the mouse language makes it unnecessary to retype the command each time.)
1021The resulting chains of commands are somewhat reminiscent of
1022shell pipelines.\u\s-4\&7\s+4\d
1023Unlike pipelines, though, which pass along modified
1024.I data ,
1025.CW sam
1026commands pass a
1027.I view
1028of the data.
1029The text at each step of the command is the same, but which pieces
1030are selected is refined step by step until the correct piece is
1031available to the final step of the command line, which ultimately makes the change.
1032.PP
1033In other UNIX programs, regular expressions are used only for selection,
1034as in the
1035.CW sam
1036.CW g
1037command, never for extraction as in the
1038.CW x
1039or
1040.CW y
1041command.
1042For example, patterns in
1043.CW awk \u\s-4\&7\s+4\d
1044are used to select lines to be operated on, but cannot be used
1045to describe the format of the input text, or to handle newline-free text.
1046The use of regular expressions to describe the structure of a piece
1047of text rather than its contents, as in the
1048.CW x
1049command,
1050has been given a name:
1051.I
1052structural regular expressions.
1053.R
1054When they are composed, as in the above example,
1055they are pleasantly expressive.
1056Their use is discussed at greater length elsewhere.\u\s-4\&10\s+4\d
1057.PP
1058.SH 2
1059Multiple files
1060.LP
1061.CW Sam
1062has a few other commands, mostly relating to input and output.
1063.P1
1064e discfilename
1065.P2
1066replaces the contents and name of the current file with those of the named
1067disc file;
1068.P1
1069w discfilename
1070.P2
1071writes the contents to the named disc file; and
1072.P1
1073r discfilename
1074.P2
1075replaces dot with the contents of the named disc file.
1076All these commands use the current file's name if none is specified.
1077Finally,
1078.P1
1079f discfilename
1080.P2
1081changes the name associated with the file and displays the result:
1082.P1
1083\&'-. discfilename
1084.P2
1085This output is called the file's
1086.I
1087menu line,
1088.R
1089because it is the contents of the file's line in the button 3 menu (described
1090in the
1091next section).
1092The first three characters are a concise notation for the state of the file.
1093The apostrophe signifies that the file is modified.
1094The minus sign indicates the number of windows
1095open on the file (see the next section):
1096.CW -
1097means none,
1098.CW +
1099means one, and
1100.CW *
1101means more than one.
1102Finally, the period indicates that this is the current file.
1103These characters are useful for controlling the
1104.CW X
1105command, described shortly.
1106.PP
1107.CW Sam
1108may be started with a set of disc files (such as all the source for
1109a program) by invoking it with a list of file names as arguments, and
1110more may be added or deleted on demand.
1111.P1
1112B discfile1 discfile2 ...
1113.P2
1114adds the named files to
1115.CW sam 's
1116list, and
1117.P1
1118D discfile1 discfile2 ...
1119.P2
1120removes them from
1121.CW sam 's
1122memory (without effect on associated disc files).
1123Both these commands have a syntax for using the shell\u\s-4\&7\s+4\d
1124(the UNIX command interpreter) to generate the lists:
1125.P1
1126B <echo *.c
1127.P2
1128will add all C source files, and
1129.P1
1130B <grep -l variable *.c
1131.P2
1132will add all C source files referencing a particular variable
1133(the UNIX command
1134.CW grep\ -l
1135lists all files in its arguments that contain matches of
1136the specified regular expression).
1137Finally,
1138.CW D
1139without arguments deletes the current file.
1140.PP
1141There are two ways to change which file is current:
1142.P1
1143b filename
1144.P2
1145makes the named file current.
1146The
1147.CW B
1148command
1149does the same, but also adds any new files to
1150.CW sam 's
1151list.
1152(In practice, of course, the current file
1153is usually chosen by mouse actions, not by textual commands.)
1154The other way is to use a form of address that refers to files:
1155.P1
1156"\f2expression\fP" \f2address\fP
1157.P2
1158refers to the address evaluated in the file whose menu line
1159matches the expression (there must be exactly one match).
1160For example,
1161.P1
1162"peter.c" 3
1163.P2
1164refers to the third line of the file whose name matches
1165.CW peter.c .
1166This is most useful in the move
1167.CW m ) (
1168and copy
1169.CW t ) (
1170commands:
1171.P1
11720,$ t "peter.c" 0
1173.P2
1174makes a copy of the current file at the beginning of
1175.CW peter.c .
1176.PP
1177The
1178.CW X
1179command
1180is a looping construct, like
1181.CW x ,
1182that refers to files instead of strings:
1183.P1
1184X/\f2expression\fP/ \f2command\fP
1185.P2
1186runs the command in all
1187files whose menu lines match the expression.  The best example is
1188.P1
1189X/'/ w
1190.P2
1191which writes to disc all modified files.
1192.CW Y
1193is the complement of
1194.CW X :
1195it runs the command on all files whose menu lines don't match the expression:
1196.P1
1197Y/\e.c/ D
1198.P2
1199deletes all files that don't have
1200.CW \&.c
1201in their names, that is, it keeps all C source files and deletes the rest.
1202.PP
1203Braces allow commands to be grouped, so
1204.P1
1205{
1206	\f2command1\fP
1207	\f2command2\fP
1208}
1209.P2
1210is syntactically a single command that runs two commands.
1211Thus,
1212.P1
1213X/\e.c/ ,g/variable/ {
1214	f
1215	, x/.*\en/ g/variable/ p
1216}
1217.P2
1218finds all occurrences of
1219.CW variable
1220in C source files, and prints
1221out the file names and lines of each match.
1222The precise semantics of compound operations is discussed in the implementation
1223sections below.
1224.PP
1225Finally,
1226the undo command,
1227.CW u ,
1228undoes the last command,
1229no matter how many files were affected.
1230Multiple undo operations move further back in time, so
1231.P1
1232u
1233u
1234.P2
1235(which may be abbreviated
1236.CW u2 )
1237undoes the last two commands.  An undo may not be undone, however, nor
1238may any command that adds or deletes files.
1239Everything else is undoable, though, including for example
1240.CW e
1241commands:
1242.P1
1243e filename
1244u
1245.P2
1246restores the state of the file completely, including its name, dot,
1247and modified bit.  Because of the undo, potentially dangerous commands
1248are not guarded by confirmations.  Only
1249.CW D ,
1250which destroys the information necessary to restore itself, is protected.
1251It will not delete a modified file, but a second
1252.CW D
1253of the same file will succeed regardless.
1254The
1255.CW q
1256command, which exits
1257.CW sam ,
1258is similarly guarded.
1259.SH 2
1260Mouse Interface
1261.LP
1262.CW Sam
1263is most commonly run
1264connected to a bitmap display and mouse for interactive editing.
1265The only difference in the command language
1266between regular, mouse-driven
1267.CW sam
1268and
1269.CW sam\ -d
1270is that if an address
1271is provided without a command,
1272.CW sam\ -d
1273will print the text referenced by the address, but
1274regular
1275.CW sam
1276will highlight it on the screen \(em in fact,
1277dot is always highlighted (see Figure 2).
1278.WS 1
1279.KF
1280.XP fig3 2.04i
1281.Cs
1282Figure 2. A
1283.CW sam
1284window.  The scroll bar down the left
1285represents the file, with the bubble showing the fraction
1286visible in the window.
1287The scroll bar may be manipulated by the mouse for convenient browsing.
1288The current text,
1289which is highlighted, need not fit on a line.  Here it consists of one partial
1290line, one complete line, and final partial line.
1291.Ce
1292.KE
1293.PP
1294Each file may have zero or more windows open on the display.
1295At any time, only one window in all of
1296.CW sam
1297is the
1298.I
1299current window,
1300.R
1301that is, the window to which typing and mouse actions refer;
1302this may be the
1303.CW sam
1304window (that in which commands may be typed)
1305or one of the file windows.
1306When a file has multiple windows, the image of the file in each window
1307is always kept up to date.
1308The current file is the last file affected by a command,
1309so if the
1310.CW sam
1311window is current,
1312the current window is not a window on the current file.
1313However, each window on a file has its own value of dot,
1314and when switching between windows on a single file,
1315the file's value of dot is changed to that of the window.
1316Thus, flipping between windows behaves in the obvious, convenient way.
1317.PP
1318The mouse on the Blit has three buttons, numbered left to right.
1319Button 3 has a list of commands to manipulate windows,
1320followed by a list of `menu lines' exactly as printed by the
1321.CW f
1322command, one per file (not one per window).
1323These menu lines are sorted by file name.
1324If the list is long, the Blit menu software will make it more manageable
1325by generating a scrolling menu instead of an unwieldy long list.
1326Using the menu to select a file from the list makes that file the current
1327file, and the most recently current window in that file the current window.
1328But if that file is already current, selecting it in the menu cycles through
1329the windows on the file; this simple trick avoids a special menu to
1330choose windows on a file.
1331If there is no window open on the file,
1332.CW sam
1333changes the mouse cursor to prompt the user to create one.
1334.PP
1335The commands on the button 3 menu are straightforward (see Figure 3), and
1336are like the commands to manipulate windows in
1337.CW mux ,\u\s-4\&8\s+4\d
1338the Blit's window system.
1339.CW New
1340makes a new file, and gives it one empty window, whose size is determined
1341by a rectangle swept by the mouse.
1342.CW Zerox
1343prompts for a window to be selected, and
1344makes a clone of that window; this is how multiple windows are created on one file.
1345.CW Reshape
1346changes the size of the indicated window, and
1347.CW close
1348deletes it.  If that is the last window open on the file,
1349.CW close
1350first does a
1351.CW D
1352command on the file.
1353.CW Write
1354is identical to a
1355.CW w
1356command on the file; it is in the menu purely for convenience.
1357Finally,
1358.CW ~~sam~~
1359is a menu item that appears between the commands and the file names.
1360Selecting it makes the
1361.CW sam
1362window the current window,
1363causing subsequent typing to be interpreted as commands.
1364.KF
1365.XP fig2 2.74i
1366.Cs
1367Figure 3. The menu on button 3.
1368The black rectangle on the left is a scroll bar; the menu is limited to
1369the length shown to prevent its becoming unwieldy.
1370Above the
1371.CW ~~sam~~
1372line is a list of commands;
1373beneath it is a list of files, presented exactly as with the
1374.CW f
1375command.
1376.Ce
1377.KE
1378.PP
1379When
1380.CW sam
1381requests that a window be swept, in response to
1382.CW new ,
1383.CW zerox
1384or
1385.CW reshape ,
1386it changes the mouse cursor from the usual arrow to a box with
1387a small arrow.
1388In this state, the mouse may be used to indicate an arbitrary rectangle by
1389pressing button 3 at one corner and releasing it at the opposite corner.
1390More conveniently,
1391button 3 may simply be clicked,
1392whereupon
1393.CW sam
1394creates the maximal rectangle that contains the cursor
1395and abuts the
1396.CW sam
1397window.
1398By placing the
1399.CW sam
1400window in the middle of the screen, the user can define two regions (one above,
1401one below) in which stacked fully-overlapping
1402windows can be created with minimal fuss (see Figure 1).
1403This simple user interface trick makes window creation noticeably easier.
1404.PP
1405The cut-and-paste editor is essentially the same as that in Smalltalk-80.\u\s-4\&11\s+4\d
1406The text in dot is always highlighted on the screen.
1407When a character is typed it replaces dot, and sets dot to the null
1408string after the character.  Thus, ordinary typing inserts text.
1409Button 1 is used for selection:
1410pressing the button, moving the mouse, and lifting the button
1411selects (sets dot to) the text between the points where the
1412button was pressed and released.
1413Pressing and releasing at the same point selects a null string; this
1414is called clicking.  Clicking twice quickly, or
1415.I
1416double clicking,
1417.R
1418selects larger objects;
1419for example, double clicking in a word selects the word,
1420double clicking just inside an opening bracket selects the text
1421contained in the brackets (handling nested brackets correctly),
1422and similarly for
1423parentheses, quotes, and so on.
1424The double-clicking rules reflect a bias toward
1425programmers.
1426If
1427.CW sam
1428were intended more for word processing, double-clicks would probably
1429select linguistic structures such as sentences.
1430.PP
1431If button 1 is pressed outside the current window, it makes the indicated
1432window current.
1433This is the easiest way to switch between windows and files.
1434.PP
1435Pressing button 2 brings up a menu of editing functions (see Figure 4).
1436These mostly apply to the selected text:
1437.CW cut
1438deletes the selected text, and remembers it in a hidden buffer called the
1439.I
1440snarf buffer,
1441.R
1442.CW paste
1443replaces the selected text by the contents of the snarf buffer,
1444.CW snarf
1445just copies the selected text to the snarf buffer,
1446.CW look
1447searches forward for the next literal occurrence of the selected text, and
1448.CW <mux>
1449exchanges snarf buffers with the window system in which
1450.CW sam
1451is running.
1452Finally, the last regular expression used appears as a menu entry
1453to search
1454forward for the next occurrence of a match for the expression.
1455.WS 1
1456.KF
1457.XP fig4 1.20i
1458.Cs
1459Figure 4. The menu on button 2.
1460The bottom entry tracks the most recently used regular expression, which may
1461be literal text.
1462.Ce
1463.KE
1464.PP
1465The relationship between the command language and the mouse language is
1466entirely due to the equality of dot and the selected text chosen
1467with button 1 on the mouse.
1468For example, to make a set of changes in a C subroutine, dot can be
1469set by double clicking on the left brace that begins the subroutine,
1470which sets dot for the command language.
1471An address-free command then typed in the
1472.CW sam
1473window will apply only to the text between the opening and closing
1474braces of the function.
1475The idea is to select what you want, and then say what you want
1476to do with it, whether invoked by a menu selection or by a typed command.
1477And of course, the value of dot is highlighted on
1478the display after the command completes.
1479This relationship between mouse interface and command language
1480is clumsy to explain, but comfortable, even natural, in practice.
1481.SH
1482The Implementation
1483.LP
1484The next few sections describe how
1485.CW sam
1486is put together, first the host part,
1487then the inter-component communication,
1488then the terminal part.
1489After explaining how the command language is implemented,
1490the discussion follows (roughly) the path of a character
1491from the temporary file on disc to the screen.
1492The presentation centers on the data structures,
1493because that is how the program was designed and because
1494the algorithms are easy to provide, given the right data
1495structures.
1496.SH 2
1497Parsing and execution
1498.LP
1499The command language is interpreted by parsing each command with a
1500table-driven recursive
1501descent parser, and when a complete command is assembled, invoking a top-down
1502executor.
1503Most editors instead employ a simple character-at-a-time
1504lexical scanner.
1505Use of a parser makes it
1506easy and unambiguous to detect when a command is complete,
1507which has two advantages.
1508First, escape conventions such as backslashes to quote
1509multiple-line commands are unnecessary;  if the command isn't finished,
1510the parser keeps reading.  For example, a multiple-line append driven by an
1511.CW x
1512command is straightforward:
1513.P1
1514x/.*\en/ g/Peter/ a
1515one line about Peter
1516another line about Peter
1517\&.
1518.P2
1519Other UNIX editors would require a backslash after all but the last line.
1520.PP
1521The other advantage is specific to the two-process structure of
1522.CW sam .
1523The host process must decide when a command is completed so the
1524command interpreter can be called.  This problem is easily resolved
1525by having the lexical analyzer read the single stream of events from the
1526terminal, directly executing all typing and mouse commands,
1527but passing to the parser characters typed to the
1528.CW sam
1529command window.
1530This scheme is slightly complicated by the availability of cut-and-paste
1531editing in the
1532.CW sam
1533window, but that difficulty is resolved by applying the rules
1534used in
1535.CW mux :
1536when a newline is typed to the
1537.CW sam
1538window, all text between the newline and the previously typed newline
1539is made available to the parser.
1540This permits arbitrary editing to be done to a command before
1541typing newline and thereby requesting execution.
1542.PP
1543The parser is driven by a table because the syntax of addresses
1544and commands is regular enough
1545to be encoded compactly.  There are few special cases, such as the
1546replacement text in a substitution, so the syntax of almost all commands
1547can be encoded with a few flags.
1548These include whether the command allows an address (for example,
1549.CW e
1550does not), whether it takes a regular expression (as in
1551.CW x
1552and
1553.CW s ),
1554whether it takes replacement text (as in
1555.CW c
1556or
1557.CW i ),
1558which may be multi-line, and so on.
1559The internal syntax of regular expressions is handled by a separate
1560parser; a regular expression is a leaf of the command parse tree.
1561Regular expressions are discussed fully in the next section.
1562.PP
1563The parser table also has information about defaults, so the interpreter
1564is always called with a complete tree.  For example, the parser fills in
1565the implicit
1566.CW 0
1567and
1568.CW $
1569in the abbreviated address
1570.CW ,
1571(comma),
1572inserts a
1573.CW +
1574to the left of an unadorned regular expression in an address,
1575and provides the usual default address
1576.CW .
1577(dot) for commands that expect an address but are not given one.
1578.PP
1579Once a complete command is parsed, the evaluation is easy.
1580The address is evaluated left-to-right starting from the value of dot,
1581with a mostly ordinary expression evaluator.
1582Addresses, like many of the data structures in
1583.CW sam ,
1584are held in a C structure and passed around by value:
1585.P1
1586typedef long Posn;    /* Position in a file */
1587typedef struct Range{
1588        Posn    p1, p2;
1589}Range;
1590typedef struct Address{
1591        Range   r;
1592        File    *f;
1593}Address;
1594.P2
1595An address is encoded as a substring (character positions
1596.CW p1
1597to
1598.CW p2 )
1599in a file
1600.CW f .
1601(The data type
1602.CW File
1603is described in detail below.)
1604.PP
1605The address interpreter is an
1606.CW Address -valued
1607function that traverses the parse tree describing an address (the
1608parse tree for the address has type
1609.CW Addrtree ):
1610.P1
1611Address
1612address(ap, a, sign)
1613	Addrtree *ap;
1614	Address a;
1615	int sign;
1616{
1617	Address a2;
1618	do
1619		switch(ap->type){
1620		case '.':
1621			a=a.f->dot;
1622			break;
1623		case '$':
1624			a.r.p1=a.r.p2=a.f->nbytes;
1625			break;
1626		case '"':
1627			a=matchfile(a, ap->aregexp)->dot;
1628			break;
1629		case ',':
1630			a2=address(ap->right, a, 0);
1631			a=address(ap->left, a, 0);
1632			if(a.f!=a2.f || a2.r.p2<a.r.p1)
1633				error(Eorder);
1634			a.r.p2=a2.r.p2;
1635			return a;
1636		/* and so on */
1637		}
1638	while((ap=ap->right)!=0);
1639	return a;
1640}
1641.P2
1642.PP
1643Throughout, errors are handled by a non-local
1644.CW goto
1645(a
1646.CW setjmp/longjmp
1647in C terminology)
1648hidden in a routine called
1649.CW error
1650that immediately aborts the execution, retracts any
1651partially made changes (see the section below on `undoing'), and
1652returns to the top level of the parser.
1653The argument to
1654.CW error
1655is an enumeration type that
1656is translated to a terse but possibly helpful
1657message such as `?addresses out of order.'
1658Very common messages are kept short; for example the message for
1659a failed regular expression search is `?search.'
1660.PP
1661Character addresses such as
1662.CW #3
1663are trivial to implement, as the
1664.CW File
1665data structure is accessible by character number.
1666However,
1667.CW sam
1668keeps no information about the position of newlines \(em it is too
1669expensive to track dynamically \(em so line addresses are computed by reading
1670the file, counting newlines.  Except in very large files, this has proven
1671acceptable: file access is fast enough to make the technique practical,
1672and lines are not central to the structure of the command language.
1673.PP
1674The command interpreter, called
1675.CW cmdexec ,
1676is also straightforward.  The parse table includes a
1677function to call to interpret a particular command.  That function
1678receives as arguments
1679the calculated address
1680for the command
1681and the command tree (of type
1682.CW Cmdtree ),
1683which may contain information such as the subtree for compound commands.
1684Here, for example, is the function for the
1685.CW g
1686and
1687.CW v
1688commands:
1689.P1
1690int
1691g_cmd(a, cp)
1692	Address a;
1693	Cmdtree *cp;
1694{
1695	compile(cp->regexp);
1696	if(execute(a.f, a.r.p1, a.r.p2)!=(cp->cmdchar=='v')){
1697		a.f->dot=a;
1698		return cmdexec(a, cp->subcmd);
1699	}
1700	return TRUE;	/* cause execution to continue */
1701}
1702.P2
1703.CW Compile "" (
1704and
1705.CW execute
1706are part of the regular expression code, described in the next section.)
1707Because the parser and the
1708.CW File
1709data structure do most of the work, most commands
1710are similarly brief.
1711.SH 2
1712Regular expressions
1713.LP
1714The regular expression code in
1715.CW sam
1716is an interpreted, rather than compiled on-the-fly, implementation of Thompson's
1717non-deterministic finite automaton algorithm.\u\s-4\&12\s+4\d
1718The syntax and semantics of the expressions are as in the UNIX program
1719.CW egrep ,
1720including alternation, closures, character classes, and so on.
1721The only changes in the notation are two additions:
1722.CW \en
1723is translated to, and matches, a newline character, and
1724.CW @
1725matches any character.  In
1726.CW egrep ,
1727the character
1728.CW \&.
1729matches any character except newline, and in
1730.CW sam
1731the same rule seemed safest, to prevent idioms like
1732.CW \&.*
1733from spanning newlines.
1734.CW Egrep
1735expressions are arguably too complicated for an interactive editor \(em
1736certainly it would make sense if all the special characters were two-character
1737sequences, so that most of the punctuation characters wouldn't have
1738peculiar meanings \(em but for an interesting command language, full
1739regular expressions are necessary, and
1740.CW egrep
1741defines the full regular expression syntax for UNIX programs.
1742Also, it seemed superfluous to define a new syntax, since various UNIX programs
1743.CW ed , (
1744.CW egrep
1745and
1746.CW vi )
1747define too many already.
1748.PP
1749The expressions are compiled by a routine,
1750.CW compile ,
1751that generates the description of the non-deterministic finite state machine.
1752A second routine,
1753.CW execute ,
1754interprets the machine to generate the leftmost-longest match of the
1755expression in a substring of the file.
1756The algorithm is described elsewhere.\u\s-4\&12,13\s+4\d
1757.CW Execute
1758reports
1759whether a match was found, and sets a global variable,
1760of type
1761.CW Range ,
1762to the substring matched.
1763.PP
1764A trick is required to evaluate the expression in reverse, such as when
1765searching backwards for an expression.
1766For example,
1767.P1
1768-/P.*r/
1769.P2
1770looks backwards through the file for a match of the expression.
1771The expression, however, is defined for a forward search.
1772The solution is to construct a machine identical to the machine
1773for a forward search except for a reversal of all the concatenation
1774operators (the other operators are symmetric under direction reversal),
1775to exchange the meaning of the operators
1776.CW ^
1777and
1778.CW $ ,
1779and then to read the file backwards, looking for the
1780usual earliest longest match.
1781.PP
1782.CW Execute
1783generates only one match each time it is called.
1784To interpret looping constructs such as the
1785.CW x
1786command,
1787.CW sam
1788must therefore synchronize between
1789calls of
1790.CW execute
1791to avoid
1792problems with null matches.
1793For example, even given the leftmost-longest rule,
1794the expression
1795.CW a*
1796matches three times in the string
1797.CW ab
1798(the character
1799.CW a ,
1800the null string between the
1801.CW a
1802and
1803.CW b ,
1804and the final null string).
1805After returning a match for the
1806.CW a ,
1807.CW sam
1808must not match the null string before the
1809.CW b .
1810The algorithm starts
1811.CW execute
1812at the end of its previous match, and
1813if the match it returns
1814is null and abuts the previous match, rejects the match and advances
1815the initial position one character.
1816.SH 2
1817Memory allocation
1818.LP
1819The C language has no memory allocation primitives, although a standard
1820library routine,
1821.CW malloc ,
1822provides adequate service for simple programs.
1823For specific uses, however,
1824it can be better to write a custom allocator.
1825The allocator (or rather, pair of allocators) described here
1826work in both the terminal and host parts of
1827.CW sam .
1828They are designed for efficient manipulation of strings,
1829which are allocated and freed frequently and vary in length from essentially
1830zero to 32 Kbytes (very large strings are written to disc).
1831More important, strings may be large and change size often,
1832so to minimize memory usage it is helpful to reclaim and to coalesce the
1833unused portions of strings when they are truncated.
1834.PP
1835Objects to be allocated in
1836.CW sam
1837are of two flavors:
1838the first is C
1839.CW structs ,
1840which are small and often addressed by pointer variables;
1841the second is variable-sized arrays of characters
1842or integers whose
1843base pointer is always used to access them.
1844The memory allocator in
1845.CW sam
1846is therefore in two parts:
1847first, a traditional first-fit allocator that provides fixed storage for
1848.CW structs ;
1849and second, a garbage-compacting allocator that reduces storage
1850overhead for variable-sized objects, at the cost of some bookkeeping.
1851The two types of objects are allocated from adjoining arenas, with
1852the garbage-compacting allocator controlling the arena with higher addresses.
1853Separating into two arenas simplifies compaction and prevents fragmentation due
1854to immovable objects.
1855The access rules for garbage-compactable objects
1856(discussed in the next paragraph) allow them to be relocated, so when
1857the first-fit arena needs space, it moves the garbage-compacted arena
1858to higher addresses to make room.  Storage is therefore created only
1859at successively higher addresses, either when more garbage-compacted
1860space is needed or when the first-fit arena pushes up the other arena.
1861.PP
1862Objects that may be compacted declare to the
1863allocator a cell that is guaranteed to be the sole repository of the
1864address of the object whenever a compaction can occur.
1865The compactor can then update the address when the object is moved.
1866For example, the implementation of type
1867.CW List
1868(really a variable-length array)
1869is:
1870.P1
1871typedef struct List{
1872        int     nused;
1873        long    *ptr;
1874}List;
1875.P2
1876The
1877.CW ptr
1878cell must always be used directly, and never copied.  When a
1879.CW List
1880is to be created the
1881.CW List
1882structure is allocated in the ordinary first-fit arena
1883and its
1884.CW ptr
1885is allocated in the garbage-compacted arena.
1886A similar data type for strings, called
1887.CW String ,
1888stores variable-length character arrays of up to 32767 elements.
1889.PP
1890A related matter of programming style:
1891.CW sam
1892frequently passes structures by value, which
1893simplifies the code.
1894Traditionally, C programs have
1895passed structures by reference, but implicit allocation on
1896the stack is easier to use.
1897Structure passing is a relatively new feature of C
1898(it is not in the
1899standard reference manual for C\u\s-4\&14\s+4\d), and is poorly supported in most
1900commercial C compilers.
1901It's convenient and expressive, though,
1902and simplifies memory management by
1903avoiding the allocator altogether
1904and eliminating pointer aliases.
1905.SH 2
1906Data structures for manipulating files
1907.LP
1908Experience with
1909.CW jim
1910showed that the requirements
1911of the file data structure were few, but strict.
1912First, files need to be read and written quickly;
1913adding a fresh file must be painless.
1914Second, the implementation must place no arbitrary upper limit on
1915the number or sizes of files.  (It should be practical to edit many files,
1916and files up to megabytes in length should be handled gracefully.)
1917This implies that files be stored on disc, not in main memory.
1918(Aficionados of virtual memory may argue otherwise, but the
1919implementation of virtual
1920memory in our system is not something to depend on
1921for good performance.)
1922Third, changes to files need be made by only two primitives:
1923deletion and insertion.
1924These are inverses of each other,
1925which simplifies the implementation of the undo operation.
1926Finally,
1927it must be easy and efficient to access the file, either
1928forwards or backwards, a byte at a time.
1929.PP
1930The
1931.CW File
1932data type is constructed from three simpler data structures that hold arrays
1933of characters.
1934Each of these types has an insertion and deletion operator, and the
1935insertion and deletion operators of the
1936.CW File
1937type itself are constructed from them.
1938.PP
1939The simplest type is the
1940.CW String ,
1941which is used to hold strings in main memory.
1942The code that manages
1943.CW Strings
1944guarantees that they will never be longer
1945than some moderate size, and in practice they are rarely larger than 8 Kbytes.
1946.CW Strings
1947have two purposes: they hold short strings like file names with little overhead,
1948and because they are deliberately small, they are efficient to modify.
1949They are therefore used as the data structure for in-memory caches.
1950.PP
1951The disc copy of the file is managed by a data structure called a
1952.CW Disc ,
1953which corresponds to a temporary file.  A
1954.CW Disc
1955has no storage in main memory other than bookkeeping information;
1956the actual data being held is all on the disc.
1957To reduce the number of open files needed,
1958.CW sam
1959opens a dozen temporary UNIX files and multiplexes the
1960.CW Discs
1961upon them.
1962This permits many files to
1963be edited; the entire
1964.CW sam
1965source (48 files) may be edited comfortably with a single
1966instance of
1967.CW sam .
1968Allocating one temporary file per
1969.CW Disc
1970would strain the operating system's limit on the number of open files.
1971Also, spreading the traffic among temporary files keeps the files shorter,
1972and shorter files are more efficiently implemented by the UNIX
1973I/O subsystem.
1974.PP
1975A
1976.CW Disc
1977is an array of fixed-length blocks, each of which contains
1978between 1 and 4096 characters of active data.
1979(The block size of our UNIX file system is 4096 bytes.)
1980The block addresses within the temporary file and the length of each
1981block are stored in a
1982.CW List .
1983When changes are made the live part of blocks may change size.
1984Blocks are created and coalesced when necessary to try to keep the sizes
1985between 2048 and 4096 bytes.
1986An actively changing part of the
1987.CW Disc
1988therefore typically has about a kilobyte of slop that can be
1989inserted or deleted
1990without changing more than one block or affecting the block order.
1991When an insertion would overflow a block, the block is split, a new one
1992is allocated to receive the overflow, and the memory-resident list of blocks
1993is rearranged to reflect the insertion of the new block.
1994.PP
1995Obviously, going to the disc for every modification to the file is
1996prohibitively expensive.
1997The data type
1998.CW Buffer
1999consists of a
2000.CW Disc
2001to hold the data and a
2002.CW String
2003that acts as a cache.
2004This is the first of a series of caches throughout the data structures in
2005.CW sam.
2006The caches not only improve performance, they provide a way to organize
2007the flow of data, particularly in the communication between the host
2008and terminal.
2009This idea is developed below, in the section on communications.
2010.PP
2011To reduce disc traffic, changes to a
2012.CW Buffer
2013are mediated by a variable-length string, in memory, that acts as a cache.
2014When an insertion or deletion is made to a
2015.CW Buffer ,
2016if the change can be accommodated by the cache, it is done there.
2017If the cache becomes bigger than a block because of an insertion,
2018some of it is written to the
2019.CW Disc
2020and deleted from the cache.
2021If the change does not intersect the cache, the cache is flushed.
2022The cache is only loaded at the new position if the change is smaller than a block;
2023otherwise, it is sent directly to the
2024.CW Disc .
2025This is because
2026large changes are typically sequential,
2027whereupon the next change is unlikely to overlap the current one.
2028.PP
2029A
2030.CW File
2031comprises a
2032.CW String
2033to hold the file name and some ancillary data such as dot and the modified bit.
2034The most important components, though, are a pair of
2035.CW Buffers ,
2036one called the transcript and the other the contents.
2037Their use is described in the next section.
2038.PP
2039The overall structure is shown in Figure 5.
2040Although it may seem that the data is touched many times on its
2041way from the
2042.CW Disc ,
2043it is read (by one UNIX system call) directly into the cache of the
2044associated
2045.CW Buffer ;
2046no extra copy is done.
2047Similarly, when flushing the cache, the text is written
2048directly from the cache to disc.
2049Most operations act directly on the text in the cache.
2050A principle applied throughout
2051.CW sam
2052is that the fewer times the data is copied, the faster the program will run
2053(see also the paper by Waite\u\s-4\&15\s+4\d).
2054.KF
2055.PS
2056copy "fig5.pic"
2057.PE
2058.Cs
2059Figure 5. File data structures.
2060The temporary files are stored in the standard repository for such files
2061on the host system.
2062.Ce
2063.KE
2064.PP
2065The contents of a
2066.CW File
2067are accessed by a routine that
2068copies to a buffer a substring of a file starting at a specified offset.
2069To read a byte at a time, a
2070.CW File "" per-
2071array is loaded starting from a specified initial position,
2072and bytes may then be read from the array.
2073The implementation is done by a macro similar to the C standard I/O
2074.CW getc
2075macro.\u\s-4\&14\s+4\d
2076Because the reading may be done at any address, a minor change to the
2077macro allows the file to be read backwards.
2078This array is read-only; there is no
2079.CW putc .
2080.SH 2
2081Doing and undoing
2082.LP
2083.CW Sam
2084has an unusual method for managing changes to files.
2085The command language makes it easy to specify multiple variable-length changes
2086to a file millions of bytes long, and such changes
2087must be made efficiently if the editor is to be practical.
2088The usual techniques for inserting and deleting strings
2089are inadequate under these conditions.
2090The
2091.CW Buffer
2092and
2093.CW Disc
2094data structures are designed for efficient random access to long strings,
2095but care must be taken to avoid super-linear behavior when making
2096many changes simultaneously.
2097.PP
2098.CW Sam
2099uses a two-pass algorithm for making changes, and treats each file as a database
2100against which transactions are registered.
2101Changes are not made directly to the contents.
2102Instead, when a command is started, a `mark' containing
2103a sequence number is placed in the transcript
2104.CW Buffer ,
2105and each change made to the file, either an insertion or deletion
2106or a change to the file name,
2107is appended to the end of the transcript.
2108When the command is complete, the transcript is rewound to the
2109mark and applied to the contents.
2110.PP
2111One reason for separating evaluation from
2112application in this way is to simplify tracking the addresses of changes
2113made in the middle of a long sequence.
2114The two-pass algorithm also allows all changes to apply to the
2115.I original
2116data: no change can affect another change made in the same command.
2117This is particularly important when evaluating an
2118.CW x
2119command because it prevents regular expression matches
2120from stumbling over changes made earlier in the execution.
2121Also, the two-pass
2122algorithm is cleaner than the way other UNIX editors allow changes to
2123affect each other;
2124for example,
2125.CW ed 's
2126idioms to do things like delete every other line
2127depend critically on the implementation.
2128Instead,
2129.CW sam 's
2130simple model, in which all changes in a command occur effectively
2131simultaneously, is easy to explain and to understand.
2132.PP
2133The records in the transcript are of the form ``delete substring from
2134locations
2135123 to 456'' and ``insert 11 characters `hello there' at location 789.''
2136(It is an error if the changes are not at monotonically greater
2137positions through the file.)
2138While the update is occurring, these numbers must be
2139offset by earlier changes, but that is straightforward and
2140local to the update routine;
2141moreover, all the numbers have been computed
2142before the first is examined.
2143.PP
2144Treating the file as a transaction system has another advantage:
2145undo is trivial.
2146All it takes is to invert the transcript after it has been
2147implemented, converting insertions
2148into deletions and vice versa, and saving them in a holding
2149.CW Buffer .
2150The `do' transcript can then be deleted from
2151the transcript
2152.CW Buffer
2153and replaced by the `undo' transcript.
2154If an undo is requested, the transcript is rewound and the undo transcript
2155executed.
2156Because the transcript
2157.CW Buffer
2158is not truncated after each command, it accumulates
2159successive changes.
2160A sequence of undo commands
2161can therefore back up the file arbitrarily,
2162which is more helpful than the more commonly implemented self-inverse form of undo.
2163.CW Sam "" (
2164provides no way to undo an undo, but if it were desired,
2165it would be easy to provide by re-interpreting the `do' transcript.)
2166Each mark in the transcript contains a sequence number and the offset into
2167the transcript of the previous mark, to aid in unwinding the transcript.
2168Marks also contain the value of dot and the modified bit so these can be
2169restored easily.
2170Undoing multiple files is easy; it merely demands undoing all files whose
2171latest change has the same sequence number as the current file.
2172.PP
2173Another benefit of having a transcript is that errors encountered in the middle
2174of a complicated command need not leave the files in an intermediate state.
2175By rewinding the transcript to the mark beginning the command,
2176the partial command can be trivially undone.
2177.PP
2178When the update algorithm was first implemented, it was unacceptably slow,
2179so a cache was added to coalesce nearby changes,
2180replacing multiple small changes by a single larger one.
2181This reduced the number
2182of insertions into the transaction
2183.CW Buffer ,
2184and made a dramatic improvement in performance,
2185but made it impossible
2186to handle changes in non-monotonic order in the file; the caching method
2187only works if changes don't overlap.
2188Before the cache was added, the transaction could in principle be sorted
2189if the changes were out of order, although
2190this was never done.
2191The current status is therefore acceptable performance with a minor
2192restriction on global changes, which is sometimes, but rarely, an annoyance.
2193.PP
2194The update algorithm obviously paws the data more than simpler
2195algorithms, but it is not prohibitively expensive;
2196the caches help.
2197(The principle of avoiding copying the data is still honored here,
2198although not as piously:
2199the data is moved from contents' cache to
2200the transcript's all at once and through only one internal buffer.)
2201Performance figures confirm the efficiency.
2202To read from a dead start a hundred kilobyte file on a VAX-11/750
2203takes 1.4 seconds of user time, 2.5 seconds of system time,
2204and 5 seconds of real time.
2205Reading the same file in
2206.CW ed
2207takes 6.0 seconds of user time, 1.7 seconds of system time,
2208and 8 seconds of real time.
2209.CW Sam
2210uses about half the CPU time.
2211A more interesting example is the one stated above:
2212inserting a character between every pair of characters in the file.
2213The
2214.CW sam
2215command is
2216.P1
2217,y/@/ a/x/
2218.P2
2219and takes 3 CPU seconds per kilobyte of input file, of which
2220about a third is spent in the regular expression code.
2221This translates to about 500 changes per second.
2222.CW Ed
2223takes 1.5 seconds per kilobyte to make a similar change (ignoring newlines),
2224but cannot undo it.
2225The same example in
2226.CW ex ,\u\s-4\&9\s+4\d
2227a variant of
2228.CW ed
2229done at the University of California at Berkeley,
2230which allows one level of undoing, again takes 3 seconds.
2231In summary,
2232.CW sam 's
2233performance is comparable to that of other UNIX editors, although it solves
2234a harder problem.
2235.SH 2
2236Communications
2237.LP
2238The discussion so far has described the implementation of the host part of
2239.CW sam ;
2240the next few sections explain how a machine with mouse and bitmap display
2241can be engaged to improve interaction.
2242.CW Sam
2243is not the first editor to be written as two processes,\u\s-4\&16\s+4\d
2244but its implementation
2245has some unusual aspects.
2246.PP
2247There are several ways
2248.CW sam 's
2249host and terminal parts may be connected.
2250The first and simplest is to forgo the terminal part and use the host
2251part's command language to edit text on an ordinary terminal.
2252This mode is invoked by starting
2253.CW sam
2254with the
2255.CW -d
2256option.
2257With no options,
2258.CW sam
2259runs separate host and terminal programs,
2260communicating with a message protocol over the physical
2261connection that joins them.
2262Typically, the connection is an RS-232 link between a Blit
2263(the prototypical display for
2264.CW sam )
2265and a host running
2266the Ninth Edition of the UNIX operating system.\u\s-4\&8\s+4\d
2267(This is the version of the system used in the Computing Sciences Research
2268Center at AT&T Bell Laboratories [now Lucent Technologies, Bell Labs], where I work.  Its relevant
2269aspects are discussed in the Blit paper.\u\s-4\&1\s+4\d)
2270The implementation of
2271.CW sam
2272for the SUN computer runs both processes on the same machine and
2273connects them by a pipe.
2274.PP
2275The low bandwidth of an RS-232 link
2276necessitated the split between
2277the two programs.
2278The division is a mixed blessing:
2279a program in two parts is much harder to write and to debug
2280than a self-contained one,
2281but the split makes several unusual configurations possible.
2282The terminal may be physically separated from the host, allowing the conveniences
2283of a mouse and bitmap display to be taken home while leaving the files at work.
2284It is also possible to run the host part on a remote machine:
2285.P1
2286sam -r host
2287.P2
2288connects to the terminal in the usual way, and then makes a call
2289across the network to establish the host part of
2290.CW sam
2291on the named machine.
2292Finally, it cross-connects the I/O to join the two parts.
2293This allows
2294.CW sam
2295to be run on machines that do not support bitmap displays;
2296for example,
2297.CW sam
2298is the editor of choice on our Cray X-MP/24.
2299.CW Sam
2300.CW -r
2301involves
2302.I three
2303machines: the remote host, the terminal, and the local host.
2304The local host's job is simple but vital: it passes the data
2305between the remote host and terminal.
2306.PP
2307The host and terminal exchange messages asynchronously
2308(rather than, say, as remote procedure calls) but there is no
2309error detection or correction
2310because, whatever the configuration, the connection is reliable.
2311Because the terminal handles mundane interaction tasks such as
2312popping up menus and interpreting the responses, the messages are about
2313data, not actions.
2314For example, the host knows nothing about what is displayed on the screen,
2315and when the user types a character, the message sent to the host says
2316``insert a one-byte string at location 123 in file 7,'' not ``a character
2317was typed at the current position in the current file.''
2318In other words, the messages look very much like the transaction records
2319in the transcripts.
2320.PP
2321Either the host or terminal part of
2322.CW sam
2323may initiate a change to a file.
2324The command language operates on the host, while typing and some
2325mouse operations are executed directly in the terminal to optimize response.
2326Changes initiated by the host program must be transmitted to the terminal,
2327and
2328vice versa.
2329(A token is exchanged to determine which end is in control,
2330which means that characters typed while a time-consuming command runs
2331must be buffered and do not appear until the command is complete.)
2332To maintain consistent information,
2333the host and terminal track changes through a per-file
2334data structure that records what portions of the file
2335the terminal has received.
2336The data structure, called a
2337.CW Rasp
2338(a weak pun: it's a file with holes)
2339is held and updated by both the host and terminal.
2340A
2341.CW Rasp
2342is a list of
2343.CW Strings
2344holding those parts of the file known to the terminal,
2345separated by counts of the number of bytes in the interstices.
2346Of course, the host doesn't keep a separate copy of the data (it only needs
2347the lengths of the various pieces),
2348but the structure is the same on both ends.
2349.PP
2350The
2351.CW Rasp
2352in the terminal doubles as a cache.
2353Since the terminal keeps the text for portions of the file it has displayed,
2354it need not request data from the host when revisiting old parts of the file
2355or redrawing obscured windows, which speeds things up considerably
2356over low-speed links.
2357.PP
2358It's trivial for the terminal to maintain its
2359.CW Rasp ,
2360because all changes made on the terminal apply to parts of the file
2361already loaded there.
2362Changes made by the host are compared against the
2363.CW Rasp
2364during the update sequence after each command.
2365Small changes to pieces of the file loaded in the terminal
2366are sent in their entirety.
2367Larger changes, and changes that fall entirely in the holes,
2368are transmitted as messages without literal data:
2369only the lengths of the deleted and inserted strings are transmitted.
2370When a command is completed, the terminal examines its visible
2371windows to see if any holes in their
2372.CW Rasps
2373intersect the visible portion of the file.
2374It then requests the missing data from the host,
2375along with up to 512 bytes of surrounding data, to minimize
2376the number of messages when visiting a new portion of the file.
2377This technique provides a kind of two-level lazy evaluation for the terminal.
2378The first level sends a minimum of information about
2379parts of the file not being edited interactively;
2380the second level waits until a change is displayed before
2381transmitting the new data.
2382Of course,
2383performance is also helped by having the terminal respond immediately to typing
2384and simple mouse requests.
2385Except for small changes to active pieces of the file, which are
2386transmitted to the terminal without negotiation,
2387the terminal is wholly responsible for deciding what is displayed;
2388the host uses the
2389.CW Rasp
2390only to tell the terminal what might be relevant.
2391.PP
2392When a change is initiated by the host,
2393the messages to the terminal describing the change
2394are generated by the routine that applies the transcript of the changes
2395to the contents of the
2396.CW File .
2397Since changes are undone by the same update routine,
2398undoing requires
2399no extra code in the communications;
2400the usual messages describing changes to the file are sufficient
2401to back up the screen image.
2402.PP
2403The
2404.CW Rasp
2405is a particularly good example of the way caches are used in
2406.CW sam .
2407First, it facilitates access to the active portion of the text by placing
2408the busy text in main memory.
2409In so doing, it provides efficient access
2410to a large data structure that does not fit in memory.
2411Since the form of data is to be imposed by the user, not by the program,
2412and because characters will frequently be scanned sequentially,
2413files are stored as flat objects.
2414Caches help keep performance good and linear when working with such
2415data.
2416.PP
2417Second, the
2418.CW Rasp
2419and several of the other caches have some
2420.I read-ahead;
2421that is, the cache is loaded with more information than is needed for
2422the job immediately at hand.
2423When manipulating linear structures, the accesses are usually sequential,
2424and read-ahead can significantly reduce the average time to access the
2425next element of the object.
2426Sequential access is a common mode for people as well as programs;
2427consider scrolling through a document while looking for something.
2428.PP
2429Finally, like any good data structure,
2430the cache guides the algorithm, or at least the implementation.
2431The
2432.CW Rasp
2433was actually invented to control the communications between the host and
2434terminal parts, but I realized very early that it was also a form of
2435cache.  Other caches were more explicitly intended to serve a double
2436purpose: for example, the caches in
2437.CW Files
2438that coalesce updates not only reduce traffic to the
2439transcript and contents
2440.CW Buffers ,
2441they also clump screen updates so that complicated changes to the
2442screen are achieved in
2443just a few messages to the terminal.
2444This saved me considerable work: I did not need to write special
2445code to optimize the message traffic to the
2446terminal.
2447Caches pay off in surprising ways.
2448Also, they tend to be independent, so their performance improvements
2449are multiplicative.
2450.SH 2
2451Data structures in the terminal
2452.LP
2453The terminal's job is to display and to maintain a consistent image of
2454pieces of the files being edited.
2455Because the text is always in memory, the data structures are
2456considerably simpler than those in the host part.
2457.PP
2458.CW Sam
2459typically has far more windows than does
2460.CW mux ,
2461the window system within which its Blit implementation runs.
2462.CW Mux
2463has a fairly small number of asynchronously updated windows;
2464.CW sam
2465needs a large number of synchronously updated windows that are
2466usually static and often fully obscured.
2467The different tradeoffs guided
2468.CW sam
2469away from the memory-intensive implementation of windows, called
2470.CW Layers ,\u\s-4\&17\s+4\d
2471used in
2472.CW mux.
2473Rather than depending on a complete bitmap image of the display for each window,
2474.CW sam
2475regenerates the image from its in-memory text
2476(stored in the
2477.CW Rasp )
2478when necessary, although it will use such an image if it is available.
2479Like
2480.CW Layers ,
2481though,
2482.CW sam
2483uses the screen bitmap as active storage in which to update the image using
2484.CW bitblt .\u\s-4\&18,19\s+4\d
2485The resulting organization, pictured in Figure 6,
2486has a global array of windows, called
2487.CW Flayers ,
2488each of which holds an image of a piece of text held in a data structure
2489called a
2490.CW Frame ,
2491which in turn represents
2492a rectangular window full of text displayed in some
2493.CW Bitmap .
2494Each
2495.CW Flayer
2496appears in a global list that orders them all front-to-back
2497on the display, and simultaneously as an element of a per-file array
2498that holds all the open windows for that file.
2499The complement in the terminal of the
2500.CW File
2501on the host is called a
2502.CW Text ;
2503each connects its
2504.CW Flayers
2505to the associated
2506.CW Rasp .
2507.KF
2508.PS
2509copy "fig6.pic"
2510.PE
2511.Cs
2512Figure 6. Data structures in the terminal.
2513.CW Flayers
2514are also linked together into a front-to-back list.
2515.CW Boxes
2516are discussed in the next section.
2517.Ce
2518.KE
2519.PP
2520The
2521.CW Bitmap
2522for a
2523.CW Frame
2524contains the image of the text.
2525For a fully visible window, the
2526.CW Bitmap
2527will be the screen (or at least the
2528.CW Layer
2529in which
2530.CW sam
2531is being run),
2532while for partially obscured windows the
2533.CW Bitmap
2534will be off-screen.
2535If the window is fully obscured, the
2536.CW Bitmap
2537will be null.
2538.PP
2539The
2540.CW Bitmap
2541is a kind of cache.
2542When making changes to the display, most of the original image will
2543look the same in the final image, and the update algorithms exploit this.
2544The
2545.CW Frame
2546software updates the image in the
2547.CW Bitmap
2548incrementally; the
2549.CW Bitmap
2550is not just an image, it is a data structure.\u\s-4\&18,19\s+4\d
2551The job of the software that updates the display is therefore
2552to use as much as possible of the existing image (converting the
2553text from ASCII characters to pixels is expensive) in a sort of two-dimensional
2554string insertion algorithm.
2555The details of this process are described in the next section.
2556.PP
2557The
2558.CW Frame
2559software has no code to support overlapping windows;
2560its job is to keep a single
2561.CW Bitmap
2562up to date.
2563It falls to the
2564.CW Flayer
2565software to multiplex the various
2566.CW Bitmaps
2567onto the screen.
2568The problem of maintaining overlapping
2569.CW Flayers
2570is easier than for
2571.CW Layers \u\s-4\&17\s+4\d
2572because changes are made synchronously and because the contents of the window
2573can be reconstructed from the data stored in the
2574.CW Frame ;
2575the
2576.CW Layers
2577software
2578makes no such assumptions.
2579In
2580.CW sam ,
2581the window being changed is almost always fully visible, because the current
2582window is always fully visible, by construction.
2583However, when multi-file changes are being made, or when
2584more than one window is open on a file,
2585it may be necessary to update partially obscured windows.
2586.PP
2587There are three cases: the window is
2588fully visible, invisible (fully obscured), or partially visible.
2589If fully visible, the
2590.CW Bitmap
2591is part of the screen, so when the
2592.CW Flayer
2593update routine calls the
2594.CW Frame
2595update routine, the screen will be updated directly.
2596If the window is invisible,
2597there is no associated
2598.CW Bitmap ,
2599and all that is necessary is to update the
2600.CW Frame
2601data structure, not the image.
2602If the window is partially visible, the
2603.CW Frame
2604routine is called to update the image in the off-screen
2605.CW Bitmap ,
2606which may require regenerating it from the text of the window.
2607The
2608.CW Flayer
2609code then clips this
2610.CW Bitmap
2611against the
2612.CW Bitmaps
2613of all
2614.CW Frames
2615in front of the
2616.CW Frame
2617being modified, and the remainder is copied to the display.
2618.PP
2619This is much faster than recreating the image off-screen
2620for every change, or clipping all the changes made to the image
2621during its update.
2622Unfortunately, these caches can also consume prohibitive amounts of
2623memory, so they are freed fairly liberally \(em after every change to the
2624front-to-back order of the
2625.CW Flayers .
2626The result is that
2627the off-screen
2628.CW Bitmaps
2629exist only while multi-window changes are occurring,
2630which is the only time the performance improvement they provide is needed.
2631Also, the user interface causes fully-obscured windows to be the
2632easiest to make \(em
2633creating a canonically sized and placed window requires only a button click
2634\(em which reduces the need for caching still further.
2635.PP
2636.SH 2
2637Screen update
2638.LP
2639Only two low-level primitives are needed for incremental update:
2640.CW bitblt ,
2641which copies rectangles of pixels, and
2642.CW string
2643(which in turn calls
2644.CW bitblt ),
2645which draws a null-terminated character string in a
2646.CW Bitmap .
2647A
2648.CW Frame
2649contains a list of
2650.CW Boxes ,
2651each of which defines a horizontal strip of text in the window
2652(see Figure 7).
2653A
2654.CW Box
2655has a character string
2656.CW str ,
2657and a
2658.CW Rectangle
2659.CW rect
2660that defines the location of the strip in the window.
2661(The text in
2662.CW str
2663is stored in the
2664.CW Box
2665separately from the
2666.CW Rasp
2667associated with the window's file, so
2668.CW Boxes
2669are self-contained.)
2670The invariant is that
2671the image of the
2672.CW Box
2673can be reproduced by calling
2674.CW string
2675with argument
2676.CW str
2677to draw the string in
2678.CW rect ,
2679and the resulting picture fits perfectly within
2680.CW rect .
2681In other words, the
2682.CW Boxes
2683define the tiling of the window.
2684The tiling may be complicated by long lines of text, which
2685are folded onto the next line.
2686Some editors use horizontal scrolling to avoid this complication,
2687but to be comfortable this technique requires that lines not be
2688.I too
2689long;
2690.CW sam
2691has no such restriction.
2692Also, and perhaps more importantly, UNIX programs and terminals traditionally fold
2693long lines to make their contents fully visible.
2694.PP
2695Two special kinds of
2696.CW Boxes
2697contain a single
2698character: either a newline or a tab.
2699Newlines and tabs are white space.
2700A newline
2701.CW Box
2702always extends to the right edge of the window,
2703forcing the following
2704.CW Box
2705to the next line.
2706The width of a tab depends on where it is located:
2707it forces the next
2708.CW Box
2709to begin at a tab location.
2710Tabs also
2711have a minimum width equivalent to a blank (blanks are
2712drawn by
2713.CW string
2714and are not treated specially); newlines have a minimum width of zero.
2715.KF
2716.PS
2717copy "fig7.pic"
2718.PE
2719.sp .5
2720.Cs
2721Figure 7. A line of text showing its
2722.CW Boxes .
2723The first two blank
2724.CW Boxes
2725contain tabs; the last contains a newline.
2726Spaces are handled as ordinary characters.
2727.Ce
2728.KE
2729.PP
2730The update algorithms always use the
2731.CW Bitmap
2732image of the text (either the display or cache
2733.CW Bitmap );
2734they never examine the characters within a
2735.CW Box
2736except when the
2737.CW Box
2738needs to be split in two.
2739Before a change, the window consists of a tiling of
2740.CW Boxes ;
2741after the change the window is tiled differently.
2742The update algorithms rearrange the tiles in place, without
2743backup storage.
2744The algorithms are not strictly optimal \(em for example, they can
2745clear a pixel that is later going to be written upon \(em
2746but they never move a tile that doesn't need to be moved,
2747and they move each tile at most once.
2748.CW Frinsert
2749on a Blit can absorb over a thousand characters a second if the strings
2750being inserted are a few tens of characters long.
2751.PP
2752Consider
2753.CW frdelete .
2754Its job is to delete a substring from a
2755.CW Frame
2756and restore the image of the
2757.CW Frame .
2758The image of a substring has a peculiar shape (see Figure 2) comprising
2759possibly a partial line,
2760zero or more full lines,
2761and possibly a final partial line.
2762For reference, call this the
2763.I
2764Z-shape.
2765.R
2766.CW Frdelete
2767begins by splitting, if necessary, the
2768.CW Boxes
2769containing the ends of
2770the substring so the substring begins and ends on
2771.CW Box
2772boundaries.
2773Because the substring is being deleted, its image is not needed,
2774so the Z-shape is then cleared.
2775Then, tiles (that is, the images of
2776.CW Boxes )
2777are copied, using
2778.CW bitblt ,
2779from immediately after the Z-shape to
2780the beginning of the Z-shape,
2781resulting in a new Z-shape.
2782.CW Boxes "" (
2783whose contents would span two lines in the new position must first be split.)
2784.PP
2785Copying the remainder of the
2786.CW Frame
2787tile by tile
2788this way will clearly accomplish the deletion but eventually,
2789typically when the copying algorithm encounters a tab or newline,
2790the old and new
2791.CW x
2792coordinates of the tile
2793to be copied are the same.
2794This correspondence implies
2795that the Z-shape has its beginning and ending edges aligned
2796vertically, and a sequence of at most two
2797.CW bitblts
2798can be used to copy the remaining tiles.
2799The last step is to clear out the resulting empty space at the bottom
2800of the window;
2801the number of lines to be cleared is the number of complete lines in the
2802Z-shape closed by the final
2803.CW bitblts.
2804The final step is to merge horizontally adjacent
2805.CW Boxes
2806of plain text.
2807The complete source to
2808.CW frdelete
2809is less than 100 lines of C.
2810.PP
2811.CW frinsert
2812is more complicated because it must do four passes:
2813one to construct the
2814.CW Box
2815list for the inserted string,
2816one to reconnoitre,
2817one to copy (in opposite order to
2818.CW frdelete )
2819the
2820.CW Boxes
2821to make the hole for the new text,
2822and finally one to copy the new text into place.
2823Overall, though,
2824.CW frinsert
2825has a similar flavor to
2826.CW frdelete ,
2827and needn't be described further.
2828.CW Frinsert
2829and its subsidiary routines comprise 211 lines of C.
2830.PP
2831The terminal source code is 3024 lines of C,
2832and the host source is 5797 lines.
2833.SH
2834Discussion
2835.SH 2
2836History
2837.LP
2838The immediate ancestor of
2839.CW sam
2840was the original text editor for the Blit, called
2841.CW jim .
2842.CW Sam
2843inherited
2844.CW jim 's
2845two-process structure and mouse language almost unchanged, but
2846.CW jim
2847suffered from several drawbacks that were addressed in the design of
2848.CW sam .
2849The most important of these was the lack of a command language.
2850Although
2851.CW jim
2852was easy to use for simple editing, it provided no direct help with
2853large or repetitive editing tasks.  Instead, it provided a command to pass
2854selected text through a shell pipeline,
2855but this was no more satisfactory than could be expected of a stopgap measure.
2856.PP
2857.CW Jim
2858was written primarily as a vehicle for experimenting with a mouse-based
2859interface to text, and the experiment was successful.
2860.CW Jim
2861had some spin-offs:
2862.CW mux ,
2863the second window system for the Blit, is essentially a multiplexed
2864version of the terminal part of
2865.CW jim ;
2866and the debugger
2867.CW pi 's
2868user interface\u\s-4\&20\s+4\d was closely modeled on
2869.CW jim 's.
2870But after a couple of years,
2871.CW jim
2872had become difficult to maintain and limiting to use,
2873and its replacement was overdue.
2874.PP
2875I began the design of
2876.CW sam
2877by asking
2878.CW jim
2879customers what they wanted.
2880This was probably a mistake; the answers were essentially a list of features
2881to be found in other editors, which did not provide any of the
2882guiding principles I was seeking.
2883For instance, one common request was for a ``global substitute,''
2884but no one suggested how to provide it within a cut-and-paste editor.
2885I was looking for a scheme that would
2886support such specialized features comfortably in the context of some
2887general command language.
2888Ideas were not forthcoming, though, particularly given my insistence
2889on removing all limits on file sizes, line lengths and so on.
2890Even worse, I recognized that, since the mouse could easily
2891indicate a region of the screen that was not an integral number of lines,
2892the command language would best forget about newlines altogether,
2893and that meant the command language had to treat the file as a single
2894string, not an array of lines.
2895.PP
2896Eventually, I decided that thinking was not getting me very far and it was
2897time to try building.
2898I knew that the terminal part could be built easily \(em
2899that part of
2900.CW jim
2901behaved acceptably well \(em and that most of the hard work was going
2902to be in the host part: the file interface, command interpreter and so on.
2903Moreover, I had some ideas about how the architecture of
2904.CW jim
2905could be improved without destroying its basic structure, which I liked
2906in principle but which hadn't worked out as well as I had hoped.
2907So I began by designing the file data structure,
2908starting with the way
2909.CW jim
2910worked \(em comparable to a single structure merging
2911.CW Disc
2912and
2913.CW Buffer ,
2914which I split to make the cache more general
2915\(em and thinking about how global substitute could be implemented.
2916The answer was clearly that it had to be done in two passes,
2917and the transcript-oriented implementation fell out naturally.
2918.PP
2919.CW Sam
2920was written bottom-up,
2921starting from the data structures and algorithms for manipulating text,
2922through the command language and up to the code for maintaining
2923the display.
2924In retrospect, it turned out well, but this implementation method is
2925not recommended in general.
2926There were several times when I had a large body of interesting code
2927assembled and no clue how to proceed with it.
2928The command language, in particular, took almost a year to figure out,
2929but can be implemented (given what was there at the beginning of that year)
2930in a day or two.  Similarly, inventing the
2931.CW Rasp
2932data structure delayed the
2933connection of the host and terminal pieces by another few months.
2934.CW Sam
2935took about two years to write, although only about four months were
2936spent actually working on it.
2937.PP
2938Part of the design process was unusual:
2939the subset of the protocol that maintains the
2940.CW Rasp
2941was simulated, debugged
2942and verified by an automatic protocol analyzer,\u\s-4\&21\s+4\d and was bug-free
2943from the start.
2944The rest of the protocol, concerned mostly
2945with keeping menus up to date,
2946was unfortunately too unwieldy for such analysis,
2947and was debugged by more traditional methods, primarily
2948by logging in a file all messages in and out of the host.
2949.SH 2
2950Reflections
2951.LP
2952.CW Sam
2953is essentially the only interactive editor used by the sixty or so members of
2954the computing science research center in which I work.
2955The same could not be said of
2956.CW jim ;
2957the lack of a command language kept some people from adopting it.
2958The union of a user interface as comfortable as
2959.CW jim 's
2960with a command language as powerful as
2961.CW ed 's†
2962.FS
2963.vs 9
2964†The people who criticize
2965.CW ed
2966as an interactive program often forget that it and its close relative
2967.CW sed \u\s-4\&7\s+4\d
2968still thrive as programmable editors.  The strength of these programs is
2969independent of their convenience for interactive editing.
2970.br
2971.vs
2972.FE
2973is essential to
2974.CW sam 's
2975success.
2976When
2977.CW sam
2978was first made available to the
2979.CW jim
2980community,
2981almost everyone switched to it within two or three days.
2982In the months that followed, even people who had never adopted
2983.CW jim
2984started using
2985.CW sam
2986exclusively.
2987.PP
2988To be honest,
2989.CW ed
2990still gets occasional use, but usually when
2991something quick needs to be done and the overhead of
2992downloading the terminal part of
2993.CW sam
2994isn't worth the trouble.
2995Also, as a `line' editor,
2996.CW sam
2997.CW -d
2998is a bit odd;
2999when using a good old ASCII terminal, it's comforting to have
3000a true line editor.
3001But it is fair to say that
3002.CW sam 's
3003command language has displaced
3004.CW ed 's
3005for most of the complicated editing that has kept line editors
3006(that is, command-driven editors) with us.
3007.PP
3008.CW Sam 's
3009command language is even fancier than
3010.CW ed 's,
3011and most
3012.CW sam
3013customers don't come near to using all its capabilities.
3014Does it need to be so sophisticated?
3015I think the answer is yes, for two reasons.
3016.PP
3017First, the
3018.I model
3019for
3020.CW sam 's
3021command language is really relatively simple, and certainly simpler than that of
3022.CW ed .
3023For instance, there is only one kind of textual loop in
3024.CW sam
3025\(em the
3026.CW x
3027command \(em
3028while
3029.CW ed
3030has three (the
3031.CW g
3032command, the global flag on substitutions, and the implicit loop over
3033lines in multi-line substitutions).
3034Also,
3035.CW ed 's
3036substitute command is necessary to make changes within lines, but in
3037.CW sam
3038the
3039.CW s
3040command is more of a familiar convenience than a necessity;
3041.CW c
3042and
3043.CW t
3044can do all the work.
3045.PP
3046Second,
3047given a community that expects an editor to be about as powerful as
3048.CW ed ,
3049it's hard to see how
3050.CW sam
3051could really be much simpler and still satisfy that expectation.
3052People want to do ``global substitutes,'' and most are content
3053to have the recipe for that and a few other fancy changes.
3054The sophistication of the command language is really just a veneer
3055over a design that makes it possible to do global substitutes
3056in a screen editor.
3057Some people will always want something more, however, and it's gratifying to
3058be able to provide it.
3059The real power of
3060.CW sam 's
3061command language comes from composability of the operators, which is by
3062nature orthogonal to the underlying model.
3063In other words,
3064.CW sam
3065is not itself complex, but it makes complex things possible.
3066If you don't want to do anything complex, you can ignore the
3067complexity altogether, and many people do so.
3068.PP
3069Sometimes I am asked the opposite question: why didn't I just make
3070.CW sam
3071a real programmable editor, with macros and variables and so on?
3072The main reason is a matter of taste: I like the editor
3073to be the same every time I use it.
3074There is one technical reason, though:
3075programmability in editors is largely a workaround for insufficient
3076interactivity.
3077Programmable editors are used to make particular, usually short-term,
3078things easy to do, such as by providing shorthands for common actions.
3079If things are generally easy to do in the first place,
3080shorthands are not as helpful.
3081.CW Sam
3082makes common editing operations very easy, and the solutions to
3083complex editing problems seem commensurate with the problems themselves.
3084Also, the ability to edit the
3085.CW sam
3086window makes it easy to repeat commands \(em it only takes a mouse button click
3087to execute a command again.
3088.SH 2
3089Pros and cons
3090.LP
3091.CW Sam
3092has several other good points,
3093and its share of problems.
3094Among the good things is the idea of
3095structural regular expressions,
3096whose usefulness has only begun to be explored.
3097They were arrived at serendipitously when I attempted to distill the essence of
3098.CW ed 's
3099way of doing global substitution and recognized that the looping command in
3100.CW ed
3101was implicitly imposing a structure (an array of lines) on the file.
3102.PP
3103Another of
3104.CW sam 's
3105good things is its undo capability.
3106I had never before used an editor with a true undo,
3107but I would never go back now.
3108Undo
3109.I must
3110be done well, but if it is, it can be relied on.
3111For example,
3112it's safe to experiment if you're not sure how to write some intricate command,
3113because if you make a mistake, it can be fixed simply and reliably.
3114I learned two things about undo from writing
3115.CW sam :
3116first, it's easy to provide if you design it in from the beginning, and
3117second, it's necessary, particularly if the system has some subtle
3118properties that may be unfamiliar or error-prone for users.
3119.PP
3120.CW Sam 's
3121lack of internal limits and sizes is a virtue.
3122Because it avoids all fixed-size tables and data structures,
3123.CW sam
3124is able to make global changes to files that some of our other
3125tools cannot even read.
3126Moreover, the design keeps the performance linear when doing such
3127operations, although I must admit
3128.CW sam
3129does get slow when editing a huge file.
3130.PP
3131Now, the problems.
3132Externally, the most obvious is that it is poorly integrated into the
3133surrounding window system.
3134By design, the user interface in
3135.CW sam
3136feels almost identical to that of
3137.CW mux ,
3138but a thick wall separates text in
3139.CW sam
3140from the programs running in
3141.CW mux .
3142For instance, the `snarf buffer' in
3143.CW sam
3144must be maintained separately from that in
3145.CW mux .
3146This is regrettable, but probably necessary given the unusual configuration
3147of the system, with a programmable terminal on the far end of an RS-232 link.
3148.PP
3149.CW Sam
3150is reliable; otherwise, people wouldn't use it.
3151But it was written over such a long time, and has so many new (to me)
3152ideas in it, that I would like to see it done over again to clean
3153up the code and remove many of the lingering problems in the implementation.
3154The worst part is in the interconnection of the host and terminal parts,
3155which might even be able to go away in a redesign for a more
3156conventional window system.
3157The program must be split in two to use the terminal effectively,
3158but the low bandwidth of the connection forces the separation to
3159occur in an inconvenient part of the design if performance is to be acceptable.
3160A simple remote procedure call
3161protocol driven by the host, emitting only graphics
3162commands, would be easy to write but wouldn't have nearly the
3163necessary responsiveness.  On the other hand, if the terminal were in control
3164and requested much simpler file services from the host, regular expression
3165searches would require that the terminal read the entire file over its RS-232
3166link, which would be unreasonably slow.
3167A compromise in which either end can take control is necessary.
3168In retrospect, the communications protocol should have been
3169designed and verified formally, although I do not know of any tool
3170that can adequately relate the protocol to
3171its implementation.
3172.PP
3173Not all of
3174.CW sam 's
3175users are comfortable with its command language, and few are adept.
3176Some (venerable) people use a sort of
3177.CW ed \& ``
3178subset'' of
3179.CW sam 's
3180command language,
3181and even ask why
3182.CW sam 's
3183command language is not exactly
3184.CW ed 's.
3185(The reason, of course, is that
3186.CW sam 's
3187model for text does not include newlines, which are central to
3188.CW ed .
3189Making the text an array of newlines to the command language would
3190be too much of a break from the seamless model provided by the mouse.
3191Some editors, such as
3192.CW vi ,
3193are willing to make this break, though.)
3194The difficulty is that
3195.CW sam 's
3196syntax is so close to
3197.CW ed 's
3198that people believe it
3199.I should
3200be the same.
3201I thought, with some justification in hindsight,
3202that making
3203.CW sam
3204similar to
3205.CW ed
3206would make it easier to learn and to accept.
3207But I may have overstepped and raised the users'
3208expectations too much.
3209It's hard to decide which way to resolve this problem.
3210.PP
3211Finally, there is a tradeoff in
3212.CW sam
3213that was decided by the environment in which it runs:
3214.CW sam
3215is a multi-file editor, although in a different system there might instead be
3216multiple single-file editors.
3217The decision was made primarily because starting a new program in a Blit is
3218time-consuming.
3219If the choice could be made freely, however, I would
3220still choose the multi-file architecture, because it allows
3221groups of files to be handled as a unit;
3222the usefulness of the multi-file commands is incontrovertible.
3223It is delightful to have the source to an entire program
3224available at your fingertips.
3225.SH
3226Acknowledgements
3227.LP
3228Tom Cargill suggested the idea behind the
3229.CW Rasp
3230data structure.
3231Norman Wilson and Ken Thompson influenced the command language.
3232This paper was improved by comments from
3233Al Aho,
3234Jon Bentley,
3235Chris Fraser,
3236Gerard Holzmann,
3237Brian Kernighan,
3238Ted Kowalski,
3239Doug McIlroy
3240and
3241Dennis Ritchie.
3242