1.HTML "The Text Editor sam 2.Vx 17 11 November 87 1 32 "ROB PIKE" "THE TEXT EDITOR SAM" 3.ds DY "31 May 1987 4.ds DR "Revised 1 July 1987 5.de CW \" puts first arg in CW font, same as UL; maintains font 6\%\&\\$3\f(CW\\$1\fP\&\\$2 7.. 8.de Cs 9.br 10.fi 11.ft 2 12.ps -2 13.vs -2 14.. 15.de Ce 16.br 17.nf 18.ft 1 19.ps 20.vs 21.sp 22.. 23.de XP 24.ie h .html - <center><img src="\\$1.gif" /></center> 25.el .BP \\$1.ps \\$2 26.. 27.TL 28The Text Editor \&\f(CWsam\fP 29.AU 30Rob Pike 31rob@plan9.bell-labs.com 32.AB 33.LP 34.CW Sam 35is an interactive multi-file text editor intended for 36bitmap displays. 37A textual command language 38supplements the mouse-driven, cut-and-paste interface 39to make complex or 40repetitive editing tasks easy to specify. 41The language is characterized by the composition of regular expressions 42to describe the structure of the text being modified. 43The treatment of files as a database, with changes logged 44as atomic transactions, guides the implementation and 45makes a general `undo' mechanism straightforward. 46.PP 47.CW Sam 48is implemented as two processes connected by a low-bandwidth stream, 49one process handling the display and the other the editing 50algorithms. Therefore it can run with the display process 51in a bitmap terminal and the editor on a local host, 52with both processes on a bitmap-equipped host, or with 53the display process in the terminal and the editor in a 54remote host. 55By suppressing the display process, 56it can even run without a bitmap terminal. 57.PP 58This paper is reprinted from Software\(emPractice and Experience, 59Vol 17, number 11, pp. 813-845, November 1987. 60The paper has not been updated for the Plan 9 manuals. Although 61.CW Sam 62has not changed much since the paper was written, the system around it certainly has. 63Nonetheless, the description here still stands as the best introduction to the editor. 64.AE 65.SH 66Introduction 67.LP 68.CW Sam 69is an interactive text editor that combines cut-and-paste interactive editing with 70an unusual command language based on the composition of regular expressions. 71It is written as two programs: one, the `host part,' runs on a UNIX system 72and implements the command language and provides file access; the other, the 73`terminal part,' runs asynchronously 74on a machine with a mouse and bitmap display 75and supports the display and interactive editing. 76The host part may be even run in isolation on an ordinary terminal 77to edit text using the command 78language, much like a traditional line editor, 79without assistance from a mouse or display. 80Most often, 81the terminal part runs on a Blit\u\s-4\&1\s+4\d terminal 82(actually on a Teletype DMD 5620, the production version of the Blit), whose 83host connection is an ordinary 9600 bps RS232 link; 84on the SUN computer the host and display processes run on a single machine, 85connected by a pipe. 86.PP 87.CW Sam 88edits uninterpreted 89ASCII text. 90It has no facilities for multiple fonts, graphics or tables, 91unlike MacWrite,\u\s-4\&2\s+4\d Bravo,\u\s-4\&3\s+4\d Tioga\u\s-4\&4\s+4\d 92or Lara.\u\s-4\&5\s+4\d 93Also unlike them, it has a rich command language. 94(Throughout this paper, the phrase 95.I 96command language 97.R 98refers to 99textual commands; commands activated from the mouse form the 100.I mouse 101.I language. ) 102.CW Sam 103developed as an editor for use by programmers, and tries to join 104the styles of the UNIX text editor 105.CW ed \u\s-4\&6,7\s+4\d 106with that of interactive cut-and-paste editors by 107providing a comfortable mouse-driven interface 108to a program with a solid command language driven by regular expressions. 109The command language developed more than the mouse language, and 110acquired a notation for describing the structure of files 111more richly than as a sequence of lines, 112using a dataflow-like syntax for specifying changes. 113.PP 114The interactive style was influenced by 115.CW jim ,\u\s-4\&1\s+4\d 116an early cut-and-paste editor for the Blit, and by 117.CW mux ,\u\s-4\&8\s+4\d 118the Blit window system. 119.CW Mux 120merges the original Blit window system, 121.CW mpx ,\u\s-4\&1\s+4\d 122with cut-and-paste editing, forming something like a 123multiplexed version of 124.CW jim 125that edits the output of (and input to) command sessions rather than files. 126.PP 127The first part of this paper describes the command language, then the mouse 128language, and explains how they interact. 129That is followed by a description of the implementation, 130first of the host part, then of the terminal part. 131A principle that influenced the design of 132.CW sam 133is that it should have no explicit limits, such as upper limits on 134file size or line length. 135A secondary consideration is that it be efficient. 136To honor these two goals together requires a method for efficiently 137manipulating 138huge strings (files) without breaking them into lines, 139perhaps while making thousands of changes 140under control of the command language. 141.CW Sam 's 142method is to 143treat the file as a transaction database, implementing changes as atomic 144updates. These updates may be unwound easily to `undo' changes. 145Efficiency is achieved through a collection of caches that minimizes 146disc traffic and data motion, both within the two parts of the program 147and between them. 148.PP 149The terminal part of 150.CW sam 151is fairly straightforward. 152More interesting is how the two halves of the editor stay 153synchronized when either half may initiate a change. 154This is achieved through a data structure that organizes the 155communications and is maintained in parallel by both halves. 156.PP 157The last part of the paper chronicles the writing of 158.CW sam 159and discusses the lessons that were learned through its development and use. 160.PP 161The paper is long, but is composed largely of two papers of reasonable length: 162a description of the user interface of 163.CW sam 164and a discussion of its implementation. 165They are combined because the implementation is strongly influenced by 166the user interface, and vice versa. 167.SH 168The Interface 169.LP 170.CW Sam 171is a text editor for multiple files. 172File names may be provided when it is invoked: 173.P1 174sam file1 file2 ... 175.P2 176and there are commands 177to add new files and discard unneeded ones. 178Files are not read until necessary 179to complete some command. 180Editing operations apply to an internal copy 181made when the file is read; the UNIX file associated with the copy 182is changed only by an explicit command. 183To simplify the discussion, the internal copy is here called a 184.I file , 185while the disc-resident original is called a 186.I 187disc file. 188.R 189.PP 190.CW Sam 191is usually connected to a bitmap display that presents a cut-and-paste 192editor driven by the mouse. 193In this mode, the command language is still available: 194text typed in a special window, called the 195.CW sam 196.I window, 197is interpreted 198as commands to be executed in the current file. 199Cut-and-paste editing may be used in any window \(em even in the 200.CW sam 201window to construct commands. 202The other mode of operation, invoked by starting 203.CW sam 204with the option 205.CW -d 206(for `no download'), 207does not use the mouse or bitmap display, but still permits 208editing using the textual command language, even on an ordinary terminal, 209interactively or from a script. 210.PP 211The following sections describe first the command language (under 212.CW sam\ -d 213and in the 214.CW sam 215window), and then the mouse interface. 216These two languages are nearly independent, but connect through the 217.I current 218.I text, 219described below. 220.SH 2 221The Command Language 222.LP 223A file consists of its contents, which are an array of characters 224(that is, a string); the 225.I name 226of the associated disc file; the 227.I 228modified bit 229.R 230that states whether the contents match those of 231the disc file; 232and a substring of the contents, called the 233.I 234current text 235.R 236or 237.I dot 238(see Figures 1 and 2). 239If the current text is a null string, dot falls between characters. 240The 241.I value 242of dot is the location of the current text; the 243.I contents 244of dot are the characters it contains. 245.CW Sam 246imparts to the text no two-dimensional interpretation such as columns 247or fields; text is always one-dimensional. 248Even the idea of a `line' of text as understood by most UNIX programs 249\(em a sequence of characters terminated by a newline character \(em 250is only weakly supported. 251.PP 252The 253.I 254current file 255.R 256is the file to which editing commands refer. 257The current text is therefore dot in the current file. 258If a command doesn't explicitly name a particular file or piece of text, 259the command is assumed to apply to the current text. 260For the moment, ignore the presence of multiple files and consider 261editing a single file. 262.KF L 263.XP fig1 3.5i 264.Cs 265Figure 1. A typical 266.CW sam 267screen, with the editing menu presented. 268The 269.CW sam 270(command language) window is in the middle, with file windows above and below. 271(The user interface makes it easy to create these abutting windows.) 272The partially obscured window is a third file window. 273The uppermost window is that to which typing and mouse operations apply, 274as indicated by its heavy border. 275Each window has its current text highlighted in reverse video. 276The 277.CW sam 278window's current text is the null string on the last visible line, 279indicated by a vertical bar. 280See also Figure 2. 281.Ce 282.KE 283.PP 284Commands have one-letter names. 285Except for non-editing commands such as writing 286the file to disc, most commands make some change 287to the text in dot and leave dot set to the text resulting from the change. 288For example, the delete command, 289.CW d , 290deletes the text in dot, replacing it by the null string and setting dot 291to the result. 292The change command, 293.CW c , 294replaces dot by text delimited by an arbitrary punctuation character, 295conventionally 296a slash. Thus, 297.P1 298c/Peter/ 299.P2 300replaces the text in dot by the string 301.CW Peter . 302Similarly, 303.P1 304a/Peter/ 305.P2 306(append) adds the string after dot, and 307.P1 308i/Peter/ 309.P2 310(insert) inserts before dot. 311All three leave dot set to the new text, 312.CW Peter . 313.PP 314Newlines are part of the syntax of commands: 315the newline character lexically terminates a command. 316Within the inserted text, however, newlines are never implicit. 317But since it is often convenient to insert multiple lines of text, 318.CW sam 319has a special 320syntax for that case: 321.P1 322a 323some lines of text 324to be inserted in the file, 325terminated by a period 326on a line by itself 327\&. 328.P2 329In the one-line syntax, a newline character may be specified by a C-like 330escape, so 331.P1 332c/\en/ 333.P2 334replaces dot by a single newline character. 335.PP 336.CW Sam 337also has a substitute command, 338.CW s : 339.P1 340s/\f2expression\fP/\f2replacement\fP/ 341.P2 342substitutes the replacement text for the first match, in dot, 343of the regular expression. 344Thus, if dot is the string 345.CW Peter , 346the command 347.P1 348s/t/st/ 349.P2 350changes it to 351.CW Pester . 352In general, 353.CW s 354is unnecessary, but it was inherited from 355.CW ed 356and it has some convenient variations. 357For instance, the replacement text may include the matched text, 358specified by 359.CW & : 360.P1 361s/Peter/Oh, &, &, &, &!/ 362.P2 363.PP 364There are also three commands that apply programs 365to text: 366.P1 367< \f2UNIX program\fP 368.P2 369replaces dot by the output of the UNIX program. 370Similarly, the 371.CW > 372command 373runs the program with dot as its standard input, and 374.CW | 375does both. For example, 376.P1 377| sort 378.P2 379replaces dot by the result of applying the standard sorting utility to it. 380Again, newlines have no special significance for these 381.CW sam 382commands. 383The text acted upon and resulting from these commands is not necessarily 384bounded by newlines, although for connection with UNIX programs, 385newlines may be necessary to obey conventions. 386.PP 387One more command: 388.CW p 389prints the contents of dot. 390Table I summarizes 391.CW sam 's 392commands. 393.KF 394.TS 395center; 396c s 397lfCW l. 398Table I. \f(CWSam\fP commands 399.sp .4 400.ft CW 401_ 402.ft 403.sp .4 404\f1Text commands\fP 405.sp .4 406_ 407.sp .4 408a/\f2text\fP/ Append text after dot 409c/\f2text\fP/ Change text in dot 410i/\f2text\fP/ Insert text before dot 411d Delete text in dot 412s/\f2regexp\fP/\f2text\fP/ Substitute text for match of regular expression in dot 413m \f2address\fP Move text in dot after address 414t \f2address\fP Copy text in dot after address 415.sp .4 416_ 417.sp .4 418\f1Display commands\fP 419.sp .4 420_ 421.sp .2 422p Print contents of dot 423\&= Print value (line numbers and character numbers) of dot 424.sp .4 425_ 426.sp .4 427\f1File commands\fP 428.sp .4 429_ 430.sp .2 431b \f2file-list\fP Set current file to first file in list that \f(CWsam\fP has in menu 432B \f2file-list\fP Same as \f(CWb\fP, but load new files 433n Print menu lines of all files 434D \f2file-list\fP Delete named files from \f(CWsam\fP 435.sp .4 436_ 437.sp .4 438\f1I/O commands\fP 439.sp .4 440_ 441.sp .2 442e \f2filename\fP Replace file with named disc file 443r \f2filename\fP Replace dot by contents of named disc file 444w \f2filename\fP Write file to named disc file 445f \f2filename\fP Set file name and print new menu line 446< \f2UNIX-command\fP Replace dot by standard output of command 447> \f2UNIX-command\fP Send dot to standard input of command 448| \f2UNIX-command\fP Replace dot by result of command applied to dot 449! \f2UNIX-command\fP Run the command 450.sp .4 451_ 452.sp .4 453\f1Loops and conditionals\fP 454.sp .4 455_ 456.sp .2 457x/\f2regexp\fP/ \f2command\fP For each match of regexp, set dot and run command 458y/\f2regexp\fP/ \f2command\fP Between adjacent matches of regexp, set dot and run command 459X/\f2regexp\fP/ \f2command\fP Run command in each file whose menu line matches regexp 460Y/\f2regexp\fP/ \f2command\fP Run command in each file whose menu line does not match 461g/\f2regexp\fP/ \f2command\fP If dot contains a match of regexp, run command 462v/\f2regexp\fP/ \f2command\fP If dot does not contain a match of regexp, run command 463.sp .4 464_ 465.sp .4 466\f1Miscellany\fP 467.sp .4 468_ 469.sp .2 470k Set address mark to value of dot 471q Quit 472u \f2n\fP Undo last \f2n\fP (default 1) changes 473{ } Braces group commands 474.sp .3 475.ft CW 476_ 477.ft 478.TE 479.sp 480.KE 481.PP 482The value of dot may be changed by 483specifying an 484.I address 485for the command. 486The simplest address is a line number: 487.P1 4883 489.P2 490refers to the third line of the file, so 491.P1 4923d 493.P2 494deletes the third line of the file, and implicitly renumbers 495the lines so the old line 4 is now numbered 3. 496(This is one of the few places where 497.CW sam 498deals with lines directly.) 499Line 500.CW 0 501is the null string at the beginning of the file. 502If a command consists of only an address, a 503.CW p 504command is assumed, so typing an unadorned 505.CW 3 506prints line 3 on the terminal. 507There are a couple of other basic addresses: 508a period addresses dot itself; and 509a dollar sign 510.CW $ ) ( 511addresses the null string at the end of the file. 512.PP 513An address is always a single substring of the file. 514Thus, the address 515.CW 3 516addresses the characters 517after the second newline of 518the file through the third newline of the file. 519A 520.I 521compound address 522.R 523is constructed by the comma operator 524.P1 525\f2address1\fP,\f2address2\fP 526.P2 527and addresses the substring of the file from the beginning of 528.I address1 529to the end of 530.I address2 . 531For example, the command 532.CW 3,5p 533prints the third through fifth lines of the file and 534.CW .,$d 535deletes the text from the beginning of dot to the end of the file. 536.PP 537These addresses are all absolute positions in the file, but 538.CW sam 539also has relative addresses, indicated by 540.CW + 541or 542.CW - . 543For example, 544.P1 545$-3 546.P2 547is the third line before the end of the file and 548.P1 549\&.+1 550.P2 551is the line after dot. 552If no address appears to the left of the 553.CW + 554or 555.CW - , 556dot is assumed; 557if nothing appears to the right, 558.CW 1 559is assumed. 560Therefore, 561.CW .+1 562may be abbreviated to just a plus sign. 563.PP 564The 565.CW + 566operator acts relative to the end of its first argument, while the 567.CW - 568operator acts relative to the beginning. Thus 569.CW .+1 570addresses the first line after dot, 571.CW .- 572addresses the first line before dot, and 573.CW +- 574refers to the line containing the end of dot. (Dot may span multiple lines, and 575.CW + 576selects the line after the end of dot, then 577.CW - 578backs up one line.) 579.PP 580The final type of address is a regular expression, which addresses the 581text matched by the expression. The expression is enclosed in slashes, as in 582.P1 583/\f2expression\fP/ 584.P2 585The expressions are the same as those in the UNIX program 586.CW egrep ,\u\s-4\&6,7\s+4\d 587and include closures, alternations, and so on. 588They find the 589.I 590leftmost longest 591.R 592string that matches the expression, that is, 593the first match after the point where the search is started, 594and if more than one match begins at the same spot, the longest such match. 595(I assume familiarity with the syntax for regular expressions in UNIX programs.\u\s-4\&9\s+4\d) 596For example, 597.P1 598/x/ 599.P2 600matches the next 601.CW x 602character in the file, 603.P1 604/xx*/ 605.P2 606matches the next run of one or more 607.CW x 's, 608and 609.P1 610/x|Peter/ 611.P2 612matches the next 613.CW x 614or 615.CW Peter . 616For compatibility with other UNIX programs, the `any character' operator, 617a period, 618does not match a newline, so 619.P1 620/.*/ 621.P2 622matches the text from dot to the end of the line, but excludes the newline 623and so will not match across 624the line boundary. 625.PP 626Regular expressions are always relative addresses. 627The direction is forwards by default, 628so 629.CW /Peter/ 630is really an abbreviation for 631.CW +/Peter/ . 632The search can be reversed with a minus sign, so 633.P1 634.CW -/Peter/ 635.P2 636finds the first 637.CW Peter 638before dot. 639Regular expressions may be used with other address forms, so 640.CW 0+/Peter/ 641finds the first 642.CW Peter 643in the file and 644.CW $-/Peter/ 645finds the last. 646Table II summarizes 647.CW sam 's 648addresses. 649.KF 650.TS 651center; 652c s 653lfCW l. 654Table II. \f(CWSam\fP addresses 655.sp .4 656.ft CW 657_ 658.ft 659.sp .4 660\f1Simple addresses\fP 661.sp .4 662_ 663.sp .2 664#\f2n\fP The empty string after character \f2n\fP 665\f2n\fP Line \f2n\fP. 666/\f2regexp\fP/ The first following match of the regular expression 667-/\f2regexp\fP/ The first previous match of the regular expression 668$ The null string at the end of the file 669\&. Dot 670\&' The address mark, set by \f(CWk\fP command 671"\f2regexp\fP" Dot in the file whose menu line matches regexp 672.sp .4 673_ 674.sp .4 675\f1Compound addresses\fP 676.sp .4 677_ 678.sp .2 679\f2a1\fP+\f2a2\fP The address \f2a2\fP evaluated starting at right of \f2a1\fP 680\f2a1\fP-\f2a2\fP \f2a2\fP evaluated in the reverse direction starting at left of \f2a1\fP 681\f2a1\fP,\f2a2\fP From the left of \f2a1\fP to the right of \f2a2\fP (default \f(CW0,$\fP) 682\f2a1\fP;\f2a2\fP Like \f(CW,\fP but sets dot after evaluating \f2a1\fP 683.sp .4 684_ 685.sp .4 686.T& 687c s. 688T{ 689The operators 690.CW + 691and 692.CW - 693are high precedence, while 694.CW , 695and 696.CW ; 697are low precedence. 698In both 699.CW + 700and 701.CW - 702forms, 703.I a2 704defaults to 1 and 705.I a1 706defaults to dot. 707If both 708.I a1 709and 710.I a2 711are present, 712.CW + 713may be elided. 714T} 715.sp .5 716.ft CW 717_ 718.ft 719.TE 720.sp 721.KE 722.PP 723The language discussed so far will not seem novel 724to people who use UNIX text editors 725such as 726.CW ed 727or 728.CW vi .\u\s-4\&9\s+4\d 729Moreover, the kinds of editing operations these commands allow, with the exception 730of regular expressions and line numbers, 731are clearly more conveniently handled by a mouse-based interface. 732Indeed, 733.CW sam 's 734mouse language (discussed at length below) is the means by which 735simple changes are usually made. 736For large or repetitive changes, however, a textual language 737outperforms a manual interface. 738.PP 739Imagine that, instead of deleting just one occurrence of the string 740.CW Peter , 741we wanted to eliminate every 742.CW Peter . 743What's needed is an iterator that runs a command for each occurrence of some 744text. 745.CW Sam 's 746iterator is called 747.CW x , 748for extract: 749.P1 750x/\f2expression\fP/ \f2command\fP 751.P2 752finds all matches in dot of the specified expression, and for each 753such match, sets dot to the text matched and runs the command. 754So to delete all the 755.CW Peters: 756.P1 7570,$ x/Peter/ d 758.P2 759(Blanks in these examples are to improve readability; 760.CW sam 761neither requires nor interprets them.) 762This searches the entire file 763.CW 0,$ ) ( 764for occurrences of the string 765.CW Peter , 766and runs the 767.CW d 768command with dot set to each such occurrence. 769(By contrast, the comparable 770.CW ed 771command would delete all 772.I lines 773containing 774.CW Peter ; 775.CW sam 776deletes only the 777.CW Peters .) 778The address 779.CW 0,$ 780is commonly used, and may be abbreviated to just a comma. 781As another example, 782.P1 783, x/Peter/ p 784.P2 785prints a list of 786.CW Peters, 787one for each appearance in the file, with no intervening text (not even newlines 788to separate the instances). 789.PP 790Of course, the text extracted by 791.CW x 792may be selected by a regular expression, 793which complicates deciding what set of matches is chosen \(em 794matches may overlap. This is resolved by generating the matches 795starting from the beginning of dot using the leftmost-longest rule, 796and searching for each match starting from the end of the previous one. 797Regular expressions may also match null strings, but a null match 798adjacent to a non-null match is never selected; at least one character 799must intervene. 800For example, 801.P1 802, c/AAA/ 803x/B*/ c/-/ 804, p 805.P2 806produces as output 807.P1 808-A-A-A- 809.P2 810because the pattern 811.CW B* 812matches the null strings separating the 813.CW A 's. 814.PP 815The 816.CW x 817command has a complement, 818.CW y , 819with similar syntax, that executes the command with dot set to the text 820.I between 821the matches of the expression. 822For example, 823.P1 824, c/AAA/ 825y/A/ c/-/ 826, p 827.P2 828produces the same result as the example above. 829.PP 830The 831.CW x 832and 833.CW y 834commands are looping constructs, and 835.CW sam 836has a pair of conditional commands to go with them. 837They have similar syntax: 838.P1 839g/\f2expression\fP/ \f2command\fP 840.P2 841(guard) 842runs the command exactly once if dot contains a match of the expression. 843This is different from 844.CW x , 845which runs the command for 846.I each 847match: 848.CW x 849loops; 850.CW g 851merely tests, without changing the value of dot. 852Thus, 853.P1 854, x/Peter/ d 855.P2 856deletes all occurrences of 857.CW Peter , 858but 859.P1 860, g/Peter/ d 861.P2 862deletes the whole file (reduces it to a null string) if 863.CW Peter 864occurs anywhere in the text. 865The complementary conditional is 866.CW v , 867which runs the command if there is 868.I no 869match of the expression. 870.PP 871These control-structure-like commands may be composed to construct more 872involved operations. For example, to print those lines of text that 873contain the string 874.CW Peter : 875.P1 876, x/.*\en/ g/Peter/ p 877.P2 878The 879.CW x 880breaks the file into lines, the 881.CW g 882selects those lines containing 883.CW Peter , 884and the 885.CW p 886prints them. 887This command gives an address for the 888.CW x 889command (the whole file), but because 890.CW g 891does not have an explicit address, it applies to the value of 892dot produced by the 893.CW x 894command, that is, to each line. 895All commands in 896.CW sam 897except for the command to write a file to disc use dot for the 898default address. 899.PP 900Composition may be continued indefinitely. 901.P1 902, x/.*\en/ g/Peter/ v/SaltPeter/ p 903.P2 904prints those lines containing 905.CW Peter 906but 907.I not 908those containing 909.CW SaltPeter . 910.SH 2 911Structural Regular Expressions 912.LP 913Unlike other UNIX text editors, 914including the non-interactive ones such as 915.CW sed 916and 917.CW awk ,\u\s-4\&7\s+4\d 918.CW sam 919is good for manipulating files with multi-line `records.' 920An example is an on-line phone book composed of records, 921separated by blank lines, of the form 922.P1 923Herbert Tic 92444 Turnip Ave., Endive, NJ 925201-5555642 926 927Norbert Twinge 92816 Potato St., Cabbagetown, NJ 929201-5553145 930 931\&... 932.P2 933The format may be encoded as a regular expression: 934.P1 935(.+\en)+ 936.P2 937that is, a sequence of one or more non-blank lines. 938The command to print Mr. Tic's entire record is then 939.P1 940, x/(.+\en)+/ g/^Herbert Tic$/ p 941.P2 942and that to extract just the phone number is 943.P1 944, x/(.+\en)+/ g/^Herbert Tic$/ x/^[0-9]*-[0-9]*\en/ p 945.P2 946The latter command breaks the file into records, 947chooses Mr. Tic's record, 948extracts the phone number from the record, 949and finally prints the number. 950.PP 951A more involved problem is that of 952renaming a particular variable, say 953.CW n , 954to 955.CW num 956in a C program. 957The obvious first attempt, 958.P1 959, x/n/ c/num/ 960.P2 961is badly flawed: it changes not only the variable 962.CW n 963but any letter 964.CW n 965that appears. 966We need to extract all the variables, and select those that match 967.CW n 968and only 969.CW n : 970.P1 971, x/[A-Za-z_][A-Za-z_0-9]*/ g/n/ v/../ c/num/ 972.P2 973The pattern 974.CW [A-Za-z_][A-Za-z_0-9]* 975matches C identifiers. 976Next 977.CW g/n/ 978selects those containing an 979.CW n . 980Then 981.CW v/../ 982rejects those containing two (or more) characters, and finally 983.CW c/num/ 984changes the remainder (identifiers 985.CW n ) 986to 987.CW num . 988This version clearly works much better, but there may still be problems. 989For example, in C character and string constants, the sequence 990.CW \en 991is interpreted as a newline character, and we don't want to change it to 992.CW \enum. 993This problem can be forestalled with a 994.CW y 995command: 996.P1 997, y/\e\en/ x/[A-Za-z_][A-Za-z_0-9]*/ g/n/ v/../ c/num/ 998.P2 999(the second 1000.CW \e 1001is necessary because of lexical conventions in regular expressions), 1002or we could even reject character constants and strings outright: 1003.P1 0 1004,y/'[^']*'/ y/"[^"]*"/ x/[A-Za-z_][A-Za-z_0-9]*/ g/n/ v/../ c/num/ 1005.P2 1006The 1007.CW y 1008commands in this version exclude from consideration all character constants 1009and strings. 1010The only remaining problem is to deal with the possible occurrence of 1011.CW \e' 1012or 1013.CW \e" 1014within these sequences, but it's easy to see how to resolve this difficulty. 1015.PP 1016The point of these composed commands is successive refinement. 1017A simple version of the command is tried, and if it's not good enough, 1018it can be honed by adding a clause or two. 1019(Mistakes can be undone; see below. 1020Also, the mouse language makes it unnecessary to retype the command each time.) 1021The resulting chains of commands are somewhat reminiscent of 1022shell pipelines.\u\s-4\&7\s+4\d 1023Unlike pipelines, though, which pass along modified 1024.I data , 1025.CW sam 1026commands pass a 1027.I view 1028of the data. 1029The text at each step of the command is the same, but which pieces 1030are selected is refined step by step until the correct piece is 1031available to the final step of the command line, which ultimately makes the change. 1032.PP 1033In other UNIX programs, regular expressions are used only for selection, 1034as in the 1035.CW sam 1036.CW g 1037command, never for extraction as in the 1038.CW x 1039or 1040.CW y 1041command. 1042For example, patterns in 1043.CW awk \u\s-4\&7\s+4\d 1044are used to select lines to be operated on, but cannot be used 1045to describe the format of the input text, or to handle newline-free text. 1046The use of regular expressions to describe the structure of a piece 1047of text rather than its contents, as in the 1048.CW x 1049command, 1050has been given a name: 1051.I 1052structural regular expressions. 1053.R 1054When they are composed, as in the above example, 1055they are pleasantly expressive. 1056Their use is discussed at greater length elsewhere.\u\s-4\&10\s+4\d 1057.PP 1058.SH 2 1059Multiple files 1060.LP 1061.CW Sam 1062has a few other commands, mostly relating to input and output. 1063.P1 1064e discfilename 1065.P2 1066replaces the contents and name of the current file with those of the named 1067disc file; 1068.P1 1069w discfilename 1070.P2 1071writes the contents to the named disc file; and 1072.P1 1073r discfilename 1074.P2 1075replaces dot with the contents of the named disc file. 1076All these commands use the current file's name if none is specified. 1077Finally, 1078.P1 1079f discfilename 1080.P2 1081changes the name associated with the file and displays the result: 1082.P1 1083\&'-. discfilename 1084.P2 1085This output is called the file's 1086.I 1087menu line, 1088.R 1089because it is the contents of the file's line in the button 3 menu (described 1090in the 1091next section). 1092The first three characters are a concise notation for the state of the file. 1093The apostrophe signifies that the file is modified. 1094The minus sign indicates the number of windows 1095open on the file (see the next section): 1096.CW - 1097means none, 1098.CW + 1099means one, and 1100.CW * 1101means more than one. 1102Finally, the period indicates that this is the current file. 1103These characters are useful for controlling the 1104.CW X 1105command, described shortly. 1106.PP 1107.CW Sam 1108may be started with a set of disc files (such as all the source for 1109a program) by invoking it with a list of file names as arguments, and 1110more may be added or deleted on demand. 1111.P1 1112B discfile1 discfile2 ... 1113.P2 1114adds the named files to 1115.CW sam 's 1116list, and 1117.P1 1118D discfile1 discfile2 ... 1119.P2 1120removes them from 1121.CW sam 's 1122memory (without effect on associated disc files). 1123Both these commands have a syntax for using the shell\u\s-4\&7\s+4\d 1124(the UNIX command interpreter) to generate the lists: 1125.P1 1126B <echo *.c 1127.P2 1128will add all C source files, and 1129.P1 1130B <grep -l variable *.c 1131.P2 1132will add all C source files referencing a particular variable 1133(the UNIX command 1134.CW grep\ -l 1135lists all files in its arguments that contain matches of 1136the specified regular expression). 1137Finally, 1138.CW D 1139without arguments deletes the current file. 1140.PP 1141There are two ways to change which file is current: 1142.P1 1143b filename 1144.P2 1145makes the named file current. 1146The 1147.CW B 1148command 1149does the same, but also adds any new files to 1150.CW sam 's 1151list. 1152(In practice, of course, the current file 1153is usually chosen by mouse actions, not by textual commands.) 1154The other way is to use a form of address that refers to files: 1155.P1 1156"\f2expression\fP" \f2address\fP 1157.P2 1158refers to the address evaluated in the file whose menu line 1159matches the expression (there must be exactly one match). 1160For example, 1161.P1 1162"peter.c" 3 1163.P2 1164refers to the third line of the file whose name matches 1165.CW peter.c . 1166This is most useful in the move 1167.CW m ) ( 1168and copy 1169.CW t ) ( 1170commands: 1171.P1 11720,$ t "peter.c" 0 1173.P2 1174makes a copy of the current file at the beginning of 1175.CW peter.c . 1176.PP 1177The 1178.CW X 1179command 1180is a looping construct, like 1181.CW x , 1182that refers to files instead of strings: 1183.P1 1184X/\f2expression\fP/ \f2command\fP 1185.P2 1186runs the command in all 1187files whose menu lines match the expression. The best example is 1188.P1 1189X/'/ w 1190.P2 1191which writes to disc all modified files. 1192.CW Y 1193is the complement of 1194.CW X : 1195it runs the command on all files whose menu lines don't match the expression: 1196.P1 1197Y/\e.c/ D 1198.P2 1199deletes all files that don't have 1200.CW \&.c 1201in their names, that is, it keeps all C source files and deletes the rest. 1202.PP 1203Braces allow commands to be grouped, so 1204.P1 1205{ 1206 \f2command1\fP 1207 \f2command2\fP 1208} 1209.P2 1210is syntactically a single command that runs two commands. 1211Thus, 1212.P1 1213X/\e.c/ ,g/variable/ { 1214 f 1215 , x/.*\en/ g/variable/ p 1216} 1217.P2 1218finds all occurrences of 1219.CW variable 1220in C source files, and prints 1221out the file names and lines of each match. 1222The precise semantics of compound operations is discussed in the implementation 1223sections below. 1224.PP 1225Finally, 1226the undo command, 1227.CW u , 1228undoes the last command, 1229no matter how many files were affected. 1230Multiple undo operations move further back in time, so 1231.P1 1232u 1233u 1234.P2 1235(which may be abbreviated 1236.CW u2 ) 1237undoes the last two commands. An undo may not be undone, however, nor 1238may any command that adds or deletes files. 1239Everything else is undoable, though, including for example 1240.CW e 1241commands: 1242.P1 1243e filename 1244u 1245.P2 1246restores the state of the file completely, including its name, dot, 1247and modified bit. Because of the undo, potentially dangerous commands 1248are not guarded by confirmations. Only 1249.CW D , 1250which destroys the information necessary to restore itself, is protected. 1251It will not delete a modified file, but a second 1252.CW D 1253of the same file will succeed regardless. 1254The 1255.CW q 1256command, which exits 1257.CW sam , 1258is similarly guarded. 1259.SH 2 1260Mouse Interface 1261.LP 1262.CW Sam 1263is most commonly run 1264connected to a bitmap display and mouse for interactive editing. 1265The only difference in the command language 1266between regular, mouse-driven 1267.CW sam 1268and 1269.CW sam\ -d 1270is that if an address 1271is provided without a command, 1272.CW sam\ -d 1273will print the text referenced by the address, but 1274regular 1275.CW sam 1276will highlight it on the screen \(em in fact, 1277dot is always highlighted (see Figure 2). 1278.WS 1 1279.KF 1280.XP fig3 2.04i 1281.Cs 1282Figure 2. A 1283.CW sam 1284window. The scroll bar down the left 1285represents the file, with the bubble showing the fraction 1286visible in the window. 1287The scroll bar may be manipulated by the mouse for convenient browsing. 1288The current text, 1289which is highlighted, need not fit on a line. Here it consists of one partial 1290line, one complete line, and final partial line. 1291.Ce 1292.KE 1293.PP 1294Each file may have zero or more windows open on the display. 1295At any time, only one window in all of 1296.CW sam 1297is the 1298.I 1299current window, 1300.R 1301that is, the window to which typing and mouse actions refer; 1302this may be the 1303.CW sam 1304window (that in which commands may be typed) 1305or one of the file windows. 1306When a file has multiple windows, the image of the file in each window 1307is always kept up to date. 1308The current file is the last file affected by a command, 1309so if the 1310.CW sam 1311window is current, 1312the current window is not a window on the current file. 1313However, each window on a file has its own value of dot, 1314and when switching between windows on a single file, 1315the file's value of dot is changed to that of the window. 1316Thus, flipping between windows behaves in the obvious, convenient way. 1317.PP 1318The mouse on the Blit has three buttons, numbered left to right. 1319Button 3 has a list of commands to manipulate windows, 1320followed by a list of `menu lines' exactly as printed by the 1321.CW f 1322command, one per file (not one per window). 1323These menu lines are sorted by file name. 1324If the list is long, the Blit menu software will make it more manageable 1325by generating a scrolling menu instead of an unwieldy long list. 1326Using the menu to select a file from the list makes that file the current 1327file, and the most recently current window in that file the current window. 1328But if that file is already current, selecting it in the menu cycles through 1329the windows on the file; this simple trick avoids a special menu to 1330choose windows on a file. 1331If there is no window open on the file, 1332.CW sam 1333changes the mouse cursor to prompt the user to create one. 1334.PP 1335The commands on the button 3 menu are straightforward (see Figure 3), and 1336are like the commands to manipulate windows in 1337.CW mux ,\u\s-4\&8\s+4\d 1338the Blit's window system. 1339.CW New 1340makes a new file, and gives it one empty window, whose size is determined 1341by a rectangle swept by the mouse. 1342.CW Zerox 1343prompts for a window to be selected, and 1344makes a clone of that window; this is how multiple windows are created on one file. 1345.CW Reshape 1346changes the size of the indicated window, and 1347.CW close 1348deletes it. If that is the last window open on the file, 1349.CW close 1350first does a 1351.CW D 1352command on the file. 1353.CW Write 1354is identical to a 1355.CW w 1356command on the file; it is in the menu purely for convenience. 1357Finally, 1358.CW ~~sam~~ 1359is a menu item that appears between the commands and the file names. 1360Selecting it makes the 1361.CW sam 1362window the current window, 1363causing subsequent typing to be interpreted as commands. 1364.KF 1365.XP fig2 2.74i 1366.Cs 1367Figure 3. The menu on button 3. 1368The black rectangle on the left is a scroll bar; the menu is limited to 1369the length shown to prevent its becoming unwieldy. 1370Above the 1371.CW ~~sam~~ 1372line is a list of commands; 1373beneath it is a list of files, presented exactly as with the 1374.CW f 1375command. 1376.Ce 1377.KE 1378.PP 1379When 1380.CW sam 1381requests that a window be swept, in response to 1382.CW new , 1383.CW zerox 1384or 1385.CW reshape , 1386it changes the mouse cursor from the usual arrow to a box with 1387a small arrow. 1388In this state, the mouse may be used to indicate an arbitrary rectangle by 1389pressing button 3 at one corner and releasing it at the opposite corner. 1390More conveniently, 1391button 3 may simply be clicked, 1392whereupon 1393.CW sam 1394creates the maximal rectangle that contains the cursor 1395and abuts the 1396.CW sam 1397window. 1398By placing the 1399.CW sam 1400window in the middle of the screen, the user can define two regions (one above, 1401one below) in which stacked fully-overlapping 1402windows can be created with minimal fuss (see Figure 1). 1403This simple user interface trick makes window creation noticeably easier. 1404.PP 1405The cut-and-paste editor is essentially the same as that in Smalltalk-80.\u\s-4\&11\s+4\d 1406The text in dot is always highlighted on the screen. 1407When a character is typed it replaces dot, and sets dot to the null 1408string after the character. Thus, ordinary typing inserts text. 1409Button 1 is used for selection: 1410pressing the button, moving the mouse, and lifting the button 1411selects (sets dot to) the text between the points where the 1412button was pressed and released. 1413Pressing and releasing at the same point selects a null string; this 1414is called clicking. Clicking twice quickly, or 1415.I 1416double clicking, 1417.R 1418selects larger objects; 1419for example, double clicking in a word selects the word, 1420double clicking just inside an opening bracket selects the text 1421contained in the brackets (handling nested brackets correctly), 1422and similarly for 1423parentheses, quotes, and so on. 1424The double-clicking rules reflect a bias toward 1425programmers. 1426If 1427.CW sam 1428were intended more for word processing, double-clicks would probably 1429select linguistic structures such as sentences. 1430.PP 1431If button 1 is pressed outside the current window, it makes the indicated 1432window current. 1433This is the easiest way to switch between windows and files. 1434.PP 1435Pressing button 2 brings up a menu of editing functions (see Figure 4). 1436These mostly apply to the selected text: 1437.CW cut 1438deletes the selected text, and remembers it in a hidden buffer called the 1439.I 1440snarf buffer, 1441.R 1442.CW paste 1443replaces the selected text by the contents of the snarf buffer, 1444.CW snarf 1445just copies the selected text to the snarf buffer, 1446.CW look 1447searches forward for the next literal occurrence of the selected text, and 1448.CW <mux> 1449exchanges snarf buffers with the window system in which 1450.CW sam 1451is running. 1452Finally, the last regular expression used appears as a menu entry 1453to search 1454forward for the next occurrence of a match for the expression. 1455.WS 1 1456.KF 1457.XP fig4 1.20i 1458.Cs 1459Figure 4. The menu on button 2. 1460The bottom entry tracks the most recently used regular expression, which may 1461be literal text. 1462.Ce 1463.KE 1464.PP 1465The relationship between the command language and the mouse language is 1466entirely due to the equality of dot and the selected text chosen 1467with button 1 on the mouse. 1468For example, to make a set of changes in a C subroutine, dot can be 1469set by double clicking on the left brace that begins the subroutine, 1470which sets dot for the command language. 1471An address-free command then typed in the 1472.CW sam 1473window will apply only to the text between the opening and closing 1474braces of the function. 1475The idea is to select what you want, and then say what you want 1476to do with it, whether invoked by a menu selection or by a typed command. 1477And of course, the value of dot is highlighted on 1478the display after the command completes. 1479This relationship between mouse interface and command language 1480is clumsy to explain, but comfortable, even natural, in practice. 1481.SH 1482The Implementation 1483.LP 1484The next few sections describe how 1485.CW sam 1486is put together, first the host part, 1487then the inter-component communication, 1488then the terminal part. 1489After explaining how the command language is implemented, 1490the discussion follows (roughly) the path of a character 1491from the temporary file on disc to the screen. 1492The presentation centers on the data structures, 1493because that is how the program was designed and because 1494the algorithms are easy to provide, given the right data 1495structures. 1496.SH 2 1497Parsing and execution 1498.LP 1499The command language is interpreted by parsing each command with a 1500table-driven recursive 1501descent parser, and when a complete command is assembled, invoking a top-down 1502executor. 1503Most editors instead employ a simple character-at-a-time 1504lexical scanner. 1505Use of a parser makes it 1506easy and unambiguous to detect when a command is complete, 1507which has two advantages. 1508First, escape conventions such as backslashes to quote 1509multiple-line commands are unnecessary; if the command isn't finished, 1510the parser keeps reading. For example, a multiple-line append driven by an 1511.CW x 1512command is straightforward: 1513.P1 1514x/.*\en/ g/Peter/ a 1515one line about Peter 1516another line about Peter 1517\&. 1518.P2 1519Other UNIX editors would require a backslash after all but the last line. 1520.PP 1521The other advantage is specific to the two-process structure of 1522.CW sam . 1523The host process must decide when a command is completed so the 1524command interpreter can be called. This problem is easily resolved 1525by having the lexical analyzer read the single stream of events from the 1526terminal, directly executing all typing and mouse commands, 1527but passing to the parser characters typed to the 1528.CW sam 1529command window. 1530This scheme is slightly complicated by the availability of cut-and-paste 1531editing in the 1532.CW sam 1533window, but that difficulty is resolved by applying the rules 1534used in 1535.CW mux : 1536when a newline is typed to the 1537.CW sam 1538window, all text between the newline and the previously typed newline 1539is made available to the parser. 1540This permits arbitrary editing to be done to a command before 1541typing newline and thereby requesting execution. 1542.PP 1543The parser is driven by a table because the syntax of addresses 1544and commands is regular enough 1545to be encoded compactly. There are few special cases, such as the 1546replacement text in a substitution, so the syntax of almost all commands 1547can be encoded with a few flags. 1548These include whether the command allows an address (for example, 1549.CW e 1550does not), whether it takes a regular expression (as in 1551.CW x 1552and 1553.CW s ), 1554whether it takes replacement text (as in 1555.CW c 1556or 1557.CW i ), 1558which may be multi-line, and so on. 1559The internal syntax of regular expressions is handled by a separate 1560parser; a regular expression is a leaf of the command parse tree. 1561Regular expressions are discussed fully in the next section. 1562.PP 1563The parser table also has information about defaults, so the interpreter 1564is always called with a complete tree. For example, the parser fills in 1565the implicit 1566.CW 0 1567and 1568.CW $ 1569in the abbreviated address 1570.CW , 1571(comma), 1572inserts a 1573.CW + 1574to the left of an unadorned regular expression in an address, 1575and provides the usual default address 1576.CW . 1577(dot) for commands that expect an address but are not given one. 1578.PP 1579Once a complete command is parsed, the evaluation is easy. 1580The address is evaluated left-to-right starting from the value of dot, 1581with a mostly ordinary expression evaluator. 1582Addresses, like many of the data structures in 1583.CW sam , 1584are held in a C structure and passed around by value: 1585.P1 1586typedef long Posn; /* Position in a file */ 1587typedef struct Range{ 1588 Posn p1, p2; 1589}Range; 1590typedef struct Address{ 1591 Range r; 1592 File *f; 1593}Address; 1594.P2 1595An address is encoded as a substring (character positions 1596.CW p1 1597to 1598.CW p2 ) 1599in a file 1600.CW f . 1601(The data type 1602.CW File 1603is described in detail below.) 1604.PP 1605The address interpreter is an 1606.CW Address -valued 1607function that traverses the parse tree describing an address (the 1608parse tree for the address has type 1609.CW Addrtree ): 1610.P1 1611Address 1612address(ap, a, sign) 1613 Addrtree *ap; 1614 Address a; 1615 int sign; 1616{ 1617 Address a2; 1618 do 1619 switch(ap->type){ 1620 case '.': 1621 a=a.f->dot; 1622 break; 1623 case '$': 1624 a.r.p1=a.r.p2=a.f->nbytes; 1625 break; 1626 case '"': 1627 a=matchfile(a, ap->aregexp)->dot; 1628 break; 1629 case ',': 1630 a2=address(ap->right, a, 0); 1631 a=address(ap->left, a, 0); 1632 if(a.f!=a2.f || a2.r.p2<a.r.p1) 1633 error(Eorder); 1634 a.r.p2=a2.r.p2; 1635 return a; 1636 /* and so on */ 1637 } 1638 while((ap=ap->right)!=0); 1639 return a; 1640} 1641.P2 1642.PP 1643Throughout, errors are handled by a non-local 1644.CW goto 1645(a 1646.CW setjmp/longjmp 1647in C terminology) 1648hidden in a routine called 1649.CW error 1650that immediately aborts the execution, retracts any 1651partially made changes (see the section below on `undoing'), and 1652returns to the top level of the parser. 1653The argument to 1654.CW error 1655is an enumeration type that 1656is translated to a terse but possibly helpful 1657message such as `?addresses out of order.' 1658Very common messages are kept short; for example the message for 1659a failed regular expression search is `?search.' 1660.PP 1661Character addresses such as 1662.CW #3 1663are trivial to implement, as the 1664.CW File 1665data structure is accessible by character number. 1666However, 1667.CW sam 1668keeps no information about the position of newlines \(em it is too 1669expensive to track dynamically \(em so line addresses are computed by reading 1670the file, counting newlines. Except in very large files, this has proven 1671acceptable: file access is fast enough to make the technique practical, 1672and lines are not central to the structure of the command language. 1673.PP 1674The command interpreter, called 1675.CW cmdexec , 1676is also straightforward. The parse table includes a 1677function to call to interpret a particular command. That function 1678receives as arguments 1679the calculated address 1680for the command 1681and the command tree (of type 1682.CW Cmdtree ), 1683which may contain information such as the subtree for compound commands. 1684Here, for example, is the function for the 1685.CW g 1686and 1687.CW v 1688commands: 1689.P1 1690int 1691g_cmd(a, cp) 1692 Address a; 1693 Cmdtree *cp; 1694{ 1695 compile(cp->regexp); 1696 if(execute(a.f, a.r.p1, a.r.p2)!=(cp->cmdchar=='v')){ 1697 a.f->dot=a; 1698 return cmdexec(a, cp->subcmd); 1699 } 1700 return TRUE; /* cause execution to continue */ 1701} 1702.P2 1703.CW Compile "" ( 1704and 1705.CW execute 1706are part of the regular expression code, described in the next section.) 1707Because the parser and the 1708.CW File 1709data structure do most of the work, most commands 1710are similarly brief. 1711.SH 2 1712Regular expressions 1713.LP 1714The regular expression code in 1715.CW sam 1716is an interpreted, rather than compiled on-the-fly, implementation of Thompson's 1717non-deterministic finite automaton algorithm.\u\s-4\&12\s+4\d 1718The syntax and semantics of the expressions are as in the UNIX program 1719.CW egrep , 1720including alternation, closures, character classes, and so on. 1721The only changes in the notation are two additions: 1722.CW \en 1723is translated to, and matches, a newline character, and 1724.CW @ 1725matches any character. In 1726.CW egrep , 1727the character 1728.CW \&. 1729matches any character except newline, and in 1730.CW sam 1731the same rule seemed safest, to prevent idioms like 1732.CW \&.* 1733from spanning newlines. 1734.CW Egrep 1735expressions are arguably too complicated for an interactive editor \(em 1736certainly it would make sense if all the special characters were two-character 1737sequences, so that most of the punctuation characters wouldn't have 1738peculiar meanings \(em but for an interesting command language, full 1739regular expressions are necessary, and 1740.CW egrep 1741defines the full regular expression syntax for UNIX programs. 1742Also, it seemed superfluous to define a new syntax, since various UNIX programs 1743.CW ed , ( 1744.CW egrep 1745and 1746.CW vi ) 1747define too many already. 1748.PP 1749The expressions are compiled by a routine, 1750.CW compile , 1751that generates the description of the non-deterministic finite state machine. 1752A second routine, 1753.CW execute , 1754interprets the machine to generate the leftmost-longest match of the 1755expression in a substring of the file. 1756The algorithm is described elsewhere.\u\s-4\&12,13\s+4\d 1757.CW Execute 1758reports 1759whether a match was found, and sets a global variable, 1760of type 1761.CW Range , 1762to the substring matched. 1763.PP 1764A trick is required to evaluate the expression in reverse, such as when 1765searching backwards for an expression. 1766For example, 1767.P1 1768-/P.*r/ 1769.P2 1770looks backwards through the file for a match of the expression. 1771The expression, however, is defined for a forward search. 1772The solution is to construct a machine identical to the machine 1773for a forward search except for a reversal of all the concatenation 1774operators (the other operators are symmetric under direction reversal), 1775to exchange the meaning of the operators 1776.CW ^ 1777and 1778.CW $ , 1779and then to read the file backwards, looking for the 1780usual earliest longest match. 1781.PP 1782.CW Execute 1783generates only one match each time it is called. 1784To interpret looping constructs such as the 1785.CW x 1786command, 1787.CW sam 1788must therefore synchronize between 1789calls of 1790.CW execute 1791to avoid 1792problems with null matches. 1793For example, even given the leftmost-longest rule, 1794the expression 1795.CW a* 1796matches three times in the string 1797.CW ab 1798(the character 1799.CW a , 1800the null string between the 1801.CW a 1802and 1803.CW b , 1804and the final null string). 1805After returning a match for the 1806.CW a , 1807.CW sam 1808must not match the null string before the 1809.CW b . 1810The algorithm starts 1811.CW execute 1812at the end of its previous match, and 1813if the match it returns 1814is null and abuts the previous match, rejects the match and advances 1815the initial position one character. 1816.SH 2 1817Memory allocation 1818.LP 1819The C language has no memory allocation primitives, although a standard 1820library routine, 1821.CW malloc , 1822provides adequate service for simple programs. 1823For specific uses, however, 1824it can be better to write a custom allocator. 1825The allocator (or rather, pair of allocators) described here 1826work in both the terminal and host parts of 1827.CW sam . 1828They are designed for efficient manipulation of strings, 1829which are allocated and freed frequently and vary in length from essentially 1830zero to 32 Kbytes (very large strings are written to disc). 1831More important, strings may be large and change size often, 1832so to minimize memory usage it is helpful to reclaim and to coalesce the 1833unused portions of strings when they are truncated. 1834.PP 1835Objects to be allocated in 1836.CW sam 1837are of two flavors: 1838the first is C 1839.CW structs , 1840which are small and often addressed by pointer variables; 1841the second is variable-sized arrays of characters 1842or integers whose 1843base pointer is always used to access them. 1844The memory allocator in 1845.CW sam 1846is therefore in two parts: 1847first, a traditional first-fit allocator that provides fixed storage for 1848.CW structs ; 1849and second, a garbage-compacting allocator that reduces storage 1850overhead for variable-sized objects, at the cost of some bookkeeping. 1851The two types of objects are allocated from adjoining arenas, with 1852the garbage-compacting allocator controlling the arena with higher addresses. 1853Separating into two arenas simplifies compaction and prevents fragmentation due 1854to immovable objects. 1855The access rules for garbage-compactable objects 1856(discussed in the next paragraph) allow them to be relocated, so when 1857the first-fit arena needs space, it moves the garbage-compacted arena 1858to higher addresses to make room. Storage is therefore created only 1859at successively higher addresses, either when more garbage-compacted 1860space is needed or when the first-fit arena pushes up the other arena. 1861.PP 1862Objects that may be compacted declare to the 1863allocator a cell that is guaranteed to be the sole repository of the 1864address of the object whenever a compaction can occur. 1865The compactor can then update the address when the object is moved. 1866For example, the implementation of type 1867.CW List 1868(really a variable-length array) 1869is: 1870.P1 1871typedef struct List{ 1872 int nused; 1873 long *ptr; 1874}List; 1875.P2 1876The 1877.CW ptr 1878cell must always be used directly, and never copied. When a 1879.CW List 1880is to be created the 1881.CW List 1882structure is allocated in the ordinary first-fit arena 1883and its 1884.CW ptr 1885is allocated in the garbage-compacted arena. 1886A similar data type for strings, called 1887.CW String , 1888stores variable-length character arrays of up to 32767 elements. 1889.PP 1890A related matter of programming style: 1891.CW sam 1892frequently passes structures by value, which 1893simplifies the code. 1894Traditionally, C programs have 1895passed structures by reference, but implicit allocation on 1896the stack is easier to use. 1897Structure passing is a relatively new feature of C 1898(it is not in the 1899standard reference manual for C\u\s-4\&14\s+4\d), and is poorly supported in most 1900commercial C compilers. 1901It's convenient and expressive, though, 1902and simplifies memory management by 1903avoiding the allocator altogether 1904and eliminating pointer aliases. 1905.SH 2 1906Data structures for manipulating files 1907.LP 1908Experience with 1909.CW jim 1910showed that the requirements 1911of the file data structure were few, but strict. 1912First, files need to be read and written quickly; 1913adding a fresh file must be painless. 1914Second, the implementation must place no arbitrary upper limit on 1915the number or sizes of files. (It should be practical to edit many files, 1916and files up to megabytes in length should be handled gracefully.) 1917This implies that files be stored on disc, not in main memory. 1918(Aficionados of virtual memory may argue otherwise, but the 1919implementation of virtual 1920memory in our system is not something to depend on 1921for good performance.) 1922Third, changes to files need be made by only two primitives: 1923deletion and insertion. 1924These are inverses of each other, 1925which simplifies the implementation of the undo operation. 1926Finally, 1927it must be easy and efficient to access the file, either 1928forwards or backwards, a byte at a time. 1929.PP 1930The 1931.CW File 1932data type is constructed from three simpler data structures that hold arrays 1933of characters. 1934Each of these types has an insertion and deletion operator, and the 1935insertion and deletion operators of the 1936.CW File 1937type itself are constructed from them. 1938.PP 1939The simplest type is the 1940.CW String , 1941which is used to hold strings in main memory. 1942The code that manages 1943.CW Strings 1944guarantees that they will never be longer 1945than some moderate size, and in practice they are rarely larger than 8 Kbytes. 1946.CW Strings 1947have two purposes: they hold short strings like file names with little overhead, 1948and because they are deliberately small, they are efficient to modify. 1949They are therefore used as the data structure for in-memory caches. 1950.PP 1951The disc copy of the file is managed by a data structure called a 1952.CW Disc , 1953which corresponds to a temporary file. A 1954.CW Disc 1955has no storage in main memory other than bookkeeping information; 1956the actual data being held is all on the disc. 1957To reduce the number of open files needed, 1958.CW sam 1959opens a dozen temporary UNIX files and multiplexes the 1960.CW Discs 1961upon them. 1962This permits many files to 1963be edited; the entire 1964.CW sam 1965source (48 files) may be edited comfortably with a single 1966instance of 1967.CW sam . 1968Allocating one temporary file per 1969.CW Disc 1970would strain the operating system's limit on the number of open files. 1971Also, spreading the traffic among temporary files keeps the files shorter, 1972and shorter files are more efficiently implemented by the UNIX 1973I/O subsystem. 1974.PP 1975A 1976.CW Disc 1977is an array of fixed-length blocks, each of which contains 1978between 1 and 4096 characters of active data. 1979(The block size of our UNIX file system is 4096 bytes.) 1980The block addresses within the temporary file and the length of each 1981block are stored in a 1982.CW List . 1983When changes are made the live part of blocks may change size. 1984Blocks are created and coalesced when necessary to try to keep the sizes 1985between 2048 and 4096 bytes. 1986An actively changing part of the 1987.CW Disc 1988therefore typically has about a kilobyte of slop that can be 1989inserted or deleted 1990without changing more than one block or affecting the block order. 1991When an insertion would overflow a block, the block is split, a new one 1992is allocated to receive the overflow, and the memory-resident list of blocks 1993is rearranged to reflect the insertion of the new block. 1994.PP 1995Obviously, going to the disc for every modification to the file is 1996prohibitively expensive. 1997The data type 1998.CW Buffer 1999consists of a 2000.CW Disc 2001to hold the data and a 2002.CW String 2003that acts as a cache. 2004This is the first of a series of caches throughout the data structures in 2005.CW sam. 2006The caches not only improve performance, they provide a way to organize 2007the flow of data, particularly in the communication between the host 2008and terminal. 2009This idea is developed below, in the section on communications. 2010.PP 2011To reduce disc traffic, changes to a 2012.CW Buffer 2013are mediated by a variable-length string, in memory, that acts as a cache. 2014When an insertion or deletion is made to a 2015.CW Buffer , 2016if the change can be accommodated by the cache, it is done there. 2017If the cache becomes bigger than a block because of an insertion, 2018some of it is written to the 2019.CW Disc 2020and deleted from the cache. 2021If the change does not intersect the cache, the cache is flushed. 2022The cache is only loaded at the new position if the change is smaller than a block; 2023otherwise, it is sent directly to the 2024.CW Disc . 2025This is because 2026large changes are typically sequential, 2027whereupon the next change is unlikely to overlap the current one. 2028.PP 2029A 2030.CW File 2031comprises a 2032.CW String 2033to hold the file name and some ancillary data such as dot and the modified bit. 2034The most important components, though, are a pair of 2035.CW Buffers , 2036one called the transcript and the other the contents. 2037Their use is described in the next section. 2038.PP 2039The overall structure is shown in Figure 5. 2040Although it may seem that the data is touched many times on its 2041way from the 2042.CW Disc , 2043it is read (by one UNIX system call) directly into the cache of the 2044associated 2045.CW Buffer ; 2046no extra copy is done. 2047Similarly, when flushing the cache, the text is written 2048directly from the cache to disc. 2049Most operations act directly on the text in the cache. 2050A principle applied throughout 2051.CW sam 2052is that the fewer times the data is copied, the faster the program will run 2053(see also the paper by Waite\u\s-4\&15\s+4\d). 2054.KF 2055.PS 2056copy "fig5.pic" 2057.PE 2058.Cs 2059Figure 5. File data structures. 2060The temporary files are stored in the standard repository for such files 2061on the host system. 2062.Ce 2063.KE 2064.PP 2065The contents of a 2066.CW File 2067are accessed by a routine that 2068copies to a buffer a substring of a file starting at a specified offset. 2069To read a byte at a time, a 2070.CW File "" per- 2071array is loaded starting from a specified initial position, 2072and bytes may then be read from the array. 2073The implementation is done by a macro similar to the C standard I/O 2074.CW getc 2075macro.\u\s-4\&14\s+4\d 2076Because the reading may be done at any address, a minor change to the 2077macro allows the file to be read backwards. 2078This array is read-only; there is no 2079.CW putc . 2080.SH 2 2081Doing and undoing 2082.LP 2083.CW Sam 2084has an unusual method for managing changes to files. 2085The command language makes it easy to specify multiple variable-length changes 2086to a file millions of bytes long, and such changes 2087must be made efficiently if the editor is to be practical. 2088The usual techniques for inserting and deleting strings 2089are inadequate under these conditions. 2090The 2091.CW Buffer 2092and 2093.CW Disc 2094data structures are designed for efficient random access to long strings, 2095but care must be taken to avoid super-linear behavior when making 2096many changes simultaneously. 2097.PP 2098.CW Sam 2099uses a two-pass algorithm for making changes, and treats each file as a database 2100against which transactions are registered. 2101Changes are not made directly to the contents. 2102Instead, when a command is started, a `mark' containing 2103a sequence number is placed in the transcript 2104.CW Buffer , 2105and each change made to the file, either an insertion or deletion 2106or a change to the file name, 2107is appended to the end of the transcript. 2108When the command is complete, the transcript is rewound to the 2109mark and applied to the contents. 2110.PP 2111One reason for separating evaluation from 2112application in this way is to simplify tracking the addresses of changes 2113made in the middle of a long sequence. 2114The two-pass algorithm also allows all changes to apply to the 2115.I original 2116data: no change can affect another change made in the same command. 2117This is particularly important when evaluating an 2118.CW x 2119command because it prevents regular expression matches 2120from stumbling over changes made earlier in the execution. 2121Also, the two-pass 2122algorithm is cleaner than the way other UNIX editors allow changes to 2123affect each other; 2124for example, 2125.CW ed 's 2126idioms to do things like delete every other line 2127depend critically on the implementation. 2128Instead, 2129.CW sam 's 2130simple model, in which all changes in a command occur effectively 2131simultaneously, is easy to explain and to understand. 2132.PP 2133The records in the transcript are of the form ``delete substring from 2134locations 2135123 to 456'' and ``insert 11 characters `hello there' at location 789.'' 2136(It is an error if the changes are not at monotonically greater 2137positions through the file.) 2138While the update is occurring, these numbers must be 2139offset by earlier changes, but that is straightforward and 2140local to the update routine; 2141moreover, all the numbers have been computed 2142before the first is examined. 2143.PP 2144Treating the file as a transaction system has another advantage: 2145undo is trivial. 2146All it takes is to invert the transcript after it has been 2147implemented, converting insertions 2148into deletions and vice versa, and saving them in a holding 2149.CW Buffer . 2150The `do' transcript can then be deleted from 2151the transcript 2152.CW Buffer 2153and replaced by the `undo' transcript. 2154If an undo is requested, the transcript is rewound and the undo transcript 2155executed. 2156Because the transcript 2157.CW Buffer 2158is not truncated after each command, it accumulates 2159successive changes. 2160A sequence of undo commands 2161can therefore back up the file arbitrarily, 2162which is more helpful than the more commonly implemented self-inverse form of undo. 2163.CW Sam "" ( 2164provides no way to undo an undo, but if it were desired, 2165it would be easy to provide by re-interpreting the `do' transcript.) 2166Each mark in the transcript contains a sequence number and the offset into 2167the transcript of the previous mark, to aid in unwinding the transcript. 2168Marks also contain the value of dot and the modified bit so these can be 2169restored easily. 2170Undoing multiple files is easy; it merely demands undoing all files whose 2171latest change has the same sequence number as the current file. 2172.PP 2173Another benefit of having a transcript is that errors encountered in the middle 2174of a complicated command need not leave the files in an intermediate state. 2175By rewinding the transcript to the mark beginning the command, 2176the partial command can be trivially undone. 2177.PP 2178When the update algorithm was first implemented, it was unacceptably slow, 2179so a cache was added to coalesce nearby changes, 2180replacing multiple small changes by a single larger one. 2181This reduced the number 2182of insertions into the transaction 2183.CW Buffer , 2184and made a dramatic improvement in performance, 2185but made it impossible 2186to handle changes in non-monotonic order in the file; the caching method 2187only works if changes don't overlap. 2188Before the cache was added, the transaction could in principle be sorted 2189if the changes were out of order, although 2190this was never done. 2191The current status is therefore acceptable performance with a minor 2192restriction on global changes, which is sometimes, but rarely, an annoyance. 2193.PP 2194The update algorithm obviously paws the data more than simpler 2195algorithms, but it is not prohibitively expensive; 2196the caches help. 2197(The principle of avoiding copying the data is still honored here, 2198although not as piously: 2199the data is moved from contents' cache to 2200the transcript's all at once and through only one internal buffer.) 2201Performance figures confirm the efficiency. 2202To read from a dead start a hundred kilobyte file on a VAX-11/750 2203takes 1.4 seconds of user time, 2.5 seconds of system time, 2204and 5 seconds of real time. 2205Reading the same file in 2206.CW ed 2207takes 6.0 seconds of user time, 1.7 seconds of system time, 2208and 8 seconds of real time. 2209.CW Sam 2210uses about half the CPU time. 2211A more interesting example is the one stated above: 2212inserting a character between every pair of characters in the file. 2213The 2214.CW sam 2215command is 2216.P1 2217,y/@/ a/x/ 2218.P2 2219and takes 3 CPU seconds per kilobyte of input file, of which 2220about a third is spent in the regular expression code. 2221This translates to about 500 changes per second. 2222.CW Ed 2223takes 1.5 seconds per kilobyte to make a similar change (ignoring newlines), 2224but cannot undo it. 2225The same example in 2226.CW ex ,\u\s-4\&9\s+4\d 2227a variant of 2228.CW ed 2229done at the University of California at Berkeley, 2230which allows one level of undoing, again takes 3 seconds. 2231In summary, 2232.CW sam 's 2233performance is comparable to that of other UNIX editors, although it solves 2234a harder problem. 2235.SH 2 2236Communications 2237.LP 2238The discussion so far has described the implementation of the host part of 2239.CW sam ; 2240the next few sections explain how a machine with mouse and bitmap display 2241can be engaged to improve interaction. 2242.CW Sam 2243is not the first editor to be written as two processes,\u\s-4\&16\s+4\d 2244but its implementation 2245has some unusual aspects. 2246.PP 2247There are several ways 2248.CW sam 's 2249host and terminal parts may be connected. 2250The first and simplest is to forgo the terminal part and use the host 2251part's command language to edit text on an ordinary terminal. 2252This mode is invoked by starting 2253.CW sam 2254with the 2255.CW -d 2256option. 2257With no options, 2258.CW sam 2259runs separate host and terminal programs, 2260communicating with a message protocol over the physical 2261connection that joins them. 2262Typically, the connection is an RS-232 link between a Blit 2263(the prototypical display for 2264.CW sam ) 2265and a host running 2266the Ninth Edition of the UNIX operating system.\u\s-4\&8\s+4\d 2267(This is the version of the system used in the Computing Sciences Research 2268Center at AT&T Bell Laboratories [now Lucent Technologies, Bell Labs], where I work. Its relevant 2269aspects are discussed in the Blit paper.\u\s-4\&1\s+4\d) 2270The implementation of 2271.CW sam 2272for the SUN computer runs both processes on the same machine and 2273connects them by a pipe. 2274.PP 2275The low bandwidth of an RS-232 link 2276necessitated the split between 2277the two programs. 2278The division is a mixed blessing: 2279a program in two parts is much harder to write and to debug 2280than a self-contained one, 2281but the split makes several unusual configurations possible. 2282The terminal may be physically separated from the host, allowing the conveniences 2283of a mouse and bitmap display to be taken home while leaving the files at work. 2284It is also possible to run the host part on a remote machine: 2285.P1 2286sam -r host 2287.P2 2288connects to the terminal in the usual way, and then makes a call 2289across the network to establish the host part of 2290.CW sam 2291on the named machine. 2292Finally, it cross-connects the I/O to join the two parts. 2293This allows 2294.CW sam 2295to be run on machines that do not support bitmap displays; 2296for example, 2297.CW sam 2298is the editor of choice on our Cray X-MP/24. 2299.CW Sam 2300.CW -r 2301involves 2302.I three 2303machines: the remote host, the terminal, and the local host. 2304The local host's job is simple but vital: it passes the data 2305between the remote host and terminal. 2306.PP 2307The host and terminal exchange messages asynchronously 2308(rather than, say, as remote procedure calls) but there is no 2309error detection or correction 2310because, whatever the configuration, the connection is reliable. 2311Because the terminal handles mundane interaction tasks such as 2312popping up menus and interpreting the responses, the messages are about 2313data, not actions. 2314For example, the host knows nothing about what is displayed on the screen, 2315and when the user types a character, the message sent to the host says 2316``insert a one-byte string at location 123 in file 7,'' not ``a character 2317was typed at the current position in the current file.'' 2318In other words, the messages look very much like the transaction records 2319in the transcripts. 2320.PP 2321Either the host or terminal part of 2322.CW sam 2323may initiate a change to a file. 2324The command language operates on the host, while typing and some 2325mouse operations are executed directly in the terminal to optimize response. 2326Changes initiated by the host program must be transmitted to the terminal, 2327and 2328vice versa. 2329(A token is exchanged to determine which end is in control, 2330which means that characters typed while a time-consuming command runs 2331must be buffered and do not appear until the command is complete.) 2332To maintain consistent information, 2333the host and terminal track changes through a per-file 2334data structure that records what portions of the file 2335the terminal has received. 2336The data structure, called a 2337.CW Rasp 2338(a weak pun: it's a file with holes) 2339is held and updated by both the host and terminal. 2340A 2341.CW Rasp 2342is a list of 2343.CW Strings 2344holding those parts of the file known to the terminal, 2345separated by counts of the number of bytes in the interstices. 2346Of course, the host doesn't keep a separate copy of the data (it only needs 2347the lengths of the various pieces), 2348but the structure is the same on both ends. 2349.PP 2350The 2351.CW Rasp 2352in the terminal doubles as a cache. 2353Since the terminal keeps the text for portions of the file it has displayed, 2354it need not request data from the host when revisiting old parts of the file 2355or redrawing obscured windows, which speeds things up considerably 2356over low-speed links. 2357.PP 2358It's trivial for the terminal to maintain its 2359.CW Rasp , 2360because all changes made on the terminal apply to parts of the file 2361already loaded there. 2362Changes made by the host are compared against the 2363.CW Rasp 2364during the update sequence after each command. 2365Small changes to pieces of the file loaded in the terminal 2366are sent in their entirety. 2367Larger changes, and changes that fall entirely in the holes, 2368are transmitted as messages without literal data: 2369only the lengths of the deleted and inserted strings are transmitted. 2370When a command is completed, the terminal examines its visible 2371windows to see if any holes in their 2372.CW Rasps 2373intersect the visible portion of the file. 2374It then requests the missing data from the host, 2375along with up to 512 bytes of surrounding data, to minimize 2376the number of messages when visiting a new portion of the file. 2377This technique provides a kind of two-level lazy evaluation for the terminal. 2378The first level sends a minimum of information about 2379parts of the file not being edited interactively; 2380the second level waits until a change is displayed before 2381transmitting the new data. 2382Of course, 2383performance is also helped by having the terminal respond immediately to typing 2384and simple mouse requests. 2385Except for small changes to active pieces of the file, which are 2386transmitted to the terminal without negotiation, 2387the terminal is wholly responsible for deciding what is displayed; 2388the host uses the 2389.CW Rasp 2390only to tell the terminal what might be relevant. 2391.PP 2392When a change is initiated by the host, 2393the messages to the terminal describing the change 2394are generated by the routine that applies the transcript of the changes 2395to the contents of the 2396.CW File . 2397Since changes are undone by the same update routine, 2398undoing requires 2399no extra code in the communications; 2400the usual messages describing changes to the file are sufficient 2401to back up the screen image. 2402.PP 2403The 2404.CW Rasp 2405is a particularly good example of the way caches are used in 2406.CW sam . 2407First, it facilitates access to the active portion of the text by placing 2408the busy text in main memory. 2409In so doing, it provides efficient access 2410to a large data structure that does not fit in memory. 2411Since the form of data is to be imposed by the user, not by the program, 2412and because characters will frequently be scanned sequentially, 2413files are stored as flat objects. 2414Caches help keep performance good and linear when working with such 2415data. 2416.PP 2417Second, the 2418.CW Rasp 2419and several of the other caches have some 2420.I read-ahead; 2421that is, the cache is loaded with more information than is needed for 2422the job immediately at hand. 2423When manipulating linear structures, the accesses are usually sequential, 2424and read-ahead can significantly reduce the average time to access the 2425next element of the object. 2426Sequential access is a common mode for people as well as programs; 2427consider scrolling through a document while looking for something. 2428.PP 2429Finally, like any good data structure, 2430the cache guides the algorithm, or at least the implementation. 2431The 2432.CW Rasp 2433was actually invented to control the communications between the host and 2434terminal parts, but I realized very early that it was also a form of 2435cache. Other caches were more explicitly intended to serve a double 2436purpose: for example, the caches in 2437.CW Files 2438that coalesce updates not only reduce traffic to the 2439transcript and contents 2440.CW Buffers , 2441they also clump screen updates so that complicated changes to the 2442screen are achieved in 2443just a few messages to the terminal. 2444This saved me considerable work: I did not need to write special 2445code to optimize the message traffic to the 2446terminal. 2447Caches pay off in surprising ways. 2448Also, they tend to be independent, so their performance improvements 2449are multiplicative. 2450.SH 2 2451Data structures in the terminal 2452.LP 2453The terminal's job is to display and to maintain a consistent image of 2454pieces of the files being edited. 2455Because the text is always in memory, the data structures are 2456considerably simpler than those in the host part. 2457.PP 2458.CW Sam 2459typically has far more windows than does 2460.CW mux , 2461the window system within which its Blit implementation runs. 2462.CW Mux 2463has a fairly small number of asynchronously updated windows; 2464.CW sam 2465needs a large number of synchronously updated windows that are 2466usually static and often fully obscured. 2467The different tradeoffs guided 2468.CW sam 2469away from the memory-intensive implementation of windows, called 2470.CW Layers ,\u\s-4\&17\s+4\d 2471used in 2472.CW mux. 2473Rather than depending on a complete bitmap image of the display for each window, 2474.CW sam 2475regenerates the image from its in-memory text 2476(stored in the 2477.CW Rasp ) 2478when necessary, although it will use such an image if it is available. 2479Like 2480.CW Layers , 2481though, 2482.CW sam 2483uses the screen bitmap as active storage in which to update the image using 2484.CW bitblt .\u\s-4\&18,19\s+4\d 2485The resulting organization, pictured in Figure 6, 2486has a global array of windows, called 2487.CW Flayers , 2488each of which holds an image of a piece of text held in a data structure 2489called a 2490.CW Frame , 2491which in turn represents 2492a rectangular window full of text displayed in some 2493.CW Bitmap . 2494Each 2495.CW Flayer 2496appears in a global list that orders them all front-to-back 2497on the display, and simultaneously as an element of a per-file array 2498that holds all the open windows for that file. 2499The complement in the terminal of the 2500.CW File 2501on the host is called a 2502.CW Text ; 2503each connects its 2504.CW Flayers 2505to the associated 2506.CW Rasp . 2507.KF 2508.PS 2509copy "fig6.pic" 2510.PE 2511.Cs 2512Figure 6. Data structures in the terminal. 2513.CW Flayers 2514are also linked together into a front-to-back list. 2515.CW Boxes 2516are discussed in the next section. 2517.Ce 2518.KE 2519.PP 2520The 2521.CW Bitmap 2522for a 2523.CW Frame 2524contains the image of the text. 2525For a fully visible window, the 2526.CW Bitmap 2527will be the screen (or at least the 2528.CW Layer 2529in which 2530.CW sam 2531is being run), 2532while for partially obscured windows the 2533.CW Bitmap 2534will be off-screen. 2535If the window is fully obscured, the 2536.CW Bitmap 2537will be null. 2538.PP 2539The 2540.CW Bitmap 2541is a kind of cache. 2542When making changes to the display, most of the original image will 2543look the same in the final image, and the update algorithms exploit this. 2544The 2545.CW Frame 2546software updates the image in the 2547.CW Bitmap 2548incrementally; the 2549.CW Bitmap 2550is not just an image, it is a data structure.\u\s-4\&18,19\s+4\d 2551The job of the software that updates the display is therefore 2552to use as much as possible of the existing image (converting the 2553text from ASCII characters to pixels is expensive) in a sort of two-dimensional 2554string insertion algorithm. 2555The details of this process are described in the next section. 2556.PP 2557The 2558.CW Frame 2559software has no code to support overlapping windows; 2560its job is to keep a single 2561.CW Bitmap 2562up to date. 2563It falls to the 2564.CW Flayer 2565software to multiplex the various 2566.CW Bitmaps 2567onto the screen. 2568The problem of maintaining overlapping 2569.CW Flayers 2570is easier than for 2571.CW Layers \u\s-4\&17\s+4\d 2572because changes are made synchronously and because the contents of the window 2573can be reconstructed from the data stored in the 2574.CW Frame ; 2575the 2576.CW Layers 2577software 2578makes no such assumptions. 2579In 2580.CW sam , 2581the window being changed is almost always fully visible, because the current 2582window is always fully visible, by construction. 2583However, when multi-file changes are being made, or when 2584more than one window is open on a file, 2585it may be necessary to update partially obscured windows. 2586.PP 2587There are three cases: the window is 2588fully visible, invisible (fully obscured), or partially visible. 2589If fully visible, the 2590.CW Bitmap 2591is part of the screen, so when the 2592.CW Flayer 2593update routine calls the 2594.CW Frame 2595update routine, the screen will be updated directly. 2596If the window is invisible, 2597there is no associated 2598.CW Bitmap , 2599and all that is necessary is to update the 2600.CW Frame 2601data structure, not the image. 2602If the window is partially visible, the 2603.CW Frame 2604routine is called to update the image in the off-screen 2605.CW Bitmap , 2606which may require regenerating it from the text of the window. 2607The 2608.CW Flayer 2609code then clips this 2610.CW Bitmap 2611against the 2612.CW Bitmaps 2613of all 2614.CW Frames 2615in front of the 2616.CW Frame 2617being modified, and the remainder is copied to the display. 2618.PP 2619This is much faster than recreating the image off-screen 2620for every change, or clipping all the changes made to the image 2621during its update. 2622Unfortunately, these caches can also consume prohibitive amounts of 2623memory, so they are freed fairly liberally \(em after every change to the 2624front-to-back order of the 2625.CW Flayers . 2626The result is that 2627the off-screen 2628.CW Bitmaps 2629exist only while multi-window changes are occurring, 2630which is the only time the performance improvement they provide is needed. 2631Also, the user interface causes fully-obscured windows to be the 2632easiest to make \(em 2633creating a canonically sized and placed window requires only a button click 2634\(em which reduces the need for caching still further. 2635.PP 2636.SH 2 2637Screen update 2638.LP 2639Only two low-level primitives are needed for incremental update: 2640.CW bitblt , 2641which copies rectangles of pixels, and 2642.CW string 2643(which in turn calls 2644.CW bitblt ), 2645which draws a null-terminated character string in a 2646.CW Bitmap . 2647A 2648.CW Frame 2649contains a list of 2650.CW Boxes , 2651each of which defines a horizontal strip of text in the window 2652(see Figure 7). 2653A 2654.CW Box 2655has a character string 2656.CW str , 2657and a 2658.CW Rectangle 2659.CW rect 2660that defines the location of the strip in the window. 2661(The text in 2662.CW str 2663is stored in the 2664.CW Box 2665separately from the 2666.CW Rasp 2667associated with the window's file, so 2668.CW Boxes 2669are self-contained.) 2670The invariant is that 2671the image of the 2672.CW Box 2673can be reproduced by calling 2674.CW string 2675with argument 2676.CW str 2677to draw the string in 2678.CW rect , 2679and the resulting picture fits perfectly within 2680.CW rect . 2681In other words, the 2682.CW Boxes 2683define the tiling of the window. 2684The tiling may be complicated by long lines of text, which 2685are folded onto the next line. 2686Some editors use horizontal scrolling to avoid this complication, 2687but to be comfortable this technique requires that lines not be 2688.I too 2689long; 2690.CW sam 2691has no such restriction. 2692Also, and perhaps more importantly, UNIX programs and terminals traditionally fold 2693long lines to make their contents fully visible. 2694.PP 2695Two special kinds of 2696.CW Boxes 2697contain a single 2698character: either a newline or a tab. 2699Newlines and tabs are white space. 2700A newline 2701.CW Box 2702always extends to the right edge of the window, 2703forcing the following 2704.CW Box 2705to the next line. 2706The width of a tab depends on where it is located: 2707it forces the next 2708.CW Box 2709to begin at a tab location. 2710Tabs also 2711have a minimum width equivalent to a blank (blanks are 2712drawn by 2713.CW string 2714and are not treated specially); newlines have a minimum width of zero. 2715.KF 2716.PS 2717copy "fig7.pic" 2718.PE 2719.sp .5 2720.Cs 2721Figure 7. A line of text showing its 2722.CW Boxes . 2723The first two blank 2724.CW Boxes 2725contain tabs; the last contains a newline. 2726Spaces are handled as ordinary characters. 2727.Ce 2728.KE 2729.PP 2730The update algorithms always use the 2731.CW Bitmap 2732image of the text (either the display or cache 2733.CW Bitmap ); 2734they never examine the characters within a 2735.CW Box 2736except when the 2737.CW Box 2738needs to be split in two. 2739Before a change, the window consists of a tiling of 2740.CW Boxes ; 2741after the change the window is tiled differently. 2742The update algorithms rearrange the tiles in place, without 2743backup storage. 2744The algorithms are not strictly optimal \(em for example, they can 2745clear a pixel that is later going to be written upon \(em 2746but they never move a tile that doesn't need to be moved, 2747and they move each tile at most once. 2748.CW Frinsert 2749on a Blit can absorb over a thousand characters a second if the strings 2750being inserted are a few tens of characters long. 2751.PP 2752Consider 2753.CW frdelete . 2754Its job is to delete a substring from a 2755.CW Frame 2756and restore the image of the 2757.CW Frame . 2758The image of a substring has a peculiar shape (see Figure 2) comprising 2759possibly a partial line, 2760zero or more full lines, 2761and possibly a final partial line. 2762For reference, call this the 2763.I 2764Z-shape. 2765.R 2766.CW Frdelete 2767begins by splitting, if necessary, the 2768.CW Boxes 2769containing the ends of 2770the substring so the substring begins and ends on 2771.CW Box 2772boundaries. 2773Because the substring is being deleted, its image is not needed, 2774so the Z-shape is then cleared. 2775Then, tiles (that is, the images of 2776.CW Boxes ) 2777are copied, using 2778.CW bitblt , 2779from immediately after the Z-shape to 2780the beginning of the Z-shape, 2781resulting in a new Z-shape. 2782.CW Boxes "" ( 2783whose contents would span two lines in the new position must first be split.) 2784.PP 2785Copying the remainder of the 2786.CW Frame 2787tile by tile 2788this way will clearly accomplish the deletion but eventually, 2789typically when the copying algorithm encounters a tab or newline, 2790the old and new 2791.CW x 2792coordinates of the tile 2793to be copied are the same. 2794This correspondence implies 2795that the Z-shape has its beginning and ending edges aligned 2796vertically, and a sequence of at most two 2797.CW bitblts 2798can be used to copy the remaining tiles. 2799The last step is to clear out the resulting empty space at the bottom 2800of the window; 2801the number of lines to be cleared is the number of complete lines in the 2802Z-shape closed by the final 2803.CW bitblts. 2804The final step is to merge horizontally adjacent 2805.CW Boxes 2806of plain text. 2807The complete source to 2808.CW frdelete 2809is less than 100 lines of C. 2810.PP 2811.CW frinsert 2812is more complicated because it must do four passes: 2813one to construct the 2814.CW Box 2815list for the inserted string, 2816one to reconnoitre, 2817one to copy (in opposite order to 2818.CW frdelete ) 2819the 2820.CW Boxes 2821to make the hole for the new text, 2822and finally one to copy the new text into place. 2823Overall, though, 2824.CW frinsert 2825has a similar flavor to 2826.CW frdelete , 2827and needn't be described further. 2828.CW Frinsert 2829and its subsidiary routines comprise 211 lines of C. 2830.PP 2831The terminal source code is 3024 lines of C, 2832and the host source is 5797 lines. 2833.SH 2834Discussion 2835.SH 2 2836History 2837.LP 2838The immediate ancestor of 2839.CW sam 2840was the original text editor for the Blit, called 2841.CW jim . 2842.CW Sam 2843inherited 2844.CW jim 's 2845two-process structure and mouse language almost unchanged, but 2846.CW jim 2847suffered from several drawbacks that were addressed in the design of 2848.CW sam . 2849The most important of these was the lack of a command language. 2850Although 2851.CW jim 2852was easy to use for simple editing, it provided no direct help with 2853large or repetitive editing tasks. Instead, it provided a command to pass 2854selected text through a shell pipeline, 2855but this was no more satisfactory than could be expected of a stopgap measure. 2856.PP 2857.CW Jim 2858was written primarily as a vehicle for experimenting with a mouse-based 2859interface to text, and the experiment was successful. 2860.CW Jim 2861had some spin-offs: 2862.CW mux , 2863the second window system for the Blit, is essentially a multiplexed 2864version of the terminal part of 2865.CW jim ; 2866and the debugger 2867.CW pi 's 2868user interface\u\s-4\&20\s+4\d was closely modeled on 2869.CW jim 's. 2870But after a couple of years, 2871.CW jim 2872had become difficult to maintain and limiting to use, 2873and its replacement was overdue. 2874.PP 2875I began the design of 2876.CW sam 2877by asking 2878.CW jim 2879customers what they wanted. 2880This was probably a mistake; the answers were essentially a list of features 2881to be found in other editors, which did not provide any of the 2882guiding principles I was seeking. 2883For instance, one common request was for a ``global substitute,'' 2884but no one suggested how to provide it within a cut-and-paste editor. 2885I was looking for a scheme that would 2886support such specialized features comfortably in the context of some 2887general command language. 2888Ideas were not forthcoming, though, particularly given my insistence 2889on removing all limits on file sizes, line lengths and so on. 2890Even worse, I recognized that, since the mouse could easily 2891indicate a region of the screen that was not an integral number of lines, 2892the command language would best forget about newlines altogether, 2893and that meant the command language had to treat the file as a single 2894string, not an array of lines. 2895.PP 2896Eventually, I decided that thinking was not getting me very far and it was 2897time to try building. 2898I knew that the terminal part could be built easily \(em 2899that part of 2900.CW jim 2901behaved acceptably well \(em and that most of the hard work was going 2902to be in the host part: the file interface, command interpreter and so on. 2903Moreover, I had some ideas about how the architecture of 2904.CW jim 2905could be improved without destroying its basic structure, which I liked 2906in principle but which hadn't worked out as well as I had hoped. 2907So I began by designing the file data structure, 2908starting with the way 2909.CW jim 2910worked \(em comparable to a single structure merging 2911.CW Disc 2912and 2913.CW Buffer , 2914which I split to make the cache more general 2915\(em and thinking about how global substitute could be implemented. 2916The answer was clearly that it had to be done in two passes, 2917and the transcript-oriented implementation fell out naturally. 2918.PP 2919.CW Sam 2920was written bottom-up, 2921starting from the data structures and algorithms for manipulating text, 2922through the command language and up to the code for maintaining 2923the display. 2924In retrospect, it turned out well, but this implementation method is 2925not recommended in general. 2926There were several times when I had a large body of interesting code 2927assembled and no clue how to proceed with it. 2928The command language, in particular, took almost a year to figure out, 2929but can be implemented (given what was there at the beginning of that year) 2930in a day or two. Similarly, inventing the 2931.CW Rasp 2932data structure delayed the 2933connection of the host and terminal pieces by another few months. 2934.CW Sam 2935took about two years to write, although only about four months were 2936spent actually working on it. 2937.PP 2938Part of the design process was unusual: 2939the subset of the protocol that maintains the 2940.CW Rasp 2941was simulated, debugged 2942and verified by an automatic protocol analyzer,\u\s-4\&21\s+4\d and was bug-free 2943from the start. 2944The rest of the protocol, concerned mostly 2945with keeping menus up to date, 2946was unfortunately too unwieldy for such analysis, 2947and was debugged by more traditional methods, primarily 2948by logging in a file all messages in and out of the host. 2949.SH 2 2950Reflections 2951.LP 2952.CW Sam 2953is essentially the only interactive editor used by the sixty or so members of 2954the computing science research center in which I work. 2955The same could not be said of 2956.CW jim ; 2957the lack of a command language kept some people from adopting it. 2958The union of a user interface as comfortable as 2959.CW jim 's 2960with a command language as powerful as 2961.CW ed 's† 2962.FS 2963.vs 9 2964†The people who criticize 2965.CW ed 2966as an interactive program often forget that it and its close relative 2967.CW sed \u\s-4\&7\s+4\d 2968still thrive as programmable editors. The strength of these programs is 2969independent of their convenience for interactive editing. 2970.br 2971.vs 2972.FE 2973is essential to 2974.CW sam 's 2975success. 2976When 2977.CW sam 2978was first made available to the 2979.CW jim 2980community, 2981almost everyone switched to it within two or three days. 2982In the months that followed, even people who had never adopted 2983.CW jim 2984started using 2985.CW sam 2986exclusively. 2987.PP 2988To be honest, 2989.CW ed 2990still gets occasional use, but usually when 2991something quick needs to be done and the overhead of 2992downloading the terminal part of 2993.CW sam 2994isn't worth the trouble. 2995Also, as a `line' editor, 2996.CW sam 2997.CW -d 2998is a bit odd; 2999when using a good old ASCII terminal, it's comforting to have 3000a true line editor. 3001But it is fair to say that 3002.CW sam 's 3003command language has displaced 3004.CW ed 's 3005for most of the complicated editing that has kept line editors 3006(that is, command-driven editors) with us. 3007.PP 3008.CW Sam 's 3009command language is even fancier than 3010.CW ed 's, 3011and most 3012.CW sam 3013customers don't come near to using all its capabilities. 3014Does it need to be so sophisticated? 3015I think the answer is yes, for two reasons. 3016.PP 3017First, the 3018.I model 3019for 3020.CW sam 's 3021command language is really relatively simple, and certainly simpler than that of 3022.CW ed . 3023For instance, there is only one kind of textual loop in 3024.CW sam 3025\(em the 3026.CW x 3027command \(em 3028while 3029.CW ed 3030has three (the 3031.CW g 3032command, the global flag on substitutions, and the implicit loop over 3033lines in multi-line substitutions). 3034Also, 3035.CW ed 's 3036substitute command is necessary to make changes within lines, but in 3037.CW sam 3038the 3039.CW s 3040command is more of a familiar convenience than a necessity; 3041.CW c 3042and 3043.CW t 3044can do all the work. 3045.PP 3046Second, 3047given a community that expects an editor to be about as powerful as 3048.CW ed , 3049it's hard to see how 3050.CW sam 3051could really be much simpler and still satisfy that expectation. 3052People want to do ``global substitutes,'' and most are content 3053to have the recipe for that and a few other fancy changes. 3054The sophistication of the command language is really just a veneer 3055over a design that makes it possible to do global substitutes 3056in a screen editor. 3057Some people will always want something more, however, and it's gratifying to 3058be able to provide it. 3059The real power of 3060.CW sam 's 3061command language comes from composability of the operators, which is by 3062nature orthogonal to the underlying model. 3063In other words, 3064.CW sam 3065is not itself complex, but it makes complex things possible. 3066If you don't want to do anything complex, you can ignore the 3067complexity altogether, and many people do so. 3068.PP 3069Sometimes I am asked the opposite question: why didn't I just make 3070.CW sam 3071a real programmable editor, with macros and variables and so on? 3072The main reason is a matter of taste: I like the editor 3073to be the same every time I use it. 3074There is one technical reason, though: 3075programmability in editors is largely a workaround for insufficient 3076interactivity. 3077Programmable editors are used to make particular, usually short-term, 3078things easy to do, such as by providing shorthands for common actions. 3079If things are generally easy to do in the first place, 3080shorthands are not as helpful. 3081.CW Sam 3082makes common editing operations very easy, and the solutions to 3083complex editing problems seem commensurate with the problems themselves. 3084Also, the ability to edit the 3085.CW sam 3086window makes it easy to repeat commands \(em it only takes a mouse button click 3087to execute a command again. 3088.SH 2 3089Pros and cons 3090.LP 3091.CW Sam 3092has several other good points, 3093and its share of problems. 3094Among the good things is the idea of 3095structural regular expressions, 3096whose usefulness has only begun to be explored. 3097They were arrived at serendipitously when I attempted to distill the essence of 3098.CW ed 's 3099way of doing global substitution and recognized that the looping command in 3100.CW ed 3101was implicitly imposing a structure (an array of lines) on the file. 3102.PP 3103Another of 3104.CW sam 's 3105good things is its undo capability. 3106I had never before used an editor with a true undo, 3107but I would never go back now. 3108Undo 3109.I must 3110be done well, but if it is, it can be relied on. 3111For example, 3112it's safe to experiment if you're not sure how to write some intricate command, 3113because if you make a mistake, it can be fixed simply and reliably. 3114I learned two things about undo from writing 3115.CW sam : 3116first, it's easy to provide if you design it in from the beginning, and 3117second, it's necessary, particularly if the system has some subtle 3118properties that may be unfamiliar or error-prone for users. 3119.PP 3120.CW Sam 's 3121lack of internal limits and sizes is a virtue. 3122Because it avoids all fixed-size tables and data structures, 3123.CW sam 3124is able to make global changes to files that some of our other 3125tools cannot even read. 3126Moreover, the design keeps the performance linear when doing such 3127operations, although I must admit 3128.CW sam 3129does get slow when editing a huge file. 3130.PP 3131Now, the problems. 3132Externally, the most obvious is that it is poorly integrated into the 3133surrounding window system. 3134By design, the user interface in 3135.CW sam 3136feels almost identical to that of 3137.CW mux , 3138but a thick wall separates text in 3139.CW sam 3140from the programs running in 3141.CW mux . 3142For instance, the `snarf buffer' in 3143.CW sam 3144must be maintained separately from that in 3145.CW mux . 3146This is regrettable, but probably necessary given the unusual configuration 3147of the system, with a programmable terminal on the far end of an RS-232 link. 3148.PP 3149.CW Sam 3150is reliable; otherwise, people wouldn't use it. 3151But it was written over such a long time, and has so many new (to me) 3152ideas in it, that I would like to see it done over again to clean 3153up the code and remove many of the lingering problems in the implementation. 3154The worst part is in the interconnection of the host and terminal parts, 3155which might even be able to go away in a redesign for a more 3156conventional window system. 3157The program must be split in two to use the terminal effectively, 3158but the low bandwidth of the connection forces the separation to 3159occur in an inconvenient part of the design if performance is to be acceptable. 3160A simple remote procedure call 3161protocol driven by the host, emitting only graphics 3162commands, would be easy to write but wouldn't have nearly the 3163necessary responsiveness. On the other hand, if the terminal were in control 3164and requested much simpler file services from the host, regular expression 3165searches would require that the terminal read the entire file over its RS-232 3166link, which would be unreasonably slow. 3167A compromise in which either end can take control is necessary. 3168In retrospect, the communications protocol should have been 3169designed and verified formally, although I do not know of any tool 3170that can adequately relate the protocol to 3171its implementation. 3172.PP 3173Not all of 3174.CW sam 's 3175users are comfortable with its command language, and few are adept. 3176Some (venerable) people use a sort of 3177.CW ed \& `` 3178subset'' of 3179.CW sam 's 3180command language, 3181and even ask why 3182.CW sam 's 3183command language is not exactly 3184.CW ed 's. 3185(The reason, of course, is that 3186.CW sam 's 3187model for text does not include newlines, which are central to 3188.CW ed . 3189Making the text an array of newlines to the command language would 3190be too much of a break from the seamless model provided by the mouse. 3191Some editors, such as 3192.CW vi , 3193are willing to make this break, though.) 3194The difficulty is that 3195.CW sam 's 3196syntax is so close to 3197.CW ed 's 3198that people believe it 3199.I should 3200be the same. 3201I thought, with some justification in hindsight, 3202that making 3203.CW sam 3204similar to 3205.CW ed 3206would make it easier to learn and to accept. 3207But I may have overstepped and raised the users' 3208expectations too much. 3209It's hard to decide which way to resolve this problem. 3210.PP 3211Finally, there is a tradeoff in 3212.CW sam 3213that was decided by the environment in which it runs: 3214.CW sam 3215is a multi-file editor, although in a different system there might instead be 3216multiple single-file editors. 3217The decision was made primarily because starting a new program in a Blit is 3218time-consuming. 3219If the choice could be made freely, however, I would 3220still choose the multi-file architecture, because it allows 3221groups of files to be handled as a unit; 3222the usefulness of the multi-file commands is incontrovertible. 3223It is delightful to have the source to an entire program 3224available at your fingertips. 3225.SH 3226Acknowledgements 3227.LP 3228Tom Cargill suggested the idea behind the 3229.CW Rasp 3230data structure. 3231Norman Wilson and Ken Thompson influenced the command language. 3232This paper was improved by comments from 3233Al Aho, 3234Jon Bentley, 3235Chris Fraser, 3236Gerard Holzmann, 3237Brian Kernighan, 3238Ted Kowalski, 3239Doug McIlroy 3240and 3241Dennis Ritchie. 3242