1.Vx 17 11 November 87 1 32 "ROB PIKE" "THE TEXT EDITOR SAM" 2.ds DY "31 May 1987 3.ds DR "Revised 1 July 1987 4.de CW \" puts first arg in CW font, same as UL; maintains font 5\%\&\\$3\f(CW\\$1\fP\&\\$2 6.. 7.de Cs 8.br 9.fi 10.ft 2 11.ps -2 12.vs -2 13.. 14.de Ce 15.br 16.nf 17.ft 1 18.ps 19.vs 20.sp 21.. 22.TL 23The Text Editor \&\f(CWsam\fP 24.AU 25Rob Pike 26rob@plan9.att.com 27.AB 28.LP 29.CW Sam 30is an interactive multi-file text editor intended for 31bitmap displays. 32A textual command language 33supplements the mouse-driven, cut-and-paste interface 34to make complex or 35repetitive editing tasks easy to specify. 36The language is characterized by the composition of regular expressions 37to describe the structure of the text being modified. 38The treatment of files as a database, with changes logged 39as atomic transactions, guides the implementation and 40makes a general `undo' mechanism straightforward. 41.PP 42.CW Sam 43is implemented as two processes connected by a low-bandwidth stream, 44one process handling the display and the other the editing 45algorithms. Therefore it can run with the display process 46in a bitmap terminal and the editor on a local host, 47with both processes on a bitmap-equipped host, or with 48the display process in the terminal and the editor in a 49remote host. 50By suppressing the display process, 51it can even run without a bitmap terminal. 52.PP 53This paper is reprinted from Software\(emPractice and Experience, 54Vol 17, number 11, pp. 813-845, November 1987. 55The paper has not been updated for the Plan 9 manuals. Although 56.CW Sam 57has not changed much since the paper was written, the system around it certainly has. 58Nonetheless, the description here still stands as the best introduction to the editor. 59.AE 60.SH 61Introduction 62.LP 63.CW Sam 64is an interactive text editor that combines cut-and-paste interactive editing with 65an unusual command language based on the composition of regular expressions. 66It is written as two programs: one, the `host part,' runs on a UNIX system 67and implements the command language and provides file access; the other, the 68`terminal part,' runs asynchronously 69on a machine with a mouse and bitmap display 70and supports the display and interactive editing. 71The host part may be even run in isolation on an ordinary terminal 72to edit text using the command 73language, much like a traditional line editor, 74without assistance from a mouse or display. 75Most often, 76the terminal part runs on a Blit\u\s-4\&1\s+4\d terminal 77(actually on a Teletype DMD 5620, the production version of the Blit), whose 78host connection is an ordinary 9600 bps RS232 link; 79on the SUN computer the host and display processes run on a single machine, 80connected by a pipe. 81.PP 82.CW Sam 83edits uninterpreted 84ASCII text. 85It has no facilities for multiple fonts, graphics or tables, 86unlike MacWrite,\u\s-4\&2\s+4\d Bravo,\u\s-4\&3\s+4\d Tioga\u\s-4\&4\s+4\d 87or Lara.\u\s-4\&5\s+4\d 88Also unlike them, it has a rich command language. 89(Throughout this paper, the phrase 90.I 91command language 92.R 93refers to 94textual commands; commands activated from the mouse form the 95.I mouse 96.I language. ) 97.CW Sam 98developed as an editor for use by programmers, and tries to join 99the styles of the UNIX text editor 100.CW ed \u\s-4\&6,7\s+4\d 101with that of interactive cut-and-paste editors by 102providing a comfortable mouse-driven interface 103to a program with a solid command language driven by regular expressions. 104The command language developed more than the mouse language, and 105acquired a notation for describing the structure of files 106more richly than as a sequence of lines, 107using a dataflow-like syntax for specifying changes. 108.PP 109The interactive style was influenced by 110.CW jim ,\u\s-4\&1\s+4\d 111an early cut-and-paste editor for the Blit, and by 112.CW mux ,\u\s-4\&8\s+4\d 113the Blit window system. 114.CW Mux 115merges the original Blit window system, 116.CW mpx ,\u\s-4\&1\s+4\d 117with cut-and-paste editing, forming something like a 118multiplexed version of 119.CW jim 120that edits the output of (and input to) command sessions rather than files. 121.PP 122The first part of this paper describes the command language, then the mouse 123language, and explains how they interact. 124That is followed by a description of the implementation, 125first of the host part, then of the terminal part. 126A principle that influenced the design of 127.CW sam 128is that it should have no explicit limits, such as upper limits on 129file size or line length. 130A secondary consideration is that it be efficient. 131To honor these two goals together requires a method for efficiently 132manipulating 133huge strings (files) without breaking them into lines, 134perhaps while making thousands of changes 135under control of the command language. 136.CW Sam 's 137method is to 138treat the file as a transaction database, implementing changes as atomic 139updates. These updates may be unwound easily to `undo' changes. 140Efficiency is achieved through a collection of caches that minimizes 141disc traffic and data motion, both within the two parts of the program 142and between them. 143.PP 144The terminal part of 145.CW sam 146is fairly straightforward. 147More interesting is how the two halves of the editor stay 148synchronized when either half may initiate a change. 149This is achieved through a data structure that organizes the 150communications and is maintained in parallel by both halves. 151.PP 152The last part of the paper chronicles the writing of 153.CW sam 154and discusses the lessons that were learned through its development and use. 155.PP 156The paper is long, but is composed largely of two papers of reasonable length: 157a description of the user interface of 158.CW sam 159and a discussion of its implementation. 160They are combined because the implementation is strongly influenced by 161the user interface, and vice versa. 162.SH 163The Interface 164.LP 165.CW Sam 166is a text editor for multiple files. 167File names may be provided when it is invoked: 168.P1 169sam file1 file2 ... 170.P2 171and there are commands 172to add new files and discard unneeded ones. 173Files are not read until necessary 174to complete some command. 175Editing operations apply to an internal copy 176made when the file is read; the UNIX file associated with the copy 177is changed only by an explicit command. 178To simplify the discussion, the internal copy is here called a 179.I file , 180while the disc-resident original is called a 181.I 182disc file. 183.R 184.PP 185.CW Sam 186is usually connected to a bitmap display that presents a cut-and-paste 187editor driven by the mouse. 188In this mode, the command language is still available: 189text typed in a special window, called the 190.CW sam 191.I window, 192is interpreted 193as commands to be executed in the current file. 194Cut-and-paste editing may be used in any window \(em even in the 195.CW sam 196window to construct commands. 197The other mode of operation, invoked by starting 198.CW sam 199with the option 200.CW -d 201(for `no download'), 202does not use the mouse or bitmap display, but still permits 203editing using the textual command language, even on an ordinary terminal, 204interactively or from a script. 205.PP 206The following sections describe first the command language (under 207.CW sam\ -d 208and in the 209.CW sam 210window), and then the mouse interface. 211These two languages are nearly independent, but connect through the 212.I current 213.I text, 214described below. 215.SH 2 216The Command Language 217.LP 218A file consists of its contents, which are an array of characters 219(that is, a string); the 220.I name 221of the associated disc file; the 222.I 223modified bit 224.R 225that states whether the contents match those of 226the disc file; 227and a substring of the contents, called the 228.I 229current text 230.R 231or 232.I dot 233(see Figures 1 and 2). 234If the current text is a null string, dot falls between characters. 235The 236.I value 237of dot is the location of the current text; the 238.I contents 239of dot are the characters it contains. 240.CW Sam 241imparts to the text no two-dimensional interpretation such as columns 242or fields; text is always one-dimensional. 243Even the idea of a `line' of text as understood by most UNIX programs 244\(em a sequence of characters terminated by a newline character \(em 245is only weakly supported. 246.PP 247The 248.I 249current file 250.R 251is the file to which editing commands refer. 252The current text is therefore dot in the current file. 253If a command doesn't explicitly name a particular file or piece of text, 254the command is assumed to apply to the current text. 255For the moment, ignore the presence of multiple files and consider 256editing a single file. 257.KF L 258.BP fig1.ps 3.5i 259.Cs 260Figure 1. A typical 261.CW sam 262screen, with the editing menu presented. 263The 264.CW sam 265(command language) window is in the middle, with file windows above and below. 266(The user interface makes it easy to create these abutting windows.) 267The partially obscured window is a third file window. 268The uppermost window is that to which typing and mouse operations apply, 269as indicated by its heavy border. 270Each window has its current text highlighted in reverse video. 271The 272.CW sam 273window's current text is the null string on the last visible line, 274indicated by a vertical bar. 275See also Figure 2. 276.Ce 277.KE 278.PP 279Commands have one-letter names. 280Except for non-editing commands such as writing 281the file to disc, most commands make some change 282to the text in dot and leave dot set to the text resulting from the change. 283For example, the delete command, 284.CW d , 285deletes the text in dot, replacing it by the null string and setting dot 286to the result. 287The change command, 288.CW c , 289replaces dot by text delimited by an arbitrary punctuation character, 290conventionally 291a slash. Thus, 292.P1 293c/Peter/ 294.P2 295replaces the text in dot by the string 296.CW Peter . 297Similarly, 298.P1 299a/Peter/ 300.P2 301(append) adds the string after dot, and 302.P1 303i/Peter/ 304.P2 305(insert) inserts before dot. 306All three leave dot set to the new text, 307.CW Peter . 308.PP 309Newlines are part of the syntax of commands: 310the newline character lexically terminates a command. 311Within the inserted text, however, newlines are never implicit. 312But since it is often convenient to insert multiple lines of text, 313.CW sam 314has a special 315syntax for that case: 316.P1 317a 318some lines of text 319to be inserted in the file, 320terminated by a period 321on a line by itself 322\&. 323.P2 324In the one-line syntax, a newline character may be specified by a C-like 325escape, so 326.P1 327c/\en/ 328.P2 329replaces dot by a single newline character. 330.PP 331.CW Sam 332also has a substitute command, 333.CW s : 334.P1 335s/\f2expression\fP/\f2replacement\fP/ 336.P2 337substitutes the replacement text for the first match, in dot, 338of the regular expression. 339Thus, if dot is the string 340.CW Peter , 341the command 342.P1 343s/t/st/ 344.P2 345changes it to 346.CW Pester . 347In general, 348.CW s 349is unnecessary, but it was inherited from 350.CW ed 351and it has some convenient variations. 352For instance, the replacement text may include the matched text, 353specified by 354.CW & : 355.P1 356s/Peter/Oh, &, &, &, &!/ 357.P2 358.PP 359There are also three commands that apply programs 360to text: 361.P1 362< \f2UNIX program\fP 363.P2 364replaces dot by the output of the UNIX program. 365Similarly, the 366.CW > 367command 368runs the program with dot as its standard input, and 369.CW | 370does both. For example, 371.P1 372| sort 373.P2 374replaces dot by the result of applying the standard sorting utility to it. 375Again, newlines have no special significance for these 376.CW sam 377commands. 378The text acted upon and resulting from these commands is not necessarily 379bounded by newlines, although for connection with UNIX programs, 380newlines may be necessary to obey conventions. 381.PP 382One more command: 383.CW p 384prints the contents of dot. 385Table I summarizes 386.CW sam 's 387commands. 388.KF 389.TS 390center; 391c s 392lfCW l. 393Table I. \f(CWSam\fP commands 394.sp .4 395.ft CW 396_ 397.ft 398.sp .4 399\f1Text commands\fP 400.sp .4 401_ 402.sp .4 403a/\f2text\fP/ Append text after dot 404c/\f2text\fP/ Change text in dot 405i/\f2text\fP/ Insert text before dot 406d Delete text in dot 407s/\f2regexp\fP/\f2text\fP/ Substitute text for match of regular expression in dot 408m \f2address\fP Move text in dot after address 409t \f2address\fP Copy text in dot after address 410.sp .4 411_ 412.sp .4 413\f1Display commands\fP 414.sp .4 415_ 416.sp .2 417p Print contents of dot 418\&= Print value (line numbers and character numbers) of dot 419.sp .4 420_ 421.sp .4 422\f1File commands\fP 423.sp .4 424_ 425.sp .2 426b \f2file-list\fP Set current file to first file in list that \f(CWsam\fP has in menu 427B \f2file-list\fP Same as \f(CWb\fP, but load new files 428n Print menu lines of all files 429D \f2file-list\fP Delete named files from \f(CWsam\fP 430.sp .4 431_ 432.sp .4 433\f1I/O commands\fP 434.sp .4 435_ 436.sp .2 437e \f2filename\fP Replace file with named disc file 438r \f2filename\fP Replace dot by contents of named disc file 439w \f2filename\fP Write file to named disc file 440f \f2filename\fP Set file name and print new menu line 441< \f2UNIX-command\fP Replace dot by standard output of command 442> \f2UNIX-command\fP Send dot to standard input of command 443| \f2UNIX-command\fP Replace dot by result of command applied to dot 444! \f2UNIX-command\fP Run the command 445.sp .4 446_ 447.sp .4 448\f1Loops and conditionals\fP 449.sp .4 450_ 451.sp .2 452x/\f2regexp\fP/ \f2command\fP For each match of regexp, set dot and run command 453y/\f2regexp\fP/ \f2command\fP Between adjacent matches of regexp, set dot and run command 454X/\f2regexp\fP/ \f2command\fP Run command in each file whose menu line matches regexp 455Y/\f2regexp\fP/ \f2command\fP Run command in each file whose menu line does not match 456g/\f2regexp\fP/ \f2command\fP If dot contains a match of regexp, run command 457v/\f2regexp\fP/ \f2command\fP If dot does not contain a match of regexp, run command 458.sp .4 459_ 460.sp .4 461\f1Miscellany\fP 462.sp .4 463_ 464.sp .2 465k Set address mark to value of dot 466q Quit 467u \f2n\fP Undo last \f2n\fP (default 1) changes 468{ } Braces group commands 469.sp .3 470.ft CW 471_ 472.ft 473.TE 474.sp 475.KE 476.PP 477The value of dot may be changed by 478specifying an 479.I address 480for the command. 481The simplest address is a line number: 482.P1 4833 484.P2 485refers to the third line of the file, so 486.P1 4873d 488.P2 489deletes the third line of the file, and implicitly renumbers 490the lines so the old line 4 is now numbered 3. 491(This is one of the few places where 492.CW sam 493deals with lines directly.) 494Line 495.CW 0 496is the null string at the beginning of the file. 497If a command consists of only an address, a 498.CW p 499command is assumed, so typing an unadorned 500.CW 3 501prints line 3 on the terminal. 502There are a couple of other basic addresses: 503a period addresses dot itself; and 504a dollar sign 505.CW $ ) ( 506addresses the null string at the end of the file. 507.PP 508An address is always a single substring of the file. 509Thus, the address 510.CW 3 511addresses the characters 512after the second newline of 513the file through the third newline of the file. 514A 515.I 516compound address 517.R 518is constructed by the comma operator 519.P1 520\f2address1\fP,\f2address2\fP 521.P2 522and addresses the substring of the file from the beginning of 523.I address1 524to the end of 525.I address2 . 526For example, the command 527.CW 3,5p 528prints the third through fifth lines of the file and 529.CW .,$d 530deletes the text from the beginning of dot to the end of the file. 531.PP 532These addresses are all absolute positions in the file, but 533.CW sam 534also has relative addresses, indicated by 535.CW + 536or 537.CW - . 538For example, 539.P1 540$-3 541.P2 542is the third line before the end of the file and 543.P1 544\&.+1 545.P2 546is the line after dot. 547If no address appears to the left of the 548.CW + 549or 550.CW - , 551dot is assumed; 552if nothing appears to the right, 553.CW 1 554is assumed. 555Therefore, 556.CW .+1 557may be abbreviated to just a plus sign. 558.PP 559The 560.CW + 561operator acts relative to the end of its first argument, while the 562.CW - 563operator acts relative to the beginning. Thus 564.CW .+1 565addresses the first line after dot, 566.CW .- 567addresses the first line before dot, and 568.CW +- 569refers to the line containing the end of dot. (Dot may span multiple lines, and 570.CW + 571selects the line after the end of dot, then 572.CW - 573backs up one line.) 574.PP 575The final type of address is a regular expression, which addresses the 576text matched by the expression. The expression is enclosed in slashes, as in 577.P1 578/\f2expression\fP/ 579.P2 580The expressions are the same as those in the UNIX program 581.CW egrep ,\u\s-4\&6,7\s+4\d 582and include closures, alternations, and so on. 583They find the 584.I 585leftmost longest 586.R 587string that matches the expression, that is, 588the first match after the point where the search is started, 589and if more than one match begins at the same spot, the longest such match. 590(I assume familiarity with the syntax for regular expressions in UNIX programs.\u\s-4\&9\s+4\d) 591For example, 592.P1 593/x/ 594.P2 595matches the next 596.CW x 597character in the file, 598.P1 599/xx*/ 600.P2 601matches the next run of one or more 602.CW x 's, 603and 604.P1 605/x|Peter/ 606.P2 607matches the next 608.CW x 609or 610.CW Peter . 611For compatibility with other UNIX programs, the `any character' operator, 612a period, 613does not match a newline, so 614.P1 615/.*/ 616.P2 617matches the text from dot to the end of the line, but excludes the newline 618and so will not match across 619the line boundary. 620.PP 621Regular expressions are always relative addresses. 622The direction is forwards by default, 623so 624.CW /Peter/ 625is really an abbreviation for 626.CW +/Peter/ . 627The search can be reversed with a minus sign, so 628.P1 629.CW -/Peter/ 630.P2 631finds the first 632.CW Peter 633before dot. 634Regular expressions may be used with other address forms, so 635.CW 0+/Peter/ 636finds the first 637.CW Peter 638in the file and 639.CW $-/Peter/ 640finds the last. 641Table II summarizes 642.CW sam 's 643addresses. 644.KF 645.TS 646center; 647c s 648lfCW l. 649Table II. \f(CWSam\fP addresses 650.sp .4 651.ft CW 652_ 653.ft 654.sp .4 655\f1Simple addresses\fP 656.sp .4 657_ 658.sp .2 659#\f2n\fP The empty string after character \f2n\fP 660\f2n\fP Line \f2n\fP. 661/\f2regexp\fP/ The first following match of the regular expression 662-/\f2regexp\fP/ The first previous match of the regular expression 663$ The null string at the end of the file 664\&. Dot 665\&' The address mark, set by \f(CWk\fP command 666"\f2regexp\fP" Dot in the file whose menu line matches regexp 667.sp .4 668_ 669.sp .4 670\f1Compound addresses\fP 671.sp .4 672_ 673.sp .2 674\f2a1\fP+\f2a2\fP The address \f2a2\fP evaluated starting at right of \f2a1\fP 675\f2a1\fP-\f2a2\fP \f2a2\fP evaluated in the reverse direction starting at left of \f2a1\fP 676\f2a1\fP,\f2a2\fP From the left of \f2a1\fP to the right of \f2a2\fP (default \f(CW0,$\fP) 677\f2a1\fP;\f2a2\fP Like \f(CW,\fP but sets dot after evaluating \f2a1\fP 678.sp .4 679_ 680.sp .4 681.T& 682c s. 683T{ 684The operators 685.CW + 686and 687.CW - 688are high precedence, while 689.CW , 690and 691.CW ; 692are low precedence. 693In both 694.CW + 695and 696.CW - 697forms, 698.I a2 699defaults to 1 and 700.I a1 701defaults to dot. 702If both 703.I a1 704and 705.I a2 706are present, 707.CW + 708may be elided. 709T} 710.sp .5 711.ft CW 712_ 713.ft 714.TE 715.sp 716.KE 717.PP 718The language discussed so far will not seem novel 719to people who use UNIX text editors 720such as 721.CW ed 722or 723.CW vi .\u\s-4\&9\s+4\d 724Moreover, the kinds of editing operations these commands allow, with the exception 725of regular expressions and line numbers, 726are clearly more conveniently handled by a mouse-based interface. 727Indeed, 728.CW sam 's 729mouse language (discussed at length below) is the means by which 730simple changes are usually made. 731For large or repetitive changes, however, a textual language 732outperforms a manual interface. 733.PP 734Imagine that, instead of deleting just one occurrence of the string 735.CW Peter , 736we wanted to eliminate every 737.CW Peter . 738What's needed is an iterator that runs a command for each occurrence of some 739text. 740.CW Sam 's 741iterator is called 742.CW x , 743for extract: 744.P1 745x/\f2expression\fP/ \f2command\fP 746.P2 747finds all matches in dot of the specified expression, and for each 748such match, sets dot to the text matched and runs the command. 749So to delete all the 750.CW Peters: 751.P1 7520,$ x/Peter/ d 753.P2 754(Blanks in these examples are to improve readability; 755.CW sam 756neither requires nor interprets them.) 757This searches the entire file 758.CW 0,$ ) ( 759for occurrences of the string 760.CW Peter , 761and runs the 762.CW d 763command with dot set to each such occurrence. 764(By contrast, the comparable 765.CW ed 766command would delete all 767.I lines 768containing 769.CW Peter ; 770.CW sam 771deletes only the 772.CW Peters .) 773The address 774.CW 0,$ 775is commonly used, and may be abbreviated to just a comma. 776As another example, 777.P1 778, x/Peter/ p 779.P2 780prints a list of 781.CW Peters, 782one for each appearance in the file, with no intervening text (not even newlines 783to separate the instances). 784.PP 785Of course, the text extracted by 786.CW x 787may be selected by a regular expression, 788which complicates deciding what set of matches is chosen \(em 789matches may overlap. This is resolved by generating the matches 790starting from the beginning of dot using the leftmost-longest rule, 791and searching for each match starting from the end of the previous one. 792Regular expressions may also match null strings, but a null match 793adjacent to a non-null match is never selected; at least one character 794must intervene. 795For example, 796.P1 797, c/AAA/ 798x/B*/ c/-/ 799, p 800.P2 801produces as output 802.P1 803-A-A-A- 804.P2 805because the pattern 806.CW B* 807matches the null strings separating the 808.CW A 's. 809.PP 810The 811.CW x 812command has a complement, 813.CW y , 814with similar syntax, that executes the command with dot set to the text 815.I between 816the matches of the expression. 817For example, 818.P1 819, c/AAA/ 820y/A/ c/-/ 821, p 822.P2 823produces the same result as the example above. 824.PP 825The 826.CW x 827and 828.CW y 829commands are looping constructs, and 830.CW sam 831has a pair of conditional commands to go with them. 832They have similar syntax: 833.P1 834g/\f2expression\fP/ \f2command\fP 835.P2 836(guard) 837runs the command exactly once if dot contains a match of the expression. 838This is different from 839.CW x , 840which runs the command for 841.I each 842match: 843.CW x 844loops; 845.CW g 846merely tests, without changing the value of dot. 847Thus, 848.P1 849, x/Peter/ d 850.P2 851deletes all occurrences of 852.CW Peter , 853but 854.P1 855, g/Peter/ d 856.P2 857deletes the whole file (reduces it to a null string) if 858.CW Peter 859occurs anywhere in the text. 860The complementary conditional is 861.CW v , 862which runs the command if there is 863.I no 864match of the expression. 865.PP 866These control-structure-like commands may be composed to construct more 867involved operations. For example, to print those lines of text that 868contain the string 869.CW Peter : 870.P1 871, x/.*\en/ g/Peter/ p 872.P2 873The 874.CW x 875breaks the file into lines, the 876.CW g 877selects those lines containing 878.CW Peter , 879and the 880.CW p 881prints them. 882This command gives an address for the 883.CW x 884command (the whole file), but because 885.CW g 886does not have an explicit address, it applies to the value of 887dot produced by the 888.CW x 889command, that is, to each line. 890All commands in 891.CW sam 892except for the command to write a file to disc use dot for the 893default address. 894.PP 895Composition may be continued indefinitely. 896.P1 897, x/.*\en/ g/Peter/ v/SaltPeter/ p 898.P2 899prints those lines containing 900.CW Peter 901but 902.I not 903those containing 904.CW SaltPeter . 905.SH 2 906Structural Regular Expressions 907.LP 908Unlike other UNIX text editors, 909including the non-interactive ones such as 910.CW sed 911and 912.CW awk ,\u\s-4\&7\s+4\d 913.CW sam 914is good for manipulating files with multi-line `records.' 915An example is an on-line phone book composed of records, 916separated by blank lines, of the form 917.P1 918Herbert Tic 91944 Turnip Ave., Endive, NJ 920201-5555642 921 922Norbert Twinge 92316 Potato St., Cabbagetown, NJ 924201-5553145 925 926\&... 927.P2 928The format may be encoded as a regular expression: 929.P1 930(.+\en)+ 931.P2 932that is, a sequence of one or more non-blank lines. 933The command to print Mr. Tic's entire record is then 934.P1 935, x/(.+\en)+/ g/^Herbert Tic$/ p 936.P2 937and that to extract just the phone number is 938.P1 939, x/(.+\en)+/ g/^Herbert Tic$/ x/^[0-9]*-[0-9]*\en/ p 940.P2 941The latter command breaks the file into records, 942chooses Mr. Tic's record, 943extracts the phone number from the record, 944and finally prints the number. 945.PP 946A more involved problem is that of 947renaming a particular variable, say 948.CW n , 949to 950.CW num 951in a C program. 952The obvious first attempt, 953.P1 954, x/n/ c/num/ 955.P2 956is badly flawed: it changes not only the variable 957.CW n 958but any letter 959.CW n 960that appears. 961We need to extract all the variables, and select those that match 962.CW n 963and only 964.CW n : 965.P1 966, x/[A-Za-z_][A-Za-z_0-9]*/ g/n/ v/../ c/num/ 967.P2 968The pattern 969.CW [A-Za-z_][A-Za-z_0-9]* 970matches C identifiers. 971Next 972.CW g/n/ 973selects those containing an 974.CW n . 975Then 976.CW v/../ 977rejects those containing two (or more) characters, and finally 978.CW c/num/ 979changes the remainder (identifiers 980.CW n ) 981to 982.CW num . 983This version clearly works much better, but there may still be problems. 984For example, in C character and string constants, the sequence 985.CW \en 986is interpreted as a newline character, and we don't want to change it to 987.CW \enum. 988This problem can be forestalled with a 989.CW y 990command: 991.P1 992, y/\e\en/ x/[A-Za-z_][A-Za-z_0-9]*/ g/n/ v/../ c/num/ 993.P2 994(the second 995.CW \e 996is necessary because of lexical conventions in regular expressions), 997or we could even reject character constants and strings outright: 998.P1 0 999,y/'[^']*'/ y/"[^"]*"/ x/[A-Za-z_][A-Za-z_0-9]*/ g/n/ v/../ c/num/ 1000.P2 1001The 1002.CW y 1003commands in this version exclude from consideration all character constants 1004and strings. 1005The only remaining problem is to deal with the possible occurrence of 1006.CW \e' 1007or 1008.CW \e" 1009within these sequences, but it's easy to see how to resolve this difficulty. 1010.PP 1011The point of these composed commands is successive refinement. 1012A simple version of the command is tried, and if it's not good enough, 1013it can be honed by adding a clause or two. 1014(Mistakes can be undone; see below. 1015Also, the mouse language makes it unnecessary to retype the command each time.) 1016The resulting chains of commands are somewhat reminiscent of 1017shell pipelines.\u\s-4\&7\s+4\d 1018Unlike pipelines, though, which pass along modified 1019.I data , 1020.CW sam 1021commands pass a 1022.I view 1023of the data. 1024The text at each step of the command is the same, but which pieces 1025are selected is refined step by step until the correct piece is 1026available to the final step of the command line, which ultimately makes the change. 1027.PP 1028In other UNIX programs, regular expressions are used only for selection, 1029as in the 1030.CW sam 1031.CW g 1032command, never for extraction as in the 1033.CW x 1034or 1035.CW y 1036command. 1037For example, patterns in 1038.CW awk \u\s-4\&7\s+4\d 1039are used to select lines to be operated on, but cannot be used 1040to describe the format of the input text, or to handle newline-free text. 1041The use of regular expressions to describe the structure of a piece 1042of text rather than its contents, as in the 1043.CW x 1044command, 1045has been given a name: 1046.I 1047structural regular expressions. 1048.R 1049When they are composed, as in the above example, 1050they are pleasantly expressive. 1051Their use is discussed at greater length elsewhere.\u\s-4\&10\s+4\d 1052.PP 1053.SH 2 1054Multiple files 1055.LP 1056.CW Sam 1057has a few other commands, mostly relating to input and output. 1058.P1 1059e discfilename 1060.P2 1061replaces the contents and name of the current file with those of the named 1062disc file; 1063.P1 1064w discfilename 1065.P2 1066writes the contents to the named disc file; and 1067.P1 1068r discfilename 1069.P2 1070replaces dot with the contents of the named disc file. 1071All these commands use the current file's name if none is specified. 1072Finally, 1073.P1 1074f discfilename 1075.P2 1076changes the name associated with the file and displays the result: 1077.P1 1078\&'-. discfilename 1079.P2 1080This output is called the file's 1081.I 1082menu line, 1083.R 1084because it is the contents of the file's line in the button 3 menu (described 1085in the 1086next section). 1087The first three characters are a concise notation for the state of the file. 1088The apostrophe signifies that the file is modified. 1089The minus sign indicates the number of windows 1090open on the file (see the next section): 1091.CW - 1092means none, 1093.CW + 1094means one, and 1095.CW * 1096means more than one. 1097Finally, the period indicates that this is the current file. 1098These characters are useful for controlling the 1099.CW X 1100command, described shortly. 1101.PP 1102.CW Sam 1103may be started with a set of disc files (such as all the source for 1104a program) by invoking it with a list of file names as arguments, and 1105more may be added or deleted on demand. 1106.P1 1107B discfile1 discfile2 ... 1108.P2 1109adds the named files to 1110.CW sam 's 1111list, and 1112.P1 1113D discfile1 discfile2 ... 1114.P2 1115removes them from 1116.CW sam 's 1117memory (without effect on associated disc files). 1118Both these commands have a syntax for using the shell\u\s-4\&7\s+4\d 1119(the UNIX command interpreter) to generate the lists: 1120.P1 1121B <echo *.c 1122.P2 1123will add all C source files, and 1124.P1 1125B <grep -l variable *.c 1126.P2 1127will add all C source files referencing a particular variable 1128(the UNIX command 1129.CW grep\ -l 1130lists all files in its arguments that contain matches of 1131the specified regular expression). 1132Finally, 1133.CW D 1134without arguments deletes the current file. 1135.PP 1136There are two ways to change which file is current: 1137.P1 1138b filename 1139.P2 1140makes the named file current. 1141The 1142.CW B 1143command 1144does the same, but also adds any new files to 1145.CW sam 's 1146list. 1147(In practice, of course, the current file 1148is usually chosen by mouse actions, not by textual commands.) 1149The other way is to use a form of address that refers to files: 1150.P1 1151"\f2expression\fP" \f2address\fP 1152.P2 1153refers to the address evaluated in the file whose menu line 1154matches the expression (there must be exactly one match). 1155For example, 1156.P1 1157"peter.c" 3 1158.P2 1159refers to the third line of the file whose name matches 1160.CW peter.c . 1161This is most useful in the move 1162.CW m ) ( 1163and copy 1164.CW t ) ( 1165commands: 1166.P1 11670,$ t "peter.c" 0 1168.P2 1169makes a copy of the current file at the beginning of 1170.CW peter.c . 1171.PP 1172The 1173.CW X 1174command 1175is a looping construct, like 1176.CW x , 1177that refers to files instead of strings: 1178.P1 1179X/\f2expression\fP/ \f2command\fP 1180.P2 1181runs the command in all 1182files whose menu lines match the expression. The best example is 1183.P1 1184X/'/ w 1185.P2 1186which writes to disc all modified files. 1187.CW Y 1188is the complement of 1189.CW X : 1190it runs the command on all files whose menu lines don't match the expression: 1191.P1 1192Y/\e.c/ D 1193.P2 1194deletes all files that don't have 1195.CW \&.c 1196in their names, that is, it keeps all C source files and deletes the rest. 1197.PP 1198Braces allow commands to be grouped, so 1199.P1 1200{ 1201 \f2command1\fP 1202 \f2command2\fP 1203} 1204.P2 1205is syntactically a single command that runs two commands. 1206Thus, 1207.P1 1208X/\e.c/ ,g/variable/ { 1209 f 1210 , x/.*\en/ g/variable/ p 1211} 1212.P2 1213finds all occurrences of 1214.CW variable 1215in C source files, and prints 1216out the file names and lines of each match. 1217The precise semantics of compound operations is discussed in the implementation 1218sections below. 1219.PP 1220Finally, 1221the undo command, 1222.CW u , 1223undoes the last command, 1224no matter how many files were affected. 1225Multiple undo operations move further back in time, so 1226.P1 1227u 1228u 1229.P2 1230(which may be abbreviated 1231.CW u2 ) 1232undoes the last two commands. An undo may not be undone, however, nor 1233may any command that adds or deletes files. 1234Everything else is undoable, though, including for example 1235.CW e 1236commands: 1237.P1 1238e filename 1239u 1240.P2 1241restores the state of the file completely, including its name, dot, 1242and modified bit. Because of the undo, potentially dangerous commands 1243are not guarded by confirmations. Only 1244.CW D , 1245which destroys the information necessary to restore itself, is protected. 1246It will not delete a modified file, but a second 1247.CW D 1248of the same file will succeed regardless. 1249The 1250.CW q 1251command, which exits 1252.CW sam , 1253is similarly guarded. 1254.SH 2 1255Mouse Interface 1256.LP 1257.CW Sam 1258is most commonly run 1259connected to a bitmap display and mouse for interactive editing. 1260The only difference in the command language 1261between regular, mouse-driven 1262.CW sam 1263and 1264.CW sam\ -d 1265is that if an address 1266is provided without a command, 1267.CW sam\ -d 1268will print the text referenced by the address, but 1269regular 1270.CW sam 1271will highlight it on the screen \(em in fact, 1272dot is always highlighted (see Figure 2). 1273.WS 1 1274.KF 1275.BP fig3.ps 2.04i 1276.Cs 1277Figure 2. A 1278.CW sam 1279window. The scroll bar down the left 1280represents the file, with the bubble showing the fraction 1281visible in the window. 1282The scroll bar may be manipulated by the mouse for convenient browsing. 1283The current text, 1284which is highlighted, need not fit on a line. Here it consists of one partial 1285line, one complete line, and final partial line. 1286.Ce 1287.KE 1288.PP 1289Each file may have zero or more windows open on the display. 1290At any time, only one window in all of 1291.CW sam 1292is the 1293.I 1294current window, 1295.R 1296that is, the window to which typing and mouse actions refer; 1297this may be the 1298.CW sam 1299window (that in which commands may be typed) 1300or one of the file windows. 1301When a file has multiple windows, the image of the file in each window 1302is always kept up to date. 1303The current file is the last file affected by a command, 1304so if the 1305.CW sam 1306window is current, 1307the current window is not a window on the current file. 1308However, each window on a file has its own value of dot, 1309and when switching between windows on a single file, 1310the file's value of dot is changed to that of the window. 1311Thus, flipping between windows behaves in the obvious, convenient way. 1312.PP 1313The mouse on the Blit has three buttons, numbered left to right. 1314Button 3 has a list of commands to manipulate windows, 1315followed by a list of `menu lines' exactly as printed by the 1316.CW f 1317command, one per file (not one per window). 1318These menu lines are sorted by file name. 1319If the list is long, the Blit menu software will make it more manageable 1320by generating a scrolling menu instead of an unwieldy long list. 1321Using the menu to select a file from the list makes that file the current 1322file, and the most recently current window in that file the current window. 1323But if that file is already current, selecting it in the menu cycles through 1324the windows on the file; this simple trick avoids a special menu to 1325choose windows on a file. 1326If there is no window open on the file, 1327.CW sam 1328changes the mouse cursor to prompt the user to create one. 1329.PP 1330The commands on the button 3 menu are straightforward (see Figure 3), and 1331are like the commands to manipulate windows in 1332.CW mux ,\u\s-4\&8\s+4\d 1333the Blit's window system. 1334.CW New 1335makes a new file, and gives it one empty window, whose size is determined 1336by a rectangle swept by the mouse. 1337.CW Zerox 1338prompts for a window to be selected, and 1339makes a clone of that window; this is how multiple windows are created on one file. 1340.CW Reshape 1341changes the size of the indicated window, and 1342.CW close 1343deletes it. If that is the last window open on the file, 1344.CW close 1345first does a 1346.CW D 1347command on the file. 1348.CW Write 1349is identical to a 1350.CW w 1351command on the file; it is in the menu purely for convenience. 1352Finally, 1353.CW ~~sam~~ 1354is a menu item that appears between the commands and the file names. 1355Selecting it makes the 1356.CW sam 1357window the current window, 1358causing subsequent typing to be interpreted as commands. 1359.KF 1360.BP fig2.ps 2.74i 1361.Cs 1362Figure 3. The menu on button 3. 1363The black rectangle on the left is a scroll bar; the menu is limited to 1364the length shown to prevent its becoming unwieldy. 1365Above the 1366.CW ~~sam~~ 1367line is a list of commands; 1368beneath it is a list of files, presented exactly as with the 1369.CW f 1370command. 1371.Ce 1372.KE 1373.PP 1374When 1375.CW sam 1376requests that a window be swept, in response to 1377.CW new , 1378.CW zerox 1379or 1380.CW reshape , 1381it changes the mouse cursor from the usual arrow to a box with 1382a small arrow. 1383In this state, the mouse may be used to indicate an arbitrary rectangle by 1384pressing button 3 at one corner and releasing it at the opposite corner. 1385More conveniently, 1386button 3 may simply be clicked, 1387whereupon 1388.CW sam 1389creates the maximal rectangle that contains the cursor 1390and abuts the 1391.CW sam 1392window. 1393By placing the 1394.CW sam 1395window in the middle of the screen, the user can define two regions (one above, 1396one below) in which stacked fully-overlapping 1397windows can be created with minimal fuss (see Figure 1). 1398This simple user interface trick makes window creation noticeably easier. 1399.PP 1400The cut-and-paste editor is essentially the same as that in Smalltalk-80.\u\s-4\&11\s+4\d 1401The text in dot is always highlighted on the screen. 1402When a character is typed it replaces dot, and sets dot to the null 1403string after the character. Thus, ordinary typing inserts text. 1404Button 1 is used for selection: 1405pressing the button, moving the mouse, and lifting the button 1406selects (sets dot to) the text between the points where the 1407button was pressed and released. 1408Pressing and releasing at the same point selects a null string; this 1409is called clicking. Clicking twice quickly, or 1410.I 1411double clicking, 1412.R 1413selects larger objects; 1414for example, double clicking in a word selects the word, 1415double clicking just inside an opening bracket selects the text 1416contained in the brackets (handling nested brackets correctly), 1417and similarly for 1418parentheses, quotes, and so on. 1419The double-clicking rules reflect a bias toward 1420programmers. 1421If 1422.CW sam 1423were intended more for word processing, double-clicks would probably 1424select linguistic structures such as sentences. 1425.PP 1426If button 1 is pressed outside the current window, it makes the indicated 1427window current. 1428This is the easiest way to switch between windows and files. 1429.PP 1430Pressing button 2 brings up a menu of editing functions (see Figure 4). 1431These mostly apply to the selected text: 1432.CW cut 1433deletes the selected text, and remembers it in a hidden buffer called the 1434.I 1435snarf buffer, 1436.R 1437.CW paste 1438replaces the selected text by the contents of the snarf buffer, 1439.CW snarf 1440just copies the selected text to the snarf buffer, 1441.CW look 1442searches forward for the next literal occurrence of the selected text, and 1443.CW <mux> 1444exchanges snarf buffers with the window system in which 1445.CW sam 1446is running. 1447Finally, the last regular expression used appears as a menu entry 1448to search 1449forward for the next occurrence of a match for the expression. 1450.WS 1 1451.KF 1452.BP fig4.ps 1.20i 1453.Cs 1454Figure 4. The menu on button 2. 1455The bottom entry tracks the most recently used regular expression, which may 1456be literal text. 1457.Ce 1458.KE 1459.PP 1460The relationship between the command language and the mouse language is 1461entirely due to the equality of dot and the selected text chosen 1462with button 1 on the mouse. 1463For example, to make a set of changes in a C subroutine, dot can be 1464set by double clicking on the left brace that begins the subroutine, 1465which sets dot for the command language. 1466An address-free command then typed in the 1467.CW sam 1468window will apply only to the text between the opening and closing 1469braces of the function. 1470The idea is to select what you want, and then say what you want 1471to do with it, whether invoked by a menu selection or by a typed command. 1472And of course, the value of dot is highlighted on 1473the display after the command completes. 1474This relationship between mouse interface and command language 1475is clumsy to explain, but comfortable, even natural, in practice. 1476.SH 1477The Implementation 1478.LP 1479The next few sections describe how 1480.CW sam 1481is put together, first the host part, 1482then the inter-component communication, 1483then the terminal part. 1484After explaining how the command language is implemented, 1485the discussion follows (roughly) the path of a character 1486from the temporary file on disc to the screen. 1487The presentation centers on the data structures, 1488because that is how the program was designed and because 1489the algorithms are easy to provide, given the right data 1490structures. 1491.SH 2 1492Parsing and execution 1493.LP 1494The command language is interpreted by parsing each command with a 1495table-driven recursive 1496descent parser, and when a complete command is assembled, invoking a top-down 1497executor. 1498Most editors instead employ a simple character-at-a-time 1499lexical scanner. 1500Use of a parser makes it 1501easy and unambiguous to detect when a command is complete, 1502which has two advantages. 1503First, escape conventions such as backslashes to quote 1504multiple-line commands are unnecessary; if the command isn't finished, 1505the parser keeps reading. For example, a multiple-line append driven by an 1506.CW x 1507command is straightforward: 1508.P1 1509x/.*\en/ g/Peter/ a 1510one line about Peter 1511another line about Peter 1512\&. 1513.P2 1514Other UNIX editors would require a backslash after all but the last line. 1515.PP 1516The other advantage is specific to the two-process structure of 1517.CW sam . 1518The host process must decide when a command is completed so the 1519command interpreter can be called. This problem is easily resolved 1520by having the lexical analyzer read the single stream of events from the 1521terminal, directly executing all typing and mouse commands, 1522but passing to the parser characters typed to the 1523.CW sam 1524command window. 1525This scheme is slightly complicated by the availability of cut-and-paste 1526editing in the 1527.CW sam 1528window, but that difficulty is resolved by applying the rules 1529used in 1530.CW mux : 1531when a newline is typed to the 1532.CW sam 1533window, all text between the newline and the previously typed newline 1534is made available to the parser. 1535This permits arbitrary editing to be done to a command before 1536typing newline and thereby requesting execution. 1537.PP 1538The parser is driven by a table because the syntax of addresses 1539and commands is regular enough 1540to be encoded compactly. There are few special cases, such as the 1541replacement text in a substitution, so the syntax of almost all commands 1542can be encoded with a few flags. 1543These include whether the command allows an address (for example, 1544.CW e 1545does not), whether it takes a regular expression (as in 1546.CW x 1547and 1548.CW s ), 1549whether it takes replacement text (as in 1550.CW c 1551or 1552.CW i ), 1553which may be multi-line, and so on. 1554The internal syntax of regular expressions is handled by a separate 1555parser; a regular expression is a leaf of the command parse tree. 1556Regular expressions are discussed fully in the next section. 1557.PP 1558The parser table also has information about defaults, so the interpreter 1559is always called with a complete tree. For example, the parser fills in 1560the implicit 1561.CW 0 1562and 1563.CW $ 1564in the abbreviated address 1565.CW , 1566(comma), 1567inserts a 1568.CW + 1569to the left of an unadorned regular expression in an address, 1570and provides the usual default address 1571.CW . 1572(dot) for commands that expect an address but are not given one. 1573.PP 1574Once a complete command is parsed, the evaluation is easy. 1575The address is evaluated left-to-right starting from the value of dot, 1576with a mostly ordinary expression evaluator. 1577Addresses, like many of the data structures in 1578.CW sam , 1579are held in a C structure and passed around by value: 1580.P1 1581typedef long Posn; /* Position in a file */ 1582typedef struct Range{ 1583 Posn p1, p2; 1584}Range; 1585typedef struct Address{ 1586 Range r; 1587 File *f; 1588}Address; 1589.P2 1590An address is encoded as a substring (character positions 1591.CW p1 1592to 1593.CW p2 ) 1594in a file 1595.CW f . 1596(The data type 1597.CW File 1598is described in detail below.) 1599.PP 1600The address interpreter is an 1601.CW Address -valued 1602function that traverses the parse tree describing an address (the 1603parse tree for the address has type 1604.CW Addrtree ): 1605.P1 1606Address 1607address(ap, a, sign) 1608 Addrtree *ap; 1609 Address a; 1610 int sign; 1611{ 1612 Address a2; 1613 do 1614 switch(ap->type){ 1615 case '.': 1616 a=a.f->dot; 1617 break; 1618 case '$': 1619 a.r.p1=a.r.p2=a.f->nbytes; 1620 break; 1621 case '"': 1622 a=matchfile(a, ap->aregexp)->dot; 1623 break; 1624 case ',': 1625 a2=address(ap->right, a, 0); 1626 a=address(ap->left, a, 0); 1627 if(a.f!=a2.f || a2.r.p2<a.r.p1) 1628 error(Eorder); 1629 a.r.p2=a2.r.p2; 1630 return a; 1631 /* and so on */ 1632 } 1633 while((ap=ap->right)!=0); 1634 return a; 1635} 1636.P2 1637.PP 1638Throughout, errors are handled by a non-local 1639.CW goto 1640(a 1641.CW setjmp/longjmp 1642in C terminology) 1643hidden in a routine called 1644.CW error 1645that immediately aborts the execution, retracts any 1646partially made changes (see the section below on `undoing'), and 1647returns to the top level of the parser. 1648The argument to 1649.CW error 1650is an enumeration type that 1651is translated to a terse but possibly helpful 1652message such as `?addresses out of order.' 1653Very common messages are kept short; for example the message for 1654a failed regular expression search is `?search.' 1655.PP 1656Character addresses such as 1657.CW #3 1658are trivial to implement, as the 1659.CW File 1660data structure is accessible by character number. 1661However, 1662.CW sam 1663keeps no information about the position of newlines \(em it is too 1664expensive to track dynamically \(em so line addresses are computed by reading 1665the file, counting newlines. Except in very large files, this has proven 1666acceptable: file access is fast enough to make the technique practical, 1667and lines are not central to the structure of the command language. 1668.PP 1669The command interpreter, called 1670.CW cmdexec , 1671is also straightforward. The parse table includes a 1672function to call to interpret a particular command. That function 1673receives as arguments 1674the calculated address 1675for the command 1676and the command tree (of type 1677.CW Cmdtree ), 1678which may contain information such as the subtree for compound commands. 1679Here, for example, is the function for the 1680.CW g 1681and 1682.CW v 1683commands: 1684.P1 1685int 1686g_cmd(a, cp) 1687 Address a; 1688 Cmdtree *cp; 1689{ 1690 compile(cp->regexp); 1691 if(execute(a.f, a.r.p1, a.r.p2)!=(cp->cmdchar=='v')){ 1692 a.f->dot=a; 1693 return cmdexec(a, cp->subcmd); 1694 } 1695 return TRUE; /* cause execution to continue */ 1696} 1697.P2 1698.CW Compile "" ( 1699and 1700.CW execute 1701are part of the regular expression code, described in the next section.) 1702Because the parser and the 1703.CW File 1704data structure do most of the work, most commands 1705are similarly brief. 1706.SH 2 1707Regular expressions 1708.LP 1709The regular expression code in 1710.CW sam 1711is an interpreted, rather than compiled on-the-fly, implementation of Thompson's 1712non-deterministic finite automaton algorithm.\u\s-4\&12\s+4\d 1713The syntax and semantics of the expressions are as in the UNIX program 1714.CW egrep , 1715including alternation, closures, character classes, and so on. 1716The only changes in the notation are two additions: 1717.CW \en 1718is translated to, and matches, a newline character, and 1719.CW @ 1720matches any character. In 1721.CW egrep , 1722the character 1723.CW \&. 1724matches any character except newline, and in 1725.CW sam 1726the same rule seemed safest, to prevent idioms like 1727.CW \&.* 1728from spanning newlines. 1729.CW Egrep 1730expressions are arguably too complicated for an interactive editor \(em 1731certainly it would make sense if all the special characters were two-character 1732sequences, so that most of the punctuation characters wouldn't have 1733peculiar meanings \(em but for an interesting command language, full 1734regular expressions are necessary, and 1735.CW egrep 1736defines the full regular expression syntax for UNIX programs. 1737Also, it seemed superfluous to define a new syntax, since various UNIX programs 1738.CW ed , ( 1739.CW egrep 1740and 1741.CW vi ) 1742define too many already. 1743.PP 1744The expressions are compiled by a routine, 1745.CW compile , 1746that generates the description of the non-deterministic finite state machine. 1747A second routine, 1748.CW execute , 1749interprets the machine to generate the leftmost-longest match of the 1750expression in a substring of the file. 1751The algorithm is described elsewhere.\u\s-4\&12,13\s+4\d 1752.CW Execute 1753reports 1754whether a match was found, and sets a global variable, 1755of type 1756.CW Range , 1757to the substring matched. 1758.PP 1759A trick is required to evaluate the expression in reverse, such as when 1760searching backwards for an expression. 1761For example, 1762.P1 1763-/P.*r/ 1764.P2 1765looks backwards through the file for a match of the expression. 1766The expression, however, is defined for a forward search. 1767The solution is to construct a machine identical to the machine 1768for a forward search except for a reversal of all the concatenation 1769operators (the other operators are symmetric under direction reversal), 1770to exchange the meaning of the operators 1771.CW ^ 1772and 1773.CW $ , 1774and then to read the file backwards, looking for the 1775usual earliest longest match. 1776.PP 1777.CW Execute 1778generates only one match each time it is called. 1779To interpret looping constructs such as the 1780.CW x 1781command, 1782.CW sam 1783must therefore synchronize between 1784calls of 1785.CW execute 1786to avoid 1787problems with null matches. 1788For example, even given the leftmost-longest rule, 1789the expression 1790.CW a* 1791matches three times in the string 1792.CW ab 1793(the character 1794.CW a , 1795the null string between the 1796.CW a 1797and 1798.CW b , 1799and the final null string). 1800After returning a match for the 1801.CW a , 1802.CW sam 1803must not match the null string before the 1804.CW b . 1805The algorithm starts 1806.CW execute 1807at the end of its previous match, and 1808if the match it returns 1809is null and abuts the previous match, rejects the match and advances 1810the initial position one character. 1811.SH 2 1812Memory allocation 1813.LP 1814The C language has no memory allocation primitives, although a standard 1815library routine, 1816.CW malloc , 1817provides adequate service for simple programs. 1818For specific uses, however, 1819it can be better to write a custom allocator. 1820The allocator (or rather, pair of allocators) described here 1821work in both the terminal and host parts of 1822.CW sam . 1823They are designed for efficient manipulation of strings, 1824which are allocated and freed frequently and vary in length from essentially 1825zero to 32 Kbytes (very large strings are written to disc). 1826More important, strings may be large and change size often, 1827so to minimize memory usage it is helpful to reclaim and to coalesce the 1828unused portions of strings when they are truncated. 1829.PP 1830Objects to be allocated in 1831.CW sam 1832are of two flavors: 1833the first is C 1834.CW structs , 1835which are small and often addressed by pointer variables; 1836the second is variable-sized arrays of characters 1837or integers whose 1838base pointer is always used to access them. 1839The memory allocator in 1840.CW sam 1841is therefore in two parts: 1842first, a traditional first-fit allocator that provides fixed storage for 1843.CW structs ; 1844and second, a garbage-compacting allocator that reduces storage 1845overhead for variable-sized objects, at the cost of some bookkeeping. 1846The two types of objects are allocated from adjoining arenas, with 1847the garbage-compacting allocator controlling the arena with higher addresses. 1848Separating into two arenas simplifies compaction and prevents fragmentation due 1849to immovable objects. 1850The access rules for garbage-compactable objects 1851(discussed in the next paragraph) allow them to be relocated, so when 1852the first-fit arena needs space, it moves the garbage-compacted arena 1853to higher addresses to make room. Storage is therefore created only 1854at successively higher addresses, either when more garbage-compacted 1855space is needed or when the first-fit arena pushes up the other arena. 1856.PP 1857Objects that may be compacted declare to the 1858allocator a cell that is guaranteed to be the sole repository of the 1859address of the object whenever a compaction can occur. 1860The compactor can then update the address when the object is moved. 1861For example, the implementation of type 1862.CW List 1863(really a variable-length array) 1864is: 1865.P1 1866typedef struct List{ 1867 int nused; 1868 long *ptr; 1869}List; 1870.P2 1871The 1872.CW ptr 1873cell must always be used directly, and never copied. When a 1874.CW List 1875is to be created the 1876.CW List 1877structure is allocated in the ordinary first-fit arena 1878and its 1879.CW ptr 1880is allocated in the garbage-compacted arena. 1881A similar data type for strings, called 1882.CW String , 1883stores variable-length character arrays of up to 32767 elements. 1884.PP 1885A related matter of programming style: 1886.CW sam 1887frequently passes structures by value, which 1888simplifies the code. 1889Traditionally, C programs have 1890passed structures by reference, but implicit allocation on 1891the stack is easier to use. 1892Structure passing is a relatively new feature of C 1893(it is not in the 1894standard reference manual for C\u\s-4\&14\s+4\d), and is poorly supported in most 1895commercial C compilers. 1896It's convenient and expressive, though, 1897and simplifies memory management by 1898avoiding the allocator altogether 1899and eliminating pointer aliases. 1900.SH 2 1901Data structures for manipulating files 1902.LP 1903Experience with 1904.CW jim 1905showed that the requirements 1906of the file data structure were few, but strict. 1907First, files need to be read and written quickly; 1908adding a fresh file must be painless. 1909Second, the implementation must place no arbitrary upper limit on 1910the number or sizes of files. (It should be practical to edit many files, 1911and files up to megabytes in length should be handled gracefully.) 1912This implies that files be stored on disc, not in main memory. 1913(Aficionados of virtual memory may argue otherwise, but the 1914implementation of virtual 1915memory in our system is not something to depend on 1916for good performance.) 1917Third, changes to files need be made by only two primitives: 1918deletion and insertion. 1919These are inverses of each other, 1920which simplifies the implementation of the undo operation. 1921Finally, 1922it must be easy and efficient to access the file, either 1923forwards or backwards, a byte at a time. 1924.PP 1925The 1926.CW File 1927data type is constructed from three simpler data structures that hold arrays 1928of characters. 1929Each of these types has an insertion and deletion operator, and the 1930insertion and deletion operators of the 1931.CW File 1932type itself are constructed from them. 1933.PP 1934The simplest type is the 1935.CW String , 1936which is used to hold strings in main memory. 1937The code that manages 1938.CW Strings 1939guarantees that they will never be longer 1940than some moderate size, and in practice they are rarely larger than 8 Kbytes. 1941.CW Strings 1942have two purposes: they hold short strings like file names with little overhead, 1943and because they are deliberately small, they are efficient to modify. 1944They are therefore used as the data structure for in-memory caches. 1945.PP 1946The disc copy of the file is managed by a data structure called a 1947.CW Disc , 1948which corresponds to a temporary file. A 1949.CW Disc 1950has no storage in main memory other than bookkeeping information; 1951the actual data being held is all on the disc. 1952To reduce the number of open files needed, 1953.CW sam 1954opens a dozen temporary UNIX files and multiplexes the 1955.CW Discs 1956upon them. 1957This permits many files to 1958be edited; the entire 1959.CW sam 1960source (48 files) may be edited comfortably with a single 1961instance of 1962.CW sam . 1963Allocating one temporary file per 1964.CW Disc 1965would strain the operating system's limit on the number of open files. 1966Also, spreading the traffic among temporary files keeps the files shorter, 1967and shorter files are more efficiently implemented by the UNIX 1968I/O subsystem. 1969.PP 1970A 1971.CW Disc 1972is an array of fixed-length blocks, each of which contains 1973between 1 and 4096 characters of active data. 1974(The block size of our UNIX file system is 4096 bytes.) 1975The block addresses within the temporary file and the length of each 1976block are stored in a 1977.CW List . 1978When changes are made the live part of blocks may change size. 1979Blocks are created and coalesced when necessary to try to keep the sizes 1980between 2048 and 4096 bytes. 1981An actively changing part of the 1982.CW Disc 1983therefore typically has about a kilobyte of slop that can be 1984inserted or deleted 1985without changing more than one block or affecting the block order. 1986When an insertion would overflow a block, the block is split, a new one 1987is allocated to receive the overflow, and the memory-resident list of blocks 1988is rearranged to reflect the insertion of the new block. 1989.PP 1990Obviously, going to the disc for every modification to the file is 1991prohibitively expensive. 1992The data type 1993.CW Buffer 1994consists of a 1995.CW Disc 1996to hold the data and a 1997.CW String 1998that acts as a cache. 1999This is the first of a series of caches throughout the data structures in 2000.CW sam. 2001The caches not only improve performance, they provide a way to organize 2002the flow of data, particularly in the communication between the host 2003and terminal. 2004This idea is developed below, in the section on communications. 2005.PP 2006To reduce disc traffic, changes to a 2007.CW Buffer 2008are mediated by a variable-length string, in memory, that acts as a cache. 2009When an insertion or deletion is made to a 2010.CW Buffer , 2011if the change can be accommodated by the cache, it is done there. 2012If the cache becomes bigger than a block because of an insertion, 2013some of it is written to the 2014.CW Disc 2015and deleted from the cache. 2016If the change does not intersect the cache, the cache is flushed. 2017The cache is only loaded at the new position if the change is smaller than a block; 2018otherwise, it is sent directly to the 2019.CW Disc . 2020This is because 2021large changes are typically sequential, 2022whereupon the next change is unlikely to overlap the current one. 2023.PP 2024A 2025.CW File 2026comprises a 2027.CW String 2028to hold the file name and some ancillary data such as dot and the modified bit. 2029The most important components, though, are a pair of 2030.CW Buffers , 2031one called the transcript and the other the contents. 2032Their use is described in the next section. 2033.PP 2034The overall structure is shown in Figure 5. 2035Although it may seem that the data is touched many times on its 2036way from the 2037.CW Disc , 2038it is read (by one UNIX system call) directly into the cache of the 2039associated 2040.CW Buffer ; 2041no extra copy is done. 2042Similarly, when flushing the cache, the text is written 2043directly from the cache to disc. 2044Most operations act directly on the text in the cache. 2045A principle applied throughout 2046.CW sam 2047is that the fewer times the data is copied, the faster the program will run 2048(see also the paper by Waite\u\s-4\&15\s+4\d). 2049.KF 2050.PS 2051copy "fig5.pic" 2052.PE 2053.Cs 2054Figure 5. File data structures. 2055The temporary files are stored in the standard repository for such files 2056on the host system. 2057.Ce 2058.KE 2059.PP 2060The contents of a 2061.CW File 2062are accessed by a routine that 2063copies to a buffer a substring of a file starting at a specified offset. 2064To read a byte at a time, a 2065.CW File "" per- 2066array is loaded starting from a specified initial position, 2067and bytes may then be read from the array. 2068The implementation is done by a macro similar to the C standard I/O 2069.CW getc 2070macro.\u\s-4\&14\s+4\d 2071Because the reading may be done at any address, a minor change to the 2072macro allows the file to be read backwards. 2073This array is read-only; there is no 2074.CW putc . 2075.SH 2 2076Doing and undoing 2077.LP 2078.CW Sam 2079has an unusual method for managing changes to files. 2080The command language makes it easy to specify multiple variable-length changes 2081to a file millions of bytes long, and such changes 2082must be made efficiently if the editor is to be practical. 2083The usual techniques for inserting and deleting strings 2084are inadequate under these conditions. 2085The 2086.CW Buffer 2087and 2088.CW Disc 2089data structures are designed for efficient random access to long strings, 2090but care must be taken to avoid super-linear behavior when making 2091many changes simultaneously. 2092.PP 2093.CW Sam 2094uses a two-pass algorithm for making changes, and treats each file as a database 2095against which transactions are registered. 2096Changes are not made directly to the contents. 2097Instead, when a command is started, a `mark' containing 2098a sequence number is placed in the transcript 2099.CW Buffer , 2100and each change made to the file, either an insertion or deletion 2101or a change to the file name, 2102is appended to the end of the transcript. 2103When the command is complete, the transcript is rewound to the 2104mark and applied to the contents. 2105.PP 2106One reason for separating evaluation from 2107application in this way is to simplify tracking the addresses of changes 2108made in the middle of a long sequence. 2109The two-pass algorithm also allows all changes to apply to the 2110.I original 2111data: no change can affect another change made in the same command. 2112This is particularly important when evaluating an 2113.CW x 2114command because it prevents regular expression matches 2115from stumbling over changes made earlier in the execution. 2116Also, the two-pass 2117algorithm is cleaner than the way other UNIX editors allow changes to 2118affect each other; 2119for example, 2120.CW ed 's 2121idioms to do things like delete every other line 2122depend critically on the implementation. 2123Instead, 2124.CW sam 's 2125simple model, in which all changes in a command occur effectively 2126simultaneously, is easy to explain and to understand. 2127.PP 2128The records in the transcript are of the form ``delete substring from 2129locations 2130123 to 456'' and ``insert 11 characters `hello there' at location 789.'' 2131(It is an error if the changes are not at monotonically greater 2132positions through the file.) 2133While the update is occurring, these numbers must be 2134offset by earlier changes, but that is straightforward and 2135local to the update routine; 2136moreover, all the numbers have been computed 2137before the first is examined. 2138.PP 2139Treating the file as a transaction system has another advantage: 2140undo is trivial. 2141All it takes is to invert the transcript after it has been 2142implemented, converting insertions 2143into deletions and vice versa, and saving them in a holding 2144.CW Buffer . 2145The `do' transcript can then be deleted from 2146the transcript 2147.CW Buffer 2148and replaced by the `undo' transcript. 2149If an undo is requested, the transcript is rewound and the undo transcript 2150executed. 2151Because the transcript 2152.CW Buffer 2153is not truncated after each command, it accumulates 2154successive changes. 2155A sequence of undo commands 2156can therefore back up the file arbitrarily, 2157which is more helpful than the more commonly implemented self-inverse form of undo. 2158.CW Sam "" ( 2159provides no way to undo an undo, but if it were desired, 2160it would be easy to provide by re-interpreting the `do' transcript.) 2161Each mark in the transcript contains a sequence number and the offset into 2162the transcript of the previous mark, to aid in unwinding the transcript. 2163Marks also contain the value of dot and the modified bit so these can be 2164restored easily. 2165Undoing multiple files is easy; it merely demands undoing all files whose 2166latest change has the same sequence number as the current file. 2167.PP 2168Another benefit of having a transcript is that errors encountered in the middle 2169of a complicated command need not leave the files in an intermediate state. 2170By rewinding the transcript to the mark beginning the command, 2171the partial command can be trivially undone. 2172.PP 2173When the update algorithm was first implemented, it was unacceptably slow, 2174so a cache was added to coalesce nearby changes, 2175replacing multiple small changes by a single larger one. 2176This reduced the number 2177of insertions into the transaction 2178.CW Buffer , 2179and made a dramatic improvement in performance, 2180but made it impossible 2181to handle changes in non-monotonic order in the file; the caching method 2182only works if changes don't overlap. 2183Before the cache was added, the transaction could in principle be sorted 2184if the changes were out of order, although 2185this was never done. 2186The current status is therefore acceptable performance with a minor 2187restriction on global changes, which is sometimes, but rarely, an annoyance. 2188.PP 2189The update algorithm obviously paws the data more than simpler 2190algorithms, but it is not prohibitively expensive; 2191the caches help. 2192(The principle of avoiding copying the data is still honored here, 2193although not as piously: 2194the data is moved from contents' cache to 2195the transcript's all at once and through only one internal buffer.) 2196Performance figures confirm the efficiency. 2197To read from a dead start a hundred kilobyte file on a VAX-11/750 2198takes 1.4 seconds of user time, 2.5 seconds of system time, 2199and 5 seconds of real time. 2200Reading the same file in 2201.CW ed 2202takes 6.0 seconds of user time, 1.7 seconds of system time, 2203and 8 seconds of real time. 2204.CW Sam 2205uses about half the CPU time. 2206A more interesting example is the one stated above: 2207inserting a character between every pair of characters in the file. 2208The 2209.CW sam 2210command is 2211.P1 2212,y/@/ a/x/ 2213.P2 2214and takes 3 CPU seconds per kilobyte of input file, of which 2215about a third is spent in the regular expression code. 2216This translates to about 500 changes per second. 2217.CW Ed 2218takes 1.5 seconds per kilobyte to make a similar change (ignoring newlines), 2219but cannot undo it. 2220The same example in 2221.CW ex ,\u\s-4\&9\s+4\d 2222a variant of 2223.CW ed 2224done at the University of California at Berkeley, 2225which allows one level of undoing, again takes 3 seconds. 2226In summary, 2227.CW sam 's 2228performance is comparable to that of other UNIX editors, although it solves 2229a harder problem. 2230.SH 2 2231Communications 2232.LP 2233The discussion so far has described the implementation of the host part of 2234.CW sam ; 2235the next few sections explain how a machine with mouse and bitmap display 2236can be engaged to improve interaction. 2237.CW Sam 2238is not the first editor to be written as two processes,\u\s-4\&16\s+4\d 2239but its implementation 2240has some unusual aspects. 2241.PP 2242There are several ways 2243.CW sam 's 2244host and terminal parts may be connected. 2245The first and simplest is to forgo the terminal part and use the host 2246part's command language to edit text on an ordinary terminal. 2247This mode is invoked by starting 2248.CW sam 2249with the 2250.CW -d 2251option. 2252With no options, 2253.CW sam 2254runs separate host and terminal programs, 2255communicating with a message protocol over the physical 2256connection that joins them. 2257Typically, the connection is an RS-232 link between a Blit 2258(the prototypical display for 2259.CW sam ) 2260and a host running 2261the Ninth Edition of the UNIX operating system.\u\s-4\&8\s+4\d 2262(This is the version of the system used in the Computing Sciences Research 2263Center at AT&T Bell Laboratories, where I work. Its relevant 2264aspects are discussed in the Blit paper.\u\s-4\&1\s+4\d) 2265The implementation of 2266.CW sam 2267for the SUN computer runs both processes on the same machine and 2268connects them by a pipe. 2269.PP 2270The low bandwidth of an RS-232 link 2271necessitated the split between 2272the two programs. 2273The division is a mixed blessing: 2274a program in two parts is much harder to write and to debug 2275than a self-contained one, 2276but the split makes several unusual configurations possible. 2277The terminal may be physically separated from the host, allowing the conveniences 2278of a mouse and bitmap display to be taken home while leaving the files at work. 2279It is also possible to run the host part on a remote machine: 2280.P1 2281sam -r host 2282.P2 2283connects to the terminal in the usual way, and then makes a call 2284across the network to establish the host part of 2285.CW sam 2286on the named machine. 2287Finally, it cross-connects the I/O to join the two parts. 2288This allows 2289.CW sam 2290to be run on machines that do not support bitmap displays; 2291for example, 2292.CW sam 2293is the editor of choice on our Cray X-MP/24. 2294.CW Sam 2295.CW -r 2296involves 2297.I three 2298machines: the remote host, the terminal, and the local host. 2299The local host's job is simple but vital: it passes the data 2300between the remote host and terminal. 2301.PP 2302The host and terminal exchange messages asynchronously 2303(rather than, say, as remote procedure calls) but there is no 2304error detection or correction 2305because, whatever the configuration, the connection is reliable. 2306Because the terminal handles mundane interaction tasks such as 2307popping up menus and interpreting the responses, the messages are about 2308data, not actions. 2309For example, the host knows nothing about what is displayed on the screen, 2310and when the user types a character, the message sent to the host says 2311``insert a one-byte string at location 123 in file 7,'' not ``a character 2312was typed at the current position in the current file.'' 2313In other words, the messages look very much like the transaction records 2314in the transcripts. 2315.PP 2316Either the host or terminal part of 2317.CW sam 2318may initiate a change to a file. 2319The command language operates on the host, while typing and some 2320mouse operations are executed directly in the terminal to optimize response. 2321Changes initiated by the host program must be transmitted to the terminal, 2322and 2323vice versa. 2324(A token is exchanged to determine which end is in control, 2325which means that characters typed while a time-consuming command runs 2326must be buffered and do not appear until the command is complete.) 2327To maintain consistent information, 2328the host and terminal track changes through a per-file 2329data structure that records what portions of the file 2330the terminal has received. 2331The data structure, called a 2332.CW Rasp 2333(a weak pun: it's a file with holes) 2334is held and updated by both the host and terminal. 2335A 2336.CW Rasp 2337is a list of 2338.CW Strings 2339holding those parts of the file known to the terminal, 2340separated by counts of the number of bytes in the interstices. 2341Of course, the host doesn't keep a separate copy of the data (it only needs 2342the lengths of the various pieces), 2343but the structure is the same on both ends. 2344.PP 2345The 2346.CW Rasp 2347in the terminal doubles as a cache. 2348Since the terminal keeps the text for portions of the file it has displayed, 2349it need not request data from the host when revisiting old parts of the file 2350or redrawing obscured windows, which speeds things up considerably 2351over low-speed links. 2352.PP 2353It's trivial for the terminal to maintain its 2354.CW Rasp , 2355because all changes made on the terminal apply to parts of the file 2356already loaded there. 2357Changes made by the host are compared against the 2358.CW Rasp 2359during the update sequence after each command. 2360Small changes to pieces of the file loaded in the terminal 2361are sent in their entirety. 2362Larger changes, and changes that fall entirely in the holes, 2363are transmitted as messages without literal data: 2364only the lengths of the deleted and inserted strings are transmitted. 2365When a command is completed, the terminal examines its visible 2366windows to see if any holes in their 2367.CW Rasps 2368intersect the visible portion of the file. 2369It then requests the missing data from the host, 2370along with up to 512 bytes of surrounding data, to minimize 2371the number of messages when visiting a new portion of the file. 2372This technique provides a kind of two-level lazy evaluation for the terminal. 2373The first level sends a minimum of information about 2374parts of the file not being edited interactively; 2375the second level waits until a change is displayed before 2376transmitting the new data. 2377Of course, 2378performance is also helped by having the terminal respond immediately to typing 2379and simple mouse requests. 2380Except for small changes to active pieces of the file, which are 2381transmitted to the terminal without negotiation, 2382the terminal is wholly responsible for deciding what is displayed; 2383the host uses the 2384.CW Rasp 2385only to tell the terminal what might be relevant. 2386.PP 2387When a change is initiated by the host, 2388the messages to the terminal describing the change 2389are generated by the routine that applies the transcript of the changes 2390to the contents of the 2391.CW File . 2392Since changes are undone by the same update routine, 2393undoing requires 2394no extra code in the communications; 2395the usual messages describing changes to the file are sufficient 2396to back up the screen image. 2397.PP 2398The 2399.CW Rasp 2400is a particularly good example of the way caches are used in 2401.CW sam . 2402First, it facilitates access to the active portion of the text by placing 2403the busy text in main memory. 2404In so doing, it provides efficient access 2405to a large data structure that does not fit in memory. 2406Since the form of data is to be imposed by the user, not by the program, 2407and because characters will frequently be scanned sequentially, 2408files are stored as flat objects. 2409Caches help keep performance good and linear when working with such 2410data. 2411.PP 2412Second, the 2413.CW Rasp 2414and several of the other caches have some 2415.I read-ahead; 2416that is, the cache is loaded with more information than is needed for 2417the job immediately at hand. 2418When manipulating linear structures, the accesses are usually sequential, 2419and read-ahead can significantly reduce the average time to access the 2420next element of the object. 2421Sequential access is a common mode for people as well as programs; 2422consider scrolling through a document while looking for something. 2423.PP 2424Finally, like any good data structure, 2425the cache guides the algorithm, or at least the implementation. 2426The 2427.CW Rasp 2428was actually invented to control the communications between the host and 2429terminal parts, but I realized very early that it was also a form of 2430cache. Other caches were more explicitly intended to serve a double 2431purpose: for example, the caches in 2432.CW Files 2433that coalesce updates not only reduce traffic to the 2434transcript and contents 2435.CW Buffers , 2436they also clump screen updates so that complicated changes to the 2437screen are achieved in 2438just a few messages to the terminal. 2439This saved me considerable work: I did not need to write special 2440code to optimize the message traffic to the 2441terminal. 2442Caches pay off in surprising ways. 2443Also, they tend to be independent, so their performance improvements 2444are multiplicative. 2445.SH 2 2446Data structures in the terminal 2447.LP 2448The terminal's job is to display and to maintain a consistent image of 2449pieces of the files being edited. 2450Because the text is always in memory, the data structures are 2451considerably simpler than those in the host part. 2452.PP 2453.CW Sam 2454typically has far more windows than does 2455.CW mux , 2456the window system within which its Blit implementation runs. 2457.CW Mux 2458has a fairly small number of asynchronously updated windows; 2459.CW sam 2460needs a large number of synchronously updated windows that are 2461usually static and often fully obscured. 2462The different tradeoffs guided 2463.CW sam 2464away from the memory-intensive implementation of windows, called 2465.CW Layers ,\u\s-4\&17\s+4\d 2466used in 2467.CW mux. 2468Rather than depending on a complete bitmap image of the display for each window, 2469.CW sam 2470regenerates the image from its in-memory text 2471(stored in the 2472.CW Rasp ) 2473when necessary, although it will use such an image if it is available. 2474Like 2475.CW Layers , 2476though, 2477.CW sam 2478uses the screen bitmap as active storage in which to update the image using 2479.CW bitblt .\u\s-4\&18,19\s+4\d 2480The resulting organization, pictured in Figure 6, 2481has a global array of windows, called 2482.CW Flayers , 2483each of which holds an image of a piece of text held in a data structure 2484called a 2485.CW Frame , 2486which in turn represents 2487a rectangular window full of text displayed in some 2488.CW Bitmap . 2489Each 2490.CW Flayer 2491appears in a global list that orders them all front-to-back 2492on the display, and simultaneously as an element of a per-file array 2493that holds all the open windows for that file. 2494The complement in the terminal of the 2495.CW File 2496on the host is called a 2497.CW Text ; 2498each connects its 2499.CW Flayers 2500to the associated 2501.CW Rasp . 2502.KF 2503.PS 2504copy "fig6.pic" 2505.PE 2506.Cs 2507Figure 6. Data structures in the terminal. 2508.CW Flayers 2509are also linked together into a front-to-back list. 2510.CW Boxes 2511are discussed in the next section. 2512.Ce 2513.KE 2514.PP 2515The 2516.CW Bitmap 2517for a 2518.CW Frame 2519contains the image of the text. 2520For a fully visible window, the 2521.CW Bitmap 2522will be the screen (or at least the 2523.CW Layer 2524in which 2525.CW sam 2526is being run), 2527while for partially obscured windows the 2528.CW Bitmap 2529will be off-screen. 2530If the window is fully obscured, the 2531.CW Bitmap 2532will be null. 2533.PP 2534The 2535.CW Bitmap 2536is a kind of cache. 2537When making changes to the display, most of the original image will 2538look the same in the final image, and the update algorithms exploit this. 2539The 2540.CW Frame 2541software updates the image in the 2542.CW Bitmap 2543incrementally; the 2544.CW Bitmap 2545is not just an image, it is a data structure.\u\s-4\&18,19\s+4\d 2546The job of the software that updates the display is therefore 2547to use as much as possible of the existing image (converting the 2548text from ASCII characters to pixels is expensive) in a sort of two-dimensional 2549string insertion algorithm. 2550The details of this process are described in the next section. 2551.PP 2552The 2553.CW Frame 2554software has no code to support overlapping windows; 2555its job is to keep a single 2556.CW Bitmap 2557up to date. 2558It falls to the 2559.CW Flayer 2560software to multiplex the various 2561.CW Bitmaps 2562onto the screen. 2563The problem of maintaining overlapping 2564.CW Flayers 2565is easier than for 2566.CW Layers \u\s-4\&17\s+4\d 2567because changes are made synchronously and because the contents of the window 2568can be reconstructed from the data stored in the 2569.CW Frame ; 2570the 2571.CW Layers 2572software 2573makes no such assumptions. 2574In 2575.CW sam , 2576the window being changed is almost always fully visible, because the current 2577window is always fully visible, by construction. 2578However, when multi-file changes are being made, or when 2579more than one window is open on a file, 2580it may be necessary to update partially obscured windows. 2581.PP 2582There are three cases: the window is 2583fully visible, invisible (fully obscured), or partially visible. 2584If fully visible, the 2585.CW Bitmap 2586is part of the screen, so when the 2587.CW Flayer 2588update routine calls the 2589.CW Frame 2590update routine, the screen will be updated directly. 2591If the window is invisible, 2592there is no associated 2593.CW Bitmap , 2594and all that is necessary is to update the 2595.CW Frame 2596data structure, not the image. 2597If the window is partially visible, the 2598.CW Frame 2599routine is called to update the image in the off-screen 2600.CW Bitmap , 2601which may require regenerating it from the text of the window. 2602The 2603.CW Flayer 2604code then clips this 2605.CW Bitmap 2606against the 2607.CW Bitmaps 2608of all 2609.CW Frames 2610in front of the 2611.CW Frame 2612being modified, and the remainder is copied to the display. 2613.PP 2614This is much faster than recreating the image off-screen 2615for every change, or clipping all the changes made to the image 2616during its update. 2617Unfortunately, these caches can also consume prohibitive amounts of 2618memory, so they are freed fairly liberally \(em after every change to the 2619front-to-back order of the 2620.CW Flayers . 2621The result is that 2622the off-screen 2623.CW Bitmaps 2624exist only while multi-window changes are occurring, 2625which is the only time the performance improvement they provide is needed. 2626Also, the user interface causes fully-obscured windows to be the 2627easiest to make \(em 2628creating a canonically sized and placed window requires only a button click 2629\(em which reduces the need for caching still further. 2630.PP 2631.SH 2 2632Screen update 2633.LP 2634Only two low-level primitives are needed for incremental update: 2635.CW bitblt , 2636which copies rectangles of pixels, and 2637.CW string 2638(which in turn calls 2639.CW bitblt ), 2640which draws a null-terminated character string in a 2641.CW Bitmap . 2642A 2643.CW Frame 2644contains a list of 2645.CW Boxes , 2646each of which defines a horizontal strip of text in the window 2647(see Figure 7). 2648A 2649.CW Box 2650has a character string 2651.CW str , 2652and a 2653.CW Rectangle 2654.CW rect 2655that defines the location of the strip in the window. 2656(The text in 2657.CW str 2658is stored in the 2659.CW Box 2660separately from the 2661.CW Rasp 2662associated with the window's file, so 2663.CW Boxes 2664are self-contained.) 2665The invariant is that 2666the image of the 2667.CW Box 2668can be reproduced by calling 2669.CW string 2670with argument 2671.CW str 2672to draw the string in 2673.CW rect , 2674and the resulting picture fits perfectly within 2675.CW rect . 2676In other words, the 2677.CW Boxes 2678define the tiling of the window. 2679The tiling may be complicated by long lines of text, which 2680are folded onto the next line. 2681Some editors use horizontal scrolling to avoid this complication, 2682but to be comfortable this technique requires that lines not be 2683.I too 2684long; 2685.CW sam 2686has no such restriction. 2687Also, and perhaps more importantly, UNIX programs and terminals traditionally fold 2688long lines to make their contents fully visible. 2689.PP 2690Two special kinds of 2691.CW Boxes 2692contain a single 2693character: either a newline or a tab. 2694Newlines and tabs are white space. 2695A newline 2696.CW Box 2697always extends to the right edge of the window, 2698forcing the following 2699.CW Box 2700to the next line. 2701The width of a tab depends on where it is located: 2702it forces the next 2703.CW Box 2704to begin at a tab location. 2705Tabs also 2706have a minimum width equivalent to a blank (blanks are 2707drawn by 2708.CW string 2709and are not treated specially); newlines have a minimum width of zero. 2710.KF 2711.PS 2712copy "fig7.pic" 2713.PE 2714.sp .5 2715.Cs 2716Figure 7. A line of text showing its 2717.CW Boxes . 2718The first two blank 2719.CW Boxes 2720contain tabs; the last contains a newline. 2721Spaces are handled as ordinary characters. 2722.Ce 2723.KE 2724.PP 2725The update algorithms always use the 2726.CW Bitmap 2727image of the text (either the display or cache 2728.CW Bitmap ); 2729they never examine the characters within a 2730.CW Box 2731except when the 2732.CW Box 2733needs to be split in two. 2734Before a change, the window consists of a tiling of 2735.CW Boxes ; 2736after the change the window is tiled differently. 2737The update algorithms rearrange the tiles in place, without 2738backup storage. 2739The algorithms are not strictly optimal \(em for example, they can 2740clear a pixel that is later going to be written upon \(em 2741but they never move a tile that doesn't need to be moved, 2742and they move each tile at most once. 2743.CW Frinsert 2744on a Blit can absorb over a thousand characters a second if the strings 2745being inserted are a few tens of characters long. 2746.PP 2747Consider 2748.CW frdelete . 2749Its job is to delete a substring from a 2750.CW Frame 2751and restore the image of the 2752.CW Frame . 2753The image of a substring has a peculiar shape (see Figure 2) comprising 2754possibly a partial line, 2755zero or more full lines, 2756and possibly a final partial line. 2757For reference, call this the 2758.I 2759Z-shape. 2760.R 2761.CW Frdelete 2762begins by splitting, if necessary, the 2763.CW Boxes 2764containing the ends of 2765the substring so the substring begins and ends on 2766.CW Box 2767boundaries. 2768Because the substring is being deleted, its image is not needed, 2769so the Z-shape is then cleared. 2770Then, tiles (that is, the images of 2771.CW Boxes ) 2772are copied, using 2773.CW bitblt , 2774from immediately after the Z-shape to 2775the beginning of the Z-shape, 2776resulting in a new Z-shape. 2777.CW Boxes "" ( 2778whose contents would span two lines in the new position must first be split.) 2779.PP 2780Copying the remainder of the 2781.CW Frame 2782tile by tile 2783this way will clearly accomplish the deletion but eventually, 2784typically when the copying algorithm encounters a tab or newline, 2785the old and new 2786.CW x 2787coordinates of the tile 2788to be copied are the same. 2789This correspondence implies 2790that the Z-shape has its beginning and ending edges aligned 2791vertically, and a sequence of at most two 2792.CW bitblts 2793can be used to copy the remaining tiles. 2794The last step is to clear out the resulting empty space at the bottom 2795of the window; 2796the number of lines to be cleared is the number of complete lines in the 2797Z-shape closed by the final 2798.CW bitblts. 2799The final step is to merge horizontally adjacent 2800.CW Boxes 2801of plain text. 2802The complete source to 2803.CW frdelete 2804is less than 100 lines of C. 2805.PP 2806.CW frinsert 2807is more complicated because it must do four passes: 2808one to construct the 2809.CW Box 2810list for the inserted string, 2811one to reconnoitre, 2812one to copy (in opposite order to 2813.CW frdelete ) 2814the 2815.CW Boxes 2816to make the hole for the new text, 2817and finally one to copy the new text into place. 2818Overall, though, 2819.CW frinsert 2820has a similar flavor to 2821.CW frdelete , 2822and needn't be described further. 2823.CW Frinsert 2824and its subsidiary routines comprise 211 lines of C. 2825.PP 2826The terminal source code is 3024 lines of C, 2827and the host source is 5797 lines. 2828.SH 2829Discussion 2830.SH 2 2831History 2832.LP 2833The immediate ancestor of 2834.CW sam 2835was the original text editor for the Blit, called 2836.CW jim . 2837.CW Sam 2838inherited 2839.CW jim 's 2840two-process structure and mouse language almost unchanged, but 2841.CW jim 2842suffered from several drawbacks that were addressed in the design of 2843.CW sam . 2844The most important of these was the lack of a command language. 2845Although 2846.CW jim 2847was easy to use for simple editing, it provided no direct help with 2848large or repetitive editing tasks. Instead, it provided a command to pass 2849selected text through a shell pipeline, 2850but this was no more satisfactory than could be expected of a stopgap measure. 2851.PP 2852.CW Jim 2853was written primarily as a vehicle for experimenting with a mouse-based 2854interface to text, and the experiment was successful. 2855.CW Jim 2856had some spin-offs: 2857.CW mux , 2858the second window system for the Blit, is essentially a multiplexed 2859version of the terminal part of 2860.CW jim ; 2861and the debugger 2862.CW pi 's 2863user interface\u\s-4\&20\s+4\d was closely modeled on 2864.CW jim 's. 2865But after a couple of years, 2866.CW jim 2867had become difficult to maintain and limiting to use, 2868and its replacement was overdue. 2869.PP 2870I began the design of 2871.CW sam 2872by asking 2873.CW jim 2874customers what they wanted. 2875This was probably a mistake; the answers were essentially a list of features 2876to be found in other editors, which did not provide any of the 2877guiding principles I was seeking. 2878For instance, one common request was for a ``global substitute,'' 2879but no one suggested how to provide it within a cut-and-paste editor. 2880I was looking for a scheme that would 2881support such specialized features comfortably in the context of some 2882general command language. 2883Ideas were not forthcoming, though, particularly given my insistence 2884on removing all limits on file sizes, line lengths and so on. 2885Even worse, I recognized that, since the mouse could easily 2886indicate a region of the screen that was not an integral number of lines, 2887the command language would best forget about newlines altogether, 2888and that meant the command language had to treat the file as a single 2889string, not an array of lines. 2890.PP 2891Eventually, I decided that thinking was not getting me very far and it was 2892time to try building. 2893I knew that the terminal part could be built easily \(em 2894that part of 2895.CW jim 2896behaved acceptably well \(em and that most of the hard work was going 2897to be in the host part: the file interface, command interpreter and so on. 2898Moreover, I had some ideas about how the architecture of 2899.CW jim 2900could be improved without destroying its basic structure, which I liked 2901in principle but which hadn't worked out as well as I had hoped. 2902So I began by designing the file data structure, 2903starting with the way 2904.CW jim 2905worked \(em comparable to a single structure merging 2906.CW Disc 2907and 2908.CW Buffer , 2909which I split to make the cache more general 2910\(em and thinking about how global substitute could be implemented. 2911The answer was clearly that it had to be done in two passes, 2912and the transcript-oriented implementation fell out naturally. 2913.PP 2914.CW Sam 2915was written bottom-up, 2916starting from the data structures and algorithms for manipulating text, 2917through the command language and up to the code for maintaining 2918the display. 2919In retrospect, it turned out well, but this implementation method is 2920not recommended in general. 2921There were several times when I had a large body of interesting code 2922assembled and no clue how to proceed with it. 2923The command language, in particular, took almost a year to figure out, 2924but can be implemented (given what was there at the beginning of that year) 2925in a day or two. Similarly, inventing the 2926.CW Rasp 2927data structure delayed the 2928connection of the host and terminal pieces by another few months. 2929.CW Sam 2930took about two years to write, although only about four months were 2931spent actually working on it. 2932.PP 2933Part of the design process was unusual: 2934the subset of the protocol that maintains the 2935.CW Rasp 2936was simulated, debugged 2937and verified by an automatic protocol analyzer,\u\s-4\&21\s+4\d and was bug-free 2938from the start. 2939The rest of the protocol, concerned mostly 2940with keeping menus up to date, 2941was unfortunately too unwieldy for such analysis, 2942and was debugged by more traditional methods, primarily 2943by logging in a file all messages in and out of the host. 2944.SH 2 2945Reflections 2946.LP 2947.CW Sam 2948is essentially the only interactive editor used by the sixty or so members of 2949the computing science research center in which I work. 2950The same could not be said of 2951.CW jim ; 2952the lack of a command language kept some people from adopting it. 2953The union of a user interface as comfortable as 2954.CW jim 's 2955with a command language as powerful as 2956.CW ed 's† 2957.FS 2958.vs 9 2959†The people who criticize 2960.CW ed 2961as an interactive program often forget that it and its close relative 2962.CW sed \u\s-4\&7\s+4\d 2963still thrive as programmable editors. The strength of these programs is 2964independent of their convenience for interactive editing. 2965.br 2966.vs 2967.FE 2968is essential to 2969.CW sam 's 2970success. 2971When 2972.CW sam 2973was first made available to the 2974.CW jim 2975community, 2976almost everyone switched to it within two or three days. 2977In the months that followed, even people who had never adopted 2978.CW jim 2979started using 2980.CW sam 2981exclusively. 2982.PP 2983To be honest, 2984.CW ed 2985still gets occasional use, but usually when 2986something quick needs to be done and the overhead of 2987downloading the terminal part of 2988.CW sam 2989isn't worth the trouble. 2990Also, as a `line' editor, 2991.CW sam 2992.CW -d 2993is a bit odd; 2994when using a good old ASCII terminal, it's comforting to have 2995a true line editor. 2996But it is fair to say that 2997.CW sam 's 2998command language has displaced 2999.CW ed 's 3000for most of the complicated editing that has kept line editors 3001(that is, command-driven editors) with us. 3002.PP 3003.CW Sam 's 3004command language is even fancier than 3005.CW ed 's, 3006and most 3007.CW sam 3008customers don't come near to using all its capabilities. 3009Does it need to be so sophisticated? 3010I think the answer is yes, for two reasons. 3011.PP 3012First, the 3013.I model 3014for 3015.CW sam 's 3016command language is really relatively simple, and certainly simpler than that of 3017.CW ed . 3018For instance, there is only one kind of textual loop in 3019.CW sam 3020\(em the 3021.CW x 3022command \(em 3023while 3024.CW ed 3025has three (the 3026.CW g 3027command, the global flag on substitutions, and the implicit loop over 3028lines in multi-line substitutions). 3029Also, 3030.CW ed 's 3031substitute command is necessary to make changes within lines, but in 3032.CW sam 3033the 3034.CW s 3035command is more of a familiar convenience than a necessity; 3036.CW c 3037and 3038.CW t 3039can do all the work. 3040.PP 3041Second, 3042given a community that expects an editor to be about as powerful as 3043.CW ed , 3044it's hard to see how 3045.CW sam 3046could really be much simpler and still satisfy that expectation. 3047People want to do ``global substitutes,'' and most are content 3048to have the recipe for that and a few other fancy changes. 3049The sophistication of the command language is really just a veneer 3050over a design that makes it possible to do global substitutes 3051in a screen editor. 3052Some people will always want something more, however, and it's gratifying to 3053be able to provide it. 3054The real power of 3055.CW sam 's 3056command language comes from composability of the operators, which is by 3057nature orthogonal to the underlying model. 3058In other words, 3059.CW sam 3060is not itself complex, but it makes complex things possible. 3061If you don't want to do anything complex, you can ignore the 3062complexity altogether, and many people do so. 3063.PP 3064Sometimes I am asked the opposite question: why didn't I just make 3065.CW sam 3066a real programmable editor, with macros and variables and so on? 3067The main reason is a matter of taste: I like the editor 3068to be the same every time I use it. 3069There is one technical reason, though: 3070programmability in editors is largely a workaround for insufficient 3071interactivity. 3072Programmable editors are used to make particular, usually short-term, 3073things easy to do, such as by providing shorthands for common actions. 3074If things are generally easy to do in the first place, 3075shorthands are not as helpful. 3076.CW Sam 3077makes common editing operations very easy, and the solutions to 3078complex editing problems seem commensurate with the problems themselves. 3079Also, the ability to edit the 3080.CW sam 3081window makes it easy to repeat commands \(em it only takes a mouse button click 3082to execute a command again. 3083.SH 2 3084Pros and cons 3085.LP 3086.CW Sam 3087has several other good points, 3088and its share of problems. 3089Among the good things is the idea of 3090structural regular expressions, 3091whose usefulness has only begun to be explored. 3092They were arrived at serendipitously when I attempted to distill the essence of 3093.CW ed 's 3094way of doing global substitution and recognized that the looping command in 3095.CW ed 3096was implicitly imposing a structure (an array of lines) on the file. 3097.PP 3098Another of 3099.CW sam 's 3100good things is its undo capability. 3101I had never before used an editor with a true undo, 3102but I would never go back now. 3103Undo 3104.I must 3105be done well, but if it is, it can be relied on. 3106For example, 3107it's safe to experiment if you're not sure how to write some intricate command, 3108because if you make a mistake, it can be fixed simply and reliably. 3109I learned two things about undo from writing 3110.CW sam : 3111first, it's easy to provide if you design it in from the beginning, and 3112second, it's necessary, particularly if the system has some subtle 3113properties that may be unfamiliar or error-prone for users. 3114.PP 3115.CW Sam 's 3116lack of internal limits and sizes is a virtue. 3117Because it avoids all fixed-size tables and data structures, 3118.CW sam 3119is able to make global changes to files that some of our other 3120tools cannot even read. 3121Moreover, the design keeps the performance linear when doing such 3122operations, although I must admit 3123.CW sam 3124does get slow when editing a huge file. 3125.PP 3126Now, the problems. 3127Externally, the most obvious is that it is poorly integrated into the 3128surrounding window system. 3129By design, the user interface in 3130.CW sam 3131feels almost identical to that of 3132.CW mux , 3133but a thick wall separates text in 3134.CW sam 3135from the programs running in 3136.CW mux . 3137For instance, the `snarf buffer' in 3138.CW sam 3139must be maintained separately from that in 3140.CW mux . 3141This is regrettable, but probably necessary given the unusual configuration 3142of the system, with a programmable terminal on the far end of an RS-232 link. 3143.PP 3144.CW Sam 3145is reliable; otherwise, people wouldn't use it. 3146But it was written over such a long time, and has so many new (to me) 3147ideas in it, that I would like to see it done over again to clean 3148up the code and remove many of the lingering problems in the implementation. 3149The worst part is in the interconnection of the host and terminal parts, 3150which might even be able to go away in a redesign for a more 3151conventional window system. 3152The program must be split in two to use the terminal effectively, 3153but the low bandwidth of the connection forces the separation to 3154occur in an inconvenient part of the design if performance is to be acceptable. 3155A simple remote procedure call 3156protocol driven by the host, emitting only graphics 3157commands, would be easy to write but wouldn't have nearly the 3158necessary responsiveness. On the other hand, if the terminal were in control 3159and requested much simpler file services from the host, regular expression 3160searches would require that the terminal read the entire file over its RS-232 3161link, which would be unreasonably slow. 3162A compromise in which either end can take control is necessary. 3163In retrospect, the communications protocol should have been 3164designed and verified formally, although I do not know of any tool 3165that can adequately relate the protocol to 3166its implementation. 3167.PP 3168Not all of 3169.CW sam 's 3170users are comfortable with its command language, and few are adept. 3171Some (venerable) people use a sort of 3172.CW ed \& `` 3173subset'' of 3174.CW sam 's 3175command language, 3176and even ask why 3177.CW sam 's 3178command language is not exactly 3179.CW ed 's. 3180(The reason, of course, is that 3181.CW sam 's 3182model for text does not include newlines, which are central to 3183.CW ed . 3184Making the text an array of newlines to the command language would 3185be too much of a break from the seamless model provided by the mouse. 3186Some editors, such as 3187.CW vi , 3188are willing to make this break, though.) 3189The difficulty is that 3190.CW sam 's 3191syntax is so close to 3192.CW ed 's 3193that people believe it 3194.I should 3195be the same. 3196I thought, with some justification in hindsight, 3197that making 3198.CW sam 3199similar to 3200.CW ed 3201would make it easier to learn and to accept. 3202But I may have overstepped and raised the users' 3203expectations too much. 3204It's hard to decide which way to resolve this problem. 3205.PP 3206Finally, there is a tradeoff in 3207.CW sam 3208that was decided by the environment in which it runs: 3209.CW sam 3210is a multi-file editor, although in a different system there might instead be 3211multiple single-file editors. 3212The decision was made primarily because starting a new program in a Blit is 3213time-consuming. 3214If the choice could be made freely, however, I would 3215still choose the multi-file architecture, because it allows 3216groups of files to be handled as a unit; 3217the usefulness of the multi-file commands is incontrovertible. 3218It is delightful to have the source to an entire program 3219available at your fingertips. 3220.SH 3221Acknowledgements 3222.LP 3223Tom Cargill suggested the idea behind the 3224.CW Rasp 3225data structure. 3226Norman Wilson and Ken Thompson influenced the command language. 3227This paper was improved by comments from 3228Al Aho, 3229Jon Bentley, 3230Chris Fraser, 3231Gerard Holzmann, 3232Brian Kernighan, 3233Ted Kowalski, 3234Doug McIlroy 3235and 3236Dennis Ritchie. 3237