1Network Working Group R. Rivest 2Internet Draft May 4, 1997 3Expires November 4, 1997 4 5 6 S-Expressions 7 draft-rivest-sexp-00.txt 8 9 10Status of this Memo 11 12 Distribution of this memo is unlimited. 13 14 This document is an Internet-Draft. Internet Drafts are working 15 documents of the Internet Engineering Task Force (IETF), its Areas, 16 and its Working Groups. Note that other groups may also distribute 17 working documents as Internet Drafts. 18 19 Internet Drafts are draft documents valid for a maximum of six 20 months, and may be updated, replaced, or obsoleted by other documents 21 at any time. It is not appropriate to use Internet Drafts as 22 reference material, or to cite them other than as a ``working draft'' 23 or ``work in progress.'' 24 25 To learn the current status of any Internet-Draft, please check the 26 ``1id-abstracts.txt'' listing contained in the internet-drafts Shadow 27 Directories on: ftp.is.co.za (Africa), nic.nordu.net (Europe), 28 ds.internic.net (US East Coast), ftp.isi.edu (US West Coast), 29 or munnari.oz.au (Pacific Rim) 30 31 32Abstract 33 34This memo describes a data structure called "S-expressions" that are 35suitable for representing arbitrary complex data structures. We make 36precise the encodings of S-expressions: we give a "canonical form" for 37S-expressions, described two "transport" representations, and also 38describe an "advanced" format for display to people. 39 40 41 421. Introduction 43 44S-expressions are data structures for representing complex data. They 45are either byte-strings ("octet-strings") or lists of simpler 46S-expressions. Here is a sample S-expression: 47 48 (snicker "abc" (#03# |YWJj|)) 49 50It is a list of length three: 51 52 -- the octet-string "snicker" 53 54 -- the octet-string "abc" 55 56 -- a sub-list containing two elements: 57 - the hexadecimal constant #03# 58 - the base-64 constant |YWJj| (which is the same as "abc") 59 60This note gives a specific proposal for constructing and utilizing 61S-expressions. The proposal is independent of any particular application. 62 63Here are the design goals for S-expressions: 64 65 -- generality: S-expressions should be good at representing arbitrary 66 data. 67 68 -- readability: it should be easy for someone to examine and 69 understand the structure of an S-expression. 70 71 -- economy: S-expressions should represent data compactly. 72 73 -- tranportability: S-expressions should be easy to transport 74 over communication media (such as email) that are known to be 75 less than perfect. 76 77 -- flexibility: S-expressions should make it relatively simple to 78 modify and extend data structures. 79 80 -- canonicalization: it should be easy to produce a unique 81 "canonical" form of an S-expression, for digital signature purposes. 82 83 -- efficiency: S-expressions should admit in-memory representations 84 that allow efficient processing. 85 86 87Section 2 gives an introduction to S-expressions. 88Section 3 discusses the character sets used. 89Section 4 presents the various representations of octet-strings. 90Section 5 describes how to represent lists. 91Section 6 discusses how S-expressions are represented for various uses. 92Section 7 gives a BNF syntax for S-expressions. 93Section 8 talks about how S-expressions might be represented in memory. 94Section 9 briefly describes implementations for handling S-expressions. 95Section 10 discusses how applications might utilize S-expressions. 96Section 11 gives historical notes on S-expressions. 97Section 12 gives references. 98 992. S-expressions -- informal introduction 100 101Informally, an S-expression is either: 102 -- an octet-string, or 103 -- a finite list of simpler S-expressions. 104 105An octet-string is a finite sequence of eight-bit octets. There may be 106many different but equivalent ways of representing an octet-string 107 108 abc -- as a token 109 110 "abc" -- as a quoted string 111 112 #616263# -- as a hexadecimal string 113 114 3:abc -- as a length-prefixed "verbatim" encoding 115 116 {MzphYmM=} -- as a base-64 encoding of the verbatim encoding 117 (that is, an encoding of "3:abc") 118 119 |YWJj| -- as a base-64 encoding of the octet-string "abc" 120 121These encodings are all equivalent; they all denote the same octet string. 122 123We will give details of these encodings later on, and also describe how to 124give a "display type" to a byte string. 125 126A list is a finite sequence of zero or more simpler S-expressions. A list 127may be represented by using parentheses to surround the sequence of encodings 128of its elements, as in: 129 130 (abc (de #6667#) "ghi jkl") 131 132As we see, there is variability possible in the encoding of an 133S-expression. In some cases, it is desirable to standardize or 134restrict the encodings; in other cases it is desirable to have no 135restrictions. The following are the target cases we aim to handle: 136 137 -- a "transport" encoding for transporting the S-expression between 138 computers. 139 140 -- a "canonical" encoding, used when signing the S-expression. 141 142 -- an "advanced" encoding used for input/output to people. 143 144 -- an "in-memory" encoding used for processing the S-expression in 145 the computer. 146 147These need not be different; in this proposal the canonical encoding 148is the same as the transport encoding, for example. In this note we 149propose (related) encoding techniques for each of these uses. 150 1513. Character set 152 153We will be describing encodings of S-expressions. Except when giving 154"verbatim" encodings, the character set used is limited to the following 155characters in US-ASCII: 156 Alphabetic: A B ... Z a b ... z 157 numeric: 0 1 ... 9 158 whitespace: space, horizontal tab, vertical tab, form-feed 159 carriage-return, line-feed 160 The following graphics characters, which we call "pseudo-alphabetic": 161 - hyphen or minus 162 . period 163 / slash 164 _ underscore 165 : colon 166 * asterisk 167 + plus 168 = equal 169 The following graphics characters, which are "reserved punctuation": 170 ( left parenthesis 171 ) right parenthesis 172 [ left bracket 173 ] right bracket 174 { left brace 175 } right brace 176 | vertical bar 177 # number sign 178 " double quote 179 & ampersand 180 \ backslash 181 The following characters are unused and unavailable, except in 182 "verbatim" encodings: 183 ! exclamation point 184 % percent 185 ^ circumflex 186 ~ tilde 187 ; semicolon 188 ' apostrophe 189 , comma 190 < less than 191 > greater than 192 ? question mark 193 194 1954. Octet string representations 196 197This section describes in detail the ways in which an octet-string may 198be represented. 199 200We recall that an octet-string is any finite sequence of octets, and 201that the octet-string may have length zero. 202 203 2044.1 Verbatim representation 205 206A verbatim encoding of an octet string consists of four parts: 207 208 -- the length (number of octets) of the octet-string, 209 given in decimal most significant digit first, with 210 no leading zeros. 211 212 -- a colon ":" 213 214 -- the octet string itself, verbatim. 215 216There are no blanks or whitespace separating the parts. No "escape 217sequences" are interpreted in the octet string. This encoding is also 218called a "binary" or "raw" encoding. 219 220Here are some sample verbatim encodings: 221 222 3:abc 223 7:subject 224 4::::: 225 12:hello world! 226 10:abcdefghij 227 0: 228 2294.2 Quoted-string representation 230 231The quoted-string representation of an octet-string consists of: 232 233 -- an optional decimal length field 234 235 -- an initial double-quote (") 236 237 -- the octet string with "C" escape conventions (\n,etc) 238 239 -- a final double-quote (") 240 241The specified length is the length of the resulting string after any 242escape sequences have been handled. The string does not have any 243"terminating NULL" that C includes, and the length does not count such 244a character. 245 246The length is optional. 247 248The escape conventions within the quoted string are as follows (these follow 249the "C" programming language conventions, with an extension for 250ignoring line terminators of just LF or CRLF): 251 \b -- backspace 252 \t -- horizontal tab 253 \v -- vertical tab 254 \n -- new-line 255 \f -- form-feed 256 \r -- carriage-return 257 \" -- double-quote 258 \' -- single-quote 259 \\ -- back-slash 260 \ooo -- character with octal value ooo (all three digits 261 must be present) 262 \xhh -- character with hexadecimal value hh (both digits 263 must be present) 264 \<carriage-return> -- causes carriage-return to be ignored. 265 \<line-feed> -- causes linefeed to be ignored 266 \<carriage-return><line-feed> -- causes CRLF to be ignored. 267 \<line-feed><carriage-return> -- causes LFCR to be ignored. 268 269Here are some examples of quoted-string encodings: 270 271 "subject" 272 "hi there" 273 7"subject" 274 3"\n\n\n" 275 "This has\n two lines." 276 "This has\ 277 one." 278 "" 279 2804.3 Token representation 281 282An octet string that meets the following conditions may be given 283directly as a "token". 284 285 -- it does not begin with a digit 286 287 -- it contains only characters that are 288 -- alphabetic (upper or lower case), 289 -- numeric, or 290 -- one of the eight "pseudo-alphabetic" punctuation marks: 291 - . / _ : * + = 292 (Note: upper and lower case are not equivalent.) 293 (Note: A token may begin with punctuation, including ":"). 294 295Here are some examples of token representations: 296 297 subject 298 not-before 299 class-of-1997 300 //microsoft.com/names/smith 301 * 302 303 3044.4 Hexadecimal representation 305 306An octet-string may be represented with a hexadecimal encoding consisting of: 307 308 -- an (optional) decimal length of the octet string 309 310 -- a sharp-sign "#" 311 312 -- a hexadecimal encoding of the octet string, with each octet 313 represented with two hexadecimal digits, most significant 314 digit first. 315 316 -- a sharp-sign "#" 317 318There may be whitespace inserted in the midst of the hexadecimal 319encoding arbitrarily; it is ignored. It is an error to have 320characters other than whitespace and hexadecimal digits. 321 322Here are some examples of hexadecimal encodings: 323 324 #616263# -- represents "abc" 325 3#616263# -- also represents "abc" 326 # 616 327 263 # -- also represents "abc" 328 329 3304.5 Base-64 representation 331 332An octet-string may be represented in a base-64 coding consisting of: 333 334 -- an (optional) decimal length of the octet string 335 336 -- a vertical bar "|" 337 338 -- the rfc 1521 base-64 encoding of the octet string. 339 340 -- a final vertical bar "|" 341 342The base-64 encoding uses only the characters 343 A-Z a-z 0-9 + / = 344It produces four characters of output for each three octets of input. 345If the input has one or two left-over octets of input, it produces an 346output block of length four ending in two or one equals signs, respectively. 347Output routines compliant with this standard MUST output the equals signs 348as specified. Input routines MAY accept inputs where the equals signs are 349dropped. 350 351There may be whitespace inserted in the midst of the base-64 encoding 352arbitrarily; it is ignored. It is an error to have characters other 353than whitespace and base-64 characters. 354 355Here are some examples of base-64 encodings: 356 357 |YWJj| -- represents "abc" 358 | Y W 359 J j | -- also represents "abc" 360 3|YWJj| -- also represents "abc" 361 |YWJjZA==| -- represents "abcd" 362 |YWJjZA| -- also represents "abcd" 363 364 3654.6 Display hint 366 367Any octet string may be preceded by a single "display hint". 368 369The purposes of the display hint is to provide information on how 370to display the octet string to a user. It has no other function. 371Many of the MIME types work here. 372 373A display-hint is an octet string surrounded by square brackets. 374There may be whitespace separating the octet string from the 375surrounding brackets. Any of the legal formats may be used for the 376octet string. 377 378Here are some examples of display-hints: 379 380 [image/gif] 381 [URI] 382 [charset=unicode-1-1] 383 [text/richtext] 384 [application/postscript] 385 [audio/basic] 386 ["http://abc.com/display-types/funky.html"] 387 388In applications an octet-string that is untyped may be considered to have 389a pre-specified "default" mime type. The mime type 390 "text/plain; charset=iso-8859-1" 391is the standard default. 392 393 3944.7 Equality of octet-strings 395 396Two octet strings are considered to be "equal" if and only if they 397have the same display hint and the same data octet strings. 398 399Note that octet-strings are "case-sensitive"; the octet-string "abc" 400is not equal to the octet-string "ABC". 401 402An untyped octet-string can be compared to another octet-string (typed 403or not) by considering it as a typed octet-string with the default 404mime-type. 405 406 4075. Lists 408 409Just as with octet-strings, there are several ways to represent an 410S-expression. Whitespace may be used to separate list elements, but 411they are only required to separate two octet strings when otherwise 412the two octet strings might be interpreted as one, as when one token 413follows another. Also, whitespace may follow the initial left 414parenthesis, or precede the final right parenthesis. 415 416Here are some examples of encodings of lists: 417 418 (a b c) 419 420 ( a ( b c ) ( ( d e ) ( e f ) ) ) 421 422 (11:certificate(6:issuer3:bob)(7:subject5:alice)) 423 424 ({3Rt=} "1997" murphy 3:{XC++}) 425 426 4276. Representation types 428 429There are three "types" of representations: 430 431 -- canonical 432 433 -- basic transport 434 435 -- advanced transport 436 437The first two MUST be supported by any implementation; the last is 438optional. 439 440 4416.1 Canonical representation 442 443This canonical representation is used for digital signature purposes, 444transmission, etc. It is uniquely defined for each S-expression. It 445is not particularly readable, but that is not the point. It is 446intended to be very easy to parse, to be reasonably economical, and to 447be unique for any S-expression. 448 449The "canonical" form of an S-expression represents each octet-string 450in verbatim mode, and represents each list with no blanks separating 451elements from each other or from the surrounding parentheses. 452 453Here are some examples of canonical representations of S-expressions: 454 455 (6:issuer3:bob) 456 457 (4:icon[12:image/bitmap]9:xxxxxxxxx) 458 459 (7:subject(3:ref5:alice6:mother)) 460 461 4626.2 Basic transport representation 463 464There are two forms of the "basic transport" representation: 465 466 -- the canonical representation 467 468 -- an rfc-2045 base-64 representation of the canonical representation, 469 surrounded by braces. 470 471The transport mechanism is intended to provide a universal means of 472representing S-expressions for transport from one machine to another. 473 474Here are some examples of an S-expression represented in basic 475transport mode: 476 477 (1:a1:b1:c) 478 479 {KDE6YTE6YjE6YykA} 480 481 (this is the same S-expression encoded in base-64) 482 483There is a difference between the brace notation for base-64 used here 484and the || notation for base-64'd octet-strings described above. Here 485the base-64 contents are converted to octets, and then re-scanned as 486if they were given originally as octets. With the || notation, the 487contents are just turned into an octet-string. 488 489 4906.3 Advanced transport representation 491 492The "advanced transport" representation is intended to provide more 493flexible and readable notations for documentation, design, debugging, 494and (in some cases) user interface. 495 496The advanced transport representation allows all of the representation 497forms described above, include quoted strings, base-64 and hexadecimal 498representation of strings, tokens, representations of strings with 499omitted lengths, and so on. 500 501 5027. BNF for syntax 503 504We give separate BNF's for canonical and advanced forms of S-expressions. 505We use the following notation: 506 <x>* means 0 or more occurrences of <x> 507 <x>+ means 1 or more occurrences of <x> 508 <x>? means 0 or 1 occurrences of <x> 509 parentheses are used for grouping, as in (<x> | <y>)* 510 511For canonical and basic transport: 512 513<sexpr> :: <string> | <list> 514<string> :: <display>? <simple-string> ; 515<simple-string> :: <raw> ; 516<display> :: "[" <simple-string> "]" ; 517<raw> :: <decimal> ":" <bytes> ; 518<decimal> :: <decimal-digit>+ ; 519 -- decimal numbers should have no unnecessary leading zeros 520<bytes> -- any string of bytes, of the indicated length 521<list> :: "(" <sexp>* ")" ; 522<decimal-digit> :: "0" | ... | "9" ; 523 524For advanced transport: 525 526<sexpr> :: <string> | <list> 527<string> :: <display>? <simple-string> ; 528<simple-string> :: <raw> | <token> | <base-64> | <hexadecimal> | 529 <quoted-string> ; 530<display> :: "[" <simple-string> "]" ; 531<raw> :: <decimal> ":" <bytes> ; 532<decimal> :: <decimal-digit>+ ; 533 -- decimal numbers should have no unnecessary leading zeros 534<bytes> -- any string of bytes, of the indicated length 535<token> :: <tokenchar>+ ; 536<base-64> :: <decimal>? "|" ( <base-64-char> | <whitespace> )* "|" ; 537<hexadecimal> :: "#" ( <hex-digit> | <white-space> )* "#" ; 538<quoted-string> :: <decimal>? <quoted-string-body> 539<quoted-string-body> :: "\"" <bytes> "\"" 540<list> :: "(" ( <sexp> | <whitespace> )* ")" ; 541<whitespace> :: <whitespace-char>* ; 542<token-char> :: <alpha> | <decimal-digit> | <simple-punc> ; 543<alpha> :: <upper-case> | <lower-case> | <digit> ; 544<lower-case> :: "a" | ... | "z" ; 545<upper-case> :: "A" | ... | "Z" ; 546<decimal-digit> :: "0" | ... | "9" ; 547<hex-digit> :: <decimal-digit> | "A" | ... | "F" | "a" | ... | "f" ; 548<simple-punc> :: "-" | "." | "/" | "_" | ":" | "*" | "+" | "=" ; 549<whitespace-char> :: " " | "\t" | "\r" | "\n" ; 550<base-64-char> :: <alpha> | <decimal-digit> | "+" | "/" | "=" ; 551<null> :: "" ; 552 5538. In-memory representations 554 555For processing, the S-expression would typically be parsed and represented 556in memory in a more more amenable to efficient processing. We suggest 557two alternatives: 558 559 -- "list-structure" 560 561 -- "array-layout" 562 563We only sketch these here, as they are only suggestive. The code referenced 564below illustrates these styles in more detail. 565 566 5678.1. List-structure memory representation 568 569Here there are separate records for simple-strings, strings, and 570lists. An S-expression of the form ("abc" "de") would require two 571records for the simple strings, two for the strings, and two for the 572list elements. This is a fairly conventional representation, and 573details are omitted here. 574 5758.2 Array-layout memory representation 576 577Here each S-expression is represented as a contiguous array of bytes. 578The first byte codes the "type" of the S-expression: 579 580 01 octet-string 581 582 02 octet-string with display-hint 583 584 03 beginning of list (and 00 is used for "end of list") 585 586Each of the three types is immediately followed by a k-byte integer 587indicating the size (in bytes) of the following representation. Here 588k is an integer that depends on the implementation, it might be 589anywhere from 2 to 8, but would be fixed for a given implementation; 590it determines the size of the objects that can be handled. The transport 591and canonical representations are independent of the choice of k made by 592the implementation. 593 594Although the length of lists are not given in the usual S-expression 595notations, it is easy to fill them in when parsing; when you reach a 596right-parenthesis you know how long the list representation was, and 597where to go back to fill in the missing length. 598 599 6008.2.1 Octet string 601 602This is represented as follows: 603 604 01 <length> <octet-string> 605 606For example (here k = 2) 607 608 01 0003 a b c 609 6108.2.2 Octet-string with display-hint 611 612This is represented as follows: 613 614 02 <length> 615 01 <length> <octet-string> /* for display-type */ 616 01 <length> <octet-string> /* for octet-string */ 617 618For example, the S-expression 619 620 [gif] #61626364# 621 622would be represented as (with k = 2) 623 624 02 000d 625 01 0003 g i f 626 01 0004 61 62 63 64 627 6288.2.3 List 629 630This is represented as 631 632 03 <length> <item1> <item2> <item3> ... <itemn> 00 633 634For example, the list (abc [d]ef (g)) is represented in memory as (with k=2) 635 636 03 001b 637 01 0003 a b c 638 02 0009 639 01 0001 d 640 01 0002 e f 641 03 0005 642 01 0001 g 643 00 644 00 645 6469. Code 647 648There is code available for reading and parsing the various 649S-expression formats proposed here. 650 651See http://theory.lcs.mit.edu/~rivest/sexp.html 652 653 65410. Utilization of S-expressions 655 656This note has described S-expressions in general form. Application writers 657may wish to restrict their use of S-expressions in various ways. Here are 658some possible restrictions that might be considered: 659 660 -- no display-hints 661 -- no lengths on hexadecimal, quoted-strings, or base-64 encodings 662 -- no empty lists 663 -- no empty octet-strings 664 -- no lists having another list as its first element 665 -- no base-64 or hexadecimal encodings 666 -- fixed limits on the size of octet-strings 667 66811. Historical note 669 670The S-expression technology described here was originally developed 671for ``SDSI'' (the Simple Distributed Security Infrastructure by 672Lampson and Rivest [SDSI]) in 1996, although the origins clearly date 673back to McCarthy's LISP programming language. It was further refined 674and improved during the merger of SDSI and SPKI [SPKI] during the 675first half of 1997. S-expressions are similar to, but more readable 676and flexible than, Bernstein's "net-strings" [BERN]. 677 67812. References 679 680[SDSI] "A Simple Distributed Security Architecture", by 681 Butler Lampson, and Ronald L. Rivest 682 http://theory.lcs.mit.edu/~cis/sdsi.html 683 684[SPKI] <a href="http://www.clark.net/pub/cme/html/spki.html">SPKI--A 685 Simple Public Key Infrastructure</a> 686 687[BERN] Dan Bernstein's "net-strings"; Internet Draft 688 draft-bernstein-netstrings-02.txt 689 690Author's Address 691 692 Ronald L. Rivest 693 Room 324, 545 Technology Square 694 MIT Laboratory for Computer Science 695 Cambridge, MA 02139 696 697 rivest@theory.lcs.mit.edu 698 699 700