xref: /inferno-os/lib/sexp (revision 46439007cf417cbd9ac8049bb4122c890097a0fa)
1Network  Working Group                                         R. Rivest
2Internet Draft                                               May 4, 1997
3Expires November 4, 1997
4
5
6	                      S-Expressions
7                        draft-rivest-sexp-00.txt
8
9
10Status of this Memo
11
12   Distribution of this memo is unlimited.
13
14   This document is an Internet-Draft.  Internet Drafts are working
15   documents of the Internet Engineering Task Force (IETF), its Areas,
16   and its Working Groups.  Note that other groups may also distribute
17   working documents as Internet Drafts.
18
19   Internet Drafts are draft documents valid for a maximum of six
20   months, and may be updated, replaced, or obsoleted by other documents
21   at any time.  It is not appropriate to use Internet Drafts as
22   reference material, or to cite them other than as a ``working draft''
23   or ``work in progress.''
24
25   To learn the current status of any Internet-Draft, please check the
26   ``1id-abstracts.txt'' listing contained in the internet-drafts Shadow
27   Directories on: ftp.is.co.za (Africa), nic.nordu.net (Europe),
28   ds.internic.net (US East Coast), ftp.isi.edu (US West Coast),
29   or munnari.oz.au (Pacific Rim)
30
31
32Abstract
33
34This memo describes a data structure called "S-expressions" that are
35suitable for representing arbitrary complex data structures.  We make
36precise the encodings of S-expressions: we give a "canonical form" for
37S-expressions, described two "transport" representations, and also
38describe an "advanced" format for display to people.
39
40
41
421. Introduction
43
44S-expressions are data structures for representing complex data.  They
45are either byte-strings ("octet-strings") or lists of simpler
46S-expressions.  Here is a sample S-expression:
47
48	(snicker "abc" (#03# |YWJj|))
49
50It is a list of length three:
51
52	-- the octet-string "snicker"
53
54	-- the octet-string "abc"
55
56	-- a sub-list containing two elements:
57		- the hexadecimal constant #03#
58		- the base-64 constant |YWJj| (which is the same as "abc")
59
60This note gives a specific proposal for constructing and utilizing
61S-expressions.  The proposal is independent of any particular application.
62
63Here are the design goals for S-expressions:
64
65  -- generality: S-expressions should be good at representing arbitrary
66     data.
67
68  -- readability: it should be easy for someone to examine and
69     understand the structure of an S-expression.
70
71  -- economy: S-expressions should represent data compactly.
72
73  -- tranportability: S-expressions should be easy to transport
74     over communication media (such as email) that are known to be
75     less than perfect.
76
77  -- flexibility: S-expressions should make it relatively simple to
78     modify and extend data structures.
79
80  -- canonicalization: it should be easy to produce a unique
81     "canonical" form of an S-expression, for digital signature purposes.
82
83  -- efficiency: S-expressions should admit in-memory representations
84     that allow efficient processing.
85
86
87Section 2 gives an introduction to S-expressions.
88Section 3 discusses the character sets used.
89Section 4 presents the various representations of octet-strings.
90Section 5 describes how to represent lists.
91Section 6 discusses how S-expressions are represented for various uses.
92Section 7 gives a BNF syntax for S-expressions.
93Section 8 talks about how S-expressions might be represented in memory.
94Section 9 briefly describes implementations for handling S-expressions.
95Section 10 discusses how applications might utilize S-expressions.
96Section 11 gives historical notes on S-expressions.
97Section 12 gives references.
98
992. S-expressions -- informal introduction
100
101Informally, an S-expression is either:
102	-- an octet-string, or
103	-- a finite list of simpler S-expressions.
104
105An octet-string is a finite sequence of eight-bit octets.  There may be
106many different but equivalent ways of representing an octet-string
107
108	abc		-- as a token
109
110	"abc"		-- as a quoted string
111
112	#616263#	-- as a hexadecimal string
113
114	3:abc		-- as a length-prefixed "verbatim" encoding
115
116	{MzphYmM=}	-- as a base-64 encoding of the verbatim encoding
117			   (that is, an encoding of "3:abc")
118
119	|YWJj|		-- as a base-64 encoding of the octet-string "abc"
120
121These encodings are all equivalent; they all denote the same octet string.
122
123We will give details of these encodings later on, and also describe how to
124give a "display type" to a byte string.
125
126A list is a finite sequence of zero or more simpler S-expressions.  A list
127may be represented by using parentheses to surround the sequence of encodings
128of its elements, as in:
129
130	(abc (de #6667#) "ghi jkl")
131
132As we see, there is variability possible in the encoding of an
133S-expression.  In some cases, it is desirable to standardize or
134restrict the encodings; in other cases it is desirable to have no
135restrictions.  The following are the target cases we aim to handle:
136
137	-- a "transport" encoding for transporting the S-expression between
138	   computers.
139
140	-- a "canonical" encoding, used when signing the S-expression.
141
142	-- an "advanced" encoding used for input/output to people.
143
144	-- an "in-memory" encoding used for processing the S-expression in
145	   the computer.
146
147These need not be different; in this proposal the canonical encoding
148is the same as the transport encoding, for example.  In this note we
149propose (related) encoding techniques for each of these uses.
150
1513. Character set
152
153We will be describing encodings of S-expressions.  Except when giving
154"verbatim" encodings, the character set used is limited to the following
155characters in US-ASCII:
156	Alphabetic:	A B ... Z a b ... z
157	numeric:	0 1 ... 9
158	whitespace:	space, horizontal tab, vertical tab, form-feed
159			carriage-return, line-feed
160	The following graphics characters, which we call "pseudo-alphabetic":
161			- hyphen or minus
162			. period
163			/ slash
164			_ underscore
165			: colon
166			* asterisk
167			+ plus
168			= equal
169	The following graphics characters, which are "reserved punctuation":
170			( left parenthesis
171			) right parenthesis
172			[ left bracket
173			] right bracket
174			{ left brace
175			} right brace
176			| vertical bar
177			# number sign
178			" double quote
179			& ampersand
180			\ backslash
181	The following characters are unused and unavailable, except in
182	"verbatim" encodings:
183			! exclamation point
184			% percent
185			^ circumflex
186			~ tilde
187			; semicolon
188			' apostrophe
189			, comma
190			< less than
191			> greater than
192			? question mark
193
194
1954. Octet string representations
196
197This section describes in detail the ways in which an octet-string may
198be represented.
199
200We recall that an octet-string is any finite sequence of octets, and
201that the octet-string may have length zero.
202
203
2044.1 Verbatim representation
205
206A verbatim encoding of an octet string consists of four parts:
207
208	-- the length (number of octets) of the octet-string,
209	   given in decimal most significant digit first, with
210	   no leading zeros.
211
212	-- a colon ":"
213
214	-- the octet string itself, verbatim.
215
216There are no blanks or whitespace separating the parts.  No "escape
217sequences" are interpreted in the octet string.  This encoding is also
218called a "binary" or "raw" encoding.
219
220Here are some sample verbatim encodings:
221
222	3:abc
223	7:subject
224	4:::::
225	12:hello world!
226	10:abcdefghij
227	0:
228
2294.2 Quoted-string representation
230
231The quoted-string representation of an octet-string consists of:
232
233	-- an optional decimal length field
234
235	-- an initial double-quote (")
236
237	-- the octet string with "C" escape conventions (\n,etc)
238
239	-- a final double-quote (")
240
241The specified length is the length of the resulting string after any
242escape sequences have been handled.  The string does not have any
243"terminating NULL" that C includes, and the length does not count such
244a character.
245
246The length is optional.
247
248The escape conventions within the quoted string are as follows (these follow
249the "C" programming language conventions, with an extension for
250ignoring line terminators of just LF or CRLF):
251	\b		-- backspace
252	\t		-- horizontal tab
253	\v 		-- vertical tab
254	\n		-- new-line
255	\f		-- form-feed
256	\r		-- carriage-return
257	\"		-- double-quote
258	\'		-- single-quote
259	\\		-- back-slash
260	\ooo		-- character with octal value ooo (all three digits
261			   must be present)
262	\xhh		-- character with hexadecimal value hh (both digits
263			   must be present)
264	\<carriage-return> -- causes carriage-return to be ignored.
265	\<line-feed>       -- causes linefeed to be ignored
266	\<carriage-return><line-feed> -- causes CRLF to be ignored.
267	\<line-feed><carriage-return> -- causes LFCR to be ignored.
268
269Here are some examples of quoted-string encodings:
270
271	"subject"
272	"hi there"
273	7"subject"
274	3"\n\n\n"
275	"This has\n two lines."
276	"This has\
277	one."
278	""
279
2804.3 Token representation
281
282An octet string that meets the following conditions may be given
283directly as a "token".
284
285	-- it does not begin with a digit
286
287	-- it contains only characters that are
288		-- alphabetic (upper or lower case),
289		-- numeric, or
290		-- one of the eight "pseudo-alphabetic" punctuation marks:
291			-   .   /   _   :  *  +  =
292	(Note: upper and lower case are not equivalent.)
293	(Note: A token may begin with punctuation, including ":").
294
295Here are some examples of token representations:
296
297	subject
298	not-before
299	class-of-1997
300	//microsoft.com/names/smith
301	*
302
303
3044.4 Hexadecimal representation
305
306An octet-string may be represented with a hexadecimal encoding consisting of:
307
308	-- an (optional) decimal length of the octet string
309
310	-- a sharp-sign "#"
311
312	-- a hexadecimal encoding of the octet string, with each octet
313	   represented with two hexadecimal digits, most significant
314	   digit first.
315
316	-- a sharp-sign "#"
317
318There may be whitespace inserted in the midst of the hexadecimal
319encoding arbitrarily; it is ignored.  It is an error to have
320characters other than whitespace and hexadecimal digits.
321
322Here are some examples of hexadecimal encodings:
323
324	#616263#		-- represents "abc"
325	3#616263#		-- also represents "abc"
326	# 616
327	  263 #                 -- also represents "abc"
328
329
3304.5 Base-64 representation
331
332An octet-string may be represented in a base-64 coding consisting of:
333
334	-- an (optional) decimal length of the octet string
335
336	-- a vertical bar "|"
337
338	-- the rfc 1521 base-64 encoding of the octet string.
339
340	-- a final vertical bar "|"
341
342The base-64 encoding uses only the characters
343	A-Z  a-z  0-9  +  /  =
344It produces four characters of output for each three octets of input.
345If the input has one or two left-over octets of input, it produces an
346output block of length four ending in two or one equals signs, respectively.
347Output routines compliant with this standard MUST output the equals signs
348as specified.  Input routines MAY accept inputs where the equals signs are
349dropped.
350
351There may be whitespace inserted in the midst of the base-64 encoding
352arbitrarily; it is ignored.  It is an error to have characters other
353than whitespace and base-64 characters.
354
355Here are some examples of base-64 encodings:
356
357	|YWJj|		-- represents "abc"
358	| Y W
359	  J j |		-- also represents "abc"
360	3|YWJj|		-- also represents "abc"
361	|YWJjZA==|	-- represents "abcd"
362	|YWJjZA|	-- also represents "abcd"
363
364
3654.6 Display hint
366
367Any octet string may be preceded by a single "display hint".
368
369The purposes of the display hint is to provide information on how
370to display the octet string to a user.  It has no other function.
371Many of the MIME types work here.
372
373A display-hint is an octet string surrounded by square brackets.
374There may be whitespace separating the octet string from the
375surrounding brackets.  Any of the legal formats may be used for the
376octet string.
377
378Here are some examples of display-hints:
379
380	[image/gif]
381	[URI]
382	[charset=unicode-1-1]
383	[text/richtext]
384	[application/postscript]
385	[audio/basic]
386	["http://abc.com/display-types/funky.html"]
387
388In applications an octet-string that is untyped may be considered to have
389a pre-specified "default" mime type.  The mime type
390		"text/plain; charset=iso-8859-1"
391is the standard default.
392
393
3944.7  Equality of octet-strings
395
396Two octet strings are considered to be "equal" if and only if they
397have the same display hint and the same data octet strings.
398
399Note that octet-strings are "case-sensitive"; the octet-string "abc"
400is not equal to the octet-string "ABC".
401
402An untyped octet-string can be compared to another octet-string (typed
403or not) by considering it as a typed octet-string with the default
404mime-type.
405
406
4075. Lists
408
409Just as with octet-strings, there are several ways to represent an
410S-expression.  Whitespace may be used to separate list elements, but
411they are only required to separate two octet strings when otherwise
412the two octet strings might be interpreted as one, as when one token
413follows another.  Also,	whitespace may follow the initial left
414parenthesis, or precede the final right parenthesis.
415
416Here are some examples of encodings of lists:
417
418	(a b c)
419
420	( a ( b c ) ( ( d e ) ( e f ) )  )
421
422	(11:certificate(6:issuer3:bob)(7:subject5:alice))
423
424	({3Rt=} "1997" murphy 3:{XC++})
425
426
4276. Representation types
428
429There are three "types" of representations:
430
431	-- canonical
432
433	-- basic transport
434
435	-- advanced transport
436
437The first two MUST be supported by any implementation; the last is
438optional.
439
440
4416.1  Canonical representation
442
443This canonical representation is used for digital signature purposes,
444transmission, etc.  It is uniquely defined for each S-expression.  It
445is not particularly readable, but that is not the point.  It is
446intended to be very easy to parse, to be reasonably economical, and to
447be unique for any S-expression.
448
449The "canonical" form of an S-expression represents each octet-string
450in verbatim mode, and represents each list with no blanks separating
451elements from each other or from the surrounding parentheses.
452
453Here are some examples of canonical representations of S-expressions:
454
455	(6:issuer3:bob)
456
457	(4:icon[12:image/bitmap]9:xxxxxxxxx)
458
459	(7:subject(3:ref5:alice6:mother))
460
461
4626.2 Basic transport representation
463
464There are two forms of the "basic transport" representation:
465
466	-- the canonical representation
467
468	-- an rfc-2045 base-64 representation of the canonical representation,
469           surrounded by braces.
470
471The transport mechanism is intended to provide a universal means of
472representing S-expressions for transport from one machine to another.
473
474Here are some examples of an S-expression represented in basic
475transport mode:
476
477	(1:a1:b1:c)
478
479	{KDE6YTE6YjE6YykA}
480
481		(this is the same S-expression encoded in base-64)
482
483There is a difference between the brace notation for base-64 used here
484and the || notation for base-64'd octet-strings described above.  Here
485the base-64 contents are converted to octets, and then re-scanned as
486if they were given originally as octets.  With the || notation, the
487contents are just turned into an octet-string.
488
489
4906.3 Advanced transport representation
491
492The "advanced transport" representation is intended to provide more
493flexible and readable notations for documentation, design, debugging,
494and (in some cases) user interface.
495
496The advanced transport representation allows all of the representation
497forms described above, include quoted strings, base-64 and hexadecimal
498representation of strings, tokens, representations of strings with
499omitted lengths, and so on.
500
501
5027. BNF for syntax
503
504We give separate BNF's for canonical and advanced forms of S-expressions.
505We use the following notation:
506	<x>* 		means 0 or more occurrences of <x>
507	<x>+		means 1 or more occurrences of <x>
508	<x>?		means 0 or 1 occurrences of <x>
509	parentheses	are used for grouping, as in (<x> | <y>)*
510
511For canonical and basic transport:
512
513<sexpr>    	:: <string> | <list>
514<string>   	:: <display>? <simple-string> ;
515<simple-string>	:: <raw> ;
516<display>  	:: "[" <simple-string> "]" ;
517<raw>      	:: <decimal> ":" <bytes> ;
518<decimal>  	:: <decimal-digit>+ ;
519		-- decimal numbers should have no unnecessary leading zeros
520<bytes> 	-- any string of bytes, of the indicated length
521<list>     	:: "(" <sexp>* ")" ;
522<decimal-digit> :: "0" | ... | "9" ;
523
524For advanced transport:
525
526<sexpr>    	:: <string> | <list>
527<string>   	:: <display>? <simple-string> ;
528<simple-string>	:: <raw> | <token> | <base-64> | <hexadecimal> |
529		           <quoted-string> ;
530<display>  	:: "[" <simple-string> "]" ;
531<raw>      	:: <decimal> ":" <bytes> ;
532<decimal>  	:: <decimal-digit>+ ;
533		-- decimal numbers should have no unnecessary leading zeros
534<bytes> 	-- any string of bytes, of the indicated length
535<token>    	:: <tokenchar>+ ;
536<base-64>  	:: <decimal>? "|" ( <base-64-char> | <whitespace> )* "|" ;
537<hexadecimal>   :: "#" ( <hex-digit> | <white-space> )* "#" ;
538<quoted-string> :: <decimal>? <quoted-string-body>
539<quoted-string-body> :: "\"" <bytes> "\""
540<list>     	:: "(" ( <sexp> | <whitespace> )* ")" ;
541<whitespace> 	:: <whitespace-char>* ;
542<token-char>  	:: <alpha> | <decimal-digit> | <simple-punc> ;
543<alpha>       	:: <upper-case> | <lower-case> | <digit> ;
544<lower-case>  	:: "a" | ... | "z" ;
545<upper-case>  	:: "A" | ... | "Z" ;
546<decimal-digit> :: "0" | ... | "9" ;
547<hex-digit>     :: <decimal-digit> | "A" | ... | "F" | "a" | ... | "f" ;
548<simple-punc> 	:: "-" | "." | "/" | "_" | ":" | "*" | "+" | "=" ;
549<whitespace-char> :: " " | "\t" | "\r" | "\n" ;
550<base-64-char> 	:: <alpha> | <decimal-digit> | "+" | "/" | "=" ;
551<null>        	:: "" ;
552
5538. In-memory representations
554
555For processing, the S-expression would typically be parsed and represented
556in memory in a more more amenable to efficient processing.  We suggest
557two alternatives:
558
559	-- "list-structure"
560
561	-- "array-layout"
562
563We only sketch these here, as they are only suggestive.  The code referenced
564below illustrates these styles in more detail.
565
566
5678.1. List-structure memory representation
568
569Here there are separate records for simple-strings, strings, and
570lists.  An S-expression of the form ("abc" "de") would require two
571records for the simple strings, two for the strings, and two for the
572list elements.  This is a fairly conventional representation, and
573details are omitted here.
574
5758.2 Array-layout memory representation
576
577Here each S-expression is represented as a contiguous array of bytes.
578The first byte codes the "type" of the S-expression:
579
580	01 	octet-string
581
582	02	octet-string with display-hint
583
584	03	beginning of list (and 00 is used for "end of list")
585
586Each of the three types is immediately followed by a k-byte integer
587indicating the size (in bytes) of the following representation.  Here
588k is an integer that depends on the implementation, it might be
589anywhere from 2 to 8, but would be fixed for a given implementation;
590it determines the size of the objects that can be handled.  The transport
591and canonical representations are independent of the choice of k made by
592the implementation.
593
594Although the length of lists are not given in the usual S-expression
595notations, it is easy to fill them in when parsing; when you reach a
596right-parenthesis you know how long the list representation was, and
597where to go back to fill in the missing length.
598
599
6008.2.1 Octet string
601
602This is represented as follows:
603
604	01 <length> <octet-string>
605
606For example (here k = 2)
607
608	01 0003 a b c
609
6108.2.2 Octet-string with display-hint
611
612This is represented as follows:
613
614	02 <length>
615	  01 <length> <octet-string>    /* for display-type */
616	  01 <length> <octet-string>    /* for octet-string */
617
618For example, the S-expression
619
620	[gif] #61626364#
621
622would be represented as (with k = 2)
623
624	02 000d
625	  01 0003  g  i  f
626	  01 0004 61 62 63 64
627
6288.2.3 List
629
630This is represented as
631
632	03 <length> <item1> <item2> <item3> ... <itemn> 00
633
634For example, the list (abc [d]ef (g)) is represented in memory as (with k=2)
635
636	03 001b
637	  01 0003 a b c
638          02 0009
639            01 0001 d
640            01 0002 e f
641          03 0005
642            01 0001 g
643          00
644        00
645
6469. Code
647
648There is code available for reading and parsing the various
649S-expression formats proposed here.
650
651See http://theory.lcs.mit.edu/~rivest/sexp.html
652
653
65410. Utilization of S-expressions
655
656This note has described S-expressions in general form.  Application writers
657may wish to restrict their use of S-expressions in various ways.  Here are
658some possible restrictions that might be considered:
659
660	-- no display-hints
661	-- no lengths on hexadecimal, quoted-strings, or base-64 encodings
662	-- no empty lists
663	-- no empty octet-strings
664	-- no lists having another list as its first element
665	-- no base-64 or hexadecimal encodings
666	-- fixed limits on the size of octet-strings
667
66811. Historical note
669
670The S-expression technology described here was originally developed
671for ``SDSI'' (the Simple Distributed Security Infrastructure by
672Lampson and Rivest [SDSI]) in 1996, although the origins clearly date
673back to McCarthy's LISP programming language.  It was further refined
674and improved during the merger of SDSI and SPKI [SPKI] during the
675first half of 1997.  S-expressions are similar to, but more readable
676and flexible than, Bernstein's "net-strings" [BERN].
677
67812. References
679
680[SDSI] "A Simple Distributed Security Architecture", by
681        Butler Lampson, and Ronald L. Rivest
682	http://theory.lcs.mit.edu/~cis/sdsi.html
683
684[SPKI] <a href="http://www.clark.net/pub/cme/html/spki.html">SPKI--A
685       Simple Public Key Infrastructure</a>
686
687[BERN] Dan Bernstein's "net-strings"; Internet Draft
688       draft-bernstein-netstrings-02.txt
689
690Author's Address
691
692      Ronald L. Rivest
693      Room 324, 545 Technology Square
694      MIT Laboratory for Computer Science
695      Cambridge, MA 02139
696
697      rivest@theory.lcs.mit.edu
698
699
700