xref: /inferno-os/lib/sexp (revision 46439007cf417cbd9ac8049bb4122c890097a0fa)
1*46439007SCharles.ForsythNetwork  Working Group                                         R. Rivest
2*46439007SCharles.ForsythInternet Draft                                               May 4, 1997
3*46439007SCharles.ForsythExpires November 4, 1997
4*46439007SCharles.Forsyth
5*46439007SCharles.Forsyth
6*46439007SCharles.Forsyth	                      S-Expressions
7*46439007SCharles.Forsyth                        draft-rivest-sexp-00.txt
8*46439007SCharles.Forsyth
9*46439007SCharles.Forsyth
10*46439007SCharles.ForsythStatus of this Memo
11*46439007SCharles.Forsyth
12*46439007SCharles.Forsyth   Distribution of this memo is unlimited.
13*46439007SCharles.Forsyth
14*46439007SCharles.Forsyth   This document is an Internet-Draft.  Internet Drafts are working
15*46439007SCharles.Forsyth   documents of the Internet Engineering Task Force (IETF), its Areas,
16*46439007SCharles.Forsyth   and its Working Groups.  Note that other groups may also distribute
17*46439007SCharles.Forsyth   working documents as Internet Drafts.
18*46439007SCharles.Forsyth
19*46439007SCharles.Forsyth   Internet Drafts are draft documents valid for a maximum of six
20*46439007SCharles.Forsyth   months, and may be updated, replaced, or obsoleted by other documents
21*46439007SCharles.Forsyth   at any time.  It is not appropriate to use Internet Drafts as
22*46439007SCharles.Forsyth   reference material, or to cite them other than as a ``working draft''
23*46439007SCharles.Forsyth   or ``work in progress.''
24*46439007SCharles.Forsyth
25*46439007SCharles.Forsyth   To learn the current status of any Internet-Draft, please check the
26*46439007SCharles.Forsyth   ``1id-abstracts.txt'' listing contained in the internet-drafts Shadow
27*46439007SCharles.Forsyth   Directories on: ftp.is.co.za (Africa), nic.nordu.net (Europe),
28*46439007SCharles.Forsyth   ds.internic.net (US East Coast), ftp.isi.edu (US West Coast),
29*46439007SCharles.Forsyth   or munnari.oz.au (Pacific Rim)
30*46439007SCharles.Forsyth
31*46439007SCharles.Forsyth
32*46439007SCharles.ForsythAbstract
33*46439007SCharles.Forsyth
34*46439007SCharles.ForsythThis memo describes a data structure called "S-expressions" that are
35*46439007SCharles.Forsythsuitable for representing arbitrary complex data structures.  We make
36*46439007SCharles.Forsythprecise the encodings of S-expressions: we give a "canonical form" for
37*46439007SCharles.ForsythS-expressions, described two "transport" representations, and also
38*46439007SCharles.Forsythdescribe an "advanced" format for display to people.
39*46439007SCharles.Forsyth
40*46439007SCharles.Forsyth
41*46439007SCharles.Forsyth
42*46439007SCharles.Forsyth1. Introduction
43*46439007SCharles.Forsyth
44*46439007SCharles.ForsythS-expressions are data structures for representing complex data.  They
45*46439007SCharles.Forsythare either byte-strings ("octet-strings") or lists of simpler
46*46439007SCharles.ForsythS-expressions.  Here is a sample S-expression:
47*46439007SCharles.Forsyth
48*46439007SCharles.Forsyth	(snicker "abc" (#03# |YWJj|))
49*46439007SCharles.Forsyth
50*46439007SCharles.ForsythIt is a list of length three:
51*46439007SCharles.Forsyth
52*46439007SCharles.Forsyth	-- the octet-string "snicker"
53*46439007SCharles.Forsyth
54*46439007SCharles.Forsyth	-- the octet-string "abc"
55*46439007SCharles.Forsyth
56*46439007SCharles.Forsyth	-- a sub-list containing two elements:
57*46439007SCharles.Forsyth		- the hexadecimal constant #03#
58*46439007SCharles.Forsyth		- the base-64 constant |YWJj| (which is the same as "abc")
59*46439007SCharles.Forsyth
60*46439007SCharles.ForsythThis note gives a specific proposal for constructing and utilizing
61*46439007SCharles.ForsythS-expressions.  The proposal is independent of any particular application.
62*46439007SCharles.Forsyth
63*46439007SCharles.ForsythHere are the design goals for S-expressions:
64*46439007SCharles.Forsyth
65*46439007SCharles.Forsyth  -- generality: S-expressions should be good at representing arbitrary
66*46439007SCharles.Forsyth     data.
67*46439007SCharles.Forsyth
68*46439007SCharles.Forsyth  -- readability: it should be easy for someone to examine and
69*46439007SCharles.Forsyth     understand the structure of an S-expression.
70*46439007SCharles.Forsyth
71*46439007SCharles.Forsyth  -- economy: S-expressions should represent data compactly.
72*46439007SCharles.Forsyth
73*46439007SCharles.Forsyth  -- tranportability: S-expressions should be easy to transport
74*46439007SCharles.Forsyth     over communication media (such as email) that are known to be
75*46439007SCharles.Forsyth     less than perfect.
76*46439007SCharles.Forsyth
77*46439007SCharles.Forsyth  -- flexibility: S-expressions should make it relatively simple to
78*46439007SCharles.Forsyth     modify and extend data structures.
79*46439007SCharles.Forsyth
80*46439007SCharles.Forsyth  -- canonicalization: it should be easy to produce a unique
81*46439007SCharles.Forsyth     "canonical" form of an S-expression, for digital signature purposes.
82*46439007SCharles.Forsyth
83*46439007SCharles.Forsyth  -- efficiency: S-expressions should admit in-memory representations
84*46439007SCharles.Forsyth     that allow efficient processing.
85*46439007SCharles.Forsyth
86*46439007SCharles.Forsyth
87*46439007SCharles.ForsythSection 2 gives an introduction to S-expressions.
88*46439007SCharles.ForsythSection 3 discusses the character sets used.
89*46439007SCharles.ForsythSection 4 presents the various representations of octet-strings.
90*46439007SCharles.ForsythSection 5 describes how to represent lists.
91*46439007SCharles.ForsythSection 6 discusses how S-expressions are represented for various uses.
92*46439007SCharles.ForsythSection 7 gives a BNF syntax for S-expressions.
93*46439007SCharles.ForsythSection 8 talks about how S-expressions might be represented in memory.
94*46439007SCharles.ForsythSection 9 briefly describes implementations for handling S-expressions.
95*46439007SCharles.ForsythSection 10 discusses how applications might utilize S-expressions.
96*46439007SCharles.ForsythSection 11 gives historical notes on S-expressions.
97*46439007SCharles.ForsythSection 12 gives references.
98*46439007SCharles.Forsyth
99*46439007SCharles.Forsyth2. S-expressions -- informal introduction
100*46439007SCharles.Forsyth
101*46439007SCharles.ForsythInformally, an S-expression is either:
102*46439007SCharles.Forsyth	-- an octet-string, or
103*46439007SCharles.Forsyth	-- a finite list of simpler S-expressions.
104*46439007SCharles.Forsyth
105*46439007SCharles.ForsythAn octet-string is a finite sequence of eight-bit octets.  There may be
106*46439007SCharles.Forsythmany different but equivalent ways of representing an octet-string
107*46439007SCharles.Forsyth
108*46439007SCharles.Forsyth	abc		-- as a token
109*46439007SCharles.Forsyth
110*46439007SCharles.Forsyth	"abc"		-- as a quoted string
111*46439007SCharles.Forsyth
112*46439007SCharles.Forsyth	#616263#	-- as a hexadecimal string
113*46439007SCharles.Forsyth
114*46439007SCharles.Forsyth	3:abc		-- as a length-prefixed "verbatim" encoding
115*46439007SCharles.Forsyth
116*46439007SCharles.Forsyth	{MzphYmM=}	-- as a base-64 encoding of the verbatim encoding
117*46439007SCharles.Forsyth			   (that is, an encoding of "3:abc")
118*46439007SCharles.Forsyth
119*46439007SCharles.Forsyth	|YWJj|		-- as a base-64 encoding of the octet-string "abc"
120*46439007SCharles.Forsyth
121*46439007SCharles.ForsythThese encodings are all equivalent; they all denote the same octet string.
122*46439007SCharles.Forsyth
123*46439007SCharles.ForsythWe will give details of these encodings later on, and also describe how to
124*46439007SCharles.Forsythgive a "display type" to a byte string.
125*46439007SCharles.Forsyth
126*46439007SCharles.ForsythA list is a finite sequence of zero or more simpler S-expressions.  A list
127*46439007SCharles.Forsythmay be represented by using parentheses to surround the sequence of encodings
128*46439007SCharles.Forsythof its elements, as in:
129*46439007SCharles.Forsyth
130*46439007SCharles.Forsyth	(abc (de #6667#) "ghi jkl")
131*46439007SCharles.Forsyth
132*46439007SCharles.ForsythAs we see, there is variability possible in the encoding of an
133*46439007SCharles.ForsythS-expression.  In some cases, it is desirable to standardize or
134*46439007SCharles.Forsythrestrict the encodings; in other cases it is desirable to have no
135*46439007SCharles.Forsythrestrictions.  The following are the target cases we aim to handle:
136*46439007SCharles.Forsyth
137*46439007SCharles.Forsyth	-- a "transport" encoding for transporting the S-expression between
138*46439007SCharles.Forsyth	   computers.
139*46439007SCharles.Forsyth
140*46439007SCharles.Forsyth	-- a "canonical" encoding, used when signing the S-expression.
141*46439007SCharles.Forsyth
142*46439007SCharles.Forsyth	-- an "advanced" encoding used for input/output to people.
143*46439007SCharles.Forsyth
144*46439007SCharles.Forsyth	-- an "in-memory" encoding used for processing the S-expression in
145*46439007SCharles.Forsyth	   the computer.
146*46439007SCharles.Forsyth
147*46439007SCharles.ForsythThese need not be different; in this proposal the canonical encoding
148*46439007SCharles.Forsythis the same as the transport encoding, for example.  In this note we
149*46439007SCharles.Forsythpropose (related) encoding techniques for each of these uses.
150*46439007SCharles.Forsyth
151*46439007SCharles.Forsyth3. Character set
152*46439007SCharles.Forsyth
153*46439007SCharles.ForsythWe will be describing encodings of S-expressions.  Except when giving
154*46439007SCharles.Forsyth"verbatim" encodings, the character set used is limited to the following
155*46439007SCharles.Forsythcharacters in US-ASCII:
156*46439007SCharles.Forsyth	Alphabetic:	A B ... Z a b ... z
157*46439007SCharles.Forsyth	numeric:	0 1 ... 9
158*46439007SCharles.Forsyth	whitespace:	space, horizontal tab, vertical tab, form-feed
159*46439007SCharles.Forsyth			carriage-return, line-feed
160*46439007SCharles.Forsyth	The following graphics characters, which we call "pseudo-alphabetic":
161*46439007SCharles.Forsyth			- hyphen or minus
162*46439007SCharles.Forsyth			. period
163*46439007SCharles.Forsyth			/ slash
164*46439007SCharles.Forsyth			_ underscore
165*46439007SCharles.Forsyth			: colon
166*46439007SCharles.Forsyth			* asterisk
167*46439007SCharles.Forsyth			+ plus
168*46439007SCharles.Forsyth			= equal
169*46439007SCharles.Forsyth	The following graphics characters, which are "reserved punctuation":
170*46439007SCharles.Forsyth			( left parenthesis
171*46439007SCharles.Forsyth			) right parenthesis
172*46439007SCharles.Forsyth			[ left bracket
173*46439007SCharles.Forsyth			] right bracket
174*46439007SCharles.Forsyth			{ left brace
175*46439007SCharles.Forsyth			} right brace
176*46439007SCharles.Forsyth			| vertical bar
177*46439007SCharles.Forsyth			# number sign
178*46439007SCharles.Forsyth			" double quote
179*46439007SCharles.Forsyth			& ampersand
180*46439007SCharles.Forsyth			\ backslash
181*46439007SCharles.Forsyth	The following characters are unused and unavailable, except in
182*46439007SCharles.Forsyth	"verbatim" encodings:
183*46439007SCharles.Forsyth			! exclamation point
184*46439007SCharles.Forsyth			% percent
185*46439007SCharles.Forsyth			^ circumflex
186*46439007SCharles.Forsyth			~ tilde
187*46439007SCharles.Forsyth			; semicolon
188*46439007SCharles.Forsyth			' apostrophe
189*46439007SCharles.Forsyth			, comma
190*46439007SCharles.Forsyth			< less than
191*46439007SCharles.Forsyth			> greater than
192*46439007SCharles.Forsyth			? question mark
193*46439007SCharles.Forsyth
194*46439007SCharles.Forsyth
195*46439007SCharles.Forsyth4. Octet string representations
196*46439007SCharles.Forsyth
197*46439007SCharles.ForsythThis section describes in detail the ways in which an octet-string may
198*46439007SCharles.Forsythbe represented.
199*46439007SCharles.Forsyth
200*46439007SCharles.ForsythWe recall that an octet-string is any finite sequence of octets, and
201*46439007SCharles.Forsyththat the octet-string may have length zero.
202*46439007SCharles.Forsyth
203*46439007SCharles.Forsyth
204*46439007SCharles.Forsyth4.1 Verbatim representation
205*46439007SCharles.Forsyth
206*46439007SCharles.ForsythA verbatim encoding of an octet string consists of four parts:
207*46439007SCharles.Forsyth
208*46439007SCharles.Forsyth	-- the length (number of octets) of the octet-string,
209*46439007SCharles.Forsyth	   given in decimal most significant digit first, with
210*46439007SCharles.Forsyth	   no leading zeros.
211*46439007SCharles.Forsyth
212*46439007SCharles.Forsyth	-- a colon ":"
213*46439007SCharles.Forsyth
214*46439007SCharles.Forsyth	-- the octet string itself, verbatim.
215*46439007SCharles.Forsyth
216*46439007SCharles.ForsythThere are no blanks or whitespace separating the parts.  No "escape
217*46439007SCharles.Forsythsequences" are interpreted in the octet string.  This encoding is also
218*46439007SCharles.Forsythcalled a "binary" or "raw" encoding.
219*46439007SCharles.Forsyth
220*46439007SCharles.ForsythHere are some sample verbatim encodings:
221*46439007SCharles.Forsyth
222*46439007SCharles.Forsyth	3:abc
223*46439007SCharles.Forsyth	7:subject
224*46439007SCharles.Forsyth	4:::::
225*46439007SCharles.Forsyth	12:hello world!
226*46439007SCharles.Forsyth	10:abcdefghij
227*46439007SCharles.Forsyth	0:
228*46439007SCharles.Forsyth
229*46439007SCharles.Forsyth4.2 Quoted-string representation
230*46439007SCharles.Forsyth
231*46439007SCharles.ForsythThe quoted-string representation of an octet-string consists of:
232*46439007SCharles.Forsyth
233*46439007SCharles.Forsyth	-- an optional decimal length field
234*46439007SCharles.Forsyth
235*46439007SCharles.Forsyth	-- an initial double-quote (")
236*46439007SCharles.Forsyth
237*46439007SCharles.Forsyth	-- the octet string with "C" escape conventions (\n,etc)
238*46439007SCharles.Forsyth
239*46439007SCharles.Forsyth	-- a final double-quote (")
240*46439007SCharles.Forsyth
241*46439007SCharles.ForsythThe specified length is the length of the resulting string after any
242*46439007SCharles.Forsythescape sequences have been handled.  The string does not have any
243*46439007SCharles.Forsyth"terminating NULL" that C includes, and the length does not count such
244*46439007SCharles.Forsytha character.
245*46439007SCharles.Forsyth
246*46439007SCharles.ForsythThe length is optional.
247*46439007SCharles.Forsyth
248*46439007SCharles.ForsythThe escape conventions within the quoted string are as follows (these follow
249*46439007SCharles.Forsyththe "C" programming language conventions, with an extension for
250*46439007SCharles.Forsythignoring line terminators of just LF or CRLF):
251*46439007SCharles.Forsyth	\b		-- backspace
252*46439007SCharles.Forsyth	\t		-- horizontal tab
253*46439007SCharles.Forsyth	\v 		-- vertical tab
254*46439007SCharles.Forsyth	\n		-- new-line
255*46439007SCharles.Forsyth	\f		-- form-feed
256*46439007SCharles.Forsyth	\r		-- carriage-return
257*46439007SCharles.Forsyth	\"		-- double-quote
258*46439007SCharles.Forsyth	\'		-- single-quote
259*46439007SCharles.Forsyth	\\		-- back-slash
260*46439007SCharles.Forsyth	\ooo		-- character with octal value ooo (all three digits
261*46439007SCharles.Forsyth			   must be present)
262*46439007SCharles.Forsyth	\xhh		-- character with hexadecimal value hh (both digits
263*46439007SCharles.Forsyth			   must be present)
264*46439007SCharles.Forsyth	\<carriage-return> -- causes carriage-return to be ignored.
265*46439007SCharles.Forsyth	\<line-feed>       -- causes linefeed to be ignored
266*46439007SCharles.Forsyth	\<carriage-return><line-feed> -- causes CRLF to be ignored.
267*46439007SCharles.Forsyth	\<line-feed><carriage-return> -- causes LFCR to be ignored.
268*46439007SCharles.Forsyth
269*46439007SCharles.ForsythHere are some examples of quoted-string encodings:
270*46439007SCharles.Forsyth
271*46439007SCharles.Forsyth	"subject"
272*46439007SCharles.Forsyth	"hi there"
273*46439007SCharles.Forsyth	7"subject"
274*46439007SCharles.Forsyth	3"\n\n\n"
275*46439007SCharles.Forsyth	"This has\n two lines."
276*46439007SCharles.Forsyth	"This has\
277*46439007SCharles.Forsyth	one."
278*46439007SCharles.Forsyth	""
279*46439007SCharles.Forsyth
280*46439007SCharles.Forsyth4.3 Token representation
281*46439007SCharles.Forsyth
282*46439007SCharles.ForsythAn octet string that meets the following conditions may be given
283*46439007SCharles.Forsythdirectly as a "token".
284*46439007SCharles.Forsyth
285*46439007SCharles.Forsyth	-- it does not begin with a digit
286*46439007SCharles.Forsyth
287*46439007SCharles.Forsyth	-- it contains only characters that are
288*46439007SCharles.Forsyth		-- alphabetic (upper or lower case),
289*46439007SCharles.Forsyth		-- numeric, or
290*46439007SCharles.Forsyth		-- one of the eight "pseudo-alphabetic" punctuation marks:
291*46439007SCharles.Forsyth			-   .   /   _   :  *  +  =
292*46439007SCharles.Forsyth	(Note: upper and lower case are not equivalent.)
293*46439007SCharles.Forsyth	(Note: A token may begin with punctuation, including ":").
294*46439007SCharles.Forsyth
295*46439007SCharles.ForsythHere are some examples of token representations:
296*46439007SCharles.Forsyth
297*46439007SCharles.Forsyth	subject
298*46439007SCharles.Forsyth	not-before
299*46439007SCharles.Forsyth	class-of-1997
300*46439007SCharles.Forsyth	//microsoft.com/names/smith
301*46439007SCharles.Forsyth	*
302*46439007SCharles.Forsyth
303*46439007SCharles.Forsyth
304*46439007SCharles.Forsyth4.4 Hexadecimal representation
305*46439007SCharles.Forsyth
306*46439007SCharles.ForsythAn octet-string may be represented with a hexadecimal encoding consisting of:
307*46439007SCharles.Forsyth
308*46439007SCharles.Forsyth	-- an (optional) decimal length of the octet string
309*46439007SCharles.Forsyth
310*46439007SCharles.Forsyth	-- a sharp-sign "#"
311*46439007SCharles.Forsyth
312*46439007SCharles.Forsyth	-- a hexadecimal encoding of the octet string, with each octet
313*46439007SCharles.Forsyth	   represented with two hexadecimal digits, most significant
314*46439007SCharles.Forsyth	   digit first.
315*46439007SCharles.Forsyth
316*46439007SCharles.Forsyth	-- a sharp-sign "#"
317*46439007SCharles.Forsyth
318*46439007SCharles.ForsythThere may be whitespace inserted in the midst of the hexadecimal
319*46439007SCharles.Forsythencoding arbitrarily; it is ignored.  It is an error to have
320*46439007SCharles.Forsythcharacters other than whitespace and hexadecimal digits.
321*46439007SCharles.Forsyth
322*46439007SCharles.ForsythHere are some examples of hexadecimal encodings:
323*46439007SCharles.Forsyth
324*46439007SCharles.Forsyth	#616263#		-- represents "abc"
325*46439007SCharles.Forsyth	3#616263#		-- also represents "abc"
326*46439007SCharles.Forsyth	# 616
327*46439007SCharles.Forsyth	  263 #                 -- also represents "abc"
328*46439007SCharles.Forsyth
329*46439007SCharles.Forsyth
330*46439007SCharles.Forsyth4.5 Base-64 representation
331*46439007SCharles.Forsyth
332*46439007SCharles.ForsythAn octet-string may be represented in a base-64 coding consisting of:
333*46439007SCharles.Forsyth
334*46439007SCharles.Forsyth	-- an (optional) decimal length of the octet string
335*46439007SCharles.Forsyth
336*46439007SCharles.Forsyth	-- a vertical bar "|"
337*46439007SCharles.Forsyth
338*46439007SCharles.Forsyth	-- the rfc 1521 base-64 encoding of the octet string.
339*46439007SCharles.Forsyth
340*46439007SCharles.Forsyth	-- a final vertical bar "|"
341*46439007SCharles.Forsyth
342*46439007SCharles.ForsythThe base-64 encoding uses only the characters
343*46439007SCharles.Forsyth	A-Z  a-z  0-9  +  /  =
344*46439007SCharles.ForsythIt produces four characters of output for each three octets of input.
345*46439007SCharles.ForsythIf the input has one or two left-over octets of input, it produces an
346*46439007SCharles.Forsythoutput block of length four ending in two or one equals signs, respectively.
347*46439007SCharles.ForsythOutput routines compliant with this standard MUST output the equals signs
348*46439007SCharles.Forsythas specified.  Input routines MAY accept inputs where the equals signs are
349*46439007SCharles.Forsythdropped.
350*46439007SCharles.Forsyth
351*46439007SCharles.ForsythThere may be whitespace inserted in the midst of the base-64 encoding
352*46439007SCharles.Forsytharbitrarily; it is ignored.  It is an error to have characters other
353*46439007SCharles.Forsyththan whitespace and base-64 characters.
354*46439007SCharles.Forsyth
355*46439007SCharles.ForsythHere are some examples of base-64 encodings:
356*46439007SCharles.Forsyth
357*46439007SCharles.Forsyth	|YWJj|		-- represents "abc"
358*46439007SCharles.Forsyth	| Y W
359*46439007SCharles.Forsyth	  J j |		-- also represents "abc"
360*46439007SCharles.Forsyth	3|YWJj|		-- also represents "abc"
361*46439007SCharles.Forsyth	|YWJjZA==|	-- represents "abcd"
362*46439007SCharles.Forsyth	|YWJjZA|	-- also represents "abcd"
363*46439007SCharles.Forsyth
364*46439007SCharles.Forsyth
365*46439007SCharles.Forsyth4.6 Display hint
366*46439007SCharles.Forsyth
367*46439007SCharles.ForsythAny octet string may be preceded by a single "display hint".
368*46439007SCharles.Forsyth
369*46439007SCharles.ForsythThe purposes of the display hint is to provide information on how
370*46439007SCharles.Forsythto display the octet string to a user.  It has no other function.
371*46439007SCharles.ForsythMany of the MIME types work here.
372*46439007SCharles.Forsyth
373*46439007SCharles.ForsythA display-hint is an octet string surrounded by square brackets.
374*46439007SCharles.ForsythThere may be whitespace separating the octet string from the
375*46439007SCharles.Forsythsurrounding brackets.  Any of the legal formats may be used for the
376*46439007SCharles.Forsythoctet string.
377*46439007SCharles.Forsyth
378*46439007SCharles.ForsythHere are some examples of display-hints:
379*46439007SCharles.Forsyth
380*46439007SCharles.Forsyth	[image/gif]
381*46439007SCharles.Forsyth	[URI]
382*46439007SCharles.Forsyth	[charset=unicode-1-1]
383*46439007SCharles.Forsyth	[text/richtext]
384*46439007SCharles.Forsyth	[application/postscript]
385*46439007SCharles.Forsyth	[audio/basic]
386*46439007SCharles.Forsyth	["http://abc.com/display-types/funky.html"]
387*46439007SCharles.Forsyth
388*46439007SCharles.ForsythIn applications an octet-string that is untyped may be considered to have
389*46439007SCharles.Forsytha pre-specified "default" mime type.  The mime type
390*46439007SCharles.Forsyth		"text/plain; charset=iso-8859-1"
391*46439007SCharles.Forsythis the standard default.
392*46439007SCharles.Forsyth
393*46439007SCharles.Forsyth
394*46439007SCharles.Forsyth4.7  Equality of octet-strings
395*46439007SCharles.Forsyth
396*46439007SCharles.ForsythTwo octet strings are considered to be "equal" if and only if they
397*46439007SCharles.Forsythhave the same display hint and the same data octet strings.
398*46439007SCharles.Forsyth
399*46439007SCharles.ForsythNote that octet-strings are "case-sensitive"; the octet-string "abc"
400*46439007SCharles.Forsythis not equal to the octet-string "ABC".
401*46439007SCharles.Forsyth
402*46439007SCharles.ForsythAn untyped octet-string can be compared to another octet-string (typed
403*46439007SCharles.Forsythor not) by considering it as a typed octet-string with the default
404*46439007SCharles.Forsythmime-type.
405*46439007SCharles.Forsyth
406*46439007SCharles.Forsyth
407*46439007SCharles.Forsyth5. Lists
408*46439007SCharles.Forsyth
409*46439007SCharles.ForsythJust as with octet-strings, there are several ways to represent an
410*46439007SCharles.ForsythS-expression.  Whitespace may be used to separate list elements, but
411*46439007SCharles.Forsyththey are only required to separate two octet strings when otherwise
412*46439007SCharles.Forsyththe two octet strings might be interpreted as one, as when one token
413*46439007SCharles.Forsythfollows another.  Also,	whitespace may follow the initial left
414*46439007SCharles.Forsythparenthesis, or precede the final right parenthesis.
415*46439007SCharles.Forsyth
416*46439007SCharles.ForsythHere are some examples of encodings of lists:
417*46439007SCharles.Forsyth
418*46439007SCharles.Forsyth	(a b c)
419*46439007SCharles.Forsyth
420*46439007SCharles.Forsyth	( a ( b c ) ( ( d e ) ( e f ) )  )
421*46439007SCharles.Forsyth
422*46439007SCharles.Forsyth	(11:certificate(6:issuer3:bob)(7:subject5:alice))
423*46439007SCharles.Forsyth
424*46439007SCharles.Forsyth	({3Rt=} "1997" murphy 3:{XC++})
425*46439007SCharles.Forsyth
426*46439007SCharles.Forsyth
427*46439007SCharles.Forsyth6. Representation types
428*46439007SCharles.Forsyth
429*46439007SCharles.ForsythThere are three "types" of representations:
430*46439007SCharles.Forsyth
431*46439007SCharles.Forsyth	-- canonical
432*46439007SCharles.Forsyth
433*46439007SCharles.Forsyth	-- basic transport
434*46439007SCharles.Forsyth
435*46439007SCharles.Forsyth	-- advanced transport
436*46439007SCharles.Forsyth
437*46439007SCharles.ForsythThe first two MUST be supported by any implementation; the last is
438*46439007SCharles.Forsythoptional.
439*46439007SCharles.Forsyth
440*46439007SCharles.Forsyth
441*46439007SCharles.Forsyth6.1  Canonical representation
442*46439007SCharles.Forsyth
443*46439007SCharles.ForsythThis canonical representation is used for digital signature purposes,
444*46439007SCharles.Forsythtransmission, etc.  It is uniquely defined for each S-expression.  It
445*46439007SCharles.Forsythis not particularly readable, but that is not the point.  It is
446*46439007SCharles.Forsythintended to be very easy to parse, to be reasonably economical, and to
447*46439007SCharles.Forsythbe unique for any S-expression.
448*46439007SCharles.Forsyth
449*46439007SCharles.ForsythThe "canonical" form of an S-expression represents each octet-string
450*46439007SCharles.Forsythin verbatim mode, and represents each list with no blanks separating
451*46439007SCharles.Forsythelements from each other or from the surrounding parentheses.
452*46439007SCharles.Forsyth
453*46439007SCharles.ForsythHere are some examples of canonical representations of S-expressions:
454*46439007SCharles.Forsyth
455*46439007SCharles.Forsyth	(6:issuer3:bob)
456*46439007SCharles.Forsyth
457*46439007SCharles.Forsyth	(4:icon[12:image/bitmap]9:xxxxxxxxx)
458*46439007SCharles.Forsyth
459*46439007SCharles.Forsyth	(7:subject(3:ref5:alice6:mother))
460*46439007SCharles.Forsyth
461*46439007SCharles.Forsyth
462*46439007SCharles.Forsyth6.2 Basic transport representation
463*46439007SCharles.Forsyth
464*46439007SCharles.ForsythThere are two forms of the "basic transport" representation:
465*46439007SCharles.Forsyth
466*46439007SCharles.Forsyth	-- the canonical representation
467*46439007SCharles.Forsyth
468*46439007SCharles.Forsyth	-- an rfc-2045 base-64 representation of the canonical representation,
469*46439007SCharles.Forsyth           surrounded by braces.
470*46439007SCharles.Forsyth
471*46439007SCharles.ForsythThe transport mechanism is intended to provide a universal means of
472*46439007SCharles.Forsythrepresenting S-expressions for transport from one machine to another.
473*46439007SCharles.Forsyth
474*46439007SCharles.ForsythHere are some examples of an S-expression represented in basic
475*46439007SCharles.Forsythtransport mode:
476*46439007SCharles.Forsyth
477*46439007SCharles.Forsyth	(1:a1:b1:c)
478*46439007SCharles.Forsyth
479*46439007SCharles.Forsyth	{KDE6YTE6YjE6YykA}
480*46439007SCharles.Forsyth
481*46439007SCharles.Forsyth		(this is the same S-expression encoded in base-64)
482*46439007SCharles.Forsyth
483*46439007SCharles.ForsythThere is a difference between the brace notation for base-64 used here
484*46439007SCharles.Forsythand the || notation for base-64'd octet-strings described above.  Here
485*46439007SCharles.Forsyththe base-64 contents are converted to octets, and then re-scanned as
486*46439007SCharles.Forsythif they were given originally as octets.  With the || notation, the
487*46439007SCharles.Forsythcontents are just turned into an octet-string.
488*46439007SCharles.Forsyth
489*46439007SCharles.Forsyth
490*46439007SCharles.Forsyth6.3 Advanced transport representation
491*46439007SCharles.Forsyth
492*46439007SCharles.ForsythThe "advanced transport" representation is intended to provide more
493*46439007SCharles.Forsythflexible and readable notations for documentation, design, debugging,
494*46439007SCharles.Forsythand (in some cases) user interface.
495*46439007SCharles.Forsyth
496*46439007SCharles.ForsythThe advanced transport representation allows all of the representation
497*46439007SCharles.Forsythforms described above, include quoted strings, base-64 and hexadecimal
498*46439007SCharles.Forsythrepresentation of strings, tokens, representations of strings with
499*46439007SCharles.Forsythomitted lengths, and so on.
500*46439007SCharles.Forsyth
501*46439007SCharles.Forsyth
502*46439007SCharles.Forsyth7. BNF for syntax
503*46439007SCharles.Forsyth
504*46439007SCharles.ForsythWe give separate BNF's for canonical and advanced forms of S-expressions.
505*46439007SCharles.ForsythWe use the following notation:
506*46439007SCharles.Forsyth	<x>* 		means 0 or more occurrences of <x>
507*46439007SCharles.Forsyth	<x>+		means 1 or more occurrences of <x>
508*46439007SCharles.Forsyth	<x>?		means 0 or 1 occurrences of <x>
509*46439007SCharles.Forsyth	parentheses	are used for grouping, as in (<x> | <y>)*
510*46439007SCharles.Forsyth
511*46439007SCharles.ForsythFor canonical and basic transport:
512*46439007SCharles.Forsyth
513*46439007SCharles.Forsyth<sexpr>    	:: <string> | <list>
514*46439007SCharles.Forsyth<string>   	:: <display>? <simple-string> ;
515*46439007SCharles.Forsyth<simple-string>	:: <raw> ;
516*46439007SCharles.Forsyth<display>  	:: "[" <simple-string> "]" ;
517*46439007SCharles.Forsyth<raw>      	:: <decimal> ":" <bytes> ;
518*46439007SCharles.Forsyth<decimal>  	:: <decimal-digit>+ ;
519*46439007SCharles.Forsyth		-- decimal numbers should have no unnecessary leading zeros
520*46439007SCharles.Forsyth<bytes> 	-- any string of bytes, of the indicated length
521*46439007SCharles.Forsyth<list>     	:: "(" <sexp>* ")" ;
522*46439007SCharles.Forsyth<decimal-digit> :: "0" | ... | "9" ;
523*46439007SCharles.Forsyth
524*46439007SCharles.ForsythFor advanced transport:
525*46439007SCharles.Forsyth
526*46439007SCharles.Forsyth<sexpr>    	:: <string> | <list>
527*46439007SCharles.Forsyth<string>   	:: <display>? <simple-string> ;
528*46439007SCharles.Forsyth<simple-string>	:: <raw> | <token> | <base-64> | <hexadecimal> |
529*46439007SCharles.Forsyth		           <quoted-string> ;
530*46439007SCharles.Forsyth<display>  	:: "[" <simple-string> "]" ;
531*46439007SCharles.Forsyth<raw>      	:: <decimal> ":" <bytes> ;
532*46439007SCharles.Forsyth<decimal>  	:: <decimal-digit>+ ;
533*46439007SCharles.Forsyth		-- decimal numbers should have no unnecessary leading zeros
534*46439007SCharles.Forsyth<bytes> 	-- any string of bytes, of the indicated length
535*46439007SCharles.Forsyth<token>    	:: <tokenchar>+ ;
536*46439007SCharles.Forsyth<base-64>  	:: <decimal>? "|" ( <base-64-char> | <whitespace> )* "|" ;
537*46439007SCharles.Forsyth<hexadecimal>   :: "#" ( <hex-digit> | <white-space> )* "#" ;
538*46439007SCharles.Forsyth<quoted-string> :: <decimal>? <quoted-string-body>
539*46439007SCharles.Forsyth<quoted-string-body> :: "\"" <bytes> "\""
540*46439007SCharles.Forsyth<list>     	:: "(" ( <sexp> | <whitespace> )* ")" ;
541*46439007SCharles.Forsyth<whitespace> 	:: <whitespace-char>* ;
542*46439007SCharles.Forsyth<token-char>  	:: <alpha> | <decimal-digit> | <simple-punc> ;
543*46439007SCharles.Forsyth<alpha>       	:: <upper-case> | <lower-case> | <digit> ;
544*46439007SCharles.Forsyth<lower-case>  	:: "a" | ... | "z" ;
545*46439007SCharles.Forsyth<upper-case>  	:: "A" | ... | "Z" ;
546*46439007SCharles.Forsyth<decimal-digit> :: "0" | ... | "9" ;
547*46439007SCharles.Forsyth<hex-digit>     :: <decimal-digit> | "A" | ... | "F" | "a" | ... | "f" ;
548*46439007SCharles.Forsyth<simple-punc> 	:: "-" | "." | "/" | "_" | ":" | "*" | "+" | "=" ;
549*46439007SCharles.Forsyth<whitespace-char> :: " " | "\t" | "\r" | "\n" ;
550*46439007SCharles.Forsyth<base-64-char> 	:: <alpha> | <decimal-digit> | "+" | "/" | "=" ;
551*46439007SCharles.Forsyth<null>        	:: "" ;
552*46439007SCharles.Forsyth
553*46439007SCharles.Forsyth8. In-memory representations
554*46439007SCharles.Forsyth
555*46439007SCharles.ForsythFor processing, the S-expression would typically be parsed and represented
556*46439007SCharles.Forsythin memory in a more more amenable to efficient processing.  We suggest
557*46439007SCharles.Forsythtwo alternatives:
558*46439007SCharles.Forsyth
559*46439007SCharles.Forsyth	-- "list-structure"
560*46439007SCharles.Forsyth
561*46439007SCharles.Forsyth	-- "array-layout"
562*46439007SCharles.Forsyth
563*46439007SCharles.ForsythWe only sketch these here, as they are only suggestive.  The code referenced
564*46439007SCharles.Forsythbelow illustrates these styles in more detail.
565*46439007SCharles.Forsyth
566*46439007SCharles.Forsyth
567*46439007SCharles.Forsyth8.1. List-structure memory representation
568*46439007SCharles.Forsyth
569*46439007SCharles.ForsythHere there are separate records for simple-strings, strings, and
570*46439007SCharles.Forsythlists.  An S-expression of the form ("abc" "de") would require two
571*46439007SCharles.Forsythrecords for the simple strings, two for the strings, and two for the
572*46439007SCharles.Forsythlist elements.  This is a fairly conventional representation, and
573*46439007SCharles.Forsythdetails are omitted here.
574*46439007SCharles.Forsyth
575*46439007SCharles.Forsyth8.2 Array-layout memory representation
576*46439007SCharles.Forsyth
577*46439007SCharles.ForsythHere each S-expression is represented as a contiguous array of bytes.
578*46439007SCharles.ForsythThe first byte codes the "type" of the S-expression:
579*46439007SCharles.Forsyth
580*46439007SCharles.Forsyth	01 	octet-string
581*46439007SCharles.Forsyth
582*46439007SCharles.Forsyth	02	octet-string with display-hint
583*46439007SCharles.Forsyth
584*46439007SCharles.Forsyth	03	beginning of list (and 00 is used for "end of list")
585*46439007SCharles.Forsyth
586*46439007SCharles.ForsythEach of the three types is immediately followed by a k-byte integer
587*46439007SCharles.Forsythindicating the size (in bytes) of the following representation.  Here
588*46439007SCharles.Forsythk is an integer that depends on the implementation, it might be
589*46439007SCharles.Forsythanywhere from 2 to 8, but would be fixed for a given implementation;
590*46439007SCharles.Forsythit determines the size of the objects that can be handled.  The transport
591*46439007SCharles.Forsythand canonical representations are independent of the choice of k made by
592*46439007SCharles.Forsyththe implementation.
593*46439007SCharles.Forsyth
594*46439007SCharles.ForsythAlthough the length of lists are not given in the usual S-expression
595*46439007SCharles.Forsythnotations, it is easy to fill them in when parsing; when you reach a
596*46439007SCharles.Forsythright-parenthesis you know how long the list representation was, and
597*46439007SCharles.Forsythwhere to go back to fill in the missing length.
598*46439007SCharles.Forsyth
599*46439007SCharles.Forsyth
600*46439007SCharles.Forsyth8.2.1 Octet string
601*46439007SCharles.Forsyth
602*46439007SCharles.ForsythThis is represented as follows:
603*46439007SCharles.Forsyth
604*46439007SCharles.Forsyth	01 <length> <octet-string>
605*46439007SCharles.Forsyth
606*46439007SCharles.ForsythFor example (here k = 2)
607*46439007SCharles.Forsyth
608*46439007SCharles.Forsyth	01 0003 a b c
609*46439007SCharles.Forsyth
610*46439007SCharles.Forsyth8.2.2 Octet-string with display-hint
611*46439007SCharles.Forsyth
612*46439007SCharles.ForsythThis is represented as follows:
613*46439007SCharles.Forsyth
614*46439007SCharles.Forsyth	02 <length>
615*46439007SCharles.Forsyth	  01 <length> <octet-string>    /* for display-type */
616*46439007SCharles.Forsyth	  01 <length> <octet-string>    /* for octet-string */
617*46439007SCharles.Forsyth
618*46439007SCharles.ForsythFor example, the S-expression
619*46439007SCharles.Forsyth
620*46439007SCharles.Forsyth	[gif] #61626364#
621*46439007SCharles.Forsyth
622*46439007SCharles.Forsythwould be represented as (with k = 2)
623*46439007SCharles.Forsyth
624*46439007SCharles.Forsyth	02 000d
625*46439007SCharles.Forsyth	  01 0003  g  i  f
626*46439007SCharles.Forsyth	  01 0004 61 62 63 64
627*46439007SCharles.Forsyth
628*46439007SCharles.Forsyth8.2.3 List
629*46439007SCharles.Forsyth
630*46439007SCharles.ForsythThis is represented as
631*46439007SCharles.Forsyth
632*46439007SCharles.Forsyth	03 <length> <item1> <item2> <item3> ... <itemn> 00
633*46439007SCharles.Forsyth
634*46439007SCharles.ForsythFor example, the list (abc [d]ef (g)) is represented in memory as (with k=2)
635*46439007SCharles.Forsyth
636*46439007SCharles.Forsyth	03 001b
637*46439007SCharles.Forsyth	  01 0003 a b c
638*46439007SCharles.Forsyth          02 0009
639*46439007SCharles.Forsyth            01 0001 d
640*46439007SCharles.Forsyth            01 0002 e f
641*46439007SCharles.Forsyth          03 0005
642*46439007SCharles.Forsyth            01 0001 g
643*46439007SCharles.Forsyth          00
644*46439007SCharles.Forsyth        00
645*46439007SCharles.Forsyth
646*46439007SCharles.Forsyth9. Code
647*46439007SCharles.Forsyth
648*46439007SCharles.ForsythThere is code available for reading and parsing the various
649*46439007SCharles.ForsythS-expression formats proposed here.
650*46439007SCharles.Forsyth
651*46439007SCharles.ForsythSee http://theory.lcs.mit.edu/~rivest/sexp.html
652*46439007SCharles.Forsyth
653*46439007SCharles.Forsyth
654*46439007SCharles.Forsyth10. Utilization of S-expressions
655*46439007SCharles.Forsyth
656*46439007SCharles.ForsythThis note has described S-expressions in general form.  Application writers
657*46439007SCharles.Forsythmay wish to restrict their use of S-expressions in various ways.  Here are
658*46439007SCharles.Forsythsome possible restrictions that might be considered:
659*46439007SCharles.Forsyth
660*46439007SCharles.Forsyth	-- no display-hints
661*46439007SCharles.Forsyth	-- no lengths on hexadecimal, quoted-strings, or base-64 encodings
662*46439007SCharles.Forsyth	-- no empty lists
663*46439007SCharles.Forsyth	-- no empty octet-strings
664*46439007SCharles.Forsyth	-- no lists having another list as its first element
665*46439007SCharles.Forsyth	-- no base-64 or hexadecimal encodings
666*46439007SCharles.Forsyth	-- fixed limits on the size of octet-strings
667*46439007SCharles.Forsyth
668*46439007SCharles.Forsyth11. Historical note
669*46439007SCharles.Forsyth
670*46439007SCharles.ForsythThe S-expression technology described here was originally developed
671*46439007SCharles.Forsythfor ``SDSI'' (the Simple Distributed Security Infrastructure by
672*46439007SCharles.ForsythLampson and Rivest [SDSI]) in 1996, although the origins clearly date
673*46439007SCharles.Forsythback to McCarthy's LISP programming language.  It was further refined
674*46439007SCharles.Forsythand improved during the merger of SDSI and SPKI [SPKI] during the
675*46439007SCharles.Forsythfirst half of 1997.  S-expressions are similar to, but more readable
676*46439007SCharles.Forsythand flexible than, Bernstein's "net-strings" [BERN].
677*46439007SCharles.Forsyth
678*46439007SCharles.Forsyth12. References
679*46439007SCharles.Forsyth
680*46439007SCharles.Forsyth[SDSI] "A Simple Distributed Security Architecture", by
681*46439007SCharles.Forsyth        Butler Lampson, and Ronald L. Rivest
682*46439007SCharles.Forsyth	http://theory.lcs.mit.edu/~cis/sdsi.html
683*46439007SCharles.Forsyth
684*46439007SCharles.Forsyth[SPKI] <a href="http://www.clark.net/pub/cme/html/spki.html">SPKI--A
685*46439007SCharles.Forsyth       Simple Public Key Infrastructure</a>
686*46439007SCharles.Forsyth
687*46439007SCharles.Forsyth[BERN] Dan Bernstein's "net-strings"; Internet Draft
688*46439007SCharles.Forsyth       draft-bernstein-netstrings-02.txt
689*46439007SCharles.Forsyth
690*46439007SCharles.ForsythAuthor's Address
691*46439007SCharles.Forsyth
692*46439007SCharles.Forsyth      Ronald L. Rivest
693*46439007SCharles.Forsyth      Room 324, 545 Technology Square
694*46439007SCharles.Forsyth      MIT Laboratory for Computer Science
695*46439007SCharles.Forsyth      Cambridge, MA 02139
696*46439007SCharles.Forsyth
697*46439007SCharles.Forsyth      rivest@theory.lcs.mit.edu
698*46439007SCharles.Forsyth
699*46439007SCharles.Forsyth
700