xref: /plan9/sys/doc/fossil.ms (revision 39734e7ed1eb944f5e7b41936007d0d38b560d7f)
1... .FP times
2... .fp 1 R R.nomath
3... .fp 5 CW LucidaSansCW83
4.TL
5Fossil, an Archival File Server
6.AU
7Sean Quinlan
8Jim McKie
9Russ Cox
10jmk,rsc@plan9.bell-labs.com
11.AB
12This paper describes the internals and
13operation of Fossil, an archival file server built for Plan 9.
14Fossil has not yet replaced the current Plan 9 file server
15and
16.CW kfs ,
17but that is our eventual intent.
18Both fossil and this documentation are
19works in progress.  Comments on either are most welcome.
20.AE
21.de HP
22.LP
23..
24.NH 1
25Introduction
26.HP
27Fossil is an archival file server built for Plan 9.
28In a typical configuration, it maintains a traditional file system
29in a local disk partition and periodically archives snapshots of the file system
30to a Venti server.  These archives are made available through
31a file system interface.
32Fossil can also be run without a Venti server, in which case the
33snapshots (if any) occupy local disk space.
34.PP
35The bulk of this paper explains the underlying data structures:
36Venti trees, the Venti archival file system format, and finally Fossil's
37file system format.
38The end of the paper discusses the architecture of the Fossil server.
39.PP
40The presentation of the data structures is very detailed, perhaps
41too detailed for most readers.
42The intent is to record all the details necessary to make structural
43changes to the file system format.
44Feel free to jump ahead when boredom strikes.
45.NH 1
46Venti trees and directory hierarchies
47.HP
48Venti [3] is an archival block storage server.
49Once a block is stored, it can be retrieved by presenting the 20-byte
50SHA1 hash of its contents, called a
51.I score .
52Blocks on Venti have a maximum length of about 56 kilobytes,
53though in practice smaller blocks are used.
54To store a byte stream of arbitrary length, Venti uses a hash tree.
55Conceptually, the data stream is broken into fixed-size (say,
56.I dsize -byte)
57chunks, which are stored on the Venti server.
58The resulting scores are concatenated into a new pointer stream, which is
59broken into fixed size (say,
60.I psize -byte)
61chunks, which are stored on the Venti server.
62.I Psize "" (
63is different from
64.I dsize
65so that we can ensure that each pointer block holds an
66integral number of pointers.)
67This yields a new pointer stream, and so on, until there is a single block
68and finally a single score describing the entire tree.
69The resulting structure looks like:
70.PS
71.ps 8
72.vs 10
73boxht=0.1
74boxwid=0.1
75
76B0: box invis wid 1 "\f(CWVtDataType\fP"
77move right 0.1
78L0a: box wid 0.2
79move right 0.1
80L0b: box wid 0.2
81move right 0.1
82L0c: box invis wid 0.2 "..."
83move right 0.1
84
85L0d: box wid 0.2
86move right 0.1
87L0e: box wid 0.2
88move right 0.2
89L0f: box invis wid 0.2 "..."
90move right 0.2
91
92L0g: box wid 0.2
93move right 0.1
94L0h: box wid 0.2
95move right 0.1
96L0i: box invis wid 0.2 "..."
97move right 0.1
98
99L0j: box wid 0.2
100move right 0.1
101L0k: box wid 0.2
102move right 0.1
103L0l: box invis wid 0.2 "..."
104move right 0.1
105L0m: box wid 0.2
106
107define boxppddd {
108	line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se>
109	line from 0.4<$1.nw,$1.ne> to 0.4<$1.sw,$1.se>
110	X: box invis at 0.1<$1.nw,$1.ne>
111	Y: box invis at 0.1<$1.sw,$1.se>
112	line -> from 0.5<X,Y> to $2.nw
113	X: box invis at 0.3<$1.nw,$1.ne>
114	Y: box invis at 0.3<$1.sw,$1.se>
115	line -> from 0.5<X,Y> to $3.nw
116	"..." at 0.7<$1.w,$1.e>
117}
118
119define boxppdddp {
120	line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se>
121	line from 0.4<$1.nw,$1.ne> to 0.4<$1.sw,$1.se>
122	line from 0.8<$1.nw,$1.ne> to 0.8<$1.sw,$1.se>
123	X: box invis at 0.1<$1.nw,$1.ne>
124	Y: box invis at 0.1<$1.sw,$1.se>
125	line -> from 0.5<X,Y> to $2.nw
126	X: box invis at 0.3<$1.nw,$1.ne>
127	Y: box invis at 0.3<$1.sw,$1.se>
128	line -> from 0.5<X,Y> to $3.nw
129	"..." at 0.6<$1.w,$1.e>
130	X: box invis at 0.9<$1.nw,$1.ne>
131	Y: box invis at 0.9<$1.sw,$1.se>
132	line -> from 0.5<X,Y> to $4.nw
133}
134
135define boxpdddp {
136	line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se>
137	line from 0.8<$1.nw,$1.ne> to 0.8<$1.sw,$1.se>
138	X: box invis at 0.1<$1.nw,$1.ne>
139	Y: box invis at 0.1<$1.sw,$1.se>
140	line -> from 0.5<X,Y> to $2.nw
141	"..." at 0.5<$1.w,$1.e>
142	X: box invis at 0.9<$1.nw,$1.ne>
143	Y: box invis at 0.9<$1.sw,$1.se>
144	line -> from 0.5<X,Y> to $3.nw
145}
146
147bhd=0.4
148L1abc: box wid 0.5 at 0.5<L0a, L0b>+(0,bhd)
149boxppddd(L1abc, L0a, L0b)
150L1def: box wid 0.5 at 0.5<L0d, L0e>+(0,bhd)
151boxppddd(L1def, L0d, L0e)
152L1ghi: box wid 0.5 at 0.5<L0g, L0h>+(0,bhd)
153boxppddd(L1ghi, L0g, L0h)
154L1jklm: box wid 0.5 at 0.5<L0j, L0k>+(0,bhd)
155boxppdddp(L1jklm, L0j, L0k, L0m)
156B1: box invis wid 1 "\f(CWVtPointerType0\fP" at B0+(0,bhd)
157
158L2abcdef: box wid 0.5 at 0.5<L1abc,L1def>+(0,bhd)
159boxppddd(L2abcdef, L1abc, L1def)
160L2ghijklm: box wid 0.5 at 0.5<L1ghi,L1jklm>+(0,bhd)
161boxpdddp(L2ghijklm, L1ghi, L1jklm)
162B2: box invis wid 1 "\f(CWVtPointerType1\fP" at B1+(0,bhd)
163
164L3atom: box wid 0.5 at 0.5<L2abcdef, L2ghijklm>+(0,bhd)
165boxpdddp(L3atom, L2abcdef, L2ghijklm)
166B3: box invis wid 1 "\f(CWVtPointerType2\fP" at B2+(0,bhd)
167.PE
168.LP
169The leaves are the original data stream.  Those blocks have type
170.CW VtDataType .
171The first pointer stream has type
172.CW VtPointerType0 ,
173the next has type
174.CW VtPointerType1 ,
175and so on.
176The figure ends with a single block of type
177.CW VtPointerType2 ,
178but in general trees can have height up to
179.CW VtPointerType6 .
180For a
181.I dsize
182of 8192 bytes
183and
184.I psize
185of 8180 bytes (409 pointers),
186this gives a maximum stream size of approximately 10 zettabytes
187(2\s-2\u73\d\s+2 or 10\s-2\u22\d\s+2 bytes).
188.PP
189Data blocks are truncated to remove trailing runs of zeros before
190storage to Venti; they are zero-filled back to
191.I dsize
192bytes after retrieval from Venti.
193Similarly, trailing runs of pointers to zero-length blocks are
194removed from and added back to pointer blocks.
195These simple rules happen to make it particularly efficient to store
196large runs of zeros, as might occur in a data stream with ``holes:''
197the zero-length block itself can be interpreted as a tree of any
198depth encoding an all-zero data stream.
199.PP
200Reconstructing the data stream requires the score and type of the
201topmost block in the tree, the data chunk size, the pointer chunk size,
202and the data stream size.
203(From the data stream size and the chunk sizes we could derive the
204depth of the tree and thus the type of the topmost block, but it is convenient
205to allow trees that are deeper than necessary.)
206This information is kept in a 40-byte structure called a
207.CW VtEntry :
208.P1
209VtEntry:
210.ta +\w'    'u +\w'            'u
211	gen[4]	\fRgeneration number\fP
212	psize[2]	\fRsize of pointer blocks\fP
213	dsize[2]	\fRsize of data blocks\fP
214	flags[1]
215	zero[5]
216	size[6]	\fRlength of file\fP
217	score[20]	\fRscore of root block in tree\fP
218.P2
219(In this notation,
220.CW name[sz]
221indicates a
222.CW sz -byte
223field called
224.CW name .
225Integers are stored in big-endian order.
226.CW Size
227really is a 48-bit field.)
228.CW Flags
229is made up of the following bit fields.
230.P1
231.ta +\w'      'u +\w'                      'u
2320x01	VtEntryActive	\fRentry is allocated\fP
2330x02	VtEntryDir	\fRentry describes a Venti directory (q.v.)\fP
2340x1C	VtEntryDepthMask	\fRmask for tree depth\fP
2350x20	VtEntryLocal	\fRreserved (q.v.)\fP
236.P2
237.LP
238The depth of the described tree is stored in the 3 bits indicated:
239a tree with a topmost node of type
240.CW VtPointerType3
241has depth 4.
242.PP
243With
244.CW VtEntry
245we can build more complicated data structures,
246ones with multiple or nested data streams.
247A data stream consisting of
248.CW VtEntry
249structures is called a Venti directory.
250It is identical in structure to the Venti data stream
251we described earlier except that the bottom-level type is
252.CW VtDirType ,
253and
254the
255.CW VtEntry
256describing a Venti directory has the
257.CW VtEntryDir
258flag bit set.
259The
260.I dsize
261for a Venti directory
262is a multiple of 40 so that each data chunk holds
263an integer number of
264.CW VtEntry
265structures.
266By analogy with Venti directories,
267we call the original data stream a
268Venti file.
269Note that Venti files are assumed
270.I not
271to contain pointers to other Venti blocks.
272The only pointers to Venti blocks occur in
273.CW VtEntry
274structures in
275Venti directories
276(and in the internal hash tree structure of the
277individual files and directories).
278Note also that these directories are nothing more than pointer lists.
279In particular, there are no names or metadata like in a file system.
280.PP
281To make it easier to pass hierarchies between applications,
282the root of a hierarchy is described in a 300-byte structure
283called a
284.CW VtRoot :
285.P1
286VtRoot:
287.ta +\w'    'u +\w'                'u
288	version[2]	\f(CW2\fP
289	name[128]	\fRname of structure (just a comment)\fP
290	type[128]	\fRstring describing structure (\f(CWvac\fR)\f(CW
291	score[20]	\fRpointer to \f(CWVtDirType\fP block\f(CW
292	blockSize[2]	\fRmaximum block size in structure\fP
293	prev[20]	\fRprevious \f(CWVtRoot\fP in chain, if any\fP
294.P2
295.LP
296This structure is stored to Venti and its score is passed
297between applications, typically in the form
298``\fItype\f(CW:\fIrootscore\fR,''
299where
300.I type
301is the type field from the
302.CW VtRoot
303structure, and
304.I rootscore
305is the score of the
306.CW VtRoot
307block.
308.CW VtRoot
309structures can be chained together using the
310.I prev
311field to encode an archival history
312of the data structure.
313.PP
314For example, a small Venti hierarchy might look like:
315.PS
316.ps 8
317.vs 10
318boxwid=0.1
319boxht=0.1
320f=0.9
321mb=0.16
322
323VtRoot: [
324	right
325	B1: box
326	move right 0.1
327	"\f(CWVtRoot\fP" ljust
328]
329
330Root: [
331	right
332	B1: box fill f
333	B2: box fill f
334	B3: box fill f
335	move right 0.1
336] with .nw at VtRoot.sw+(0.2,-.1)
337Level1: [
338	RootMeta: [
339		box wid mb
340	]
341	MetaSource: [
342		right
343		B1: box wid 5*mb
344	] with .nw at RootMeta.sw+(0,-.1)
345
346	Source: [
347		right
348		B1: box fill f
349		B2: box fill f
350		B3: box fill f
351		B4: box fill f
352		B5: box fill f
353		B6: box fill f
354		B7: box fill f
355		B8: box fill f
356	] with .nw at MetaSource.sw+(0,-.1)
357	SB1: box invis at Source.B1
358	SB2: box invis at Source.B2
359	SB3: box invis at Source.B3
360] with .nw at Root.sw+(0.4,-.1)
361Level2: [
362	MetaSource: [
363		right
364		B1: box wid 5*mb
365	]
366	Source: [
367		right
368		B1: box fill f
369		B2: box fill f
370		B3: box fill f
371		B4: box fill f
372		B5: box fill f
373		B6: box fill f
374		B7: box fill f
375		B8: box fill f
376	] with .nw at MetaSource.sw+(0,-.1)
377	File: box wid 0.8 with .nw at Source.sw+(0,-.1)
378] with .nw at Level1.sw+(0.6,-.1)
379
380line -> from VtRoot.B1 down boxwid/2+0.1+boxwid/2 then to Root.w
381line -> from Root.B3 down boxwid/2+0.1+boxwid/2 then to Level1.RootMeta.w
382line -> from Root.B2 down boxwid/2+0.1+boxwid+0.1+boxwid/2 then to Level1.MetaSource.w
383line -> from Root.B1 down boxwid/2+0.1+boxwid+0.1+boxwid+0.1+boxwid/2 then to Level1.Source.w
384
385line -> from Level1.SB3 down boxwid/2+0.1+boxwid/2 then to Level2.MetaSource.w
386line -> from Level1.SB2 down boxwid/2+0.1+boxwid+0.1+boxwid/2 then to Level2.Source.w
387line -> from Level1.SB1 down boxwid/2+0.1+boxwid+0.1+boxwid+0.1+boxwid/2 then to Level2.File.w
388
389[
390	KEY: box wid 1.5 invis "Key"
391	line from KEY.sw to KEY.se
392	k = -.1
393	kk=0.5
394	A: [
395		box wid 4*boxwid
396		"Venti file" ljust with .w at last box .w+(kk,0)
397	] with .nw at KEY.sw+(0,2*k)
398	B: [
399		box fill f
400		"Venti entry (\f(CWVtEntry\fP)" ljust with .w at last box .w+(kk,0)
401	] with .nw at A.sw+(0,k)
402	C: [
403		right
404		CC: box fill f
405		box fill f
406		box fill f
407		box fill f
408		"Venti directory" ljust with .w at CC.w+(kk,0)
409	] with .nw at B.sw+(0,k)
410	D: [
411		line -> right 3*boxwid
412		"Venti pointer (score)" ljust with .w at last line .w+(kk, 0)
413	] with .nw at C.sw+(0,k)
414] with .nw at VtRoot.nw+(3,0)
415.PE
416.LP
417Venti files are shown as white boxes, while directories are shown
418as shaded boxes.  Each shaded square represents a
419.CW VtEntry .
420Arrows represent pointers from
421.CW VtEntry
422structures to other
423Venti files or directories.
424.PP
425The hierarchical structure provided by Venti files and directories
426can be used as the base for more complicated data structures.
427Because this structure captures all the information
428about pointers to other blocks, tools written to traverse
429Venti hierarchies can traverse the more complicated
430data structures as well.
431For example,
432.I venti/copy
433(see
434.I ventiaux (8))
435copies a Venti hierarchy from one Venti server to another,
436given the root
437.CW VtEntry .
438Because the traditional file system described in later sections is
439layered on a Venti hierarchy,
440.I venti/copy
441can copy it without fully understanding its structure.
442.NH 1
443Vac file system format
444.HP
445The Venti archive format
446.I vac
447builds a traditional file system using a Venti hierarchy.
448Each vac file is implemented as a Venti file;
449each vac directory is implemented as a Venti
450directory and a Venti file to provide traditional file system metadata.
451The metadata is stored in a structure called a
452.CW DirEntry :
453.P1
454DirEntry:
455.ta +\w'    'u +\w'            'u
456	magic[4]	\f(CW0x1c4d9072\fP (DirMagic)\fP
457	version[2]	\f(CW9\fP
458	elem[s]	\fRname (final path element only)\fP
459	entry[4]	\fRentry number for Venti file or directory\fP
460	gen[4]	\fRgeneration number\fP
461	mentry[4]	\fRentry number for Venti file holding metadata\fP
462	mgen[4]	\fRgeneration number\fP
463	qid[8]	\fRunique file serial number\fP
464	uid[s]	\fRowner\fP
465	gid[s]	\fRgroup\fP
466	mid[s]	\fRlast modified by\fP
467	mtime[4]	\fRlast modification time\fP
468	ctime[4]	\fRcreation time\fP
469	atime[4]	\fRlast access time\fP
470	mode[4]	\fRmode bits\fP
471.P2
472The notation
473.CW name[s]
474denotes a string stored as a two-byte length
475and then that many bytes.
476The above describes Version 9 of the
477.CW DirEntry
478format.  Versions 7 and 8 are very similar; they can be
479read by the current
480.I vac
481source code but are not written.
482Earlier versions were not widespread.
483A
484.CW DirEntry
485may be followed by optional extension sections, though none
486are currently used.
487The
488.CW mode
489bits include bits commonly used by
490Unix and Windows, in addition to those used by Plan 9.
491.PP
492The
493.CW entry
494field is an index into the parallel Venti directory.
495The
496.CW gen
497field must match the
498.CW gen
499field in the corresponding
500.CW VtEntry
501in the directory;
502it is used to detect
503stale indices.
504Similarly,
505.CW mentry
506and
507.CW mgen
508are the index and generation number
509for the metadata Venti file,
510if the
511.CW DirEntry
512describes a vac directory.
513.PP
514The relation between Venti files and directories and
515vac files and directories can be seen in this figure:
516.PS
517.ps 8
518.vs 10
519boxwid=0.1
520boxht=0.1
521f=0.9
522mb=0.16
523
524VtRoot: [
525	right
526	B1: box
527	move right 0.1
528	"\f(CWVtRoot\fP" ljust
529]
530
531SuperRoot: [
532	right
533	B1: box fill f
534	move right 0.1
535	"fs root block" ljust
536] with .nw at VtRoot.sw + (0.2, -.2)
537Root: [
538	right
539	B1: box fill f
540	B2: box fill f
541	B3: box fill f
542	move right 0.1
543	"root directory info block" ljust
544] with .nw at SuperRoot.sw+(0.2, -.2)
545Level1: [
546	RootMeta: [
547		box wid mb
548		move right 0.1
549		"root metadata" ljust
550	]
551	MetaSource: [
552		right
553		B1: box wid mb
554		B2: box wid mb
555		B3: box wid mb
556		B4: box wid mb
557		B5: box wid mb
558	] with .nw at RootMeta.sw+(0,-.2)
559	MB1: box wid mb invis at MetaSource.B1
560	MB2: box wid mb invis at MetaSource.B2
561	MB3: box wid mb invis at MetaSource.B3
562	MB4: box wid mb invis at MetaSource.B4
563	MB5: box wid mb invis at MetaSource.B5
564
565	Source: [
566		right
567		B1: box fill f
568		B2: box fill f
569		B3: box fill f
570		B4: box fill f
571		B5: box fill f
572		B6: box fill f
573		B7: box fill f
574		B8: box fill f
575	] with .nw at MetaSource.sw+(0,-.1)
576	SB1: box invis at Source.B1
577	SB2: box invis at Source.B2
578	SB3: box invis at Source.B3
579	SB4: box invis at Source.B4
580	SB5: box invis at Source.B5
581	SB6: box invis at Source.B6
582	SB7: box invis at Source.B7
583	SB8: box invis at Source.B8
584] with .nw at Root.sw+(0.4,-.2)
585Level2: [
586	MetaSource: [
587		right
588		B1: box wid mb
589		B2: box wid mb
590		B3: box wid mb
591		B4: box wid mb
592		B5: box wid mb
593	]
594	Source: [
595		right
596		B1: box fill f
597		B2: box fill f
598		B3: box fill f
599		B4: box fill f
600		B5: box fill f
601		B6: box fill f
602		B7: box fill f
603		B8: box fill f
604	] with .nw at MetaSource.sw+(0,-.1)
605	File: box wid 0.8 with .nw at Source.sw+(0,-.2)
606] with .nw at Level1.sw+(0.6,-.2)
607
608line -> from VtRoot.B1 down boxwid/2+0.2+boxwid/2 then to SuperRoot.w
609line -> from SuperRoot.B1 down boxwid/2+0.2+boxwid/2 then to Root.w
610line -> from Root.B3 down boxwid/2+0.2+boxwid/2 then to Level1.RootMeta.w
611line -> from Root.B2 down boxwid/2+0.2+boxwid+0.2+boxwid/2 then to Level1.MetaSource.w
612line -> from Root.B1 down boxwid/2+0.2+boxwid+0.1+boxwid+0.2+boxwid/2 then to Level1.Source.w
613
614line -> from Level1.SB3 down boxwid/2+0.2+boxwid/2 then to Level2.MetaSource.w
615line -> from Level1.SB2 down boxwid/2+0.2+boxwid+0.1+boxwid/2 then to Level2.Source.w
616line -> from Level1.SB1 down boxwid/2+0.2+boxwid+0.1+boxwid+0.2+boxwid/2 then to Level2.File.w
617
618arrowwid = arrowwid/2
619arrowht = arrowht/2
620line -> from Level1.MB1 to Level1.SB1.n
621line -> from Level1.MB2 to Level1.SB2.n
622line -> from Level1.MB2 to Level1.SB3.n
623line -> from Level1.MB4 to Level1.SB7.n
624line -> from Level1.MB5 to Level1.SB5.n
625arrowwid = arrowwid * 2
626arrowht = arrowht * 2
627
628box dashed with .nw at Level1.MetaSource.nw+(-.05,.05) wid 0.8+.05*2 ht .3+.05*2
629box dashed with .nw at Level2.MetaSource.nw+(-.05,.05) wid 0.8+.05*2 ht .3+.05*2
630box dotted with .nw at Level2.File.nw+(-.05,.05) wid 0.8+0.05*2 ht .1+.05*2
631
632[
633	KEY: box wid 1.5 invis "Key"
634	line from KEY.sw to KEY.se
635	k = -.1
636	kk=0.5
637	A: [
638		box wid 4*boxwid
639		"Venti file" ljust with .w at last box .w+(kk,0)
640	] with .nw at KEY.sw+(0,2*k)
641	B: [
642		box fill f
643		"Venti entry (\f(CWEntry\fP)" ljust with .w at last box .w+(kk,0)
644	] with .nw at A.sw+(0,k)
645	C: [
646		right
647		CC: box fill f
648		box fill f
649		box fill f
650		box fill f
651		"Venti directory" ljust with .w at CC.w+(kk,0)
652	] with .nw at B.sw+(0,k)
653	D: [
654		line -> right 3*boxwid
655		"Venti pointer (score)" ljust with .w at last line .w+(kk, 0)
656	] with .nw at C.sw+(0,k)
657	DD: [
658		box dotted wid 4*boxwid
659		"Vac file" ljust with .w at last box .w+(kk,0)
660	] with .nw at D.sw+(0,k)
661	E: [
662		box wid mb
663		"Vac entry (\f(CWDirEntry\fP)" ljust with .w at last box .w+(kk,0)
664	] with .nw at DD.sw+(0,k)
665	G: [
666		box dashed wid 4*boxwid
667		"Vac directory" ljust with .w at last box .w+(kk,0)
668	] with .nw at E.sw+(0,k)
669	H: [
670		arrowwid = arrowwid/2
671		arrowht = arrowht/2
672		line -> right 1.5*boxwid
673		"Vac pointer (integer index)" ljust with .w at last line .w+(kk, 0)
674		arrowwid = arrowwid * 2
675		arrowht = arrowht * 2
676	] with .nw at G.sw+(0,k)
677] with .nw at VtRoot.nw+(3,0)
678.PE
679.LP
680In reality, the story is slightly more complicated.
681The metadata file in a Vac directory
682is not just the concatenation of
683.CW DirEntry
684structures.
685Instead, it is the concatenation of
686.CW MetaBlocks .
687A
688.CW MetaBlock
689contains some number of
690.CW DirEntry
691structures along with a sorted index to make it easy
692to look for a particular
693.CW DirEntry
694by its
695.CW elem
696field.
697The details are in the source code.
698.PP
699As shown in the diagram,
700the root directory of the file system is summarized by
701three
702.CW VtEntry
703structures describing
704the Venti directory for the children of the root,
705the Venti file for the metadata describing the children of the root,
706and a Venti file holding metadata for the root directory itself.
707These
708.CW VtEntry
709structures are placed in a Venti directory of their own,
710described by the single
711.CW VtEntry
712in the
713root block.
714.NH 1
715Fossil file system format
716.HP
717Fossil uses the vac format, with some small changes.
718The changes only affect the data on the local disk; the data
719archived to Venti is exactly in vac format.
720.PP
721Blocks stored on local disk may contain scores pointing at local disk
722blocks or at Venti blocks.
723Local block addresses are stored as 20-byte scores in which the first 16 bytes
724are all zero and the last 4 bytes specify a block number in the disk.
725Before a block is archived, all the
726blocks it points to must be archived, and the local scores in the block
727must be changed to Venti scores.
728Using block addresses rather than content hashes for local data
729makes the local file system easier to manage: if a local block's contents
730change, the pointer to the block does not need to change.
731.NH 2
732Snapshots
733.HP
734Fossil is an archival file server.
735It takes periodic snapshots of the file system,
736which are made accessible through the file system.
737Specifically, the active file system is presented in
738.CW /active .
739Ephemeral snapshots (those that are kept on local disk and eventually deleted)
740are presented in
741\f(CW/snapshot/\fIyyyy\f(CW/\fImmdd\f(CW/\fIhhmm\fR,
742where
743.I yyyy
744is the full year,
745.I mm
746is the month number,
747.I dd
748is the day number,
749.I hh
750is the hour,
751and
752.I mm
753is the minute.
754Archival snapshots (those that are archived to Venti and persist forever)
755are presented in
756\f(CW/archive/\fIyyyy\f(CW/\fImmdds\fR,
757where
758.I yyyy ,
759.I mm ,
760and
761.I dd
762are year, month, and day as before,
763and
764.I s
765is a sequence number if more than one
766archival snapshot is done in a day.
767For the first snapshot,
768.I s
769is null.
770For the subsequent snapshots,
771.I s
772is
773.CW .1 ,
774.CW .2 ,
775.CW .3 ,
776etc.
777.PP
778To implement the snapshots, the file server maintains a
779current
780.I epoch
781for the active file system.
782Each local block has a label that records, among other things,
783the epoch in which the block was allocated.
784If a block was allocated in an epoch earlier than the current one,
785it is immutable and treated as copy-on-write.
786Taking a snapshot can be accomplished by
787recording the address of the current root block and then
788incrementing the epoch number.
789Notice that the copy-on-write method makes
790snapshots both time efficient and space efficient.
791The only time cost is waiting for all current file system
792requests to finish and then incrementing a counter.
793After a snapshot, blocks only get copied when they are
794next modified, so the per-snapshot
795space requirement is proportional
796to the amount of new data rather than the total
797size of the file system.
798.PP
799The blocks in the archival snapshots are moved to Venti,
800but the blocks in the ephemeral snapshots take up space
801in the local disk file.
802To allow reclamation of this disk space, the file system
803maintains a
804.I low
805.I epoch ,
806which is the epoch of the earliest ephemeral snapshot
807still available.
808Fossil only allows access to snapshots with epoch numbers
809between the
810low epoch and the current epoch
811(also called the high epoch).
812Incrementing the low epoch thus makes old
813snapshots inaccessible.
814The space required to store those snapshots can then
815be reclaimed, as described below.
816.NH 2
817Local blocks
818.HP
819The bulk of the local disk file is the local blocks.
820Each block has a 14-byte label associated with it, of the format:
821.P1
822Label:
823.ta +\w'    'u +\w'                'u
824	state[1]	\fRblock state\fP
825	type[1]	\fRblock type\fP
826	epoch[4]	\fRallocation epoch\fP
827	epochClose[4]	\fRclose epoch\fP
828	tag[4]	\fRrandom tag\fP
829.P2
830.LP
831The
832.CW type
833is an analogue of the block types described earlier,
834though different names are used, to distinguish between
835pointers blocks in a hash tree for a data stream
836and pointer blocks for a directory stream.
837The
838.CW epoch
839was mentioned in the last section.
840The other fields are explained below.
841.PP
842There are two distinguished blocks states
843.CW BsFree
844.CW 0x00 ) (
845and
846.CW BsBad
847.CW 0xFF ), (
848which mark blocks that are available for allocation
849and blocks that are bad and should be avoided.
850If
851.CW state
852is not one of these values, it is a bitwise
853.I or ' `
854of the following flags:
855.P1
856.ta +\w'      'u +\w'                'u
8570x01	BsAlloc	\fRblock is in use\fP
8580x02	BsCopied	\fRblock has been copied\fP
8590x04	BsVenti	\fRblock has been stored on Venti\fP
8600x08	BsClosed	\fRblock has been unlinked from active file system\fP
861.P2
862.LP
863The flags are explained as they arise in the discussions below.
864.PP
865It is convenient to store some extra fields in the
866.CW VtEntry
867structure when it describes a Venti file or directory
868stored on local disk.
869Specifically, we set the
870.CW VtEntryLocal
871flag bit
872and then use the bytes 7-16 of the score (which would
873otherwise be zero, since it is a local score) to hold these fields:
874.P1
875.ta +\w'    'u +\w'                'u
876	archive[1]	\fRboolean: this is an archival snapshot\fP
877	snap[4]	\fRepoch number if root of snapshot\fP
878	tag[4]	\fRrandom tag\fP
879.P2
880.LP
881The extended
882.CW VtEntry
883structure is called an
884.CW Entry .
885The
886.CW tag
887field
888in the
889.CW Label
890and the
891.CW Entry
892is used to identify dangling pointers or other file system corruption:
893all the local blocks in a hash tree must
894have tags matching the tag in the
895.CW Entry .
896If this
897.CW Entry
898points at the root of a snapshot,
899the
900.CW snap
901field is the epoch of the snapshot.
902If the snapshot is intended to be archived to Venti,
903the
904.CW archive
905field is non-zero.
906.NH 2
907Block reclamation
908.HP
909The blocks in the active file system form a tree: each
910block has only one parent.
911Once a copy-on-write block
912.I b
913is replaced by its copy, it is no longer
914needed by the active file system.
915At this point,
916.I b
917is unlinked from the active file system.
918We say that
919.I b
920is now
921.I closed :
922it is needed only for snapshots.
923When a block is closed, the
924.CW BsClosed
925bit is set in its state, and the current epoch (called the block's closing epoch)
926is stored in the
927.CW epochClose
928label field.
929(Open blocks have an
930.CW epochClose
931of
932.CW ~0 ).
933.PP
934A block is referenced by snapshots with epochs
935between the block's allocation epoch and its closing epoch.
936Once the file system's low epoch grows to be greater than or equal to the block's
937closing epoch, the block is no longer needed for any snapshots
938and can be reused.
939.PP
940In a typical configuration, where nightly archival snapshots
941are taken and written to Venti, it is desirable to reclaim
942the space occupied by now-archived blocks if possible.
943To do this, Fossil keeps track of whether the pointers
944in each block are unique to that block.
945When a block
946.I bb
947is allocated, a pointer to
948.I bb
949is written into exactly one active block (say,
950.I b ).
951In the absence of snapshots, the pointer to
952.I bb
953will remain unique to
954.I b ,
955so that if the pointer is zeroed,
956.I bb
957can be immediately reused.
958Snapshots complicate this invariant:
959when
960.I b
961is copied-on-write, all its pointers
962are no longer unique to it.
963At time of the copy, the
964.CW BsCopied
965state bit in the block's label
966is set to note the duplication of the pointers contained within.
967.NH 2
968Disk layout
969.HP
970The file system header describes the file system layout and has this format:
971.P1
972.ta +\w'    'u +\w'                'u
973Header:
974	magic[4]	\fR0x3776AE89 (HeaderMagic)\fP
975	version[2]	\fR1 (HeaderVersion)\fP
976	blockSize[2]	\fIfile system block size\fP
977	super[4]	\fRblock offset of super block\fP
978	label[4]	\fRblock offset of labels\fP
979	data[4]	\fRdata blocks\fP
980	end[4]	\fRend of file system\fP
981.P2
982.LP
983The corresponding file system layout is:
984.PS
985.ps 8
986.vs 9
987boxwid=0.75
988boxht=0.15
989Empty: box "empty" ht 0.25
990Header: box "header" with .n at Empty.s
991Empty2: box "empty" with .n at Header.s
992Super: box "super block" with .n at Empty2.s
993Label: box "label" "blocks" with .n at Super.s ht 0.25
994Data: box "data" "blocks" with .n at Label.s ht 0.3
995"  0" ljust at Empty.ne
996"  128kB" ljust at Header.ne
997"  \f5super\fP \(mu \f(CWblockSize\fP" ljust at Super.ne
998"  \f5label\fP \(mu \f(CWblockSize\fP" ljust at Label.ne
999"  \f5data\fP \(mu \f(CWblockSize\fP" ljust at Data.ne
1000"  \f5end\fP \(mu \f(CWblockSize\fP" ljust at Data.se
1001"" at (-1,0)
1002"" at (6,0)
1003.PE
1004.LP
1005The numbers to the right of the blocks are byte offsets
1006of the boundaries.
1007.LP
1008The super block describes the file system itself and looks like:
1009.P1
1010.ta +\w'    'u +\w'                'u
1011Super:
1012	magic[4]	\fR0x2340A3B1 (SuperMagic)\fP
1013	version[2]	\fR1 (SuperVersion)\fP
1014	epochLow[4]	\fRfile system low epoch\fP
1015	epochHigh[4]	\fRfile system high (active) epoch\fP
1016	qid[8]	\fRnext qid to allocate\fP
1017	active[4]	\fRdata block number: root of active file system\fP
1018	next[4]	\fRdata block number: root of next file system to archive\fP
1019	current[4]	\fRdata block number: root of file system currently being archived\fP
1020	last[20]	\fRVenti score of last successful archive\fP
1021	name[128]	\fRname of file system (just a comment)\fP
1022.P2
1023.LP
1024.NH 1
1025Fossil server
1026.HP
1027The Fossil server is a user-space program that runs on a standard Plan 9 kernel.
1028.NH 2
1029Process structure
1030.PP
1031The file server is structured as a set of processes synchronizing
1032mostly through message passing along queues.
1033The processes are given names, which can be seen in the output of
1034.CW ps
1035.CW -a .
1036.PP
1037.CW Listen
1038processes announce on various network addresses.
1039A
1040.CW con
1041process handles each incoming connection, reading 9P requests
1042and adding them to a central message queue.
1043.CW Msg
1044processes remove 9P requests from the queue,
1045handle them, and write the responses to the appropriate
1046file descriptors.
1047.PP
1048The
1049.CW disk
1050process handles disk I/O requests made by the other processes.
1051The
1052.CW flush
1053process writes dirty blocks from the in-memory block cache to disk.
1054The
1055.CW unlink
1056process frees previously linked blocks once the blocks that point at them
1057have been written to disk.
1058.PP
1059A
1060.CW consI
1061reads from each console file (typically a pipe posted in
1062.CW /srv ),
1063adding the typed characters to the input queue.
1064The
1065.CW cons
1066process echoes input and runs the commands, saving
1067output in a ring buffer.
1068Because there is only one
1069.CW cons
1070process, only one console command may be executing at a time.
1071A
1072.CW consO
1073process copies this ring buffer to each console file.
1074.PP
1075The
1076.CW periodic
1077process runs periodic events, like
1078flushing the root metadata to disk or
1079taking snapshots of the file system.
1080.NH 2
1081Block cache
1082.HP
1083Fossil maintains an in-memory block cache which
1084holds both local disk blocks and Venti blocks.
1085Cache eviction follows a least recently used policy.
1086Dirty blocks are restricted to at most half the cache.
1087This can be changed by editing
1088.CW DirtyPercentage
1089in
1090.CW dat.h .
1091.PP
1092The block cache uses soft updates [1] to ensure that the on-disk
1093file system is always self-consistent.
1094Thus there is no
1095.I halt
1096console command
1097and no need to check a file system
1098that was shut down without halting.
1099.NH 2
1100Archiving
1101.HP
1102A background process writes blocks in archival snapshots to Venti.
1103Although
1104.CW /archive/\fIyyyy\fP/\fImmdds\fR
1105is a copy of only
1106.CW /active
1107at the time of the snapshot,
1108the archival process archives the
1109entire file tree rather than just
1110the subtree rooted at
1111.CW /active .
1112The snapshots
1113.CW /snapshot/\fIyyyy\fP/\fImmdd\fP/\fIhhmm
1114are stored as empty directories.
1115Once all the blocks have been archived,
1116a
1117.CW VtRoot
1118header for the file system is archived.
1119The score of that header is recorded in
1120.CW super.score
1121and also printed on the file server console.
1122The score can used by
1123.I flfmt
1124to restore a file system (see
1125.I fossil (4)).
1126.NH 2
1127Contrast with the old file server
1128.HP
1129The most obvious difference between Fossil and the
1130old Plan 9 file server [2] is that Fossil uses a Venti server as
1131its archival storage in place of a WORM juke box.
1132There are a few other architectural differences to be
1133aware of.
1134.PP
1135Fossil is a user-level program run on a standard kernel.
1136.PP
1137Fossil does not have any way to concatenate, stripe, or
1138mirror disk files.  For functionality similar to the old file server's
1139configuration strings, use the experimental file stack device
1140(see
1141.I fs (3)).
1142.PP
1143Fossil speaks only 9P2000.  Old 9P (aka 9P1) is not supported.
1144.PP
1145... XXX words about converting an old file system to fossil?
1146.NH 1
1147References
1148.LP
1149[1] Gregory R. Ganger, Marshall Kirk McKusick, Craig A. N. Soules,
1150and Yale N. Patt.
1151``Soft Updates: A Solution to the Metadata Update Problem
1152in File Systems,''
1153.I "ACM Transactions on Computer Systems" ,
1154Vol 18., No. 2, May 2000, pp. 127\-153.
1155.LP
1156[2] Sean Quinlan, ``A Cached WORM File System,''
1157.I "Software\(emPractice and Experience" ,
1158Vol 21., No 12., December 1991, pp. 1289\-1299.
1159.LP
1160[3] Sean Quinlan and Sean Dorward, ``Venti: A New Approach to Archival Storage,''
1161.I "Usenix Conference on File and Storage Technologies" ,
11622002.
1163