xref: /plan9/sys/doc/fossil.ms (revision 8e42ac6b098d2518418cf14bffe2a43f2379a084)
1.HTML "Fossil, an Archival File Server
2... .FP times
3... .fp 1 R R.nomath
4... .fp 5 CW LucidaSansCW83
5.TL
6Fossil, an Archival File Server
7.AU
8Sean Quinlan
9Jim McKie
10Russ Cox
11jmk,rsc@plan9.bell-labs.com
12.AB
13This paper describes the internals and
14operation of Fossil, an archival file server built for Plan 9.
15Fossil has not yet replaced the current Plan 9 file server
16and
17.CW kfs ,
18but that is our eventual intent.
19Both fossil and this documentation are
20works in progress.  Comments on either are most welcome.
21.AE
22.de HP
23.LP
24..
25.NH 1
26Introduction
27.HP
28Fossil is an archival file server built for Plan 9.
29In a typical configuration, it maintains a traditional file system
30in a local disk partition and periodically archives snapshots of the file system
31to a Venti server.  These archives are made available through
32a file system interface.
33Fossil can also be run without a Venti server, in which case the
34snapshots (if any) occupy local disk space.
35.PP
36The bulk of this paper explains the underlying data structures:
37Venti trees, the Venti archival file system format, and finally Fossil's
38file system format.
39The end of the paper discusses the architecture of the Fossil server.
40.PP
41The presentation of the data structures is very detailed, perhaps
42too detailed for most readers.
43The intent is to record all the details necessary to make structural
44changes to the file system format.
45Feel free to jump ahead when boredom strikes.
46.NH 1
47Venti trees and directory hierarchies
48.HP
49Venti [3] is an archival block storage server.
50Once a block is stored, it can be retrieved by presenting the 20-byte
51SHA1 hash of its contents, called a
52.I score .
53Blocks on Venti have a maximum length of about 56 kilobytes,
54though in practice smaller blocks are used.
55To store a byte stream of arbitrary length, Venti uses a hash tree.
56Conceptually, the data stream is broken into fixed-size (say,
57.I dsize -byte)
58chunks, which are stored on the Venti server.
59The resulting scores are concatenated into a new pointer stream, which is
60broken into fixed size (say,
61.I psize -byte)
62chunks, which are stored on the Venti server.
63.I Psize "" (
64is different from
65.I dsize
66so that we can ensure that each pointer block holds an
67integral number of pointers.)
68This yields a new pointer stream, and so on, until there is a single block
69and finally a single score describing the entire tree.
70The resulting structure looks like:
71.PS
72.ps 8
73.vs 10
74boxht=0.1
75boxwid=0.1
76
77B0: box invis wid 1 "\f(CWVtDataType\fP"
78move right 0.1
79L0a: box wid 0.2
80move right 0.1
81L0b: box wid 0.2
82move right 0.1
83L0c: box invis wid 0.2 "..."
84move right 0.1
85
86L0d: box wid 0.2
87move right 0.1
88L0e: box wid 0.2
89move right 0.2
90L0f: box invis wid 0.2 "..."
91move right 0.2
92
93L0g: box wid 0.2
94move right 0.1
95L0h: box wid 0.2
96move right 0.1
97L0i: box invis wid 0.2 "..."
98move right 0.1
99
100L0j: box wid 0.2
101move right 0.1
102L0k: box wid 0.2
103move right 0.1
104L0l: box invis wid 0.2 "..."
105move right 0.1
106L0m: box wid 0.2
107
108define boxppddd {
109	line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se>
110	line from 0.4<$1.nw,$1.ne> to 0.4<$1.sw,$1.se>
111	X: box invis at 0.1<$1.nw,$1.ne>
112	Y: box invis at 0.1<$1.sw,$1.se>
113	line -> from 0.5<X,Y> to $2.nw
114	X: box invis at 0.3<$1.nw,$1.ne>
115	Y: box invis at 0.3<$1.sw,$1.se>
116	line -> from 0.5<X,Y> to $3.nw
117	"..." at 0.7<$1.w,$1.e>
118}
119
120define boxppdddp {
121	line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se>
122	line from 0.4<$1.nw,$1.ne> to 0.4<$1.sw,$1.se>
123	line from 0.8<$1.nw,$1.ne> to 0.8<$1.sw,$1.se>
124	X: box invis at 0.1<$1.nw,$1.ne>
125	Y: box invis at 0.1<$1.sw,$1.se>
126	line -> from 0.5<X,Y> to $2.nw
127	X: box invis at 0.3<$1.nw,$1.ne>
128	Y: box invis at 0.3<$1.sw,$1.se>
129	line -> from 0.5<X,Y> to $3.nw
130	"..." at 0.6<$1.w,$1.e>
131	X: box invis at 0.9<$1.nw,$1.ne>
132	Y: box invis at 0.9<$1.sw,$1.se>
133	line -> from 0.5<X,Y> to $4.nw
134}
135
136define boxpdddp {
137	line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se>
138	line from 0.8<$1.nw,$1.ne> to 0.8<$1.sw,$1.se>
139	X: box invis at 0.1<$1.nw,$1.ne>
140	Y: box invis at 0.1<$1.sw,$1.se>
141	line -> from 0.5<X,Y> to $2.nw
142	"..." at 0.5<$1.w,$1.e>
143	X: box invis at 0.9<$1.nw,$1.ne>
144	Y: box invis at 0.9<$1.sw,$1.se>
145	line -> from 0.5<X,Y> to $3.nw
146}
147
148bhd=0.4
149L1abc: box wid 0.5 at 0.5<L0a, L0b>+(0,bhd)
150boxppddd(L1abc, L0a, L0b)
151L1def: box wid 0.5 at 0.5<L0d, L0e>+(0,bhd)
152boxppddd(L1def, L0d, L0e)
153L1ghi: box wid 0.5 at 0.5<L0g, L0h>+(0,bhd)
154boxppddd(L1ghi, L0g, L0h)
155L1jklm: box wid 0.5 at 0.5<L0j, L0k>+(0,bhd)
156boxppdddp(L1jklm, L0j, L0k, L0m)
157B1: box invis wid 1 "\f(CWVtPointerType0\fP" at B0+(0,bhd)
158
159L2abcdef: box wid 0.5 at 0.5<L1abc,L1def>+(0,bhd)
160boxppddd(L2abcdef, L1abc, L1def)
161L2ghijklm: box wid 0.5 at 0.5<L1ghi,L1jklm>+(0,bhd)
162boxpdddp(L2ghijklm, L1ghi, L1jklm)
163B2: box invis wid 1 "\f(CWVtPointerType1\fP" at B1+(0,bhd)
164
165L3atom: box wid 0.5 at 0.5<L2abcdef, L2ghijklm>+(0,bhd)
166boxpdddp(L3atom, L2abcdef, L2ghijklm)
167B3: box invis wid 1 "\f(CWVtPointerType2\fP" at B2+(0,bhd)
168.PE
169.LP
170The leaves are the original data stream.  Those blocks have type
171.CW VtDataType .
172The first pointer stream has type
173.CW VtPointerType0 ,
174the next has type
175.CW VtPointerType1 ,
176and so on.
177The figure ends with a single block of type
178.CW VtPointerType2 ,
179but in general trees can have height up to
180.CW VtPointerType6 .
181For a
182.I dsize
183of 8192 bytes
184and
185.I psize
186of 8180 bytes (409 pointers),
187this gives a maximum stream size of approximately 10 zettabytes
188(2\s-2\u73\d\s+2 or 10\s-2\u22\d\s+2 bytes).
189.PP
190Data blocks are truncated to remove trailing runs of zeros before
191storage to Venti; they are zero-filled back to
192.I dsize
193bytes after retrieval from Venti.
194Similarly, trailing runs of pointers to zero-length blocks are
195removed from and added back to pointer blocks.
196These simple rules happen to make it particularly efficient to store
197large runs of zeros, as might occur in a data stream with ``holes:''
198the zero-length block itself can be interpreted as a tree of any
199depth encoding an all-zero data stream.
200.PP
201Reconstructing the data stream requires the score and type of the
202topmost block in the tree, the data chunk size, the pointer chunk size,
203and the data stream size.
204(From the data stream size and the chunk sizes we could derive the
205depth of the tree and thus the type of the topmost block, but it is convenient
206to allow trees that are deeper than necessary.)
207This information is kept in a 40-byte structure called a
208.CW VtEntry :
209.P1
210VtEntry:
211.ta +\w'    'u +\w'            'u
212	gen[4]	\fRgeneration number\fP
213	psize[2]	\fRsize of pointer blocks\fP
214	dsize[2]	\fRsize of data blocks\fP
215	flags[1]
216	zero[5]
217	size[6]	\fRlength of file\fP
218	score[20]	\fRscore of root block in tree\fP
219.P2
220(In this notation,
221.CW name[sz]
222indicates a
223.CW sz -byte
224field called
225.CW name .
226Integers are stored in big-endian order.
227.CW Size
228really is a 48-bit field.)
229.CW Flags
230is made up of the following bit fields.
231.P1
232.ta +\w'      'u +\w'                      'u
2330x01	VtEntryActive	\fRentry is allocated\fP
2340x02	VtEntryDir	\fRentry describes a Venti directory (q.v.)\fP
2350x1C	VtEntryDepthMask	\fRmask for tree depth\fP
2360x20	VtEntryLocal	\fRreserved (q.v.)\fP
237.P2
238.LP
239The depth of the described tree is stored in the 3 bits indicated:
240a tree with a topmost node of type
241.CW VtPointerType3
242has depth 4.
243.PP
244With
245.CW VtEntry
246we can build more complicated data structures,
247ones with multiple or nested data streams.
248A data stream consisting of
249.CW VtEntry
250structures is called a Venti directory.
251It is identical in structure to the Venti data stream
252we described earlier except that the bottom-level type is
253.CW VtDirType ,
254and
255the
256.CW VtEntry
257describing a Venti directory has the
258.CW VtEntryDir
259flag bit set.
260The
261.I dsize
262for a Venti directory
263is a multiple of 40 so that each data chunk holds
264an integer number of
265.CW VtEntry
266structures.
267By analogy with Venti directories,
268we call the original data stream a
269Venti file.
270Note that Venti files are assumed
271.I not
272to contain pointers to other Venti blocks.
273The only pointers to Venti blocks occur in
274.CW VtEntry
275structures in
276Venti directories
277(and in the internal hash tree structure of the
278individual files and directories).
279Note also that these directories are nothing more than pointer lists.
280In particular, there are no names or metadata like in a file system.
281.PP
282To make it easier to pass hierarchies between applications,
283the root of a hierarchy is described in a 300-byte structure
284called a
285.CW VtRoot :
286.P1
287VtRoot:
288.ta +\w'    'u +\w'                'u
289	version[2]	\f(CW2\fP
290	name[128]	\fRname of structure (just a comment)\fP
291	type[128]	\fRstring describing structure (\f(CWvac\fR)\f(CW
292	score[20]	\fRpointer to \f(CWVtDirType\fP block\f(CW
293	blockSize[2]	\fRmaximum block size in structure\fP
294	prev[20]	\fRprevious \f(CWVtRoot\fP in chain, if any\fP
295.P2
296.LP
297This structure is stored to Venti and its score is passed
298between applications, typically in the form
299``\fItype\f(CW:\fIrootscore\fR,''
300where
301.I type
302is the type field from the
303.CW VtRoot
304structure, and
305.I rootscore
306is the score of the
307.CW VtRoot
308block.
309.CW VtRoot
310structures can be chained together using the
311.I prev
312field to encode an archival history
313of the data structure.
314.PP
315For example, a small Venti hierarchy might look like:
316.PS
317.ps 8
318.vs 10
319boxwid=0.1
320boxht=0.1
321f=0.9
322mb=0.16
323
324VtRoot: [
325	right
326	B1: box
327	move right 0.1
328	"\f(CWVtRoot\fP" ljust
329]
330
331Root: [
332	right
333	B1: box fill f
334	B2: box fill f
335	B3: box fill f
336	move right 0.1
337] with .nw at VtRoot.sw+(0.2,-.1)
338Level1: [
339	RootMeta: [
340		box wid mb
341	]
342	MetaSource: [
343		right
344		B1: box wid 5*mb
345	] with .nw at RootMeta.sw+(0,-.1)
346
347	Source: [
348		right
349		B1: box fill f
350		B2: box fill f
351		B3: box fill f
352		B4: box fill f
353		B5: box fill f
354		B6: box fill f
355		B7: box fill f
356		B8: box fill f
357	] with .nw at MetaSource.sw+(0,-.1)
358	SB1: box invis at Source.B1
359	SB2: box invis at Source.B2
360	SB3: box invis at Source.B3
361] with .nw at Root.sw+(0.4,-.1)
362Level2: [
363	MetaSource: [
364		right
365		B1: box wid 5*mb
366	]
367	Source: [
368		right
369		B1: box fill f
370		B2: box fill f
371		B3: box fill f
372		B4: box fill f
373		B5: box fill f
374		B6: box fill f
375		B7: box fill f
376		B8: box fill f
377	] with .nw at MetaSource.sw+(0,-.1)
378	File: box wid 0.8 with .nw at Source.sw+(0,-.1)
379] with .nw at Level1.sw+(0.6,-.1)
380
381line -> from VtRoot.B1 down boxwid/2+0.1+boxwid/2 then to Root.w
382line -> from Root.B3 down boxwid/2+0.1+boxwid/2 then to Level1.RootMeta.w
383line -> from Root.B2 down boxwid/2+0.1+boxwid+0.1+boxwid/2 then to Level1.MetaSource.w
384line -> from Root.B1 down boxwid/2+0.1+boxwid+0.1+boxwid+0.1+boxwid/2 then to Level1.Source.w
385
386line -> from Level1.SB3 down boxwid/2+0.1+boxwid/2 then to Level2.MetaSource.w
387line -> from Level1.SB2 down boxwid/2+0.1+boxwid+0.1+boxwid/2 then to Level2.Source.w
388line -> from Level1.SB1 down boxwid/2+0.1+boxwid+0.1+boxwid+0.1+boxwid/2 then to Level2.File.w
389
390[
391	KEY: box wid 1.5 invis "Key"
392	line from KEY.sw to KEY.se
393	k = -.1
394	kk=0.5
395	A: [
396		box wid 4*boxwid
397		"Venti file" ljust with .w at last box .w+(kk,0)
398	] with .nw at KEY.sw+(0,2*k)
399	B: [
400		box fill f
401		"Venti entry (\f(CWVtEntry\fP)" ljust with .w at last box .w+(kk,0)
402	] with .nw at A.sw+(0,k)
403	C: [
404		right
405		CC: box fill f
406		box fill f
407		box fill f
408		box fill f
409		"Venti directory" ljust with .w at CC.w+(kk,0)
410	] with .nw at B.sw+(0,k)
411	D: [
412		line -> right 3*boxwid
413		"Venti pointer (score)" ljust with .w at last line .w+(kk, 0)
414	] with .nw at C.sw+(0,k)
415] with .nw at VtRoot.nw+(3,0)
416.PE
417.LP
418Venti files are shown as white boxes, while directories are shown
419as shaded boxes.  Each shaded square represents a
420.CW VtEntry .
421Arrows represent pointers from
422.CW VtEntry
423structures to other
424Venti files or directories.
425.PP
426The hierarchical structure provided by Venti files and directories
427can be used as the base for more complicated data structures.
428Because this structure captures all the information
429about pointers to other blocks, tools written to traverse
430Venti hierarchies can traverse the more complicated
431data structures as well.
432For example,
433.I venti/copy
434(see
435.I venti (1))
436copies a Venti hierarchy from one Venti server to another,
437given the root
438.CW VtEntry .
439Because the traditional file system described in later sections is
440layered on a Venti hierarchy,
441.I venti/copy
442can copy it without fully understanding its structure.
443.NH 1
444Vac file system format
445.HP
446The Venti archive format
447.I vac
448builds a traditional file system using a Venti hierarchy.
449Each vac file is implemented as a Venti file;
450each vac directory is implemented as a Venti
451directory and a Venti file to provide traditional file system metadata.
452The metadata is stored in a structure called a
453.CW DirEntry :
454.P1
455DirEntry:
456.ta +\w'    'u +\w'            'u
457	magic[4]	\f(CW0x1c4d9072\fP (DirMagic)\fP
458	version[2]	\f(CW9\fP
459	elem[s]	\fRname (final path element only)\fP
460	entry[4]	\fRentry number for Venti file or directory\fP
461	gen[4]	\fRgeneration number\fP
462	mentry[4]	\fRentry number for Venti file holding metadata\fP
463	mgen[4]	\fRgeneration number\fP
464	qid[8]	\fRunique file serial number\fP
465	uid[s]	\fRowner\fP
466	gid[s]	\fRgroup\fP
467	mid[s]	\fRlast modified by\fP
468	mtime[4]	\fRlast modification time\fP
469	ctime[4]	\fRcreation time\fP
470	atime[4]	\fRlast access time\fP
471	mode[4]	\fRmode bits\fP
472.P2
473The notation
474.CW name[s]
475denotes a string stored as a two-byte length
476and then that many bytes.
477The above describes Version 9 of the
478.CW DirEntry
479format.  Versions 7 and 8 are very similar; they can be
480read by the current
481.I vac
482source code but are not written.
483Earlier versions were not widespread.
484A
485.CW DirEntry
486may be followed by optional extension sections, though none
487are currently used.
488The
489.CW mode
490bits include bits commonly used by
491Unix and Windows, in addition to those used by Plan 9.
492.PP
493The
494.CW entry
495field is an index into the parallel Venti directory.
496The
497.CW gen
498field must match the
499.CW gen
500field in the corresponding
501.CW VtEntry
502in the directory;
503it is used to detect
504stale indices.
505Similarly,
506.CW mentry
507and
508.CW mgen
509are the index and generation number
510for the metadata Venti file,
511if the
512.CW DirEntry
513describes a vac directory.
514.PP
515The relation between Venti files and directories and
516vac files and directories can be seen in this figure:
517.PS
518.ps 8
519.vs 10
520boxwid=0.1
521boxht=0.1
522f=0.9
523mb=0.16
524
525VtRoot: [
526	right
527	B1: box
528	move right 0.1
529	"\f(CWVtRoot\fP" ljust
530]
531
532SuperRoot: [
533	right
534	B1: box fill f
535	move right 0.1
536	"fs root block" ljust
537] with .nw at VtRoot.sw + (0.2, -.2)
538Root: [
539	right
540	B1: box fill f
541	B2: box fill f
542	B3: box fill f
543	move right 0.1
544	"root directory info block" ljust
545] with .nw at SuperRoot.sw+(0.2, -.2)
546Level1: [
547	RootMeta: [
548		box wid mb
549		move right 0.1
550		"root metadata" ljust
551	]
552	MetaSource: [
553		right
554		B1: box wid mb
555		B2: box wid mb
556		B3: box wid mb
557		B4: box wid mb
558		B5: box wid mb
559	] with .nw at RootMeta.sw+(0,-.2)
560	MB1: box wid mb invis at MetaSource.B1
561	MB2: box wid mb invis at MetaSource.B2
562	MB3: box wid mb invis at MetaSource.B3
563	MB4: box wid mb invis at MetaSource.B4
564	MB5: box wid mb invis at MetaSource.B5
565
566	Source: [
567		right
568		B1: box fill f
569		B2: box fill f
570		B3: box fill f
571		B4: box fill f
572		B5: box fill f
573		B6: box fill f
574		B7: box fill f
575		B8: box fill f
576	] with .nw at MetaSource.sw+(0,-.1)
577	SB1: box invis at Source.B1
578	SB2: box invis at Source.B2
579	SB3: box invis at Source.B3
580	SB4: box invis at Source.B4
581	SB5: box invis at Source.B5
582	SB6: box invis at Source.B6
583	SB7: box invis at Source.B7
584	SB8: box invis at Source.B8
585] with .nw at Root.sw+(0.4,-.2)
586Level2: [
587	MetaSource: [
588		right
589		B1: box wid mb
590		B2: box wid mb
591		B3: box wid mb
592		B4: box wid mb
593		B5: box wid mb
594	]
595	Source: [
596		right
597		B1: box fill f
598		B2: box fill f
599		B3: box fill f
600		B4: box fill f
601		B5: box fill f
602		B6: box fill f
603		B7: box fill f
604		B8: box fill f
605	] with .nw at MetaSource.sw+(0,-.1)
606	File: box wid 0.8 with .nw at Source.sw+(0,-.2)
607] with .nw at Level1.sw+(0.6,-.2)
608
609line -> from VtRoot.B1 down boxwid/2+0.2+boxwid/2 then to SuperRoot.w
610line -> from SuperRoot.B1 down boxwid/2+0.2+boxwid/2 then to Root.w
611line -> from Root.B3 down boxwid/2+0.2+boxwid/2 then to Level1.RootMeta.w
612line -> from Root.B2 down boxwid/2+0.2+boxwid+0.2+boxwid/2 then to Level1.MetaSource.w
613line -> from Root.B1 down boxwid/2+0.2+boxwid+0.1+boxwid+0.2+boxwid/2 then to Level1.Source.w
614
615line -> from Level1.SB3 down boxwid/2+0.2+boxwid/2 then to Level2.MetaSource.w
616line -> from Level1.SB2 down boxwid/2+0.2+boxwid+0.1+boxwid/2 then to Level2.Source.w
617line -> from Level1.SB1 down boxwid/2+0.2+boxwid+0.1+boxwid+0.2+boxwid/2 then to Level2.File.w
618
619arrowwid = arrowwid/2
620arrowht = arrowht/2
621line -> from Level1.MB1 to Level1.SB1.n
622line -> from Level1.MB2 to Level1.SB2.n
623line -> from Level1.MB2 to Level1.SB3.n
624line -> from Level1.MB4 to Level1.SB7.n
625line -> from Level1.MB5 to Level1.SB5.n
626arrowwid = arrowwid * 2
627arrowht = arrowht * 2
628
629box dashed with .nw at Level1.MetaSource.nw+(-.05,.05) wid 0.8+.05*2 ht .3+.05*2
630box dashed with .nw at Level2.MetaSource.nw+(-.05,.05) wid 0.8+.05*2 ht .3+.05*2
631box dotted with .nw at Level2.File.nw+(-.05,.05) wid 0.8+0.05*2 ht .1+.05*2
632
633[
634	KEY: box wid 1.5 invis "Key"
635	line from KEY.sw to KEY.se
636	k = -.1
637	kk=0.5
638	A: [
639		box wid 4*boxwid
640		"Venti file" ljust with .w at last box .w+(kk,0)
641	] with .nw at KEY.sw+(0,2*k)
642	B: [
643		box fill f
644		"Venti entry (\f(CWEntry\fP)" ljust with .w at last box .w+(kk,0)
645	] with .nw at A.sw+(0,k)
646	C: [
647		right
648		CC: box fill f
649		box fill f
650		box fill f
651		box fill f
652		"Venti directory" ljust with .w at CC.w+(kk,0)
653	] with .nw at B.sw+(0,k)
654	D: [
655		line -> right 3*boxwid
656		"Venti pointer (score)" ljust with .w at last line .w+(kk, 0)
657	] with .nw at C.sw+(0,k)
658	DD: [
659		box dotted wid 4*boxwid
660		"Vac file" ljust with .w at last box .w+(kk,0)
661	] with .nw at D.sw+(0,k)
662	E: [
663		box wid mb
664		"Vac entry (\f(CWDirEntry\fP)" ljust with .w at last box .w+(kk,0)
665	] with .nw at DD.sw+(0,k)
666	G: [
667		box dashed wid 4*boxwid
668		"Vac directory" ljust with .w at last box .w+(kk,0)
669	] with .nw at E.sw+(0,k)
670	H: [
671		arrowwid = arrowwid/2
672		arrowht = arrowht/2
673		line -> right 1.5*boxwid
674		"Vac pointer (integer index)" ljust with .w at last line .w+(kk, 0)
675		arrowwid = arrowwid * 2
676		arrowht = arrowht * 2
677	] with .nw at G.sw+(0,k)
678] with .nw at VtRoot.nw+(3,0)
679.PE
680.LP
681In reality, the story is slightly more complicated.
682The metadata file in a Vac directory
683is not just the concatenation of
684.CW DirEntry
685structures.
686Instead, it is the concatenation of
687.CW MetaBlocks .
688A
689.CW MetaBlock
690contains some number of
691.CW DirEntry
692structures along with a sorted index to make it easy
693to look for a particular
694.CW DirEntry
695by its
696.CW elem
697field.
698The details are in the source code.
699.PP
700As shown in the diagram,
701the root directory of the file system is summarized by
702three
703.CW VtEntry
704structures describing
705the Venti directory for the children of the root,
706the Venti file for the metadata describing the children of the root,
707and a Venti file holding metadata for the root directory itself.
708These
709.CW VtEntry
710structures are placed in a Venti directory of their own,
711described by the single
712.CW VtEntry
713in the
714root block.
715.NH 1
716Fossil file system format
717.HP
718Fossil uses the vac format, with some small changes.
719The changes only affect the data on the local disk; the data
720archived to Venti is exactly in vac format.
721.PP
722Blocks stored on local disk may contain scores pointing at local disk
723blocks or at Venti blocks.
724Local block addresses are stored as 20-byte scores in which the first 16 bytes
725are all zero and the last 4 bytes specify a block number in the disk.
726Before a block is archived, all the
727blocks it points to must be archived, and the local scores in the block
728must be changed to Venti scores.
729Using block addresses rather than content hashes for local data
730makes the local file system easier to manage: if a local block's contents
731change, the pointer to the block does not need to change.
732.NH 2
733Snapshots
734.HP
735Fossil is an archival file server.
736It takes periodic snapshots of the file system,
737which are made accessible through the file system.
738Specifically, the active file system is presented in
739.CW /active .
740Ephemeral snapshots (those that are kept on local disk and eventually deleted)
741are presented in
742\f(CW/snapshot/\fIyyyy\f(CW/\fImmdd\f(CW/\fIhhmm\fR,
743where
744.I yyyy
745is the full year,
746.I mm
747is the month number,
748.I dd
749is the day number,
750.I hh
751is the hour,
752and
753.I mm
754is the minute.
755Archival snapshots (those that are archived to Venti and persist forever)
756are presented in
757\f(CW/archive/\fIyyyy\f(CW/\fImmdds\fR,
758where
759.I yyyy ,
760.I mm ,
761and
762.I dd
763are year, month, and day as before,
764and
765.I s
766is a sequence number if more than one
767archival snapshot is done in a day.
768For the first snapshot,
769.I s
770is null.
771For the subsequent snapshots,
772.I s
773is
774.CW .1 ,
775.CW .2 ,
776.CW .3 ,
777etc.
778.PP
779To implement the snapshots, the file server maintains a
780current
781.I epoch
782for the active file system.
783Each local block has a label that records, among other things,
784the epoch in which the block was allocated.
785If a block was allocated in an epoch earlier than the current one,
786it is immutable and treated as copy-on-write.
787Taking a snapshot can be accomplished by
788recording the address of the current root block and then
789incrementing the epoch number.
790Notice that the copy-on-write method makes
791snapshots both time efficient and space efficient.
792The only time cost is waiting for all current file system
793requests to finish and then incrementing a counter.
794After a snapshot, blocks only get copied when they are
795next modified, so the per-snapshot
796space requirement is proportional
797to the amount of new data rather than the total
798size of the file system.
799.PP
800The blocks in the archival snapshots are moved to Venti,
801but the blocks in the ephemeral snapshots take up space
802in the local disk file.
803To allow reclamation of this disk space, the file system
804maintains a
805.I low
806.I epoch ,
807which is the epoch of the earliest ephemeral snapshot
808still available.
809Fossil only allows access to snapshots with epoch numbers
810between the
811low epoch and the current epoch
812(also called the high epoch).
813Incrementing the low epoch thus makes old
814snapshots inaccessible.
815The space required to store those snapshots can then
816be reclaimed, as described below.
817.NH 2
818Local blocks
819.HP
820The bulk of the local disk file is the local blocks.
821Each block has a 14-byte label associated with it, of the format:
822.P1
823Label:
824.ta +\w'    'u +\w'                'u
825	state[1]	\fRblock state\fP
826	type[1]	\fRblock type\fP
827	epoch[4]	\fRallocation epoch\fP
828	epochClose[4]	\fRclose epoch\fP
829	tag[4]	\fRrandom tag\fP
830.P2
831.LP
832The
833.CW type
834is an analogue of the block types described earlier,
835though different names are used, to distinguish between
836pointers blocks in a hash tree for a data stream
837and pointer blocks for a directory stream.
838The
839.CW epoch
840was mentioned in the last section.
841The other fields are explained below.
842.PP
843There are two distinguished blocks states
844.CW BsFree
845.CW 0x00 ) (
846and
847.CW BsBad
848.CW 0xFF ), (
849which mark blocks that are available for allocation
850and blocks that are bad and should be avoided.
851If
852.CW state
853is not one of these values, it is a bitwise
854.I or ' `
855of the following flags:
856.P1
857.ta +\w'      'u +\w'                'u
8580x01	BsAlloc	\fRblock is in use\fP
8590x02	BsCopied	\fRblock has been copied\fP
8600x04	BsVenti	\fRblock has been stored on Venti\fP
8610x08	BsClosed	\fRblock has been unlinked from active file system\fP
862.P2
863.LP
864The flags are explained as they arise in the discussions below.
865.PP
866It is convenient to store some extra fields in the
867.CW VtEntry
868structure when it describes a Venti file or directory
869stored on local disk.
870Specifically, we set the
871.CW VtEntryLocal
872flag bit
873and then use the bytes 7-16 of the score (which would
874otherwise be zero, since it is a local score) to hold these fields:
875.P1
876.ta +\w'    'u +\w'                'u
877	archive[1]	\fRboolean: this is an archival snapshot\fP
878	snap[4]	\fRepoch number if root of snapshot\fP
879	tag[4]	\fRrandom tag\fP
880.P2
881.LP
882The extended
883.CW VtEntry
884structure is called an
885.CW Entry .
886The
887.CW tag
888field
889in the
890.CW Label
891and the
892.CW Entry
893is used to identify dangling pointers or other file system corruption:
894all the local blocks in a hash tree must
895have tags matching the tag in the
896.CW Entry .
897If this
898.CW Entry
899points at the root of a snapshot,
900the
901.CW snap
902field is the epoch of the snapshot.
903If the snapshot is intended to be archived to Venti,
904the
905.CW archive
906field is non-zero.
907.NH 2
908Block reclamation
909.HP
910The blocks in the active file system form a tree: each
911block has only one parent.
912Once a copy-on-write block
913.I b
914is replaced by its copy, it is no longer
915needed by the active file system.
916At this point,
917.I b
918is unlinked from the active file system.
919We say that
920.I b
921is now
922.I closed :
923it is needed only for snapshots.
924When a block is closed, the
925.CW BsClosed
926bit is set in its state, and the current epoch (called the block's closing epoch)
927is stored in the
928.CW epochClose
929label field.
930(Open blocks have an
931.CW epochClose
932of
933.CW ~0 ).
934.PP
935A block is referenced by snapshots with epochs
936between the block's allocation epoch and its closing epoch.
937Once the file system's low epoch grows to be greater than or equal to the block's
938closing epoch, the block is no longer needed for any snapshots
939and can be reused.
940.PP
941In a typical configuration, where nightly archival snapshots
942are taken and written to Venti, it is desirable to reclaim
943the space occupied by now-archived blocks if possible.
944To do this, Fossil keeps track of whether the pointers
945in each block are unique to that block.
946When a block
947.I bb
948is allocated, a pointer to
949.I bb
950is written into exactly one active block (say,
951.I b ).
952In the absence of snapshots, the pointer to
953.I bb
954will remain unique to
955.I b ,
956so that if the pointer is zeroed,
957.I bb
958can be immediately reused.
959Snapshots complicate this invariant:
960when
961.I b
962is copied-on-write, all its pointers
963are no longer unique to it.
964At time of the copy, the
965.CW BsCopied
966state bit in the block's label
967is set to note the duplication of the pointers contained within.
968.NH 2
969Disk layout
970.HP
971The file system header describes the file system layout and has this format:
972.P1
973.ta +\w'    'u +\w'                'u
974Header:
975	magic[4]	\fR0x3776AE89 (HeaderMagic)\fP
976	version[2]	\fR1 (HeaderVersion)\fP
977	blockSize[2]	\fIfile system block size\fP
978	super[4]	\fRblock offset of super block\fP
979	label[4]	\fRblock offset of labels\fP
980	data[4]	\fRdata blocks\fP
981	end[4]	\fRend of file system\fP
982.P2
983.LP
984The corresponding file system layout is:
985.PS
986.ps 8
987.vs 9
988boxwid=0.75
989boxht=0.15
990Empty: box "empty" ht 0.25
991Header: box "header" with .n at Empty.s
992Empty2: box "empty" with .n at Header.s
993Super: box "super block" with .n at Empty2.s
994Label: box "label" "blocks" with .n at Super.s ht 0.25
995Data: box "data" "blocks" with .n at Label.s ht 0.3
996"  0" ljust at Empty.ne
997"  128kB" ljust at Header.ne
998"  \f5super\fP \(mu \f(CWblockSize\fP" ljust at Super.ne
999"  \f5label\fP \(mu \f(CWblockSize\fP" ljust at Label.ne
1000"  \f5data\fP \(mu \f(CWblockSize\fP" ljust at Data.ne
1001"  \f5end\fP \(mu \f(CWblockSize\fP" ljust at Data.se
1002"" at (-1,0)
1003"" at (6,0)
1004.PE
1005.LP
1006The numbers to the right of the blocks are byte offsets
1007of the boundaries.
1008.LP
1009The super block describes the file system itself and looks like:
1010.P1
1011.ta +\w'    'u +\w'                'u
1012Super:
1013	magic[4]	\fR0x2340A3B1 (SuperMagic)\fP
1014	version[2]	\fR1 (SuperVersion)\fP
1015	epochLow[4]	\fRfile system low epoch\fP
1016	epochHigh[4]	\fRfile system high (active) epoch\fP
1017	qid[8]	\fRnext qid to allocate\fP
1018	active[4]	\fRdata block number: root of active file system\fP
1019	next[4]	\fRdata block number: root of next file system to archive\fP
1020	current[4]	\fRdata block number: root of file system currently being archived\fP
1021	last[20]	\fRVenti score of last successful archive\fP
1022	name[128]	\fRname of file system (just a comment)\fP
1023.P2
1024.LP
1025.NH 1
1026Fossil server
1027.HP
1028The Fossil server is a user-space program that runs on a standard Plan 9 kernel.
1029.NH 2
1030Process structure
1031.PP
1032The file server is structured as a set of processes synchronizing
1033mostly through message passing along queues.
1034The processes are given names, which can be seen in the output of
1035.CW ps
1036.CW -a .
1037.PP
1038.CW Listen
1039processes announce on various network addresses.
1040A
1041.CW con
1042process handles each incoming connection, reading 9P requests
1043and adding them to a central message queue.
1044.CW Msg
1045processes remove 9P requests from the queue,
1046handle them, and write the responses to the appropriate
1047file descriptors.
1048.PP
1049The
1050.CW disk
1051process handles disk I/O requests made by the other processes.
1052The
1053.CW flush
1054process writes dirty blocks from the in-memory block cache to disk.
1055The
1056.CW unlink
1057process frees previously linked blocks once the blocks that point at them
1058have been written to disk.
1059.PP
1060A
1061.CW consI
1062reads from each console file (typically a pipe posted in
1063.CW /srv ),
1064adding the typed characters to the input queue.
1065The
1066.CW cons
1067process echoes input and runs the commands, saving
1068output in a ring buffer.
1069Because there is only one
1070.CW cons
1071process, only one console command may be executing at a time.
1072A
1073.CW consO
1074process copies this ring buffer to each console file.
1075.PP
1076The
1077.CW periodic
1078process runs periodic events, like
1079flushing the root metadata to disk or
1080taking snapshots of the file system.
1081.NH 2
1082Block cache
1083.HP
1084Fossil maintains an in-memory block cache which
1085holds both local disk blocks and Venti blocks.
1086Cache eviction follows a least recently used policy.
1087Dirty blocks are restricted to at most half the cache.
1088This can be changed by editing
1089.CW DirtyPercentage
1090in
1091.CW dat.h .
1092.PP
1093The block cache uses soft updates [1] to ensure that the on-disk
1094file system is always self-consistent.
1095Thus there is no
1096.I halt
1097console command
1098and no need to check a file system
1099that was shut down without halting.
1100.NH 2
1101Archiving
1102.HP
1103A background process writes blocks in archival snapshots to Venti.
1104Although
1105.CW /archive/\fIyyyy\fP/\fImmdds\fR
1106is a copy of only
1107.CW /active
1108at the time of the snapshot,
1109the archival process archives the
1110entire file tree rather than just
1111the subtree rooted at
1112.CW /active .
1113The snapshots
1114.CW /snapshot/\fIyyyy\fP/\fImmdd\fP/\fIhhmm
1115are stored as empty directories.
1116Once all the blocks have been archived,
1117a
1118.CW VtRoot
1119header for the file system is archived.
1120The score of that header is recorded in
1121.CW super.score
1122and also printed on the file server console.
1123The score can used by
1124.I flfmt
1125to restore a file system (see
1126.I fossil (4)).
1127.NH 2
1128Contrast with the old file server
1129.HP
1130The most obvious difference between Fossil and the
1131old Plan 9 file server [2] is that Fossil uses a Venti server as
1132its archival storage in place of a WORM juke box.
1133There are a few other architectural differences to be
1134aware of.
1135.PP
1136Fossil is a user-level program run on a standard kernel.
1137.PP
1138Fossil does not have any way to concatenate, stripe, or
1139mirror disk files.  For functionality similar to the old file server's
1140configuration strings, use the experimental file stack device
1141(see
1142.I fs (3)).
1143.PP
1144Fossil speaks only 9P2000.  Old 9P (aka 9P1) is not supported.
1145.PP
1146... XXX words about converting an old file system to fossil?
1147.NH 1
1148References
1149.LP
1150[1] Gregory R. Ganger, Marshall Kirk McKusick, Craig A. N. Soules,
1151and Yale N. Patt.
1152``Soft Updates: A Solution to the Metadata Update Problem
1153in File Systems,''
1154.I "ACM Transactions on Computer Systems" ,
1155Vol 18., No. 2, May 2000, pp. 127\-153.
1156.LP
1157[2] Sean Quinlan, ``A Cached WORM File System,''
1158.I "Software\(emPractice and Experience" ,
1159Vol 21., No 12., December 1991, pp. 1289\-1299.
1160.LP
1161[3] Sean Quinlan and Sean Dorward, ``Venti: A New Approach to Archival Storage,''
1162.I "Usenix Conference on File and Storage Technologies" ,
11632002.
1164