xref: /plan9/sys/doc/fossil.ms (revision 5e96a66c77eb9140492ca53f857cbbf108e128ed)
1... .FP times
2... .fp 1 R R.nomath
3... .fp 5 CW LucidaSansCW83
4.TL
5Fossil, an Archival File Server
6.AU
7Sean Quinlan
8Jim McKie
9Russ Cox
10jmk,rsc@plan9.bell-labs.com
11.AB
12This paper describes the internals and
13operation of Fossil, an archival file server built for Plan 9.
14Fossil has not yet replaced the current Plan 9 file server
15and
16.CW kfs ,
17but that is our eventual intent.
18Both fossil and this documentation are
19works in progress.  Comments on either are most welcome.
20.AE
21.de HP
22.LP
23..
24.NH 1
25Introduction
26.HP
27Fossil is an archival file server built for Plan 9.
28In a typical configuration, it maintains a traditional file system
29in a local disk partition and periodically archives snapshots of the file system
30to a Venti server.  These archives are made available through
31a file system interface.
32Fossil can also be run without a Venti server, in which case the
33snapshots (if any) occupy local disk space.
34.PP
35The bulk of this paper explains the underlying data structures:
36Venti trees, the Venti archival file system format, and finally Fossil's
37file system format.
38The end of the paper discusses the architecture of the Fossil server.
39.PP
40The presentation of the data structures is very detailed, perhaps
41too detailed for most readers.
42The intent is to record all the details necessary to make structural
43changes to the file system format.
44Feel free to jump ahead when boredom strikes.
45.NH 1
46Venti trees and directory hierarchies
47.HP
48Venti [3] is an archival block storage server.
49Once a block is stored, it can be retrieved by presenting the 20-byte
50SHA1 hash of its contents, called a
51.I score .
52Blocks on Venti have a maximum length of about 56 kilobytes,
53though in practice smaller blocks are used.
54To store a byte stream of arbitrary length, Venti uses a hash tree.
55Conceptually, the data stream is broken into fixed-size (say,
56.I dsize -byte)
57chunks, which are stored on the Venti server.
58The resulting scores are concatenated into a new pointer stream, which is
59broken into fixed size (say,
60.I psize -byte)
61chunks, which are stored on the Venti server.
62.I Psize "" (
63is different from
64.I dsize
65so that we can ensure that each pointer block holds an
66integral number of pointers.)
67This yields a new pointer stream, and so on, until there is a single block
68and finally a single score describing the entire tree.
69The resulting structure looks like:
70.PS
71.ps 8
72.vs 10
73boxht=0.1
74boxwid=0.1
75
76B0: box invis wid 1 "\f(CWVtDataType\fP"
77move right 0.1
78L0a: box wid 0.2
79move right 0.1
80L0b: box wid 0.2
81move right 0.1
82L0c: box invis wid 0.2 "..."
83move right 0.1
84
85L0d: box wid 0.2
86move right 0.1
87L0e: box wid 0.2
88move right 0.2
89L0f: box invis wid 0.2 "..."
90move right 0.2
91
92L0g: box wid 0.2
93move right 0.1
94L0h: box wid 0.2
95move right 0.1
96L0i: box invis wid 0.2 "..."
97move right 0.1
98
99L0j: box wid 0.2
100move right 0.1
101L0k: box wid 0.2
102move right 0.1
103L0l: box invis wid 0.2 "..."
104move right 0.1
105L0m: box wid 0.2
106
107define boxppddd {
108	line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se>
109	line from 0.4<$1.nw,$1.ne> to 0.4<$1.sw,$1.se>
110	X: box invis at 0.1<$1.nw,$1.ne>
111	Y: box invis at 0.1<$1.sw,$1.se>
112	line -> from 0.5<X,Y> to $2.nw
113	X: box invis at 0.3<$1.nw,$1.ne>
114	Y: box invis at 0.3<$1.sw,$1.se>
115	line -> from 0.5<X,Y> to $3.nw
116	"..." at 0.7<$1.w,$1.e>
117}
118
119define boxppdddp {
120	line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se>
121	line from 0.4<$1.nw,$1.ne> to 0.4<$1.sw,$1.se>
122	line from 0.8<$1.nw,$1.ne> to 0.8<$1.sw,$1.se>
123	X: box invis at 0.1<$1.nw,$1.ne>
124	Y: box invis at 0.1<$1.sw,$1.se>
125	line -> from 0.5<X,Y> to $2.nw
126	X: box invis at 0.3<$1.nw,$1.ne>
127	Y: box invis at 0.3<$1.sw,$1.se>
128	line -> from 0.5<X,Y> to $3.nw
129	"..." at 0.6<$1.w,$1.e>
130	X: box invis at 0.9<$1.nw,$1.ne>
131	Y: box invis at 0.9<$1.sw,$1.se>
132	line -> from 0.5<X,Y> to $4.nw
133}
134
135define boxpdddp {
136	line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se>
137	line from 0.8<$1.nw,$1.ne> to 0.8<$1.sw,$1.se>
138	X: box invis at 0.1<$1.nw,$1.ne>
139	Y: box invis at 0.1<$1.sw,$1.se>
140	line -> from 0.5<X,Y> to $2.nw
141	"..." at 0.5<$1.w,$1.e>
142	X: box invis at 0.9<$1.nw,$1.ne>
143	Y: box invis at 0.9<$1.sw,$1.se>
144	line -> from 0.5<X,Y> to $3.nw
145}
146
147bhd=0.4
148L1abc: box wid 0.5 at 0.5<L0a, L0b>+(0,bhd)
149boxppddd(L1abc, L0a, L0b)
150L1def: box wid 0.5 at 0.5<L0d, L0e>+(0,bhd)
151boxppddd(L1def, L0d, L0e)
152L1ghi: box wid 0.5 at 0.5<L0g, L0h>+(0,bhd)
153boxppddd(L1ghi, L0g, L0h)
154L1jklm: box wid 0.5 at 0.5<L0j, L0k>+(0,bhd)
155boxppdddp(L1jklm, L0j, L0k, L0m)
156B1: box invis wid 1 "\f(CWVtPointerType0\fP" at B0+(0,bhd)
157
158L2abcdef: box wid 0.5 at 0.5<L1abc,L1def>+(0,bhd)
159boxppddd(L2abcdef, L1abc, L1def)
160L2ghijklm: box wid 0.5 at 0.5<L1ghi,L1jklm>+(0,bhd)
161boxpdddp(L2ghijklm, L1ghi, L1jklm)
162B2: box invis wid 1 "\f(CWVtPointerType1\fP" at B1+(0,bhd)
163
164L3atom: box wid 0.5 at 0.5<L2abcdef, L2ghijklm>+(0,bhd)
165boxpdddp(L3atom, L2abcdef, L2ghijklm)
166B3: box invis wid 1 "\f(CWVtPointerType2\fP" at B2+(0,bhd)
167.PE
168.LP
169The leaves are the original data stream.  Those blocks have type
170.CW VtDataType .
171The first pointer stream has type
172.CW VtPointerType0 ,
173the next has type
174.CW VtPointerType1 ,
175and so on.
176The figure ends with a single block of type
177.CW VtPointerType2 ,
178but in general trees can have height up to
179.CW VtPointerType6 .
180For a
181.I dsize
182of 8192 bytes
183and
184.I psize
185of 8180 bytes (409 pointers),
186this gives a maximum stream size of approximately 10 zettabytes
187(2\s-2\u73\d\s+2 or 10\s-2\u22\d\s+2 bytes).
188.PP
189Data block are truncated to remove trailing runs of zeros before
190storage to Venti; they are zero-filled back to
191.I dsize
192bytes after retrieval from Venti.
193SImilarly, trailing runs of pointers to zero-length blocks are
194removed from and added back to pointer blocks.
195These simple rules happen to make it particularly efficient to store
196large runs of zeros, as might occur in a data stream with ``holes:''
197the zero-length block itself can be interpreted as a tree of any
198depth encoding an all-zero data stream.
199.PP
200Reconstructing the data stream requires the score and type of the
201topmost block in the tree, the data chunk size, the pointer chunk size,
202and the data stream size.
203(From the data stream size and the chunk sizes we could derive the
204depth of the tree and thus the type of the topmost block, but it is convenient
205to allow trees that are deeper than necessary.)
206This information is kept in a 40-byte structure called a
207.CW VtEntry :
208.P1
209VtEntry:
210.ta +\w'    'u +\w'            'u
211	gen[4]	\fRgeneration number\fP
212	psize[2]	\fRsize of pointer blocks\fP
213	dsize[2]	\fRsize of data blocks\fP
214	flags[1]
215	zero[5]
216	size[6]	\fRlength of file\fP
217	score[20]	\fRscore of root block in tree\fP
218.P2
219(In this notation,
220.CW name[sz]
221indicates a
222.CW sz -byte
223field called
224.CW name .
225Integers are stored in big-endian order.
226.CW Size
227really is a 48-bit field.)
228.CW Flags
229is made up of the following bit fields.
230.I or ' `
231of the following flags:
232.P1
233.ta +\w'      'u +\w'                      'u
2340x01	VtEntryActive	\fRentry is allocated\fP
2350x02	VtEntryDir	\fRentry describes a Venti directory (q.v.)\fP
2360x1C	VtEntryDepthMask	\fRmask for tree depth\fP
2370x20	VtEntryLocal	\fRreserved (q.v.)\fP
238.P2
239.LP
240The depth of the described tree is stored in the 5 bits indicated:
241a tree with a topmost node of type
242.CW VtPointerType3
243has depth 4.
244.PP
245With
246.CW VtEntry
247we can build more complicated data structures,
248ones with multiple or nested data streams.
249A data stream consisting of
250.CW VtEntry
251structures is called a Venti directory.
252It is identical in structure to the Venti data stream
253we described earlier except that the bottom-level type is
254.CW VtDirType ,
255and
256the
257.CW VtEntry
258describing a Venti directory has the
259.CW VtEntryDir
260flag bit set.
261The
262.I dsize
263for a Venti directory
264is a multiple of 40 so that each data chunk holds
265an integer number of
266.CW VtEntry
267structures.
268By analogy with Venti directories,
269we call the original data stream a
270Venti file.
271Note that Venti files are assumed
272.I not
273to contain pointers to other Venti blocks.
274The only pointers to Venti blocks occur in
275.CW VtEntry
276structures in
277Venti directories
278(and in the internal hash tree structure of the
279individual files and directories).
280Note also that these directories are nothing more than pointer lists.
281In particular, there are no names or metadata like in a file system.
282.PP
283To make it easier to pass hierarchies between applications,
284the root of a hierarchy is described in a 300-byte structure
285called a
286.CW VtRoot :
287.P1
288VtRoot:
289.ta +\w'    'u +\w'                'u
290	version[2]	\f(CW2\fP
291	name[128]	\fRname of structure (just a comment)\fP
292	type[128]	\fRstring describing structure (\f(CWvac\fR)\f(CW
293	score[20]	\fRpointer to \f(CWVtDirType\fP block\f(CW
294	blockSize[2]	\fRmaximum block size in structure\fP
295	prev[20]	\fRprevious \f(CWVtRoot\fP in chain, if any\fP
296.P2
297.LP
298This structure is stored to Venti and its score is passed
299between applications, typically in the form
300``\fItype\f(CW:\fIrootscore\fR,''
301where
302.I type
303is the type field from the
304.CW VtRoot
305structure, and
306.I rootscore
307is the score of the
308.CW VtRoot
309block.
310.CW VtRoot
311structures can be chained together using the
312.I prev
313field to encode an archival history
314of the data structure.
315.PP
316For example, a small Venti hierarchy might look like:
317.PS
318.ps 8
319.vs 10
320boxwid=0.1
321boxht=0.1
322f=0.9
323mb=0.16
324
325VtRoot: [
326	right
327	B1: box
328	move right 0.1
329	"\f(CWVtRoot\fP" ljust
330]
331
332Root: [
333	right
334	B1: box fill f
335	B2: box fill f
336	B3: box fill f
337	move right 0.1
338] with .nw at VtRoot.sw+(0.2,-.1)
339Level1: [
340	RootMeta: [
341		box wid mb
342	]
343	MetaSource: [
344		right
345		B1: box wid 5*mb
346	] with .nw at RootMeta.sw+(0,-.1)
347
348	Source: [
349		right
350		B1: box fill f
351		B2: box fill f
352		B3: box fill f
353		B4: box fill f
354		B5: box fill f
355		B6: box fill f
356		B7: box fill f
357		B8: box fill f
358	] with .nw at MetaSource.sw+(0,-.1)
359	SB1: box invis at Source.B1
360	SB2: box invis at Source.B2
361	SB3: box invis at Source.B3
362] with .nw at Root.sw+(0.4,-.1)
363Level2: [
364	MetaSource: [
365		right
366		B1: box wid 5*mb
367	]
368	Source: [
369		right
370		B1: box fill f
371		B2: box fill f
372		B3: box fill f
373		B4: box fill f
374		B5: box fill f
375		B6: box fill f
376		B7: box fill f
377		B8: box fill f
378	] with .nw at MetaSource.sw+(0,-.1)
379	File: box wid 0.8 with .nw at Source.sw+(0,-.1)
380] with .nw at Level1.sw+(0.6,-.1)
381
382line -> from VtRoot.B1 down boxwid/2+0.1+boxwid/2 then to Root.w
383line -> from Root.B3 down boxwid/2+0.1+boxwid/2 then to Level1.RootMeta.w
384line -> from Root.B2 down boxwid/2+0.1+boxwid+0.1+boxwid/2 then to Level1.MetaSource.w
385line -> from Root.B1 down boxwid/2+0.1+boxwid+0.1+boxwid+0.1+boxwid/2 then to Level1.Source.w
386
387line -> from Level1.SB3 down boxwid/2+0.1+boxwid/2 then to Level2.MetaSource.w
388line -> from Level1.SB2 down boxwid/2+0.1+boxwid+0.1+boxwid/2 then to Level2.Source.w
389line -> from Level1.SB1 down boxwid/2+0.1+boxwid+0.1+boxwid+0.1+boxwid/2 then to Level2.File.w
390
391[
392	KEY: box wid 1.5 invis "Key"
393	line from KEY.sw to KEY.se
394	k = -.1
395	kk=0.5
396	A: [
397		box wid 4*boxwid
398		"Venti file" ljust with .w at last box .w+(kk,0)
399	] with .nw at KEY.sw+(0,2*k)
400	B: [
401		box fill f
402		"Venti entry (\f(CWVtEntry\fP)" ljust with .w at last box .w+(kk,0)
403	] with .nw at A.sw+(0,k)
404	C: [
405		right
406		CC: box fill f
407		box fill f
408		box fill f
409		box fill f
410		"Venti directory" ljust with .w at CC.w+(kk,0)
411	] with .nw at B.sw+(0,k)
412	D: [
413		line -> right 3*boxwid
414		"Venti pointer (score)" ljust with .w at last line .w+(kk, 0)
415	] with .nw at C.sw+(0,k)
416] with .nw at VtRoot.nw+(3,0)
417.PE
418.LP
419Venti files are shown as white boxes, while directories are shown
420as shaded boxes.  Each shaded square represents a
421.CW VtEntry .
422Arrows represent pointers from
423.CW VtEntry
424structures to other
425Venti files or directories.
426.PP
427The hierarchical structure provided by Venti files and directories
428can be used as the base for more complicated data structures.
429Because this structure captures all the information
430about pointers to other blocks, tools written to traverse
431Venti hierarchies can traverse the more complicated
432data structures as well.
433For example,
434.I venti/copy
435(see
436.I ventiaux (8))
437copies a Venti hierarchy from one Venti server to another,
438given the root
439.CW VtEntry .
440Because the traditional file system described in later sections is
441layered on a Venti hierarchy,
442.I venti/copy
443can copy it without fully understanding its structure.
444.NH 1
445Vac file system format
446.HP
447The Venti archive format
448.I vac
449builds a traditional file system using a Venti hierarchy.
450Each vac file is implemented as a Venti file;
451each vac directory is implemented as a Venti
452directory and a Venti file to provide traditional file system metadata.
453The metadata is stored in a structure called a
454.CW DirEntry :
455.P1
456DirEntry:
457.ta +\w'    'u +\w'            'u
458	magic[4]	\f(CW0x1c4d9072\fP (DirMagic)\fP
459	version[2]	\f(CW9\fP
460	elem[s]	\fRname (final path element only)\fP
461	entry[4]	\fRentry number for Venti file or directory\fP
462	gen[4]	\fRgeneration number\fP
463	mentry[4]	\fRentry number for Venti file holding metadata\fP
464	mgen[4]	\fRgeneration number\fP
465	qid[8]	\fRunique file serial number\fP
466	uid[s]	\fRowner\fP
467	gid[s]	\fRgroup\fP
468	mid[s]	\fRlast modified by\fP
469	mtime[4]	\fRlast modification time\fP
470	ctime[4]	\fRcreation time\fP
471	atime[4]	\fRlast access time\fP
472	mode[4]	\fRmode bits\fP
473.P2
474The notation
475.CW name[s]
476denotes a string stored as a two-byte length
477and then that many bytes.
478The above describes Version 9 of the
479.CW DirEntry
480format.  Versions 7 and 8 are very similar; they can be
481read by the current
482.I vac
483source code but are not written.
484Earlier versions were not widespread.
485A
486.CW DirEntry
487may be followed by optional extension sections, though none
488are currently used.
489The
490.CW mode
491bits include bits commonly used by
492Unix and Windows, in addition to those used by Plan 9.
493.PP
494The
495.CW entry
496field is an index into the parallel Venti directory.
497The
498.CW gen
499field must match the
500.CW gen
501field in the corresponding
502.CW VtEntry
503in the directory;
504it is used to detect
505stale indices.
506Similarly,
507.CW mentry
508and
509.CW mgen
510are the index and generation number
511for the metadata Venti file,
512if the
513.CW DirEntry
514describes a vac directory.
515.PP
516The relation between Venti files and directories and
517vac files and directories can be seen in this figure:
518.PS
519.ps 8
520.vs 10
521boxwid=0.1
522boxht=0.1
523f=0.9
524mb=0.16
525
526VtRoot: [
527	right
528	B1: box
529	move right 0.1
530	"\f(CWVtRoot\fP" ljust
531]
532
533SuperRoot: [
534	right
535	B1: box fill f
536	move right 0.1
537	"fs root block" ljust
538] with .nw at VtRoot.sw + (0.2, -.2)
539Root: [
540	right
541	B1: box fill f
542	B2: box fill f
543	B3: box fill f
544	move right 0.1
545	"root directory info block" ljust
546] with .nw at SuperRoot.sw+(0.2, -.2)
547Level1: [
548	RootMeta: [
549		box wid mb
550		move right 0.1
551		"root metadata" ljust
552	]
553	MetaSource: [
554		right
555		B1: box wid mb
556		B2: box wid mb
557		B3: box wid mb
558		B4: box wid mb
559		B5: box wid mb
560	] with .nw at RootMeta.sw+(0,-.2)
561	MB1: box wid mb invis at MetaSource.B1
562	MB2: box wid mb invis at MetaSource.B2
563	MB3: box wid mb invis at MetaSource.B3
564	MB4: box wid mb invis at MetaSource.B4
565	MB5: box wid mb invis at MetaSource.B5
566
567	Source: [
568		right
569		B1: box fill f
570		B2: box fill f
571		B3: box fill f
572		B4: box fill f
573		B5: box fill f
574		B6: box fill f
575		B7: box fill f
576		B8: box fill f
577	] with .nw at MetaSource.sw+(0,-.1)
578	SB1: box invis at Source.B1
579	SB2: box invis at Source.B2
580	SB3: box invis at Source.B3
581	SB4: box invis at Source.B4
582	SB5: box invis at Source.B5
583	SB6: box invis at Source.B6
584	SB7: box invis at Source.B7
585	SB8: box invis at Source.B8
586] with .nw at Root.sw+(0.4,-.2)
587Level2: [
588	MetaSource: [
589		right
590		B1: box wid mb
591		B2: box wid mb
592		B3: box wid mb
593		B4: box wid mb
594		B5: box wid mb
595	]
596	Source: [
597		right
598		B1: box fill f
599		B2: box fill f
600		B3: box fill f
601		B4: box fill f
602		B5: box fill f
603		B6: box fill f
604		B7: box fill f
605		B8: box fill f
606	] with .nw at MetaSource.sw+(0,-.1)
607	File: box wid 0.8 with .nw at Source.sw+(0,-.2)
608] with .nw at Level1.sw+(0.6,-.2)
609
610line -> from VtRoot.B1 down boxwid/2+0.2+boxwid/2 then to SuperRoot.w
611line -> from SuperRoot.B1 down boxwid/2+0.2+boxwid/2 then to Root.w
612line -> from Root.B3 down boxwid/2+0.2+boxwid/2 then to Level1.RootMeta.w
613line -> from Root.B2 down boxwid/2+0.2+boxwid+0.2+boxwid/2 then to Level1.MetaSource.w
614line -> from Root.B1 down boxwid/2+0.2+boxwid+0.1+boxwid+0.2+boxwid/2 then to Level1.Source.w
615
616line -> from Level1.SB3 down boxwid/2+0.2+boxwid/2 then to Level2.MetaSource.w
617line -> from Level1.SB2 down boxwid/2+0.2+boxwid+0.1+boxwid/2 then to Level2.Source.w
618line -> from Level1.SB1 down boxwid/2+0.2+boxwid+0.1+boxwid+0.2+boxwid/2 then to Level2.File.w
619
620arrowwid = arrowwid/2
621arrowht = arrowht/2
622line -> from Level1.MB1 to Level1.SB1.n
623line -> from Level1.MB2 to Level1.SB2.n
624line -> from Level1.MB2 to Level1.SB3.n
625line -> from Level1.MB4 to Level1.SB7.n
626line -> from Level1.MB5 to Level1.SB5.n
627arrowwid = arrowwid * 2
628arrowht = arrowht * 2
629
630box dashed with .nw at Level1.MetaSource.nw+(-.05,.05) wid 0.8+.05*2 ht .3+.05*2
631box dashed with .nw at Level2.MetaSource.nw+(-.05,.05) wid 0.8+.05*2 ht .3+.05*2
632box dotted with .nw at Level2.File.nw+(-.05,.05) wid 0.8+0.05*2 ht .1+.05*2
633
634[
635	KEY: box wid 1.5 invis "Key"
636	line from KEY.sw to KEY.se
637	k = -.1
638	kk=0.5
639	A: [
640		box wid 4*boxwid
641		"Venti file" ljust with .w at last box .w+(kk,0)
642	] with .nw at KEY.sw+(0,2*k)
643	B: [
644		box fill f
645		"Venti entry (\f(CWEntry\fP)" ljust with .w at last box .w+(kk,0)
646	] with .nw at A.sw+(0,k)
647	C: [
648		right
649		CC: box fill f
650		box fill f
651		box fill f
652		box fill f
653		"Venti directory" ljust with .w at CC.w+(kk,0)
654	] with .nw at B.sw+(0,k)
655	D: [
656		line -> right 3*boxwid
657		"Venti pointer (score)" ljust with .w at last line .w+(kk, 0)
658	] with .nw at C.sw+(0,k)
659	DD: [
660		box dotted wid 4*boxwid
661		"Vac file" ljust with .w at last box .w+(kk,0)
662	] with .nw at D.sw+(0,k)
663	E: [
664		box wid mb
665		"Vac entry (\f(CWDirEntry\fP)" ljust with .w at last box .w+(kk,0)
666	] with .nw at DD.sw+(0,k)
667	G: [
668		box dashed wid 4*boxwid
669		"Vac directory" ljust with .w at last box .w+(kk,0)
670	] with .nw at E.sw+(0,k)
671	H: [
672		arrowwid = arrowwid/2
673		arrowht = arrowht/2
674		line -> right 1.5*boxwid
675		"Vac pointer (integer index)" ljust with .w at last line .w+(kk, 0)
676		arrowwid = arrowwid * 2
677		arrowht = arrowht * 2
678	] with .nw at G.sw+(0,k)
679] with .nw at VtRoot.nw+(3,0)
680.PE
681.LP
682In reality, the story is slightly more complicated.
683The metadata file in a Vac directory
684is not just the concatenation of
685.CW DirEntry
686structures.
687Instead, it is the concatenation of
688.CW MetaBlocks .
689A
690.CW MetaBlock
691contains some number of
692.CW DirEntry
693structures along with a sorted index to make it easy
694to look for a particular
695.CW DirEntry
696by its
697.CW elem
698field.
699The details are in the source code.
700.PP
701As shown in the diagram,
702the root directory of the file system is summarized by
703three
704.CW VtEntry
705structures describing
706the Venti directory for the children of the root,
707the Venti file for the metadata describing the children of the root,
708and a Venti file holding metadata for the root directory itself.
709These
710.CW VtEntry
711structures are placed in a Venti directory of their own,
712described by the single
713.CW VtEntry
714in the
715root block.
716.NH 1
717Fossil file system format
718.HP
719Fossil uses the vac format, with some small changes.
720The changes only affect the data on the local disk; the data
721archived to Venti is exactly in vac format.
722.PP
723Blocks stored on local disk may contain scores pointing at local disk
724blocks or at Venti blocks.
725Local block addresses are stored as 20-byte scores in which the first 16 bytes
726are all zero and the last 4 bytes specify a block number in the disk.
727Before a block is archived, all the
728blocks it points to must be archived, and the local scores in the block
729must be changed to Venti scores.
730Using block addresses rather than content hashes for local data
731makes the local file system easier to manage: if a local block's contents
732change, the pointer to the block does not need to change.
733.NH 2
734Snapshots
735.HP
736Fossil is an archival file server.
737It takes periodic snapshots of the file system,
738which are made accessible through the file system.
739Specifically, the active file system is presented in
740.CW /active .
741Ephemeral snapshots (those that are kept on local disk and eventually deleted)
742are presented in
743\f(CW/snapshot/\fIyyyy\f(CW/\fImmdd\f(CW/\fIhhmm\fR,
744where
745.I yyyy
746is the full year,
747.I mm
748is the month number,
749.I dd
750is the day number,
751.I hh
752is the hour,
753and
754.I mm
755is the minute.
756Archival snapshots (those that are archived to Venti and persist forever)
757are presented in
758\f(CW/archive/\fIyyyy\f(CW/\fImmdds\fR,
759where
760.I yyyy ,
761.I mm ,
762and
763.I dd
764are year, month, and day as before,
765and
766.I s
767is a sequence number if more than one
768archival snapshot is done in a day.
769For the first snapshot,
770.I s
771is null.
772For the subsequent snapshots,
773.I s
774is
775.CW .1 ,
776.CW .2 ,
777.CW .3 ,
778etc.
779.PP
780To implement the snapshots, the file server maintains a
781current
782.I epoch
783for the active file system.
784Each local block has a label that records, among other things,
785the epoch in which the block was allocated.
786If a block was allocated in an epoch earlier than the current one,
787it is immutable and treated as copy-on-write.
788Taking a snapshot can be accomplished by
789recording the address of the current root block and then
790incrementing the epoch number.
791Notice that the copy-on-write method makes
792snapshots both time efficient and space efficient.
793The only time cost is waiting for all current file system
794requests to finish and then incrementing a counter.
795After a snapshot, blocks only get copied when they are
796next modified, so the per-snapshot
797space requirement is proportional
798to the amount of new data rather than the total
799size of the file system.
800.PP
801The blocks in the archival snapshots are moved to Venti,
802but the blocks in the ephemeral snapshots take up space
803in the local disk file.
804To allow reclamation of this disk space, the file system
805maintains a
806.I low
807.I epoch ,
808which is the epoch of the earliest ephemeral snapshot
809still available.
810Fossil only allows access to snapshots with epoch numbers
811between the
812low epoch and the current epoch
813(also called the high epoch).
814Incrementing the low epoch thus makes old
815snapshots inaccessible.
816The space required to store those snapshots can then
817be reclaimed, as described below.
818.NH 2
819Local blocks
820.HP
821The bulk of the local disk file is the local blocks.
822Each block has a 14-byte label associated with it, of the format:
823.P1
824Label:
825.ta +\w'    'u +\w'                'u
826	state[1]	\fRblock state\fP
827	type[1]	\fRblock type\fP
828	epoch[4]	\fRallocation epoch\fP
829	epochClose[4]	\fRclose epoch\fP
830	tag[4]	\fRrandom tag\fP
831.P2
832.LP
833The
834.CW type
835is an analogue of the block types described earlier,
836though different names are used, to distinguish between
837pointers blocks in a hash tree for a data stream
838and pointer blocks for a directory stream.
839The
840.CW epoch
841was mentioned in the last section.
842The other fields are explained below.
843.PP
844There are two distinguished blocks states
845.CW BsFree
846.CW 0x00 ) (
847and
848.CW BsBad
849.CW 0xFF ), (
850which mark blocks that are available for allocation
851and blocks that are bad and should be avoided.
852If
853.CW state
854is not one of these values, it is a bitwise
855.I or ' `
856of the following flags:
857.P1
858.ta +\w'      'u +\w'                'u
8590x01	BsAlloc	\fRblock is in use\fP
8600x02	BsCopied	\fRblock has been copied\fP
8610x04	BsVenti	\fRblock has been stored on Venti\fP
8620x08	BsClosed	\fRblock has been unlinked from active file system\fP
863.P2
864.LP
865The flags are explained as they arise in the discussions below.
866.PP
867It is convenient to store some extra fields in the
868.CW VtEntry
869structure when it describes a Venti file or directory
870stored on local disk.
871Specifically, we set the
872.CW VtEntryLocal
873flag bit
874and then use the bytes 7-16 of the score (which would
875otherwise be zero, since it is a local score) to hold these fields:
876.P1
877.ta +\w'    'u +\w'                'u
878	archive[1]	\fRboolean: this is an archival snapshot\fP
879	snap[4]	\fRepoch number if root of snapshot\fP
880	tag[4]	\fRrandom tag\fP
881.P2
882.LP
883The extended
884.CW VtEntry
885structure is called an
886.CW Entry .
887The
888.CW tag
889field
890in the
891.CW Label
892and the
893.CW Entry
894is used to identify dangling pointers or other file system corruption:
895all the local blocks in a hash tree must
896have tags matching the tag in the
897.CW Entry .
898If this
899.CW Entry
900points at the root of a snapshot,
901the
902.CW snap
903field is the epoch of the snapshot.
904If the snapshot is intended to be archived to Venti,
905the
906.CW archive
907field is non-zero.
908.NH 2
909Block reclamation
910.HP
911The blocks in the active file system form a tree: each
912block has only one parent.
913Once a copy-on-write block
914.I b
915is replaced by its copy, it is no longer
916needed by the active file system.
917At this point,
918.I b
919is unlinked from the active file system.
920We say that
921.I b
922is now
923.I closed :
924it is needed only for snapshots.
925When a block is closed, the
926.CW BsClosed
927bit is set in its state, and the current epoch (called the block's closing epoch)
928is stored in the
929.CW epochClose
930label field.
931(Open blocks have an
932.CW epochClose
933of
934.CW ~0 ).
935.PP
936A block is referenced by snapshots with epochs
937between the block's allocation epoch and its closing epoch.
938Once the file system's low epoch grows to be greater than or equal to the block's
939closing epoch, the block is no longer needed for any snapshots
940and can be reused.
941.PP
942In a typical configuration, where nightly archival snapshots
943are taken and written to Venti, it is desirable to reclaim
944the space occupied by now-archived blocks if possible.
945To do this, Fossil keeps track of whether the pointers
946in each block are unique to that block.
947When a block
948.I bb
949is allocated, a pointer to
950.I bb
951is written into exactly one active block (say,
952.I b ).
953In the absence of snapshots, the pointer to
954.I bb
955will remain unique to
956.I b ,
957so that if the pointer is zeroed,
958.I bb
959can be immediately reused.
960Snapshots complicate this invariant:
961when
962.I b
963is copied-on-write, all its pointers
964are no longer unique to it.
965At time of the copy, the
966.CW BsCopied
967state bit in the block's label
968is set to note the duplication of the pointers contained within.
969.NH 2
970Disk layout
971.HP
972The file system header describes the file system layout and has this format:
973.P1
974.ta +\w'    'u +\w'                'u
975Header:
976	magic[4]	\fR0x3776AE89 (HeaderMagic)\fP
977	version[2]	\fR1 (HeaderVersion)\fP
978	blockSize[2]	\fIfile system block size\fP
979	super[4]	\fRblock offset of super block\fP
980	label[4]	\fRblock offset of labels\fP
981	data[4]	\fRdata blocks\fP
982	end[4]	\fRend of file system\fP
983.P2
984.LP
985The corresponding file system layout is:
986.PS
987.ps 8
988.vs 9
989boxwid=0.75
990boxht=0.15
991Empty: box "empty" ht 0.25
992Header: box "header" with .n at Empty.s
993Empty2: box "empty" with .n at Header.s
994Super: box "super block" with .n at Empty2.s
995Label: box "label" "blocks" with .n at Super.s ht 0.25
996Data: box "data" "blocks" with .n at Label.s ht 0.3
997"  0" ljust at Empty.ne
998"  128kB" ljust at Header.ne
999"  \f5super\fP \(mu \f(CWblockSize\fP" ljust at Super.ne
1000"  \f5label\fP \(mu \f(CWblockSize\fP" ljust at Label.ne
1001"  \f5data\fP \(mu \f(CWblockSize\fP" ljust at Data.ne
1002"  \f5end\fP \(mu \f(CWblockSize\fP" ljust at Data.se
1003"" at (-1,0)
1004"" at (6,0)
1005.PE
1006.LP
1007The numbers to the right of the blocks are byte offsets
1008of the boundaries.
1009.LP
1010The super block describes the file system itself and looks like:
1011.P1
1012.ta +\w'    'u +\w'                'u
1013Super:
1014	magic[4]	\fR0x2340A3B1 (SuperMagic)\fP
1015	version[2]	\fR1 (SuperVersion)\fP
1016	epochLow[4]	\fRfile system low epoch\fP
1017	epochHigh[4]	\fRfile system high (active) epoch\fP
1018	qid[8]	\fRnext qid to allocate\fP
1019	active[4]	\fRdata block number: root of active file system\fP
1020	next[4]	\fRdata block number: root of next file system to archive\fP
1021	current[4]	\fRdata block number: root of file system currently being archived\fP
1022	last[20]	\fRVenti score of last successful archive\fP
1023	name[128]	\fRname of file system (just a comment)\fP
1024.P2
1025.LP
1026.NH 1
1027Fossil server
1028.HP
1029The Fossil server is a user-space program that runs on a standard Plan 9 kernel.
1030.NH 2
1031Process structure
1032.PP
1033The file server is structured as a set of processes synchronizing
1034mostly through message passing along queues.
1035The processes are given names, which can be seen in the output of
1036.CW ps
1037.CW -a .
1038.PP
1039.CW Listen
1040processes announce on various network addresses.
1041A
1042.CW con
1043process handles each incoming connection, reading 9P requests
1044and adding them to a central message queue.
1045.CW Msg
1046processes remove 9P requests from the queue,
1047handle them, and write the responses to the appropriate
1048file descriptors.
1049.PP
1050The
1051.CW disk
1052process handles disk I/O requests made by the other processes.
1053The
1054.CW flush
1055process writes dirty blocks from the in-memory block cache to disk.
1056The
1057.CW unlink
1058process frees previously linked blocks once the blocks that point at them
1059have been written to disk.
1060.PP
1061A
1062.CW consI
1063reads from each console file (typically a pipe posted in
1064.CW /srv ),
1065adding the typed characters to the input queue.
1066The
1067.CW cons
1068process echoes input and runs the commands, saving
1069output in a ring buffer.
1070Because there is only one
1071.CW cons
1072process, only one console command may be executing at a time.
1073A
1074.CW consO
1075process copies this ring buffer to the each console file.
1076.PP
1077The
1078.CW periodic
1079process runs periodic events, like
1080flushing the root metadata to disk or
1081taking snapshots of the file system.
1082.NH 2
1083Block cache
1084.HP
1085Fossil maintains an in-memory block cache which
1086holds both local disk blocks and Venti blocks.
1087Cache eviction follows a least recently used policy.
1088Dirty blocks are restricted to at most half the cache.
1089This can be changed by editing
1090.CW DirtyPercentage
1091in
1092.CW dat.h .
1093.PP
1094The block cache uses soft updates [1] to ensure that the on-disk
1095file system is always self-consistent.
1096Thus there is no
1097.I halt
1098console command
1099and no need to check a file system
1100that was shut down without halting.
1101.NH 2
1102Archiving
1103.HP
1104A background process writes blocks in archival snapshots to Venti.
1105Although
1106.CW /archive/\fIyyyy\fP/\fImmdds\fR
1107is a copy of only
1108.CW /active
1109at the time of the snapshot,
1110the archival process archives the
1111entire file tree rather than just
1112the subtree rooted at
1113.CW /active .
1114The snapshots
1115.CW /snapshot/\fIyyyy\fP/\fImmdd\fP/\fIhhmm
1116are stored as empty directories.
1117Once all the blocks have been archived,
1118a
1119.CW VtRoot
1120header for the file system is archived.
1121The score of that header is recorded in
1122.CW super.score
1123and also printed on the file server console.
1124The score can used by
1125.I flfmt
1126to restore a file system (see
1127.I fossil (4)).
1128.NH 2
1129Contrast with the old file server
1130.HP
1131The most obvious difference between Fossil and the
1132old Plan 9 file server [2] is that Fossil uses a Venti server as
1133its archival storage in place of a WORM juke box.
1134There are a few other architectural differences to be
1135aware of.
1136.PP
1137Fossil is a user-level program run on a standard kernel.
1138.PP
1139Fossil does not have any way to concatenate, stripe, or
1140mirror disk files.  For functionality similar to the old file server's
1141configuration strings, use the experimental file stack device
1142(see
1143.I devfs (3)).
1144.PP
1145Fossil speaks only 9P2000.  Old 9P (aka 9P1) is not supported.
1146.PP
1147... XXX words about converting an old file system to fossil?
1148.NH 1
1149References
1150.LP
1151[1] Gregory R. Ganger, Marshall Kirk McKusick, Craig A. N. Soules,
1152and Yale N. Patt.
1153``Soft Updates: A Solution to the Metadata Update Problem
1154in File Systems,''
1155.I "ACM Transactions on Computer Systems" ,
1156Vol 18., No. 2, May 2000, pp. 127\-153.
1157.LP
1158[2] Sean Quinlan, ``A Cached WORM File System,''
1159.I "Software\(emPractice and Experience" ,
1160Vol 21., No 12., December 1991, pp. 1289\-1299.
1161.LP
1162[3] Sean Quinlan and Sean Dorward, ``Venti: A New Approach to Archival Storage,''
1163.I "Usenix Conference on File and Storage Technologies" ,
11642002.
1165