1... .FP times 2... .fp 1 R R.nomath 3... .fp 5 CW LucidaSansCW83 4.TL 5Fossil, an Archival File Server 6.AU 7Sean Quinlan 8Jim McKie 9Russ Cox 10jmk,rsc@plan9.bell-labs.com 11.AB 12This paper describes the internals and 13operation of Fossil, an archival file server built for Plan 9. 14Fossil has not yet replaced the current Plan 9 file server 15and 16.CW kfs , 17but that is our eventual intent. 18Both fossil and this documentation are 19works in progress. Comments on either are most welcome. 20.AE 21.de HP 22.LP 23.. 24.NH 1 25Introduction 26.HP 27Fossil is an archival file server built for Plan 9. 28In a typical configuration, it maintains a traditional file system 29in a local disk partition and periodically archives snapshots of the file system 30to a Venti server. These archives are made available through 31a file system interface. 32Fossil can also be run without a Venti server, in which case the 33snapshots (if any) occupy local disk space. 34.PP 35The bulk of this paper explains the underlying data structures: 36Venti trees, the Venti archival file system format, and finally Fossil's 37file system format. 38The end of the paper discusses the architecture of the Fossil server. 39.PP 40The presentation of the data structures is very detailed, perhaps 41too detailed for most readers. 42The intent is to record all the details necessary to make structural 43changes to the file system format. 44Feel free to jump ahead when boredom strikes. 45.NH 1 46Venti trees and directory hierarchies 47.HP 48Venti [3] is an archival block storage server. 49Once a block is stored, it can be retrieved by presenting the 20-byte 50SHA1 hash of its contents, called a 51.I score . 52Blocks on Venti have a maximum length of about 56 kilobytes, 53though in practice smaller blocks are used. 54To store a byte stream of arbitrary length, Venti uses a hash tree. 55Conceptually, the data stream is broken into fixed-size (say, 56.I dsize -byte) 57chunks, which are stored on the Venti server. 58The resulting scores are concatenated into a new pointer stream, which is 59broken into fixed size (say, 60.I psize -byte) 61chunks, which are stored on the Venti server. 62.I Psize "" ( 63is different from 64.I dsize 65so that we can ensure that each pointer block holds an 66integral number of pointers.) 67This yields a new pointer stream, and so on, until there is a single block 68and finally a single score describing the entire tree. 69The resulting structure looks like: 70.PS 71.ps 8 72.vs 10 73boxht=0.1 74boxwid=0.1 75 76B0: box invis wid 1 "\f(CWVtDataType\fP" 77move right 0.1 78L0a: box wid 0.2 79move right 0.1 80L0b: box wid 0.2 81move right 0.1 82L0c: box invis wid 0.2 "..." 83move right 0.1 84 85L0d: box wid 0.2 86move right 0.1 87L0e: box wid 0.2 88move right 0.2 89L0f: box invis wid 0.2 "..." 90move right 0.2 91 92L0g: box wid 0.2 93move right 0.1 94L0h: box wid 0.2 95move right 0.1 96L0i: box invis wid 0.2 "..." 97move right 0.1 98 99L0j: box wid 0.2 100move right 0.1 101L0k: box wid 0.2 102move right 0.1 103L0l: box invis wid 0.2 "..." 104move right 0.1 105L0m: box wid 0.2 106 107define boxppddd { 108 line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se> 109 line from 0.4<$1.nw,$1.ne> to 0.4<$1.sw,$1.se> 110 X: box invis at 0.1<$1.nw,$1.ne> 111 Y: box invis at 0.1<$1.sw,$1.se> 112 line -> from 0.5<X,Y> to $2.nw 113 X: box invis at 0.3<$1.nw,$1.ne> 114 Y: box invis at 0.3<$1.sw,$1.se> 115 line -> from 0.5<X,Y> to $3.nw 116 "..." at 0.7<$1.w,$1.e> 117} 118 119define boxppdddp { 120 line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se> 121 line from 0.4<$1.nw,$1.ne> to 0.4<$1.sw,$1.se> 122 line from 0.8<$1.nw,$1.ne> to 0.8<$1.sw,$1.se> 123 X: box invis at 0.1<$1.nw,$1.ne> 124 Y: box invis at 0.1<$1.sw,$1.se> 125 line -> from 0.5<X,Y> to $2.nw 126 X: box invis at 0.3<$1.nw,$1.ne> 127 Y: box invis at 0.3<$1.sw,$1.se> 128 line -> from 0.5<X,Y> to $3.nw 129 "..." at 0.6<$1.w,$1.e> 130 X: box invis at 0.9<$1.nw,$1.ne> 131 Y: box invis at 0.9<$1.sw,$1.se> 132 line -> from 0.5<X,Y> to $4.nw 133} 134 135define boxpdddp { 136 line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se> 137 line from 0.8<$1.nw,$1.ne> to 0.8<$1.sw,$1.se> 138 X: box invis at 0.1<$1.nw,$1.ne> 139 Y: box invis at 0.1<$1.sw,$1.se> 140 line -> from 0.5<X,Y> to $2.nw 141 "..." at 0.5<$1.w,$1.e> 142 X: box invis at 0.9<$1.nw,$1.ne> 143 Y: box invis at 0.9<$1.sw,$1.se> 144 line -> from 0.5<X,Y> to $3.nw 145} 146 147bhd=0.4 148L1abc: box wid 0.5 at 0.5<L0a, L0b>+(0,bhd) 149boxppddd(L1abc, L0a, L0b) 150L1def: box wid 0.5 at 0.5<L0d, L0e>+(0,bhd) 151boxppddd(L1def, L0d, L0e) 152L1ghi: box wid 0.5 at 0.5<L0g, L0h>+(0,bhd) 153boxppddd(L1ghi, L0g, L0h) 154L1jklm: box wid 0.5 at 0.5<L0j, L0k>+(0,bhd) 155boxppdddp(L1jklm, L0j, L0k, L0m) 156B1: box invis wid 1 "\f(CWVtPointerType0\fP" at B0+(0,bhd) 157 158L2abcdef: box wid 0.5 at 0.5<L1abc,L1def>+(0,bhd) 159boxppddd(L2abcdef, L1abc, L1def) 160L2ghijklm: box wid 0.5 at 0.5<L1ghi,L1jklm>+(0,bhd) 161boxpdddp(L2ghijklm, L1ghi, L1jklm) 162B2: box invis wid 1 "\f(CWVtPointerType1\fP" at B1+(0,bhd) 163 164L3atom: box wid 0.5 at 0.5<L2abcdef, L2ghijklm>+(0,bhd) 165boxpdddp(L3atom, L2abcdef, L2ghijklm) 166B3: box invis wid 1 "\f(CWVtPointerType2\fP" at B2+(0,bhd) 167.PE 168.LP 169The leaves are the original data stream. Those blocks have type 170.CW VtDataType . 171The first pointer stream has type 172.CW VtPointerType0 , 173the next has type 174.CW VtPointerType1 , 175and so on. 176The figure ends with a single block of type 177.CW VtPointerType2 , 178but in general trees can have height up to 179.CW VtPointerType6 . 180For a 181.I dsize 182of 8192 bytes 183and 184.I psize 185of 8180 bytes (409 pointers), 186this gives a maximum stream size of approximately 10 zettabytes 187(2\s-2\u73\d\s+2 or 10\s-2\u22\d\s+2 bytes). 188.PP 189Data blocks are truncated to remove trailing runs of zeros before 190storage to Venti; they are zero-filled back to 191.I dsize 192bytes after retrieval from Venti. 193Similarly, trailing runs of pointers to zero-length blocks are 194removed from and added back to pointer blocks. 195These simple rules happen to make it particularly efficient to store 196large runs of zeros, as might occur in a data stream with ``holes:'' 197the zero-length block itself can be interpreted as a tree of any 198depth encoding an all-zero data stream. 199.PP 200Reconstructing the data stream requires the score and type of the 201topmost block in the tree, the data chunk size, the pointer chunk size, 202and the data stream size. 203(From the data stream size and the chunk sizes we could derive the 204depth of the tree and thus the type of the topmost block, but it is convenient 205to allow trees that are deeper than necessary.) 206This information is kept in a 40-byte structure called a 207.CW VtEntry : 208.P1 209VtEntry: 210.ta +\w' 'u +\w' 'u 211 gen[4] \fRgeneration number\fP 212 psize[2] \fRsize of pointer blocks\fP 213 dsize[2] \fRsize of data blocks\fP 214 flags[1] 215 zero[5] 216 size[6] \fRlength of file\fP 217 score[20] \fRscore of root block in tree\fP 218.P2 219(In this notation, 220.CW name[sz] 221indicates a 222.CW sz -byte 223field called 224.CW name . 225Integers are stored in big-endian order. 226.CW Size 227really is a 48-bit field.) 228.CW Flags 229is made up of the following bit fields. 230.P1 231.ta +\w' 'u +\w' 'u 2320x01 VtEntryActive \fRentry is allocated\fP 2330x02 VtEntryDir \fRentry describes a Venti directory (q.v.)\fP 2340x1C VtEntryDepthMask \fRmask for tree depth\fP 2350x20 VtEntryLocal \fRreserved (q.v.)\fP 236.P2 237.LP 238The depth of the described tree is stored in the 3 bits indicated: 239a tree with a topmost node of type 240.CW VtPointerType3 241has depth 4. 242.PP 243With 244.CW VtEntry 245we can build more complicated data structures, 246ones with multiple or nested data streams. 247A data stream consisting of 248.CW VtEntry 249structures is called a Venti directory. 250It is identical in structure to the Venti data stream 251we described earlier except that the bottom-level type is 252.CW VtDirType , 253and 254the 255.CW VtEntry 256describing a Venti directory has the 257.CW VtEntryDir 258flag bit set. 259The 260.I dsize 261for a Venti directory 262is a multiple of 40 so that each data chunk holds 263an integer number of 264.CW VtEntry 265structures. 266By analogy with Venti directories, 267we call the original data stream a 268Venti file. 269Note that Venti files are assumed 270.I not 271to contain pointers to other Venti blocks. 272The only pointers to Venti blocks occur in 273.CW VtEntry 274structures in 275Venti directories 276(and in the internal hash tree structure of the 277individual files and directories). 278Note also that these directories are nothing more than pointer lists. 279In particular, there are no names or metadata like in a file system. 280.PP 281To make it easier to pass hierarchies between applications, 282the root of a hierarchy is described in a 300-byte structure 283called a 284.CW VtRoot : 285.P1 286VtRoot: 287.ta +\w' 'u +\w' 'u 288 version[2] \f(CW2\fP 289 name[128] \fRname of structure (just a comment)\fP 290 type[128] \fRstring describing structure (\f(CWvac\fR)\f(CW 291 score[20] \fRpointer to \f(CWVtDirType\fP block\f(CW 292 blockSize[2] \fRmaximum block size in structure\fP 293 prev[20] \fRprevious \f(CWVtRoot\fP in chain, if any\fP 294.P2 295.LP 296This structure is stored to Venti and its score is passed 297between applications, typically in the form 298``\fItype\f(CW:\fIrootscore\fR,'' 299where 300.I type 301is the type field from the 302.CW VtRoot 303structure, and 304.I rootscore 305is the score of the 306.CW VtRoot 307block. 308.CW VtRoot 309structures can be chained together using the 310.I prev 311field to encode an archival history 312of the data structure. 313.PP 314For example, a small Venti hierarchy might look like: 315.PS 316.ps 8 317.vs 10 318boxwid=0.1 319boxht=0.1 320f=0.9 321mb=0.16 322 323VtRoot: [ 324 right 325 B1: box 326 move right 0.1 327 "\f(CWVtRoot\fP" ljust 328] 329 330Root: [ 331 right 332 B1: box fill f 333 B2: box fill f 334 B3: box fill f 335 move right 0.1 336] with .nw at VtRoot.sw+(0.2,-.1) 337Level1: [ 338 RootMeta: [ 339 box wid mb 340 ] 341 MetaSource: [ 342 right 343 B1: box wid 5*mb 344 ] with .nw at RootMeta.sw+(0,-.1) 345 346 Source: [ 347 right 348 B1: box fill f 349 B2: box fill f 350 B3: box fill f 351 B4: box fill f 352 B5: box fill f 353 B6: box fill f 354 B7: box fill f 355 B8: box fill f 356 ] with .nw at MetaSource.sw+(0,-.1) 357 SB1: box invis at Source.B1 358 SB2: box invis at Source.B2 359 SB3: box invis at Source.B3 360] with .nw at Root.sw+(0.4,-.1) 361Level2: [ 362 MetaSource: [ 363 right 364 B1: box wid 5*mb 365 ] 366 Source: [ 367 right 368 B1: box fill f 369 B2: box fill f 370 B3: box fill f 371 B4: box fill f 372 B5: box fill f 373 B6: box fill f 374 B7: box fill f 375 B8: box fill f 376 ] with .nw at MetaSource.sw+(0,-.1) 377 File: box wid 0.8 with .nw at Source.sw+(0,-.1) 378] with .nw at Level1.sw+(0.6,-.1) 379 380line -> from VtRoot.B1 down boxwid/2+0.1+boxwid/2 then to Root.w 381line -> from Root.B3 down boxwid/2+0.1+boxwid/2 then to Level1.RootMeta.w 382line -> from Root.B2 down boxwid/2+0.1+boxwid+0.1+boxwid/2 then to Level1.MetaSource.w 383line -> from Root.B1 down boxwid/2+0.1+boxwid+0.1+boxwid+0.1+boxwid/2 then to Level1.Source.w 384 385line -> from Level1.SB3 down boxwid/2+0.1+boxwid/2 then to Level2.MetaSource.w 386line -> from Level1.SB2 down boxwid/2+0.1+boxwid+0.1+boxwid/2 then to Level2.Source.w 387line -> from Level1.SB1 down boxwid/2+0.1+boxwid+0.1+boxwid+0.1+boxwid/2 then to Level2.File.w 388 389[ 390 KEY: box wid 1.5 invis "Key" 391 line from KEY.sw to KEY.se 392 k = -.1 393 kk=0.5 394 A: [ 395 box wid 4*boxwid 396 "Venti file" ljust with .w at last box .w+(kk,0) 397 ] with .nw at KEY.sw+(0,2*k) 398 B: [ 399 box fill f 400 "Venti entry (\f(CWVtEntry\fP)" ljust with .w at last box .w+(kk,0) 401 ] with .nw at A.sw+(0,k) 402 C: [ 403 right 404 CC: box fill f 405 box fill f 406 box fill f 407 box fill f 408 "Venti directory" ljust with .w at CC.w+(kk,0) 409 ] with .nw at B.sw+(0,k) 410 D: [ 411 line -> right 3*boxwid 412 "Venti pointer (score)" ljust with .w at last line .w+(kk, 0) 413 ] with .nw at C.sw+(0,k) 414] with .nw at VtRoot.nw+(3,0) 415.PE 416.LP 417Venti files are shown as white boxes, while directories are shown 418as shaded boxes. Each shaded square represents a 419.CW VtEntry . 420Arrows represent pointers from 421.CW VtEntry 422structures to other 423Venti files or directories. 424.PP 425The hierarchical structure provided by Venti files and directories 426can be used as the base for more complicated data structures. 427Because this structure captures all the information 428about pointers to other blocks, tools written to traverse 429Venti hierarchies can traverse the more complicated 430data structures as well. 431For example, 432.I venti/copy 433(see 434.I ventiaux (8)) 435copies a Venti hierarchy from one Venti server to another, 436given the root 437.CW VtEntry . 438Because the traditional file system described in later sections is 439layered on a Venti hierarchy, 440.I venti/copy 441can copy it without fully understanding its structure. 442.NH 1 443Vac file system format 444.HP 445The Venti archive format 446.I vac 447builds a traditional file system using a Venti hierarchy. 448Each vac file is implemented as a Venti file; 449each vac directory is implemented as a Venti 450directory and a Venti file to provide traditional file system metadata. 451The metadata is stored in a structure called a 452.CW DirEntry : 453.P1 454DirEntry: 455.ta +\w' 'u +\w' 'u 456 magic[4] \f(CW0x1c4d9072\fP (DirMagic)\fP 457 version[2] \f(CW9\fP 458 elem[s] \fRname (final path element only)\fP 459 entry[4] \fRentry number for Venti file or directory\fP 460 gen[4] \fRgeneration number\fP 461 mentry[4] \fRentry number for Venti file holding metadata\fP 462 mgen[4] \fRgeneration number\fP 463 qid[8] \fRunique file serial number\fP 464 uid[s] \fRowner\fP 465 gid[s] \fRgroup\fP 466 mid[s] \fRlast modified by\fP 467 mtime[4] \fRlast modification time\fP 468 ctime[4] \fRcreation time\fP 469 atime[4] \fRlast access time\fP 470 mode[4] \fRmode bits\fP 471.P2 472The notation 473.CW name[s] 474denotes a string stored as a two-byte length 475and then that many bytes. 476The above describes Version 9 of the 477.CW DirEntry 478format. Versions 7 and 8 are very similar; they can be 479read by the current 480.I vac 481source code but are not written. 482Earlier versions were not widespread. 483A 484.CW DirEntry 485may be followed by optional extension sections, though none 486are currently used. 487The 488.CW mode 489bits include bits commonly used by 490Unix and Windows, in addition to those used by Plan 9. 491.PP 492The 493.CW entry 494field is an index into the parallel Venti directory. 495The 496.CW gen 497field must match the 498.CW gen 499field in the corresponding 500.CW VtEntry 501in the directory; 502it is used to detect 503stale indices. 504Similarly, 505.CW mentry 506and 507.CW mgen 508are the index and generation number 509for the metadata Venti file, 510if the 511.CW DirEntry 512describes a vac directory. 513.PP 514The relation between Venti files and directories and 515vac files and directories can be seen in this figure: 516.PS 517.ps 8 518.vs 10 519boxwid=0.1 520boxht=0.1 521f=0.9 522mb=0.16 523 524VtRoot: [ 525 right 526 B1: box 527 move right 0.1 528 "\f(CWVtRoot\fP" ljust 529] 530 531SuperRoot: [ 532 right 533 B1: box fill f 534 move right 0.1 535 "fs root block" ljust 536] with .nw at VtRoot.sw + (0.2, -.2) 537Root: [ 538 right 539 B1: box fill f 540 B2: box fill f 541 B3: box fill f 542 move right 0.1 543 "root directory info block" ljust 544] with .nw at SuperRoot.sw+(0.2, -.2) 545Level1: [ 546 RootMeta: [ 547 box wid mb 548 move right 0.1 549 "root metadata" ljust 550 ] 551 MetaSource: [ 552 right 553 B1: box wid mb 554 B2: box wid mb 555 B3: box wid mb 556 B4: box wid mb 557 B5: box wid mb 558 ] with .nw at RootMeta.sw+(0,-.2) 559 MB1: box wid mb invis at MetaSource.B1 560 MB2: box wid mb invis at MetaSource.B2 561 MB3: box wid mb invis at MetaSource.B3 562 MB4: box wid mb invis at MetaSource.B4 563 MB5: box wid mb invis at MetaSource.B5 564 565 Source: [ 566 right 567 B1: box fill f 568 B2: box fill f 569 B3: box fill f 570 B4: box fill f 571 B5: box fill f 572 B6: box fill f 573 B7: box fill f 574 B8: box fill f 575 ] with .nw at MetaSource.sw+(0,-.1) 576 SB1: box invis at Source.B1 577 SB2: box invis at Source.B2 578 SB3: box invis at Source.B3 579 SB4: box invis at Source.B4 580 SB5: box invis at Source.B5 581 SB6: box invis at Source.B6 582 SB7: box invis at Source.B7 583 SB8: box invis at Source.B8 584] with .nw at Root.sw+(0.4,-.2) 585Level2: [ 586 MetaSource: [ 587 right 588 B1: box wid mb 589 B2: box wid mb 590 B3: box wid mb 591 B4: box wid mb 592 B5: box wid mb 593 ] 594 Source: [ 595 right 596 B1: box fill f 597 B2: box fill f 598 B3: box fill f 599 B4: box fill f 600 B5: box fill f 601 B6: box fill f 602 B7: box fill f 603 B8: box fill f 604 ] with .nw at MetaSource.sw+(0,-.1) 605 File: box wid 0.8 with .nw at Source.sw+(0,-.2) 606] with .nw at Level1.sw+(0.6,-.2) 607 608line -> from VtRoot.B1 down boxwid/2+0.2+boxwid/2 then to SuperRoot.w 609line -> from SuperRoot.B1 down boxwid/2+0.2+boxwid/2 then to Root.w 610line -> from Root.B3 down boxwid/2+0.2+boxwid/2 then to Level1.RootMeta.w 611line -> from Root.B2 down boxwid/2+0.2+boxwid+0.2+boxwid/2 then to Level1.MetaSource.w 612line -> from Root.B1 down boxwid/2+0.2+boxwid+0.1+boxwid+0.2+boxwid/2 then to Level1.Source.w 613 614line -> from Level1.SB3 down boxwid/2+0.2+boxwid/2 then to Level2.MetaSource.w 615line -> from Level1.SB2 down boxwid/2+0.2+boxwid+0.1+boxwid/2 then to Level2.Source.w 616line -> from Level1.SB1 down boxwid/2+0.2+boxwid+0.1+boxwid+0.2+boxwid/2 then to Level2.File.w 617 618arrowwid = arrowwid/2 619arrowht = arrowht/2 620line -> from Level1.MB1 to Level1.SB1.n 621line -> from Level1.MB2 to Level1.SB2.n 622line -> from Level1.MB2 to Level1.SB3.n 623line -> from Level1.MB4 to Level1.SB7.n 624line -> from Level1.MB5 to Level1.SB5.n 625arrowwid = arrowwid * 2 626arrowht = arrowht * 2 627 628box dashed with .nw at Level1.MetaSource.nw+(-.05,.05) wid 0.8+.05*2 ht .3+.05*2 629box dashed with .nw at Level2.MetaSource.nw+(-.05,.05) wid 0.8+.05*2 ht .3+.05*2 630box dotted with .nw at Level2.File.nw+(-.05,.05) wid 0.8+0.05*2 ht .1+.05*2 631 632[ 633 KEY: box wid 1.5 invis "Key" 634 line from KEY.sw to KEY.se 635 k = -.1 636 kk=0.5 637 A: [ 638 box wid 4*boxwid 639 "Venti file" ljust with .w at last box .w+(kk,0) 640 ] with .nw at KEY.sw+(0,2*k) 641 B: [ 642 box fill f 643 "Venti entry (\f(CWEntry\fP)" ljust with .w at last box .w+(kk,0) 644 ] with .nw at A.sw+(0,k) 645 C: [ 646 right 647 CC: box fill f 648 box fill f 649 box fill f 650 box fill f 651 "Venti directory" ljust with .w at CC.w+(kk,0) 652 ] with .nw at B.sw+(0,k) 653 D: [ 654 line -> right 3*boxwid 655 "Venti pointer (score)" ljust with .w at last line .w+(kk, 0) 656 ] with .nw at C.sw+(0,k) 657 DD: [ 658 box dotted wid 4*boxwid 659 "Vac file" ljust with .w at last box .w+(kk,0) 660 ] with .nw at D.sw+(0,k) 661 E: [ 662 box wid mb 663 "Vac entry (\f(CWDirEntry\fP)" ljust with .w at last box .w+(kk,0) 664 ] with .nw at DD.sw+(0,k) 665 G: [ 666 box dashed wid 4*boxwid 667 "Vac directory" ljust with .w at last box .w+(kk,0) 668 ] with .nw at E.sw+(0,k) 669 H: [ 670 arrowwid = arrowwid/2 671 arrowht = arrowht/2 672 line -> right 1.5*boxwid 673 "Vac pointer (integer index)" ljust with .w at last line .w+(kk, 0) 674 arrowwid = arrowwid * 2 675 arrowht = arrowht * 2 676 ] with .nw at G.sw+(0,k) 677] with .nw at VtRoot.nw+(3,0) 678.PE 679.LP 680In reality, the story is slightly more complicated. 681The metadata file in a Vac directory 682is not just the concatenation of 683.CW DirEntry 684structures. 685Instead, it is the concatenation of 686.CW MetaBlocks . 687A 688.CW MetaBlock 689contains some number of 690.CW DirEntry 691structures along with a sorted index to make it easy 692to look for a particular 693.CW DirEntry 694by its 695.CW elem 696field. 697The details are in the source code. 698.PP 699As shown in the diagram, 700the root directory of the file system is summarized by 701three 702.CW VtEntry 703structures describing 704the Venti directory for the children of the root, 705the Venti file for the metadata describing the children of the root, 706and a Venti file holding metadata for the root directory itself. 707These 708.CW VtEntry 709structures are placed in a Venti directory of their own, 710described by the single 711.CW VtEntry 712in the 713root block. 714.NH 1 715Fossil file system format 716.HP 717Fossil uses the vac format, with some small changes. 718The changes only affect the data on the local disk; the data 719archived to Venti is exactly in vac format. 720.PP 721Blocks stored on local disk may contain scores pointing at local disk 722blocks or at Venti blocks. 723Local block addresses are stored as 20-byte scores in which the first 16 bytes 724are all zero and the last 4 bytes specify a block number in the disk. 725Before a block is archived, all the 726blocks it points to must be archived, and the local scores in the block 727must be changed to Venti scores. 728Using block addresses rather than content hashes for local data 729makes the local file system easier to manage: if a local block's contents 730change, the pointer to the block does not need to change. 731.NH 2 732Snapshots 733.HP 734Fossil is an archival file server. 735It takes periodic snapshots of the file system, 736which are made accessible through the file system. 737Specifically, the active file system is presented in 738.CW /active . 739Ephemeral snapshots (those that are kept on local disk and eventually deleted) 740are presented in 741\f(CW/snapshot/\fIyyyy\f(CW/\fImmdd\f(CW/\fIhhmm\fR, 742where 743.I yyyy 744is the full year, 745.I mm 746is the month number, 747.I dd 748is the day number, 749.I hh 750is the hour, 751and 752.I mm 753is the minute. 754Archival snapshots (those that are archived to Venti and persist forever) 755are presented in 756\f(CW/archive/\fIyyyy\f(CW/\fImmdds\fR, 757where 758.I yyyy , 759.I mm , 760and 761.I dd 762are year, month, and day as before, 763and 764.I s 765is a sequence number if more than one 766archival snapshot is done in a day. 767For the first snapshot, 768.I s 769is null. 770For the subsequent snapshots, 771.I s 772is 773.CW .1 , 774.CW .2 , 775.CW .3 , 776etc. 777.PP 778To implement the snapshots, the file server maintains a 779current 780.I epoch 781for the active file system. 782Each local block has a label that records, among other things, 783the epoch in which the block was allocated. 784If a block was allocated in an epoch earlier than the current one, 785it is immutable and treated as copy-on-write. 786Taking a snapshot can be accomplished by 787recording the address of the current root block and then 788incrementing the epoch number. 789Notice that the copy-on-write method makes 790snapshots both time efficient and space efficient. 791The only time cost is waiting for all current file system 792requests to finish and then incrementing a counter. 793After a snapshot, blocks only get copied when they are 794next modified, so the per-snapshot 795space requirement is proportional 796to the amount of new data rather than the total 797size of the file system. 798.PP 799The blocks in the archival snapshots are moved to Venti, 800but the blocks in the ephemeral snapshots take up space 801in the local disk file. 802To allow reclamation of this disk space, the file system 803maintains a 804.I low 805.I epoch , 806which is the epoch of the earliest ephemeral snapshot 807still available. 808Fossil only allows access to snapshots with epoch numbers 809between the 810low epoch and the current epoch 811(also called the high epoch). 812Incrementing the low epoch thus makes old 813snapshots inaccessible. 814The space required to store those snapshots can then 815be reclaimed, as described below. 816.NH 2 817Local blocks 818.HP 819The bulk of the local disk file is the local blocks. 820Each block has a 14-byte label associated with it, of the format: 821.P1 822Label: 823.ta +\w' 'u +\w' 'u 824 state[1] \fRblock state\fP 825 type[1] \fRblock type\fP 826 epoch[4] \fRallocation epoch\fP 827 epochClose[4] \fRclose epoch\fP 828 tag[4] \fRrandom tag\fP 829.P2 830.LP 831The 832.CW type 833is an analogue of the block types described earlier, 834though different names are used, to distinguish between 835pointers blocks in a hash tree for a data stream 836and pointer blocks for a directory stream. 837The 838.CW epoch 839was mentioned in the last section. 840The other fields are explained below. 841.PP 842There are two distinguished blocks states 843.CW BsFree 844.CW 0x00 ) ( 845and 846.CW BsBad 847.CW 0xFF ), ( 848which mark blocks that are available for allocation 849and blocks that are bad and should be avoided. 850If 851.CW state 852is not one of these values, it is a bitwise 853.I or ' ` 854of the following flags: 855.P1 856.ta +\w' 'u +\w' 'u 8570x01 BsAlloc \fRblock is in use\fP 8580x02 BsCopied \fRblock has been copied\fP 8590x04 BsVenti \fRblock has been stored on Venti\fP 8600x08 BsClosed \fRblock has been unlinked from active file system\fP 861.P2 862.LP 863The flags are explained as they arise in the discussions below. 864.PP 865It is convenient to store some extra fields in the 866.CW VtEntry 867structure when it describes a Venti file or directory 868stored on local disk. 869Specifically, we set the 870.CW VtEntryLocal 871flag bit 872and then use the bytes 7-16 of the score (which would 873otherwise be zero, since it is a local score) to hold these fields: 874.P1 875.ta +\w' 'u +\w' 'u 876 archive[1] \fRboolean: this is an archival snapshot\fP 877 snap[4] \fRepoch number if root of snapshot\fP 878 tag[4] \fRrandom tag\fP 879.P2 880.LP 881The extended 882.CW VtEntry 883structure is called an 884.CW Entry . 885The 886.CW tag 887field 888in the 889.CW Label 890and the 891.CW Entry 892is used to identify dangling pointers or other file system corruption: 893all the local blocks in a hash tree must 894have tags matching the tag in the 895.CW Entry . 896If this 897.CW Entry 898points at the root of a snapshot, 899the 900.CW snap 901field is the epoch of the snapshot. 902If the snapshot is intended to be archived to Venti, 903the 904.CW archive 905field is non-zero. 906.NH 2 907Block reclamation 908.HP 909The blocks in the active file system form a tree: each 910block has only one parent. 911Once a copy-on-write block 912.I b 913is replaced by its copy, it is no longer 914needed by the active file system. 915At this point, 916.I b 917is unlinked from the active file system. 918We say that 919.I b 920is now 921.I closed : 922it is needed only for snapshots. 923When a block is closed, the 924.CW BsClosed 925bit is set in its state, and the current epoch (called the block's closing epoch) 926is stored in the 927.CW epochClose 928label field. 929(Open blocks have an 930.CW epochClose 931of 932.CW ~0 ). 933.PP 934A block is referenced by snapshots with epochs 935between the block's allocation epoch and its closing epoch. 936Once the file system's low epoch grows to be greater than or equal to the block's 937closing epoch, the block is no longer needed for any snapshots 938and can be reused. 939.PP 940In a typical configuration, where nightly archival snapshots 941are taken and written to Venti, it is desirable to reclaim 942the space occupied by now-archived blocks if possible. 943To do this, Fossil keeps track of whether the pointers 944in each block are unique to that block. 945When a block 946.I bb 947is allocated, a pointer to 948.I bb 949is written into exactly one active block (say, 950.I b ). 951In the absence of snapshots, the pointer to 952.I bb 953will remain unique to 954.I b , 955so that if the pointer is zeroed, 956.I bb 957can be immediately reused. 958Snapshots complicate this invariant: 959when 960.I b 961is copied-on-write, all its pointers 962are no longer unique to it. 963At time of the copy, the 964.CW BsCopied 965state bit in the block's label 966is set to note the duplication of the pointers contained within. 967.NH 2 968Disk layout 969.HP 970The file system header describes the file system layout and has this format: 971.P1 972.ta +\w' 'u +\w' 'u 973Header: 974 magic[4] \fR0x3776AE89 (HeaderMagic)\fP 975 version[2] \fR1 (HeaderVersion)\fP 976 blockSize[2] \fIfile system block size\fP 977 super[4] \fRblock offset of super block\fP 978 label[4] \fRblock offset of labels\fP 979 data[4] \fRdata blocks\fP 980 end[4] \fRend of file system\fP 981.P2 982.LP 983The corresponding file system layout is: 984.PS 985.ps 8 986.vs 9 987boxwid=0.75 988boxht=0.15 989Empty: box "empty" ht 0.25 990Header: box "header" with .n at Empty.s 991Empty2: box "empty" with .n at Header.s 992Super: box "super block" with .n at Empty2.s 993Label: box "label" "blocks" with .n at Super.s ht 0.25 994Data: box "data" "blocks" with .n at Label.s ht 0.3 995" 0" ljust at Empty.ne 996" 128kB" ljust at Header.ne 997" \f5super\fP \(mu \f(CWblockSize\fP" ljust at Super.ne 998" \f5label\fP \(mu \f(CWblockSize\fP" ljust at Label.ne 999" \f5data\fP \(mu \f(CWblockSize\fP" ljust at Data.ne 1000" \f5end\fP \(mu \f(CWblockSize\fP" ljust at Data.se 1001"" at (-1,0) 1002"" at (6,0) 1003.PE 1004.LP 1005The numbers to the right of the blocks are byte offsets 1006of the boundaries. 1007.LP 1008The super block describes the file system itself and looks like: 1009.P1 1010.ta +\w' 'u +\w' 'u 1011Super: 1012 magic[4] \fR0x2340A3B1 (SuperMagic)\fP 1013 version[2] \fR1 (SuperVersion)\fP 1014 epochLow[4] \fRfile system low epoch\fP 1015 epochHigh[4] \fRfile system high (active) epoch\fP 1016 qid[8] \fRnext qid to allocate\fP 1017 active[4] \fRdata block number: root of active file system\fP 1018 next[4] \fRdata block number: root of next file system to archive\fP 1019 current[4] \fRdata block number: root of file system currently being archived\fP 1020 last[20] \fRVenti score of last successful archive\fP 1021 name[128] \fRname of file system (just a comment)\fP 1022.P2 1023.LP 1024.NH 1 1025Fossil server 1026.HP 1027The Fossil server is a user-space program that runs on a standard Plan 9 kernel. 1028.NH 2 1029Process structure 1030.PP 1031The file server is structured as a set of processes synchronizing 1032mostly through message passing along queues. 1033The processes are given names, which can be seen in the output of 1034.CW ps 1035.CW -a . 1036.PP 1037.CW Listen 1038processes announce on various network addresses. 1039A 1040.CW con 1041process handles each incoming connection, reading 9P requests 1042and adding them to a central message queue. 1043.CW Msg 1044processes remove 9P requests from the queue, 1045handle them, and write the responses to the appropriate 1046file descriptors. 1047.PP 1048The 1049.CW disk 1050process handles disk I/O requests made by the other processes. 1051The 1052.CW flush 1053process writes dirty blocks from the in-memory block cache to disk. 1054The 1055.CW unlink 1056process frees previously linked blocks once the blocks that point at them 1057have been written to disk. 1058.PP 1059A 1060.CW consI 1061reads from each console file (typically a pipe posted in 1062.CW /srv ), 1063adding the typed characters to the input queue. 1064The 1065.CW cons 1066process echoes input and runs the commands, saving 1067output in a ring buffer. 1068Because there is only one 1069.CW cons 1070process, only one console command may be executing at a time. 1071A 1072.CW consO 1073process copies this ring buffer to each console file. 1074.PP 1075The 1076.CW periodic 1077process runs periodic events, like 1078flushing the root metadata to disk or 1079taking snapshots of the file system. 1080.NH 2 1081Block cache 1082.HP 1083Fossil maintains an in-memory block cache which 1084holds both local disk blocks and Venti blocks. 1085Cache eviction follows a least recently used policy. 1086Dirty blocks are restricted to at most half the cache. 1087This can be changed by editing 1088.CW DirtyPercentage 1089in 1090.CW dat.h . 1091.PP 1092The block cache uses soft updates [1] to ensure that the on-disk 1093file system is always self-consistent. 1094Thus there is no 1095.I halt 1096console command 1097and no need to check a file system 1098that was shut down without halting. 1099.NH 2 1100Archiving 1101.HP 1102A background process writes blocks in archival snapshots to Venti. 1103Although 1104.CW /archive/\fIyyyy\fP/\fImmdds\fR 1105is a copy of only 1106.CW /active 1107at the time of the snapshot, 1108the archival process archives the 1109entire file tree rather than just 1110the subtree rooted at 1111.CW /active . 1112The snapshots 1113.CW /snapshot/\fIyyyy\fP/\fImmdd\fP/\fIhhmm 1114are stored as empty directories. 1115Once all the blocks have been archived, 1116a 1117.CW VtRoot 1118header for the file system is archived. 1119The score of that header is recorded in 1120.CW super.score 1121and also printed on the file server console. 1122The score can used by 1123.I flfmt 1124to restore a file system (see 1125.I fossil (4)). 1126.NH 2 1127Contrast with the old file server 1128.HP 1129The most obvious difference between Fossil and the 1130old Plan 9 file server [2] is that Fossil uses a Venti server as 1131its archival storage in place of a WORM juke box. 1132There are a few other architectural differences to be 1133aware of. 1134.PP 1135Fossil is a user-level program run on a standard kernel. 1136.PP 1137Fossil does not have any way to concatenate, stripe, or 1138mirror disk files. For functionality similar to the old file server's 1139configuration strings, use the experimental file stack device 1140(see 1141.I fs (3)). 1142.PP 1143Fossil speaks only 9P2000. Old 9P (aka 9P1) is not supported. 1144.PP 1145... XXX words about converting an old file system to fossil? 1146.NH 1 1147References 1148.LP 1149[1] Gregory R. Ganger, Marshall Kirk McKusick, Craig A. N. Soules, 1150and Yale N. Patt. 1151``Soft Updates: A Solution to the Metadata Update Problem 1152in File Systems,'' 1153.I "ACM Transactions on Computer Systems" , 1154Vol 18., No. 2, May 2000, pp. 127\-153. 1155.LP 1156[2] Sean Quinlan, ``A Cached WORM File System,'' 1157.I "Software\(emPractice and Experience" , 1158Vol 21., No 12., December 1991, pp. 1289\-1299. 1159.LP 1160[3] Sean Quinlan and Sean Dorward, ``Venti: A New Approach to Archival Storage,'' 1161.I "Usenix Conference on File and Storage Technologies" , 11622002. 1163