1.HTML "Fossil, an Archival File Server 2... .FP times 3... .fp 1 R R.nomath 4... .fp 5 CW LucidaSansCW83 5.TL 6Fossil, an Archival File Server 7.AU 8Sean Quinlan 9Jim McKie 10Russ Cox 11jmk,rsc@plan9.bell-labs.com 12.AB 13This paper describes the internals and 14operation of Fossil, an archival file server built for Plan 9. 15Fossil has not yet replaced the current Plan 9 file server 16and 17.CW kfs , 18but that is our eventual intent. 19Both fossil and this documentation are 20works in progress. Comments on either are most welcome. 21.AE 22.de HP 23.LP 24.. 25.NH 1 26Introduction 27.HP 28Fossil is an archival file server built for Plan 9. 29In a typical configuration, it maintains a traditional file system 30in a local disk partition and periodically archives snapshots of the file system 31to a Venti server. These archives are made available through 32a file system interface. 33Fossil can also be run without a Venti server, in which case the 34snapshots (if any) occupy local disk space. 35.PP 36The bulk of this paper explains the underlying data structures: 37Venti trees, the Venti archival file system format, and finally Fossil's 38file system format. 39The end of the paper discusses the architecture of the Fossil server. 40.PP 41The presentation of the data structures is very detailed, perhaps 42too detailed for most readers. 43The intent is to record all the details necessary to make structural 44changes to the file system format. 45Feel free to jump ahead when boredom strikes. 46.NH 1 47Venti trees and directory hierarchies 48.HP 49Venti [3] is an archival block storage server. 50Once a block is stored, it can be retrieved by presenting the 20-byte 51SHA1 hash of its contents, called a 52.I score . 53Blocks on Venti have a maximum length of about 56 kilobytes, 54though in practice smaller blocks are used. 55To store a byte stream of arbitrary length, Venti uses a hash tree. 56Conceptually, the data stream is broken into fixed-size (say, 57.I dsize -byte) 58chunks, which are stored on the Venti server. 59The resulting scores are concatenated into a new pointer stream, which is 60broken into fixed size (say, 61.I psize -byte) 62chunks, which are stored on the Venti server. 63.I Psize "" ( 64is different from 65.I dsize 66so that we can ensure that each pointer block holds an 67integral number of pointers.) 68This yields a new pointer stream, and so on, until there is a single block 69and finally a single score describing the entire tree. 70The resulting structure looks like: 71.PS 72.ps 8 73.vs 10 74boxht=0.1 75boxwid=0.1 76 77B0: box invis wid 1 "\f(CWVtDataType\fP" 78move right 0.1 79L0a: box wid 0.2 80move right 0.1 81L0b: box wid 0.2 82move right 0.1 83L0c: box invis wid 0.2 "..." 84move right 0.1 85 86L0d: box wid 0.2 87move right 0.1 88L0e: box wid 0.2 89move right 0.2 90L0f: box invis wid 0.2 "..." 91move right 0.2 92 93L0g: box wid 0.2 94move right 0.1 95L0h: box wid 0.2 96move right 0.1 97L0i: box invis wid 0.2 "..." 98move right 0.1 99 100L0j: box wid 0.2 101move right 0.1 102L0k: box wid 0.2 103move right 0.1 104L0l: box invis wid 0.2 "..." 105move right 0.1 106L0m: box wid 0.2 107 108define boxppddd { 109 line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se> 110 line from 0.4<$1.nw,$1.ne> to 0.4<$1.sw,$1.se> 111 X: box invis at 0.1<$1.nw,$1.ne> 112 Y: box invis at 0.1<$1.sw,$1.se> 113 line -> from 0.5<X,Y> to $2.nw 114 X: box invis at 0.3<$1.nw,$1.ne> 115 Y: box invis at 0.3<$1.sw,$1.se> 116 line -> from 0.5<X,Y> to $3.nw 117 "..." at 0.7<$1.w,$1.e> 118} 119 120define boxppdddp { 121 line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se> 122 line from 0.4<$1.nw,$1.ne> to 0.4<$1.sw,$1.se> 123 line from 0.8<$1.nw,$1.ne> to 0.8<$1.sw,$1.se> 124 X: box invis at 0.1<$1.nw,$1.ne> 125 Y: box invis at 0.1<$1.sw,$1.se> 126 line -> from 0.5<X,Y> to $2.nw 127 X: box invis at 0.3<$1.nw,$1.ne> 128 Y: box invis at 0.3<$1.sw,$1.se> 129 line -> from 0.5<X,Y> to $3.nw 130 "..." at 0.6<$1.w,$1.e> 131 X: box invis at 0.9<$1.nw,$1.ne> 132 Y: box invis at 0.9<$1.sw,$1.se> 133 line -> from 0.5<X,Y> to $4.nw 134} 135 136define boxpdddp { 137 line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se> 138 line from 0.8<$1.nw,$1.ne> to 0.8<$1.sw,$1.se> 139 X: box invis at 0.1<$1.nw,$1.ne> 140 Y: box invis at 0.1<$1.sw,$1.se> 141 line -> from 0.5<X,Y> to $2.nw 142 "..." at 0.5<$1.w,$1.e> 143 X: box invis at 0.9<$1.nw,$1.ne> 144 Y: box invis at 0.9<$1.sw,$1.se> 145 line -> from 0.5<X,Y> to $3.nw 146} 147 148bhd=0.4 149L1abc: box wid 0.5 at 0.5<L0a, L0b>+(0,bhd) 150boxppddd(L1abc, L0a, L0b) 151L1def: box wid 0.5 at 0.5<L0d, L0e>+(0,bhd) 152boxppddd(L1def, L0d, L0e) 153L1ghi: box wid 0.5 at 0.5<L0g, L0h>+(0,bhd) 154boxppddd(L1ghi, L0g, L0h) 155L1jklm: box wid 0.5 at 0.5<L0j, L0k>+(0,bhd) 156boxppdddp(L1jklm, L0j, L0k, L0m) 157B1: box invis wid 1 "\f(CWVtPointerType0\fP" at B0+(0,bhd) 158 159L2abcdef: box wid 0.5 at 0.5<L1abc,L1def>+(0,bhd) 160boxppddd(L2abcdef, L1abc, L1def) 161L2ghijklm: box wid 0.5 at 0.5<L1ghi,L1jklm>+(0,bhd) 162boxpdddp(L2ghijklm, L1ghi, L1jklm) 163B2: box invis wid 1 "\f(CWVtPointerType1\fP" at B1+(0,bhd) 164 165L3atom: box wid 0.5 at 0.5<L2abcdef, L2ghijklm>+(0,bhd) 166boxpdddp(L3atom, L2abcdef, L2ghijklm) 167B3: box invis wid 1 "\f(CWVtPointerType2\fP" at B2+(0,bhd) 168.PE 169.LP 170The leaves are the original data stream. Those blocks have type 171.CW VtDataType . 172The first pointer stream has type 173.CW VtPointerType0 , 174the next has type 175.CW VtPointerType1 , 176and so on. 177The figure ends with a single block of type 178.CW VtPointerType2 , 179but in general trees can have height up to 180.CW VtPointerType6 . 181For a 182.I dsize 183of 8192 bytes 184and 185.I psize 186of 8180 bytes (409 pointers), 187this gives a maximum stream size of approximately 10 zettabytes 188(2\s-2\u73\d\s+2 or 10\s-2\u22\d\s+2 bytes). 189.PP 190Data blocks are truncated to remove trailing runs of zeros before 191storage to Venti; they are zero-filled back to 192.I dsize 193bytes after retrieval from Venti. 194Similarly, trailing runs of pointers to zero-length blocks are 195removed from and added back to pointer blocks. 196These simple rules happen to make it particularly efficient to store 197large runs of zeros, as might occur in a data stream with ``holes:'' 198the zero-length block itself can be interpreted as a tree of any 199depth encoding an all-zero data stream. 200.PP 201Reconstructing the data stream requires the score and type of the 202topmost block in the tree, the data chunk size, the pointer chunk size, 203and the data stream size. 204(From the data stream size and the chunk sizes we could derive the 205depth of the tree and thus the type of the topmost block, but it is convenient 206to allow trees that are deeper than necessary.) 207This information is kept in a 40-byte structure called a 208.CW VtEntry : 209.P1 210VtEntry: 211.ta +\w' 'u +\w' 'u 212 gen[4] \fRgeneration number\fP 213 psize[2] \fRsize of pointer blocks\fP 214 dsize[2] \fRsize of data blocks\fP 215 flags[1] 216 zero[5] 217 size[6] \fRlength of file\fP 218 score[20] \fRscore of root block in tree\fP 219.P2 220(In this notation, 221.CW name[sz] 222indicates a 223.CW sz -byte 224field called 225.CW name . 226Integers are stored in big-endian order. 227.CW Size 228really is a 48-bit field.) 229.CW Flags 230is made up of the following bit fields. 231.P1 232.ta +\w' 'u +\w' 'u 2330x01 VtEntryActive \fRentry is allocated\fP 2340x02 VtEntryDir \fRentry describes a Venti directory (q.v.)\fP 2350x1C VtEntryDepthMask \fRmask for tree depth\fP 2360x20 VtEntryLocal \fRreserved (q.v.)\fP 237.P2 238.LP 239The depth of the described tree is stored in the 3 bits indicated: 240a tree with a topmost node of type 241.CW VtPointerType3 242has depth 4. 243.PP 244With 245.CW VtEntry 246we can build more complicated data structures, 247ones with multiple or nested data streams. 248A data stream consisting of 249.CW VtEntry 250structures is called a Venti directory. 251It is identical in structure to the Venti data stream 252we described earlier except that the bottom-level type is 253.CW VtDirType , 254and 255the 256.CW VtEntry 257describing a Venti directory has the 258.CW VtEntryDir 259flag bit set. 260The 261.I dsize 262for a Venti directory 263is a multiple of 40 so that each data chunk holds 264an integer number of 265.CW VtEntry 266structures. 267By analogy with Venti directories, 268we call the original data stream a 269Venti file. 270Note that Venti files are assumed 271.I not 272to contain pointers to other Venti blocks. 273The only pointers to Venti blocks occur in 274.CW VtEntry 275structures in 276Venti directories 277(and in the internal hash tree structure of the 278individual files and directories). 279Note also that these directories are nothing more than pointer lists. 280In particular, there are no names or metadata like in a file system. 281.PP 282To make it easier to pass hierarchies between applications, 283the root of a hierarchy is described in a 300-byte structure 284called a 285.CW VtRoot : 286.P1 287VtRoot: 288.ta +\w' 'u +\w' 'u 289 version[2] \f(CW2\fP 290 name[128] \fRname of structure (just a comment)\fP 291 type[128] \fRstring describing structure (\f(CWvac\fR)\f(CW 292 score[20] \fRpointer to \f(CWVtDirType\fP block\f(CW 293 blockSize[2] \fRmaximum block size in structure\fP 294 prev[20] \fRprevious \f(CWVtRoot\fP in chain, if any\fP 295.P2 296.LP 297This structure is stored to Venti and its score is passed 298between applications, typically in the form 299``\fItype\f(CW:\fIrootscore\fR,'' 300where 301.I type 302is the type field from the 303.CW VtRoot 304structure, and 305.I rootscore 306is the score of the 307.CW VtRoot 308block. 309.CW VtRoot 310structures can be chained together using the 311.I prev 312field to encode an archival history 313of the data structure. 314.PP 315For example, a small Venti hierarchy might look like: 316.PS 317.ps 8 318.vs 10 319boxwid=0.1 320boxht=0.1 321f=0.9 322mb=0.16 323 324VtRoot: [ 325 right 326 B1: box 327 move right 0.1 328 "\f(CWVtRoot\fP" ljust 329] 330 331Root: [ 332 right 333 B1: box fill f 334 B2: box fill f 335 B3: box fill f 336 move right 0.1 337] with .nw at VtRoot.sw+(0.2,-.1) 338Level1: [ 339 RootMeta: [ 340 box wid mb 341 ] 342 MetaSource: [ 343 right 344 B1: box wid 5*mb 345 ] with .nw at RootMeta.sw+(0,-.1) 346 347 Source: [ 348 right 349 B1: box fill f 350 B2: box fill f 351 B3: box fill f 352 B4: box fill f 353 B5: box fill f 354 B6: box fill f 355 B7: box fill f 356 B8: box fill f 357 ] with .nw at MetaSource.sw+(0,-.1) 358 SB1: box invis at Source.B1 359 SB2: box invis at Source.B2 360 SB3: box invis at Source.B3 361] with .nw at Root.sw+(0.4,-.1) 362Level2: [ 363 MetaSource: [ 364 right 365 B1: box wid 5*mb 366 ] 367 Source: [ 368 right 369 B1: box fill f 370 B2: box fill f 371 B3: box fill f 372 B4: box fill f 373 B5: box fill f 374 B6: box fill f 375 B7: box fill f 376 B8: box fill f 377 ] with .nw at MetaSource.sw+(0,-.1) 378 File: box wid 0.8 with .nw at Source.sw+(0,-.1) 379] with .nw at Level1.sw+(0.6,-.1) 380 381line -> from VtRoot.B1 down boxwid/2+0.1+boxwid/2 then to Root.w 382line -> from Root.B3 down boxwid/2+0.1+boxwid/2 then to Level1.RootMeta.w 383line -> from Root.B2 down boxwid/2+0.1+boxwid+0.1+boxwid/2 then to Level1.MetaSource.w 384line -> from Root.B1 down boxwid/2+0.1+boxwid+0.1+boxwid+0.1+boxwid/2 then to Level1.Source.w 385 386line -> from Level1.SB3 down boxwid/2+0.1+boxwid/2 then to Level2.MetaSource.w 387line -> from Level1.SB2 down boxwid/2+0.1+boxwid+0.1+boxwid/2 then to Level2.Source.w 388line -> from Level1.SB1 down boxwid/2+0.1+boxwid+0.1+boxwid+0.1+boxwid/2 then to Level2.File.w 389 390[ 391 KEY: box wid 1.5 invis "Key" 392 line from KEY.sw to KEY.se 393 k = -.1 394 kk=0.5 395 A: [ 396 box wid 4*boxwid 397 "Venti file" ljust with .w at last box .w+(kk,0) 398 ] with .nw at KEY.sw+(0,2*k) 399 B: [ 400 box fill f 401 "Venti entry (\f(CWVtEntry\fP)" ljust with .w at last box .w+(kk,0) 402 ] with .nw at A.sw+(0,k) 403 C: [ 404 right 405 CC: box fill f 406 box fill f 407 box fill f 408 box fill f 409 "Venti directory" ljust with .w at CC.w+(kk,0) 410 ] with .nw at B.sw+(0,k) 411 D: [ 412 line -> right 3*boxwid 413 "Venti pointer (score)" ljust with .w at last line .w+(kk, 0) 414 ] with .nw at C.sw+(0,k) 415] with .nw at VtRoot.nw+(3,0) 416.PE 417.LP 418Venti files are shown as white boxes, while directories are shown 419as shaded boxes. Each shaded square represents a 420.CW VtEntry . 421Arrows represent pointers from 422.CW VtEntry 423structures to other 424Venti files or directories. 425.PP 426The hierarchical structure provided by Venti files and directories 427can be used as the base for more complicated data structures. 428Because this structure captures all the information 429about pointers to other blocks, tools written to traverse 430Venti hierarchies can traverse the more complicated 431data structures as well. 432For example, 433.I venti/copy 434(see 435.I venti (1)) 436copies a Venti hierarchy from one Venti server to another, 437given the root 438.CW VtEntry . 439Because the traditional file system described in later sections is 440layered on a Venti hierarchy, 441.I venti/copy 442can copy it without fully understanding its structure. 443.NH 1 444Vac file system format 445.HP 446The Venti archive format 447.I vac 448builds a traditional file system using a Venti hierarchy. 449Each vac file is implemented as a Venti file; 450each vac directory is implemented as a Venti 451directory and a Venti file to provide traditional file system metadata. 452The metadata is stored in a structure called a 453.CW DirEntry : 454.P1 455DirEntry: 456.ta +\w' 'u +\w' 'u 457 magic[4] \f(CW0x1c4d9072\fP (DirMagic)\fP 458 version[2] \f(CW9\fP 459 elem[s] \fRname (final path element only)\fP 460 entry[4] \fRentry number for Venti file or directory\fP 461 gen[4] \fRgeneration number\fP 462 mentry[4] \fRentry number for Venti file holding metadata\fP 463 mgen[4] \fRgeneration number\fP 464 qid[8] \fRunique file serial number\fP 465 uid[s] \fRowner\fP 466 gid[s] \fRgroup\fP 467 mid[s] \fRlast modified by\fP 468 mtime[4] \fRlast modification time\fP 469 ctime[4] \fRcreation time\fP 470 atime[4] \fRlast access time\fP 471 mode[4] \fRmode bits\fP 472.P2 473The notation 474.CW name[s] 475denotes a string stored as a two-byte length 476and then that many bytes. 477The above describes Version 9 of the 478.CW DirEntry 479format. Versions 7 and 8 are very similar; they can be 480read by the current 481.I vac 482source code but are not written. 483Earlier versions were not widespread. 484A 485.CW DirEntry 486may be followed by optional extension sections, though none 487are currently used. 488The 489.CW mode 490bits include bits commonly used by 491Unix and Windows, in addition to those used by Plan 9. 492.PP 493The 494.CW entry 495field is an index into the parallel Venti directory. 496The 497.CW gen 498field must match the 499.CW gen 500field in the corresponding 501.CW VtEntry 502in the directory; 503it is used to detect 504stale indices. 505Similarly, 506.CW mentry 507and 508.CW mgen 509are the index and generation number 510for the metadata Venti file, 511if the 512.CW DirEntry 513describes a vac directory. 514.PP 515The relation between Venti files and directories and 516vac files and directories can be seen in this figure: 517.PS 518.ps 8 519.vs 10 520boxwid=0.1 521boxht=0.1 522f=0.9 523mb=0.16 524 525VtRoot: [ 526 right 527 B1: box 528 move right 0.1 529 "\f(CWVtRoot\fP" ljust 530] 531 532SuperRoot: [ 533 right 534 B1: box fill f 535 move right 0.1 536 "fs root block" ljust 537] with .nw at VtRoot.sw + (0.2, -.2) 538Root: [ 539 right 540 B1: box fill f 541 B2: box fill f 542 B3: box fill f 543 move right 0.1 544 "root directory info block" ljust 545] with .nw at SuperRoot.sw+(0.2, -.2) 546Level1: [ 547 RootMeta: [ 548 box wid mb 549 move right 0.1 550 "root metadata" ljust 551 ] 552 MetaSource: [ 553 right 554 B1: box wid mb 555 B2: box wid mb 556 B3: box wid mb 557 B4: box wid mb 558 B5: box wid mb 559 ] with .nw at RootMeta.sw+(0,-.2) 560 MB1: box wid mb invis at MetaSource.B1 561 MB2: box wid mb invis at MetaSource.B2 562 MB3: box wid mb invis at MetaSource.B3 563 MB4: box wid mb invis at MetaSource.B4 564 MB5: box wid mb invis at MetaSource.B5 565 566 Source: [ 567 right 568 B1: box fill f 569 B2: box fill f 570 B3: box fill f 571 B4: box fill f 572 B5: box fill f 573 B6: box fill f 574 B7: box fill f 575 B8: box fill f 576 ] with .nw at MetaSource.sw+(0,-.1) 577 SB1: box invis at Source.B1 578 SB2: box invis at Source.B2 579 SB3: box invis at Source.B3 580 SB4: box invis at Source.B4 581 SB5: box invis at Source.B5 582 SB6: box invis at Source.B6 583 SB7: box invis at Source.B7 584 SB8: box invis at Source.B8 585] with .nw at Root.sw+(0.4,-.2) 586Level2: [ 587 MetaSource: [ 588 right 589 B1: box wid mb 590 B2: box wid mb 591 B3: box wid mb 592 B4: box wid mb 593 B5: box wid mb 594 ] 595 Source: [ 596 right 597 B1: box fill f 598 B2: box fill f 599 B3: box fill f 600 B4: box fill f 601 B5: box fill f 602 B6: box fill f 603 B7: box fill f 604 B8: box fill f 605 ] with .nw at MetaSource.sw+(0,-.1) 606 File: box wid 0.8 with .nw at Source.sw+(0,-.2) 607] with .nw at Level1.sw+(0.6,-.2) 608 609line -> from VtRoot.B1 down boxwid/2+0.2+boxwid/2 then to SuperRoot.w 610line -> from SuperRoot.B1 down boxwid/2+0.2+boxwid/2 then to Root.w 611line -> from Root.B3 down boxwid/2+0.2+boxwid/2 then to Level1.RootMeta.w 612line -> from Root.B2 down boxwid/2+0.2+boxwid+0.2+boxwid/2 then to Level1.MetaSource.w 613line -> from Root.B1 down boxwid/2+0.2+boxwid+0.1+boxwid+0.2+boxwid/2 then to Level1.Source.w 614 615line -> from Level1.SB3 down boxwid/2+0.2+boxwid/2 then to Level2.MetaSource.w 616line -> from Level1.SB2 down boxwid/2+0.2+boxwid+0.1+boxwid/2 then to Level2.Source.w 617line -> from Level1.SB1 down boxwid/2+0.2+boxwid+0.1+boxwid+0.2+boxwid/2 then to Level2.File.w 618 619arrowwid = arrowwid/2 620arrowht = arrowht/2 621line -> from Level1.MB1 to Level1.SB1.n 622line -> from Level1.MB2 to Level1.SB2.n 623line -> from Level1.MB2 to Level1.SB3.n 624line -> from Level1.MB4 to Level1.SB7.n 625line -> from Level1.MB5 to Level1.SB5.n 626arrowwid = arrowwid * 2 627arrowht = arrowht * 2 628 629box dashed with .nw at Level1.MetaSource.nw+(-.05,.05) wid 0.8+.05*2 ht .3+.05*2 630box dashed with .nw at Level2.MetaSource.nw+(-.05,.05) wid 0.8+.05*2 ht .3+.05*2 631box dotted with .nw at Level2.File.nw+(-.05,.05) wid 0.8+0.05*2 ht .1+.05*2 632 633[ 634 KEY: box wid 1.5 invis "Key" 635 line from KEY.sw to KEY.se 636 k = -.1 637 kk=0.5 638 A: [ 639 box wid 4*boxwid 640 "Venti file" ljust with .w at last box .w+(kk,0) 641 ] with .nw at KEY.sw+(0,2*k) 642 B: [ 643 box fill f 644 "Venti entry (\f(CWEntry\fP)" ljust with .w at last box .w+(kk,0) 645 ] with .nw at A.sw+(0,k) 646 C: [ 647 right 648 CC: box fill f 649 box fill f 650 box fill f 651 box fill f 652 "Venti directory" ljust with .w at CC.w+(kk,0) 653 ] with .nw at B.sw+(0,k) 654 D: [ 655 line -> right 3*boxwid 656 "Venti pointer (score)" ljust with .w at last line .w+(kk, 0) 657 ] with .nw at C.sw+(0,k) 658 DD: [ 659 box dotted wid 4*boxwid 660 "Vac file" ljust with .w at last box .w+(kk,0) 661 ] with .nw at D.sw+(0,k) 662 E: [ 663 box wid mb 664 "Vac entry (\f(CWDirEntry\fP)" ljust with .w at last box .w+(kk,0) 665 ] with .nw at DD.sw+(0,k) 666 G: [ 667 box dashed wid 4*boxwid 668 "Vac directory" ljust with .w at last box .w+(kk,0) 669 ] with .nw at E.sw+(0,k) 670 H: [ 671 arrowwid = arrowwid/2 672 arrowht = arrowht/2 673 line -> right 1.5*boxwid 674 "Vac pointer (integer index)" ljust with .w at last line .w+(kk, 0) 675 arrowwid = arrowwid * 2 676 arrowht = arrowht * 2 677 ] with .nw at G.sw+(0,k) 678] with .nw at VtRoot.nw+(3,0) 679.PE 680.LP 681In reality, the story is slightly more complicated. 682The metadata file in a Vac directory 683is not just the concatenation of 684.CW DirEntry 685structures. 686Instead, it is the concatenation of 687.CW MetaBlocks . 688A 689.CW MetaBlock 690contains some number of 691.CW DirEntry 692structures along with a sorted index to make it easy 693to look for a particular 694.CW DirEntry 695by its 696.CW elem 697field. 698The details are in the source code. 699.PP 700As shown in the diagram, 701the root directory of the file system is summarized by 702three 703.CW VtEntry 704structures describing 705the Venti directory for the children of the root, 706the Venti file for the metadata describing the children of the root, 707and a Venti file holding metadata for the root directory itself. 708These 709.CW VtEntry 710structures are placed in a Venti directory of their own, 711described by the single 712.CW VtEntry 713in the 714root block. 715.NH 1 716Fossil file system format 717.HP 718Fossil uses the vac format, with some small changes. 719The changes only affect the data on the local disk; the data 720archived to Venti is exactly in vac format. 721.PP 722Blocks stored on local disk may contain scores pointing at local disk 723blocks or at Venti blocks. 724Local block addresses are stored as 20-byte scores in which the first 16 bytes 725are all zero and the last 4 bytes specify a block number in the disk. 726Before a block is archived, all the 727blocks it points to must be archived, and the local scores in the block 728must be changed to Venti scores. 729Using block addresses rather than content hashes for local data 730makes the local file system easier to manage: if a local block's contents 731change, the pointer to the block does not need to change. 732.NH 2 733Snapshots 734.HP 735Fossil is an archival file server. 736It takes periodic snapshots of the file system, 737which are made accessible through the file system. 738Specifically, the active file system is presented in 739.CW /active . 740Ephemeral snapshots (those that are kept on local disk and eventually deleted) 741are presented in 742\f(CW/snapshot/\fIyyyy\f(CW/\fImmdd\f(CW/\fIhhmm\fR, 743where 744.I yyyy 745is the full year, 746.I mm 747is the month number, 748.I dd 749is the day number, 750.I hh 751is the hour, 752and 753.I mm 754is the minute. 755Archival snapshots (those that are archived to Venti and persist forever) 756are presented in 757\f(CW/archive/\fIyyyy\f(CW/\fImmdds\fR, 758where 759.I yyyy , 760.I mm , 761and 762.I dd 763are year, month, and day as before, 764and 765.I s 766is a sequence number if more than one 767archival snapshot is done in a day. 768For the first snapshot, 769.I s 770is null. 771For the subsequent snapshots, 772.I s 773is 774.CW .1 , 775.CW .2 , 776.CW .3 , 777etc. 778.PP 779To implement the snapshots, the file server maintains a 780current 781.I epoch 782for the active file system. 783Each local block has a label that records, among other things, 784the epoch in which the block was allocated. 785If a block was allocated in an epoch earlier than the current one, 786it is immutable and treated as copy-on-write. 787Taking a snapshot can be accomplished by 788recording the address of the current root block and then 789incrementing the epoch number. 790Notice that the copy-on-write method makes 791snapshots both time efficient and space efficient. 792The only time cost is waiting for all current file system 793requests to finish and then incrementing a counter. 794After a snapshot, blocks only get copied when they are 795next modified, so the per-snapshot 796space requirement is proportional 797to the amount of new data rather than the total 798size of the file system. 799.PP 800The blocks in the archival snapshots are moved to Venti, 801but the blocks in the ephemeral snapshots take up space 802in the local disk file. 803To allow reclamation of this disk space, the file system 804maintains a 805.I low 806.I epoch , 807which is the epoch of the earliest ephemeral snapshot 808still available. 809Fossil only allows access to snapshots with epoch numbers 810between the 811low epoch and the current epoch 812(also called the high epoch). 813Incrementing the low epoch thus makes old 814snapshots inaccessible. 815The space required to store those snapshots can then 816be reclaimed, as described below. 817.NH 2 818Local blocks 819.HP 820The bulk of the local disk file is the local blocks. 821Each block has a 14-byte label associated with it, of the format: 822.P1 823Label: 824.ta +\w' 'u +\w' 'u 825 state[1] \fRblock state\fP 826 type[1] \fRblock type\fP 827 epoch[4] \fRallocation epoch\fP 828 epochClose[4] \fRclose epoch\fP 829 tag[4] \fRrandom tag\fP 830.P2 831.LP 832The 833.CW type 834is an analogue of the block types described earlier, 835though different names are used, to distinguish between 836pointers blocks in a hash tree for a data stream 837and pointer blocks for a directory stream. 838The 839.CW epoch 840was mentioned in the last section. 841The other fields are explained below. 842.PP 843There are two distinguished blocks states 844.CW BsFree 845.CW 0x00 ) ( 846and 847.CW BsBad 848.CW 0xFF ), ( 849which mark blocks that are available for allocation 850and blocks that are bad and should be avoided. 851If 852.CW state 853is not one of these values, it is a bitwise 854.I or ' ` 855of the following flags: 856.P1 857.ta +\w' 'u +\w' 'u 8580x01 BsAlloc \fRblock is in use\fP 8590x02 BsCopied \fRblock has been copied\fP 8600x04 BsVenti \fRblock has been stored on Venti\fP 8610x08 BsClosed \fRblock has been unlinked from active file system\fP 862.P2 863.LP 864The flags are explained as they arise in the discussions below. 865.PP 866It is convenient to store some extra fields in the 867.CW VtEntry 868structure when it describes a Venti file or directory 869stored on local disk. 870Specifically, we set the 871.CW VtEntryLocal 872flag bit 873and then use the bytes 7-16 of the score (which would 874otherwise be zero, since it is a local score) to hold these fields: 875.P1 876.ta +\w' 'u +\w' 'u 877 archive[1] \fRboolean: this is an archival snapshot\fP 878 snap[4] \fRepoch number if root of snapshot\fP 879 tag[4] \fRrandom tag\fP 880.P2 881.LP 882The extended 883.CW VtEntry 884structure is called an 885.CW Entry . 886The 887.CW tag 888field 889in the 890.CW Label 891and the 892.CW Entry 893is used to identify dangling pointers or other file system corruption: 894all the local blocks in a hash tree must 895have tags matching the tag in the 896.CW Entry . 897If this 898.CW Entry 899points at the root of a snapshot, 900the 901.CW snap 902field is the epoch of the snapshot. 903If the snapshot is intended to be archived to Venti, 904the 905.CW archive 906field is non-zero. 907.NH 2 908Block reclamation 909.HP 910The blocks in the active file system form a tree: each 911block has only one parent. 912Once a copy-on-write block 913.I b 914is replaced by its copy, it is no longer 915needed by the active file system. 916At this point, 917.I b 918is unlinked from the active file system. 919We say that 920.I b 921is now 922.I closed : 923it is needed only for snapshots. 924When a block is closed, the 925.CW BsClosed 926bit is set in its state, and the current epoch (called the block's closing epoch) 927is stored in the 928.CW epochClose 929label field. 930(Open blocks have an 931.CW epochClose 932of 933.CW ~0 ). 934.PP 935A block is referenced by snapshots with epochs 936between the block's allocation epoch and its closing epoch. 937Once the file system's low epoch grows to be greater than or equal to the block's 938closing epoch, the block is no longer needed for any snapshots 939and can be reused. 940.PP 941In a typical configuration, where nightly archival snapshots 942are taken and written to Venti, it is desirable to reclaim 943the space occupied by now-archived blocks if possible. 944To do this, Fossil keeps track of whether the pointers 945in each block are unique to that block. 946When a block 947.I bb 948is allocated, a pointer to 949.I bb 950is written into exactly one active block (say, 951.I b ). 952In the absence of snapshots, the pointer to 953.I bb 954will remain unique to 955.I b , 956so that if the pointer is zeroed, 957.I bb 958can be immediately reused. 959Snapshots complicate this invariant: 960when 961.I b 962is copied-on-write, all its pointers 963are no longer unique to it. 964At time of the copy, the 965.CW BsCopied 966state bit in the block's label 967is set to note the duplication of the pointers contained within. 968.NH 2 969Disk layout 970.HP 971The file system header describes the file system layout and has this format: 972.P1 973.ta +\w' 'u +\w' 'u 974Header: 975 magic[4] \fR0x3776AE89 (HeaderMagic)\fP 976 version[2] \fR1 (HeaderVersion)\fP 977 blockSize[2] \fIfile system block size\fP 978 super[4] \fRblock offset of super block\fP 979 label[4] \fRblock offset of labels\fP 980 data[4] \fRdata blocks\fP 981 end[4] \fRend of file system\fP 982.P2 983.LP 984The corresponding file system layout is: 985.PS 986.ps 8 987.vs 9 988boxwid=0.75 989boxht=0.15 990Empty: box "empty" ht 0.25 991Header: box "header" with .n at Empty.s 992Empty2: box "empty" with .n at Header.s 993Super: box "super block" with .n at Empty2.s 994Label: box "label" "blocks" with .n at Super.s ht 0.25 995Data: box "data" "blocks" with .n at Label.s ht 0.3 996" 0" ljust at Empty.ne 997" 128kB" ljust at Header.ne 998" \f5super\fP \(mu \f(CWblockSize\fP" ljust at Super.ne 999" \f5label\fP \(mu \f(CWblockSize\fP" ljust at Label.ne 1000" \f5data\fP \(mu \f(CWblockSize\fP" ljust at Data.ne 1001" \f5end\fP \(mu \f(CWblockSize\fP" ljust at Data.se 1002"" at (-1,0) 1003"" at (6,0) 1004.PE 1005.LP 1006The numbers to the right of the blocks are byte offsets 1007of the boundaries. 1008.LP 1009The super block describes the file system itself and looks like: 1010.P1 1011.ta +\w' 'u +\w' 'u 1012Super: 1013 magic[4] \fR0x2340A3B1 (SuperMagic)\fP 1014 version[2] \fR1 (SuperVersion)\fP 1015 epochLow[4] \fRfile system low epoch\fP 1016 epochHigh[4] \fRfile system high (active) epoch\fP 1017 qid[8] \fRnext qid to allocate\fP 1018 active[4] \fRdata block number: root of active file system\fP 1019 next[4] \fRdata block number: root of next file system to archive\fP 1020 current[4] \fRdata block number: root of file system currently being archived\fP 1021 last[20] \fRVenti score of last successful archive\fP 1022 name[128] \fRname of file system (just a comment)\fP 1023.P2 1024.LP 1025.NH 1 1026Fossil server 1027.HP 1028The Fossil server is a user-space program that runs on a standard Plan 9 kernel. 1029.NH 2 1030Process structure 1031.PP 1032The file server is structured as a set of processes synchronizing 1033mostly through message passing along queues. 1034The processes are given names, which can be seen in the output of 1035.CW ps 1036.CW -a . 1037.PP 1038.CW Listen 1039processes announce on various network addresses. 1040A 1041.CW con 1042process handles each incoming connection, reading 9P requests 1043and adding them to a central message queue. 1044.CW Msg 1045processes remove 9P requests from the queue, 1046handle them, and write the responses to the appropriate 1047file descriptors. 1048.PP 1049The 1050.CW disk 1051process handles disk I/O requests made by the other processes. 1052The 1053.CW flush 1054process writes dirty blocks from the in-memory block cache to disk. 1055The 1056.CW unlink 1057process frees previously linked blocks once the blocks that point at them 1058have been written to disk. 1059.PP 1060A 1061.CW consI 1062reads from each console file (typically a pipe posted in 1063.CW /srv ), 1064adding the typed characters to the input queue. 1065The 1066.CW cons 1067process echoes input and runs the commands, saving 1068output in a ring buffer. 1069Because there is only one 1070.CW cons 1071process, only one console command may be executing at a time. 1072A 1073.CW consO 1074process copies this ring buffer to each console file. 1075.PP 1076The 1077.CW periodic 1078process runs periodic events, like 1079flushing the root metadata to disk or 1080taking snapshots of the file system. 1081.NH 2 1082Block cache 1083.HP 1084Fossil maintains an in-memory block cache which 1085holds both local disk blocks and Venti blocks. 1086Cache eviction follows a least recently used policy. 1087Dirty blocks are restricted to at most half the cache. 1088This can be changed by editing 1089.CW DirtyPercentage 1090in 1091.CW dat.h . 1092.PP 1093The block cache uses soft updates [1] to ensure that the on-disk 1094file system is always self-consistent. 1095Thus there is no 1096.I halt 1097console command 1098and no need to check a file system 1099that was shut down without halting. 1100.NH 2 1101Archiving 1102.HP 1103A background process writes blocks in archival snapshots to Venti. 1104Although 1105.CW /archive/\fIyyyy\fP/\fImmdds\fR 1106is a copy of only 1107.CW /active 1108at the time of the snapshot, 1109the archival process archives the 1110entire file tree rather than just 1111the subtree rooted at 1112.CW /active . 1113The snapshots 1114.CW /snapshot/\fIyyyy\fP/\fImmdd\fP/\fIhhmm 1115are stored as empty directories. 1116Once all the blocks have been archived, 1117a 1118.CW VtRoot 1119header for the file system is archived. 1120The score of that header is recorded in 1121.CW super.score 1122and also printed on the file server console. 1123The score can used by 1124.I flfmt 1125to restore a file system (see 1126.I fossil (4)). 1127.NH 2 1128Contrast with the old file server 1129.HP 1130The most obvious difference between Fossil and the 1131old Plan 9 file server [2] is that Fossil uses a Venti server as 1132its archival storage in place of a WORM juke box. 1133There are a few other architectural differences to be 1134aware of. 1135.PP 1136Fossil is a user-level program run on a standard kernel. 1137.PP 1138Fossil does not have any way to concatenate, stripe, or 1139mirror disk files. For functionality similar to the old file server's 1140configuration strings, use the experimental file stack device 1141(see 1142.I fs (3)). 1143.PP 1144Fossil speaks only 9P2000. Old 9P (aka 9P1) is not supported. 1145.PP 1146... XXX words about converting an old file system to fossil? 1147.NH 1 1148References 1149.LP 1150[1] Gregory R. Ganger, Marshall Kirk McKusick, Craig A. N. Soules, 1151and Yale N. Patt. 1152``Soft Updates: A Solution to the Metadata Update Problem 1153in File Systems,'' 1154.I "ACM Transactions on Computer Systems" , 1155Vol 18., No. 2, May 2000, pp. 127\-153. 1156.LP 1157[2] Sean Quinlan, ``A Cached WORM File System,'' 1158.I "Software\(emPractice and Experience" , 1159Vol 21., No 12., December 1991, pp. 1289\-1299. 1160.LP 1161[3] Sean Quinlan and Sean Dorward, ``Venti: A New Approach to Archival Storage,'' 1162.I "Usenix Conference on File and Storage Technologies" , 11632002. 1164