1... .FP times 2... .fp 1 R R.nomath 3... .fp 5 CW LucidaSansCW83 4.TL 5Fossil, an Archival File Server 6.AU 7Sean Quinlan 8Jim McKie 9Russ Cox 10jmk,rsc@plan9.bell-labs.com 11.AB 12This paper describes the internals and 13operation of Fossil, an archival file server built for Plan 9. 14Fossil has not yet replaced the current Plan 9 file server 15and 16.CW kfs , 17but that is our eventual intent. 18Both fossil and this documentation are 19works in progress. Comments on either are most welcome. 20.AE 21.de HP 22.LP 23.. 24.NH 1 25Introduction 26.HP 27Fossil is an archival file server built for Plan 9. 28In a typical configuration, it maintains a traditional file system 29in a local disk partition and periodically archives snapshots of the file system 30to a Venti server. These archives are made available through 31a file system interface. 32Fossil can also be run without a Venti server, in which case the 33snapshots (if any) occupy local disk space. 34.PP 35The bulk of this paper explains the underlying data structures: 36Venti trees, the Venti archival file system format, and finally Fossil's 37file system format. 38The end of the paper discusses the architecture of the Fossil server. 39.PP 40The presentation of the data structures is very detailed, perhaps 41too detailed for most readers. 42The intent is to record all the details necessary to make structural 43changes to the file system format. 44Feel free to jump ahead when boredom strikes. 45.NH 1 46Venti trees and directory hierarchies 47.HP 48Venti [3] is an archival block storage server. 49Once a block is stored, it can be retrieved by presenting the 20-byte 50SHA1 hash of its contents, called a 51.I score . 52Blocks on Venti have a maximum length of about 56 kilobytes, 53though in practice smaller blocks are used. 54To store a byte stream of arbitrary length, Venti uses a hash tree. 55Conceptually, the data stream is broken into fixed-size (say, 56.I dsize -byte) 57chunks, which are stored on the Venti server. 58The resulting scores are concatenated into a new pointer stream, which is 59broken into fixed size (say, 60.I psize -byte) 61chunks, which are stored on the Venti server. 62.I Psize "" ( 63is different from 64.I dsize 65so that we can ensure that each pointer block holds an 66integral number of pointers.) 67This yields a new pointer stream, and so on, until there is a single block 68and finally a single score describing the entire tree. 69The resulting structure looks like: 70.PS 71.ps 8 72.vs 10 73boxht=0.1 74boxwid=0.1 75 76B0: box invis wid 1 "\f(CWVtDataType\fP" 77move right 0.1 78L0a: box wid 0.2 79move right 0.1 80L0b: box wid 0.2 81move right 0.1 82L0c: box invis wid 0.2 "..." 83move right 0.1 84 85L0d: box wid 0.2 86move right 0.1 87L0e: box wid 0.2 88move right 0.2 89L0f: box invis wid 0.2 "..." 90move right 0.2 91 92L0g: box wid 0.2 93move right 0.1 94L0h: box wid 0.2 95move right 0.1 96L0i: box invis wid 0.2 "..." 97move right 0.1 98 99L0j: box wid 0.2 100move right 0.1 101L0k: box wid 0.2 102move right 0.1 103L0l: box invis wid 0.2 "..." 104move right 0.1 105L0m: box wid 0.2 106 107define boxppddd { 108 line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se> 109 line from 0.4<$1.nw,$1.ne> to 0.4<$1.sw,$1.se> 110 X: box invis at 0.1<$1.nw,$1.ne> 111 Y: box invis at 0.1<$1.sw,$1.se> 112 line -> from 0.5<X,Y> to $2.nw 113 X: box invis at 0.3<$1.nw,$1.ne> 114 Y: box invis at 0.3<$1.sw,$1.se> 115 line -> from 0.5<X,Y> to $3.nw 116 "..." at 0.7<$1.w,$1.e> 117} 118 119define boxppdddp { 120 line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se> 121 line from 0.4<$1.nw,$1.ne> to 0.4<$1.sw,$1.se> 122 line from 0.8<$1.nw,$1.ne> to 0.8<$1.sw,$1.se> 123 X: box invis at 0.1<$1.nw,$1.ne> 124 Y: box invis at 0.1<$1.sw,$1.se> 125 line -> from 0.5<X,Y> to $2.nw 126 X: box invis at 0.3<$1.nw,$1.ne> 127 Y: box invis at 0.3<$1.sw,$1.se> 128 line -> from 0.5<X,Y> to $3.nw 129 "..." at 0.6<$1.w,$1.e> 130 X: box invis at 0.9<$1.nw,$1.ne> 131 Y: box invis at 0.9<$1.sw,$1.se> 132 line -> from 0.5<X,Y> to $4.nw 133} 134 135define boxpdddp { 136 line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se> 137 line from 0.8<$1.nw,$1.ne> to 0.8<$1.sw,$1.se> 138 X: box invis at 0.1<$1.nw,$1.ne> 139 Y: box invis at 0.1<$1.sw,$1.se> 140 line -> from 0.5<X,Y> to $2.nw 141 "..." at 0.5<$1.w,$1.e> 142 X: box invis at 0.9<$1.nw,$1.ne> 143 Y: box invis at 0.9<$1.sw,$1.se> 144 line -> from 0.5<X,Y> to $3.nw 145} 146 147bhd=0.4 148L1abc: box wid 0.5 at 0.5<L0a, L0b>+(0,bhd) 149boxppddd(L1abc, L0a, L0b) 150L1def: box wid 0.5 at 0.5<L0d, L0e>+(0,bhd) 151boxppddd(L1def, L0d, L0e) 152L1ghi: box wid 0.5 at 0.5<L0g, L0h>+(0,bhd) 153boxppddd(L1ghi, L0g, L0h) 154L1jklm: box wid 0.5 at 0.5<L0j, L0k>+(0,bhd) 155boxppdddp(L1jklm, L0j, L0k, L0m) 156B1: box invis wid 1 "\f(CWVtPointerType0\fP" at B0+(0,bhd) 157 158L2abcdef: box wid 0.5 at 0.5<L1abc,L1def>+(0,bhd) 159boxppddd(L2abcdef, L1abc, L1def) 160L2ghijklm: box wid 0.5 at 0.5<L1ghi,L1jklm>+(0,bhd) 161boxpdddp(L2ghijklm, L1ghi, L1jklm) 162B2: box invis wid 1 "\f(CWVtPointerType1\fP" at B1+(0,bhd) 163 164L3atom: box wid 0.5 at 0.5<L2abcdef, L2ghijklm>+(0,bhd) 165boxpdddp(L3atom, L2abcdef, L2ghijklm) 166B3: box invis wid 1 "\f(CWVtPointerType2\fP" at B2+(0,bhd) 167.PE 168.LP 169The leaves are the original data stream. Those blocks have type 170.CW VtDataType . 171The first pointer stream has type 172.CW VtPointerType0 , 173the next has type 174.CW VtPointerType1 , 175and so on. 176The figure ends with a single block of type 177.CW VtPointerType2 , 178but in general trees can have height up to 179.CW VtPointerType6 . 180For a 181.I dsize 182of 8192 bytes 183and 184.I psize 185of 8180 bytes (409 pointers), 186this gives a maximum stream size of approximately 10 zettabytes 187(2\s-2\u73\d\s+2 or 10\s-2\u22\d\s+2 bytes). 188.PP 189Data block are truncated to remove trailing runs of zeros before 190storage to Venti; they are zero-filled back to 191.I dsize 192bytes after retrieval from Venti. 193SImilarly, trailing runs of pointers to zero-length blocks are 194removed from and added back to pointer blocks. 195These simple rules happen to make it particularly efficient to store 196large runs of zeros, as might occur in a data stream with ``holes:'' 197the zero-length block itself can be interpreted as a tree of any 198depth encoding an all-zero data stream. 199.PP 200Reconstructing the data stream requires the score and type of the 201topmost block in the tree, the data chunk size, the pointer chunk size, 202and the data stream size. 203(From the data stream size and the chunk sizes we could derive the 204depth of the tree and thus the type of the topmost block, but it is convenient 205to allow trees that are deeper than necessary.) 206This information is kept in a 40-byte structure called a 207.CW VtEntry : 208.P1 209VtEntry: 210.ta +\w' 'u +\w' 'u 211 gen[4] \fRgeneration number\fP 212 psize[2] \fRsize of pointer blocks\fP 213 dsize[2] \fRsize of data blocks\fP 214 flags[1] 215 zero[5] 216 size[6] \fRlength of file\fP 217 score[20] \fRscore of root block in tree\fP 218.P2 219(In this notation, 220.CW name[sz] 221indicates a 222.CW sz -byte 223field called 224.CW name . 225Integers are stored in big-endian order. 226.CW Size 227really is a 48-bit field.) 228.CW Flags 229is made up of the following bit fields. 230.I or ' ` 231of the following flags: 232.P1 233.ta +\w' 'u +\w' 'u 2340x01 VtEntryActive \fRentry is allocated\fP 2350x02 VtEntryDir \fRentry describes a Venti directory (q.v.)\fP 2360x1C VtEntryDepthMask \fRmask for tree depth\fP 2370x20 VtEntryLocal \fRreserved (q.v.)\fP 238.P2 239.LP 240The depth of the described tree is stored in the 5 bits indicated: 241a tree with a topmost node of type 242.CW VtPointerType3 243has depth 4. 244.PP 245With 246.CW VtEntry 247we can build more complicated data structures, 248ones with multiple or nested data streams. 249A data stream consisting of 250.CW VtEntry 251structures is called a Venti directory. 252It is identical in structure to the Venti data stream 253we described earlier except that the bottom-level type is 254.CW VtDirType , 255and 256the 257.CW VtEntry 258describing a Venti directory has the 259.CW VtEntryDir 260flag bit set. 261The 262.I dsize 263for a Venti directory 264is a multiple of 40 so that each data chunk holds 265an integer number of 266.CW VtEntry 267structures. 268By analogy with Venti directories, 269we call the original data stream a 270Venti file. 271Note that Venti files are assumed 272.I not 273to contain pointers to other Venti blocks. 274The only pointers to Venti blocks occur in 275.CW VtEntry 276structures in 277Venti directories 278(and in the internal hash tree structure of the 279individual files and directories). 280Note also that these directories are nothing more than pointer lists. 281In particular, there are no names or metadata like in a file system. 282.PP 283To make it easier to pass hierarchies between applications, 284the root of a hierarchy is described in a 300-byte structure 285called a 286.CW VtRoot : 287.P1 288VtRoot: 289.ta +\w' 'u +\w' 'u 290 version[2] \f(CW2\fP 291 name[128] \fRname of structure (just a comment)\fP 292 type[128] \fRstring describing structure (\f(CWvac\fR)\f(CW 293 score[20] \fRpointer to \f(CWVtDirType\fP block\f(CW 294 blockSize[2] \fRmaximum block size in structure\fP 295 prev[20] \fRprevious \f(CWVtRoot\fP in chain, if any\fP 296.P2 297.LP 298This structure is stored to Venti and its score is passed 299between applications, typically in the form 300``\fItype\f(CW:\fIrootscore\fR,'' 301where 302.I type 303is the type field from the 304.CW VtRoot 305structure, and 306.I rootscore 307is the score of the 308.CW VtRoot 309block. 310.CW VtRoot 311structures can be chained together using the 312.I prev 313field to encode an archival history 314of the data structure. 315.PP 316For example, a small Venti hierarchy might look like: 317.PS 318.ps 8 319.vs 10 320boxwid=0.1 321boxht=0.1 322f=0.9 323mb=0.16 324 325VtRoot: [ 326 right 327 B1: box 328 move right 0.1 329 "\f(CWVtRoot\fP" ljust 330] 331 332Root: [ 333 right 334 B1: box fill f 335 B2: box fill f 336 B3: box fill f 337 move right 0.1 338] with .nw at VtRoot.sw+(0.2,-.1) 339Level1: [ 340 RootMeta: [ 341 box wid mb 342 ] 343 MetaSource: [ 344 right 345 B1: box wid 5*mb 346 ] with .nw at RootMeta.sw+(0,-.1) 347 348 Source: [ 349 right 350 B1: box fill f 351 B2: box fill f 352 B3: box fill f 353 B4: box fill f 354 B5: box fill f 355 B6: box fill f 356 B7: box fill f 357 B8: box fill f 358 ] with .nw at MetaSource.sw+(0,-.1) 359 SB1: box invis at Source.B1 360 SB2: box invis at Source.B2 361 SB3: box invis at Source.B3 362] with .nw at Root.sw+(0.4,-.1) 363Level2: [ 364 MetaSource: [ 365 right 366 B1: box wid 5*mb 367 ] 368 Source: [ 369 right 370 B1: box fill f 371 B2: box fill f 372 B3: box fill f 373 B4: box fill f 374 B5: box fill f 375 B6: box fill f 376 B7: box fill f 377 B8: box fill f 378 ] with .nw at MetaSource.sw+(0,-.1) 379 File: box wid 0.8 with .nw at Source.sw+(0,-.1) 380] with .nw at Level1.sw+(0.6,-.1) 381 382line -> from VtRoot.B1 down boxwid/2+0.1+boxwid/2 then to Root.w 383line -> from Root.B3 down boxwid/2+0.1+boxwid/2 then to Level1.RootMeta.w 384line -> from Root.B2 down boxwid/2+0.1+boxwid+0.1+boxwid/2 then to Level1.MetaSource.w 385line -> from Root.B1 down boxwid/2+0.1+boxwid+0.1+boxwid+0.1+boxwid/2 then to Level1.Source.w 386 387line -> from Level1.SB3 down boxwid/2+0.1+boxwid/2 then to Level2.MetaSource.w 388line -> from Level1.SB2 down boxwid/2+0.1+boxwid+0.1+boxwid/2 then to Level2.Source.w 389line -> from Level1.SB1 down boxwid/2+0.1+boxwid+0.1+boxwid+0.1+boxwid/2 then to Level2.File.w 390 391[ 392 KEY: box wid 1.5 invis "Key" 393 line from KEY.sw to KEY.se 394 k = -.1 395 kk=0.5 396 A: [ 397 box wid 4*boxwid 398 "Venti file" ljust with .w at last box .w+(kk,0) 399 ] with .nw at KEY.sw+(0,2*k) 400 B: [ 401 box fill f 402 "Venti entry (\f(CWVtEntry\fP)" ljust with .w at last box .w+(kk,0) 403 ] with .nw at A.sw+(0,k) 404 C: [ 405 right 406 CC: box fill f 407 box fill f 408 box fill f 409 box fill f 410 "Venti directory" ljust with .w at CC.w+(kk,0) 411 ] with .nw at B.sw+(0,k) 412 D: [ 413 line -> right 3*boxwid 414 "Venti pointer (score)" ljust with .w at last line .w+(kk, 0) 415 ] with .nw at C.sw+(0,k) 416] with .nw at VtRoot.nw+(3,0) 417.PE 418.LP 419Venti files are shown as white boxes, while directories are shown 420as shaded boxes. Each shaded square represents a 421.CW VtEntry . 422Arrows represent pointers from 423.CW VtEntry 424structures to other 425Venti files or directories. 426.PP 427The hierarchical structure provided by Venti files and directories 428can be used as the base for more complicated data structures. 429Because this structure captures all the information 430about pointers to other blocks, tools written to traverse 431Venti hierarchies can traverse the more complicated 432data structures as well. 433For example, 434.I venti/copy 435(see 436.I ventiaux (8)) 437copies a Venti hierarchy from one Venti server to another, 438given the root 439.CW VtEntry . 440Because the traditional file system described in later sections is 441layered on a Venti hierarchy, 442.I venti/copy 443can copy it without fully understanding its structure. 444.NH 1 445Vac file system format 446.HP 447The Venti archive format 448.I vac 449builds a traditional file system using a Venti hierarchy. 450Each vac file is implemented as a Venti file; 451each vac directory is implemented as a Venti 452directory and a Venti file to provide traditional file system metadata. 453The metadata is stored in a structure called a 454.CW DirEntry : 455.P1 456DirEntry: 457.ta +\w' 'u +\w' 'u 458 magic[4] \f(CW0x1c4d9072\fP (DirMagic)\fP 459 version[2] \f(CW9\fP 460 elem[s] \fRname (final path element only)\fP 461 entry[4] \fRentry number for Venti file or directory\fP 462 gen[4] \fRgeneration number\fP 463 mentry[4] \fRentry number for Venti file holding metadata\fP 464 mgen[4] \fRgeneration number\fP 465 qid[8] \fRunique file serial number\fP 466 uid[s] \fRowner\fP 467 gid[s] \fRgroup\fP 468 mid[s] \fRlast modified by\fP 469 mtime[4] \fRlast modification time\fP 470 ctime[4] \fRcreation time\fP 471 atime[4] \fRlast access time\fP 472 mode[4] \fRmode bits\fP 473.P2 474The notation 475.CW name[s] 476denotes a string stored as a two-byte length 477and then that many bytes. 478The above describes Version 9 of the 479.CW DirEntry 480format. Versions 7 and 8 are very similar; they can be 481read by the current 482.I vac 483source code but are not written. 484Earlier versions were not widespread. 485A 486.CW DirEntry 487may be followed by optional extension sections, though none 488are currently used. 489The 490.CW mode 491bits include bits commonly used by 492Unix and Windows, in addition to those used by Plan 9. 493.PP 494The 495.CW entry 496field is an index into the parallel Venti directory. 497The 498.CW gen 499field must match the 500.CW gen 501field in the corresponding 502.CW VtEntry 503in the directory; 504it is used to detect 505stale indices. 506Similarly, 507.CW mentry 508and 509.CW mgen 510are the index and generation number 511for the metadata Venti file, 512if the 513.CW DirEntry 514describes a vac directory. 515.PP 516The relation between Venti files and directories and 517vac files and directories can be seen in this figure: 518.PS 519.ps 8 520.vs 10 521boxwid=0.1 522boxht=0.1 523f=0.9 524mb=0.16 525 526VtRoot: [ 527 right 528 B1: box 529 move right 0.1 530 "\f(CWVtRoot\fP" ljust 531] 532 533SuperRoot: [ 534 right 535 B1: box fill f 536 move right 0.1 537 "fs root block" ljust 538] with .nw at VtRoot.sw + (0.2, -.2) 539Root: [ 540 right 541 B1: box fill f 542 B2: box fill f 543 B3: box fill f 544 move right 0.1 545 "root directory info block" ljust 546] with .nw at SuperRoot.sw+(0.2, -.2) 547Level1: [ 548 RootMeta: [ 549 box wid mb 550 move right 0.1 551 "root metadata" ljust 552 ] 553 MetaSource: [ 554 right 555 B1: box wid mb 556 B2: box wid mb 557 B3: box wid mb 558 B4: box wid mb 559 B5: box wid mb 560 ] with .nw at RootMeta.sw+(0,-.2) 561 MB1: box wid mb invis at MetaSource.B1 562 MB2: box wid mb invis at MetaSource.B2 563 MB3: box wid mb invis at MetaSource.B3 564 MB4: box wid mb invis at MetaSource.B4 565 MB5: box wid mb invis at MetaSource.B5 566 567 Source: [ 568 right 569 B1: box fill f 570 B2: box fill f 571 B3: box fill f 572 B4: box fill f 573 B5: box fill f 574 B6: box fill f 575 B7: box fill f 576 B8: box fill f 577 ] with .nw at MetaSource.sw+(0,-.1) 578 SB1: box invis at Source.B1 579 SB2: box invis at Source.B2 580 SB3: box invis at Source.B3 581 SB4: box invis at Source.B4 582 SB5: box invis at Source.B5 583 SB6: box invis at Source.B6 584 SB7: box invis at Source.B7 585 SB8: box invis at Source.B8 586] with .nw at Root.sw+(0.4,-.2) 587Level2: [ 588 MetaSource: [ 589 right 590 B1: box wid mb 591 B2: box wid mb 592 B3: box wid mb 593 B4: box wid mb 594 B5: box wid mb 595 ] 596 Source: [ 597 right 598 B1: box fill f 599 B2: box fill f 600 B3: box fill f 601 B4: box fill f 602 B5: box fill f 603 B6: box fill f 604 B7: box fill f 605 B8: box fill f 606 ] with .nw at MetaSource.sw+(0,-.1) 607 File: box wid 0.8 with .nw at Source.sw+(0,-.2) 608] with .nw at Level1.sw+(0.6,-.2) 609 610line -> from VtRoot.B1 down boxwid/2+0.2+boxwid/2 then to SuperRoot.w 611line -> from SuperRoot.B1 down boxwid/2+0.2+boxwid/2 then to Root.w 612line -> from Root.B3 down boxwid/2+0.2+boxwid/2 then to Level1.RootMeta.w 613line -> from Root.B2 down boxwid/2+0.2+boxwid+0.2+boxwid/2 then to Level1.MetaSource.w 614line -> from Root.B1 down boxwid/2+0.2+boxwid+0.1+boxwid+0.2+boxwid/2 then to Level1.Source.w 615 616line -> from Level1.SB3 down boxwid/2+0.2+boxwid/2 then to Level2.MetaSource.w 617line -> from Level1.SB2 down boxwid/2+0.2+boxwid+0.1+boxwid/2 then to Level2.Source.w 618line -> from Level1.SB1 down boxwid/2+0.2+boxwid+0.1+boxwid+0.2+boxwid/2 then to Level2.File.w 619 620arrowwid = arrowwid/2 621arrowht = arrowht/2 622line -> from Level1.MB1 to Level1.SB1.n 623line -> from Level1.MB2 to Level1.SB2.n 624line -> from Level1.MB2 to Level1.SB3.n 625line -> from Level1.MB4 to Level1.SB7.n 626line -> from Level1.MB5 to Level1.SB5.n 627arrowwid = arrowwid * 2 628arrowht = arrowht * 2 629 630box dashed with .nw at Level1.MetaSource.nw+(-.05,.05) wid 0.8+.05*2 ht .3+.05*2 631box dashed with .nw at Level2.MetaSource.nw+(-.05,.05) wid 0.8+.05*2 ht .3+.05*2 632box dotted with .nw at Level2.File.nw+(-.05,.05) wid 0.8+0.05*2 ht .1+.05*2 633 634[ 635 KEY: box wid 1.5 invis "Key" 636 line from KEY.sw to KEY.se 637 k = -.1 638 kk=0.5 639 A: [ 640 box wid 4*boxwid 641 "Venti file" ljust with .w at last box .w+(kk,0) 642 ] with .nw at KEY.sw+(0,2*k) 643 B: [ 644 box fill f 645 "Venti entry (\f(CWEntry\fP)" ljust with .w at last box .w+(kk,0) 646 ] with .nw at A.sw+(0,k) 647 C: [ 648 right 649 CC: box fill f 650 box fill f 651 box fill f 652 box fill f 653 "Venti directory" ljust with .w at CC.w+(kk,0) 654 ] with .nw at B.sw+(0,k) 655 D: [ 656 line -> right 3*boxwid 657 "Venti pointer (score)" ljust with .w at last line .w+(kk, 0) 658 ] with .nw at C.sw+(0,k) 659 DD: [ 660 box dotted wid 4*boxwid 661 "Vac file" ljust with .w at last box .w+(kk,0) 662 ] with .nw at D.sw+(0,k) 663 E: [ 664 box wid mb 665 "Vac entry (\f(CWDirEntry\fP)" ljust with .w at last box .w+(kk,0) 666 ] with .nw at DD.sw+(0,k) 667 G: [ 668 box dashed wid 4*boxwid 669 "Vac directory" ljust with .w at last box .w+(kk,0) 670 ] with .nw at E.sw+(0,k) 671 H: [ 672 arrowwid = arrowwid/2 673 arrowht = arrowht/2 674 line -> right 1.5*boxwid 675 "Vac pointer (integer index)" ljust with .w at last line .w+(kk, 0) 676 arrowwid = arrowwid * 2 677 arrowht = arrowht * 2 678 ] with .nw at G.sw+(0,k) 679] with .nw at VtRoot.nw+(3,0) 680.PE 681.LP 682In reality, the story is slightly more complicated. 683The metadata file in a Vac directory 684is not just the concatenation of 685.CW DirEntry 686structures. 687Instead, it is the concatenation of 688.CW MetaBlocks . 689A 690.CW MetaBlock 691contains some number of 692.CW DirEntry 693structures along with a sorted index to make it easy 694to look for a particular 695.CW DirEntry 696by its 697.CW elem 698field. 699The details are in the source code. 700.PP 701As shown in the diagram, 702the root directory of the file system is summarized by 703three 704.CW VtEntry 705structures describing 706the Venti directory for the children of the root, 707the Venti file for the metadata describing the children of the root, 708and a Venti file holding metadata for the root directory itself. 709These 710.CW VtEntry 711structures are placed in a Venti directory of their own, 712described by the single 713.CW VtEntry 714in the 715root block. 716.NH 1 717Fossil file system format 718.HP 719Fossil uses the vac format, with some small changes. 720The changes only affect the data on the local disk; the data 721archived to Venti is exactly in vac format. 722.PP 723Blocks stored on local disk may contain scores pointing at local disk 724blocks or at Venti blocks. 725Local block addresses are stored as 20-byte scores in which the first 16 bytes 726are all zero and the last 4 bytes specify a block number in the disk. 727Before a block is archived, all the 728blocks it points to must be archived, and the local scores in the block 729must be changed to Venti scores. 730Using block addresses rather than content hashes for local data 731makes the local file system easier to manage: if a local block's contents 732change, the pointer to the block does not need to change. 733.NH 2 734Snapshots 735.HP 736Fossil is an archival file server. 737It takes periodic snapshots of the file system, 738which are made accessible through the file system. 739Specifically, the active file system is presented in 740.CW /active . 741Ephemeral snapshots (those that are kept on local disk and eventually deleted) 742are presented in 743\f(CW/snapshot/\fIyyyy\f(CW/\fImmdd\f(CW/\fIhhmm\fR, 744where 745.I yyyy 746is the full year, 747.I mm 748is the month number, 749.I dd 750is the day number, 751.I hh 752is the hour, 753and 754.I mm 755is the minute. 756Archival snapshots (those that are archived to Venti and persist forever) 757are presented in 758\f(CW/archive/\fIyyyy\f(CW/\fImmdds\fR, 759where 760.I yyyy , 761.I mm , 762and 763.I dd 764are year, month, and day as before, 765and 766.I s 767is a sequence number if more than one 768archival snapshot is done in a day. 769For the first snapshot, 770.I s 771is null. 772For the subsequent snapshots, 773.I s 774is 775.CW .1 , 776.CW .2 , 777.CW .3 , 778etc. 779.PP 780To implement the snapshots, the file server maintains a 781current 782.I epoch 783for the active file system. 784Each local block has a label that records, among other things, 785the epoch in which the block was allocated. 786If a block was allocated in an epoch earlier than the current one, 787it is immutable and treated as copy-on-write. 788Taking a snapshot can be accomplished by 789recording the address of the current root block and then 790incrementing the epoch number. 791Notice that the copy-on-write method makes 792snapshots both time efficient and space efficient. 793The only time cost is waiting for all current file system 794requests to finish and then incrementing a counter. 795After a snapshot, blocks only get copied when they are 796next modified, so the per-snapshot 797space requirement is proportional 798to the amount of new data rather than the total 799size of the file system. 800.PP 801The blocks in the archival snapshots are moved to Venti, 802but the blocks in the ephemeral snapshots take up space 803in the local disk file. 804To allow reclamation of this disk space, the file system 805maintains a 806.I low 807.I epoch , 808which is the epoch of the earliest ephemeral snapshot 809still available. 810Fossil only allows access to snapshots with epoch numbers 811between the 812low epoch and the current epoch 813(also called the high epoch). 814Incrementing the low epoch thus makes old 815snapshots inaccessible. 816The space required to store those snapshots can then 817be reclaimed, as described below. 818.NH 2 819Local blocks 820.HP 821The bulk of the local disk file is the local blocks. 822Each block has a 14-byte label associated with it, of the format: 823.P1 824Label: 825.ta +\w' 'u +\w' 'u 826 state[1] \fRblock state\fP 827 type[1] \fRblock type\fP 828 epoch[4] \fRallocation epoch\fP 829 epochClose[4] \fRclose epoch\fP 830 tag[4] \fRrandom tag\fP 831.P2 832.LP 833The 834.CW type 835is an analogue of the block types described earlier, 836though different names are used, to distinguish between 837pointers blocks in a hash tree for a data stream 838and pointer blocks for a directory stream. 839The 840.CW epoch 841was mentioned in the last section. 842The other fields are explained below. 843.PP 844There are two distinguished blocks states 845.CW BsFree 846.CW 0x00 ) ( 847and 848.CW BsBad 849.CW 0xFF ), ( 850which mark blocks that are available for allocation 851and blocks that are bad and should be avoided. 852If 853.CW state 854is not one of these values, it is a bitwise 855.I or ' ` 856of the following flags: 857.P1 858.ta +\w' 'u +\w' 'u 8590x01 BsAlloc \fRblock is in use\fP 8600x02 BsCopied \fRblock has been copied\fP 8610x04 BsVenti \fRblock has been stored on Venti\fP 8620x08 BsClosed \fRblock has been unlinked from active file system\fP 863.P2 864.LP 865The flags are explained as they arise in the discussions below. 866.PP 867It is convenient to store some extra fields in the 868.CW VtEntry 869structure when it describes a Venti file or directory 870stored on local disk. 871Specifically, we set the 872.CW VtEntryLocal 873flag bit 874and then use the bytes 7-16 of the score (which would 875otherwise be zero, since it is a local score) to hold these fields: 876.P1 877.ta +\w' 'u +\w' 'u 878 archive[1] \fRboolean: this is an archival snapshot\fP 879 snap[4] \fRepoch number if root of snapshot\fP 880 tag[4] \fRrandom tag\fP 881.P2 882.LP 883The extended 884.CW VtEntry 885structure is called an 886.CW Entry . 887The 888.CW tag 889field 890in the 891.CW Label 892and the 893.CW Entry 894is used to identify dangling pointers or other file system corruption: 895all the local blocks in a hash tree must 896have tags matching the tag in the 897.CW Entry . 898If this 899.CW Entry 900points at the root of a snapshot, 901the 902.CW snap 903field is the epoch of the snapshot. 904If the snapshot is intended to be archived to Venti, 905the 906.CW archive 907field is non-zero. 908.NH 2 909Block reclamation 910.HP 911The blocks in the active file system form a tree: each 912block has only one parent. 913Once a copy-on-write block 914.I b 915is replaced by its copy, it is no longer 916needed by the active file system. 917At this point, 918.I b 919is unlinked from the active file system. 920We say that 921.I b 922is now 923.I closed : 924it is needed only for snapshots. 925When a block is closed, the 926.CW BsClosed 927bit is set in its state, and the current epoch (called the block's closing epoch) 928is stored in the 929.CW epochClose 930label field. 931(Open blocks have an 932.CW epochClose 933of 934.CW ~0 ). 935.PP 936A block is referenced by snapshots with epochs 937between the block's allocation epoch and its closing epoch. 938Once the file system's low epoch grows to be greater than or equal to the block's 939closing epoch, the block is no longer needed for any snapshots 940and can be reused. 941.PP 942In a typical configuration, where nightly archival snapshots 943are taken and written to Venti, it is desirable to reclaim 944the space occupied by now-archived blocks if possible. 945To do this, Fossil keeps track of whether the pointers 946in each block are unique to that block. 947When a block 948.I bb 949is allocated, a pointer to 950.I bb 951is written into exactly one active block (say, 952.I b ). 953In the absence of snapshots, the pointer to 954.I bb 955will remain unique to 956.I b , 957so that if the pointer is zeroed, 958.I bb 959can be immediately reused. 960Snapshots complicate this invariant: 961when 962.I b 963is copied-on-write, all its pointers 964are no longer unique to it. 965At time of the copy, the 966.CW BsCopied 967state bit in the block's label 968is set to note the duplication of the pointers contained within. 969.NH 2 970Disk layout 971.HP 972The file system header describes the file system layout and has this format: 973.P1 974.ta +\w' 'u +\w' 'u 975Header: 976 magic[4] \fR0x3776AE89 (HeaderMagic)\fP 977 version[2] \fR1 (HeaderVersion)\fP 978 blockSize[2] \fIfile system block size\fP 979 super[4] \fRblock offset of super block\fP 980 label[4] \fRblock offset of labels\fP 981 data[4] \fRdata blocks\fP 982 end[4] \fRend of file system\fP 983.P2 984.LP 985The corresponding file system layout is: 986.PS 987.ps 8 988.vs 9 989boxwid=0.75 990boxht=0.15 991Empty: box "empty" ht 0.25 992Header: box "header" with .n at Empty.s 993Empty2: box "empty" with .n at Header.s 994Super: box "super block" with .n at Empty2.s 995Label: box "label" "blocks" with .n at Super.s ht 0.25 996Data: box "data" "blocks" with .n at Label.s ht 0.3 997" 0" ljust at Empty.ne 998" 128kB" ljust at Header.ne 999" \f5super\fP \(mu \f(CWblockSize\fP" ljust at Super.ne 1000" \f5label\fP \(mu \f(CWblockSize\fP" ljust at Label.ne 1001" \f5data\fP \(mu \f(CWblockSize\fP" ljust at Data.ne 1002" \f5end\fP \(mu \f(CWblockSize\fP" ljust at Data.se 1003"" at (-1,0) 1004"" at (6,0) 1005.PE 1006.LP 1007The numbers to the right of the blocks are byte offsets 1008of the boundaries. 1009.LP 1010The super block describes the file system itself and looks like: 1011.P1 1012.ta +\w' 'u +\w' 'u 1013Super: 1014 magic[4] \fR0x2340A3B1 (SuperMagic)\fP 1015 version[2] \fR1 (SuperVersion)\fP 1016 epochLow[4] \fRfile system low epoch\fP 1017 epochHigh[4] \fRfile system high (active) epoch\fP 1018 qid[8] \fRnext qid to allocate\fP 1019 active[4] \fRdata block number: root of active file system\fP 1020 next[4] \fRdata block number: root of next file system to archive\fP 1021 current[4] \fRdata block number: root of file system currently being archived\fP 1022 last[20] \fRVenti score of last successful archive\fP 1023 name[128] \fRname of file system (just a comment)\fP 1024.P2 1025.LP 1026.NH 1 1027Fossil server 1028.HP 1029The Fossil server is a user-space program that runs on a standard Plan 9 kernel. 1030.NH 2 1031Process structure 1032.PP 1033The file server is structured as a set of processes synchronizing 1034mostly through message passing along queues. 1035The processes are given names, which can be seen in the output of 1036.CW ps 1037.CW -a . 1038.PP 1039.CW Listen 1040processes announce on various network addresses. 1041A 1042.CW con 1043process handles each incoming connection, reading 9P requests 1044and adding them to a central message queue. 1045.CW Msg 1046processes remove 9P requests from the queue, 1047handle them, and write the responses to the appropriate 1048file descriptors. 1049.PP 1050The 1051.CW disk 1052process handles disk I/O requests made by the other processes. 1053The 1054.CW flush 1055process writes dirty blocks from the in-memory block cache to disk. 1056The 1057.CW unlink 1058process frees previously linked blocks once the blocks that point at them 1059have been written to disk. 1060.PP 1061A 1062.CW consI 1063reads from each console file (typically a pipe posted in 1064.CW /srv ), 1065adding the typed characters to the input queue. 1066The 1067.CW cons 1068process echoes input and runs the commands, saving 1069output in a ring buffer. 1070Because there is only one 1071.CW cons 1072process, only one console command may be executing at a time. 1073A 1074.CW consO 1075process copies this ring buffer to the each console file. 1076.PP 1077The 1078.CW periodic 1079process runs periodic events, like 1080flushing the root metadata to disk or 1081taking snapshots of the file system. 1082.NH 2 1083Block cache 1084.HP 1085Fossil maintains an in-memory block cache which 1086holds both local disk blocks and Venti blocks. 1087Cache eviction follows a least recently used policy. 1088Dirty blocks are restricted to at most half the cache. 1089This can be changed by editing 1090.CW DirtyPercentage 1091in 1092.CW dat.h . 1093.PP 1094The block cache uses soft updates [1] to ensure that the on-disk 1095file system is always self-consistent. 1096Thus there is no 1097.I halt 1098console command 1099and no need to check a file system 1100that was shut down without halting. 1101.NH 2 1102Archiving 1103.HP 1104A background process writes blocks in archival snapshots to Venti. 1105Although 1106.CW /archive/\fIyyyy\fP/\fImmdds\fR 1107is a copy of only 1108.CW /active 1109at the time of the snapshot, 1110the archival process archives the 1111entire file tree rather than just 1112the subtree rooted at 1113.CW /active . 1114The snapshots 1115.CW /snapshot/\fIyyyy\fP/\fImmdd\fP/\fIhhmm 1116are stored as empty directories. 1117Once all the blocks have been archived, 1118a 1119.CW VtRoot 1120header for the file system is archived. 1121The score of that header is recorded in 1122.CW super.score 1123and also printed on the file server console. 1124The score can used by 1125.I flfmt 1126to restore a file system (see 1127.I fossil (4)). 1128.NH 2 1129Contrast with the old file server 1130.HP 1131The most obvious difference between Fossil and the 1132old Plan 9 file server [2] is that Fossil uses a Venti server as 1133its archival storage in place of a WORM juke box. 1134There are a few other architectural differences to be 1135aware of. 1136.PP 1137Fossil is a user-level program run on a standard kernel. 1138.PP 1139Fossil does not have any way to concatenate, stripe, or 1140mirror disk files. For functionality similar to the old file server's 1141configuration strings, use the experimental file stack device 1142(see 1143.I devfs (3)). 1144.PP 1145Fossil speaks only 9P2000. Old 9P (aka 9P1) is not supported. 1146.PP 1147... XXX words about converting an old file system to fossil? 1148.NH 1 1149References 1150.LP 1151[1] Gregory R. Ganger, Marshall Kirk McKusick, Craig A. N. Soules, 1152and Yale N. Patt. 1153``Soft Updates: A Solution to the Metadata Update Problem 1154in File Systems,'' 1155.I "ACM Transactions on Computer Systems" , 1156Vol 18., No. 2, May 2000, pp. 127\-153. 1157.LP 1158[2] Sean Quinlan, ``A Cached WORM File System,'' 1159.I "Software\(emPractice and Experience" , 1160Vol 21., No 12., December 1991, pp. 1289\-1299. 1161.LP 1162[3] Sean Quinlan and Sean Dorward, ``Venti: A New Approach to Archival Storage,'' 1163.I "Usenix Conference on File and Storage Technologies" , 11642002. 1165