1<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN"> 2<html> 3 4<head> 5 6<meta http-equiv="Content-Language" content="en-us"> 7<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> 8 9<title>Venti: a new approach to archival storage</title> 10</head> 11 12<body bgcolor="white"> 13 14<h1>Venti: a new approach to archival storage</h1> 15 16<p> 17 18Sean Quinlan and Sean Dorward 19<br> 20 21Bell Labs, Lucent Technologies 22<p> 23 24<h1>Abstract</h1> 25<p> 26 27This paper describes a network storage system, called Venti, intended 28for archival data. In this system, a unique hash of a block's 29contents acts as the block identifier for read and write operations. 30This approach enforces a write-once policy, preventing accidental or 31malicious destruction of data. In addition, duplicate copies of a 32block can be coalesced, reducing the consumption of storage and 33simplifying the implementation of clients. Venti is a building block 34for constructing a variety of storage applications such as logical 35backup, physical backup, and snapshot file systems. 36<p> 37 38We have built a prototype of the system and present some preliminary 39performance results. The system uses magnetic disks as the storage 40technology, resulting in an access time for archival data that is 41comparable to non-archival data. The feasibility of the write-once 42model for storage is demonstrated using data from over a decade's use 43of two Plan 9 file systems. 44<p> 45 46<h1>1. Introduction</h1> 47<p> 48 49Archival storage is a second class citizen. Many computer 50environments provide access to a few recent versions of the 51information stored in file systems and databases, though this access 52can be tedious and may require the assistance of a system 53administrator. Less common is the ability for a user to examine data 54from last month or last year or last decade. Such a feature may not 55be needed frequently, but when it is needed it is often crucial. 56<p> 57 58The growth in capacity of storage technologies exceeds the ability of 59many users to generate data, making it practical to archive data in 60perpetuity. Plan 9, the computing environment that the authors use, 61includes a file system that stores archival data to an optical jukebox 62[16, 17]. Ken Thompson observed that, for our usage patterns, the 63capacity of the jukebox could be considered infinite. In the time it 64took for us to fill the jukebox, the improvement in technology would 65allow us to upgrade to a new jukebox with twice the capacity. 66<p> 67 68Abundant storage suggests that an archival system impose a write-once 69policy. Such a policy prohibits either a user or administrator from 70deleting or modifying data once it is stored. This approach greatly 71reduces the opportunities for accidental or malicious data loss and 72simplifies the system's implementation. 73<p> 74 75Moreover, our experience with Plan 9 is that a write-once policy 76changes the way one views storage. Obviously, some data is temporary, 77derivative, or so large that it is either undesirable or impractical 78to retain forever and should not be archived. However, once it is 79decided that the data is worth keeping, the resources needed to store 80the data have been consumed and cannot be reclaimed. This eliminates 81the task of periodically "cleaning up" and deciding whether the data 82is still worth keeping. More thought is required before storing the 83data to a write-once archive, but as the cost of storage continues to 84fall, this becomes an easy decision. 85<p> 86 87This paper describes the design and implementation of an archival 88server, called Venti. The goal of Venti is to provide a write-once 89archival repository that can be shared by multiple client machines and 90applications. In addition, by using magnetic disks as the primary 91storage technology, the performance of the system approaches that of 92non-archival storage. 93<p> 94 95<h1>2. Background</h1> 96<p> 97 98A prevalent form of archival storage is the regular backup of data to 99magnetic tape [15]. A typical scenario is to provide backup as a 100central service for a number of client machines. Client software 101interfaces with a database or file system and determines what data to 102back up. The data is copied from the client to the tape device, often 103over a network, and a record of what was copied is stored in a catalog 104database. 105<p> 106 107Restoring data from a tape backup system can be tedious and error 108prone. The backup system violates the access permission of the file 109system, requiring a system administrator or privileged software to 110perform the task. Since they are tedious, restore operations are 111infrequent and problems with the process may go undetected. Potential 112sources of error abound: tapes are mislabeled or reused or lost, 113drives wander out of alignment and cannot read their old tapes, 114technology becomes obsolete. 115<p> 116 117For tape backup systems, a tradeoff exists between the performance of 118backup and restore operations [1]. A full backup simplifies the 119process of restoring data since all the data is copied to a continuous 120region on the tape media. For large file systems and databases, 121incremental backups are more efficient to generate, but such backups 122are not self-contained; the data for a restore operation is scattered 123across multiple incremental backups and perhaps multiple tapes. The 124conventional solution is to limit the extent of this scattering by 125performing a full backup followed by a small number of incremental 126backups. 127<p> 128 129File systems such as Plan 9 [16, 17], WAFL [5], and AFS [7] provide a 130more unified approach to the backup problem by implementing a snapshot 131feature. A snapshot is a consistent read-only view of the file system 132at some point in the past. The snapshot retains the file system 133permissions and can be accessed with standard tools (ls, cat, cp, 134grep, diff) without special privileges or assistance from an 135administrator. In our experience, snapshots are a relied-upon and 136frequently-used resource because they are always available and easy to 137access. 138<p> 139 140Snapshots avoid the tradeoff between full and incremental backups. 141Each snapshot is a complete file system tree, much like a full backup. 142The implementation, however, resembles an incremental backup because 143the snapshots and the active file system share any blocks that remain 144unmodified; a snapshot only requires additional storage for the blocks 145that have changed. To achieve reasonable performance, the device that 146stores the snapshots must efficiently support random access, limiting 147the suitability of tape storage for this approach. 148<p> 149 150In the WAFL and AFS systems, snapshots are ephemeral; only a small 151number of recent versions of the file system are retained. This 152policy is reasonable since the most recent versions of files are the 153most useful. For these systems, archival storage requires an 154additional mechanism such as tape backup. 155<p> 156 157The philosophy of the Plan 9 file system is that random access storage 158is sufficiently cheap that it is feasible to retain snapshots 159permanently. The storage required to retain all daily snapshots of a 160file system is surprisingly modest; later in the paper we present 161statistics for two file servers that have been in use over the last 10 162years. 163<p> 164 165Like Plan 9, the Elephant file system [18] retains many versions of 166data. This system allows a variety of storage reclamation policies 167that determine when a version of a file should be deleted. In 168particular, "landmark" versions of files are retained permanently and 169provide an archival record. 170<p> 171 172<h1>3. The Venti Archival Server</h1> 173<p> 174 175Venti is a block-level network storage system intended for archival 176data. The interface to the system is a simple protocol that enables 177client applications to read and write variable sized blocks of data. 178Venti itself does not provide the services of a file or backup system, 179but rather the backend archival storage for these types of 180applications. 181<p> 182 183Venti identifies data blocks by a hash of their contents. By using a 184collision-resistant hash function with a sufficiently large output, it 185is possible to consider the hash of a data block as unique. Such a 186unique hash is called the fingerprint of a block and can be used as 187the address for read and write operations. This approach results in a 188storage system with a number of interesting properties. 189<p> 190 191As blocks are addressed by the fingerprint of their contents, a block 192cannot be modified without changing its address; the behavior is 193intrinsically write-once. This property distinguishes Venti from most 194other storage systems, in which the address of a block and its 195contents are independent. 196<p> 197 198Moreover, writes are idempotent. Multiple writes of the same data can 199be coalesced and do not require additional storage space. This 200property can greatly increase the effective storage capacity of the 201server since it does not rely on the behavior of client applications. 202For example, an incremental backup application may not be able to 203determine exactly which blocks have changed, resulting in unnecessary 204duplication of data. On Venti, such duplicate blocks will be 205discarded and only one copy of the data will be retained. In fact, 206replacing the incremental backup with a full backup will consume the 207same amount of storage. Even duplicate data from different 208applications and machines can be eliminated if the clients write the 209data using the same block size and alignment. 210<p> 211 212The hash function can be viewed as generating a universal name space 213for data blocks. Without cooperating or coordinating, multiple 214clients can share this name space and share a Venti server. Moreover, 215the block level interface places few restrictions on the structures 216and format that clients use to store their data. In contrast, 217traditional backup and archival systems require more centralized 218control. For example, backup systems include some form of job 219scheduler to serialize access to tape devices and may only support a 220small number of predetermined data formats so that the catalog system 221can extract pertinent meta-data. 222<p> 223 224Venti provides inherent integrity checking of data. When a block is 225retrieved, both the client and the server can compute the fingerprint 226of the data and compare it to the requested fingerprint. This 227operation allows the client to avoid errors from undetected data 228corruption and enables the server to identify when error recovery is 229necessary. 230<p> 231 232Using the fingerprint of a block as its identity facilitates features 233such as replication, caching, and load balancing. Since the contents 234of a particular block are immutable, the problem of data coherency is 235greatly reduced; a cache or a mirror cannot contain a stale or out of 236date version of a block. 237<p> 238 239<h2>3.1. Choice of Hash Function</h2> 240<p> 241 242The design of Venti requires a hash function that generates a unique 243fingerprint for every data block that a client may want to store. 244Obviously, if the size of the fingerprint is smaller than the size of 245the data blocks, such a hash function cannot exist since there are 246fewer possible fingerprints than blocks. If the fingerprint is large 247enough and randomly distributed, this problem does not arise in 248practice. For a server of a given capacity, the likelihood that two 249different blocks will have the same hash value, also known as a 250collision, can be determined. If the probability of a collision is 251vanishingly small, we can be confident that each fingerprint is 252unique. 253<p> 254 255It is desirable that Venti employ a cryptographic hash function. For 256such a function, it is computationally infeasible to find two distinct 257inputs that hash to the same value [10]. This property is important 258because it prevents a malicious client from intentionally creating 259blocks that violate the assumption that each block has a unique 260fingerprint. As an additional benefit, using a cryptographic hash 261function strengthens a client's integrity check, preventing a 262malicious server from fulfilling a read request with fraudulent data. 263If the fingerprint of the returned block matches the requested 264fingerprint, the client can be confident the server returned the 265original data. 266<p> 267 268Venti uses the Sha1 hash function [13] developed by the US National 269Institute for Standards and Technology (NIST). Sha1 is a popular hash 270algorithm for many security systems and, to date, there are no known 271collisions. The output of Sha1 is a 160 bit (20 byte) hash value. 272Software implementations of Sha1 are relatively efficient; for 273example, a 700Mhz Pentium 3 can compute the Sha1 hash of 8 Kbyte data 274blocks in about 130 microseconds, a rate of 60 Mbytes per second. 275<p> 276 277Are the 160 bit hash values generated by Sha1 large enough to ensure 278the fingerprint of every block is unique? Assuming random hash values 279with a uniform distribution, a collection of n different data blocks 280and a hash function that generates b bits, the probability p that 281there will be one or more collisions is bounded by the number of pairs 282of blocks multiplied by the probability that a given pair will 283collide, i.e. 284<p> 285 286<img src="probablity.gif" ALT="probablity"> 287<p> 288 289Today, a large storage system may contain a petabyte (10^15 bytes) of data. 290Consider an even larger system that contains an exabyte (10^18 bytes) 291stored as 8 Kbyte blocks (~10^14 blocks). Using the Sha1 hash function, the 292probability of a collision is less than 10^-20. Such a scenario seems 293sufficiently unlikely that we ignore it and use the Sha1 hash as a 294unique identifier for a block. Obviously, as storage technology 295advances, it may become feasible to store much more than an exabyte, 296at which point it maybe necessary to move to a larger hash function. 297NIST has already proposed variants of Sha1 that produce 256, 384, and 298512 bit results [14]. For the immediate future, however, Sha1 is a 299suitable choice for generating the fingerprint of a block. 300<p> 301 302<h2>3.2. Choice of Storage Technology</h2> 303<p> 304 305When the Plan 9 file system was designed in 1989, optical jukeboxes 306offered high capacity with respectable random access performance and 307thus were an obvious candidate for archival storage. The last decade, 308however, has seen the capacity of magnetic disks increase at a far 309faster rate than optical technologies [20]. Today, a disk array costs 310less than the equivalent capacity optical jukebox and occupies less 311physical space. Disk technology is even approaching tape in cost per 312bit. 313<p> 314 315Magnetic disk storage is not as stable or permanent as optical media. 316Reliability can be improved with technology such as RAID, but unlike 317write-once optical disks, there is little protection from erasure due 318to failures of the storage server or RAID array firmware. This issue 319is discussed in Section 7. 320<p> 321 322Using magnetic disks for Venti has the benefit of reducing the 323disparity in performance between conventional and archival storage. 324Operations that previously required data to be restored to magnetic 325disk can be accomplished directly from the archive. Similarly, the 326archive can contain the primary copy of often-accessed read-only data. 327In effect, archival data need not be further down the storage 328hierarchy; it is differentiated by the write-once policy of the 329server. 330<p> 331 332<h1>4. Applications</h1> 333<p> 334 335Venti is a building block on which to construct a variety of storage 336applications. Venti provides a large repository for data that can be 337shared by many clients, much as tape libraries are currently the 338foundation of many centralized backup systems. Applications need to 339accommodate the unique properties of Venti, which are different from 340traditional block level storage devices, but these properties enable a 341number of interesting features. 342<p> 343 344Applications use the block level service provided by Venti to store 345more complex data structures. Data is divided into blocks and written 346to the server. To enable this data to be retrieved, the application 347must record the fingerprints of these blocks. One approach is to pack 348the fingerprints into additional blocks, called pointer blocks, that 349are also written to the server, a process that can be repeated 350recursively until a single fingerprint is obtained. This fingerprint 351represents the root of a tree of blocks and corresponds to a 352hierarchical hash of the original data. 353<p> 354 355A simple data structure for storing a linear sequence of data blocks 356is shown in Figure 1. The data blocks are located via a fixed depth 357tree of pointer blocks which itself is addressed by a root 358fingerprint. Applications can use such a structure to store a single 359file or to mimic the behavior of a physical device such as a tape or a 360disk drive. The write-once nature of Venti does not allow such a tree 361to be modified, but new versions of the tree can be generated 362efficiently by storing the new or modified data blocks and reusing the 363unchanged sections of the tree as depicted in Figure 2. 364<p> 365 366 367 368<img src="SimpleTree.gif" ALT="simple tree"> 369<p> 370Figure 1. A tree structure for storing a linear sequence of blocks 371<p> 372 373 374 375<img src="ModifiedTree.gif" ALT="modified tree"> 376<p> 377Figure 2. Build a new version of the tree. 378<p> 379 380By mixing data and fingerprints in a block, more complex data 381structures can be constructed. For example, a structure for storing a 382file system may include three types of blocks: directory, pointer, and 383data. A directory block combines the meta information for a file and 384the fingerprint to a tree of data blocks containing the file's 385contents. The depth of the tree can be determined from the size of 386the file, assuming the pointer and data blocks have a fixed size. 387Other structures are obviously possible. Venti's block-level 388interface leaves the choice of format to client applications and 389different data structures can coexist on a single server. 390<p> 391 392The following sections describes three applications that use Venti as 393an archival data repository: a user level archive utility called vac, 394a proposal for a physical level backup utility, and our preliminary 395work on a new version of the Plan 9 file system. 396<p> 397 398<h2>4.1. Vac</h2> 399<p> 400 401Vac is an application for storing a collection of files and 402directories as a single object, similar in functionality to the 403utilities tar and zip. With vac, the contents of the selected files 404are stored as a tree of blocks on a Venti server. The root 405fingerprint for this tree is written to a vac archive file specified 406by the user, which consists of an ASCII representation of the 20 byte 407root fingerprint plus a fixed header string, and is always 45 bytes 408long. A corresponding program, called unvac, enables the user to 409restore files from a vac archive. Naturally, unvac requires access to 410the Venti server that contains the actual data, but in most situations 411this is transparent. For a user, it appears that vac compresses any 412amount of data down to 45 bytes. 413<p> 414 415An important attribute of vac is that it writes each file as a 416separate collection of Venti blocks, thus ensuring that duplicate 417copies of a file will be coalesced on the server. If multiple users 418vac the same data, only one copy will be stored on the server. 419Similarly, a user may repeatedly vac a directory over time and even if 420the contents of the directory change, the additional storage consumed 421on the server will be related to the extent of the changes rather than 422the total size of the contents. Since Venti coalesces data at the 423block level, even files that change may share many blocks with 424previous versions and thus require little space on the server; log and 425database files are good examples of this scenario. 426<p> 427 428On many Unix systems, the dump utility is used to back up file 429systems. Dump has the ability to perform incremental backups of data; 430a user specifies a dump level, and only files that are new or have 431changed since the last dump at this level are written to the archive. 432To implement incremental backups, dump examines the modified time 433associated with each file, which is an efficient method of filtering 434out the unchanged files. 435<p> 436 437Vac also implements an incremental option based on the file 438modification times. The user specifies an existing vac file and this 439archive is used to reduce the number of blocks written to the Venti 440server. For each file, vac examines the modified time in both the 441file system and the vac archive. If they are the same, vac copies the 442fingerprint for the file from the old archive into the new archive. 443Copying just the 20-byte fingerprint enables the new archive to 444include the entire file without reading the data from the file system 445nor writing the data across the network to the Venti server. In 446addition, unlike an incremental dump, the resulting archive will be 447identical to an archive generated without the incremental option; it 448is only a performance improvement. This means there is no need to 449have multiple levels of backups, some incremental, some full, and so 450restore operations are greatly simplified. 451<p> 452 453A variant of the incremental option improves the backup of files 454without reference to modification times. As vac reads a file, it 455computes the fingerprint for each block. Concurrently, the pointer 456blocks of the old archive are examined to determine the fingerprint 457for the block at the same offset in the old version of the file. If 458the fingerprints are the same, the block does not need to be written 459to Venti. Instead, the fingerprint can simply be copied into the 460appropriate pointer block. This optimization reduces the number of 461writes to the Venti server, saving both network and disk bandwidth. 462Like the file level optimization above, the resulting vac file is no 463different from the one produced without this optimization. It does, 464however, require the data for the file to be read and is only 465effective if there are a significant number of unchanged blocks. 466<p> 467 468<h2>4.2. Physical backup</h2> 469<p> 470 471Utilities such as vac, tar, and dump archive data at the file or 472logical level: they walk the file hierarchy converting both data and 473meta-data into their own internal format. An alternative approach is 474block-level or physical backup, in which the disk blocks that make up 475the file system are directly copied without interpretation. Physical 476backup has a number of benefits including simplicity and potentially 477much higher throughput [8]. A physical backup utility for file 478systems that stores the resulting data on Venti appears attractive, 479though we have not yet implemented such an application. 480<p> 481 482The simplest form of physical backup is to copy the raw contents of 483one or mores disk drives to Venti. The backup also includes a tree of 484pointer blocks, which enables access to the data blocks. Like vac, 485the end result is a single fingerprint representing the root of the 486tree; that fingerprint needs to be recorded outside of Venti. 487<p> 488 489Coalescing duplicate blocks is the main advantage of making a physical 490backup to Venti rather than copying the data to another storage medium 491such as tape. Since file systems are inherently block based, we 492expect coalescing to be effective. Not only will backups of a file 493system over time share many unchanged blocks, but even file systems 494for different machines that are running the same operating system may 495have many blocks in common. As with vac, the user sees a full backup 496of the device, while retaining the storage space advantages of an 497incremental backup. 498<p> 499 500One enhancement to physical backup is to copy only blocks that are 501actively in use in the file system. For most file system formats it 502is relatively easy to determine if a block is in use or free without 503walking the file system hierarchy. Free blocks generally contain the 504remnants of temporary files that were created and removed in the time 505between backups and it is advantageous not to store such blocks. This 506optimization requires that the backup format be able to represent 507missing blocks, which can easily be achieved on Venti by storing a 508null value for the appropriate entry in the pointer tree. 509<p> 510 511The random access performance of Venti is sufficiently good that it is 512possible to use a physical backup without first restoring it to disk. 513With operating system support, it is feasible to directly mount a 514backup file system image from Venti. Access to this file system is 515read only, but it provides a natural method of restoring a subset of 516files. For situations where a full restore is required, it might be 517possible to do this restore in a lazy fashion, copying blocks from 518Venti to the file system as needed, instead of copying the entire 519contents of the file system before resuming normal operation. 520<p> 521 522The time to perform a physical backup can be reduced using a variety 523of incremental techniques. Like vac, the backup utility can compute 524the fingerprint of each block and compare this fingerprint with the 525appropriate entry in the pointer tree of a previous backup. This 526optimization reduces the number of writes to the Venti server. If the 527file system provides information about which blocks have changed, as 528is the case with WAFL, the backup utility can avoid even reading the 529unchanged blocks. Again, a major advantage of using Venti is that the 530backup utility can implement these incremental techniques while still 531providing the user with a full backup. The backup utility writes the 532new blocks to the Venti server and constructs a pointer tree with the 533appropriate fingerprint for the unchanged blocks. 534<p> 535 536<h2>4.3. Plan 9 File system</h2> 537<p> 538 539When combined with a small amount of read/write storage, Venti can be 540used as the primary location for data rather than a place to store 541backups. A new version of the Plan 9 file system, which we are 542developing, exemplifies this approach. 543<p> 544 545Previously, the Plan 9 file system was stored on a combination of 546magnetic disks and a write-once optical jukebox. The jukebox 547furnishes the permanent storage for the system, while the magnetic 548disks act as a cache for the jukebox. The cache provides faster file 549access and, more importantly, accumulates the changes to the file 550system during the period between snapshots. When a snapshot is taken, 551new or modified blocks are written from the disk cache to the jukebox. 552<p> 553 554The disk cache can be smaller than the active file system, needing 555only to be big enough to contain the daily changes to the file system. 556However, accesses that miss the cache are significantly slower since 557changing platters in the jukebox takes several seconds. This 558performance penalty makes certain operations on old snapshots 559prohibitively expensive. Also, on the rare occasions when the disk 560cache has been reinitialized due to corruption, the file server spends 561several days filling the cache before performance returns to normal. 562<p> 563 564The new version of the Plan 9 file system uses Venti instead of an 565optical jukebox as its storage device. Since the performance of Venti 566is comparable to disk, this substitution equalizes access both to the 567active and to the archival view of the file system. It also allows 568the disk cache to be quite small; the cache accumulates changes to the 569file system between snapshots, but does not speed file access. 570<p> 571 572<h1>5. Implementation</h1> 573<p> 574 575We have implemented a prototype of Venti. The implementation uses an 576append-only log of data blocks and an index that maps fingerprints to 577locations in this log. It also includes a number of features that 578improve robustness and performance. This section gives a brief 579overview of the implementation. Figure 3 shows a block diagram of the 580server. 581<p> 582 583 584 585<img src="Block.gif" ALT="block diagram"> 586<p> 587Figure 3. A block diagram of the Venti prototype. 588<p> 589 590Since Venti is intended for archival storage, one goal of our 591prototype is robustness. The approach we have taken is to separate 592the storage of data blocks from the index used to locate a block. In 593particular, blocks are stored in an append-only log on a RAID array of 594disk drives. The simplicity of the append-only log structure 595eliminates many possible software errors that might cause data 596corruption and facilitates a variety of additional integrity 597strategies. A separate index structure allows a block to be 598efficiently located in the log; however, the index can be regenerated 599from the data log if required and thus does not have the same 600reliability constraints as the log itself. 601<p> 602 603The structure of the data log is illustrated in Figure 4. To ease 604maintenance, the log is divided into self-contained sections called 605arenas. Each arena contains a large number of data blocks and is 606sized to facilitate operations such as copying to removable media. 607Within an arena is a section for data bocks that is filled in an 608append-only manner. In Venti, data blocks are variable sized, up to a 609current limit of 52 Kbytes, but since blocks are immutable they can be 610densely packed into an arena without fragmentation. 611<p> 612 613 614<img src="LogFormat.gif" ALT="log format"> 615<p> 616Figure 4. The format of the data log. 617<p> 618 619Each block is prefixed by a header that describes the contents of the 620block. The primary purpose of the header is to provide integrity 621checking during normal operation and to assist in data recovery. The 622header includes a magic number, the fingerprint and size of the block, 623the time when the block was first written, and identity of the user 624that wrote it. The header also includes a user-supplied type 625identifier, which is explained in Section 7. Note, only one copy of a 626given block is stored in the log, thus the user and wtime fields 627correspond to the first time the block was stored to the server. 628<p> 629 630Before storing a block in the log, an attempt is made to compress its 631contents. The inclusion of data compression increases the effective 632capacity of the archive and is simple to add given the log structure. 633Obviously, some blocks are incompressible. The encoding field in the 634block header indicates whether the data was compressed and, if so, the 635algorithm used. The esize field indicates the size of the data after 636compression, enabling the location of the next block in the arena to 637be determined. The downside of using compression is the computational 638cost, typically resulting in a decrease in the rate that blocks can be 639stored and retrieved. Our prototype uses a custom Lempel-Ziv '77 [21] 640algorithm that is optimized for speed. Compression is not a 641performance bottleneck for our existing server. Future 642implementations may benefit from hardware solutions. 643<p> 644 645In addition to a log of data blocks, an arena includes a header, a 646directory, and a trailer. The header identifies the arena. The 647directory contains a copy of the block header and offset for every 648block in the arena. By replicating the headers of all the blocks in 649one relatively small part of the arena, the server can rapidly check 650or rebuild the system's global block index. The directory also 651facilitates error recovery if part of the arena is destroyed or 652corrupted. The trailer summarizes the current state of the arena 653itself, including the number of blocks and the size of the log. 654Within the arena, the data log and the directory start at opposite 655ends and grow towards each other. When the arena is filled, it is 656marked as sealed, and a fingerprint is computed for the contents of 657the entire arena. Sealed arenas are never modified. 658<p> 659 660The basic operation of Venti is to store and retrieve blocks based on 661their fingerprints. A fingerprint is 160 bits long, and the number of 662possible fingerprints far exceeds the number of blocks stored on a 663server. The disparity between the number of fingerprints and blocks 664means it is impractical to map the fingerprint directly to a location 665on a storage device. Instead, we use an index to locate a block 666within the log. 667<p> 668 669We implement the index using a disk-resident hash table as illustrated 670in Figure 5. The index is divided into fixed-sized buckets, each of 671which is stored as a single disk block. Each bucket contains the 672index map for a small section of the fingerprint space. A hash 673function is used to map fingerprints to index buckets in a roughly 674uniform manner, and then the bucket is examined using binary search. 675If provisioned with sufficient buckets, the index hash table will be 676relatively empty and bucket overflows will be extremely rare. If a 677bucket does overflow, the extra entries are placed in an adjacent 678bucket. This structure is simple and efficient, requiring one disk 679access to locate a block in almost all cases. 680<p> 681 682 683<p> 684 685<img src="Index.gif" ALT="index format"> 686<p> 687 688Figure 5. Format of the index. 689<p> 690 691The need to go through an index is the main performance penalty for 692Venti compared to a conventional block storage device. Our prototype 693uses three techniques to increase the performance: caching, striping, 694and write buffering. 695<p> 696 697The current implementation has two important caches of approximately 698equal size: a block cache and an index cache. A hit in the block 699cache returns the data for that fingerprint, bypassing the both the 700index lookup and access to the data log. Hits in the index cache 701eliminate only the index lookup, but the entries are much smaller and 702the hit rate correspondingly higher. 703<p> 704 705Unfortunately, these caches do not speed the process of storing a new 706block to Venti. The server must check that the block is not a 707duplicate by examining the index. If the block is not contained on 708the server, it will obviously not be in any cache. Since the 709fingerprint of the block contains no internal structure, the location 710of a fingerprint in the index is essentially random. Furthermore, the 711archival nature of Venti means the entire index will not fit in memory 712because of the large number of blocks. Combining these factors means 713that the write performance of Venti will be limited to the random IO 714performance of the index disk, which for current technology is a few 715hundred accesses per second. By striping the index across multiple 716disks, however, we get a linear speedup. This requires a sufficient 717number of concurrent accesses, which we assure by buffering the writes 718before accessing the index. 719<p> 720 721The prototype Venti server is implemented for the Plan 9 operating 722system in about 10,000 lines of C. The server runs on a dedicated dual 723550Mhz Pentium III processor system with 2 Gbyte of memory and is 724accessed over a 100Mbs Ethernet network. The data log is stored on a 725500 Gbyte MaxTronic IDE Raid 5 Array and the index resides on a string 726of 8 Seagate Cheetah 18XL 9 Gbyte SCSI drives. 727<p> 728 729<h1>6. Performance</h1> 730<p> 731 732Table 1 gives the preliminary performance results for read and write 733operations in a variety of situations. For comparison, we include the 734SCSI performance of the RAID array. Although the performance is still 735several times slower than directly accessing the disk, we believe the 736results are promising and will improve as the system matures. 737<p> 738Table 1. The performance of read and write operations in Mbytes/s for 8 Kbyte blocks. 739<p> 740<p> 741<table align=center> 742<tr> 743<th></th> 744<th width=150>sequential reads</th> 745<th width=150>random reads</th> 746<th width=150>virgin writes</th> 747<th width=150>duplicate writes</th> 748</tr> 749<tr> 750<td>uncached</td> 751<td align=center>0.9</td> 752<td align=center>0.4</td> 753<td align=center>3.7</td> 754<td align=center>5.6</td> 755</tr> 756<tr> 757<td>index cache</td> 758<td align=center>4.2</td> 759<td align=center>0.7</td> 760<td align=center>-</td> 761<td align=center>6.2</td> 762</tr> 763<tr> 764<td>block cache</td> 765<td align=center>6.8</td> 766<td align=center>-</td> 767<td align=center>-</td> 768<td align=center>6.5</td> 769</tr> 770<tr> 771<td>raw raid</td> 772<td align=center>14.8</td> 773<td align=center>1.0</td> 774<td align=center>12.4</td> 775<td align=center>12.4</td> 776</tr> 777</table> 778<p> 779 780 781The uncached sequential read performance is particularly bad. The 782problem is that these sequential reads require a random read of the 783index. Without assistance from the client, the read operations are 784not overlapped and do not benefit from the striping of the index. One 785possible solution is a form of read-ahead. When reading a block from 786the data log, it is feasible to also read several following blocks. 787These extra blocks can be added to the caches without referencing the 788index. If blocks are read in the same order they were written to the 789log, the latency of uncached index lookups will be avoided. This 790strategy should work well for streaming data such as multimedia files. 791<p> 792 793The basic assumption in Venti is that the growth in capacity of disks 794combined with the removal of duplicate blocks and compression of their 795contents enables a model in which it is not necessary to reclaim space 796by deleting archival data. To demonstrate why we believe this model 797is practical, we present some statistics derived from a decade's use 798of the Plan 9 file system. 799<p> 800 801The computing environment in which we work includes two Plan 9 file 802servers named bootes and emelie. Bootes was our primary file 803repository from 1990 until 1997 at which point it was superseded by 804emelie. Over the life of these two file servers there have been 522 805user accounts of which between 50 and 100 were active at any given 806time. The file servers have hosted numerous development projects and 807also contain several large data sets including chess end games, 808astronomical data, satellite imagery, and multimedia files. 809<p> 810 811Figure 6 depicts the size of the active file system as measured over 812time by du, the space consumed on the jukebox, and the size of the 813jukebox's data if it were to be stored on Venti. The ratio of the 814size of the archival data and the active file system is also given. 815As can be seen, even without using Venti, the storage required to 816implement the daily snapshots in Plan 9 is relatively modest, a result 817of the block level incremental approach to generating a snapshot. 818When the archival data is stored to Venti the cost of retaining the 819snapshots is reduced significantly. In the case of the emelie file 820system, the size on Venti is only slightly larger than the active file 821system; the cost of retaining the daily snapshots is almost zero. 822Note that the amount of storage that Venti uses for the snapshots 823would be the same even if more conventional methods were used to back 824up the file system. The Plan 9 approach to snapshots is not a 825necessity, since Venti will remove duplicate blocks. 826<p> 827<img src="bootes.gif" ALT="storage sizes for bootes"> 828<img src="emelie.gif" ALT="storage sizes for emelie"> 829<img src="bootes2.gif" ALT="ratio of sizes for bootes"> 830<img src="emelie2.gif" ALT="ratio of sizes for emelie"> 831<p> 832Figure 6. Graphs of the various sizes of two Plan 9 file servers. 833<p> 834 835When stored on Venti, the size of the jukebox data is reduced by three 836factors: elimination of duplicate blocks, elimination of block 837fragmentation, and compression of the block contents. Table 2 838presents the percent reduction for each of these factors. Note, 839bootes uses a 6 Kbyte block size while emelie uses 16 Kbyte, so the 840effect of removing fragmentation is more significant on emelie. 841<p> 842 843The 10 year history of the two Plan 9 file servers may be of interest 844to other researchers. We have made available per-block information 845including a hash of each block's contents, all the block pointers, and 846most of the directory information. The traces do not include the 847actual contents of files nor the file names. There is sufficient 848information to reconstruct the structure of the file system and to 849track the daily changes to this structure over time. The traces are 850available at http://www.cs.bell-labs.com/~seanq/p9trace.html. 851<p> 852 853Table 2. The percentage reduction in the size of data stored on 854Venti. 855<p> 856<table align=center> 857<tr> 858<th></th> 859<th width=150>bootes</th> 860<th width=150>emelie</th> 861</tr> 862<tr> 863<td>Elimination of duplicates</td> 864<td align=center>27.8%</td> 865<td align=center>31.3%</td> 866</tr> 867<tr> 868<td>Elimination of fragments</td> 869<td align=center>10.2%</td> 870<td align=center>25.4%</td> 871</tr> 872<tr> 873<td>Data Compression</td> 874<td align=center>33.8%</td> 875<td align=center>54.1%</td> 876</tr> 877<tr> 878<td>Total Reduction</td> 879<td align=center>59.7%</td> 880<td align=center>76.5%</td> 881</tr> 882</table> 883<p> 884 885 886<p> 887 888<h1>7. Reliability and Recovery</h1> 889<p> 890 891In concert with the development of the Venti prototype, we have built 892a collection of tools for integrity checking and error recovery. 893Example uses of these tools include: verifying the structure of an 894arena, checking there is an index entry for every block in the data 895log and vice versa, rebuilding the index from the data log, and 896copying an arena to removable media. These tools directly access the 897storage devices containing the data log and index and are executed on 898the server. 899<p> 900 901The directory structure at the end of each area enhances the 902efficiency of many integrity and recovery operations, since it is 903typically two orders of magnitude smaller than the arena, yet contains 904most of the needed information. The index checking utility, for 905example, is implemented as a disk based sort of all the arena 906directories, followed by a comparison between this sorted list and the 907index. Our prototype currently contains approximately 150 million 908blocks using 250 Gbytes of storage. An index check takes 2.2 hours, 909which is significantly less than the 6 hours it takes to read all the 910log data. 911<p> 912 913An additional integrity and recovery feature is the association of a 914type identifier with every block. This 8 bit identifier is included 915with all client read and write operations and has the effect of 916partitioning the server into multiple independent domains. The idea 917is that type indicates the interpretation of the data contained in the 918block. A client can use this feature, for example, to indicate that a 919block is the root node for a tree of blocks. Currently, the data 920format associated with a type is left entirely to the client; the 921server does not interpret the type other that to use it in conjunction 922with a fingerprint as the key with which to index a block. 923<p> 924 925One use of the type identifier is to assist the administrator in 926locating blocks for which a user has accidentally lost the 927fingerprint. Using a tool on the server, the data log can be scanned 928for blocks that match specified criteria, including the block type, 929the write time, and user identifier. The type makes it relatively 930simple to locate forgotten root blocks. Future uses for the type 931might include the ability for the server to determine the location of 932fingerprints within a block, enabling the server to traverse the data 933structures that have been stored. 934<p> 935 936By storing the data log on a RAID 5 disk array, our server is 937protected against single drive failures. Obviously, there are many 938scenarios where this is not sufficient: multiple drives may fail, 939there may be a fire in the machine room, the RAID firmware may contain 940bugs, or the device may be stolen. 941<p> 942 943Additional protection could be obtained by using one or more off-site 944mirrors for the server. We have not yet implemented this strategy, 945but the architecture of Venti makes this relatively simple. A 946background process on the server copies new blocks from the data log 947to the mirrors. This copying can be achieved using the Venti 948protocol; the server is simply another client to the mirror. 949<p> 950 951Even mirroring may not be sufficient. The implementation of Venti may 952contain bugs that can be exploited to compromise the server. An 953automated attack may delete data on many servers simultaneously. 954Storage devices that provide low level enforcement of a write-once 955policy would provide protection for such an attack. Write-once 956read-many optical jukeboxes often provide such protection, but this is 957not yet common for magnetic disk based storage systems. We have thus 958resorted to copying the sealed arenas onto removable media. 959<p> 960 961<h1>8. Related Work</h1> 962<p> 963 964The Stanford Archival Vault [2] is a prototype archival repository 965intended for digital libraries. The archive consists of a write-once 966log of digital objects (files) and several auxiliary indexes for 967locating objects within the log. Objects are identified by the hash 968of their contents using a cyclic redundancy check (CRC). Unlike 969Venti, this system has no way to share data between objects that are 970partially the same, or to build up complex data structures such as a 971file system hierarchy. Rather, the archive consists of a collection 972of separate objects with a limited ability to group objects into sets. 973<p> 974 975On Venti, blocks are organized into more complex data structures by 976creating hash-trees, an idea originally proposed by Merkle [11] for an 977efficient digital signature scheme. 978<p> 979 980The approach to block retrieval in the Read-Only Secure File System 981(SFSRO) [3] is comparable to Venti. Blocks are identified by the Sha1 982hash of their contents and this idea is applied recursively to build 983up more complex structures. The focus of this system is security, not 984archival storage. An administrator creates a digitally signed 985database offline. The database contains a public read-only file 986system that can be published on multiple servers and efficiently and 987securely accessed by clients. SFSRO outperforms traditional methods 988for providing data integrity between a client and a file server, 989demonstrating an attractive property of hash-based addressing. 990<p> 991 992Given their similarities, it would be simple to implement SFSRO on top 993of Venti. The goal of Venti is to provide a flexible location for 994archival storage and SFSRO is a good example of an application that 995could use this capability. In fact, using Venti would provide a 996trivial solution to SFSRO's problem with stale NFS handles since data 997is never deleted from Venti and thus a stale handle will never be 998encountered. 999<p> 1000 1001Content-Derived Names [6] are another example of naming objects based 1002on a secure hash of its contents. This work addresses the issue of 1003naming and managing the various binary software components, in 1004particular shared libraries, that make up an application. 1005<p> 1006 1007The philosophy of the Elephant file system [18] is similar to Venti; 1008large, cheap disks make it feasible to retain many versions of data. 1009A feature of the Elephant system is the ability to specify a variety 1010of data retention policies, which can be applied to individual files 1011or directories. These policies attempt to strike a balance between 1012the costs and benefits of storing every version of a file. In 1013contrast, Venti focuses on the problem of how to store information 1014after deciding that it should be retained in perpetuity. A system 1015such as the Elephant file system could incorporate Venti as the 1016storage device for the permanent "landmark" versions of files, much as 1017the Plan 9 file system will use Venti to archive snapshots. 1018<p> 1019 1020Self-Securing Storage [19] retains all versions of file system data in 1021order to provide diagnosis and recovery from security breaches. The 1022system is implemented as a self-contained network service that exports 1023an object-based disk interface, providing protection from compromise 1024of the client operating system. Old data is retained for a window of 1025time and then deleted to reclaim storage. 1026<p> 1027 1028Venti provides many of the features of self-securing storage: the 1029server is self-contained and accessed through a simple low-level 1030protocol, malicious users cannot corrupt or delete existing data on 1031the server, and old versions of data are available for inspection. It 1032is unlikely that a system would write every file system operation to 1033Venti since storage is never reclaimed, but not deleting data removes 1034the constraint that an intrusion must be detected within a limited 1035window of time. A hybrid approach might retain every version for some 1036time and some versions for all time. Venti could provide the 1037long-term storage for such a hybrid. 1038<p> 1039 1040<h1>9. Future Work</h1> 1041<p> 1042 1043Venti could be distributed across multiple machines; the approach of 1044identifying data by a hash of its contents simplifies such an 1045extension. For example, the IO performance could be improved by 1046replicating the server and using a simple load balancing algorithm. 1047When storing or retrieving a block, clients direct the operation to a 1048server based on a few bits of the fingerprint. Such load balancing 1049could even be hidden from the client application by interposing a 1050proxy server that performs this operation on behalf of the client. 1051<p> 1052 1053Today, Venti provides little security. After authenticating to the 1054server, clients can read any block for which they know the 1055fingerprint. A fingerprint does act as a capability since the space 1056of fingerprints is large and the Venti protocol does not include a 1057means of enumerating the blocks on the server. However, this 1058protection is weak as a single root fingerprint enables access to an 1059entire file tree and once a fingerprint is known, there is no way to 1060restrict access to a particular user. We are exploring ways of 1061providing better access control. 1062<p> 1063 1064To date, the structures we have used for storing data on Venti break 1065files into a series of fixed sized blocks. Identical blocks are 1066consolidated on Venti, but this consolidation will not occur if the 1067data is shifted within the file or an application uses a different 1068block size. This limitation can be overcome using an adaptation of 1069Manber's algorithm for finding similarities in files [9]. The idea is 1070to break files into variable sized blocks based on the identification 1071of anchor or break points, increasing the occurrence of duplicate 1072blocks [12]. Such a strategy can be implemented in client 1073applications with no change to the Venti server. 1074<p> 1075 1076A more detailed analysis of the decade of daily snapshots of the Plan 10779 file systems might be interesting. The trace data we have made 1078publicly available contains approximately the same information used 1079for other studies of long term file activity [4]. 1080<p> 1081 1082<h1>10. Conclusion</h1> 1083<p> 1084 1085The approach of identifying a block by the Sha1 hash of its contents 1086is well suited to archival storage. The write-once model and the 1087ability to coalesce duplicate copies of a block makes Venti a useful 1088building block for a number of interesting storage applications. 1089<p> 1090 1091The large capacity of magnetic disks allows archival data to be 1092retained and available on-line with performance that is comparable to 1093conventional disks. Stored on our prototype server is over a decade 1094of daily snapshots of two major departmental file servers. These 1095snapshots are stored in a little over 200 Gbytes of disk space. 1096Today, 100 Gbytes drives cost less than $300 and IDE RAID controllers 1097are included on many motherboards. A scaled down version of our 1098server could provide archival storage for a home user at an attractive 1099price. Tomorrow, when terabyte disks can be had for the same price, 1100it seems unlikely that archival data will be deleted to reclaim space. 1101Venti provides an attractive approach to storing that data. 1102<p> 1103 1104<h1>11. Acknowledgments</h1> 1105<p> 1106 1107This paper was improved by comments and suggestions from Peter Bosch, 1108Eric Grosse, Lorenz Huelsbergen, Rob Pike, Ross Quinlan, and Cliff 1109Young and six anonymous reviewers. The paper's shepherd was Ethan L. 1110Miller. We thank them all for their help. 1111<p> 1112 1113<h1>12. References</h1> 1114<p> 1115 1116[1] Ann Chervenak, Vivekenand Vellanki, and Zachary Kurmas. 1117Protecting file systems: A survey of backup techniques. In 1118Proceedings Joint NASA and IEEE Mass Storage Conference, March 1998. 1119<p> 1120 1121[2] Arturo Crespo and Hector Garcia-Molina. Archival storage for 1122digital libraries. In Proceedings of the 3rd ACM International 1123Conference on Digital Libraries, 1998. 1124<p> 1125 1126[3] Kevin Fu, Frans Kaashoek, and David Mazières. Fast and secure 1127distributed read-only file system. In Proceedings of the 4th 1128Symposium on Operating Systems Design and Implementation, 2000. 1129<p> 1130 1131[4] Timothy J. Gibson, Ethan L. Miller, and Darrell D. E. Long. 1132Long-term file activity and inter-reference patterns. In Proceedings, 113324th International Conference on Technology Management and Performance 1134Evaluation of Enterprise-Wide Information Systems, Computer 1135Measurement Group, December 1998. 1136<p> 1137 1138[5] Dave Hitz, James Lau, and Michael Malcolm, File system design for 1139an NFS file server appliance, In Proceedings of the Winter 1994 USENIX 1140Conference, San Francisco, CA, January 1994. 1141<p> 1142 1143[6] J. K. Hollingsworth and E. L. Miller. Using content-derived names 1144for configuration management. In Proceeding of the 1997 ACM Symposium 1145on Software Reusability, Boston, May 1997. 1146<p> 1147 1148[7] John Howard, Michael Kazar, Sherri Menees, David Nichols, Mahadev 1149Satyanarayanan, Robert Sidebotham, and Michael West. Scale and 1150performance in a distributed file system. ACM Transactions on 1151Computer Systems, 6(1):51-81, February 1988. 1152<p> 1153 1154[8] Norman C. Hutchinson, Stephen Manley, Mike Federwisch, Guy Harris, 1155Dave Hitz, Steven Kleiman, and Sean O'Malley. Logical vs. physical 1156file system backup. In Proceedings of the 3rd USENIX Symposium on 1157Operating Systems Design and Implementation (OSDI), 1999. 1158<p> 1159 1160[9] Udi Manber. Finding similar files in a large file system. In 1161Proceedings of the Winter 1994 USENIX Conference, San Francisco, CA, 1162January 1994. 1163<p> 1164 1165[10] Alfred J. Menezes, Paul C. van Oorschot, and Scott A. Vanstone. 1166Handbook of Applied Cryptography. CRC Press, 1996. 1167<p> 1168 1169[11] Ralph C. Merkle. Protocols for public-key cryptosystems. In 1170Proceedings of the IEEE Symposium on Security and Privacy, pp. 1171122-133, April 1980. 1172<p> 1173 1174[12] Athicha Muthitacharoen, Benjie Chen, and David Mazières. A 1175low-bandwidth network file system. In Proceedings of the 18th 1176Symposium on Operating Systems Principles, October 2001. 1177<p> 1178 1179[13] National Institute of Standards and Technology, FIPS 180-1. 1180Secure Hash Standard. US Department of Commerce, April 1995. 1181<p> 1182 1183[14] National Institute of Standards and Technology, Draft FIPS 180-2. 1184Secure Hash Standard. US Department of Commerce, May 2001. 1185<p> 1186 1187[15] Evi Nemeth, Garth Snyder, Scott Seebass, and Trent R. Hein. UNIX 1188System Administration Handbook 3rd Edition, Prentice Hall, 2001. 1189<p> 1190 1191[16] Rob Pike, Dave Presotto, Sean Dorward, Bob Flandrena, Ken 1192Thompson, Howard Trickey, and Phil Winterbottom. Plan 9 from Bell 1193Labs, Computing Systems, Vol. 8, 3, pp. 221-254, Summer 1995. 1194<p> 1195 1196[17] Sean Quinlan. A cache worm file system. Software-Practice and 1197Experience, Vol 21, 12, pp 1289-1299, December 1991. 1198<p> 1199 1200[18] Douglas S. Santry, Michael J. Feeley, Norman C. Hutchinson, 1201Alistair C. Veitch, Ross W. Carton and Jacob Ofir. Deciding when to 1202forget in the Elephant file system. In Proceedings of the 17th 1203Symposium on Operating Systems Principles, December 12-15, 1999. 1204<p> 1205 1206[19] John. D. Strunk, Garth R. Goodson, Michael L. Scheinholtz, Craig 1207A.N. Soules, and Gregory R. Ganger. Self-securing storage: protecting 1208data in compromised systems. In Proceedings of the 4th Symposium on 1209Operating Systems Design and Implementation, October 2000. 1210<p> 1211 1212[20] D. A. Thompson and J. S. Best. The future of magnetic data 1213storage technology, IBM Journal of Research and Development, Vol 44, 12143, pp. 311-322, May 2000. 1215<p> 1216 1217[21] J. Ziv and A. Lempel. A universal algorithm for sequential data 1218compression, IEEE Trans. Inform. Theory, vol. IT-23, pp. 337-343, 1219May 1977. 1220<p> 1221 1222