15a9a531cSMatthew Dillon 25a9a531cSMatthew Dillon HAMMER2 DESIGN DOCUMENT 35a9a531cSMatthew Dillon 405af5bd1SMatthew Dillon Matthew Dillon 505af5bd1SMatthew Dillon dillon@backplane.com 605af5bd1SMatthew Dillon 757614c51SMatthew Dillon 08-Dec-2018 (v6) 8b7910865SMatthew Dillon 24-Jul-2017 (v5) 97fece146SMatthew Dillon 09-Jul-2016 (v4) 10b93cc2e0SMatthew Dillon 03-Apr-2015 (v3) 11b93cc2e0SMatthew Dillon 14-May-2013 (v2) 12b93cc2e0SMatthew Dillon 08-Feb-2012 (v1) 135a9a531cSMatthew Dillon 14b93cc2e0SMatthew Dillon Current Status as of document date 155a9a531cSMatthew Dillon 16b93cc2e0SMatthew Dillon* Filesystem Core - operational 17b93cc2e0SMatthew Dillon - bulkfree - operational 18b93cc2e0SMatthew Dillon - Compression - operational 19b93cc2e0SMatthew Dillon - Snapshots - operational 207fece146SMatthew Dillon - Deduper - live operational, batch specced 21b7910865SMatthew Dillon - Subhierarchy quotas - scrapped (still possible on a limited basis) 22b93cc2e0SMatthew Dillon - Logical Encryption - not specced yet 23b93cc2e0SMatthew Dillon - Copies - not specced yet 24b93cc2e0SMatthew Dillon - fsync bypass - not specced yet 2557614c51SMatthew Dillon - FS consistency - operational 265a9a531cSMatthew Dillon 27b93cc2e0SMatthew Dillon* Clustering core 28b93cc2e0SMatthew Dillon - Network msg core - operational 29b93cc2e0SMatthew Dillon - Network blk device - operational 30b93cc2e0SMatthew Dillon - Error handling - under development 31b93cc2e0SMatthew Dillon - Quorum Protocol - under development 32b93cc2e0SMatthew Dillon - Synchronization - under development 33b93cc2e0SMatthew Dillon - Transaction replay - not specced yet 34b93cc2e0SMatthew Dillon - Cache coherency - not specced yet 355a9a531cSMatthew Dillon 3657614c51SMatthew Dillon Recent Document Changes 3757614c51SMatthew Dillon 3857614c51SMatthew Dillon* Reorganized the feature list to indicate currently operational features 3957614c51SMatthew Dillon first, and moving the future features to another section (since they 4057614c51SMatthew Dillon are taking so long to implement). 4157614c51SMatthew Dillon 4257614c51SMatthew Dillon Current Features List 435a9a531cSMatthew Dillon 44b7910865SMatthew Dillon* Standard filesystem semantics with full hardlink and softlink support. 4557614c51SMatthew Dillon 64-bit hardlink count field. 46e513e77eSMatthew Dillon 4757614c51SMatthew Dillon* The topology is indexed with a dynamic radix tree rooted in several 4857614c51SMatthew Dillon places: The super-root, the PFS root inode, and any inode boundary. 4957614c51SMatthew Dillon Index keys are 64-bits. Each element is referenced with a blockref 5057614c51SMatthew Dillon structure (described below) that is capable of referencing a power-of-2 5157614c51SMatthew Dillon sized block. The block size is currently capped at 64KB to play 5257614c51SMatthew Dillon nice(r) with the buffer cache and SSDs. 53e513e77eSMatthew Dillon 5457614c51SMatthew Dillon The dynamic radix tree pushes elements into new indirect blocks only 5557614c51SMatthew Dillon when the current level fills up, and will delete empty indirect blocks 5657614c51SMatthew Dillon when a level is cleaned out. 57b93cc2e0SMatthew Dillon 5857614c51SMatthew Dillon* Block-copy-on-write filesystem mechanism for both the main topology 5957614c51SMatthew Dillon and for the freemap. Media-level block frees are deferred and flushes 6057614c51SMatthew Dillon rotate between (up to) 4 volume headers (capped at 4 if the filesystem 6157614c51SMatthew Dillon is > ~8GB). Recovery will choose the most recent fully-valid volume 6257614c51SMatthew Dillon header and can thus work around failures which cause partial volume 6357614c51SMatthew Dillon header writes. 6457614c51SMatthew Dillon 6557614c51SMatthew Dillon Modifications issue copy-on-write updates up to the volume root. 66b93cc2e0SMatthew Dillon 67b7910865SMatthew Dillon* Utilizes a fat blockref structure (128 bytes) which can store up to 6857614c51SMatthew Dillon 64 bytes (512 bits) of check code data for each referenced block. 6957614c51SMatthew Dillon In the original implementation I had gone with 64 byte blockrefs, 7057614c51SMatthew Dillon but I eventually decided that I wanted to support up to a 512-bit 7157614c51SMatthew Dillon hash (which eats 64 bytes), so I bumped it up to 128 bytes. This 7257614c51SMatthew Dillon turned out to be fortuitous because it made it possible to store 7357614c51SMatthew Dillon most directory entries directly in the blockref structure without 7457614c51SMatthew Dillon having to reference a separate data block via the blockref structure. 755a9a531cSMatthew Dillon 7657614c51SMatthew Dillon* 1KB 'fat' inode structure. The inode structure directly embeds four 7757614c51SMatthew Dillon blockrefs so small files and directories can be represented without 7857614c51SMatthew Dillon requiring an indirect block to be allocated. The inode structure can 7957614c51SMatthew Dillon also overload the same space to store up to 512 bytes of direct 8057614c51SMatthew Dillon file data (for files which are <= 512 bytes long). 815a9a531cSMatthew Dillon 8257614c51SMatthew Dillon The super-root and PFS root inodes are directly represented in the 8357614c51SMatthew Dillon topology, without the use of directory entries. A combination of 8457614c51SMatthew Dillon normal directory entries and separtely-indexed inodes are implemented 8557614c51SMatthew Dillon under each PFS. 8601d71aa5SMatthew Dillon 8757614c51SMatthew Dillon Normal filesystem inodes (other than inode 1) are indexed under the PFS 8857614c51SMatthew Dillon root inode by their inode number. Directory entries are indexed under the 8957614c51SMatthew Dillon same PFS root by their filename hash. Bit 63 is used to distinguish and 9057614c51SMatthew Dillon partition the two. Filename hash collisions are handled by incrementing 9157614c51SMatthew Dillon reserved low bits in the filename hash code. 9201d71aa5SMatthew Dillon 9357614c51SMatthew Dillon* Directory entries representing filenames that are less than 64 bytes 9457614c51SMatthew Dillon long are directly stored AS blockrefs. This means that an inode 9557614c51SMatthew Dillon representing a small directory can store up to 4 directory entries in 9657614c51SMatthew Dillon the inode itself before resorting to indirect blocks, and then those 9757614c51SMatthew Dillon indirect blocks themselves can directly embed up to 512 directory entries. 9857614c51SMatthew Dillon Directory entries with long filenames reference an indirect data block 9957614c51SMatthew Dillon to hold the filename instead of directly-embedding the filename. 1005a9a531cSMatthew Dillon 10157614c51SMatthew Dillon This results in *very* compact directories in terms of I/O bandwidth. 10257614c51SMatthew Dillon Not as compact as e.g. UFS's variable-length directory entries, but still 10357614c51SMatthew Dillon very good with a nominal 128 real bytes per directory entry. 1041a7cfe5aSMatthew Dillon 10557614c51SMatthew Dillon Because directory entries are represented using a dynamic radix tree via 10657614c51SMatthew Dillon its blockrefs, directory entries can be randomly looked up without having 10757614c51SMatthew Dillon to scan the whole directory. 1081a7cfe5aSMatthew Dillon 10957614c51SMatthew Dillon* Multiple PFSs. In HAMMER2, all PFSs are implemented the same way, with 11057614c51SMatthew Dillon the kernel choosing a default PFS name for the mount if none is specified. 11157614c51SMatthew Dillon For example, "ROOT" is the default PFS name for a root mount. You can 11257614c51SMatthew Dillon create as many PFSs as you like and you can specify the PFS name in the 11357614c51SMatthew Dillon mount command using the <device_path>@<pfs_name> notation. 114b93cc2e0SMatthew Dillon 11557614c51SMatthew Dillon* Snapshots are implemented as PFSs. Due to the copy-on-write nature of 11657614c51SMatthew Dillon the filesystem, taking a snapshot is a trivial operation requiring only 11757614c51SMatthew Dillon a normal filesystme sync and copying of the PFS root inode (1KB), and 11857614c51SMatthew Dillon that's it. 1191a7cfe5aSMatthew Dillon 12057614c51SMatthew Dillon On the minus side, can complicate the bulkfree operation that is responsible 12157614c51SMatthew Dillon for freeing up disk space. It can take significantly longer when many 12257614c51SMatthew Dillon snapshots are present. 123b7910865SMatthew Dillon 12457614c51SMatthew Dillon* SNAPSHOTS ARE READ-WRITE. You can mount any PFS read-write, including 12557614c51SMatthew Dillon snapshots. For example, you can revert to an earlier 'root' that you 12657614c51SMatthew Dillon made a snapshot of simply by changing what the system mounts as the root 12757614c51SMatthew Dillon filesystem. 128b7910865SMatthew Dillon 12957614c51SMatthew Dillon* Full filesystem coherency at both the radix tree level and the filesystem 13057614c51SMatthew Dillon semantics level. This is true for all filesystem syncs, recovery after 13157614c51SMatthew Dillon a crash, and snapshots. 132b7910865SMatthew Dillon 13357614c51SMatthew Dillon The filesystem syncs fully vfsync the buffer cache for the files 13457614c51SMatthew Dillon that are part of the sync group, and keeps track of dependencies to 13557614c51SMatthew Dillon ensure that all inter-dependent inodes are flushed in the same sync 13657614c51SMatthew Dillon group. Atomic filesystem ops such as write()s are guaranteed to remain 13757614c51SMatthew Dillon atomic across a sync, snapshot, and crash. 138b7910865SMatthew Dillon 13957614c51SMatthew Dillon* Flushes and syncs are almost entirely asynchronous and will run concurrent 14057614c51SMatthew Dillon with frontend operations. This feature is implemented by adding inodes 14157614c51SMatthew Dillon to the sync group currently being flushed on-the-fly as new dependencies 14257614c51SMatthew Dillon are created, and reordering inodes in the sync queue to prioritize inodes 14357614c51SMatthew Dillon which the frontend is stalled on. 144b7910865SMatthew Dillon 14557614c51SMatthew Dillon By reprioritizing inodes in the syncq, frontend stalls are minimized. 146b7910865SMatthew Dillon 14757614c51SMatthew Dillon The only synchronous disk operations is the final sync of the volume 14857614c51SMatthew Dillon header which updates the ultimate root of the filesystem. A disk flush 14957614c51SMatthew Dillon command is issued synchronously, then the write of the volume header is 15057614c51SMatthew Dillon issued synchronously. All other writes to the disk, regardless of the 15157614c51SMatthew Dillon complexity of the dependencies, occur asynchronously and can make very 15257614c51SMatthew Dillon good use of high-speed I/O and SSD bandwidth. 1531a7cfe5aSMatthew Dillon 1541a7cfe5aSMatthew Dillon* Low memory footprint. Except for the volume header, the buffer cache 1551a7cfe5aSMatthew Dillon is completely asynchronous and dirty buffers can be retired by the OS 1561a7cfe5aSMatthew Dillon directly to backing store with no further interactions with the filesystem. 1575a9a531cSMatthew Dillon 158b7910865SMatthew Dillon* Compression support. Multiple algorithms are supported and can be 159b7910865SMatthew Dillon configured on a subdirectory hierarchy or individual file basis. 160b7910865SMatthew Dillon Block compression up to 64KB will be used. Only compression ratios at 161b7910865SMatthew Dillon powers of 2 that are at least 2:1 (e.g. 2:1, 4:1, 8:1, etc) will work in 162b7910865SMatthew Dillon this scheme because physical block allocations in HAMMER2 are always 16357614c51SMatthew Dillon power-of-2. 16457614c51SMatthew Dillon 16557614c51SMatthew Dillon Modest compression can be achieved with low overhead, is turned on 16657614c51SMatthew Dillon by default, and is compatible with deduplication. 16757614c51SMatthew Dillon 16857614c51SMatthew Dillon Compression is extremely useful and often gives you anywhere from 25% 16957614c51SMatthew Dillon to 400% the logical storage as you have physical blocks, depending. 17057614c51SMatthew Dillon Of course, .tgz and other pre-compressed files cannot be compressed 17157614c51SMatthew Dillon further by the filesystem. 17257614c51SMatthew Dillon 17357614c51SMatthew Dillon The usefulness shnould not be underestimated, our users are constantly 17457614c51SMatthew Dillon being surprised at things the filesystem is able to compres that just 17557614c51SMatthew Dillon makes life a lot easier. For example, 30GB core dumps tend to contain 17657614c51SMatthew Dillon a great deal of highly compressable data. Source trees, web files, 17757614c51SMatthew Dillon executables, general data... this is why HAMMER2 turns modest compression 17857614c51SMatthew Dillon on by default. It just works. 1795a9a531cSMatthew Dillon 180b7910865SMatthew Dillon* De-duplication support. HAMMER2 uses a relatively simple freemap 181b7910865SMatthew Dillon scheme that allows the filesystem to discard block references 18257614c51SMatthew Dillon asynchronously. The same scheme allows essentially unlimited references 18357614c51SMatthew Dillon to the same data block in the hierarchy. Thus, both live de-duplication 18457614c51SMatthew Dillon and bulk deduplication are relatively easy to implement. 185b7910865SMatthew Dillon 18657614c51SMatthew Dillon HAMMER2 currently implements only live de-duplications. This means that 18757614c51SMatthew Dillon typical situations such as when copying files or whole directory hierarchies 18857614c51SMatthew Dillon will naturally de-duplicate. Simply reading filesystem data in makes 18957614c51SMatthew Dillon it available for deduplication later. HAMMER2 will index a potentially 19057614c51SMatthew Dillon very large number of blocks in memory, even beyond what the buffer cache 19157614c51SMatthew Dillon can hold, for deduplication purposes. 19257614c51SMatthew Dillon 19357614c51SMatthew Dillon* Zero-fill detection on write (writing all-zeros), which requires the data 194b7910865SMatthew Dillon buffer to be scanned, is fully supported. This allows the writing of 0's 195b7910865SMatthew Dillon to create holes. 196b7910865SMatthew Dillon 197b7910865SMatthew Dillon Generally speaking pre-writing zerod blocks to reserve space doesn't work 198b7910865SMatthew Dillon well on copy-on-write filesystems. However, if both compression and 199b7910865SMatthew Dillon check codes are disabled on a file, H2 will also disable zero-detection, 20057614c51SMatthew Dillon allowing the file blocks to be pre-reserved (by actually zeroing them and 20157614c51SMatthew Dillon reusing them later on), and allow data overwrites to write to the same 20257614c51SMatthew Dillon sector. Please be aware that DISABLING THE CHECK CODE IN THIS MANNER ALSO 20357614c51SMatthew Dillon MEANS THAT SNAPSHOTS WILL NOT WORK. The snapshot will contain the latest 20457614c51SMatthew Dillon data for the file and not the data as-of the snapshot. This is NOT turned 20557614c51SMatthew Dillon on by default in HAMMER2 and is not recommended except in special 20657614c51SMatthew Dillon well-controlled circumstances. 207b7910865SMatthew Dillon 20857614c51SMatthew Dillon* Multiple supporting kernel threads, breaking up frontend VOP operation 20957614c51SMatthew Dillon from backend I/O, compression, and decompression operation. Buffer cache 21057614c51SMatthew Dillon I/O and VOP ops message the backend. Actual I/O is handled by the backend 21157614c51SMatthew Dillon and not by the frontend, which will theoretically allow us to survive 21257614c51SMatthew Dillon stalled devices and nodes when implementing multi-node support. 21357614c51SMatthew Dillon 21457614c51SMatthew Dillon Pending Features 21557614c51SMatthew Dillon (not yet implemented) 21657614c51SMatthew Dillon 21757614c51SMatthew Dillon* Constructing a filesystem across multiple nodes. Each low-level H2 device 218bbb35c81SSascha Wildner would be able to accommodate nodes belonging to multiple cluster components 21957614c51SMatthew Dillon as well as nodes that are simply local to the device or machine. 22057614c51SMatthew Dillon 22157614c51SMatthew Dillon CURRENT STATUS: Not yet operational. 222b7910865SMatthew Dillon 223b7910865SMatthew Dillon* Incremental synchronization via highest-transaction id propagation 224b7910865SMatthew Dillon within the radix tree. This is a queueless, incremental design. 225b7910865SMatthew Dillon 226b7910865SMatthew Dillon CURRENT STATUS: Due to the flat inode hierarchy now being employed, 227b7910865SMatthew Dillon the current synchronization code which silently recurses indirect nodes 228b7910865SMatthew Dillon will be inefficient due to the fact that all the inodes are at the 229b7910865SMatthew Dillon same logical level in the topology. To fix this, the code will need 230b7910865SMatthew Dillon to explicitly iterate indirect nodes and keep track of the related 231b7910865SMatthew Dillon key ranges to match them up on an indirect-block basis, which would 232b7910865SMatthew Dillon be incredibly efficient. 233b7910865SMatthew Dillon 234b7910865SMatthew Dillon* Background synchronization and mirroring occurs at the logical layer 235b7910865SMatthew Dillon rather than the physical layer. This allows cluster components to 236b7910865SMatthew Dillon have differing storage arrangements. 237b7910865SMatthew Dillon 238b7910865SMatthew Dillon In addition, this mechanism will fully correct any out of sync nodes 239b7910865SMatthew Dillon in the cluster as long as a sufficient number of other nodes agree on 240b7910865SMatthew Dillon what the proper state should be. 241b7910865SMatthew Dillon 24257614c51SMatthew Dillon CURRENT STATUS: Not yet operational. 2435a9a531cSMatthew Dillon 244b93cc2e0SMatthew Dillon* Encryption. Whole-disk encryption is supported by another layer, but I 245b93cc2e0SMatthew Dillon intend to give H2 an encryption feature at the logical layer which works 246b93cc2e0SMatthew Dillon approximately as follows: 247b93cc2e0SMatthew Dillon 248b93cc2e0SMatthew Dillon - Encryption controlled by the client on an inode/sub-tree basis. 249b93cc2e0SMatthew Dillon - Server has no visibility to decrypted data. 250b93cc2e0SMatthew Dillon - Encrypt filenames in directory entries. Since the filename[] array 251b93cc2e0SMatthew Dillon is 256 bytes wide, client can add random bytes after the normal 252b93cc2e0SMatthew Dillon terminator to make it virtually impossible for an attacker to figure 253b93cc2e0SMatthew Dillon out the filename. 254b93cc2e0SMatthew Dillon - Encrypt file size and most inode contents. 255b93cc2e0SMatthew Dillon - Encrypt file data (holes are not encrypted). 256b93cc2e0SMatthew Dillon - Encryption occurs after compression, with random filler. 257b93cc2e0SMatthew Dillon - Check codes calculated after encryption & compression (not before). 258b93cc2e0SMatthew Dillon 259b93cc2e0SMatthew Dillon - Blockrefs are not encrypted. 260b93cc2e0SMatthew Dillon - Directory and File Topology is not encrypted. 261b93cc2e0SMatthew Dillon - Encryption is not sub-topology validation. Client would have to keep 262b93cc2e0SMatthew Dillon track of that itself. Server or other clients can still e.g. remove 263b93cc2e0SMatthew Dillon files, rename, etc. 264b93cc2e0SMatthew Dillon 265b93cc2e0SMatthew Dillon In particular, note that even though the file size field can be encrypted, 266b93cc2e0SMatthew Dillon the server does have visibility on the block topology and thus has a pretty 267b93cc2e0SMatthew Dillon good idea how big the file is. However, a client could add junk blocks 268b93cc2e0SMatthew Dillon at the end of a file to make this less apparent, at the cost of space. 269b93cc2e0SMatthew Dillon 270b93cc2e0SMatthew Dillon If a client really wants a fully validated H2-encrypted space the easiest 271b93cc2e0SMatthew Dillon solution is to format a filesystem within an encrypted file by treating it 272b93cc2e0SMatthew Dillon as a block device, but I digress. 2735a9a531cSMatthew Dillon 27457614c51SMatthew Dillon CURRENT STATUS: Not yet operational. 27557614c51SMatthew Dillon 276b7910865SMatthew Dillon* Device ganging, copies for redundancy, and file splitting. 2775a9a531cSMatthew Dillon 278b7910865SMatthew Dillon Device ganging - The idea here is not to gang devices into a single 279b7910865SMatthew Dillon physical volume but to instead format each device independently 280b7910865SMatthew Dillon and allow crossover-references in the blockref to other devices in 281b7910865SMatthew Dillon the set. 2827fece146SMatthew Dillon 283b7910865SMatthew Dillon One of the things we want to accomplish is to ensure that a failed 284b7910865SMatthew Dillon device does not prevent access to radix tree elements in other devices 285b7910865SMatthew Dillon in the gang, and that the failed device can be reconstructed. To do 286b7910865SMatthew Dillon this, each device implements complete reachability from the node root 287b7910865SMatthew Dillon to all elements underneath it. When a device fails, the sychronization 288b7910865SMatthew Dillon code can theoretically reconstruct the missing material in other 289b7910865SMatthew Dillon devices making up the gang. New devices can be added to the gang and 290b7910865SMatthew Dillon existing devices can be removed from the gang. 2915a9a531cSMatthew Dillon 292b7910865SMatthew Dillon Redundant copies - This is actually a fairly tough problem. The 293b7910865SMatthew Dillon solution I would like to implement is to use the device ganging feature 294b7910865SMatthew Dillon to also implement redundancy, that way if a device fails within the 295b7910865SMatthew Dillon gang there's a good chance that it can still remain completely functional 296b7910865SMatthew Dillon without having to resynchronize. But making this work is difficult to say 297b7910865SMatthew Dillon the least. 2985a9a531cSMatthew Dillon 29957614c51SMatthew Dillon CURRENT STATUS: Not yet operational. 30057614c51SMatthew Dillon 301b93cc2e0SMatthew Dillon* MESI Cache coherency for multi-master/multi-client clustering operations. 302b93cc2e0SMatthew Dillon The servers hosting the MASTERs are also responsible for keeping track of 303b93cc2e0SMatthew Dillon the cache state. 304b93cc2e0SMatthew Dillon 305b7910865SMatthew Dillon This is a feature that we would need to implement coherent cross-machine 306b7910865SMatthew Dillon multi-threading and migration. 3075a9a531cSMatthew Dillon 30857614c51SMatthew Dillon CURRENT STATUS: Not yet operational. 30957614c51SMatthew Dillon 310b7910865SMatthew Dillon* Implement unverified de-duplication (where only the check code is tested, 311b7910865SMatthew Dillon avoiding having to actually read data blocks to calculate a de-duplication. 312b7910865SMatthew Dillon This would make use of the blockref structure's widest check field 313b7910865SMatthew Dillon (512 bits). 3145a9a531cSMatthew Dillon 315b7910865SMatthew Dillon Out of necessity this type of feature would be settable on a file or 316b7910865SMatthew Dillon recursive directory tree basis, but should only be used when the data 317b7910865SMatthew Dillon is throw-away or can be reconstructed since data corruption (mismatched 318b7910865SMatthew Dillon duplicates with the same hash) is still possible even with a 512-bit 319b7910865SMatthew Dillon check code. 3205a9a531cSMatthew Dillon 321b93cc2e0SMatthew Dillon The Unverified dedup feature is intended only for those files where 322bbb35c81SSascha Wildner occasional corruption is ok, such as in a web-crawler data store or 323b93cc2e0SMatthew Dillon other situations where the data content is not critically important 324b93cc2e0SMatthew Dillon or can be externally recovered if it becomes corrupt. 3255a9a531cSMatthew Dillon 32657614c51SMatthew Dillon CURRENT STATUS: Not yet operational. 32757614c51SMatthew Dillon 3285a9a531cSMatthew Dillon GENERAL DESIGN 3295a9a531cSMatthew Dillon 3305a9a531cSMatthew DillonHAMMER2 generally implements a copy-on-write block design for the filesystem, 3315a9a531cSMatthew Dillonwhich is very different from HAMMER1's B-Tree design. Because the design 332b7910865SMatthew Dillonis copy-on-write it can be trivially snapshotted simply by making a copy 333b7910865SMatthew Dillonof the block table we desire to snapshot. Snapshotting the root inode 334b7910865SMatthew Dilloneffectively snapshots the entire filesystem, whereas snapshotting a file 335b7910865SMatthew Dilloninode only snapshots that one file. Snapshotting a directory inode is 336b7910865SMatthew Dillongenerally unhelpful since it only contains directory entries and the 337b7910865SMatthew Dillonunderlying files are not arranged under it in the radix tree. 3385a9a531cSMatthew Dillon 339b7910865SMatthew DillonThe copy-on-write design implements a block table as a radix-tree, 340b7910865SMatthew Dillonwith a small fan-out in the volume header and inode (typically 4x) and 341b7910865SMatthew Dillona large fan-out for indirect blocks (typically 128x and 512x depending). 342*a4cea70eSTomohiro KusumiThe table is built bottom-up. Intermediate radixes are only created when 343b7910865SMatthew Dillonnecessary so small files and directories will have a much shallower radix 344b7910865SMatthew Dillontree. 345e513e77eSMatthew Dillon 346b7910865SMatthew DillonHAMMER2 implements several space optimizations: 347b7910865SMatthew Dillon 34857614c51SMatthew Dillon 1. Directory entries with filenames <= 64 bytes will fit entirely 349b7910865SMatthew Dillon in the 128-byte blockref structure and do not require additional data 350b7910865SMatthew Dillon block references. Since blockrefs are the core elements making up 351b7910865SMatthew Dillon block tables, most directories should have good locality of reference 352b7910865SMatthew Dillon for directory scans. 353b7910865SMatthew Dillon 35457614c51SMatthew Dillon Filenames > 64 bytes require a 1KB data-block reference, which 35557614c51SMatthew Dillon is clearly less optimal, but very few files in a filesystem tend 35657614c51SMatthew Dillon to be larger than 64 bytes so it works out. This also simplifies 35757614c51SMatthew Dillon the handling for large filenames as we can allow filenames up to 35857614c51SMatthew Dillon 1023 bytes long with this mechanism with no major changes to the 35957614c51SMatthew Dillon code. 36057614c51SMatthew Dillon 361b7910865SMatthew Dillon 2. Inodes embed 4 blockrefs, so files up to 256KB and directories with 362b7910865SMatthew Dillon up to four directory entries (not including "." or "..") can be 363bbb35c81SSascha Wildner accommodated without requiring any indirecct blocks. 364b7910865SMatthew Dillon 365b7910865SMatthew Dillon 3. Indirect blocks can be sized to any power of two up to 65536 bytes, 366b7910865SMatthew Dillon and H2 typically uses 16384 and 65536 bytes. The smaller size is 367b7910865SMatthew Dillon used for initial indirect blocks to reduce storage overhead for 368b7910865SMatthew Dillon medium-sized files and directories. 369b7910865SMatthew Dillon 370b7910865SMatthew Dillon 4. The File inode itself can directly hold the data for small 37157614c51SMatthew Dillon files <= 512 bytes in size, overloading the space also used 37257614c51SMatthew Dillon by its four 128 bytes blockrefs (which are not needed if the 37357614c51SMatthew Dillon file is <= 512 bytes in size). This works out great for small 37457614c51SMatthew Dillon files and directories. 375b7910865SMatthew Dillon 376b7910865SMatthew Dillon 5. The last block in a file will have a storage allocation in powers 377b7910865SMatthew Dillon of 2 from 1KB to 64KB as needed. Thus a small file in excess of 378b7910865SMatthew Dillon 512 bytes but less than 64KB will not waste a full 64KB block. 379b7910865SMatthew Dillon 380b7910865SMatthew Dillon 6. When compression is enabled, small physical blocks will be allocated 381b7910865SMatthew Dillon when possible. However, only reductions in powers of 2 are supported. 382b7910865SMatthew Dillon So if a 64KB data block can be compressed to (16KB+1) to 32KB, then 383b7910865SMatthew Dillon a 32KB block will be used. This gives H2 modest compression at very 384b7910865SMatthew Dillon low cost without too much added complexity. 385b7910865SMatthew Dillon 386b7910865SMatthew Dillon 7. Live de-dup will attempt to share data blocks when file copying is 387b7910865SMatthew Dillon detected, significantly reducing actual physical writes to storage 388b7910865SMatthew Dillon and the storage used. Bulk de-dup (when implemented), will catch 389b7910865SMatthew Dillon other cases of de-duplication. 390b7910865SMatthew Dillon 391b7910865SMatthew DillonDirectories contain directory entries which are indexed using a hash of 392b7910865SMatthew Dillontheir filename. The hash is carefully designed to maintain some natural 39357614c51SMatthew Dillonsort ordering. The directory entries are implemented AS blockrefs. So 39457614c51SMatthew Dillonan inode can contain up to 4 before requiring an indirect block, and 39557614c51SMatthew Dilloneach indirect block can contain up to 512 entries, with further data block 39657614c51SMatthew Dillonreferences required for any directory entry whos filename is > 64 bytes. 39757614c51SMatthew DillonBecause the directory entries are blockrefs, random access lookups are 39857614c51SMatthew Dillonmaximally efficient. The directory hash is designed to very loosely try 39957614c51SMatthew Dillonto retain some alphanumeric sorting to bundle similarly-named files together 40057614c51SMatthew Dillonand reduce random lookups. 401b7910865SMatthew Dillon 402b7910865SMatthew DillonThe copy-on-write nature of the filesystem means that any modification 4035a9a531cSMatthew Dillonwhatsoever will have to eventually synchronize new disk blocks all the way 40457614c51SMatthew Dillonto the super-root of the filesystem and then to the volume header itself. 40557614c51SMatthew DillonThis forms the basis for crash recovery and also ensures that recovery 40657614c51SMatthew Dillonoccurs on a completed high-level transaction boundary. All disk writes are 40757614c51SMatthew Dillonto new blocks except for the volume header (which cycles through 4 copies), 40857614c51SMatthew Dillonthus allowing all writes to run asynchronously and concurrently prior to 40957614c51SMatthew Dillonand during a flush, and then just doing a final synchronization and volume 41057614c51SMatthew Dillonheader update at the end. Many of HAMMER2s features are enabled by this 41157614c51SMatthew Dilloncore design feature. 4125a9a531cSMatthew Dillon 413b7910865SMatthew DillonThe Freemap is also implemented using a radix tree via a set of pre-reserved 414b7910865SMatthew Dillonblocks (approximately 4MB for every 2GB of storage), and also cycles through 415b7910865SMatthew Dillonmultiple copies to ensure that crash recovery can restore the state of the 416b7910865SMatthew Dillonfilesystem quickly at mount time. 417b7910865SMatthew Dillon 418b7910865SMatthew DillonHAMMER2 tries to maintain a small footprint and one way it does this is 419b7910865SMatthew Dillonby using the normal buffer cache for data and meta-data, and allowing the 420b7910865SMatthew Dillonkernel to asynchronously flush device buffers at any time (even during 421b7910865SMatthew Dillonsynchronization). The volume root is flushed separately, separated from 422b7910865SMatthew Dillonthe asynchronous flushes by a synchronizing BUF_CMD_FLUSH op. This means 423b7910865SMatthew Dillonthat HAMMER2 has very low resource overhead from the point of view of the 424b7910865SMatthew Dillonoperating system and is very much unlike HAMMER1 which had to lock dirty 425b7910865SMatthew Dillonbuffers into memory for long periods of time. HAMMER2 has no such 426b93cc2e0SMatthew Dillonrequirement. 4275a9a531cSMatthew Dillon 428b93cc2e0SMatthew DillonBuffer cache overhead is very well bounded and can handle filesystem 429b93cc2e0SMatthew Dillonoperations of any complexity, even on boxes with very small amounts 430b93cc2e0SMatthew Dillonof physical memory. Buffer cache overhead is significantly lower with H2 431b93cc2e0SMatthew Dillonthan with H1 (and orders of magnitude lower than ZFS). 4325a9a531cSMatthew Dillon 433b93cc2e0SMatthew DillonAt some point I intend to implement a shortcut to make fsync()'s run fast, 434b93cc2e0SMatthew Dillonand that is to allow deep updates to blockrefs to shortcut to auxillary 435b93cc2e0SMatthew Dillonspace in the volume header to satisfy the fsync requirement. The related 436b93cc2e0SMatthew Dillonblockref is then recorded when the filesystem is mounted after a crash and 437b93cc2e0SMatthew Dillonthe update chain is reconstituted when a matching blockref is encountered 438b93cc2e0SMatthew Dillonagain during normal operation of the filesystem. 439b93cc2e0SMatthew Dillon 44057614c51SMatthew Dillon FILESYSTEM SYNC SEQUENCING 44157614c51SMatthew Dillon 44257614c51SMatthew DillonHAMMER2 implements a filesystem sync mechanism that allows the frontend 44357614c51SMatthew Dillonto continue doing modifying operations concurrent with the sync. The 44457614c51SMatthew Dillongeneral sync mechanism operates in four phases: 44557614c51SMatthew Dillon 44657614c51SMatthew Dillon 1. Individual file and directory inodes are fsync()d to disk, 44757614c51SMatthew Dillon updated the blockrefs in the parent block above the inode, and 44857614c51SMatthew Dillon removed from the syncq. 44957614c51SMatthew Dillon 45057614c51SMatthew Dillon Once removed from the syncq, the frontend can do a modifying 45157614c51SMatthew Dillon operation on these file and directory inodes without further 45257614c51SMatthew Dillon effecting the filesystem sync. These modifications will be 45357614c51SMatthew Dillon flushed to disk on the next filesystem sync. 45457614c51SMatthew Dillon 45557614c51SMatthew Dillon To reduce frontend stall times, an inode blocked on by the frontend 45657614c51SMatthew Dillon which is on the syncq will be reordered to the front of the syncq 45757614c51SMatthew Dillon to give the syncer a shot at it more quickly, in order to unstall 45857614c51SMatthew Dillon the frontend ASAP. 45957614c51SMatthew Dillon 46057614c51SMatthew Dillon If a frontend operations creates an unavoidable dependency between 46157614c51SMatthew Dillon an inode on the syncq and an inode not on the syncq, both inodes 46257614c51SMatthew Dillon are placed on (or back onto) the syncq as needed to ensure filesystem 46357614c51SMatthew Dillon consistency for the filesystem sync. This can extend the filesystem 46457614c51SMatthew Dillon sync time, but even under heavy loads syncs are still able to be 46557614c51SMatthew Dillon retired. 46657614c51SMatthew Dillon 46757614c51SMatthew Dillon 2. The PFS ROOT is fsync()d to storage along with the subhierarchy 46857614c51SMatthew Dillon representing the inode index (whos inodes were flushed in (1)). 46957614c51SMatthew Dillon This brings the block copy-on-write up to the root inode. 47057614c51SMatthew Dillon 47157614c51SMatthew Dillon 3. The SUPER-ROOT inode is fsync()d to storage along with the 47257614c51SMatthew Dillon subhierarchy representing the PFS ROOTs for the volume. 47357614c51SMatthew Dillon 47457614c51SMatthew Dillon 4. Finally, a physical disk flush command is issued to the storage 47557614c51SMatthew Dillon device, and then the volume header is written to disk. All 47657614c51SMatthew Dillon I/O prior to this step occurred asynchronously. This is the only 47757614c51SMatthew Dillon step which must occur synchronously. 47857614c51SMatthew Dillon 47953f84d31SMatthew Dillon MIRROR_TID, MODIFY_TID, UPDATE_TID 480e513e77eSMatthew Dillon 481b7910865SMatthew DillonIn HAMMER2, the core block reference is a 128-byte structure called a blockref. 482e513e77eSMatthew DillonThe blockref contains various bits of information including the 64-bit radix 483e513e77eSMatthew Dillonkey (typically a directory hash if a directory entry, inode number if a 484b7910865SMatthew Dillonhidden hardlink target, or file offset if a file block), number of significant 485b7910865SMatthew Dillonkey bits for ranged recursion of indirect blocks, a 64-bit device seek that 486b7910865SMatthew Dillonencodes the radix of the physical block size in the low bits (physical block 487b7910865SMatthew Dillonsize can be different from logical block size due to compression), 488b7910865SMatthew Dillonthree 64-bit transaction ids, type information, and up to 512 bits worth 489b7910865SMatthew Dillonof check data for the block being reference which can be anything from 490b7910865SMatthew Dillona simple CRC to a strong cryptographic hash. 491e513e77eSMatthew Dillon 492e513e77eSMatthew Dillonmirror_tid - This is a media-centric (as in physical disk partition) 49353f84d31SMatthew Dillon transaction id which tracks media-level updates. The mirror_tid 49453f84d31SMatthew Dillon can be different at the same point on different nodes in a 49553f84d31SMatthew Dillon cluster. 496e513e77eSMatthew Dillon 497e513e77eSMatthew Dillon Whenever any block in the media topology is modified, its 498e513e77eSMatthew Dillon mirror_tid is updated with the flush id and will propagate 499e513e77eSMatthew Dillon upward during the flush all the way to the volume header. 500e513e77eSMatthew Dillon 50153f84d31SMatthew Dillon mirror_tid is monotonic. It is primarily used for on-mount 50253f84d31SMatthew Dillon recovery and volume root validation. The name is historical 50353f84d31SMatthew Dillon from H1, it is not used for nominal mirroring. 504e513e77eSMatthew Dillon 505e513e77eSMatthew Dillonmodify_tid - This is a cluster-centric (as in across all the nodes used 506e513e77eSMatthew Dillon to build a cluster) transaction id which tracks filesystem-level 507e513e77eSMatthew Dillon updates. 508e513e77eSMatthew Dillon 509e513e77eSMatthew Dillon modify_tid is updated when the front-end of the filesystem makes 51053f84d31SMatthew Dillon a change to an inode or data block. It does NOT propagate upward 51153f84d31SMatthew Dillon during a flush. 512e513e77eSMatthew Dillon 51353f84d31SMatthew Dillonupdate_tid - This is a cluster synchronization transaction id. Modifications 51453f84d31SMatthew Dillon made to the topology will clear this field to 0 as they propagate 51553f84d31SMatthew Dillon up to the root. This gives the synchronizer an easy way to 51653f84d31SMatthew Dillon determine what needs revalidation. 517e513e77eSMatthew Dillon 51853f84d31SMatthew Dillon The synchronizer revalidates the cluster bottom-up by validating 51953f84d31SMatthew Dillon a sub-topology and propagating the highest modify_tid in the 52053f84d31SMatthew Dillon validated sub-topology up via the update_tid field. 52153f84d31SMatthew Dillon 52253f84d31SMatthew Dillon Update to this field may be optimized by the HAMMER2 VFS to 52353f84d31SMatthew Dillon avoid the double-transition. 524e513e77eSMatthew Dillon 525e513e77eSMatthew DillonThe synchronization code updates an out-of-sync node bottom-up and will 52653f84d31SMatthew Dillondynamically set update_tid as it goes, but media flushes can occur at any 527e513e77eSMatthew Dillontime and these flushes will use mirror_tid for flush and freemap management. 528e513e77eSMatthew DillonThe mirror_tid for each flush propagates upward to the volume header on each 52953f84d31SMatthew Dillonflush. modify_tid is set for any chains modified by a cluster op but does 53053f84d31SMatthew Dillonnot propagate up, instead serving as a seed for update_tid. 531e513e77eSMatthew Dillon 532e513e77eSMatthew Dillon* The synchronization code is able to determine that a sub-tree is 53353f84d31SMatthew Dillon synchronized simply by observing the update_tid at the root of the sub-tree, 53453f84d31SMatthew Dillon on an inode-by-inode basis and also on a data-block-by-data-block basis. 535e513e77eSMatthew Dillon 536e513e77eSMatthew Dillon* The synchronization code is able to do an incremental update of an 53753f84d31SMatthew Dillon out-of-sync node simply by skipping elements with a matching update_tid 53853f84d31SMatthew Dillon (when not 0). 539e513e77eSMatthew Dillon 540e513e77eSMatthew Dillon* The synchronization code can be interrupted and restarted at any time, 541b7910865SMatthew Dillon and is able to pick up where it left off with very low overhead. 542e513e77eSMatthew Dillon 543e513e77eSMatthew Dillon* The synchronization code does not inhibit media flushes. Media flushes 544e513e77eSMatthew Dillon can occur (and must occur) while synchronization is ongoing. 545e513e77eSMatthew Dillon 546e513e77eSMatthew DillonThere are several other stored transaction ids in HAMMER2. There is a 547e513e77eSMatthew Dillonseparate freemap_tid in the volume header that is used to allow freemap 5487fece146SMatthew Dillonflushes to be deferred, and inodes have a pfs_psnap_tid which is used in 549bbb35c81SSascha Wildnerconjunction with CHECK_NONE to allow blocks without a check code which do 5507fece146SMatthew Dillonnot violate the most recent snapshot to be overwritten in-place. 551e513e77eSMatthew Dillon 552e513e77eSMatthew DillonRemember that since this is a copy-on-write filesystem, we can propagate 553e513e77eSMatthew Dillona considerable amount of information up the tree to the volume header 554e513e77eSMatthew Dillonwithout adding to the I/O we already have to do. 555e513e77eSMatthew Dillon 556b93cc2e0SMatthew Dillon DIRECTORIES AND INODES 5575a9a531cSMatthew Dillon 55857614c51SMatthew DillonDirectories are hashed. In HAMMER2, the PFS ROOT directory (aka inode 1 for 55957614c51SMatthew Dillona PFS) can contain a mix of directory entries AND embedded inodes. This was 56057614c51SMatthew Dillonactually a design mistake, so the code to deal with the index of inodes 56157614c51SMatthew Dillonvs the directory entries is slightly convoluted (but not too bad). 562b93cc2e0SMatthew Dillon 56357614c51SMatthew DillonIn the first iteration of HAMMER2 I tried really hard to embed actual 56457614c51SMatthew Dilloninodes AS the directory entries, but it created a mass of problems for 56557614c51SMatthew Dillonimplementing NFS export support and dealing with hardlinks, so in a later 56657614c51SMatthew Dilloniteration I implemented small independent directory entries (that wound up 56757614c51SMatthew Dillonmostly fitting in the blockref structure, so WIN WIN!). However, 'embedded' 56857614c51SMatthew Dilloninodes AS the directory entries still survive for the SUPER-ROOT and the 56957614c51SMatthew DillonPFS-ROOTs under the SUPER-ROOT. They just aren't used in the individual 57057614c51SMatthew Dillonfilesystem that each PFS represents. 571f7712c43SMatthew Dillon 57257614c51SMatthew DillonHardlinks are now implemented normally, with multiple directory entries 57357614c51SMatthew Dillonreferencing the same inode and that inode containing a nlinks count. 574b93cc2e0SMatthew Dillon 575b93cc2e0SMatthew Dillon RECOVERY 576b93cc2e0SMatthew Dillon 57757614c51SMatthew DillonH2 allows freemap flushes to lag behind topology flushes. This improves 57857614c51SMatthew Dillonfilesystem sync performance. The freemap flush tracks a separate 57957614c51SMatthew Dillontransaction id (via mirror_tid) in the volume header. 580b93cc2e0SMatthew Dillon 581b93cc2e0SMatthew DillonOn mount, HAMMER2 will first locate the highest-sequenced check-code-validated 582b93cc2e0SMatthew Dillonvolume header from the 4 copies available (if the filesystem is big enough, 58357614c51SMatthew Dillone.g. > ~8GB or so, there will be 4 copies of the volume header). 584b93cc2e0SMatthew Dillon 585e513e77eSMatthew DillonHAMMER2 will then run an incremental scan of the topology for mirror_tid 586f7712c43SMatthew Dillontransaction ids between the last freemap flush tid and the last topology 587f7712c43SMatthew Dillonflush tid in order to synchronize the freemap. Because this scan is 588f7712c43SMatthew Dillonincremental the time it takes to run will be relatively short and well-bounded 589b7910865SMatthew Dillonat mount-time. This is NOT an fsck. Freemap flushes can be avoided for any 590f7712c43SMatthew Dillonnumber of normal topology flushes but should still occur frequently enough 591f7712c43SMatthew Dillonto avoid long recovery times in case of a crash. 592b93cc2e0SMatthew Dillon 593b93cc2e0SMatthew DillonThe filesystem is then ready for use. 5945a9a531cSMatthew Dillon 595a98aa0b0SMatthew Dillon DISK I/O OPTIMIZATIONS 596a98aa0b0SMatthew Dillon 597b93cc2e0SMatthew DillonThe freemap implements a 1KB allocation resolution. Each 2MB segment managed 598bbb35c81SSascha Wildnerby the freemap is zoned and has a tendency to collect inodes, small data, 599b93cc2e0SMatthew Dillonindirect blocks, and larger data blocks into separate segments. The idea is 600b93cc2e0SMatthew Dillonto greatly improve I/O performance (particularly by laying inodes down next 601b93cc2e0SMatthew Dillonto each other which has a huge effect on directory scans). 602a98aa0b0SMatthew Dillon 60357614c51SMatthew DillonThe current implementation of HAMMER2 implements a fixed 64KB physical block 60457614c51SMatthew Dillonsize in order to allow the mapping of hammer2_dio's in its IO subsystem 60557614c51SMatthew Dillonto consumers that might desire different sizes. This way we don't have to 606f7712c43SMatthew Dillonworry about matching the buffer cache / DIO cache to the variable block 607b7910865SMatthew Dillonsize of underlying elements. In addition, 64KB I/Os allow compatibility 608b7910865SMatthew Dillonwith physical sector sizes up to 64KB in the underlying physical storage 60957614c51SMatthew Dillonwith no change in the byte-by-byte format of the filesystem. The DIO 61057614c51SMatthew Dillonlayer also prevents ordering deadlocks between unrelated portions of the 61157614c51SMatthew Dillonfilesystem hierarchy whos logical blocks wind up in the same physical block. 612f7712c43SMatthew Dillon 613f7712c43SMatthew DillonThe biggest issue we are avoiding by having a fixed 64KB I/O size is not 614f7712c43SMatthew Dillonactually to help nominal front-end access issue but instead to reduce the 615b7910865SMatthew Dilloncomplexity of having to deal with mixed block sizes in the buffer cache, 616b7910865SMatthew Dillonparticularly when blocks are freed and then later reused with a different 617b7910865SMatthew Dillonblock size. HAMMER1 had to have specialized code to check for and 618b7910865SMatthew Dilloninvalidate buffer cache buffers in the free/reuse case. HAMMER2 does not 619b7910865SMatthew Dillonneed such code. 620f7712c43SMatthew Dillon 62157614c51SMatthew DillonThat said, HAMMER2 places no major restrictions on mixing logical block 62257614c51SMatthew Dillonsizes within a 64KB block. The only restriction is that a logical HAMMER2 62357614c51SMatthew Dillonblock cannot cross a 64KB boundary. The soft restrictions the block 62457614c51SMatthew Dillonallocator puts in place exist primarily for performance reasons (i.e. to 62557614c51SMatthew Dillontry to collect 1K inodes together). The 2MB freemap zone granularity 62657614c51SMatthew Dillonshould work very well in this regard. 627b93cc2e0SMatthew Dillon 62857614c51SMatthew DillonHAMMER2 also utilizes OS support for ganging 64KB buffers together into even 629f7712c43SMatthew Dillonlarger blocks for I/O (OS buffer cache 'clustering'), OS-supported read-ahead, 630f7712c43SMatthew DillonOS-driven asynchronous retirement, and other performance features typically 631f7712c43SMatthew Dillonprovided by the OS at the block-level to ensure smooth system operation. 632a98aa0b0SMatthew Dillon 63357614c51SMatthew DillonBy avoiding wiring buffers/memory and allowing the OS's buffer cache to 63457614c51SMatthew Dillonrun normally, HAMMER2 winds up with very low OS overhead. 6355a9a531cSMatthew Dillon 6365a9a531cSMatthew Dillon FREEMAP NOTES 6375a9a531cSMatthew Dillon 638b93cc2e0SMatthew DillonThe freemap is stored in the reserved blocks situated in the ~4MB reserved 63957614c51SMatthew Dillonarea at the base of every ~1GB level-1 zone of physical storage. The current 64057614c51SMatthew Dillonimplementation reserves 8 copies of every freemap block and cycles through 64157614c51SMatthew Dillonthem in order to make the freemap operate in a copy-on-write fashion. 642b93cc2e0SMatthew Dillon 643b93cc2e0SMatthew Dillon - Freemap is copy-on-write. 644b93cc2e0SMatthew Dillon - Freemap operations are transactional, same as everything else. 645b93cc2e0SMatthew Dillon - All backup volume headers are consistent on-mount. 646b93cc2e0SMatthew Dillon 647b93cc2e0SMatthew DillonThe Freemap is organized using the same radix blockmap algorithm used for 648b93cc2e0SMatthew Dillonfiles and directories, but with fixed radix values. For a maximally-sized 649b93cc2e0SMatthew Dillonfilesystem the Freemap will wind up being a 5-level-deep radix blockmap, 650b93cc2e0SMatthew Dillonbut the top-level is embedded in the volume header so insofar as performance 651b93cc2e0SMatthew Dillongoes it is really just a 4-level blockmap. 652b93cc2e0SMatthew Dillon 653b93cc2e0SMatthew DillonThe freemap radix allocation mechanism is also the same, meaning that it is 654b93cc2e0SMatthew Dillonbottom-up and will not allocate unnecessary intermediate levels for smaller 6555cebbe36SMatthew Dillonfilesystems. The number of blockmap levels not including the volume header 6565cebbe36SMatthew Dillonfor various filesystem sizes is as follows: 6575cebbe36SMatthew Dillon 6585cebbe36SMatthew Dillon up-to #of freemap levels 6595cebbe36SMatthew Dillon 1GB 1-level 6605cebbe36SMatthew Dillon 256GB 2-level 6615cebbe36SMatthew Dillon 64TB 3-level 6625cebbe36SMatthew Dillon 16PB 4-level 6635cebbe36SMatthew Dillon 4EB 5-level 6645cebbe36SMatthew Dillon 16EB 6-level 665b93cc2e0SMatthew Dillon 666b93cc2e0SMatthew DillonThe Freemap has bitmap granularity down to 16KB and a linear iterator that 667b93cc2e0SMatthew Dilloncan linearly allocate space down to 1KB. Due to fragmentation it is possible 668b93cc2e0SMatthew Dillonfor the linear allocator to become marginalized, but it is relatively easy 66957614c51SMatthew Dillonto reallocate small blocks every once in a while (like once a year if you 67057614c51SMatthew Dilloncare at all) and once the old data cycles out of the snapshots, or you also 67157614c51SMatthew Dillonrewrite the snapshots (which you can do), the freemap should wind up 672b93cc2e0SMatthew Dillonrelatively optimal again. Generally speaking I believe that algorithms can 673b93cc2e0SMatthew Dillonbe developed to make this a non-problem without requiring any media structure 67457614c51SMatthew Dillonchanges. However, touching all the freemaps will replicate meta-data whereas 67557614c51SMatthew Dillonthe meta-data was mostly shared in the original snapshot. So this is a 67657614c51SMatthew Dillonproblem that needs solving in HAMMER2. 677b93cc2e0SMatthew Dillon 6785a9a531cSMatthew DillonIn order to implement fast snapshots (and writable snapshots for that 679b93cc2e0SMatthew Dillonmatter), HAMMER2 does NOT ref-count allocations. All the freemap does is 680b93cc2e0SMatthew Dillonkeep track of 100% free blocks plus some extra bits for staging the bulkfree 681b93cc2e0SMatthew Dillonscan. The lack of ref-counting makes it possible to: 6825a9a531cSMatthew Dillon 683b93cc2e0SMatthew Dillon - Completely trivialize HAMMER2s snapshot operations. 68457614c51SMatthew Dillon - Completely trivialize HAMMER2s de-dup operations. 685b93cc2e0SMatthew Dillon - Allows any volume header backup to be used trivially. 686b93cc2e0SMatthew Dillon - Allows whole sub-trees to be destroyed without having to scan them. 68757614c51SMatthew Dillon Deleting PFSs and snapshots is instant (though space recovery still 68857614c51SMatthew Dillon requires two bulkfree scans). 68957614c51SMatthew Dillon - Simplifies normal crash recovery operations by not having to reconcile 69057614c51SMatthew Dillon a ref-count. 69157614c51SMatthew Dillon - Simplifies catastrophic recovery operations for the same reason. 6925a9a531cSMatthew Dillon 693b93cc2e0SMatthew DillonNormal crash recovery is simply a matter of doing an incremental scan 694b93cc2e0SMatthew Dillonof the topology between the last flushed freemap TID and the last flushed 695b93cc2e0SMatthew Dillontopology TID. This usually takes only a few seconds and allows: 6965a9a531cSMatthew Dillon 697b93cc2e0SMatthew Dillon - Freemap flushes to be be deferred for any number of topology flush 69857614c51SMatthew Dillon cycles (with some care to ensure that all four volume headers 69957614c51SMatthew Dillon remain valid). 700b93cc2e0SMatthew Dillon - Does not have to be flushed for fsync, reducing fsync overhead. 701b93cc2e0SMatthew Dillon 702b93cc2e0SMatthew Dillon FREEMAP - BULKFREE 703b93cc2e0SMatthew Dillon 704b93cc2e0SMatthew DillonBlocks are freed via a bulkfree scan, which is a two-stage meta-data scan. 705b93cc2e0SMatthew DillonBlocks are first marked as being possibly free and then finalized in the 706b93cc2e0SMatthew Dillonsecond scan. Live filesystem operations are allowed to run during these 707b93cc2e0SMatthew Dillonscans and any freemap block that is allocated or adjusted after the first 708b93cc2e0SMatthew Dillonscan will simply be re-marked as allocated and the second scan will not 709b93cc2e0SMatthew Dillontransition it to being free. 710b93cc2e0SMatthew Dillon 711b93cc2e0SMatthew DillonThe cost of not doing ref-count tracking is that HAMMER2 must perform two 712b93cc2e0SMatthew Dillonbulkfree scans of the meta-data to determine which blocks can actually be 713b93cc2e0SMatthew Dillonfreed. This can be complicated by the volume header backups and snapshots 714b93cc2e0SMatthew Dillonwhich cause the same meta-data topology to be scanned over and over again, 715b93cc2e0SMatthew Dillonbut mitigated somewhat by keeping a cache of higher-level nodes to detect 716b93cc2e0SMatthew Dillonwhen we would scan a sub-topology that we have already scanned. Due to the 717b93cc2e0SMatthew Dilloncopy-on-write nature of the filesystem, such detection is easy to implement. 7185a9a531cSMatthew Dillon 7195a9a531cSMatthew DillonPart of the ongoing design work is finding ways to reduce the scope of this 7205a9a531cSMatthew Dillonmeta-data scan so the entire filesystem's meta-data does not need to be 7215a9a531cSMatthew Dillonscanned (though in tests with HAMMER1, even full meta-data scans have 722b93cc2e0SMatthew Dillonturned out to be fairly low cost). In other words, its an area where 723b93cc2e0SMatthew Dillonimprovements can be made without any media format changes. 724b93cc2e0SMatthew Dillon 725b93cc2e0SMatthew DillonAnother advantage of operating the freemap like this is that some future 726b93cc2e0SMatthew Dillonversion of HAMMER2 might decide to completely change how the freemap works 727b93cc2e0SMatthew Dillonand would be able to make the change with relatively low downtime. 7285a9a531cSMatthew Dillon 7295a9a531cSMatthew Dillon CLUSTERING 7305a9a531cSMatthew Dillon 7315a9a531cSMatthew DillonClustering, as always, is the most difficult bit but we have some advantages 7325a9a531cSMatthew Dillonwith HAMMER2 that we did not have with HAMMER1. First, HAMMER2's media 733bbb35c81SSascha Wildnerstructures generally follow the kernel's filesystem hierarchy which allows 734b93cc2e0SMatthew Dilloncluster operations to use topology cache and lock state. Second, 7355a9a531cSMatthew DillonHAMMER2's writable snapshots make it possible to implement several forms 7365a9a531cSMatthew Dillonof multi-master clustering. 7375a9a531cSMatthew Dillon 73862efe6ecSMatthew DillonThe mount device path you specify serves to bootstrap your entry into 739b93cc2e0SMatthew Dillonthe cluster. This is typically local media. It can even be a ram-disk 740b93cc2e0SMatthew Dillonthat only contains placemarkers that help HAMMER2 connect to a fully 741b93cc2e0SMatthew Dillonnetworked cluster. 7425a9a531cSMatthew Dillon 743b93cc2e0SMatthew DillonWith HAMMER2 you mount a directory entry under the super-root. This entry 744b93cc2e0SMatthew Dillonwill contain a cluster identifier that helps HAMMER2 identify and integrate 745b93cc2e0SMatthew Dillonwith the nodes making up the cluster. HAMMER2 will automatically integrate 746b93cc2e0SMatthew Dillon*all* entries under the super-root when you mount one of them. You have to 747b93cc2e0SMatthew Dillonmount at least one for HAMMER2 to integrate the block device in the larger 74801d71aa5SMatthew Dilloncluster. 7495a9a531cSMatthew Dillon 750b93cc2e0SMatthew DillonFor cluster servers every HAMMER2-formatted partition has a "LOCAL" MASTER 751b93cc2e0SMatthew Dillonwhich can be mounted in order to make the rest of the elements under the 752b93cc2e0SMatthew Dillonsuper-root available to the network. (In a prior specification I emplaced 753b93cc2e0SMatthew Dillonthe cluster connections in the volume header's configuration space but I no 754b93cc2e0SMatthew Dillonlonger do that). 75562efe6ecSMatthew Dillon 756b93cc2e0SMatthew DillonConnecting to the wider networked cluster involves setting up the /etc/hammer2 757b93cc2e0SMatthew Dillondirectory with appropriate IP addresses and keys. The user-mode hammer2 758b93cc2e0SMatthew Dillonservice daemon maintains the connections and performs graph operations 759b93cc2e0SMatthew Dillonvia libdmsg. 76062efe6ecSMatthew Dillon 761b93cc2e0SMatthew DillonNode types within the cluster: 76262efe6ecSMatthew Dillon 763b93cc2e0SMatthew Dillon DUMMY - Used as a local placeholder (typically in ramdisk) 764b93cc2e0SMatthew Dillon CACHE - Used as a local placeholder and cache (typically on a SSD) 765b93cc2e0SMatthew Dillon SLAVE - A SLAVE in the cluster, can source data on quorum agreement. 766b93cc2e0SMatthew Dillon MASTER - A MASTER in the cluster, can source and sink data on quorum 767b93cc2e0SMatthew Dillon agreement. 768b93cc2e0SMatthew Dillon SOFT_SLAVE - A SLAVE in the cluster, can source data locally without 769b93cc2e0SMatthew Dillon quorum agreement (must be directly mounted). 770b93cc2e0SMatthew Dillon SOFT_MASTER - A local MASTER but *not* a MASTER in the cluster. Can source 771b93cc2e0SMatthew Dillon and sink data locally without quorum agreement, intended to 772b93cc2e0SMatthew Dillon be synchronized with the real MASTERs when connectivity 773b93cc2e0SMatthew Dillon allows. Operations are not coherent with the real MASTERS 774b93cc2e0SMatthew Dillon even when they are available. 77562efe6ecSMatthew Dillon 776b93cc2e0SMatthew Dillon NOTE: SNAPSHOT, AUTOSNAP, etc represent sub-types, typically under a 777b93cc2e0SMatthew Dillon SLAVE. A SNAPSHOT or AUTOSNAP is a SLAVE sub-type that is no longer 778b93cc2e0SMatthew Dillon synchronized against current masters. 77962efe6ecSMatthew Dillon 780b93cc2e0SMatthew Dillon NOTE: Any SLAVE or other copy can be turned into its own writable MASTER 781b93cc2e0SMatthew Dillon by giving it a unique cluster id, taking it out of the cluster that 782b93cc2e0SMatthew Dillon originally spawned it. 7835a9a531cSMatthew Dillon 7842910a90cSMatthew DillonThere are four major protocols: 7855a9a531cSMatthew Dillon 7862910a90cSMatthew Dillon Quorum protocol 7872910a90cSMatthew Dillon 7882910a90cSMatthew Dillon This protocol is used between MASTER nodes to vote on operations 7892910a90cSMatthew Dillon and resolve deadlocks. 7902910a90cSMatthew Dillon 7912910a90cSMatthew Dillon This protocol is used between SOFT_MASTER nodes in a sub-cluster 7922910a90cSMatthew Dillon to vote on operations, resolve deadlocks, determine what the latest 7932910a90cSMatthew Dillon transaction id for an element is, and to perform commits. 7942910a90cSMatthew Dillon 7952910a90cSMatthew Dillon Cache sub-protocol 7962910a90cSMatthew Dillon 7972910a90cSMatthew Dillon This is the MESI sub-protocol which runs under the Quorum 7982910a90cSMatthew Dillon protocol. This protocol is used to maintain cache state for 7992910a90cSMatthew Dillon sub-trees to ensure that operations remain cache coherent. 8002910a90cSMatthew Dillon 8012910a90cSMatthew Dillon Depending on administrative rights this protocol may or may 8022910a90cSMatthew Dillon not allow a leaf node in the cluster to hold a cache element 8032910a90cSMatthew Dillon indefinitely. The administrative controller may preemptively 8042910a90cSMatthew Dillon downgrade a leaf with insufficient administrative rights 8052910a90cSMatthew Dillon without giving it a chance to synchronize any modified state 8062910a90cSMatthew Dillon back to the cluster. 8072910a90cSMatthew Dillon 8082910a90cSMatthew Dillon Proxy protocol 8092910a90cSMatthew Dillon 8102910a90cSMatthew Dillon The Quorum and Cache protocols only operate between MASTER 8112910a90cSMatthew Dillon and SOFT_MASTER nodes. All other node types must use the 8122910a90cSMatthew Dillon Proxy protocol to perform similar actions. This protocol 8132910a90cSMatthew Dillon differs in that proxy requests are typically sent to just 8142910a90cSMatthew Dillon one adjacent node and that node then maintains state and 8152910a90cSMatthew Dillon forwards the request or performs the required operation. 8162910a90cSMatthew Dillon When the link is lost to the proxy, the proxy automatically 8172910a90cSMatthew Dillon forwards a deletion of the state to the other nodes based on 8182910a90cSMatthew Dillon what it has recorded. 8192910a90cSMatthew Dillon 8202910a90cSMatthew Dillon If a leaf has insufficient administrative rights it may not 8212910a90cSMatthew Dillon be allowed to actually initiate a quorum operation and may only 8222910a90cSMatthew Dillon be allowed to maintain partial MESI cache state or perhaps none 8232910a90cSMatthew Dillon at all (since cache state can block other machines in the 8242910a90cSMatthew Dillon cluster). Instead a leaf with insufficient rights will have to 8252910a90cSMatthew Dillon make due with a preemptive loss of cache state and any allowed 8262910a90cSMatthew Dillon modifying operations will have to be forwarded to the proxy which 8272910a90cSMatthew Dillon continues forwarding it until a node with sufficient administrative 8282910a90cSMatthew Dillon rights is encountered. 8292910a90cSMatthew Dillon 8302910a90cSMatthew Dillon To reduce issues and give the cluster more breath, sub-clusters 8312910a90cSMatthew Dillon made up of SOFT_MASTERs can be formed in order to provide full 8322910a90cSMatthew Dillon cache coherent within a subset of machines and yet still tie them 8332910a90cSMatthew Dillon into a greater cluster that they normally would not have such 8342910a90cSMatthew Dillon access to. This effectively makes it possible to create a two 8352910a90cSMatthew Dillon or three-tier fan-out of groups of machines which are cache-coherent 8362910a90cSMatthew Dillon within the group, but perhaps not between groups, and use other 8372910a90cSMatthew Dillon means to synchronize between the groups. 8382910a90cSMatthew Dillon 8392910a90cSMatthew Dillon Media protocol 8402910a90cSMatthew Dillon 8412910a90cSMatthew Dillon This is basically the physical media protocol. 8425a9a531cSMatthew Dillon 843b93cc2e0SMatthew Dillon MASTER & SLAVE SYNCHRONIZATION 844b93cc2e0SMatthew Dillon 845b93cc2e0SMatthew DillonWith HAMMER2 I really want to be hard-nosed about the consistency of the 846b93cc2e0SMatthew Dillonfilesystem, including the consistency of SLAVEs (snapshots, etc). In order 847b93cc2e0SMatthew Dillonto guarantee consistency we take advantage of the copy-on-write nature of 848b93cc2e0SMatthew Dillonthe filesystem by forking consistent nodes and using the forked copy as the 849b93cc2e0SMatthew Dillonsource for synchronization. 850b93cc2e0SMatthew Dillon 851b93cc2e0SMatthew DillonSimilarly, the target for synchronization is not updated on the fly but instead 852b93cc2e0SMatthew Dillonis also forked and the forked copy is updated. When synchronization is 853b93cc2e0SMatthew Dilloncomplete, forked sources can be thrown away and forked copies can replace 854b93cc2e0SMatthew Dillonthe original synchronization target. 855b93cc2e0SMatthew Dillon 856b93cc2e0SMatthew DillonThis may seem complex, but 'forking a copy' is actually a virtually free 857b93cc2e0SMatthew Dillonoperation. The top-level inode (under the super-root), on-media, is simply 858b93cc2e0SMatthew Dilloncopied to a new inode and poof, we have an unchanging snapshot to work with. 859b93cc2e0SMatthew Dillon 860b93cc2e0SMatthew Dillon - Making a snapshot is fast... almost instantanious. 861b93cc2e0SMatthew Dillon 862b93cc2e0SMatthew Dillon - Snapshots are used for various purposes, including synchronization 863b93cc2e0SMatthew Dillon of out-of-date nodes. 864b93cc2e0SMatthew Dillon 865b93cc2e0SMatthew Dillon - A snapshot can be converted into a MASTER or some other PFS type. 866b93cc2e0SMatthew Dillon 867b93cc2e0SMatthew Dillon - A snapshot can be forked off from its parent cluster entirely and 868b93cc2e0SMatthew Dillon turned into its own writable filesystem, either as a single MASTER 869b93cc2e0SMatthew Dillon or this can be done across the cluster by forking a quorum+ of 870bbb35c81SSascha Wildner existing MASTERs and transferring them all to a new cluster id. 871b93cc2e0SMatthew Dillon 872b93cc2e0SMatthew DillonMore complex is reintegrating the target once the synchronization is complete. 873b93cc2e0SMatthew DillonFor SLAVEs we just delete the old SLAVE and rename the copy to the same name. 874b93cc2e0SMatthew DillonHowever, if the SLAVE is mounted and not optioned as a static mount (that is 875b93cc2e0SMatthew Dillonthe mounter wants to see updates as they are synchronized), a reconciliation 876b93cc2e0SMatthew Dillonmust occur on the live mount to clean up the vnode, inode, and chain caches 877b93cc2e0SMatthew Dillonand shift any remaining vnodes over to the updated copy. 878b93cc2e0SMatthew Dillon 879b93cc2e0SMatthew Dillon - A mounted SLAVE can track updates made to the SLAVE but the 880b93cc2e0SMatthew Dillon actual mechanism is that the SLAVE PFS is replaced with an 881b93cc2e0SMatthew Dillon updated copy, typically every 30-60 seconds. 882b93cc2e0SMatthew Dillon 883b93cc2e0SMatthew DillonReintegrating a MASTER which has fallen out of the quorum due to being out 884b93cc2e0SMatthew Dillonof date is also somewhat more complex. The same updating mechanic is used, 885b93cc2e0SMatthew Dillonwe actually have to throw the 'old' MASTER away once the new one has been 886b93cc2e0SMatthew Dillonupdated. However if the cluster is undergoing heavy modifications the 887b93cc2e0SMatthew Dillonupdated MASTER will be out of date almost the instant its source is 888b93cc2e0SMatthew Dillonsnapshotted. Reintegrating a MASTER thus requires a somewhat more complex 889b93cc2e0SMatthew Dilloninteraction. 890b93cc2e0SMatthew Dillon 891b93cc2e0SMatthew Dillon - If a MASTER is really out of date we can run one or more 892b93cc2e0SMatthew Dillon synchronization passes concurrent with modifying operations. 893b93cc2e0SMatthew Dillon The quorum can remain live. 894b93cc2e0SMatthew Dillon 895b93cc2e0SMatthew Dillon - A final synchronization pass is required with quorum operations 896b93cc2e0SMatthew Dillon blocked to reintegrate the now up-to-date MASTER into the cluster. 897b93cc2e0SMatthew Dillon 898b93cc2e0SMatthew Dillon 899b93cc2e0SMatthew Dillon QUORUM OPERATIONS 900b93cc2e0SMatthew Dillon 901b93cc2e0SMatthew DillonQuorum operations can be broken down into HARD BLOCK operations and NETWORK 902b93cc2e0SMatthew Dillonoperations. If your MASTERs are all local mounts, then failures and 903b93cc2e0SMatthew Dillonsequencing is easy to deal with. 904b93cc2e0SMatthew Dillon 905b93cc2e0SMatthew DillonQuorum operations on a networked cluster are more complex. The problems: 906b93cc2e0SMatthew Dillon 907b93cc2e0SMatthew Dillon - Masters cannot rely on clients to moderate quorum transactions. 908b93cc2e0SMatthew Dillon Apart from the reliance being unsafe, the client could also 909b93cc2e0SMatthew Dillon lose contact with one or more masters during the transaction and 910b93cc2e0SMatthew Dillon leave one or more masters out-of-sync without the master(s) knowing 911b93cc2e0SMatthew Dillon they are out of sync. 912b93cc2e0SMatthew Dillon 913b93cc2e0SMatthew Dillon - When many clients are present, we do not want a flakey network 914b93cc2e0SMatthew Dillon link from one to cause one or more masters to go out of 915b93cc2e0SMatthew Dillon synchronization and potentially stall the whole works. 916b93cc2e0SMatthew Dillon 917b93cc2e0SMatthew Dillon - Normal hammer2 mounts allow a virtually unlimited number of modifying 918b93cc2e0SMatthew Dillon transactions between actual flushes. The media flush rolls everything 919b93cc2e0SMatthew Dillon up into a single transaction id per flush. Detection of 'missing' 920b93cc2e0SMatthew Dillon transactions in a concurrent multi-client setup when one or more client 921b93cc2e0SMatthew Dillon temporarily loses connectivity is thus difficult. 922b93cc2e0SMatthew Dillon 923b93cc2e0SMatthew Dillon - Clients have a limited amount of time to reconnect to a cluster after 924b93cc2e0SMatthew Dillon a network disconnect before their MESI cache states are lost. 925b93cc2e0SMatthew Dillon 926b93cc2e0SMatthew Dillon - Clients may proceed with several transactions before knowing for sure 927b93cc2e0SMatthew Dillon that earlier transactions were completely successful. Performance is 928b93cc2e0SMatthew Dillon important, we won't be waiting for a full quorum-verified synchronous 929b93cc2e0SMatthew Dillon flush to media before allowing a system call to return. 930b93cc2e0SMatthew Dillon 931b93cc2e0SMatthew Dillon - Masters can decide that a client's MESI cache states were lost (i.e. 932b93cc2e0SMatthew Dillon that the transaction was too slow) as well. 933b93cc2e0SMatthew Dillon 934b93cc2e0SMatthew DillonThe solutions (for modifying transactions): 935b93cc2e0SMatthew Dillon 936b93cc2e0SMatthew Dillon - Masters handle quorum confirmation amongst themselves and do not rely 937b93cc2e0SMatthew Dillon on the client for that purpose. 938b93cc2e0SMatthew Dillon 939b93cc2e0SMatthew Dillon - A client can connect to one or more masters regardless of the size of 940b93cc2e0SMatthew Dillon the quorum and can submit modifying operations to a single master if 941b93cc2e0SMatthew Dillon desired. The master will take care of the rest. 942b93cc2e0SMatthew Dillon 943b93cc2e0SMatthew Dillon A client must still validate the quorum (and obtain MESI cache states) 944b93cc2e0SMatthew Dillon when doing read-only operations in order to present the correct data 945b93cc2e0SMatthew Dillon to the user process for the VOP. 946b93cc2e0SMatthew Dillon 947b93cc2e0SMatthew Dillon - Masters will run a 2-phase commit amongst themselves, often concurrent 948b93cc2e0SMatthew Dillon with other non-conflicting transactions, and will serialize operations 949b93cc2e0SMatthew Dillon and/or enforce synchronization points for 2-phase completion on 950b93cc2e0SMatthew Dillon serialized transactions from the same client or when cache state 951b93cc2e0SMatthew Dillon ownership is shifted from one client to another. 952b93cc2e0SMatthew Dillon 953b93cc2e0SMatthew Dillon - Clients will usually allow operations to run asynchronously and return 954b93cc2e0SMatthew Dillon from system calls more or less ASAP once they own the necessary cache 955b93cc2e0SMatthew Dillon coherency locks. The client can select the validation mode to wait for 956b93cc2e0SMatthew Dillon with mount options: 957b93cc2e0SMatthew Dillon 958b93cc2e0SMatthew Dillon (1) Fully async (mount -o async) 959b93cc2e0SMatthew Dillon (2) Wait for phase-1 ack (mount) 960b93cc2e0SMatthew Dillon (3) Wait for phase-2 ack (mount -o sync) (fsync - wait p2ack) 961b93cc2e0SMatthew Dillon (4) Wait for flush (mount -o sync) (fsync - wait flush) 962b93cc2e0SMatthew Dillon 963b93cc2e0SMatthew Dillon Modifying system calls cannot be told to wait for a full media 964b93cc2e0SMatthew Dillon flush, as full media flushes are prohibitively expensive. You 965b93cc2e0SMatthew Dillon still have to fsync(). 966b93cc2e0SMatthew Dillon 967b93cc2e0SMatthew Dillon The fsync wait mode for network links can be selected, either to 968b93cc2e0SMatthew Dillon return after the phase-2 ack or to return after the media flush. 969b93cc2e0SMatthew Dillon The default is to wait for the phase-2 ack, which at least guarantees 970b93cc2e0SMatthew Dillon that a network failure after that point will not disrupt operations 971b93cc2e0SMatthew Dillon issued before the fsync. 972b93cc2e0SMatthew Dillon 973b93cc2e0SMatthew Dillon - Clients must adjust the chain state for modifying operations prior to 974b93cc2e0SMatthew Dillon releasing chain locks / returning from the system call, even if the 975b93cc2e0SMatthew Dillon masters have not finished the transaction. A late failure by the 976b93cc2e0SMatthew Dillon cluster will result in desynchronized state which requires erroring 977b93cc2e0SMatthew Dillon out the whole filesystem or resynchronizing somehow. 978b93cc2e0SMatthew Dillon 979b93cc2e0SMatthew Dillon - Clients can opt to keep a record of transactions through the phase-2 980b93cc2e0SMatthew Dillon ack or the actual media flush on the masters. 981b93cc2e0SMatthew Dillon 982b93cc2e0SMatthew Dillon However, replaying/revalidating the log cannot necessarily guarantee 983b93cc2e0SMatthew Dillon success. If the masters lose synchronization due to network issues 984b93cc2e0SMatthew Dillon between masters (or if the client was mounted fully-async), or if enough 985b93cc2e0SMatthew Dillon masters crash simultaniously such that a quorum fails to flush even 986b93cc2e0SMatthew Dillon after the phase-2 ack, then it is possible that by the time a client 987b93cc2e0SMatthew Dillon is able to replay/revalidate, some other client has squeeded in and 988b93cc2e0SMatthew Dillon committed something that would conflict. 989b93cc2e0SMatthew Dillon 990b93cc2e0SMatthew Dillon If the client crashes it works similarly to a crash with a local storage 991b93cc2e0SMatthew Dillon mount... many dirty buffers might be lost. And the same happens in 992b93cc2e0SMatthew Dillon the cluster case. 993b93cc2e0SMatthew Dillon 994b93cc2e0SMatthew Dillon TRANSACTION LOG 995b93cc2e0SMatthew Dillon 996b93cc2e0SMatthew DillonKeeping a short-term transaction log, much less being able to properly replay 997b93cc2e0SMatthew Dillonit, is fraught with difficulty and I've made it a separate development task. 998b7910865SMatthew DillonFor now HAMMER2 does not have one. 999b93cc2e0SMatthew Dillon 1000