xref: /dflybsd-src/sys/vfs/hammer2/DESIGN (revision a4cea70ee4f58a9d55e883bcf5490af68ce7c0a7)
15a9a531cSMatthew Dillon
25a9a531cSMatthew Dillon			    HAMMER2 DESIGN DOCUMENT
35a9a531cSMatthew Dillon
405af5bd1SMatthew Dillon				Matthew Dillon
505af5bd1SMatthew Dillon			     dillon@backplane.com
605af5bd1SMatthew Dillon
757614c51SMatthew Dillon			       08-Dec-2018 (v6)
8b7910865SMatthew Dillon			       24-Jul-2017 (v5)
97fece146SMatthew Dillon			       09-Jul-2016 (v4)
10b93cc2e0SMatthew Dillon			       03-Apr-2015 (v3)
11b93cc2e0SMatthew Dillon			       14-May-2013 (v2)
12b93cc2e0SMatthew Dillon			       08-Feb-2012 (v1)
135a9a531cSMatthew Dillon
14b93cc2e0SMatthew Dillon			Current Status as of document date
155a9a531cSMatthew Dillon
16b93cc2e0SMatthew Dillon* Filesystem Core	- operational
17b93cc2e0SMatthew Dillon  - bulkfree		- operational
18b93cc2e0SMatthew Dillon  - Compression		- operational
19b93cc2e0SMatthew Dillon  - Snapshots		- operational
207fece146SMatthew Dillon  - Deduper		- live operational, batch specced
21b7910865SMatthew Dillon  - Subhierarchy quotas - scrapped (still possible on a limited basis)
22b93cc2e0SMatthew Dillon  - Logical Encryption	- not specced yet
23b93cc2e0SMatthew Dillon  - Copies		- not specced yet
24b93cc2e0SMatthew Dillon  - fsync bypass	- not specced yet
2557614c51SMatthew Dillon  - FS consistency	- operational
265a9a531cSMatthew Dillon
27b93cc2e0SMatthew Dillon* Clustering core
28b93cc2e0SMatthew Dillon  - Network msg core	- operational
29b93cc2e0SMatthew Dillon  - Network blk device	- operational
30b93cc2e0SMatthew Dillon  - Error handling	- under development
31b93cc2e0SMatthew Dillon  - Quorum Protocol	- under development
32b93cc2e0SMatthew Dillon  - Synchronization	- under development
33b93cc2e0SMatthew Dillon  - Transaction replay	- not specced yet
34b93cc2e0SMatthew Dillon  - Cache coherency	- not specced yet
355a9a531cSMatthew Dillon
3657614c51SMatthew Dillon			Recent Document Changes
3757614c51SMatthew Dillon
3857614c51SMatthew Dillon* Reorganized the feature list to indicate currently operational features
3957614c51SMatthew Dillon  first, and moving the future features to another section (since they
4057614c51SMatthew Dillon  are taking so long to implement).
4157614c51SMatthew Dillon
4257614c51SMatthew Dillon			    Current Features List
435a9a531cSMatthew Dillon
44b7910865SMatthew Dillon* Standard filesystem semantics with full hardlink and softlink support.
4557614c51SMatthew Dillon  64-bit hardlink count field.
46e513e77eSMatthew Dillon
4757614c51SMatthew Dillon* The topology is indexed with a dynamic radix tree rooted in several
4857614c51SMatthew Dillon  places:  The super-root, the PFS root inode, and any inode boundary.
4957614c51SMatthew Dillon  Index keys are 64-bits.  Each element is referenced with a blockref
5057614c51SMatthew Dillon  structure (described below) that is capable of referencing a power-of-2
5157614c51SMatthew Dillon  sized block.  The block size is currently capped at 64KB to play
5257614c51SMatthew Dillon  nice(r) with the buffer cache and SSDs.
53e513e77eSMatthew Dillon
5457614c51SMatthew Dillon  The dynamic radix tree pushes elements into new indirect blocks only
5557614c51SMatthew Dillon  when the current level fills up, and will delete empty indirect blocks
5657614c51SMatthew Dillon  when a level is cleaned out.
57b93cc2e0SMatthew Dillon
5857614c51SMatthew Dillon* Block-copy-on-write filesystem mechanism for both the main topology
5957614c51SMatthew Dillon  and for the freemap.  Media-level block frees are deferred and flushes
6057614c51SMatthew Dillon  rotate between (up to) 4 volume headers (capped at 4 if the filesystem
6157614c51SMatthew Dillon  is > ~8GB).  Recovery will choose the most recent fully-valid volume
6257614c51SMatthew Dillon  header and can thus work around failures which cause partial volume
6357614c51SMatthew Dillon  header writes.
6457614c51SMatthew Dillon
6557614c51SMatthew Dillon  Modifications issue copy-on-write updates up to the volume root.
66b93cc2e0SMatthew Dillon
67b7910865SMatthew Dillon* Utilizes a fat blockref structure (128 bytes) which can store up to
6857614c51SMatthew Dillon  64 bytes (512 bits) of check code data for each referenced block.
6957614c51SMatthew Dillon  In the original implementation I had gone with 64 byte blockrefs,
7057614c51SMatthew Dillon  but I eventually decided that I wanted to support up to a 512-bit
7157614c51SMatthew Dillon  hash (which eats 64 bytes), so I bumped it up to 128 bytes.  This
7257614c51SMatthew Dillon  turned out to be fortuitous because it made it possible to store
7357614c51SMatthew Dillon  most directory entries directly in the blockref structure without
7457614c51SMatthew Dillon  having to reference a separate data block via the blockref structure.
755a9a531cSMatthew Dillon
7657614c51SMatthew Dillon* 1KB 'fat' inode structure.  The inode structure directly embeds four
7757614c51SMatthew Dillon  blockrefs so small files and directories can be represented without
7857614c51SMatthew Dillon  requiring an indirect block to be allocated.  The inode structure can
7957614c51SMatthew Dillon  also overload the same space to store up to 512 bytes of direct
8057614c51SMatthew Dillon  file data (for files which are <= 512 bytes long).
815a9a531cSMatthew Dillon
8257614c51SMatthew Dillon  The super-root and PFS root inodes are directly represented in the
8357614c51SMatthew Dillon  topology, without the use of directory entries.  A combination of
8457614c51SMatthew Dillon  normal directory entries and separtely-indexed inodes are implemented
8557614c51SMatthew Dillon  under each PFS.
8601d71aa5SMatthew Dillon
8757614c51SMatthew Dillon  Normal filesystem inodes (other than inode 1) are indexed under the PFS
8857614c51SMatthew Dillon  root inode by their inode number.  Directory entries are indexed under the
8957614c51SMatthew Dillon  same PFS root by their filename hash.  Bit 63 is used to distinguish and
9057614c51SMatthew Dillon  partition the two.  Filename hash collisions are handled by incrementing
9157614c51SMatthew Dillon  reserved low bits in the filename hash code.
9201d71aa5SMatthew Dillon
9357614c51SMatthew Dillon* Directory entries representing filenames that are less than 64 bytes
9457614c51SMatthew Dillon  long are directly stored AS blockrefs.  This means that an inode
9557614c51SMatthew Dillon  representing a small directory can store up to 4 directory entries in
9657614c51SMatthew Dillon  the inode itself before resorting to indirect blocks, and then those
9757614c51SMatthew Dillon  indirect blocks themselves can directly embed up to 512 directory entries.
9857614c51SMatthew Dillon  Directory entries with long filenames reference an indirect data block
9957614c51SMatthew Dillon  to hold the filename instead of directly-embedding the filename.
1005a9a531cSMatthew Dillon
10157614c51SMatthew Dillon  This results in *very* compact directories in terms of I/O bandwidth.
10257614c51SMatthew Dillon  Not as compact as e.g. UFS's variable-length directory entries, but still
10357614c51SMatthew Dillon  very good with a nominal 128 real bytes per directory entry.
1041a7cfe5aSMatthew Dillon
10557614c51SMatthew Dillon  Because directory entries are represented using a dynamic radix tree via
10657614c51SMatthew Dillon  its blockrefs, directory entries can be randomly looked up without having
10757614c51SMatthew Dillon  to scan the whole directory.
1081a7cfe5aSMatthew Dillon
10957614c51SMatthew Dillon* Multiple PFSs.  In HAMMER2, all PFSs are implemented the same way, with
11057614c51SMatthew Dillon  the kernel choosing a default PFS name for the mount if none is specified.
11157614c51SMatthew Dillon  For example, "ROOT" is the default PFS name for a root mount.  You can
11257614c51SMatthew Dillon  create as many PFSs as you like and you can specify the PFS name in the
11357614c51SMatthew Dillon  mount command using the <device_path>@<pfs_name> notation.
114b93cc2e0SMatthew Dillon
11557614c51SMatthew Dillon* Snapshots are implemented as PFSs.  Due to the copy-on-write nature of
11657614c51SMatthew Dillon  the filesystem, taking a snapshot is a trivial operation requiring only
11757614c51SMatthew Dillon  a normal filesystme sync and copying of the PFS root inode (1KB), and
11857614c51SMatthew Dillon  that's it.
1191a7cfe5aSMatthew Dillon
12057614c51SMatthew Dillon  On the minus side, can complicate the bulkfree operation that is responsible
12157614c51SMatthew Dillon  for freeing up disk space.  It can take significantly longer when many
12257614c51SMatthew Dillon  snapshots are present.
123b7910865SMatthew Dillon
12457614c51SMatthew Dillon* SNAPSHOTS ARE READ-WRITE.  You can mount any PFS read-write, including
12557614c51SMatthew Dillon  snapshots.  For example, you can revert to an earlier 'root' that you
12657614c51SMatthew Dillon  made a snapshot of simply by changing what the system mounts as the root
12757614c51SMatthew Dillon  filesystem.
128b7910865SMatthew Dillon
12957614c51SMatthew Dillon* Full filesystem coherency at both the radix tree level and the filesystem
13057614c51SMatthew Dillon  semantics level.  This is true for all filesystem syncs, recovery after
13157614c51SMatthew Dillon  a crash, and snapshots.
132b7910865SMatthew Dillon
13357614c51SMatthew Dillon  The filesystem syncs fully vfsync the buffer cache for the files
13457614c51SMatthew Dillon  that are part of the sync group, and keeps track of dependencies to
13557614c51SMatthew Dillon  ensure that all inter-dependent inodes are flushed in the same sync
13657614c51SMatthew Dillon  group.  Atomic filesystem ops such as write()s are guaranteed to remain
13757614c51SMatthew Dillon  atomic across a sync, snapshot, and crash.
138b7910865SMatthew Dillon
13957614c51SMatthew Dillon* Flushes and syncs are almost entirely asynchronous and will run concurrent
14057614c51SMatthew Dillon  with frontend operations.  This feature is implemented by adding inodes
14157614c51SMatthew Dillon  to the sync group currently being flushed on-the-fly as new dependencies
14257614c51SMatthew Dillon  are created, and reordering inodes in the sync queue to prioritize inodes
14357614c51SMatthew Dillon  which the frontend is stalled on.
144b7910865SMatthew Dillon
14557614c51SMatthew Dillon  By reprioritizing inodes in the syncq, frontend stalls are minimized.
146b7910865SMatthew Dillon
14757614c51SMatthew Dillon  The only synchronous disk operations is the final sync of the volume
14857614c51SMatthew Dillon  header which updates the ultimate root of the filesystem.  A disk flush
14957614c51SMatthew Dillon  command is issued synchronously, then the write of the volume header is
15057614c51SMatthew Dillon  issued synchronously.  All other writes to the disk, regardless of the
15157614c51SMatthew Dillon  complexity of the dependencies, occur asynchronously and can make very
15257614c51SMatthew Dillon  good use of high-speed I/O and SSD bandwidth.
1531a7cfe5aSMatthew Dillon
1541a7cfe5aSMatthew Dillon* Low memory footprint.  Except for the volume header, the buffer cache
1551a7cfe5aSMatthew Dillon  is completely asynchronous and dirty buffers can be retired by the OS
1561a7cfe5aSMatthew Dillon  directly to backing store with no further interactions with the filesystem.
1575a9a531cSMatthew Dillon
158b7910865SMatthew Dillon* Compression support.  Multiple algorithms are supported and can be
159b7910865SMatthew Dillon  configured on a subdirectory hierarchy or individual file basis.
160b7910865SMatthew Dillon  Block compression up to 64KB will be used.  Only compression ratios at
161b7910865SMatthew Dillon  powers of 2 that are at least 2:1 (e.g. 2:1, 4:1, 8:1, etc) will work in
162b7910865SMatthew Dillon  this scheme because physical block allocations in HAMMER2 are always
16357614c51SMatthew Dillon  power-of-2.
16457614c51SMatthew Dillon
16557614c51SMatthew Dillon  Modest compression can be achieved with low overhead, is turned on
16657614c51SMatthew Dillon  by default, and is compatible with deduplication.
16757614c51SMatthew Dillon
16857614c51SMatthew Dillon  Compression is extremely useful and often gives you anywhere from 25%
16957614c51SMatthew Dillon  to 400% the logical storage as you have physical blocks, depending.
17057614c51SMatthew Dillon  Of course, .tgz and other pre-compressed files cannot be compressed
17157614c51SMatthew Dillon  further by the filesystem.
17257614c51SMatthew Dillon
17357614c51SMatthew Dillon  The usefulness shnould not be underestimated, our users are constantly
17457614c51SMatthew Dillon  being surprised at things the filesystem is able to compres that just
17557614c51SMatthew Dillon  makes life a lot easier.  For example, 30GB core dumps tend to contain
17657614c51SMatthew Dillon  a great deal of highly compressable data.  Source trees, web files,
17757614c51SMatthew Dillon  executables, general data... this is why HAMMER2 turns modest compression
17857614c51SMatthew Dillon  on by default.  It just works.
1795a9a531cSMatthew Dillon
180b7910865SMatthew Dillon* De-duplication support.  HAMMER2 uses a relatively simple freemap
181b7910865SMatthew Dillon  scheme that allows the filesystem to discard block references
18257614c51SMatthew Dillon  asynchronously.  The same scheme allows essentially unlimited references
18357614c51SMatthew Dillon  to the same data block in the hierarchy.  Thus, both live de-duplication
18457614c51SMatthew Dillon  and bulk deduplication are relatively easy to implement.
185b7910865SMatthew Dillon
18657614c51SMatthew Dillon  HAMMER2 currently implements only live de-duplications.  This means that
18757614c51SMatthew Dillon  typical situations such as when copying files or whole directory hierarchies
18857614c51SMatthew Dillon  will naturally de-duplicate.  Simply reading filesystem data in makes
18957614c51SMatthew Dillon  it available for deduplication later.  HAMMER2 will index a potentially
19057614c51SMatthew Dillon  very large number of blocks in memory, even beyond what the buffer cache
19157614c51SMatthew Dillon  can hold, for deduplication purposes.
19257614c51SMatthew Dillon
19357614c51SMatthew Dillon* Zero-fill detection on write (writing all-zeros), which requires the data
194b7910865SMatthew Dillon  buffer to be scanned, is fully supported.  This allows the writing of 0's
195b7910865SMatthew Dillon  to create holes.
196b7910865SMatthew Dillon
197b7910865SMatthew Dillon  Generally speaking pre-writing zerod blocks to reserve space doesn't work
198b7910865SMatthew Dillon  well on copy-on-write filesystems.  However, if both compression and
199b7910865SMatthew Dillon  check codes are disabled on a file, H2 will also disable zero-detection,
20057614c51SMatthew Dillon  allowing the file blocks to be pre-reserved (by actually zeroing them and
20157614c51SMatthew Dillon  reusing them later on), and allow data overwrites to write to the same
20257614c51SMatthew Dillon  sector.  Please be aware that DISABLING THE CHECK CODE IN THIS MANNER ALSO
20357614c51SMatthew Dillon  MEANS THAT SNAPSHOTS WILL NOT WORK.  The snapshot will contain the latest
20457614c51SMatthew Dillon  data for the file and not the data as-of the snapshot.  This is NOT turned
20557614c51SMatthew Dillon  on by default in HAMMER2 and is not recommended except in special
20657614c51SMatthew Dillon  well-controlled circumstances.
207b7910865SMatthew Dillon
20857614c51SMatthew Dillon* Multiple supporting kernel threads, breaking up frontend VOP operation
20957614c51SMatthew Dillon  from backend I/O, compression, and decompression operation.  Buffer cache
21057614c51SMatthew Dillon  I/O and VOP ops message the backend.  Actual I/O is handled by the backend
21157614c51SMatthew Dillon  and not by the frontend, which will theoretically allow us to survive
21257614c51SMatthew Dillon  stalled devices and nodes when implementing multi-node support.
21357614c51SMatthew Dillon
21457614c51SMatthew Dillon			    Pending Features
21557614c51SMatthew Dillon			  (not yet implemented)
21657614c51SMatthew Dillon
21757614c51SMatthew Dillon* Constructing a filesystem across multiple nodes.  Each low-level H2 device
218bbb35c81SSascha Wildner  would be able to accommodate nodes belonging to multiple cluster components
21957614c51SMatthew Dillon  as well as nodes that are simply local to the device or machine.
22057614c51SMatthew Dillon
22157614c51SMatthew Dillon  CURRENT STATUS: Not yet operational.
222b7910865SMatthew Dillon
223b7910865SMatthew Dillon* Incremental synchronization via highest-transaction id propagation
224b7910865SMatthew Dillon  within the radix tree.  This is a queueless, incremental design.
225b7910865SMatthew Dillon
226b7910865SMatthew Dillon  CURRENT STATUS: Due to the flat inode hierarchy now being employed,
227b7910865SMatthew Dillon  the current synchronization code which silently recurses indirect nodes
228b7910865SMatthew Dillon  will be inefficient due to the fact that all the inodes are at the
229b7910865SMatthew Dillon  same logical level in the topology.  To fix this, the code will need
230b7910865SMatthew Dillon  to explicitly iterate indirect nodes and keep track of the related
231b7910865SMatthew Dillon  key ranges to match them up on an indirect-block basis, which would
232b7910865SMatthew Dillon  be incredibly efficient.
233b7910865SMatthew Dillon
234b7910865SMatthew Dillon* Background synchronization and mirroring occurs at the logical layer
235b7910865SMatthew Dillon  rather than the physical layer.  This allows cluster components to
236b7910865SMatthew Dillon  have differing storage arrangements.
237b7910865SMatthew Dillon
238b7910865SMatthew Dillon  In addition, this mechanism will fully correct any out of sync nodes
239b7910865SMatthew Dillon  in the cluster as long as a sufficient number of other nodes agree on
240b7910865SMatthew Dillon  what the proper state should be.
241b7910865SMatthew Dillon
24257614c51SMatthew Dillon  CURRENT STATUS: Not yet operational.
2435a9a531cSMatthew Dillon
244b93cc2e0SMatthew Dillon* Encryption.  Whole-disk encryption is supported by another layer, but I
245b93cc2e0SMatthew Dillon  intend to give H2 an encryption feature at the logical layer which works
246b93cc2e0SMatthew Dillon  approximately as follows:
247b93cc2e0SMatthew Dillon
248b93cc2e0SMatthew Dillon  - Encryption controlled by the client on an inode/sub-tree basis.
249b93cc2e0SMatthew Dillon  - Server has no visibility to decrypted data.
250b93cc2e0SMatthew Dillon  - Encrypt filenames in directory entries.  Since the filename[] array
251b93cc2e0SMatthew Dillon    is 256 bytes wide, client can add random bytes after the normal
252b93cc2e0SMatthew Dillon    terminator to make it virtually impossible for an attacker to figure
253b93cc2e0SMatthew Dillon    out the filename.
254b93cc2e0SMatthew Dillon  - Encrypt file size and most inode contents.
255b93cc2e0SMatthew Dillon  - Encrypt file data (holes are not encrypted).
256b93cc2e0SMatthew Dillon  - Encryption occurs after compression, with random filler.
257b93cc2e0SMatthew Dillon  - Check codes calculated after encryption & compression (not before).
258b93cc2e0SMatthew Dillon
259b93cc2e0SMatthew Dillon  - Blockrefs are not encrypted.
260b93cc2e0SMatthew Dillon  - Directory and File Topology is not encrypted.
261b93cc2e0SMatthew Dillon  - Encryption is not sub-topology validation.  Client would have to keep
262b93cc2e0SMatthew Dillon    track of that itself.  Server or other clients can still e.g. remove
263b93cc2e0SMatthew Dillon    files, rename, etc.
264b93cc2e0SMatthew Dillon
265b93cc2e0SMatthew Dillon  In particular, note that even though the file size field can be encrypted,
266b93cc2e0SMatthew Dillon  the server does have visibility on the block topology and thus has a pretty
267b93cc2e0SMatthew Dillon  good idea how big the file is.  However, a client could add junk blocks
268b93cc2e0SMatthew Dillon  at the end of a file to make this less apparent, at the cost of space.
269b93cc2e0SMatthew Dillon
270b93cc2e0SMatthew Dillon  If a client really wants a fully validated H2-encrypted space the easiest
271b93cc2e0SMatthew Dillon  solution is to format a filesystem within an encrypted file by treating it
272b93cc2e0SMatthew Dillon  as a block device, but I digress.
2735a9a531cSMatthew Dillon
27457614c51SMatthew Dillon  CURRENT STATUS: Not yet operational.
27557614c51SMatthew Dillon
276b7910865SMatthew Dillon* Device ganging, copies for redundancy, and file splitting.
2775a9a531cSMatthew Dillon
278b7910865SMatthew Dillon  Device ganging - The idea here is not to gang devices into a single
279b7910865SMatthew Dillon  physical volume but to instead format each device independently
280b7910865SMatthew Dillon  and allow crossover-references in the blockref to other devices in
281b7910865SMatthew Dillon  the set.
2827fece146SMatthew Dillon
283b7910865SMatthew Dillon  One of the things we want to accomplish is to ensure that a failed
284b7910865SMatthew Dillon  device does not prevent access to radix tree elements in other devices
285b7910865SMatthew Dillon  in the gang, and that the failed device can be reconstructed.  To do
286b7910865SMatthew Dillon  this, each device implements complete reachability from the node root
287b7910865SMatthew Dillon  to all elements underneath it.  When a device fails, the sychronization
288b7910865SMatthew Dillon  code can theoretically reconstruct the missing material in other
289b7910865SMatthew Dillon  devices making up the gang.  New devices can be added to the gang and
290b7910865SMatthew Dillon  existing devices can be removed from the gang.
2915a9a531cSMatthew Dillon
292b7910865SMatthew Dillon  Redundant copies - This is actually a fairly tough problem.  The
293b7910865SMatthew Dillon  solution I would like to implement is to use the device ganging feature
294b7910865SMatthew Dillon  to also implement redundancy, that way if a device fails within the
295b7910865SMatthew Dillon  gang there's a good chance that it can still remain completely functional
296b7910865SMatthew Dillon  without having to resynchronize.  But making this work is difficult to say
297b7910865SMatthew Dillon  the least.
2985a9a531cSMatthew Dillon
29957614c51SMatthew Dillon  CURRENT STATUS: Not yet operational.
30057614c51SMatthew Dillon
301b93cc2e0SMatthew Dillon* MESI Cache coherency for multi-master/multi-client clustering operations.
302b93cc2e0SMatthew Dillon  The servers hosting the MASTERs are also responsible for keeping track of
303b93cc2e0SMatthew Dillon  the cache state.
304b93cc2e0SMatthew Dillon
305b7910865SMatthew Dillon  This is a feature that we would need to implement coherent cross-machine
306b7910865SMatthew Dillon  multi-threading and migration.
3075a9a531cSMatthew Dillon
30857614c51SMatthew Dillon  CURRENT STATUS: Not yet operational.
30957614c51SMatthew Dillon
310b7910865SMatthew Dillon* Implement unverified de-duplication (where only the check code is tested,
311b7910865SMatthew Dillon  avoiding having to actually read data blocks to calculate a de-duplication.
312b7910865SMatthew Dillon  This would make use of the blockref structure's widest check field
313b7910865SMatthew Dillon  (512 bits).
3145a9a531cSMatthew Dillon
315b7910865SMatthew Dillon  Out of necessity this type of feature would be settable on a file or
316b7910865SMatthew Dillon  recursive directory tree basis, but should only be used when the data
317b7910865SMatthew Dillon  is throw-away or can be reconstructed since data corruption (mismatched
318b7910865SMatthew Dillon  duplicates with the same hash) is still possible even with a 512-bit
319b7910865SMatthew Dillon  check code.
3205a9a531cSMatthew Dillon
321b93cc2e0SMatthew Dillon  The Unverified dedup feature is intended only for those files where
322bbb35c81SSascha Wildner  occasional corruption is ok, such as in a web-crawler data store or
323b93cc2e0SMatthew Dillon  other situations where the data content is not critically important
324b93cc2e0SMatthew Dillon  or can be externally recovered if it becomes corrupt.
3255a9a531cSMatthew Dillon
32657614c51SMatthew Dillon  CURRENT STATUS: Not yet operational.
32757614c51SMatthew Dillon
3285a9a531cSMatthew Dillon				GENERAL DESIGN
3295a9a531cSMatthew Dillon
3305a9a531cSMatthew DillonHAMMER2 generally implements a copy-on-write block design for the filesystem,
3315a9a531cSMatthew Dillonwhich is very different from HAMMER1's B-Tree design.  Because the design
332b7910865SMatthew Dillonis copy-on-write it can be trivially snapshotted simply by making a copy
333b7910865SMatthew Dillonof the block table we desire to snapshot.  Snapshotting the root inode
334b7910865SMatthew Dilloneffectively snapshots the entire filesystem, whereas snapshotting a file
335b7910865SMatthew Dilloninode only snapshots that one file.  Snapshotting a directory inode is
336b7910865SMatthew Dillongenerally unhelpful since it only contains directory entries and the
337b7910865SMatthew Dillonunderlying files are not arranged under it in the radix tree.
3385a9a531cSMatthew Dillon
339b7910865SMatthew DillonThe copy-on-write design implements a block table as a radix-tree,
340b7910865SMatthew Dillonwith a small fan-out in the volume header and inode (typically 4x) and
341b7910865SMatthew Dillona large fan-out for indirect blocks (typically 128x and 512x depending).
342*a4cea70eSTomohiro KusumiThe table is built bottom-up.  Intermediate radixes are only created when
343b7910865SMatthew Dillonnecessary so small files and directories will have a much shallower radix
344b7910865SMatthew Dillontree.
345e513e77eSMatthew Dillon
346b7910865SMatthew DillonHAMMER2 implements several space optimizations:
347b7910865SMatthew Dillon
34857614c51SMatthew Dillon  1. Directory entries with filenames <= 64 bytes will fit entirely
349b7910865SMatthew Dillon     in the 128-byte blockref structure and do not require additional data
350b7910865SMatthew Dillon     block references.  Since blockrefs are the core elements making up
351b7910865SMatthew Dillon     block tables, most directories should have good locality of reference
352b7910865SMatthew Dillon     for directory scans.
353b7910865SMatthew Dillon
35457614c51SMatthew Dillon     Filenames > 64 bytes require a 1KB data-block reference, which
35557614c51SMatthew Dillon     is clearly less optimal, but very few files in a filesystem tend
35657614c51SMatthew Dillon     to be larger than 64 bytes so it works out.  This also simplifies
35757614c51SMatthew Dillon     the handling for large filenames as we can allow filenames up to
35857614c51SMatthew Dillon     1023 bytes long with this mechanism with no major changes to the
35957614c51SMatthew Dillon     code.
36057614c51SMatthew Dillon
361b7910865SMatthew Dillon  2. Inodes embed 4 blockrefs, so files up to 256KB and directories with
362b7910865SMatthew Dillon     up to four directory entries (not including "." or "..") can be
363bbb35c81SSascha Wildner     accommodated without requiring any indirecct blocks.
364b7910865SMatthew Dillon
365b7910865SMatthew Dillon  3. Indirect blocks can be sized to any power of two up to 65536 bytes,
366b7910865SMatthew Dillon     and H2 typically uses 16384 and 65536 bytes.  The smaller size is
367b7910865SMatthew Dillon     used for initial indirect blocks to reduce storage overhead for
368b7910865SMatthew Dillon     medium-sized files and directories.
369b7910865SMatthew Dillon
370b7910865SMatthew Dillon  4. The File inode itself can directly hold the data for small
37157614c51SMatthew Dillon     files <= 512 bytes in size, overloading the space also used
37257614c51SMatthew Dillon     by its four 128 bytes blockrefs (which are not needed if the
37357614c51SMatthew Dillon     file is <= 512 bytes in size).  This works out great for small
37457614c51SMatthew Dillon     files and directories.
375b7910865SMatthew Dillon
376b7910865SMatthew Dillon  5. The last block in a file will have a storage allocation in powers
377b7910865SMatthew Dillon     of 2 from 1KB to 64KB as needed.  Thus a small file in excess of
378b7910865SMatthew Dillon     512 bytes but less than 64KB will not waste a full 64KB block.
379b7910865SMatthew Dillon
380b7910865SMatthew Dillon  6. When compression is enabled, small physical blocks will be allocated
381b7910865SMatthew Dillon     when possible.  However, only reductions in powers of 2 are supported.
382b7910865SMatthew Dillon     So if a 64KB data block can be compressed to (16KB+1) to 32KB, then
383b7910865SMatthew Dillon     a 32KB block will be used.  This gives H2 modest compression at very
384b7910865SMatthew Dillon     low cost without too much added complexity.
385b7910865SMatthew Dillon
386b7910865SMatthew Dillon  7. Live de-dup will attempt to share data blocks when file copying is
387b7910865SMatthew Dillon     detected, significantly reducing actual physical writes to storage
388b7910865SMatthew Dillon     and the storage used.  Bulk de-dup (when implemented), will catch
389b7910865SMatthew Dillon     other cases of de-duplication.
390b7910865SMatthew Dillon
391b7910865SMatthew DillonDirectories contain directory entries which are indexed using a hash of
392b7910865SMatthew Dillontheir filename.  The hash is carefully designed to maintain some natural
39357614c51SMatthew Dillonsort ordering.  The directory entries are implemented AS blockrefs.  So
39457614c51SMatthew Dillonan inode can contain up to 4 before requiring an indirect block, and
39557614c51SMatthew Dilloneach indirect block can contain up to 512 entries, with further data block
39657614c51SMatthew Dillonreferences required for any directory entry whos filename is > 64 bytes.
39757614c51SMatthew DillonBecause the directory entries are blockrefs, random access lookups are
39857614c51SMatthew Dillonmaximally efficient.  The directory hash is designed to very loosely try
39957614c51SMatthew Dillonto retain some alphanumeric sorting to bundle similarly-named files together
40057614c51SMatthew Dillonand reduce random lookups.
401b7910865SMatthew Dillon
402b7910865SMatthew DillonThe copy-on-write nature of the filesystem means that any modification
4035a9a531cSMatthew Dillonwhatsoever will have to eventually synchronize new disk blocks all the way
40457614c51SMatthew Dillonto the super-root of the filesystem and then to the volume header itself.
40557614c51SMatthew DillonThis forms the basis for crash recovery and also ensures that recovery
40657614c51SMatthew Dillonoccurs on a completed high-level transaction boundary.  All disk writes are
40757614c51SMatthew Dillonto new blocks except for the volume header (which cycles through 4 copies),
40857614c51SMatthew Dillonthus allowing all writes to run asynchronously and concurrently prior to
40957614c51SMatthew Dillonand during a flush, and then just doing a final synchronization and volume
41057614c51SMatthew Dillonheader update at the end.  Many of HAMMER2s features are enabled by this
41157614c51SMatthew Dilloncore design feature.
4125a9a531cSMatthew Dillon
413b7910865SMatthew DillonThe Freemap is also implemented using a radix tree via a set of pre-reserved
414b7910865SMatthew Dillonblocks (approximately 4MB for every 2GB of storage), and also cycles through
415b7910865SMatthew Dillonmultiple copies to ensure that crash recovery can restore the state of the
416b7910865SMatthew Dillonfilesystem quickly at mount time.
417b7910865SMatthew Dillon
418b7910865SMatthew DillonHAMMER2 tries to maintain a small footprint and one way it does this is
419b7910865SMatthew Dillonby using the normal buffer cache for data and meta-data, and allowing the
420b7910865SMatthew Dillonkernel to asynchronously flush device buffers at any time (even during
421b7910865SMatthew Dillonsynchronization).  The volume root is flushed separately, separated from
422b7910865SMatthew Dillonthe asynchronous flushes by a synchronizing BUF_CMD_FLUSH op.  This means
423b7910865SMatthew Dillonthat HAMMER2 has very low resource overhead from the point of view of the
424b7910865SMatthew Dillonoperating system and is very much unlike HAMMER1 which had to lock dirty
425b7910865SMatthew Dillonbuffers into memory for long periods of time.  HAMMER2 has no such
426b93cc2e0SMatthew Dillonrequirement.
4275a9a531cSMatthew Dillon
428b93cc2e0SMatthew DillonBuffer cache overhead is very well bounded and can handle filesystem
429b93cc2e0SMatthew Dillonoperations of any complexity, even on boxes with very small amounts
430b93cc2e0SMatthew Dillonof physical memory.  Buffer cache overhead is significantly lower with H2
431b93cc2e0SMatthew Dillonthan with H1 (and orders of magnitude lower than ZFS).
4325a9a531cSMatthew Dillon
433b93cc2e0SMatthew DillonAt some point I intend to implement a shortcut to make fsync()'s run fast,
434b93cc2e0SMatthew Dillonand that is to allow deep updates to blockrefs to shortcut to auxillary
435b93cc2e0SMatthew Dillonspace in the volume header to satisfy the fsync requirement.  The related
436b93cc2e0SMatthew Dillonblockref is then recorded when the filesystem is mounted after a crash and
437b93cc2e0SMatthew Dillonthe update chain is reconstituted when a matching blockref is encountered
438b93cc2e0SMatthew Dillonagain during normal operation of the filesystem.
439b93cc2e0SMatthew Dillon
44057614c51SMatthew Dillon			    FILESYSTEM SYNC SEQUENCING
44157614c51SMatthew Dillon
44257614c51SMatthew DillonHAMMER2 implements a filesystem sync mechanism that allows the frontend
44357614c51SMatthew Dillonto continue doing modifying operations concurrent with the sync.  The
44457614c51SMatthew Dillongeneral sync mechanism operates in four phases:
44557614c51SMatthew Dillon
44657614c51SMatthew Dillon    1.	Individual file and directory inodes are fsync()d to disk,
44757614c51SMatthew Dillon	updated the blockrefs in the parent block above the inode, and
44857614c51SMatthew Dillon	removed from the syncq.
44957614c51SMatthew Dillon
45057614c51SMatthew Dillon	Once removed from the syncq, the frontend can do a modifying
45157614c51SMatthew Dillon	operation on these file and directory inodes without further
45257614c51SMatthew Dillon	effecting the filesystem sync.  These modifications will be
45357614c51SMatthew Dillon	flushed to disk on the next filesystem sync.
45457614c51SMatthew Dillon
45557614c51SMatthew Dillon	To reduce frontend stall times, an inode blocked on by the frontend
45657614c51SMatthew Dillon	which is on the syncq will be reordered to the front of the syncq
45757614c51SMatthew Dillon	to give the syncer a shot at it more quickly, in order to unstall
45857614c51SMatthew Dillon	the frontend ASAP.
45957614c51SMatthew Dillon
46057614c51SMatthew Dillon	If a frontend operations creates an unavoidable dependency between
46157614c51SMatthew Dillon	an inode on the syncq and an inode not on the syncq, both inodes
46257614c51SMatthew Dillon	are placed on (or back onto) the syncq as needed to ensure filesystem
46357614c51SMatthew Dillon	consistency for the filesystem sync.  This can extend the filesystem
46457614c51SMatthew Dillon	sync time, but even under heavy loads syncs are still able to be
46557614c51SMatthew Dillon	retired.
46657614c51SMatthew Dillon
46757614c51SMatthew Dillon    2.  The PFS ROOT is fsync()d to storage along with the subhierarchy
46857614c51SMatthew Dillon	representing the inode index (whos inodes were flushed in (1)).
46957614c51SMatthew Dillon	This brings the block copy-on-write up to the root inode.
47057614c51SMatthew Dillon
47157614c51SMatthew Dillon    3.	The SUPER-ROOT inode is fsync()d to storage along with the
47257614c51SMatthew Dillon	subhierarchy representing the PFS ROOTs for the volume.
47357614c51SMatthew Dillon
47457614c51SMatthew Dillon    4.	Finally, a physical disk flush command is issued to the storage
47557614c51SMatthew Dillon	device, and then the volume header is written to disk.  All
47657614c51SMatthew Dillon	I/O prior to this step occurred asynchronously.  This is the only
47757614c51SMatthew Dillon	step which must occur synchronously.
47857614c51SMatthew Dillon
47953f84d31SMatthew Dillon			MIRROR_TID, MODIFY_TID, UPDATE_TID
480e513e77eSMatthew Dillon
481b7910865SMatthew DillonIn HAMMER2, the core block reference is a 128-byte structure called a blockref.
482e513e77eSMatthew DillonThe blockref contains various bits of information including the 64-bit radix
483e513e77eSMatthew Dillonkey (typically a directory hash if a directory entry, inode number if a
484b7910865SMatthew Dillonhidden hardlink target, or file offset if a file block), number of significant
485b7910865SMatthew Dillonkey bits for ranged recursion of indirect blocks, a 64-bit device seek that
486b7910865SMatthew Dillonencodes the radix of the physical block size in the low bits (physical block
487b7910865SMatthew Dillonsize can be different from logical block size due to compression),
488b7910865SMatthew Dillonthree 64-bit transaction ids, type information, and up to 512 bits worth
489b7910865SMatthew Dillonof check data for the block being reference which can be anything from
490b7910865SMatthew Dillona simple CRC to a strong cryptographic hash.
491e513e77eSMatthew Dillon
492e513e77eSMatthew Dillonmirror_tid - This is a media-centric (as in physical disk partition)
49353f84d31SMatthew Dillon	     transaction id which tracks media-level updates.  The mirror_tid
49453f84d31SMatthew Dillon	     can be different at the same point on different nodes in a
49553f84d31SMatthew Dillon	     cluster.
496e513e77eSMatthew Dillon
497e513e77eSMatthew Dillon	     Whenever any block in the media topology is modified, its
498e513e77eSMatthew Dillon	     mirror_tid is updated with the flush id and will propagate
499e513e77eSMatthew Dillon	     upward during the flush all the way to the volume header.
500e513e77eSMatthew Dillon
50153f84d31SMatthew Dillon	     mirror_tid is monotonic.  It is primarily used for on-mount
50253f84d31SMatthew Dillon	     recovery and volume root validation.  The name is historical
50353f84d31SMatthew Dillon	     from H1, it is not used for nominal mirroring.
504e513e77eSMatthew Dillon
505e513e77eSMatthew Dillonmodify_tid - This is a cluster-centric (as in across all the nodes used
506e513e77eSMatthew Dillon	     to build a cluster) transaction id which tracks filesystem-level
507e513e77eSMatthew Dillon	     updates.
508e513e77eSMatthew Dillon
509e513e77eSMatthew Dillon	     modify_tid is updated when the front-end of the filesystem makes
51053f84d31SMatthew Dillon	     a change to an inode or data block.  It does NOT propagate upward
51153f84d31SMatthew Dillon	     during a flush.
512e513e77eSMatthew Dillon
51353f84d31SMatthew Dillonupdate_tid - This is a cluster synchronization transaction id.  Modifications
51453f84d31SMatthew Dillon	     made to the topology will clear this field to 0 as they propagate
51553f84d31SMatthew Dillon	     up to the root.  This gives the synchronizer an easy way to
51653f84d31SMatthew Dillon	     determine what needs revalidation.
517e513e77eSMatthew Dillon
51853f84d31SMatthew Dillon	     The synchronizer revalidates the cluster bottom-up by validating
51953f84d31SMatthew Dillon	     a sub-topology and propagating the highest modify_tid in the
52053f84d31SMatthew Dillon	     validated sub-topology up via the update_tid field.
52153f84d31SMatthew Dillon
52253f84d31SMatthew Dillon	     Update to this field may be optimized by the HAMMER2 VFS to
52353f84d31SMatthew Dillon	     avoid the double-transition.
524e513e77eSMatthew Dillon
525e513e77eSMatthew DillonThe synchronization code updates an out-of-sync node bottom-up and will
52653f84d31SMatthew Dillondynamically set update_tid as it goes, but media flushes can occur at any
527e513e77eSMatthew Dillontime and these flushes will use mirror_tid for flush and freemap management.
528e513e77eSMatthew DillonThe mirror_tid for each flush propagates upward to the volume header on each
52953f84d31SMatthew Dillonflush.  modify_tid is set for any chains modified by a cluster op but does
53053f84d31SMatthew Dillonnot propagate up, instead serving as a seed for update_tid.
531e513e77eSMatthew Dillon
532e513e77eSMatthew Dillon* The synchronization code is able to determine that a sub-tree is
53353f84d31SMatthew Dillon  synchronized simply by observing the update_tid at the root of the sub-tree,
53453f84d31SMatthew Dillon  on an inode-by-inode basis and also on a data-block-by-data-block basis.
535e513e77eSMatthew Dillon
536e513e77eSMatthew Dillon* The synchronization code is able to do an incremental update of an
53753f84d31SMatthew Dillon  out-of-sync node simply by skipping elements with a matching update_tid
53853f84d31SMatthew Dillon  (when not 0).
539e513e77eSMatthew Dillon
540e513e77eSMatthew Dillon* The synchronization code can be interrupted and restarted at any time,
541b7910865SMatthew Dillon  and is able to pick up where it left off with very low overhead.
542e513e77eSMatthew Dillon
543e513e77eSMatthew Dillon* The synchronization code does not inhibit media flushes.  Media flushes
544e513e77eSMatthew Dillon  can occur (and must occur) while synchronization is ongoing.
545e513e77eSMatthew Dillon
546e513e77eSMatthew DillonThere are several other stored transaction ids in HAMMER2.  There is a
547e513e77eSMatthew Dillonseparate freemap_tid in the volume header that is used to allow freemap
5487fece146SMatthew Dillonflushes to be deferred, and inodes have a pfs_psnap_tid which is used in
549bbb35c81SSascha Wildnerconjunction with CHECK_NONE to allow blocks without a check code which do
5507fece146SMatthew Dillonnot violate the most recent snapshot to be overwritten in-place.
551e513e77eSMatthew Dillon
552e513e77eSMatthew DillonRemember that since this is a copy-on-write filesystem, we can propagate
553e513e77eSMatthew Dillona considerable amount of information up the tree to the volume header
554e513e77eSMatthew Dillonwithout adding to the I/O we already have to do.
555e513e77eSMatthew Dillon
556b93cc2e0SMatthew Dillon			    DIRECTORIES AND INODES
5575a9a531cSMatthew Dillon
55857614c51SMatthew DillonDirectories are hashed.  In HAMMER2, the PFS ROOT directory (aka inode 1 for
55957614c51SMatthew Dillona PFS) can contain a mix of directory entries AND embedded inodes.  This was
56057614c51SMatthew Dillonactually a design mistake, so the code to deal with the index of inodes
56157614c51SMatthew Dillonvs the directory entries is slightly convoluted (but not too bad).
562b93cc2e0SMatthew Dillon
56357614c51SMatthew DillonIn the first iteration of HAMMER2 I tried really hard to embed actual
56457614c51SMatthew Dilloninodes AS the directory entries, but it created a mass of problems for
56557614c51SMatthew Dillonimplementing NFS export support and dealing with hardlinks, so in a later
56657614c51SMatthew Dilloniteration I implemented small independent directory entries (that wound up
56757614c51SMatthew Dillonmostly fitting in the blockref structure, so WIN WIN!).  However, 'embedded'
56857614c51SMatthew Dilloninodes AS the directory entries still survive for the SUPER-ROOT and the
56957614c51SMatthew DillonPFS-ROOTs under the SUPER-ROOT.  They just aren't used in the individual
57057614c51SMatthew Dillonfilesystem that each PFS represents.
571f7712c43SMatthew Dillon
57257614c51SMatthew DillonHardlinks are now implemented normally, with multiple directory entries
57357614c51SMatthew Dillonreferencing the same inode and that inode containing a nlinks count.
574b93cc2e0SMatthew Dillon
575b93cc2e0SMatthew Dillon				    RECOVERY
576b93cc2e0SMatthew Dillon
57757614c51SMatthew DillonH2 allows freemap flushes to lag behind topology flushes.  This improves
57857614c51SMatthew Dillonfilesystem sync performance.  The freemap flush tracks a separate
57957614c51SMatthew Dillontransaction id (via mirror_tid) in the volume header.
580b93cc2e0SMatthew Dillon
581b93cc2e0SMatthew DillonOn mount, HAMMER2 will first locate the highest-sequenced check-code-validated
582b93cc2e0SMatthew Dillonvolume header from the 4 copies available (if the filesystem is big enough,
58357614c51SMatthew Dillone.g. > ~8GB or so, there will be 4 copies of the volume header).
584b93cc2e0SMatthew Dillon
585e513e77eSMatthew DillonHAMMER2 will then run an incremental scan of the topology for mirror_tid
586f7712c43SMatthew Dillontransaction ids between the last freemap flush tid and the last topology
587f7712c43SMatthew Dillonflush tid in order to synchronize the freemap.  Because this scan is
588f7712c43SMatthew Dillonincremental the time it takes to run will be relatively short and well-bounded
589b7910865SMatthew Dillonat mount-time.  This is NOT an fsck.  Freemap flushes can be avoided for any
590f7712c43SMatthew Dillonnumber of normal topology flushes but should still occur frequently enough
591f7712c43SMatthew Dillonto avoid long recovery times in case of a crash.
592b93cc2e0SMatthew Dillon
593b93cc2e0SMatthew DillonThe filesystem is then ready for use.
5945a9a531cSMatthew Dillon
595a98aa0b0SMatthew Dillon			    DISK I/O OPTIMIZATIONS
596a98aa0b0SMatthew Dillon
597b93cc2e0SMatthew DillonThe freemap implements a 1KB allocation resolution.  Each 2MB segment managed
598bbb35c81SSascha Wildnerby the freemap is zoned and has a tendency to collect inodes, small data,
599b93cc2e0SMatthew Dillonindirect blocks, and larger data blocks into separate segments.  The idea is
600b93cc2e0SMatthew Dillonto greatly improve I/O performance (particularly by laying inodes down next
601b93cc2e0SMatthew Dillonto each other which has a huge effect on directory scans).
602a98aa0b0SMatthew Dillon
60357614c51SMatthew DillonThe current implementation of HAMMER2 implements a fixed 64KB physical block
60457614c51SMatthew Dillonsize in order to allow the mapping of hammer2_dio's in its IO subsystem
60557614c51SMatthew Dillonto consumers that might desire different sizes.  This way we don't have to
606f7712c43SMatthew Dillonworry about matching the buffer cache / DIO cache to the variable block
607b7910865SMatthew Dillonsize of underlying elements.  In addition, 64KB I/Os allow compatibility
608b7910865SMatthew Dillonwith physical sector sizes up to 64KB in the underlying physical storage
60957614c51SMatthew Dillonwith no change in the byte-by-byte format of the filesystem.  The DIO
61057614c51SMatthew Dillonlayer also prevents ordering deadlocks between unrelated portions of the
61157614c51SMatthew Dillonfilesystem hierarchy whos logical blocks wind up in the same physical block.
612f7712c43SMatthew Dillon
613f7712c43SMatthew DillonThe biggest issue we are avoiding by having a fixed 64KB I/O size is not
614f7712c43SMatthew Dillonactually to help nominal front-end access issue but instead to reduce the
615b7910865SMatthew Dilloncomplexity of having to deal with mixed block sizes in the buffer cache,
616b7910865SMatthew Dillonparticularly when blocks are freed and then later reused with a different
617b7910865SMatthew Dillonblock size.  HAMMER1 had to have specialized code to check for and
618b7910865SMatthew Dilloninvalidate buffer cache buffers in the free/reuse case.  HAMMER2 does not
619b7910865SMatthew Dillonneed such code.
620f7712c43SMatthew Dillon
62157614c51SMatthew DillonThat said, HAMMER2 places no major restrictions on mixing logical block
62257614c51SMatthew Dillonsizes within a 64KB block.  The only restriction is that a logical HAMMER2
62357614c51SMatthew Dillonblock cannot cross a 64KB boundary.  The soft restrictions the block
62457614c51SMatthew Dillonallocator puts in place exist primarily for performance reasons (i.e. to
62557614c51SMatthew Dillontry to collect 1K inodes together).  The 2MB freemap zone granularity
62657614c51SMatthew Dillonshould work very well in this regard.
627b93cc2e0SMatthew Dillon
62857614c51SMatthew DillonHAMMER2 also utilizes OS support for ganging 64KB buffers together into even
629f7712c43SMatthew Dillonlarger blocks for I/O (OS buffer cache 'clustering'), OS-supported read-ahead,
630f7712c43SMatthew DillonOS-driven asynchronous retirement, and other performance features typically
631f7712c43SMatthew Dillonprovided by the OS at the block-level to ensure smooth system operation.
632a98aa0b0SMatthew Dillon
63357614c51SMatthew DillonBy avoiding wiring buffers/memory and allowing the OS's buffer cache to
63457614c51SMatthew Dillonrun normally, HAMMER2 winds up with very low OS overhead.
6355a9a531cSMatthew Dillon
6365a9a531cSMatthew Dillon				FREEMAP NOTES
6375a9a531cSMatthew Dillon
638b93cc2e0SMatthew DillonThe freemap is stored in the reserved blocks situated in the ~4MB reserved
63957614c51SMatthew Dillonarea at the base of every ~1GB level-1 zone of physical storage.  The current
64057614c51SMatthew Dillonimplementation reserves 8 copies of every freemap block and cycles through
64157614c51SMatthew Dillonthem in order to make the freemap operate in a copy-on-write fashion.
642b93cc2e0SMatthew Dillon
643b93cc2e0SMatthew Dillon    - Freemap is copy-on-write.
644b93cc2e0SMatthew Dillon    - Freemap operations are transactional, same as everything else.
645b93cc2e0SMatthew Dillon    - All backup volume headers are consistent on-mount.
646b93cc2e0SMatthew Dillon
647b93cc2e0SMatthew DillonThe Freemap is organized using the same radix blockmap algorithm used for
648b93cc2e0SMatthew Dillonfiles and directories, but with fixed radix values.  For a maximally-sized
649b93cc2e0SMatthew Dillonfilesystem the Freemap will wind up being a 5-level-deep radix blockmap,
650b93cc2e0SMatthew Dillonbut the top-level is embedded in the volume header so insofar as performance
651b93cc2e0SMatthew Dillongoes it is really just a 4-level blockmap.
652b93cc2e0SMatthew Dillon
653b93cc2e0SMatthew DillonThe freemap radix allocation mechanism is also the same, meaning that it is
654b93cc2e0SMatthew Dillonbottom-up and will not allocate unnecessary intermediate levels for smaller
6555cebbe36SMatthew Dillonfilesystems.  The number of blockmap levels not including the volume header
6565cebbe36SMatthew Dillonfor various filesystem sizes is as follows:
6575cebbe36SMatthew Dillon
6585cebbe36SMatthew Dillon	up-to		#of freemap levels
6595cebbe36SMatthew Dillon	1GB		1-level
6605cebbe36SMatthew Dillon	256GB		2-level
6615cebbe36SMatthew Dillon	64TB		3-level
6625cebbe36SMatthew Dillon	16PB		4-level
6635cebbe36SMatthew Dillon	4EB		5-level
6645cebbe36SMatthew Dillon	16EB		6-level
665b93cc2e0SMatthew Dillon
666b93cc2e0SMatthew DillonThe Freemap has bitmap granularity down to 16KB and a linear iterator that
667b93cc2e0SMatthew Dilloncan linearly allocate space down to 1KB.  Due to fragmentation it is possible
668b93cc2e0SMatthew Dillonfor the linear allocator to become marginalized, but it is relatively easy
66957614c51SMatthew Dillonto reallocate small blocks every once in a while (like once a year if you
67057614c51SMatthew Dilloncare at all) and once the old data cycles out of the snapshots, or you also
67157614c51SMatthew Dillonrewrite the snapshots (which you can do), the freemap should wind up
672b93cc2e0SMatthew Dillonrelatively optimal again.  Generally speaking I believe that algorithms can
673b93cc2e0SMatthew Dillonbe developed to make this a non-problem without requiring any media structure
67457614c51SMatthew Dillonchanges.  However, touching all the freemaps will replicate meta-data whereas
67557614c51SMatthew Dillonthe meta-data was mostly shared in the original snapshot.  So this is a
67657614c51SMatthew Dillonproblem that needs solving in HAMMER2.
677b93cc2e0SMatthew Dillon
6785a9a531cSMatthew DillonIn order to implement fast snapshots (and writable snapshots for that
679b93cc2e0SMatthew Dillonmatter), HAMMER2 does NOT ref-count allocations.  All the freemap does is
680b93cc2e0SMatthew Dillonkeep track of 100% free blocks plus some extra bits for staging the bulkfree
681b93cc2e0SMatthew Dillonscan.  The lack of ref-counting makes it possible to:
6825a9a531cSMatthew Dillon
683b93cc2e0SMatthew Dillon    - Completely trivialize HAMMER2s snapshot operations.
68457614c51SMatthew Dillon    - Completely trivialize HAMMER2s de-dup operations.
685b93cc2e0SMatthew Dillon    - Allows any volume header backup to be used trivially.
686b93cc2e0SMatthew Dillon    - Allows whole sub-trees to be destroyed without having to scan them.
68757614c51SMatthew Dillon      Deleting PFSs and snapshots is instant (though space recovery still
68857614c51SMatthew Dillon      requires two bulkfree scans).
68957614c51SMatthew Dillon    - Simplifies normal crash recovery operations by not having to reconcile
69057614c51SMatthew Dillon      a ref-count.
69157614c51SMatthew Dillon    - Simplifies catastrophic recovery operations for the same reason.
6925a9a531cSMatthew Dillon
693b93cc2e0SMatthew DillonNormal crash recovery is simply a matter of doing an incremental scan
694b93cc2e0SMatthew Dillonof the topology between the last flushed freemap TID and the last flushed
695b93cc2e0SMatthew Dillontopology TID.  This usually takes only a few seconds and allows:
6965a9a531cSMatthew Dillon
697b93cc2e0SMatthew Dillon    - Freemap flushes to be be deferred for any number of topology flush
69857614c51SMatthew Dillon      cycles (with some care to ensure that all four volume headers
69957614c51SMatthew Dillon      remain valid).
700b93cc2e0SMatthew Dillon    - Does not have to be flushed for fsync, reducing fsync overhead.
701b93cc2e0SMatthew Dillon
702b93cc2e0SMatthew Dillon				FREEMAP - BULKFREE
703b93cc2e0SMatthew Dillon
704b93cc2e0SMatthew DillonBlocks are freed via a bulkfree scan, which is a two-stage meta-data scan.
705b93cc2e0SMatthew DillonBlocks are first marked as being possibly free and then finalized in the
706b93cc2e0SMatthew Dillonsecond scan.  Live filesystem operations are allowed to run during these
707b93cc2e0SMatthew Dillonscans and any freemap block that is allocated or adjusted after the first
708b93cc2e0SMatthew Dillonscan will simply be re-marked as allocated and the second scan will not
709b93cc2e0SMatthew Dillontransition it to being free.
710b93cc2e0SMatthew Dillon
711b93cc2e0SMatthew DillonThe cost of not doing ref-count tracking is that HAMMER2 must perform two
712b93cc2e0SMatthew Dillonbulkfree scans of the meta-data to determine which blocks can actually be
713b93cc2e0SMatthew Dillonfreed.  This can be complicated by the volume header backups and snapshots
714b93cc2e0SMatthew Dillonwhich cause the same meta-data topology to be scanned over and over again,
715b93cc2e0SMatthew Dillonbut mitigated somewhat by keeping a cache of higher-level nodes to detect
716b93cc2e0SMatthew Dillonwhen we would scan a sub-topology that we have already scanned.  Due to the
717b93cc2e0SMatthew Dilloncopy-on-write nature of the filesystem, such detection is easy to implement.
7185a9a531cSMatthew Dillon
7195a9a531cSMatthew DillonPart of the ongoing design work is finding ways to reduce the scope of this
7205a9a531cSMatthew Dillonmeta-data scan so the entire filesystem's meta-data does not need to be
7215a9a531cSMatthew Dillonscanned (though in tests with HAMMER1, even full meta-data scans have
722b93cc2e0SMatthew Dillonturned out to be fairly low cost).  In other words, its an area where
723b93cc2e0SMatthew Dillonimprovements can be made without any media format changes.
724b93cc2e0SMatthew Dillon
725b93cc2e0SMatthew DillonAnother advantage of operating the freemap like this is that some future
726b93cc2e0SMatthew Dillonversion of HAMMER2 might decide to completely change how the freemap works
727b93cc2e0SMatthew Dillonand would be able to make the change with relatively low downtime.
7285a9a531cSMatthew Dillon
7295a9a531cSMatthew Dillon				  CLUSTERING
7305a9a531cSMatthew Dillon
7315a9a531cSMatthew DillonClustering, as always, is the most difficult bit but we have some advantages
7325a9a531cSMatthew Dillonwith HAMMER2 that we did not have with HAMMER1.  First, HAMMER2's media
733bbb35c81SSascha Wildnerstructures generally follow the kernel's filesystem hierarchy which allows
734b93cc2e0SMatthew Dilloncluster operations to use topology cache and lock state.  Second,
7355a9a531cSMatthew DillonHAMMER2's writable snapshots make it possible to implement several forms
7365a9a531cSMatthew Dillonof multi-master clustering.
7375a9a531cSMatthew Dillon
73862efe6ecSMatthew DillonThe mount device path you specify serves to bootstrap your entry into
739b93cc2e0SMatthew Dillonthe cluster.  This is typically local media.  It can even be a ram-disk
740b93cc2e0SMatthew Dillonthat only contains placemarkers that help HAMMER2 connect to a fully
741b93cc2e0SMatthew Dillonnetworked cluster.
7425a9a531cSMatthew Dillon
743b93cc2e0SMatthew DillonWith HAMMER2 you mount a directory entry under the super-root.  This entry
744b93cc2e0SMatthew Dillonwill contain a cluster identifier that helps HAMMER2 identify and integrate
745b93cc2e0SMatthew Dillonwith the nodes making up the cluster.  HAMMER2 will automatically integrate
746b93cc2e0SMatthew Dillon*all* entries under the super-root when you mount one of them.  You have to
747b93cc2e0SMatthew Dillonmount at least one for HAMMER2 to integrate the block device in the larger
74801d71aa5SMatthew Dilloncluster.
7495a9a531cSMatthew Dillon
750b93cc2e0SMatthew DillonFor cluster servers every HAMMER2-formatted partition has a "LOCAL" MASTER
751b93cc2e0SMatthew Dillonwhich can be mounted in order to make the rest of the elements under the
752b93cc2e0SMatthew Dillonsuper-root available to the network.  (In a prior specification I emplaced
753b93cc2e0SMatthew Dillonthe cluster connections in the volume header's configuration space but I no
754b93cc2e0SMatthew Dillonlonger do that).
75562efe6ecSMatthew Dillon
756b93cc2e0SMatthew DillonConnecting to the wider networked cluster involves setting up the /etc/hammer2
757b93cc2e0SMatthew Dillondirectory with appropriate IP addresses and keys.  The user-mode hammer2
758b93cc2e0SMatthew Dillonservice daemon maintains the connections and performs graph operations
759b93cc2e0SMatthew Dillonvia libdmsg.
76062efe6ecSMatthew Dillon
761b93cc2e0SMatthew DillonNode types within the cluster:
76262efe6ecSMatthew Dillon
763b93cc2e0SMatthew Dillon    DUMMY	- Used as a local placeholder (typically in ramdisk)
764b93cc2e0SMatthew Dillon    CACHE	- Used as a local placeholder and cache (typically on a SSD)
765b93cc2e0SMatthew Dillon    SLAVE	- A SLAVE in the cluster, can source data on quorum agreement.
766b93cc2e0SMatthew Dillon    MASTER	- A MASTER in the cluster, can source and sink data on quorum
767b93cc2e0SMatthew Dillon		  agreement.
768b93cc2e0SMatthew Dillon    SOFT_SLAVE	- A SLAVE in the cluster, can source data locally without
769b93cc2e0SMatthew Dillon		  quorum agreement (must be directly mounted).
770b93cc2e0SMatthew Dillon    SOFT_MASTER	- A local MASTER but *not* a MASTER in the cluster.  Can source
771b93cc2e0SMatthew Dillon		  and sink data locally without quorum agreement, intended to
772b93cc2e0SMatthew Dillon		  be synchronized with the real MASTERs when connectivity
773b93cc2e0SMatthew Dillon		  allows.  Operations are not coherent with the real MASTERS
774b93cc2e0SMatthew Dillon		  even when they are available.
77562efe6ecSMatthew Dillon
776b93cc2e0SMatthew Dillon    NOTE: SNAPSHOT, AUTOSNAP, etc represent sub-types, typically under a
777b93cc2e0SMatthew Dillon	  SLAVE.  A SNAPSHOT or AUTOSNAP is a SLAVE sub-type that is no longer
778b93cc2e0SMatthew Dillon	  synchronized against current masters.
77962efe6ecSMatthew Dillon
780b93cc2e0SMatthew Dillon    NOTE: Any SLAVE or other copy can be turned into its own writable MASTER
781b93cc2e0SMatthew Dillon	  by giving it a unique cluster id, taking it out of the cluster that
782b93cc2e0SMatthew Dillon	  originally spawned it.
7835a9a531cSMatthew Dillon
7842910a90cSMatthew DillonThere are four major protocols:
7855a9a531cSMatthew Dillon
7862910a90cSMatthew Dillon    Quorum protocol
7872910a90cSMatthew Dillon
7882910a90cSMatthew Dillon	This protocol is used between MASTER nodes to vote on operations
7892910a90cSMatthew Dillon	and resolve deadlocks.
7902910a90cSMatthew Dillon
7912910a90cSMatthew Dillon	This protocol is used between SOFT_MASTER nodes in a sub-cluster
7922910a90cSMatthew Dillon	to vote on operations, resolve deadlocks, determine what the latest
7932910a90cSMatthew Dillon	transaction id for an element is, and to perform commits.
7942910a90cSMatthew Dillon
7952910a90cSMatthew Dillon    Cache sub-protocol
7962910a90cSMatthew Dillon
7972910a90cSMatthew Dillon	This is the MESI sub-protocol which runs under the Quorum
7982910a90cSMatthew Dillon	protocol.  This protocol is used to maintain cache state for
7992910a90cSMatthew Dillon	sub-trees to ensure that operations remain cache coherent.
8002910a90cSMatthew Dillon
8012910a90cSMatthew Dillon	Depending on administrative rights this protocol may or may
8022910a90cSMatthew Dillon	not allow a leaf node in the cluster to hold a cache element
8032910a90cSMatthew Dillon	indefinitely.  The administrative controller may preemptively
8042910a90cSMatthew Dillon	downgrade a leaf with insufficient administrative rights
8052910a90cSMatthew Dillon	without giving it a chance to synchronize any modified state
8062910a90cSMatthew Dillon	back to the cluster.
8072910a90cSMatthew Dillon
8082910a90cSMatthew Dillon    Proxy protocol
8092910a90cSMatthew Dillon
8102910a90cSMatthew Dillon	The Quorum and Cache protocols only operate between MASTER
8112910a90cSMatthew Dillon	and SOFT_MASTER nodes.  All other node types must use the
8122910a90cSMatthew Dillon	Proxy protocol to perform similar actions.  This protocol
8132910a90cSMatthew Dillon	differs in that proxy requests are typically sent to just
8142910a90cSMatthew Dillon	one adjacent node and that node then maintains state and
8152910a90cSMatthew Dillon	forwards the request or performs the required operation.
8162910a90cSMatthew Dillon	When the link is lost to the proxy, the proxy automatically
8172910a90cSMatthew Dillon	forwards a deletion of the state to the other nodes based on
8182910a90cSMatthew Dillon	what it has recorded.
8192910a90cSMatthew Dillon
8202910a90cSMatthew Dillon	If a leaf has insufficient administrative rights it may not
8212910a90cSMatthew Dillon	be allowed to actually initiate a quorum operation and may only
8222910a90cSMatthew Dillon	be allowed to maintain partial MESI cache state or perhaps none
8232910a90cSMatthew Dillon	at all (since cache state can block other machines in the
8242910a90cSMatthew Dillon	cluster).  Instead a leaf with insufficient rights will have to
8252910a90cSMatthew Dillon	make due with a preemptive loss of cache state and any allowed
8262910a90cSMatthew Dillon	modifying operations will have to be forwarded to the proxy which
8272910a90cSMatthew Dillon	continues forwarding it until a node with sufficient administrative
8282910a90cSMatthew Dillon	rights is encountered.
8292910a90cSMatthew Dillon
8302910a90cSMatthew Dillon	To reduce issues and give the cluster more breath, sub-clusters
8312910a90cSMatthew Dillon	made up of SOFT_MASTERs can be formed in order to provide full
8322910a90cSMatthew Dillon	cache coherent within a subset of machines and yet still tie them
8332910a90cSMatthew Dillon	into a greater cluster that they normally would not have such
8342910a90cSMatthew Dillon	access to.  This effectively makes it possible to create a two
8352910a90cSMatthew Dillon	or three-tier fan-out of groups of machines which are cache-coherent
8362910a90cSMatthew Dillon	within the group, but perhaps not between groups, and use other
8372910a90cSMatthew Dillon	means to synchronize between the groups.
8382910a90cSMatthew Dillon
8392910a90cSMatthew Dillon    Media protocol
8402910a90cSMatthew Dillon
8412910a90cSMatthew Dillon	This is basically the physical media protocol.
8425a9a531cSMatthew Dillon
843b93cc2e0SMatthew Dillon		       MASTER & SLAVE SYNCHRONIZATION
844b93cc2e0SMatthew Dillon
845b93cc2e0SMatthew DillonWith HAMMER2 I really want to be hard-nosed about the consistency of the
846b93cc2e0SMatthew Dillonfilesystem, including the consistency of SLAVEs (snapshots, etc).  In order
847b93cc2e0SMatthew Dillonto guarantee consistency we take advantage of the copy-on-write nature of
848b93cc2e0SMatthew Dillonthe filesystem by forking consistent nodes and using the forked copy as the
849b93cc2e0SMatthew Dillonsource for synchronization.
850b93cc2e0SMatthew Dillon
851b93cc2e0SMatthew DillonSimilarly, the target for synchronization is not updated on the fly but instead
852b93cc2e0SMatthew Dillonis also forked and the forked copy is updated.  When synchronization is
853b93cc2e0SMatthew Dilloncomplete, forked sources can be thrown away and forked copies can replace
854b93cc2e0SMatthew Dillonthe original synchronization target.
855b93cc2e0SMatthew Dillon
856b93cc2e0SMatthew DillonThis may seem complex, but 'forking a copy' is actually a virtually free
857b93cc2e0SMatthew Dillonoperation.  The top-level inode (under the super-root), on-media, is simply
858b93cc2e0SMatthew Dilloncopied to a new inode and poof, we have an unchanging snapshot to work with.
859b93cc2e0SMatthew Dillon
860b93cc2e0SMatthew Dillon	- Making a snapshot is fast... almost instantanious.
861b93cc2e0SMatthew Dillon
862b93cc2e0SMatthew Dillon	- Snapshots are used for various purposes, including synchronization
863b93cc2e0SMatthew Dillon	  of out-of-date nodes.
864b93cc2e0SMatthew Dillon
865b93cc2e0SMatthew Dillon	- A snapshot can be converted into a MASTER or some other PFS type.
866b93cc2e0SMatthew Dillon
867b93cc2e0SMatthew Dillon	- A snapshot can be forked off from its parent cluster entirely and
868b93cc2e0SMatthew Dillon	  turned into its own writable filesystem, either as a single MASTER
869b93cc2e0SMatthew Dillon	  or this can be done across the cluster by forking a quorum+ of
870bbb35c81SSascha Wildner	  existing MASTERs and transferring them all to a new cluster id.
871b93cc2e0SMatthew Dillon
872b93cc2e0SMatthew DillonMore complex is reintegrating the target once the synchronization is complete.
873b93cc2e0SMatthew DillonFor SLAVEs we just delete the old SLAVE and rename the copy to the same name.
874b93cc2e0SMatthew DillonHowever, if the SLAVE is mounted and not optioned as a static mount (that is
875b93cc2e0SMatthew Dillonthe mounter wants to see updates as they are synchronized), a reconciliation
876b93cc2e0SMatthew Dillonmust occur on the live mount to clean up the vnode, inode, and chain caches
877b93cc2e0SMatthew Dillonand shift any remaining vnodes over to the updated copy.
878b93cc2e0SMatthew Dillon
879b93cc2e0SMatthew Dillon	- A mounted SLAVE can track updates made to the SLAVE but the
880b93cc2e0SMatthew Dillon	  actual mechanism is that the SLAVE PFS is replaced with an
881b93cc2e0SMatthew Dillon	  updated copy, typically every 30-60 seconds.
882b93cc2e0SMatthew Dillon
883b93cc2e0SMatthew DillonReintegrating a MASTER which has fallen out of the quorum due to being out
884b93cc2e0SMatthew Dillonof date is also somewhat more complex.  The same updating mechanic is used,
885b93cc2e0SMatthew Dillonwe actually have to throw the 'old' MASTER away once the new one has been
886b93cc2e0SMatthew Dillonupdated.  However if the cluster is undergoing heavy modifications the
887b93cc2e0SMatthew Dillonupdated MASTER will be out of date almost the instant its source is
888b93cc2e0SMatthew Dillonsnapshotted.  Reintegrating a MASTER thus requires a somewhat more complex
889b93cc2e0SMatthew Dilloninteraction.
890b93cc2e0SMatthew Dillon
891b93cc2e0SMatthew Dillon	- If a MASTER is really out of date we can run one or more
892b93cc2e0SMatthew Dillon	  synchronization passes concurrent with modifying operations.
893b93cc2e0SMatthew Dillon	  The quorum can remain live.
894b93cc2e0SMatthew Dillon
895b93cc2e0SMatthew Dillon	- A final synchronization pass is required with quorum operations
896b93cc2e0SMatthew Dillon	  blocked to reintegrate the now up-to-date MASTER into the cluster.
897b93cc2e0SMatthew Dillon
898b93cc2e0SMatthew Dillon
899b93cc2e0SMatthew Dillon				QUORUM OPERATIONS
900b93cc2e0SMatthew Dillon
901b93cc2e0SMatthew DillonQuorum operations can be broken down into HARD BLOCK operations and NETWORK
902b93cc2e0SMatthew Dillonoperations.  If your MASTERs are all local mounts, then failures and
903b93cc2e0SMatthew Dillonsequencing is easy to deal with.
904b93cc2e0SMatthew Dillon
905b93cc2e0SMatthew DillonQuorum operations on a networked cluster are more complex.  The problems:
906b93cc2e0SMatthew Dillon
907b93cc2e0SMatthew Dillon    - Masters cannot rely on clients to moderate quorum transactions.
908b93cc2e0SMatthew Dillon      Apart from the reliance being unsafe, the client could also
909b93cc2e0SMatthew Dillon      lose contact with one or more masters during the transaction and
910b93cc2e0SMatthew Dillon      leave one or more masters out-of-sync without the master(s) knowing
911b93cc2e0SMatthew Dillon      they are out of sync.
912b93cc2e0SMatthew Dillon
913b93cc2e0SMatthew Dillon    - When many clients are present, we do not want a flakey network
914b93cc2e0SMatthew Dillon      link from one to cause one or more masters to go out of
915b93cc2e0SMatthew Dillon      synchronization and potentially stall the whole works.
916b93cc2e0SMatthew Dillon
917b93cc2e0SMatthew Dillon    - Normal hammer2 mounts allow a virtually unlimited number of modifying
918b93cc2e0SMatthew Dillon      transactions between actual flushes.  The media flush rolls everything
919b93cc2e0SMatthew Dillon      up into a single transaction id per flush.  Detection of 'missing'
920b93cc2e0SMatthew Dillon      transactions in a concurrent multi-client setup when one or more client
921b93cc2e0SMatthew Dillon      temporarily loses connectivity is thus difficult.
922b93cc2e0SMatthew Dillon
923b93cc2e0SMatthew Dillon    - Clients have a limited amount of time to reconnect to a cluster after
924b93cc2e0SMatthew Dillon      a network disconnect before their MESI cache states are lost.
925b93cc2e0SMatthew Dillon
926b93cc2e0SMatthew Dillon    - Clients may proceed with several transactions before knowing for sure
927b93cc2e0SMatthew Dillon      that earlier transactions were completely successful.  Performance is
928b93cc2e0SMatthew Dillon      important, we won't be waiting for a full quorum-verified synchronous
929b93cc2e0SMatthew Dillon      flush to media before allowing a system call to return.
930b93cc2e0SMatthew Dillon
931b93cc2e0SMatthew Dillon    - Masters can decide that a client's MESI cache states were lost (i.e.
932b93cc2e0SMatthew Dillon      that the transaction was too slow) as well.
933b93cc2e0SMatthew Dillon
934b93cc2e0SMatthew DillonThe solutions (for modifying transactions):
935b93cc2e0SMatthew Dillon
936b93cc2e0SMatthew Dillon    - Masters handle quorum confirmation amongst themselves and do not rely
937b93cc2e0SMatthew Dillon      on the client for that purpose.
938b93cc2e0SMatthew Dillon
939b93cc2e0SMatthew Dillon    - A client can connect to one or more masters regardless of the size of
940b93cc2e0SMatthew Dillon      the quorum and can submit modifying operations to a single master if
941b93cc2e0SMatthew Dillon      desired.  The master will take care of the rest.
942b93cc2e0SMatthew Dillon
943b93cc2e0SMatthew Dillon      A client must still validate the quorum (and obtain MESI cache states)
944b93cc2e0SMatthew Dillon      when doing read-only operations in order to present the correct data
945b93cc2e0SMatthew Dillon      to the user process for the VOP.
946b93cc2e0SMatthew Dillon
947b93cc2e0SMatthew Dillon    - Masters will run a 2-phase commit amongst themselves, often concurrent
948b93cc2e0SMatthew Dillon      with other non-conflicting transactions, and will serialize operations
949b93cc2e0SMatthew Dillon      and/or enforce synchronization points for 2-phase completion on
950b93cc2e0SMatthew Dillon      serialized transactions from the same client or when cache state
951b93cc2e0SMatthew Dillon      ownership is shifted from one client to another.
952b93cc2e0SMatthew Dillon
953b93cc2e0SMatthew Dillon    - Clients will usually allow operations to run asynchronously and return
954b93cc2e0SMatthew Dillon      from system calls more or less ASAP once they own the necessary cache
955b93cc2e0SMatthew Dillon      coherency locks.  The client can select the validation mode to wait for
956b93cc2e0SMatthew Dillon      with mount options:
957b93cc2e0SMatthew Dillon
958b93cc2e0SMatthew Dillon      (1) Fully async		(mount -o async)
959b93cc2e0SMatthew Dillon      (2) Wait for phase-1 ack	(mount)
960b93cc2e0SMatthew Dillon      (3) Wait for phase-2 ack	(mount -o sync)		(fsync - wait p2ack)
961b93cc2e0SMatthew Dillon      (4) Wait for flush	(mount -o sync)		(fsync - wait flush)
962b93cc2e0SMatthew Dillon
963b93cc2e0SMatthew Dillon      Modifying system calls cannot be told to wait for a full media
964b93cc2e0SMatthew Dillon      flush, as full media flushes are prohibitively expensive.  You
965b93cc2e0SMatthew Dillon      still have to fsync().
966b93cc2e0SMatthew Dillon
967b93cc2e0SMatthew Dillon      The fsync wait mode for network links can be selected, either to
968b93cc2e0SMatthew Dillon      return after the phase-2 ack or to return after the media flush.
969b93cc2e0SMatthew Dillon      The default is to wait for the phase-2 ack, which at least guarantees
970b93cc2e0SMatthew Dillon      that a network failure after that point will not disrupt operations
971b93cc2e0SMatthew Dillon      issued before the fsync.
972b93cc2e0SMatthew Dillon
973b93cc2e0SMatthew Dillon    - Clients must adjust the chain state for modifying operations prior to
974b93cc2e0SMatthew Dillon      releasing chain locks / returning from the system call, even if the
975b93cc2e0SMatthew Dillon      masters have not finished the transaction.  A late failure by the
976b93cc2e0SMatthew Dillon      cluster will result in desynchronized state which requires erroring
977b93cc2e0SMatthew Dillon      out the whole filesystem or resynchronizing somehow.
978b93cc2e0SMatthew Dillon
979b93cc2e0SMatthew Dillon    - Clients can opt to keep a record of transactions through the phase-2
980b93cc2e0SMatthew Dillon      ack or the actual media flush on the masters.
981b93cc2e0SMatthew Dillon
982b93cc2e0SMatthew Dillon      However, replaying/revalidating the log cannot necessarily guarantee
983b93cc2e0SMatthew Dillon      success.  If the masters lose synchronization due to network issues
984b93cc2e0SMatthew Dillon      between masters (or if the client was mounted fully-async), or if enough
985b93cc2e0SMatthew Dillon      masters crash simultaniously such that a quorum fails to flush even
986b93cc2e0SMatthew Dillon      after the phase-2 ack, then it is possible that by the time a client
987b93cc2e0SMatthew Dillon      is able to replay/revalidate, some other client has squeeded in and
988b93cc2e0SMatthew Dillon      committed something that would conflict.
989b93cc2e0SMatthew Dillon
990b93cc2e0SMatthew Dillon      If the client crashes it works similarly to a crash with a local storage
991b93cc2e0SMatthew Dillon      mount... many dirty buffers might be lost.  And the same happens in
992b93cc2e0SMatthew Dillon      the cluster case.
993b93cc2e0SMatthew Dillon
994b93cc2e0SMatthew Dillon				TRANSACTION LOG
995b93cc2e0SMatthew Dillon
996b93cc2e0SMatthew DillonKeeping a short-term transaction log, much less being able to properly replay
997b93cc2e0SMatthew Dillonit, is fraught with difficulty and I've made it a separate development task.
998b7910865SMatthew DillonFor now HAMMER2 does not have one.
999b93cc2e0SMatthew Dillon
1000