dm/dmirror/dmirror_notes.txt

*f603807bSTomohiro Kusumi    Now that Alex has the basic lvm stuff in we need to add soft-raid-1
*f603807bSTomohiro Kusumi    to it.  I have some ideas on how it could be implemented.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi    This is not set in stone at all, this is just me rattling off my
*f603807bSTomohiro Kusumi    RAID-1 implementation ideas.  It isn't quite as complex as it sounds,
*f603807bSTomohiro Kusumi    really!  I swear it isn't!  But if we could implement something like
*f603807bSTomohiro Kusumi    this we would have the best soft-raid-1 implementation around.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi    Here are the basic problems which need to be solved:
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	* Allow partial downtimes for pieces of the mirror such that
*f603807bSTomohiro Kusumi	  when the mirror becomes whole again the entire drive does not
*f603807bSTomohiro Kusumi	  have to be copied.  Instead only the segments of the drive that
*f603807bSTomohiro Kusumi	  are out of sync would be resynchronized.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	  We want to avoid having to completely resynchronize the entire
*f603807bSTomohiro Kusumi	  contents of a potentially multi-terabyte drive if one is
*f603807bSTomohiro Kusumi	  taken offline temporarily and then brought back online.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	* Allow mixed I/O errors on both drives making up the mirror
*f603807bSTomohiro Kusumi	  without taking the entire mirror offline.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	* Allow I/O read or write errors on one drive to degrade only
*f603807bSTomohiro Kusumi	  the related segment and not the whole drive.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	* Allow most writes to be asynchronous to the two drives making
*f603807bSTomohiro Kusumi	  up the mirror up to the synchronization point.  Avoid unnecessary
*f603807bSTomohiro Kusumi	  writes to the segment array on-media even through a synchronization
*f603807bSTomohiro Kusumi	  point.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	* Detect out-of-sync mirrors that are out of sync due to a system
*f603807bSTomohiro Kusumi	  crash occuring prior to a synchronization point (i.e. when the
*f603807bSTomohiro Kusumi	  drives themselves are just fine).  When this case occurs either
*f603807bSTomohiro Kusumi	  copy is valid and one must be selected, but then the selected
*f603807bSTomohiro Kusumi	  copy must be resynchronized to the other drive in the mirror
*f603807bSTomohiro Kusumi	  to prevent the read data from 'changing' randomly from the point
*f603807bSTomohiro Kusumi	  of view of whoever is reading it.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi    And my idea on implementation:
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	* Implement a segment descriptor array for each drive in the
*f603807bSTomohiro Kusumi	  mirror, breaking the drive down into large pieces.  For
*f603807bSTomohiro Kusumi	  example, 128MB per segment.  The segment array would be stored
*f603807bSTomohiro Kusumi	  on both disks making up the mirror.  In addition, each disk will
*f603807bSTomohiro Kusumi	  store the segment state for BOTH disks.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	  Thus a 1TBx2 mirror would have 8192x4 segments (4 segment
*f603807bSTomohiro Kusumi	  descriptors for each logical segment).  The segment descriptor
*f603807bSTomohiro Kusumi	  array would idealy be small enough to cache in-memory.  Being
*f603807bSTomohiro Kusumi	  able to cache it in-memory simplifies lookups.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	  A segment descriptor would be, oh I don't know... probably
*f603807bSTomohiro Kusumi	  16 bytes.  Leave room for expansion :-)
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	  Why does each disk need to store a segment descriptor for both
*f603807bSTomohiro Kusumi	  disks?  So we can 'remember' the state of the dead disk on the
*f603807bSTomohiro Kusumi	  live disk in order to resolve mismatches later on when the
*f603807bSTomohiro Kusumi	  dead disk comes back to life.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	* The state of the segment descriptor must be consulted when reading
*f603807bSTomohiro Kusumi	  or writing.  Some states are in-memory-only states while others
*f603807bSTomohiro Kusumi	  can exist on-media or in-memory.  The states are represented by
*f603807bSTomohiro Kusumi	  a set of bit flags:
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	  MEDIA_UNSTABLE	0: The content is stable on-media and
*f603807bSTomohiro Kusumi				   fully synchronized.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi				1: The content is unstable on-media
*f603807bSTomohiro Kusumi				   (writes have been made and have not
*f603807bSTomohiro Kusumi				    been completely synchronized to both
*f603807bSTomohiro Kusumi				    drives).
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	  MEDIA_READ_DEGRADED	0: No I/O read error occured on this segment
*f603807bSTomohiro Kusumi				1: I/O read error(s) occured on this segment
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	  MEDIA_WRITE_DEGRADED	0: No I/O write error occured on this segment
*f603807bSTomohiro Kusumi				1: I/O write error(s) occured on this segment
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	  MEDIA_MASTER		0: Normal operation
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi				1: Mastership operation for this segment
*f603807bSTomohiro Kusumi				   on this drive, which is set when the
*f603807bSTomohiro Kusumi				   other drive in the mirror has failed
*f603807bSTomohiro Kusumi				   and writes are made to the drive that
*f603807bSTomohiro Kusumi				   is still operational.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	  UNINITIALIZED		0: The segment contains normal data.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi				1: The entire segment is empty and should
*f603807bSTomohiro Kusumi				   read all zeros regardless of the actual
*f603807bSTomohiro Kusumi				   content on the media.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi				   (Use for newly initialized mirrors as
*f603807bSTomohiro Kusumi				   a way to avoid formatting the whole
*f603807bSTomohiro Kusumi				   drive or SSD?).
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	  OLD_UNSTABLE		Copy of original MEDIA_UNSTABLE bit initially
*f603807bSTomohiro Kusumi				read from the media.  This bit is only
*f603807bSTomohiro Kusumi				recopied after the related segment has been
*f603807bSTomohiro Kusumi				fully synchronized.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	  OLD_MASTER		Copy of original MEDIA_MASTER bit initially
*f603807bSTomohiro Kusumi				read from the media.  This bit is only
*f603807bSTomohiro Kusumi				recopied after the related segment has been
*f603807bSTomohiro Kusumi				fully synchronized.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	  We probably need room for a serial number or timestamp in the
*f603807bSTomohiro Kusumi	  segment descriptor as well in order to resolve certain situations.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	* Since updating a segment descriptor on-media is expensive
*f603807bSTomohiro Kusumi	  (requiring at least one disk synchronization command and of
*f603807bSTomohiro Kusumi	  course a nasty seek), segment descriptors on-media are updated
*f603807bSTomohiro Kusumi	  synchronously only when going from a STABLE to an UNSTABLE state,
*f603807bSTomohiro Kusumi	  meaning the segment is undergoing active writing.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	  Changing a segment descriptor from unstable to stable can be
*f603807bSTomohiro Kusumi	  delayed indefinitely (synchronized on a long timer, like
*f603807bSTomohiro Kusumi	  30 or 60 seconds).  All that happens if a crash occurs in the
*f603807bSTomohiro Kusumi	  mean time is a little extra copying of segments occurs on
*f603807bSTomohiro Kusumi	  reboot.  Theoretically anyway.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi    Ok, now what actions need to be taken to satisfy a read or write?
*f603807bSTomohiro Kusumi    The actions taken will be based on the segment state for the segment
*f603807bSTomohiro Kusumi    involved in the I/O.  Any I/O which crosses a segment boundary would
*f603807bSTomohiro Kusumi    be split into two or more I/Os and treated separately.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi    Remember there are four descriptors for each segment, two on each drive:
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	DISK1 STATE stored on disk1
*f603807bSTomohiro Kusumi	DISK2 STATE stored on disk1
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	DISK1 STATE stored on disk2
*f603807bSTomohiro Kusumi	DISK2 STATE stored on disk2
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi    In order to simplify matters any inconstencies between e.g. the DISK2
*f603807bSTomohiro Kusumi    state as stored on disk1 and the DISK2 state as stored on disk2 would
*f603807bSTomohiro Kusumi    be resolved immediately prior to initiation of the actual I/O.  Otherwise
*f603807bSTomohiro Kusumi    the combination of four states is just too complex.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi    So if both drives are operational this resolution must take place.  If
*f603807bSTomohiro Kusumi    only one drive is operational then the state stored in the segment
*f603807bSTomohiro Kusumi    descriptors on that one operational drive is consulted to obtain the
*f603807bSTomohiro Kusumi    state of both drives.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi    This is the hard part.  Lets take the mismatched cases first.  That is,
*f603807bSTomohiro Kusumi    when the DISK2 STATE stored on DISK1 is different from the DISK2 STATE
*f603807bSTomohiro Kusumi    stored on DISK2 (or vise-versa... disk1 state stored on each drive):
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	* If one of the two conflicting states has the UNSTABLE or MASTER
*f603807bSTomohiro Kusumi	  bits set then set the same bits in the other.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	  Basically just OR some of the bits together and store to
*f603807bSTomohiro Kusumi	  both copies.  But not all of the bits.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	* If doing a write operation and the segment is marked UNITIALIZED
*f603807bSTomohiro Kusumi	  the entire segment must be zero-filled and the bit cleared prior
*f603807bSTomohiro Kusumi	  to the write operation. ????  (needs more thought, maybe even a
*f603807bSTomohiro Kusumi	  sub-bitmap. See later on in this email).
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi    Ok, now we have done that we can just consider two states, one for
*f603807bSTomohiro Kusumi    DISK1 and one for DISK2, coupled with the I/O operation:
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi    WHEN READING:
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	* If MASTER is NOT set on either drive the read may be
*f603807bSTomohiro Kusumi	  sent to either drive.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	* If MASTER is set on one of the drives the read must be sent
*f603807bSTomohiro Kusumi	  only to that drive.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	* If MASTER is set on both drives then we are screwed.  This case
*f603807bSTomohiro Kusumi	  can occur if one of the mirror drives goes down and a bunch of
*f603807bSTomohiro Kusumi	  writes are made to the other, then system is rebooted and the
*f603807bSTomohiro Kusumi	  original mirror drive comes up but the other drive goes down.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	  So this condition detects a conflict.  We must return an I/O
*f603807bSTomohiro Kusumi	  error for the READ, presumably.  The only way to resolve this
*f603807bSTomohiro Kusumi	  is for a manual intervention to explicitly select one or the
*f603807bSTomohiro Kusumi	  other drive as the master.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	* If READ_DEGRADED is set on one drive the read can be directed to
*f603807bSTomohiro Kusumi	  the other.  If READ_DEGRADED is set on both drives then either
*f603807bSTomohiro Kusumi	  drive can be selected.  If the read fails on any given drive
*f603807bSTomohiro Kusumi	  it is of course redispatched to the other drive regardless.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	  When READ_DEGRADED is set on one drive and only one drive is up
*f603807bSTomohiro Kusumi	  we still issue the read to that drive, obviously, since we have
*f603807bSTomohiro Kusumi	  no other choice.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi    WHEN WRITING:
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	* If MASTER is NOT set on either drive the write is directed to
*f603807bSTomohiro Kusumi	  both drives.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	* Otherwise a WRITE is directed only to the drive with MASTER set.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	* If both drives are marked MASTER the write is directed to both
*f603807bSTomohiro Kusumi	  drives.  This is a conflict situation on read but writing will
*f603807bSTomohiro Kusumi	  still work just fine.  The MASTER bit is left alone.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	* If an I/O error occurs on one of the drives the WRITE_DEGRADED
*f603807bSTomohiro Kusumi	  bit is set for that drive and the other drive (where the write
*f603807bSTomohiro Kusumi	  succeeded) is marked as MASTER.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	  However, we can only do this if neither drive is already a MASTER.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	  If a drive is already marked MASTER we cannot mark the other drive
*f603807bSTomohiro Kusumi	  as MASTER.  The failed write will cause an I/O error to be
*f603807bSTomohiro Kusumi	  returned.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi    RESYNCHRONIZATION:
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	* A kernel thread is created manage mirror synchronization.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	* Synchronization of out-of-sync mirror segments can occur
*f603807bSTomohiro Kusumi	  asynchnronously, but must interlock against I/O operations
*f603807bSTomohiro Kusumi	  that might conflict.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	  The segment array on the drive(s) is used to determine what
*f603807bSTomohiro Kusumi	  segments need to be resynchronized.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	* Synchronization occurs when the segment for one drive is
*f603807bSTomohiro Kusumi	  marked MASTER and the segment for the other drive is not.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	* In a conflict situation (where both drives are marked MASTER
*f603807bSTomohiro Kusumi	  for any given segment) a manual intervention is required to
*f603807bSTomohiro Kusumi	  specify (e.g. through an ioctl) which of the two drives is
*f603807bSTomohiro Kusumi	  the master.  This overrides the MASTER bits for all segments
*f603807bSTomohiro Kusumi	  and allows synchronization to occur for all conflicting
*f603807bSTomohiro Kusumi	  segments (or possibly all segments, period, in the case where
*f603807bSTomohiro Kusumi	  a new mirror drive is being deployed).
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi    Segment array on-media and header.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	* The mirroring code must reserve some of the sectors on the
*f603807bSTomohiro Kusumi	  drives to hold a header and the segment array, making the
*f603807bSTomohiro Kusumi	  resulting logical mirror a bit smaller than it otherwise would
*f603807bSTomohiro Kusumi	  be.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	* The header must contain a unique serial number (the uuid code
*f603807bSTomohiro Kusumi	  can be used to generate it).
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	* When manual intervention is required to specify a master a new
*f603807bSTomohiro Kusumi	  unique serial number must be generated for that master to
*f603807bSTomohiro Kusumi	  prevent 'old' mirror drives that were removed from the system
*f603807bSTomohiro Kusumi	  from being improperly recognized as being part of the new mirror
*f603807bSTomohiro Kusumi	  when they aren't any more.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	* Automatic detection of the mirror status is possible by using
*f603807bSTomohiro Kusumi	  the serial number in the header.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	* If the serial numbers for the header(s) for the two drives
*f603807bSTomohiro Kusumi	  making up the mirror do not match (when both drives are up and
*f603807bSTomohiro Kusumi	  both header read I/Os succeeded), manual intervention is required.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	* Auto-detection of mirror segments ala Geom... using on-disk headers,
*f603807bSTomohiro Kusumi	  is discouraged.  I think it is too dangerous and would much rather
*f603807bSTomohiro Kusumi	  the detection be based on drive serial number rather than serial
*f603807bSTomohiro Kusumi	  numbers stored on-media in headers.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	  However, I guess this is a function of LVM?  So I might not have
*f603807bSTomohiro Kusumi	  any control over it.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi    The UNINITIALIZED FLAG
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	When formatting a new mirror or when a drive is torn out and a new
*f603807bSTomohiro Kusumi	drive is added the drive(s) in question must be formatted.  To
*f603807bSTomohiro Kusumi	avoid actually writing to all sectors of the drive, which would
*f603807bSTomohiro Kusumi	take too long on multi-terabyte drives and create unnecesary
*f603807bSTomohiro Kusumi	writes on things like SSDs we instead of an UNINITIALIZED flag
*f603807bSTomohiro Kusumi	state in the descriptor.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	If set any read I/O to the related segment is simply zero-filled.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	When writing we have to zero-fill the segment (write zeros to the
*f603807bSTomohiro Kusumi	whole 128MB segment) and then clear the UNINITIALIZED flag before
*f603807bSTomohiro Kusumi	allowing the write I/O to proceed.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	We might want to use some of the bits in the descriptor as a
*f603807bSTomohiro Kusumi	sub-bitmap.  e.g. if we reserve 4 bytes in the 16-byte descriptor
*f603807bSTomohiro Kusumi	to be an 'UNINITIALIZED' sub-bitmap we can break the 128MB
*f603807bSTomohiro Kusumi	segment down into 4MB pieces and only zero-fill/write portions
*f603807bSTomohiro Kusumi	of the 128MB segment instead of having to do the whole segment.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi	I don't know how well this idea would work in real life.  Another
*f603807bSTomohiro Kusumi	option is to just return random data for the uninitialized portions
*f603807bSTomohiro Kusumi	of a new mirror but that kinda breaks the whole abstraction and
*f603807bSTomohiro Kusumi	could blow up certain types of filesystems, like ZFS, which
*f603807bSTomohiro Kusumi	assume any read data is stable on-media.
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi						-Matt
*f603807bSTomohiro Kusumi
*f603807bSTomohiro Kusumi