xref: /dflybsd-src/sys/dev/disk/dm/dmirror/dmirror_notes.txt (revision f603807b2c8b9b8ca8a7a99e36eeb70cd39b460d)
1*f603807bSTomohiro Kusumi    Now that Alex has the basic lvm stuff in we need to add soft-raid-1
2*f603807bSTomohiro Kusumi    to it.  I have some ideas on how it could be implemented.
3*f603807bSTomohiro Kusumi
4*f603807bSTomohiro Kusumi    This is not set in stone at all, this is just me rattling off my
5*f603807bSTomohiro Kusumi    RAID-1 implementation ideas.  It isn't quite as complex as it sounds,
6*f603807bSTomohiro Kusumi    really!  I swear it isn't!  But if we could implement something like
7*f603807bSTomohiro Kusumi    this we would have the best soft-raid-1 implementation around.
8*f603807bSTomohiro Kusumi
9*f603807bSTomohiro Kusumi    Here are the basic problems which need to be solved:
10*f603807bSTomohiro Kusumi
11*f603807bSTomohiro Kusumi	* Allow partial downtimes for pieces of the mirror such that
12*f603807bSTomohiro Kusumi	  when the mirror becomes whole again the entire drive does not
13*f603807bSTomohiro Kusumi	  have to be copied.  Instead only the segments of the drive that
14*f603807bSTomohiro Kusumi	  are out of sync would be resynchronized.
15*f603807bSTomohiro Kusumi
16*f603807bSTomohiro Kusumi	  We want to avoid having to completely resynchronize the entire
17*f603807bSTomohiro Kusumi	  contents of a potentially multi-terabyte drive if one is
18*f603807bSTomohiro Kusumi	  taken offline temporarily and then brought back online.
19*f603807bSTomohiro Kusumi
20*f603807bSTomohiro Kusumi	* Allow mixed I/O errors on both drives making up the mirror
21*f603807bSTomohiro Kusumi	  without taking the entire mirror offline.
22*f603807bSTomohiro Kusumi
23*f603807bSTomohiro Kusumi	* Allow I/O read or write errors on one drive to degrade only
24*f603807bSTomohiro Kusumi	  the related segment and not the whole drive.
25*f603807bSTomohiro Kusumi
26*f603807bSTomohiro Kusumi	* Allow most writes to be asynchronous to the two drives making
27*f603807bSTomohiro Kusumi	  up the mirror up to the synchronization point.  Avoid unnecessary
28*f603807bSTomohiro Kusumi	  writes to the segment array on-media even through a synchronization
29*f603807bSTomohiro Kusumi	  point.
30*f603807bSTomohiro Kusumi
31*f603807bSTomohiro Kusumi	* Detect out-of-sync mirrors that are out of sync due to a system
32*f603807bSTomohiro Kusumi	  crash occuring prior to a synchronization point (i.e. when the
33*f603807bSTomohiro Kusumi	  drives themselves are just fine).  When this case occurs either
34*f603807bSTomohiro Kusumi	  copy is valid and one must be selected, but then the selected
35*f603807bSTomohiro Kusumi	  copy must be resynchronized to the other drive in the mirror
36*f603807bSTomohiro Kusumi	  to prevent the read data from 'changing' randomly from the point
37*f603807bSTomohiro Kusumi	  of view of whoever is reading it.
38*f603807bSTomohiro Kusumi
39*f603807bSTomohiro Kusumi    And my idea on implementation:
40*f603807bSTomohiro Kusumi
41*f603807bSTomohiro Kusumi	* Implement a segment descriptor array for each drive in the
42*f603807bSTomohiro Kusumi	  mirror, breaking the drive down into large pieces.  For
43*f603807bSTomohiro Kusumi	  example, 128MB per segment.  The segment array would be stored
44*f603807bSTomohiro Kusumi	  on both disks making up the mirror.  In addition, each disk will
45*f603807bSTomohiro Kusumi	  store the segment state for BOTH disks.
46*f603807bSTomohiro Kusumi
47*f603807bSTomohiro Kusumi	  Thus a 1TBx2 mirror would have 8192x4 segments (4 segment
48*f603807bSTomohiro Kusumi	  descriptors for each logical segment).  The segment descriptor
49*f603807bSTomohiro Kusumi	  array would idealy be small enough to cache in-memory.  Being
50*f603807bSTomohiro Kusumi	  able to cache it in-memory simplifies lookups.
51*f603807bSTomohiro Kusumi
52*f603807bSTomohiro Kusumi	  A segment descriptor would be, oh I don't know... probably
53*f603807bSTomohiro Kusumi	  16 bytes.  Leave room for expansion :-)
54*f603807bSTomohiro Kusumi
55*f603807bSTomohiro Kusumi	  Why does each disk need to store a segment descriptor for both
56*f603807bSTomohiro Kusumi	  disks?  So we can 'remember' the state of the dead disk on the
57*f603807bSTomohiro Kusumi	  live disk in order to resolve mismatches later on when the
58*f603807bSTomohiro Kusumi	  dead disk comes back to life.
59*f603807bSTomohiro Kusumi
60*f603807bSTomohiro Kusumi	* The state of the segment descriptor must be consulted when reading
61*f603807bSTomohiro Kusumi	  or writing.  Some states are in-memory-only states while others
62*f603807bSTomohiro Kusumi	  can exist on-media or in-memory.  The states are represented by
63*f603807bSTomohiro Kusumi	  a set of bit flags:
64*f603807bSTomohiro Kusumi
65*f603807bSTomohiro Kusumi	  MEDIA_UNSTABLE	0: The content is stable on-media and
66*f603807bSTomohiro Kusumi				   fully synchronized.
67*f603807bSTomohiro Kusumi
68*f603807bSTomohiro Kusumi				1: The content is unstable on-media
69*f603807bSTomohiro Kusumi				   (writes have been made and have not
70*f603807bSTomohiro Kusumi				    been completely synchronized to both
71*f603807bSTomohiro Kusumi				    drives).
72*f603807bSTomohiro Kusumi
73*f603807bSTomohiro Kusumi	  MEDIA_READ_DEGRADED	0: No I/O read error occured on this segment
74*f603807bSTomohiro Kusumi				1: I/O read error(s) occured on this segment
75*f603807bSTomohiro Kusumi
76*f603807bSTomohiro Kusumi	  MEDIA_WRITE_DEGRADED	0: No I/O write error occured on this segment
77*f603807bSTomohiro Kusumi				1: I/O write error(s) occured on this segment
78*f603807bSTomohiro Kusumi
79*f603807bSTomohiro Kusumi	  MEDIA_MASTER		0: Normal operation
80*f603807bSTomohiro Kusumi
81*f603807bSTomohiro Kusumi				1: Mastership operation for this segment
82*f603807bSTomohiro Kusumi				   on this drive, which is set when the
83*f603807bSTomohiro Kusumi				   other drive in the mirror has failed
84*f603807bSTomohiro Kusumi				   and writes are made to the drive that
85*f603807bSTomohiro Kusumi				   is still operational.
86*f603807bSTomohiro Kusumi
87*f603807bSTomohiro Kusumi	  UNINITIALIZED		0: The segment contains normal data.
88*f603807bSTomohiro Kusumi
89*f603807bSTomohiro Kusumi				1: The entire segment is empty and should
90*f603807bSTomohiro Kusumi				   read all zeros regardless of the actual
91*f603807bSTomohiro Kusumi				   content on the media.
92*f603807bSTomohiro Kusumi
93*f603807bSTomohiro Kusumi				   (Use for newly initialized mirrors as
94*f603807bSTomohiro Kusumi				   a way to avoid formatting the whole
95*f603807bSTomohiro Kusumi				   drive or SSD?).
96*f603807bSTomohiro Kusumi
97*f603807bSTomohiro Kusumi	  OLD_UNSTABLE		Copy of original MEDIA_UNSTABLE bit initially
98*f603807bSTomohiro Kusumi				read from the media.  This bit is only
99*f603807bSTomohiro Kusumi				recopied after the related segment has been
100*f603807bSTomohiro Kusumi				fully synchronized.
101*f603807bSTomohiro Kusumi
102*f603807bSTomohiro Kusumi	  OLD_MASTER		Copy of original MEDIA_MASTER bit initially
103*f603807bSTomohiro Kusumi				read from the media.  This bit is only
104*f603807bSTomohiro Kusumi				recopied after the related segment has been
105*f603807bSTomohiro Kusumi				fully synchronized.
106*f603807bSTomohiro Kusumi
107*f603807bSTomohiro Kusumi	  We probably need room for a serial number or timestamp in the
108*f603807bSTomohiro Kusumi	  segment descriptor as well in order to resolve certain situations.
109*f603807bSTomohiro Kusumi
110*f603807bSTomohiro Kusumi	* Since updating a segment descriptor on-media is expensive
111*f603807bSTomohiro Kusumi	  (requiring at least one disk synchronization command and of
112*f603807bSTomohiro Kusumi	  course a nasty seek), segment descriptors on-media are updated
113*f603807bSTomohiro Kusumi	  synchronously only when going from a STABLE to an UNSTABLE state,
114*f603807bSTomohiro Kusumi	  meaning the segment is undergoing active writing.
115*f603807bSTomohiro Kusumi
116*f603807bSTomohiro Kusumi	  Changing a segment descriptor from unstable to stable can be
117*f603807bSTomohiro Kusumi	  delayed indefinitely (synchronized on a long timer, like
118*f603807bSTomohiro Kusumi	  30 or 60 seconds).  All that happens if a crash occurs in the
119*f603807bSTomohiro Kusumi	  mean time is a little extra copying of segments occurs on
120*f603807bSTomohiro Kusumi	  reboot.  Theoretically anyway.
121*f603807bSTomohiro Kusumi
122*f603807bSTomohiro Kusumi    Ok, now what actions need to be taken to satisfy a read or write?
123*f603807bSTomohiro Kusumi    The actions taken will be based on the segment state for the segment
124*f603807bSTomohiro Kusumi    involved in the I/O.  Any I/O which crosses a segment boundary would
125*f603807bSTomohiro Kusumi    be split into two or more I/Os and treated separately.
126*f603807bSTomohiro Kusumi
127*f603807bSTomohiro Kusumi    Remember there are four descriptors for each segment, two on each drive:
128*f603807bSTomohiro Kusumi
129*f603807bSTomohiro Kusumi	DISK1 STATE stored on disk1
130*f603807bSTomohiro Kusumi	DISK2 STATE stored on disk1
131*f603807bSTomohiro Kusumi
132*f603807bSTomohiro Kusumi	DISK1 STATE stored on disk2
133*f603807bSTomohiro Kusumi	DISK2 STATE stored on disk2
134*f603807bSTomohiro Kusumi
135*f603807bSTomohiro Kusumi    In order to simplify matters any inconstencies between e.g. the DISK2
136*f603807bSTomohiro Kusumi    state as stored on disk1 and the DISK2 state as stored on disk2 would
137*f603807bSTomohiro Kusumi    be resolved immediately prior to initiation of the actual I/O.  Otherwise
138*f603807bSTomohiro Kusumi    the combination of four states is just too complex.
139*f603807bSTomohiro Kusumi
140*f603807bSTomohiro Kusumi    So if both drives are operational this resolution must take place.  If
141*f603807bSTomohiro Kusumi    only one drive is operational then the state stored in the segment
142*f603807bSTomohiro Kusumi    descriptors on that one operational drive is consulted to obtain the
143*f603807bSTomohiro Kusumi    state of both drives.
144*f603807bSTomohiro Kusumi
145*f603807bSTomohiro Kusumi    This is the hard part.  Lets take the mismatched cases first.  That is,
146*f603807bSTomohiro Kusumi    when the DISK2 STATE stored on DISK1 is different from the DISK2 STATE
147*f603807bSTomohiro Kusumi    stored on DISK2 (or vise-versa... disk1 state stored on each drive):
148*f603807bSTomohiro Kusumi
149*f603807bSTomohiro Kusumi	* If one of the two conflicting states has the UNSTABLE or MASTER
150*f603807bSTomohiro Kusumi	  bits set then set the same bits in the other.
151*f603807bSTomohiro Kusumi
152*f603807bSTomohiro Kusumi	  Basically just OR some of the bits together and store to
153*f603807bSTomohiro Kusumi	  both copies.  But not all of the bits.
154*f603807bSTomohiro Kusumi
155*f603807bSTomohiro Kusumi	* If doing a write operation and the segment is marked UNITIALIZED
156*f603807bSTomohiro Kusumi	  the entire segment must be zero-filled and the bit cleared prior
157*f603807bSTomohiro Kusumi	  to the write operation. ????  (needs more thought, maybe even a
158*f603807bSTomohiro Kusumi	  sub-bitmap. See later on in this email).
159*f603807bSTomohiro Kusumi
160*f603807bSTomohiro Kusumi    Ok, now we have done that we can just consider two states, one for
161*f603807bSTomohiro Kusumi    DISK1 and one for DISK2, coupled with the I/O operation:
162*f603807bSTomohiro Kusumi
163*f603807bSTomohiro Kusumi    WHEN READING:
164*f603807bSTomohiro Kusumi
165*f603807bSTomohiro Kusumi	* If MASTER is NOT set on either drive the read may be
166*f603807bSTomohiro Kusumi	  sent to either drive.
167*f603807bSTomohiro Kusumi
168*f603807bSTomohiro Kusumi	* If MASTER is set on one of the drives the read must be sent
169*f603807bSTomohiro Kusumi	  only to that drive.
170*f603807bSTomohiro Kusumi
171*f603807bSTomohiro Kusumi	* If MASTER is set on both drives then we are screwed.  This case
172*f603807bSTomohiro Kusumi	  can occur if one of the mirror drives goes down and a bunch of
173*f603807bSTomohiro Kusumi	  writes are made to the other, then system is rebooted and the
174*f603807bSTomohiro Kusumi	  original mirror drive comes up but the other drive goes down.
175*f603807bSTomohiro Kusumi
176*f603807bSTomohiro Kusumi	  So this condition detects a conflict.  We must return an I/O
177*f603807bSTomohiro Kusumi	  error for the READ, presumably.  The only way to resolve this
178*f603807bSTomohiro Kusumi	  is for a manual intervention to explicitly select one or the
179*f603807bSTomohiro Kusumi	  other drive as the master.
180*f603807bSTomohiro Kusumi
181*f603807bSTomohiro Kusumi	* If READ_DEGRADED is set on one drive the read can be directed to
182*f603807bSTomohiro Kusumi	  the other.  If READ_DEGRADED is set on both drives then either
183*f603807bSTomohiro Kusumi	  drive can be selected.  If the read fails on any given drive
184*f603807bSTomohiro Kusumi	  it is of course redispatched to the other drive regardless.
185*f603807bSTomohiro Kusumi
186*f603807bSTomohiro Kusumi	  When READ_DEGRADED is set on one drive and only one drive is up
187*f603807bSTomohiro Kusumi	  we still issue the read to that drive, obviously, since we have
188*f603807bSTomohiro Kusumi	  no other choice.
189*f603807bSTomohiro Kusumi
190*f603807bSTomohiro Kusumi    WHEN WRITING:
191*f603807bSTomohiro Kusumi
192*f603807bSTomohiro Kusumi	* If MASTER is NOT set on either drive the write is directed to
193*f603807bSTomohiro Kusumi	  both drives.
194*f603807bSTomohiro Kusumi
195*f603807bSTomohiro Kusumi	* Otherwise a WRITE is directed only to the drive with MASTER set.
196*f603807bSTomohiro Kusumi
197*f603807bSTomohiro Kusumi	* If both drives are marked MASTER the write is directed to both
198*f603807bSTomohiro Kusumi	  drives.  This is a conflict situation on read but writing will
199*f603807bSTomohiro Kusumi	  still work just fine.  The MASTER bit is left alone.
200*f603807bSTomohiro Kusumi
201*f603807bSTomohiro Kusumi	* If an I/O error occurs on one of the drives the WRITE_DEGRADED
202*f603807bSTomohiro Kusumi	  bit is set for that drive and the other drive (where the write
203*f603807bSTomohiro Kusumi	  succeeded) is marked as MASTER.
204*f603807bSTomohiro Kusumi
205*f603807bSTomohiro Kusumi	  However, we can only do this if neither drive is already a MASTER.
206*f603807bSTomohiro Kusumi
207*f603807bSTomohiro Kusumi	  If a drive is already marked MASTER we cannot mark the other drive
208*f603807bSTomohiro Kusumi	  as MASTER.  The failed write will cause an I/O error to be
209*f603807bSTomohiro Kusumi	  returned.
210*f603807bSTomohiro Kusumi
211*f603807bSTomohiro Kusumi    RESYNCHRONIZATION:
212*f603807bSTomohiro Kusumi
213*f603807bSTomohiro Kusumi	* A kernel thread is created manage mirror synchronization.
214*f603807bSTomohiro Kusumi
215*f603807bSTomohiro Kusumi	* Synchronization of out-of-sync mirror segments can occur
216*f603807bSTomohiro Kusumi	  asynchnronously, but must interlock against I/O operations
217*f603807bSTomohiro Kusumi	  that might conflict.
218*f603807bSTomohiro Kusumi
219*f603807bSTomohiro Kusumi	  The segment array on the drive(s) is used to determine what
220*f603807bSTomohiro Kusumi	  segments need to be resynchronized.
221*f603807bSTomohiro Kusumi
222*f603807bSTomohiro Kusumi	* Synchronization occurs when the segment for one drive is
223*f603807bSTomohiro Kusumi	  marked MASTER and the segment for the other drive is not.
224*f603807bSTomohiro Kusumi
225*f603807bSTomohiro Kusumi	* In a conflict situation (where both drives are marked MASTER
226*f603807bSTomohiro Kusumi	  for any given segment) a manual intervention is required to
227*f603807bSTomohiro Kusumi	  specify (e.g. through an ioctl) which of the two drives is
228*f603807bSTomohiro Kusumi	  the master.  This overrides the MASTER bits for all segments
229*f603807bSTomohiro Kusumi	  and allows synchronization to occur for all conflicting
230*f603807bSTomohiro Kusumi	  segments (or possibly all segments, period, in the case where
231*f603807bSTomohiro Kusumi	  a new mirror drive is being deployed).
232*f603807bSTomohiro Kusumi
233*f603807bSTomohiro Kusumi    Segment array on-media and header.
234*f603807bSTomohiro Kusumi
235*f603807bSTomohiro Kusumi	* The mirroring code must reserve some of the sectors on the
236*f603807bSTomohiro Kusumi	  drives to hold a header and the segment array, making the
237*f603807bSTomohiro Kusumi	  resulting logical mirror a bit smaller than it otherwise would
238*f603807bSTomohiro Kusumi	  be.
239*f603807bSTomohiro Kusumi
240*f603807bSTomohiro Kusumi	* The header must contain a unique serial number (the uuid code
241*f603807bSTomohiro Kusumi	  can be used to generate it).
242*f603807bSTomohiro Kusumi
243*f603807bSTomohiro Kusumi	* When manual intervention is required to specify a master a new
244*f603807bSTomohiro Kusumi	  unique serial number must be generated for that master to
245*f603807bSTomohiro Kusumi	  prevent 'old' mirror drives that were removed from the system
246*f603807bSTomohiro Kusumi	  from being improperly recognized as being part of the new mirror
247*f603807bSTomohiro Kusumi	  when they aren't any more.
248*f603807bSTomohiro Kusumi
249*f603807bSTomohiro Kusumi	* Automatic detection of the mirror status is possible by using
250*f603807bSTomohiro Kusumi	  the serial number in the header.
251*f603807bSTomohiro Kusumi
252*f603807bSTomohiro Kusumi	* If the serial numbers for the header(s) for the two drives
253*f603807bSTomohiro Kusumi	  making up the mirror do not match (when both drives are up and
254*f603807bSTomohiro Kusumi	  both header read I/Os succeeded), manual intervention is required.
255*f603807bSTomohiro Kusumi
256*f603807bSTomohiro Kusumi	* Auto-detection of mirror segments ala Geom... using on-disk headers,
257*f603807bSTomohiro Kusumi	  is discouraged.  I think it is too dangerous and would much rather
258*f603807bSTomohiro Kusumi	  the detection be based on drive serial number rather than serial
259*f603807bSTomohiro Kusumi	  numbers stored on-media in headers.
260*f603807bSTomohiro Kusumi
261*f603807bSTomohiro Kusumi	  However, I guess this is a function of LVM?  So I might not have
262*f603807bSTomohiro Kusumi	  any control over it.
263*f603807bSTomohiro Kusumi
264*f603807bSTomohiro Kusumi    The UNINITIALIZED FLAG
265*f603807bSTomohiro Kusumi
266*f603807bSTomohiro Kusumi	When formatting a new mirror or when a drive is torn out and a new
267*f603807bSTomohiro Kusumi	drive is added the drive(s) in question must be formatted.  To
268*f603807bSTomohiro Kusumi	avoid actually writing to all sectors of the drive, which would
269*f603807bSTomohiro Kusumi	take too long on multi-terabyte drives and create unnecesary
270*f603807bSTomohiro Kusumi	writes on things like SSDs we instead of an UNINITIALIZED flag
271*f603807bSTomohiro Kusumi	state in the descriptor.
272*f603807bSTomohiro Kusumi
273*f603807bSTomohiro Kusumi	If set any read I/O to the related segment is simply zero-filled.
274*f603807bSTomohiro Kusumi
275*f603807bSTomohiro Kusumi	When writing we have to zero-fill the segment (write zeros to the
276*f603807bSTomohiro Kusumi	whole 128MB segment) and then clear the UNINITIALIZED flag before
277*f603807bSTomohiro Kusumi	allowing the write I/O to proceed.
278*f603807bSTomohiro Kusumi
279*f603807bSTomohiro Kusumi	We might want to use some of the bits in the descriptor as a
280*f603807bSTomohiro Kusumi	sub-bitmap.  e.g. if we reserve 4 bytes in the 16-byte descriptor
281*f603807bSTomohiro Kusumi	to be an 'UNINITIALIZED' sub-bitmap we can break the 128MB
282*f603807bSTomohiro Kusumi	segment down into 4MB pieces and only zero-fill/write portions
283*f603807bSTomohiro Kusumi	of the 128MB segment instead of having to do the whole segment.
284*f603807bSTomohiro Kusumi
285*f603807bSTomohiro Kusumi	I don't know how well this idea would work in real life.  Another
286*f603807bSTomohiro Kusumi	option is to just return random data for the uninitialized portions
287*f603807bSTomohiro Kusumi	of a new mirror but that kinda breaks the whole abstraction and
288*f603807bSTomohiro Kusumi	could blow up certain types of filesystems, like ZFS, which
289*f603807bSTomohiro Kusumi	assume any read data is stable on-media.
290*f603807bSTomohiro Kusumi
291*f603807bSTomohiro Kusumi
292*f603807bSTomohiro Kusumi						-Matt
293*f603807bSTomohiro Kusumi
294*f603807bSTomohiro Kusumi
295