1*f603807bSTomohiro Kusumi Now that Alex has the basic lvm stuff in we need to add soft-raid-1 2*f603807bSTomohiro Kusumi to it. I have some ideas on how it could be implemented. 3*f603807bSTomohiro Kusumi 4*f603807bSTomohiro Kusumi This is not set in stone at all, this is just me rattling off my 5*f603807bSTomohiro Kusumi RAID-1 implementation ideas. It isn't quite as complex as it sounds, 6*f603807bSTomohiro Kusumi really! I swear it isn't! But if we could implement something like 7*f603807bSTomohiro Kusumi this we would have the best soft-raid-1 implementation around. 8*f603807bSTomohiro Kusumi 9*f603807bSTomohiro Kusumi Here are the basic problems which need to be solved: 10*f603807bSTomohiro Kusumi 11*f603807bSTomohiro Kusumi * Allow partial downtimes for pieces of the mirror such that 12*f603807bSTomohiro Kusumi when the mirror becomes whole again the entire drive does not 13*f603807bSTomohiro Kusumi have to be copied. Instead only the segments of the drive that 14*f603807bSTomohiro Kusumi are out of sync would be resynchronized. 15*f603807bSTomohiro Kusumi 16*f603807bSTomohiro Kusumi We want to avoid having to completely resynchronize the entire 17*f603807bSTomohiro Kusumi contents of a potentially multi-terabyte drive if one is 18*f603807bSTomohiro Kusumi taken offline temporarily and then brought back online. 19*f603807bSTomohiro Kusumi 20*f603807bSTomohiro Kusumi * Allow mixed I/O errors on both drives making up the mirror 21*f603807bSTomohiro Kusumi without taking the entire mirror offline. 22*f603807bSTomohiro Kusumi 23*f603807bSTomohiro Kusumi * Allow I/O read or write errors on one drive to degrade only 24*f603807bSTomohiro Kusumi the related segment and not the whole drive. 25*f603807bSTomohiro Kusumi 26*f603807bSTomohiro Kusumi * Allow most writes to be asynchronous to the two drives making 27*f603807bSTomohiro Kusumi up the mirror up to the synchronization point. Avoid unnecessary 28*f603807bSTomohiro Kusumi writes to the segment array on-media even through a synchronization 29*f603807bSTomohiro Kusumi point. 30*f603807bSTomohiro Kusumi 31*f603807bSTomohiro Kusumi * Detect out-of-sync mirrors that are out of sync due to a system 32*f603807bSTomohiro Kusumi crash occuring prior to a synchronization point (i.e. when the 33*f603807bSTomohiro Kusumi drives themselves are just fine). When this case occurs either 34*f603807bSTomohiro Kusumi copy is valid and one must be selected, but then the selected 35*f603807bSTomohiro Kusumi copy must be resynchronized to the other drive in the mirror 36*f603807bSTomohiro Kusumi to prevent the read data from 'changing' randomly from the point 37*f603807bSTomohiro Kusumi of view of whoever is reading it. 38*f603807bSTomohiro Kusumi 39*f603807bSTomohiro Kusumi And my idea on implementation: 40*f603807bSTomohiro Kusumi 41*f603807bSTomohiro Kusumi * Implement a segment descriptor array for each drive in the 42*f603807bSTomohiro Kusumi mirror, breaking the drive down into large pieces. For 43*f603807bSTomohiro Kusumi example, 128MB per segment. The segment array would be stored 44*f603807bSTomohiro Kusumi on both disks making up the mirror. In addition, each disk will 45*f603807bSTomohiro Kusumi store the segment state for BOTH disks. 46*f603807bSTomohiro Kusumi 47*f603807bSTomohiro Kusumi Thus a 1TBx2 mirror would have 8192x4 segments (4 segment 48*f603807bSTomohiro Kusumi descriptors for each logical segment). The segment descriptor 49*f603807bSTomohiro Kusumi array would idealy be small enough to cache in-memory. Being 50*f603807bSTomohiro Kusumi able to cache it in-memory simplifies lookups. 51*f603807bSTomohiro Kusumi 52*f603807bSTomohiro Kusumi A segment descriptor would be, oh I don't know... probably 53*f603807bSTomohiro Kusumi 16 bytes. Leave room for expansion :-) 54*f603807bSTomohiro Kusumi 55*f603807bSTomohiro Kusumi Why does each disk need to store a segment descriptor for both 56*f603807bSTomohiro Kusumi disks? So we can 'remember' the state of the dead disk on the 57*f603807bSTomohiro Kusumi live disk in order to resolve mismatches later on when the 58*f603807bSTomohiro Kusumi dead disk comes back to life. 59*f603807bSTomohiro Kusumi 60*f603807bSTomohiro Kusumi * The state of the segment descriptor must be consulted when reading 61*f603807bSTomohiro Kusumi or writing. Some states are in-memory-only states while others 62*f603807bSTomohiro Kusumi can exist on-media or in-memory. The states are represented by 63*f603807bSTomohiro Kusumi a set of bit flags: 64*f603807bSTomohiro Kusumi 65*f603807bSTomohiro Kusumi MEDIA_UNSTABLE 0: The content is stable on-media and 66*f603807bSTomohiro Kusumi fully synchronized. 67*f603807bSTomohiro Kusumi 68*f603807bSTomohiro Kusumi 1: The content is unstable on-media 69*f603807bSTomohiro Kusumi (writes have been made and have not 70*f603807bSTomohiro Kusumi been completely synchronized to both 71*f603807bSTomohiro Kusumi drives). 72*f603807bSTomohiro Kusumi 73*f603807bSTomohiro Kusumi MEDIA_READ_DEGRADED 0: No I/O read error occured on this segment 74*f603807bSTomohiro Kusumi 1: I/O read error(s) occured on this segment 75*f603807bSTomohiro Kusumi 76*f603807bSTomohiro Kusumi MEDIA_WRITE_DEGRADED 0: No I/O write error occured on this segment 77*f603807bSTomohiro Kusumi 1: I/O write error(s) occured on this segment 78*f603807bSTomohiro Kusumi 79*f603807bSTomohiro Kusumi MEDIA_MASTER 0: Normal operation 80*f603807bSTomohiro Kusumi 81*f603807bSTomohiro Kusumi 1: Mastership operation for this segment 82*f603807bSTomohiro Kusumi on this drive, which is set when the 83*f603807bSTomohiro Kusumi other drive in the mirror has failed 84*f603807bSTomohiro Kusumi and writes are made to the drive that 85*f603807bSTomohiro Kusumi is still operational. 86*f603807bSTomohiro Kusumi 87*f603807bSTomohiro Kusumi UNINITIALIZED 0: The segment contains normal data. 88*f603807bSTomohiro Kusumi 89*f603807bSTomohiro Kusumi 1: The entire segment is empty and should 90*f603807bSTomohiro Kusumi read all zeros regardless of the actual 91*f603807bSTomohiro Kusumi content on the media. 92*f603807bSTomohiro Kusumi 93*f603807bSTomohiro Kusumi (Use for newly initialized mirrors as 94*f603807bSTomohiro Kusumi a way to avoid formatting the whole 95*f603807bSTomohiro Kusumi drive or SSD?). 96*f603807bSTomohiro Kusumi 97*f603807bSTomohiro Kusumi OLD_UNSTABLE Copy of original MEDIA_UNSTABLE bit initially 98*f603807bSTomohiro Kusumi read from the media. This bit is only 99*f603807bSTomohiro Kusumi recopied after the related segment has been 100*f603807bSTomohiro Kusumi fully synchronized. 101*f603807bSTomohiro Kusumi 102*f603807bSTomohiro Kusumi OLD_MASTER Copy of original MEDIA_MASTER bit initially 103*f603807bSTomohiro Kusumi read from the media. This bit is only 104*f603807bSTomohiro Kusumi recopied after the related segment has been 105*f603807bSTomohiro Kusumi fully synchronized. 106*f603807bSTomohiro Kusumi 107*f603807bSTomohiro Kusumi We probably need room for a serial number or timestamp in the 108*f603807bSTomohiro Kusumi segment descriptor as well in order to resolve certain situations. 109*f603807bSTomohiro Kusumi 110*f603807bSTomohiro Kusumi * Since updating a segment descriptor on-media is expensive 111*f603807bSTomohiro Kusumi (requiring at least one disk synchronization command and of 112*f603807bSTomohiro Kusumi course a nasty seek), segment descriptors on-media are updated 113*f603807bSTomohiro Kusumi synchronously only when going from a STABLE to an UNSTABLE state, 114*f603807bSTomohiro Kusumi meaning the segment is undergoing active writing. 115*f603807bSTomohiro Kusumi 116*f603807bSTomohiro Kusumi Changing a segment descriptor from unstable to stable can be 117*f603807bSTomohiro Kusumi delayed indefinitely (synchronized on a long timer, like 118*f603807bSTomohiro Kusumi 30 or 60 seconds). All that happens if a crash occurs in the 119*f603807bSTomohiro Kusumi mean time is a little extra copying of segments occurs on 120*f603807bSTomohiro Kusumi reboot. Theoretically anyway. 121*f603807bSTomohiro Kusumi 122*f603807bSTomohiro Kusumi Ok, now what actions need to be taken to satisfy a read or write? 123*f603807bSTomohiro Kusumi The actions taken will be based on the segment state for the segment 124*f603807bSTomohiro Kusumi involved in the I/O. Any I/O which crosses a segment boundary would 125*f603807bSTomohiro Kusumi be split into two or more I/Os and treated separately. 126*f603807bSTomohiro Kusumi 127*f603807bSTomohiro Kusumi Remember there are four descriptors for each segment, two on each drive: 128*f603807bSTomohiro Kusumi 129*f603807bSTomohiro Kusumi DISK1 STATE stored on disk1 130*f603807bSTomohiro Kusumi DISK2 STATE stored on disk1 131*f603807bSTomohiro Kusumi 132*f603807bSTomohiro Kusumi DISK1 STATE stored on disk2 133*f603807bSTomohiro Kusumi DISK2 STATE stored on disk2 134*f603807bSTomohiro Kusumi 135*f603807bSTomohiro Kusumi In order to simplify matters any inconstencies between e.g. the DISK2 136*f603807bSTomohiro Kusumi state as stored on disk1 and the DISK2 state as stored on disk2 would 137*f603807bSTomohiro Kusumi be resolved immediately prior to initiation of the actual I/O. Otherwise 138*f603807bSTomohiro Kusumi the combination of four states is just too complex. 139*f603807bSTomohiro Kusumi 140*f603807bSTomohiro Kusumi So if both drives are operational this resolution must take place. If 141*f603807bSTomohiro Kusumi only one drive is operational then the state stored in the segment 142*f603807bSTomohiro Kusumi descriptors on that one operational drive is consulted to obtain the 143*f603807bSTomohiro Kusumi state of both drives. 144*f603807bSTomohiro Kusumi 145*f603807bSTomohiro Kusumi This is the hard part. Lets take the mismatched cases first. That is, 146*f603807bSTomohiro Kusumi when the DISK2 STATE stored on DISK1 is different from the DISK2 STATE 147*f603807bSTomohiro Kusumi stored on DISK2 (or vise-versa... disk1 state stored on each drive): 148*f603807bSTomohiro Kusumi 149*f603807bSTomohiro Kusumi * If one of the two conflicting states has the UNSTABLE or MASTER 150*f603807bSTomohiro Kusumi bits set then set the same bits in the other. 151*f603807bSTomohiro Kusumi 152*f603807bSTomohiro Kusumi Basically just OR some of the bits together and store to 153*f603807bSTomohiro Kusumi both copies. But not all of the bits. 154*f603807bSTomohiro Kusumi 155*f603807bSTomohiro Kusumi * If doing a write operation and the segment is marked UNITIALIZED 156*f603807bSTomohiro Kusumi the entire segment must be zero-filled and the bit cleared prior 157*f603807bSTomohiro Kusumi to the write operation. ???? (needs more thought, maybe even a 158*f603807bSTomohiro Kusumi sub-bitmap. See later on in this email). 159*f603807bSTomohiro Kusumi 160*f603807bSTomohiro Kusumi Ok, now we have done that we can just consider two states, one for 161*f603807bSTomohiro Kusumi DISK1 and one for DISK2, coupled with the I/O operation: 162*f603807bSTomohiro Kusumi 163*f603807bSTomohiro Kusumi WHEN READING: 164*f603807bSTomohiro Kusumi 165*f603807bSTomohiro Kusumi * If MASTER is NOT set on either drive the read may be 166*f603807bSTomohiro Kusumi sent to either drive. 167*f603807bSTomohiro Kusumi 168*f603807bSTomohiro Kusumi * If MASTER is set on one of the drives the read must be sent 169*f603807bSTomohiro Kusumi only to that drive. 170*f603807bSTomohiro Kusumi 171*f603807bSTomohiro Kusumi * If MASTER is set on both drives then we are screwed. This case 172*f603807bSTomohiro Kusumi can occur if one of the mirror drives goes down and a bunch of 173*f603807bSTomohiro Kusumi writes are made to the other, then system is rebooted and the 174*f603807bSTomohiro Kusumi original mirror drive comes up but the other drive goes down. 175*f603807bSTomohiro Kusumi 176*f603807bSTomohiro Kusumi So this condition detects a conflict. We must return an I/O 177*f603807bSTomohiro Kusumi error for the READ, presumably. The only way to resolve this 178*f603807bSTomohiro Kusumi is for a manual intervention to explicitly select one or the 179*f603807bSTomohiro Kusumi other drive as the master. 180*f603807bSTomohiro Kusumi 181*f603807bSTomohiro Kusumi * If READ_DEGRADED is set on one drive the read can be directed to 182*f603807bSTomohiro Kusumi the other. If READ_DEGRADED is set on both drives then either 183*f603807bSTomohiro Kusumi drive can be selected. If the read fails on any given drive 184*f603807bSTomohiro Kusumi it is of course redispatched to the other drive regardless. 185*f603807bSTomohiro Kusumi 186*f603807bSTomohiro Kusumi When READ_DEGRADED is set on one drive and only one drive is up 187*f603807bSTomohiro Kusumi we still issue the read to that drive, obviously, since we have 188*f603807bSTomohiro Kusumi no other choice. 189*f603807bSTomohiro Kusumi 190*f603807bSTomohiro Kusumi WHEN WRITING: 191*f603807bSTomohiro Kusumi 192*f603807bSTomohiro Kusumi * If MASTER is NOT set on either drive the write is directed to 193*f603807bSTomohiro Kusumi both drives. 194*f603807bSTomohiro Kusumi 195*f603807bSTomohiro Kusumi * Otherwise a WRITE is directed only to the drive with MASTER set. 196*f603807bSTomohiro Kusumi 197*f603807bSTomohiro Kusumi * If both drives are marked MASTER the write is directed to both 198*f603807bSTomohiro Kusumi drives. This is a conflict situation on read but writing will 199*f603807bSTomohiro Kusumi still work just fine. The MASTER bit is left alone. 200*f603807bSTomohiro Kusumi 201*f603807bSTomohiro Kusumi * If an I/O error occurs on one of the drives the WRITE_DEGRADED 202*f603807bSTomohiro Kusumi bit is set for that drive and the other drive (where the write 203*f603807bSTomohiro Kusumi succeeded) is marked as MASTER. 204*f603807bSTomohiro Kusumi 205*f603807bSTomohiro Kusumi However, we can only do this if neither drive is already a MASTER. 206*f603807bSTomohiro Kusumi 207*f603807bSTomohiro Kusumi If a drive is already marked MASTER we cannot mark the other drive 208*f603807bSTomohiro Kusumi as MASTER. The failed write will cause an I/O error to be 209*f603807bSTomohiro Kusumi returned. 210*f603807bSTomohiro Kusumi 211*f603807bSTomohiro Kusumi RESYNCHRONIZATION: 212*f603807bSTomohiro Kusumi 213*f603807bSTomohiro Kusumi * A kernel thread is created manage mirror synchronization. 214*f603807bSTomohiro Kusumi 215*f603807bSTomohiro Kusumi * Synchronization of out-of-sync mirror segments can occur 216*f603807bSTomohiro Kusumi asynchnronously, but must interlock against I/O operations 217*f603807bSTomohiro Kusumi that might conflict. 218*f603807bSTomohiro Kusumi 219*f603807bSTomohiro Kusumi The segment array on the drive(s) is used to determine what 220*f603807bSTomohiro Kusumi segments need to be resynchronized. 221*f603807bSTomohiro Kusumi 222*f603807bSTomohiro Kusumi * Synchronization occurs when the segment for one drive is 223*f603807bSTomohiro Kusumi marked MASTER and the segment for the other drive is not. 224*f603807bSTomohiro Kusumi 225*f603807bSTomohiro Kusumi * In a conflict situation (where both drives are marked MASTER 226*f603807bSTomohiro Kusumi for any given segment) a manual intervention is required to 227*f603807bSTomohiro Kusumi specify (e.g. through an ioctl) which of the two drives is 228*f603807bSTomohiro Kusumi the master. This overrides the MASTER bits for all segments 229*f603807bSTomohiro Kusumi and allows synchronization to occur for all conflicting 230*f603807bSTomohiro Kusumi segments (or possibly all segments, period, in the case where 231*f603807bSTomohiro Kusumi a new mirror drive is being deployed). 232*f603807bSTomohiro Kusumi 233*f603807bSTomohiro Kusumi Segment array on-media and header. 234*f603807bSTomohiro Kusumi 235*f603807bSTomohiro Kusumi * The mirroring code must reserve some of the sectors on the 236*f603807bSTomohiro Kusumi drives to hold a header and the segment array, making the 237*f603807bSTomohiro Kusumi resulting logical mirror a bit smaller than it otherwise would 238*f603807bSTomohiro Kusumi be. 239*f603807bSTomohiro Kusumi 240*f603807bSTomohiro Kusumi * The header must contain a unique serial number (the uuid code 241*f603807bSTomohiro Kusumi can be used to generate it). 242*f603807bSTomohiro Kusumi 243*f603807bSTomohiro Kusumi * When manual intervention is required to specify a master a new 244*f603807bSTomohiro Kusumi unique serial number must be generated for that master to 245*f603807bSTomohiro Kusumi prevent 'old' mirror drives that were removed from the system 246*f603807bSTomohiro Kusumi from being improperly recognized as being part of the new mirror 247*f603807bSTomohiro Kusumi when they aren't any more. 248*f603807bSTomohiro Kusumi 249*f603807bSTomohiro Kusumi * Automatic detection of the mirror status is possible by using 250*f603807bSTomohiro Kusumi the serial number in the header. 251*f603807bSTomohiro Kusumi 252*f603807bSTomohiro Kusumi * If the serial numbers for the header(s) for the two drives 253*f603807bSTomohiro Kusumi making up the mirror do not match (when both drives are up and 254*f603807bSTomohiro Kusumi both header read I/Os succeeded), manual intervention is required. 255*f603807bSTomohiro Kusumi 256*f603807bSTomohiro Kusumi * Auto-detection of mirror segments ala Geom... using on-disk headers, 257*f603807bSTomohiro Kusumi is discouraged. I think it is too dangerous and would much rather 258*f603807bSTomohiro Kusumi the detection be based on drive serial number rather than serial 259*f603807bSTomohiro Kusumi numbers stored on-media in headers. 260*f603807bSTomohiro Kusumi 261*f603807bSTomohiro Kusumi However, I guess this is a function of LVM? So I might not have 262*f603807bSTomohiro Kusumi any control over it. 263*f603807bSTomohiro Kusumi 264*f603807bSTomohiro Kusumi The UNINITIALIZED FLAG 265*f603807bSTomohiro Kusumi 266*f603807bSTomohiro Kusumi When formatting a new mirror or when a drive is torn out and a new 267*f603807bSTomohiro Kusumi drive is added the drive(s) in question must be formatted. To 268*f603807bSTomohiro Kusumi avoid actually writing to all sectors of the drive, which would 269*f603807bSTomohiro Kusumi take too long on multi-terabyte drives and create unnecesary 270*f603807bSTomohiro Kusumi writes on things like SSDs we instead of an UNINITIALIZED flag 271*f603807bSTomohiro Kusumi state in the descriptor. 272*f603807bSTomohiro Kusumi 273*f603807bSTomohiro Kusumi If set any read I/O to the related segment is simply zero-filled. 274*f603807bSTomohiro Kusumi 275*f603807bSTomohiro Kusumi When writing we have to zero-fill the segment (write zeros to the 276*f603807bSTomohiro Kusumi whole 128MB segment) and then clear the UNINITIALIZED flag before 277*f603807bSTomohiro Kusumi allowing the write I/O to proceed. 278*f603807bSTomohiro Kusumi 279*f603807bSTomohiro Kusumi We might want to use some of the bits in the descriptor as a 280*f603807bSTomohiro Kusumi sub-bitmap. e.g. if we reserve 4 bytes in the 16-byte descriptor 281*f603807bSTomohiro Kusumi to be an 'UNINITIALIZED' sub-bitmap we can break the 128MB 282*f603807bSTomohiro Kusumi segment down into 4MB pieces and only zero-fill/write portions 283*f603807bSTomohiro Kusumi of the 128MB segment instead of having to do the whole segment. 284*f603807bSTomohiro Kusumi 285*f603807bSTomohiro Kusumi I don't know how well this idea would work in real life. Another 286*f603807bSTomohiro Kusumi option is to just return random data for the uninitialized portions 287*f603807bSTomohiro Kusumi of a new mirror but that kinda breaks the whole abstraction and 288*f603807bSTomohiro Kusumi could blow up certain types of filesystems, like ZFS, which 289*f603807bSTomohiro Kusumi assume any read data is stable on-media. 290*f603807bSTomohiro Kusumi 291*f603807bSTomohiro Kusumi 292*f603807bSTomohiro Kusumi -Matt 293*f603807bSTomohiro Kusumi 294*f603807bSTomohiro Kusumi 295