1.\" $NetBSD: raidctl.8,v 1.52 2008/05/02 18:11:05 martin Exp $ 2.\" 3.\" Copyright (c) 1998, 2002 The NetBSD Foundation, Inc. 4.\" All rights reserved. 5.\" 6.\" This code is derived from software contributed to The NetBSD Foundation 7.\" by Greg Oster 8.\" 9.\" Redistribution and use in source and binary forms, with or without 10.\" modification, are permitted provided that the following conditions 11.\" are met: 12.\" 1. Redistributions of source code must retain the above copyright 13.\" notice, this list of conditions and the following disclaimer. 14.\" 2. Redistributions in binary form must reproduce the above copyright 15.\" notice, this list of conditions and the following disclaimer in the 16.\" documentation and/or other materials provided with the distribution. 17.\" 18.\" THIS SOFTWARE IS PROVIDED BY THE NETBSD FOUNDATION, INC. AND CONTRIBUTORS 19.\" ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED 20.\" TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR 21.\" PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE FOUNDATION OR CONTRIBUTORS 22.\" BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR 23.\" CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF 24.\" SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 25.\" INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN 26.\" CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) 27.\" ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE 28.\" POSSIBILITY OF SUCH DAMAGE. 29.\" 30.\" 31.\" Copyright (c) 1995 Carnegie-Mellon University. 32.\" All rights reserved. 33.\" 34.\" Author: Mark Holland 35.\" 36.\" Permission to use, copy, modify and distribute this software and 37.\" its documentation is hereby granted, provided that both the copyright 38.\" notice and this permission notice appear in all copies of the 39.\" software, derivative works or modified versions, and any portions 40.\" thereof, and that both notices appear in supporting documentation. 41.\" 42.\" CARNEGIE MELLON ALLOWS FREE USE OF THIS SOFTWARE IN ITS "AS IS" 43.\" CONDITION. CARNEGIE MELLON DISCLAIMS ANY LIABILITY OF ANY KIND 44.\" FOR ANY DAMAGES WHATSOEVER RESULTING FROM THE USE OF THIS SOFTWARE. 45.\" 46.\" Carnegie Mellon requests users of this software to return to 47.\" 48.\" Software Distribution Coordinator or Software.Distribution@CS.CMU.EDU 49.\" School of Computer Science 50.\" Carnegie Mellon University 51.\" Pittsburgh PA 15213-3890 52.\" 53.\" any improvements or extensions that they make and grant Carnegie the 54.\" rights to redistribute these changes. 55.\" 56.Dd August 6, 2007 57.Dt RAIDCTL 8 58.Os 59.Sh NAME 60.Nm raidctl 61.Nd configuration utility for the RAIDframe disk driver 62.Sh SYNOPSIS 63.Nm 64.Op Fl v 65.Fl a Ar component Ar dev 66.Nm 67.Op Fl v 68.Fl A Op yes | no | root 69.Ar dev 70.Nm 71.Op Fl v 72.Fl B Ar dev 73.Nm 74.Op Fl v 75.Fl c Ar config_file Ar dev 76.Nm 77.Op Fl v 78.Fl C Ar config_file Ar dev 79.Nm 80.Op Fl v 81.Fl f Ar component Ar dev 82.Nm 83.Op Fl v 84.Fl F Ar component Ar dev 85.Nm 86.Op Fl v 87.Fl g Ar component Ar dev 88.Nm 89.Op Fl v 90.Fl G Ar dev 91.Nm 92.Op Fl v 93.Fl i Ar dev 94.Nm 95.Op Fl v 96.Fl I Ar serial_number Ar dev 97.Nm 98.Op Fl v 99.Fl p Ar dev 100.Nm 101.Op Fl v 102.Fl P Ar dev 103.Nm 104.Op Fl v 105.Fl r Ar component Ar dev 106.Nm 107.Op Fl v 108.Fl R Ar component Ar dev 109.Nm 110.Op Fl v 111.Fl s Ar dev 112.Nm 113.Op Fl v 114.Fl S Ar dev 115.Nm 116.Op Fl v 117.Fl u Ar dev 118.Sh DESCRIPTION 119.Nm 120is the user-land control program for 121.Xr raid 4 , 122the RAIDframe disk device. 123.Nm 124is primarily used to dynamically configure and unconfigure RAIDframe disk 125devices. 126For more information about the RAIDframe disk device, see 127.Xr raid 4 . 128.Pp 129This document assumes the reader has at least rudimentary knowledge of 130RAID and RAID concepts. 131.Pp 132The command-line options for 133.Nm 134are as follows: 135.Bl -tag -width indent 136.It Fl a Ar component Ar dev 137Add 138.Ar component 139as a hot spare for the device 140.Ar dev . 141Component labels (which identify the location of a given 142component within a particular RAID set) are automatically added to the 143hot spare after it has been used and are not required for 144.Ar component 145before it is used. 146.It Fl A Ic yes Ar dev 147Make the RAID set auto-configurable. 148The RAID set will be automatically configured at boot 149.Ar before 150the root file system is mounted. 151Note that all components of the set must be of type 152.Dv RAID 153in the disklabel. 154.It Fl A Ic no Ar dev 155Turn off auto-configuration for the RAID set. 156.It Fl A Ic root Ar dev 157Make the RAID set auto-configurable, and also mark the set as being 158eligible to be the root partition. 159A RAID set configured this way will 160.Ar override 161the use of the boot disk as the root device. 162All components of the set must be of type 163.Dv RAID 164in the disklabel. 165Note that only certain architectures 166.Pq currently alpha, i386, pmax, sparc, sparc64, and vax 167support booting a kernel directly from a RAID set. 168.It Fl B Ar dev 169Initiate a copyback of reconstructed data from a spare disk to 170its original disk. 171This is performed after a component has failed, 172and the failed drive has been reconstructed onto a spare drive. 173.It Fl c Ar config_file Ar dev 174Configure the RAIDframe device 175.Ar dev 176according to the configuration given in 177.Ar config_file . 178A description of the contents of 179.Ar config_file 180is given later. 181.It Fl C Ar config_file Ar dev 182As for 183.Fl c , 184but forces the configuration to take place. 185This is required the first time a RAID set is configured. 186.It Fl f Ar component Ar dev 187This marks the specified 188.Ar component 189as having failed, but does not initiate a reconstruction of that component. 190.It Fl F Ar component Ar dev 191Fails the specified 192.Ar component 193of the device, and immediately begin a reconstruction of the failed 194disk onto an available hot spare. 195This is one of the mechanisms used to start 196the reconstruction process if a component does have a hardware failure. 197.It Fl g Ar component Ar dev 198Get the component label for the specified component. 199.It Fl G Ar dev 200Generate the configuration of the RAIDframe device in a format suitable for 201use with the 202.Fl c 203or 204.Fl C 205options. 206.It Fl i Ar dev 207Initialize the RAID device. 208In particular, (re-)write the parity on the selected device. 209This 210.Em MUST 211be done for 212.Em all 213RAID sets before the RAID device is labeled and before 214file systems are created on the RAID device. 215.It Fl I Ar serial_number Ar dev 216Initialize the component labels on each component of the device. 217.Ar serial_number 218is used as one of the keys in determining whether a 219particular set of components belong to the same RAID set. 220While not strictly enforced, different serial numbers should be used for 221different RAID sets. 222This step 223.Em MUST 224be performed when a new RAID set is created. 225.It Fl p Ar dev 226Check the status of the parity on the RAID set. 227Displays a status message, 228and returns successfully if the parity is up-to-date. 229.It Fl P Ar dev 230Check the status of the parity on the RAID set, and initialize 231(re-write) the parity if the parity is not known to be up-to-date. 232This is normally used after a system crash (and before a 233.Xr fsck 8 ) 234to ensure the integrity of the parity. 235.It Fl r Ar component Ar dev 236Remove the spare disk specified by 237.Ar component 238from the set of available spare components. 239.It Fl R Ar component Ar dev 240Fails the specified 241.Ar component , 242if necessary, and immediately begins a reconstruction back to 243.Ar component . 244This is useful for reconstructing back onto a component after 245it has been replaced following a failure. 246.It Fl s Ar dev 247Display the status of the RAIDframe device for each of the components 248and spares. 249.It Fl S Ar dev 250Check the status of parity re-writing, component reconstruction, and 251component copyback. 252The output indicates the amount of progress 253achieved in each of these areas. 254.It Fl u Ar dev 255Unconfigure the RAIDframe device. 256.It Fl v 257Be more verbose. 258For operations such as reconstructions, parity 259re-writing, and copybacks, provide a progress indicator. 260.El 261.Pp 262The device used by 263.Nm 264is specified by 265.Ar dev . 266.Ar dev 267may be either the full name of the device, e.g., 268.Pa /dev/rraid0d , 269for the i386 architecture, or 270.Pa /dev/rraid0c 271for many others, or just simply 272.Pa raid0 273(for 274.Pa /dev/rraid0[cd] ) . 275It is recommended that the partitions used to represent the 276RAID device are not used for file systems. 277.Ss Configuration file 278The format of the configuration file is complex, and 279only an abbreviated treatment is given here. 280In the configuration files, a 281.Sq # 282indicates the beginning of a comment. 283.Pp 284There are 4 required sections of a configuration file, and 2 285optional sections. 286Each section begins with a 287.Sq START , 288followed by the section name, 289and the configuration parameters associated with that section. 290The first section is the 291.Sq array 292section, and it specifies 293the number of rows, columns, and spare disks in the RAID set. 294For example: 295.Bd -literal -offset indent 296START array 2971 3 0 298.Ed 299.Pp 300indicates an array with 1 row, 3 columns, and 0 spare disks. 301Note that although multi-dimensional arrays may be specified, they are 302.Em NOT 303supported in the driver. 304.Pp 305The second section, the 306.Sq disks 307section, specifies the actual components of the device. 308For example: 309.Bd -literal -offset indent 310START disks 311/dev/sd0e 312/dev/sd1e 313/dev/sd2e 314.Ed 315.Pp 316specifies the three component disks to be used in the RAID device. 317If any of the specified drives cannot be found when the RAID device is 318configured, then they will be marked as 319.Sq failed , 320and the system will operate in degraded mode. 321Note that it is 322.Em imperative 323that the order of the components in the configuration file does not 324change between configurations of a RAID device. 325Changing the order of the components will result in data loss 326if the set is configured with the 327.Fl C 328option. 329In normal circumstances, the RAID set will not configure if only 330.Fl c 331is specified, and the components are out-of-order. 332.Pp 333The next section, which is the 334.Sq spare 335section, is optional, and, if present, specifies the devices to be used as 336.Sq hot spares 337\(em devices which are on-line, 338but are not actively used by the RAID driver unless 339one of the main components fail. 340A simple 341.Sq spare 342section might be: 343.Bd -literal -offset indent 344START spare 345/dev/sd3e 346.Ed 347.Pp 348for a configuration with a single spare component. 349If no spare drives are to be used in the configuration, then the 350.Sq spare 351section may be omitted. 352.Pp 353The next section is the 354.Sq layout 355section. 356This section describes the general layout parameters for the RAID device, 357and provides such information as 358sectors per stripe unit, 359stripe units per parity unit, 360stripe units per reconstruction unit, 361and the parity configuration to use. 362This section might look like: 363.Bd -literal -offset indent 364START layout 365# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level 36632 1 1 5 367.Ed 368.Pp 369The sectors per stripe unit specifies, in blocks, the interleave 370factor; i.e., the number of contiguous sectors to be written to each 371component for a single stripe. 372Appropriate selection of this value (32 in this example) 373is the subject of much research in RAID architectures. 374The stripe units per parity unit and 375stripe units per reconstruction unit are normally each set to 1. 376While certain values above 1 are permitted, a discussion of valid 377values and the consequences of using anything other than 1 are outside 378the scope of this document. 379The last value in this section (5 in this example) 380indicates the parity configuration desired. 381Valid entries include: 382.Bl -tag -width inde 383.It 0 384RAID level 0. 385No parity, only simple striping. 386.It 1 387RAID level 1. 388Mirroring. 389The parity is the mirror. 390.It 4 391RAID level 4. 392Striping across components, with parity stored on the last component. 393.It 5 394RAID level 5. 395Striping across components, parity distributed across all components. 396.El 397.Pp 398There are other valid entries here, including those for Even-Odd 399parity, RAID level 5 with rotated sparing, Chained declustering, 400and Interleaved declustering, but as of this writing the code for 401those parity operations has not been tested with 402.Nx . 403.Pp 404The next required section is the 405.Sq queue 406section. 407This is most often specified as: 408.Bd -literal -offset indent 409START queue 410fifo 100 411.Ed 412.Pp 413where the queuing method is specified as fifo (first-in, first-out), 414and the size of the per-component queue is limited to 100 requests. 415Other queuing methods may also be specified, but a discussion of them 416is beyond the scope of this document. 417.Pp 418The final section, the 419.Sq debug 420section, is optional. 421For more details on this the reader is referred to 422the RAIDframe documentation discussed in the 423.Sx HISTORY 424section. 425.Pp 426See 427.Sx EXAMPLES 428for a more complete configuration file example. 429.Sh FILES 430.Bl -tag -width /dev/XXrXraidX -compact 431.It Pa /dev/{,r}raid* 432.Cm raid 433device special files. 434.El 435.Sh EXAMPLES 436It is highly recommended that before using the RAID driver for real 437file systems that the system administrator(s) become quite familiar 438with the use of 439.Nm , 440and that they understand how the component reconstruction process works. 441The examples in this section will focus on configuring a 442number of different RAID sets of varying degrees of redundancy. 443By working through these examples, administrators should be able to 444develop a good feel for how to configure a RAID set, and how to 445initiate reconstruction of failed components. 446.Pp 447In the following examples 448.Sq raid0 449will be used to denote the RAID device. 450Depending on the architecture, 451.Pa /dev/rraid0c 452or 453.Pa /dev/rraid0d 454may be used in place of 455.Pa raid0 . 456.Ss Initialization and Configuration 457The initial step in configuring a RAID set is to identify the components 458that will be used in the RAID set. 459All components should be the same size. 460Each component should have a disklabel type of 461.Dv FS_RAID , 462and a typical disklabel entry for a RAID component might look like: 463.Bd -literal -offset indent 464f: 1800000 200495 RAID # (Cyl. 405*- 4041*) 465.Ed 466.Pp 467While 468.Dv FS_BSDFFS 469will also work as the component type, the type 470.Dv FS_RAID 471is preferred for RAIDframe use, as it is required for features such as 472auto-configuration. 473As part of the initial configuration of each RAID set, 474each component will be given a 475.Sq component label . 476A 477.Sq component label 478contains important information about the component, including a 479user-specified serial number, the row and column of that component in 480the RAID set, the redundancy level of the RAID set, a 481.Sq modification counter , 482and whether the parity information (if any) on that 483component is known to be correct. 484Component labels are an integral part of the RAID set, 485since they are used to ensure that components 486are configured in the correct order, and used to keep track of other 487vital information about the RAID set. 488Component labels are also required for the auto-detection 489and auto-configuration of RAID sets at boot time. 490For a component label to be considered valid, that 491particular component label must be in agreement with the other 492component labels in the set. 493For example, the serial number, 494.Sq modification counter , 495number of rows and number of columns must all be in agreement. 496If any of these are different, then the component is 497not considered to be part of the set. 498See 499.Xr raid 4 500for more information about component labels. 501.Pp 502Once the components have been identified, and the disks have 503appropriate labels, 504.Nm 505is then used to configure the 506.Xr raid 4 507device. 508To configure the device, a configuration file which looks something like: 509.Bd -literal -offset indent 510START array 511# numRow numCol numSpare 5121 3 1 513 514START disks 515/dev/sd1e 516/dev/sd2e 517/dev/sd3e 518 519START spare 520/dev/sd4e 521 522START layout 523# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_5 52432 1 1 5 525 526START queue 527fifo 100 528.Ed 529.Pp 530is created in a file. 531The above configuration file specifies a RAID 5 532set consisting of the components 533.Pa /dev/sd1e , 534.Pa /dev/sd2e , 535and 536.Pa /dev/sd3e , 537with 538.Pa /dev/sd4e 539available as a 540.Sq hot spare 541in case one of the three main drives should fail. 542A RAID 0 set would be specified in a similar way: 543.Bd -literal -offset indent 544START array 545# numRow numCol numSpare 5461 4 0 547 548START disks 549/dev/sd10e 550/dev/sd11e 551/dev/sd12e 552/dev/sd13e 553 554START layout 555# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_0 55664 1 1 0 557 558START queue 559fifo 100 560.Ed 561.Pp 562In this case, devices 563.Pa /dev/sd10e , 564.Pa /dev/sd11e , 565.Pa /dev/sd12e , 566and 567.Pa /dev/sd13e 568are the components that make up this RAID set. 569Note that there are no hot spares for a RAID 0 set, 570since there is no way to recover data if any of the components fail. 571.Pp 572For a RAID 1 (mirror) set, the following configuration might be used: 573.Bd -literal -offset indent 574START array 575# numRow numCol numSpare 5761 2 0 577 578START disks 579/dev/sd20e 580/dev/sd21e 581 582START layout 583# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_1 584128 1 1 1 585 586START queue 587fifo 100 588.Ed 589.Pp 590In this case, 591.Pa /dev/sd20e 592and 593.Pa /dev/sd21e 594are the two components of the mirror set. 595While no hot spares have been specified in this 596configuration, they easily could be, just as they were specified in 597the RAID 5 case above. 598Note as well that RAID 1 sets are currently limited to only 2 components. 599At present, n-way mirroring is not possible. 600.Pp 601The first time a RAID set is configured, the 602.Fl C 603option must be used: 604.Bd -literal -offset indent 605raidctl -C raid0.conf raid0 606.Ed 607.Pp 608where 609.Pa raid0.conf 610is the name of the RAID configuration file. 611The 612.Fl C 613forces the configuration to succeed, even if any of the component 614labels are incorrect. 615The 616.Fl C 617option should not be used lightly in 618situations other than initial configurations, as if 619the system is refusing to configure a RAID set, there is probably a 620very good reason for it. 621After the initial configuration is done (and 622appropriate component labels are added with the 623.Fl I 624option) then raid0 can be configured normally with: 625.Bd -literal -offset indent 626raidctl -c raid0.conf raid0 627.Ed 628.Pp 629When the RAID set is configured for the first time, it is 630necessary to initialize the component labels, and to initialize the 631parity on the RAID set. 632Initializing the component labels is done with: 633.Bd -literal -offset indent 634raidctl -I 112341 raid0 635.Ed 636.Pp 637where 638.Sq 112341 639is a user-specified serial number for the RAID set. 640This initialization step is 641.Em required 642for all RAID sets. 643As well, using different serial numbers between RAID sets is 644.Em strongly encouraged , 645as using the same serial number for all RAID sets will only serve to 646decrease the usefulness of the component label checking. 647.Pp 648Initializing the RAID set is done via the 649.Fl i 650option. 651This initialization 652.Em MUST 653be done for 654.Em all 655RAID sets, since among other things it verifies that the parity (if 656any) on the RAID set is correct. 657Since this initialization may be quite time-consuming, the 658.Fl v 659option may be also used in conjunction with 660.Fl i : 661.Bd -literal -offset indent 662raidctl -iv raid0 663.Ed 664.Pp 665This will give more verbose output on the 666status of the initialization: 667.Bd -literal -offset indent 668Initiating re-write of parity 669Parity Re-write status: 670 10% |**** | ETA: 06:03 / 671.Ed 672.Pp 673The output provides a 674.Sq Percent Complete 675in both a numeric and graphical format, as well as an estimated time 676to completion of the operation. 677.Pp 678Since it is the parity that provides the 679.Sq redundancy 680part of RAID, it is critical that the parity is correct as much as possible. 681If the parity is not correct, then there is no 682guarantee that data will not be lost if a component fails. 683.Pp 684Once the parity is known to be correct, it is then safe to perform 685.Xr disklabel 8 , 686.Xr newfs 8 , 687or 688.Xr fsck 8 689on the device or its file systems, and then to mount the file systems 690for use. 691.Pp 692Under certain circumstances (e.g., the additional component has not 693arrived, or data is being migrated off of a disk destined to become a 694component) it may be desirable to configure a RAID 1 set with only 695a single component. 696This can be achieved by using the word 697.Dq absent 698to indicate that a particular component is not present. 699In the following: 700.Bd -literal -offset indent 701START array 702# numRow numCol numSpare 7031 2 0 704 705START disks 706absent 707/dev/sd0e 708 709START layout 710# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_1 711128 1 1 1 712 713START queue 714fifo 100 715.Ed 716.Pp 717.Pa /dev/sd0e 718is the real component, and will be the second disk of a RAID 1 set. 719The first component is simply marked as being absent. 720Configuration (using 721.Fl C 722and 723.Fl I Ar 12345 724as above) proceeds normally, but initialization of the RAID set will 725have to wait until all physical components are present. 726After configuration, this set can be used normally, but will be operating 727in degraded mode. 728Once a second physical component is obtained, it can be hot-added, 729the existing data mirrored, and normal operation resumed. 730.Pp 731The size of the resulting RAID set will depend on the number of data 732components in the set. 733Space is automatically reserved for the component labels, and 734the actual amount of space used 735for data on a component will be rounded down to the largest possible 736multiple of the sectors per stripe unit (sectPerSU) value. 737Thus, the amount of space provided by the RAID set will be less 738than the sum of the size of the components. 739.Ss Maintenance of the RAID set 740After the parity has been initialized for the first time, the command: 741.Bd -literal -offset indent 742raidctl -p raid0 743.Ed 744.Pp 745can be used to check the current status of the parity. 746To check the parity and rebuild it necessary (for example, 747after an unclean shutdown) the command: 748.Bd -literal -offset indent 749raidctl -P raid0 750.Ed 751.Pp 752is used. 753Note that re-writing the parity can be done while 754other operations on the RAID set are taking place (e.g., while doing a 755.Xr fsck 8 756on a file system on the RAID set). 757However: for maximum effectiveness of the RAID set, the parity should be 758known to be correct before any data on the set is modified. 759.Pp 760To see how the RAID set is doing, the following command can be used to 761show the RAID set's status: 762.Bd -literal -offset indent 763raidctl -s raid0 764.Ed 765.Pp 766The output will look something like: 767.Bd -literal -offset indent 768Components: 769 /dev/sd1e: optimal 770 /dev/sd2e: optimal 771 /dev/sd3e: optimal 772Spares: 773 /dev/sd4e: spare 774Component label for /dev/sd1e: 775 Row: 0 Column: 0 Num Rows: 1 Num Columns: 3 776 Version: 2 Serial Number: 13432 Mod Counter: 65 777 Clean: No Status: 0 778 sectPerSU: 32 SUsPerPU: 1 SUsPerRU: 1 779 RAID Level: 5 blocksize: 512 numBlocks: 1799936 780 Autoconfig: No 781 Last configured as: raid0 782Component label for /dev/sd2e: 783 Row: 0 Column: 1 Num Rows: 1 Num Columns: 3 784 Version: 2 Serial Number: 13432 Mod Counter: 65 785 Clean: No Status: 0 786 sectPerSU: 32 SUsPerPU: 1 SUsPerRU: 1 787 RAID Level: 5 blocksize: 512 numBlocks: 1799936 788 Autoconfig: No 789 Last configured as: raid0 790Component label for /dev/sd3e: 791 Row: 0 Column: 2 Num Rows: 1 Num Columns: 3 792 Version: 2 Serial Number: 13432 Mod Counter: 65 793 Clean: No Status: 0 794 sectPerSU: 32 SUsPerPU: 1 SUsPerRU: 1 795 RAID Level: 5 blocksize: 512 numBlocks: 1799936 796 Autoconfig: No 797 Last configured as: raid0 798Parity status: clean 799Reconstruction is 100% complete. 800Parity Re-write is 100% complete. 801Copyback is 100% complete. 802.Ed 803.Pp 804This indicates that all is well with the RAID set. 805Of importance here are the component lines which read 806.Sq optimal , 807and the 808.Sq Parity status 809line. 810.Sq Parity status: clean 811indicates that the parity is up-to-date for this RAID set, 812whether or not the RAID set is in redundant or degraded mode. 813.Sq Parity status: DIRTY 814indicates that it is not known if the parity information is 815consistent with the data, and that the parity information needs 816to be checked. 817Note that if there are file systems open on the RAID set, 818the individual components will not be 819.Sq clean 820but the set as a whole can still be clean. 821.Pp 822To check the component label of 823.Pa /dev/sd1e , 824the following is used: 825.Bd -literal -offset indent 826raidctl -g /dev/sd1e raid0 827.Ed 828.Pp 829The output of this command will look something like: 830.Bd -literal -offset indent 831Component label for /dev/sd1e: 832 Row: 0 Column: 0 Num Rows: 1 Num Columns: 3 833 Version: 2 Serial Number: 13432 Mod Counter: 65 834 Clean: No Status: 0 835 sectPerSU: 32 SUsPerPU: 1 SUsPerRU: 1 836 RAID Level: 5 blocksize: 512 numBlocks: 1799936 837 Autoconfig: No 838 Last configured as: raid0 839.Ed 840.Ss Dealing with Component Failures 841If for some reason 842(perhaps to test reconstruction) it is necessary to pretend a drive 843has failed, the following will perform that function: 844.Bd -literal -offset indent 845raidctl -f /dev/sd2e raid0 846.Ed 847.Pp 848The system will then be performing all operations in degraded mode, 849where missing data is re-computed from existing data and the parity. 850In this case, obtaining the status of raid0 will return (in part): 851.Bd -literal -offset indent 852Components: 853 /dev/sd1e: optimal 854 /dev/sd2e: failed 855 /dev/sd3e: optimal 856Spares: 857 /dev/sd4e: spare 858.Ed 859.Pp 860Note that with the use of 861.Fl f 862a reconstruction has not been started. 863To both fail the disk and start a reconstruction, the 864.Fl F 865option must be used: 866.Bd -literal -offset indent 867raidctl -F /dev/sd2e raid0 868.Ed 869.Pp 870The 871.Fl f 872option may be used first, and then the 873.Fl F 874option used later, on the same disk, if desired. 875Immediately after the reconstruction is started, the status will report: 876.Bd -literal -offset indent 877Components: 878 /dev/sd1e: optimal 879 /dev/sd2e: reconstructing 880 /dev/sd3e: optimal 881Spares: 882 /dev/sd4e: used_spare 883[...] 884Parity status: clean 885Reconstruction is 10% complete. 886Parity Re-write is 100% complete. 887Copyback is 100% complete. 888.Ed 889.Pp 890This indicates that a reconstruction is in progress. 891To find out how the reconstruction is progressing the 892.Fl S 893option may be used. 894This will indicate the progress in terms of the 895percentage of the reconstruction that is completed. 896When the reconstruction is finished the 897.Fl s 898option will show: 899.Bd -literal -offset indent 900Components: 901 /dev/sd1e: optimal 902 /dev/sd2e: spared 903 /dev/sd3e: optimal 904Spares: 905 /dev/sd4e: used_spare 906[...] 907Parity status: clean 908Reconstruction is 100% complete. 909Parity Re-write is 100% complete. 910Copyback is 100% complete. 911.Ed 912.Pp 913At this point there are at least two options. 914First, if 915.Pa /dev/sd2e 916is known to be good (i.e., the failure was either caused by 917.Fl f 918or 919.Fl F , 920or the failed disk was replaced), then a copyback of the data can 921be initiated with the 922.Fl B 923option. 924In this example, this would copy the entire contents of 925.Pa /dev/sd4e 926to 927.Pa /dev/sd2e . 928Once the copyback procedure is complete, the 929status of the device would be (in part): 930.Bd -literal -offset indent 931Components: 932 /dev/sd1e: optimal 933 /dev/sd2e: optimal 934 /dev/sd3e: optimal 935Spares: 936 /dev/sd4e: spare 937.Ed 938.Pp 939and the system is back to normal operation. 940.Pp 941The second option after the reconstruction is to simply use 942.Pa /dev/sd4e 943in place of 944.Pa /dev/sd2e 945in the configuration file. 946For example, the configuration file (in part) might now look like: 947.Bd -literal -offset indent 948START array 9491 3 0 950 951START drives 952/dev/sd1e 953/dev/sd4e 954/dev/sd3e 955.Ed 956.Pp 957This can be done as 958.Pa /dev/sd4e 959is completely interchangeable with 960.Pa /dev/sd2e 961at this point. 962Note that extreme care must be taken when 963changing the order of the drives in a configuration. 964This is one of the few instances where the devices and/or 965their orderings can be changed without loss of data! 966In general, the ordering of components in a configuration file should 967.Em never 968be changed. 969.Pp 970If a component fails and there are no hot spares 971available on-line, the status of the RAID set might (in part) look like: 972.Bd -literal -offset indent 973Components: 974 /dev/sd1e: optimal 975 /dev/sd2e: failed 976 /dev/sd3e: optimal 977No spares. 978.Ed 979.Pp 980In this case there are a number of options. 981The first option is to add a hot spare using: 982.Bd -literal -offset indent 983raidctl -a /dev/sd4e raid0 984.Ed 985.Pp 986After the hot add, the status would then be: 987.Bd -literal -offset indent 988Components: 989 /dev/sd1e: optimal 990 /dev/sd2e: failed 991 /dev/sd3e: optimal 992Spares: 993 /dev/sd4e: spare 994.Ed 995.Pp 996Reconstruction could then take place using 997.Fl F 998as describe above. 999.Pp 1000A second option is to rebuild directly onto 1001.Pa /dev/sd2e . 1002Once the disk containing 1003.Pa /dev/sd2e 1004has been replaced, one can simply use: 1005.Bd -literal -offset indent 1006raidctl -R /dev/sd2e raid0 1007.Ed 1008.Pp 1009to rebuild the 1010.Pa /dev/sd2e 1011component. 1012As the rebuilding is in progress, the status will be: 1013.Bd -literal -offset indent 1014Components: 1015 /dev/sd1e: optimal 1016 /dev/sd2e: reconstructing 1017 /dev/sd3e: optimal 1018No spares. 1019.Ed 1020.Pp 1021and when completed, will be: 1022.Bd -literal -offset indent 1023Components: 1024 /dev/sd1e: optimal 1025 /dev/sd2e: optimal 1026 /dev/sd3e: optimal 1027No spares. 1028.Ed 1029.Pp 1030In circumstances where a particular component is completely 1031unavailable after a reboot, a special component name will be used to 1032indicate the missing component. 1033For example: 1034.Bd -literal -offset indent 1035Components: 1036 /dev/sd2e: optimal 1037 component1: failed 1038No spares. 1039.Ed 1040.Pp 1041indicates that the second component of this RAID set was not detected 1042at all by the auto-configuration code. 1043The name 1044.Sq component1 1045can be used anywhere a normal component name would be used. 1046For example, to add a hot spare to the above set, and rebuild to that hot 1047spare, the following could be done: 1048.Bd -literal -offset indent 1049raidctl -a /dev/sd3e raid0 1050raidctl -F component1 raid0 1051.Ed 1052.Pp 1053at which point the data missing from 1054.Sq component1 1055would be reconstructed onto 1056.Pa /dev/sd3e . 1057.Pp 1058When more than one component is marked as 1059.Sq failed 1060due to a non-component hardware failure (e.g., loss of power to two 1061components, adapter problems, termination problems, or cabling issues) it 1062is quite possible to recover the data on the RAID set. 1063The first thing to be aware of is that the first disk to fail will 1064almost certainly be out-of-sync with the remainder of the array. 1065If any IO was performed between the time the first component is considered 1066.Sq failed 1067and when the second component is considered 1068.Sq failed , 1069then the first component to fail will 1070.Em not 1071contain correct data, and should be ignored. 1072When the second component is marked as failed, however, the RAID device will 1073(currently) panic the system. 1074At this point the data on the RAID set 1075(not including the first failed component) is still self consistent, 1076and will be in no worse state of repair than had the power gone out in 1077the middle of a write to a file system on a non-RAID device. 1078The problem, however, is that the component labels may now have 3 different 1079.Sq modification counters 1080(one value on the first component that failed, one value on the second 1081component that failed, and a third value on the remaining components). 1082In such a situation, the RAID set will not autoconfigure, 1083and can only be forcibly re-configured 1084with the 1085.Fl C 1086option. 1087To recover the RAID set, one must first remedy whatever physical 1088problem caused the multiple-component failure. 1089After that is done, the RAID set can be restored by forcibly 1090configuring the raid set 1091.Em without 1092the component that failed first. 1093For example, if 1094.Pa /dev/sd1e 1095and 1096.Pa /dev/sd2e 1097fail (in that order) in a RAID set of the following configuration: 1098.Bd -literal -offset indent 1099START array 11001 4 0 1101 1102START drives 1103/dev/sd1e 1104/dev/sd2e 1105/dev/sd3e 1106/dev/sd4e 1107 1108START layout 1109# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_5 111064 1 1 5 1111 1112START queue 1113fifo 100 1114 1115.Ed 1116.Pp 1117then the following configuration (say "recover_raid0.conf") 1118.Bd -literal -offset indent 1119START array 11201 4 0 1121 1122START drives 1123/dev/sd6e 1124/dev/sd2e 1125/dev/sd3e 1126/dev/sd4e 1127 1128START layout 1129# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_5 113064 1 1 5 1131 1132START queue 1133fifo 100 1134.Ed 1135.Pp 1136(where 1137.Pa /dev/sd6e 1138has no physical device) can be used with 1139.Bd -literal -offset indent 1140raidctl -C recover_raid0.conf raid0 1141.Ed 1142.Pp 1143to force the configuration of raid0. 1144A 1145.Bd -literal -offset indent 1146raidctl -I 12345 raid0 1147.Ed 1148.Pp 1149will be required in order to synchronize the component labels. 1150At this point the file systems on the RAID set can then be checked and 1151corrected. 1152To complete the re-construction of the RAID set, 1153.Pa /dev/sd1e 1154is simply hot-added back into the array, and reconstructed 1155as described earlier. 1156.Ss RAID on RAID 1157RAID sets can be layered to create more complex and much larger RAID sets. 1158A RAID 0 set, for example, could be constructed from four RAID 5 sets. 1159The following configuration file shows such a setup: 1160.Bd -literal -offset indent 1161START array 1162# numRow numCol numSpare 11631 4 0 1164 1165START disks 1166/dev/raid1e 1167/dev/raid2e 1168/dev/raid3e 1169/dev/raid4e 1170 1171START layout 1172# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_0 1173128 1 1 0 1174 1175START queue 1176fifo 100 1177.Ed 1178.Pp 1179A similar configuration file might be used for a RAID 0 set 1180constructed from components on RAID 1 sets. 1181In such a configuration, the mirroring provides a high degree 1182of redundancy, while the striping provides additional speed benefits. 1183.Ss Auto-configuration and Root on RAID 1184RAID sets can also be auto-configured at boot. 1185To make a set auto-configurable, 1186simply prepare the RAID set as above, and then do a: 1187.Bd -literal -offset indent 1188raidctl -A yes raid0 1189.Ed 1190.Pp 1191to turn on auto-configuration for that set. 1192To turn off auto-configuration, use: 1193.Bd -literal -offset indent 1194raidctl -A no raid0 1195.Ed 1196.Pp 1197RAID sets which are auto-configurable will be configured before the 1198root file system is mounted. 1199These RAID sets are thus available for 1200use as a root file system, or for any other file system. 1201A primary advantage of using the auto-configuration is that RAID components 1202become more independent of the disks they reside on. 1203For example, SCSI ID's can change, but auto-configured sets will always be 1204configured correctly, even if the SCSI ID's of the component disks 1205have become scrambled. 1206.Pp 1207Having a system's root file system 1208.Pq Pa / 1209on a RAID set is also allowed, with the 1210.Sq a 1211partition of such a RAID set being used for 1212.Pa / . 1213To use raid0a as the root file system, simply use: 1214.Bd -literal -offset indent 1215raidctl -A root raid0 1216.Ed 1217.Pp 1218To return raid0a to be just an auto-configuring set simply use the 1219.Fl A Ar yes 1220arguments. 1221.Pp 1222Note that kernels can only be directly read from RAID 1 components on 1223architectures that support that 1224.Pq currently alpha, i386, pmax, sparc, sparc64, and vax . 1225On those architectures, the 1226.Dv FS_RAID 1227file system is recognized by the bootblocks, and will properly load the 1228kernel directly from a RAID 1 component. 1229For other architectures, or to support the root file system 1230on other RAID sets, some other mechanism must be used to get a kernel booting. 1231For example, a small partition containing only the secondary boot-blocks 1232and an alternate kernel (or two) could be used. 1233Once a kernel is booting however, and an auto-configuring RAID set is 1234found that is eligible to be root, then that RAID set will be 1235auto-configured and used as the root device. 1236If two or more RAID sets claim to be root devices, then the 1237user will be prompted to select the root device. 1238At this time, RAID 0, 1, 4, and 5 sets are all supported as root devices. 1239.Pp 1240A typical RAID 1 setup with root on RAID might be as follows: 1241.Bl -enum 1242.It 1243wd0a - a small partition, which contains a complete, bootable, basic 1244.Nx 1245installation. 1246.It 1247wd1a - also contains a complete, bootable, basic 1248.Nx 1249installation. 1250.It 1251wd0e and wd1e - a RAID 1 set, raid0, used for the root file system. 1252.It 1253wd0f and wd1f - a RAID 1 set, raid1, which will be used only for 1254swap space. 1255.It 1256wd0g and wd1g - a RAID 1 set, raid2, used for 1257.Pa /usr , 1258.Pa /home , 1259or other data, if desired. 1260.It 1261wd0h and wd1h - a RAID 1 set, raid3, if desired. 1262.El 1263.Pp 1264RAID sets raid0, raid1, and raid2 are all marked as auto-configurable. 1265raid0 is marked as being a root file system. 1266When new kernels are installed, the kernel is not only copied to 1267.Pa / , 1268but also to wd0a and wd1a. 1269The kernel on wd0a is required, since that 1270is the kernel the system boots from. 1271The kernel on wd1a is also 1272required, since that will be the kernel used should wd0 fail. 1273The important point here is to have redundant copies of the kernel 1274available, in the event that one of the drives fail. 1275.Pp 1276There is no requirement that the root file system be on the same disk 1277as the kernel. 1278For example, obtaining the kernel from wd0a, and using 1279sd0e and sd1e for raid0, and the root file system, is fine. 1280It 1281.Em is 1282critical, however, that there be multiple kernels available, in the 1283event of media failure. 1284.Pp 1285Multi-layered RAID devices (such as a RAID 0 set made 1286up of RAID 1 sets) are 1287.Em not 1288supported as root devices or auto-configurable devices at this point. 1289(Multi-layered RAID devices 1290.Em are 1291supported in general, however, as mentioned earlier.) 1292Note that in order to enable component auto-detection and 1293auto-configuration of RAID devices, the line: 1294.Bd -literal -offset indent 1295options RAID_AUTOCONFIG 1296.Ed 1297.Pp 1298must be in the kernel configuration file. 1299See 1300.Xr raid 4 1301for more details. 1302.Ss Swapping on RAID 1303A RAID device can be used as a swap device. 1304In order to ensure that a RAID device used as a swap device 1305is correctly unconfigured when the system is shutdown or rebooted, 1306it is recommended that the line 1307.Bd -literal -offset indent 1308swapoff=YES 1309.Ed 1310.Pp 1311be added to 1312.Pa /etc/rc.conf . 1313.Ss Unconfiguration 1314The final operation performed by 1315.Nm 1316is to unconfigure a 1317.Xr raid 4 1318device. 1319This is accomplished via a simple: 1320.Bd -literal -offset indent 1321raidctl -u raid0 1322.Ed 1323.Pp 1324at which point the device is ready to be reconfigured. 1325.Ss Performance Tuning 1326Selection of the various parameter values which result in the best 1327performance can be quite tricky, and often requires a bit of 1328trial-and-error to get those values most appropriate for a given system. 1329A whole range of factors come into play, including: 1330.Bl -enum 1331.It 1332Types of components (e.g., SCSI vs. IDE) and their bandwidth 1333.It 1334Types of controller cards and their bandwidth 1335.It 1336Distribution of components among controllers 1337.It 1338IO bandwidth 1339.It 1340file system access patterns 1341.It 1342CPU speed 1343.El 1344.Pp 1345As with most performance tuning, benchmarking under real-life loads 1346may be the only way to measure expected performance. 1347Understanding some of the underlying technology is also useful in tuning. 1348The goal of this section is to provide pointers to those parameters which may 1349make significant differences in performance. 1350.Pp 1351For a RAID 1 set, a SectPerSU value of 64 or 128 is typically sufficient. 1352Since data in a RAID 1 set is arranged in a linear 1353fashion on each component, selecting an appropriate stripe size is 1354somewhat less critical than it is for a RAID 5 set. 1355However: a stripe size that is too small will cause large IO's to be 1356broken up into a number of smaller ones, hurting performance. 1357At the same time, a large stripe size may cause problems with 1358concurrent accesses to stripes, which may also affect performance. 1359Thus values in the range of 32 to 128 are often the most effective. 1360.Pp 1361Tuning RAID 5 sets is trickier. 1362In the best case, IO is presented to the RAID set one stripe at a time. 1363Since the entire stripe is available at the beginning of the IO, 1364the parity of that stripe can be calculated before the stripe is written, 1365and then the stripe data and parity can be written in parallel. 1366When the amount of data being written is less than a full stripe worth, the 1367.Sq small write 1368problem occurs. 1369Since a 1370.Sq small write 1371means only a portion of the stripe on the components is going to 1372change, the data (and parity) on the components must be updated 1373slightly differently. 1374First, the 1375.Sq old parity 1376and 1377.Sq old data 1378must be read from the components. 1379Then the new parity is constructed, 1380using the new data to be written, and the old data and old parity. 1381Finally, the new data and new parity are written. 1382All this extra data shuffling results in a serious loss of performance, 1383and is typically 2 to 4 times slower than a full stripe write (or read). 1384To combat this problem in the real world, it may be useful 1385to ensure that stripe sizes are small enough that a 1386.Sq large IO 1387from the system will use exactly one large stripe write. 1388As is seen later, there are some file system dependencies 1389which may come into play here as well. 1390.Pp 1391Since the size of a 1392.Sq large IO 1393is often (currently) only 32K or 64K, on a 5-drive RAID 5 set it may 1394be desirable to select a SectPerSU value of 16 blocks (8K) or 32 1395blocks (16K). 1396Since there are 4 data sectors per stripe, the maximum 1397data per stripe is 64 blocks (32K) or 128 blocks (64K). 1398Again, empirical measurement will provide the best indicators of which 1399values will yeild better performance. 1400.Pp 1401The parameters used for the file system are also critical to good performance. 1402For 1403.Xr newfs 8 , 1404for example, increasing the block size to 32K or 64K may improve 1405performance dramatically. 1406As well, changing the cylinders-per-group 1407parameter from 16 to 32 or higher is often not only necessary for 1408larger file systems, but may also have positive performance implications. 1409.Ss Summary 1410Despite the length of this man-page, configuring a RAID set is a 1411relatively straight-forward process. 1412All that needs to be done is the following steps: 1413.Bl -enum 1414.It 1415Use 1416.Xr disklabel 8 1417to create the components (of type RAID). 1418.It 1419Construct a RAID configuration file: e.g., 1420.Pa raid0.conf 1421.It 1422Configure the RAID set with: 1423.Bd -literal -offset indent 1424raidctl -C raid0.conf raid0 1425.Ed 1426.Pp 1427.It 1428Initialize the component labels with: 1429.Bd -literal -offset indent 1430raidctl -I 123456 raid0 1431.Ed 1432.Pp 1433.It 1434Initialize other important parts of the set with: 1435.Bd -literal -offset indent 1436raidctl -i raid0 1437.Ed 1438.Pp 1439.It 1440Get the default label for the RAID set: 1441.Bd -literal -offset indent 1442disklabel raid0 \*[Gt] /tmp/label 1443.Ed 1444.Pp 1445.It 1446Edit the label: 1447.Bd -literal -offset indent 1448vi /tmp/label 1449.Ed 1450.Pp 1451.It 1452Put the new label on the RAID set: 1453.Bd -literal -offset indent 1454disklabel -R -r raid0 /tmp/label 1455.Ed 1456.Pp 1457.It 1458Create the file system: 1459.Bd -literal -offset indent 1460newfs /dev/rraid0e 1461.Ed 1462.Pp 1463.It 1464Mount the file system: 1465.Bd -literal -offset indent 1466mount /dev/raid0e /mnt 1467.Ed 1468.Pp 1469.It 1470Use: 1471.Bd -literal -offset indent 1472raidctl -c raid0.conf raid0 1473.Ed 1474.Pp 1475To re-configure the RAID set the next time it is needed, or put 1476.Pa raid0.conf 1477into 1478.Pa /etc 1479where it will automatically be started by the 1480.Pa /etc/rc.d 1481scripts. 1482.El 1483.Sh SEE ALSO 1484.Xr ccd 4 , 1485.Xr raid 4 , 1486.Xr rc 8 1487.Sh HISTORY 1488RAIDframe is a framework for rapid prototyping of RAID structures 1489developed by the folks at the Parallel Data Laboratory at Carnegie 1490Mellon University (CMU). 1491A more complete description of the internals and functionality of 1492RAIDframe is found in the paper "RAIDframe: A Rapid Prototyping Tool 1493for RAID Systems", by William V. Courtright II, Garth Gibson, Mark 1494Holland, LeAnn Neal Reilly, and Jim Zelenka, and published by the 1495Parallel Data Laboratory of Carnegie Mellon University. 1496.Pp 1497The 1498.Nm 1499command first appeared as a program in CMU's RAIDframe v1.1 distribution. 1500This version of 1501.Nm 1502is a complete re-write, and first appeared in 1503.Nx 1.4 . 1504.Sh COPYRIGHT 1505.Bd -literal 1506The RAIDframe Copyright is as follows: 1507 1508Copyright (c) 1994-1996 Carnegie-Mellon University. 1509All rights reserved. 1510 1511Permission to use, copy, modify and distribute this software and 1512its documentation is hereby granted, provided that both the copyright 1513notice and this permission notice appear in all copies of the 1514software, derivative works or modified versions, and any portions 1515thereof, and that both notices appear in supporting documentation. 1516 1517CARNEGIE MELLON ALLOWS FREE USE OF THIS SOFTWARE IN ITS "AS IS" 1518CONDITION. CARNEGIE MELLON DISCLAIMS ANY LIABILITY OF ANY KIND 1519FOR ANY DAMAGES WHATSOEVER RESULTING FROM THE USE OF THIS SOFTWARE. 1520 1521Carnegie Mellon requests users of this software to return to 1522 1523 Software Distribution Coordinator or Software.Distribution@CS.CMU.EDU 1524 School of Computer Science 1525 Carnegie Mellon University 1526 Pittsburgh PA 15213-3890 1527 1528any improvements or extensions that they make and grant Carnegie the 1529rights to redistribute these changes. 1530.Ed 1531.Sh WARNINGS 1532Certain RAID levels (1, 4, 5, 6, and others) can protect against some 1533data loss due to component failure. 1534However the loss of two components of a RAID 4 or 5 system, 1535or the loss of a single component of a RAID 0 system will 1536result in the entire file system being lost. 1537RAID is 1538.Em NOT 1539a substitute for good backup practices. 1540.Pp 1541Recomputation of parity 1542.Em MUST 1543be performed whenever there is a chance that it may have been compromised. 1544This includes after system crashes, or before a RAID 1545device has been used for the first time. 1546Failure to keep parity correct will be catastrophic should a 1547component ever fail \(em it is better to use RAID 0 and get the 1548additional space and speed, than it is to use parity, but 1549not keep the parity correct. 1550At least with RAID 0 there is no perception of increased data security. 1551.Sh BUGS 1552Hot-spare removal is currently not available. 1553