1.\" $NetBSD: raidctl.8,v 1.49 2005/02/28 22:03:05 wiz Exp $ 2.\" 3.\" Copyright (c) 1998, 2002 The NetBSD Foundation, Inc. 4.\" All rights reserved. 5.\" 6.\" This code is derived from software contributed to The NetBSD Foundation 7.\" by Greg Oster 8.\" 9.\" Redistribution and use in source and binary forms, with or without 10.\" modification, are permitted provided that the following conditions 11.\" are met: 12.\" 1. Redistributions of source code must retain the above copyright 13.\" notice, this list of conditions and the following disclaimer. 14.\" 2. Redistributions in binary form must reproduce the above copyright 15.\" notice, this list of conditions and the following disclaimer in the 16.\" documentation and/or other materials provided with the distribution. 17.\" 3. All advertising materials mentioning features or use of this software 18.\" must display the following acknowledgement: 19.\" This product includes software developed by the NetBSD 20.\" Foundation, Inc. and its contributors. 21.\" 4. Neither the name of The NetBSD Foundation nor the names of its 22.\" contributors may be used to endorse or promote products derived 23.\" from this software without specific prior written permission. 24.\" 25.\" THIS SOFTWARE IS PROVIDED BY THE NETBSD FOUNDATION, INC. AND CONTRIBUTORS 26.\" ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED 27.\" TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR 28.\" PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE FOUNDATION OR CONTRIBUTORS 29.\" BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR 30.\" CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF 31.\" SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 32.\" INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN 33.\" CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) 34.\" ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE 35.\" POSSIBILITY OF SUCH DAMAGE. 36.\" 37.\" 38.\" Copyright (c) 1995 Carnegie-Mellon University. 39.\" All rights reserved. 40.\" 41.\" Author: Mark Holland 42.\" 43.\" Permission to use, copy, modify and distribute this software and 44.\" its documentation is hereby granted, provided that both the copyright 45.\" notice and this permission notice appear in all copies of the 46.\" software, derivative works or modified versions, and any portions 47.\" thereof, and that both notices appear in supporting documentation. 48.\" 49.\" CARNEGIE MELLON ALLOWS FREE USE OF THIS SOFTWARE IN ITS "AS IS" 50.\" CONDITION. CARNEGIE MELLON DISCLAIMS ANY LIABILITY OF ANY KIND 51.\" FOR ANY DAMAGES WHATSOEVER RESULTING FROM THE USE OF THIS SOFTWARE. 52.\" 53.\" Carnegie Mellon requests users of this software to return to 54.\" 55.\" Software Distribution Coordinator or Software.Distribution@CS.CMU.EDU 56.\" School of Computer Science 57.\" Carnegie Mellon University 58.\" Pittsburgh PA 15213-3890 59.\" 60.\" any improvements or extensions that they make and grant Carnegie the 61.\" rights to redistribute these changes. 62.\" 63.Dd February 28, 2005 64.Dt RAIDCTL 8 65.Os 66.Sh NAME 67.Nm raidctl 68.Nd configuration utility for the RAIDframe disk driver 69.Sh SYNOPSIS 70.Nm 71.Op Fl v 72.Fl a Ar component Ar dev 73.Nm 74.Op Fl v 75.Fl A Op yes | no | root 76.Ar dev 77.Nm 78.Op Fl v 79.Fl B Ar dev 80.Nm 81.Op Fl v 82.Fl c Ar config_file Ar dev 83.Nm 84.Op Fl v 85.Fl C Ar config_file Ar dev 86.Nm 87.Op Fl v 88.Fl f Ar component Ar dev 89.Nm 90.Op Fl v 91.Fl F Ar component Ar dev 92.Nm 93.Op Fl v 94.Fl g Ar component Ar dev 95.Nm 96.Op Fl v 97.Fl G Ar dev 98.Nm 99.Op Fl v 100.Fl i Ar dev 101.Nm 102.Op Fl v 103.Fl I Ar serial_number Ar dev 104.Nm 105.Op Fl v 106.Fl p Ar dev 107.Nm 108.Op Fl v 109.Fl P Ar dev 110.Nm 111.Op Fl v 112.Fl r Ar component Ar dev 113.Nm 114.Op Fl v 115.Fl R Ar component Ar dev 116.Nm 117.Op Fl v 118.Fl s Ar dev 119.Nm 120.Op Fl v 121.Fl S Ar dev 122.Nm 123.Op Fl v 124.Fl u Ar dev 125.Sh DESCRIPTION 126.Nm 127is the user-land control program for 128.Xr raid 4 , 129the RAIDframe disk device. 130.Nm 131is primarily used to dynamically configure and unconfigure RAIDframe disk 132devices. 133For more information about the RAIDframe disk device, see 134.Xr raid 4 . 135.Pp 136This document assumes the reader has at least rudimentary knowledge of 137RAID and RAID concepts. 138.Pp 139The command-line options for 140.Nm 141are as follows: 142.Bl -tag -width indent 143.It Fl a Ar component Ar dev 144Add 145.Ar component 146as a hot spare for the device 147.Ar dev . 148Component labels (which identify the location of a given 149component within a particular RAID set) are automatically added to the 150hot spare after it has been used and are not required for 151.Ar component 152before it is used. 153.It Fl A Ic yes Ar dev 154Make the RAID set auto-configurable. 155The RAID set will be automatically configured at boot 156.Ar before 157the root file system is mounted. 158Note that all components of the set must be of type 159.Dv RAID 160in the disklabel. 161.It Fl A Ic no Ar dev 162Turn off auto-configuration for the RAID set. 163.It Fl A Ic root Ar dev 164Make the RAID set auto-configurable, and also mark the set as being 165eligible to be the root partition. 166A RAID set configured this way will 167.Ar override 168the use of the boot disk as the root device. 169All components of the set must be of type 170.Dv RAID 171in the disklabel. 172Note that only certain architectures 173.Pq currently alpha, i386, pmax, sparc, sparc64, and vax 174support booting a kernel directly from a RAID set. 175.It Fl B Ar dev 176Initiate a copyback of reconstructed data from a spare disk to 177its original disk. 178This is performed after a component has failed, 179and the failed drive has been reconstructed onto a spare drive. 180.It Fl c Ar config_file Ar dev 181Configure the RAIDframe device 182.Ar dev 183according to the configuration given in 184.Ar config_file . 185A description of the contents of 186.Ar config_file 187is given later. 188.It Fl C Ar config_file Ar dev 189As for 190.Fl c , 191but forces the configuration to take place. 192This is required the first time a RAID set is configured. 193.It Fl f Ar component Ar dev 194This marks the specified 195.Ar component 196as having failed, but does not initiate a reconstruction of that component. 197.It Fl F Ar component Ar dev 198Fails the specified 199.Ar component 200of the device, and immediately begin a reconstruction of the failed 201disk onto an available hot spare. 202This is one of the mechanisms used to start 203the reconstruction process if a component does have a hardware failure. 204.It Fl g Ar component Ar dev 205Get the component label for the specified component. 206.It Fl G Ar dev 207Generate the configuration of the RAIDframe device in a format suitable for 208use with the 209.Fl c 210or 211.Fl C 212options. 213.It Fl i Ar dev 214Initialize the RAID device. 215In particular, (re-)write the parity on the selected device. 216This 217.Em MUST 218be done for 219.Em all 220RAID sets before the RAID device is labeled and before 221file systems are created on the RAID device. 222.It Fl I Ar serial_number Ar dev 223Initialize the component labels on each component of the device. 224.Ar serial_number 225is used as one of the keys in determining whether a 226particular set of components belong to the same RAID set. 227While not strictly enforced, different serial numbers should be used for 228different RAID sets. 229This step 230.Em MUST 231be performed when a new RAID set is created. 232.It Fl p Ar dev 233Check the status of the parity on the RAID set. 234Displays a status message, 235and returns successfully if the parity is up-to-date. 236.It Fl P Ar dev 237Check the status of the parity on the RAID set, and initialize 238(re-write) the parity if the parity is not known to be up-to-date. 239This is normally used after a system crash (and before a 240.Xr fsck 8 ) 241to ensure the integrity of the parity. 242.It Fl r Ar component Ar dev 243Remove the spare disk specified by 244.Ar component 245from the set of available spare components. 246.It Fl R Ar component Ar dev 247Fails the specified 248.Ar component , 249if necessary, and immediately begins a reconstruction back to 250.Ar component . 251This is useful for reconstructing back onto a component after 252it has been replaced following a failure. 253.It Fl s Ar dev 254Display the status of the RAIDframe device for each of the components 255and spares. 256.It Fl S Ar dev 257Check the status of parity re-writing, component reconstruction, and 258component copyback. 259The output indicates the amount of progress 260achieved in each of these areas. 261.It Fl u Ar dev 262Unconfigure the RAIDframe device. 263.It Fl v 264Be more verbose. 265For operations such as reconstructions, parity 266re-writing, and copybacks, provide a progress indicator. 267.El 268.Pp 269The device used by 270.Nm 271is specified by 272.Ar dev . 273.Ar dev 274may be either the full name of the device, e.g., 275.Pa /dev/rraid0d , 276for the i386 architecture, or 277.Pa /dev/rraid0c 278for many others, or just simply 279.Pa raid0 280(for 281.Pa /dev/rraid0[cd] ) . 282It is recommended that the partitions used to represent the 283RAID device are not used for file systems. 284.Ss Configuration file 285The format of the configuration file is complex, and 286only an abbreviated treatment is given here. 287In the configuration files, a 288.Sq # 289indicates the beginning of a comment. 290.Pp 291There are 4 required sections of a configuration file, and 2 292optional sections. 293Each section begins with a 294.Sq START , 295followed by the section name, 296and the configuration parameters associated with that section. 297The first section is the 298.Sq array 299section, and it specifies 300the number of rows, columns, and spare disks in the RAID set. 301For example: 302.Bd -literal -offset indent 303START array 3041 3 0 305.Ed 306.Pp 307indicates an array with 1 row, 3 columns, and 0 spare disks. 308Note that although multi-dimensional arrays may be specified, they are 309.Em NOT 310supported in the driver. 311.Pp 312The second section, the 313.Sq disks 314section, specifies the actual components of the device. 315For example: 316.Bd -literal -offset indent 317START disks 318/dev/sd0e 319/dev/sd1e 320/dev/sd2e 321.Ed 322.Pp 323specifies the three component disks to be used in the RAID device. 324If any of the specified drives cannot be found when the RAID device is 325configured, then they will be marked as 326.Sq failed , 327and the system will operate in degraded mode. 328Note that it is 329.Em imperative 330that the order of the components in the configuration file does not 331change between configurations of a RAID device. 332Changing the order of the components will result in data loss 333if the set is configured with the 334.Fl C 335option. 336In normal circumstances, the RAID set will not configure if only 337.Fl c 338is specified, and the components are out-of-order. 339.Pp 340The next section, which is the 341.Sq spare 342section, is optional, and, if present, specifies the devices to be used as 343.Sq hot spares 344\(em devices which are on-line, 345but are not actively used by the RAID driver unless 346one of the main components fail. 347A simple 348.Sq spare 349section might be: 350.Bd -literal -offset indent 351START spare 352/dev/sd3e 353.Ed 354.Pp 355for a configuration with a single spare component. 356If no spare drives are to be used in the configuration, then the 357.Sq spare 358section may be omitted. 359.Pp 360The next section is the 361.Sq layout 362section. 363This section describes the general layout parameters for the RAID device, 364and provides such information as 365sectors per stripe unit, 366stripe units per parity unit, 367stripe units per reconstruction unit, 368and the parity configuration to use. 369This section might look like: 370.Bd -literal -offset indent 371START layout 372# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level 37332 1 1 5 374.Ed 375.Pp 376The sectors per stripe unit specifies, in blocks, the interleave 377factor; i.e., the number of contiguous sectors to be written to each 378component for a single stripe. 379Appropriate selection of this value (32 in this example) 380is the subject of much research in RAID architectures. 381The stripe units per parity unit and 382stripe units per reconstruction unit are normally each set to 1. 383While certain values above 1 are permitted, a discussion of valid 384values and the consequences of using anything other than 1 are outside 385the scope of this document. 386The last value in this section (5 in this example) 387indicates the parity configuration desired. 388Valid entries include: 389.Bl -tag -width inde 390.It 0 391RAID level 0. 392No parity, only simple striping. 393.It 1 394RAID level 1. 395Mirroring. 396The parity is the mirror. 397.It 4 398RAID level 4. 399Striping across components, with parity stored on the last component. 400.It 5 401RAID level 5. 402Striping across components, parity distributed across all components. 403.El 404.Pp 405There are other valid entries here, including those for Even-Odd 406parity, RAID level 5 with rotated sparing, Chained declustering, 407and Interleaved declustering, but as of this writing the code for 408those parity operations has not been tested with 409.Nx . 410.Pp 411The next required section is the 412.Sq queue 413section. 414This is most often specified as: 415.Bd -literal -offset indent 416START queue 417fifo 100 418.Ed 419.Pp 420where the queuing method is specified as fifo (first-in, first-out), 421and the size of the per-component queue is limited to 100 requests. 422Other queuing methods may also be specified, but a discussion of them 423is beyond the scope of this document. 424.Pp 425The final section, the 426.Sq debug 427section, is optional. 428For more details on this the reader is referred to 429the RAIDframe documentation discussed in the 430.Sx HISTORY 431section. 432.Pp 433See 434.Sx EXAMPLES 435for a more complete configuration file example. 436.Sh FILES 437.Bl -tag -width /dev/XXrXraidX -compact 438.It Pa /dev/{,r}raid* 439.Cm raid 440device special files. 441.El 442.Sh EXAMPLES 443It is highly recommended that before using the RAID driver for real 444file systems that the system administrator(s) become quite familiar 445with the use of 446.Nm , 447and that they understand how the component reconstruction process works. 448The examples in this section will focus on configuring a 449number of different RAID sets of varying degrees of redundancy. 450By working through these examples, administrators should be able to 451develop a good feel for how to configure a RAID set, and how to 452initiate reconstruction of failed components. 453.Pp 454In the following examples 455.Sq raid0 456will be used to denote the RAID device. 457Depending on the architecture, 458.Pa /dev/rraid0c 459or 460.Pa /dev/rraid0d 461may be used in place of 462.Pa raid0 . 463.Ss Initialization and Configuration 464The initial step in configuring a RAID set is to identify the components 465that will be used in the RAID set. 466All components should be the same size. 467Each component should have a disklabel type of 468.Dv FS_RAID , 469and a typical disklabel entry for a RAID component might look like: 470.Bd -literal -offset indent 471f: 1800000 200495 RAID # (Cyl. 405*- 4041*) 472.Ed 473.Pp 474While 475.Dv FS_BSDFFS 476will also work as the component type, the type 477.Dv FS_RAID 478is preferred for RAIDframe use, as it is required for features such as 479auto-configuration. 480As part of the initial configuration of each RAID set, 481each component will be given a 482.Sq component label . 483A 484.Sq component label 485contains important information about the component, including a 486user-specified serial number, the row and column of that component in 487the RAID set, the redundancy level of the RAID set, a 488.Sq modification counter , 489and whether the parity information (if any) on that 490component is known to be correct. 491Component labels are an integral part of the RAID set, 492since they are used to ensure that components 493are configured in the correct order, and used to keep track of other 494vital information about the RAID set. 495Component labels are also required for the auto-detection 496and auto-configuration of RAID sets at boot time. 497For a component label to be considered valid, that 498particular component label must be in agreement with the other 499component labels in the set. 500For example, the serial number, 501.Sq modification counter , 502number of rows and number of columns must all be in agreement. 503If any of these are different, then the component is 504not considered to be part of the set. 505See 506.Xr raid 4 507for more information about component labels. 508.Pp 509Once the components have been identified, and the disks have 510appropriate labels, 511.Nm 512is then used to configure the 513.Xr raid 4 514device. 515To configure the device, a configuration file which looks something like: 516.Bd -literal -offset indent 517START array 518# numRow numCol numSpare 5191 3 1 520 521START disks 522/dev/sd1e 523/dev/sd2e 524/dev/sd3e 525 526START spare 527/dev/sd4e 528 529START layout 530# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_5 53132 1 1 5 532 533START queue 534fifo 100 535.Ed 536.Pp 537is created in a file. 538The above configuration file specifies a RAID 5 539set consisting of the components 540.Pa /dev/sd1e , 541.Pa /dev/sd2e , 542and 543.Pa /dev/sd3e , 544with 545.Pa /dev/sd4e 546available as a 547.Sq hot spare 548in case one of the three main drives should fail. 549A RAID 0 set would be specified in a similar way: 550.Bd -literal -offset indent 551START array 552# numRow numCol numSpare 5531 4 0 554 555START disks 556/dev/sd10e 557/dev/sd11e 558/dev/sd12e 559/dev/sd13e 560 561START layout 562# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_0 56364 1 1 0 564 565START queue 566fifo 100 567.Ed 568.Pp 569In this case, devices 570.Pa /dev/sd10e , 571.Pa /dev/sd11e , 572.Pa /dev/sd12e , 573and 574.Pa /dev/sd13e 575are the components that make up this RAID set. 576Note that there are no hot spares for a RAID 0 set, 577since there is no way to recover data if any of the components fail. 578.Pp 579For a RAID 1 (mirror) set, the following configuration might be used: 580.Bd -literal -offset indent 581START array 582# numRow numCol numSpare 5831 2 0 584 585START disks 586/dev/sd20e 587/dev/sd21e 588 589START layout 590# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_1 591128 1 1 1 592 593START queue 594fifo 100 595.Ed 596.Pp 597In this case, 598.Pa /dev/sd20e 599and 600.Pa /dev/sd21e 601are the two components of the mirror set. 602While no hot spares have been specified in this 603configuration, they easily could be, just as they were specified in 604the RAID 5 case above. 605Note as well that RAID 1 sets are currently limited to only 2 components. 606At present, n-way mirroring is not possible. 607.Pp 608The first time a RAID set is configured, the 609.Fl C 610option must be used: 611.Bd -literal -offset indent 612raidctl -C raid0.conf raid0 613.Ed 614.Pp 615where 616.Pa raid0.conf 617is the name of the RAID configuration file. 618The 619.Fl C 620forces the configuration to succeed, even if any of the component 621labels are incorrect. 622The 623.Fl C 624option should not be used lightly in 625situations other than initial configurations, as if 626the system is refusing to configure a RAID set, there is probably a 627very good reason for it. 628After the initial configuration is done (and 629appropriate component labels are added with the 630.Fl I 631option) then raid0 can be configured normally with: 632.Bd -literal -offset indent 633raidctl -c raid0.conf raid0 634.Ed 635.Pp 636When the RAID set is configured for the first time, it is 637necessary to initialize the component labels, and to initialize the 638parity on the RAID set. 639Initializing the component labels is done with: 640.Bd -literal -offset indent 641raidctl -I 112341 raid0 642.Ed 643.Pp 644where 645.Sq 112341 646is a user-specified serial number for the RAID set. 647This initialization step is 648.Em required 649for all RAID sets. 650As well, using different serial numbers between RAID sets is 651.Em strongly encouraged , 652as using the same serial number for all RAID sets will only serve to 653decrease the usefulness of the component label checking. 654.Pp 655Initializing the RAID set is done via the 656.Fl i 657option. 658This initialization 659.Em MUST 660be done for 661.Em all 662RAID sets, since among other things it verifies that the parity (if 663any) on the RAID set is correct. 664Since this initialization may be quite time-consuming, the 665.Fl v 666option may be also used in conjunction with 667.Fl i : 668.Bd -literal -offset indent 669raidctl -iv raid0 670.Ed 671.Pp 672This will give more verbose output on the 673status of the initialization: 674.Bd -literal -offset indent 675Initiating re-write of parity 676Parity Re-write status: 677 10% |**** | ETA: 06:03 / 678.Ed 679.Pp 680The output provides a 681.Sq Percent Complete 682in both a numeric and graphical format, as well as an estimated time 683to completion of the operation. 684.Pp 685Since it is the parity that provides the 686.Sq redundancy 687part of RAID, it is critical that the parity is correct as much as possible. 688If the parity is not correct, then there is no 689guarantee that data will not be lost if a component fails. 690.Pp 691Once the parity is known to be correct, it is then safe to perform 692.Xr disklabel 8 , 693.Xr newfs 8 , 694or 695.Xr fsck 8 696on the device or its file systems, and then to mount the file systems 697for use. 698.Pp 699Under certain circumstances (e.g., the additional component has not 700arrived, or data is being migrated off of a disk destined to become a 701component) it may be desirable to configure a RAID 1 set with only 702a single component. 703This can be achieved by using the word 704.Dq absent 705to indicate that a particular component is not present. 706In the following: 707.Bd -literal -offset indent 708START array 709# numRow numCol numSpare 7101 2 0 711 712START disks 713absent 714/dev/sd0e 715 716START layout 717# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_1 718128 1 1 1 719 720START queue 721fifo 100 722.Ed 723.Pp 724.Pa /dev/sd0e 725is the real component, and will be the second disk of a RAID 1 set. 726The first component is simply marked as being absent. 727Configuration (using 728.Fl C 729and 730.Fl I Ar 12345 731as above) proceeds normally, but initialization of the RAID set will 732have to wait until all physical components are present. 733After configuration, this set can be used normally, but will be operating 734in degraded mode. 735Once a second physical component is obtained, it can be hot-added, 736the existing data mirrored, and normal operation resumed. 737.Ss Maintenance of the RAID set 738After the parity has been initialized for the first time, the command: 739.Bd -literal -offset indent 740raidctl -p raid0 741.Ed 742.Pp 743can be used to check the current status of the parity. 744To check the parity and rebuild it necessary (for example, 745after an unclean shutdown) the command: 746.Bd -literal -offset indent 747raidctl -P raid0 748.Ed 749.Pp 750is used. 751Note that re-writing the parity can be done while 752other operations on the RAID set are taking place (e.g., while doing a 753.Xr fsck 8 754on a file system on the RAID set). 755However: for maximum effectiveness of the RAID set, the parity should be 756known to be correct before any data on the set is modified. 757.Pp 758To see how the RAID set is doing, the following command can be used to 759show the RAID set's status: 760.Bd -literal -offset indent 761raidctl -s raid0 762.Ed 763.Pp 764The output will look something like: 765.Bd -literal -offset indent 766Components: 767 /dev/sd1e: optimal 768 /dev/sd2e: optimal 769 /dev/sd3e: optimal 770Spares: 771 /dev/sd4e: spare 772Component label for /dev/sd1e: 773 Row: 0 Column: 0 Num Rows: 1 Num Columns: 3 774 Version: 2 Serial Number: 13432 Mod Counter: 65 775 Clean: No Status: 0 776 sectPerSU: 32 SUsPerPU: 1 SUsPerRU: 1 777 RAID Level: 5 blocksize: 512 numBlocks: 1799936 778 Autoconfig: No 779 Last configured as: raid0 780Component label for /dev/sd2e: 781 Row: 0 Column: 1 Num Rows: 1 Num Columns: 3 782 Version: 2 Serial Number: 13432 Mod Counter: 65 783 Clean: No Status: 0 784 sectPerSU: 32 SUsPerPU: 1 SUsPerRU: 1 785 RAID Level: 5 blocksize: 512 numBlocks: 1799936 786 Autoconfig: No 787 Last configured as: raid0 788Component label for /dev/sd3e: 789 Row: 0 Column: 2 Num Rows: 1 Num Columns: 3 790 Version: 2 Serial Number: 13432 Mod Counter: 65 791 Clean: No Status: 0 792 sectPerSU: 32 SUsPerPU: 1 SUsPerRU: 1 793 RAID Level: 5 blocksize: 512 numBlocks: 1799936 794 Autoconfig: No 795 Last configured as: raid0 796Parity status: clean 797Reconstruction is 100% complete. 798Parity Re-write is 100% complete. 799Copyback is 100% complete. 800.Ed 801.Pp 802This indicates that all is well with the RAID set. 803Of importance here are the component lines which read 804.Sq optimal , 805and the 806.Sq Parity status 807line. 808.Sq Parity status: clean 809indicates that the parity is up-to-date for this RAID set, 810whether or not the RAID set is in redundant or degraded mode. 811.Sq Parity status: DIRTY 812indicates that it is not known if the parity information is 813consistent with the data, and that the parity information needs 814to be checked. 815Note that if there are file systems open on the RAID set, 816the individual components will not be 817.Sq clean 818but the set as a whole can still be clean. 819.Pp 820To check the component label of 821.Pa /dev/sd1e , 822the following is used: 823.Bd -literal -offset indent 824raidctl -g /dev/sd1e raid0 825.Ed 826.Pp 827The output of this command will look something like: 828.Bd -literal -offset indent 829Component label for /dev/sd1e: 830 Row: 0 Column: 0 Num Rows: 1 Num Columns: 3 831 Version: 2 Serial Number: 13432 Mod Counter: 65 832 Clean: No Status: 0 833 sectPerSU: 32 SUsPerPU: 1 SUsPerRU: 1 834 RAID Level: 5 blocksize: 512 numBlocks: 1799936 835 Autoconfig: No 836 Last configured as: raid0 837.Ed 838.Ss Dealing with Component Failures 839If for some reason 840(perhaps to test reconstruction) it is necessary to pretend a drive 841has failed, the following will perform that function: 842.Bd -literal -offset indent 843raidctl -f /dev/sd2e raid0 844.Ed 845.Pp 846The system will then be performing all operations in degraded mode, 847where missing data is re-computed from existing data and the parity. 848In this case, obtaining the status of raid0 will return (in part): 849.Bd -literal -offset indent 850Components: 851 /dev/sd1e: optimal 852 /dev/sd2e: failed 853 /dev/sd3e: optimal 854Spares: 855 /dev/sd4e: spare 856.Ed 857.Pp 858Note that with the use of 859.Fl f 860a reconstruction has not been started. 861To both fail the disk and start a reconstruction, the 862.Fl F 863option must be used: 864.Bd -literal -offset indent 865raidctl -F /dev/sd2e raid0 866.Ed 867.Pp 868The 869.Fl f 870option may be used first, and then the 871.Fl F 872option used later, on the same disk, if desired. 873Immediately after the reconstruction is started, the status will report: 874.Bd -literal -offset indent 875Components: 876 /dev/sd1e: optimal 877 /dev/sd2e: reconstructing 878 /dev/sd3e: optimal 879Spares: 880 /dev/sd4e: used_spare 881[...] 882Parity status: clean 883Reconstruction is 10% complete. 884Parity Re-write is 100% complete. 885Copyback is 100% complete. 886.Ed 887.Pp 888This indicates that a reconstruction is in progress. 889To find out how the reconstruction is progressing the 890.Fl S 891option may be used. 892This will indicate the progress in terms of the 893percentage of the reconstruction that is completed. 894When the reconstruction is finished the 895.Fl s 896option will show: 897.Bd -literal -offset indent 898Components: 899 /dev/sd1e: optimal 900 /dev/sd2e: spared 901 /dev/sd3e: optimal 902Spares: 903 /dev/sd4e: used_spare 904[...] 905Parity status: clean 906Reconstruction is 100% complete. 907Parity Re-write is 100% complete. 908Copyback is 100% complete. 909.Ed 910.Pp 911At this point there are at least two options. 912First, if 913.Pa /dev/sd2e 914is known to be good (i.e., the failure was either caused by 915.Fl f 916or 917.Fl F , 918or the failed disk was replaced), then a copyback of the data can 919be initiated with the 920.Fl B 921option. 922In this example, this would copy the entire contents of 923.Pa /dev/sd4e 924to 925.Pa /dev/sd2e . 926Once the copyback procedure is complete, the 927status of the device would be (in part): 928.Bd -literal -offset indent 929Components: 930 /dev/sd1e: optimal 931 /dev/sd2e: optimal 932 /dev/sd3e: optimal 933Spares: 934 /dev/sd4e: spare 935.Ed 936.Pp 937and the system is back to normal operation. 938.Pp 939The second option after the reconstruction is to simply use 940.Pa /dev/sd4e 941in place of 942.Pa /dev/sd2e 943in the configuration file. 944For example, the configuration file (in part) might now look like: 945.Bd -literal -offset indent 946START array 9471 3 0 948 949START drives 950/dev/sd1e 951/dev/sd4e 952/dev/sd3e 953.Ed 954.Pp 955This can be done as 956.Pa /dev/sd4e 957is completely interchangeable with 958.Pa /dev/sd2e 959at this point. 960Note that extreme care must be taken when 961changing the order of the drives in a configuration. 962This is one of the few instances where the devices and/or 963their orderings can be changed without loss of data! 964In general, the ordering of components in a configuration file should 965.Em never 966be changed. 967.Pp 968If a component fails and there are no hot spares 969available on-line, the status of the RAID set might (in part) look like: 970.Bd -literal -offset indent 971Components: 972 /dev/sd1e: optimal 973 /dev/sd2e: failed 974 /dev/sd3e: optimal 975No spares. 976.Ed 977.Pp 978In this case there are a number of options. 979The first option is to add a hot spare using: 980.Bd -literal -offset indent 981raidctl -a /dev/sd4e raid0 982.Ed 983.Pp 984After the hot add, the status would then be: 985.Bd -literal -offset indent 986Components: 987 /dev/sd1e: optimal 988 /dev/sd2e: failed 989 /dev/sd3e: optimal 990Spares: 991 /dev/sd4e: spare 992.Ed 993.Pp 994Reconstruction could then take place using 995.Fl F 996as describe above. 997.Pp 998A second option is to rebuild directly onto 999.Pa /dev/sd2e . 1000Once the disk containing 1001.Pa /dev/sd2e 1002has been replaced, one can simply use: 1003.Bd -literal -offset indent 1004raidctl -R /dev/sd2e raid0 1005.Ed 1006.Pp 1007to rebuild the 1008.Pa /dev/sd2e 1009component. 1010As the rebuilding is in progress, the status will be: 1011.Bd -literal -offset indent 1012Components: 1013 /dev/sd1e: optimal 1014 /dev/sd2e: reconstructing 1015 /dev/sd3e: optimal 1016No spares. 1017.Ed 1018.Pp 1019and when completed, will be: 1020.Bd -literal -offset indent 1021Components: 1022 /dev/sd1e: optimal 1023 /dev/sd2e: optimal 1024 /dev/sd3e: optimal 1025No spares. 1026.Ed 1027.Pp 1028In circumstances where a particular component is completely 1029unavailable after a reboot, a special component name will be used to 1030indicate the missing component. 1031For example: 1032.Bd -literal -offset indent 1033Components: 1034 /dev/sd2e: optimal 1035 component1: failed 1036No spares. 1037.Ed 1038.Pp 1039indicates that the second component of this RAID set was not detected 1040at all by the auto-configuration code. 1041The name 1042.Sq component1 1043can be used anywhere a normal component name would be used. 1044For example, to add a hot spare to the above set, and rebuild to that hot 1045spare, the following could be done: 1046.Bd -literal -offset indent 1047raidctl -a /dev/sd3e raid0 1048raidctl -F component1 raid0 1049.Ed 1050.Pp 1051at which point the data missing from 1052.Sq component1 1053would be reconstructed onto 1054.Pa /dev/sd3e . 1055.Pp 1056When more than one component is marked as 1057.Sq failed 1058due to a non-component hardware failure (e.g., loss of power to two 1059components, adapter problems, termination problems, or cabling issues) it 1060is quite possible to recover the data on the RAID set. 1061The first thing to be aware of is that the first disk to fail will 1062almost certainly be out-of-sync with the remainder of the array. 1063If any IO was performed between the time the first component is considered 1064.Sq failed 1065and when the second component is considered 1066.Sq failed , 1067then the first component to fail will 1068.Em not 1069contain correct data, and should be ignored. 1070When the second component is marked as failed, however, the RAID device will 1071(currently) panic the system. 1072At this point the data on the RAID set 1073(not including the first failed component) is still self consistent, 1074and will be in no worse state of repair than had the power gone out in 1075the middle of a write to a file system on a non-RAID device. 1076The problem, however, is that the component labels may now have 3 different 1077.Sq modification counters 1078(one value on the first component that failed, one value on the second 1079component that failed, and a third value on the remaining components). 1080In such a situation, the RAID set will not autoconfigure, 1081and can only be forcibly re-configured 1082with the 1083.Fl C 1084option. 1085To recover the RAID set, one must first remedy whatever physical 1086problem caused the multiple-component failure. 1087After that is done, the RAID set can be restored by forcibly 1088configuring the raid set 1089.Em without 1090the component that failed first. 1091For example, if 1092.Pa /dev/sd1e 1093and 1094.Pa /dev/sd2e 1095fail (in that order) in a RAID set of the following configuration: 1096.Bd -literal -offset indent 1097START array 10981 4 0 1099 1100START drives 1101/dev/sd1e 1102/dev/sd2e 1103/dev/sd3e 1104/dev/sd4e 1105 1106START layout 1107# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_5 110864 1 1 5 1109 1110START queue 1111fifo 100 1112 1113.Ed 1114.Pp 1115then the following configuration (say "recover_raid0.conf") 1116.Bd -literal -offset indent 1117START array 11181 4 0 1119 1120START drives 1121/dev/sd6e 1122/dev/sd2e 1123/dev/sd3e 1124/dev/sd4e 1125 1126START layout 1127# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_5 112864 1 1 5 1129 1130START queue 1131fifo 100 1132.Ed 1133.Pp 1134(where 1135.Pa /dev/sd6e 1136has no physical device) can be used with 1137.Bd -literal -offset indent 1138raidctl -C recover_raid0.conf raid0 1139.Ed 1140.Pp 1141to force the configuration of raid0. 1142A 1143.Bd -literal -offset indent 1144raidctl -I 12345 raid0 1145.Ed 1146.Pp 1147will be required in order to synchronize the component labels. 1148At this point the file systems on the RAID set can then be checked and 1149corrected. 1150To complete the re-construction of the RAID set, 1151.Pa /dev/sd1e 1152is simply hot-added back into the array, and reconstructed 1153as described earlier. 1154.Ss RAID on RAID 1155RAID sets can be layered to create more complex and much larger RAID sets. 1156A RAID 0 set, for example, could be constructed from four RAID 5 sets. 1157The following configuration file shows such a setup: 1158.Bd -literal -offset indent 1159START array 1160# numRow numCol numSpare 11611 4 0 1162 1163START disks 1164/dev/raid1e 1165/dev/raid2e 1166/dev/raid3e 1167/dev/raid4e 1168 1169START layout 1170# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_0 1171128 1 1 0 1172 1173START queue 1174fifo 100 1175.Ed 1176.Pp 1177A similar configuration file might be used for a RAID 0 set 1178constructed from components on RAID 1 sets. 1179In such a configuration, the mirroring provides a high degree 1180of redundancy, while the striping provides additional speed benefits. 1181.Ss Auto-configuration and Root on RAID 1182RAID sets can also be auto-configured at boot. 1183To make a set auto-configurable, 1184simply prepare the RAID set as above, and then do a: 1185.Bd -literal -offset indent 1186raidctl -A yes raid0 1187.Ed 1188.Pp 1189to turn on auto-configuration for that set. 1190To turn off auto-configuration, use: 1191.Bd -literal -offset indent 1192raidctl -A no raid0 1193.Ed 1194.Pp 1195RAID sets which are auto-configurable will be configured before the 1196root file system is mounted. 1197These RAID sets are thus available for 1198use as a root file system, or for any other file system. 1199A primary advantage of using the auto-configuration is that RAID components 1200become more independent of the disks they reside on. 1201For example, SCSI ID's can change, but auto-configured sets will always be 1202configured correctly, even if the SCSI ID's of the component disks 1203have become scrambled. 1204.Pp 1205Having a system's root file system 1206.Pq Pa / 1207on a RAID set is also allowed, with the 1208.Sq a 1209partition of such a RAID set being used for 1210.Pa / . 1211To use raid0a as the root file system, simply use: 1212.Bd -literal -offset indent 1213raidctl -A root raid0 1214.Ed 1215.Pp 1216To return raid0a to be just an auto-configuring set simply use the 1217.Fl A Ar yes 1218arguments. 1219.Pp 1220Note that kernels can only be directly read from RAID 1 components on 1221architectures that support that 1222.Pq currently alpha, i386, pmax, sparc, sparc64, and vax . 1223On those architectures, the 1224.Dv FS_RAID 1225file system is recognized by the bootblocks, and will properly load the 1226kernel directly from a RAID 1 component. 1227For other architectures, or to support the root file system 1228on other RAID sets, some other mechanism must be used to get a kernel booting. 1229For example, a small partition containing only the secondary boot-blocks 1230and an alternate kernel (or two) could be used. 1231Once a kernel is booting however, and an auto-configuring RAID set is 1232found that is eligible to be root, then that RAID set will be 1233auto-configured and used as the root device. 1234If two or more RAID sets claim to be root devices, then the 1235user will be prompted to select the root device. 1236At this time, RAID 0, 1, 4, and 5 sets are all supported as root devices. 1237.Pp 1238A typical RAID 1 setup with root on RAID might be as follows: 1239.Bl -enum 1240.It 1241wd0a - a small partition, which contains a complete, bootable, basic 1242.Nx 1243installation. 1244.It 1245wd1a - also contains a complete, bootable, basic 1246.Nx 1247installation. 1248.It 1249wd0e and wd1e - a RAID 1 set, raid0, used for the root file system. 1250.It 1251wd0f and wd1f - a RAID 1 set, raid1, which will be used only for 1252swap space. 1253.It 1254wd0g and wd1g - a RAID 1 set, raid2, used for 1255.Pa /usr , 1256.Pa /home , 1257or other data, if desired. 1258.It 1259wd0h and wd0h - a RAID 1 set, raid3, if desired. 1260.El 1261.Pp 1262RAID sets raid0, raid1, and raid2 are all marked as auto-configurable. 1263raid0 is marked as being a root file system. 1264When new kernels are installed, the kernel is not only copied to 1265.Pa / , 1266but also to wd0a and wd1a. 1267The kernel on wd0a is required, since that 1268is the kernel the system boots from. 1269The kernel on wd1a is also 1270required, since that will be the kernel used should wd0 fail. 1271The important point here is to have redundant copies of the kernel 1272available, in the event that one of the drives fail. 1273.Pp 1274There is no requirement that the root file system be on the same disk 1275as the kernel. 1276For example, obtaining the kernel from wd0a, and using 1277sd0e and sd1e for raid0, and the root file system, is fine. 1278It 1279.Em is 1280critical, however, that there be multiple kernels available, in the 1281event of media failure. 1282.Pp 1283Multi-layered RAID devices (such as a RAID 0 set made 1284up of RAID 1 sets) are 1285.Em not 1286supported as root devices or auto-configurable devices at this point. 1287(Multi-layered RAID devices 1288.Em are 1289supported in general, however, as mentioned earlier.) 1290Note that in order to enable component auto-detection and 1291auto-configuration of RAID devices, the line: 1292.Bd -literal -offset indent 1293options RAID_AUTOCONFIG 1294.Ed 1295.Pp 1296must be in the kernel configuration file. 1297See 1298.Xr raid 4 1299for more details. 1300.Ss Swapping on RAID 1301A RAID device can be used as a swap device. 1302In order to ensure that a RAID device used as a swap device 1303is correctly unconfigured when the system is shutdown or rebooted, 1304it is recommended that the line 1305.Bd -literal -offset indent 1306swapoff=YES 1307.Ed 1308.Pp 1309be added to 1310.Pa /etc/rc.conf . 1311.Ss Unconfiguration 1312The final operation performed by 1313.Nm 1314is to unconfigure a 1315.Xr raid 4 1316device. 1317This is accomplished via a simple: 1318.Bd -literal -offset indent 1319raidctl -u raid0 1320.Ed 1321.Pp 1322at which point the device is ready to be reconfigured. 1323.Ss Performance Tuning 1324Selection of the various parameter values which result in the best 1325performance can be quite tricky, and often requires a bit of 1326trial-and-error to get those values most appropriate for a given system. 1327A whole range of factors come into play, including: 1328.Bl -enum 1329.It 1330Types of components (e.g., SCSI vs. IDE) and their bandwidth 1331.It 1332Types of controller cards and their bandwidth 1333.It 1334Distribution of components among controllers 1335.It 1336IO bandwidth 1337.It 1338file system access patterns 1339.It 1340CPU speed 1341.El 1342.Pp 1343As with most performance tuning, benchmarking under real-life loads 1344may be the only way to measure expected performance. 1345Understanding some of the underlying technology is also useful in tuning. 1346The goal of this section is to provide pointers to those parameters which may 1347make significant differences in performance. 1348.Pp 1349For a RAID 1 set, a SectPerSU value of 64 or 128 is typically sufficient. 1350Since data in a RAID 1 set is arranged in a linear 1351fashion on each component, selecting an appropriate stripe size is 1352somewhat less critical than it is for a RAID 5 set. 1353However: a stripe size that is too small will cause large IO's to be 1354broken up into a number of smaller ones, hurting performance. 1355At the same time, a large stripe size may cause problems with 1356concurrent accesses to stripes, which may also affect performance. 1357Thus values in the range of 32 to 128 are often the most effective. 1358.Pp 1359Tuning RAID 5 sets is trickier. 1360In the best case, IO is presented to the RAID set one stripe at a time. 1361Since the entire stripe is available at the beginning of the IO, 1362the parity of that stripe can be calculated before the stripe is written, 1363and then the stripe data and parity can be written in parallel. 1364When the amount of data being written is less than a full stripe worth, the 1365.Sq small write 1366problem occurs. 1367Since a 1368.Sq small write 1369means only a portion of the stripe on the components is going to 1370change, the data (and parity) on the components must be updated 1371slightly differently. 1372First, the 1373.Sq old parity 1374and 1375.Sq old data 1376must be read from the components. 1377Then the new parity is constructed, 1378using the new data to be written, and the old data and old parity. 1379Finally, the new data and new parity are written. 1380All this extra data shuffling results in a serious loss of performance, 1381and is typically 2 to 4 times slower than a full stripe write (or read). 1382To combat this problem in the real world, it may be useful 1383to ensure that stripe sizes are small enough that a 1384.Sq large IO 1385from the system will use exactly one large stripe write. 1386As is seen later, there are some file system dependencies 1387which may come into play here as well. 1388.Pp 1389Since the size of a 1390.Sq large IO 1391is often (currently) only 32K or 64K, on a 5-drive RAID 5 set it may 1392be desirable to select a SectPerSU value of 16 blocks (8K) or 32 1393blocks (16K). 1394Since there are 4 data sectors per stripe, the maximum 1395data per stripe is 64 blocks (32K) or 128 blocks (64K). 1396Again, empirical measurement will provide the best indicators of which 1397values will yeild better performance. 1398.Pp 1399The parameters used for the file system are also critical to good performance. 1400For 1401.Xr newfs 8 , 1402for example, increasing the block size to 32K or 64K may improve 1403performance dramatically. 1404As well, changing the cylinders-per-group 1405parameter from 16 to 32 or higher is often not only necessary for 1406larger file systems, but may also have positive performance implications. 1407.Ss Summary 1408Despite the length of this man-page, configuring a RAID set is a 1409relatively straight-forward process. 1410All that needs to be done is the following steps: 1411.Bl -enum 1412.It 1413Use 1414.Xr disklabel 8 1415to create the components (of type RAID). 1416.It 1417Construct a RAID configuration file: e.g., 1418.Pa raid0.conf 1419.It 1420Configure the RAID set with: 1421.Bd -literal -offset indent 1422raidctl -C raid0.conf raid0 1423.Ed 1424.Pp 1425.It 1426Initialize the component labels with: 1427.Bd -literal -offset indent 1428raidctl -I 123456 raid0 1429.Ed 1430.Pp 1431.It 1432Initialize other important parts of the set with: 1433.Bd -literal -offset indent 1434raidctl -i raid0 1435.Ed 1436.Pp 1437.It 1438Get the default label for the RAID set: 1439.Bd -literal -offset indent 1440disklabel raid0 \*[Gt] /tmp/label 1441.Ed 1442.Pp 1443.It 1444Edit the label: 1445.Bd -literal -offset indent 1446vi /tmp/label 1447.Ed 1448.Pp 1449.It 1450Put the new label on the RAID set: 1451.Bd -literal -offset indent 1452disklabel -R -r raid0 /tmp/label 1453.Ed 1454.Pp 1455.It 1456Create the file system: 1457.Bd -literal -offset indent 1458newfs /dev/rraid0e 1459.Ed 1460.Pp 1461.It 1462Mount the file system: 1463.Bd -literal -offset indent 1464mount /dev/raid0e /mnt 1465.Ed 1466.Pp 1467.It 1468Use: 1469.Bd -literal -offset indent 1470raidctl -c raid0.conf raid0 1471.Ed 1472.Pp 1473To re-configure the RAID set the next time it is needed, or put 1474.Pa raid0.conf 1475into 1476.Pa /etc 1477where it will automatically be started by the 1478.Pa /etc/rc.d 1479scripts. 1480.El 1481.Sh SEE ALSO 1482.Xr ccd 4 , 1483.Xr raid 4 , 1484.Xr rc 8 1485.Sh HISTORY 1486RAIDframe is a framework for rapid prototyping of RAID structures 1487developed by the folks at the Parallel Data Laboratory at Carnegie 1488Mellon University (CMU). 1489A more complete description of the internals and functionality of 1490RAIDframe is found in the paper "RAIDframe: A Rapid Prototyping Tool 1491for RAID Systems", by William V. Courtright II, Garth Gibson, Mark 1492Holland, LeAnn Neal Reilly, and Jim Zelenka, and published by the 1493Parallel Data Laboratory of Carnegie Mellon University. 1494.Pp 1495The 1496.Nm 1497command first appeared as a program in CMU's RAIDframe v1.1 distribution. 1498This version of 1499.Nm 1500is a complete re-write, and first appeared in 1501.Nx 1.4 . 1502.Sh COPYRIGHT 1503.Bd -literal 1504The RAIDframe Copyright is as follows: 1505 1506Copyright (c) 1994-1996 Carnegie-Mellon University. 1507All rights reserved. 1508 1509Permission to use, copy, modify and distribute this software and 1510its documentation is hereby granted, provided that both the copyright 1511notice and this permission notice appear in all copies of the 1512software, derivative works or modified versions, and any portions 1513thereof, and that both notices appear in supporting documentation. 1514 1515CARNEGIE MELLON ALLOWS FREE USE OF THIS SOFTWARE IN ITS "AS IS" 1516CONDITION. CARNEGIE MELLON DISCLAIMS ANY LIABILITY OF ANY KIND 1517FOR ANY DAMAGES WHATSOEVER RESULTING FROM THE USE OF THIS SOFTWARE. 1518 1519Carnegie Mellon requests users of this software to return to 1520 1521 Software Distribution Coordinator or Software.Distribution@CS.CMU.EDU 1522 School of Computer Science 1523 Carnegie Mellon University 1524 Pittsburgh PA 15213-3890 1525 1526any improvements or extensions that they make and grant Carnegie the 1527rights to redistribute these changes. 1528.Ed 1529.Sh WARNINGS 1530Certain RAID levels (1, 4, 5, 6, and others) can protect against some 1531data loss due to component failure. 1532However the loss of two components of a RAID 4 or 5 system, 1533or the loss of a single component of a RAID 0 system will 1534result in the entire file system being lost. 1535RAID is 1536.Em NOT 1537a substitute for good backup practices. 1538.Pp 1539Recomputation of parity 1540.Em MUST 1541be performed whenever there is a chance that it may have been compromised. 1542This includes after system crashes, or before a RAID 1543device has been used for the first time. 1544Failure to keep parity correct will be catastrophic should a 1545component ever fail \(em it is better to use RAID 0 and get the 1546additional space and speed, than it is to use parity, but 1547not keep the parity correct. 1548At least with RAID 0 there is no perception of increased data security. 1549.Sh BUGS 1550Hot-spare removal is currently not available. 1551