1.\" $NetBSD: raidctl.8,v 1.35 2003/02/25 10:35:08 wiz Exp $ 2.\" 3.\" Copyright (c) 1998, 2002 The NetBSD Foundation, Inc. 4.\" All rights reserved. 5.\" 6.\" This code is derived from software contributed to The NetBSD Foundation 7.\" by Greg Oster 8.\" 9.\" Redistribution and use in source and binary forms, with or without 10.\" modification, are permitted provided that the following conditions 11.\" are met: 12.\" 1. Redistributions of source code must retain the above copyright 13.\" notice, this list of conditions and the following disclaimer. 14.\" 2. Redistributions in binary form must reproduce the above copyright 15.\" notice, this list of conditions and the following disclaimer in the 16.\" documentation and/or other materials provided with the distribution. 17.\" 3. All advertising materials mentioning features or use of this software 18.\" must display the following acknowledgement: 19.\" This product includes software developed by the NetBSD 20.\" Foundation, Inc. and its contributors. 21.\" 4. Neither the name of The NetBSD Foundation nor the names of its 22.\" contributors may be used to endorse or promote products derived 23.\" from this software without specific prior written permission. 24.\" 25.\" THIS SOFTWARE IS PROVIDED BY THE NETBSD FOUNDATION, INC. AND CONTRIBUTORS 26.\" ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED 27.\" TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR 28.\" PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE FOUNDATION OR CONTRIBUTORS 29.\" BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR 30.\" CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF 31.\" SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 32.\" INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN 33.\" CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) 34.\" ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE 35.\" POSSIBILITY OF SUCH DAMAGE. 36.\" 37.\" 38.\" Copyright (c) 1995 Carnegie-Mellon University. 39.\" All rights reserved. 40.\" 41.\" Author: Mark Holland 42.\" 43.\" Permission to use, copy, modify and distribute this software and 44.\" its documentation is hereby granted, provided that both the copyright 45.\" notice and this permission notice appear in all copies of the 46.\" software, derivative works or modified versions, and any portions 47.\" thereof, and that both notices appear in supporting documentation. 48.\" 49.\" CARNEGIE MELLON ALLOWS FREE USE OF THIS SOFTWARE IN ITS "AS IS" 50.\" CONDITION. CARNEGIE MELLON DISCLAIMS ANY LIABILITY OF ANY KIND 51.\" FOR ANY DAMAGES WHATSOEVER RESULTING FROM THE USE OF THIS SOFTWARE. 52.\" 53.\" Carnegie Mellon requests users of this software to return to 54.\" 55.\" Software Distribution Coordinator or Software.Distribution@CS.CMU.EDU 56.\" School of Computer Science 57.\" Carnegie Mellon University 58.\" Pittsburgh PA 15213-3890 59.\" 60.\" any improvements or extensions that they make and grant Carnegie the 61.\" rights to redistribute these changes. 62.\" 63.Dd July 10, 2001 64.Dt RAIDCTL 8 65.Os 66.Sh NAME 67.Nm raidctl 68.Nd configuration utility for the RAIDframe disk driver 69.Sh SYNOPSIS 70.Nm 71.Op Fl v 72.Fl a Ar component Ar dev 73.Nm 74.Op Fl v 75.Fl A Op yes | no | root 76.Ar dev 77.Nm 78.Op Fl v 79.Fl B Ar dev 80.Nm 81.Op Fl v 82.Fl c Ar config_file Ar dev 83.Nm 84.Op Fl v 85.Fl C Ar config_file Ar dev 86.Nm 87.Op Fl v 88.Fl f Ar component Ar dev 89.Nm 90.Op Fl v 91.Fl F Ar component Ar dev 92.Nm 93.Op Fl v 94.Fl g Ar component Ar dev 95.Nm 96.Op Fl v 97.Fl G Ar dev 98.Nm 99.Op Fl v 100.Fl i Ar dev 101.Nm 102.Op Fl v 103.Fl I Ar serial_number Ar dev 104.Nm 105.Op Fl v 106.Fl p Ar dev 107.Nm 108.Op Fl v 109.Fl P Ar dev 110.Nm 111.Op Fl v 112.Fl r Ar component Ar dev 113.Nm 114.Op Fl v 115.Fl R Ar component Ar dev 116.Nm 117.Op Fl v 118.Fl s Ar dev 119.Nm 120.Op Fl v 121.Fl S Ar dev 122.Nm 123.Op Fl v 124.Fl u Ar dev 125.Sh DESCRIPTION 126.Nm 127is the user-land control program for 128.Xr raid 4 , 129the RAIDframe disk device. 130.Nm 131is primarily used to dynamically configure and unconfigure RAIDframe disk 132devices. 133For more information about the RAIDframe disk device, see 134.Xr raid 4 . 135.Pp 136This document assumes the reader has at least rudimentary knowledge of 137RAID and RAID concepts. 138.Pp 139The command-line options for 140.Nm 141are as follows: 142.Bl -tag -width indent 143.It Fl a Ar component Ar dev 144Add 145.Ar component 146as a hot spare for the device 147.Ar dev . 148.It Fl A Ic yes Ar dev 149Make the RAID set auto-configurable. 150The RAID set will be automatically configured at boot 151.Ar before 152the root file system is mounted. 153Note that all components of the set must be of type 154.Dv RAID 155in the disklabel. 156.It Fl A Ic no Ar dev 157Turn off auto-configuration for the RAID set. 158.It Fl A Ic root Ar dev 159Make the RAID set auto-configurable, and also mark the set as being 160eligible to be the root partition. 161A RAID set configured this way will 162.Ar override 163the use of the boot disk as the root device. 164All components of the set must be of type 165.Dv RAID 166in the disklabel. 167Note that the kernel being booted must currently reside on a non-RAID set. 168.It Fl B Ar dev 169Initiate a copyback of reconstructed data from a spare disk to 170its original disk. 171This is performed after a component has failed, 172and the failed drive has been reconstructed onto a spare drive. 173.It Fl c Ar config_file Ar dev 174Configure the RAIDframe device 175.Ar dev 176according to the configuration given in 177.Ar config_file . 178A description of the contents of 179.Ar config_file 180is given later. 181.It Fl C Ar config_file Ar dev 182As for 183.Fl c , 184but forces the configuration to take place. 185This is required the first time a RAID set is configured. 186.It Fl f Ar component Ar dev 187This marks the specified 188.Ar component 189as having failed, but does not initiate a reconstruction of that component. 190.It Fl F Ar component Ar dev 191Fails the specified 192.Ar component 193of the device, and immediately begin a reconstruction of the failed 194disk onto an available hot spare. 195This is one of the mechanisms used to start 196the reconstruction process if a component does have a hardware failure. 197.It Fl g Ar component Ar dev 198Get the component label for the specified component. 199.It Fl G Ar dev 200Generate the configuration of the RAIDframe device in a format suitable for 201use with the 202.Fl c 203or 204.Fl C 205options. 206.It Fl i Ar dev 207Initialize the RAID device. 208In particular, (re-)write the parity on the selected device. 209This 210.Em MUST 211be done for 212.Em all 213RAID sets before the RAID device is labeled and before 214file systems are created on the RAID device. 215.It Fl I Ar serial_number Ar dev 216Initialize the component labels on each component of the device. 217.Ar serial_number 218is used as one of the keys in determining whether a 219particular set of components belong to the same RAID set. 220While not strictly enforced, different serial numbers should be used for 221different RAID sets. 222This step 223.Em MUST 224be performed when a new RAID set is created. 225.It Fl p Ar dev 226Check the status of the parity on the RAID set. 227Displays a status message, 228and returns successfully if the parity is up-to-date. 229.It Fl P Ar dev 230Check the status of the parity on the RAID set, and initialize 231(re-write) the parity if the parity is not known to be up-to-date. 232This is normally used after a system crash (and before a 233.Xr fsck 8 ) 234to ensure the integrity of the parity. 235.It Fl r Ar component Ar dev 236Remove the spare disk specified by 237.Ar component 238from the set of available spare components. 239.It Fl R Ar component Ar dev 240Fails the specified 241.Ar component , 242if necessary, and immediately begins a reconstruction back to 243.Ar component . 244This is useful for reconstructing back onto a component after 245it has been replaced following a failure. 246.It Fl s Ar dev 247Display the status of the RAIDframe device for each of the components 248and spares. 249.It Fl S Ar dev 250Check the status of parity re-writing, component reconstruction, and 251component copyback. 252The output indicates the amount of progress 253achieved in each of these areas. 254.It Fl u Ar dev 255Unconfigure the RAIDframe device. 256.It Fl v 257Be more verbose. 258For operations such as reconstructions, parity 259re-writing, and copybacks, provide a progress indicator. 260.El 261.Pp 262The device used by 263.Nm 264is specified by 265.Ar dev . 266.Ar dev 267may be either the full name of the device, e.g., 268.Pa /dev/rraid0d , 269for the i386 architecture, or 270.Pa /dev/rraid0c 271for many others, or just simply 272.Pa raid0 273(for 274.Pa /dev/rraid0[cd] ) . 275It is recommended that the partitions used to represent the 276RAID device are not used for file systems. 277.Ss Configuration file 278The format of the configuration file is complex, and 279only an abbreviated treatment is given here. 280In the configuration files, a 281.Sq # 282indicates the beginning of a comment. 283.Pp 284There are 4 required sections of a configuration file, and 2 285optional sections. 286Each section begins with a 287.Sq START , 288followed by the section name, 289and the configuration parameters associated with that section. 290The first section is the 291.Sq array 292section, and it specifies 293the number of rows, columns, and spare disks in the RAID set. 294For example: 295.Bd -literal -offset indent 296START array 2971 3 0 298.Ed 299.Pp 300indicates an array with 1 row, 3 columns, and 0 spare disks. 301Note that although multi-dimensional arrays may be specified, they are 302.Em NOT 303supported in the driver. 304.Pp 305The second section, the 306.Sq disks 307section, specifies the actual components of the device. 308For example: 309.Bd -literal -offset indent 310START disks 311/dev/sd0e 312/dev/sd1e 313/dev/sd2e 314.Ed 315.Pp 316specifies the three component disks to be used in the RAID device. 317If any of the specified drives cannot be found when the RAID device is 318configured, then they will be marked as 319.Sq failed , 320and the system will operate in degraded mode. 321Note that it is 322.Em imperative 323that the order of the components in the configuration file does not 324change between configurations of a RAID device. 325Changing the order of the components will result in data loss 326if the set is configured with the 327.Fl C 328option. 329In normal circumstances, the RAID set will not configure if only 330.Fl c 331is specified, and the components are out-of-order. 332.Pp 333The next section, which is the 334.Sq spare 335section, is optional, and, if present, specifies the devices to be used as 336.Sq hot spares 337\(em devices which are on-line, 338but are not actively used by the RAID driver unless 339one of the main components fail. 340A simple 341.Sq spare 342section might be: 343.Bd -literal -offset indent 344START spare 345/dev/sd3e 346.Ed 347.Pp 348for a configuration with a single spare component. 349If no spare drives are to be used in the configuration, then the 350.Sq spare 351section may be omitted. 352.Pp 353The next section is the 354.Sq layout 355section. 356This section describes the general layout parameters for the RAID device, 357and provides such information as 358sectors per stripe unit, 359stripe units per parity unit, 360stripe units per reconstruction unit, 361and the parity configuration to use. 362This section might look like: 363.Bd -literal -offset indent 364START layout 365# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level 36632 1 1 5 367.Ed 368.Pp 369The sectors per stripe unit specifies, in blocks, the interleave 370factor; i.e., the number of contiguous sectors to be written to each 371component for a single stripe. 372Appropriate selection of this value (32 in this example) 373is the subject of much research in RAID architectures. 374The stripe units per parity unit and 375stripe units per reconstruction unit are normally each set to 1. 376While certain values above 1 are permitted, a discussion of valid 377values and the consequences of using anything other than 1 are outside 378the scope of this document. 379The last value in this section (5 in this example) 380indicates the parity configuration desired. 381Valid entries include: 382.Bl -tag -width inde 383.It 0 384RAID level 0. 385No parity, only simple striping. 386.It 1 387RAID level 1. 388Mirroring. 389The parity is the mirror. 390.It 4 391RAID level 4. 392Striping across components, with parity stored on the last component. 393.It 5 394RAID level 5. 395Striping across components, parity distributed across all components. 396.El 397.Pp 398There are other valid entries here, including those for Even-Odd 399parity, RAID level 5 with rotated sparing, Chained declustering, 400and Interleaved declustering, but as of this writing the code for 401those parity operations has not been tested with 402.Nx . 403.Pp 404The next required section is the 405.Sq queue 406section. 407This is most often specified as: 408.Bd -literal -offset indent 409START queue 410fifo 100 411.Ed 412.Pp 413where the queuing method is specified as fifo (first-in, first-out), 414and the size of the per-component queue is limited to 100 requests. 415Other queuing methods may also be specified, but a discussion of them 416is beyond the scope of this document. 417.Pp 418The final section, the 419.Sq debug 420section, is optional. 421For more details on this the reader is referred to 422the RAIDframe documentation discussed in the 423.Sx HISTORY 424section. 425.Pp 426See 427.Sx EXAMPLES 428for a more complete configuration file example. 429.Sh FILES 430.Bl -tag -width /dev/XXrXraidX -compact 431.It Pa /dev/{,r}raid* 432.Cm raid 433device special files. 434.El 435.Sh EXAMPLES 436It is highly recommended that before using the RAID driver for real 437file systems that the system administrator(s) become quite familiar 438with the use of 439.Nm , 440and that they understand how the component reconstruction process works. 441The examples in this section will focus on configuring a 442number of different RAID sets of varying degrees of redundancy. 443By working through these examples, administrators should be able to 444develop a good feel for how to configure a RAID set, and how to 445initiate reconstruction of failed components. 446.Pp 447In the following examples 448.Sq raid0 449will be used to denote the RAID device. 450Depending on the architecture, 451.Pa /dev/rraid0c 452or 453.Pa /dev/rraid0d 454may be used in place of 455.Pa raid0 . 456.Ss Initialization and Configuration 457The initial step in configuring a RAID set is to identify the components 458that will be used in the RAID set. 459All components should be the same size. 460Each component should have a disklabel type of 461.Dv FS_RAID , 462and a typical disklabel entry for a RAID component might look like: 463.Bd -literal -offset indent 464f: 1800000 200495 RAID # (Cyl. 405*- 4041*) 465.Ed 466.Pp 467While 468.Dv FS_BSDFFS 469will also work as the component type, the type 470.Dv FS_RAID 471is preferred for RAIDframe use, as it is required for features such as 472auto-configuration. 473As part of the initial configuration of each RAID set, 474each component will be given a 475.Sq component label . 476A 477.Sq component label 478contains important information about the component, including a 479user-specified serial number, the row and column of that component in 480the RAID set, the redundancy level of the RAID set, a 481.Sq modification counter , 482and whether the parity information (if any) on that 483component is known to be correct. 484Component labels are an integral part of the RAID set, 485since they are used to ensure that components 486are configured in the correct order, and used to keep track of other 487vital information about the RAID set. 488Component labels are also required for the auto-detection 489and auto-configuration of RAID sets at boot time. 490For a component label to be considered valid, that 491particular component label must be in agreement with the other 492component labels in the set. 493For example, the serial number, 494.Sq modification counter , 495number of rows and number of columns must all be in agreement. 496If any of these are different, then the component is 497not considered to be part of the set. 498See 499.Xr raid 4 500for more information about component labels. 501.Pp 502Once the components have been identified, and the disks have 503appropriate labels, 504.Nm 505is then used to configure the 506.Xr raid 4 507device. 508To configure the device, a configuration file which looks something like: 509.Bd -literal -offset indent 510START array 511# numRow numCol numSpare 5121 3 1 513 514START disks 515/dev/sd1e 516/dev/sd2e 517/dev/sd3e 518 519START spare 520/dev/sd4e 521 522START layout 523# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_5 52432 1 1 5 525 526START queue 527fifo 100 528.Ed 529.Pp 530is created in a file. 531The above configuration file specifies a RAID 5 532set consisting of the components 533.Pa /dev/sd1e , 534.Pa /dev/sd2e , 535and 536.Pa /dev/sd3e , 537with 538.Pa /dev/sd4e 539available as a 540.Sq hot spare 541in case one of the three main drives should fail. 542A RAID 0 set would be specified in a similar way: 543.Bd -literal -offset indent 544START array 545# numRow numCol numSpare 5461 4 0 547 548START disks 549/dev/sd10e 550/dev/sd11e 551/dev/sd12e 552/dev/sd13e 553 554START layout 555# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_0 55664 1 1 0 557 558START queue 559fifo 100 560.Ed 561.Pp 562In this case, devices 563.Pa /dev/sd10e , 564.Pa /dev/sd11e , 565.Pa /dev/sd12e , 566and 567.Pa /dev/sd13e 568are the components that make up this RAID set. 569Note that there are no hot spares for a RAID 0 set, 570since there is no way to recover data if any of the components fail. 571.Pp 572For a RAID 1 (mirror) set, the following configuration might be used: 573.Bd -literal -offset indent 574START array 575# numRow numCol numSpare 5761 2 0 577 578START disks 579/dev/sd20e 580/dev/sd21e 581 582START layout 583# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_1 584128 1 1 1 585 586START queue 587fifo 100 588.Ed 589.Pp 590In this case, 591.Pa /dev/sd20e 592and 593.Pa /dev/sd21e 594are the two components of the mirror set. 595While no hot spares have been specified in this 596configuration, they easily could be, just as they were specified in 597the RAID 5 case above. 598Note as well that RAID 1 sets are currently limited to only 2 components. 599At present, n-way mirroring is not possible. 600.Pp 601The first time a RAID set is configured, the 602.Fl C 603option must be used: 604.Bd -literal -offset indent 605raidctl -C raid0.conf raid0 606.Ed 607.Pp 608where 609.Pa raid0.conf 610is the name of the RAID configuration file. 611The 612.Fl C 613forces the configuration to succeed, even if any of the component 614labels are incorrect. 615The 616.Fl C 617option should not be used lightly in 618situations other than initial configurations, as if 619the system is refusing to configure a RAID set, there is probably a 620very good reason for it. 621After the initial configuration is done (and 622appropriate component labels are added with the 623.Fl I 624option) then raid0 can be configured normally with: 625.Bd -literal -offset indent 626raidctl -c raid0.conf raid0 627.Ed 628.Pp 629When the RAID set is configured for the first time, it is 630necessary to initialize the component labels, and to initialize the 631parity on the RAID set. 632Initializing the component labels is done with: 633.Bd -literal -offset indent 634raidctl -I 112341 raid0 635.Ed 636.Pp 637where 638.Sq 112341 639is a user-specified serial number for the RAID set. 640This initialization step is 641.Em required 642for all RAID sets. 643As well, using different serial numbers between RAID sets is 644.Em strongly encouraged , 645as using the same serial number for all RAID sets will only serve to 646decrease the usefulness of the component label checking. 647.Pp 648Initializing the RAID set is done via the 649.Fl i 650option. 651This initialization 652.Em MUST 653be done for 654.Em all 655RAID sets, since among other things it verifies that the parity (if 656any) on the RAID set is correct. 657Since this initialization may be quite time-consuming, the 658.Fl v 659option may be also used in conjunction with 660.Fl i : 661.Bd -literal -offset indent 662raidctl -iv raid0 663.Ed 664.Pp 665This will give more verbose output on the 666status of the initialization: 667.Bd -literal -offset indent 668Initiating re-write of parity 669Parity Re-write status: 670 10% |**** | ETA: 06:03 / 671.Ed 672.Pp 673The output provides a 674.Sq Percent Complete 675in both a numeric and graphical format, as well as an estimated time 676to completion of the operation. 677.Pp 678Since it is the parity that provides the 679.Sq redundancy 680part of RAID, it is critical that the parity is correct as much as possible. 681If the parity is not correct, then there is no 682guarantee that data will not be lost if a component fails. 683.Pp 684Once the parity is known to be correct, it is then safe to perform 685.Xr disklabel 8 , 686.Xr newfs 8 , 687or 688.Xr fsck 8 689on the device or its file systems, and then to mount the file systems 690for use. 691.Pp 692Under certain circumstances (e.g., the additional component has not 693arrived, or data is being migrated off of a disk destined to become a 694component) it may be desirable to configure a RAID 1 set with only 695a single component. 696This can be achieved by configuring the set with a physically existing 697component (as either the first or second component) and with a 698.Sq fake 699component. 700In the following: 701.Bd -literal -offset indent 702START array 703# numRow numCol numSpare 7041 2 0 705 706START disks 707/dev/sd6e 708/dev/sd0e 709 710START layout 711# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_1 712128 1 1 1 713 714START queue 715fifo 100 716.Ed 717.Pp 718.Pa /dev/sd0e 719is the real component, and will be the second disk of a RAID 1 set. 720The component 721.Pa /dev/sd6e , 722which must exist, but have no physical device associated with it, 723is simply used as a placeholder. 724Configuration (using 725.Fl C 726and 727.Fl I Ar 12345 728as above) proceeds normally, but initialization of the RAID set will 729have to wait until all physical components are present. 730After configuration, this set can be used normally, but will be operating 731in degraded mode. 732Once a second physical component is obtained, it can be hot-added, 733the existing data mirrored, and normal operation resumed. 734.Ss Maintenance of the RAID set 735After the parity has been initialized for the first time, the command: 736.Bd -literal -offset indent 737raidctl -p raid0 738.Ed 739.Pp 740can be used to check the current status of the parity. 741To check the parity and rebuild it necessary (for example, 742after an unclean shutdown) the command: 743.Bd -literal -offset indent 744raidctl -P raid0 745.Ed 746.Pp 747is used. 748Note that re-writing the parity can be done while 749other operations on the RAID set are taking place (e.g., while doing a 750.Xr fsck 8 751on a file system on the RAID set). 752However: for maximum effectiveness of the RAID set, the parity should be 753known to be correct before any data on the set is modified. 754.Pp 755To see how the RAID set is doing, the following command can be used to 756show the RAID set's status: 757.Bd -literal -offset indent 758raidctl -s raid0 759.Ed 760.Pp 761The output will look something like: 762.Bd -literal -offset indent 763Components: 764 /dev/sd1e: optimal 765 /dev/sd2e: optimal 766 /dev/sd3e: optimal 767Spares: 768 /dev/sd4e: spare 769Component label for /dev/sd1e: 770 Row: 0 Column: 0 Num Rows: 1 Num Columns: 3 771 Version: 2 Serial Number: 13432 Mod Counter: 65 772 Clean: No Status: 0 773 sectPerSU: 32 SUsPerPU: 1 SUsPerRU: 1 774 RAID Level: 5 blocksize: 512 numBlocks: 1799936 775 Autoconfig: No 776 Last configured as: raid0 777Component label for /dev/sd2e: 778 Row: 0 Column: 1 Num Rows: 1 Num Columns: 3 779 Version: 2 Serial Number: 13432 Mod Counter: 65 780 Clean: No Status: 0 781 sectPerSU: 32 SUsPerPU: 1 SUsPerRU: 1 782 RAID Level: 5 blocksize: 512 numBlocks: 1799936 783 Autoconfig: No 784 Last configured as: raid0 785Component label for /dev/sd3e: 786 Row: 0 Column: 2 Num Rows: 1 Num Columns: 3 787 Version: 2 Serial Number: 13432 Mod Counter: 65 788 Clean: No Status: 0 789 sectPerSU: 32 SUsPerPU: 1 SUsPerRU: 1 790 RAID Level: 5 blocksize: 512 numBlocks: 1799936 791 Autoconfig: No 792 Last configured as: raid0 793Parity status: clean 794Reconstruction is 100% complete. 795Parity Re-write is 100% complete. 796Copyback is 100% complete. 797.Ed 798.Pp 799This indicates that all is well with the RAID set. 800Of importance here are the component lines which read 801.Sq optimal , 802and the 803.Sq Parity status 804line which indicates that the parity is up-to-date. 805Note that if there are file systems open on the RAID set, 806the individual components will not be 807.Sq clean 808but the set as a whole can still be clean. 809.Pp 810To check the component label of 811.Pa /dev/sd1e , 812the following is used: 813.Bd -literal -offset indent 814raidctl -g /dev/sd1e raid0 815.Ed 816.Pp 817The output of this command will look something like: 818.Bd -literal -offset indent 819Component label for /dev/sd1e: 820 Row: 0 Column: 0 Num Rows: 1 Num Columns: 3 821 Version: 2 Serial Number: 13432 Mod Counter: 65 822 Clean: No Status: 0 823 sectPerSU: 32 SUsPerPU: 1 SUsPerRU: 1 824 RAID Level: 5 blocksize: 512 numBlocks: 1799936 825 Autoconfig: No 826 Last configured as: raid0 827.Ed 828.Ss Dealing with Component Failures 829If for some reason 830(perhaps to test reconstruction) it is necessary to pretend a drive 831has failed, the following will perform that function: 832.Bd -literal -offset indent 833raidctl -f /dev/sd2e raid0 834.Ed 835.Pp 836The system will then be performing all operations in degraded mode, 837where missing data is re-computed from existing data and the parity. 838In this case, obtaining the status of raid0 will return (in part): 839.Bd -literal -offset indent 840Components: 841 /dev/sd1e: optimal 842 /dev/sd2e: failed 843 /dev/sd3e: optimal 844Spares: 845 /dev/sd4e: spare 846.Ed 847.Pp 848Note that with the use of 849.Fl f 850a reconstruction has not been started. 851To both fail the disk and start a reconstruction, the 852.Fl F 853option must be used: 854.Bd -literal -offset indent 855raidctl -F /dev/sd2e raid0 856.Ed 857.Pp 858The 859.Fl f 860option may be used first, and then the 861.Fl F 862option used later, on the same disk, if desired. 863Immediately after the reconstruction is started, the status will report: 864.Bd -literal -offset indent 865Components: 866 /dev/sd1e: optimal 867 /dev/sd2e: reconstructing 868 /dev/sd3e: optimal 869Spares: 870 /dev/sd4e: used_spare 871[...] 872Parity status: clean 873Reconstruction is 10% complete. 874Parity Re-write is 100% complete. 875Copyback is 100% complete. 876.Ed 877.Pp 878This indicates that a reconstruction is in progress. 879To find out how the reconstruction is progressing the 880.Fl S 881option may be used. 882This will indicate the progress in terms of the 883percentage of the reconstruction that is completed. 884When the reconstruction is finished the 885.Fl s 886option will show: 887.Bd -literal -offset indent 888Components: 889 /dev/sd1e: optimal 890 /dev/sd2e: spared 891 /dev/sd3e: optimal 892Spares: 893 /dev/sd4e: used_spare 894[...] 895Parity status: clean 896Reconstruction is 100% complete. 897Parity Re-write is 100% complete. 898Copyback is 100% complete. 899.Ed 900.Pp 901At this point there are at least two options. 902First, if 903.Pa /dev/sd2e 904is known to be good (i.e., the failure was either caused by 905.Fl f 906or 907.Fl F , 908or the failed disk was replaced), then a copyback of the data can 909be initiated with the 910.Fl B 911option. 912In this example, this would copy the entire contents of 913.Pa /dev/sd4e 914to 915.Pa /dev/sd2e . 916Once the copyback procedure is complete, the 917status of the device would be (in part): 918.Bd -literal -offset indent 919Components: 920 /dev/sd1e: optimal 921 /dev/sd2e: optimal 922 /dev/sd3e: optimal 923Spares: 924 /dev/sd4e: spare 925.Ed 926.Pp 927and the system is back to normal operation. 928.Pp 929The second option after the reconstruction is to simply use 930.Pa /dev/sd4e 931in place of 932.Pa /dev/sd2e 933in the configuration file. 934For example, the configuration file (in part) might now look like: 935.Bd -literal -offset indent 936START array 9371 3 0 938 939START drives 940/dev/sd1e 941/dev/sd4e 942/dev/sd3e 943.Ed 944.Pp 945This can be done as 946.Pa /dev/sd4e 947is completely interchangeable with 948.Pa /dev/sd2e 949at this point. 950Note that extreme care must be taken when 951changing the order of the drives in a configuration. 952This is one of the few instances where the devices and/or 953their orderings can be changed without loss of data! 954In general, the ordering of components in a configuration file should 955.Em never 956be changed. 957.Pp 958If a component fails and there are no hot spares 959available on-line, the status of the RAID set might (in part) look like: 960.Bd -literal -offset indent 961Components: 962 /dev/sd1e: optimal 963 /dev/sd2e: failed 964 /dev/sd3e: optimal 965No spares. 966.Ed 967.Pp 968In this case there are a number of options. 969The first option is to add a hot spare using: 970.Bd -literal -offset indent 971raidctl -a /dev/sd4e raid0 972.Ed 973.Pp 974After the hot add, the status would then be: 975.Bd -literal -offset indent 976Components: 977 /dev/sd1e: optimal 978 /dev/sd2e: failed 979 /dev/sd3e: optimal 980Spares: 981 /dev/sd4e: spare 982.Ed 983.Pp 984Reconstruction could then take place using 985.Fl F 986as describe above. 987.Pp 988A second option is to rebuild directly onto 989.Pa /dev/sd2e . 990Once the disk containing 991.Pa /dev/sd2e 992has been replaced, one can simply use: 993.Bd -literal -offset indent 994raidctl -R /dev/sd2e raid0 995.Ed 996.Pp 997to rebuild the 998.Pa /dev/sd2e 999component. 1000As the rebuilding is in progress, the status will be: 1001.Bd -literal -offset indent 1002Components: 1003 /dev/sd1e: optimal 1004 /dev/sd2e: reconstructing 1005 /dev/sd3e: optimal 1006No spares. 1007.Ed 1008.Pp 1009and when completed, will be: 1010.Bd -literal -offset indent 1011Components: 1012 /dev/sd1e: optimal 1013 /dev/sd2e: optimal 1014 /dev/sd3e: optimal 1015No spares. 1016.Ed 1017.Pp 1018In circumstances where a particular component is completely 1019unavailable after a reboot, a special component name will be used to 1020indicate the missing component. 1021For example: 1022.Bd -literal -offset indent 1023Components: 1024 /dev/sd2e: optimal 1025 component1: failed 1026No spares. 1027.Ed 1028.Pp 1029indicates that the second component of this RAID set was not detected 1030at all by the auto-configuration code. 1031The name 1032.Sq component1 1033can be used anywhere a normal component name would be used. 1034For example, to add a hot spare to the above set, and rebuild to that hot 1035spare, the following could be done: 1036.Bd -literal -offset indent 1037raidctl -a /dev/sd3e raid0 1038raidctl -F component1 raid0 1039.Ed 1040.Pp 1041at which point the data missing from 1042.Sq component1 1043would be reconstructed onto 1044.Pa /dev/sd3e . 1045.Pp 1046When more than one component is marked as 1047.Sq failed 1048due to a non-component hardware failure (e.g., loss of power to two 1049components, adapter problems, termination problems, or cabling issues) it 1050is quite possible to recover the data on the RAID set. 1051The first thing to be aware of is that the first disk to fail will 1052almost certainly be out-of-sync with the remainder of the array. 1053If any IO was performed between the time the first component is considered 1054.Sq failed 1055and when the second component is considered 1056.Sq failed , 1057then the first component to fail will 1058.Em not 1059contain correct data, and should be ignored. 1060When the second component is marked as failed, however, the RAID device will 1061(currently) panic the system. 1062At this point the data on the RAID set 1063(not including the first failed component) is still self consistent, 1064and will be in no worse state of repair than had the power gone out in 1065the middle of a write to a file system on a non-RAID device. 1066The problem, however, is that the component labels may now have 3 different 1067.Sq modification counters 1068(one value on the first component that failed, one value on the second 1069component that failed, and a third value on the remaining components). 1070In such a situation, the RAID set will not autoconfigure, 1071and can only be forcibly re-configured 1072with the 1073.Fl C 1074option. 1075To recover the RAID set, one must first remedy whatever physical 1076problem caused the multiple-component failure. 1077After that is done, the RAID set can be restored by forcibly 1078configuring the raid set 1079.Em without 1080the component that failed first. 1081For example, if 1082.Pa /dev/sd1e 1083and 1084.Pa /dev/sd2e 1085fail (in that order) in a RAID set of the following configuration: 1086.Bd -literal -offset indent 1087START array 10881 4 0 1089 1090START drives 1091/dev/sd1e 1092/dev/sd2e 1093/dev/sd3e 1094/dev/sd4e 1095 1096START layout 1097# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_5 109864 1 1 5 1099 1100START queue 1101fifo 100 1102 1103.Ed 1104.Pp 1105then the following configuration (say "recover_raid0.conf") 1106.Bd -literal -offset indent 1107START array 11081 4 0 1109 1110START drives 1111/dev/sd6e 1112/dev/sd2e 1113/dev/sd3e 1114/dev/sd4e 1115 1116START layout 1117# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_5 111864 1 1 5 1119 1120START queue 1121fifo 100 1122.Ed 1123.Pp 1124(where 1125.Pa /dev/sd6e 1126has no physical device) can be used with 1127.Bd -literal -offset indent 1128raidctl -C recover_raid0.conf raid0 1129.Ed 1130.Pp 1131to force the configuration of raid0. 1132A 1133.Bd -literal -offset indent 1134raidctl -I 12345 raid0 1135.Ed 1136.Pp 1137will be required in order to synchronize the component labels. 1138At this point the file systems on the RAID set can then be checked and 1139corrected. 1140To complete the re-construction of the RAID set, 1141.Pa /dev/sd1e 1142is simply hot-added back into the array, and reconstructed 1143as described earlier. 1144.Ss RAID on RAID 1145RAID sets can be layered to create more complex and much larger RAID sets. 1146A RAID 0 set, for example, could be constructed from four RAID 5 sets. 1147The following configuration file shows such a setup: 1148.Bd -literal -offset indent 1149START array 1150# numRow numCol numSpare 11511 4 0 1152 1153START disks 1154/dev/raid1e 1155/dev/raid2e 1156/dev/raid3e 1157/dev/raid4e 1158 1159START layout 1160# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_0 1161128 1 1 0 1162 1163START queue 1164fifo 100 1165.Ed 1166.Pp 1167A similar configuration file might be used for a RAID 0 set 1168constructed from components on RAID 1 sets. 1169In such a configuration, the mirroring provides a high degree 1170of redundancy, while the striping provides additional speed benefits. 1171.Ss Auto-configuration and Root on RAID 1172RAID sets can also be auto-configured at boot. 1173To make a set auto-configurable, 1174simply prepare the RAID set as above, and then do a: 1175.Bd -literal -offset indent 1176raidctl -A yes raid0 1177.Ed 1178.Pp 1179to turn on auto-configuration for that set. 1180To turn off auto-configuration, use: 1181.Bd -literal -offset indent 1182raidctl -A no raid0 1183.Ed 1184.Pp 1185RAID sets which are auto-configurable will be configured before the 1186root file system is mounted. 1187These RAID sets are thus available for 1188use as a root file system, or for any other file system. 1189A primary advantage of using the auto-configuration is that RAID components 1190become more independent of the disks they reside on. 1191For example, SCSI ID's can change, but auto-configured sets will always be 1192configured correctly, even if the SCSI ID's of the component disks 1193have become scrambled. 1194.Pp 1195Having a system's root file system 1196.Pq Pa / 1197on a RAID set is also allowed, with the 1198.Sq a 1199partition of such a RAID set being used for 1200.Pa / . 1201To use raid0a as the root file system, simply use: 1202.Bd -literal -offset indent 1203raidctl -A root raid0 1204.Ed 1205.Pp 1206To return raid0a to be just an auto-configuring set simply use the 1207.Fl A Ar yes 1208arguments. 1209.Pp 1210Note that kernels can only be directly read from RAID 1 components on 1211alpha and pmax architectures. 1212On those architectures, the 1213.Dv FS_RAID 1214file system is recognized by the bootblocks, and will properly load the 1215kernel directly from a RAID 1 component. 1216For other architectures, or to support the root file system 1217on other RAID sets, some other mechanism must be used to get a kernel booting. 1218For example, a small partition containing only the secondary boot-blocks 1219and an alternate kernel (or two) could be used. 1220Once a kernel is booting however, and an auto-configuring RAID set is 1221found that is eligible to be root, then that RAID set will be 1222auto-configured and used as the root device. 1223If two or more RAID sets claim to be root devices, then the 1224user will be prompted to select the root device. 1225At this time, RAID 0, 1, 4, and 5 sets are all supported as root devices. 1226.Pp 1227A typical RAID 1 setup with root on RAID might be as follows: 1228.Bl -enum 1229.It 1230wd0a - a small partition, which contains a complete, bootable, basic 1231.Nx 1232installation. 1233.It 1234wd1a - also contains a complete, bootable, basic 1235.Nx 1236installation. 1237.It 1238wd0e and wd1e - a RAID 1 set, raid0, used for the root file system. 1239.It 1240wd0f and wd1f - a RAID 1 set, raid1, which will be used only for 1241swap space. 1242.It 1243wd0g and wd1g - a RAID 1 set, raid2, used for 1244.Pa /usr , 1245.Pa /home , 1246or other data, if desired. 1247.It 1248wd0h and wd0h - a RAID 1 set, raid3, if desired. 1249.El 1250.Pp 1251RAID sets raid0, raid1, and raid2 are all marked as auto-configurable. 1252raid0 is marked as being a root file system. 1253When new kernels are installed, the kernel is not only copied to 1254.Pa / , 1255but also to wd0a and wd1a. 1256The kernel on wd0a is required, since that 1257is the kernel the system boots from. 1258The kernel on wd1a is also 1259required, since that will be the kernel used should wd0 fail. 1260The important point here is to have redundant copies of the kernel 1261available, in the event that one of the drives fail. 1262.Pp 1263There is no requirement that the root file system be on the same disk 1264as the kernel. 1265For example, obtaining the kernel from wd0a, and using 1266sd0e and sd1e for raid0, and the root file system, is fine. 1267It 1268.Em is 1269critical, however, that there be multiple kernels available, in the 1270event of media failure. 1271.Pp 1272Multi-layered RAID devices (such as a RAID 0 set made 1273up of RAID 1 sets) are 1274.Em not 1275supported as root devices or auto-configurable devices at this point. 1276(Multi-layered RAID devices 1277.Em are 1278supported in general, however, as mentioned earlier.) 1279Note that in order to enable component auto-detection and 1280auto-configuration of RAID devices, the line: 1281.Bd -literal -offset indent 1282options RAID_AUTOCONFIG 1283.Ed 1284.Pp 1285must be in the kernel configuration file. 1286See 1287.Xr raid 4 1288for more details. 1289.Ss Swapping on RAID 1290A RAID device can be used as a swap device. 1291In order to ensure that a RAID device used as a swap device 1292is correctly unconfigured when the system is shutdown or rebooted, 1293it is recommended that the line 1294.Bd -literal -offset indent 1295swapoff=YES 1296.Ed 1297.Pp 1298be added to 1299.Pa /etc/rc.conf . 1300.Ss Unconfiguration 1301The final operation performed by 1302.Nm 1303is to unconfigure a 1304.Xr raid 4 1305device. 1306This is accomplished via a simple: 1307.Bd -literal -offset indent 1308raidctl -u raid0 1309.Ed 1310.Pp 1311at which point the device is ready to be reconfigured. 1312.Ss Performance Tuning 1313Selection of the various parameter values which result in the best 1314performance can be quite tricky, and often requires a bit of 1315trial-and-error to get those values most appropriate for a given system. 1316A whole range of factors come into play, including: 1317.Bl -enum 1318.It 1319Types of components (e.g., SCSI vs. IDE) and their bandwidth 1320.It 1321Types of controller cards and their bandwidth 1322.It 1323Distribution of components among controllers 1324.It 1325IO bandwidth 1326.It 1327file system access patterns 1328.It 1329CPU speed 1330.El 1331.Pp 1332As with most performance tuning, benchmarking under real-life loads 1333may be the only way to measure expected performance. 1334Understanding some of the underlying technology is also useful in tuning. 1335The goal of this section is to provide pointers to those parameters which may 1336make significant differences in performance. 1337.Pp 1338For a RAID 1 set, a SectPerSU value of 64 or 128 is typically sufficient. 1339Since data in a RAID 1 set is arranged in a linear 1340fashion on each component, selecting an appropriate stripe size is 1341somewhat less critical than it is for a RAID 5 set. 1342However: a stripe size that is too small will cause large IO's to be 1343broken up into a number of smaller ones, hurting performance. 1344At the same time, a large stripe size may cause problems with 1345concurrent accesses to stripes, which may also affect performance. 1346Thus values in the range of 32 to 128 are often the most effective. 1347.Pp 1348Tuning RAID 5 sets is trickier. 1349In the best case, IO is presented to the RAID set one stripe at a time. 1350Since the entire stripe is available at the beginning of the IO, 1351the parity of that stripe can be calculated before the stripe is written, 1352and then the stripe data and parity can be written in parallel. 1353When the amount of data being written is less than a full stripe worth, the 1354.Sq small write 1355problem occurs. 1356Since a 1357.Sq small write 1358means only a portion of the stripe on the components is going to 1359change, the data (and parity) on the components must be updated 1360slightly differently. 1361First, the 1362.Sq old parity 1363and 1364.Sq old data 1365must be read from the components. 1366Then the new parity is constructed, 1367using the new data to be written, and the old data and old parity. 1368Finally, the new data and new parity are written. 1369All this extra data shuffling results in a serious loss of performance, 1370and is typically 2 to 4 times slower than a full stripe write (or read). 1371To combat this problem in the real world, it may be useful 1372to ensure that stripe sizes are small enough that a 1373.Sq large IO 1374from the system will use exactly one large stripe write. 1375As is seen later, there are some file system dependencies 1376which may come into play here as well. 1377.Pp 1378Since the size of a 1379.Sq large IO 1380is often (currently) only 32K or 64K, on a 5-drive RAID 5 set it may 1381be desirable to select a SectPerSU value of 16 blocks (8K) or 32 1382blocks (16K). 1383Since there are 4 data sectors per stripe, the maximum 1384data per stripe is 64 blocks (32K) or 128 blocks (64K). 1385Again, empirical measurement will provide the best indicators of which 1386values will yeild better performance. 1387.Pp 1388The parameters used for the file system are also critical to good performance. 1389For 1390.Xr newfs 8 , 1391for example, increasing the block size to 32K or 64K may improve 1392performance dramatically. 1393As well, changing the cylinders-per-group 1394parameter from 16 to 32 or higher is often not only necessary for 1395larger file systems, but may also have positive performance implications. 1396.Ss Summary 1397Despite the length of this man-page, configuring a RAID set is a 1398relatively straight-forward process. 1399All that needs to be done is the following steps: 1400.Bl -enum 1401.It 1402Use 1403.Xr disklabel 8 1404to create the components (of type RAID). 1405.It 1406Construct a RAID configuration file: e.g., 1407.Pa raid0.conf 1408.It 1409Configure the RAID set with: 1410.Bd -literal -offset indent 1411raidctl -C raid0.conf raid0 1412.Ed 1413.Pp 1414.It 1415Initialize the component labels with: 1416.Bd -literal -offset indent 1417raidctl -I 123456 raid0 1418.Ed 1419.Pp 1420.It 1421Initialize other important parts of the set with: 1422.Bd -literal -offset indent 1423raidctl -i raid0 1424.Ed 1425.Pp 1426.It 1427Get the default label for the RAID set: 1428.Bd -literal -offset indent 1429disklabel raid0 \*[Gt] /tmp/label 1430.Ed 1431.Pp 1432.It 1433Edit the label: 1434.Bd -literal -offset indent 1435vi /tmp/label 1436.Ed 1437.Pp 1438.It 1439Put the new label on the RAID set: 1440.Bd -literal -offset indent 1441disklabel -R -r raid0 /tmp/label 1442.Ed 1443.Pp 1444.It 1445Create the file system: 1446.Bd -literal -offset indent 1447newfs /dev/rraid0e 1448.Ed 1449.Pp 1450.It 1451Mount the file system: 1452.Bd -literal -offset indent 1453mount /dev/raid0e /mnt 1454.Ed 1455.Pp 1456.It 1457Use: 1458.Bd -literal -offset indent 1459raidctl -c raid0.conf raid0 1460.Ed 1461.Pp 1462To re-configure the RAID set the next time it is needed, or put 1463.Pa raid0.conf 1464into 1465.Pa /etc 1466where it will automatically be started by the 1467.Pa /etc/rc.d 1468scripts. 1469.El 1470.Sh SEE ALSO 1471.Xr ccd 4 , 1472.Xr raid 4 , 1473.Xr rc 8 1474.Sh HISTORY 1475RAIDframe is a framework for rapid prototyping of RAID structures 1476developed by the folks at the Parallel Data Laboratory at Carnegie 1477Mellon University (CMU). 1478A more complete description of the internals and functionality of 1479RAIDframe is found in the paper "RAIDframe: A Rapid Prototyping Tool 1480for RAID Systems", by William V. Courtright II, Garth Gibson, Mark 1481Holland, LeAnn Neal Reilly, and Jim Zelenka, and published by the 1482Parallel Data Laboratory of Carnegie Mellon University. 1483.Pp 1484The 1485.Nm 1486command first appeared as a program in CMU's RAIDframe v1.1 distribution. 1487This version of 1488.Nm 1489is a complete re-write, and first appeared in 1490.Nx 1.4 . 1491.Sh COPYRIGHT 1492.Bd -literal 1493The RAIDframe Copyright is as follows: 1494 1495Copyright (c) 1994-1996 Carnegie-Mellon University. 1496All rights reserved. 1497 1498Permission to use, copy, modify and distribute this software and 1499its documentation is hereby granted, provided that both the copyright 1500notice and this permission notice appear in all copies of the 1501software, derivative works or modified versions, and any portions 1502thereof, and that both notices appear in supporting documentation. 1503 1504CARNEGIE MELLON ALLOWS FREE USE OF THIS SOFTWARE IN ITS "AS IS" 1505CONDITION. CARNEGIE MELLON DISCLAIMS ANY LIABILITY OF ANY KIND 1506FOR ANY DAMAGES WHATSOEVER RESULTING FROM THE USE OF THIS SOFTWARE. 1507 1508Carnegie Mellon requests users of this software to return to 1509 1510 Software Distribution Coordinator or Software.Distribution@CS.CMU.EDU 1511 School of Computer Science 1512 Carnegie Mellon University 1513 Pittsburgh PA 15213-3890 1514 1515any improvements or extensions that they make and grant Carnegie the 1516rights to redistribute these changes. 1517.Ed 1518.Sh WARNINGS 1519Certain RAID levels (1, 4, 5, 6, and others) can protect against some 1520data loss due to component failure. 1521However the loss of two components of a RAID 4 or 5 system, 1522or the loss of a single component of a RAID 0 system will 1523result in the entire file system being lost. 1524RAID is 1525.Em NOT 1526a substitute for good backup practices. 1527.Pp 1528Recomputation of parity 1529.Em MUST 1530be performed whenever there is a chance that it may have been compromised. 1531This includes after system crashes, or before a RAID 1532device has been used for the first time. 1533Failure to keep parity correct will be catastrophic should a 1534component ever fail \(em it is better to use RAID 0 and get the 1535additional space and speed, than it is to use parity, but 1536not keep the parity correct. 1537At least with RAID 0 there is no perception of increased data security. 1538.Sh BUGS 1539Hot-spare removal is currently not available. 1540