C H A P T E R  10

RAID

This chapter describes software and hardware RAID, and includes the following sections:


10.1 Considerations for Backend Storage

Lustre's architecture allows it to use any kind of block device as backend storage. The characteristics of such devices, particularly in the case of failures vary significantly and have an impact on configuration choices.

This section surveys issues and recommendations regarding backend storage.

10.1.1 Selecting Storage for the MDS or OSTs

MDS

The MDS does a large amount of small writes. For this reason, we recommend that you use RAID1 for MDT storage. If you require more capacity for an MDT than one disk provides, we recommend RAID1 + 0 or RAID10. LVM is not recommended at this time for performance reasons.

OST

A quick calculation (shown below), makes it clear that without further redundancy, RAID5 is not acceptable for large clusters and RAID6 is a must.

Take a 1 PB file system (2,000 disks of 500 GB capacity). The MTTF[1] of a disk is about 1,000 days. This means that the expected failure rate is 2000/1000 = 2 disks per day. Repair time at 10% of disk bandwidth is close to 1 day (500 GB at 5 MB/sec = 100,000 sec = 1 day).

If we have a RAID 5 stripe that is 10 disks wide, then during 1 day of rebuilding, the chance that a second disk in the same array fails is about 9 / 1000 ~= 1/100. This means that, in the expected period of 50 days, a double failure in a RAID 5 stripe leads to data loss.

So, RAID 6 or another double parity algorithm is necessary for OST storage.

For better performance, we recommend that you create RAID sets with no more than 8 data disks (+1 or +2 parity disks) as this will provide more IOPS from having multiple independent RAID sets.

File system: Use RAID5 with 5 or 9 disks or RAID6 with 6 or 10 disks, each on a different controller. The stripe width is the optimal minimum I/O size. Ideally, the RAID configuration should allow 1 MB Lustre RPCs to fit evenly on a single RAID stripe without an expensive read-modify-write cycle. Use this formula to determine the stripe_width.

<stripe_width> = <chunksize> * (<disks> - <parity_disks>) <= 1 MB 

where <parity_disks> is 1 for RAID5/RAID-Z and 2 for RAID6/RAID-Z2. If the RAID configuration does not allow <chunksize> to fit evenly into 1 MB, select <chunksize>, such that <stripe_width> is close to 1 MB, but not larger.

For example, RAID6 with 6 disks has 4 data and 2 parity disks, so we get:

<chunksize> <= 1024kB/4; either 256kB, 128kB or 64kB

The <stripe_width> value must equal <chunksize> * (<disks> - <parity_disks>). Use it for OST file systems only (not MDT file systems).

$ mkfs.lustre --mountfsoptions="stripe=<stripe_width_blocks>" ...

External journal: Use RAID1 with two partitions of 400 MB (or more), each from disks on different controllers.

To set up the journal device (/dev/mdJ), run:

$ 'mke2fs -O journal_dev -b 4096 /dev/mdJ'

Then run --reformat on the file system device (/dev/mdX), specifying the RAID geometry to the underlying ldiskfs file system, where:

<chunk_blocks> = <chunksize> / 4096 
<stripe_width_blocks> = <stripe_width> / 4096: 
$ mkfs.lustre --reformat ... 
--mkfsoptions "-j -J device=/dev/mdJ -E stride=<chunk_blocks>" /dev/mdX

10.1.2 Reliability Best Practices

It is considered mandatory that you use disk monitoring software, so rebuilds happen without any delay.

We recommend backups of the metadata file systems. This can be done with LVM snapshots or using raw partition backups.

10.1.3 Performance Tradeoffs

Writeback cache can dramatically increase write performance on any type of RAID array.[2] Unfortunately, unless the RAID array has battery-backed cache (a feature only found in some higher-priced hardware RAID arrays), interrupting the power to the array may result in out-of-sequence writes. This causes problems for journaling.

If writeback cache is enabled, a file system check is required after the array loses power. Data may also be lost because of this.

Therefore, we recommend against the use of writeback cache when data integrity is critical. You should carefully consider whether the benefits of using writeback cache outweigh the risks.

10.1.4 Formatting Options for RAID Devices

When formatting a file system on a RAID device, it is beneficial to specify additional parameters at the time of formatting. This ensures that the file system is optimized for the underlying disk geometry. Use the --mkfsoptions parameter to specify these options when formatting the OST or MDT.

For RAID 5, RAID 6, RAID 1+0 storage, specifying the -E stride = <chunksize> option improves the layout of the file system metadata ensuring that no single disk contains all of the allocation bitmaps. The <chunksize> parameter is in units of 4096-byte blocks and represents the amount of contiguous data written to a single disk before moving to the next disk. This is applicable to both MDS and OST file systems.

For more information on how to override the defaults while formatting MDS or OST file systems, see Options for Formatting the MDT and OSTs.

10.1.4.1 Creating an External Journal

If you have configured a RAID array and use it directly as an OST, it houses both data and metadata. For better performance[3], we recommend putting OST metadata on another journal device, by creating a small RAID 1 array and using it as an external journal for the OST.

It is not known if external journals improve performance of MDTs. Currently, we recommend against using them for MDTs to reduce complexity.

No more than 102,400 file system blocks will ever be used for a journal. For Lustre's standard 4 KB block size, this corresponds to a 400 MB journal. A larger partition can be created, but only the first 400 MB will be used. Additionally, a copy of the journal is kept in RAM on the OSS. Therefore, make sure you have enough memory available to hold copies of all the journals.

To create an external journal, perform these steps for each OST on the OSS:

1. Create a 400 MB (or larger) journal partition (RAID 1 is recommended).

In this example, /dev/sdb is a RAID 1 device, run:

$ sfdisk -uC /dev/sdb << EOF 
> ,50,L 
> EOF

2. Create a journal device on the partition. Run:

$ mke2fs -b 4096 -O journal_dev /dev/sdb1

3. Create the OST.

In this example, /dev/sdc is the RAID 6 device to be used as the OST, run:

$ mkfs.lustre --ost --mgsnode=mds@osib \ --mkfsoptions="-J device=/dev/sdb1" /dev/sdc

4. Mount the OST as usual.

10.1.5 Handling Degraded RAID Arrays

Lustre 1.8.2 and later versions include functionality that notifies Lustre if an external RAID array has degraded performance (resulting in a degraded OST), either because a disk has failed and not been replaced, or because a disk was replaced and is undergoing a rebuild. To avoid a global performance slowdown due to a degraded OST, the MDS can avoid the OST for new object allocation if it is notified of the degraded state.

The new file (called "degraded"), in /proc/fs/lustre/obdfilter/{OST}, marks the OST as degraded if it is written with a "1" (or any non-zero value), until a "0" is written to it. Therefore, "1" should be written to the file when the array becomes degraded and "0" should be written when the array becomes healthy.

If the OST is remounted due to a reboot or other condition, the flag resets to "0".


10.2 Insights into Disk Performance Measurement

Several tips and insights for disk performance measurement are provided below. Some of this information is specific to RAID arrays and/or the Linux RAID implementation.

Before creating a software RAID array, benchmark all disks individually. We have frequently encountered situations where drive performance was not consistent for all devices in the array. Replace any disks that are significantly slower than the rest.

To identify the optimal request size for a given disk, benchmark the disk with different record sizes ranging from 4 KB to 1 to 2 MB.



Note - Try to avoid sync writes; probably subsequent write would make the stripe full and no reads will be needed. Try to configure RAID arrays and the application so that most of the writes are full-stripe and stripe-aligned.


RAID1 with an internal journal and two disks from different controllers.

If you need a larger MDT, create multiple RAID1 devices from pairs of disks, and then make a RAID0 array of the RAID1 devices. This ensures maximum reliability because multiple disk failures only have a small chance of hitting both disks in the same RAID1 device.

Doing the opposite (RAID1 of a pair of RAID0 devices) has a 50% chance that even two disk failures can cause the loss of the whole MDT device. The first failure disables an entire half of the mirror and the second failure has a 50% chance of disabling the remaining mirror.


10.3 Lustre Software RAID Support

A number of Linux kernels offer software RAID support, by which the kernel organizes disks into a RAID array. All Lustre-supported kernels have software RAID capability, but Lustre has added performance improvements to the RHEL 4 and RHEL 5 kernels that make operations even faster[4]. Therefore, if you are using software RAID functionality, we recommend that you use a Lustre-patched RHEL 4 or 5 kernel to take advantage of these performance improvements, rather than a SLES kernel.

10.3.0.1 Enabling Software RAID on Lustre

This procedure describes how to set up software RAID on a Lustre system. It requires use of mdadm, a third-party tool to manage devices using software RAID.

1. Install Lustre, but do not configure it yet. See Installing Lustre.

2. Create the RAID array with the mdadm command.

The mdadm command is used to create and manage software RAID arrays in Linux, as well as to monitor the arrays while they are running. To create a RAID array, use the --create option and specify the MD device to create, the array components, and the options appropriate to the array.



Note - For best performance, we generally recommend using disks from as many controllers as possible in one RAID array.


To illustrate how to create a software RAID array for Lustre, the steps below include a worked example that creates a 10-disk RAID 6 array from disks /dev/dsk/c0t0d0 through c0tod4 and /dev/dsk/c1t0d0 through c1tod4. This RAID array has no spares.

For the 10-disk RAID 6 array, there are 8 active disks. The chunk size must be chosen such that <chunksize> <= 1024KB/8. Therefore, the largest valid chunk size is 128KB.

a. Create a RAID array for an OST. On the OSS, run:

$ mdadm --create <array_device> -c <chunksize> -l \ <raid_level> -n <active_disks> -x <spare_disks> <block_devices> 

where:


<array_device>

RAID array to create, in the form of /dev/mdX

<chunksize>

Size of each stripe piece on the array’s disks (in KB); discussed above.

<raid_level>

Architecture of the RAID array. RAID 5 and RAID 6 are commonly used for OSTs.

<active_disks>

Number of active disks in the array, including parity disks.

<spare_disks>

Number of spare disks initially assigned to the array. More disks may be brought in via spare pooling (see below).

<block_devices>

List of the block devices used for the RAID array; wildcards may be used.


For the worked example, the command is:
$ mdadm --create /dev/md10 -c 128 -l 6 -n 10 -x 0 \/dev/dsk/c0t0d[01234] /dev/dsk/c1t0d[01234]

This command output displays:

mdadm: array /dev/md10 started.

We also want an external journal on a RAID 1 device. We create this from two 400MB partitions on separate disks: /dev/dsk/c9t0d20p1 and /dev/dsk/c1t0d20p1

b. Create a RAID array for an external journal. On the OSS, run:

$ mdadm --create <array_device> -l <raid_level> -n \ <active_devices> -x <spare_devices> <block_devices>

where:


<array_device>

RAID array to create, in the form of /dev/mdX

<raid_level>

Architecture of the RAID array. RAID 1 is recommended for external journals.

<active_devices>

Number of active disks in the RAID array, including mirrors.

<spare_devices>

Number of spare disks initially assigned to the RAID array. More disks may be brought in via spare pooling (see below).

<block_devices>

List of the block devices used for the RAID array; wildcards may be used.


For the worked example, the command is:

$ mdadm --create /dev/md20 -l 1 -n 2 -x 0 /dev/dsk/c0t0d20p1 \/dev/dsk/c1t0d20p1

This command output displays:

mdadm: array /dev/md20 started.

We now have two arrays - a RAID 6 array for the OST (/dev/md20), and a RAID 1 array for the external journal (/dev/md20).

The arrays will now be re-synced, a process which re-synchronizes the various disks in the array so their contents match. The arrays may be used during the re-sync process (including formatting the OSTs), but performance will not be as high as usual. The re-sync progress may be monitored by reading the /proc/mdstat file.

Next, you need to create a RAID array for an MDT. In this example, a RAID 10 array is created with 4 disks: /dev/dsk/c0t0d1, c0t0d3, c1t0d1, and c1t0d3. For smaller arrays, RAID 1 could be used.

c. Create a RAID array for an MDT. On the MDT, run:

$ mdadm --create <array_device> -l <raid_level> -n \<active_devices> -x <spare_devices> <block_devices>

where:


<array_device>

RAID array to create, in the form of /dev/mdX

<raid_level>

Architecture of the RAID array. RAID 1 or RAID 10 is recommended for MDTs.

<active_devices>

Number of active disks in the RAID array, including mirrors.

<spare_devices>

Number of spare disks initially assigned to the RAID array. More disks may be brought in via spare pooling (see below).

<block_devices>

List of the block devices used for the RAID array; wildcards may be used.


For the worked example, the command is:

$ mdadm --create -l 10 -n 4 -x 0 /dev/md10 /dev/dsk/c[01]t0d[13]

This command output displays:

mdadm: array /dev/md10 started.

If you creating many arrays across many servers, we recommend scripting this process.



Note - Do not use the --assume-clean option when creating arrays. This could lead to data corruption on RAID 5 and will cause array checks to show errors with all RAID types.


3. Set up the mdadm tool.

The mdadm tool enables you to monitor disks for failures (you will receive a notification). It also enables you to manage spare disks. When a disk fails, you can use mdadm to make a spare disk active, until such time as the failed disk is replaced.

Here is an example mdadm.conf from an OSS with 7 OSTs including external journals. Note how spare groups are configured, so that OSTs without spares still benefit from the spare disks assigned to other OSTs.

ARRAY /dev/md10 level=raid6 num-devices=10
    UUID=e8926d28:0724ee29:65147008:b8df0bd1 spare-group=raids
ARRAY /dev/md11 level=raid6 num-devices=10 spares=1 
    UUID=7b045948:ac4edfc4:f9d7a279:17b468cd spare-group=raids
ARRAY /dev/md12 level=raid6 num-devices=10 spares=1 
    UUID=29d8c0f0:d9408537:39c8053e:bd476268 spare-group=raids
ARRAY /dev/md13 level=raid6 num-devices=10
    UUID=1753fa96:fd83a518:d49fc558:9ae3488c spare-group=raids
ARRAY /dev/md14 level=raid6 num-devices=10 spares=1 
    UUID=7f0ad256:0b3459a4:d7366660:cf6c7249 spare-group=raids
ARRAY /dev/md15 level=raid6 num-devices=10
    UUID=09830fd2:1cac8625:182d9290:2b1ccf2a spare-group=raids
ARRAY /dev/md16 level=raid6 num-devices=10
    UUID=32bf1b12:4787d254:29e76bd7:684d7217 spare-group=raids
ARRAY /dev/md20 level=raid1 num-devices=2 spares=1 
    UUID=bcfb5f40:7a2ebd50:b3111587:8b393b86 spare-group=journals
ARRAY /dev/md21 level=raid1 num-devices=2 spares=1 
    UUID=6c82d034:3f5465ad:11663a04:58fbc2d1 spare-group=journals
ARRAY /dev/md22 level=raid1 num-devices=2 spares=1 
    UUID=7c7274c5:8b970569:03c22c87:e7a40e11 spare-group=journals
ARRAY /dev/md23 level=raid1 num-devices=2 spares=1 
    UUID=46ecd502:b39cd6d9:dd7e163b:dd9b2620 spare-group=journals
ARRAY /dev/md24 level=raid1 num-devices=2 spares=1 
    UUID=5c099970:2a9919e6:28c9b741:3134be7e spare-group=journals
ARRAY /dev/md25 level=raid1 num-devices=2 spares=1 
    UUID=b44a56c0:b1893164:4416e0b8:75beabc4 spare-group=journals
ARRAY /dev/md26 level=raid1 num-devices=2 spares=1
    UUID=2adf9d0f:2b7372c5:4e5f483f:3d9a0a25 spare-group=journals
 
# Email address to notify of events (e.g. disk failures)
MAILADDR admin@example.com

4. Set up periodic checks of the RAID array.

We recommend checking the software RAID arrays monthly for consistency. This can be done using cron and should be scheduled for an idle period so performance is not affected.

To start a check, write "check" into /sys/block/[ARRAY]/md/sync_action. For example, to check /dev/md10, run this command on the Lustre server:

$ echo check > /sys/block/md10/md/sync_action

5. Format the OSTs and MDT, and continue with normal Lustre setup and configuration.

For configuration information, see Configuring Lustre.



Note - The default value of stripe_cache_size is 16 KB.


These additional resources may be helpful when enabling software RAID on Lustre:


1 (Footnote) Mean Time to Failure
2 (Footnote) Client writeback cache improves performance for many small files or for a single, large file alike. However, if the cache is filled with small files, cache flushing is likely to be much slower (because of less data being sent per RPC), so there may be a drop-off in total throughput.
3 (Footnote) Performance is affected because, while writing large sequential data, small I/O writes are done to update metadata. This small-sized I/O can affect performance of large sequential I/O with disk seeks.
4 (Footnote) These enhancements have mostly improved write performance.