| C H A P T E R 17 |
|
Backing Up and Restoring a File System |
Lustre provides backups at the file system-level, device-level and file-level. This chapter describes how to backup and restore on Lustre, and includes the following sections:
Backing up a complete file system gives you full control over the files to back up, and allows restoration of individual files as needed. File system-level backups are also the easiest to integrate into existing backup solutions.
File system backups are performed from a Lustre client (or many clients working parallel in different directories) rather than on individual server nodes; this is no different than backing up any other file system.
However, due to the large size of most Lustre file systems, it is not always possible to get a complete backup. We recommend that you back up subsets of a file system. This includes subdirectories of the entire file system, filesets for a single user, files incremented by date, and so on.
The lustre_rsync feature keeps the entire file system in sync on a backup by replicating the file system’s changes to a second file system (the second file system need not be a Lustre file system, but it must be sufficiently large). Lustre_rsync uses Lustre changelogs to efficiently synchronize the file systems without having to scan (directory walk) the Lustre file system. This efficiency is critically important for large file systems, and distinguishes the Lustre lustre_rsync feature from other replication/backup solutions.
The lustre_rsync feature works by periodically running lustre_rsync, a userspace program used to synchronize changes in the Lustre file system onto the target file system. The lustre_rsync utility keeps a status file, which enables it to be safely interrupted and restarted without losing synchronization between the file systems.
The first time that lustre_rsync is run, the user must specify a set of parameters for the program to use. These parameters are described in the following table and in lustre_rsync. On subsequent runs, these parameters are stored in the the status file, and only the name of the status file needs to be passed to lustre_rsync.
The lustre_rsync utility uses the following parameters:
|
The path to the root of the Lustre file system (source) which will be synchronized. This is a mandatory option if a valid status log created during a previous synchronization operation (--statuslog) is not specified. |
|
|
The path to the root where the source file system will be synchronized (target). This is a mandatory option if the status log created during a previous synchronization operation (--statuslog) is not specified. This option can be repeated if multiple synchronization targets are desired. |
|
|
The metadata device to be synchronized. A changelog user must be registered for this device. This is a mandatory option if a valid status log created during a previous synchronization operation (--statuslog) is not specified. |
|
|
The changelog user ID for the specified MDT. To use lustre_rsync, the changelog user must be registered. For details, see the changelog_register parameter in lctl. This is a mandatory option if a valid status log created during a previous synchronization operation (--statuslog) is not specified. |
|
|
A log file to which synchronization status is saved. When the lustre_rsync utility starts, if the status log from a previous synchronization operation is specified, then the state is read from the log and otherwise mandatory --source, --target and --mdt options can be skipped. Specifying the --source, --target and/or --mdt options, in addition to the --statuslog option, causes the specified parameters in the status log to be overriden. Command line options take precedence over options in the status log. |
|
|
Specifies whether extended attributes (xattrs) are synchronized or not. The default is to synchronize extended attributes. Note - Disabling xattrs causes Lustre striping information not to be synchronized. |
|
|
Shows the output of lustre_rsync commands (copy, mkdir, etc.) on the target file system without actually executing them. |
|
|
Stops processing the lustre_rsync operation if an error occurs. The default is to continue the operation. |
Sample lustre_rsync commands are listed below.
Register a changelog user for an MDT (e.g. lustre-MDT0000).
# lctl --device lustre-MDT0000 changelog_register lustre-MDT0000 Registered changelog userid 'cl1'
Synchronize a Lustre file system (/mnt/lustre) to a target file system (/mnt/target).
$ lustre_rsync --source=/mnt/lustre --target=/mnt/target --mdt=lustre-MDT0000 --user=cl1 --statuslog sync.log --verbose Lustre filesystem: lustre MDT device: lustre-MDT0000 Source: /mnt/lustre Target: /mnt/target Statuslog: sync.log Changelog registration: cl1 Starting changelog record: 0 Errors: 0 lustre_rsync took 1 seconds Changelog records consumed: 22
After the file system undergoes changes, synchronize the changes onto the target file system. Only the statuslog name needs to be specified, as it has all the parameters passed earlier.
$ lustre_rsync --statuslog sync.log --verbose Replicating Lustre filesystem: lustre MDT device: lustre-MDT0000 Source: /mnt/lustre Target: /mnt/target Statuslog: sync.log Changelog registration: cl1 Starting changelog record: 22 Errors: 0 lustre_rsync took 2 seconds Changelog records consumed: 42
To synchronize a Lustre file system (/mnt/lustre) to two target file systems (/mnt/target1 and /mnt/target2).
$ lustre_rsync --source=/mnt/lustre --target=/mnt/target1 --target=/mnt/target2 \ --mdt=lustre-MDT0000 --user=cl1 --statuslog sync.log
In some cases, it is useful to do a full device-level backup of an individual device (MDT or OST), before replacing hardware, performing maintenance, etc. Doing full device-level backups ensures that all of the data and configuration files is preserved in the original state and is the easiest method of doing a backup. For the MDT file system, it may also be the fastest way to perform the backup and restore, since it can do large streaming read and write operations at the maximum bandwidth of the underlying devices.
If hardware replacement is the reason for the backup or if a spare storage device is available, it is possible to do a raw copy of the MDT or OST from one block device to the other, as long as the new device is at least as large as the original device. To do this, run:
dd if=/dev/{original} of=/dev/{new} bs=1M
If hardware errors cause read problems on the original device, use the command below to allow as much data as possible to be read from the original device while skipping sections of the disk with errors:
dd if=/dev/{original} of=/dev/{new} bs=4k conv=sync,noerror count={original size in 4kB blocks}
Even in the face of hardware errors, the ldiskfs file system is very robust and it may be possible to recover the file system data after running e2fsck -f on the new device.
This procedure provides another way to backup or migrate the data of an OST at the file level, so that the unused space of the OST does not need to be backed up. Backing up a single OST device is not necessarily the best way to perform backups of the Lustre file system, since the files stored in the backup are not usable without metadata stored on the MDT. However, it is the preferred method for migration of OST devices, especially when it is desirable to reformat the underlying file system with different configuration options or to reduce fragmentation.
1. Make a mountpoint for the file system.
[oss]# mkdir -p /mnt/ost
[oss]# mount -t ldiskfs /dev/{ostdev} /mnt/ost
3. Change to the mountpoint being backed up.
[oss]# cd /mnt/ost
4. Back up the extended attributes.
[oss]# getfattr -R -d -m '.*' -e hex -P . > ea-$(date +%Y%m%d).bak
5. Verify that the ea-$date.bak file has properly backed up the EA data on the OST.
Without this attribute data, the restore process may be missing extra data that can be very useful in case of later file system corruption. Look at this file with more or a text editor. Each object file should hae a corresponding item similar to this:
[oss]# file: O/0/d0/100992 trusted.fid= \ 0x0d822200000000004a8a73e500000000808a0100000000000000000000000000
6. Back up all file system data.
[oss]# tar czvf {backup file}.tgz --sparse .
| Note - In Lustre 1.6.7 and later, the --sparse option reduces the size of the backup file. Be sure to use it so the tar command does not mistakenly create an archive full of zeros. |
7. Change directory out of the file system.
[oss]# cd -
[oss]# umount /mnt/ost
| Note - When restoring an OST backup on a different node as part of an OST migration, you also have to change server NIDs and use the --writeconf command to re-generate the configuration logs. See Changing a Server NID. |
To restore data from a file-level backup, you need to format the device, restore the file data and then restore the EA data.
[oss]# mkfs.lustre --ost --index {OST index} {other options} newdev}
[oss]# mount -t ldiskfs {newdev} /mnt/ost
3. Change to the new file system mount point.
[oss]# cd /mnt/ost
4. Restore the file system backup.
[oss]# tar xzvpf {backup file} --sparse
5. Restore the file system extended attributes.
[oss]# setfattr --restore=ea-${date}.bak
6. Verify that the extended attributes were restored.
[oss]# getfattr -d -m ".*" -e hex O/0/d0/100992 trusted.fid= \ 0x0d822200000000004a8a73e500000000808a0100000000000000000000000000
7. Change directory out of the file system.
[oss]# cd -
8. Unmount the new file system.
[oss]# umount /mnt/ost
If the file system was used between the time the backup was made and when it was restored, then the lfsck tool (part of Lustre e2fsprogs) can optionally be run to ensure the file system is coherent. If all of the device file systems were backed up at the same time after the entire Lustre file system was stopped, this is not necessary. In either case, the file system should be immediately usable even if lfsck is not run, though there may be I/O errors reading from files that are present on the MDT but not the OSTs, and files that were created after the MDT backup will not be accessible/visible.
If you want to perform disk-based backups (because, for example, access to the backup system needs to be as fast as to the primary Lustre file system), you can use the Linux LVM snapshot tool to maintain multiple, incremental file system backups.
Because LVM snapshots cost CPU cycles as new files are written, taking snapshots of the main Lustre file system will probably result in unacceptable performance losses. You should create a new, backup Lustre file system and periodically (e.g., nightly) back up new/changed files to it. Periodic snapshots can be taken of this backup file system to create a series of "full" backups.
Use this procedure to create a backup Lustre file system for use with the LVM snapshot mechanism.
1. Create LVM volumes for the MDT and OSTs.
Create LVM devices for your MDT and OST targets. Make sure not to use the entire disk for the targets; save some room for the snapshots. The snapshots start out as 0 size, but grow as you make changes to the current file system. If you expect to change 20% of the file system between backups, the most recent snapshot will be 20% of the target size, the next older one will be 40%, etc. Here is an example:
cfs21:~# pvcreate /dev/sda1 Physical volume "/dev/sda1" successfully created cfs21:~# vgcreate volgroup /dev/sda1 Volume group "volgroup" successfully created cfs21:~# lvcreate -L200M -nMDT volgroup Logical volume "MDT" created cfs21:~# lvcreate -L200M -nOST0 volgroup Logical volume "OST0" created cfs21:~# lvscan ACTIVE '/dev/volgroup/MDT' [200.00 MB] inherit ACTIVE '/dev/volgroup/OST0' [200.00 MB] inherit
2. Format the LVM volumes as Lustre targets.
In this example, the backup file system is called “main” and designates the current, most up-to-date backup.
cfs21:~# mkfs.lustre --mdt --fsname=main /dev/volgroup/MDT
No management node specified, adding MGS to this MDT.
Permanent disk data:
Target: main-MDTffff
Index: unassigned
Lustre FS: main
Mount type: ldiskfs
Flags: 0x75
(MDT MGS needs_index first_time update )
Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
Parameters:
checking for existing Lustre data
device size = 200MB
formatting backing filesystem ldiskfs on /dev/volgroup/MDT
target name main-MDTffff
4k blocks 0
options -i 4096 -I 512 -q -O dir_index -F
mkfs_cmd = mkfs.ext2 -j -b 4096 -L main-MDTffff -i 4096 -I 512 -q -O dir_index -F /dev/volgroup/MDT
Writing CONFIGS/mountdata
cfs21:~# mkfs.lustre --ost --mgsnode=cfs21 --fsname=main /dev/volgroup/OST0
Permanent disk data:
Target: main-OSTffff
Index: unassigned
Lustre FS: main
Mount type: ldiskfs
Flags: 0x72
(OST needs_index first_time update )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=192.168.0.21@tcp
checking for existing Lustre data
device size = 200MB
formatting backing filesystem ldiskfs on /dev/volgroup/OST0
target name main-OSTffff
4k blocks 0
options -I 256 -q -O dir_index -F
mkfs_cmd = mkfs.ext2 -j -b 4096 -L main-OSTffff -I 256 -q -O dir_index -F /dev/ volgroup/OST0
Writing CONFIGS/mountdata
cfs21:~# mount -t lustre /dev/volgroup/MDT /mnt/mdt
cfs21:~# mount -t lustre /dev/volgroup/OST0 /mnt/ost
cfs21:~# mount -t lustre cfs21:/main /mnt/main
At periodic intervals e.g., nightly, back up new and changed files to the LVM-based backup file system.
cfs21:~# cp /etc/passwd /mnt/main cfs21:~# cp /etc/fstab /mnt/main cfs21:~# ls /mnt/main fstab passwd
Whenever you want to make a "checkpoint" of the main Lustre file system, create LVM snapshots of all target MDT and OSTs in the LVM-based backup file system. You must decide the maximum size of a snapshot ahead of time, although you can dynamically change this later. The size of a daily snapshot is dependent on the amount of data changed daily in the main Lustre file system. It is likely that a two-day old snapshot will be twice as big as a one-day old snapshot.
You can create as many snapshots as you have room for in the volume group. If necessary, you can dynamically add disks to the volume group.
The snapshots of the target MDT and OSTs should be taken at the same point in time. Make sure that the cronjob updating the backup file system is not running, since that is the only thing writing to the disks. Here is an example:
cfs21:~# modprobe dm-snapshot cfs21:~# lvcreate -L50M -s -n MDTb1 /dev/volgroup/MDT Rounding up size to full physical extent 52.00 MB Logical volume "MDTb1" created cfs21:~# lvcreate -L50M -s -n OSTb1 /dev/volgroup/OST0 Rounding up size to full physical extent 52.00 MB Logical volume "OSTb1" created
After the snapshots are taken, you can continue to back up new/changed files to "main". The snapshots will not contain the new files.
cfs21:~# cp /etc/termcap /mnt/main cfs21:~# ls /mnt/main fstab passwd termcap
Use this procedure to restore the file system from an LVM snapshot.
Rename the file system snapshot from "main" to "back" so you can mount it without unmounting "main". This is recommended, but not required. Use the --reformat flag to tunefs.lustre to force the name change. For example:
cfs21:~# tunefs.lustre --reformat --fsname=back --writeconf /dev/volgroup/MDTb1
checking for existing Lustre data
found Lustre data
Reading CONFIGS/mountdata
Read previous values:
Target: main-MDT0000
Index: 0
Lustre FS: main
Mount type: ldiskfs
Flags: 0x5
(MDT MGS )
Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
Parameters:
Permanent disk data:
Target: back-MDT0000
Index: 0
Lustre FS: back
Mount type: ldiskfs
Flags: 0x105
(MDT MGS writeconf )
Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
Parameters:
Writing CONFIGS/mountdata
cfs21:~# tunefs.lustre --reformat --fsname=back --writeconf /dev/volgroup/OSTb1
checking for existing Lustre data
found Lustre data
Reading CONFIGS/mountdata
Read previous values:
Target: main-OST0000
Index: 0
Lustre FS: main
Mount type: ldiskfs
Flags: 0x2
(OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=192.168.0.21@tcp
Permanent disk data:
Target: back-OST0000
Index: 0
Lustre FS: back
Mount type: ldiskfs
Flags: 0x102
(OST writeconf )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=192.168.0.21@tcp
Writing CONFIGS/mountdata
When renaming an FS, we must also erase the last_rcvd file from the snapshots
cfs21:~# mount -t ldiskfs /dev/volgroup/MDTb1 /mnt/mdtback
cfs21:~# rm /mnt/mdtback/last_rcvd
cfs21:~# umount /mnt/mdtback
cfs21:~# mount -t ldiskfs /dev/volgroup/OSTb1 /mnt/ostback
cfs21:~# rm /mnt/ostback/last_rcvd
cfs21:~# umount /mnt/ostback
2. Mount the file system from the LVM snapshot.
cfs21:~# mount -t lustre /dev/volgroup/MDTb1 /mnt/mdtback cfs21:~# mount -t lustre /dev/volgroup/OSTb1 /mnt/ostback cfs21:~# mount -t lustre cfs21:/back /mnt/back
3. Note the old directory contents, as of the snapshot time.
cfs21:~/cfs/b1_5/lustre/utils# ls /mnt/back fstab passwds
To reclaim disk space, you can erase old snapshots as your backup policy dictates. Run:
lvremove /dev/volgroup/MDTb1
You can also extend or shrink snapshot volumes if you find your daily deltas are smaller or larger than expected. Run:
lvextend -L10G /dev/volgroup/MDTb1
| Note - Extending snapshots seems to be broken in older LVM. It is working in LVM v2.02.01. |
Copyright © 2011, Oracle and/or its affiliates. All rights reserved.