Lustre Snapshots

Requirements
Users have requested the ability to create filesystem-level snapshots for several main uses:


 * Regular users can restore older versions of their files from snapshots to recover from user or application errors. This functionality would typically only be used from interactive login nodes, since the snapshots would be mounted read-only and should not be used for batch HPC processing.
 * Administrators can create consistent backups from a snapshot on a backup node.
 * Administrators can create snapshots before significant system upgrade events, and then revert the whole filesystem to an earlier snapshot in case of serious software-level bugs

It should be noted that filesystem-level snapshots will not help in the case of device-level corruption (e.g. device failure beyond RAID parity protection, or RAID parity reconstruction errors) because snapshots are typically stored on the same physical devices as the live copy and normally share the same data and metadata blocks unless the file has been modified. Device-level corruption would affect unmodified files regardless of whether it was accessed from the snapshot of live copy, and would be equally likely to corrupt the copy of any modified file and filesystem or snapshot metadata.

Snapshot Implementation Proposals
There are two approaches that could be taken to implement filesystem-level snapshots with Lustre, which are compatible to some extent. The first approach, Simplified Userspace Snapshots, is largely implemented in userspace by creating device-level snapshots for all OSTs and MDTs in the filesystem in a roughly synchronized manner. Each snapshot must be explicitly mounted as a separate Lustre filesystem that is visible only on a limited number of client nodes that need such access (e.g. separate /scratch/ and /scratch-yesterday/ filesystems). The second approach is Integrated Lustre Snapshots, which creates device-level snapshots at the OSD level using a coordinated RPC mechanism, and exposes the snapshots as part of the Lustre filesystem namespace visible on all clients (e.g. /scratch/.lustre/snapshots/yesterday/ is a subdirectory of the single /scratch/ filesystem).

The Simplified Userspace Snapshots approach offers several advantages in ease of implementation and time to first availability as well as usefulness of implemented components beyond the scope of snapshots, at the expense of somewhat more administration complexity and runtime overhead (connections, mount points). The Integrated Lustre Snapshots approach allows more features and flexibility in the long term, at the expense of a much more complex initial implementation requirement and a significantly longer lead-time before first availability. There are several components that can be shared between the two approaches, such as the read-only mounting and MDS/OSS Object Destroy Barrier.

Simplified Userspace Snapshots
With a simplified snapshot implementation, there would be one or more device-level snapshots created of the backing filesystem, one for each MDT and OST in the filesystem, using ldiskfs+LVM or ZFS. Userspace scripts would create snapshots on all servers. Initially, there would only be best effort coordination (e.g. within 5-10 seconds) between the MDT and OST snapshots. Only files created or deleted during this window might be inconsistent if accessed in the snapshot (e.g. a "file or directory not found" error returned to the user for files created or deleted during this window). This is not significantly worse than inconsistencies in file data due to applications in the middle of writing to files at the time of snapshot creation, even if it was instantaneous.

In a preliminary implementation, the creation of snapshots on the MDT and OST devices can be managed by an external mechanism across the server nodes (e.g. ssh or pdsh, etc). There is no strong requirement for coordination of the snapshots with clients, from a filesystem metadata consistency point-of-view. This would not guarantee that files currently being modified would contain data that contains consistent application states, but only files modified within the past 30s or so might be inconsistent.

Once the individual snapshots were created for all of the OST and MDT backing filesystems, a second step to fix the Lustre-level configuration would be needed to make the snapshots suitable for mounting on the clients. If the snapshots were only created for disaster recovery purposes (e.g. in case of software or application corruption of large parts of the filesystem) then this step may not be necessary, and the snapshots could just take the place of the original devices.

In the case where it is desirable to mount the snapshots on the clients (e.g. to allow users to recover deleted files on login nodes), the Lustre filesystem name in the filesystem labels and in the Lustre configuration logs would need to be changed, so that the clients can distinguish between the MDT(s)/OST(s) of the original filesystem and the snapshot filesystem.

From an implementation point of view, it would be possible to write to a snapshot filesystem, but it may be desirable from a administrative point of view to always mount the snapshot read-only on the clients. To the Lustre clients and servers a renamed snapshot filesystem will just be a new Lustre filesystem, with no visible relation to the original filesystem. This makes it easy to implement (virtually nothing new at the Lustre level in the kernel, just some scripts/utilities to change the fsname), but it will lack integration with the original filesystem (e.g. it will potentially increase memory consumption to double-cache filesystem metadata, etc), and will appear disjoint from the original filesystem.

As you can see, for the simplified snapshots, most of the process is in the userspace/tools area, related to creating the snapshots in some coordinated manner, and then renaming the filesystem. Some minor changes are needed for the Lustre tools to rename the filesystem, possibly along with a new subcommand for lctl to be able to rename or delete the config of old snapshots on the MGS while it is in use (assuming the same MGS will be used for the original and snapshot filesystems).

Simplified Userspace Snapshot Components
The following components need to be developed as part of the Simplified Userspace Snapshots implementation. However, only the Filesystem Rename component is required for snapshots to become functional at the Lustre level. The rest of the components can be implemented incrementally to improve usability and functionality as schedule allows and requirements demand.

Filesystem Rename
In order to mount the snapshot OST and MDT filesystems as a new Lustre filesystem, the filesystem name (fsname) needs to be changed on the OST or MDT. The fsname is stored in several places on each device - filesystem label, last_rcvd file, CONFIGS/mountdata, and the per-device configuration log. Changing the Lustre fsname is currently difficult to do correctly without expert knowledge.

Changing the filesystem label can be done for ldiskfs with tune2fs, and for ZFS the device properties can be changed during zfs clone. Deleting the last_rcvd file will cause it to be recreated automatically at mount, which would also evict all mounted clients, but it would also be trivial to change the fsname directly and just truncate the file and preserve the state that is stored there. The CONFIGS/{fsname}-{device} file is created after tunefs.lustre --writeconf and registering with the MGS again. Finally, the CONFIGS/mountdata file currently has no mechanism to update or recreate it, but it would be straight forward to implement this.

Having tunefs.lustre --fsname={new_name} perform all of the above steps on an unmounted ldiskfs or ZFS OST or MDT snapshot would be needed, and would be useful for users in outside of the context of snapshots if they want to rename the filesystem for some other reason. This would be relatively simple to implement in the tunefs.lustre tool and does not require detailed Lustre knowledge (LU-5070).

It should be noted that there is currently a restrictions on the fsname length in Lustre to a maximum of 8 characters, though there is no Lustre-imposed limit on the name of the directory where the snapshot is mounted. This precludes storing a full snapshot timestamp into the fsname itself, allowing at most a short index integer to be used to distinguish the fsname. This limitation was imposed by the ldiskfs filesystem label length, and inherited by the size of fields in a variety of data structures, which is not present in ZFS. It may be possible to increase this limit, or hide it from the user via automount mapping of the short fsname to a more useful snapshot and mount point name.

Read-only Mount of OST/MDT
Currently, clients can mount a Lustre filesystem read-only using the "-o ro" mount option on the client. This is needed for every client that is mounting a filesystem. It is not currently possible to actually do a read-only mount of an OST or MDT device. It would be desirable to mark the OST and MDT snapshot filesystems read-only in the last_rcvd file, and/or mount the backing filesystems read-only to ensure that clients could not modify the snapshot. This is useful beyond the scope of snapshots, such as disaster recovery to allow copying files from a partially failing OST without minimizing problems, administrative quiescence of the filesystem before decommissioning) and is relatively simple to implement (LU-5553).

MDS/OSS Object Destroy Barrier
In Lustre 2.4 and later, the MDS completely manages file unlink operations, waiting until the last link is deleted from the namespace and committed to disk before sending OST RPCs to destroy the objects. In all versions of Lustre the OST objects are pre-created before they are allocated to MDT inodes. Due to non-atomic snapshot creation across multiple MDT and OST targets there may be user-visible namespace inconsistencies with files created or destroyed in the interval between the MDT and OST snapshots, regardless of which ones are created first.

If the MDT snapshot is created first, then a file is unlinked and the objects are deleted before the OST snapshots are created, access to the file in the MDT snapshot will result in errors accessing the deleted OST objects. Conversely, if the OST snapshots are created first and a new file is created on the MDT before its snapshot is created, this results in an MDT inode referencing OST objects that are not in the corresponding snapshot. To avoid these snapshot ordering issues, the MDS should flush any pending OST object destroys, then impose a barrier on sending any more OST object destroys until both the MDTs and OSTs have completed their snapshots and the barrier is removed. The MDS barrier would also suspend file creation if pre-created OST objects are exhausted during snapshot creation.

It may or may not be desirable to allow LFSCK to run on a snapshot filesystem. On the positive side, LFSCK can resolve minor inconsistencies that may exist in the original snapshot, or caused by races between targets during the snapshot creation. On the negative side, one reason to create snapshots is to provide a disaster-recovery fallback in case of software bugs, and since LFSCK is modifying the filesystem itself it may be the cause of data corruption in the first place. In either case, LFSCK should not be run automatically on a snapshot filesystem.

Automount of Snapshot Filesystems
In order to simplify management of snapshot filesystems on Lustre clients that may want to mount them, and avoid complexity with aging and expiry of older snapshots, it makes sense to use automount to mount the snapshots on access from the clients/users and unmount them when the snapshots are idle. There have been some problems with using automount together with Lustre in the past that need to be investigated/fixed (LU-430). Recent reports indicate that automount may be working, in which case a test should be added to conf-sanity.sh to ensure this continues to work. Having functional automount is useful independent of the scope of snapshots.

The client-side mountpoint for the snapshot filesystems is somewhat arbitrary, but it would be desirable to plan the deployment in a manner that simplified migration to Integrated Lustre Snapshots in the future. In this light, it would be useful to mount the snapshots in $MOUNT/.lustre/snapshot/ so that this mountpoint can also be handled automatically in the future to create a new snapshot by mkdir $MOUNT/.lustre/snapshot/. This would initially require the ability to create directories under .lustre/snapshot, so it isn't a requirement for initial deployment.

MGS Configuration Migration
Since the MDT and OST snapshots have a different fsname from the source filesystem, when they are registered with the MGS after an lctl --writeconf operation, they appear as a new filesystem and do not keep any of the configuration information from the original filesystem. The configuration information is typically not required for functionality, but keeping filesystem tunables for the snapshot may be important in some cases. Generating a new copy of the configuration log for the new snapshot on the MGS should be done at the time the snapshot is created. Otherwise, there would need to be an interface to copy the Lustre configuration log of the source filesystem to the new filesystem name while the MGS is active, and then change the fsname within the configuration log. There needs to be a mechanism to delete an old snapshot configuration when deleting a snapshot.

Integrated Lustre Snapshots
Integrated Lustre Snapshots (ILS) require support inside Lustre for clients and servers to distinguish between different snapshots (or ZFS datasets, a superset of snapshots) on the servers when accessing any file in the filesystem. Each Lustre File Identifier (FID) used by the clients and servers would need to track an identifier for which dataset each object belongs to. At the server, the FID is mapped internally by Lustre to a specific dataset, and then used as today to do a lookup of the individual object within the dataset. The Lustre target is in charge of creating and mounting all snapshots internally, and presenting this to the client as a single filesystem. The existing PtlRPC client-target connections (e.g. OSC->OST) would be shared for all existing targets, avoiding increasing the number of PtlRPC connections that need to be managed (pings, exports, recovery, etc) which is already too much overhead for the largest systems deployed today.

Integrated Lustre Snapshot Components
The following components need to be developed as part of the Integrated Lustre Snapshots implementation. Some of the components specified for Simplified Userspace Snapshots above are required before Integrated Lustre Snapshots are usable, including MDS/OSS Object Destroy Barrier and Read-only Mount of OST/MDT. There is a substantial up-front Lustre implementation effort required for the infrastructure before ILS snapshots are usable.

Local OSD Snapshot Creation and Access
First and foremost there needs to be a mechanism implemented by the Lustre OSD to create and access the snapshot datasets on the MDS and OSS nodes. This would allow creation of local backing filesystem (ZFS) snapshots, as well as eventually allowing clients to access the snapshots without explicit interaction from the user or administrator. To reduce the overhead of needing a Lustre-level or network-level connections for each snapshot, the snapshot dataset should be mounted internally to the OSD, and accessible using the same Lustre connections between the clients and servers. To minimize server overhead, it may be desirable to only mount snapshots internally on demand, as there may be many snapshots that are never used by users.

The concept of a snapshot in ZFS is a subset of ZFS datasets, which are both sub-filesystems within a single ZFS pool. A snapshot is a read-only dataset that is initially populated with objects referenced from the original dataset. A snapshot dataset can be converted into a read-write clone dataset, allowing it to be modified (e.g. for changing the Lustre fsname) or possibly replacing the original dataset in the face of catastrophic failure. Additionally, a new empty dataset could be created in the same ZFS storage pool. Separate datasets may be useful for allowing users to have different ZFS parameters for their dataset than others (e.g. snapshot frequency and retention, data compression, restricting dataset access, fast deletion of user or project data, etc.). Since there is only minimal added complexity for supporting ZFS datasets compared to only supporting ZFS snapshots, the term dataset may be used interchangeably with snapshot in this document to describe required ZFS snapshot functionality.

Data Set Identifier Support
The Lustre FID used to identify files contains a reserved 32-bit field, f_ver, which has always intended to be used for snapshots or file versioning. Since ZFS datasets are a superset of snapshots, it makes sense to repurpose the f_ver field for the Dataset Identifier (DSI) as f_dsi. The DSI component of the FID will be used throughout the Lustre client and server code to identify which dataset (snapshot) each file and directory belongs to. The DSI is generated by the master MDS at the time a snapshot is first created, and is shared by all MDT and OST datasets that make up a single coherent Lustre filesystem snapshot.

Passing the DSI as part of the existing Lustre FID minimizes the number of places in the code that need to be modified in order to support snapshots. It should be noted that the DSI is NOT stored on disk with the FID in the snapshot, since multiple snapshots can reference a given object at the same time. The actual f_dsi field in the FID would be filled dynamically as objects are being fetched from disk on the server, and would be largely transparent to the client, which is a significant benefit. The value of f_dsi depends on the dataset by which the file was accessed.

The major drawback of having a unique f_dsi field for each version of an object is that this requires modifying every FID handled by Lustre on the server. This would require an in-depth knowledge of every data structure that contains a FID, such as directories, link xattr, all LOV layout types, and other data structures that do not even exist yet. Ideally this could be done by a FID accessor function used throughout the code (e.g. swab, but that isn't done on the server).

Another alternative is to pass a separate DSI throughout the Lustre code to identify which dataset is currently being accessed. This would remove the requirement for modifying all FIDs, at the expense of requiring significant API changes throughout the Lustre code in order to manage this extra state for each operation.

It may be possible to have a hybrid model, where the DSI is modified as part of the primary FID returned for an object, and then anywhere that accesses data structures associated with that object (e.g. link xattr, LOV layouts, etc.) will do their own virtual replacement of the f_dsi field as needed. This localizes knowledge of the data structures to the code that is accessing them, avoiding global API changes to the code to pass the DSI separately. The drawback is that several parts of the client and servers would need to become aware of f_dsi replacement.

Map Dataset Identifier to Local Dataset on Server
The FID with its accompanying DSI needs to be mapped at the MDT and OST levels to a local ZFS dataset identifier using a lookup table. The Lustre Dataset Index (LDI) would be analogous to the current Object Index (OI) that maps Lustre FIDs to the ZFS local dnode number.

The LDI maps a common global DSI to an MDT- or OST-unique dataset identifier, and avoids any requirement for coordination between the dataset identifiers between targets. While such coordination might be possible for ZFS filesystems created at the same time, it would be quite difficult to implement if OSTs or MDTs were added to an existing filesystem, and may be impossible with other filesystem types, so the use of local identifiers is best kept internal to the OSD implementation.

The LDI is expected to be a simple ZAP index that maps the global DSI number to the ZFS dataset number, which is its root dnode number in the ZFS Meta Object Store (MOS). The LDI would need to be accessed for every FID lookup to locate the correct dataset on the server, and then the dataset-local OI is used to map the FID to the local ZFS dnode number.

Namespace Mapping of Dataset Identifiers
In order to determine the DSI for a FID, the client needs to do a lookup of the root object for the particular dataset. This would typically be done by doing a lookup by name in the $LUSTRE/.lustre/snapshot directory. This directory on the MDS would persistently store the mapping of all the snapshot names to DSI numbers. Once a client had traversed into the .lustre/snapshot directory (or in the case of a non-snapshot dataset attached elsewhere in the filesystem namespace) all FIDs returned by the MDS would need to be modified to include the DSI number to ensure that FID lookups reference the correct snapshot version.

Remote Dataset Creation, Listing, Removal, Rename
In addition to implementing the MDS/OSS Object Destroy Barrier described for Simplified Userspace Snapshots, an RPC mechanism is needed to access the remote OSD functionality to create and remove snapshots on targets. The proposed mechanism for creating a new snapshot is via "mkdir $MOUNT/.lustre/snapshot/" to create a new snapshot of the current filesystem state with the given name. This also provides a simple mechanism to list and remove Lustre snapshots, simply "ls $MOUNT/.lustre/snapshot" and "rmdir $MOUNT/.lustre/snapshot" respectively. This avoids the need for extra tools and provides a natural interface for creating snapshots.

It needs further investigation whether it is possible to (easily) rename a snapshot or dataset after creation. If that is practical, then "mv $MOUNT/.lustre/snapshot/{,}" should be allowed, but this is a secondary goal and is not required for the initial implementation.

Dataset Recovery Mechanism
There would need to be a recovery mechanism for creating and removing remote snapshots in case a target was offline at the time the snapshot was created. Fortunately, if a target is offline at the time a new snapshot or dataset is created it cannot be modified, so it is sufficient to create any pending datasets at the time the target comes online. Similarly, for deleted snapshots it is sufficient to remove datasets on the failed target when it comes online, since they will no longer be in use.

ZFS vs. LVM+ldiskfs Snapshots
The choice of backing filesystem type affects the implementation of Simplified Userspace Snapshots only to the degree that the userspace scripts for creating and managing the snapshots is different. There will also be some difference in how read-only devices will be mounted by the OSD. However, the clients will remain unaware of the distinction of OSD type, as with the higher level server layers. The major difference will be in the filesystem performance when snapshots exist. For ZFS the creation of a snapshot is normal behaviour every few seconds of the system's lifetime, and keeping additional snapshots is largely a decision about how much space will be unavailable to users. On the other hand, LVM+ldiskfs snapshots impose significant overhead at runtime for every new snapshot that is created, as any blocks modified in the live filesystem need to be read from disk and copied to every other snapshot before the block can be overwritten. This makes LVM+ldiskfs somewhat impractical for deployment of more than a single snapshot, and an unattractive as a development target for this feature, even if it would be possible to develop scripts for it with approximately an equal amount of effort. There is a secondary concern that some aspects of osd-ldiskfs that are not tightly integrated with the filesystem freeze API in the kernel, and this would need to be fixed to block OST IO submissions during the creation of LVM snapshots.

For Integrated Lustre Snapshots the situation is somewhat different, since the OSD needs to be able to create and remove snapshots efficiently, and access objects from multiple snapshots transparently to the upper layers. This makes ZFS the main target for the second phase of feature development.

Beyond Integrated Snapshots
Once the Integrated Snapshot functionality is implemented, it is a relatively small step to allowing integrated datasets to be handled. The main difference between a dataset and a snapshot is that the dataset is not rooted in the existing filesystem, and may attach to the namespace at an arbitrary location.

Dataset Redirection Entry
In the case of only snapshots under the .lustre/snapshot directory, the mapping of names to DSI numbers might be special-cased in the lookup methods of this directory. However, in order to attach completely new datasets into arbitrary locations in the namespace (e.g. /home/adilger or /project/tokomak), there would need to be a special Dataset Redirection Entry (DRE) that mapped a name to a specific DSI number. Similar to lookups in the .lustre/snapshot directory the lookup of any names in this directory would attach the DSI number to f_dsi for all subsequent operations within the dataset. In this case, the f_dsi number would be stored on disk in the directory entry. Some thought needs to be given to how a snapshot would be accessed from within a dataset, either by virtue of the .lustre/snapshot directory within the dataset mountpoint, or possibly by transparently traversing from the parent snapshot to the dataset snapshot but it may be impossible to do this if the snapshots between the parent and dataset are not coordinated. However, without such transparent traversal it may be confusing for users if accessing /project/.lustre/snapshot/2015-02-03/tokomak goes through the snapshot of the parent filesystem but leads to the current tokomak dataset, while /project/tokomak/.lustre/snapshot/2015-02-03 correctl leads to the snapshot of the tokomak project dataset.

Cross-Dataset Rename and Hard Link
For read-only snapshots, rename and hard links will not be an issue since the snapshot cannot be modified. Since the datasets can be modified, but are separate ZFS datasets at the OSD side, it will not be possible to create hard links or rename files between separate datasets. This would generate an EXDEV error to userspace and be handled in the same way as normal cross-device operations (typically copying the file data). Since the mountpoint will be the same in userspace, this will be detected within the kernel by comparing the source and target f_dsi values during link and rename operations. It may be desirable to export separate s_dev values to userspace for inodes in different datasets so that userspace can also detect that files are on different datasets, similar to what is done with Btrfs.