Managing Lustre with the ZFS backend as a High Availability Service

Overview
Service continuity, or providing high-availability for a service, is implemented in Lustre by means of a mechanism called “failover”. A failover service is one that can be run on exactly one machine (a physical or virtual computer node) at a time, but has a choice of machines on which to run. The machines are usually identically similar in configuration, and have some common infrastructure characteristics that permit the service to start and stop in a predictable manner on each machine. The service is therefore able to run with the same configuration and data, regardless of the selected machine. If the service is running on a machine that fails, it can be restarted on one of the surviving machines in the common infrastructure pool.

Lustre services are tightly coupled to the data storage. An MGS has a corresponding MGT, each MDS has one or more corresponding MDTs, and each OSS has one or more corresponding OSTs. Failover resources are defined in terms of the storage targets (MGT, MDTs, OSTs). When a Lustre server node develops a fault, it is the mount point of the storage that migrates, or “fails over” between the nodes. This is because the running state of Lustre services are governed by the mount and umount commands. There are no user-space daemons like httpd or nfsd to start and stop; we simply mount the Lustre storage target to start the service, and unmount the target to stop the service.

This also means that the storage targets must be reachable by more than one server. Typically, Lustre file systems are assembled from multiple pairs of servers, each pair being connected to a common pool of shared storage. Each host in a pair mounts a subset of the connected storage, presented to the OS in the form of one or more logical units (an aggregated set of devices arranged in a RAID pattern) or, in the case of JBOD storage enclosures, discrete devices.

For example, in a typical metadata server cluster pair, the storage will comprise one MGT volume and one MDT volume. One server will act as the primary host for the MGT and corresponding MGS, while the other server is the primary or preferred host for the MDT/MDS. If there is a server failure, the affected services are restarted on the surviving host.

For failover to work with Lustre, the storage targets must be configured with the NIDs of the hosts that are expected to be able to mount and provide storage services for that specific storage target. These are specified using either the  or   command line options to.

For high-availability clusters with ZFS storage shared between nodes as a failover resource, it is also required that each ZFS pool is created and imported with the  property set equal to the special value of , and the   property set to  :

 zpool create [-f] -O canmount=off \ -o cachefile=none \ -o multihost=on \ [-o ashift=] \

zpool import [-f] -o cachefile=none

The remainder of this article describes the underlying mechanics of managing migration / failover of Lustre services. Refer to Creating a Framework for High Availability with Pacemaker for information regarding the integration of Lustre services into the Pacemaker and Corosync high availability cluster framework.

Controlled Migration or Failover of a Lustre Service
The migration, or “failover”, of a Lustre service between hosts in a cooperative high-availability cluster is straightforward. A Lustre service runs wherever the corresponding storage is mounted (e.g., the MGS runs where the MGT has been mounted, similarly for MDS/MDT or OSS/OST). A service is started when the storage targets are mounted, and stopped when the storage targets are unmounted. So, a Lustre service is migrated by unmounting the storage from one host and remounting it on a different host in the cluster.

The concept applies to both LDISKFS and ZFS object storage devices (OSDs). For ZFS storage, there is some additional complexity due to the fact that ZFS is a volume management solution as well as a file system, meaning that the ZFS pool must be imported before the Lustre OSD can be mounted.

Some care is also required in order to prevent storage from being mounted on multiple hosts simultaneously. For LDISKFS volumes, multiple mount prevention / protection is automatically enabled, while for ZFS, the multihost property must be enabled, and a persistent hostid set on each server.

The procedure for the controlled migration of a Lustre storage target, where both hosts are online and active, is as follows:

 Unmount the storage from the current (primary) host:  umount  For ZFS OSDS, use the  command to export the ZFS pool from the configuration on the current host:  zpool export  Log into the failover host. For ZFS OSDs, run the  command, being sure to set  :  zpool import [-f] -o cachefile=none

Do not use the  flag unless absolutely necessary. The ZFS pool should import cleanly if it was exported from the primary host. If the import fails, and the output from  makes reference that the pool may be in use on another host, check the host that is referred to. Ensure that the zpool has been properly exported from the primary host by running  on the primary and verifying that the pool is no longer present in the listed output. If the pool is positively confirmed as being exported or at least not active on any other host, then run the  command again, including the   flag. Note: When the  property is , import of a pool will be prevented if the pool is imported on another node.  For ZFS OSDs, check that the pool is imported cleanly and that there are no active issues on the pool:  zpool status zfs list </li>

Mount the storage on the failover host: <pre style="overflow-x:auto;"> mkdir -p mount -t lustre | / command can be useful in verifying the on-disk configuration: <pre style="overflow-x:auto;"> zdb -e

For a two-node HA failover group, if the hostname and hostid fields match the identity of the host that has the fault, and the faulted host has been isolated from the storage (e.g. powered off), then it is safe to proceed with the import. If there were more than two hosts connected to the shared storage, make sure that no other host has imported the zpool before continuing. Note: For versions of ZFS that support the  property (introduced in ZFS on Linux version 0.7.0), the ZFS pool will be protected against accidental imports of pools that are active on other machines. </li> For ZFS OSDs, run the  command, being sure to set  : <pre style="overflow-x:auto;"> zpool import [-f] -o cachefile=none <ul> In this scenario, with the primary host offline, the  flag will almost certainly have to be used to successfully import the pool, but always treat the   as an option of last resort. </li> Check that the ZFS pool has been imported cleanly and that there are no active issues on the pool: <pre style="overflow-x:auto;"> zpool status zfs list zfs get all [/ output will show the updated hostname and hostid fields with values corresponding to the failover host. </li> </ul> </li> Mount the storage on the failover host: <pre style="overflow-x:auto;"> mkdir -p mount -t lustre / and   properties correctly are critical to protecting the ZFS storage pools from data corruption in high-availability clusters.

Caution: Do not rely on the system defaults when working with shared ZFS storage in high-availability clusters.