Managing Lustre with the ZFS backend as a High Availability Service

From Lustre Wiki
Jump to navigation Jump to search

Overview

Service continuity, or providing high-availability for a service, is implemented in Lustre by means of a mechanism called “failover”. A failover service is one that can be run on exactly one machine (a physical or virtual computer node) at a time, but has a choice of machines on which to run. The machines are usually identically similar in configuration, and have some common infrastructure characteristics that permit the service to start and stop in a predictable manner on each machine. The service is therefore able to run with the same configuration and data, regardless of the selected machine. If the service is running on a machine that fails, it can be restarted on one of the surviving machines in the common infrastructure pool.

Lustre services are tightly coupled to the data storage. An MGS has a corresponding MGT, each MDS has one or more corresponding MDTs, and each OSS has one or more corresponding OSTs. Failover resources are defined in terms of the storage targets (MGT, MDTs, OSTs). When a Lustre server node develops a fault, it is the mount point of the storage that migrates, or “fails over” between the nodes. This is because the running state of Lustre services are governed by the mount and umount commands. There are no user-space daemons like httpd or nfsd to start and stop; we simply mount the Lustre storage target to start the service, and unmount the target to stop the service.

This also means that the storage targets must be reachable by more than one server. Typically, Lustre file systems are assembled from multiple pairs of servers, each pair being connected to a common pool of shared storage. Each host in a pair mounts a subset of the connected storage, presented to the OS in the form of one or more logical units (an aggregated set of devices arranged in a RAID pattern) or, in the case of JBOD storage enclosures, discrete devices.

For example, in a typical metadata server cluster pair, the storage will comprise one MGT volume and one MDT volume. One server will act as the primary host for the MGT and corresponding MGS, while the other server is the primary or preferred host for the MDT/MDS. If there is a server failure, the affected services are restarted on the surviving host.

For failover to work with Lustre, the storage targets must be configured with the NIDs of the hosts that are expected to be able to mount and provide storage services for that specific storage target. These are specified using either the --failnode or --servicenode command line options to mkfs.lustre.

For high-availability clusters with ZFS storage shared between nodes as a failover resource, it is also required that each ZFS pool is created and imported with the cachefile property set equal to the special value of none, and the multihost property set to on:

zpool create [-f] -O canmount=off \
  -o cachefile=none \
  -o multihost=on \
  [-o ashift=<n>] \
  <zpool name> <zpool specification>

zpool import [-f] -o cachefile=none <zpool name>

The remainder of this article describes the underlying mechanics of managing migration / failover of Lustre services. Refer to Creating a Framework for High Availability with Pacemaker for information regarding the integration of Lustre services into the Pacemaker and Corosync high availability cluster framework.

Controlled Migration or Failover of a Lustre Service

The migration, or “failover”, of a Lustre service between hosts in a cooperative high-availability cluster is straightforward. A Lustre service runs wherever the corresponding storage is mounted (e.g., the MGS runs where the MGT has been mounted, similarly for MDS/MDT or OSS/OST). A service is started when the storage targets are mounted, and stopped when the storage targets are unmounted. So, a Lustre service is migrated by unmounting the storage from one host and remounting it on a different host in the cluster.

The concept applies to both LDISKFS and ZFS object storage devices (OSDs). For ZFS storage, there is some additional complexity due to the fact that ZFS is a volume management solution as well as a file system, meaning that the ZFS pool must be imported before the Lustre OSD can be mounted.

Some care is also required in order to prevent storage from being mounted on multiple hosts simultaneously. For LDISKFS volumes, multiple mount prevention / protection is automatically enabled, while for ZFS, the multihost property must be enabled, and a persistent hostid set on each server.

The procedure for the controlled migration of a Lustre storage target, where both hosts are online and active, is as follows:

  1. Unmount the storage from the current (primary) host:
    umount <path>
    
  2. For ZFS OSDS, use the zpool export command to export the ZFS pool from the configuration on the current host:
    zpool export <zpool name>
    
  3. Log into the failover host.
  4. For ZFS OSDs, run the zpool import command, being sure to set cachefile=none:
    zpool import [-f] -o cachefile=none <zpool name>
    

    Do not use the -f flag unless absolutely necessary. The ZFS pool should import cleanly if it was exported from the primary host. If the import fails, and the output from zpool import makes reference that the pool may be in use on another host, check the host that is referred to. Ensure that the zpool has been properly exported from the primary host by running zpool list on the primary and verifying that the pool is no longer present in the listed output.

    If the pool is positively confirmed as being exported or at least not active on any other host, then run the zfs import command again, including the -f flag.

    Note: When the multihost property is on, import of a pool will be prevented if the pool is imported on another node.

  5. For ZFS OSDs, check that the pool is imported cleanly and that there are no active issues on the pool:
    zpool status
    zfs list
    
  6. Mount the storage on the failover host:
    mkdir -p <mount point>
    mount -t lustre <device path> | <zpool name>/<dataset name> <mountpoint>
    
  7. Verify that the services are running:
    df -ht lustre
    lctl dl
    

The failover host is now managing the services associated with the migrated storage. To migrate the service back to the original host, run the same procedure, with the roles of the hosts reversed.

Forced Migration of a Lustre Service Between Hosts

If the primary host has failed, and it is not possible to log into the host or otherwise unmount the storage and, in the case of ZFS, export the zpool, then a forced migration must be undertaken in order to migrate the service to the failover node.

The process for a forced migration is very similar to a controlled migration, but there is no interaction with the original primary host, because the primary host is offline:

  1. Remove power from the failed node or otherwise ensure that it is unable to render access to the shared storage containing the Lustre file systems.
  2. Log into the failover host. For LDISKFS OSDs, make sure that the storage is not mounted on any other node.

    For ZFS OSDs, be absolutely certain that no other host has imported the pool before continuing. The zdb command can be useful in verifying the on-disk configuration:

    zdb -e <zpool name>
    

    For a two-node HA failover group, if the hostname and hostid fields match the identity of the host that has the fault, and the faulted host has been isolated from the storage (e.g. powered off), then it is safe to proceed with the import. If there were more than two hosts connected to the shared storage, make sure that no other host has imported the zpool before continuing.

    Note: For versions of ZFS that support the multihost property (introduced in ZFS on Linux version 0.7.0), the ZFS pool will be protectd against accidental imports of pools that are active on other machines.

  3. For ZFS OSDs, run the zpool import command, being sure to set cachefile=none:
    zpool import [-f] -o cachefile=none <zpool name>
    
    • In this scenario, with the primary host offline, the -f flag will almost certainly have to be used to successfully import the pool, but always treat the -f as an option of last resort.
    • Check that the ZFS pool has been imported cleanly and that there are no active issues on the pool:
      zpool status
      zfs list
      zfs get all <zpool name>[/<dataset name>]
      zdb -C <zpool name>[/<dataset name>]
      

      The zdb output will show the updated hostname and hostid fields with values corresponding to the failover host.

  4. Mount the storage on the failover host:
    mkdir -p <mount point>
    mount -t lustre <zpool name>/<dataset name> <mountpoint>
    
  5. Verify that the services are running:
    df -ht lustre
    lctl dl
    

To restore the service back to its original host, run through the controlled migration process for active hosts.

Note: Remember that setting the persistent hostid and the ZFS pool cachefile=none and multihost=on properties correctly are critical to protecting the ZFS storage pools from data corruption in high-availability clusters.

Caution: Do not rely on the system defaults when working with shared ZFS storage in high-availability clusters.