Protecting File System Volumes from Concurrent Access

From Lustre Wiki
Revision as of 22:56, 30 August 2017 by Malcolm (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Storage volumes formatted for LDISKFS (which is based on EXT4) and ZFS are not SAN-aware parallel file systems and do not support multiple concurrent accesses from different computers. If two or more computers attempt to write directly to the same storage, the write operations will not be coordinated, which could lead to data corruption: each host that has the storage mounted will assume that they have unique access to the data and will not take into account any IO transactions external to that host.

For this reason, Lustre OSDs must be mounted by no more than one host at any single point in time. This also has the consequence when working with high availability frameworks that each individual storage target can only participate in an HA cluster as a failover resource (also referred to as an active-passive resource).

Administrators must take this into consideration when planning Lustre file system deployments and when conducting any maintenance.

To protect against data corruption, the LDISKFS OSD format has built-in protection against multiple concurrent mounts of a storage volume, which is referred to as “multi-mount protection” (MMP). The storage will refuse to mount if the host detects that the storage might already be mounted elsewhere.

OpenZFS also provides protection against concurrent pool access, a feature called multiple import protection, introduced in version 0.7.0 of ZFS on Linux. Multiple import protection will prevent a pool from being imported if that pool is already imported on another node, even if the force (-f) option is used.

OpenZFS multiple import protection is enabled by setting the zpool property multihost=on.

In addition to enabling the multihost feature, each server running ZFS must have a unique, persistent (i.e., unchanging) hostid that can be used by the SPL. The hostid of the system where the zpool was last successfully imported is written into the pool configuration. The zpool therefore has a notion of system ownership written into its configuration. If an attempt is made to import a zpool that has the hostid set to a value that does not match the hostid of the system where the import is being executed, the attempt fails. This check complements the multihost protection: both the hostid and the multihost property are needed to prevent a zpool from being simultaneously imported to multiple machines. Exporting the zpool will clear the hostid.

SPL can be provided with a hostid via the kernel command line on system boot, or by using the genhostid command to create a random hostid which is written to the file /etc/hostid.

The genhostid command is the easiest option to implement and is recommended. Run the command as follows (the command does not generate any output to the shell):

[root@rh7z-mds1 ~]# genhostid

Verify that data has been written to the file:

[root@rh7z-mds1 ~]# od -An -tx /etc/hostid
 e343ce58

The genhostid command must be run on every Lustre server that has or will have ZFS volumes configured. Make sure that each server gets a unique hostid. Reboot each system after the hostid is configured in order for the SPL kernel module to pick up the change.

On operating system distributions where the genhostid command is not available, use the following instead:

h=`hostid`; a=${h:6:2}; b=${h:4:2}; c=${h:2:2}; d=${h:0:2}
sudo sh -c "echo -ne \"\x$a\x$b\x$c\x$d\" > /etc/hostid"

For older releases of OpenZFS, where multiple import protection is not available, limited protection against concurrent mounts can still be implemented by generating the persistent hostid, although this can be circumvented if one is not careful. When there is no multiple import protection available, the import can be forced, which will override the hostid check. When forcing the import of a ZFS pool on older ZFS software releases, be very careful to ensure that the volume is not currently imported anywhere else.

The SPL, by default, sets the value of its internal hostid to 0 (zero), based on the result of the gethostid() system call, and this is what is written into the zpool. This means that, without any additional configuration of the operating system, the SPL hostid will be 0 across all hosts. When a zpool is created and imported, it will inherit this hostid value of zero and write that into the zpool configuration. Any system with a hostid of zero will be able to import the zpool, because the zpool import command will always succeed when the hostid of the system matches the hostid of the zpool.

Therefore, by relying on the default configuration, the check against concurrent access by multiple hosts is lost, leading to a corruption of the zpool and data loss.

It is easy to be fooled by the operating system runtime that the hostid is configured, because the hostid command in the GNU coreutils package will always return a non-zero unique value. To determine the SPL’s active hostid value, it must be retrieved from /sys/module/spl/parameters/spl_hostid in the kernel sysfs interface. In the following example, the SPL hostid is 0 (zero):

[root@rh7z-mds1 ~]# cat /sys/module/spl/parameters/spl_hostid
0

The other essential step in protecting a ZFS pool from concurrent mounts across hosts is to prevent the pool from being automatically imported on system boot. This is controlled by the content of the system default ZFS pool configuration cache, or cache file, usually written to /etc/zfs/zpool.cache.

The configuration cache keeps a record of the configuration of each zpool that is either created or imported to a host. If a zpool is exported, it is removed from the configuration cache.

Note: Any exported zpools that are detected by the host on system boot will be automatically [re-]imported and added back into the default configuration cache.

The cache file is a generally useful feature that can speed up the “assembly” of zpools on system boot, but special care must be taken when managed pools that are kept on shared storage and participate in high availability server configurations.

Caution: Any zpool that has its configuration recorded in the default cache file will be imported by the host on system boot automatically. In a high-availability framework where ZFS is a shared managed resource, this may not be desirable, since the resource manager is meant to determine where the resources run, not the default init process.

Multiple import protection, when enabled, will prevent a pool from being imported by multiple hosts concurrently, but does not control which host imports the pool. Constraints defined by the HA framework could be compromised if pools are imported on system boot.

To prevent a zpool from being imported automatically on boot, use the cachefile option to create a separate, unique cache file for the pool being created, or set the property cachefile=none. This is only effective when the hostid has also been set correctly for SPL, as described earlier.

Note: The cachefile option has to be specified every time the zpool import command is run, not just when creating the pool or running the import command for the first time. It must also be specified when running the import command on a failover host in an HA cluster. Also be aware that the value none is a reserved keyword and cannot be used as a file name. Setting the cachefile parameter equal to the empty string (\'\' or \"\") is the same as telling ZFS to use the default cache file.

For high-availability clusters with ZFS storage shared between nodes as a failover resource, it is recommended that each zpool is created and imported with the cachefile set equal to the special value of none:

zpool create [-f] -O canmount=off -o cachefile=none \
  [-o ashift=<n>] \
  <zpool name> <zpool specification>

zpool import [-f] -o cachefile=none <zpool name>

For further safety, one can also disable the ZFS storage services from attempting to automatically start on system boot. This will mean that the host will not attempt any automated import of ZFS storage.

The simplest way to disable the services is to disable the ZFS target milestone:

systemctl disable zfs.target

Note: If there are any non-Lustre storage devices formatted using ZFS, they will also be affected by this change and will not be available until explicitly imported after system start-up.

Using ZFS Properties to Further Protect Lustre OSDs

To protect the integrity of ZFS volumes used by Lustre, the zpool command should be invoked with an option to set the property, canmount=off, when working with Lustre storage volumes. This property will also be automatically applied to any ZFS datasets created by the mkfs.lustre command.

The property canmount=off is used to prevent a dataset within a pool from being mounted by the standard ZFS tools, e.g., by executing the zfs mount -a command, thus preventing accidental and incorrect mounts of ZFS storage that is being used for Lustre. Setting the property in the zpool command ensures that all of the datasets in the zpool inherit this property. It will also ensure that the file system datasets that have been formatted for use by Lustre will not get mounted on system boot by the ZFS services in systemd or sysvinit (on hosts running RHEL 7, for example, the systemd zfs-mount service will run zfs mount -a during system startup).

However, note that the canmount parameter will not prevent a zpool from being imported by a host on system boot, it will only stop the datasets in an imported zpool from being mounted. Users must be very careful when managing ZFS volumes to ensure that the zpools are only imported onto a single host at a time.

The zfs command-line executed by mkfs.lustre also sets the xattrs=sa property. This is used to improve performance of the ZFS storage, especially when using POSIX ACLs (Access Control Lists). SA stands for System Attributes, and provides an alternative implementation to the default Directory-based extended attributes. Storing extended attributes using system attributes significantly decreases disk I/O and is recommended for systems that make use of SELinux or POSIX ACLs. Refer to the zfs manual page for a more detailed explanation.

Many of the ZFS properties can be altered after a zpool or dataset has been created, and formatting a Lustre target using the ZFS OSD will always set the ZFS properties canmount=off and xattrs=sa.