ZFS OSD Storage Basics
Introduction
When working with ZFS OSDs, one can bundle the entire process of creating a zpool and formatting a storage target into a single command using mkfs.lustre
, or split the work into two steps, where creation of the zpool is separated from formatting the OSD.
Both methods are discussed in this section, however we recommend creating the ZFS storage pools separately from formatting the Lustre OSD.
For high-availability configurations where the ZFS volumes are kept on shared storage, the zpools must be created independently of the mkfs.lustre
command in order to be able to correctly prepare the zpools for use in a high-availability, failover environment.
ZFS Storage Pool Basics
ZFS separates storage volume definition from the file system specification, providing two separate tools to manage each. The zpool
command is used to define the volumes and manages the physical storage assets, while the zfs
command provides management of the ZFS file system datasets themselves.
A ZFS pool is comprised of one or more entities called Virtual Devices, or vdevs
. There are two basic categories of vdev
: physical and logical. A physical vdev
can be a complete physical storage device such as a disk drive, a partition on a disk drive, or a file. For Lustre file systems, it is strongly recommended that whole disks are used, with no pre-defined partition table.
Logical vdevs
are assemblies of physical vdevs
, arranged into groups, usually for the purpose of providing additional storage redundancy. Logical vdevs
include mirrors and RAIDZ data protection layouts. There can be many vdevs
assigned per ZFS pool, and data is written in stripes across the vdevs
in the pool.
The following examples illustrate how ZFS pools are created:
- Simple stripe across two physical
vdevs
:zpool create tank sda sdb
The result of this command is the creation of a pool named tank containing two physical
vdevs
,sda
andsdb
, in a stripe (equivalent to RAID 0). - Two-disk mirror:
zpool create tank mirror sda sdb
- Striped mirrors (equivalent to RAID 1+0):
zpool create tank \ mirror sda sdb \ mirror sdc sdd \ mirror sde sdf
The above example has a single pool consisting of three mirrored vdevs. Data is striped across the three mirrors.
- Pool with single RAIDZ2
vdev
(equivalent to RAID 6):zpool create tank raidz2 sda sdb sdc sdd sde sdf
- Pool with a 2-parity/1-spare dRAID
vdev
(equivalent to RAID 6 + hot spare):zpool create tank draid2:3d:1s sda sdb sdc sdd sde sdf
Note: In order to simplify the presentation of commands, the examples use the shortened block device name (i.e. sdN
), which is not guaranteed to refer to the same physical device on each server boot or when the server configuration is altered. The short block device path is therefore not recommended for use in production. When defining the storage pools, use a persistent device path, such as /dev/disk/by-id/<name>
or some other reliable convention that ensures that the device file will always refer to the same physical device.
Formatting a ZFS OSD using only the mkfs.lustre
command
The basic syntax for creating a ZFS-based OSD using only the mkfs.lustre
command is:
mkfs.lustre --mgs | --mdt | --ost \ [--reformat] \ [ --fsname <name> ] \ [ --index <n> ] \ [ --mgsnode <MGS NID> [--mgsnode <MGS NID> …]] \ [ --servicenode <NID> [--servicenode <NID> …]] \ [ --failnode <NID> [--failnode <NID> …]] \ --backfstype=zfs \ [ --mkfsoptions <options> ] \ <pool name>/<dataset> \ <zpool specification>
The servicenode
and failnode
command-line options are used to identify the NIDs of the hosts that are able to run the Lustre service in a high-availability configuration. The options servicenode
and failnode
are mutually incompatible: choose one or the other when defining the HA failover hosts that are expected to provide the Lustre service.
The servicenode
syntax defines all of the NIDs of all of the hosts that will be able to run the Lustre service. All of the hosts must be referenced, including the host that is expected to be the preferred primary for running the service (this is usually the host where the format command is running).
This example uses the --servicenode
syntax to create an MGT that can be run on two servers as an HA failover resource:
[root@rh7z-mds1 ~]# mkfs.lustre --mgs \ --servicenode 192.168.227.11@tcp1 \ --servicenode 192.168.227.12@tcp1 \ --backfstype=zfs \ mgspool/mgt mirror sda sdc
The command line formats a new MGT that will be used by the MGS for storage. The command further defines a mirrored zpool called mgspool consisting of two devices, and creates a ZFS dataset called mgt. Two server NIDs are supplied as service nodes for the MGS, 192.168.227.11@tcp1
and 192.168.227.12@tcp1
.
The failnode
syntax is similar, but is used to define only a failover target for the storage service. The failnode
syntax is an older method for creating services and implicit in the definition of the storage service is the notion of a primary server and one or more secondary, or failover servers. Only the failover servers are included in the command-line definition. The primary is only written to the storage service configuration the first time that the formatted device is mounted. Whichever host mounts the storage first will have its NID written in as the effective primary server.
Example format command with the failnode
syntax:
[root@rh7z-mds1 ~]# mkfs.lustre --mgs \ --failnode 192.168.227.12@tcp1 \ --backfstype=zfs \ mgspool/mgt mirror sda sdc
Here, the failover host is identified as 192.168.227.12@tcp1
, one metadata server in an HA pair (and which, incidentally, has the hostname rh7z-mds2
). The mkfs.lustre
command was executed on rh7z-mds1
(NID: 192.168.227.11@tcp1
), and the mount command must also be run from this host when the MGS service starts for the very first time. Otherwise, the primary NID will not be written to the storage configuration, and the failover mechanism will not work as expected.
Note: If, after formatting the storage, the MGT is mounted on 192.168.227.12@tcp1
, then both the primary NID and the failover NID will be the same. The intended primary host, 192.168.227.11@tcp1
, would be excluded from being able to run the MGS service.
Wherever possible, use the servicenode
syntax to define the high availability configuration for Lustre services.
The mkfs.lustre
command can pass additional command flags to the underlying file system creation software using the --mkfsoptions
flag. The command was originally used to pass through options for EXT-based storage, but can also be used in a limited way for ZFS. The --mkfsoptions
parameter allows a user to pass through commands that are added to the zfs command line utility, but cannot be used to modify the zpool command line invocation. There are times when the zpool command defaults are not sufficient to support a production Lustre file system, especially when the storage system is on shared drives and there is a failover configuration in place.
For this reason, it is recommended that ZFS pools always be created explicitly and separately from the mkfs.lustre
command.
Formatting a ZFS OSD using zpool
and mkfs.lustre
To create a ZFS-based OSD suitable for use as a high-availability failover storage device, first create a ZFS pool to contain the file system dataset, then use mkfs.lustre
to actually create the file system inside the zpool:
zpool create [-f] -O canmount=off \ [ -O compression=lz4 ] \ [ -o ashift=<n> ] \ -o cachefile=/etc/zfs/<zpool name>.spec | -o cachefile=none \ <zpool name> <zpool specification> mkfs.lustre --mgs | --mdt | --ost \ [--reformat] \ [ --fsname <name> ] \ [ --index <n> ] \ [ --mgsnode <MGS NID> [--mgsnode <MGS NID> …]] \ [ --servicenode <NID> [--servicenode <NID> …]] \ [ --failnode <NID> [--failnode <NID> …]] \ --backfstype=zfs \ [ --mkfsoptions <options> ] \ <pool name>/<dataset name>
For example:
# Create the zpool zpool create -O canmount=off \ -o cachefile=none \ mgspool mirror sda sdc # Format the Lustre MGT mkfs.lustre --mgs \ --servicenode 192.168.227.11@tcp1 \ --servicenode 192.168.227.12@tcp1 \ --backfstype=zfs \ mgspool/mgt
After formatting, use tunefs.lustre
to review the newly created OSD. The command line format is:
tunefs.lustre --dryrun <pool name>/<dataset name>
For example:
[root@rh7z-mds1 ~]# tunefs.lustre --dryrun mgspool/mgt checking for existing Lustre data: found Read previous values: Target: MGS Index: unassigned Lustre FS: Mount type: zfs Flags: 0x1044 (MGS update no_primnode ) Persistent mount opts: Parameters: failover.node=192.168.227.11@tcp1:192.168.227.12@tcp1 Permanent disk data: Target: MGS Index: unassigned Lustre FS: Mount type: zfs Flags: 0x1044 (MGS update no_primnode ) Persistent mount opts: Parameters: failover.node=192.168.227.11@tcp1:192.168.227.12@tcp1
Use the zfs get
command to retrieve comprehensive metadata information about the file system dataset and to confirm that the Lustre properties have been set correctly:
zfs get all | awk '$2 ~ /lustre/'
For example:
[root@rh7z-mds1 ~]# zfs get all | awk '$2 ~/lustre/' mgspool/mgt lustre:version 1 local mgspool/mgt lustre:index 65535 local mgspool/mgt lustre:failover.node 192.168.227.11@tcp1:192.168.227.12@tcp1 local mgspool/mgt lustre:svname MGS local mgspool/mgt lustre:flags 4196 local
Only the zpool is created directly by the administrator. The mkfs.lustre
command is still used to control creation of the file system dataset from the pool. Additional properties of the data set can be applied by mkfs.lustre
using the --mkfs.options
flag. The mkfs.lustre
command will fail with an error if an attempt is made to format an existing dataset:
[root@rh7z-mds1 ~]# mkfs.lustre --mgs \ > --servicenode 192.168.227.11@tcp1 \ > --servicenode 192.168.227.12@tcp1 \ > --backfstype=zfs mgspool/mgt Permanent disk data: Target: MGS Index: unassigned Lustre FS: Mount type: zfs Flags: 0x1064 (MGS first_time update no_primnode ) Persistent mount opts: Parameters: failover.node=192.168.227.11@tcp1:192.168.227.12@tcp1 checking for existing Lustre data: not found mkfs_cmd = zfs create -o canmount=off -o xattr=sa mgspool/mgt cannot create 'mgspool/mgt': dataset already exists mkfs.lustre FATAL: Unable to create file system mgspool/mgt (256) mkfs.lustre FATAL: mkfs failed 256
One can force the dataset to be formatted as a Lustre OSD by adding the --reformat
flag to mkfs.lustre
:
[root@rh7z-mds1 ~]# mkfs.lustre --reformat --mgs \ > --servicenode 192.168.227.11@tcp1 \ > --servicenode 192.168.227.12@tcp1 \ > --backfstype=zfs mgspool/mgt Permanent disk data: Target: MGS Index: unassigned Lustre FS: Mount type: zfs Flags: 0x1064 (MGS first_time update no_primnode ) Persistent mount opts: Parameters: failover.node=192.168.227.11@tcp1:192.168.227.12@tcp1 mkfs_cmd = zfs create -o canmount=off -o xattr=sa mgspool/mgt Writing mgspool/mgt properties lustre:version=1 lustre:flags=4196 lustre:index=65535 lustre:svname=MGS lustre:failover.node=192.168.227.11@tcp1:192.168.227.12@tcp1
Alternatively, destroy the dataset and have mkfs.lustre
recreate it:
zfs destroy <pool name>/<dataset name> mkfs.lustre --mgs \ [ --servicenode <NID> [--servicenode <NID> …]] \ --backfstype=zfs \ [ --mkfsoptions <options> ] \ <pool name>/<dataset>
If the dataset is reformatted, then previously applied properties will obviously be lost. Remember to include any ZFS-specific properties by making use of the --mkfsoptions
flag.
Working with ZFS Imports
The assembly and incorporation of a ZFS storage pool into the operating system run-time environment is managed by a command called zpool import
. Existing pools are incorporated into a host’s run-time environment using the zpool import
command, and can be released using the zpool export command. The basic syntax to import a pool is:
zpool import [-f] [ -o <properties>] <pool name>
This technique enables storage pools to migrate between hosts in a consistent and reliable manner. Any host that is connected to a pool of storage in a shared enclosure can import the ZFS pool, making it straightforward to facilitate high availability failover.
A ZFS storage pool can only be imported to a single host at a time. If a pool has been imported onto a host, it must be exported before it can be safely imported to a different host. If a zpool in a shared storage enclosure is simultaneously imported to more than one host, the pool data will be corrupted. To reduce the risk of this happening, all servers must be configured with a unique hostid
that is used to label each zpool with the current active host that has imported the pool. Multiple Import Protection (using the multihost
pool property), implemented in ZFS 0.7.0, must also be configured when available. Please refer to Protecting File System Volumes from Concurrent Access for information about how to enable and verify that this protection is properly configured.
If the pool is not exported, it will appear to still be active on its original host, even if that host is offline (regardless of whether it was powered off cleanly or crashed). In this case, the import will fail, provided that the hostids
for the servers have been correctly configured, and the import will need to be forced.
If an import fails, it could mean that the pool is already imported on a different host. Before trying to force the importation of a pool onto a host, check that the storage has not already been imported to another system. If the zpool has been configured correctly, and all hosts have valid hostids
, then the import command will indicate the host that last had the pool imported.
The zpool import
command can be used to list pools that are not currently imported on the host, without actually performing any import:
zpool import [-d <dev directory>] [-D]
When invoked using this syntax, zpool import
lists the pools that are potentially available for import, and will ignore zpools that are already imported to the host. The -d
(lower case) flag is used to specify a directory containing block devices, for example /dev/disk/by-id
. This flag is not often required. The -D
(upper case) flag is used to list zpools that have been destroyed.
In the following example, the import fails because the zpool cannot be found:
[root@rh7z-mds2 system]# zpool import demo-mdtpool cannot import 'demo-mdtpool': no such pool available
This could be because the zpool does not exist, or because the pool has the wrong name, but it could also mean that the pool has been destroyed. Run zpool import
without any other options to get a list of the zpools that the operating system can locate, that are not already imported to the host:
[root@rh7z-mds2 system]# zpool import pool: mgspool id: 2186474330384511828 state: ONLINE status: The pool was last accessed by another system. action: The pool can be imported using its name or numeric identifier and the '-f' flag. see: http://zfsonlinux.org/msg/ZFS-8000-EY config: mgspool ONLINE mirror-0 ONLINE sda ONLINE sdc ONLINE
The output only lists a single zpool, called mgspool
. There is no obvious sign of any other configured zpool available to the operating system. If the list of ZFS pools does not match expectations, perhaps one or more of the pools has been destroyed. Check using zpool import -D
:
[root@rh7z-mds2 system]# zpool import -D pool: demo-mdtpool id: 6641674394771267657 state: ONLINE (DESTROYED) action: The pool can be imported using its name or numeric identifier. config: demo-mdtpool ONLINE mirror-0 ONLINE scsi-0QEMU_QEMU_HARDDISK_EEMDT0001 ONLINE scsi-0QEMU_QEMU_HARDDISK_EEMDT0000 ONLINE
From this, it can be seen that at some point in the past, another zpool existed on the system, but that it was destroyed. It is possible to recover a destroyed pool as follows:
zpool import -D <pool name>
Provided that the storage from which the pool was originally assembled has not been modified or the data over-written, the pool will be re-assembled and can be used as normal.
The zdb
command can be helpful in determining where a zpool has been imported. If an exported pool cannot be imported cleanly into a host, use zdb
to check the MOS configuration to see if it is perhaps “registered” with another host:
zdb -e <zpool name> | awk '/^MOS/,/^$/{print}'
The hostid
and hostname
fields will indicate the last host known to have imported the pool or dataset. For example:
[root@rh7z-mds1 ~]# zdb -e mgspool | awk '/^MOS/,/^$/{print}' MOS Configuration: version: 5000 name: 'mgspool' state: 0 txg: 34351 pool_guid: 11089712772589408485 errata: 0 hostid: 1386610045 hostname: 'rh7z-mds2' vdev_children: 1 vdev_tree: … <output truncated for brevity>
In this example, it would be prudent to check the host rh7z-mds2
to see if the pool has been imported there before taking any further action.
Lustre and ZFS File System Datasets
When working with ZFS-based storage, each Lustre storage target is held on a file system dataset inside a ZFS pool. The dataset will be created by Lustre when the storage is formatted with the mkfs.lustre
command and --backfstype=zfs
has been selected. While it is possible to create multiple file system datasets within a single storage pool and use those for Lustre, this is not recommended: each dataset will compete for the pool’s resources, affecting performance and making it more difficult to balance IO across the storage cluster. Also bear in mind that the unit of failover is the ZFS pool, not the dataset. One cannot migrate a dataset in a pool without migrating the entire pool. For example, if the MGT
and MDT0
are created within the same ZFS pool, then the MGS and MDS services will always have to run on the same host, because the pool can only be imported to one host at a time. The MGS therefore loses its independence from the MDS for MDT0
.
Don’t use the zfs
command directly to create datasets that will be used as Lustre targets. ZFS datasets created independently of the mkfs.lustre
command will have be unmounted or destroyed and then reformatted. The properties of Lustre file system datasets can be altered after formatting or can be supplied as options to the mkfs.lustre
command by using the --mkfsoptions
flag. Refer to the mkfs.lustre
manual page for details.
If a ZFS dataset already exists and is not unmounted, the mkfs.lustre
command will not report an error when an attempt is made to format, but it will not be able to format the volume. The only immediate indication of a failure is that the mkfs.lustre
output will be truncated:
[root@rh7z-mds1 ~]# zpool create -O canmount=off -o cachefile=none mgspool mirror sda sdc # Create a file system dataset using the default properties. # The dataset will be automatically mounted once created. [root@rh7z-mds1 ~]# zfs create mgspool/mgt # Try to format the dataset for Lustre. # The command will fail but will not report an error. [root@rh7z-mds1 ~]# mkfs.lustre --mgs \ > --servicenode 192.168.227.11@tcp1 \ > --servicenode 192.168.227.12@tcp1 \ > --backfstype=zfs \ > mgspool/mgt Permanent disk data: Target: MGS Index: unassigned Lustre FS: Mount type: zfs Flags: 0x1064 (MGS first_time update no_primnode ) Persistent mount opts: Parameters: failover.node=192.168.227.11@tcp1:192.168.227.12@tcp1 [root@rh7z-mds1 ~]#
Administrators must use the mount.lustre
command whenever starting Lustre services, and the umount
command when stopping services.
For the purposes of comparison, let’s examine what the mkfs.lustre
command itself is doing when it creates/formats a ZFS OSD. The following example command output shows an MGT being created from a ZFS mirrored zpool consisting of two disks:
[root@rh7z-mds1 ~]# mkfs.lustre --mgs \ > --servicenode 192.168.227.11@tcp1 \ > --servicenode 192.168.227.12@tcp1 \ > --backfstype=zfs \ > mgspool/mgt Permanent disk data: Target: MGS Index: unassigned Lustre FS: Mount type: zfs Flags: 0x1064 (MGS first_time update no_primnode ) Persistent mount opts: Parameters: failover.node=192.168.227.11@tcp1:192.168.227.12@tcp1 checking for existing Lustre data: not found mkfs_cmd = zfs create -o canmount=off -o xattr=sa mgspool/mgt Writing mgspool/mgt properties lustre:version=1 lustre:flags=4196 lustre:index=65535 lustre:svname=MGS lustre:failover.node=192.168.227.11@tcp1:192.168.227.12@tcp1 [root@rh7z-mds1 ~]#
Note that the mkfs.lustre
command hands creation of the mgspool/mgt
dataset to the relevant ZFS command, acting as a convenient wrapper that encapsulates creation of the ZFS dataset and the Lustre formatted storage devices. The complete ZFS command-line invocations are displayed in the mkfs.lustre
output, providing transparency in the way that the underlying storage is configured.
Unfortunately, zpool
command line options used by mkfs.lustre
cannot be modified, other than to define the zpool specification. The mkfs.lustre
command therefore should not be used to create the zpool when working with failover shared storage. This is because the restriction prevents a user from defining the correct cachefile property for the zpool to prevent hosts from automatically importing a zpool on system boot. The mkfs.lustre
command also will not allow a user to provide tuning options at file system create time, notably the ashift
property that helps with write alignment for storage devices that do not correctly report the underlying sector size.
Refer to the ZFS OSD Tuning section for information on properties that influence ZFS and Lustre performance.
Refer to vdev_id.conf(5) and vdev_id(8) for more information on using /dev/disk/by-vdev/ paths where names reflect the slots devices are in.
Refer to zpoolconcepts(7) and https://openzfs.github.io/openzfs-docs/Basic%20Concepts/dRAID%20Howto.html for the limitations and advantages of different vdev types.