ZFS OSD Storage Basics

Introduction
When working with ZFS OSDs, one can bundle the entire process of creating a zpool and formatting a storage target into a single command using, or split the work into two steps, where creation of the zpool is separated from formatting the OSD.

Both methods are discussed in this section, however we recommend creating the ZFS storage pools separately from formatting the Lustre OSD.

For high-availability configurations where the ZFS volumes are kept on shared storage, the zpools must be created independently of the  command in order to be able to correctly prepare the zpools for use in a high-availability, failover environment.

ZFS Storage Pool Basics
ZFS separates storage volume definition from the file system specification, providing two separate tools to manage each. The  command is used to define the volumes and manages the physical storage assets, while the   command provides management of the ZFS file system datasets themselves.

A ZFS pool is comprised of one or more entities called Virtual Devices, or. There are two basic categories of : physical and logical. A physical  can be a complete physical storage device such as a disk drive, a partition on a disk drive, or a file. For Lustre file systems, it is strongly recommended that whole disks are used, with no pre-defined partition table.

Logical  are assemblies of physical , arranged into groups, usually for the purpose of providing additional storage redundancy. Logical  include mirrors and RAIDZ data protection layouts. There can be many  assigned per ZFS pool, and data is written in stripes across the   in the pool.

The following examples illustrate how ZFS pools are created:

 Simple stripe across two physical :

 zpool create tank sda sdb

The result of this command is the creation of a pool named tank containing two physical,   and  , in a stripe (equivalent to RAID 0). 

Two-disk mirror:

 zpool create tank mirror sda sdb 

 Striped mirrors (equivalent to RAID 1+0):

 zpool create tank \ mirror sda sdb \ mirror sdc sdd \ mirror sde sdf

The above example has a single pool consisting of three mirrored vdevs. Data is striped across the three mirrors. 

Pool with single RAIDZ2  (equivalent to RAID 6):

 zpool create tank raidz2 sda sdb sdc sdd sde sdf



Pool with two RAIDZ2  (equivalent to RAID 6+0):

 zpool create tank \ raidz2 sda sdb sdc sdd sde sdf \ raidz2 sdg sdh sdi sdj sdk sdl

Note: In order to simplify the presentation of commands, the examples use the shortened block device name (i.e. ), which is not guaranteed to refer to the same physical device on each server boot or when the server configuration is altered. The short block device path is therefore not recommended for use in production. When defining the storage pools, use a persistent device path, such as  or some other reliable convention that ensures that the device file will always refer to the same physical device.

Formatting a ZFS OSD using only the command
The basic syntax for creating a ZFS-based OSD using only the  command is:

 mkfs.lustre --mgs | --mdt | --ost \ [--reformat] \ [ --fsname ] \ [ --index <n> ] \ [ --mgsnode <MGS NID> [--mgsnode <MGS NID> …]] \ [ --servicenode <NID> [--servicenode <NID> …]] \ [ --failnode <NID> [--failnode <NID> …]] \ --backfstype=zfs \ [ --mkfsoptions ] \ / and   command-line options are used to identify the NIDs of the hosts that are able to run the Lustre service in a high-availability configuration. The options  and   are mutually incompatible: choose one or the other when defining the HA failover hosts that are expected to provide the Lustre service.

The  syntax defines all of the NIDs of all of the hosts that will be able to run the Lustre service. All of the hosts must be referenced, including the host that is expected to be the preferred primary for running the service (this is usually the host where the format command is running).

This example uses the  syntax to create an MGT that can be run on two servers as an HA failover resource:

<pre style="overflow-x:auto;"> [root@rh7z-mds1 ~]# mkfs.lustre --mgs \ --servicenode 192.168.227.11@tcp1 \ --servicenode 192.168.227.12@tcp1 \ --backfstype=zfs \ mgspool/mgt mirror sda sdc

The command line formats a new MGT that will be used by the MGS for storage. The command further defines a mirrored zpool called mgspool consisting of two devices, and creates a ZFS dataset called mgt. Two server NIDs are supplied as service nodes for the MGS,  and.

The  syntax is similar, but is used to define only a failover target for the storage service. The  syntax is an older method for creating services and implicit in the definition of the storage service is the notion of a primary server and one or more secondary, or failover servers. Only the failover servers are included in the command-line definition. The primary is only written to the storage service configuration the first time that the formatted device is mounted. Whichever host mounts the storage first will have its NID written in as the effective primary server.

Example format command with the  syntax:

<pre style="overflow-x:auto;"> [root@rh7z-mds1 ~]# mkfs.lustre --mgs \ --failnode 192.168.227.12@tcp1 \ --backfstype=zfs \ mgspool/mgt mirror sda sdc

Here, the failover host is identified as, one metadata server in an HA pair (and which, incidentally, has the hostname  ). The  command was executed on   (NID:  ), and the mount command must also be run from this host when the MGS service starts for the very first time. Otherwise, the primary NID will not be written to the storage configuration, and the failover mechanism will not work as expected.

Note: If, after formatting the storage, the MGT is mounted on, then both the primary NID and the failover NID will be the same. The intended primary host,, would be excluded from being able to run the MGS service.

Wherever possible, use the   syntax to define the high availability configuration for Lustre services.

The  command can pass additional command flags to the underlying file system creation software using the   flag. The command was originally used to pass through options for EXT-based storage, but can also be used in a limited way for ZFS. The  parameter allows a user to pass through commands that are added to the zfs command line utility, but cannot be used to modify the zpool command line invocation. There are times when the zpool command defaults are not sufficient to support a production Lustre file system, especially when the storage system is on shared drives and there is a failover configuration in place.

For this reason, it is recommended that ZFS pools always be created explicitly and separately from the  command.

Formatting a ZFS OSD using and
To create a ZFS-based OSD suitable for use as a high-availability failover storage device, first create a ZFS pool to contain the file system dataset, then use  to actually create the file system inside the zpool:

<pre style="overflow-x:auto;"> zpool create [-f] -O canmount=off \ [ -O compression=lz4 ] \ [ -o ashift=<n> ] \ -o cachefile=/etc/zfs/ .spec | -o cachefile=none \

mkfs.lustre --mgs | --mdt | --ost \ [--reformat] \ [ --fsname ] \ [ --index <n> ] \ [ --mgsnode <MGS NID> [--mgsnode <MGS NID> …]] \ [ --servicenode <NID> [--servicenode <NID> …]] \ [ --failnode <NID> [--failnode <NID> …]] \ --backfstype=zfs \ [ --mkfsoptions ] \ / to review the newly created OSD. The command line format is:

<pre style="overflow-x:auto;"> tunefs.lustre --dryrun / command to retrieve comprehensive metadata information about the file system dataset and to confirm that the Lustre properties have been set correctly:

<pre style="overflow-x:auto;"> zfs get all | awk '$2 ~ /lustre/'

For example:

<pre style="overflow-x:auto;"> [root@rh7z-mds1 ~]# zfs get all | awk '$2 ~/lustre/' mgspool/mgt lustre:version        1                                        local mgspool/mgt lustre:index          65535                                    local mgspool/mgt lustre:failover.node  192.168.227.11@tcp1:192.168.227.12@tcp1  local mgspool/mgt lustre:svname         MGS                                      local mgspool/mgt lustre:flags          4196                                     local

Only the zpool is created directly by the administrator. The  command is still used to control creation of the file system dataset from the pool. Additional properties of the data set can be applied by  using the   flag. The  command will fail with an error if an attempt is made to format an existing dataset:

<pre style="overflow-x:auto;"> [root@rh7z-mds1 ~]# mkfs.lustre --mgs \ > --servicenode 192.168.227.11@tcp1 \ > --servicenode 192.168.227.12@tcp1 \ > --backfstype=zfs   mgspool/mgt

Permanent disk data: Target:    MGS Index:     unassigned Lustre FS: Mount type: zfs Flags:     0x1064 (MGS first_time update no_primnode ) Persistent mount opts: Parameters: failover.node=192.168.227.11@tcp1:192.168.227.12@tcp1

checking for existing Lustre data: not found mkfs_cmd = zfs create -o canmount=off -o xattr=sa mgspool/mgt cannot create 'mgspool/mgt': dataset already exists

mkfs.lustre FATAL: Unable to create file system mgspool/mgt (256)

mkfs.lustre FATAL: mkfs failed 256

One can force the dataset to be formatted as a Lustre OSD by adding the  flag to  :

<pre style="overflow-x:auto;"> [root@rh7z-mds1 ~]# mkfs.lustre --reformat --mgs \ > --servicenode 192.168.227.11@tcp1 \ > --servicenode 192.168.227.12@tcp1 \ > --backfstype=zfs   mgspool/mgt

Permanent disk data: Target:    MGS Index:     unassigned Lustre FS: Mount type: zfs Flags:     0x1064 (MGS first_time update no_primnode ) Persistent mount opts: Parameters: failover.node=192.168.227.11@tcp1:192.168.227.12@tcp1

mkfs_cmd = zfs create -o canmount=off -o xattr=sa mgspool/mgt Writing mgspool/mgt properties lustre:version=1 lustre:flags=4196 lustre:index=65535 lustre:svname=MGS lustre:failover.node=192.168.227.11@tcp1:192.168.227.12@tcp1

Alternatively, destroy the dataset and have  recreate it:

<pre style="overflow-x:auto;"> zfs destroy / flag.

Working with ZFS Imports
The assembly and incorporation of a ZFS storage pool into the operating system run-time environment is managed by a command called. Existing pools are incorporated into a host’s run-time environment using the  command, and can be released using the zpool export command. The basic syntax to import a pool is:

<pre style="overflow-x:auto;"> zpool import [-f] [ -o ]

This technique enables storage pools to migrate between hosts in a consistent and reliable manner. Any host that is connected to a pool of storage in a shared enclosure can import the ZFS pool, making it straightforward to facilitate high availability failover.

A ZFS storage pool can only be imported to a single host at a time. If a pool has been imported onto a host, it must be exported before it can be safely imported to a different host. If a zpool in a shared storage enclosure is simultaneously imported to more than one host, the pool data will be corrupted. To reduce the risk of this happening, all servers must be configured with a unique  that is used to label each zpool with the current active host that has imported the pool. Multiple Import Protection (using the  pool property), implemented in ZFS 0.7.0, must also be configured when available. Please refer to Protecting File System Volumes from Concurrent Access for information about how to enable and verify that this protection is properly configured.

If the pool is not exported, it will appear to still be active on its original host, even if that host is offline (regardless of whether it was powered off cleanly or crashed). In this case, the import will fail, provided that the  for the servers have been correctly configured, and the import will need to be forced.

If an import fails, it could mean that the pool is already imported on a different host. Before trying to force the importation of a pool onto a host, check that the storage has not already been imported to another system. If the zpool has been configured correctly, and all hosts have valid, then the import command will indicate the host that last had the pool imported.

The  command can be used to list pools that are not currently imported on the host, without actually performing any import:

<pre style="overflow-x:auto;"> zpool import [-d ] [-D]

When invoked using this syntax,  lists the pools that are potentially available for import, and will ignore zpools that are already imported to the host. The  (lower case) flag is used to specify a directory containing block devices, for example. This flag is not often required. The  (upper case) flag is used to list zpools that have been destroyed.

In the following example, the import fails because the zpool cannot be found:

<pre style="overflow-x:auto;"> [root@rh7z-mds2 system]# zpool import demo-mdtpool cannot import 'demo-mdtpool': no such pool available

This could be because the zpool does not exist, or because the pool has the wrong name, but it could also mean that the pool has been destroyed. Run  without any other options to get a list of the zpools that the operating system can locate, that are not already imported to the host:

<pre style="overflow-x:auto;"> [root@rh7z-mds2 system]# zpool import pool: mgspool id: 2186474330384511828 state: ONLINE status: The pool was last accessed by another system. action: The pool can be imported using its name or numeric identifier and the '-f' flag. see: http://zfsonlinux.org/msg/ZFS-8000-EY config:

mgspool    ONLINE mirror-0 ONLINE sda    ONLINE sdc    ONLINE

The output only lists a single zpool, called. There is no obvious sign of any other configured zpool available to the operating system. If the list of ZFS pools does not match expectations, perhaps one or more of the pools has been destroyed. Check using :

<pre style="overflow-x:auto;"> [root@rh7z-mds2 system]# zpool import -D pool: demo-mdtpool id: 6641674394771267657 state: ONLINE (DESTROYED) action: The pool can be imported using its name or numeric identifier. config:

demo-mdtpool                                  ONLINE mirror-0                               ONLINE scsi-0QEMU_QEMU_HARDDISK_EEMDT0001 ONLINE scsi-0QEMU_QEMU_HARDDISK_EEMDT0000 ONLINE

From this, it can be seen that at some point in the past, another zpool existed on the system, but that it was destroyed. It is possible to recover a destroyed pool as follows:

<pre style="overflow-x:auto;"> zpool import -D

Provided that the storage from which the pool was originally assembled has not been modified or the data over-written, the pool will be re-assembled and can be used as normal.

The  command can be helpful in determining where a zpool has been imported. If an exported pool cannot be imported cleanly into a host, use  to check the MOS configuration to see if it is perhaps “registered” with another host:

<pre style="overflow-x:auto;"> zdb -e | awk '/^MOS/,/^$/{print}'

The and   fields will indicate the last host known to have imported the pool or dataset. For example:

<pre style="overflow-x:auto;"> [root@rh7z-mds1 ~]# zdb -e mgspool | awk '/^MOS/,/^$/{print}' MOS Configuration: version: 5000 name: 'mgspool' state: 0 txg: 34351 pool_guid: 11089712772589408485 errata: 0 hostid: 1386610045 hostname: 'rh7z-mds2' vdev_children: 1 vdev_tree: … <output truncated for brevity>

In this example, it would be prudent to check the host  to see if the pool has been imported there before taking any further action.

Lustre and ZFS File System Datasets
When working with ZFS-based storage, each Lustre storage target is held on a file system dataset inside a ZFS pool. The dataset will be created by Lustre when the storage is formatted with the  command and   has been selected. While it is possible to create multiple file system datasets within a single storage pool and use those for Lustre, this is not recommended: each dataset will compete for the pool’s resources, affecting performance and making it more difficult to balance IO across the storage cluster. Also bear in mind that the unit of failover is the ZFS pool, not the dataset. One cannot migrate a dataset in a pool without migrating the entire pool. For example, if the  and   are created within the same ZFS pool, then the MGS and MDS services will always have to run on the same host, because the pool can only be imported to one host at a time. The MGS therefore loses its independence from the MDS for.

Don’t use the  command directly to create datasets that will be used as Lustre targets. ZFS datasets created independently of the  command will have be unmounted or destroyed and then reformatted. The properties of Lustre file system datasets can be altered after formatting or can be supplied as options to the  command by using the   flag. Refer to the  manual page for details.

If a ZFS dataset already exists and is not unmounted, the  command will not report an error when an attempt is made to format, but it will not be able to format the volume. The only immediate indication of a failure is that the  output will be truncated:

<pre style="overflow-x:auto;"> [root@rh7z-mds1 ~]# zpool create -O canmount=off -o cachefile=none mgspool mirror sda sdc

[root@rh7z-mds1 ~]# zfs create mgspool/mgt
 * 1) Create a file system dataset using the default properties.
 * 2) The dataset will be automatically mounted once created.

[root@rh7z-mds1 ~]# mkfs.lustre --mgs \ >  --servicenode 192.168.227.11@tcp1 \ >  --servicenode 192.168.227.12@tcp1 \ >  --backfstype=zfs \ >  mgspool/mgt
 * 1) Try to format the dataset for Lustre.
 * 2) The command will fail but will not report an error.

Permanent disk data: Target:    MGS Index:     unassigned Lustre FS: Mount type: zfs Flags:     0x1064 (MGS first_time update no_primnode ) Persistent mount opts: Parameters: failover.node=192.168.227.11@tcp1:192.168.227.12@tcp1 [root@rh7z-mds1 ~]#

Administrators must use the  command whenever starting Lustre services, and the   command when stopping services.

For the purposes of comparison, let’s examine what the  command itself is doing when it creates/formats a ZFS OSD. The following example command output shows an MGT being created from a ZFS mirrored zpool consisting of two disks:

<pre style="overflow-x:auto;"> [root@rh7z-mds1 ~]# mkfs.lustre --mgs \ > --servicenode 192.168.227.11@tcp1 \ > --servicenode 192.168.227.12@tcp1 \ > --backfstype=zfs \ > mgspool/mgt

Permanent disk data: Target:    MGS Index:     unassigned Lustre FS: Mount type: zfs Flags:     0x1064 (MGS first_time update no_primnode ) Persistent mount opts: Parameters: failover.node=192.168.227.11@tcp1:192.168.227.12@tcp1

checking for existing Lustre data: not found mkfs_cmd = zfs create -o canmount=off -o xattr=sa mgspool/mgt Writing mgspool/mgt properties lustre:version=1 lustre:flags=4196 lustre:index=65535 lustre:svname=MGS lustre:failover.node=192.168.227.11@tcp1:192.168.227.12@tcp1 [root@rh7z-mds1 ~]#

Note that the  command hands creation of the   dataset to the relevant ZFS command, acting as a convenient wrapper that encapsulates creation of the ZFS dataset and the Lustre formatted storage devices. The complete ZFS command-line invocations are displayed in the  output, providing transparency in the way that the underlying storage is configured.

Unfortunately,  command line options used by   cannot be modified, other than to define the zpool specification. The  command therefore should not be used to create the zpool when working with failover shared storage. This is because the restriction prevents a user from defining the correct cachefile property for the zpool to prevent hosts from automatically importing a zpool on system boot. The  command also will not allow a user to provide tuning options at file system create time, notably the   property that helps with write alignment for storage devices that do not correctly report the underlying sector size.

Refer to the ZFS OSD Tuning section for information on properties that influence ZFS and Lustre performance.