ZFS OSD Storage Basics

From Lustre Wiki
Jump to: navigation, search

Introduction

When working with ZFS OSDs, one can bundle the entire process of creating a zpool and formatting a storage target into a single command using mkfs.lustre, or split the work into two steps, where creation of the zpool is separated from formatting the OSD.

Both methods are discussed in this section, however we recommend creating the ZFS storage pools separately from formatting the Lustre OSD.

For high-availability configurations where the ZFS volumes are kept on shared storage, the zpools must be created independently of the mkfs.lustre command in order to be able to correctly prepare the zpools for use in a high-availability, failover environment.

ZFS Storage Pool Basics

ZFS separates storage volume definition from the file system specification, providing two separate tools to manage each. The zpool command is used to define the volumes and manages the physical storage assets, while the zfs command provides management of the ZFS file system datasets themselves.

A ZFS pool is comprised of one or more entities called Virtual Devices, or vdevs. There are two basic categories of vdev: physical and logical. A physical vdev can be a complete physical storage device such as a disk drive, a partition on a disk drive, or a file. For Lustre file systems, it is strongly recommended that whole disks are used, with no pre-defined partition table.

Logical vdevs are assemblies of physical vdevs, arranged into groups, usually for the purpose of providing additional storage redundancy. Logical vdevs include mirrors and RAIDZ data protection layouts. There can be many vdevs assigned per ZFS pool, and data is written in stripes across the vdevs in the pool.

The following examples illustrate how ZFS pools are created:

  • Simple stripe across two physical vdevs:
    zpool create tank sda sdb
    

    The result of this command is the creation of a pool named tank containing two physical vdevs, sda and sdb, in a stripe (equivalent to RAID 0).

  • Two-disk mirror:
    zpool create tank mirror sda sdb
    
  • Striped mirrors (equivalent to RAID 1+0):
    zpool create tank \
      mirror sda sdb \
      mirror sdc sdd \
      mirror sde sdf
    

    The above example has a single pool consisting of three mirrored vdevs. Data is striped across the three mirrors.

  • Pool with single RAIDZ2 vdev (equivalent to RAID 6):
    zpool create tank raidz2 sda sdb sdc sdd sde sdf
    
  • Pool with two RAIDZ2 vdevs (equivalent to RAID 6+0):
    zpool create tank \
      raidz2 sda sdb sdc sdd sde sdf \
      raidz2 sdg sdh sdi sdj sdk sdl
    

    Note: In order to simplify the presentation of commands, the examples use the shortened block device name (i.e. sdN), which is not guaranteed to refer to the same physical device on each server boot or when the server configuration is altered. The short block device path is therefore not recommended for use in production. When defining the storage pools, use a persistent device path, such as /dev/disk/by-id/<name> or some other reliable convention that ensures that the device file will always refer to the same physical device.

    Formatting a ZFS OSD using only the mkfs.lustre command

    The basic syntax for creating a ZFS-based OSD using only the mkfs.lustre command is:

    mkfs.lustre --mgs | --mdt | --ost \
      [--reformat] \
      [ --fsname <name> ] \
      [ --index <n> ] \
      [ --mgsnode <MGS NID> [--mgsnode <MGS NID> …]] \
      [ --servicenode <NID> [--servicenode <NID> …]] \
      [ --failnode <NID> [--failnode <NID> …]] \
      --backfstype=zfs \
      [ --mkfsoptions <options> ] \
      <pool name>/<dataset> \
      <zpool specification>
    

    The servicenode and failnode command-line options are used to identify the NIDs of the hosts that are able to run the Lustre service in a high-availability configuration. The options servicenode and failnode are mutually incompatible: choose one or the other when defining the HA failover hosts that are expected to provide the Lustre service.

    The servicenode syntax defines all of the NIDs of all of the hosts that will be able to run the Lustre service. All of the hosts must be referenced, including the host that is expected to be the preferred primary for running the service (this is usually the host where the format command is running).

    This example uses the --servicenode syntax to create an MGT that can be run on two servers as an HA failover resource:

    [root@rh7z-mds1 ~]# mkfs.lustre --mgs \
      --servicenode 192.168.227.11@tcp1 \
      --servicenode 192.168.227.12@tcp1 \
      --backfstype=zfs \
      mgspool/mgt mirror sda sdc
    

    The command line formats a new MGT that will be used by the MGS for storage. The command further defines a mirrored zpool called mgspool consisting of two devices, and creates a ZFS dataset called mgt. Two server NIDs are supplied as service nodes for the MGS, 192.168.227.11@tcp1 and 192.168.227.12@tcp1.

    The failnode syntax is similar, but is used to define only a failover target for the storage service. The failnode syntax is an older method for creating services and implicit in the definition of the storage service is the notion of a primary server and one or more secondary, or failover servers. Only the failover servers are included in the command-line definition. The primary is only written to the storage service configuration the first time that the formatted device is mounted. Whichever host mounts the storage first will have its NID written in as the effective primary server.

    Example format command with the failnode syntax:

    [root@rh7z-mds1 ~]# mkfs.lustre --mgs \
      --failnode 192.168.227.12@tcp1 \
      --backfstype=zfs \
      mgspool/mgt mirror sda sdc
    

    Here, the failover host is identified as 192.168.227.12@tcp1, one metadata server in an HA pair (and which, incidentally, has the hostname rh7z-mds2). The mkfs.lustre command was executed on rh7z-mds1 (NID: 192.168.227.11@tcp1), and the mount command must also be run from this host when the MGS service starts for the very first time. Otherwise, the primary NID will not be written to the storage configuration, and the failover mechanism will not work as expected.

    Note: If, after formatting the storage, the MGT is mounted on 192.168.227.12@tcp1, then both the primary NID and the failover NID will be the same. The intended primary host, 192.168.227.11@tcp1, would be excluded from being able to run the MGS service.

    Wherever possible, use the servicenode syntax to define the high availability configuration for Lustre services.

    The mkfs.lustre command can pass additional command flags to the underlying file system creation software using the --mkfsoptions flag. The command was originally used to pass through options for EXT-based storage, but can also be used in a limited way for ZFS. The --mkfsoptions parameter allows a user to pass through commands that are added to the zfs command line utility, but cannot be used to modify the zpool command line invocation. There are times when the zpool command defaults are not sufficient to support a production Lustre file system, especially when the storage system is on shared drives and there is a failover configuration in place.

    For this reason, it is recommended that ZFS pools always be created explicitly and separately from the mkfs.lustre command.

    Formatting a ZFS OSD using zpool and mkfs.lustre

    To create a ZFS-based OSD suitable for use as a high-availability failover storage device, first create a ZFS pool to contain the file system dataset, then use mkfs.lustre to actually create the file system inside the zpool:

    zpool create [-f] -O canmount=off \
      [ -O compression=lz4 ] \
      [ -o ashift=<n> ] \
      -o cachefile=/etc/zfs/<zpool name>.spec | -o cachefile=none \
      <zpool name> <zpool specification>
    
    mkfs.lustre --mgs | --mdt | --ost \
      [--reformat] \
      [ --fsname <name> ] \
      [ --index <n> ] \
      [ --mgsnode <MGS NID> [--mgsnode <MGS NID> …]] \
      [ --servicenode <NID> [--servicenode <NID> …]] \
      [ --failnode <NID> [--failnode <NID> …]] \
      --backfstype=zfs \
      [ --mkfsoptions <options> ] \
      <pool name>/<dataset name>
    

    For example:

    # Create the zpool
    zpool create -O canmount=off \
      -o cachefile=none \
      mgspool mirror sda sdc
    
    # Format the Lustre MGT 
    mkfs.lustre --mgs \
      --servicenode 192.168.227.11@tcp1 \
      --servicenode 192.168.227.12@tcp1 \
      --backfstype=zfs \
      mgspool/mgt
    

    After formatting, use tunefs.lustre to review the newly created OSD. The command line format is:

    tunefs.lustre --dryrun <pool name>/<dataset name>
    

    For example:

    [root@rh7z-mds1 ~]# tunefs.lustre --dryrun mgspool/mgt
    checking for existing Lustre data: found
    
       Read previous values:
    Target:     MGS
    Index:      unassigned
    Lustre FS:  
    Mount type: zfs
    Flags:      0x1044
                  (MGS update no_primnode )
    Persistent mount opts: 
    Parameters: failover.node=192.168.227.11@tcp1:192.168.227.12@tcp1
    
    
       Permanent disk data:
    Target:     MGS
    Index:      unassigned
    Lustre FS:  
    Mount type: zfs
    Flags:      0x1044
                  (MGS update no_primnode )
    Persistent mount opts: 
    Parameters: failover.node=192.168.227.11@tcp1:192.168.227.12@tcp1
    

    Use the zfs get command to retrieve comprehensive metadata information about the file system dataset and to confirm that the Lustre properties have been set correctly:

    zfs get all | awk '$2 ~ /lustre/'
    

    For example:

    [root@rh7z-mds1 ~]# zfs get all | awk '$2 ~/lustre/'
    mgspool/mgt  lustre:version        1                                        local
    mgspool/mgt  lustre:index          65535                                    local
    mgspool/mgt  lustre:failover.node  192.168.227.11@tcp1:192.168.227.12@tcp1  local
    mgspool/mgt  lustre:svname         MGS                                      local
    mgspool/mgt  lustre:flags          4196                                     local
    

    Only the zpool is created directly by the administrator. The mkfs.lustre command is still used to control creation of the file system dataset from the pool. Additional properties of the data set can be applied by mkfs.lustre using the --mkfs.options flag. The mkfs.lustre command will fail with an error if an attempt is made to format an existing dataset:

    [root@rh7z-mds1 ~]# mkfs.lustre --mgs \
    >  --servicenode 192.168.227.11@tcp1 \
    >  --servicenode 192.168.227.12@tcp1 \
    >  --backfstype=zfs   mgspool/mgt
    
       Permanent disk data:
    Target:     MGS
    Index:      unassigned
    Lustre FS:  
    Mount type: zfs
    Flags:      0x1064
                  (MGS first_time update no_primnode )
    Persistent mount opts: 
    Parameters:  failover.node=192.168.227.11@tcp1:192.168.227.12@tcp1
    
    checking for existing Lustre data: not found
    mkfs_cmd = zfs create -o canmount=off -o xattr=sa mgspool/mgt
       cannot create 'mgspool/mgt': dataset already exists
    
    mkfs.lustre FATAL: Unable to create file system mgspool/mgt (256)
    
    mkfs.lustre FATAL: mkfs failed 256
    

    One can force the dataset to be formatted as a Lustre OSD by adding the --reformat flag to mkfs.lustre:

    [root@rh7z-mds1 ~]# mkfs.lustre --reformat --mgs \
    >  --servicenode 192.168.227.11@tcp1 \
    >  --servicenode 192.168.227.12@tcp1 \
    >  --backfstype=zfs   mgspool/mgt
    
       Permanent disk data:
    Target:     MGS
    Index:      unassigned
    Lustre FS:  
    Mount type: zfs
    Flags:      0x1064
                  (MGS first_time update no_primnode )
    Persistent mount opts: 
    Parameters:  failover.node=192.168.227.11@tcp1:192.168.227.12@tcp1
    
    mkfs_cmd = zfs create -o canmount=off -o xattr=sa mgspool/mgt
    Writing mgspool/mgt properties
      lustre:version=1
      lustre:flags=4196
      lustre:index=65535
      lustre:svname=MGS
      lustre:failover.node=192.168.227.11@tcp1:192.168.227.12@tcp1
    

    Alternatively, destroy the dataset and have mkfs.lustre recreate it:

    zfs destroy <pool name>/<dataset name>
    
    mkfs.lustre --mgs \
      [ --servicenode <NID> [--servicenode <NID> …]] \
      --backfstype=zfs \
      [ --mkfsoptions <options> ] \
      <pool name>/<dataset>
    

    If the dataset is reformatted, then previously applied properties will obviously be lost. Remember to include any ZFS-specific properties by making use of the --mkfsoptions flag.

    Working with ZFS Imports

    The assembly and incorporation of a ZFS storage pool into the operating system run-time environment is managed by a command called zpool import. Existing pools are incorporated into a host’s run-time environment using the zpool import command, and can be released using the zpool export command. The basic syntax to import a pool is:

    zpool import [-f] [ -o <properties>] <pool name>
    

    This technique enables storage pools to migrate between hosts in a consistent and reliable manner. Any host that is connected to a pool of storage in a shared enclosure can import the ZFS pool, making it straightforward to facilitate high availability failover.

    A ZFS storage pool can only be imported to a single host at a time. If a pool has been imported onto a host, it must be exported before it can be safely imported to a different host. If a zpool in a shared storage enclosure is simultaneously imported to more than one host, the pool data will be corrupted. To reduce the risk of this happening, all servers must be configured with a unique hostid that is used to label each zpool with the current active host that has imported the pool. Multiple Import Protection (using the multihost pool property), implemented in ZFS 0.7.0, must also be configured when available. Please refer to Protecting File System Volumes from Concurrent Access for information about how to enable and verify that this protection is properly configured.

    If the pool is not exported, it will appear to still be active on its original host, even if that host is offline (regardless of whether it was powered off cleanly or crashed). In this case, the import will fail, provided that the hostids for the servers have been correctly configured, and the import will need to be forced.

    If an import fails, it could mean that the pool is already imported on a different host. Before trying to force the importation of a pool onto a host, check that the storage has not already been imported to another system. If the zpool has been configured correctly, and all hosts have valid hostids, then the import command will indicate the host that last had the pool imported.

    The zpool import command can be used to list pools that are not currently imported on the host, without actually performing any import:

    zpool import [-d <dev directory>] [-D]
    

    When invoked using this syntax, zpool import lists the pools that are potentially available for import, and will ignore zpools that are already imported to the host. The -d (lower case) flag is used to specify a directory containing block devices, for example /dev/disk/by-id. This flag is not often required. The -D (upper case) flag is used to list zpools that have been destroyed.

    In the following example, the import fails because the zpool cannot be found:

    [root@rh7z-mds2 system]# zpool import demo-mdtpool
    cannot import 'demo-mdtpool': no such pool available
    

    This could be because the zpool does not exist, or because the pool has the wrong name, but it could also mean that the pool has been destroyed. Run zpool import without any other options to get a list of the zpools that the operating system can locate, that are not already imported to the host:

    [root@rh7z-mds2 system]# zpool import
       pool: mgspool
         id: 2186474330384511828
      state: ONLINE
     status: The pool was last accessed by another system.
     action: The pool can be imported using its name or numeric identifier and
    	the '-f' flag.
       see: http://zfsonlinux.org/msg/ZFS-8000-EY
     config:
    
    	mgspool     ONLINE
    	  mirror-0  ONLINE
    	    sda     ONLINE
    	    sdc     ONLINE
    

    The output only lists a single zpool, called mgspool. There is no obvious sign of any other configured zpool available to the operating system. If the list of ZFS pools does not match expectations, perhaps one or more of the pools has been destroyed. Check using zpool import -D:

    [root@rh7z-mds2 system]# zpool import -D
       pool: demo-mdtpool
         id: 6641674394771267657
      state: ONLINE (DESTROYED)
     action: The pool can be imported using its name or numeric identifier.
     config:
    
    	demo-mdtpool                                   ONLINE
    	  mirror-0                                ONLINE
    	    scsi-0QEMU_QEMU_HARDDISK_EEMDT0001  ONLINE
    	    scsi-0QEMU_QEMU_HARDDISK_EEMDT0000  ONLINE
    

    From this, it can be seen that at some point in the past, another zpool existed on the system, but that it was destroyed. It is possible to recover a destroyed pool as follows:

    zpool import -D <pool name>
    

    Provided that the storage from which the pool was originally assembled has not been modified or the data over-written, the pool will be re-assembled and can be used as normal.

    The zdb command can be helpful in determining where a zpool has been imported. If an exported pool cannot be imported cleanly into a host, use zdb to check the MOS configuration to see if it is perhaps “registered” with another host:

    zdb -e <zpool name> | awk '/^MOS/,/^$/{print}'
    

    The hostidand hostname fields will indicate the last host known to have imported the pool or dataset. For example:

    [root@rh7z-mds1 ~]# zdb -e mgspool | awk '/^MOS/,/^$/{print}'
    MOS Configuration:
            version: 5000
            name: 'mgspool'
            state: 0
            txg: 34351
            pool_guid: 11089712772589408485
            errata: 0
            hostid: 1386610045
            hostname: 'rh7z-mds2'
            vdev_children: 1
            vdev_tree:
    …
    <output truncated for brevity>
    

    In this example, it would be prudent to check the host rh7z-mds2 to see if the pool has been imported there before taking any further action.

    Lustre and ZFS File System Datasets

    When working with ZFS-based storage, each Lustre storage target is held on a file system dataset inside a ZFS pool. The dataset will be created by Lustre when the storage is formatted with the mkfs.lustre command and --backfstype=zfs has been selected. While it is possible to create multiple file system datasets within a single storage pool and use those for Lustre, this is not recommended: each dataset will compete for the pool’s resources, affecting performance and making it more difficult to balance IO across the storage cluster. Also bear in mind that the unit of failover is the ZFS pool, not the dataset. One cannot migrate a dataset in a pool without migrating the entire pool. For example, if the MGT and MDT0 are created within the same ZFS pool, then the MGS and MDS services will always have to run on the same host, because the pool can only be imported to one host at a time. The MGS therefore loses its independence from the MDS for MDT0.

    Don’t use the zfs command directly to create datasets that will be used as Lustre targets. ZFS datasets created independently of the mkfs.lustre command will have be unmounted or destroyed and then reformatted. The properties of Lustre file system datasets can be altered after formatting or can be supplied as options to the mkfs.lustre command by using the --mkfsoptions flag. Refer to the mkfs.lustre manual page for details.

    If a ZFS dataset already exists and is not unmounted, the mkfs.lustre command will not report an error when an attempt is made to format, but it will not be able to format the volume. The only immediate indication of a failure is that the mkfs.lustre output will be truncated:

    [root@rh7z-mds1 ~]# zpool create -O canmount=off -o cachefile=none mgspool mirror sda sdc
    
    # Create a file system dataset using the default properties.
    # The dataset will be automatically mounted once created.
    [root@rh7z-mds1 ~]# zfs create mgspool/mgt
    
    # Try to format the dataset for Lustre.
    # The command will fail but will not report an error.
    [root@rh7z-mds1 ~]# mkfs.lustre --mgs \
    >   --servicenode 192.168.227.11@tcp1 \
    >   --servicenode 192.168.227.12@tcp1 \
    >   --backfstype=zfs \
    >   mgspool/mgt
    
       Permanent disk data:
    Target:     MGS
    Index:      unassigned
    Lustre FS:  
    Mount type: zfs
    Flags:      0x1064
                  (MGS first_time update no_primnode )
    Persistent mount opts: 
    Parameters:  failover.node=192.168.227.11@tcp1:192.168.227.12@tcp1
    [root@rh7z-mds1 ~]#
    

    Administrators must use the mount.lustre command whenever starting Lustre services, and the umount command when stopping services.

    For the purposes of comparison, let’s examine what the mkfs.lustre command itself is doing when it creates/formats a ZFS OSD. The following example command output shows an MGT being created from a ZFS mirrored zpool consisting of two disks:

    [root@rh7z-mds1 ~]# mkfs.lustre --mgs \
    >  --servicenode 192.168.227.11@tcp1 \
    >  --servicenode 192.168.227.12@tcp1 \
    >  --backfstype=zfs \
    >  mgspool/mgt
    
       Permanent disk data:
    Target:     MGS
    Index:      unassigned
    Lustre FS:  
    Mount type: zfs
    Flags:      0x1064
                  (MGS first_time update no_primnode )
    Persistent mount opts: 
    Parameters:  failover.node=192.168.227.11@tcp1:192.168.227.12@tcp1
    
    checking for existing Lustre data: not found
    mkfs_cmd = zfs create -o canmount=off -o xattr=sa mgspool/mgt
    Writing mgspool/mgt properties
      lustre:version=1
      lustre:flags=4196
      lustre:index=65535
      lustre:svname=MGS
      lustre:failover.node=192.168.227.11@tcp1:192.168.227.12@tcp1
    [root@rh7z-mds1 ~]#
    

    Note that the mkfs.lustre command hands creation of the mgspool/mgt dataset to the relevant ZFS command, acting as a convenient wrapper that encapsulates creation of the ZFS dataset and the Lustre formatted storage devices. The complete ZFS command-line invocations are displayed in the mkfs.lustre output, providing transparency in the way that the underlying storage is configured.

    Unfortunately, zpool command line options used by mkfs.lustre cannot be modified, other than to define the zpool specification. The mkfs.lustre command therefore should not be used to create the zpool when working with failover shared storage. This is because the restriction prevents a user from defining the correct cachefile property for the zpool to prevent hosts from automatically importing a zpool on system boot. The mkfs.lustre command also will not allow a user to provide tuning options at file system create time, notably the ashift property that helps with write alignment for storage devices that do not correctly report the underlying sector size.

    Refer to the ZFS OSD Tuning section for information on properties that influence ZFS and Lustre performance.