Creating Lustre Object Storage Services (OSS)

From Lustre Wiki
Jump to: navigation, search

Syntax Overview

The syntax for creating an OST is:

mkfs.lustre --ost \
  [--reformat] \
  --fsname <name> \
  --index <n> \
  --mgsnode <MGS NID> [--mgsnode <MGS NID> …] \
  [ --servicenode <NID> [--servicenode <NID> …]] \
  [ --failnode <NID> [--failnode <NID> …]] \
  [ --backfstype=ldiskfs|zfs ] \
  [ --mkfsoptions <options> ] \
  <device path> | <pool name>/<dataset> <zpool specification>

The command line syntax for formatting an OST is very similar to that for the MDT. To create an OST, the mkfs.lustre command requires the file system name, OST index, list of MGS server NIDs, and the NIDS for each of the OSS machines that can mount the OST (used for HA failover). The command also specifies the back-end file system type and device path or ZFS pool.

The MGS NIDs are supplied using the --mgsnode flag. If there is more than one potential location for the MGS (i.e., it is part of a high availability failover cluster configuration), then the option is repeated for as many failover nodes as are configured (usually there are two).

Ordering of the MGS nodes in the command line is significant: the first --mgsnode flag must reference the NID of the current active or primary MGS server. If this is not the case, then the first time that the OST tries to join the cluster, it will fail. The first time mount of a storage target does not currently check the failover locations when trying to establish a connection with the MGS. When adding new storage targets to Lustre, the MGS must be running on its primary NID.

Each OST must also be supplied with the name (--fsname) of the Lustre file system it will join (maximum 8 characters), and an index number (--index) unique to the file system.

The list of service nodes (--servicenode) or failover nodes (--failnode) must be specified for any high availability configuration. Although there are more compact declarations for defining the nodes, for simplicity, list the NID of each server that can mount the storage as a separate --servicenode entry.

The next example uses the --servicenode syntax to create an OST that can be run on two servers as an HA failover resource:

[root@rh7z-oss1 system]# mkfs.lustre --ost \
>   --fsname demo \
>   --index 0 \
>   --mgsnode 192.168.227.11@tcp1 \
>   --mgsnode 192.168.227.12@tcp1 \
>   --servicenode 192.168.227.21@tcp1 \
>   --servicenode 192.168.227.22@tcp1 \
>   /dev/dm-3

The command line formats a new OST that will be used by an OSS for storage. The OST will be part of a file system called demo, with index number 0 (zero). There are two NIDs defined as the nodes able to host the OSS service, denoted by the --servicenode options, and two NIDs supplied for the MGS that the OSS will register with.

The --failnode syntax is similar, but is used to define only a failover target for the storage service. For example:

[root@rh7z-oss1 system]# mkfs.lustre --ost \
>   --fsname demo \
>   --index 0 \
>   --mgsnode 192.168.227.11@tcp1 \
>   --mgsnode 192.168.227.12@tcp1 \
>   --failnode 192.168.227.22@tcp1 \
>   /dev/dm-3

Here, the failover host is identified as 192.168.227.22@tcp1, one server in an HA pair (and which, for the purpose of this example, has the hostname rh7z-oss2). The mkfs.lustre command was executed on rh7z-oss1 (NID: 192.168.227.21@tcp1), and the mount command must also be run from this host when the service starts for the very first time.

OST Formatted as an LDISKFS OSD

The syntax for creating an LDISKFS-based OST is:

mkfs.lustre --ost \
  [--reformat] \
  --fsname <name> \
  --index <n> \
  --mgsnode <MGS NID> [--mgsnode <MGS NID> …] \
  [ --servicenode <NID> [--servicenode <NID> …]] \
  [ --failnode <NID> [--failnode <NID> …]] \
  [ --backfstype=ldiskfs ] \
  [ --mkfsoptions <options> ] \
  <device path>

The next example uses the --servicenode syntax to create an OST that can be run on two servers as an HA failover resource:

[root@rh7z-oss1 system]# mkfs.lustre --ost \
>   --fsname demo \
>   --index 0 \
>   --mgsnode 192.168.227.11@tcp1 \
>   --mgsnode 192.168.227.12@tcp1 \
>   --servicenode 192.168.227.21@tcp1 \
>   --servicenode 192.168.227.22@tcp1 \
>   --backfstype=ldiskfs \
>   /dev/dm-3

The --failnode syntax is similar, but is used to define only a failover target for the storage service. For example:

[root@rh7z-oss1 system]# mkfs.lustre --ost \
>   --fsname demo \
>   --index 0 \
>   --mgsnode 192.168.227.11@tcp1 \
>   --mgsnode 192.168.227.12@tcp1 \
>   --failnode 192.168.227.22@tcp1 \
>   --backfstype=ldiskfs \
>   /dev/dm-3

The above examples are repeated from the main introduction to the syntax, but are included here to maintain symmetry with the rest of the text. However, note that the --backfstype flag has been set to ldiskfs, which tells mkfs.lustre to format the device as an LDISKFS OSD.

OST Formatted as a ZFS OSD

Formatting an OST using only the mkfs.lustre command

Note: For the greatest flexibility and control when creating ZFS-based Lustre storage targets, do not use this approach – instead, create the zpool separately from formatting the Lustre OSD. See Formatting an OST using zpool and mkfs.lustre. See also ZFS_Tunables_for_Lustre_Object_Storage_Servers_(OSS) for ZFS-specific tuning options.

The syntax for creating a ZFS-based OST using only the mkfs.lustre command is:

mkfs.lustre --ost \
  [--reformat] \
  --fsname <name> \
  --index <n> \
  --mgsnode <MGS NID> [--mgsnode <MGS NID> …] \
  [ --servicenode <NID> [--servicenode <NID> …]] \
  [ --failnode <NID> [--failnode <NID> …]] \
  --backfstype=zfs \
  [ --mkfsoptions <options> ] \
  <pool name>/<dataset> \
  <zpool specification>

The next example uses the --servicenode syntax to create an OST that can be run on two servers as an HA failover resource:

[root@rh7z-oss1 system]# mkfs.lustre --ost \
>   --fsname demo \
>   --index 0 \
>   --mgsnode 192.168.227.11@tcp1 \
>   --mgsnode 192.168.227.12@tcp1 \
>   --servicenode 192.168.227.21@tcp1 \
>   --servicenode 192.168.227.22@tcp1 \
>   --backfstype=zfs \
>   --mkfsoptions "recordsize=1024K -o compression=lz4" \
>   demo-ost0pool/ost0 \
>   raidz2 sda sdb sdc sdd sde sdf

The command line formats a new ZFS-based OST that will be used by an OSS for storage. The OST will be part of to a file system called demo, with index number 0 (zero). The back-end storage is a ZFS pool called demo-ost0pool comprising a RAIDZ2 vdev constructed from six physical devices, and creates a ZFS file system dataset called ost0. Two server NIDs are supplied as service nodes for the OSS, 192.168.227.21@tcp1 and 192.168.227.22@tcp1, and there are NIDs for the MGS primary and failover hosts.

In the above command-line, notice that there is a custom property for the recordsize, set using --mkfsoptions "recordsize=1024K". The syntax of string needs some explanation.

The mkfs.lustre command takes the string provided as an argument to --mkfsoptions, prefixes the string with -o and inserts it into the zfs command line that creates the dataset. This can be observed in the output of the mkfs.lustre command.

If only one property is being altered, then just create the <name>=<value> pair string without any additional flags or arguments. However, if more than one property needs to be set, then separate each property with -o. For example:

--mkfsoptions="recordsize=1024K -o compression=lz4 -o mountpoint=none"

Refer to ZFS recordsize Property for recommendations on how to set the compression and recordsize properties appropriately.

The --failnode syntax is similar, but is used to define only a failover target for the storage service. For example:

[root@rh7z-oss1 system]# mkfs.lustre --ost \
>   --fsname demo --index 0 \
>   --mgsnode 192.168.227.11@tcp1 --mgsnode 192.168.227.12@tcp1 \
>   --failnode 192.168.227.22@tcp1 \
>   --backfstype=zfs \
>   demo-ost0pool/ost0 mirror sdb sdd

Note:: When creating a ZFS-based OSD using only the mkfs.lustre command, it is not possible to set or change some properties of the zpool or its vdevs, such as the multihost and ashift properties. For this reason, it is highly recommended that the zpools be created independently of the mkfs.lustre command, as shown in the next section.

Formatting an OST using zpool and mkfs.lustre

To create a ZFS-based OST, create a zpool to contain the OST file system dataset, then use mkfs.lustre to create the actual file system dataset inside the zpool:

zpool create [-f] -o multihost=on \
  -O canmount=off \
  -O recordsize=<RK> \
  -O compression=lz4|on \
  [-o ashift=<n>] \
  -o cachefile=/etc/zfs/<zpool name>.spec | -o cachefile=none \
  <zpool name> \
  <zpool specification>

mkfs.lustre --ost \
  [--reformat] \
  --fsname <name> \
  --index <n> \
  --mgsnode <MGS NID> [--mgsnode <MGS NID> …] \
  [ --servicenode <NID> [--servicenode <NID> …]] \
  [ --failnode <NID> [--failnode <NID> …]] \
  --backfstype=zfs \
  [ --mkfsoptions <options> ] \
  <pool name>/<dataset>

For example:

# Create the zpool
# Pool will comprise a single RAIDZ2 vdev with 6 devices
# Recordsize is set to 1024K
# Compression is enabled and set to use lz4 algorithm
zpool create -o multihost=on \
  -O canmount=off \
  -O recordsize=1024K \
  -O compression=lz4 \
  -o cachefile=none \
  demo-ost0pool \
  raidz2 sda sdb sdc sdd sde sdf

# Format OST0 for Lustre file system "demo"
mkfs.lustre --ost \
  --fsname demo \
  --index 0 \
  --mgsnode 192.168.227.11@tcp1 --mgsnode 192.168.227.12@tcp1 \
  --servicenode 192.168.227.21@tcp1 \
  --servicenode 192.168.227.22@tcp1 \
  --backfstype=zfs \
  demo-ost0pool/ost0

The output from the above example will look something like this:

# The zpool command will not return output unless there are errors
[root@rh7z-oss1 system]# zpool create -o multihost=on \
>   -O canmount=off \
>   -O recordsize=1024K \
>   -O compression=lz4 \
>   -o cachefile=none \
>   demo-ost0pool \
>   raidz2 sda sdb sdc sdd sde sdf

[root@rh7z-oss1 lz]# mkfs.lustre --ost \
>   --fsname demo \
>   --index 0 \
>   --mgsnode 192.168.227.11@tcp1 --mgsnode 192.168.227.12@tcp1 \
>   --servicenode 192.168.227.21@tcp1 \
>   --servicenode 192.168.227.22@tcp1 \
>   --backfstype=zfs \
>   demo-ost0pool/ost0

   Permanent disk data:
Target:     demo:OST0000
Index:      0
Lustre FS:  demo
Mount type: zfs
Flags:      0x1062
              (OST first_time update no_primnode )
Persistent mount opts: 
Parameters:  mgsnode=192.168.227.11@tcp1:192.168.227.12@tcp1  failover.node=192.168.227.21@tcp1:192.168.227.22@tcp1

checking for existing Lustre data: not found
mkfs_cmd = zfs create -o canmount=off -o xattr=sa demo-ost0pool/ost0
Writing demo-ost0pool/ost0 properties
  lustre:version=1
  lustre:flags=4194
  lustre:index=0
  lustre:fsname=demo
  lustre:svname=demo:OST0000
  lustre:mgsnode=192.168.227.11@tcp1:192.168.227.12@tcp1
  lustre:failover.node=192.168.227.21@tcp1:192.168.227.22@tcp1

Refer to ZFS recordsize Property for recommendations on how to set the recordsize appropriately.

Use the zfs get command or tunefs.lustre to verify that the file system dataset has been formatted correctly. For example:

[root@rh7-oss1 ~]# zfs get all -s local,inherited
NAME                PROPERTY              VALUE                                    SOURCE
demo-ost0pool       recordsize            1M                                       local
demo-ost0pool       compression           lz4                                      local
demo-ost0pool       canmount              off                                      local
demo-ost0pool/ost0  recordsize            1M                                       inherited from demo-ost0pool
demo-ost0pool/ost0  compression           lz4                                      inherited from demo-ost0pool
demo-ost0pool/ost0  canmount              off                                      local
demo-ost0pool/ost0  xattr                 sa                                       local
demo-ost0pool/ost0  lustre:mgsnode        192.168.227.11@tcp1:192.168.227.12@tcp1  local
demo-ost0pool/ost0  lustre:flags          4194                                     local
demo-ost0pool/ost0  lustre:fsname         demo                                     local
demo-ost0pool/ost0  lustre:version        1                                        local
demo-ost0pool/ost0  lustre:failover.node  192.168.227.21@tcp1:192.168.227.22@tcp1  local
demo-ost0pool/ost0  lustre:index          0                                        local
demo-ost0pool/ost0  lustre:svname         demo:OST0000                             local

Starting the OSS Service

The mount command is used to start all Lustre storage services, including the OSS. The syntax is:

mount -t lustre [-o <options>] \
  <ldiskfs blockdev>|<zpool>/<dataset> <mount point>

The mount command syntax is very similar for both LDISKFS and ZFS storage targets. The main difference is the format of the path to the storage. For LDISKFS, the path will resolve to a block device, such as /dev/sda or /dev/mapper/mpatha, whereas for ZFS, the path resolves to a dataset in a zpool, e.g. demo-mdt0pool/mdt0.

The mount point directory must exist before the mount command is executed. The recommended convention for the mount point of the OST storage is /lustre/<fsname>/ost<n>, where <fsname> is the name of the file system and <n> is the index number of the OST.

The following example starts a ZFS-based OSS:

[root@rh7z-oss1 ~]# zfs list -o name,used,avail,refer
NAME                 USED  AVAIL  REFER
demo-ost0pool        173M  48.0G    19K
demo-ost0pool/ost0   173M  48.0G   173M

[root@rh7z-oss1 lz]# mkdir -p /lustre/demo/ost0
[root@rh7z-oss1 lz]# mount -t lustre demo-ost0pool/ost0 /lustre/demo/ost0

[root@rh7z-oss1 lz]# df -ht lustre
File system          Size  Used Avail Use% Mounted on
demo-ost0pool/ost0   49G  2.9M   49G   1% /lustre/demo/ost0

As with all Lustre storage targets, only the mount -t lustre command can start Lustre services.

To verify that the OSS is running, check that the device has been mounted, then get the Lustre device list with lctl dl and review the running processes:

[root@rh7z-oss1 lz]# lctl dl
  0 UP osd-zfs demo-OST0000-osd demo-OST0000-osd_UUID 5
  1 UP mgc MGC192.168.227.11@tcp1 4106d169-ed51-cd92-3361-12800a73962d 5
  2 UP ost OSS OSS_uuid 3
  3 UP obdfilter demo-OST0000 demo-OST0000_UUID 7
  4 UP lwp demo-MDT0000-lwp-OST0000 demo-MDT0000-lwp-OST0000_UUID 5

[root@rh7z-oss1 lz]# ps -ef | awk '/ost/ && !/awk/'
root      1932     1  0 Apr04 ?        00:00:00 /usr/libexec/postfix/master -w
postfix   1984  1932  0 Apr04 ?        00:00:00 pickup -l -t unix -u
postfix   1985  1932  0 Apr04 ?        00:00:00 qmgr -l -t unix -u
root     24709     2  0 00:03 ?        00:00:00 [ll_ost00_000]
root     24710     2  0 00:03 ?        00:00:00 [ll_ost00_001]
root     24711     2  0 00:03 ?        00:00:00 [ll_ost00_002]
root     24712     2  0 00:03 ?        00:00:00 [ll_ost_create00]
root     24713     2  0 00:03 ?        00:00:00 [ll_ost_create00]
root     24714     2  0 00:03 ?        00:00:00 [ll_ost_io00_000]
root     24715     2  0 00:03 ?        00:00:00 [ll_ost_io00_001]
root     24716     2  0 00:03 ?        00:00:00 [ll_ost_io00_002]
root     24717     2  0 00:03 ?        00:00:00 [ll_ost_seq00_00]
root     24718     2  0 00:03 ?        00:00:00 [ll_ost_seq00_00]
root     24719     2  0 00:03 ?        00:00:00 [ll_ost_out00_00]
root     24720     2  0 00:03 ?        00:00:00 [ll_ost_out00_00]

Stopping the OSS Service

To stop a Lustre service, umount the corresponding target:

umount <mount point>

The mount point must correspond to the mount point used with the mount -t lustre command. For example:

[root@rh7z-oss1 lz]# df -ht lustre
File system          Size  Used Avail Use% Mounted on
demo-ost0pool/ost0   49G  2.9M   49G   1% /lustre/demo/ost0
[root@rh7z-oss1 ~]# umount /lustre/demo/ost0
[root@rh7z-oss1 ~]# df -ht lustre
df: no file systems processed
[root@rh7z-oss1 ~]# lctl dl
[root@rh7z-oss1 ~]#

The regular umount command is the correct way to stop a given Lustre service and unmount the associated storage, for both LDISKFS and ZFS-based Lustre storage volumes.

Do not use the zfs unmount command to stop a Lustre service. Attempting to use zfs commands to unmount a storage target that is mounted as part of an active Lustre service will return an error.