Creating the Lustre Management Service (MGS)

From Lustre Wiki
Jump to: navigation, search

MGT Formatted as an LDISKFS OSD

The syntax for creating an LDISKFS-based MGT is as follows:

mkfs.lustre --mgs \
  [ --reformat ] \
  [ --servicenode <NID> [--servicenode <NID> …]] \
  [ --failnode <NID> [--failnode <NID> …]] \
  [ --backfstype=ldiskfs] \
  [ --mkfsoptions <options> ] \
  <device path>

The next example uses the --servicenode syntax to create an MGT that can be run on two servers as a high availability failover resource:

[root@rh7z-mds1 ~]# mkfs.lustre --mgs \
  --servicenode 192.168.227.11@tcp1 \
  --servicenode 192.168.227.12@tcp1 \
  --backfstype=ldiskfs \
  /dev/dm-1

The command-line formats a new MGT that will be used by the MGS for storage. Two server NIDs are supplied as service nodes for the MGS, 192.168.227.11@tcp1 and 192.168.227.12@tcp1.

The --failnode syntax is similar, but is used to define only a failover target for the storage service. For example:

[root@rh7z-mds1 ~]# mkfs.lustre --mgs \
  --failnode 192.168.227.12@tcp1 \
  --backfstype=ldiskfs \
  /dev/dm-1

Here, the failover host is identified as 192.168.227.12@tcp1, one server in an HA pair (and which, for the purpose of this example, has the hostname rh7z-mds2). The mkfs.lustre command was executed on rh7z-mds1 (NID: 192.168.227.11@tcp1), and the mount command must also be run from rh7z-mds1 when the MGS service starts for the very first time.

MGT Formatted as a ZFS OSD

Formatting the MGT using only the mkfs.lustre command

Note: For the greatest flexibility and control when creating ZFS-based Lustre storage targets, do not use this approach – instead, create the zpool separately from formatting the Lustre OSD. See Formatting the MGT using zpool and mkfs.lustre.

The syntax for creating a ZFS-based MGT using only the mkfs.lustre command is as follows:

mkfs.lustre --mgs \
  [ --reformat ] \
  [ --servicenode <NID> [--servicenode <NID> …]] \
  [ --failnode <NID> [--failnode <NID> …]] \
  --backfstype=zfs \
  [ --mkfsoptions <options> ] \
  <zpool>/<dataset> <zpool specification>

This example uses the --servicenode syntax to create an MGT that can be run on two servers as a high availability failover resource:

[root@rh7z-mds1 ~]# mkfs.lustre --mgs \
  --servicenode 192.168.227.11@tcp1 \
  --servicenode 192.168.227.12@tcp1 \
  --backfstype=zfs \
  mgspool/mgt mirror sda sdc

The command-line formats a new MGT that will be used by the MGS for storage. The command further defines a mirrored zpool called mgspool (consisting of two devices) and creates a ZFS dataset called mgt. Two server NIDs are supplied as service nodes for the MGS, 192.168.227.11@tcp1 and 192.168.227.12@tcp1.

The --failnode syntax is similar, but is used to define only a failover target for the storage service. For example:

[root@rh7z-mds1 ~]# mkfs.lustre --mgs \
  --failnode 192.168.227.12@tcp1 \
  --backfstype=zfs \
  mgspool/mgt mirror sda sdc

As with the LDISKFS example in the previous section, the failover host is identified as 192.168.227.12@tcp1, one server in an HA pair (and which has the hostname rh7z-mds2). The mkfs.lustre command was executed on rh7z-mds1 (NID: 192.168.227.11@tcp1), and the mount command must also be run from rh7z-mds1 when the MGS service starts for the very first time.

Note that when creating a ZFS-based OSD using only the mkfs.lustre command, it is not possible to set or change some properties of the zpool or its vdevs, such as the multihost and ashift properties. For this reason, it is highly recommended that the zpools be created independently of the mkfs.lustre command, as shown in the next section.

Formatting the MGT using zpool and mkfs.lustre

To create a ZFS-based MGT, first create a zpool to contain the MGT file system dataset:

zpool create [-f] -O canmount=off -o multihost=on \
  [-o ashift=<n>] \
  -o cachefile=/etc/zfs/<zpool name>.spec | -o cachefile=none \
  <zpool name> <zpool specification>

Next, use mkfs.lustre to actually create the file system inside the zpool:

mkfs.lustre --mgs \
  [ --reformat ] \
  [ --servicenode <NID> [--servicenode <NID> …]] \
  [ --failnode <NID> [--failnode <NID> …]] \
  --backfstype=zfs \
  [ --mkfsoptions <options> ] \
  <pool name>/<dataset>

For example:

# Create the zpool
zpool create -O canmount=off -o multihost=on \
  -o cachefile=none \
  mgspool mirror sda sdc

# Format the Lustre MGT 
mkfs.lustre --mgs \
  --servicenode 192.168.227.11@tcp1 \
  --servicenode 192.168.227.12@tcp1 \
  --backfstype=zfs \
  mgspool/mgt

Use the zfs get command to retrieve comprehensive metadata information about the file system dataset and to confirm that the Lustre properties have been set correctly:

zfs get all | awk '$2 ~ /lustre/'

Alternatively, use the following command to retrieve all properties that have been explicitly set (which may be a larger list than just the lustre: properties):

zfs get all -s local

An MGT example:

[root@rh7z-mds1 ~]# zfs get all -s local
NAME         PROPERTY              VALUE                                    SOURCE
mgspool      canmount              off                                      local
mgspool/mgt  canmount              off                                      local
mgspool/mgt  xattr                 sa                                       local
mgspool/mgt  lustre:version        1                                        local
mgspool/mgt  lustre:index          65535                                    local
mgspool/mgt  lustre:failover.node  192.168.227.11@tcp1:192.168.227.12@tcp1  local
mgspool/mgt  lustre:svname         MGS                                      local
mgspool/mgt  lustre:flags          4132                                     local

Starting the MGS Service

The mount command is used to start all Lustre storage services, including the MGS. Therefore, to start up the MGS, one must mount the MGT on the server.

The syntax is:

mount -t lustre [-o <options>] \
  <ldiskfs blockdev>|<zpool>/<dataset> <mount point>

The mount command syntax is very similar for both LDISKFS and ZFS MGT storage targets. The main difference is the format of the path to the storage. For LDISKFS, the path will resolve to a block device, such as /dev/sda or /dev/mapper/mpatha, whereas for ZFS, the path resolves to a dataset in a zpool, e.g., mgspool/mgt.

The mount point directory must exist before the mount command is executed. The recommended convention for the mount point of the MGT storage is /lustre/mgt.

The following example starts a ZFS-based MGT:

# Ignore MOUNTPOINT column in output: not used by Lustre
[root@rh7z-mds1 ~]# zfs list
NAME          USED  AVAIL  REFER  MOUNTPOINT
mgspool      1.67M   974M    19K  /mgspool
mgspool/mgt  1.59M   974M  1.59M  /mgspool/mgt

[root@rh7z-mds1 ~]# mkdir -p /lustre/mgt

[root@rh7z-mds1 ~]# mount -t lustre mgspool/mgt /lustre/mgt

[root@rh7z-mds1 ~]# df -ht lustre
File system      Size  Used Avail Use% Mounted on
mgspool/mgt     960M  1.7M  957M   1% /lustre/mgt

Note that the default output for zfs list shows the mount points for the MGS pool and MGT dataset in the MOUNTPOINT column. The content in the MOUNTPOINT column can be ignored because the referenced mount points are not used for mounting Lustre ZFS OSDs. Instead, the mount point is created administrator explicitly, just as for an LDISKFS-based storage target or any other regular file system mount point.

To reduce confusion, the ZFS file system mountpoint property can be set equal to none. For example:

zfs set mountpoint=none mgspool
zfs set mountpoint=none mgspool/mgt

Note: Only the mount -t lustre command can start Lustre services. Mounting storage as type ldiskfs or zfs will mount a storage target on the host, but it will not trigger the startup of the Lustre kernel processes.

To verify that the MGS is running, check that the device has been mounted, then get the Lustre device list with lctl dl and review the running processes:

[root@rh7z-mds1 lustre]# df -ht lustre
File system      Size  Used Avail Use% Mounted on
mgspool/mgt     960M  2.0M  956M   1% /lustre/mgt

[root@rh7z-mds1 ~]# lctl dl
  0 UP osd-zfs MGS-osd MGS-osd_UUID 5
  1 UP mgs MGS MGS 5
  2 UP mgc MGC192.168.227.11@tcp1 5d62a612-f872-09a4-7da8-4ce562af6e0c 5

[root@rh7z-mds1 ~]# ps -ef | awk '/mgs/ && !/awk/'
root     15162     2  0 02:44 ?        00:00:00 [mgs_params_noti]
root     15163     2  0 02:44 ?        00:00:00 [ll_mgs_0000]
root     15164     2  0 02:44 ?        00:00:00 [ll_mgs_0001]
root     15165     2  0 02:44 ?        00:00:00 [ll_mgs_0002]

Stopping the MGS Service

To stop a Lustre service, run umount on the corresponding target:

umount <mount point>

The mount point must correspond to the mount point used with the mount -t lustre command. For example:

[root@rh7z-mds1 ~]# df -ht lustre
File system      Size  Used Avail Use% Mounted on
mgspool/mgt     960M  2.0M  956M   1% /lustre/mgt
[root@rh7z-mds1 ~]# umount /lustre/mgt
[root@rh7z-mds1 ~]# df -ht lustre
df: no file systems processed
[root@rh7z-mds1 ~]# lctl dl
[root@rh7z-mds1 ~]#

Using the regular umount command is the correct way to stop a given Lustre service and unmount the associated storage, for both LDISKFS and ZFS-based Lustre storage volumes.

Do not use the zfs unmount command to stop a Lustre service. Attempting to use zfs commands to unmount a storage target that is mounted as part of an active Lustre service will return an error:

[root@rh7z-mds1 ~]# lctl dl
  0 UP osd-zfs MGS-osd MGS-osd_UUID 5
  1 UP mgs MGS MGS 5
  2 UP mgc MGC192.168.227.11@tcp1 be9fad27-107b-d165-8494-9a723b90e863 5

[root@rh7z-mds1 ~]# mount -t lustre
mgspool/mgt on /lustre/mgt type lustre (ro)

[root@rh7z-mds1 ~]# zfs list
NAME           USED  AVAIL  REFER  MOUNTPOINT
mgspool       2.05M   974M    19K  /mgspool
mgspool/mgt   1.97M   974M  1.97M  /mgspool/mgt

[root@rh7z-mds1 ~]# zpool status
pool: mgspool
 state: ONLINE
  scan: none requested
config:

	NAME        STATE     READ WRITE CKSUM
	mgspool     ONLINE       0     0     0
	  mirror-0  ONLINE       0     0     0
	    sda     ONLINE       0     0     0
	    sdc     ONLINE       0     0     0

errors: No known data errors

[root@rh7z-mds1 ~]# zfs unmount mgspool/mgt
cannot unmount 'mgspool/mgt': not currently mounted

[root@rh7z-mds1 ~]# zfs unmount /lustre/mgt
cannot unmount '/lustre/mgt': not a ZFS file system

In the example, the MGS is up and running on a host, and the MGT storage is formatted as a ZFS dataset in a mirrored zpool. The service is online and the storage is mounted as a Lustre file system type. When an attempt is made to use ZFS to umount the volume, the command fails, regardless of if one uses <zpool>/<dataset> or the mount point as the reference to the storage volume.

These examples are provided to reinforce the point that many of the Lustre server management tools are the same whether LDISKFS or ZFS is used for the underlying storage. Of course there are storage-level differences, but where possible, the Lustre tools are common to both storage target formats.