Creating the Lustre Metadata Service (MDS)

Syntax Overview
The syntax for creating an MDT is:

 mkfs.lustre --mdt \ [--reformat] \ --fsname \ --index  \ --mgsnode  [--mgsnode  …] \ [ --servicenode  [--servicenode  …]] \ [ --failnode  [--failnode  …]] \ [ --backfstype=ldiskfs|zfs ] \ [ --mkfsoptions ] \ | / flag. If there is more than one potential location for the MGS (i.e., it is part of a high availability failover cluster configuration), then the option is repeated for as many failover nodes as are configured (usually there are two).

Ordering of the MGS nodes in the command line is significant: the first  flag must reference the NID of the current active or primary MGS server. If this is not the case, then the first time that the MDS tries to join the cluster, it will fail. The first time mount of a storage target does not currently check the failover locations when trying to establish a connection with the MGS. When adding new storage targets to Lustre, the MGS must be running on its primary NID.

The MDT must also be supplied with the name of the Lustre file system (maximum 8 characters), and a unique index number  for the file system. There must always be an MDT with  (zero) for each file system, representing the root of the file system tree. For many Lustre file systems, a single MDT (referred to as ) is sufficient.

The list of service nodes or failover nodes  must be specified for any high availability configuration. Although there are more compact declarations for defining the nodes, for simplicity, list the NID of each server that can mount the storage as a separate  entry.

The next example uses the  syntax to create an MDT that can be run on two servers as an HA failover resource:

 [root@rh7z-mds2 system]# mkfs.lustre --mdt \ >  --fsname demo \ >  --index 0 \ >  --mgsnode 192.168.227.11@tcp1 \ >  --mgsnode 192.168.227.12@tcp1 \ >  --servicenode 192.168.227.12@tcp1 \ >  --servicenode 192.168.227.11@tcp1 \ >  /dev/dm-2

The command line formats a new MDT that will be used by the MDS for storage. The MDT will provide metadata for a file system called, and has index number   (zero). There are two NIDs defined as the nodes able to host the MDS service, denoted by the  options, and two NIDs supplied for the MGS that the MDS will register with.

The  syntax is similar, but is used to define only a failover target for the storage service. For example:

 [root@rh7z-mds2 system]# mkfs.lustre --mdt \ >  --fsname demo \ >  --index 0 \ >  --mgsnode 192.168.227.11@tcp1 \ >  --mgsnode 192.168.227.12@tcp1 \ >  --failnode 192.168.227.11@tcp1 \ >  /dev/dm-2

Here, the failover host is identified as, one server in an HA pair (and which, for the purpose of this example, has the hostname  ). The  command was executed on   (NID:  ), and the   command must also be run from this host when the service starts for the very first time.

MDT Formatted as an LDISKFS OSD
The syntax for creating an LDISKFS-based MDT is:

 mkfs.lustre --mdt \ [--reformat] \ --fsname \ --index  \ --mgsnode  [--mgsnode  …] \ [ --servicenode  [--servicenode  …]] \ [ --failnode <NID> [--failnode <NID> …]] \ [ --backfstype=ldiskfs ] \ [ --mkfsoptions ] \

The following example uses the  syntax to create an MDT that can be run on two servers as an HA failover resource:

<pre style="overflow-x:auto;"> [root@rh7z-mds2 system]# mkfs.lustre --mdt \ >  --fsname demo \ >  --index 0 \ >  --mgsnode 192.168.227.11@tcp1 \ >  --mgsnode 192.168.227.12@tcp1 \ >  --servicenode 192.168.227.12@tcp1 \ >  --servicenode 192.168.227.11@tcp1 \ >  --backfstype=ldiskfs \ >  /dev/dm-2

The  syntax is similar, but is used to define only a failover target for the storage service. For example:

<pre style="overflow-x:auto;"> [root@rh7z-mds2 system]# mkfs.lustre --mdt \ >  --fsname demo --index 0 \ >  --mgsnode 192.168.227.11@tcp1 --mgsnode 192.168.227.12@tcp1 \ >  --failnode 192.168.227.11@tcp1 \ >  --backfstype=ldiskfs \ >  /dev/dm-2

The above examples are repeated from the main introduction to the syntax, but are included here to maintain symmetry with the rest of the text. However, note that the  flag has been set to , which tells   to format the device as an LDISKFS OSD.

Formatting an MDT using only the mkfs.lustre command
Note: For the greatest flexibility and control when creating ZFS-based Lustre storage targets, do not use this approach – instead, create the zpool separately from formatting the Lustre OSD. See Formatting an MDT using zpool and mkfs.lustre.

The syntax for creating a ZFS-based MDT using only the  command is:

<pre style="overflow-x:auto;"> mkfs.lustre --mdt \ [--reformat] \ --fsname \ --index <n> \ --mgsnode <MGS NID> [--mgsnode <MGS NID> …] \ [ --servicenode <NID> [--servicenode <NID> …]] \ [ --failnode <NID> [--failnode <NID> …]] \ --backfstype=zfs \ [ --mkfsoptions ] \ / syntax to create an MDT that can be run on two servers as an HA failover resource:

<pre style="overflow-x:auto;"> [root@rh7z-mds2 system]# mkfs.lustre --mdt \ >  --fsname demo \ >  --index 0 \ >  --mgsnode 192.168.227.11@tcp1 --mgsnode 192.168.227.12@tcp1 \ >  --servicenode 192.168.227.12@tcp1 \ >  --servicenode 192.168.227.11@tcp1 \ >  --backfstype=zfs \ >  demo-mdt0pool/mdt0 \ >  mirror sdb sdd

In addition to defining the parameters of the MDT service itself, the command defines a mirrored ZFS zpool called  consisting of two devices, and creates a ZFS dataset called. Normally, it is expected that the MDT will be created from a larger pool of storage, to maximize performance and meet requirements for capacity; the above example is provided for the purposes of outlining the command line syntax.

The  syntax is similar, but is used to define only a failover target for the storage service. For example:

<pre style="overflow-x:auto;"> [root@rh7z-mds2 system]# mkfs.lustre --mdt \ >  --fsname demo --index 0 \ >  --mgsnode 192.168.227.11@tcp1 --mgsnode 192.168.227.12@tcp1 \ >  --failnode 192.168.227.11@tcp1 \ >  --backfstype=zfs \ >  demo-mdt0pool/mdt0 mirror sdb sdd

Note:: When creating a ZFS-based OSD using only the  command, it is not possible to set or change some properties of the zpool or its vdevs, such as the   property. For this reason, it is highly recommended that the zpools be created independently of the  command, as shown in the next section.

Formatting an MDT using zpool and mkfs.lustre
To create a ZFS-based MDT, create a zpool to contain the MDT file system dataset, then use  to create the actual file system dataset inside the zpool:

<pre style="overflow-x:auto;"> zpool create [-f] -O canmount=off \ [-o ashift=<n>] \ -o cachefile=/etc/zfs/ .spec | -o cachefile=none \ \

mkfs.lustre --mdt \ [--reformat] \ --fsname \ --index <n> \ --mgsnode <MGS NID> [--mgsnode <MGS NID> …] \ [ --servicenode <NID> [--servicenode <NID> …]] \ [ --failnode <NID> [--failnode <NID> …]] \ --backfstype=zfs \ [ --mkfsoptions ] \ / command or   to verify that the file system dataset has been formatted correctly. For example:

<pre style="overflow-x:auto;"> [root@rh7z-mds2 ~]# zfs get all -s local NAME               PROPERTY              VALUE                                    SOURCE demo-mdt0pool      canmount              off                                      local demo-mdt0pool/mdt0 canmount              off                                      local demo-mdt0pool/mdt0 xattr                 sa                                       local demo-mdt0pool/mdt0 lustre:svname         demo-MDT0000                             local demo-mdt0pool/mdt0 lustre:flags          4129                                     local demo-mdt0pool/mdt0 lustre:failover.node  192.168.227.12@tcp1:192.168.227.11@tcp1  local demo-mdt0pool/mdt0 lustre:version        1                                        local demo-mdt0pool/mdt0 lustre:mgsnode        192.168.227.11@tcp1:192.168.227.12@tcp1  local demo-mdt0pool/mdt0 lustre:fsname         demo                                     local demo-mdt0pool/mdt0 lustre:index          0                                        local

Starting the MDS Service
The  command is used to start all Lustre storage services, including the MDS. The syntax is:

<pre style="overflow-x:auto;"> mount -t lustre [-o ] \ | / command syntax is very similar for both LDISKFS and ZFS storage targets. The main difference is the format of the path to the storage. For LDISKFS, the path will resolve to a block device, such as  or , whereas for ZFS, the path resolves to a dataset in a zpool, e.g..

The mount point directory must exist before the mount command is executed. The recommended convention for the mount point of the MDT storage is, where   is the name of the file system and   is the index number of the MDT.

The following example starts a ZFS-based MDS:

<pre style="overflow-x:auto;"> [root@rh7z-mds2 ~]# zfs list NAME                USED  AVAIL  REFER  MOUNTPOINT demo-mdt0pool      2.87M  9.62G    19K  /demo-mdt0pool demo-mdt0pool/mdt0 2.79M  9.62G  2.79M  /demo-mdt0pool/mdt0
 * 1) Ignore MOUNTPOINT column in output: not used by Lustre

[root@rh7z-mds2 ~]# mkdir -p /lustre/demo/mdt0

[root@rh7z-mds2 ~]# mount -t lustre demo-mdt0pool/mdt0 /lustre/demo/mdt0

[root@rh7z-mds2 ~]# df -ht lustre File system         Size  Used Avail Use% Mounted on demo-mdt0pool/mdt0  9.7G  2.8M  9.7G   1% /lustre/demo/mdt0

Note: The default output for  shows mount points for the   Pool and   dataset in the   column. Just as for all ZFS-formatted OSDs, the content in this column can be ignored.

To reduce confusion, the ZFS file system  property can be set equal to. For example:

<pre style="overflow-x:auto;"> zfs set mountpoint=none demo-mdt0pool zfs set mountpoint=none demo-mdt0pool/mdt0

Note: Only the  command can start Lustre services. Mounting storage as type  or   will mount a storage target on the host, but it will not trigger the startup of the requisite Lustre kernel processes.

To verify that the MDS is running, check that the MDT device has been mounted, then get the Lustre device list with, and review the running processes:

<pre style="overflow-x:auto;"> [root@rh7z-mds2 ~]# df -ht lustre File system         Size  Used Avail Use% Mounted on demo-mdt0pool/mdt0  9.7G  2.8M  9.7G   1% /lustre/demo/mdt0

[root@rh7z-mds2 ~]# lctl dl 0 UP osd-zfs demo-MDT0000-osd demo-MDT0000-osd_UUID 7 1 UP mgc MGC192.168.227.11@tcp1 1605562b-d702-9251-6f38-1fd4a64e2720 5 2 UP mds MDS MDS_uuid 3 3 UP lod demo-MDT0000-mdtlov demo-MDT0000-mdtlov_UUID 4 4 UP mdt demo-MDT0000 demo-MDT0000_UUID 5 5 UP mdd demo-MDD0000 demo-MDD0000_UUID 4 6 UP qmt demo-QMT0000 demo-QMT0000_UUID 4 7 UP lwp demo-MDT0000-lwp-MDT0000 demo-MDT0000-lwp-MDT0000_UUID 5

[root@rh7z-mds2 ~]# ps -ef | awk '/mdt/ && !/awk/' root    32320     2  0 Mar30 ? 00:00:00 [mdt00_000] root    32321     2  0 Mar30 ? 00:00:00 [mdt00_001] root    32322     2  0 Mar30 ? 00:00:00 [mdt00_002] root    32323     2  0 Mar30 ? 00:00:00 [mdt_rdpg00_000] root    32324     2  0 Mar30 ? 00:00:00 [mdt_rdpg00_001] root    32325     2  0 Mar30 ? 00:00:00 [mdt_attr00_000] root    32326     2  0 Mar30 ? 00:00:00 [mdt_attr00_001] root    32327     2  0 Mar30 ? 00:00:00 [mdt_out00_000] root    32328     2  0 Mar30 ? 00:00:00 [mdt_out00_001] root    32329     2  0 Mar30 ? 00:00:00 [mdt_seqs_0000] root    32330     2  0 Mar30 ? 00:00:00 [mdt_seqs_0001] root    32331     2  0 Mar30 ? 00:00:00 [mdt_seqm_0000] root    32332     2  0 Mar30 ? 00:00:00 [mdt_seqm_0001] root    32333     2  0 Mar30 ? 00:00:00 [mdt_fld_0000] root    32334     2  0 Mar30 ? 00:00:00 [mdt_fld_0001] root    32340     2  0 Mar30 ? 00:00:00 [mdt_ck]

Stopping the MDS Service
To stop a Lustre service, umount the corresponding target:

<pre style="overflow-x:auto;"> umount

The mount point must correspond to the mount point used with the  command. For example:

<pre style="overflow-x:auto;"> [root@rh7z-mds2 ~]# df -ht lustre File system         Size  Used Avail Use% Mounted on demo-mdt0pool/mdt0  9.7G  2.8M  9.7G   1% /lustre/demo/mdt0 [root@rh7z-mds2 ~]# umount /lustre/demo/mdt0 [root@rh7z-mds2 ~]# df -ht lustre df: no file systems processed [root@rh7z-mds2 ~]# lctl dl [root@rh7z-mds2 ~]#

Using the regular  command is the correct way to stop a given Lustre service and unmount the associated storage, for both LDISKFS and ZFS-based Lustre storage volumes.

Do not use the  command to stop a Lustre service. Attempting to use  commands to unmount a storage target that is mounted as part of an active Lustre service will return an error.