Creating the Lustre Metadata Service (MDS): Difference between revisions
(Created page with "== Syntax Overview == The syntax for creating an MDT is: <pre style="overflow-x:auto;"> mkfs.lustre --mdt \ [--reformat] \ --fsname <name> \ --index <n> \ --mgsnode...") |
No edit summary |
||
Line 354: | Line 354: | ||
[[Category:Creating Lustre File System Services]] | [[Category:Creating Lustre File System Services]] | ||
[[Category:Lustre Systems Administration]] |
Revision as of 22:54, 30 August 2017
Syntax Overview
The syntax for creating an MDT is:
mkfs.lustre --mdt \ [--reformat] \ --fsname <name> \ --index <n> \ --mgsnode <MGS NID> [--mgsnode <MGS NID> …] \ [ --servicenode <NID> [--servicenode <NID> …]] \ [ --failnode <NID> [--failnode <NID> …]] \ [ --backfstype=ldiskfs|zfs ] \ [ --mkfsoptions <options> ] \ <device path> | <pool name>/<dataset> <zpool specification>
The command line syntax for formatting an MDT incorporates several additional parameters when compared to that of the much simpler MGT. First, the metadata target requires a record of the NIDs that can provide the Lustre management service (MGS). The MGS NIDs are supplied using the --mgsnode
flag. If there is more than one potential location for the MGS (i.e., it is part of a high availability failover cluster configuration), then the option is repeated for as many failover nodes as are configured (usually there are two).
Ordering of the MGS nodes in the command line is significant: the first --mgsnode
flag must reference the NID of the current active or primary MGS server. If this is not the case, then the first time that the MDS tries to join the cluster, it will fail. The first time mount of a storage target does not currently check the failover locations when trying to establish a connection with the MGS. When adding new storage targets to Lustre, the MGS must be running on its primary NID.
The MDT must also be supplied with the name (--fsname
) of the Lustre file system (maximum 8 characters), and a unique index number (--index
) for the file system. There must always be an MDT with index=0
(zero) for each file system, representing the root of the file system tree. For many Lustre file systems, a single MDT (referred to as MDT0
) is sufficient.
The list of service nodes (--servicenode
) or failover nodes (--failnode
) must be specified for any high availability configuration. Although there are more compact declarations for defining the nodes, for simplicity, list the NID of each server that can mount the storage as a separate --servicenode
entry.
The next example uses the --servicenode
syntax to create an MDT that can be run on two servers as an HA failover resource:
[root@rh7z-mds2 system]# mkfs.lustre --mdt \ > --fsname demo \ > --index 0 \ > --mgsnode 192.168.227.11@tcp1 \ > --mgsnode 192.168.227.12@tcp1 \ > --servicenode 192.168.227.12@tcp1 \ > --servicenode 192.168.227.11@tcp1 \ > /dev/dm-2
The command line formats a new MDT that will be used by the MDS for storage. The MDT will provide metadata for a file system called demo
, and has index number 0
(zero). There are two NIDs defined as the nodes able to host the MDS service, denoted by the --servicenode
options, and two NIDs supplied for the MGS that the MDS will register with.
The --failnode
syntax is similar, but is used to define only a failover target for the storage service. For example:
[root@rh7z-mds2 system]# mkfs.lustre --mdt \ > --fsname demo \ > --index 0 \ > --mgsnode 192.168.227.11@tcp1 \ > --mgsnode 192.168.227.12@tcp1 \ > --failnode 192.168.227.11@tcp1 \ > /dev/dm-2
Here, the failover host is identified as 192.168.227.11@tcp1
, one server in an HA pair (and which, for the purpose of this example, has the hostname rh7z-mds1
). The mkfs.lustre
command was executed on rh7z-mds2
(NID: 192.168.227.12@tcp1
), and the mount
command must also be run from this host when the service starts for the very first time.
MDT Formatted as an LDISKFS OSD
The syntax for creating an LDISKFS-based MDT is:
mkfs.lustre --mdt \ [--reformat] \ --fsname <name> \ --index <n> \ --mgsnode <MGS NID> [--mgsnode <MGS NID> …] \ [ --servicenode <NID> [--servicenode <NID> …]] \ [ --failnode <NID> [--failnode <NID> …]] \ [ --backfstype=ldiskfs ] \ [ --mkfsoptions <options> ] \ <device path>
The following example uses the --servicenode
syntax to create an MDT that can be run on two servers as an HA failover resource:
[root@rh7z-mds2 system]# mkfs.lustre --mdt \ > --fsname demo \ > --index 0 \ > --mgsnode 192.168.227.11@tcp1 \ > --mgsnode 192.168.227.12@tcp1 \ > --servicenode 192.168.227.12@tcp1 \ > --servicenode 192.168.227.11@tcp1 \ > --backfstype=ldiskfs \ > /dev/dm-2
The --failnode
syntax is similar, but is used to define only a failover target for the storage service. For example:
[root@rh7z-mds2 system]# mkfs.lustre --mdt \ > --fsname demo --index 0 \ > --mgsnode 192.168.227.11@tcp1 --mgsnode 192.168.227.12@tcp1 \ > --failnode 192.168.227.11@tcp1 \ > --backfstype=ldiskfs \ > /dev/dm-2
The above examples are repeated from the main introduction to the syntax, but are included here to maintain symmetry with the rest of the text. However, note that the --backfstype
flag has been set to ldiskfs
, which tells mkfs.lustre
to format the device as an LDISKFS OSD.
MDT Formatted as a ZFS OSD
Formatting an MDT using only the mkfs.lustre command
Note: For the greatest flexibility and control when creating ZFS-based Lustre storage targets, do not use this approach – instead, create the zpool separately from formatting the Lustre OSD. See Formatting an MDT using zpool and mkfs.lustre.
The syntax for creating a ZFS-based MDT using only the mkfs.lustre
command is:
mkfs.lustre --mdt \ [--reformat] \ --fsname <name> \ --index <n> \ --mgsnode <MGS NID> [--mgsnode <MGS NID> …] \ [ --servicenode <NID> [--servicenode <NID> …]] \ [ --failnode <NID> [--failnode <NID> …]] \ --backfstype=zfs \ [ --mkfsoptions <options> ] \ <pool name>/<dataset> \ <zpool specification>
The next example uses the --servicenode
syntax to create an MDT that can be run on two servers as an HA failover resource:
[root@rh7z-mds2 system]# mkfs.lustre --mdt \ > --fsname demo \ > --index 0 \ > --mgsnode 192.168.227.11@tcp1 --mgsnode 192.168.227.12@tcp1 \ > --servicenode 192.168.227.12@tcp1 \ > --servicenode 192.168.227.11@tcp1 \ > --backfstype=zfs \ > demo-mdt0pool/mdt0 \ > mirror sdb sdd
In addition to defining the parameters of the MDT service itself, the command defines a mirrored ZFS zpool called demo-mdt0pool
consisting of two devices, and creates a ZFS dataset called mdt0
. Normally, it is expected that the MDT will be created from a larger pool of storage, to maximize performance and meet requirements for capacity; the above example is provided for the purposes of outlining the command line syntax.
The --failnode
syntax is similar, but is used to define only a failover target for the storage service. For example:
[root@rh7z-mds2 system]# mkfs.lustre --mdt \ > --fsname demo --index 0 \ > --mgsnode 192.168.227.11@tcp1 --mgsnode 192.168.227.12@tcp1 \ > --failnode 192.168.227.11@tcp1 \ > --backfstype=zfs \ > demo-mdt0pool/mdt0 mirror sdb sdd
Note:: When creating a ZFS-based OSD using only the mkfs.lustre
command, it is not possible to set or change some properties of the zpool or its vdevs, such as the ashift
property. For this reason, it is highly recommended that the zpools be created independently of the mkfs.lustre
command, as shown in the next section.
Formatting an MDT using zpool and mkfs.lustre
To create a ZFS-based MDT, create a zpool to contain the MDT file system dataset, then use mkfs.lustre
to create the actual file system dataset inside the zpool:
zpool create [-f] -O canmount=off \ [-o ashift=<n>] \ -o cachefile=/etc/zfs/<zpool name>.spec | -o cachefile=none \ <zpool name> \ <zpool specification> mkfs.lustre --mdt \ [--reformat] \ --fsname <name> \ --index <n> \ --mgsnode <MGS NID> [--mgsnode <MGS NID> …] \ [ --servicenode <NID> [--servicenode <NID> …]] \ [ --failnode <NID> [--failnode <NID> …]] \ --backfstype=zfs \ [ --mkfsoptions <options> ] \ <pool name>/<dataset>
For example:
# Create the zpool # Pool will comprise 3 mirrors each with 2 devices. # Mirrors will be concatenated (striped). zpool create -O canmount=off \ -o cachefile=none \ demo-mdt0pool \ mirror sdd sde mirror sdf sdg mirror sdh sdi # Format MDT0 for Lustre file system "demo" mkfs.lustre --mdt \ --fsname demo \ --index 0 \ --mgsnode 192.168.227.11@tcp1 --mgsnode 192.168.227.12@tcp1 \ --servicenode 192.168.227.12@tcp1 \ --servicenode 192.168.227.11@tcp1 \ --backfstype=zfs \ demo-mdt0pool/mdt0
The output from the above example will look something like this:
# The zpool command will not return output if there are no errors [root@rh7z-mds2 system]# zpool create -O canmount=off \ > -o cachefile=none \ > demo-mdt0pool \ > mirror sdd sde mirror sdf sdg mirror sdh sdi [root@rh7z-mds2 system]# mkfs.lustre --mdt \ > --fsname demo \ > --index 0 \ > --mgsnode 192.168.227.11@tcp1 --mgsnode 192.168.227.12@tcp1 \ > --servicenode 192.168.227.12@tcp1 \ > --servicenode 192.168.227.11@tcp1 \ > --backfstype=zfs \ > demo-mdt0pool/mdt0 Permanent disk data: Target: demo:MDT0000 Index: 0 Lustre FS: demo Mount type: zfs Flags: 0x1061 (MDT first_time update no_primnode ) Persistent mount opts: Parameters: mgsnode=192.168.227.11@tcp1:192.168.227.12@tcp1 failover.node=192.168.227.12@tcp1:192.168.227.11@tcp1 checking for existing Lustre data: not found mkfs_cmd = zfs create -o canmount=off -o xattr=sa demo-mdt0pool/mdt0 Writing demo-mdt0pool/mdt0 properties lustre:version=1 lustre:flags=4193 lustre:index=0 lustre:fsname=demo lustre:svname=demo:MDT0000 lustre:mgsnode=192.168.227.11@tcp1:192.168.227.12@tcp1 lustre:failover.node=192.168.227.12@tcp1:192.168.227.11@tcp1
Use the zfs get
command or tunefs.lustre
to verify that the file system dataset has been formatted correctly. For example:
[root@rh7z-mds2 ~]# zfs get all -s local NAME PROPERTY VALUE SOURCE demo-mdt0pool canmount off local demo-mdt0pool/mdt0 canmount off local demo-mdt0pool/mdt0 xattr sa local demo-mdt0pool/mdt0 lustre:svname demo-MDT0000 local demo-mdt0pool/mdt0 lustre:flags 4129 local demo-mdt0pool/mdt0 lustre:failover.node 192.168.227.12@tcp1:192.168.227.11@tcp1 local demo-mdt0pool/mdt0 lustre:version 1 local demo-mdt0pool/mdt0 lustre:mgsnode 192.168.227.11@tcp1:192.168.227.12@tcp1 local demo-mdt0pool/mdt0 lustre:fsname demo local demo-mdt0pool/mdt0 lustre:index 0 local
Starting the MDS Service
The mount
command is used to start all Lustre storage services, including the MDS. The syntax is:
mount -t lustre [-o <options>] \ <ldiskfs blockdev>|<zpool>/<dataset> <mount point>
The mount
command syntax is very similar for both LDISKFS and ZFS storage targets. The main difference is the format of the path to the storage. For LDISKFS, the path will resolve to a block device, such as /dev/sda
or /dev/mapper/mpatha
, whereas for ZFS, the path resolves to a dataset in a zpool, e.g. demo-mdt0pool/mdt0
.
The mount point directory must exist before the mount command is executed. The recommended convention for the mount point of the MDT storage is /lustre/<fsname>/mdt<n>
, where <fsname>
is the name of the file system and <n>
is the index number of the MDT.
The following example starts a ZFS-based MDS:
# Ignore MOUNTPOINT column in output: not used by Lustre [root@rh7z-mds2 ~]# zfs list NAME USED AVAIL REFER MOUNTPOINT demo-mdt0pool 2.87M 9.62G 19K /demo-mdt0pool demo-mdt0pool/mdt0 2.79M 9.62G 2.79M /demo-mdt0pool/mdt0 [root@rh7z-mds2 ~]# mkdir -p /lustre/demo/mdt0 [root@rh7z-mds2 ~]# mount -t lustre demo-mdt0pool/mdt0 /lustre/demo/mdt0 [root@rh7z-mds2 ~]# df -ht lustre File system Size Used Avail Use% Mounted on demo-mdt0pool/mdt0 9.7G 2.8M 9.7G 1% /lustre/demo/mdt0
Note: The default output for zfs list
shows mount points for the demo-mdt0pool
Pool and mdt0
dataset in the MOUNTPOINT
column. Just as for all ZFS-formatted OSDs, the content in this column can be ignored.
To reduce confusion, the ZFS file system mountpoint
property can be set equal to none
. For example:
zfs set mountpoint=none demo-mdt0pool zfs set mountpoint=none demo-mdt0pool/mdt0
Note: Only the mount -t lustre
command can start Lustre services. Mounting storage as type ldiskfs
or zfs
will mount a storage target on the host, but it will not trigger the startup of the requisite Lustre kernel processes.
To verify that the MDS is running, check that the MDT device has been mounted, then get the Lustre device list with lctl dl
, and review the running processes:
[root@rh7z-mds2 ~]# df -ht lustre File system Size Used Avail Use% Mounted on demo-mdt0pool/mdt0 9.7G 2.8M 9.7G 1% /lustre/demo/mdt0 [root@rh7z-mds2 ~]# lctl dl 0 UP osd-zfs demo-MDT0000-osd demo-MDT0000-osd_UUID 7 1 UP mgc MGC192.168.227.11@tcp1 1605562b-d702-9251-6f38-1fd4a64e2720 5 2 UP mds MDS MDS_uuid 3 3 UP lod demo-MDT0000-mdtlov demo-MDT0000-mdtlov_UUID 4 4 UP mdt demo-MDT0000 demo-MDT0000_UUID 5 5 UP mdd demo-MDD0000 demo-MDD0000_UUID 4 6 UP qmt demo-QMT0000 demo-QMT0000_UUID 4 7 UP lwp demo-MDT0000-lwp-MDT0000 demo-MDT0000-lwp-MDT0000_UUID 5 [root@rh7z-mds2 ~]# ps -ef | awk '/mdt/ && !/awk/' root 32320 2 0 Mar30 ? 00:00:00 [mdt00_000] root 32321 2 0 Mar30 ? 00:00:00 [mdt00_001] root 32322 2 0 Mar30 ? 00:00:00 [mdt00_002] root 32323 2 0 Mar30 ? 00:00:00 [mdt_rdpg00_000] root 32324 2 0 Mar30 ? 00:00:00 [mdt_rdpg00_001] root 32325 2 0 Mar30 ? 00:00:00 [mdt_attr00_000] root 32326 2 0 Mar30 ? 00:00:00 [mdt_attr00_001] root 32327 2 0 Mar30 ? 00:00:00 [mdt_out00_000] root 32328 2 0 Mar30 ? 00:00:00 [mdt_out00_001] root 32329 2 0 Mar30 ? 00:00:00 [mdt_seqs_0000] root 32330 2 0 Mar30 ? 00:00:00 [mdt_seqs_0001] root 32331 2 0 Mar30 ? 00:00:00 [mdt_seqm_0000] root 32332 2 0 Mar30 ? 00:00:00 [mdt_seqm_0001] root 32333 2 0 Mar30 ? 00:00:00 [mdt_fld_0000] root 32334 2 0 Mar30 ? 00:00:00 [mdt_fld_0001] root 32340 2 0 Mar30 ? 00:00:00 [mdt_ck]
Stopping the MDS Service
To stop a Lustre service, umount the corresponding target:
umount <mount point>
The mount point must correspond to the mount point used with the mount -t lustre
command. For example:
[root@rh7z-mds2 ~]# df -ht lustre File system Size Used Avail Use% Mounted on demo-mdt0pool/mdt0 9.7G 2.8M 9.7G 1% /lustre/demo/mdt0 [root@rh7z-mds2 ~]# umount /lustre/demo/mdt0 [root@rh7z-mds2 ~]# df -ht lustre df: no file systems processed [root@rh7z-mds2 ~]# lctl dl [root@rh7z-mds2 ~]#
Using the regular umount
command is the correct way to stop a given Lustre service and unmount the associated storage, for both LDISKFS and ZFS-based Lustre storage volumes.
Do not use the zfs unmount
command to stop a Lustre service. Attempting to use zfs
commands to unmount a storage target that is mounted as part of an active Lustre service will return an error.