Lustre Management Service (MGS): Difference between revisions

From Lustre Wiki
Jump to navigation Jump to search
(Rewrite Part2 km)
(Reformat part 3)
Line 1: Line 1:
=== Summary ===
The Management Service, or MGS, is a global resource that can support multiple file systems in a service domain. The MGS stores configuration information for one or more Lustre file systems in a cluster and provides this information to other Lustre hosts. Servers and clients connect to the MGS on startup in order to retrieve the configuration log for the file system. Notification of changes to a file system’s configuration, including server restarts, are distributed by the MGS. Global file system enablement of features like Quota and Jobstats are managed by the MGS. Data is stored on a block device called the management target (MGT).
The Management Service, or MGS, is a global resource that can support multiple file systems in a service domain. The MGS stores configuration information for all the Lustre file systems in a cluster and provides this information to other Lustre hosts. Servers and clients connect to the MGS on startup in order to retrieve the configuration log for the file system. Notification of changes to a file system’s configuration, including server restarts, are distributed by the MGS. Global file system settings for Quota, Jobstats and others, are managed by the MGS. Data is stored on a block device called the management target (MGT).


=== System Design ===
The MGS has very modest resource requirements compared to the Metadata Service (MDS) and the Object Storage Service (OSS), as it is not a compute- or storage-intensive service. The MGS is only active when Lustre Targets start and stop and when client mount, when certain global configuration is change.  The MGS is not directly in the client IO path. The minimum storage required for the MGT is 100MB and any physical storage allocation is likely to far exceed this minimum requirement. Therefore the minimum real-world requirement is a fault-tolerant storage device, ideally a two-disk, RAID 1 mirror. Some storage arrays support the creation of a small virtual disk from a larger physical array configuration. For example, one might create a physical storage array consisting of 24 disks in a RAID 10 configuration, and split this into 2 vdisks: a small vdisk for the MGT (perhaps 1GB), with the remainder allocated to e.g. MDT0. Each vdisk must be presented as an independent block device to the servers that can be managed independently from the point of view of failover configuration.
The MGS has very modest resource requirements compared to the Metadata Service (MDS) and the Object Storage Service (OSS), as it is not a compute- or storage-intensive service. The MGS is only active when Lustre Targets start and stop and when client mount, when certing global configuration is change.  The MGS is not directly in the client IO path. The storage required for the MGT is 100MB and any physical storage allocation is likely to far exceed this minimum requirement. Therefore the minimum real-world requirement is a fault-tolerant storage device, ideally a two-disk, RAID 1 mirror. Some storage arrays support the creation of a small virtual disk from a larger physical array configuration. For example, one might create a physical storage array consisting of 24 disks in a RAID 10 configuration, and split this into 2 vdisks: a small vdisk for the MGT (perhaps 1GB), with the remainder allocated to e.g. MDT0. Each vdisk must be presented as an independent block device to the servers that can be managed independently from the point of view of failover configuration.


The MGS can run on a standalone server, but like all Lustre services, if that host fails, the MGS will be offline until the host can be restored.  When the MGS is offline client will be unable to mount the file system.  Therefore, the MGS is most often deployed into an HA failover configuration with MDS with a small high availability device or volume that can be mounted on more than one host independently. Because the MGS consumes only a small amount of a server’s resources, it is unusual to create an HA cluster that contains only the MGS. Instead, the MGS will typically be paired with the root MDS (i.e., the metadata service for MDT0, which is the storage target that contains the root of a Lustre file system namespace). The MDS can safely run with and MDS on the same server and is it common to see this in DNE deployments.  
The MGS can run on a standalone server, but like all Lustre services, if that host fails, the MGS will be offline until the host can be restored.  When the MGS is offline client will be unable to mount the file system.  Therefore, the MGS is most often deployed into an HA failover configuration with MDS with a small high availability device or volume that can be mounted on more than one host independently. Because the MGS consumes only a small amount of a server’s resources, it is unusual to create an HA cluster that contains only the MGS. Instead, the MGS will typically be paired with the root MDS (i.e., the metadata service for MDT0, which is the storage target that contains the root of a Lustre file system namespace). The MGS can safely run with a MDS on the same server and is it common to see this in DNE deployments.  


==== MGT ====
For truly flexible, high-availability configurations, where resources are somewhat autonomous and can be managed independently, the MGT storage should be allocated to a device or volume that is independent of the MDT storage, from the perspective of the OS (although both MGT and MDT might be in the same physical storage enclosure). The intention is that each service can migrate independently between hosts in a high availability server configuration.For ldiskfs-based storage, this generally means that the MGT and MDT are contained on separate LUNs or vdisks in a storage array, and for ZFS, the MGT and MDT should be in separate zpools.
For truly flexible, high-availability configurations, where resources are somewhat autonomous and can be managed independently, the MGT storage should be allocated to a device or volume that is independent of the MDT storage, from the perspective of the OS (although both MGT and MDT might be in the same physical storage enclosure). The intention is that each service can migrate independently between hosts in a high availability server configuration.
 
For ldiskfs-based storage, this generally means that the MGT and MDT are contained on separate LUNs or vdisks in a storage array, and for ZFS, the MGT and MDT should be in separate zpools.
 
When using ZFS for the MGT and MDT storage in a high availability configuration, do not configure the MGT and MDT as datasets in the same pool. The pool can only be imported onto one host at a time, which will prevent the services from running on separate hosts, and will not allow independent service migration or failover.


Refer to the [[:File:Metadata Server HA Cluster Simple lowres v1.png|Metadata server cluster]] diagram for a typical high availability MGS and MDS server cluster configuration.
Refer to the [[:File:Metadata Server HA Cluster Simple lowres v1.png|Metadata server cluster]] diagram for a typical high availability MGS and MDS server cluster configuration.


There can only be one MGS running on a node at one time. This means that it is not possible to have multiple MGTs configured in the same HA cluster because even if the services initially start on separate nodes, if a failover occurs they will both end up being located on the same host. Incoming client connections will not be able to determine which service to connect to, and may be connected arbitrarily, resulting in the registration and other configuration data being split in unpredictable ways across the competing resources. This could cause corruption of the configuration on both targets. There is no specific requirement in Lustre to create multiple MGS; one MGS will often suffice for many file systems in a subnet.
There can only be one MGS running on a node at one time. This means that it is not possible to have multiple MGTs configured in the same HA cluster because even if the services initially start on separate nodes, if a failover occurs they will both end up being located on the same host. Incoming client connections will not be able to determine which service to connect to, and may be connected arbitrarily, resulting in the registration and other configuration data being split in unpredictable ways across the competing resources. This could cause corruption of the configuration on both targets. There is no specific requirement in Lustre to create multiple MGS; one MGS will often suffice for many file systems in a subnet.
==== Installation and Use  ====
After installation of the Lustre Server software mkfs.lustre can be used to create an MGS.
mkfs.lustre --mgs  [options] <target_type> device|zpool
The MGS must be mounted before starting other Lustre Targets and when clients mount as it is the central registration and management point.  Lustre Servers and Clients communicate with with the MGS via Management Service Client (MGC) kernel processes. 
To start the MGS service:  mount -t Lustre device|zpool
To stop the the MGS service:  umount device|zpool
=== System Settings  ===
The MGS controls high level file-system setting as well as change system wide defaults. Below is a list of some common settings.
Jobstats give administrators the ability to track HPC Jobs and their IO actions on in a file system : See the Lustre Manual "12.2  Lustre Jobstats" for full details.
Quota is used to monitor and limit the amount of resources that can be used. See the Lustre Manual  "22.2.2. Enabling Disk Quotas (Lustre Software Release 2.4 and later)" for full details.
Setting global client parameters "lctl set_param -P ..." can be used to set global client setting.  This can simplify deployment of system wide client configuration changes for common parameters like osc.*.max_rpc_in_flight. See the Lustre Manual "  13.10.3.3. Setting Permanent Parameters with lctl set_param -P" for more details.




[[Category:Lustre File System Components]]
[[Category:Lustre File System Components]]

Revision as of 14:45, 24 August 2017

The Management Service, or MGS, is a global resource that can support multiple file systems in a service domain. The MGS stores configuration information for one or more Lustre file systems in a cluster and provides this information to other Lustre hosts. Servers and clients connect to the MGS on startup in order to retrieve the configuration log for the file system. Notification of changes to a file system’s configuration, including server restarts, are distributed by the MGS. Global file system enablement of features like Quota and Jobstats are managed by the MGS. Data is stored on a block device called the management target (MGT).

The MGS has very modest resource requirements compared to the Metadata Service (MDS) and the Object Storage Service (OSS), as it is not a compute- or storage-intensive service. The MGS is only active when Lustre Targets start and stop and when client mount, when certain global configuration is change. The MGS is not directly in the client IO path. The minimum storage required for the MGT is 100MB and any physical storage allocation is likely to far exceed this minimum requirement. Therefore the minimum real-world requirement is a fault-tolerant storage device, ideally a two-disk, RAID 1 mirror. Some storage arrays support the creation of a small virtual disk from a larger physical array configuration. For example, one might create a physical storage array consisting of 24 disks in a RAID 10 configuration, and split this into 2 vdisks: a small vdisk for the MGT (perhaps 1GB), with the remainder allocated to e.g. MDT0. Each vdisk must be presented as an independent block device to the servers that can be managed independently from the point of view of failover configuration.

The MGS can run on a standalone server, but like all Lustre services, if that host fails, the MGS will be offline until the host can be restored. When the MGS is offline client will be unable to mount the file system. Therefore, the MGS is most often deployed into an HA failover configuration with MDS with a small high availability device or volume that can be mounted on more than one host independently. Because the MGS consumes only a small amount of a server’s resources, it is unusual to create an HA cluster that contains only the MGS. Instead, the MGS will typically be paired with the root MDS (i.e., the metadata service for MDT0, which is the storage target that contains the root of a Lustre file system namespace). The MGS can safely run with a MDS on the same server and is it common to see this in DNE deployments.

For truly flexible, high-availability configurations, where resources are somewhat autonomous and can be managed independently, the MGT storage should be allocated to a device or volume that is independent of the MDT storage, from the perspective of the OS (although both MGT and MDT might be in the same physical storage enclosure). The intention is that each service can migrate independently between hosts in a high availability server configuration.For ldiskfs-based storage, this generally means that the MGT and MDT are contained on separate LUNs or vdisks in a storage array, and for ZFS, the MGT and MDT should be in separate zpools.

Refer to the Metadata server cluster diagram for a typical high availability MGS and MDS server cluster configuration.

There can only be one MGS running on a node at one time. This means that it is not possible to have multiple MGTs configured in the same HA cluster because even if the services initially start on separate nodes, if a failover occurs they will both end up being located on the same host. Incoming client connections will not be able to determine which service to connect to, and may be connected arbitrarily, resulting in the registration and other configuration data being split in unpredictable ways across the competing resources. This could cause corruption of the configuration on both targets. There is no specific requirement in Lustre to create multiple MGS; one MGS will often suffice for many file systems in a subnet.