Lustre Metadata Service (MDS)

The Metadata Service (MDS) provides the index, or namespace, for a Lustre file system. The metadata content is stored on object storage device (OSD) volumes called Metadata Targets (MDTs); a Lustre file system’s directory structure and file names, permissions, extended attributes, and file layouts are recorded to the MDTs. Each Lustre file system must have a minimum of one MDT, called MDT0, which contains the root of the file system namespace, but a single Lustre file system can have many MDTs for file systems with complex directory structures and very large quantities of files.

In addition to managing the namespace hierarchy, the metadata service records the layout of files (number of stripes and stripe size), and is responsible for object allocation on the OSTs. The MDS determines and allocates a file’s layout; an application or user on a Lustre client can specify the parameters for a file's layout when a file is created. When a layout is not specified, the MDS provides a default (usually a single object on the next suitable object storage target). File layouts are allocated for each individual file in the file system in order to allow users to optimize the IO for a given application workload.

For convenience, a default policy for the file layout can be assigned to a directory so that each file created within that directory will inherit the same layout format. Files do not share objects, so each individual file has a unique storage layout, although many files will conform to the same layout policy.

The MDS is only involved in metadata operations for a file or directory. After a file has been opened, the MDT does not participate in I/O transactions again until the file is closed, avoiding any overheads that might be incurred by an application switching between throughput and metadata work.

Prior to Lustre version 2.4, only a single MDT could be used to store the metadata for a Lustre file system. With the introduction of the Distributed Name Space (DNE) feature, the metadata workload for a single file system can be distributed across multiple MDTs, and consequently multiple metadata servers. There are two implementations of distributed metadata available to file system architects: remote directories, and striped directories.

DNE remote directories – sometimes referred to as DNE phase 1 or DNE1 – provide a way for administrators to assign a discrete sub-tree of the overall file system namespace to a specific MDT. In this way, the MDTs are connected into a virtual tree structure, with each MDT associated with a specific sub-directory. MDT0 is always used to represent the root of the name space, with all other MDTs as a child of MDT0.

While it is technically possible to create nested MDT relationships, this is disabled by default and discouraged as an architecture, because loss of an MDT means loss of access to all of the subdirectories hosted by that MDT, including any content on MDTs that are serving subdirectories more deeply nested in the tree than the failed MDT.

DNE striped directories – sometimes called DNE phase 2 or DNE2 – uses a more sophisticated structure to load balance metadata operations across multiple servers and MDTs. With striped directories, the structure of a directory is split into shards that are distributed across multiple MDTs. The user defines the layout for a given directory. Directory entries are split across the MDTs and the inodes for a file are created on the same MDT that holds the file name entry.

A single metadata server can serve the content of many MDTs, even MDTs for different file systems. The metadata service can also be configured for high-availability, as can all of the services of a Lustre file system.

When metadata servers are grouped into high-availability cluster configurations, this capability allows MDT resources to be configured to run on each host in the cluster so that each server can be actively providing service to the network with no idle “standby” nodes. When a server fails or requires maintenance, its MDT resources can migrate to other nodes, preserving operational continuity of the metadata services. This works well for DNE-enabled file systems, as well as for hosting MDTs for multiple individual file systems.

Hosting the MDTs for multiple file systems on a common set of servers is not unusual, especially where there are budgetary or physical constraints, such as limited space or power in a data center, or where there is a desire for multiple file systems to be established in a single environment. This MDT hosting method provides a way to maximize utilization of the available hardware infrastructure. As with any arrangement where resources are shared between multiple services, care must be taken to ensure balance is maintained between the competing processes.

Metadata Server Building Blocks
The most common high-availability metadata server design pattern is a two-node configuration that comprises an MGS and the MDS for MDT0.



Figure 1 shows a typical blue print for the Metadata server high availability building block. The building block comprises two servers, connected to a common external storage enclosure. The storage array in the diagram has been configured to hold the MGT as well as MDT0 for a Lustre file system. Two of the drives have been retained as spares. Each server has some internal, node-local storage to host the operating system, typically two disks in a RAID 1 mirror. There are three network interfaces on each server: a high performance data network for Lustre and other application traffic, a management network for administration of the servers, and a dedicated point-to-point server connection for use as a communication ring in a high availability framework, such as Pacemaker with Corosync.

Corosync is used for communications in HA clusters, and can be configured to use multiple networks for its communications rings. In this reference topology, both the dedicated point-to-point connection and the management network can be used in Corosync rings. Pacemaker high-availability clusters also commonly include a connection to a power management interface (not pictured), which is used to isolate, or fence, machines when a fault is detected in the cluster. Examples of the power management interface include a network or serial device connection to a smart PDU, or to a server’s BMC via IPMI. The management network is sometimes used to provide the connection to these power management interfaces.

Because the node that normally just runs the MGS is underutilized on its own, there is sufficient available capacity to run an addition MDS on the same host, provided care is taken to ensure that the MGS and additional MDS resources can be operated and migrated independently of one another within a cluster framework. This can be exploited to enable the MDT of a second file system to be hosted within the same HA cluster configuration, or to provide an additional MDT to an existing file system.

Figure 2 shows an example configuration where there are two MDT storage devices in separate storage enclosures, along with the MGT, which in this case has been created by mirroring two drives, one from each enclosure. This maximizes the available storage for metadata bandwidth while still leaving room for a spare drive in each enclosure. This is only one possibility among many valid choices.



Metadata Server Design Considerations
The Metadata Service is very resource-intensive and systems will benefit from frequency-optimized CPUs: clock speed and cache are more important than the number of CPU cores, although when there are a large number of service threads running, spreading the work across cores will also derive a benefit. The MDS will also exploit system memory to good advantage as cache for metadata and for lock management. The more RAM a metadata server can access, the better able it is to deliver strong performance for the concurrent workloads typical in HPC environments. On ZFS-based platforms, system memory is also used extensively for caching, providing significant improvements in application performance.

Services can scale from very small deployments for testing, such as VMs with perhaps 2 VCPU cores and as little as 2GB or RAM, up to high-end production servers with 24 cores, 3 GHz CPUs and 512GB RAM. Sizing depends on anticipated workload and file system working set population, and as computing requirements evolve, requirements invariably become more demanding. When a metadata target is formatted, the number of inodes for the MDT is calculated using the ratio of 2KB-per-inode. The inode itself consumes 512 bytes, but additional blocks can be allocated to store extended attributes when the initial space is consumed. This happens for example, when a file has a lot of stripes. Using this 2KB-per-inode ratio has proven to be a reliable way to allocate capacity for the MDT.

A typical configuration for a metadata server is to allocate 16 cores and 256GB RAM. If it is assumed that to cache a single inode in RAM without a lock requires approximately 2KB, this allows somewhere in the region of 130 million inodes to be cached in memory, depending on the other operating system overheads, which would represent around 13% of a 1 billion inode name space. A subset of these files will also be part of an active working set, which incurs an additional overhead for locking. Every client that accesses a file will require a lock, which is around 1KB per file per client. For example, if a thousand clients open a thousand files, this would incur an additional 1GiB of RAM for the locks. When sizing memory requirements, aiming for an active working set in cache of between 10 - 20 per cent of the total file system name space is a reasonable goal..

File systems with large active working sets may require an increase in RAM to achieve optimal performance, and metadata-intensive workloads with high IOps activity will benefit from additional CPU power, with the largest benefit being derived from higher clock speeds rather than very large core counts.

Metadata storage is subjected to small, random I/O, that is very IOps-intensive and somewhat transactional in nature, and MDT I/O bears many of the same attributes common for an OLTP database. A metadata server with a large active dataset can process tens of thousands to hundreds of thousands of very small I/O operations per second. High-speed storage is therefore essential and SAS storage is commonly used, but there is an increase in the use of flash storage for metadata.

Storage should be arranged in a RAID 1+0 (striped mirrors) layout to provide the best balance between performance and reliability, without the overheads introduced by RAID 5/6 or equivalent parity-based layouts. Metadata workloads are small, random IO operations, and will perform poorly on RAID 5/6 layouts because the IO is very unlikely to fit neatly into a single full stripe write, thus leading to read-modify-write overheads when recalculating parity to write data to the storage. With ZFS, the IOps are spread across the virtual devices (vdevs) in a zpool, which effectively means that the more datasets in the pool, the better the overall performance. A pool containing many mirrored vdevs will provide better IOps performance than a pool with a single vdev.

When formatting MDT storage, Lustre will assume a ratio of 2KB per inode and allocate the on-disk format accordingly (note that the ratio of inodes to storage capacity is not the same as the size of the inode itself, which is 512 bytes on LDISKFS storage. Additional space is used on the MDT for extended attributes to contain the file layout and other information, which is one of the reasons why this ratio is chosen). Metadata storage formatted using ldiskfs cannot allocate more than 4 billion inodes, a limitation in the EXT4 file system upon which ldiskfs is based. Because of the 2KB/inode allocation ratio, this puts the maximum volume size for an ldiskfs MDT at 8TB. Attempting to format a storage volume for ldiskfs that is larger than 8TB will fail. Even if the format were to succeed, capacity would be wasted. ZFS storage does not suffer from this limitation.

When considering ZFS for the MDT back-end file system, be sure to use the latest stable ZFS software release to ensure the best metadata performance. Versions of ZFS on Linux prior to 0.7.0 did not perform as well as LDISKFS for metadata-intensive workloads. With version 0.7.0 and newer, ZFS performance is dramatically improved.

Also consider allocating additional storage capacity for recording snapshots. Snapshots of MDTs can be very useful for providing a means to create online backups of the metadata for a file system, without incurring an outage of the file system. Catastrophic loss of the MDT means loss of the file system name space, and consequently loss of the index to the data objects for each file that are held on the Object Storage Servers. In short, loss of the MDT renders a Lustre file system unusable. If, however, a regular backup is made of the metadata, then it is possible to recover a file system back to production state. Using a snapshot makes it easier to take a copy of the MDT while the file system is still online.

ZFS snapshots are very efficient and introduce very little overhead. LDISKFS does not have any built-in snapshot capability, but it is possible to use LVM to create a logical volume formatted for ldiskfs and use LVM to create snapshots. Be aware that LVM snapshots can degrade the performance of storage significantly, so snapshots should be destroyed after the backup has been successfully completed.

It is technically possible to combine the MGT and MDT into a single LUN, however this is strongly discouraged. It reduces the flexibility of both services, makes maintenance more complex, and does not allow for distribution of the services across nodes in an HA cluster to optimize performance. When designing Lustre high availability storage solutions, do not combine the MGT and MDT into a single storage volume or ZFS pool.