Lustre Server Requirements Guidelines

From Lustre Wiki
Revision as of 19:47, 4 October 2019 by Adilger (talk | contribs) (clarify MDS and OSS limits)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
Figure 1. Lustre High Availability Architecture

Lustre can support 32 or more metadata servers (MDS), each with one or more metadata targets (MDTs) for storage, and at least 2000 object storage servers (OSS), each with one or more object storage targets (OSTs) for storage, within a single Lustre file system. Figure 1 depicts a Lustre file system architecture comprised of metadata and object storage server HA building blocks for a Lustre file system. As the requirements for performance and capacity grow, more OSS or MDS building blocks can be added to the file system cluster to meet demand.

The MGS, MDS and OSS servers typically use the same core operating platform and are often installed from a common template. Variations in the server operating platforms are commonly due to driver differences due to differences in hardware configuration of MDS versus OSS.

It is the established practice to align the operating system installations for Lustre servers as closely as possible. Ideally, each server will have the same core operating system environment installed from a common template, and the software package manifests and versions will match across all of the server assets. This practice extends to drivers and firmware.

Variation of software and hardware is discouraged, but can occur when the deployment has been in place for some years and inevitable changes in hardware catalogues mandate a change in the design of individual servers belonging to different hardware generations.

In high availability configurations, each server of the HA group should ideally be identical, or as similar as possible in every design aspect, including the server hardware, slot placement for add-on cards, CPU choices and memory configuration. Each server should have an identical software package manifest to ensure consistent performance and application behaviour during operation of the cluster group.

Most of the examples used throughout the systems administration guide are based on the following system architecture:

  • Two metadata servers in an HA configuration, each equipped with:
    • three network interface cards
      • One in-band management network
      • One dedicated point-to-point network connection dedicated as a Corosync ring.
      • One high speed, low latency data fabric interface (e.g. OPA, InfiniBand)
    • one storage enclosure accessible from both hosts to contain the MGT and MDT0
  • Two or more object storage servers grouped into HA cluster pairs, each equipped with:
    • three network interface cards
      • One in-band management network
      • One dedicated point-to-point network connection dedicated as a Corosync ring.
      • One high speed, low latency data fabric interface (e.g. OPA, InfiniBand)
    • one storage enclosure accessible from both hosts to contain OST0 and OST1
  • The Lustre clients with one management network and one high performance data fabric interface

Lustre Server Guidelines

Figure 2. Lustre High Availability Building Blocks

Lustre servers normally have at least two network interfaces:

  1. A network for management and maintenance, including software management, health monitoring, time synchronisation, remote administrator access.
  2. A high-performance data network exclusively for carrying Lustre traffic.

Extra network interfaces can be configured to provide multi-homing capabilities to servers, or to provide additional performance and resiliency characteristics using LNet's Multi-rail capability.

Additional network interfaces can also be used to provide redundancy for Corosync communications in the HA software framework. The hardware and operating system requirements for Pacemaker and Corosync HA framework are described in Creating a Framework for High Availability with Pacemaker. More complex HA topologies might employ a dedicated switch for cluster communications traffic.

The operating system storage requires relatively little capacity, as the footprint of installed packages is quite modest. However, other factors influence the OS payload, including requirements for development tools needed to support DKMS (typically used to install OpenZFS), and additional system management tools. Log file storage must also be accommodated, as well as space for crash dumps, when configured.

A swap partition must also be included in estimates for OS storage needs. Follow the operating system vendor’s guide for configuration of swap. In the case of Red Hat, the guidance is to allocate a fixed 4GB swap partition for systems with 64GB or more or system RAM. System performance can degrade markedly when swap is actually used, so the goal is to configure the system with sufficient RAM that it that never needs to use swap.

Note: Lustre itself will never use swap because it is implemented as kernel modules and it is not possible to swap kernel memory. However, there will always be a small number of programs running in user space on the host and because of this, allocating some storage for swap is essential as a contingency.

Because Lustre services run mostly in the kernel, it is important to ensure that there is sufficient free space on the root disks to accommodate crash dump files generated by the OS kernel when there is a kernel-level software fault. Kernel crash dumps are essential to debugging kernel software, including Lustre. Typically, crash files are stored in the /var directory or partition. Refer to the operating system documentation for instructions and guidance on crash dump (kdump) configuration.

Two disks, internal to the server chassis and configured as a RAID-1 mirror, provide some fault tolerance. Where possible, use a hardware RAID controller supplied with the server to manage the root disk mirror; this will reduce complexity in the operating system configuration. Otherwise, use LVM to create a root disk mirror. LVM has the added advantage of supporting snapshots, which can be useful when conducting system maintenance, such as a software upgrade. Refer to the documentation from the Linux operating system distribution used for installation for more detailed information on establishing LVM storage volumes.

Lustre servers should be configured with a large amount of system memory in order to take advantage of Lustre’s caching features. Metadata servers, in particular, benefit from being able to cache the file system namespace in memory.

While a test system can be configured with as little as 2GB RAM, a production Lustre server should be equipped with at least 64GB for an entry-level platform; ideally 128GB or more should be installed in each server. Insufficient memory capacity can lead to out-of-memory errors when the servers are exposed to demanding, high-performance, production workloads, destabilizing the server and, by extension, the file system. Refer to the section on OpenZFS Performance Tuning for further information.

Storage Preparation

In the reference architecture, servers are grouped into pairs, and each pair is connected to a multi-ported storage enclosure. The storage devices in each enclosure are grouped into Logical Units (LUs), and the logical units are visible to each of the servers in an HA group.

A high availability Lustre file system using the reference architecture is comprised of at least two HA cluster groups: one group for the MGS and MDS, usually referred to as the metadata cluster (or individually as metadata servers), and one or more HA groups for the object storage servers.

Storage enclosures are usually attached to each of the servers with redundant cable connections, to protect against individual component-level failures, such as a broken cable or host bus adapter (HBA). This is commonly referred to as multipathing. Multipathing configuration is specific to the storage hardware vendor, although some basic guidance is available from the operating system distribution providers. In most Linux distributions, multipath capability is managed by a software package called Device Mapper Multipath (DM-Multipath). Storage vendors will sometimes provide their own software for managing multipath configuration.

The device-mapper multipath software is not installed by default on RHEL or CentOS systems using the @core and @base package groups. This software is usually required when working with external storage systems, although one should check with the storage vendor’s documentation for information regarding integration of the storage with the Linux distribution in use.

RHEL and CentOS systems provide the multipath software in two packages: device-mapper-multipath and device-mapper-multipath-libs. To install, use YUM:

yum [-y] install device-mapper-multipath

YUM will automatically resolve any dependencies and include those packages in the installation manifest. In this case, device-mapper-multipath depends on device-mapper-multipath-libs, which YUM will automatically include so that the -libs package does not have to be specified on the yum command line.

Information on the configuration of multipath devices is available from all of the major Linux distributions, and from several storage hardware manufacturers. For example: