C H A P T E R  1

Understanding Lustre

This chapter describes the Lustre architecture and features of Lustre. It includes the following sections:


1.1 What Lustre Is (and What It Isn’t)

Lustre is a storage architecture for clusters. The central component of the Lustre architecture is the Lustre file system, which is supported on the Linux operating system and provides a POSIX-compliant UNIX file system interface.

The Lustre storage architecture is used for many different kinds of clusters. It is best known for powering seven of the ten largest high-performance computing (HPC) clusters worldwide, with tens of thousands of client systems, petabytes (PB) of storage and hundreds of gigabytes per second (GB/sec) of I/O throughput. Many HPC sites use Lustre as a site-wide global file system, serving dozens of clusters.

The ability of a Lustre file system to scale capacity and performance for any need reduces the need to deploy many separate file systems, such as one for each compute cluster. Storage management is simplified by avoiding the need to copy data between compute clusters. In addition to aggregating storage capacity of many servers, the I/O throughput is also aggregated and scales with additional servers. Moreover, throughput and/or capacity can be easily increased by adding servers dynamically.

While Lustre can function in many work environments, it is not necessarily the best choice for all applications. It is best suited for uses that exceed the capacity that a single server can provide, though in some use cases Lustre can perform better with a single server than other filesystems due to its strong locking and data coherency.

Lustre is currently not particularly well suited for "peer-to-peer" usage models where there are clients and servers running on the same node, each sharing a small amount of storage, due to the lack of Lustre-level data replication. In such uses, if one client/server fails, then the data stored on that node will not be accessible until the node is restarted.

1.1.1 Lustre Features

Lustre runs on a variety of vendor’s kernels. For more details, see Lustre_Release_Information on the Lustre wiki.

A Lustre installation can be scaled up or down with respect to the number of client nodes, disk storage and bandwidth. Scalability and performance are dependent on available disk and network bandwith and the processing power of the servers in the system. Lustre can be deployed in a wide variety of configurations that can be scaled well beyond the size and performance observed in production systems to date.

TABLE 1-1 shows the practical range of scalability and performance characteristics of the Lustre file system and some test results in production systems.

TABLE 1-1 Lustre Scalability and Performance

Feature

Practical Range

Tested in Production

Clients

Scalabity

100-100,000

50,000+ clients

Many installations in 10,000 to 20,000 range

 

Performance

Single client:
2 GB/sec I/O,

1,000 metadata ops/sec

File system:

2.5 TB/sec

Single client:
2 GB/sec

 

File system:

240 GB/sec I/O

OSSs

Scalabity

OSSs:

4-500 with up to 4000 OSTs

 

File system:

64 PB, file size 320 TB

OSSs:

450 OSSs with 1,000 OSTs

192 OSSs with 1344 OSTs

File system:

10 PB, file size multi-TB

 

Performance

Up to 5 GB/sec

OSS throughput at 2.0+ GB/sec

MDSs

Scalabity

1 + 1 (failover with one backup)

 

 

Performance

Up to 35,000/s create, 100,000/s stat metadata operations

15,000/s create, 25,000/s stat metadata operations


Other Lustre features are:


1.2 Lustre Components

An installation of the Lustre software includes a management server (MGS) and one or more Lustre file systems interconnected with Lustre networking (LNET).

A basic configuration of Lustre components is shown in FIGURE 1-1.

FIGURE 1-1 Lustre components in a basic cluster


1.2.1 Management Server (MGS)

The MGS stores configuration information for all the Lustre file systems in a cluster and provides this information to other Lustre components. Each Lustre target contacts the MGS to provide information, and Lustre clients contact the MGS to retrieve information.

It is preferable that the MGS have its own storage space so that it can be managed independently. However, the MGS can be“co-located” and share storage space with an MDS as shown in FIGURE 1-1.

1.2.2 Lustre File System Components

Each Lustre file system consists of the following components:

The Lustre client software provides an interface between the Linux virtual file system and the Lustre servers. The client software includes a Management Client (MGC), a Metadata Client (MDC), and multiple Object Storage Clients (OSCs), one corresponding to each OST in the file system.

A logical object volume (LOV) aggregates the OSCs to provide transparent access across all the OSTs. Thus, a client with the Lustre file system mounted sees a single, coherent, synchronized namespace. Several clients can write to different parts of the same file simultaneously, while, at the same time, other clients can read from the file.

TABLE 1-2 provides the requirements for attached storage for each Lustre file system component and describes desirable characterics of the hardware used.


TABLE 1-2 Storage and hardware requirements for Lustre components

Required attached storage

Desirable hardware characteristics

MDSs

1-2% of file system capacity

Adequate CPU power, plenty of memory, fast disk storage.

OSSs

1-16 TB per OST, 1-8 OSTs per OSS

Good bus bandwidth. Recommended that storage be balanced evenly across OSSs.

Clients

None

Low latency, high bandwith network.


For additional hardware requirements and considerations, see Chapter 5: Setting Up a Lustre File System.

1.2.3 Lustre Networking (LNET)

Lustre Networking (LNET) is a custom networking API that provides the communication infrastructure that handles metadata and file I/O data for the Lustre file system servers and clients. For more information about LNET, see Chapter 2: Understanding Lustre Networking (LNET).

1.2.4 Lustre Cluster

At scale, the Lustre cluster can include hundreds of OSSs and thousands of clients (see FIGURE 1-2). More than one type of network can be used in a Lustre cluster. Shared storage between OSSs enables failover capability. For more details about OSS failover, see Chapter 3: Understanding Failover in Lustre.

FIGURE 1-2 Lustre cluster at scale



1.3 Lustre Storage and I/O

In a Lustre file system, a file stored on the MDT points to one or more objects associated with a data file, as shown in FIGURE 1-3. Each object contains data and is stored on an OST. If the MDT file points to one object, all the file data is stored in that object. If the file points to more than one object, the file data is “striped” across the objects (using RAID 0) and each object is stored on a different OST. (For more information about how striping is implemented in Lustre, see Lustre File System and Striping.)

In FIGURE 1-3, each filename points to an inode. The inode contains all of the file attributes, such as owner, access permissions, Lustre striping layout, access time, and access control. Multiple filenames may point to the same inode.

FIGURE 1-3 MDT file points to objects on OSTs containing file data


When a client opens a file, the file open operation transfers the file layout from the MDS to the client. The client then uses this information to perform I/O on the file, directly interacting with the OSS nodes where the objects are stored. This process is illustrated in FIGURE 1-4.

FIGURE 1-4 File open and file I/O in Lustre


Each file on the MDT contains the layout of the associated data file, including the OST number and object identifier. Clients request the file layout from the MDS and then perform file I/O operations by communicating directly with the OSSs that manage that file data.

The available bandwidth of a Lustre file system is determined as follows:

1.3.1 Lustre File System and Striping

One of the main factors leading to the high performance of Lustre file systems is the ability to stripe data across multiple OSTs in a round-robin fashion. Users can optionally configure for each file the number of stripes, stripe size, and OSTs that are used.

Striping can be used to improve performance when the aggregate bandwidth to a single file exeeds the bandwidth of a single OST. The ability to stripe is also useful when a single OST does not have anough free space to hold an entire file. For more information about benefits and drawbacks of file striping, see Lustre File Striping Considerations.

Striping allows segments or “chunks” of data in a file to be stored on different OSTs, as shown in FIGURE 1-5. In the Lustre file system, a RAID 0 pattern is used in which data is "striped" across a certain number of objects. The number of objects in a single file is called the stripe_count.

Each object contains a chunk of data from the file. When the chunk of data being written to a particular object exceeds the stripe_size, the next chunk of data in the file is stored on the next object.

Default values for stripe_count and stripe_size are set for the file system. The default value for stripe_count is 1 stripe for file and the default value for stripe_size is 1MB. The user may change these values on a per directory or per file basis. For more details, see Setting the File Layout/Striping Configuration (lfs setstripe).

In FIGURE 1-5, the stripe_size for File C is larger than the stripe_size for File A, allowing more data to be stored in a single stripe for File C. The stripe_count for File A is 3, resulting in data striped across three objects, while the stripe_count for File B and File C is 1.

No space is reserved on the OST for unwritten data. File A in FIGURE 1-5 is a sparse file that is missing chunk 6.

FIGURE 1-5 File striping pattern across three OSTs for three different data files


The maximum file size is not limited by the size of a single target. Lustre can stripe files across multiple objects (up to 160), and each object can be up to 2 TB in size. This leads to a maximum file size of 320 TB. (Note that Lustre itself can support files up to 2^64 bytes depending on the backing storage used by OSTs.)

Athough a single file can only be striped over 160 objects, Lustre file systems can have thousands of OSTs. The I/O bandwidth to access a single file is the aggregated I/O bandwidth to the objects in a file, which can be as much as a bandwidth of up to 160 servers. On systems with more than 160 OSTs, clients can do I/O using multiple files to utilize the full file system bandwidth.

For more information about striping, see Chapter 18: Managing File Striping and Free Space.