| C H A P T E R 1 |
|
Introduction to Lustre |
This chapter describes Lustre software and components, and includes the following sections:
These instructions assume you have some familiarity with Linux system administration, cluster systems and network technologies.
Lustre is a storage architecture for clusters. The central component is the Lustre file system, which is available for Linux and provides a POSIX-compliant UNIX file system interface.
The Lustre architecture is used for many different kinds of clusters. It is best known for powering seven of the ten largest high-performance computing (HPC) clusters worldwide, with tens of thousands of client systems, petabytes (PB) of storage and hundreds of gigabytes per second (GB/sec) of I/O throughput. Many HPC sites use Lustre as a site-wide global file system, serving dozens of clusters on an unprecedented scale.
The scalability of a Lustre file system reduces the need to deploy many separate file systems (such as one for each cluster). This offers significant storage management advantages, for example, avoiding maintenance of multiple data copies staged on multiple file systems. Hand in hand with aggregating file system capacity with many servers, I/O throughput is also aggregated and scales with additional servers. Moreover, throughput (or capacity) can be easily adjusted by adding servers dynamically.
Lustre has been integrated with several vendor’s kernels. We offer Red Hat Enterprise Linux (RHEL) and SUSE Linux Enterprise (SUSE) kernels with Lustre patches.
The key features of Lustre include:
Additionally, Lustre offers these features:
A Lustre file system consists of the following basic components (see FIGURE 1-1).
The Lustre client software consists of an interface between the Linux Virtual File System and the Lustre servers. Each target has a client counterpart: Metadata Client (MDC), Object Storage Client (OSC), and a Management Client (MGC). A group of OSCs are wrapped into a single LOV. Working in concert, the OSCs provide transparent access to the file system.
Clients, which mount the Lustre file system, see a single, coherent, synchronized namespace at all times. Different clients can write to different parts of the same file at the same time, while other clients can read from the file.
Lustre includes several additional components, LNET and the MGS, described in the following sections.
FIGURE 1-1 Lustre components in a basic cluster
Lustre Networking (LNET) is an API that handles metadata and file I/O data for file system servers and clients. LNET supports multiple, heterogeneous interfaces on clients and servers. LNET interoperates with a variety of network transports through Network Abstraction Layers (NAL). Lustre Network Drivers (LNDs) are available for a number of commodity and high-end networks, including Infiniband, TCP/IP, Quadrics Elan, Myrinet (MX and GM) and Cray.
In clusters with a Lustre file system, servers and clients communicate with one another over a custom networking API known as Lustre Networking (LNET), while the disk storage behind the MDSs and OSSs is connected to these servers using traditional SAN technologies.
The MGS stores configuration information for all Lustre file systems in a cluster. Each Lustre target contacts the MGS to provide information, and Lustre clients contact the MGS to retrieve information. The MGS requires its own disk for storage. However, there is a provision that allows the MGS to share a disk ("co-locate") with a single MDT. The MGS is not considered "part" of an individual file system; it provides configuration information for all managed Lustre file systems to other Lustre components.
Lustre components work together as coordinated systems to manage file and directory operations in the file system (see FIGURE 1-2).
FIGURE 1-2 Lustre system interaction in a file system
The characteristics of the Lustre system include:
At scale, the Lustre cluster can include up to 1,000 OSSs and 100,000 clients (see FIGURE 1-3).
FIGURE 1-3 Lustre cluster at scale
Traditional UNIX disk file systems use inodes, which contain lists of block numbers where file data for the inode is stored. Similarly, for each file in a Lustre file system, one inode exists on the MDT. However, in Lustre, the inode on the MDT does not point to data blocks, but instead, points to one or more objects associated with the files. This is illustrated in FIGURE 1-4. These objects are implemented as files on the OST file systems and contain file data.
FIGURE 1-4 MDS inodes point to objects, ext3 inodes point to data
FIGURE 1-5 shows how a file open operation transfers the object pointers from the MDS to the client when a client opens the file, and how the client uses this information to perform I/O on the file, directly interacting with the OSS nodes where the objects are stored.
FIGURE 1-5 File open and file I/O in Lustre
If only one object is associated with an MDS inode, that object contains all of the data in that Lustre file. When more than one object is associated with a file, data in the file is "striped" across the objects.
The MDS knows the layout of each file, the number and location of the file's stripes. The clients obtain the file layout from the MDS. Client do I/O against the stripes of a file by communicating directly with the relevant OSTs.
The benefits of the Lustre arrangement are clear. The capacity of a Lustre file system equals the sum of the capacities of the storage targets. The aggregate bandwidth available in the file system equals the aggregate bandwidth offered by the OSSs to the targets. Both capacity and aggregate I/O bandwidth scale simply with the number of OSSs.
Striping allows parts of files to be stored on different OSTs, as shown in FIGURE 1-6. A RAID 0 pattern, in which data is "striped" across a certain number of objects, is used; the number of objects is called the stripe_count. Each object contains "chunks" of data. When the "chunk" being written to a particular object exceeds the stripe_size, the next "chunk" of data in the file is stored on the next target.
FIGURE 1-6 Files striped with a stripe count of 2 and 3 with different stripe sizes
File striping presents several benefits. One is that the maximum file size is not limited by the size of a single target. Lustre can stripe files over up to 160 targets, and each target can support a maximum disk use of 8 TB[3] by a file. This leads to a maximum disk use of 1.48 PB[4] by a file. Note that the maximum file size is much larger (2^64 bytes), but the file cannot have more than 1.48 PB2 of allocated data; hence a file larger than 1.48 PB2 must have many sparse sections. While a single file can only be striped over 160 targets, Lustre file systems have been built with almost 5000 targets, which is enough to support a 40 PB file system.
Another benefit of striped files is that the I/O bandwidth to a single file is the aggregate I/O bandwidth to the objects in a file and this can be as much as the bandwidth of up to 160 servers.
The storage attached to the servers is partitioned, optionally organized with logical volume management (LVM) and formatted as file systems. Lustre OSS and MDS servers read, write and modify data in the format imposed by these file systems.
Each OSS can manage multiple object storage targets (OSTs), one for each volume; I/O traffic is load-balanced against servers and targets. An OSS should also balance network bandwidth between the system network and attached storage to prevent network bottlenecks. Depending on the server's hardware, an OSS typically serves between 2 and 25 targets, with each target up to 8 terabytes (TBs) in size.
For the MDS nodes, storage must be attached for Lustre metadata, for which 1-2 percent of the file system capacity is needed. The data access pattern for MDS storage is different from the OSS storage: the former is a metadata access pattern with many seeks and read-and-writes of small amounts of data, while the latter is an I/O access pattern, which typically involves large data transfers.
High throughput to MDS storage is not important. Therefore, we recommend that a different storage type be used for the MDS (for example FC or SAS drives, which provide much lower seek times). Moreover, for low levels of I/O, RAID 5/6 patterns are not optimal, a RAID 0+1 pattern yields much better results.
Lustre uses journaling file system technology on the targets, and for a MDS, an approximately 20 percent performance gain can sometimes be obtained by placing the journal on a separate device. Typically, the MDS requires CPU power; we recommend at least four processor cores.
Lustre file system capacity is the sum of the capacities provided by the targets.
As an example, 64 OSSs, each with two 8-TB targets, provide a file system with a capacity of nearly 1 PB. If this system uses sixteen 1-TB SATA disks, it may be possible to get 50 MB/sec from each drive, providing up to 800 MB/sec of disk bandwidth. If this system is used as storage backend with a system network like InfiniBand that supports a similar bandwidth, then each OSS could provide 800 MB/sec of end-to-end I/O throughput. Note that the OSS must provide inbound and outbound bus throughput of 800 MB/sec simultaneously. The cluster could see aggregate I/O bandwidth of 64x800, or about 50 GB/sec. Although the architectural constraints described here are simple, in practice it takes careful hardware selection, benchmarking and integration to obtain such results.
In a Lustre file system, storage is only attached to server nodes, not to client nodes. If failover capability is desired, then this storage must be attached to multiple servers. In all cases, the use of storage area networks (SANs) with expensive switches can be avoided, because point-to-point connections between the servers and the storage arrays normally provide the simplest and best attachments.
Lustre file systems are easy to configure. First, the Lustre software is installed, and then MDT and OST partitions are formatted using the standard UNIX mkfs command. Next, the volumes carrying the Lustre file system targets are mounted on the server nodes as local file systems. Finally, the Lustre client systems are mounted (in a manner similar to NFS mounts).
The configuration commands listed below are for the Lustre cluster shown in FIGURE 1-7.
On the MDS (mds.your.org@tcp0):
mkfs.lustre --mdt --mgs --fsname=large-fs /dev/sda mount -t lustre /dev/sda /mnt/mdt
mkfs.lustre --ost --fsname=large-fs --mgsnode=mds.your.org@tcp0 /dev/sdb mount -t lustre /dev/sdb/mnt/ost1
mkfs.lustre --ost --fsname=large-fs --mgsnode=mds.your.org@tcp0 /dev/sdc mount -t lustre /dev/sdc/mnt/ost2
FIGURE 1-7 A simple Lustre cluster
In clusters with a Lustre file system, the system network connects the servers and the clients. The disk storage behind the MDSs and OSSs connects to these servers using traditional SAN technologies, but this SAN does not extend to the Lustre client system. Servers and clients communicate with one another over a custom networking API known as Lustre Networking (LNET). LNET interoperates with a variety of network transports through Network Abstraction Layers (NAL).
LNET includes LNDs to support many network type including:
The LNDs that support these networks are pluggable modules for the LNET software stack.
LNET offers extremely high performance. It is common to see end-to-end throughput over GigE networks in excess of 110 MB/sec, InfiniBand double data rate (DDR) links reach bandwidths up to 1.5 GB/sec, and 10GigE interfaces provide end-to-end bandwidth of over 1 GB/sec.
Lustre offers a robust, application-transparent failover mechanism that delivers call completion. This failover mechanism, in conjunction with software that offers interoperability between versions, is used to support rolling upgrades of file system software on active clusters.
The Lustre recovery feature allows servers to be upgraded without taking down the system. The server is simply taken offline, upgraded and restarted (or failed over to a standby server with the new software). All active jobs continue to run without failures, they merely experience a delay.
Lustre MDSs are configured as an active/passive pair, while OSSs are typically deployed in an active/active configuration that provides redundancy without extra overhead, as shown in FIGURE 1-8. Often the standby MDS is the active MDS for another Lustre file system, so no nodes are idle in the cluster.
FIGURE 1-8 Lustre failover configurations for OSSs and MDSs
Although a file system checking tool (lfsck) is provided for disaster recovery, journaling and sophisticated protocols re-synchronize the cluster within seconds, without the need for a lengthy fsck. Lustre version interoperability between successive minor versions is guaranteed. As a result, the Lustre failover capability is used regularly to upgrade the software without cluster downtime.
Copyright © 2010, Oracle and/or its affiliates. All rights reserved.