Frequently Asked Questions (Old Wiki): Difference between revisions

From Lustre Wiki
Jump to navigation Jump to search
(initial page creation)
 
(move stuff to main FAQ page)
 
(5 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{| class='wikitable'
See [[Frequently_Asked_Questions]] page.
|-
!Note: This page originated on the old Lustre wiki. It was identified as likely having value and was migrated to the new wiki. It is in the process of being reviewed/updated and may currently have content that is out of date.
|}


== Fundamentals ==
== Installation ==


<small>''(Updated: Dec 2009)''</small>
<small>''(Updated: Dec 2009)''</small>


''' Can you describe the data caching and cache coherency method?'''
'''Which operating systems are supported as clients and servers?'''


Complete cache coherence is provided for both metadata (names, readdir lists, inode attributes) and file data. Clients and servers both take locks with a distributed lock management service; caches are flushed before locks are released.
Please see [[FAQ - OS_Support|OS Support]].


''' Does Lustre separate metadata and file data?'''
'''Can you use NFS or CIFS to reach a Lustre volume?'''


Yes. The entire namespace is stored on Lustre Metadata Servers (MDSs); file data is stored on Lustre Object Storage Servers (OSSs). Note that unlike many block-based clustered filesystems where the MDS is still in charge of block allocation, the Lustre MDS is ''not'' involved in file IO in any manner and is not a source of contention for file IO.
Yes. Any native Lustre client (running Linux today, by definition) can export a volume using NFS or Samba. Some people have even built small clusters of these export nodes, to improve overall performance to their non-native clients.


The data for each file may reside in multiple objects on separate servers. Lustre 1.x manages these objects in a RAID-0 (striping) configuration, so each object in a multi-object file contains only a part of the file's data. Future versions of Lustre will allow the user or administrator to choose other striping methods, such as RAID-1 or RAID-5 redundancy.
Although NFS export works today, it is slower than native Lustre access, and does not have cache coherent access that some applications depend on.


''' What is the difference between an OST and an OSS?'''
CIFS export with Samba, even a cluster of such nodes, is possible with one caveat: oplocks and Windows share modes. If you connect to these Samba shares with Windows clients, they will probably make heavy use of share modes and oplocks for locking and synchronization. Samba implements them internally, and does not yet have a clustered mode to coordinate them between multiple servers running on separate nodes. So if you rely on the consistency of these share modes and oplocks, you should use a single node to export CIFS.


As the architecture has evolved, we refined these terms.
'''What is the typical MDS node configuration?'''


An Object Storage Server (OSS) is a server node, running the Lustre software stack. It has one or more network interfaces and usually one or more disks.
10,000-node clusters with moderate metadata loads are commonly supported with a 4-socket quad-core node with 32GB of RAM, providing sustained throughput of over 5,000 creates or deletes per second, and up to 20,000 getattr per second. It is common for these systems to have roughly 200 million files.  Even in 10,000-client clusters, the single MDS has been shown not to be a significant bottleneck under typical HPC loads.  Having a low-latency network like InfiniBand plays a significant role in improved MDS performance.


An Object Storage Target (OST) is an interface to a single exported backend volume. It is conceptually similar to an NFS export, except that an OST does not contain a whole namespace, but rather file system objects.
High throughput with very large directories is possible with extra RAM and, optionally, solid-state disks. Typically, write I/O is low, but seek latency is very important, hence RAID-0+1 mirrored storage
(RAID-0 striping of multiple RAID-1 mirrored disks) is strongly recommended for the MDS. Access to metadata from RAM can be 10x faster, so RAM more RAM is usually beneficial.


''' Is it common for a single OSS to export more than one OST?'''
MDS storage requirements should be sized at approximately 4kB per file, except in unusual circumstances.


Yes, for example to get around the Linux 2.6 maximum 8 TB partition size. Although Lustre will aggregate multiple OSTs into a single large file system, the individual OST partitions are 8 TB.
'''What is the typical OSS node configuration?'''


''' Does Lustre perform high-level I/O load balancing?'''
Multi-core 64-bit servers with good buses are capable of saturating multiple interconnects of any type. These nodes are often dual- or quad- socket and support up to 4 fibrechannel interfaces.  RAM is used for locks and metadata caching, and in 1.8 and later extra server RAM will be used for file data caching, though is not strictly required.  Aproximately 2 GB/OST is recommended for OSS nodes in 1.8, and twice that if OST failover will also run backup OSTs on the node.


Yes. Because a single file can reside in pieces on many servers, the I/O load for even a single file can be distributed.
'''Which architectures are interoperable?'''


Objects are distributed amongst OSTs in a round-robin manner to ensure even load balancing across OSTs and OSS nodes. In Lustre 1.6 and later, if the OSTs are imbalanced in terms of space usage, the MDS will take this into account and allocate a larger fraction of files to OSTs with more free space.
Lustre requires the page size on server nodes (MDS and OSS) to be smaller or the same size as client nodes. Except for this, there are no known obstacles to interoperability, even among heterogeneous client groups and systems with different endianness.


''' Is there a common synchronized namespace for files and directories?'''
'''Which storage devices are supported, on MDS and OSS nodes?'''


Yes. All clients that mount the file system will see a single, coherent, synchronized namespace at all times.
Servers support all block storage: SCSI, SATA, SAS, FC and exotic storage (SSD, NVRAM) are supported.


''' Can Lustre be used as part of a "single system image" installation?'''
'''Which storage interconnects are supported?'''


Yes. Lustre as the ''root'' file system is being used by some installations on both clients and servers.
Just to be clear: Lustre does not require a SAN, nor does it require a fabric like iSCSI. It will work just fine over simple IDE block devices. But because many people already have SANs, or want some amount of shared storage for failover, this is a common question.


''' Do Lustre clients use NFS to reach the servers?'''
For storage behind server nodes, FibreChannel, InfiniBand, iSCSI, or any other block storage protocol can be used. Failover functionality requires shared storage (each partition used active/passive) between a pair of nodes on a fabric like SCSI, FC or SATA.


No. Client nodes run a native Lustre client file system driver, which uses the Lustre metadata and object protocols to communicate with the servers. The NFS protocol is not suitable to meet Lustre's metadata, I/O, locking, recovery, or performance requirements.
'''Are fibrechannel switches necessary? How does HA shared storage work?'''


''' Does Lustre use or provide a single security domain?'''
Typically, fibrechannel switches are not necessary. Multi-port shared storage for failover is normally configured to be shared between two server nodes on a FC-AL. Shared SCSI and SAS devices will also work.


Current versions of Lustre expect the clients and servers to have an identical understanding of UIDs and GIDs, but security is enforced on the MDS by a server-local PAM-managed group databaseLustre supports Access Control Lists (ACLs).  Strong security using Kerberos is being developed and will be in a future release.
Backend storage is expected to be cache-coherent between multiple channels reaching the devices. Servers in an OSS failover pair are normally both active in the file system, and can be configured to take over partitions for each other in the case of a failureMDS failover pairs can also both be active, but only if they serve multiple separate file systems.


''' Does Lustre support the standard POSIX file system APIs?'''
'''Can you put the file system journal on a separate device?'''


Yes. Applications that use standard POSIX file system APIs can run on Lustre without modifications.
Yes. This can be configured when the backend ext3 file systems are created.
'''Can you run Lustre on LVM volumes, software RAID, etc?'''


''' Is Lustre "POSIX compliant"? Are there any exceptions?'''
Yes. You can use any Linux block device as storage for a backend Lustre server file system, including LVM or software RAID devices.
'''Can you describe the installation process?'''


POSIX does not, strictly speaking, say anything about how a file system will operate on multiple clients. However, Lustre conforms to the most reasonable interpretation of what the single-node POSIX requirements would mean in a clustered environment.
The current installation process is straightforward, but manual:


For example, the coherency of read and write operations are enforced through the Lustre distributed lock manager; if application threads running on multiple nodes were to try to read and write the same part of a file at the same time, they would both see consistent results.
1. Install the provided kernel and Lustre RPMs.


This is true of all I/O and metadata operations, with two exceptions:
2. A configuration tool assistant can generate a configuration file for simple configurations, or you can build more complex configurations with relatively simple shell scripts.


* ''atime'' updates
3. Format and mount the OST and MDT filesystems. The command is usually identical on all nodes, so it's easy to use a utility like ''pdsh/prun'' to execute it.


: It is not practical to maintain fully coherent ''atime'' updates in a high-performance cluster file system. Lustre will update the ''atime'' of files lazily -- if an inode needs to be changed on a disk anyway, we will piggy-back an ''atime'' update if needed -- and when files are closed.  Clients will refresh a file's ''atime''' whenever they read or write objects from that file from the OST(s), but will only do local ''atime'' updates for reads from cache.
4. Start the clients with ''mount'', similar to how NFS is mounted.


''flock/lockf''
'''What is the estimated installation time per compute node?'''


: POSIX and BSD ''flock/lockf'' system calls will be completely coherent across the cluster, using the Lustre lock manager, but are not enabled by default today. It is possible to enable client-local ''flock'' locking with the ''-o localflock'' mount option, or cluster-wide locking with the ''-o flock'' mount optionIf/when this becomes the default, it is also possible to disable ''flock'' for a client with the ''-o noflock'' mount option.
Assuming that node doesn't require special drivers or kernel configuration, 5 minutes. Install the RPMs, add a line for the file system to ''/etc/fstab''.  Compute nodes can be installed and started in parallel.


Prior to Lustre 1.4.1, there was one additional deviation from POSIX, in the area of ''mmap'' I/O. In 1.4.1, changes were made to support cache-coherent ''mmap'' I/O and robust execution of binaries and shared libraries residing in Lustre.  ''mmap()'' I/O is now coherent and synchronized via the Lustre lock manager, although there may be pathological cases that remain hazardous for some time.
'''What is the estimated installation time per I/O node?'''


''' Can you grow/shrink file systems online?'''
5-30 minutes, plus formatting time, which can also be done in parallel.  The per-node time depends on how easily the commands can be run in parallel on the I/O nodes.  Install the RPMs, format the devices, add a line to ''/etc/fstab'' for the file system, mount.


Lustre 1.6 contains support for online addition of OST targets either on a new or on an existing OSS.  In an upcoming version of Lustre, the recently added support for online resizing of ''ext3'' volumes will provide an additional way of growing file systems.  Shrinking is not supported.
== Licensing and Support ==


''' Which disk file systems are supported as Lustre back-end file systems?'''
<small>''(Updated: Dec 2009)''</small>
 
''' What is the licensing model for the Lustre file system for Linux? '''
 
The Lustre file system for Linux is an Open Source product.
 
New releases are available at the [http://downloads.whamcloud.com/ Lustre download site] to the general public, under the terms and conditions of the GNU GPL.
 
== Metadata Servers ==
 
<small>''(Updated: Dec 2009)''</small>
 
'''How many metadata servers does Lustre support?'''
 
Lustre 2.x introduced the Distributed Namespace Environment (DNE) feature, which permits dozens of metadata servers working in parallel for a single file system.
 
'''How does DNE work?'''
 
At a high level, it is reasonably simple: each directory can be striped over multiple metadata targets, each of which contains a disjoint portion of the namespace. When a client wants to lookup or create a name in that namespace, it uses a hashing algorithm to determine which metadata server holds the information for that name.
 
When you consider the details of doing this efficiently, coherently, and completely recoverable in the face of any number of different failures, it becomes more complicated.
 
'''Isn't the single metadata server a bottleneck?'''
 
Not so far. We regularly perform tests with single directories containing millions of files, and we have several customers with 10,000-node clusters (or larger) and a single metadata server.
 
Lustre is carefully designed to place the entire burden of file I/O on the Object Storage Servers (OSSs): locking, disk allocation, storage and retrieval, everything. Once the file is opened and the striping information obtained by the client, the metadata server is no longer involved in the business of file I/O.
 
The Lustre metadata server software is extremely multithreaded, and we have made substantial efforts to reduce the lock contention on the MDS so that many clients can work concurrently.  These are the kinds of optimizations which make it possible to do random creations or lookups in a single 10-million-file directory at a rate of more than 5,000 per second.  Future versions of Lustre will continue to improve the metadata performance.
 
If there is a customer need for massive metadata servers prior to the release of clustered metadata, it should be possible to scale quite far using large SMP systems (such as the SunFire x4600 or Bull NovaScale) with large memory spaces and solid-state disks which can utilize large caches to speed MDS operations.
 
 
'''How is metadata allocated on disk?'''
 
The standard way that Lustre formats the MDS file system is with 512-byte ext3 inodes, which contain extended attributes (EAs) embedded in the inodes. One use of such an EA is for the file striping data, which tells the clients on which object servers to find the file data. For very widely striped files, this EA may be too large to store in the inode and will be stored in separate blocks. By storing the EA in the inode when possible, we avoid an extra very expensive disk seek.
 
'''What is intent locking?'''
 
Most file systems operate in one of two modes: a mode in which the server does the metadata modifications, or a mode in which the client can cache metadata updates itself. Both ways have their advantages and disadvantages in certain situations.
 
Consider the case of 1,000 clients all chdir'ing to ''/tmp'' and creating their own output files. If each client locks the directory, adds their file, uploads the modification, and releases the lock, this simple operation will take forever. If the metadata server is able to execute the operations locally and return the results, it should all happen in less than a second.
 
This is not a contrived example -- Lustre users run applications which do this very thing every hour of every day, for example, to write checkpoint data of a long-running scientific computation.
 
Consider another very common case, like a user's home directory being used only by one node. In this case, it would be extremely advantageous to allow that node to cache metadata updates in ram, then lazily propagate them back to the MDS. This allows the user to make updates as fast as they can be recorded in ram (until ram is full).
 
The moral of the store is: in cases of high concurrency, do the updates on the metadata server. In single-user cases, cache updates on the client.
 
What does this have to do with intent locking? Our protocol bundles up the information for the entire operation with the initial lock request, in the form of a metadata intent, and gives the metadata server the option of whether to execute the operation immediately (and return only a result code), or to return a lock to allow the client to perform writeback caching.
 
Lustre 1.x does not include a metadata writeback cache on the client, so today's metadata server always executes the operation on the client's behalf. Even without a writeback cache, however, the intent locking infrastructure still provides value. By having all of the information available during the initial lock request, we are able to perform all metadata operations in a single RPC.
 
''' How does the metadata locking protocol work?'''
 
Prior to Lustre 1.4.x, each metadata server inode was locked as a single unit. When the client wished to cache the existence of a name, or the attributes or directory pages of an inode, it would take and hold such a read lock. When the metadata server modified an inode, it would take a write lock on the same.
 
There are common cases when even one lock per inode is not enough, however. For example, consider the case of creating the file ''dir/file'', as would happen during the unpacking of a tar archive. The client would first lookup ''dir'', and in doing so take a lock to cache the result. It would then ask the metadata server to create "file" inside of it, which would lock ''dir'' to modify it, thus yanking the lock back from the client. This "ping-pong" back and forth is unnecessary and very inefficient, so Lustre 1.4.x introduced separate locking of different parts of the inode (simple existence, directory pages, and attributes).
 
''' Does the MDS do any pre-allocation?'''
 
Yes. To enable very fast file creation, the metadata server asks the individual OSTs to pre-create some number of objects, which the MDS can then allocate as file stripes without additional RPCs. These preallocations are replenished asynchronously.
 
== Networking ==
 
<small>''(Updated: Dec 2009)''</small>
 
'''Which interconnects and protocols are currently supported?'''
Today, Lustre supports TCP/IP (commonly over gigabit or 10-gigabit ethernet), OFED, Myrinet MX, and Cray's Seastar networks.  Other older networking technologies were also supported, but are virtually unused today and will be dropped in future releases.
 
The up-to-date versions of each network type are at the beginning of the lnet/ChangeLog file and each release announcement.
 
'''Can I use more than one interface of the same type on the same node?'''
 
Yes, with Lustre 1.4.6 and later.
 
'''Can I use two or more different interconnects on the same node?'''
 
Yes, with Lustre 1.4.x, subject to the particular limitations of the interconnect. For example, we are told that it is not possible to use both Elan 3 and Elan 4 in the same node at the same time.
 
'''Can I use TCP offload cards?'''
 
Probably -- but we've tried many of these cards, and for various reasons we didn't see much improvement, if any. First, because Lustre runs entirely in the kernel, it uses kernel networking APIs which are often not supported (or at least not optimized) by the offload drivers.
 
Second, the problem isn't the overhead of checksum calculation or the need for interrupt coalescing; lots of commodity ethernet cards already support these features. The big overhead is memory copying and buffering, which these cards rarely do anything to address.
 
'''Does Lustre support crazy heterogeneous network topologies?'''
 
Yes, although the craziest of them are not yet fully supported.
 
Because Lustre supports native protocols on top of high speed cluster interconnects (in addition to TCP/IP), some special infrastructure is necessary.
 
Lustre uses its own implementation of the Portals message passing API, upon which we have implemented Gateway nodes, to route between two native protocols. These are commodity nodes with, for example, both gigabit ethernet and InfiniBand interfaces. The gateway software translates the Portals packets between the interfaces to bridge the two networks.
 
These routers are in use today, and may become more popular as more enterprises connect multiple clusters with special interconnects to a single global Lustre file system.  On the other hand, TCP/IP on GigE is the interconnect of choice for most organizations, which requires no additional Portals routing.
 
== OS Support ==
 
<small>''(Updated: Dec 2009)''</small>
 
'''Which interconnects and protocols are currently supported?'''
Today, Lustre supports TCP/IP (commonly over gigabit or 10-gigabit ethernet), OFED, Myrinet MX, and Cray's Seastar networks.  Other older networking technologies were also supported, but are virtually unused today and will be dropped in future releases.
 
The up-to-date versions of each network type are at the beginning of the lnet/ChangeLog file and each release announcement.
 
'''Can I use more than one interface of the same type on the same node?'''
 
Yes, with Lustre 1.4.6 and later.
 
'''Can I use two or more different interconnects on the same node?'''
 
Yes, with Lustre 1.4.x, subject to the particular limitations of the interconnect. For example, we are told that it is not possible to use both Elan 3 and Elan 4 in the same node at the same time.
 
'''Can I use TCP offload cards?'''
 
Probably -- but we've tried many of these cards, and for various reasons we didn't see much improvement, if any. First, because Lustre runs entirely in the kernel, it uses kernel networking APIs which are often not supported (or at least not optimized) by the offload drivers.
 
Second, the problem isn't the overhead of checksum calculation or the need for interrupt coalescing; lots of commodity ethernet cards already support these features. The big overhead is memory copying and buffering, which these cards rarely do anything to address.
 
'''Does Lustre support crazy heterogeneous network topologies?'''
 
Yes, although the craziest of them are not yet fully supported.
 
Because Lustre supports native protocols on top of high speed cluster interconnects (in addition to TCP/IP), some special infrastructure is necessary.
 
Lustre uses its own implementation of the Portals message passing API, upon which we have implemented Gateway nodes, to route between two native protocols. These are commodity nodes with, for example, both gigabit ethernet and InfiniBand interfaces. The gateway software translates the Portals packets between the interfaces to bridge the two networks.
 
These routers are in use today, and may become more popular as more enterprises connect multiple clusters with special interconnects to a single global Lustre file system.  On the other hand, TCP/IP on GigE is the interconnect of choice for most organizations, which requires no additional Portals routing.
 
== Object Servers and I/O Throughput ==
 
<small>''(Updated: Dec 2009)''</small>
 
'''What levels of throughput should I expect?'''
 
This of course depends on many variables, including the type and number of clients and servers, your network and disk infrastructure, your application's I/O patterns, tuning, and more. With standard HPC workloads and reasonable (ie, not seek-bound, nor extremely small I/O requests, etc) Lustre has demonstrated up to 90% of the system's raw I/O bandwidth capability.
 
With all of those variables in mind, here are some demonstrated single-server results on customer or demonstration installations of various types:
 
* TCP/IP
**Single-connected GigE: 115 MB/s
**Dual-NIC GigE on a 32-bit OSS: 180 MB/s
**Dual-NIC GigE on a 64-bit OSS: 220 MB/s
**Single-connected 10GigE on a 64-bit OSS: 550 MB/s, 1GB/s on woodcrest
* Unoptimized InfiniBand
**Single-port SDR InfiniBand on a 64-bit OSS: 700-900 MB/s
**DDR InfiniBand on a 64-bit OSS: 1500 MB/s
 
'''How fast can a single OSS be?'''
 
Using Lustre 1.4.0, a single 8-way Bull NovaScale IA-64 OSS, DataDirect Networks storage, and 3 rails of Quadrics Elan 4, a single OSS achieved 2.6 GB/s of sustained end-to-end bandwidth from two 16-way IA-64 client nodes.
 
Also using Lustre 1.4.0, a single-CPU AMD Opteron using 10-gigabit ethernet has been clocked at 550 MB/s.
 
'''How well does Lustre scale as OSSs are added?'''
 
Configured properly, it will scale linearly. In demonstrations on a production system of up to 104 Lustre OSSs, each connected with a single gigabit ethernet port, the aggregate sustained bandwidth reached 11.1 GB/s.
 
'''How many clients can each OSS support?'''
 
The number of clients is not usually a factor in choosing how many OSSs to deploy. Please see [[FAQ - Sizing|File Sizing]].
 
'''What is a typical OSS node configuration?'''
 
Please see [[FAQ - Installation|Installation]].
 
'''How do I automate failover of my OSSs?'''
 
Please see [[FAQ - Recovery|Recovery]].
 
'''Do you plan to support OSS failover without shared storage?'''
 
Yes, Server Network Striping will allow RAID-1 and RAID-5 file I/O features. They will provide redundancy and recoverability in the Lustre object protocol rather than requiring shared storage.  In the meantime, some users have been trying DRBD to replicate the OST block device to a backup OSS instead of having shared storage, though this is not an officially supported configuration.
 
'''How is file data allocated on disk?'''
 
Because the Lustre OSTs mount regular ext3 file systems, you can mount them directly and look at them. If you were to do so, you would see a lot of files with names like "934151", which are object numbers. Inside each object is a file's data, or a portion of that file's data, depending on the striping policy for that file. There is no namespace information stored on the object server at this time.
 
The allocation of this file data to disk blocks is governed by ''ext3'', although here we have made very substantial improvements. Instead of a long array of individual blocks, Lustre's ext3 manages file data extents, which can dramatically reduce the amount of this metadata for each file, and therefore the amount of seeking and I/O required to read and write it. We also implemented a new buddy block allocator, which can very quickly and without a lot of searching return very large contiguous disk extents.
 
'''How does the object locking protocol work?'''
 
Before any file data can be modified, or stored in a client cache, a lock must be taken. Each OST runs a lock server, and manages the locking for the stripes of data which reside on that OST. This has two extremely positive effects:
 
First, it removes a potential bottleneck of a single lock server. As you add object servers, you also add lock server capacity, in addition to disk capacity and bandwidth, and network bandwidth.
 
Second, it removes the so-called "split-brain" problem common in clustered systems. If the lock service and I/O service reside on different nodes, it is possible for the communications between them to be disrupted, while clients can still access one or both. In that case, data corruption could result because the locking and I/O would no longer be carefully coordinated.
 
In the Lustre protocol, if a client requests a lock which conflicts with a lock held by another client, a message is sent to the lock holder asking for the lock to be dropped. Before that lock is dropped, the client must write back any cached modifications, and remove all data from its cache for which it will no longer have a lock. Then, and only then, can it drop the lock.
 
If a client does not drop its lock in a reasonable amount of time (defined by a configurable timeout value) -- perhaps because it has been powered off, or suffered a hardware failure, or for some other reason -- it is evicted from that OST and will not be allowed to execute any operations until it has reconnected. This allows the remainder of the cluster to continue after a node has failed, after a short pause.
 
Finally, we have implemented a lock manager extension to optimize the very common case of sampling an object's attributes while it is being modified. Many users, for example, will want to track the progress of a job by getting a file listing ("ls -l") in the output directory while the job is writing its data.
 
Because it is not acceptable to return stale or out-of-date file size information, we must ask the server for a lock on this data. Because we don't actually need the data -- we just need to know how much there is -- we tell the server that instead of a lock it could simply provide the attributes. This is another case of intent locking . If the file is not being actively modified, then the server will grant a lock so that the client can cache the attributes.
 
'''Does Lustre support Direct I/O?'''
 
Yes. It locks the data to guarantee cluster-wide consistency, just like normal POSIX I/O, but does not cache the data on the client or server.  With an RDMA-capable network (anything other than TCP) there is only a single data copy directly from the client RAM to the server RAM and straight to the disk.
 
'''Can these locks be disabled?'''
 
Yes, but:
 
* It's only safe to do so when you use direct I/O; otherwise you have data in the caches which is not locked. Once that data is in the cache without a lock, it will not be removed except under memory pressure.
* In practice, the overhead of these locks has not been shown to be an issue. Databases may or may not be an exception, but in any case, they tend to use direct I/O.
 
'''Do you plan to support T-10 object devices?'''
 
We are in touch with the T-10 committee. It is not clear to us that recovery and lock management implications for cluster file systems will see sufficient attention in the T-10 standard for this proposal to be viable. The goals of the T-10 committee may not, in the end, line up well with the very strong semantic guarantees that Lustre makes.
'''Does Lustre support/require special parallel I/O libraries?'''
 
Lustre supports them, but by no means requires them. We have found equal performance when using standard POSIX I/O calls, the POSIX ADIO driver, or the MPI/IO libraries, when the IO size is sufficient.
 
A Lustre-specific MPI/IO ADIO driver has been developed to allow an application to provide hints about how it would like its output files to be striped, and to optimize the IO pattern when many clients are doing small read or write operations to a single file.
 
== Recovery ==
 
<small>''(Updated: Dec 2009)''</small>
 
'''How do I configure failover services?'''
 
Typical failover configurations couple two Lustre MDS or OSS nodes in pairs directly to a multi-port disk array. Object servers are typically active/active, with each serving half of the array, while metadata servers must be active/passive. These array devices typically have redundancy internally, to eliminate them as single points of failure. This does not typically require a Fibre Channel switch.
 
'''How do I automate failover of my MDSs/OSSs?'''
 
The actual business of automating the decisions about whether a server has failed, and which server should take over the load, is managed by a separate package (our customers have used Red Hat's Cluster Manager and SuSE's Heartbeat).
 
Completely automated failover also requires some kind of programmatically controllable power switch, because the new "active" MDS must be able to completely power off the failed node. Otherwise, there is a chance that the "dead" node could wake up, start using the disk at the same time, and cause massive corruption.
 
''' How necessary is failover, really?'''
 
The answer depends on how close to 100% uptime you need to achieve. Failover doesn't protect against the failure of individual disks -- that is handled by software or hardware RAID at the OST and MDT level.  Lustre failover is to handle the failure of an MDS or OSS node as a whole which, in our experience, is not very common.
 
We would suggest that simple RAID-5 or RAID-6 storage is sufficient for most users, with manual restart of failed OSS and MDS nodes, but that the most important production systems should consider failover.
 
'''I don't need failover, and don't want shared storage. How will this work?'''
 
If Lustre is configured without shared storage for failover, and a server node fails, then a client that tries to use that node will pause until the failed server is returned to operation. After a short delay (a configurable timeout value), applications waiting for those nodes can be aborted with a signal (kill or Ctrl-C), similar to the NFS soft-mount mode.
 
When the node is returned to service, applications which have not been aborted will continue to run without errors or data loss.
 
'''If a node suffers a connection failure, will the node select an alternate route for recovery?'''
 
Yes. If a node has multiple network paths, and one fails, it can continue to use the others.
 
'''What are the supported hardware methods for HBA, switch, and controller failover?'''
 
These are supported to the extent supported by the HBA drivers. If arrays with multiple ports are shared by multiple I/O nodes, Lustre offers 100% transparent failover for I/O and metadata nodes. Applications will see a delay while failover and recovery is in progress, but system calls complete without errors.
 
'''Can you describe an example failure scenario, and its resolution?'''
 
Although failures are becoming more rare, it is more likely that a node will hang or timeout rather than crash. If a client node hangs or crashes, usually all other client and server nodes are not affected. Normally such a client is rebooted and rejoins the file system. When server nodes hang, they are commonly restarted, merely causing a short delay to applications which try to use that node. Other server nodes or clients are not usually affected.
 
''' How are power failures, disk or RAID controller failures, etc. addressed?'''
 
If I/O to the storage is interrupted AND the storage device guarantees strict ordering of transactions, then the ext3 journal recovery will restore the file system in a few seconds.
 
If the file system is damaged through device failures, unordered transactions, or a power loss affecting a storage device's caches, Lustre requires a file system repair. Lustre's tools will reliably repair any damage it can. It will run in parallel on all nodes, but can still be very time consuming for large file systems.
 
== Release Testing and Upgrading ==
 
<small>''(Updated: Dec 2009)''</small>
 
''' How does the Lustre group fix issues? '''
 
The Lustre group approaches bug tracking and fixing seriously and methodically:
 
* ''Regression testing:'' A test is written to reproduce the problem, which is added to the ongoing test suite.
* ''Architecture and design:'' Depending on how severe or invasive an update to the architecture description is, it may be written and reviewed by senior architects.  A detailed design description for the patch is written and reviewed by principal engineers.
* ''Implementation:'' Fixes are implemented according to the design description and added to a bug for review and inspection.
* ''Review and Inspection:'' A developer or development team will review the code first and then submit it for a methodical inspection by senior and principal engineers.
* ''Testing:'' The developer runs a small suite of tests before the code leaves his or her desk. Then it's added to a branch for regression testing and final release testing.
 
This process can be tracked closely via [http://bugzilla.lustre.org/ Lustre Bugzilla].
 
''' What testing does each version undergo prior to release? '''
 
Sun and its vendor and customer partners run a large suite of tests on a number of systems, architectures, kernels, and interconnects, including clusters as large as 400 nodes. Major updates receive testing on the largest clusters available to us, around 1,000 nodes.
 
'''Are Lustre releases backwards and forward compatible on the disk? On the wire?'''
 
Special care is taken to ensure that any disk format changes -- which are rare to begin with -- are handled transparently in previous and subsequent releases. Before the disk format changes, we release versions which support the new format, so you can safely roll back in case of problems.  After the format change, new versions continue to support the old formats for some time and transparently update disk structures when old versions are encountered.
 
Support for running with older protocols are removed on every second major release, so the 1.4.x release will not be interoperable with 1.8.x release.  Similarly, the 1.6.x release will not be able to interoperate with 2.0.x release.
 
The Lustre group always tests release 1.x.y with the preceding 1.x.(y-1) release, and with the older 1.(x-2).latest version when making a release.  It isn't possible to exhaustively test all release combinations.  As a result, we cannot guarantee that all releases are fully interoperable, even though we strive through design and code review to ensure that any 1.x release will work with any 1.(x-2) release.
 
''' Do you have to reboot to upgrade? '''
 
Not unless you upgrade your kernel. It's usually a simple matter of unmounting the file system or stopping the server as the case may be, installing the new RPMs, and restarting it.
 
Some of our customers upgrade servers between wire-compatible releases using failover; a service is failed over, the software is updated on the stopped node, the service is failed back, and the failover partner is upgraded in the same way.
 
== Sizing ==
 
<small>''(Updated: Dec 2009)''</small>
 
''' What is the maximum file system size? What is the largest file system you've tested? '''
 
Each backend OST file system is restricted to a maximum of 8 TB on Linux 2.6 (imposed by ''ext3'').
Filesystems that are based on ''ext4'' (SLES11) will soon be able to handle single OSTs of up to 16TB.
Of course, it is possible to have multiple OST backends on a single OSS, and to aggregate multiple OSSs within a single Lustre file system.  Running tests with almost 4000 smaller OSTs has been tried - hence 32PB or 64PB file systems could be achieved today.
 
Lustre users already run single production file systems of over 10PB, using over 1300 8TB OSTs.
 
''' What is the maximum file system block size? '''
 
The basic ''ext3'' block size is 4096 bytes, although this could in principle be easily changed to as large as PAGE_SIZE (on IA64 or PPC, for example) in ext4. It is not clear, however, that this is necessary.
 
Some people confuse block size with extent size or I/O request size -- they are not the same thing. The block size is the basic unit of disk allocation, and for our purposes it seems that 4kB is as good as any. The size of a single file extent, by definition, is almost always larger than 4kB, and ''ldiskfs'' extents and ''mballoc'' features used by Lustre do a good job of allocating I/O aligned and contiguous disk blocks whenever possible. The I/O request size (the amount of data that we try to read or write in a single request) is usually 1MB (aligned to the start of the LUN) and can be further aggregated by the disk elevator or RAID controller.
 
''' What is the maximum single-file size? '''
 
On 32-bit clients, the page cache makes it quite difficult to read or write a single file larger than 8 TB.  On 64 bit clients, the maximum file size is 2^63.  A current Lustre limit for allocated file space arises from a maximum of 160 stripes and 2TB per file on current ''ldiskfs'' file systems, leading to a limit of 320TB per file.
 
''' What is the maximum number of files in a single file system? In a single directory? '''
 
We use the ''ext3'' hashed directory code, which has a theoretical limit of about 15 million files per directory, at which point the directory grows to more than 2 GB. The maximum number of subdirectories in a single directory is the same as the file limit, about 15 million.
 
We regularly run tests with ten million files in a single directory. On a properly-configured quad-socket MDS with 32 GB of ram, it is possible to do random lookups in this directory at a rate of 20,000/second.
 
A single MDS imposes an upper limit of 4 billion inodes, but the default limit is slightly less than the device size/4kB, so about 2 billion inodes for a 8 TB MDS file system.  This can be increased at initial file system creation time by specifying ''mkfs'' options to increase the number of inodes.  Production file systems containing over 300 million files exist.
 
With the introduction of clustered metadata servers (Lustre 2.x) and with ZFS-based MDTs, these limits will disappear.
 
''' How many OSSs do I need? '''
 
The short answer is: As many as you need to achieve the required aggregate I/O throughput.


Lustre includes a patched version of the ''ext3'' file system, called ''ldiskfs'', with additional features such as extents, an efficient multi-block allocator, ''htree'' directories, large inodes, extended attributes, transaction optimizations, fine-grained locking, and CPU affinity for critical operationsIn newer kernels that support the ''ext4'' file system, this will be used instead of ''ext3''. Work is underway to use the Solaris ZFS file system to increase the scalability and robustness of the back-end file system.
The long answer is: Each OSS contributes to the total capacity and the aggregate throughput. For example, a 100 TB file system may use 100 single-GigE-connected OSS nodes with 1 TB of 100 MB/sec storage each, providing 10 GB/sec of aggregate bandwidth. The same bandwidth and capacity could be provided with four heavy-duty 25 TB OSS servers with three Elan 4 interfaces and 16 FC2 channels, each providing ~2.5 GB/s in aggregate bandwidthThe 25 TB of storage must be capable of 2.5 GB/s.


''' Why did Lustre choose ''ext3''? Do you ever plan to support others?'''
Each OSS can support a very large number of clients, so we do not advise our customers to use any particular client-to-OSS ratio.  Nevertheless, it is common to deploy 1 GB/s of OSS throughput per 1 TFLOP/s of compute power.


There are many reasons to choose ''ext3''. One is size; at about 15,000 lines of code, it is extremely understandable, maintainable, and modifiable. Another is reliability; ''ext3'' is proven stable by millions of users, with an excellent file system repair tool.
''' What is the largest possible I/O request? '''


When we began, there was a big difference between the various Linux file systems, particularly with respect to performance. In the last few years, however, the Lustre team has carried ''ext3'' substantially forward, and it is now competitive with other Linux file systems.  Most of the changes made to ''ext3'' for improving Lustre performance have been included into the upstream ''ext4'' filesystem, reducing the number and size of patches in ''ldiskfs'' dramatically.
When most people ask this question, they are asking what is the maximum buffer size that can be safely passed to a ''read()'' or ''write()'' system call. In principle this is limited only by the address space on the client, and we have tested single ''read()'' and ''write()'' system calls up to 1GB in size, so it has not been an issue in reality.


Nevertheless, we had originally planned to support multiple file systems, so Lustre does contain a file system abstraction. In future Lustre releases, we will support ZFS as the backing file system for both OSTs and MDTs.
The Lustre I/O subsystem is designed with the understanding that an I/O request travels through many pipelines, and that it's important to keep all pipelines full for maximum performance. So it is not necessary to teach your application to do I/O in very large chunks; Lustre and the page cache will aggregate I/O for you.


''' Why didn't you use IBM's distributed lock manager?'''
Typically, Lustre client nodes will do their best to aggregate I/O into 1 MB chunks on the wire, and to keep between 5 and 10 I/O requests "in flight" at a time, per server.  There is still a per-syscall overhead for locking and such, so using 1MB or larger read/write requests will minimize this overhead.


The design of the Lustre DLM borrows heavily from the VAX Clusters DLM, plus extensions that are not found in others. Although we have received some reasonable criticism for not using an existing package (such as IBM's DLM), experience thus far has seemed to indicate that we've made the correct choice: it's smaller, simpler and, at least for our needs, more extensible.
On the OSS, we have gone to significant effort to ensure that these large 1 MB buffers do not get unnecessarily broken up by lower kernel layers. In Linux 2.4, modifications were required to the SCSI layer, block devices, and the QLogic fibrechannel driver.  In newer kernels there is less need to modify the kernel, though the block device tunables are still set for low-latency desktop workloads, and need to be tuned for high-bandwidth IO.


The Lustre DLM, at around 6,000 lines of code, has proven to be an overseeable maintenance task, despite its somewhat daunting complexity. The IBM DLM, by comparison, was nearly the size of all of Lustre combined. This is not necessarily a criticism of the IBM DLM, however. To its credit, it is a complete DLM which implements many features which we do not require in Lustre.
''' How many nodes can connect to a single Lustre file system? '''


In particular, Lustre's DLM is not really distributed, at least not when compared to other such systems. Locks in the Lustre DLM are always managed by the service node, and do not change masters as other systems allow. Omitting features of this type has allowed us to rapidly develop and stabilize the core DLM functionality required by the file system, plus add several extensions of our own (extent locking, intent locking, policy functions, glimpse ASTs, and a different take on lock value blocks).
The largest single production Lustre installation is approximately 26,000 nodes today (2009). This is the site-wide file system for a handful of different supercomputers.


''' Are services at user or kernel level? How do they communicate?'''
Although these are the largest clusters available to us today, we believe that the architecture is fundamentally capable of supporting many more clients.


All daemons on a single node run in the kernel, and therefore share a single address space. Daemons on different nodes communicate through RPC messages; large messages are sent using remote DMA if the fabric supports it.
[[Category: NeedsReview]]

Latest revision as of 18:44, 4 May 2020

See Frequently_Asked_Questions page.

Installation

(Updated: Dec 2009)

Which operating systems are supported as clients and servers?

Please see OS Support.

Can you use NFS or CIFS to reach a Lustre volume?

Yes. Any native Lustre client (running Linux today, by definition) can export a volume using NFS or Samba. Some people have even built small clusters of these export nodes, to improve overall performance to their non-native clients.

Although NFS export works today, it is slower than native Lustre access, and does not have cache coherent access that some applications depend on.

CIFS export with Samba, even a cluster of such nodes, is possible with one caveat: oplocks and Windows share modes. If you connect to these Samba shares with Windows clients, they will probably make heavy use of share modes and oplocks for locking and synchronization. Samba implements them internally, and does not yet have a clustered mode to coordinate them between multiple servers running on separate nodes. So if you rely on the consistency of these share modes and oplocks, you should use a single node to export CIFS.

What is the typical MDS node configuration?

10,000-node clusters with moderate metadata loads are commonly supported with a 4-socket quad-core node with 32GB of RAM, providing sustained throughput of over 5,000 creates or deletes per second, and up to 20,000 getattr per second. It is common for these systems to have roughly 200 million files. Even in 10,000-client clusters, the single MDS has been shown not to be a significant bottleneck under typical HPC loads. Having a low-latency network like InfiniBand plays a significant role in improved MDS performance.

High throughput with very large directories is possible with extra RAM and, optionally, solid-state disks. Typically, write I/O is low, but seek latency is very important, hence RAID-0+1 mirrored storage (RAID-0 striping of multiple RAID-1 mirrored disks) is strongly recommended for the MDS. Access to metadata from RAM can be 10x faster, so RAM more RAM is usually beneficial.

MDS storage requirements should be sized at approximately 4kB per file, except in unusual circumstances.

What is the typical OSS node configuration?

Multi-core 64-bit servers with good buses are capable of saturating multiple interconnects of any type. These nodes are often dual- or quad- socket and support up to 4 fibrechannel interfaces. RAM is used for locks and metadata caching, and in 1.8 and later extra server RAM will be used for file data caching, though is not strictly required. Aproximately 2 GB/OST is recommended for OSS nodes in 1.8, and twice that if OST failover will also run backup OSTs on the node.

Which architectures are interoperable?

Lustre requires the page size on server nodes (MDS and OSS) to be smaller or the same size as client nodes. Except for this, there are no known obstacles to interoperability, even among heterogeneous client groups and systems with different endianness.

Which storage devices are supported, on MDS and OSS nodes?

Servers support all block storage: SCSI, SATA, SAS, FC and exotic storage (SSD, NVRAM) are supported.

Which storage interconnects are supported?

Just to be clear: Lustre does not require a SAN, nor does it require a fabric like iSCSI. It will work just fine over simple IDE block devices. But because many people already have SANs, or want some amount of shared storage for failover, this is a common question.

For storage behind server nodes, FibreChannel, InfiniBand, iSCSI, or any other block storage protocol can be used. Failover functionality requires shared storage (each partition used active/passive) between a pair of nodes on a fabric like SCSI, FC or SATA.

Are fibrechannel switches necessary? How does HA shared storage work?

Typically, fibrechannel switches are not necessary. Multi-port shared storage for failover is normally configured to be shared between two server nodes on a FC-AL. Shared SCSI and SAS devices will also work.

Backend storage is expected to be cache-coherent between multiple channels reaching the devices. Servers in an OSS failover pair are normally both active in the file system, and can be configured to take over partitions for each other in the case of a failure. MDS failover pairs can also both be active, but only if they serve multiple separate file systems.

Can you put the file system journal on a separate device?

Yes. This can be configured when the backend ext3 file systems are created. Can you run Lustre on LVM volumes, software RAID, etc?

Yes. You can use any Linux block device as storage for a backend Lustre server file system, including LVM or software RAID devices. Can you describe the installation process?

The current installation process is straightforward, but manual:

1. Install the provided kernel and Lustre RPMs.

2. A configuration tool assistant can generate a configuration file for simple configurations, or you can build more complex configurations with relatively simple shell scripts.

3. Format and mount the OST and MDT filesystems. The command is usually identical on all nodes, so it's easy to use a utility like pdsh/prun to execute it.

4. Start the clients with mount, similar to how NFS is mounted.

What is the estimated installation time per compute node?

Assuming that node doesn't require special drivers or kernel configuration, 5 minutes. Install the RPMs, add a line for the file system to /etc/fstab. Compute nodes can be installed and started in parallel.

What is the estimated installation time per I/O node?

5-30 minutes, plus formatting time, which can also be done in parallel. The per-node time depends on how easily the commands can be run in parallel on the I/O nodes. Install the RPMs, format the devices, add a line to /etc/fstab for the file system, mount.

Licensing and Support

(Updated: Dec 2009)

What is the licensing model for the Lustre file system for Linux?

The Lustre file system for Linux is an Open Source product.

New releases are available at the Lustre download site to the general public, under the terms and conditions of the GNU GPL.

Metadata Servers

(Updated: Dec 2009)

How many metadata servers does Lustre support?

Lustre 2.x introduced the Distributed Namespace Environment (DNE) feature, which permits dozens of metadata servers working in parallel for a single file system.

How does DNE work?

At a high level, it is reasonably simple: each directory can be striped over multiple metadata targets, each of which contains a disjoint portion of the namespace. When a client wants to lookup or create a name in that namespace, it uses a hashing algorithm to determine which metadata server holds the information for that name.

When you consider the details of doing this efficiently, coherently, and completely recoverable in the face of any number of different failures, it becomes more complicated.

Isn't the single metadata server a bottleneck?

Not so far. We regularly perform tests with single directories containing millions of files, and we have several customers with 10,000-node clusters (or larger) and a single metadata server.

Lustre is carefully designed to place the entire burden of file I/O on the Object Storage Servers (OSSs): locking, disk allocation, storage and retrieval, everything. Once the file is opened and the striping information obtained by the client, the metadata server is no longer involved in the business of file I/O.

The Lustre metadata server software is extremely multithreaded, and we have made substantial efforts to reduce the lock contention on the MDS so that many clients can work concurrently. These are the kinds of optimizations which make it possible to do random creations or lookups in a single 10-million-file directory at a rate of more than 5,000 per second. Future versions of Lustre will continue to improve the metadata performance.

If there is a customer need for massive metadata servers prior to the release of clustered metadata, it should be possible to scale quite far using large SMP systems (such as the SunFire x4600 or Bull NovaScale) with large memory spaces and solid-state disks which can utilize large caches to speed MDS operations.


How is metadata allocated on disk?

The standard way that Lustre formats the MDS file system is with 512-byte ext3 inodes, which contain extended attributes (EAs) embedded in the inodes. One use of such an EA is for the file striping data, which tells the clients on which object servers to find the file data. For very widely striped files, this EA may be too large to store in the inode and will be stored in separate blocks. By storing the EA in the inode when possible, we avoid an extra very expensive disk seek.

What is intent locking?

Most file systems operate in one of two modes: a mode in which the server does the metadata modifications, or a mode in which the client can cache metadata updates itself. Both ways have their advantages and disadvantages in certain situations.

Consider the case of 1,000 clients all chdir'ing to /tmp and creating their own output files. If each client locks the directory, adds their file, uploads the modification, and releases the lock, this simple operation will take forever. If the metadata server is able to execute the operations locally and return the results, it should all happen in less than a second.

This is not a contrived example -- Lustre users run applications which do this very thing every hour of every day, for example, to write checkpoint data of a long-running scientific computation.

Consider another very common case, like a user's home directory being used only by one node. In this case, it would be extremely advantageous to allow that node to cache metadata updates in ram, then lazily propagate them back to the MDS. This allows the user to make updates as fast as they can be recorded in ram (until ram is full).

The moral of the store is: in cases of high concurrency, do the updates on the metadata server. In single-user cases, cache updates on the client.

What does this have to do with intent locking? Our protocol bundles up the information for the entire operation with the initial lock request, in the form of a metadata intent, and gives the metadata server the option of whether to execute the operation immediately (and return only a result code), or to return a lock to allow the client to perform writeback caching.

Lustre 1.x does not include a metadata writeback cache on the client, so today's metadata server always executes the operation on the client's behalf. Even without a writeback cache, however, the intent locking infrastructure still provides value. By having all of the information available during the initial lock request, we are able to perform all metadata operations in a single RPC.

How does the metadata locking protocol work?

Prior to Lustre 1.4.x, each metadata server inode was locked as a single unit. When the client wished to cache the existence of a name, or the attributes or directory pages of an inode, it would take and hold such a read lock. When the metadata server modified an inode, it would take a write lock on the same.

There are common cases when even one lock per inode is not enough, however. For example, consider the case of creating the file dir/file, as would happen during the unpacking of a tar archive. The client would first lookup dir, and in doing so take a lock to cache the result. It would then ask the metadata server to create "file" inside of it, which would lock dir to modify it, thus yanking the lock back from the client. This "ping-pong" back and forth is unnecessary and very inefficient, so Lustre 1.4.x introduced separate locking of different parts of the inode (simple existence, directory pages, and attributes).

Does the MDS do any pre-allocation?

Yes. To enable very fast file creation, the metadata server asks the individual OSTs to pre-create some number of objects, which the MDS can then allocate as file stripes without additional RPCs. These preallocations are replenished asynchronously.

Networking

(Updated: Dec 2009)

Which interconnects and protocols are currently supported? Today, Lustre supports TCP/IP (commonly over gigabit or 10-gigabit ethernet), OFED, Myrinet MX, and Cray's Seastar networks. Other older networking technologies were also supported, but are virtually unused today and will be dropped in future releases.

The up-to-date versions of each network type are at the beginning of the lnet/ChangeLog file and each release announcement.

Can I use more than one interface of the same type on the same node?

Yes, with Lustre 1.4.6 and later.

Can I use two or more different interconnects on the same node?

Yes, with Lustre 1.4.x, subject to the particular limitations of the interconnect. For example, we are told that it is not possible to use both Elan 3 and Elan 4 in the same node at the same time.

Can I use TCP offload cards?

Probably -- but we've tried many of these cards, and for various reasons we didn't see much improvement, if any. First, because Lustre runs entirely in the kernel, it uses kernel networking APIs which are often not supported (or at least not optimized) by the offload drivers.

Second, the problem isn't the overhead of checksum calculation or the need for interrupt coalescing; lots of commodity ethernet cards already support these features. The big overhead is memory copying and buffering, which these cards rarely do anything to address.

Does Lustre support crazy heterogeneous network topologies?

Yes, although the craziest of them are not yet fully supported.

Because Lustre supports native protocols on top of high speed cluster interconnects (in addition to TCP/IP), some special infrastructure is necessary.

Lustre uses its own implementation of the Portals message passing API, upon which we have implemented Gateway nodes, to route between two native protocols. These are commodity nodes with, for example, both gigabit ethernet and InfiniBand interfaces. The gateway software translates the Portals packets between the interfaces to bridge the two networks.

These routers are in use today, and may become more popular as more enterprises connect multiple clusters with special interconnects to a single global Lustre file system. On the other hand, TCP/IP on GigE is the interconnect of choice for most organizations, which requires no additional Portals routing.

OS Support

(Updated: Dec 2009)

Which interconnects and protocols are currently supported? Today, Lustre supports TCP/IP (commonly over gigabit or 10-gigabit ethernet), OFED, Myrinet MX, and Cray's Seastar networks. Other older networking technologies were also supported, but are virtually unused today and will be dropped in future releases.

The up-to-date versions of each network type are at the beginning of the lnet/ChangeLog file and each release announcement.

Can I use more than one interface of the same type on the same node?

Yes, with Lustre 1.4.6 and later.

Can I use two or more different interconnects on the same node?

Yes, with Lustre 1.4.x, subject to the particular limitations of the interconnect. For example, we are told that it is not possible to use both Elan 3 and Elan 4 in the same node at the same time.

Can I use TCP offload cards?

Probably -- but we've tried many of these cards, and for various reasons we didn't see much improvement, if any. First, because Lustre runs entirely in the kernel, it uses kernel networking APIs which are often not supported (or at least not optimized) by the offload drivers.

Second, the problem isn't the overhead of checksum calculation or the need for interrupt coalescing; lots of commodity ethernet cards already support these features. The big overhead is memory copying and buffering, which these cards rarely do anything to address.

Does Lustre support crazy heterogeneous network topologies?

Yes, although the craziest of them are not yet fully supported.

Because Lustre supports native protocols on top of high speed cluster interconnects (in addition to TCP/IP), some special infrastructure is necessary.

Lustre uses its own implementation of the Portals message passing API, upon which we have implemented Gateway nodes, to route between two native protocols. These are commodity nodes with, for example, both gigabit ethernet and InfiniBand interfaces. The gateway software translates the Portals packets between the interfaces to bridge the two networks.

These routers are in use today, and may become more popular as more enterprises connect multiple clusters with special interconnects to a single global Lustre file system. On the other hand, TCP/IP on GigE is the interconnect of choice for most organizations, which requires no additional Portals routing.

Object Servers and I/O Throughput

(Updated: Dec 2009)

What levels of throughput should I expect?

This of course depends on many variables, including the type and number of clients and servers, your network and disk infrastructure, your application's I/O patterns, tuning, and more. With standard HPC workloads and reasonable (ie, not seek-bound, nor extremely small I/O requests, etc) Lustre has demonstrated up to 90% of the system's raw I/O bandwidth capability.

With all of those variables in mind, here are some demonstrated single-server results on customer or demonstration installations of various types:

  • TCP/IP
    • Single-connected GigE: 115 MB/s
    • Dual-NIC GigE on a 32-bit OSS: 180 MB/s
    • Dual-NIC GigE on a 64-bit OSS: 220 MB/s
    • Single-connected 10GigE on a 64-bit OSS: 550 MB/s, 1GB/s on woodcrest
  • Unoptimized InfiniBand
    • Single-port SDR InfiniBand on a 64-bit OSS: 700-900 MB/s
    • DDR InfiniBand on a 64-bit OSS: 1500 MB/s

How fast can a single OSS be?

Using Lustre 1.4.0, a single 8-way Bull NovaScale IA-64 OSS, DataDirect Networks storage, and 3 rails of Quadrics Elan 4, a single OSS achieved 2.6 GB/s of sustained end-to-end bandwidth from two 16-way IA-64 client nodes.

Also using Lustre 1.4.0, a single-CPU AMD Opteron using 10-gigabit ethernet has been clocked at 550 MB/s.

How well does Lustre scale as OSSs are added?

Configured properly, it will scale linearly. In demonstrations on a production system of up to 104 Lustre OSSs, each connected with a single gigabit ethernet port, the aggregate sustained bandwidth reached 11.1 GB/s.

How many clients can each OSS support?

The number of clients is not usually a factor in choosing how many OSSs to deploy. Please see File Sizing.

What is a typical OSS node configuration?

Please see Installation.

How do I automate failover of my OSSs?

Please see Recovery.

Do you plan to support OSS failover without shared storage?

Yes, Server Network Striping will allow RAID-1 and RAID-5 file I/O features. They will provide redundancy and recoverability in the Lustre object protocol rather than requiring shared storage. In the meantime, some users have been trying DRBD to replicate the OST block device to a backup OSS instead of having shared storage, though this is not an officially supported configuration.

How is file data allocated on disk?

Because the Lustre OSTs mount regular ext3 file systems, you can mount them directly and look at them. If you were to do so, you would see a lot of files with names like "934151", which are object numbers. Inside each object is a file's data, or a portion of that file's data, depending on the striping policy for that file. There is no namespace information stored on the object server at this time.

The allocation of this file data to disk blocks is governed by ext3, although here we have made very substantial improvements. Instead of a long array of individual blocks, Lustre's ext3 manages file data extents, which can dramatically reduce the amount of this metadata for each file, and therefore the amount of seeking and I/O required to read and write it. We also implemented a new buddy block allocator, which can very quickly and without a lot of searching return very large contiguous disk extents.

How does the object locking protocol work?

Before any file data can be modified, or stored in a client cache, a lock must be taken. Each OST runs a lock server, and manages the locking for the stripes of data which reside on that OST. This has two extremely positive effects:

First, it removes a potential bottleneck of a single lock server. As you add object servers, you also add lock server capacity, in addition to disk capacity and bandwidth, and network bandwidth.

Second, it removes the so-called "split-brain" problem common in clustered systems. If the lock service and I/O service reside on different nodes, it is possible for the communications between them to be disrupted, while clients can still access one or both. In that case, data corruption could result because the locking and I/O would no longer be carefully coordinated.

In the Lustre protocol, if a client requests a lock which conflicts with a lock held by another client, a message is sent to the lock holder asking for the lock to be dropped. Before that lock is dropped, the client must write back any cached modifications, and remove all data from its cache for which it will no longer have a lock. Then, and only then, can it drop the lock.

If a client does not drop its lock in a reasonable amount of time (defined by a configurable timeout value) -- perhaps because it has been powered off, or suffered a hardware failure, or for some other reason -- it is evicted from that OST and will not be allowed to execute any operations until it has reconnected. This allows the remainder of the cluster to continue after a node has failed, after a short pause.

Finally, we have implemented a lock manager extension to optimize the very common case of sampling an object's attributes while it is being modified. Many users, for example, will want to track the progress of a job by getting a file listing ("ls -l") in the output directory while the job is writing its data.

Because it is not acceptable to return stale or out-of-date file size information, we must ask the server for a lock on this data. Because we don't actually need the data -- we just need to know how much there is -- we tell the server that instead of a lock it could simply provide the attributes. This is another case of intent locking . If the file is not being actively modified, then the server will grant a lock so that the client can cache the attributes.

Does Lustre support Direct I/O?

Yes. It locks the data to guarantee cluster-wide consistency, just like normal POSIX I/O, but does not cache the data on the client or server. With an RDMA-capable network (anything other than TCP) there is only a single data copy directly from the client RAM to the server RAM and straight to the disk.

Can these locks be disabled?

Yes, but:

  • It's only safe to do so when you use direct I/O; otherwise you have data in the caches which is not locked. Once that data is in the cache without a lock, it will not be removed except under memory pressure.
  • In practice, the overhead of these locks has not been shown to be an issue. Databases may or may not be an exception, but in any case, they tend to use direct I/O.

Do you plan to support T-10 object devices?

We are in touch with the T-10 committee. It is not clear to us that recovery and lock management implications for cluster file systems will see sufficient attention in the T-10 standard for this proposal to be viable. The goals of the T-10 committee may not, in the end, line up well with the very strong semantic guarantees that Lustre makes. Does Lustre support/require special parallel I/O libraries?

Lustre supports them, but by no means requires them. We have found equal performance when using standard POSIX I/O calls, the POSIX ADIO driver, or the MPI/IO libraries, when the IO size is sufficient.

A Lustre-specific MPI/IO ADIO driver has been developed to allow an application to provide hints about how it would like its output files to be striped, and to optimize the IO pattern when many clients are doing small read or write operations to a single file.

Recovery

(Updated: Dec 2009)

How do I configure failover services?

Typical failover configurations couple two Lustre MDS or OSS nodes in pairs directly to a multi-port disk array. Object servers are typically active/active, with each serving half of the array, while metadata servers must be active/passive. These array devices typically have redundancy internally, to eliminate them as single points of failure. This does not typically require a Fibre Channel switch.

How do I automate failover of my MDSs/OSSs?

The actual business of automating the decisions about whether a server has failed, and which server should take over the load, is managed by a separate package (our customers have used Red Hat's Cluster Manager and SuSE's Heartbeat).

Completely automated failover also requires some kind of programmatically controllable power switch, because the new "active" MDS must be able to completely power off the failed node. Otherwise, there is a chance that the "dead" node could wake up, start using the disk at the same time, and cause massive corruption.

How necessary is failover, really?

The answer depends on how close to 100% uptime you need to achieve. Failover doesn't protect against the failure of individual disks -- that is handled by software or hardware RAID at the OST and MDT level. Lustre failover is to handle the failure of an MDS or OSS node as a whole which, in our experience, is not very common.

We would suggest that simple RAID-5 or RAID-6 storage is sufficient for most users, with manual restart of failed OSS and MDS nodes, but that the most important production systems should consider failover.

I don't need failover, and don't want shared storage. How will this work?

If Lustre is configured without shared storage for failover, and a server node fails, then a client that tries to use that node will pause until the failed server is returned to operation. After a short delay (a configurable timeout value), applications waiting for those nodes can be aborted with a signal (kill or Ctrl-C), similar to the NFS soft-mount mode.

When the node is returned to service, applications which have not been aborted will continue to run without errors or data loss.

If a node suffers a connection failure, will the node select an alternate route for recovery?

Yes. If a node has multiple network paths, and one fails, it can continue to use the others.

What are the supported hardware methods for HBA, switch, and controller failover?

These are supported to the extent supported by the HBA drivers. If arrays with multiple ports are shared by multiple I/O nodes, Lustre offers 100% transparent failover for I/O and metadata nodes. Applications will see a delay while failover and recovery is in progress, but system calls complete without errors.

Can you describe an example failure scenario, and its resolution?

Although failures are becoming more rare, it is more likely that a node will hang or timeout rather than crash. If a client node hangs or crashes, usually all other client and server nodes are not affected. Normally such a client is rebooted and rejoins the file system. When server nodes hang, they are commonly restarted, merely causing a short delay to applications which try to use that node. Other server nodes or clients are not usually affected.

How are power failures, disk or RAID controller failures, etc. addressed?

If I/O to the storage is interrupted AND the storage device guarantees strict ordering of transactions, then the ext3 journal recovery will restore the file system in a few seconds.

If the file system is damaged through device failures, unordered transactions, or a power loss affecting a storage device's caches, Lustre requires a file system repair. Lustre's tools will reliably repair any damage it can. It will run in parallel on all nodes, but can still be very time consuming for large file systems.

Release Testing and Upgrading

(Updated: Dec 2009)

How does the Lustre group fix issues?

The Lustre group approaches bug tracking and fixing seriously and methodically:

  • Regression testing: A test is written to reproduce the problem, which is added to the ongoing test suite.
  • Architecture and design: Depending on how severe or invasive an update to the architecture description is, it may be written and reviewed by senior architects. A detailed design description for the patch is written and reviewed by principal engineers.
  • Implementation: Fixes are implemented according to the design description and added to a bug for review and inspection.
  • Review and Inspection: A developer or development team will review the code first and then submit it for a methodical inspection by senior and principal engineers.
  • Testing: The developer runs a small suite of tests before the code leaves his or her desk. Then it's added to a branch for regression testing and final release testing.

This process can be tracked closely via Lustre Bugzilla.

What testing does each version undergo prior to release?

Sun and its vendor and customer partners run a large suite of tests on a number of systems, architectures, kernels, and interconnects, including clusters as large as 400 nodes. Major updates receive testing on the largest clusters available to us, around 1,000 nodes.

Are Lustre releases backwards and forward compatible on the disk? On the wire?

Special care is taken to ensure that any disk format changes -- which are rare to begin with -- are handled transparently in previous and subsequent releases. Before the disk format changes, we release versions which support the new format, so you can safely roll back in case of problems. After the format change, new versions continue to support the old formats for some time and transparently update disk structures when old versions are encountered.

Support for running with older protocols are removed on every second major release, so the 1.4.x release will not be interoperable with 1.8.x release. Similarly, the 1.6.x release will not be able to interoperate with 2.0.x release.

The Lustre group always tests release 1.x.y with the preceding 1.x.(y-1) release, and with the older 1.(x-2).latest version when making a release. It isn't possible to exhaustively test all release combinations. As a result, we cannot guarantee that all releases are fully interoperable, even though we strive through design and code review to ensure that any 1.x release will work with any 1.(x-2) release.

Do you have to reboot to upgrade?

Not unless you upgrade your kernel. It's usually a simple matter of unmounting the file system or stopping the server as the case may be, installing the new RPMs, and restarting it.

Some of our customers upgrade servers between wire-compatible releases using failover; a service is failed over, the software is updated on the stopped node, the service is failed back, and the failover partner is upgraded in the same way.

Sizing

(Updated: Dec 2009)

What is the maximum file system size? What is the largest file system you've tested?

Each backend OST file system is restricted to a maximum of 8 TB on Linux 2.6 (imposed by ext3). Filesystems that are based on ext4 (SLES11) will soon be able to handle single OSTs of up to 16TB. Of course, it is possible to have multiple OST backends on a single OSS, and to aggregate multiple OSSs within a single Lustre file system. Running tests with almost 4000 smaller OSTs has been tried - hence 32PB or 64PB file systems could be achieved today.

Lustre users already run single production file systems of over 10PB, using over 1300 8TB OSTs.

What is the maximum file system block size?

The basic ext3 block size is 4096 bytes, although this could in principle be easily changed to as large as PAGE_SIZE (on IA64 or PPC, for example) in ext4. It is not clear, however, that this is necessary.

Some people confuse block size with extent size or I/O request size -- they are not the same thing. The block size is the basic unit of disk allocation, and for our purposes it seems that 4kB is as good as any. The size of a single file extent, by definition, is almost always larger than 4kB, and ldiskfs extents and mballoc features used by Lustre do a good job of allocating I/O aligned and contiguous disk blocks whenever possible. The I/O request size (the amount of data that we try to read or write in a single request) is usually 1MB (aligned to the start of the LUN) and can be further aggregated by the disk elevator or RAID controller.

What is the maximum single-file size?

On 32-bit clients, the page cache makes it quite difficult to read or write a single file larger than 8 TB. On 64 bit clients, the maximum file size is 2^63. A current Lustre limit for allocated file space arises from a maximum of 160 stripes and 2TB per file on current ldiskfs file systems, leading to a limit of 320TB per file.

What is the maximum number of files in a single file system? In a single directory?

We use the ext3 hashed directory code, which has a theoretical limit of about 15 million files per directory, at which point the directory grows to more than 2 GB. The maximum number of subdirectories in a single directory is the same as the file limit, about 15 million.

We regularly run tests with ten million files in a single directory. On a properly-configured quad-socket MDS with 32 GB of ram, it is possible to do random lookups in this directory at a rate of 20,000/second.

A single MDS imposes an upper limit of 4 billion inodes, but the default limit is slightly less than the device size/4kB, so about 2 billion inodes for a 8 TB MDS file system. This can be increased at initial file system creation time by specifying mkfs options to increase the number of inodes. Production file systems containing over 300 million files exist.

With the introduction of clustered metadata servers (Lustre 2.x) and with ZFS-based MDTs, these limits will disappear.

How many OSSs do I need?

The short answer is: As many as you need to achieve the required aggregate I/O throughput.

The long answer is: Each OSS contributes to the total capacity and the aggregate throughput. For example, a 100 TB file system may use 100 single-GigE-connected OSS nodes with 1 TB of 100 MB/sec storage each, providing 10 GB/sec of aggregate bandwidth. The same bandwidth and capacity could be provided with four heavy-duty 25 TB OSS servers with three Elan 4 interfaces and 16 FC2 channels, each providing ~2.5 GB/s in aggregate bandwidth. The 25 TB of storage must be capable of 2.5 GB/s.

Each OSS can support a very large number of clients, so we do not advise our customers to use any particular client-to-OSS ratio. Nevertheless, it is common to deploy 1 GB/s of OSS throughput per 1 TFLOP/s of compute power.

What is the largest possible I/O request?

When most people ask this question, they are asking what is the maximum buffer size that can be safely passed to a read() or write() system call. In principle this is limited only by the address space on the client, and we have tested single read() and write() system calls up to 1GB in size, so it has not been an issue in reality.

The Lustre I/O subsystem is designed with the understanding that an I/O request travels through many pipelines, and that it's important to keep all pipelines full for maximum performance. So it is not necessary to teach your application to do I/O in very large chunks; Lustre and the page cache will aggregate I/O for you.

Typically, Lustre client nodes will do their best to aggregate I/O into 1 MB chunks on the wire, and to keep between 5 and 10 I/O requests "in flight" at a time, per server. There is still a per-syscall overhead for locking and such, so using 1MB or larger read/write requests will minimize this overhead.

On the OSS, we have gone to significant effort to ensure that these large 1 MB buffers do not get unnecessarily broken up by lower kernel layers. In Linux 2.4, modifications were required to the SCSI layer, block devices, and the QLogic fibrechannel driver. In newer kernels there is less need to modify the kernel, though the block device tunables are still set for low-latency desktop workloads, and need to be tuned for high-bandwidth IO.

How many nodes can connect to a single Lustre file system?

The largest single production Lustre installation is approximately 26,000 nodes today (2009). This is the site-wide file system for a handful of different supercomputers.

Although these are the largest clusters available to us today, we believe that the architecture is fundamentally capable of supporting many more clients.