Frequently Asked Questions (Old Wiki)

See Frequently_Asked_Questions page.

Installation
(Updated: Dec 2009)

Which operating systems are supported as clients and servers?

Please see OS Support.

Can you use NFS or CIFS to reach a Lustre volume?

Yes. Any native Lustre client (running Linux today, by definition) can export a volume using NFS or Samba. Some people have even built small clusters of these export nodes, to improve overall performance to their non-native clients.

Although NFS export works today, it is slower than native Lustre access, and does not have cache coherent access that some applications depend on.

CIFS export with Samba, even a cluster of such nodes, is possible with one caveat: oplocks and Windows share modes. If you connect to these Samba shares with Windows clients, they will probably make heavy use of share modes and oplocks for locking and synchronization. Samba implements them internally, and does not yet have a clustered mode to coordinate them between multiple servers running on separate nodes. So if you rely on the consistency of these share modes and oplocks, you should use a single node to export CIFS.

What is the typical MDS node configuration?

10,000-node clusters with moderate metadata loads are commonly supported with a 4-socket quad-core node with 32GB of RAM, providing sustained throughput of over 5,000 creates or deletes per second, and up to 20,000 getattr per second. It is common for these systems to have roughly 200 million files. Even in 10,000-client clusters, the single MDS has been shown not to be a significant bottleneck under typical HPC loads. Having a low-latency network like InfiniBand plays a significant role in improved MDS performance.

High throughput with very large directories is possible with extra RAM and, optionally, solid-state disks. Typically, write I/O is low, but seek latency is very important, hence RAID-0+1 mirrored storage (RAID-0 striping of multiple RAID-1 mirrored disks) is strongly recommended for the MDS. Access to metadata from RAM can be 10x faster, so RAM more RAM is usually beneficial.

MDS storage requirements should be sized at approximately 4kB per file, except in unusual circumstances.

What is the typical OSS node configuration?

Multi-core 64-bit servers with good buses are capable of saturating multiple interconnects of any type. These nodes are often dual- or quad- socket and support up to 4 fibrechannel interfaces. RAM is used for locks and metadata caching, and in 1.8 and later extra server RAM will be used for file data caching, though is not strictly required. Aproximately 2 GB/OST is recommended for OSS nodes in 1.8, and twice that if OST failover will also run backup OSTs on the node.

Which architectures are interoperable?

Lustre requires the page size on server nodes (MDS and OSS) to be smaller or the same size as client nodes. Except for this, there are no known obstacles to interoperability, even among heterogeneous client groups and systems with different endianness.

Which storage devices are supported, on MDS and OSS nodes?

Servers support all block storage: SCSI, SATA, SAS, FC and exotic storage (SSD, NVRAM) are supported.

Which storage interconnects are supported?

Just to be clear: Lustre does not require a SAN, nor does it require a fabric like iSCSI. It will work just fine over simple IDE block devices. But because many people already have SANs, or want some amount of shared storage for failover, this is a common question.

For storage behind server nodes, FibreChannel, InfiniBand, iSCSI, or any other block storage protocol can be used. Failover functionality requires shared storage (each partition used active/passive) between a pair of nodes on a fabric like SCSI, FC or SATA.

'''Are fibrechannel switches necessary? How does HA shared storage work?'''

Typically, fibrechannel switches are not necessary. Multi-port shared storage for failover is normally configured to be shared between two server nodes on a FC-AL. Shared SCSI and SAS devices will also work.

Backend storage is expected to be cache-coherent between multiple channels reaching the devices. Servers in an OSS failover pair are normally both active in the file system, and can be configured to take over partitions for each other in the case of a failure. MDS failover pairs can also both be active, but only if they serve multiple separate file systems.

Can you put the file system journal on a separate device?

Yes. This can be configured when the backend ext3 file systems are created. Can you run Lustre on LVM volumes, software RAID, etc?

Yes. You can use any Linux block device as storage for a backend Lustre server file system, including LVM or software RAID devices. Can you describe the installation process?

The current installation process is straightforward, but manual:

1. Install the provided kernel and Lustre RPMs.

2. A configuration tool assistant can generate a configuration file for simple configurations, or you can build more complex configurations with relatively simple shell scripts.

3. Format and mount the OST and MDT filesystems. The command is usually identical on all nodes, so it's easy to use a utility like pdsh/prun to execute it.

4. Start the clients with mount, similar to how NFS is mounted.

What is the estimated installation time per compute node?

Assuming that node doesn't require special drivers or kernel configuration, 5 minutes. Install the RPMs, add a line for the file system to /etc/fstab. Compute nodes can be installed and started in parallel.

What is the estimated installation time per I/O node?

5-30 minutes, plus formatting time, which can also be done in parallel. The per-node time depends on how easily the commands can be run in parallel on the I/O nodes. Install the RPMs, format the devices, add a line to /etc/fstab for the file system, mount.

Licensing and Support
(Updated: Dec 2009)

''' What is the licensing model for the Lustre file system for Linux? '''

The Lustre file system for Linux is an Open Source product.

New releases are available at the Lustre download site to the general public, under the terms and conditions of the GNU GPL.

Metadata Servers
(Updated: Dec 2009)

How many metadata servers does Lustre support?

Lustre 2.x introduced the Distributed Namespace Environment (DNE) feature, which permits dozens of metadata servers working in parallel for a single file system.

How does DNE work?

At a high level, it is reasonably simple: each directory can be striped over multiple metadata targets, each of which contains a disjoint portion of the namespace. When a client wants to lookup or create a name in that namespace, it uses a hashing algorithm to determine which metadata server holds the information for that name.

When you consider the details of doing this efficiently, coherently, and completely recoverable in the face of any number of different failures, it becomes more complicated.

Isn't the single metadata server a bottleneck?

Not so far. We regularly perform tests with single directories containing millions of files, and we have several customers with 10,000-node clusters (or larger) and a single metadata server.

Lustre is carefully designed to place the entire burden of file I/O on the Object Storage Servers (OSSs): locking, disk allocation, storage and retrieval, everything. Once the file is opened and the striping information obtained by the client, the metadata server is no longer involved in the business of file I/O.

The Lustre metadata server software is extremely multithreaded, and we have made substantial efforts to reduce the lock contention on the MDS so that many clients can work concurrently. These are the kinds of optimizations which make it possible to do random creations or lookups in a single 10-million-file directory at a rate of more than 5,000 per second. Future versions of Lustre will continue to improve the metadata performance.

If there is a customer need for massive metadata servers prior to the release of clustered metadata, it should be possible to scale quite far using large SMP systems (such as the SunFire x4600 or Bull NovaScale) with large memory spaces and solid-state disks which can utilize large caches to speed MDS operations.

How is metadata allocated on disk?

The standard way that Lustre formats the MDS file system is with 512-byte ext3 inodes, which contain extended attributes (EAs) embedded in the inodes. One use of such an EA is for the file striping data, which tells the clients on which object servers to find the file data. For very widely striped files, this EA may be too large to store in the inode and will be stored in separate blocks. By storing the EA in the inode when possible, we avoid an extra very expensive disk seek.

What is intent locking?

Most file systems operate in one of two modes: a mode in which the server does the metadata modifications, or a mode in which the client can cache metadata updates itself. Both ways have their advantages and disadvantages in certain situations.

Consider the case of 1,000 clients all chdir'ing to /tmp and creating their own output files. If each client locks the directory, adds their file, uploads the modification, and releases the lock, this simple operation will take forever. If the metadata server is able to execute the operations locally and return the results, it should all happen in less than a second.

This is not a contrived example -- Lustre users run applications which do this very thing every hour of every day, for example, to write checkpoint data of a long-running scientific computation.

Consider another very common case, like a user's home directory being used only by one node. In this case, it would be extremely advantageous to allow that node to cache metadata updates in ram, then lazily propagate them back to the MDS. This allows the user to make updates as fast as they can be recorded in ram (until ram is full).

The moral of the store is: in cases of high concurrency, do the updates on the metadata server. In single-user cases, cache updates on the client.

What does this have to do with intent locking? Our protocol bundles up the information for the entire operation with the initial lock request, in the form of a metadata intent, and gives the metadata server the option of whether to execute the operation immediately (and return only a result code), or to return a lock to allow the client to perform writeback caching.

Lustre 1.x does not include a metadata writeback cache on the client, so today's metadata server always executes the operation on the client's behalf. Even without a writeback cache, however, the intent locking infrastructure still provides value. By having all of the information available during the initial lock request, we are able to perform all metadata operations in a single RPC.

 How does the metadata locking protocol work?

Prior to Lustre 1.4.x, each metadata server inode was locked as a single unit. When the client wished to cache the existence of a name, or the attributes or directory pages of an inode, it would take and hold such a read lock. When the metadata server modified an inode, it would take a write lock on the same.

There are common cases when even one lock per inode is not enough, however. For example, consider the case of creating the file dir/file, as would happen during the unpacking of a tar archive. The client would first lookup dir, and in doing so take a lock to cache the result. It would then ask the metadata server to create "file" inside of it, which would lock dir to modify it, thus yanking the lock back from the client. This "ping-pong" back and forth is unnecessary and very inefficient, so Lustre 1.4.x introduced separate locking of different parts of the inode (simple existence, directory pages, and attributes).

 Does the MDS do any pre-allocation?

Yes. To enable very fast file creation, the metadata server asks the individual OSTs to pre-create some number of objects, which the MDS can then allocate as file stripes without additional RPCs. These preallocations are replenished asynchronously.

Networking
(Updated: Dec 2009)

Which interconnects and protocols are currently supported? Today, Lustre supports TCP/IP (commonly over gigabit or 10-gigabit ethernet), OFED, Myrinet MX, and Cray's Seastar networks. Other older networking technologies were also supported, but are virtually unused today and will be dropped in future releases.

The up-to-date versions of each network type are at the beginning of the lnet/ChangeLog file and each release announcement.

Can I use more than one interface of the same type on the same node?

Yes, with Lustre 1.4.6 and later.

Can I use two or more different interconnects on the same node?

Yes, with Lustre 1.4.x, subject to the particular limitations of the interconnect. For example, we are told that it is not possible to use both Elan 3 and Elan 4 in the same node at the same time.

Can I use TCP offload cards?

Probably -- but we've tried many of these cards, and for various reasons we didn't see much improvement, if any. First, because Lustre runs entirely in the kernel, it uses kernel networking APIs which are often not supported (or at least not optimized) by the offload drivers.

Second, the problem isn't the overhead of checksum calculation or the need for interrupt coalescing; lots of commodity ethernet cards already support these features. The big overhead is memory copying and buffering, which these cards rarely do anything to address.

Does Lustre support crazy heterogeneous network topologies?

Yes, although the craziest of them are not yet fully supported.

Because Lustre supports native protocols on top of high speed cluster interconnects (in addition to TCP/IP), some special infrastructure is necessary.

Lustre uses its own implementation of the Portals message passing API, upon which we have implemented Gateway nodes, to route between two native protocols. These are commodity nodes with, for example, both gigabit ethernet and InfiniBand interfaces. The gateway software translates the Portals packets between the interfaces to bridge the two networks.

These routers are in use today, and may become more popular as more enterprises connect multiple clusters with special interconnects to a single global Lustre file system. On the other hand, TCP/IP on GigE is the interconnect of choice for most organizations, which requires no additional Portals routing.

OS Support
(Updated: Dec 2009)

Which interconnects and protocols are currently supported? Today, Lustre supports TCP/IP (commonly over gigabit or 10-gigabit ethernet), OFED, Myrinet MX, and Cray's Seastar networks. Other older networking technologies were also supported, but are virtually unused today and will be dropped in future releases.

The up-to-date versions of each network type are at the beginning of the lnet/ChangeLog file and each release announcement.

Can I use more than one interface of the same type on the same node?

Yes, with Lustre 1.4.6 and later.

Can I use two or more different interconnects on the same node?

Yes, with Lustre 1.4.x, subject to the particular limitations of the interconnect. For example, we are told that it is not possible to use both Elan 3 and Elan 4 in the same node at the same time.

Can I use TCP offload cards?

Probably -- but we've tried many of these cards, and for various reasons we didn't see much improvement, if any. First, because Lustre runs entirely in the kernel, it uses kernel networking APIs which are often not supported (or at least not optimized) by the offload drivers.

Second, the problem isn't the overhead of checksum calculation or the need for interrupt coalescing; lots of commodity ethernet cards already support these features. The big overhead is memory copying and buffering, which these cards rarely do anything to address.

Does Lustre support crazy heterogeneous network topologies?

Yes, although the craziest of them are not yet fully supported.

Because Lustre supports native protocols on top of high speed cluster interconnects (in addition to TCP/IP), some special infrastructure is necessary.

Lustre uses its own implementation of the Portals message passing API, upon which we have implemented Gateway nodes, to route between two native protocols. These are commodity nodes with, for example, both gigabit ethernet and InfiniBand interfaces. The gateway software translates the Portals packets between the interfaces to bridge the two networks.

These routers are in use today, and may become more popular as more enterprises connect multiple clusters with special interconnects to a single global Lustre file system. On the other hand, TCP/IP on GigE is the interconnect of choice for most organizations, which requires no additional Portals routing.

Object Servers and I/O Throughput
(Updated: Dec 2009)

What levels of throughput should I expect?

This of course depends on many variables, including the type and number of clients and servers, your network and disk infrastructure, your application's I/O patterns, tuning, and more. With standard HPC workloads and reasonable (ie, not seek-bound, nor extremely small I/O requests, etc) Lustre has demonstrated up to 90% of the system's raw I/O bandwidth capability.

With all of those variables in mind, here are some demonstrated single-server results on customer or demonstration installations of various types:


 * TCP/IP
 * Single-connected GigE: 115 MB/s
 * Dual-NIC GigE on a 32-bit OSS: 180 MB/s
 * Dual-NIC GigE on a 64-bit OSS: 220 MB/s
 * Single-connected 10GigE on a 64-bit OSS: 550 MB/s, 1GB/s on woodcrest
 * Unoptimized InfiniBand
 * Single-port SDR InfiniBand on a 64-bit OSS: 700-900 MB/s
 * DDR InfiniBand on a 64-bit OSS: 1500 MB/s

How fast can a single OSS be?

Using Lustre 1.4.0, a single 8-way Bull NovaScale IA-64 OSS, DataDirect Networks storage, and 3 rails of Quadrics Elan 4, a single OSS achieved 2.6 GB/s of sustained end-to-end bandwidth from two 16-way IA-64 client nodes.

Also using Lustre 1.4.0, a single-CPU AMD Opteron using 10-gigabit ethernet has been clocked at 550 MB/s.

How well does Lustre scale as OSSs are added?

Configured properly, it will scale linearly. In demonstrations on a production system of up to 104 Lustre OSSs, each connected with a single gigabit ethernet port, the aggregate sustained bandwidth reached 11.1 GB/s.

How many clients can each OSS support?

The number of clients is not usually a factor in choosing how many OSSs to deploy. Please see File Sizing.

What is a typical OSS node configuration?

Please see Installation.

How do I automate failover of my OSSs?

Please see Recovery.

Do you plan to support OSS failover without shared storage?

Yes, Server Network Striping will allow RAID-1 and RAID-5 file I/O features. They will provide redundancy and recoverability in the Lustre object protocol rather than requiring shared storage. In the meantime, some users have been trying DRBD to replicate the OST block device to a backup OSS instead of having shared storage, though this is not an officially supported configuration.

How is file data allocated on disk?

Because the Lustre OSTs mount regular ext3 file systems, you can mount them directly and look at them. If you were to do so, you would see a lot of files with names like "934151", which are object numbers. Inside each object is a file's data, or a portion of that file's data, depending on the striping policy for that file. There is no namespace information stored on the object server at this time.

The allocation of this file data to disk blocks is governed by ext3, although here we have made very substantial improvements. Instead of a long array of individual blocks, Lustre's ext3 manages file data extents, which can dramatically reduce the amount of this metadata for each file, and therefore the amount of seeking and I/O required to read and write it. We also implemented a new buddy block allocator, which can very quickly and without a lot of searching return very large contiguous disk extents.

How does the object locking protocol work?

Before any file data can be modified, or stored in a client cache, a lock must be taken. Each OST runs a lock server, and manages the locking for the stripes of data which reside on that OST. This has two extremely positive effects:

First, it removes a potential bottleneck of a single lock server. As you add object servers, you also add lock server capacity, in addition to disk capacity and bandwidth, and network bandwidth.

Second, it removes the so-called "split-brain" problem common in clustered systems. If the lock service and I/O service reside on different nodes, it is possible for the communications between them to be disrupted, while clients can still access one or both. In that case, data corruption could result because the locking and I/O would no longer be carefully coordinated.

In the Lustre protocol, if a client requests a lock which conflicts with a lock held by another client, a message is sent to the lock holder asking for the lock to be dropped. Before that lock is dropped, the client must write back any cached modifications, and remove all data from its cache for which it will no longer have a lock. Then, and only then, can it drop the lock.

If a client does not drop its lock in a reasonable amount of time (defined by a configurable timeout value) -- perhaps because it has been powered off, or suffered a hardware failure, or for some other reason -- it is evicted from that OST and will not be allowed to execute any operations until it has reconnected. This allows the remainder of the cluster to continue after a node has failed, after a short pause.

Finally, we have implemented a lock manager extension to optimize the very common case of sampling an object's attributes while it is being modified. Many users, for example, will want to track the progress of a job by getting a file listing ("ls -l") in the output directory while the job is writing its data.

Because it is not acceptable to return stale or out-of-date file size information, we must ask the server for a lock on this data. Because we don't actually need the data -- we just need to know how much there is -- we tell the server that instead of a lock it could simply provide the attributes. This is another case of intent locking. If the file is not being actively modified, then the server will grant a lock so that the client can cache the attributes.

Does Lustre support Direct I/O?

Yes. It locks the data to guarantee cluster-wide consistency, just like normal POSIX I/O, but does not cache the data on the client or server. With an RDMA-capable network (anything other than TCP) there is only a single data copy directly from the client RAM to the server RAM and straight to the disk.

Can these locks be disabled?

Yes, but:


 * It's only safe to do so when you use direct I/O; otherwise you have data in the caches which is not locked. Once that data is in the cache without a lock, it will not be removed except under memory pressure.
 * In practice, the overhead of these locks has not been shown to be an issue. Databases may or may not be an exception, but in any case, they tend to use direct I/O.

Do you plan to support T-10 object devices?

We are in touch with the T-10 committee. It is not clear to us that recovery and lock management implications for cluster file systems will see sufficient attention in the T-10 standard for this proposal to be viable. The goals of the T-10 committee may not, in the end, line up well with the very strong semantic guarantees that Lustre makes. Does Lustre support/require special parallel I/O libraries?

Lustre supports them, but by no means requires them. We have found equal performance when using standard POSIX I/O calls, the POSIX ADIO driver, or the MPI/IO libraries, when the IO size is sufficient.

A Lustre-specific MPI/IO ADIO driver has been developed to allow an application to provide hints about how it would like its output files to be striped, and to optimize the IO pattern when many clients are doing small read or write operations to a single file.

Recovery
(Updated: Dec 2009)

How do I configure failover services?

Typical failover configurations couple two Lustre MDS or OSS nodes in pairs directly to a multi-port disk array. Object servers are typically active/active, with each serving half of the array, while metadata servers must be active/passive. These array devices typically have redundancy internally, to eliminate them as single points of failure. This does not typically require a Fibre Channel switch.

How do I automate failover of my MDSs/OSSs?

The actual business of automating the decisions about whether a server has failed, and which server should take over the load, is managed by a separate package (our customers have used Red Hat's Cluster Manager and SuSE's Heartbeat).

Completely automated failover also requires some kind of programmatically controllable power switch, because the new "active" MDS must be able to completely power off the failed node. Otherwise, there is a chance that the "dead" node could wake up, start using the disk at the same time, and cause massive corruption.

 How necessary is failover, really?

The answer depends on how close to 100% uptime you need to achieve. Failover doesn't protect against the failure of individual disks -- that is handled by software or hardware RAID at the OST and MDT level. Lustre failover is to handle the failure of an MDS or OSS node as a whole which, in our experience, is not very common.

We would suggest that simple RAID-5 or RAID-6 storage is sufficient for most users, with manual restart of failed OSS and MDS nodes, but that the most important production systems should consider failover.

'''I don't need failover, and don't want shared storage. How will this work?'''

If Lustre is configured without shared storage for failover, and a server node fails, then a client that tries to use that node will pause until the failed server is returned to operation. After a short delay (a configurable timeout value), applications waiting for those nodes can be aborted with a signal (kill or Ctrl-C), similar to the NFS soft-mount mode.

When the node is returned to service, applications which have not been aborted will continue to run without errors or data loss.

If a node suffers a connection failure, will the node select an alternate route for recovery?

Yes. If a node has multiple network paths, and one fails, it can continue to use the others.

What are the supported hardware methods for HBA, switch, and controller failover?

These are supported to the extent supported by the HBA drivers. If arrays with multiple ports are shared by multiple I/O nodes, Lustre offers 100% transparent failover for I/O and metadata nodes. Applications will see a delay while failover and recovery is in progress, but system calls complete without errors.

Can you describe an example failure scenario, and its resolution?

Although failures are becoming more rare, it is more likely that a node will hang or timeout rather than crash. If a client node hangs or crashes, usually all other client and server nodes are not affected. Normally such a client is rebooted and rejoins the file system. When server nodes hang, they are commonly restarted, merely causing a short delay to applications which try to use that node. Other server nodes or clients are not usually affected.

 How are power failures, disk or RAID controller failures, etc. addressed?

If I/O to the storage is interrupted AND the storage device guarantees strict ordering of transactions, then the ext3 journal recovery will restore the file system in a few seconds.

If the file system is damaged through device failures, unordered transactions, or a power loss affecting a storage device's caches, Lustre requires a file system repair. Lustre's tools will reliably repair any damage it can. It will run in parallel on all nodes, but can still be very time consuming for large file systems.

Release Testing and Upgrading
(Updated: Dec 2009)

''' How does the Lustre group fix issues? '''

The Lustre group approaches bug tracking and fixing seriously and methodically:


 * Regression testing: A test is written to reproduce the problem, which is added to the ongoing test suite.
 * Architecture and design: Depending on how severe or invasive an update to the architecture description is, it may be written and reviewed by senior architects. A detailed design description for the patch is written and reviewed by principal engineers.
 * Implementation: Fixes are implemented according to the design description and added to a bug for review and inspection.
 * Review and Inspection: A developer or development team will review the code first and then submit it for a methodical inspection by senior and principal engineers.
 * Testing: The developer runs a small suite of tests before the code leaves his or her desk. Then it's added to a branch for regression testing and final release testing.

This process can be tracked closely via Lustre Bugzilla.

''' What testing does each version undergo prior to release? '''

Sun and its vendor and customer partners run a large suite of tests on a number of systems, architectures, kernels, and interconnects, including clusters as large as 400 nodes. Major updates receive testing on the largest clusters available to us, around 1,000 nodes.

'''Are Lustre releases backwards and forward compatible on the disk? On the wire?'''

Special care is taken to ensure that any disk format changes -- which are rare to begin with -- are handled transparently in previous and subsequent releases. Before the disk format changes, we release versions which support the new format, so you can safely roll back in case of problems. After the format change, new versions continue to support the old formats for some time and transparently update disk structures when old versions are encountered.

Support for running with older protocols are removed on every second major release, so the 1.4.x release will not be interoperable with 1.8.x release. Similarly, the 1.6.x release will not be able to interoperate with 2.0.x release.

The Lustre group always tests release 1.x.y with the preceding 1.x.(y-1) release, and with the older 1.(x-2).latest version when making a release. It isn't possible to exhaustively test all release combinations. As a result, we cannot guarantee that all releases are fully interoperable, even though we strive through design and code review to ensure that any 1.x release will work with any 1.(x-2) release.

''' Do you have to reboot to upgrade? '''

Not unless you upgrade your kernel. It's usually a simple matter of unmounting the file system or stopping the server as the case may be, installing the new RPMs, and restarting it.

Some of our customers upgrade servers between wire-compatible releases using failover; a service is failed over, the software is updated on the stopped node, the service is failed back, and the failover partner is upgraded in the same way.

Sizing
(Updated: Dec 2009)

''' What is the maximum file system size? What is the largest file system you've tested? '''

Each backend OST file system is restricted to a maximum of 8 TB on Linux 2.6 (imposed by ext3). Filesystems that are based on ext4 (SLES11) will soon be able to handle single OSTs of up to 16TB. Of course, it is possible to have multiple OST backends on a single OSS, and to aggregate multiple OSSs within a single Lustre file system. Running tests with almost 4000 smaller OSTs has been tried - hence 32PB or 64PB file systems could be achieved today.

Lustre users already run single production file systems of over 10PB, using over 1300 8TB OSTs.

''' What is the maximum file system block size? '''

The basic ext3 block size is 4096 bytes, although this could in principle be easily changed to as large as PAGE_SIZE (on IA64 or PPC, for example) in ext4. It is not clear, however, that this is necessary.

Some people confuse block size with extent size or I/O request size -- they are not the same thing. The block size is the basic unit of disk allocation, and for our purposes it seems that 4kB is as good as any. The size of a single file extent, by definition, is almost always larger than 4kB, and ldiskfs extents and mballoc features used by Lustre do a good job of allocating I/O aligned and contiguous disk blocks whenever possible. The I/O request size (the amount of data that we try to read or write in a single request) is usually 1MB (aligned to the start of the LUN) and can be further aggregated by the disk elevator or RAID controller.

''' What is the maximum single-file size? '''

On 32-bit clients, the page cache makes it quite difficult to read or write a single file larger than 8 TB. On 64 bit clients, the maximum file size is 2^63. A current Lustre limit for allocated file space arises from a maximum of 160 stripes and 2TB per file on current ldiskfs file systems, leading to a limit of 320TB per file.

''' What is the maximum number of files in a single file system? In a single directory? '''

We use the ext3 hashed directory code, which has a theoretical limit of about 15 million files per directory, at which point the directory grows to more than 2 GB. The maximum number of subdirectories in a single directory is the same as the file limit, about 15 million.

We regularly run tests with ten million files in a single directory. On a properly-configured quad-socket MDS with 32 GB of ram, it is possible to do random lookups in this directory at a rate of 20,000/second.

A single MDS imposes an upper limit of 4 billion inodes, but the default limit is slightly less than the device size/4kB, so about 2 billion inodes for a 8 TB MDS file system. This can be increased at initial file system creation time by specifying mkfs options to increase the number of inodes. Production file systems containing over 300 million files exist.

With the introduction of clustered metadata servers (Lustre 2.x) and with ZFS-based MDTs, these limits will disappear.

''' How many OSSs do I need? '''

The short answer is: As many as you need to achieve the required aggregate I/O throughput.

The long answer is: Each OSS contributes to the total capacity and the aggregate throughput. For example, a 100 TB file system may use 100 single-GigE-connected OSS nodes with 1 TB of 100 MB/sec storage each, providing 10 GB/sec of aggregate bandwidth. The same bandwidth and capacity could be provided with four heavy-duty 25 TB OSS servers with three Elan 4 interfaces and 16 FC2 channels, each providing ~2.5 GB/s in aggregate bandwidth. The 25 TB of storage must be capable of 2.5 GB/s.

Each OSS can support a very large number of clients, so we do not advise our customers to use any particular client-to-OSS ratio. Nevertheless, it is common to deploy 1 GB/s of OSS throughput per 1 TFLOP/s of compute power.

''' What is the largest possible I/O request? '''

When most people ask this question, they are asking what is the maximum buffer size that can be safely passed to a read or write system call. In principle this is limited only by the address space on the client, and we have tested single read and write system calls up to 1GB in size, so it has not been an issue in reality.

The Lustre I/O subsystem is designed with the understanding that an I/O request travels through many pipelines, and that it's important to keep all pipelines full for maximum performance. So it is not necessary to teach your application to do I/O in very large chunks; Lustre and the page cache will aggregate I/O for you.

Typically, Lustre client nodes will do their best to aggregate I/O into 1 MB chunks on the wire, and to keep between 5 and 10 I/O requests "in flight" at a time, per server. There is still a per-syscall overhead for locking and such, so using 1MB or larger read/write requests will minimize this overhead.

On the OSS, we have gone to significant effort to ensure that these large 1 MB buffers do not get unnecessarily broken up by lower kernel layers. In Linux 2.4, modifications were required to the SCSI layer, block devices, and the QLogic fibrechannel driver. In newer kernels there is less need to modify the kernel, though the block device tunables are still set for low-latency desktop workloads, and need to be tuned for high-bandwidth IO.

''' How many nodes can connect to a single Lustre file system? '''

The largest single production Lustre installation is approximately 26,000 nodes today (2009). This is the site-wide file system for a handful of different supercomputers.

Although these are the largest clusters available to us today, we believe that the architecture is fundamentally capable of supporting many more clients.