Frequently Asked Questions
Welcome to the Lustre FAQ!
What is Lustre?
Lustre is a scale-out architecture distributed parallel filesystem. Metadata services and storage are segregated from data services and storage.
Can you describe the data caching and cache coherency method?
Complete cache coherence is provided for both metadata (names, readdir lists, inode attributes) and file data. Clients and servers both take locks with a distributed lock management service; caches are flushed before locks are released.
Does Lustre separate metadata and file data?
Yes. The entire namespace is stored on Lustre Metadata Servers (MDSs); file data is stored on Lustre Object Storage Servers (OSSs). Note that unlike many block-based clustered filesystems where the MDS is still in charge of block allocation, the Lustre MDS is not involved in file IO in any manner and is not a source of contention for file IO.
The data for each file may reside in multiple objects on separate servers. Lustre normally manages these objects in a RAID-0 (striping) configuration, so each object in a multi-object file contains only a part of the file's data. File Level Redundancy (FLR) allows the user or administrator to mirror the file data across multiple OSTs to provide data redundancy.
What is the difference between a server (MDS/OSS) and a target (MDT/OST)?
A target is the persistent storage device (HDD or SSD) where object (file) data or metadata is stored. The servers handle network RPC request on behalf of clients in order to access the files and metadata. Clients do not access storage directly.
Is it common for a single OSS/MDS to export more than one OST/MDT?
In order to improve parallelism for file IO, as well as reduced e2fsck time for large ldiskfs OSTs, there are often multiple OSTs exported from a single OSS. Although Lustre will aggregate multiple OSTs into a single large file system on the client side. It is much less common to export multiple MDTs from a single MDS, since the MDTs are typically much smaller than OSTs, and the performance gains from running multiple MDTs on a single MDS is relatively small.
Does Lustre perform high-level I/O load balancing?
Yes. Because a single file can reside in pieces on many servers, the I/O load for even a single file can be distributed.
Objects are distributed between OSTs in a round-robin manner to ensure even load balancing across OSTs and OSS nodes. If the OSTs are imbalanced in terms of space usage, the MDS will take this into account and allocate a larger fraction of files to OSTs with more free space.
Is there a common synchronized namespace for files and directories?
Yes. All clients that mount the file system will see a single, coherent, synchronized namespace at all times.
Can Lustre be used as part of a "single system image" installation?
Yes. Lustre as the root file system is being used by some installations on both clients and servers.
Do Lustre clients use NFS to reach the servers?
No. Client nodes run a native Lustre client file system driver, which uses the Lustre metadata and object protocols to communicate with the servers. The NFS protocol is not suitable to meet Lustre's metadata, I/O, locking, recovery, or performance requirements. It is possible to re-export Lustre mountpoints on a client via NFS for non-Linux clients, but they will not have the same performance and data coherency advantage of native Lustre mountpoints.
Does Lustre use or provide a single security domain?
Current versions of Lustre expect the clients and servers to have an identical understanding of UIDs and GIDs, but security is enforced on the MDS by a server-local PAM-managed group database. Lustre supports POSIX Access Control Lists (ACLs). Strong user and host authentication using Kerberos is also available.
Does Lustre support the standard POSIX file system APIs?
Yes. Applications that use standard POSIX file system APIs can run on Lustre without modifications.
Is Lustre "POSIX compliant"? Are there any exceptions?
POSIX does not, strictly speaking, say anything about how a file system will operate on multiple clients. However, Lustre conforms to the most reasonable interpretation of what the single-node POSIX requirements would mean in a clustered environment.
For example, the coherency of read and write operations are enforced through the Lustre distributed lock manager; if application threads running on multiple nodes were to try to read and write the same part of a file at the same time, they would both see consistent results.
This is true of all I/O and metadata operations, with two exceptions:
- atime updates
- It is not practical to maintain fully coherent atime updates in a high-performance cluster file system. Lustre will update the atime of files lazily -- if an inode needs to be changed on a disk anyway, we will piggy-back an atime update if needed -- and when files are closed. Clients will refresh a file's atime whenever they read or write objects from that file from the OST(s), but will only do local atime updates for reads from cache. atime will be stored on disk when the file is closed.
- POSIX and BSD flock/lockf system calls will be completely coherent across the cluster, using the Lustre lock manager, which are enabled by default in Lustre 2.12.3 and later. It is possible to enable client-local flock locking with the -o localflock mount option. It is also possible to disable flock for a client with the -o noflock mount option.
Can you grow/shrink file systems online?
Lustre contains support for online addition of OST/MDT targets either on a new OSS/MDS or on an existing OSS/MDS. Lustre will automatically start using newly-added OSTs in the filesystem, but the MDTs need to be linked into the filesystem (e.g. via lfs mkdir -i for them to be used). Shrinking OST filesystems online is not supported.
Which disk file systems are supported as Lustre back-end file systems?
Lustre includes an enhanced version of the ext4 file system, called ldiskfs, with additional functional and performance enhancements. It is also possible to use the OpenZFS file system to increase the scalability and robustness of the back-end file system.
Why did Lustre choose ext4 for the backing filesystem? Do you ever plan to support others?
There are many reasons to choose ext4. One is size; it is understandable, maintainable, and modifiable. Another is reliability; ext4 is proven stable by millions of users, with an excellent file system repair tool, and is still in active use by many large sites such as Google.
When we began, there was a big difference between the various Linux file systems, particularly with respect to performance. Over many years, the Lustre team carried ext3 substantially forward by improving functionality and performance that have been included into the popular ext4 filesystem, reducing the number and size of patches in ldiskfs dramatically.
Nevertheless, we had originally planned to support multiple file systems, so Lustre does contain a file system abstraction. Since Lustre 2.4, we will support ZFS as the backing file system for both OSTs and MDTs.
Why didn't you use IBM's distributed lock manager?
The design of the Lustre DLM borrows heavily from the VAX Clusters DLM, plus extensions that are not found in others. Although we have received some reasonable criticism for not using an existing package (such as IBM's DLM), experience thus far has seemed to indicate that we've made the correct choice: it's smaller, simpler and, at least for our needs, more extensible.
The Lustre DLM, at around 6,000 lines of code, has proven to be an overseeable maintenance task, despite its somewhat daunting complexity. The IBM DLM, by comparison, was nearly the size of all of Lustre combined. This is not necessarily a criticism of the IBM DLM, however. To its credit, it is a complete DLM which implements many features which we do not require in Lustre.
In particular, Lustre's DLM is not really distributed, at least not when compared to other such systems. Locks in the Lustre DLM are always managed by the service node managing a specific MDT or OST, and do not change masters as other systems allow. Omitting features of this type has allowed us to rapidly develop and stabilize the core DLM functionality required by the file system, plus add several extensions of our own (extent locking, intent locking, policy functions, glimpse ASTs, and a different take on lock value blocks).
Are services at user or kernel level? How do they communicate?
All daemons on a single node run in the kernel, and therefore share a single address space. Daemons on different nodes communicate through RPC messages; large messages are sent using remote DMA if the fabric supports it.
Why does lists.lustre.org say "The list overview page has been disabled temporarily"?
The mailing lists at lists.lustre.org are running on a fully vendor managed solution at DreamHost. Unfortunately, they have disabled the list overview page for all of their customers, and they have no definitive plan to correct that in the near future. "Temporarily" apparently really means "permanently". While this is unfortunate, DreamHost's mailing list servers are pleasantly fast, low in maintenance, and low in cost. We will just need to live with this minor inconvenience. In the mean time, you can get a full list of mailing lists from Mailing Lists and IRC.
What do all these acronyms mean?
- ACL - Access Control List - more fine-grained file access permission mechanism beyond standard POSIX User/Group/Other permissions
- DLM - Distributed Lock Manager - subsystem that manages consistency between multiple client and server access of file/directory data and metadata, also Lustre DLM (LDLM)
- DNE - Distributed Namespace Environment - feature aggregating multiple MDTs (possibly on multiple MDSes) into a single filesystem namespace
- FID - File IDentifier - unique 128-bit identifier for every object within a single filesystem, made up of a 64-bit sequence (SEQ), a 32-bit object ID (OID), and a 32-bit version field
- FLDB - FID Location DataBase - table that maps FID SEQ values/ranges to a specific MDT or OST that holds those objects
- IB - InfiniBand - high-performance low-latency network interface, alternative to Ethernet
- IDIF - OST object ID In FID - specific FID range reserved for compatibility with pre-DNE OST objects
- IGIF - Inode and Generation In FID - specific FID range reserved for compatibility from Lustre 1.x MDT inode objects
- IOV - I/O Vector - used to describe file data pages for submission to the network or disk
- LBUG - Lustre BUG - serious error detected by the kernel code that causes a thread to stop running
- LDLM - Lustre Distributed Lock Manager - see DLM
- LMV - Logical Metadata Volume - client software layer that handles client (llite) access to multiple MDTs
- LND - Lustre Network Driver - LNet interface to low-level network interfaces such as Ethernet, InfiniBand, etc.
- LNet - Lustre Network - Type-agnostic networking layer (TCP/IP on Ethernet, InfiniBand, Omni-Path, Cray Gemini, etc.)
- LOD - Logical Object Device - MDS software layer that handles access to multiple MDTs and multiple OSTs
- LOV - Logical Object Volume - client software layer that handles client (llite) access to multiple OSTs
- MDC - MetaData Client - client software layer that interfaces to the MDS
- MDD - Metadata Device Driver - MDS software layer that understands POSIX semantics for file access
- MDS - MetaData Server - software service that manages access to filesystem namespace (inodes, paths, permission) requests from the client
- MDT - MetaData Target - storage device that holds the filesystem metadata (attributes, inodes, directories, xattrs, etc)
- MGS - ManaGement Server - service that helps clients and servers with configuration
- MGT - ManaGement Target - storage device that holds the configuration logs
- NID - Network Identifier - used to uniquely identify a Lustre network endpoint by node ID and network type
- OBD - Object Based Device - generic term for Lustre devices such as MDT, OST, OSD, LMV, LOV, MDC, OSC
- OFD - Object Filter Device - OSS software layer that handles file IO
- OID - Object IDentifier - the sub-range of a FID that is allocated by a client within a SEQuence to identify an object
- OSC - Object Storage Client - client software layer that interfaces to the OST
- OSD - Object Storage Device - server software layer that abstracts MDD and OFD access to underlying disk filesystems like ldiskfs and ZFS
- OSP - Object Storage Proxy - server software layer that interfaces from one MDS to the OSD on another MDS or another OSS
- OSS - Object Storage Server - software service that manages access to filesystem data (read, write, truncate, etc)
- OST - Object Storage Target - storage device that holds the filesystem data (regular data files, not directories, xattrs, or other metadata)
- PTLRPC - Portal RPC - remote procedure call protocol layered over LNet
- SEQ - SEQuence number of FIDs that are allocated to a client or server for its own exclusive use. The SEQ controller on MDT0000 manages the entire 64-bit SEQ space, and grants large sub-ranges to MDTs and OSTs, which may in turn grant sub-sub-ranges to individual clients.
When is the next Major release?
We plan to tag and build Major releases of Lustre about every year, the roadmap gives a high level idea of the current expectations
When is the next Maintenance Release?
There are multiple maintenance releases per year but the exact timing can depend upon several factors (Linux distro releases, OpenZFS releases, important bugfixes). The Lustre Working Group is the best source of information about the latest status