Lustre Hardware Sizing Guide
Lustre Hardware Sizing Guide
This page provides practical guidance for sizing a Lustre filesystem deployment. It covers MDT, OST, memory, and network sizing with worked examples. For filesystem limits, see Lustre Architecture for Admins.
MDT Sizing
The MDT stores all filesystem metadata: directory entries, filenames, permissions, timestamps, file layout information, and extended attributes.
How Much MDT Space Do I Need?
Rule of thumb: Each file consumes approximately 2 KiB of usable MDT space on ldiskfs (including the inode and directory entry).
Basic formula:
MDT size = (expected file count) × 2 KiB × safety_factor
A safety factor of 2× is recommended to allow for growth and to avoid performance degradation as the MDT fills up.
Data on MDT (DoM): If you plan to use Data on MDT (storing small file data directly on the MDT, available since Lustre 2.11), you need significantly more MDT space. A reasonable starting point is 5% or more of total filesystem capacity, depending on how many small files you expect.
Worked Examples
Example 1: Traditional HPC storage
- 100 TiB total usable OST space
- Average file size: 5 MiB
- Expected file count: ~20 million files
MDT size = 20M × 2 KiB × 2 = 80 GiB
A 100 GiB MDT (RAID1 SSD) would be appropriate.
Example 2: Many small files with DoM
- 100 TiB total usable OST space
- Average file size: 64 KiB
- Expected file count: ~1 billion files
- DoM stripe size: 64 KiB per file
Metadata alone: 1B × 2 KiB × 2 = 4 TiB DoM data: 1B × 64 KiB = 64 TiB Total MDT: ~68 TiB (use multiple MDTs with DNE)
This is an extreme case. Most deployments fall somewhere between these examples.
Example 3: Research cluster
- 500 TiB total usable OST space
- Average file size: 1 GiB
- Expected file count: ~500,000 files
MDT size = 500K × 2 KiB × 2 = 2 GiB
Even a small SSD works here, but use at least 20 GiB RAID1 for headroom and performance.
ldiskfs vs. ZFS for MDT
| ldiskfs | ZFS | |
|---|---|---|
| Inode allocation | Fixed at format time | Dynamic (allocates on demand) |
| Minimum per inode | ~2 KiB | ~4 KiB |
| Maximum files | 4 billion per MDT | 256 trillion per MDT |
| Resize | grow only (no shrink) | grow only (add VDEVs) |
| Recommended RAID | RAID1 or RAID10 | mirror VDEVs |
Note: With ldiskfs, the inode count is fixed at format time. Over-provisioning is cheap (SSD space is relatively inexpensive) and prevents the painful situation of running out of inodes while having plenty of disk space.
MDT Storage Hardware
- SSD or NVMe strongly recommended. MDT access patterns are database-like: many small, random reads and writes. Spinning disks will bottleneck metadata performance.
- RAID1 (mirror) for a single MDT. For larger MDTs, use RAID10 (ldiskfs) or mirrored ZFS VDEVs.
- Do not use RAID5/RAID6 for the MDT — the write penalty for small random I/O is severe.
- Do not use RAID1-of-RAID0 (RAID01) — a two-disk failure has a 50% chance of destroying the entire MDT. Use RAID10 instead.
- Dedicated controller recommended — do not share with OSTs.
OST Sizing
OSTs store the actual file data. Each file's data is striped across one or more OSTs in chunks (default 4 MiB).
How Many OSTs?
Capacity planning:
Number of OSTs = total capacity / capacity per OST
Typical OST sizes range from 24 TiB to 48 TiB. Maximum is 1024 TiB.
Throughput planning:
Number of OSTs = required aggregate bandwidth / bandwidth per OST
A single HDD-based OST can typically sustain 1–3 GB/s, depending on RAID configuration and spindle count. NVMe-based OSTs can sustain significantly more.
Rule of thumb: Each OSS typically serves 2 to 8 OSTs. Size your OSS count so that network bandwidth matches storage bandwidth.
OST Storage Hardware
- Streaming I/O pattern — high throughput is more important than IOPS.
- RAID6 is common for HDD-based OSTs (good balance of capacity and protection).
- For ZFS: RAIDZ2 (similar to RAID6) is recommended.
- dRAID is available in ZFS for improved rebuild times.
- One OST per block device or ZFS dataset. Do not partition.
Memory Sizing
MDS Memory
MDS memory is critical for performance. The MDS caches metadata (inodes, directory entries, lock state) in RAM.
Sizing formula:
MDS RAM = OS overhead (4 GB)
+ journal cache (4 GB)
+ client lock memory (clients × cores × files_per_core × 2 KB)
+ working set cache (working_set_files × 1.5 KB)
Worked example:
- 1024 compute nodes, 32 cores each, 256 files/core open concurrently
- 12 interactive/login nodes, 100K files each
- 20 million file working set
OS + journal: 4 + 4 = 8 GB Compute clients: 1024 × 32 × 256 × 2 KB = 16 GB Interactive: 12 × 100K × 2 KB = 2.4 GB Working set: 20M × 1.5 KB = 30 GB Total: ~60 GB minimum
For active-active DNE failover: Double the client lock memory since one MDS may serve two MDTs — 96 GB in this example.
OSS Memory
Rule of thumb:
- 24 GB base + 4 GB per OST (non-failover)
- 24 GB base + 8 GB per OST (HA failover — one OSS may need to serve its partner's OSTs)
| Configuration | OSTs per OSS | Recommended RAM |
|---|---|---|
| 4 OSTs, no failover | 4 | 40 GB |
| 8 OSTs, no failover | 8 | 56 GB |
| 4 OSTs, HA failover | 4 (8 during failover) | 56 GB |
| 8 OSTs, HA failover | 8 (16 during failover) | 88 GB |
Client Memory
Minimum: 2 GB. Clients with more RAM can cache more data locally, improving read performance. No special sizing formula needed.
Network Sizing
Fabric Selection
| Fabric | Bandwidth | Latency | LNet Driver | Notes |
|---|---|---|---|---|
| 1 GbE | 125 MB/s | ~50 μs | ksocklnd | Testing only — too slow for production |
| 10 GbE | 1.2 GB/s | ~20 μs | ksocklnd | Small clusters, archival |
| 25 GbE | 3 GB/s | ~15 μs | ksocklnd | Mid-range clusters |
| 100 GbE | 12 GB/s | ~10 μs | ksocklnd | Large clusters |
| InfiniBand HDR (200 Gb/s) | 24 GB/s | ~1 μs | ko2iblnd | High-performance HPC |
| InfiniBand NDR (400 Gb/s) | 48 GB/s | ~1 μs | ko2iblnd | Latest generation HPC |
Key principle: Balance network bandwidth with storage bandwidth. There is no benefit in having 200 Gb/s InfiniBand if your OSTs can only sustain 2 GB/s each.
LNet Configuration
- Default: LNet uses the first TCP interface. For most setups, create
/etc/lnet.confto specify the correct interface. - All Lustre nodes communicate on TCP port 988. Ensure this port is open.
- For multi-network environments, LNet supports routing between different subnets or fabric types. See LNet Router Config Guide.
- For bandwidth aggregation, see Multi-Rail LNet.
Putting It Together: Example Configurations
Small: Research Lab (5 nodes)
| Role | Count | Hardware | Storage |
|---|---|---|---|
| Combined MGS/MDS | 1 | 4+ cores, 32 GB RAM | 100 GB SSD (RAID1) |
| OSS | 2 | 4+ cores, 32 GB RAM each | 2× 24 TB HDD (RAID6) per OSS = 4 OSTs total |
| Client | 2 | 2+ GB RAM | — |
| Network | 10 or 25 GbE | ||
Total capacity: ~96 TB usable. Suitable for a small team.
Medium: Department Cluster (50 nodes)
| Role | Count | Hardware | Storage |
|---|---|---|---|
| Combined MGS/MDS | 1 (with HA pair) | 8+ cores, 64 GB RAM | 500 GB NVMe (RAID1) |
| OSS | 4 (2 HA pairs) | 8+ cores, 56 GB RAM each | 4× 48 TB HDD (RAID6) per OSS = 16 OSTs total |
| Client | 44 | 4+ GB RAM | — |
| Network | 25 GbE or InfiniBand HDR100 | ||
Total capacity: ~770 TB usable. HA configured for MDS and OSS failover.
Large: HPC Center (1000+ nodes)
| Role | Count | Hardware | Storage |
|---|---|---|---|
| MGS | 1 (with HA pair) | 4+ cores, 16 GB RAM | 10 GB SSD (RAID1) |
| MDS | 2–4 (DNE, HA pairs) | 16+ cores, 96 GB RAM each | 2–4 TB NVMe (RAID1) per MDT |
| OSS | 50+ (HA pairs) | 8+ cores, 88 GB RAM each | 8× 48 TB (RAID6 or RAIDZ2) per OSS |
| Client | 1000+ | 4+ GB RAM | — |
| LNet routers | 2–4 (if multi-fabric) | 8+ cores, 32 GB RAM | — |
| Network | InfiniBand HDR/NDR | ||
Total capacity: 10+ PB usable. DNE for metadata scaling. Separate MGS for independent management.
Next Steps
- Set up your first filesystem: Lustre Quick Start Guide
- Choose your deployment pattern: Lustre Deployment Patterns
- Understand the architecture: Lustre Architecture for Admins