Lustre Hardware Sizing Guide

This page provides practical guidance for sizing a Lustre filesystem deployment. It covers MDT, OST, memory, and network sizing with worked examples. For filesystem limits, see Lustre Architecture for Admins.

MDT Sizing

The MDT stores all filesystem metadata: directory entries, filenames, permissions, timestamps, file layout information, and extended attributes.

How Much MDT Space Do I Need?

Rule of thumb: Each file consumes approximately 2 KiB of usable MDT space on ldiskfs (including the inode and directory entry).

Basic formula:

MDT size = (expected file count) × 2 KiB × safety_factor

A safety factor of 2× is recommended to allow for growth and to avoid performance degradation as the MDT fills up.

Data on MDT (DoM): If you plan to use Data on MDT (storing small file data directly on the MDT, available since Lustre 2.11), you need significantly more MDT space. A reasonable starting point is 5% or more of total filesystem capacity, depending on how many small files you expect.

Worked Examples

Example 1: Traditional HPC storage

100 TiB total usable OST space
Average file size: 5 MiB
Expected file count: ~20 million files

MDT size = 20M × 2 KiB × 2 = 80 GiB

A 100 GiB MDT (RAID1 SSD) would be appropriate.

Example 2: Many small files with DoM

100 TiB total usable OST space
Average file size: 64 KiB
Expected file count: ~1 billion files
DoM stripe size: 64 KiB per file

Metadata alone: 1B × 2 KiB × 2 = 4 TiB
DoM data: 1B × 64 KiB = 64 TiB
Total MDT: ~68 TiB (use multiple MDTs with DNE)

This is an extreme case. Most deployments fall somewhere between these examples.

Example 3: Research cluster

500 TiB total usable OST space
Average file size: 1 GiB
Expected file count: ~500,000 files

MDT size = 500K × 2 KiB × 2 = 2 GiB

Even a small SSD works here, but use at least 20 GiB RAID1 for headroom and performance.

ldiskfs vs. ZFS for MDT

	ldiskfs	ZFS
Inode allocation	Fixed at format time	Dynamic (allocates on demand)
Minimum per inode	~2 KiB	~4 KiB
Maximum files	4 billion per MDT	256 trillion per MDT
Resize	grow only (no shrink)	grow only (add VDEVs)
Recommended RAID	RAID1 or RAID10	mirror VDEVs

Note: With ldiskfs, the inode count is fixed at format time. Over-provisioning is cheap (SSD space is relatively inexpensive) and prevents the painful situation of running out of inodes while having plenty of disk space.

MDT Storage Hardware

SSD or NVMe strongly recommended. MDT access patterns are database-like: many small, random reads and writes. Spinning disks will bottleneck metadata performance.
RAID1 (mirror) for a single MDT. For larger MDTs, use RAID10 (ldiskfs) or mirrored ZFS VDEVs.
Do not use RAID5/RAID6 for the MDT — the write penalty for small random I/O is severe.
Do not use RAID1-of-RAID0 (RAID01) — a two-disk failure has a 50% chance of destroying the entire MDT. Use RAID10 instead.
Dedicated controller recommended — do not share with OSTs.

OST Sizing

OSTs store the actual file data. Each file's data is striped across one or more OSTs in chunks (default 4 MiB).

How Many OSTs?

Capacity planning:

Number of OSTs = total capacity / capacity per OST

Typical OST sizes range from 24 TiB to 48 TiB. Maximum is 1024 TiB.

Throughput planning:

Number of OSTs = required aggregate bandwidth / bandwidth per OST

A single HDD-based OST can typically sustain 1–3 GB/s, depending on RAID configuration and spindle count. NVMe-based OSTs can sustain significantly more.

Rule of thumb: Each OSS typically serves 2 to 8 OSTs. Size your OSS count so that network bandwidth matches storage bandwidth.

OST Storage Hardware

Streaming I/O pattern — high throughput is more important than IOPS.
RAID6 is common for HDD-based OSTs (good balance of capacity and protection).
For ZFS: RAIDZ2 (similar to RAID6) is recommended.
dRAID is available in ZFS for improved rebuild times.
One OST per block device or ZFS dataset. Do not partition.

Memory Sizing

MDS Memory

MDS memory is critical for performance. The MDS caches metadata (inodes, directory entries, lock state) in RAM.

Sizing formula:

MDS RAM = OS overhead (4 GB)
        + journal cache (4 GB)
        + client lock memory (clients × cores × files_per_core × 2 KB)
        + working set cache (working_set_files × 1.5 KB)

Worked example:

1024 compute nodes, 32 cores each, 256 files/core open concurrently
12 interactive/login nodes, 100K files each
20 million file working set

OS + journal:        4 + 4 = 8 GB
Compute clients: 1024 × 32 × 256 × 2 KB = 16 GB
Interactive:      12 × 100K × 2 KB = 2.4 GB
Working set:      20M × 1.5 KB = 30 GB
Total: ~60 GB minimum

For active-active DNE failover: Double the client lock memory since one MDS may serve two MDTs — 96 GB in this example.

OSS Memory

Rule of thumb:

24 GB base + 4 GB per OST (non-failover)
24 GB base + 8 GB per OST (HA failover — one OSS may need to serve its partner's OSTs)

Configuration	OSTs per OSS	Recommended RAM
4 OSTs, no failover	4	40 GB
8 OSTs, no failover	8	56 GB
4 OSTs, HA failover	4 (8 during failover)	56 GB
8 OSTs, HA failover	8 (16 during failover)	88 GB

Client Memory

Minimum: 2 GB. Clients with more RAM can cache more data locally, improving read performance. No special sizing formula needed.

Network Sizing

Fabric Selection

Fabric	Bandwidth	Latency	LNet Driver	Notes
1 GbE	125 MB/s	~50 μs	ksocklnd	Testing only — too slow for production
10 GbE	1.2 GB/s	~20 μs	ksocklnd	Small clusters, archival
25 GbE	3 GB/s	~15 μs	ksocklnd	Mid-range clusters
100 GbE	12 GB/s	~10 μs	ksocklnd	Large clusters
InfiniBand HDR (200 Gb/s)	24 GB/s	~1 μs	ko2iblnd	High-performance HPC
InfiniBand NDR (400 Gb/s)	48 GB/s	~1 μs	ko2iblnd	Latest generation HPC

Key principle: Balance network bandwidth with storage bandwidth. There is no benefit in having 200 Gb/s InfiniBand if your OSTs can only sustain 2 GB/s each.

LNet Configuration

Default: LNet uses the first TCP interface. For most setups, create /etc/lnet.conf to specify the correct interface.
All Lustre nodes communicate on TCP port 988. Ensure this port is open.
For multi-network environments, LNet supports routing between different subnets or fabric types. See LNet Router Config Guide.
For bandwidth aggregation, see Multi-Rail LNet.

Putting It Together: Example Configurations

Small: Research Lab (5 nodes)

Role	Count	Hardware	Storage
Combined MGS/MDS	1	4+ cores, 32 GB RAM	100 GB SSD (RAID1)
OSS	2	4+ cores, 32 GB RAM each	2× 24 TB HDD (RAID6) per OSS = 4 OSTs total
Client	2	2+ GB RAM	—
Network	10 or 25 GbE

Total capacity: ~96 TB usable. Suitable for a small team.

Medium: Department Cluster (50 nodes)

Role	Count	Hardware	Storage
Combined MGS/MDS	1 (with HA pair)	8+ cores, 64 GB RAM	500 GB NVMe (RAID1)
OSS	4 (2 HA pairs)	8+ cores, 56 GB RAM each	4× 48 TB HDD (RAID6) per OSS = 16 OSTs total
Client	44	4+ GB RAM	—
Network	25 GbE or InfiniBand HDR100

Total capacity: ~770 TB usable. HA configured for MDS and OSS failover.

Large: HPC Center (1000+ nodes)

Role	Count	Hardware	Storage
MGS	1 (with HA pair)	4+ cores, 16 GB RAM	10 GB SSD (RAID1)
MDS	2–4 (DNE, HA pairs)	16+ cores, 96 GB RAM each	2–4 TB NVMe (RAID1) per MDT
OSS	50+ (HA pairs)	8+ cores, 88 GB RAM each	8× 48 TB (RAID6 or RAIDZ2) per OSS
Client	1000+	4+ GB RAM	—
LNet routers	2–4 (if multi-fabric)	8+ cores, 32 GB RAM	—
Network	InfiniBand HDR/NDR

Total capacity: 10+ PB usable. DNE for metadata scaling. Separate MGS for independent management.

Next Steps

Set up your first filesystem: Lustre Quick Start Guide
Choose your deployment pattern: Lustre Deployment Patterns
Understand the architecture: Lustre Architecture for Admins

Lustre Hardware Sizing Guide

Contents