Lustre Hardware Sizing Guide

From Lustre Wiki
Jump to navigation Jump to search

Lustre Hardware Sizing Guide

This page provides practical guidance for sizing a Lustre filesystem deployment. It covers MDT, OST, memory, and network sizing with worked examples. For filesystem limits, see Lustre Architecture for Admins.

MDT Sizing

The MDT stores all filesystem metadata: directory entries, filenames, permissions, timestamps, file layout information, and extended attributes.

How Much MDT Space Do I Need?

Rule of thumb: Each file consumes approximately 2 KiB of usable MDT space on ldiskfs (including the inode and directory entry).

Basic formula:

MDT size = (expected file count) × 2 KiB × safety_factor

A safety factor of is recommended to allow for growth and to avoid performance degradation as the MDT fills up.

Data on MDT (DoM): If you plan to use Data on MDT (storing small file data directly on the MDT, available since Lustre 2.11), you need significantly more MDT space. A reasonable starting point is 5% or more of total filesystem capacity, depending on how many small files you expect.

Worked Examples

Example 1: Traditional HPC storage

  • 100 TiB total usable OST space
  • Average file size: 5 MiB
  • Expected file count: ~20 million files
MDT size = 20M × 2 KiB × 2 = 80 GiB

A 100 GiB MDT (RAID1 SSD) would be appropriate.

Example 2: Many small files with DoM

  • 100 TiB total usable OST space
  • Average file size: 64 KiB
  • Expected file count: ~1 billion files
  • DoM stripe size: 64 KiB per file
Metadata alone: 1B × 2 KiB × 2 = 4 TiB
DoM data: 1B × 64 KiB = 64 TiB
Total MDT: ~68 TiB (use multiple MDTs with DNE)

This is an extreme case. Most deployments fall somewhere between these examples.

Example 3: Research cluster

  • 500 TiB total usable OST space
  • Average file size: 1 GiB
  • Expected file count: ~500,000 files
MDT size = 500K × 2 KiB × 2 = 2 GiB

Even a small SSD works here, but use at least 20 GiB RAID1 for headroom and performance.

ldiskfs vs. ZFS for MDT

ldiskfs ZFS
Inode allocation Fixed at format time Dynamic (allocates on demand)
Minimum per inode ~2 KiB ~4 KiB
Maximum files 4 billion per MDT 256 trillion per MDT
Resize grow only (no shrink) grow only (add VDEVs)
Recommended RAID RAID1 or RAID10 mirror VDEVs

Note: With ldiskfs, the inode count is fixed at format time. Over-provisioning is cheap (SSD space is relatively inexpensive) and prevents the painful situation of running out of inodes while having plenty of disk space.

MDT Storage Hardware

  • SSD or NVMe strongly recommended. MDT access patterns are database-like: many small, random reads and writes. Spinning disks will bottleneck metadata performance.
  • RAID1 (mirror) for a single MDT. For larger MDTs, use RAID10 (ldiskfs) or mirrored ZFS VDEVs.
  • Do not use RAID5/RAID6 for the MDT — the write penalty for small random I/O is severe.
  • Do not use RAID1-of-RAID0 (RAID01) — a two-disk failure has a 50% chance of destroying the entire MDT. Use RAID10 instead.
  • Dedicated controller recommended — do not share with OSTs.

OST Sizing

OSTs store the actual file data. Each file's data is striped across one or more OSTs in chunks (default 4 MiB).

How Many OSTs?

Capacity planning:

Number of OSTs = total capacity / capacity per OST

Typical OST sizes range from 24 TiB to 48 TiB. Maximum is 1024 TiB.

Throughput planning:

Number of OSTs = required aggregate bandwidth / bandwidth per OST

A single HDD-based OST can typically sustain 1–3 GB/s, depending on RAID configuration and spindle count. NVMe-based OSTs can sustain significantly more.

Rule of thumb: Each OSS typically serves 2 to 8 OSTs. Size your OSS count so that network bandwidth matches storage bandwidth.

OST Storage Hardware

  • Streaming I/O pattern — high throughput is more important than IOPS.
  • RAID6 is common for HDD-based OSTs (good balance of capacity and protection).
  • For ZFS: RAIDZ2 (similar to RAID6) is recommended.
  • dRAID is available in ZFS for improved rebuild times.
  • One OST per block device or ZFS dataset. Do not partition.

Memory Sizing

MDS Memory

MDS memory is critical for performance. The MDS caches metadata (inodes, directory entries, lock state) in RAM.

Sizing formula:

MDS RAM = OS overhead (4 GB)
        + journal cache (4 GB)
        + client lock memory (clients × cores × files_per_core × 2 KB)
        + working set cache (working_set_files × 1.5 KB)

Worked example:

  • 1024 compute nodes, 32 cores each, 256 files/core open concurrently
  • 12 interactive/login nodes, 100K files each
  • 20 million file working set
OS + journal:        4 + 4 = 8 GB
Compute clients: 1024 × 32 × 256 × 2 KB = 16 GB
Interactive:      12 × 100K × 2 KB = 2.4 GB
Working set:      20M × 1.5 KB = 30 GB
Total: ~60 GB minimum

For active-active DNE failover: Double the client lock memory since one MDS may serve two MDTs — 96 GB in this example.

OSS Memory

Rule of thumb:

  • 24 GB base + 4 GB per OST (non-failover)
  • 24 GB base + 8 GB per OST (HA failover — one OSS may need to serve its partner's OSTs)
Configuration OSTs per OSS Recommended RAM
4 OSTs, no failover 4 40 GB
8 OSTs, no failover 8 56 GB
4 OSTs, HA failover 4 (8 during failover) 56 GB
8 OSTs, HA failover 8 (16 during failover) 88 GB

Client Memory

Minimum: 2 GB. Clients with more RAM can cache more data locally, improving read performance. No special sizing formula needed.

Network Sizing

Fabric Selection

Fabric Bandwidth Latency LNet Driver Notes
1 GbE 125 MB/s ~50 μs ksocklnd Testing only — too slow for production
10 GbE 1.2 GB/s ~20 μs ksocklnd Small clusters, archival
25 GbE 3 GB/s ~15 μs ksocklnd Mid-range clusters
100 GbE 12 GB/s ~10 μs ksocklnd Large clusters
InfiniBand HDR (200 Gb/s) 24 GB/s ~1 μs ko2iblnd High-performance HPC
InfiniBand NDR (400 Gb/s) 48 GB/s ~1 μs ko2iblnd Latest generation HPC

Key principle: Balance network bandwidth with storage bandwidth. There is no benefit in having 200 Gb/s InfiniBand if your OSTs can only sustain 2 GB/s each.

LNet Configuration

  • Default: LNet uses the first TCP interface. For most setups, create /etc/lnet.conf to specify the correct interface.
  • All Lustre nodes communicate on TCP port 988. Ensure this port is open.
  • For multi-network environments, LNet supports routing between different subnets or fabric types. See LNet Router Config Guide.
  • For bandwidth aggregation, see Multi-Rail LNet.

Putting It Together: Example Configurations

Small: Research Lab (5 nodes)

Role Count Hardware Storage
Combined MGS/MDS 1 4+ cores, 32 GB RAM 100 GB SSD (RAID1)
OSS 2 4+ cores, 32 GB RAM each 2× 24 TB HDD (RAID6) per OSS = 4 OSTs total
Client 2 2+ GB RAM
Network 10 or 25 GbE

Total capacity: ~96 TB usable. Suitable for a small team.

Medium: Department Cluster (50 nodes)

Role Count Hardware Storage
Combined MGS/MDS 1 (with HA pair) 8+ cores, 64 GB RAM 500 GB NVMe (RAID1)
OSS 4 (2 HA pairs) 8+ cores, 56 GB RAM each 4× 48 TB HDD (RAID6) per OSS = 16 OSTs total
Client 44 4+ GB RAM
Network 25 GbE or InfiniBand HDR100

Total capacity: ~770 TB usable. HA configured for MDS and OSS failover.

Large: HPC Center (1000+ nodes)

Role Count Hardware Storage
MGS 1 (with HA pair) 4+ cores, 16 GB RAM 10 GB SSD (RAID1)
MDS 2–4 (DNE, HA pairs) 16+ cores, 96 GB RAM each 2–4 TB NVMe (RAID1) per MDT
OSS 50+ (HA pairs) 8+ cores, 88 GB RAM each 8× 48 TB (RAID6 or RAIDZ2) per OSS
Client 1000+ 4+ GB RAM
LNet routers 2–4 (if multi-fabric) 8+ cores, 32 GB RAM
Network InfiniBand HDR/NDR

Total capacity: 10+ PB usable. DNE for metadata scaling. Separate MGS for independent management.

Next Steps