Lustre IO Monitoring

Overview

Lustre exposes detailed I/O statistics at multiple layers — client, RPC, and server. Understanding these statistics is essential for diagnosing performance problems, identifying bottlenecks, and validating tuning changes.

This page focuses on interpreting the stats. For a broader monitoring overview, see the Lustre Monitoring and Statistics Guide. For tuning guidance, see Lustre Tuning.

Client-Side Statistics (llite)

Client stats show how applications interact with the Lustre filesystem:

lctl get_param llite.*.stats

Key counters:

Counter	Description
`read_bytes`	Total bytes read by applications
`write_bytes`	Total bytes written by applications
`open`	Number of file open operations
`close`	Number of file close operations
`mmap`	Number of memory-mapped I/O operations
`seek`	Number of lseek calls
`fsync`	Number of fsync calls (can indicate write-barrier-heavy workloads)
`truncate`	Number of truncate operations
`setattr`	Number of attribute changes (chmod, chown, etc.)
`getattr`	Number of attribute reads (stat calls)

To reset client stats:

lctl set_param llite.*.stats=clear

Reading the Output Format

Each stat line shows:

{count} samples [{unit}] {min} {max} {sum} {sumsq}

For example:

read_bytes    500 samples [bytes] 4096 1048576 209715200 ...

This means 500 read operations occurred, with the smallest read being 4 KB, the largest 1 MB, and a total of 200 MB read.

RPC Statistics (osc)

RPC stats reveal how the client packages I/O into RPCs sent to OSTs:

lctl get_param osc.*.rpc_stats

This produces histogram output with several sections:

Pages per RPC

Shows the distribution of RPC sizes in pages (typically 4 KB each):

pages per rpc         rpcs   % cumulative %
1:                     150  15  15
2:                      50   5  20
...
256:                   500  50 100

What to look for:

A healthy large-file workload should show most RPCs at the maximum size (256 pages = 1 MB by default).
A high percentage of small RPCs (1-4 pages) indicates the application is doing small, non-sequential I/O. Consider increasing max_pages_per_rpc or adjusting the application's I/O pattern.

RPCs in Flight

Shows how many RPCs are concurrently in flight:

rpcs in flight        rpcs   % cumulative %
1:                     200  20  20
2:                     300  30  50
...
8:                     100  10 100

What to look for:

If most RPCs are at depth 1, the client may be I/O latency-bound. Consider increasing max_rpcs_in_flight (default 8).
If the histogram is concentrated at the maximum value, the client is fully utilizing its RPC pipeline — the bottleneck is likely on the server side or the network.

Offset Distribution

Shows whether I/O is sequential or random:

What to look for:

Sequential workloads show offsets that increase monotonically.
Random workloads show a flat offset distribution across the file.

Server-Side BRW Statistics (obdfilter)

BRW (Bulk Read/Write) stats show the actual disk I/O patterns on the OSS:

lctl get_param obdfilter.*.brw_stats

Sections include:

Disk I/O Size

disk I/O size          ios   % cumulative %
4K:                    100  10  10
8K:                     50   5  15
...
1M:                    500  50 100

Large I/O sizes (512 KB–1 MB) indicate efficient bulk transfer.
Many small I/O sizes suggest fragmentation, small-file workloads, or clients sending suboptimal RPCs.

Contiguous vs. Non-Contiguous Access

Shows whether I/O requests from clients target contiguous disk regions:

High contiguous access percentages indicate sequential I/O — good for throughput.
High non-contiguous (discontinuous) access means the server is seeking — a potential performance bottleneck on HDDs (less impactful on SSDs).

I/O Time

Shows the time distribution for completing I/O operations:

Look for long-tail latencies that could indicate slow disks, RAID rebuilds, or resource contention.

Interpreting the Histogram Format

All Lustre stats histograms use the same format:

{bucket_label}:  {count}  {percentage}  {cumulative_percentage}

count — number of operations in this bucket.
percentage — fraction of total operations in this bucket.
cumulative percentage — running total; 100% at the last bucket.

The bucket labels are powers of 2 (1, 2, 4, 8, 16, ...) representing pages, bytes, or microseconds depending on the section.

Common Performance Anti-Patterns

Symptom	Where to Look	Likely Cause
Many small RPCs	`osc.*.rpc_stats` (pages per rpc)	Application doing small random I/O; misaligned I/O; stripe size too small
Low RPCs in flight	`osc.*.rpc_stats` (rpcs in flight)	`max_rpcs_in_flight` too low; single-threaded application
High queue depth on server	`obdfilter.*.brw_stats`	Server overloaded; too many clients; slow storage backend
Imbalanced I/O across OSTs	`lfs df` or per-OST stats	Uneven striping; hot files on specific OSTs; OST QoS weights
High getattr/setattr rate	`llite.*.stats`	Metadata-heavy workload (e.g., `ls -l` on large directories); consider MDT tuning
Many fsync calls	`llite.*.stats`	Application forcing write barriers; affects throughput significantly

Collecting Stats Over Time

To capture a baseline and then measure a workload:

# Clear stats
lctl set_param llite.*.stats=clear
lctl set_param osc.*.rpc_stats=clear

# Run workload
# ...

# Collect stats
lctl get_param llite.*.stats > /tmp/llite_stats.txt
lctl get_param osc.*.rpc_stats > /tmp/rpc_stats.txt

On servers:

lctl set_param obdfilter.*.brw_stats=clear
# ... (after workload)
lctl get_param obdfilter.*.brw_stats > /tmp/brw_stats.txt

Lustre IO Monitoring

Contents

Overview

Client-Side Statistics (llite)

Reading the Output Format

RPC Statistics (osc)

Pages per RPC

RPCs in Flight

Offset Distribution

Server-Side BRW Statistics (obdfilter)

Disk I/O Size

Contiguous vs. Non-Contiguous Access

I/O Time

Interpreting the Histogram Format

Common Performance Anti-Patterns

Collecting Stats Over Time

See Also

Navigation menu

Lustre IO Monitoring

Overview

Client-Side Statistics (llite)

Reading the Output Format

RPC Statistics (osc)

Pages per RPC

RPCs in Flight

Offset Distribution

Server-Side BRW Statistics (obdfilter)

Disk I/O Size

Contiguous vs. Non-Contiguous Access

I/O Time

Interpreting the Histogram Format

Common Performance Anti-Patterns

Collecting Stats Over Time

See Also

Navigation menu

Search