IOR: Difference between revisions

Latest revision as of 08:18, 13 October 2025

Description

IOR (Interleaved or Random) is a commonly used file system benchmarking application particularly well-suited for evaluating the performance of parallel file systems. The software is most commonly distributed in source code form and normally needs to be compiled on the target platform.

IOR is not a Lustre-specific benchmark and can be run on any POSIX-compliant file system, but it does require a fully installed and configured file system implementation in order to run. For Lustre, this means the MGS, MDS and OSS services must be installed, configured and running, and that there is a population of Lustre client nodes running with the Lustre file system mounted.

Purpose

IOR can be used for testing performance of parallel file systems using various interfaces and access patterns. IOR uses MPI for process synchronisation – typically there are several IOR processes running in parallel across several nodes in an HPC cluster. As a user-space benchmarking application it is suitable for comparing the performance of different file systems. Typically one IOR process is run on each participating client node mounting the target file system but this is completely configurable.

Preparation

The ior and mdtest programs are built from the same IOR project source tree and share identical build and dependency steps.

For current build / compilation instructions (prerequisites, release tarball usage, configure, make, optional install), refer to the mdtest page: See: MDTest (Download and Compile MDTest section). That procedure produces both 'ior' and 'mdtest' binaries (./src/ior and ./src/mdtest). After building once, distribute or place the binaries on a shared filesystem.

What IOR Measures

Parallel data throughput characteristics:

Sequential / random read & write bandwidth
Shared-file vs. file-per-process scaling
Influence of transfer size (-t) vs. block size (-b)
Storage layout / striping effects (e.g. Lustre directives)
Read-back ordering / cache behavior (-C)

Key Options (Common Subset)

(Use ./src/ior -h for the authoritative list.)

-a API Select I/O API (POSIX default; MPIIO, HDF5, etc.)
-w / -r Write phase / read phase
-k Keep files (skip removal)
-o <path[@path...]> Output file or list (template when combined with -F)
-F File-per-process
-u Unique directory per process (with -F) to reduce directory contention
-t <size> Transfer size (I/O request size, e.g. 1m)
-b <size> Block size per process (must be multiple of -t)
-i <N> Iterations
-v (-vv …) Verbosity
-C Reorder tasks on read phase
-m Use iteration count as number of files (multi-file mode)
-O "directive[,...]" Implementation directives (Lustre examples):

 * lustreStripeCount=<int>
 * lustreStripeSize=<bytes>
 * lustreStartOST=<int>
 * lustreIgnoreLocks=1

Sizing:

Shared-file total size = block_size * num_processes
File-per-process total = block_size * num_processes (each file = block_size)
Ensure block_size % transfer_size == 0

Designing Test Cases

Access pattern: Compare shared file vs. -F (file-per-process) to separate locking from raw bandwidth.
Scaling: Weak (constant per-rank block) vs. strong (constant aggregate size) scaling reveals saturation points.
Transfer size sweep: Test 256k, 1m, 4m to observe protocol/RPC efficiency and network effects.
Striping strategy: Adjust lustreStripeCount / size to match or exceed active ranks touching a file.
Cache influence: Use -C on read; optionally increase dataset beyond aggregate client cache to observe backend limits.
Iterations: Multiple (-i 3–5+) for stability.
Large dataset: Ensure data > (client_cache * participating_nodes) to reduce artificial cache inflation.

Interpreting Results

Key output fields:

bw(MiB/s): Compare write vs. read symmetry; large disparity may indicate caching or read-ahead limits.
open/close times: Elevated values suggest metadata contention or locking overhead.
Min vs. Max bandwidth gap: Imbalance; investigate stripe distribution or network variability.
Iteration consistency: Divergence over iterations may reflect throttling, congestion, or caching.

Diagnostic clues:

Improving read after first iteration: Cache-dominated.
Flat scaling when adding ranks: Stripe or network saturation; increase stripe count or per-OST concurrency.
High variance with -F but not shared: Directory or path resolution contention; evaluate -u or layout.

Prepare the run-time environment

Most run-time environment guidance (user account considerations, SSH key usage, hostfile creation concepts, MPI module loading) is identical to mdtest and not repeated here. See: MDTest (Benchmark Execution section) for full details.

Minimal IOR prerequisites:

MPI runtime on all client nodes
Consistent user (UID/GID) across nodes
ior binary accessible everywhere (shared FS recommended)
MPI environment loaded (e.g. module load mpi/openmpi-x86_64)
(Optional) Passwordless SSH if required by MPI implementation

Benchmark Execution

Setup

Login to one of the compute nodes as the benchmark user
Create a host file for the mpirun command, containing the list of Lustre clients that will be used for the benchmark. Each line in the file represents a machine and the number of slots (usually equal to the number of CPU cores). For example:
```
for i in `seq -f "%03g" 1 32`; do
  echo "n"$i" slots=16"
done > $HOME/hfile

# Result:
n001 slots=16
n002 slots=16
n003 slots=16
n004 slots=16
...
```
- The first column of the host file contains the name of the nodes. This can also be an IP address if the /etc/hosts file or DNS is not set up.
- The second column is used to represent the number of CPU cores.
Run a quick test using mpirun to launch the benchmark and verify that the environment is set up correctly. For example:
```
mpirun --hostfile $HOME/hfile --map-by node -np `cat $HOME/hfile|wc -l` hostname
```
This should return the hostnames of all the machines that are in the test environment. The results are returned unsorted, in order of completion.
Note: If the --map-by node does not work, and the output has only one or a very small number of unique hostnames repeated in the output, then set slots=1 for each host in the host file. Otherwise, mpirun will fill up the slots on the first node before launching processes on subsequent nodes.
This may be desirable for multi-process tests but not for the single task per client test. Do not set the slot count higher than the number of cores present. If over-subscription is required, set the -np flag to greater than the number of physical cores. This informs OpenMPI that the task will be oversubscribed and will run in a mode that yields the processor to peers.
Refer to: OpenMPI FAQ -- Oversubscribing Nodes, and also the notes on OpenMPI at the end of this document.

Example: IOR Read / Write Test, Single File, Multiple Clients

The following annotated script demonstrates how to configure an IOR benchmark for a single, shared file, test:

#!/bin/bash
module purge
module load mpi/openmpi-x86_64

IOREXE="/lustre/demo/bin/ior"

# Node count -- not very accurate
NCT=`grep -v ^# hfile |wc -l`

# Date Stamp for benchmark
DS=`date +"%F_%H:%M:%S"`
# IOR will be run in a loop, doubling the number of processes per client node
# with every iteration from $SEQ -> $MAXPROCS. If SEQ=1 and MAXPROCS=8, then the
# iterations will be 1, 2, 4, 8 processes per node.
# SEQ and MAXPROCS should be a power of 2 (including 2^0).
SEQ=1
MAXPROCS=8

# Overall data set size in GiB. Must be >=MAXPROCS. Should be a power of 2.
DATA_SIZE=8

BASE_DIR=/lustre/demo/iorbench
mkdir -p ${BASE_DIR}

while [ ${SEQ} -le ${MAXPROCS} ]; do
NPROC=`expr ${NCT} \* ${SEQ}`
# Pick a reasonable block size, bearing in mind the size of the target file system.
# Bear in mind that the overall data size will be block size * number of processes.
# Block size must be a multiple of transfer size (-t option in command line).
BSZ=`expr ${DATA_SIZE} / ${SEQ}`"g"
# Alternatively, set to a static value and let the data size increase.
# BSZ="1g"
# BSZ="${DATA_SIZE}"
mpirun -np ${NPROC} --map-by node -hostfile ./hfile \
  ${IOREXE} -v -w -r -i 4 \
  -o ${BASE_DIR}/ior-test.file \
  -t 1m -b ${BSZ} \
  -O "lustreStripeCount=-1" | tee ${HOME}/IOR-RW-Single_File-c_${NCT}-s_${SEQ}_${DS}
SEQ=`expr ${SEQ} \* 2`
done

Example: IOR Read/Write Test, Multiple Files per Process, Multiple Clients

This script is similar to the previous example, but this time the -F flag is used, informing IOR to create a unique file per process. Additionally, the Lustre stripe count is set to 1.

#!/bin/bash

module purge
module load mpi/openmpi-x86_64

IOREXE="/lustre/demo/bin/ior"

NCT=`grep -v ^# hfile |wc -l`
DS=`date +"%F_%H:%M:%S"`
SEQ=1
MAXPROCS=8
DATA_SIZE=8
BASE_DIR=/lustre/demo/iorbench
mkdir -p ${BASE_DIR}

while [ ${SEQ} -le ${PROCS} ]; do
NPROC=`expr ${NCT} \* ${SEQ}`
BSZ=`expr ${DATA_SIZE} / ${SEQ}`"g"
# BSZ="1g"
# BSZ="${DATA_SIZE}"
mpirun -np ${NPROC} --map-by node -hostfile ./hfile \
  ${IOREXE} -v -w -r -i 4 -F \
  -o ${BASE_DIR}/test/ior-test.file \
  -t 1m -b ${BSZ} \
  -O "lustreStripeCount=1" | tee ${HOME}/IOR-RW-Multiple_Files-Common_Dir-c_${NCT}-s_${SEQ}_${DS}
SEQ=`expr ${SEQ} \* 2`
done

Optionally, add the -u flag to create a unique directory for each file created. The full file name paths for each process can also be specified by supplying a list of files delimited by @ to the -o flag. This can be useful for DNE testing.

Notes on MPI

Refer to the mdtest documentation for shared MPI guidance (process mapping, oversubscription, binding, examples): See: MDTest (Notes on OpenMPI / Benchmark Execution sections). Key reminder:

Use --map-by node to distribute ranks evenly.
Adjust slots or add --oversubscribe (if supported) when intentionally exceeding physical cores.
Validate placement with a trivial command (e.g. mpirun ... hostname) before running IOR at scale.

References

https://github.com/hpc/ior

IOR: Difference between revisions

Latest revision as of 08:18, 13 October 2025

Contents

Description

Purpose

Preparation

What IOR Measures

Key Options (Common Subset)

Designing Test Cases

Interpreting Results

Prepare the run-time environment

Benchmark Execution

Setup

Example: IOR Read / Write Test, Single File, Multiple Clients

Example: IOR Read/Write Test, Multiple Files per Process, Multiple Clients

Notes on MPI

References

Navigation menu

IOR: Difference between revisions

Latest revision as of 08:18, 13 October 2025

Description

Purpose

Preparation

What IOR Measures

Key Options (Common Subset)

Designing Test Cases

Interpreting Results

Prepare the run-time environment

Benchmark Execution

Setup

Example: IOR Read / Write Test, Single File, Multiple Clients

Example: IOR Read/Write Test, Multiple Files per Process, Multiple Clients

Notes on MPI

References

Navigation menu

Search