IOR: Difference between revisions

From Lustre Wiki
Jump to navigation Jump to search
(→‎References: update IOR URL, add IO-500 URL)
No edit summary
 
Line 7: Line 7:
== Purpose ==
== Purpose ==


IOR can be used for testing performance of parallel file systems using various interfaces and access patterns. IOR uses MPI for process synchronisation – typically there are several IOR processes running in parallel across several nodes in an HPC cluster. As a user-space benchmarking application it is suitable for comparing the performance of different file systems. Typically one IOR process is run on each participating client node mounting the target file system but this is completely configurable.  
IOR can be used for testing performance of parallel file systems using various interfaces and access patterns. IOR uses MPI for process synchronisation – typically there are several IOR processes running in parallel across several nodes in an HPC cluster. As a user-space benchmarking application it is suitable for comparing the performance of different file systems. Typically one IOR process is run on each participating client node mounting the target file system but this is completely configurable.


== Preparation ==
== Preparation ==


The <code>ior</code> application is distributed as source code and must be compiled for use on the target environment. The software is hosted as a project on GitHub:
The ior and mdtest programs are built from the same IOR project source tree and share identical build and dependency steps.


https://github.com/LLNL/ior
For current build / compilation instructions (prerequisites, release tarball usage, configure, make, optional install), refer to the mdtest page:
See: [[MDTest]] (Download and Compile MDTest section).
That procedure produces both 'ior' and 'mdtest' binaries (./src/ior and ./src/mdtest). After building once, distribute or place the binaries on a shared filesystem.


The remainder of this document will use OpenMPI for the examples. Integration with job schedulers is not discussed – examples will call the <code>mpirun</code> command directly.
=== What IOR Measures ===


=== Download and Compile IOR ===
Parallel data throughput characteristics:
* Sequential / random read & write bandwidth
* Shared-file vs. file-per-process scaling
* Influence of transfer size (-t) vs. block size (-b)
* Storage layout / striping effects (e.g. Lustre directives)
* Read-back ordering / cache behavior (-C)


To compile the <code>ior</code> benchmark, run the following steps on a suitable machine:
=== Key Options (Common Subset) ===


<ol>
(Use ./src/ior -h for the authoritative list.)
<li>Install the pre-requisite development tools. On RHEL or CentOS systems, this can be accomplished by running the following command:
* -a API                Select I/O API (POSIX default; MPIIO, HDF5, etc.)
<pre style="overflow-x:auto;">
* -w / -r              Write phase / read phase
sudo yum -y install openmpi-devel git automake
* -k                    Keep files (skip removal)
</pre>
* -o <path[@path...]>  Output file or list (template when combined with -F)
</li>
* -F                    File-per-process
<li>Download the <code>IOR</code> source:
* -u                    Unique directory per process (with -F) to reduce directory contention
<pre style="overflow-x:auto;">
* -t <size>            Transfer size (I/O request size, e.g. 1m)
git clone https://github.com/hpc/ior
* -b <size>             Block size per process (must be multiple of -t)
</pre>
* -i <N>                Iterations
</li>
* -v (-vv …)           Verbosity
<li> Compile the software:
* -C                    Reorder tasks on read phase
<pre style="overflow-x:auto;">
* -m                    Use iteration count as number of files (multi-file mode)
cd ior
* -O "directive[,...]" Implementation directives (Lustre examples):
module load mpi/openmpi-x86_64
   * lustreStripeCount=<int>
./bootstrap
   * lustreStripeSize=<bytes>
./configure [--with-lustre]
  * lustreStartOST=<int>
make clean && make
  * lustreIgnoreLocks=1
</pre>
</li>
<li>Quickly verify that the program runs:
<pre style="overflow-x:auto;">
./src/ior
</pre>
For example:
<pre style="overflow-x:auto;">
[bench@ct73-c1 ior]$ ./src/ior
IOR-3.0.1: MPI Coordinated Test of Parallel I/O
 
Began: Wed Jun 28 23:37:00 2017
Command line used: ./src/ior
Machine: Linux ct73-c1
 
Test 0 started: Wed Jun 28 23:37:00 2017
Summary:
api                = POSIX
test filename      = testFile
access             = single-shared-file
ordering in a file = sequential offsets
ordering inter file= no tasks offsets
clients            = 1 (1 per node)
repetitions        = 1
xfersize          = 262144 bytes
blocksize          = 1 MiB
aggregate filesize = 1 MiB
 
access    bw(MiB/s) block(KiB) xfer(KiB)  open(s)    wr/rd(s)  close(s)  total(s)  iter
------    ---------  ---------- ---------  --------  --------  --------  --------  ----
write    1072.96    1024.00    256.00    0.000022  0.000905  0.000005  0.000932  0 
read     266.31    1024.00    256.00    0.000005  0.003745  0.000004  0.003755  0 
remove    -         -         -         -          -          -          0.000280  0 
 
Max Write: 1072.96 MiB/sec (1125.08 MB/sec)
Max Read: 266.31 MiB/sec (279.25 MB/sec)
 
Summary of all tests:
Operation   Max(MiB)   Min(MiB)  Mean(MiB)    StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum
write        1072.96    1072.96    1072.96      0.00    0.00093 0 1 1 1 0 0 1 0 0 1 1048576 262144 1048576 POSIX 0
read          266.31    266.31    266.31      0.00    0.00375 0 1 1 1 0 0 1 0 0 1 1048576 262144 1048576 POSIX 0


Finished: Wed Jun 28 23:37:00 2017
Sizing:
</pre>
* Shared-file total size = block_size * num_processes
<li>Copy the <code>ior</code> command onto all of the Lustre client nodes that will be used to run the benchmark. Alternatively, copy onto the Lustre file system itself so that the application is available on all of the nodes automatically. For example:
* File-per-process total = block_size * num_processes (each file = block_size)
<pre style="overflow-x:auto">
* Ensure block_size % transfer_size == 0
sudo mkdir -p /lustre/demo/bin
sudo cp ./src/ior /lustre/demo/bin/.
</pre>
</li>
</ol>


'''Note:''' There is currently a bug in some versions of the <code>libfabric</code> library, notably version 1.3.0, that can cause a delay in starting MPI applications. When this occurs the following warning will appear in the command output:
=== Designing Test Cases ===


<pre style="overflow-x:auto;">
* Access pattern: Compare shared file vs. -F (file-per-process) to separate locking from raw bandwidth.
hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
* Scaling: Weak (constant per-rank block) vs. strong (constant aggregate size) scaling reveals saturation points.
</pre>
* Transfer size sweep: Test 256k, 1m, 4m to observe protocol/RPC efficiency and network effects.
* Striping strategy: Adjust lustreStripeCount / size to match or exceed active ranks touching a file.
* Cache influence: Use -C on read; optionally increase dataset beyond aggregate client cache to observe backend limits.
* Iterations: Multiple (-i 3–5+) for stability.
* Large dataset: Ensure data > (client_cache * participating_nodes) to reduce artificial cache inflation.


This issue affects RHEL and CentOS 7.3, and is resolved in RHEL / CentOS 7.4+ and the upstream project. Details can be found here:
=== Interpreting Results ===


https://bugzilla.redhat.com/show_bug.cgi?id=1408316
Key output fields:
* bw(MiB/s): Compare write vs. read symmetry; large disparity may indicate caching or read-ahead limits.
* open/close times: Elevated values suggest metadata contention or locking overhead.
* Min vs. Max bandwidth gap: Imbalance; investigate stripe distribution or network variability.
* Iteration consistency: Divergence over iterations may reflect throttling, congestion, or caching.
Diagnostic clues:
* Improving read after first iteration: Cache-dominated.
* Flat scaling when adding ranks: Stripe or network saturation; increase stripe count or per-OST concurrency.
* High variance with -F but not shared: Directory or path resolution contention; evaluate -u or layout.


=== Prepare the run-time environment ===
=== Prepare the run-time environment ===


<ol>
Most run-time environment guidance (user account considerations, SSH key usage, hostfile creation concepts, MPI module loading) is identical to mdtest and not repeated here. See: [[MDTest]] (Benchmark Execution section) for full details.
<li>Create a user account from which to run the application, if a suitable account does not already exist. The account must be propagated across all of the Lustre client nodes that will participate in the benchmark, as well as the MDS servers for the file system. On the servers, it is recommended that the account is disabled in order to prevent users from logging into those machines.
</li>
<li>Some MPI implementations rely upon passphrase-less SSH keys. This will enable the <code>mpirun</code> command to launch processes on each of the client nodes that will run the benchmark. To create a key, login as the benchmark user to one of the nodes and run the <code>ssh-keygen</code> command, supplying an empty passphrase. For example:


<pre style="overflow-x:auto;">
Minimal IOR prerequisites:
[mjcowe@ct7-c1 ~]$ ssh-keygen -t rsa -N "" -f $HOME/.ssh/id_rsa
* MPI runtime on all client nodes
Generating public/private rsa key pair.
* Consistent user (UID/GID) across nodes
Created directory '/home/mjcowe/.ssh'.
* ior binary accessible everywhere (shared FS recommended)
Your identification has been saved in /home/mjcowe/.ssh/id_rsa.
* MPI environment loaded (e.g. module load mpi/openmpi-x86_64)
Your public key has been saved in /home/mjcowe/.ssh/id_rsa.pub.
* (Optional) Passwordless SSH if required by MPI implementation
The key fingerprint is:
e4:b1:10:a2:7f:e8:b1:74:f3:c3:24:76:46:3d:4d:91 mjcowe@ct7-c1
The key's randomart image is:
+--[ RSA 2048]----+
|    . .    oo  |
|  . . . . oE    |
|  .  . + o .    |
|  . . = o .    |
|    = * S        |
|  o * O        |
|    o  +        |
|        .      |
|                |
+-----------------+
</pre>
</li>
<li>Copy the public key into the <code>$HOME/.ssh/authorized_keys</code> file for the account.</li>
<li>If the user account is not hosted on a shared file system (e.g. a Lustre filesystem), then copy the public and private keys that were generated into the <code>$HOME/.ssh</code> directory of each of the Lustre client nodes that will be used in the benchmark. Normally, user accounts are hosted on a shared resource, making this step unnecessary.
</li>
<li>Consider relaxing the <code>StrictHostKeyChecking</code> SSH option so that host entries are automatically added into <code>$HOME/.ssh/known_hosts</code> rather than prompting the user to confirm the connection. When running MPI programs across many nodes, this can save a good deal of inconvenience. If the account home directory is not on a shared storage,  all nodes will need to be updated.
<pre style="overflow-x:auto;">
cat >>$HOME/.ssh/config <<\__EOF
Host *
  StrictHostKeyChecking no
__EOF
chmod 0600 $HOME/.ssh/config
</pre>
</li>
<li>Install the MPI runtime onto all Lustre client nodes:
<pre style="overflow-x:auto;">
sudo yum -y install openmpi
</pre>
</li>
<li>Append the following lines to <code>$HOME/.bashrc</code> (assuming BASH is the login shell) on the account running the benchmark:
<pre style="overflow-x:auto">
module purge
module load mpi/openmpi-x86_64
</pre>
This ensures that the Open MPI library path and binary path are added to the user environment every time the user logs in (and every time <code>mpirun</code> is invoked across multiple nodes). The <code>.bash_profile</code> file is not read when <code>mpirun</code> starts processes on remote nodes, which is why it is not chosen in this case.
</li>
</ol>


== Benchmark Execution ==
== Benchmark Execution ==
Line 167: Line 96:
   echo "n"$i" slots=16"
   echo "n"$i" slots=16"
done > $HOME/hfile
done > $HOME/hfile
 
 
# Result:
# Result:
n001 slots=16
n001 slots=16
Line 188: Line 117:
'''Note:''' If the <code>--map-by node</code> does not work, and the output has only one or a very small number of unique hostnames repeated in the output, then set <code>slots=1</code> for each host in the host file. Otherwise, <code>mpirun</code> will fill up the slots on the first node before launching processes on subsequent nodes.
'''Note:''' If the <code>--map-by node</code> does not work, and the output has only one or a very small number of unique hostnames repeated in the output, then set <code>slots=1</code> for each host in the host file. Otherwise, <code>mpirun</code> will fill up the slots on the first node before launching processes on subsequent nodes.


This may be desirable for multi-process tests but not for the single task per client test. Do not set the slot count higher than the number of cores present. If over-subscription is required, set the -np flag to greater than the number of physical cores. This informs OpenMPI that the task will be oversubscribed and will run in a mode that yields the processor to peers.  
This may be desirable for multi-process tests but not for the single task per client test. Do not set the slot count higher than the number of cores present. If over-subscription is required, set the -np flag to greater than the number of physical cores. This informs OpenMPI that the task will be oversubscribed and will run in a mode that yields the processor to peers.


Refer to: [https://www.open-mpi.org/faq/?category=all#oversubscribing OpenMPI FAQ -- Oversubscribing Nodes], and also the [[#Notes_on_OpenMPI|notes on OpenMPI]] at the end of this document.
Refer to: [https://www.open-mpi.org/faq/?category=all#oversubscribing OpenMPI FAQ -- Oversubscribing Nodes], and also the [[#Notes_on_OpenMPI|notes on OpenMPI]] at the end of this document.
Line 202: Line 131:
module purge
module purge
module load mpi/openmpi-x86_64
module load mpi/openmpi-x86_64
 
IOREXE="/lustre/demo/bin/ior"
IOREXE="/lustre/demo/bin/ior"


# Node count -- not very accurate
# Node count -- not very accurate
NCT=`grep -v ^# hfile |wc -l`
NCT=`grep -v ^# hfile |wc -l`
 
# Date Stamp for benchmark
# Date Stamp for benchmark
DS=`date +"%F_%H:%M:%S"`
DS=`date +"%F_%H:%M:%S"`
Line 216: Line 145:
SEQ=1
SEQ=1
MAXPROCS=8
MAXPROCS=8
 
# Overall data set size in GiB. Must be >=MAXPROCS. Should be a power of 2.
# Overall data set size in GiB. Must be >=MAXPROCS. Should be a power of 2.
DATA_SIZE=8
DATA_SIZE=8
 
 
BASE_DIR=/lustre/demo/iorbench
BASE_DIR=/lustre/demo/iorbench
mkdir -p ${BASE_DIR}
mkdir -p ${BASE_DIR}
 
while [ ${SEQ} -le ${MAXPROCS} ]; do
while [ ${SEQ} -le ${MAXPROCS} ]; do
NPROC=`expr ${NCT} \* ${SEQ}`
NPROC=`expr ${NCT} \* ${SEQ}`
Line 247: Line 176:
<pre style="overflow-x:auto">
<pre style="overflow-x:auto">
#!/bin/bash
#!/bin/bash
 
module purge
module purge
module load mpi/openmpi-x86_64
module load mpi/openmpi-x86_64
Line 277: Line 206:
Optionally, add the <code>-u</code> flag to create a unique directory for each file created. The full file name paths for each process can also be specified by supplying a list of files delimited by <code>@</code> to the <code>-o</code> flag. This can be useful for DNE testing.
Optionally, add the <code>-u</code> flag to create a unique directory for each file created. The full file name paths for each process can also be specified by supplying a list of files delimited by <code>@</code> to the <code>-o</code> flag. This can be useful for DNE testing.


=== Commonly Used Options ===
<!-- Removed former "Commonly Used Options" table; see Preparation -> Key Options (Common Subset). -->
 
{| border=1 cellpadding=5 style="border-collapse: collapse; border: solid 1px #ccc;"
|- style="text-align: left; vertical-align: bottom; background-color: #eee;"
! scope="col"| Option
! scope="col" style="width:50%" | Description
|- style="text-align: left; vertical-align: top;"
! scope="row"| <code>-w</code>
| Write file
|- style="text-align: left; vertical-align: top;"
! scope="row"| <code>-r</code>
| Read existing file – when combined with <code>-w</code>, the write text executes first to create the file for the read test to use.
|- style="text-align: left; vertical-align: top;"
! scope="row"| <code>-o <file> [[@<file>] [@file] ...]</code>
| The file [list] to use in the test. For multi-file (file-per-process) tests, the file name is a template for each file that will be created (file path will be appended with a unique number). When combined with the unique directory name option, each directory is numbered and the files created one per numbered directory. For example:
<pre style="overflow-x:auto">
-F -o /lustre/demo/test.dat &rarr; /lustre/demo/test.dat.{seq}
-F -u -o /lustre/scratch/test.dat &rarr; /lustre/scratch/{index}/test.dat.{seq}
-F -u -o /lustre/demo/test.dat0@/lustre/demo/test.dat1@/lustre/demo/test.dat2
</pre>
|- style="text-align: left; vertical-align: top;"
! scope="row"| <code>-O "<directive>"</code>
| Comma-separated list of IOR directives to set various parameters. There are several Lustre-specific directives, for example:
; <code>lustreStripeCount=&lt;int&gt;</code>
: set the lustre stripe count for the test file(s) [default 0]
; <code>lustreStripeSize=&lt;int&gt;</code>
: set the lustre stripe size for the test file(s) [default 0]
; <code>lustreStartOST=&lt;int&gt;</code>
: set the starting OST for the test file(s) [default 0]
; <code>lustreIgnoreLocks=&lt;int&gt;</code>
; disable lustre range locking [default 0]
|- style="text-align: left; vertical-align: top;"
! scope="row"| <code>-t &lt;int&gt;</code>
| Size of data transfer in bytes (e.g. 8, 4k, 2m, 1g). This is the equivalent of the RPC transaction size and should normally be set to <code>1m</code> for Lustre.
|- style="text-align: left; vertical-align: top;"
! scope="row"| <code>-b &lt;int&gt;</code>
| Size of data block in bytes (e.g. 8, 4k, 2m, 1g). This is the size of the block of data that each process will write and must be a multiple of the transfer size. Set it to a large number. For single file tests, multiply the block size by the number of tasks to get the file size. For multiple file tasks, each file will be <code>block size</code> bytes. File system must have <code>block size * nprocs</code> free space.
|- style="text-align: left; vertical-align: top;"
! scope="row"| <code>-i &lt;int&gt;</code>
| The number of iterations to run.
|- style="text-align: left; vertical-align: top;"
! scope="row"| <code>-v[v[v] ...]</code>
| Increase verbosity of output. Add more <code>-v</code> flags to increase the level of detail.
|- style="text-align: left; vertical-align: top;"
! scope="row"| <code>-u</code>
| When each process creates a separate file, use a unique directory name for each file-per-process
|- style="text-align: left; vertical-align: top;"
! scope="row"| <code>-C</code>
| Re-order tasks: change the task ordering to <code>n+1</code> ordering for read-back. May avoid the read cache effects on clients
|- style="text-align: left; vertical-align: top;"
! scope="row"| <code>-F</code>
| Create a separate file in each process (often referred to as file-per-process)
|- style="text-align: left; vertical-align: top;"
! scope="row"| <code>-m</code>
| Multi-file option: use the number of iterations (the <code>-i</code> flag) for the count of the number of  files
|- style="text-align: left; vertical-align: top;"
! scope="row"| <code>-k</code>
| Do not remove test file(s) on program exit
|-
|}
 
=== Notes on OpenMPI ===
 
When preparing the benchmark, pay careful attention to the distribution of processes across the nodes. <code>mpirun</code> will, by default, fill the slots of one node before allocating processes to the next node in the list. i.e. all of the slots on the first node in the file will be consumed before allocating processes to the second node, then third node, and so on. If the number of slots requested is lower than the overall number of slots in the host file, then utlisation will not be evenly distributed, and some nodes may not be used at all.
 
If the number of process is larger than the number of available slots, <code>mpirun</code> will oversubscribe one or more nodes until all the processes have been launched. This can be exploited to create more even distribution of processes across nodes by setting the number of slots per host to 1. However, note that <code>mpirun</code> will decide where the additional processes will run, which can lead to performance variance from run to run of a job.
 
The <code>--map-by node</code> option distributes processes evenly across the nodes, and does not try to consume all of the slots from one node before allocating processes to the next node in the list. For example, if there are 4 nodes, each with 16 slots (64 slots total), and a job is submitted that requires only 24 slots, then each node will be allocated 6 processes.
 
Experiment with the options by using the <code>hostname</code> command as the target application. For example:
<pre style="overflow-x:auto">
[mduser@ct7-c1 ~]$ cat $HOME/hfile
ct7-c1 slots=16
ct7-c2 slots=16
ct7-c3 slots=16
ct7-c4 slots=16
 
# By default, mpirun will fill slots on one node before allocating slots from the next:
[mduser@ct7-c1 ~]$ mpirun --hostfile $HOME/hfile -np `cat $HOME/hfile|wc -l` hostname
ct7-c1
ct7-c1
ct7-c1
ct7-c1
 
# The --map-by node option distributes the processes evenly:
[mduser@ct7-c1 ~]$ mpirun --hostfile $HOME/hfile --map-by node -np `cat $HOME/hfile|wc -l` hostname
ct7-c2
ct7-c1
ct7-c3
ct7-c4
</pre>
 
The <code>-np</code> parameter is the total number of threads. If the host file has 16 nodes but the value of <code>-np</code>is 1, then only one thread on one node is being used to complete the operations.


The <code>mpirun(1)</code> man page provides a comprehensive description of the available options.
=== Notes on MPI ===


See also the [https://www.open-mpi.org/faq/ OpenMPI FAQ], and the section on [https://www.open-mpi.org/faq/?category=all#oversubscribing oversubscription].
Refer to the mdtest documentation for shared MPI guidance (process mapping, oversubscription, binding, examples):
See: [[MDTest]] (Notes on OpenMPI / Benchmark Execution sections).
Key reminder:
* Use --map-by node to distribute ranks evenly.
* Adjust slots or add --oversubscribe (if supported) when intentionally exceeding physical cores.
* Validate placement with a trivial command (e.g. mpirun ... hostname) before running IOR at scale.


== References ==
== References ==


* https://github.com/hpc/ior
* https://github.com/hpc/ior
* https://github.com/VI4IO/io-500-dev
* https://www.open-mpi.org/faq


[[Category:Benchmarking]]
[[Category:Benchmarking]]

Latest revision as of 08:18, 13 October 2025

Description

IOR (Interleaved or Random) is a commonly used file system benchmarking application particularly well-suited for evaluating the performance of parallel file systems. The software is most commonly distributed in source code form and normally needs to be compiled on the target platform.

IOR is not a Lustre-specific benchmark and can be run on any POSIX-compliant file system, but it does require a fully installed and configured file system implementation in order to run. For Lustre, this means the MGS, MDS and OSS services must be installed, configured and running, and that there is a population of Lustre client nodes running with the Lustre file system mounted.

Purpose

IOR can be used for testing performance of parallel file systems using various interfaces and access patterns. IOR uses MPI for process synchronisation – typically there are several IOR processes running in parallel across several nodes in an HPC cluster. As a user-space benchmarking application it is suitable for comparing the performance of different file systems. Typically one IOR process is run on each participating client node mounting the target file system but this is completely configurable.

Preparation

The ior and mdtest programs are built from the same IOR project source tree and share identical build and dependency steps.

For current build / compilation instructions (prerequisites, release tarball usage, configure, make, optional install), refer to the mdtest page: See: MDTest (Download and Compile MDTest section). That procedure produces both 'ior' and 'mdtest' binaries (./src/ior and ./src/mdtest). After building once, distribute or place the binaries on a shared filesystem.

What IOR Measures

Parallel data throughput characteristics:

  • Sequential / random read & write bandwidth
  • Shared-file vs. file-per-process scaling
  • Influence of transfer size (-t) vs. block size (-b)
  • Storage layout / striping effects (e.g. Lustre directives)
  • Read-back ordering / cache behavior (-C)

Key Options (Common Subset)

(Use ./src/ior -h for the authoritative list.)

  • -a API Select I/O API (POSIX default; MPIIO, HDF5, etc.)
  • -w / -r Write phase / read phase
  • -k Keep files (skip removal)
  • -o <path[@path...]> Output file or list (template when combined with -F)
  • -F File-per-process
  • -u Unique directory per process (with -F) to reduce directory contention
  • -t <size> Transfer size (I/O request size, e.g. 1m)
  • -b <size> Block size per process (must be multiple of -t)
  • -i <N> Iterations
  • -v (-vv …) Verbosity
  • -C Reorder tasks on read phase
  • -m Use iteration count as number of files (multi-file mode)
  • -O "directive[,...]" Implementation directives (Lustre examples):
 * lustreStripeCount=<int>
 * lustreStripeSize=<bytes>
 * lustreStartOST=<int>
 * lustreIgnoreLocks=1

Sizing:

  • Shared-file total size = block_size * num_processes
  • File-per-process total = block_size * num_processes (each file = block_size)
  • Ensure block_size % transfer_size == 0

Designing Test Cases

  • Access pattern: Compare shared file vs. -F (file-per-process) to separate locking from raw bandwidth.
  • Scaling: Weak (constant per-rank block) vs. strong (constant aggregate size) scaling reveals saturation points.
  • Transfer size sweep: Test 256k, 1m, 4m to observe protocol/RPC efficiency and network effects.
  • Striping strategy: Adjust lustreStripeCount / size to match or exceed active ranks touching a file.
  • Cache influence: Use -C on read; optionally increase dataset beyond aggregate client cache to observe backend limits.
  • Iterations: Multiple (-i 3–5+) for stability.
  • Large dataset: Ensure data > (client_cache * participating_nodes) to reduce artificial cache inflation.

Interpreting Results

Key output fields:

  • bw(MiB/s): Compare write vs. read symmetry; large disparity may indicate caching or read-ahead limits.
  • open/close times: Elevated values suggest metadata contention or locking overhead.
  • Min vs. Max bandwidth gap: Imbalance; investigate stripe distribution or network variability.
  • Iteration consistency: Divergence over iterations may reflect throttling, congestion, or caching.

Diagnostic clues:

  • Improving read after first iteration: Cache-dominated.
  • Flat scaling when adding ranks: Stripe or network saturation; increase stripe count or per-OST concurrency.
  • High variance with -F but not shared: Directory or path resolution contention; evaluate -u or layout.

Prepare the run-time environment

Most run-time environment guidance (user account considerations, SSH key usage, hostfile creation concepts, MPI module loading) is identical to mdtest and not repeated here. See: MDTest (Benchmark Execution section) for full details.

Minimal IOR prerequisites:

  • MPI runtime on all client nodes
  • Consistent user (UID/GID) across nodes
  • ior binary accessible everywhere (shared FS recommended)
  • MPI environment loaded (e.g. module load mpi/openmpi-x86_64)
  • (Optional) Passwordless SSH if required by MPI implementation

Benchmark Execution

Setup

  1. Login to one of the compute nodes as the benchmark user
  2. Create a host file for the mpirun command, containing the list of Lustre clients that will be used for the benchmark. Each line in the file represents a machine and the number of slots (usually equal to the number of CPU cores). For example:
    for i in `seq -f "%03g" 1 32`; do
      echo "n"$i" slots=16"
    done > $HOME/hfile
    
    # Result:
    n001 slots=16
    n002 slots=16
    n003 slots=16
    n004 slots=16
    ...
    
    • The first column of the host file contains the name of the nodes. This can also be an IP address if the /etc/hosts file or DNS is not set up.
    • The second column is used to represent the number of CPU cores.
  3. Run a quick test using mpirun to launch the benchmark and verify that the environment is set up correctly. For example:
    mpirun --hostfile $HOME/hfile --map-by node -np `cat $HOME/hfile|wc -l` hostname
    

    This should return the hostnames of all the machines that are in the test environment. The results are returned unsorted, in order of completion.

    Note: If the --map-by node does not work, and the output has only one or a very small number of unique hostnames repeated in the output, then set slots=1 for each host in the host file. Otherwise, mpirun will fill up the slots on the first node before launching processes on subsequent nodes.

    This may be desirable for multi-process tests but not for the single task per client test. Do not set the slot count higher than the number of cores present. If over-subscription is required, set the -np flag to greater than the number of physical cores. This informs OpenMPI that the task will be oversubscribed and will run in a mode that yields the processor to peers.

    Refer to: OpenMPI FAQ -- Oversubscribing Nodes, and also the notes on OpenMPI at the end of this document.

Example: IOR Read / Write Test, Single File, Multiple Clients

The following annotated script demonstrates how to configure an IOR benchmark for a single, shared file, test:

#!/bin/bash
module purge
module load mpi/openmpi-x86_64

IOREXE="/lustre/demo/bin/ior"

# Node count -- not very accurate
NCT=`grep -v ^# hfile |wc -l`

# Date Stamp for benchmark
DS=`date +"%F_%H:%M:%S"`
# IOR will be run in a loop, doubling the number of processes per client node
# with every iteration from $SEQ -> $MAXPROCS. If SEQ=1 and MAXPROCS=8, then the
# iterations will be 1, 2, 4, 8 processes per node.
# SEQ and MAXPROCS should be a power of 2 (including 2^0).
SEQ=1
MAXPROCS=8

# Overall data set size in GiB. Must be >=MAXPROCS. Should be a power of 2.
DATA_SIZE=8

BASE_DIR=/lustre/demo/iorbench
mkdir -p ${BASE_DIR}

while [ ${SEQ} -le ${MAXPROCS} ]; do
NPROC=`expr ${NCT} \* ${SEQ}`
# Pick a reasonable block size, bearing in mind the size of the target file system.
# Bear in mind that the overall data size will be block size * number of processes.
# Block size must be a multiple of transfer size (-t option in command line).
BSZ=`expr ${DATA_SIZE} / ${SEQ}`"g"
# Alternatively, set to a static value and let the data size increase.
# BSZ="1g"
# BSZ="${DATA_SIZE}"
mpirun -np ${NPROC} --map-by node -hostfile ./hfile \
  ${IOREXE} -v -w -r -i 4 \
  -o ${BASE_DIR}/ior-test.file \
  -t 1m -b ${BSZ} \
  -O "lustreStripeCount=-1" | tee ${HOME}/IOR-RW-Single_File-c_${NCT}-s_${SEQ}_${DS}
SEQ=`expr ${SEQ} \* 2`
done

Example: IOR Read/Write Test, Multiple Files per Process, Multiple Clients

This script is similar to the previous example, but this time the -F flag is used, informing IOR to create a unique file per process. Additionally, the Lustre stripe count is set to 1.

#!/bin/bash

module purge
module load mpi/openmpi-x86_64

IOREXE="/lustre/demo/bin/ior"

NCT=`grep -v ^# hfile |wc -l`
DS=`date +"%F_%H:%M:%S"`
SEQ=1
MAXPROCS=8
DATA_SIZE=8
BASE_DIR=/lustre/demo/iorbench
mkdir -p ${BASE_DIR}

while [ ${SEQ} -le ${PROCS} ]; do
NPROC=`expr ${NCT} \* ${SEQ}`
BSZ=`expr ${DATA_SIZE} / ${SEQ}`"g"
# BSZ="1g"
# BSZ="${DATA_SIZE}"
mpirun -np ${NPROC} --map-by node -hostfile ./hfile \
  ${IOREXE} -v -w -r -i 4 -F \
  -o ${BASE_DIR}/test/ior-test.file \
  -t 1m -b ${BSZ} \
  -O "lustreStripeCount=1" | tee ${HOME}/IOR-RW-Multiple_Files-Common_Dir-c_${NCT}-s_${SEQ}_${DS}
SEQ=`expr ${SEQ} \* 2`
done

Optionally, add the -u flag to create a unique directory for each file created. The full file name paths for each process can also be specified by supplying a list of files delimited by @ to the -o flag. This can be useful for DNE testing.


Notes on MPI

Refer to the mdtest documentation for shared MPI guidance (process mapping, oversubscription, binding, examples): See: MDTest (Notes on OpenMPI / Benchmark Execution sections). Key reminder:

  • Use --map-by node to distribute ranks evenly.
  • Adjust slots or add --oversubscribe (if supported) when intentionally exceeding physical cores.
  • Validate placement with a trivial command (e.g. mpirun ... hostname) before running IOR at scale.

References