MDTest: Difference between revisions

From Lustre Wiki
Jump to navigation Jump to search
(→‎Preparation: update mdtest discussion to only reference latest mdtest-in-IOR)
No edit summary
 
Line 3: Line 3:
MDtest is an MPI-based application for evaluating the metadata performance of a file system and has been designed to test parallel file systems. MDTest is not a Lustre-specific benchmark and can be run on any POSIX-compliant file system, but it does require a fully installed and configured file system implementation in order to run. For Lustre, this means the MGS, MDS and OSS services must be installed, configured and running, and that there is a population of Lustre clients running with the Lustre file system mounted.
MDtest is an MPI-based application for evaluating the metadata performance of a file system and has been designed to test parallel file systems. MDTest is not a Lustre-specific benchmark and can be run on any POSIX-compliant file system, but it does require a fully installed and configured file system implementation in order to run. For Lustre, this means the MGS, MDS and OSS services must be installed, configured and running, and that there is a population of Lustre clients running with the Lustre file system mounted.


The <code>mdtest</code> application runs on Lustre clients in a fully configured Lustre file system. Multiple <code>mdtest</code> processes are run in parallel across several nodes using MPI in order to saturate file system I/O. The program can create directory trees of arbitrary depth and can be directed to create a mixture of work-loads, including file-only tests.  
The <code>mdtest</code> application runs on Lustre clients in a fully configured Lustre file system. Multiple <code>mdtest</code> processes are run in parallel across several nodes using MPI in order to saturate file system I/O. The program can create directory trees of arbitrary depth and can be directed to create a mixture of work-loads, including file-only tests.


== Purpose ==
== Purpose ==
Line 13: Line 13:
== Preparation ==
== Preparation ==


The <code>mdtest</code> application is distributed as source code and must be compiled for use on the target environment. The preferred distribution of <code>mdtest</code> is available on GitHub at [https://github.com/IOR-LANL/ior IOR] as part of the IOR project. LANL have added features not available in the LLNL version, most notable of which are some Lustre-awareness to allow striping across multiple MDTs and AWS S3 support. There is indeed, even a third option, hidden in the depths of the Lustre JIRA issue tracking system in ticket [https://jira.whamcloud.com/browse/LU-56 LU-56] that adds the ability to run against multiple mountpoints on a single client.
The <code>mdtest</code> application is distributed as source code and must be compiled for use on the target environment. The preferred distribution of <code>mdtest</code> is available on GitHub at [https://github.com/hpc/ior IOR] as part of the IOR project. This version includes features such as Lustre-awareness to allow striping across multiple MDTs and AWS S3 support.


The remainder of this document will use OpenMPI for the examples. Integration with job schedulers is not discussed – examples will call the <code>mpirun</code> command directly.
The remainder of this document uses MPI for the examples (either OpenMPI or MPICH can be used). Integration with job schedulers is not discussed – examples will call the <code>mpirun</code> command directly.


=== Download and Compile MDTest ===
=== Download and Compile MDTest ===


To compile the <code>mdtest</code> binary, run the following steps on a suitable machine:
To obtain the <code>mdtest</code> binary, it is recommended to use an official release tarball from the IOR project rather than a development (git) checkout.


<ol>
<ol>
<li>Install the pre-requisite development tools. On RHEL or CentOS systems, this can be accomplished by running the following command:
<li>Download the latest (or required) release from the IOR Releases page:
<pre style="overflow-x:auto;">
<pre style="overflow-x:auto;">
sudo yum -y install openmpi-devel git
# Pick a version; example only – check GitHub for the latest tag
VERSION=4.0.0
wget https://github.com/hpc/ior/releases/download/${VERSION}/ior-${VERSION}.tar.gz
tar -xzf ior-${VERSION}.tar.gz
cd ior-${VERSION}
</pre>
</pre>
</li>
</li>
<li>Download the <code>mdtest</code> source:
<li>Load (or otherwise ensure) an MPI toolchain is available (OpenMPI or MPICH) and required build tools (gcc, make, etc.).</li>
<li>Configure and build (Autotools release tarball already contains the build system):
<pre style="overflow-x:auto;">
<pre style="overflow-x:auto;">
git clone https://github.com/IOR-LANL/ior.git
# (Optional) install to a prefix you control
./configure --prefix=$HOME/ior-${VERSION}-install
make -j $(nproc)
# (Optional) install to a prefix you control
make install
</pre>
</pre>
</li>
</li>
<li> Compile the software:
<li>Verify binaries:
<pre style="overflow-x:auto;">
<pre style="overflow-x:auto;">
cd ior
./src/mdtest -h
module load mpi/openmpi-x86_64
./src/ior -h
make clean && make
</pre>
</pre>
</li>
</li>
<li>Quickly verify that the program runs:
<li>Distribute the resulting <code>mdtest</code> (and optionally <code>ior</code>) binary to all client nodes, or place it on a shared filesystem accessible by every test node.</li>
</ol>
 
'''Quick functional check:'''
<pre style="overflow-x:auto;">
<pre style="overflow-x:auto;">
./src/mdtest
./src/mdtest -n 1 -i 1 -d /tmp/mdtest-check -u
</pre>
</pre>
For example:
<pre style="overflow-x:auto;">
[mduser@ct7-c1 mdtest]$ ./src/mdtest
-- started at 06/28/2017 03:07:55 --


mdtest-1.9.4-rc1 was launched with 1 total task(s) on 1 node(s)
=== What MDTest Measures ===
Command line used: ./src/mdtest
 
Path: /lustre/demo/mdtest
MDTest focuses on filesystem metadata operations executed in parallel:
FS: 58.0 GiB  Used FS: 1.1%  Inodes: 5.0 Mi  Used Inodes: 0.0%
* Directory create / stat / remove
* File create / stat / read (open/close) / remove
* Tree (aggregate) create & remove timings
* Optional extended operations (xattrs, hard links, etc., when enabled at build time)
 
Each phase is timed across all MPI ranks; results report max/min/mean and standard deviation of operations per second (or time) per operation class.


1 tasks, 0 files/directories
=== Key Options (Common Subset) ===


SUMMARY: (of 1 iterations)
* -n &lt;N&gt;  Number of items (files/dirs) per MPI rank (most commonly used)
  Operation                      Max            Min          Mean        Std Dev
* -N &lt;N&gt;  Number of directories per tree level (affects hierarchy breadth)
  ---------                     ---            ---          ----        -------
* -i &lt;I&gt;  Iterations (repeat entire sequence to gauge variability)
  Directory creation:          0.000          0.000          0.000          0.000
* -d &lt;path&gt; Base directory under which test subtrees are created
  Directory stat   :          0.000          0.000          0.000          0.000
* -u      Use a unique working directory per rank (reduces contention)
  Directory removal :          0.000          0.000          0.000          0.000
* -t &lt;sec&gt; Stonewall timer: stop creation phase after this time (per iteration) to normalize run length
  File creation    :          0.000          0.000          0.000          0.000
* -F      File-only mode (skip directory-only tests)
  File stat        :          0.000          0.000          0.000          0.000
* -D      Directory-only mode (skip file operations)
  File read        :          0.000          0.000          0.000          0.000
* -R      Random order of stat operations (more realistic cache behavior)
  File removal      :          0.000          0.000          0.000          0.000
* -L &lt;N&gt;  Set directory depth (levels) for tree tests
  Tree creation    :        461.255        461.255        461.255          0.000
* -q      Quiet (reduced output)
  Tree removal      :        497.512        497.512        497.512          0.000
* -r      Read phase enabled (some builds may default this on/off)


-- finished at 06/28/2017 03:07:55 --
(Invoke <code>./src/mdtest -h</code> for the authoritative list—options evolve over releases.)
</pre>
<li>Copy the <code>mdtest</code> command onto all of the Lustre client nodes that will be used to run the benchmark. Alternatively, copy onto the Lustre file system itself so that the application is available on all of the nodes automatically.
</li>
</ol>


'''Note:''' There is currently a bug in some versions of the <code>libfabric</code> library, notably version 1.3.0, that can cause a delay in starting MPI applications. When this occurs the following warning will appear in the command output:
=== Designing Test Cases ===


<pre style="overflow-x:auto;">
Consider the following when constructing a benchmark plan:
hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
* Scaling model: Weak scaling (constant items per rank) vs. strong scaling (constant total items) reveals different behaviors.
</pre>
* Dataset size vs. metadata cache: Ensure total inodes created exceed MDS and client cache to avoid unrealistically high results dominated by cache hits. A rule of thumb is to create several multiples of aggregate metadata cache capacity.
* Unique vs. shared directories: Using <code>-u</code> isolates rank performance; omitting it increases lock contention and stresses directory concurrency.
* Iteration count: Use multiple (<code>-i</code>) iterations to observe stability; large variance may indicate load interference or imbalance.
* Stepwise growth: Start small (e.g. 10k total objects) and scale toward target (100k–1M+) validating runtime and resource impact.
* Depth / breadth: Adjust directory fan-out (<code>-N</code>) and depth (<code>-L</code>) to mimic application namespace patterns.
* Stonewalling: Use <code>-t</code> for time-normalized comparisons across systems with different performance tiers.
* Isolation: Run on an otherwise idle system for baseline; repeat under mixed load to understand degradation.


This issue affects RHEL and CentOS 7.3, and is resolved in RHEL / CentOS 7.4+ and the upstream project. Details can be found here:
=== Interpreting Results ===


https://bugzilla.redhat.com/show_bug.cgi?id=1408316
Key metrics reported:
* Ops/sec (create/stat/read/remove) per operation class: Higher is better.
* Tree creation / removal time: Indicates aggregate namespace scalability.
* Max vs. Min divergence: Large gaps can signal load imbalance, network issues, or server hot spots.
* Std Dev: High variability across ranks suggests contention or transient interference.
* Iteration consistency: Stable means with low variance increase confidence in reported peak rates.


=== Prepare the run-time environment ===
Diagnostic cues:
* Fast create but slow stat/remove may indicate caching tuned for creates but lock or RPC bottlenecks elsewhere.
* Consistently slower subset of ranks often maps to specific OSS/MDS affinity or network path differences.


<ol>
=== Example Scenario Patterns (Brief) ===
<li>Create a user account from which to run the application, if a suitable account does not already exist. The account must be propagated across all of the Lustre client nodes that will participate in the benchmark, as well as the MDS servers for the file system. On the servers, it is recommended that the account is disabled in order to prevent users from logging into those machines.
</li>
<li>Some MPI implementations rely upon passphrase-less SSH keys. Login as the benchmark user to one of the nodes and create a passphrase-less SSH key. This will enable the <code>mpirun</code> command to launch processes on each of the client nodes that will run the benchmark. For example:


<pre style="overflow-x:auto;">
* Single-rank functional smoke test: <code>mdtest -n 100 -i 1 -u -d /path</code>
[mjcowe@ct7-c1 ~]$ ssh-keygen -t rsa -N "" -f $HOME/.ssh/id_rsa
* Metadata scaling (weak scaling): Increase ranks (1,2,4,8,...) holding <code>-n 20000</code> constant.
Generating public/private rsa key pair.
* Contention stress: Omit <code>-u</code> so all ranks operate in shared namespace.
Created directory '/home/mjcowe/.ssh'.
* Cache pressure: Large <code>-n</code> and deeper <code>-L</code> to exceed MDS cache.
Your identification has been saved in /home/mjcowe/.ssh/id_rsa.
Your public key has been saved in /home/mjcowe/.ssh/id_rsa.pub.
The key fingerprint is:
e4:b1:10:a2:7f:e8:b1:74:f3:c3:24:76:46:3d:4d:91 mjcowe@ct7-c1
The key's randomart image is:
+--[ RSA 2048]----+
|    . .    oo  |
|  . . . . oE    |
|  .  . + o .    |
|  . . = o .    |
|    = * S        |
|  o * O        |
|    o  +        |
|        .      |
|                |
+-----------------+
</pre>
</li>
<li>Copy the public key into the <code>$HOME/.ssh/authorized_keys</code> file for the account.</li>
<li>If the user account is not hosted on a shared file system (e.g. a Lustre filesystem), then copy the public and private keys that were generated into the <code>$HOME/.ssh</code> directory of each of the Lustre client nodes that will be used in the benchmark. Normally, user accounts are hosted on a shared resource, making this step unnecessary.
</li>
<li>Consider relaxing the <code>StrictHostKeyChecking</code> SSH option so that host entries are automatically added into <code>$HOME/.ssh/known_hosts</code> rather than prompting the user to confirm the connection. When running MPI programs across many nodes, this can save a good deal of inconvenience. If the account home directory is not on a shared storage,  all nodes will need to be updated.  
<pre style="overflow-x:auto;">
Host *
  StrictHostKeyChecking no
</pre>
</li>
<li>Install the MPI runtime onto all Lustre client nodes:
<pre style="overflow-x:auto;">
yum install openmpi
</pre>
</li>
<li>Append the following lines to <code>$HOME/.bashrc</code> (assuming BASH is the login shell) on the account running the benchmark:
<pre style="overflow-x:auto">
module purge
module load mpi/openmpi-x86_64
</pre>
This ensures that the Open MPI library path and binary path are added to the user environment every time the user logs in (and every time <code>mpirun</code> is invoked across multiple nodes). The <code>.bash_profile</code> file is not read when <code>mpirun</code> starts processes on remote nodes, which is why it is not chosen in this case.
</li>
</ol>


== Benchmark Execution ==
== Benchmark Execution ==
Line 140: Line 118:
<ol>
<ol>
<li>Login to one of the compute nodes as the benchmark user
<li>Login to one of the compute nodes as the benchmark user
<li>Ensure passwordless ssh is setup across all nodes in question (not covered here)
<li>Create a host file for the <code>mpirun</code> command, containing the list of Lustre clients that will be used for the benchmark. Each line in the file represents a machine and the number of slots (usually equal to the number of CPU cores). For example:
<li>Create a host file for the <code>mpirun</code> command, containing the list of Lustre clients that will be used for the benchmark. Each line in the file represents a machine and the number of slots (usually equal to the number of CPU cores). For example:


Line 146: Line 125:
   echo "n"$i" slots=16"
   echo "n"$i" slots=16"
done > $HOME/hfile
done > $HOME/hfile
 
 
# Result:
# Result:
n001 slots=16
n001 slots=16
Line 167: Line 146:
'''Note:''' If the <code>--map-by node</code> does not work, and the output has only one or a very small number of unique hostnames repeated in the output, then set <code>slots=1</code> for each host in the host file. Otherwise, <code>mpirun</code> will fill up the slots on the first node before launching processes on subsequent nodes.
'''Note:''' If the <code>--map-by node</code> does not work, and the output has only one or a very small number of unique hostnames repeated in the output, then set <code>slots=1</code> for each host in the host file. Otherwise, <code>mpirun</code> will fill up the slots on the first node before launching processes on subsequent nodes.


This may be desirable for multi-process tests but not for the single task per client test. Do not set the slot count higher than the number of cores present. If over-subscription is required, set the -np flag to greater than the number of physical cores. This informs OpenMPI that the task will be oversubscribed and will run in a mode that yields the processor to peers.  
This may be desirable for multi-process tests but not for the single task per client test. Do not set the slot count higher than the number of cores present. If over-subscription is required, set the -np flag to greater than the number of physical cores. This informs OpenMPI that the task will be oversubscribed and will run in a mode that yields the processor to peers.


Refer to: [https://www.open-mpi.org/faq/?category=all#oversubscribing OpenMPI FAQ -- Oversubscribing Nodes], and also the [[#Notes_on_OpenMPI|notes on OpenMPI]] at the end of this document.
Refer to: [https://www.open-mpi.org/faq/?category=all#oversubscribing OpenMPI FAQ -- Oversubscribing Nodes], and also the [[#Notes_on_OpenMPI|notes on OpenMPI]] at the end of this document.
Line 175: Line 154:
mpirun --hostfile $HOME/hfile -np 48 ./mdtest -n 20840 -i 10 -u -d /lustre/demo/mdtest-scratch
mpirun --hostfile $HOME/hfile -np 48 ./mdtest -n 20840 -i 10 -u -d /lustre/demo/mdtest-scratch
</pre>
</pre>
In the above example, 48 processes (<code>-np 48</code>) will be distributed across the nodes listed in the host file (<code>--hostfile hfile</code>), with each process creating 20,840 directories and files (<code>-n 20480</code>) for a total of 1,000,320 files/directories. The test will conduct 10 iterations (<code>-i 10</code>) and use <code>/lustre/demo/mdtest-scratch</code> as the target base directory (<code>-d &lt;path&gt;</code>). The <code>-u</code> flag tells the program to assign a unique working directory per task.  
In the above example, 48 processes (<code>-np 48</code>) will be distributed across the nodes listed in the host file (<code>--hostfile hfile</code>), with each process creating 20,840 directories and files (<code>-n 20480</code>) for a total of 1,000,320 files/directories. The test will conduct 10 iterations (<code>-i 10</code>) and use <code>/lustre/demo/mdtest-scratch</code> as the target base directory (<code>-d &lt;path&gt;</code>). The <code>-u</code> flag tells the program to assign a unique working directory per task.


When first running the test on a new system, your test should be sized for 10,000 files/directories.  This will give you an idea of how your system will handle the test.  Gradually increase the number of files/directories as you feel more comfortable with the results you are seeing up to a maximum of 1,000,000 files/directory, or higher if there is a specific requirement in excess of this value. Note that 100,000 files/directories is probably the minimum value that will deliver a meaningful result (such that MDS cacheing does not affect results).
When first running the test on a new system, your test should be sized for 10,000 files/directories.  This will give you an idea of how your system will handle the test.  Gradually increase the number of files/directories as you feel more comfortable with the results you are seeing up to a maximum of 1,000,000 files/directory, or higher if there is a specific requirement in excess of this value.


Start with a small number of threads and increase with each run using a doubling sequence starting at one (1, 2, 4, 8, 16), keeping the total number of files created as close to your target files/directories as possible. This means that as the thread count increases, the value of the <code>-n</code> parameter should decrease.
Start with a small number of threads and increase with each run using a doubling sequence starting at one (1, 2, 4, 8, 16), keeping the total number of files created as close to your target files/directories as possible. This means that as the thread count increases, the value of the <code>-n</code> parameter should decrease.
Line 185: Line 164:
=== Notes on OpenMPI ===
=== Notes on OpenMPI ===


When preparing the benchmark, pay careful attention to the distribution of processes across the nodes. <code>mpirun</code> will, by default, fill the slots of one node before allocating processes to the next node in the list. i.e. all of the slots on the first node in the file will be consumed before allocating processes to the second node, then third node, and so on. If the number of slots requested is lower than the overall number of slots in the host file, then utlisation will not be evenly distributed, and some nodes may not be used at all.
* If you encounter issues with process mapping or oversubscription, consult the documentation for the MPI implementation in use (e.g., OpenMPI or MPICH).
* Common flags that may be useful include:
  * <code>--map-by</code>: Controls how processes are mapped to nodes.
  * <code>--bind-to</code>: Controls how processes are bound to CPU cores.
  * <code>--oversubscribe</code>: Allows more processes than slots, useful for testing limits.


If the number of process is larger than the number of available slots, <code>mpirun</code> will oversubscribe one or more nodes until all the processes have been launched. This can be exploited to create more even distribution of processes across nodes by setting the number of slots per host to 1. However, note that <code>mpirun</code> will decide where the additional processes will run, which can lead to performance variance from run to run of a job.
== References ==
 
The <code>--map-by node</code> option distributes processes evenly across the nodes, and does not try to consume all of the slots from one node before allocating processes to the next node in the list. For example, if there are 4 nodes, each with 16 slots (64 slots total), and a job is submitted that requires only 24 slots, then each node will be allocated 6 processes.


Experiment with the options by using the <code>hostname</code> command as the target application. For example:
* https://github.com/hpc/ior
<pre style="overflow-x:auto">
[mduser@ct7-c1 ~]$ cat $HOME/hfile
ct7-c1 slots=16
ct7-c2 slots=16
ct7-c3 slots=16
ct7-c4 slots=16


# By default, mpirun will fill slots on one node before allocating slots from the next:
=== Further Reading ===
[mduser@ct7-c1 ~]$ mpirun --hostfile $HOME/hfile -np `cat $HOME/hfile|wc -l` hostname
ct7-c1
ct7-c1
ct7-c1
ct7-c1
 
# The --map-by node option distributes the processes evenly:
[mduser@ct7-c1 ~]$ mpirun --hostfile $HOME/hfile --map-by node -np `cat $HOME/hfile|wc -l` hostname
ct7-c2
ct7-c1
ct7-c3
ct7-c4
</pre>
 
The <code>-np</code> parameter is the total number of threads. If the host file has 16 nodes but the value of <code>-np</code>is 1, then only one thread on one node is being used to complete the operations.
 
The <code>mpirun(1)</code> man page provides a comprehensive description of the available options.
 
See also the [https://www.open-mpi.org/faq/ OpenMPI FAQ], and the section on [https://www.open-mpi.org/faq/?category=all#oversubscribing oversubscription].
 
== References ==


* https://github.com/MDTEST-LANL/mdtest
* High-level discussion of mdtest methodology (external blog): https://www.glennklockwood.com/garden/mdtest
* https://github.com/LLNL/mdtest
* MPI implementation documentation (OpenMPI / MPICH) for process mapping and oversubscription guidance.
* https://jira.hpdd.intel.com/browse/LU-56
* https://www.open-mpi.org/faq


[[Category:Benchmarking]]
[[Category:Benchmarking]]

Latest revision as of 07:22, 13 October 2025

Description

MDtest is an MPI-based application for evaluating the metadata performance of a file system and has been designed to test parallel file systems. MDTest is not a Lustre-specific benchmark and can be run on any POSIX-compliant file system, but it does require a fully installed and configured file system implementation in order to run. For Lustre, this means the MGS, MDS and OSS services must be installed, configured and running, and that there is a population of Lustre clients running with the Lustre file system mounted.

The mdtest application runs on Lustre clients in a fully configured Lustre file system. Multiple mdtest processes are run in parallel across several nodes using MPI in order to saturate file system I/O. The program can create directory trees of arbitrary depth and can be directed to create a mixture of work-loads, including file-only tests.

Purpose

MDTest measures the metadata performance of a given file system implementation and will run on any POSIX-compliant file system. The program works by creating, stat-ing and deleting a tree of directories and files in parallel across a population of machines (typically compute nodes in an HPC cluster). In the case of Lustre, the machines are Lustre clients. While mdtest can be run stand-alone to measure local file system performance, it is really intended to be run on parallel and shared file systems.

Metadata performance is a critical measurement of file system capability and is increasingly relevant to parallel file system workloads in general. It is therefore important to be able to demonstrate the ability of Lustre to match and even exceed application requirements for file systems. MDTest provides a way to define a standard test that can be used to assess baseline performance of a file system, and provide a comparative measure against storage platforms.

Preparation

The mdtest application is distributed as source code and must be compiled for use on the target environment. The preferred distribution of mdtest is available on GitHub at IOR as part of the IOR project. This version includes features such as Lustre-awareness to allow striping across multiple MDTs and AWS S3 support.

The remainder of this document uses MPI for the examples (either OpenMPI or MPICH can be used). Integration with job schedulers is not discussed – examples will call the mpirun command directly.

Download and Compile MDTest

To obtain the mdtest binary, it is recommended to use an official release tarball from the IOR project rather than a development (git) checkout.

  1. Download the latest (or required) release from the IOR Releases page:
    # Pick a version; example only – check GitHub for the latest tag
    VERSION=4.0.0
    wget https://github.com/hpc/ior/releases/download/${VERSION}/ior-${VERSION}.tar.gz
    tar -xzf ior-${VERSION}.tar.gz
    cd ior-${VERSION}
    
  2. Load (or otherwise ensure) an MPI toolchain is available (OpenMPI or MPICH) and required build tools (gcc, make, etc.).
  3. Configure and build (Autotools release tarball already contains the build system):
    # (Optional) install to a prefix you control
    ./configure --prefix=$HOME/ior-${VERSION}-install
    make -j $(nproc)
    # (Optional) install to a prefix you control
    make install
    
  4. Verify binaries:
    ./src/mdtest -h
    ./src/ior -h
    
  5. Distribute the resulting mdtest (and optionally ior) binary to all client nodes, or place it on a shared filesystem accessible by every test node.

Quick functional check:

./src/mdtest -n 1 -i 1 -d /tmp/mdtest-check -u

What MDTest Measures

MDTest focuses on filesystem metadata operations executed in parallel:

  • Directory create / stat / remove
  • File create / stat / read (open/close) / remove
  • Tree (aggregate) create & remove timings
  • Optional extended operations (xattrs, hard links, etc., when enabled at build time)

Each phase is timed across all MPI ranks; results report max/min/mean and standard deviation of operations per second (or time) per operation class.

Key Options (Common Subset)

  • -n <N> Number of items (files/dirs) per MPI rank (most commonly used)
  • -N <N> Number of directories per tree level (affects hierarchy breadth)
  • -i <I> Iterations (repeat entire sequence to gauge variability)
  • -d <path> Base directory under which test subtrees are created
  • -u Use a unique working directory per rank (reduces contention)
  • -t <sec> Stonewall timer: stop creation phase after this time (per iteration) to normalize run length
  • -F File-only mode (skip directory-only tests)
  • -D Directory-only mode (skip file operations)
  • -R Random order of stat operations (more realistic cache behavior)
  • -L <N> Set directory depth (levels) for tree tests
  • -q Quiet (reduced output)
  • -r Read phase enabled (some builds may default this on/off)

(Invoke ./src/mdtest -h for the authoritative list—options evolve over releases.)

Designing Test Cases

Consider the following when constructing a benchmark plan:

  • Scaling model: Weak scaling (constant items per rank) vs. strong scaling (constant total items) reveals different behaviors.
  • Dataset size vs. metadata cache: Ensure total inodes created exceed MDS and client cache to avoid unrealistically high results dominated by cache hits. A rule of thumb is to create several multiples of aggregate metadata cache capacity.
  • Unique vs. shared directories: Using -u isolates rank performance; omitting it increases lock contention and stresses directory concurrency.
  • Iteration count: Use multiple (-i) iterations to observe stability; large variance may indicate load interference or imbalance.
  • Stepwise growth: Start small (e.g. 10k total objects) and scale toward target (100k–1M+) validating runtime and resource impact.
  • Depth / breadth: Adjust directory fan-out (-N) and depth (-L) to mimic application namespace patterns.
  • Stonewalling: Use -t for time-normalized comparisons across systems with different performance tiers.
  • Isolation: Run on an otherwise idle system for baseline; repeat under mixed load to understand degradation.

Interpreting Results

Key metrics reported:

  • Ops/sec (create/stat/read/remove) per operation class: Higher is better.
  • Tree creation / removal time: Indicates aggregate namespace scalability.
  • Max vs. Min divergence: Large gaps can signal load imbalance, network issues, or server hot spots.
  • Std Dev: High variability across ranks suggests contention or transient interference.
  • Iteration consistency: Stable means with low variance increase confidence in reported peak rates.

Diagnostic cues:

  • Fast create but slow stat/remove may indicate caching tuned for creates but lock or RPC bottlenecks elsewhere.
  • Consistently slower subset of ranks often maps to specific OSS/MDS affinity or network path differences.

Example Scenario Patterns (Brief)

  • Single-rank functional smoke test: mdtest -n 100 -i 1 -u -d /path
  • Metadata scaling (weak scaling): Increase ranks (1,2,4,8,...) holding -n 20000 constant.
  • Contention stress: Omit -u so all ranks operate in shared namespace.
  • Cache pressure: Large -n and deeper -L to exceed MDS cache.

Benchmark Execution

  1. Login to one of the compute nodes as the benchmark user
  2. Ensure passwordless ssh is setup across all nodes in question (not covered here)
  3. Create a host file for the mpirun command, containing the list of Lustre clients that will be used for the benchmark. Each line in the file represents a machine and the number of slots (usually equal to the number of CPU cores). For example:
    for i in `seq -f "%03g" 1 32`; do
      echo "n"$i" slots=16"
    done > $HOME/hfile
    
    # Result:
    n001 slots=16
    n002 slots=16
    n003 slots=16
    n004 slots=16
    ...
    
    • The first column of the host file contains the name of the nodes. This can also be an IP address if the /etc/hosts file or DNS is not set up.
    • The second column is used to represent the number of CPU cores.
  4. Run a quick test using mpirun to launch the benchmark and verify that the environment is set up correctly. For example:
    mpirun --hostfile $HOME/hfile --map-by node -np `cat $HOME/hfile|wc -l` hostname
    

    This should return the hostnames of all the machines that are in the test environment. The results are returned unsorted, in order of completion.

    Note: If the --map-by node does not work, and the output has only one or a very small number of unique hostnames repeated in the output, then set slots=1 for each host in the host file. Otherwise, mpirun will fill up the slots on the first node before launching processes on subsequent nodes.

    This may be desirable for multi-process tests but not for the single task per client test. Do not set the slot count higher than the number of cores present. If over-subscription is required, set the -np flag to greater than the number of physical cores. This informs OpenMPI that the task will be oversubscribed and will run in a mode that yields the processor to peers.

    Refer to: OpenMPI FAQ -- Oversubscribing Nodes, and also the notes on OpenMPI at the end of this document.

  5. Use mpirun to launch the mdtestbenchmark. For example:
    mpirun --hostfile $HOME/hfile -np 48 ./mdtest -n 20840 -i 10 -u -d /lustre/demo/mdtest-scratch
    

    In the above example, 48 processes (-np 48) will be distributed across the nodes listed in the host file (--hostfile hfile), with each process creating 20,840 directories and files (-n 20480) for a total of 1,000,320 files/directories. The test will conduct 10 iterations (-i 10) and use /lustre/demo/mdtest-scratch as the target base directory (-d <path>). The -u flag tells the program to assign a unique working directory per task.

    When first running the test on a new system, your test should be sized for 10,000 files/directories. This will give you an idea of how your system will handle the test. Gradually increase the number of files/directories as you feel more comfortable with the results you are seeing up to a maximum of 1,000,000 files/directory, or higher if there is a specific requirement in excess of this value.

    Start with a small number of threads and increase with each run using a doubling sequence starting at one (1, 2, 4, 8, 16), keeping the total number of files created as close to your target files/directories as possible. This means that as the thread count increases, the value of the -n parameter should decrease.

Notes on OpenMPI

  • If you encounter issues with process mapping or oversubscription, consult the documentation for the MPI implementation in use (e.g., OpenMPI or MPICH).
  • Common flags that may be useful include:
 * --map-by: Controls how processes are mapped to nodes.
 * --bind-to: Controls how processes are bound to CPU cores.
 * --oversubscribe: Allows more processes than slots, useful for testing limits.

References

Further Reading