MDTest: Difference between revisions
(→Preparation: update mdtest discussion to only reference latest mdtest-in-IOR) |
Elliswilson (talk | contribs) No edit summary |
||
| Line 3: | Line 3: | ||
MDtest is an MPI-based application for evaluating the metadata performance of a file system and has been designed to test parallel file systems. MDTest is not a Lustre-specific benchmark and can be run on any POSIX-compliant file system, but it does require a fully installed and configured file system implementation in order to run. For Lustre, this means the MGS, MDS and OSS services must be installed, configured and running, and that there is a population of Lustre clients running with the Lustre file system mounted. | MDtest is an MPI-based application for evaluating the metadata performance of a file system and has been designed to test parallel file systems. MDTest is not a Lustre-specific benchmark and can be run on any POSIX-compliant file system, but it does require a fully installed and configured file system implementation in order to run. For Lustre, this means the MGS, MDS and OSS services must be installed, configured and running, and that there is a population of Lustre clients running with the Lustre file system mounted. | ||
The <code>mdtest</code> application runs on Lustre clients in a fully configured Lustre file system. Multiple <code>mdtest</code> processes are run in parallel across several nodes using MPI in order to saturate file system I/O. The program can create directory trees of arbitrary depth and can be directed to create a mixture of work-loads, including file-only tests. | The <code>mdtest</code> application runs on Lustre clients in a fully configured Lustre file system. Multiple <code>mdtest</code> processes are run in parallel across several nodes using MPI in order to saturate file system I/O. The program can create directory trees of arbitrary depth and can be directed to create a mixture of work-loads, including file-only tests. | ||
== Purpose == | == Purpose == | ||
| Line 13: | Line 13: | ||
== Preparation == | == Preparation == | ||
The <code>mdtest</code> application is distributed as source code and must be compiled for use on the target environment. The preferred distribution of <code>mdtest</code> is available on GitHub at [https://github.com/ | The <code>mdtest</code> application is distributed as source code and must be compiled for use on the target environment. The preferred distribution of <code>mdtest</code> is available on GitHub at [https://github.com/hpc/ior IOR] as part of the IOR project. This version includes features such as Lustre-awareness to allow striping across multiple MDTs and AWS S3 support. | ||
The remainder of this document | The remainder of this document uses MPI for the examples (either OpenMPI or MPICH can be used). Integration with job schedulers is not discussed – examples will call the <code>mpirun</code> command directly. | ||
=== Download and Compile MDTest === | === Download and Compile MDTest === | ||
To | To obtain the <code>mdtest</code> binary, it is recommended to use an official release tarball from the IOR project rather than a development (git) checkout. | ||
<ol> | <ol> | ||
<li> | <li>Download the latest (or required) release from the IOR Releases page: | ||
<pre style="overflow-x:auto;"> | <pre style="overflow-x:auto;"> | ||
# Pick a version; example only – check GitHub for the latest tag | |||
VERSION=4.0.0 | |||
wget https://github.com/hpc/ior/releases/download/${VERSION}/ior-${VERSION}.tar.gz | |||
tar -xzf ior-${VERSION}.tar.gz | |||
cd ior-${VERSION} | |||
</pre> | </pre> | ||
</li> | </li> | ||
<li> | <li>Load (or otherwise ensure) an MPI toolchain is available (OpenMPI or MPICH) and required build tools (gcc, make, etc.).</li> | ||
<li>Configure and build (Autotools release tarball already contains the build system): | |||
<pre style="overflow-x:auto;"> | <pre style="overflow-x:auto;"> | ||
# (Optional) install to a prefix you control | |||
./configure --prefix=$HOME/ior-${VERSION}-install | |||
make -j $(nproc) | |||
# (Optional) install to a prefix you control | |||
make install | |||
</pre> | </pre> | ||
</li> | </li> | ||
<li> | <li>Verify binaries: | ||
<pre style="overflow-x:auto;"> | <pre style="overflow-x:auto;"> | ||
./src/mdtest -h | |||
./src/ior -h | |||
</pre> | </pre> | ||
</li> | </li> | ||
<li> | <li>Distribute the resulting <code>mdtest</code> (and optionally <code>ior</code>) binary to all client nodes, or place it on a shared filesystem accessible by every test node.</li> | ||
</ol> | |||
'''Quick functional check:''' | |||
<pre style="overflow-x:auto;"> | <pre style="overflow-x:auto;"> | ||
./src/mdtest | ./src/mdtest -n 1 -i 1 -d /tmp/mdtest-check -u | ||
</pre> | </pre> | ||
=== What MDTest Measures === | |||
MDTest focuses on filesystem metadata operations executed in parallel: | |||
* Directory create / stat / remove | |||
* File create / stat / read (open/close) / remove | |||
* Tree (aggregate) create & remove timings | |||
* Optional extended operations (xattrs, hard links, etc., when enabled at build time) | |||
Each phase is timed across all MPI ranks; results report max/min/mean and standard deviation of operations per second (or time) per operation class. | |||
=== Key Options (Common Subset) === | |||
* -n <N> Number of items (files/dirs) per MPI rank (most commonly used) | |||
* -N <N> Number of directories per tree level (affects hierarchy breadth) | |||
* -i <I> Iterations (repeat entire sequence to gauge variability) | |||
* -d <path> Base directory under which test subtrees are created | |||
* -u Use a unique working directory per rank (reduces contention) | |||
* -t <sec> Stonewall timer: stop creation phase after this time (per iteration) to normalize run length | |||
* -F File-only mode (skip directory-only tests) | |||
* -D Directory-only mode (skip file operations) | |||
* -R Random order of stat operations (more realistic cache behavior) | |||
* -L <N> Set directory depth (levels) for tree tests | |||
* -q Quiet (reduced output) | |||
* -r Read phase enabled (some builds may default this on/off) | |||
(Invoke <code>./src/mdtest -h</code> for the authoritative list—options evolve over releases.) | |||
=== Designing Test Cases === | |||
< | Consider the following when constructing a benchmark plan: | ||
* Scaling model: Weak scaling (constant items per rank) vs. strong scaling (constant total items) reveals different behaviors. | |||
</ | * Dataset size vs. metadata cache: Ensure total inodes created exceed MDS and client cache to avoid unrealistically high results dominated by cache hits. A rule of thumb is to create several multiples of aggregate metadata cache capacity. | ||
* Unique vs. shared directories: Using <code>-u</code> isolates rank performance; omitting it increases lock contention and stresses directory concurrency. | |||
* Iteration count: Use multiple (<code>-i</code>) iterations to observe stability; large variance may indicate load interference or imbalance. | |||
* Stepwise growth: Start small (e.g. 10k total objects) and scale toward target (100k–1M+) validating runtime and resource impact. | |||
* Depth / breadth: Adjust directory fan-out (<code>-N</code>) and depth (<code>-L</code>) to mimic application namespace patterns. | |||
* Stonewalling: Use <code>-t</code> for time-normalized comparisons across systems with different performance tiers. | |||
* Isolation: Run on an otherwise idle system for baseline; repeat under mixed load to understand degradation. | |||
=== Interpreting Results === | |||
Key metrics reported: | |||
* Ops/sec (create/stat/read/remove) per operation class: Higher is better. | |||
* Tree creation / removal time: Indicates aggregate namespace scalability. | |||
* Max vs. Min divergence: Large gaps can signal load imbalance, network issues, or server hot spots. | |||
* Std Dev: High variability across ranks suggests contention or transient interference. | |||
* Iteration consistency: Stable means with low variance increase confidence in reported peak rates. | |||
Diagnostic cues: | |||
* Fast create but slow stat/remove may indicate caching tuned for creates but lock or RPC bottlenecks elsewhere. | |||
* Consistently slower subset of ranks often maps to specific OSS/MDS affinity or network path differences. | |||
=== Example Scenario Patterns (Brief) === | |||
* Single-rank functional smoke test: <code>mdtest -n 100 -i 1 -u -d /path</code> | |||
* Metadata scaling (weak scaling): Increase ranks (1,2,4,8,...) holding <code>-n 20000</code> constant. | |||
* Contention stress: Omit <code>-u</code> so all ranks operate in shared namespace. | |||
* Cache pressure: Large <code>-n</code> and deeper <code>-L</code> to exceed MDS cache. | |||
</ | |||
== Benchmark Execution == | == Benchmark Execution == | ||
| Line 140: | Line 118: | ||
<ol> | <ol> | ||
<li>Login to one of the compute nodes as the benchmark user | <li>Login to one of the compute nodes as the benchmark user | ||
<li>Ensure passwordless ssh is setup across all nodes in question (not covered here) | |||
<li>Create a host file for the <code>mpirun</code> command, containing the list of Lustre clients that will be used for the benchmark. Each line in the file represents a machine and the number of slots (usually equal to the number of CPU cores). For example: | <li>Create a host file for the <code>mpirun</code> command, containing the list of Lustre clients that will be used for the benchmark. Each line in the file represents a machine and the number of slots (usually equal to the number of CPU cores). For example: | ||
| Line 146: | Line 125: | ||
echo "n"$i" slots=16" | echo "n"$i" slots=16" | ||
done > $HOME/hfile | done > $HOME/hfile | ||
# Result: | # Result: | ||
n001 slots=16 | n001 slots=16 | ||
| Line 167: | Line 146: | ||
'''Note:''' If the <code>--map-by node</code> does not work, and the output has only one or a very small number of unique hostnames repeated in the output, then set <code>slots=1</code> for each host in the host file. Otherwise, <code>mpirun</code> will fill up the slots on the first node before launching processes on subsequent nodes. | '''Note:''' If the <code>--map-by node</code> does not work, and the output has only one or a very small number of unique hostnames repeated in the output, then set <code>slots=1</code> for each host in the host file. Otherwise, <code>mpirun</code> will fill up the slots on the first node before launching processes on subsequent nodes. | ||
This may be desirable for multi-process tests but not for the single task per client test. Do not set the slot count higher than the number of cores present. If over-subscription is required, set the -np flag to greater than the number of physical cores. This informs OpenMPI that the task will be oversubscribed and will run in a mode that yields the processor to peers. | This may be desirable for multi-process tests but not for the single task per client test. Do not set the slot count higher than the number of cores present. If over-subscription is required, set the -np flag to greater than the number of physical cores. This informs OpenMPI that the task will be oversubscribed and will run in a mode that yields the processor to peers. | ||
Refer to: [https://www.open-mpi.org/faq/?category=all#oversubscribing OpenMPI FAQ -- Oversubscribing Nodes], and also the [[#Notes_on_OpenMPI|notes on OpenMPI]] at the end of this document. | Refer to: [https://www.open-mpi.org/faq/?category=all#oversubscribing OpenMPI FAQ -- Oversubscribing Nodes], and also the [[#Notes_on_OpenMPI|notes on OpenMPI]] at the end of this document. | ||
| Line 175: | Line 154: | ||
mpirun --hostfile $HOME/hfile -np 48 ./mdtest -n 20840 -i 10 -u -d /lustre/demo/mdtest-scratch | mpirun --hostfile $HOME/hfile -np 48 ./mdtest -n 20840 -i 10 -u -d /lustre/demo/mdtest-scratch | ||
</pre> | </pre> | ||
In the above example, 48 processes (<code>-np 48</code>) will be distributed across the nodes listed in the host file (<code>--hostfile hfile</code>), with each process creating 20,840 directories and files (<code>-n 20480</code>) for a total of 1,000,320 files/directories. The test will conduct 10 iterations (<code>-i 10</code>) and use <code>/lustre/demo/mdtest-scratch</code> as the target base directory (<code>-d <path></code>). The <code>-u</code> flag tells the program to assign a unique working directory per task. | In the above example, 48 processes (<code>-np 48</code>) will be distributed across the nodes listed in the host file (<code>--hostfile hfile</code>), with each process creating 20,840 directories and files (<code>-n 20480</code>) for a total of 1,000,320 files/directories. The test will conduct 10 iterations (<code>-i 10</code>) and use <code>/lustre/demo/mdtest-scratch</code> as the target base directory (<code>-d <path></code>). The <code>-u</code> flag tells the program to assign a unique working directory per task. | ||
When first running the test on a new system, your test should be sized for 10,000 files/directories. This will give you an idea of how your system will handle the test. Gradually increase the number of files/directories as you feel more comfortable with the results you are seeing up to a maximum of 1,000,000 files/directory, or higher if there is a specific requirement in excess of this value | When first running the test on a new system, your test should be sized for 10,000 files/directories. This will give you an idea of how your system will handle the test. Gradually increase the number of files/directories as you feel more comfortable with the results you are seeing up to a maximum of 1,000,000 files/directory, or higher if there is a specific requirement in excess of this value. | ||
Start with a small number of threads and increase with each run using a doubling sequence starting at one (1, 2, 4, 8, 16), keeping the total number of files created as close to your target files/directories as possible. This means that as the thread count increases, the value of the <code>-n</code> parameter should decrease. | Start with a small number of threads and increase with each run using a doubling sequence starting at one (1, 2, 4, 8, 16), keeping the total number of files created as close to your target files/directories as possible. This means that as the thread count increases, the value of the <code>-n</code> parameter should decrease. | ||
| Line 185: | Line 164: | ||
=== Notes on OpenMPI === | === Notes on OpenMPI === | ||
* If you encounter issues with process mapping or oversubscription, consult the documentation for the MPI implementation in use (e.g., OpenMPI or MPICH). | |||
* Common flags that may be useful include: | |||
* <code>--map-by</code>: Controls how processes are mapped to nodes. | |||
* <code>--bind-to</code>: Controls how processes are bound to CPU cores. | |||
* <code>--oversubscribe</code>: Allows more processes than slots, useful for testing limits. | |||
== References == | |||
* https://github.com/hpc/ior | |||
=== Further Reading === | |||
== | |||
* | * High-level discussion of mdtest methodology (external blog): https://www.glennklockwood.com/garden/mdtest | ||
* MPI implementation documentation (OpenMPI / MPICH) for process mapping and oversubscription guidance. | |||
* | |||
[[Category:Benchmarking]] | [[Category:Benchmarking]] | ||
Latest revision as of 07:22, 13 October 2025
Description
MDtest is an MPI-based application for evaluating the metadata performance of a file system and has been designed to test parallel file systems. MDTest is not a Lustre-specific benchmark and can be run on any POSIX-compliant file system, but it does require a fully installed and configured file system implementation in order to run. For Lustre, this means the MGS, MDS and OSS services must be installed, configured and running, and that there is a population of Lustre clients running with the Lustre file system mounted.
The mdtest application runs on Lustre clients in a fully configured Lustre file system. Multiple mdtest processes are run in parallel across several nodes using MPI in order to saturate file system I/O. The program can create directory trees of arbitrary depth and can be directed to create a mixture of work-loads, including file-only tests.
Purpose
MDTest measures the metadata performance of a given file system implementation and will run on any POSIX-compliant file system. The program works by creating, stat-ing and deleting a tree of directories and files in parallel across a population of machines (typically compute nodes in an HPC cluster). In the case of Lustre, the machines are Lustre clients. While mdtest can be run stand-alone to measure local file system performance, it is really intended to be run on parallel and shared file systems.
Metadata performance is a critical measurement of file system capability and is increasingly relevant to parallel file system workloads in general. It is therefore important to be able to demonstrate the ability of Lustre to match and even exceed application requirements for file systems. MDTest provides a way to define a standard test that can be used to assess baseline performance of a file system, and provide a comparative measure against storage platforms.
Preparation
The mdtest application is distributed as source code and must be compiled for use on the target environment. The preferred distribution of mdtest is available on GitHub at IOR as part of the IOR project. This version includes features such as Lustre-awareness to allow striping across multiple MDTs and AWS S3 support.
The remainder of this document uses MPI for the examples (either OpenMPI or MPICH can be used). Integration with job schedulers is not discussed – examples will call the mpirun command directly.
Download and Compile MDTest
To obtain the mdtest binary, it is recommended to use an official release tarball from the IOR project rather than a development (git) checkout.
- Download the latest (or required) release from the IOR Releases page:
# Pick a version; example only – check GitHub for the latest tag VERSION=4.0.0 wget https://github.com/hpc/ior/releases/download/${VERSION}/ior-${VERSION}.tar.gz tar -xzf ior-${VERSION}.tar.gz cd ior-${VERSION} - Load (or otherwise ensure) an MPI toolchain is available (OpenMPI or MPICH) and required build tools (gcc, make, etc.).
- Configure and build (Autotools release tarball already contains the build system):
# (Optional) install to a prefix you control ./configure --prefix=$HOME/ior-${VERSION}-install make -j $(nproc) # (Optional) install to a prefix you control make install - Verify binaries:
./src/mdtest -h ./src/ior -h
- Distribute the resulting
mdtest(and optionallyior) binary to all client nodes, or place it on a shared filesystem accessible by every test node.
Quick functional check:
./src/mdtest -n 1 -i 1 -d /tmp/mdtest-check -u
What MDTest Measures
MDTest focuses on filesystem metadata operations executed in parallel:
- Directory create / stat / remove
- File create / stat / read (open/close) / remove
- Tree (aggregate) create & remove timings
- Optional extended operations (xattrs, hard links, etc., when enabled at build time)
Each phase is timed across all MPI ranks; results report max/min/mean and standard deviation of operations per second (or time) per operation class.
Key Options (Common Subset)
- -n <N> Number of items (files/dirs) per MPI rank (most commonly used)
- -N <N> Number of directories per tree level (affects hierarchy breadth)
- -i <I> Iterations (repeat entire sequence to gauge variability)
- -d <path> Base directory under which test subtrees are created
- -u Use a unique working directory per rank (reduces contention)
- -t <sec> Stonewall timer: stop creation phase after this time (per iteration) to normalize run length
- -F File-only mode (skip directory-only tests)
- -D Directory-only mode (skip file operations)
- -R Random order of stat operations (more realistic cache behavior)
- -L <N> Set directory depth (levels) for tree tests
- -q Quiet (reduced output)
- -r Read phase enabled (some builds may default this on/off)
(Invoke ./src/mdtest -h for the authoritative list—options evolve over releases.)
Designing Test Cases
Consider the following when constructing a benchmark plan:
- Scaling model: Weak scaling (constant items per rank) vs. strong scaling (constant total items) reveals different behaviors.
- Dataset size vs. metadata cache: Ensure total inodes created exceed MDS and client cache to avoid unrealistically high results dominated by cache hits. A rule of thumb is to create several multiples of aggregate metadata cache capacity.
- Unique vs. shared directories: Using
-uisolates rank performance; omitting it increases lock contention and stresses directory concurrency. - Iteration count: Use multiple (
-i) iterations to observe stability; large variance may indicate load interference or imbalance. - Stepwise growth: Start small (e.g. 10k total objects) and scale toward target (100k–1M+) validating runtime and resource impact.
- Depth / breadth: Adjust directory fan-out (
-N) and depth (-L) to mimic application namespace patterns. - Stonewalling: Use
-tfor time-normalized comparisons across systems with different performance tiers. - Isolation: Run on an otherwise idle system for baseline; repeat under mixed load to understand degradation.
Interpreting Results
Key metrics reported:
- Ops/sec (create/stat/read/remove) per operation class: Higher is better.
- Tree creation / removal time: Indicates aggregate namespace scalability.
- Max vs. Min divergence: Large gaps can signal load imbalance, network issues, or server hot spots.
- Std Dev: High variability across ranks suggests contention or transient interference.
- Iteration consistency: Stable means with low variance increase confidence in reported peak rates.
Diagnostic cues:
- Fast create but slow stat/remove may indicate caching tuned for creates but lock or RPC bottlenecks elsewhere.
- Consistently slower subset of ranks often maps to specific OSS/MDS affinity or network path differences.
Example Scenario Patterns (Brief)
- Single-rank functional smoke test:
mdtest -n 100 -i 1 -u -d /path - Metadata scaling (weak scaling): Increase ranks (1,2,4,8,...) holding
-n 20000constant. - Contention stress: Omit
-uso all ranks operate in shared namespace. - Cache pressure: Large
-nand deeper-Lto exceed MDS cache.
Benchmark Execution
- Login to one of the compute nodes as the benchmark user
- Ensure passwordless ssh is setup across all nodes in question (not covered here)
- Create a host file for the
mpiruncommand, containing the list of Lustre clients that will be used for the benchmark. Each line in the file represents a machine and the number of slots (usually equal to the number of CPU cores). For example:for i in `seq -f "%03g" 1 32`; do echo "n"$i" slots=16" done > $HOME/hfile # Result: n001 slots=16 n002 slots=16 n003 slots=16 n004 slots=16 ...
- The first column of the host file contains the name of the nodes. This can also be an IP address if the
/etc/hostsfile or DNS is not set up. - The second column is used to represent the number of CPU cores.
- The first column of the host file contains the name of the nodes. This can also be an IP address if the
- Run a quick test using
mpirunto launch the benchmark and verify that the environment is set up correctly. For example:mpirun --hostfile $HOME/hfile --map-by node -np `cat $HOME/hfile|wc -l` hostname
This should return the hostnames of all the machines that are in the test environment. The results are returned unsorted, in order of completion.
Note: If the
--map-by nodedoes not work, and the output has only one or a very small number of unique hostnames repeated in the output, then setslots=1for each host in the host file. Otherwise,mpirunwill fill up the slots on the first node before launching processes on subsequent nodes.This may be desirable for multi-process tests but not for the single task per client test. Do not set the slot count higher than the number of cores present. If over-subscription is required, set the -np flag to greater than the number of physical cores. This informs OpenMPI that the task will be oversubscribed and will run in a mode that yields the processor to peers.
Refer to: OpenMPI FAQ -- Oversubscribing Nodes, and also the notes on OpenMPI at the end of this document.
- Use
mpirunto launch themdtestbenchmark. For example:mpirun --hostfile $HOME/hfile -np 48 ./mdtest -n 20840 -i 10 -u -d /lustre/demo/mdtest-scratch
In the above example, 48 processes (
-np 48) will be distributed across the nodes listed in the host file (--hostfile hfile), with each process creating 20,840 directories and files (-n 20480) for a total of 1,000,320 files/directories. The test will conduct 10 iterations (-i 10) and use/lustre/demo/mdtest-scratchas the target base directory (-d <path>). The-uflag tells the program to assign a unique working directory per task.When first running the test on a new system, your test should be sized for 10,000 files/directories. This will give you an idea of how your system will handle the test. Gradually increase the number of files/directories as you feel more comfortable with the results you are seeing up to a maximum of 1,000,000 files/directory, or higher if there is a specific requirement in excess of this value.
Start with a small number of threads and increase with each run using a doubling sequence starting at one (1, 2, 4, 8, 16), keeping the total number of files created as close to your target files/directories as possible. This means that as the thread count increases, the value of the
-nparameter should decrease.
Notes on OpenMPI
- If you encounter issues with process mapping or oversubscription, consult the documentation for the MPI implementation in use (e.g., OpenMPI or MPICH).
- Common flags that may be useful include:
*--map-by: Controls how processes are mapped to nodes. *--bind-to: Controls how processes are bound to CPU cores. *--oversubscribe: Allows more processes than slots, useful for testing limits.
References
Further Reading
- High-level discussion of mdtest methodology (external blog): https://www.glennklockwood.com/garden/mdtest
- MPI implementation documentation (OpenMPI / MPICH) for process mapping and oversubscription guidance.