MDTest

From Lustre Wiki
Revision as of 07:22, 13 October 2025 by Elliswilson (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Description

MDtest is an MPI-based application for evaluating the metadata performance of a file system and has been designed to test parallel file systems. MDTest is not a Lustre-specific benchmark and can be run on any POSIX-compliant file system, but it does require a fully installed and configured file system implementation in order to run. For Lustre, this means the MGS, MDS and OSS services must be installed, configured and running, and that there is a population of Lustre clients running with the Lustre file system mounted.

The mdtest application runs on Lustre clients in a fully configured Lustre file system. Multiple mdtest processes are run in parallel across several nodes using MPI in order to saturate file system I/O. The program can create directory trees of arbitrary depth and can be directed to create a mixture of work-loads, including file-only tests.

Purpose

MDTest measures the metadata performance of a given file system implementation and will run on any POSIX-compliant file system. The program works by creating, stat-ing and deleting a tree of directories and files in parallel across a population of machines (typically compute nodes in an HPC cluster). In the case of Lustre, the machines are Lustre clients. While mdtest can be run stand-alone to measure local file system performance, it is really intended to be run on parallel and shared file systems.

Metadata performance is a critical measurement of file system capability and is increasingly relevant to parallel file system workloads in general. It is therefore important to be able to demonstrate the ability of Lustre to match and even exceed application requirements for file systems. MDTest provides a way to define a standard test that can be used to assess baseline performance of a file system, and provide a comparative measure against storage platforms.

Preparation

The mdtest application is distributed as source code and must be compiled for use on the target environment. The preferred distribution of mdtest is available on GitHub at IOR as part of the IOR project. This version includes features such as Lustre-awareness to allow striping across multiple MDTs and AWS S3 support.

The remainder of this document uses MPI for the examples (either OpenMPI or MPICH can be used). Integration with job schedulers is not discussed – examples will call the mpirun command directly.

Download and Compile MDTest

To obtain the mdtest binary, it is recommended to use an official release tarball from the IOR project rather than a development (git) checkout.

  1. Download the latest (or required) release from the IOR Releases page:
    # Pick a version; example only – check GitHub for the latest tag
    VERSION=4.0.0
    wget https://github.com/hpc/ior/releases/download/${VERSION}/ior-${VERSION}.tar.gz
    tar -xzf ior-${VERSION}.tar.gz
    cd ior-${VERSION}
    
  2. Load (or otherwise ensure) an MPI toolchain is available (OpenMPI or MPICH) and required build tools (gcc, make, etc.).
  3. Configure and build (Autotools release tarball already contains the build system):
    # (Optional) install to a prefix you control
    ./configure --prefix=$HOME/ior-${VERSION}-install
    make -j $(nproc)
    # (Optional) install to a prefix you control
    make install
    
  4. Verify binaries:
    ./src/mdtest -h
    ./src/ior -h
    
  5. Distribute the resulting mdtest (and optionally ior) binary to all client nodes, or place it on a shared filesystem accessible by every test node.

Quick functional check:

./src/mdtest -n 1 -i 1 -d /tmp/mdtest-check -u

What MDTest Measures

MDTest focuses on filesystem metadata operations executed in parallel:

  • Directory create / stat / remove
  • File create / stat / read (open/close) / remove
  • Tree (aggregate) create & remove timings
  • Optional extended operations (xattrs, hard links, etc., when enabled at build time)

Each phase is timed across all MPI ranks; results report max/min/mean and standard deviation of operations per second (or time) per operation class.

Key Options (Common Subset)

  • -n <N> Number of items (files/dirs) per MPI rank (most commonly used)
  • -N <N> Number of directories per tree level (affects hierarchy breadth)
  • -i <I> Iterations (repeat entire sequence to gauge variability)
  • -d <path> Base directory under which test subtrees are created
  • -u Use a unique working directory per rank (reduces contention)
  • -t <sec> Stonewall timer: stop creation phase after this time (per iteration) to normalize run length
  • -F File-only mode (skip directory-only tests)
  • -D Directory-only mode (skip file operations)
  • -R Random order of stat operations (more realistic cache behavior)
  • -L <N> Set directory depth (levels) for tree tests
  • -q Quiet (reduced output)
  • -r Read phase enabled (some builds may default this on/off)

(Invoke ./src/mdtest -h for the authoritative list—options evolve over releases.)

Designing Test Cases

Consider the following when constructing a benchmark plan:

  • Scaling model: Weak scaling (constant items per rank) vs. strong scaling (constant total items) reveals different behaviors.
  • Dataset size vs. metadata cache: Ensure total inodes created exceed MDS and client cache to avoid unrealistically high results dominated by cache hits. A rule of thumb is to create several multiples of aggregate metadata cache capacity.
  • Unique vs. shared directories: Using -u isolates rank performance; omitting it increases lock contention and stresses directory concurrency.
  • Iteration count: Use multiple (-i) iterations to observe stability; large variance may indicate load interference or imbalance.
  • Stepwise growth: Start small (e.g. 10k total objects) and scale toward target (100k–1M+) validating runtime and resource impact.
  • Depth / breadth: Adjust directory fan-out (-N) and depth (-L) to mimic application namespace patterns.
  • Stonewalling: Use -t for time-normalized comparisons across systems with different performance tiers.
  • Isolation: Run on an otherwise idle system for baseline; repeat under mixed load to understand degradation.

Interpreting Results

Key metrics reported:

  • Ops/sec (create/stat/read/remove) per operation class: Higher is better.
  • Tree creation / removal time: Indicates aggregate namespace scalability.
  • Max vs. Min divergence: Large gaps can signal load imbalance, network issues, or server hot spots.
  • Std Dev: High variability across ranks suggests contention or transient interference.
  • Iteration consistency: Stable means with low variance increase confidence in reported peak rates.

Diagnostic cues:

  • Fast create but slow stat/remove may indicate caching tuned for creates but lock or RPC bottlenecks elsewhere.
  • Consistently slower subset of ranks often maps to specific OSS/MDS affinity or network path differences.

Example Scenario Patterns (Brief)

  • Single-rank functional smoke test: mdtest -n 100 -i 1 -u -d /path
  • Metadata scaling (weak scaling): Increase ranks (1,2,4,8,...) holding -n 20000 constant.
  • Contention stress: Omit -u so all ranks operate in shared namespace.
  • Cache pressure: Large -n and deeper -L to exceed MDS cache.

Benchmark Execution

  1. Login to one of the compute nodes as the benchmark user
  2. Ensure passwordless ssh is setup across all nodes in question (not covered here)
  3. Create a host file for the mpirun command, containing the list of Lustre clients that will be used for the benchmark. Each line in the file represents a machine and the number of slots (usually equal to the number of CPU cores). For example:
    for i in `seq -f "%03g" 1 32`; do
      echo "n"$i" slots=16"
    done > $HOME/hfile
    
    # Result:
    n001 slots=16
    n002 slots=16
    n003 slots=16
    n004 slots=16
    ...
    
    • The first column of the host file contains the name of the nodes. This can also be an IP address if the /etc/hosts file or DNS is not set up.
    • The second column is used to represent the number of CPU cores.
  4. Run a quick test using mpirun to launch the benchmark and verify that the environment is set up correctly. For example:
    mpirun --hostfile $HOME/hfile --map-by node -np `cat $HOME/hfile|wc -l` hostname
    

    This should return the hostnames of all the machines that are in the test environment. The results are returned unsorted, in order of completion.

    Note: If the --map-by node does not work, and the output has only one or a very small number of unique hostnames repeated in the output, then set slots=1 for each host in the host file. Otherwise, mpirun will fill up the slots on the first node before launching processes on subsequent nodes.

    This may be desirable for multi-process tests but not for the single task per client test. Do not set the slot count higher than the number of cores present. If over-subscription is required, set the -np flag to greater than the number of physical cores. This informs OpenMPI that the task will be oversubscribed and will run in a mode that yields the processor to peers.

    Refer to: OpenMPI FAQ -- Oversubscribing Nodes, and also the notes on OpenMPI at the end of this document.

  5. Use mpirun to launch the mdtestbenchmark. For example:
    mpirun --hostfile $HOME/hfile -np 48 ./mdtest -n 20840 -i 10 -u -d /lustre/demo/mdtest-scratch
    

    In the above example, 48 processes (-np 48) will be distributed across the nodes listed in the host file (--hostfile hfile), with each process creating 20,840 directories and files (-n 20480) for a total of 1,000,320 files/directories. The test will conduct 10 iterations (-i 10) and use /lustre/demo/mdtest-scratch as the target base directory (-d <path>). The -u flag tells the program to assign a unique working directory per task.

    When first running the test on a new system, your test should be sized for 10,000 files/directories. This will give you an idea of how your system will handle the test. Gradually increase the number of files/directories as you feel more comfortable with the results you are seeing up to a maximum of 1,000,000 files/directory, or higher if there is a specific requirement in excess of this value.

    Start with a small number of threads and increase with each run using a doubling sequence starting at one (1, 2, 4, 8, 16), keeping the total number of files created as close to your target files/directories as possible. This means that as the thread count increases, the value of the -n parameter should decrease.

Notes on OpenMPI

  • If you encounter issues with process mapping or oversubscription, consult the documentation for the MPI implementation in use (e.g., OpenMPI or MPICH).
  • Common flags that may be useful include:
 * --map-by: Controls how processes are mapped to nodes.
 * --bind-to: Controls how processes are bound to CPU cores.
 * --oversubscribe: Allows more processes than slots, useful for testing limits.

References

Further Reading