MDTest: Difference between revisions
(Created page with "== Description == MDtest is an MPI-based application for evaluating the metadata performance of a file system and has been designed to test parallel file systems. MDTest is n...") |
(→Preparation: update mdtest discussion to only reference latest mdtest-in-IOR) |
||
(4 intermediate revisions by one other user not shown) | |||
Line 13: | Line 13: | ||
== Preparation == | == Preparation == | ||
The <code>mdtest</code> application is distributed as source code and must be compiled for use on the target environment. | The <code>mdtest</code> application is distributed as source code and must be compiled for use on the target environment. The preferred distribution of <code>mdtest</code> is available on GitHub at [https://github.com/IOR-LANL/ior IOR] as part of the IOR project. LANL have added features not available in the LLNL version, most notable of which are some Lustre-awareness to allow striping across multiple MDTs and AWS S3 support. There is indeed, even a third option, hidden in the depths of the Lustre JIRA issue tracking system in ticket [https://jira.whamcloud.com/browse/LU-56 LU-56] that adds the ability to run against multiple mountpoints on a single client. | ||
The remainder of this document will use OpenMPI for the examples. Integration with job schedulers is not discussed – examples will call the <code>mpirun</code> command directly. | The remainder of this document will use OpenMPI for the examples. Integration with job schedulers is not discussed – examples will call the <code>mpirun</code> command directly. | ||
Line 26: | Line 24: | ||
<li>Install the pre-requisite development tools. On RHEL or CentOS systems, this can be accomplished by running the following command: | <li>Install the pre-requisite development tools. On RHEL or CentOS systems, this can be accomplished by running the following command: | ||
<pre style="overflow-x:auto;"> | <pre style="overflow-x:auto;"> | ||
sudo yum -y install openmpi-devel | sudo yum -y install openmpi-devel git | ||
</pre> | </pre> | ||
</li> | </li> | ||
<li>Download the <code>mdtest</code> source: | <li>Download the <code>mdtest</code> source: | ||
<pre style="overflow-x:auto;"> | <pre style="overflow-x:auto;"> | ||
git clone https://github.com/ | git clone https://github.com/IOR-LANL/ior.git | ||
</pre> | </pre> | ||
</li> | </li> | ||
<li> Compile the software: | <li> Compile the software: | ||
<pre style="overflow-x:auto;"> | <pre style="overflow-x:auto;"> | ||
cd | cd ior | ||
module load mpi/openmpi-x86_64 | module load mpi/openmpi-x86_64 | ||
make clean && make | make clean && make | ||
Line 43: | Line 41: | ||
<li>Quickly verify that the program runs: | <li>Quickly verify that the program runs: | ||
<pre style="overflow-x:auto;"> | <pre style="overflow-x:auto;"> | ||
./mdtest | ./src/mdtest | ||
</pre> | </pre> | ||
For example: | For example: | ||
<pre style="overflow-x:auto;"> | <pre style="overflow-x:auto;"> | ||
[mduser@ct7-c1 mdtest]$ ./mdtest | [mduser@ct7-c1 mdtest]$ ./src/mdtest | ||
-- started at 06/28/2017 03:07:55 -- | -- started at 06/28/2017 03:07:55 -- | ||
mdtest-1.9.4-rc1 was launched with 1 total task(s) on 1 node(s) | mdtest-1.9.4-rc1 was launched with 1 total task(s) on 1 node(s) | ||
Command line used: ./mdtest | Command line used: ./src/mdtest | ||
Path: /lustre/demo/mdtest | Path: /lustre/demo/mdtest | ||
FS: 58.0 GiB Used FS: 1.1% Inodes: 5.0 Mi Used Inodes: 0.0% | FS: 58.0 GiB Used FS: 1.1% Inodes: 5.0 Mi Used Inodes: 0.0% | ||
Line 132: | Line 130: | ||
<pre style="overflow-x:auto"> | <pre style="overflow-x:auto"> | ||
module purge | module purge | ||
module load openmpi-x86_64 | module load mpi/openmpi-x86_64 | ||
</pre> | </pre> | ||
This ensures that the Open MPI library path and binary path are added to the user environment every time the user logs in (and every time <code>mpirun</code> is invoked across multiple nodes). The <code>.bash_profile</code> file is not read when <code>mpirun</code> starts processes on remote nodes, which is why it is not chosen in this case. | This ensures that the Open MPI library path and binary path are added to the user environment every time the user logs in (and every time <code>mpirun</code> is invoked across multiple nodes). The <code>.bash_profile</code> file is not read when <code>mpirun</code> starts processes on remote nodes, which is why it is not chosen in this case. | ||
Line 166: | Line 164: | ||
</pre> | </pre> | ||
This should return the hostnames of all the machines that are in the test environment. The results are returned unsorted, in order of completion. | This should return the hostnames of all the machines that are in the test environment. The results are returned unsorted, in order of completion. | ||
'''Note:''' If the <code>--map-by node</code> does not work, and the output has only one or a very small number of unique hostnames repeated in the output, then set <code>slots=1</code> for each host in the host file. Otherwise, <code>mpirun</code> will fill up the slots on the first node before launching processes on subsequent nodes. | |||
This may be desirable for multi-process tests but not for the single task per client test. Do not set the slot count higher than the number of cores present. If over-subscription is required, set the -np flag to greater than the number of physical cores. This informs OpenMPI that the task will be oversubscribed and will run in a mode that yields the processor to peers. | |||
Refer to: [https://www.open-mpi.org/faq/?category=all#oversubscribing OpenMPI FAQ -- Oversubscribing Nodes], and also the [[#Notes_on_OpenMPI|notes on OpenMPI]] at the end of this document. | |||
</li> | </li> | ||
<li>Use <code>mpirun</code> to launch the <code>mdtest</code>benchmark. For example: | <li>Use <code>mpirun</code> to launch the <code>mdtest</code>benchmark. For example: | ||
Line 179: | Line 183: | ||
</ol> | </ol> | ||
When preparing the | === Notes on OpenMPI === | ||
When preparing the benchmark, pay careful attention to the distribution of processes across the nodes. <code>mpirun</code> will, by default, fill the slots of one node before allocating processes to the next node in the list. i.e. all of the slots on the first node in the file will be consumed before allocating processes to the second node, then third node, and so on. If the number of slots requested is lower than the overall number of slots in the host file, then utlisation will not be evenly distributed, and some nodes may not be used at all. | |||
If the number of process is larger than the number of available slots, <code>mpirun</code> will oversubscribe one or more nodes until all the processes have been launched. This can be exploited to create more even distribution of processes across nodes by setting the number of slots per host to 1. However, note that <code>mpirun</code> will decide where the additional processes will run, which can lead to performance variance from run to run of a job. | If the number of process is larger than the number of available slots, <code>mpirun</code> will oversubscribe one or more nodes until all the processes have been launched. This can be exploited to create more even distribution of processes across nodes by setting the number of slots per host to 1. However, note that <code>mpirun</code> will decide where the additional processes will run, which can lead to performance variance from run to run of a job. | ||
Line 211: | Line 217: | ||
The <code>mpirun(1)</code> man page provides a comprehensive description of the available options. | The <code>mpirun(1)</code> man page provides a comprehensive description of the available options. | ||
See also the [https://www.open-mpi.org/faq/ OpenMPI FAQ], and the section on [https://www.open-mpi.org/faq/?category=all#oversubscribing oversubscription]. | |||
== References == | == References == | ||
Line 217: | Line 225: | ||
* https://github.com/LLNL/mdtest | * https://github.com/LLNL/mdtest | ||
* https://jira.hpdd.intel.com/browse/LU-56 | * https://jira.hpdd.intel.com/browse/LU-56 | ||
* https://www.open-mpi.org/faq | |||
[[Category:Benchmarking]] | [[Category:Benchmarking]] |
Latest revision as of 10:32, 22 June 2018
Description
MDtest is an MPI-based application for evaluating the metadata performance of a file system and has been designed to test parallel file systems. MDTest is not a Lustre-specific benchmark and can be run on any POSIX-compliant file system, but it does require a fully installed and configured file system implementation in order to run. For Lustre, this means the MGS, MDS and OSS services must be installed, configured and running, and that there is a population of Lustre clients running with the Lustre file system mounted.
The mdtest
application runs on Lustre clients in a fully configured Lustre file system. Multiple mdtest
processes are run in parallel across several nodes using MPI in order to saturate file system I/O. The program can create directory trees of arbitrary depth and can be directed to create a mixture of work-loads, including file-only tests.
Purpose
MDTest measures the metadata performance of a given file system implementation and will run on any POSIX-compliant file system. The program works by creating, stat-ing and deleting a tree of directories and files in parallel across a population of machines (typically compute nodes in an HPC cluster). In the case of Lustre, the machines are Lustre clients. While mdtest
can be run stand-alone to measure local file system performance, it is really intended to be run on parallel and shared file systems.
Metadata performance is a critical measurement of file system capability and is increasingly relevant to parallel file system workloads in general. It is therefore important to be able to demonstrate the ability of Lustre to match and even exceed application requirements for file systems. MDTest provides a way to define a standard test that can be used to assess baseline performance of a file system, and provide a comparative measure against storage platforms.
Preparation
The mdtest
application is distributed as source code and must be compiled for use on the target environment. The preferred distribution of mdtest
is available on GitHub at IOR as part of the IOR project. LANL have added features not available in the LLNL version, most notable of which are some Lustre-awareness to allow striping across multiple MDTs and AWS S3 support. There is indeed, even a third option, hidden in the depths of the Lustre JIRA issue tracking system in ticket LU-56 that adds the ability to run against multiple mountpoints on a single client.
The remainder of this document will use OpenMPI for the examples. Integration with job schedulers is not discussed – examples will call the mpirun
command directly.
Download and Compile MDTest
To compile the mdtest
binary, run the following steps on a suitable machine:
- Install the pre-requisite development tools. On RHEL or CentOS systems, this can be accomplished by running the following command:
sudo yum -y install openmpi-devel git
- Download the
mdtest
source:git clone https://github.com/IOR-LANL/ior.git
- Compile the software:
cd ior module load mpi/openmpi-x86_64 make clean && make
- Quickly verify that the program runs:
./src/mdtest
For example:
[mduser@ct7-c1 mdtest]$ ./src/mdtest -- started at 06/28/2017 03:07:55 -- mdtest-1.9.4-rc1 was launched with 1 total task(s) on 1 node(s) Command line used: ./src/mdtest Path: /lustre/demo/mdtest FS: 58.0 GiB Used FS: 1.1% Inodes: 5.0 Mi Used Inodes: 0.0% 1 tasks, 0 files/directories SUMMARY: (of 1 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- Directory creation: 0.000 0.000 0.000 0.000 Directory stat : 0.000 0.000 0.000 0.000 Directory removal : 0.000 0.000 0.000 0.000 File creation : 0.000 0.000 0.000 0.000 File stat : 0.000 0.000 0.000 0.000 File read : 0.000 0.000 0.000 0.000 File removal : 0.000 0.000 0.000 0.000 Tree creation : 461.255 461.255 461.255 0.000 Tree removal : 497.512 497.512 497.512 0.000 -- finished at 06/28/2017 03:07:55 --
- Copy the
mdtest
command onto all of the Lustre client nodes that will be used to run the benchmark. Alternatively, copy onto the Lustre file system itself so that the application is available on all of the nodes automatically.
Note: There is currently a bug in some versions of the libfabric
library, notably version 1.3.0, that can cause a delay in starting MPI applications. When this occurs the following warning will appear in the command output:
hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
This issue affects RHEL and CentOS 7.3, and is resolved in RHEL / CentOS 7.4+ and the upstream project. Details can be found here:
https://bugzilla.redhat.com/show_bug.cgi?id=1408316
Prepare the run-time environment
- Create a user account from which to run the application, if a suitable account does not already exist. The account must be propagated across all of the Lustre client nodes that will participate in the benchmark, as well as the MDS servers for the file system. On the servers, it is recommended that the account is disabled in order to prevent users from logging into those machines.
- Some MPI implementations rely upon passphrase-less SSH keys. Login as the benchmark user to one of the nodes and create a passphrase-less SSH key. This will enable the
mpirun
command to launch processes on each of the client nodes that will run the benchmark. For example:[mjcowe@ct7-c1 ~]$ ssh-keygen -t rsa -N "" -f $HOME/.ssh/id_rsa Generating public/private rsa key pair. Created directory '/home/mjcowe/.ssh'. Your identification has been saved in /home/mjcowe/.ssh/id_rsa. Your public key has been saved in /home/mjcowe/.ssh/id_rsa.pub. The key fingerprint is: e4:b1:10:a2:7f:e8:b1:74:f3:c3:24:76:46:3d:4d:91 mjcowe@ct7-c1 The key's randomart image is: +--[ RSA 2048]----+ | . . oo | | . . . . oE | | . . + o . | | . . = o . | | = * S | | o * O | | o + | | . | | | +-----------------+
- Copy the public key into the
$HOME/.ssh/authorized_keys
file for the account. - If the user account is not hosted on a shared file system (e.g. a Lustre filesystem), then copy the public and private keys that were generated into the
$HOME/.ssh
directory of each of the Lustre client nodes that will be used in the benchmark. Normally, user accounts are hosted on a shared resource, making this step unnecessary. - Consider relaxing the
StrictHostKeyChecking
SSH option so that host entries are automatically added into$HOME/.ssh/known_hosts
rather than prompting the user to confirm the connection. When running MPI programs across many nodes, this can save a good deal of inconvenience. If the account home directory is not on a shared storage, all nodes will need to be updated.Host * StrictHostKeyChecking no
- Install the MPI runtime onto all Lustre client nodes:
yum install openmpi
- Append the following lines to
$HOME/.bashrc
(assuming BASH is the login shell) on the account running the benchmark:module purge module load mpi/openmpi-x86_64
This ensures that the Open MPI library path and binary path are added to the user environment every time the user logs in (and every time
mpirun
is invoked across multiple nodes). The.bash_profile
file is not read whenmpirun
starts processes on remote nodes, which is why it is not chosen in this case.
Benchmark Execution
- Login to one of the compute nodes as the benchmark user
- Create a host file for the
mpirun
command, containing the list of Lustre clients that will be used for the benchmark. Each line in the file represents a machine and the number of slots (usually equal to the number of CPU cores). For example:for i in `seq -f "%03g" 1 32`; do echo "n"$i" slots=16" done > $HOME/hfile # Result: n001 slots=16 n002 slots=16 n003 slots=16 n004 slots=16 ...
- The first column of the host file contains the name of the nodes. This can also be an IP address if the
/etc/hosts
file or DNS is not set up. - The second column is used to represent the number of CPU cores.
- The first column of the host file contains the name of the nodes. This can also be an IP address if the
- Run a quick test using
mpirun
to launch the benchmark and verify that the environment is set up correctly. For example:mpirun --hostfile $HOME/hfile --map-by node -np `cat $HOME/hfile|wc -l` hostname
This should return the hostnames of all the machines that are in the test environment. The results are returned unsorted, in order of completion.
Note: If the
--map-by node
does not work, and the output has only one or a very small number of unique hostnames repeated in the output, then setslots=1
for each host in the host file. Otherwise,mpirun
will fill up the slots on the first node before launching processes on subsequent nodes.This may be desirable for multi-process tests but not for the single task per client test. Do not set the slot count higher than the number of cores present. If over-subscription is required, set the -np flag to greater than the number of physical cores. This informs OpenMPI that the task will be oversubscribed and will run in a mode that yields the processor to peers.
Refer to: OpenMPI FAQ -- Oversubscribing Nodes, and also the notes on OpenMPI at the end of this document.
- Use
mpirun
to launch themdtest
benchmark. For example:mpirun --hostfile $HOME/hfile -np 48 ./mdtest -n 20840 -i 10 -u -d /lustre/demo/mdtest-scratch
In the above example, 48 processes (
-np 48
) will be distributed across the nodes listed in the host file (--hostfile hfile
), with each process creating 20,840 directories and files (-n 20480
) for a total of 1,000,320 files/directories. The test will conduct 10 iterations (-i 10
) and use/lustre/demo/mdtest-scratch
as the target base directory (-d <path>
). The-u
flag tells the program to assign a unique working directory per task.When first running the test on a new system, your test should be sized for 10,000 files/directories. This will give you an idea of how your system will handle the test. Gradually increase the number of files/directories as you feel more comfortable with the results you are seeing up to a maximum of 1,000,000 files/directory, or higher if there is a specific requirement in excess of this value. Note that 100,000 files/directories is probably the minimum value that will deliver a meaningful result (such that MDS cacheing does not affect results).
Start with a small number of threads and increase with each run using a doubling sequence starting at one (1, 2, 4, 8, 16), keeping the total number of files created as close to your target files/directories as possible. This means that as the thread count increases, the value of the
-n
parameter should decrease.
Notes on OpenMPI
When preparing the benchmark, pay careful attention to the distribution of processes across the nodes. mpirun
will, by default, fill the slots of one node before allocating processes to the next node in the list. i.e. all of the slots on the first node in the file will be consumed before allocating processes to the second node, then third node, and so on. If the number of slots requested is lower than the overall number of slots in the host file, then utlisation will not be evenly distributed, and some nodes may not be used at all.
If the number of process is larger than the number of available slots, mpirun
will oversubscribe one or more nodes until all the processes have been launched. This can be exploited to create more even distribution of processes across nodes by setting the number of slots per host to 1. However, note that mpirun
will decide where the additional processes will run, which can lead to performance variance from run to run of a job.
The --map-by node
option distributes processes evenly across the nodes, and does not try to consume all of the slots from one node before allocating processes to the next node in the list. For example, if there are 4 nodes, each with 16 slots (64 slots total), and a job is submitted that requires only 24 slots, then each node will be allocated 6 processes.
Experiment with the options by using the hostname
command as the target application. For example:
[mduser@ct7-c1 ~]$ cat $HOME/hfile ct7-c1 slots=16 ct7-c2 slots=16 ct7-c3 slots=16 ct7-c4 slots=16 # By default, mpirun will fill slots on one node before allocating slots from the next: [mduser@ct7-c1 ~]$ mpirun --hostfile $HOME/hfile -np `cat $HOME/hfile|wc -l` hostname ct7-c1 ct7-c1 ct7-c1 ct7-c1 # The --map-by node option distributes the processes evenly: [mduser@ct7-c1 ~]$ mpirun --hostfile $HOME/hfile --map-by node -np `cat $HOME/hfile|wc -l` hostname ct7-c2 ct7-c1 ct7-c3 ct7-c4
The -np
parameter is the total number of threads. If the host file has 16 nodes but the value of -np
is 1, then only one thread on one node is being used to complete the operations.
The mpirun(1)
man page provides a comprehensive description of the available options.
See also the OpenMPI FAQ, and the section on oversubscription.