| C H A P T E R 24 |
|
Benchmarking Lustre Performance (Lustre I/O Kit) |
This chapter describes the Lustre I/O kit, a collection of I/O benchmarking tools for a Lustre cluster, and PIOS, a parallel I/O simulator for Linux and Solaris. It includes:
The tools in the Lustre I/O Kit are used to benchmark Lustre hardware and validate that it is working as expected before you install the Lustre software. It can also be used to to validate the performance of the various hardware and software layers in the cluster and also to find and troubleshoot I/O issues.
Typically, performance is measured starting with single raw devices and then proceeding to groups of devices. Once raw performance has been established, other software layers are then added incrementally and tested.
The I/O kit contains three tests, each of which tests a progressively higher layer in the Lustre stack:
Typically with these tests, Lustre should deliver 85-90% of the raw device performance.
A utility stats-collect is also provided to collect application profiling information from Lustre clients and servers. See Collecting Application Profiling Information (stats-collect) for more information.
The following prerequisites must be met to use the tests in the Lustre I/O kit:
Download the Lustre I/O kit (lustre-iokit)from:
http://downloads.lustre.org/public/tools/lustre-iokit/
The sgpdd_survey tool is used to test bare metal I/O performance of the raw hardware, while bypassing as much of the kernel as possible. This survey may be used to characterize the performance of a SCSI device by simulating an OST serving multiple stripe files. The data gathered by this survey can help set expectations for the performance of a Lustre OST using this device.
The script uses sgp_dd to carry out raw sequential disk I/O. It runs with variable numbers of sgp_dd threads to show how performance varies with different request queue depths.
The script spawns variable numbers of sgp_dd instances, each reading or writing a separate area of the disk to demonstrate performance variance within a number of concurrent stripe files.
Several tips and insights for disk performance measurement are described below. Some of this information is specific to RAID arrays and/or the Linux RAID implementation.
Before creating a RAID array, benchmark all disks individually. We have frequently encountered situations where drive performance was not consistent for all devices in the array. Replace any disks that are significantly slower than the rest.
To identify the optimal request size for a given disk, benchmark the disk with different record sizes ranging from 4 KB to 1 to 2 MB.
|
Caution - The sgpdd_survey script overwrites the device being tested, which results in the LOSS OF ALL DATA on that device. Exercise caution when selecting the device to be tested. |
| Note - Array performance with all LUNs loaded does not always match the performance of a single LUN when tested in isolation. |
The device(s) being tested must meet one of these two requirements:
Raw and SCSI devices cannot be mixed in the test specification.
To get large I/O transfers (1 MB) to disk, it may be necessary to tune several kernel parameters as specified:
/sys/block/sdN/queue/max_sectors_kb = 4096 /sys/block/sdN/queue/max_phys_segments = 256 /proc/scsi/sg/allow_dio = 1 /sys/module/ib_srp/parameters/srp_sg_tablesize = 255 /sys/block/sdN/queue/scheduler
The sgpdd_survey script must be customized for the particular device being tested and for the location where the script saves its working and result files (by specifying the ${rslt} variable). Customization variables are described at the beginning of the script.
When the sgpdd_survey script runs, it creates a number of working files and a pair of result files. The names of all the files created start with the prefixdefined in the variable ${rslt}. (The default value is /tmp.) The files include:
${rslt}_<date/time>.summary
${rslt}_<date/time>_*
${rslt}_<date/time>.detail
The stdout and the .summary file will contain lines like this:
total_size 8388608K rsz 1024 thr 1 crg 1 180.45 MB/s 1 x 180.50 \=/ 180.50 MB/s
Each line corresponds to a run of the test. Each test run will have a different number of threads, record size, or number of regions.
If there are so many threads that the sgp_dd script is unlikely to be able to allocate I/O buffers, then ENOMEM is printed in place of the aggregate bandwidth result.
If one or more sgp_dd instances do not successfully report a bandwidth number, then FAILED is printed in place of the aggregate bandwidth result.
The obdfilter_survey script generates sequential I/O from varying numbers of threads and objects (files) to simulate the I/O patterns of a Lustre client.
The obdfilter_survey script can be run directly on the OSS node to measure the OST storage performance without any intervening network, or it can be run remotely on a Lustre client to measure the OST performance including network overhead.
The obdfilter_survey is used to characterize the performance of the following:
Run the script using the case=disk parameter to run the test against all the local OSTs. The script automatically detects all local OSTs and includes them in the survey.
To run the test against only specific OSTs, run the script using the target= parameter to list the OSTs to be tested explicitly. If some OSTs are on remote nodes, specify their hostnames in addition to the OST name (for example, oss2:lustre-OST0004).
All obdfilter instances are driven directly. The script automatically loads the obdecho module (if required) and creates one instance of echo_client for each obdfilter instance in order to generate I/O requests directly to the OST.
For more details, see Testing Local Disk Performance.
Pass the parameters case=network and target=<hostname|IP_of_server> to the script. For each network case, the script does the required setup.
For more details, see Testing Network Performance
To run the test against all the local OSCs, pass the parameter case=netdisk to the script. Alternately you can pass the target= parameter with one or more OSC devices (e.g., lustre-OST0000-osc-ffff88007754bc00) against which the tests are to be run.
For more details, see Testing Remote Disk Performance.
| Note - If the obdfilter_survey test is terminated before it completes, some small amount of space is leaked. You can either ignore it or reformat the file system. |
The obdfilter_survey script can be run automatically or manually against a local disk. This script profiles the overall throughput of storage hardware, including the file system and RAID layers managing the storage, by sending workloads to the OSTs that vary in thread count, object count, and I/O size.
When the obdfilter_survey script is run, it provides information about the performance abilities of the storage hardware and shows the saturation points.
The plot-obdfilter script generates from the output of the obdfilter_survey a CSV file and parameters for importing into a spreadsheet or gnuplot to visualize the data.
To run the obdfilter_survey script, create a standard Lustre configuration; no special setup is needed.
The Lustre OSTs should be mounted on the OSS node(s) to be tested. The Lustre client is not required to be mounted at this time.
2. Verify that the obdecho module is loaded. Run:
modprobe obdecho
3. Run the obdfilter_survey script with the parameter case=disk.
For example, to run a local test with up to two objects (nobjhi), up to two threads (thrhi), and 1024 MB transfer size (size):
$ nobjhi=2 thrhi=2 size=1024 case=disk sh obdfilter-survey
The Lustre OSTs should be mounted on the OSS node(s) to be tested. The Lustre client is not required to be mounted at this time.
2. Verify that the obdecho module is loaded. Run:
On the OSS nodes to be tested, run the lctl dl command. The OST device names are listed in the fourth column of the output. For example:
$ lctl dl |grep obdfilter 0 UP obdfilter lustre-OST0001 lustre-OST0001_UUID 1159 2 UP obdfilter lustre-OST0002 lustre-OST0002_UUID 1159 ...
4. List all OSTs you want to test.
Use the target= parameter to list the OSTs separated by spaces. List the individual OSTs by name using the format <fsname>-<OSTnumber> (for example, lustre-OST0001). You do not have to specify an MDS or LOV.
5. Run the obdfilter_survey script with the target= parameter.
For example, to run a local test with up to two objects (nobjhi), up to two threads (thrhi), and 1024 Mb (size) transfer size:
$ nobjhi=2 thrhi=2 size=1024 targets=”lustre-OST0001 \ lustre-OST0002” sh obdfilter-survey
The obdfilter_survey script can only be run automatically against a network; no manual test is provided.
To run the network test, a specific Lustre setup is needed. Make sure that these configuration requirements have been met.
The Lustre OSTs should be mounted on the OSS node(s) to be tested. The Lustre client is not required to be mounted at this time.
2. Verify that the obdecho module is loaded. Run:
modprobe obdecho
3. Start lctl and check the device list, which must be empty. Run:
lctl dl
4. Run the obdfilter_survey script with the parameters case=network and targets= <hostname | ip_of_server>. For example:
$ nobjhi=2 thrhi=2 size=1024 targets=”oss1 oss2” case=network sh obdfilter-survey
5. On the server side, view the statistics at:
/proc/fs/lustre/obdecho/<echo_srv>/stats
where <echo_srv> is the obdecho server created by the script.
The obdfilter_survey script can be run automatically or manually against a network disk. To run the network disk test, start with a standard Lustre configuration. No special setup is needed.
The Lustre OSTs should be mounted on the OSS node(s) to be tested. The Lustre client is not required to be mounted at this time.
2. Verify that the obdecho module is loaded. Run:
modprobe obdecho
3. Run the obdfilter_survey script with the parameter case=netdisk. For example:
$ nobjhi=2 thrhi=2 size=1024 case=netdisk sh obdfilter-survey
The Lustre OSTs should be mounted on the OSS node(s) to be tested. The Lustre client is not required to be mounted at this time.
2. Verify that the obdecho module is loaded. Run:
On the OSS nodes to be tested, run the lctl dl command. The OSC device names are listed in the fourth column of the output. For example:
$ lctl dl |grep obdfilter 3 UP osc lustre-OST0000-osc-ffff88007754bc00 54b91eab-0ea9-1516-b571-5e6df349592e 5 4 UP osc lustre-OST0001-osc-ffff88007754bc00 54b91eab-0ea9-1516-b571-5e6df349592e 5 ...
4. List all OSCs you want to test.
Use the target= parameter to list the OSCs separated by spaces. List the individual OSCs by name seperated by spaces using the format <fsname>-<OST_name>-osc-<OSC_number> (for example, lustre-OST0000-osc-ffff88007754bc00). You do not have to specify an MDS or LOV.
5. Run the obdfilter_survey script with the target= parameter and case=netdisk.
An example of a local test run with up to two objects (nobjhi), up to two threads (thrhi), and 1024 Mb (size) transfer size is shown below:
$ nobjhi=2 thrhi=2 size=1024 \ targets="lustre-OST0000-osc-ffff88007754bc00 \ lustre-OST0001-osc-ffff88007754bc00" \ sh obdfilter-survey
When the obdfilter_survey script runs, it creates a number of working files and a pair of result files. All files start with the prefix defined in the variable ${rslt}.
The obdfilter_survey script iterates over the given number of threads and objects performing the specified tests and checks that all test processes have completed successfully.
The .summary file and stdout of the obdfilter_survey script contain lines like:
ost 8 sz 67108864K rsz 1024 obj 8 thr 8 write 613.54 [ 64.00, 82.00]
| Note - Although the numbers of threads and objects are specified per-OST in the customization section of the script, the reported results are aggregated over all OSTs. |
It is useful to import the obdfilter_survey script summary data (it is fixed width) into Excel (or any graphing package) and graph the bandwidth versus the number of threads for varying numbers of concurrent regions. This shows how the OSS performs for a given number of concurrently-accessed objects (files) with varying numbers of I/Os in flight.
It is also useful to monitor and record average disk I/O sizes during each test using the “disk io size” histogram in the file /proc/fs/lustre/obdfilter/
*/brw_stats (see Watching the OST Block I/O Stream for details). These numbers help identify problems in the system when full-sized I/Os are not submitted to the underlying disk. This may be caused by problems in the device driver or Linux block layer.
The plot-obdfilter script included in the I/O toolkit is an example of processing output files to a .csv format and plotting a graph using gnuplot.
The ost_survey tool is a shell script that uses lfs setstripe to perform I/O against a single OST. The script writes a file (currently using dd) to each OST in the Lustre file system, and compares read and write speeds. The ost_survey tool is used to detect anomalies between otherwise identical disk subsystems.
To run the ost_survey script, supply a file size (in KB) and the Lustre mount point. For example, run:
$ ./ost-survey.sh 10 /mnt/lustre
Average read Speed: 6.73 Average write Speed: 5.41 read - Worst OST indx 0 5.84 MB/s write - Worst OST indx 0 3.77 MB/s read - Best OST indx 1 7.38 MB/s write - Best OST indx 1 6.31 MB/s 3 OST devices found Ost index 0 Read speed 5.84 Write speed 3.77 Ost index 0 Read time 0.17 Write time 0.27 Ost index 1 Read speed 7.38 Write speed 6.31 Ost index 1 Read time 0.14 Write time 0.16 Ost index 2 Read speed 6.98 Write speed 6.16 Ost index 2 Read time 0.14 Write time 0.16
The stats-collect utility contains the following scripts used to collect application profiling information from Lustre clients and servers:
The stats-collect utility requires:
The stats-collect utility is configured by including profiling configuration variables in the config.sh script. Each configuration variable takes the following form, where 0 indicates statistics are to be collected only when the script starts and stops and n indicates the interval in seconds at which statistics are to be collected:
<statistic>_INTERVAL=[0|n]
Statistics that can be collected include:
To collect profile information:
1. Begin collecting statistics on each node specified in the config.sh script.
Starting the collect profile daemon on each node by entering:
sh gather_stats_everywhere.sh config.sh start
3. Stop collecting statistics on each node, clean up the temporary file, and create a profiling tarball.
sh gather_stats_everywhere.sh config.sh stop <log_name.tgz>
When <log_name.tgz> is specified, a profile tarball /tmp/<log_name.tgz> is created.
4. Analyze the collected statistics and create a csv tarball for the specified profiling data.
sh gather_stats_everywhere.sh config.sh analyse log_tarball.tgz csv
Copyright © 2011, Oracle and/or its affiliates. All rights reserved.