| C H A P T E R 18 |
|
Lustre I/O Kit |
This chapter describes the Lustre I/O kit and PIOS performance tool, and includes the following sections:
The Lustre I/O kit is a collection of benchmark tools for a Lustre cluster. The I/O kit can be used to validate the performance of the various hardware and software layers in the cluster and also as a way to find and troubleshoot I/O issues.
The I/O kit contains three tests. The first surveys basic performance of the device and bypasses the kernel block device layers, buffer cache and file system. The subsequent tests survey progressively higher layers of the Lustre stack. Typically with these tests, Lustre should deliver 85-90% of the raw device performance.
It is very important to establish performance from the “bottom up” perspective. First, the performance of a single raw device should be verified. Once this is complete, verify that performance is stable within a larger number of devices. Frequently, while troubleshooting such performance issues, we find that array performance with all LUNs loaded does not always match the performance of a single LUN when tested in isolation. After the raw performance has been established, other software layers can be added and tested in an incremental manner.
You can download the I/O kit from:
http://downloads.lustre.org/public/tools/lustre-iokit/
In this directory, you will find two packages:
The following prerequisites must be met to use the Lustre I/O kit:
As mentioned above, the I/O kit contains these test tools:
Use the sgpdd_survey tool to test bare metal performance, while bypassing as much of the kernel as possible. This script requires the sgp_dd package, although it does not require Lustre software. This survey may be used to characterize the performance of a SCSI device by simulating an OST serving multiple stripe files. The data gathered by this survey can help set expectations for the performance of a Lustre OST exporting the device.
The script uses sgp_dd to carry out raw sequential disk I/O. It runs with variable numbers of sgp_dd threads to show how performance varies with different request queue depths.
The script spawns variable numbers of sgp_dd instances, each reading or writing a separate area of the disk to demonstrate performance variance within a number of concurrent stripe files.
The device(s) used must meet one of the two tests described below:
Must appear in the output of sg_map (make sure the kernel module "sg" is loaded)
Must appear in the output of raw -qa
If you need to create raw devices in order to use the sgpdd_survey tool, note that raw device 0 cannot be used due to a bug in certain versions of the "raw" utility (including that shipped with RHEL4U4.)
You may not mix raw and SCSI devices in the test specification.
|
Caution - The sgpdd_survey script overwrites the device being tested, which results in the LOSS OF ALL DATA on that device. Exercise caution when selecting the device to be tested. |
The sgpdd_survey script must be customized according to the particular device being tested and also according to the location where it should keep its working files. Customization variables are described explicitly at the start of the script.
When the sgpdd_survey script runs, it creates a number of working files and a pair of result files. All files start with the prefix given by the script variable ${rslt}.
${rslt}_<date/time>.summary same as stdout
${rslt}_<date/time>_* tmp files
${rslt}_<date/time>.detail collected tmp files for post-mortem
The summary file and stdout should contain lines like this:
total_size 8388608K rsz 1024 thr 1 crg 1 180.45 MB/s 1 x 180.50 \=/ 180.50 MB/s
The number immediately before the first MB/s is bandwidth, computed by measuring total data and elapsed time. The remaining numbers are a check on the bandwidths reported by the individual sgp_dd instances.
If there are so many threads that the sgp_dd script is unlikely to be able to allocate I/O buffers, then "ENOMEM" is printed.
If one or more sgp_dd instances do not successfully report a bandwidth number, then "failed" is printed.
To get large I/O (1 MB) to disk, it may be necessary to tune several sgpdd_survey parameters as specified:
/sys/block/sdN/queue/max_sectors_kb = 4096
/sys/block/sdN/queue/max_phys_segments = 256
/sys/module/ib_srp/parameters/srp_sg_tablesize = 255
The obdfilter_survey script processes sequential I/O with varying numbers of threads and objects (files) by using lctl to drive the echo_client connected to local or remote obdfilter instances or remote obdecho instances. It can be used to characterize the performance of the following Lustre components:
The script exercises one or more instances of obdfilter directly. The script may run on one or more nodes, for example, when the nodes are all attached to the same multi-ported disk subsystem.
Tell the script the names of all obdfilter instances (which should be up and running already). If some instances are on different nodes, specify their hostnames too (for example, node1:ost1). Alternately, you can pass parameter case=disk to the script. (The script automatically detects the local obdfilter instances.)
All obdfilter instances are driven directly. The script automatically loads the obdecho module (if required) and creates one instance of echo_client for each obdfilter instance.
The script drives one or more instances of the obdecho server via instances of echo_client running on one or more nodes. Pass the parameters case=network and target=''<hostname/ip_of_server>'' to the script. For each nework case, the script does the required setup.
The script drives one or more instances of obdfilter via instances of echo_client running on one or more nodes.
Tell the script the names of the OSCs (which should be up and running). Alternately, you can pass the parameter case=netdisk to the script. The script will use all of the local OSCs.
| Note - The obdfilter_survey script is NOT scalable to hundreds of nodes since it is only intended to measure individual servers, not the scalability of the entire system. |
The obdfilter_survey script supports automatic and manual runs against a local disk. Obdfilter-survey profiles the overall throughput of storage hardware[1], by sending ranges of workloads to the OSTs (varied in thread counts and I/O sizes).
When the obdfilter_survey script is complete, it provides information on the performance abilities of the storage hardware and shows the saturation points. If you use plot scripts on the data, this information is shown graphically.
To run the obdfilter_survey script, create a normal Lustre configuration; no special setup is needed.
1. Set up the Lustre file system.
2. Verify that the obdecho.ko module is present.
3. Run the obdfilter_survey script with the parameter case=disk. For example:
$ nobjhi=2 thrhi=2 size=1024 case=disk sh obdfilter-survey
1. List all OSTs you want to test. (You do not have to specify an MDS or LOV.)
$ mkfs.lustre --fsname spfs --mdt --mgs /dev/sda
3. Determine the obdfilter instance names on all Lustre clients. The device names appear in the fourth column of the lctl dl command output. For example:
$ pdsh -w oss[01-02] lctl dl |grep obdfilter |sort oss01: 0 UP obdfilter oss01-sdb oss01-sdb_UUID 3 oss01: 2 UP obdfilter oss01-sdd oss01-sdd_UUID 3 oss02: 0 UP obdfilter oss02-sdi oss02-sdi_UUID 3 ...
In this example, the obdfilter instance names are oss01-sdb, oss01-sdd, and oss02-sdi. Since you are driving obdfilter instances directly, set the shell array variable, targets, to the names of the obdfilter instances. For example:
targets='oss01:oss01-sdb oss01:oss01-sdd oss02:oss02-sdi'\./obdfilter-survey
The obdfilter_survey script can only be run automatically against a network; no manual test is supported.
To run the network test, a specific Lustre setup is needed. Make sure that these configuration requirements have been met.
1. Run the obdfilter_survey script with the parameters case=netdisk and targets=''<hostname/ip_of_server>''. For example:
$ nobjhi=2 thrhi=2 size=1024 targets="<hostname/ip_of_server>" \case=network sh obdfilter-survey
On the server side, you can see the statistics at:
/proc/fs/lustre/obdecho/<echo_srv>/stats
where 'echo_srv' is the obdecho server created by the script.
The obdfilter_survey script can be run automatically or manually against a network disk.
To run the network disk test, create a Lustre configuration using normal methods; no special setup is needed.
1. Set up the Lustre file system with the required OSTs.
2. Verify that the obdecho.ko module is present.
3. Run the obdfilter_survey script with the parameter case=netdisk. For example:
$ nobjhi=2 thrhi=2 size=1024 case=netdisk sh obdfilter-survey
1. Run the obdfilter_survey script and tell the script the names of all echo_client instances (which should be up and running already).
$ nobjhi=2 thrhi=2 size=1024 targets="<osc_name> ..." \ sh obdfilter-survey
When the obdfilter_survey script runs, it creates a number of working files and a pair of result files. All files start with the prefix given by ${rslt}.
The obdfilter_survey script iterates over the given number of threads and objects performing the specified tests and checks that all test processes have completed successfully.
The summary file and stdout of the obdfilter_survey script contain lines such as:
ost 8 sz 67108864K rsz 1024 obj 8 thr 8 write 613.54 [ 64.00, 82.00]
| Note - Although the numbers of threads and objects are specified per-OST in the customization section of the script, the reported results are aggregated over all OSTs. |
It is useful to import the obdfilter_survey script summary data (it is fixed width) into Excel (or any graphing package) and graph the bandwidth versus the number of threads for varying numbers of concurrent regions. This shows how the OSS performs for a given number of concurrently-accessed objects (files) with varying numbers of I/Os in flight.
It is also extremely useful to record average disk I/O sizes during each test. These numbers help locate pathologies in the system when the file system block allocator and the block device elevator.
The plot-obdfilter script (included) is an example of processing output files to a .csv format and plotting a graph using gnuplot.
The ost_survey tool is a shell script that uses lfs setstripe to perform I/O against a single OST. The script writes a file (currently using dd) to each OST in the Lustre file system, and compares read and write speeds. The ost_survey tool is used to detect misbehaving disk subsystems.
| Note - We have frequently discovered wide performance variations across all LUNs in a cluster. |
To run the ost_survey script, supply a file size (in KB) and the Lustre mount point. For example, run:
$ ./ost-survey.sh 10 /mnt/lustre Average read Speed: 6.73 Average write Speed: 5.41 read - Worst OST indx 0 5.84 MB/s write - Worst OST indx 0 3.77 MB/s read - Best OST indx 1 7.38 MB/s write - Best OST indx 1 6.31 MB/s 3 OST devices found Ost index 0 Read speed 5.84 Write speed 3.77 Ost index 0 Read time 0.17 Write time 0.27 Ost index 1 Read speed 7.38 Write speed 6.31 Ost index 1 Read time 0.14 Write time 0.16 Ost index 2 Read speed 6.98 Write speed 6.16 Ost index 2 Read time 0.14 Write time 0.16
The stats-collect utility contains the following scripts used to collect application profiling information from Lustre clients and servers:
The stats-collect utility requires:
Configuring the stats-collect utility is simple - all of the profiling configuration VARs are in the config.sh script.
XXXX_INTERVAL is the profiling interval where the value of interval means:
If XXX_INTERVAL is not specified, then XXX statistics are not collected. XXX can be VMSTAT, SERVICE, BRW, SDIO, MBALLOC, IO, JBD, CLIENT
The gather_stats_everywhere.sh script should be run in three phases:
Starts statistics collection on each node specified in the config.sh script.
Stops collecting statistics on each node. If <log_name.tgz> is provided, it creates a profile tarball /tmp/<log_name.tgz>.
Analyzes the log_tarball and create a csv tarball for this profiling tarball.
To collect profile information:
1. Start the collect profile daemon on each node.
sh gather_stats_everywhere.sh config.sh start
3. Stop the collect profile daemon on each node, clean up the temporary file and create a profiling tarball.
sh gather_stats_everywhere.sh config.sh stop log_tarball.tgz
4. Create a csv file according to the profile.
sh gather_stats_everywhere.sh config.sh analyse log_tarball.tgz csv
The PIOS test tool is a parallel I/O simulator for Linux and Solaris. PIOS generates I/O on file systems, block devices and zpools similar to what can be expected from a large Lustre OSS server when handling the load from many clients. The program generates and executes the I/O load in a manner substantially similar to an OSS, that is, multiple threads take work items from a simulated request queue. It forks a CPU load generator to simulate running on a system with additional load.
PIOS can read/write data to a single shared file or multiple files (default is a single file). To specify multiple files, use the --fpp option. (It is better to measure with both single and multiple files.) If the final argument is a file, block device or zpool, PIOS writes to RegionCount regions in one file. PIOS issues I/O commands of size ChunkSize. The regions are spaced apart Offset bytes (or, in the case of many files, the region starts at Offset bytes). In each region, RegionSize bytes are written or read, one ChunkSize I/O at a time. Note that:
ChunkSize <= Regionsize <= Offset
Multiple runs can be specified with comma separated lists of values for ChunkSize, Offset, RegionCount, ThreadCount, and RegionSize. Multiple runs can also be specified by giving a starting (low) value, increase (in percent) and high value for each of these arguments. If a low value is given, no value list or value may be supplied.
Every run is given a timestamp, and the timestamp and offset are written with every chunk (to allow verification). Before every run, PIOS executes the pre-run shell command. After every run, PIOS executes the post-run command. Typically, this is used to clear and collect statistics for the run, or to start and stop statistics gathering during the run. The timestamp is passed to both pre-run and post-run.
For convenience, PIOS understands byte specifiers and uses:
Download the PIOS test tool at:
http://downloads.lustre.org/public/tools/benchmarks/pios/
pios [--chunksize|-c =values, (--chunksize_low|-a =value --chunksize_high|-b =value --chunksize_incr|-g =value)] [--offset|-o =values, (--offset_low|-m =value --offset_high|-q =value --offset_incr|-r =value)] [--regioncount|-n =values, (--regioncount_low|-i =value --regioncount_high|-j =value --regioncount_incr|-k =value)] [--threadcount|-t =values, (--threadcount_low|-l =value --threadcount_high|-h =value --threadcount_incr|-e =value)] [--regionsize|-s =values, (--regionsize_low|-A =value --regionsize_high|-B =value --regionsize_incr|-C =value)] [--directio|-d, --posixio|-x, --cowio|-w} [--cleanup|-L --threaddelay|-T =ms --regionnoise|-I ==shift --chunknoise|-N =bytes -fpp|-F ] [--verify|-V =values] [--prerun|-P =pre-command --postrun|-R =post-command] [--path|-p =output-file-path]
There are several supported PIOS I/O modes:
This is the default operational mode where I/O is done using standard POSIX calls, such as pwrite/pread. This mode is valid on both Linux and Solaris.
This mode corresponds to the O_DIRECT flag in open(2) system call, and it is currently applicable only to Linux. Use this mode when using PIOS on the ldiskfs file system on an OSS.
This mode corresponds to the copy overwrite operation where file system blocks that are being overwritten were copied to shadow files. Only use this mode if you want to see overhead of preserving existing data (in case of overwrite). This mode is valid on both Linux and Solaris.
PIOS has five basic parameters to determine the amount of data that is being written.
Amount of data that a thread writes in one attempt. ChunkSize should be a multiple of file system block size.
Amount of data required to fill up a region. PIOS writes a chunksize of data continuously until it fills the regionsize. RegionSize should be a multiple of ChunkSize.
Number of regions to write in one or multiple files. The total amount of data written by PIOS is RegionSize x RegionCount.
Number of threads working on regions.
Distance between two successive regions when all threads are writing to the same file. In the case of multiple files, threads start writing in files at Offset bytes.
To create a 1 GB load with a different number of threads:
pios -t 1,2,4, 8,16, 32, 64, 128 -n 128 -c 1M -s 8M -o 8M --load=posixio -p /mnt/lustre
pios -t 1,2,4, 8,16, 32, 64, 128 -n 128 -c 1M -s 8M -o 8M --load=posixio,fpp -p /mnt/lustre
To create a 1 GB load with a different number of chunksizes on ldiskfs with direct I/O:
pios -t 32 -n 128 -c 128K, 256K, 512K, 1M, 2M, 4M -s 8M -o 8M --load=directio -p /mnt/lustre
pios -t 32 -n 128 -c 128K, 256K, 512K, 1M, 2M, 4M -s 8M -o 8M --load=directio,fpp -p /mnt/lustre
To create a 32 MB to 128 MB load with different RegionSizes on a Solaris zpool:
pios -t 8 -n 16 -c 1M -A 2M -B 8M -C 100 -o 8M --load=posixio -p \/myzpool/
pios -t 8 -n 16 -c 1M -A 2M -B 8M -C 100 -o 8M --load=posixio, \ fpp -p /myzpool/
To read and verify timestamps:
pios -t 40 -n 1024 -c 256K -s 4M -o 8M --load=posixio -p /mnt/lustre
Keep the same parameters to read:
pios -t 40 -n 1024 -c 256K -s 4M -o 8M --load=posixio -p /mnt/lustre --verify
LNET self-test helps site administrators confirm that Lustre Networking (LNET) has been properly installed and configured, and that underlying network software and hardware are performing according to expectations.
LNET self-test is a kernel module that runs over LNET and LNDs. It is designed to:
| Note - Apart from the performance impact, LNET self-test is invisible to Lustre. |
This section describes basic concepts of LNET self-test, utilities and a sample script.
To run LNET self-test, these modules must be loaded: libcfs, lnet, lnet_selftest and one of the klnds (i.e, ksocklnd, ko2iblnd...). To load all necessary modules, run modprobe lnet_selftest (recursively loads the modules on which LNET self-test depends.
The LNET self-test cluster has two types of nodes:
The console and test nodes require all previously-listed modules to be loaded. (The userspace test node does not require these modules.)
LNET self-test has two user utilities, lst and lstclient.
In the context of LNET self-test, a session is a test node that can be associated with only one session at a time, to ensure that the session has exclusive use. Almost all operations should be performed in a session context. From the console node, a user can only operate nodes in his own session. If a session ends, the session context in all test nodes is destroyed.
The console node can be used to create, change or destroy a session (new_session, end_session, show_session). For more information, see Session.
The console node is the user interface of the LNET self-test system, and can be any node in the test cluster. All self-test commands are entered from the console node. From the console node, a user can control and monitor the status of the entire test cluster (session). The console node is exclusive, meaning that a user cannot control two different sessions (LNET self-test clusters) on one node.
An LNET self-test group is just a named collection of nodes. There are no restrictions on group membership, i.e., a node can be included in any number of groups, and any number of groups can exist in a single LNET self-test session.
Each node in a group has a rank, determined by the order in which it was added to the group, which is used to establish test traffic patterns.
A user can only control nodes in his/her session. To allocate nodes to the session, the user needs to add nodes to a group (of the session). All nodes in a group can be referenced by group's name. A node can be allocated to multiple groups of a session.
A test generates network load between two arbitrary groups of nodes - the test's "from" and "to" groups. When a test is running, each node in the "from" group sends requests to nodes in the "to" group, and receive responses in return. This activity is designed to mimic Lustre RPC traffic, i.e. the "from" group acts like a set of clients and the "to" group acts like a set of servers.
The traffic pattern and test intensity is determined several properties, including test type, distribution of test nodes, concurrency of test, RDMA operation type, etc. Several of the available test parameters are described below.
--distribute 1:1 This is the default setting. Each "from" node communicates with the same rank (modules "to" group size) "to" node. Note that if there are more "from" nodes than "to" nodes, some "from" nodes may share the same "to" nodes. Also, if there are more "to" nodes than "from" nodes, some higher-ranked "to" nodes will be idle.
--distribute 1:n (where 'n' is the size of the "to" group). Each "from" node communicates with every node in the "to" group.
A batch is an arbitrary collection of tests which are started and stopped together; they run in parallel. Each test should belong to a batch; tests should not exist individually. Users can control a test batch (run, stop); they cannot control individual tests. Tests in a batch are non-destructive to the file system, and can be run in a normal Lustre environment (provided the performance impact is acceptable).
The simplest batch might contain only a single test - running brw to determine whether network bandwidth will be an I/O bottleneck. In this example, the "to" group is comprised of Lustre OSSes and the "from" group includes the compute nodes. Adding an second test to perform pings from a login node to the MDS could tell you how much checkpointing would affect the ls -l process.
These are the steps to run a sample LNET self-test script simulating the traffic pattern of a set of Lustre servers on a TCP network, accessed by Lustre clients on an InfiniBand network (connected via LNET routers). In this example, half the clients are reading and half the clients are writing.
1. Load libcfs.ko, lnet.ko, ksocklnd.ko and lnet_selftest.ko on all test nodes and the console node.
2. Run this script on the console node:
#!/bin/bash export LST_SESSION=$$ lst new_session read/write lst add_group servers 192.168.10.[8,10,12-16]@tcp lst add_group readers 192.168.1.[1-253/2]@o2ib lst add_group writers 192.168.1.[2-254/2]@o2ib lst add_batch bulk_rw lst add_test --batch bulk_rw --from readers --to servers \ brw read check=simple size=1M lst add_test --batch bulk_rw --from writers --to servers \ brw write check=full size=4K # start running lst run bulk_rw # display server stats for 30 seconds lst stat servers & sleep 30; kill $! # tear down lst end_session
| Note - This script can be easily adapted to pass the group NIDs by shell variables or command line arguments (making it good for general-purpose use). |
The LNET self-test (lst) utility is used to issue LNET self-test commands. The lst utility takes a number of command line arguments. The first argument is the command name and subsequent arguments are command-specific.
This section lists lst session commands.
Process Environment (LST_SESSION)
The lst utility uses the LST_SESSION environmental variable to identify the session locally on the self-test console node. This should be a numeric value that uniquely identifies all session processes on the node. It is convenient to set this to the process ID of the shell both for interactive use and in shell scripts. Almost all lst commands require LST_SESSION to be set.
new_session [--timeout SECONDS] [--force] NAME
Stops all operations and tests in the current session and clears the session’s status.
$ lst end_session
Shows the session information. This command prints information about the current session. It does not require LST_SESSION to be defined in the process environment.
$ lst show_session
This section lists lst group commands.
Creates the group and adds a list of test nodes to the group.
|
A string that may be expanded into one or more LNET NIDs.
$ lst add_group servers 192.168.10.[35,40-45]@tcp $ lst add_group clients 192.168.1.[10-100]@tcp 192.168.[2,4].\ [10-20]@tcp |
update_group NAME [--refresh] [--clean STATE] [--remove NIDs]
Updates the state of nodes in a group or adjusts a group’s membership. This command is useful if some nodes have crashed and should be excluded from the group.
list_group [NAME] [--active] [--busy] [--down] [--unknown] [--all]
Prints information about a group or lists all groups in the current session if no group is specified.
|
Lists all nodes.
ACTIVE BUSY DOWN UNKNOWN TOTAL $ lst list_group clients --all |
Removes a group from the session. If the group is referred to by any test, then the operation fails. If nodes in the group are referred to only by this group, then they are kicked out from the current session; otherwise, they are still in the current session.
$ lst del_group clients
Userland client (lstclient --sesid NID --group NAME)
Use lstclient to run the userland self-test client. lstclient should be executed after creating a session on the console. There are only two mandatory options for lstclient:
|
The test group to join.
Client1 $ lstclient --sesid 192.168.1.52@tcp --group clients |
Also, lstclient has a mandatory option that enforces LNET to behave as a server (start acceptor if the underlying NID needs it, use privileged ports, etc.):
Client1 $ lstclient --sesid 192.168.1.52@tcp |--group clients --server_mode
| Note - Only the super user is allowed to use the --server_mode option. |
This section lists lst batch and test commands.
The default batch (named “batch”) is created when the session is started. However, the user can specify a batch name by using add_batch:
$ lst add_batch bulkperf
add_test --batch BATCH [--loop #] [--concurrency #] [--distribute #:#]
from GROUP --to GROUP TEST ...
Adds a test to batch. For now, TEST can be brw and ping:
There are only two test types:
list_batch [NAME] [--test INDEX] [--active] [--invalid] [--server]
Lists batches in the current session or lists client|server nodes in a batch or a test.
$ lst run bulkperf
$ lst stop bulkperf
query NAME [--test INDEX] [--timeout #] [--loop #] [--delay #] [--all]
|
Only queries the specified test. The test INDEX starts from 1. |
|
|
The timeout value to wait for RPC. The default is 5 seconds. |
|
|
The list status of all nodes in a batch or a test.
|
This section lists other lst commands.
ping [-session] [--group NAME] [--nodes NIDs] [--batch name] [--server] [--timeout #]
Sends a “hello” query to the nodes.
stat [--bw] [--rate] [--read] [--write] [--max] [--min] [--avg] " " [--timeout #] [--delay #] GROUP|NIDs [GROUP|NIDs]
The collection performance and RPC statistics of one or more nodes.
Specifying a group name (GROUP) causes statistics to be gathered for all nodes in a test group. For example:
$ lst stat servers
where servers is the name of a test group created by lst add_group
Specifying a NID range (NIDs) causes statistics to be gathered for selected nodes. For example:
$ lst stat 192.168.0.[1-100/2]@tcp
Currently, only LNET performance statistics are available.[2] By default, all statistics information is displayed. Users can specify additional information with these options.
show_error [--session] [GROUP]|[NIDs] ...
Lists the number of failed RPCs on test nodes.
Copyright © 2010, Oracle and/or its affiliates. All rights reserved.