C H A P T E R  18

Lustre I/O Kit

This chapter describes the Lustre I/O kit and PIOS performance tool, and includes the following sections:


18.1 Lustre I/O Kit Description and Prerequisites

The Lustre I/O kit is a collection of benchmark tools for a Lustre cluster. The I/O kit can be used to validate the performance of the various hardware and software layers in the cluster and also as a way to find and troubleshoot I/O issues.

The I/O kit contains three tests. The first surveys basic performance of the device and bypasses the kernel block device layers, buffer cache and file system. The subsequent tests survey progressively higher layers of the Lustre stack. Typically with these tests, Lustre should deliver 85-90% of the raw device performance.

It is very important to establish performance from the “bottom up” perspective. First, the performance of a single raw device should be verified. Once this is complete, verify that performance is stable within a larger number of devices. Frequently, while troubleshooting such performance issues, we find that array performance with all LUNs loaded does not always match the performance of a single LUN when tested in isolation. After the raw performance has been established, other software layers can be added and tested in an incremental manner.

18.1.1 Downloading an I/O Kit

You can download the I/O kit from:

http://downloads.lustre.org/public/tools/lustre-iokit/

In this directory, you will find two packages:

18.1.2 Prerequisites to Using an I/O Kit

The following prerequisites must be met to use the Lustre I/O kit:


18.2 Running I/O Kit Tests

As mentioned above, the I/O kit contains these test tools:

18.2.1 sgpdd_survey

Use the sgpdd_survey tool to test bare metal performance, while bypassing as much of the kernel as possible. This script requires the sgp_dd package, although it does not require Lustre software. This survey may be used to characterize the performance of a SCSI device by simulating an OST serving multiple stripe files. The data gathered by this survey can help set expectations for the performance of a Lustre OST exporting the device.

The script uses sgp_dd to carry out raw sequential disk I/O. It runs with variable numbers of sgp_dd threads to show how performance varies with different request queue depths.

The script spawns variable numbers of sgp_dd instances, each reading or writing a separate area of the disk to demonstrate performance variance within a number of concurrent stripe files.

The device(s) used must meet one of the two tests described below:

SCSI device:

Must appear in the output of sg_map (make sure the kernel module "sg" is loaded)

Raw device:

Must appear in the output of raw -qa

If you need to create raw devices in order to use the sgpdd_survey tool, note that raw device 0 cannot be used due to a bug in certain versions of the "raw" utility (including that shipped with RHEL4U4.)

You may not mix raw and SCSI devices in the test specification.



caution icon Caution - The sgpdd_survey script overwrites the device being tested, which results in the LOSS OF ALL DATA on that device. Exercise caution when selecting the device to be tested.


The sgpdd_survey script must be customized according to the particular device being tested and also according to the location where it should keep its working files. Customization variables are described explicitly at the start of the script.

When the sgpdd_survey script runs, it creates a number of working files and a pair of result files. All files start with the prefix given by the script variable ${rslt}.

${rslt}_<date/time>.summary same as stdout
${rslt}_<date/time>_* tmp files
${rslt}_<date/time>.detail collected tmp files for post-mortem

The summary file and stdout should contain lines like this:

total_size 8388608K rsz 1024 thr 1 crg 1 180.45 MB/s 1 x 180.50 \=/ 180.50 MB/s

The number immediately before the first MB/s is bandwidth, computed by measuring total data and elapsed time. The remaining numbers are a check on the bandwidths reported by the individual sgp_dd instances.

If there are so many threads that the sgp_dd script is unlikely to be able to allocate I/O buffers, then "ENOMEM" is printed.

If one or more sgp_dd instances do not successfully report a bandwidth number, then "failed" is printed.

18.2.1.1 Tuning sgpdd_survey

To get large I/O (1 MB) to disk, it may be necessary to tune several sgpdd_survey parameters as specified:

/sys/block/sdN/queue/max_sectors_kb = 4096

/sys/block/sdN/queue/max_phys_segments = 256

/proc/scsi/sg/allow_dio = 1

/sys/module/ib_srp/parameters/srp_sg_tablesize = 255

 

 

18.2.2 obdfilter_survey

The obdfilter_survey script processes sequential I/O with varying numbers of threads and objects (files) by using lctl to drive the echo_client connected to local or remote obdfilter instances or remote obdecho instances. It can be used to characterize the performance of the following Lustre components:

OSTs

The script exercises one or more instances of obdfilter directly. The script may run on one or more nodes, for example, when the nodes are all attached to the same multi-ported disk subsystem.

Tell the script the names of all obdfilter instances (which should be up and running already). If some instances are on different nodes, specify their hostnames too (for example, node1:ost1). Alternately, you can pass parameter case=disk to the script. (The script automatically detects the local obdfilter instances.)

All obdfilter instances are driven directly. The script automatically loads the obdecho module (if required) and creates one instance of echo_client for each obdfilter instance.

Network

The script drives one or more instances of the obdecho server via instances of echo_client running on one or more nodes. Pass the parameters case=network and target=''<hostname/ip_of_server>'' to the script. For each nework case, the script does the required setup.

Striped File System Over the Network

The script drives one or more instances of obdfilter via instances of echo_client running on one or more nodes.

Tell the script the names of the OSCs (which should be up and running). Alternately, you can pass the parameter case=netdisk to the script. The script will use all of the local OSCs.



Note - The obdfilter_survey script is NOT scalable to hundreds of nodes since it is only intended to measure individual servers, not the scalability of the entire system.




Note - The obdfilter_survey script must be customized, depending on the components under test and where the script’s working files should be kept. Customization variables are clearly described in the script (Customization Variables section). In particular, refer to the maximum supported value ranges for customization variables.


18.2.2.1 Running obdfilter_survey Against a Local Disk

The obdfilter_survey script supports automatic and manual runs against a local disk. Obdfilter-survey profiles the overall throughput of storage hardware[1], by sending ranges of workloads to the OSTs (varied in thread counts and I/O sizes).

When the obdfilter_survey script is complete, it provides information on the performance abilities of the storage hardware and shows the saturation points. If you use plot scripts on the data, this information is shown graphically.

To run the obdfilter_survey script, create a normal Lustre configuration; no special setup is needed.

To perform an automatic run:

1. Set up the Lustre file system.

2. Verify that the obdecho.ko module is present.

3. Run the obdfilter_survey script with the parameter case=disk. For example:

$ nobjhi=2 thrhi=2 size=1024 case=disk sh obdfilter-survey

To perform a manual run:

1. List all OSTs you want to test. (You do not have to specify an MDS or LOV.)

2. On all OSSs, run:

$ mkfs.lustre --fsname spfs --mdt --mgs /dev/sda


caution icon Caution - Write tests are destructive. This test should be run before the Lustre file system is started. If you do this, you will not need to reformat to restart Lustre system. However, if the obdfilter_survey test is terminated before it completes, you may have to remove objects from the disk.


3. Determine the obdfilter instance names on all Lustre clients. The device names appear in the fourth column of the lctl dl command output. For example:

$ pdsh -w oss[01-02] lctl dl |grep obdfilter |sort
oss01:	0 UP obdfilter oss01-sdb oss01-sdb_UUID 3
oss01:	2 UP obdfilter oss01-sdd oss01-sdd_UUID 3
oss02:	0 UP obdfilter oss02-sdi oss02-sdi_UUID 3
...

In this example, the obdfilter instance names are oss01-sdb, oss01-sdd, and oss02-sdi. Since you are driving obdfilter instances directly, set the shell array variable, targets, to the names of the obdfilter instances. For example:

targets='oss01:oss01-sdb oss01:oss01-sdd oss02:oss02-sdi'\./obdfilter-survey

18.2.2.2 Running obdfilter_survey Against a Network

The obdfilter_survey script can only be run automatically against a network; no manual test is supported.

To run the network test, a specific Lustre setup is needed. Make sure that these configuration requirements have been met.

To perform an automatic run:

1. Run the obdfilter_survey script with the parameters case=netdisk and targets=''<hostname/ip_of_server>''. For example:

$ nobjhi=2 thrhi=2 size=1024 targets="<hostname/ip_of_server>" \case=network sh obdfilter-survey

On the server side, you can see the statistics at:

/proc/fs/lustre/obdecho/<echo_srv>/stats

where 'echo_srv' is the obdecho server created by the script.

18.2.2.3 Running obdfilter_survey Against a Network Disk

The obdfilter_survey script can be run automatically or manually against a network disk.

To run the network disk test, create a Lustre configuration using normal methods; no special setup is needed.

To perform an automatic run:

1. Set up the Lustre file system with the required OSTs.

2. Verify that the obdecho.ko module is present.

3. Run the obdfilter_survey script with the parameter case=netdisk. For example:

$ nobjhi=2 thrhi=2 size=1024 case=netdisk sh obdfilter-survey

To perform a manual run:

1. Run the obdfilter_survey script and tell the script the names of all echo_client instances (which should be up and running already).

$ nobjhi=2 thrhi=2 size=1024 targets="<osc_name> ..." \ sh obdfilter-survey

18.2.2.4 Output Files

When the obdfilter_survey script runs, it creates a number of working files and a pair of result files. All files start with the prefix given by ${rslt}.


File

Description

${rslt}.summary

Same as stdout

${rslt}.script_*

Per-host test script files

${rslt}.detail_tmp*

Per-OST result files

${rslt}.detail

Collected result files for post-mortem


The obdfilter_survey script iterates over the given number of threads and objects performing the specified tests and checks that all test processes have completed successfully.



Note - The obdfilter_survey script may not clean up properly if it is aborted or if it encounters an unrecoverable error. In this case, a manual cleanup may be required, possibly including killing any running instances of 'lctl' (local or remote), removing echo_client instances created by the script and unloading obdecho.


18.2.2.5 Script Output

The summary file and stdout of the obdfilter_survey script contain lines such as:

ost 8 sz 67108864K rsz 1024 obj 8 thr 8 write 613.54 [ 64.00, 82.00]

Where:


Variable

Supported Type

ost8

Total number of OSTs being tested.

sz 67108864K

Total amount of data read or written (in KB).

rsz 1024

Record size (size of each echo_client I/O, in KB).

obj 8

Total number of objects over all OSTs.

thr 8

Total number of threads over all OSTs and objects.

write

Test name. If more tests have been specified, they all appear on the same line.

613.54

Aggregate bandwidth over all OSTs (measured by dividing the total number of MB by the elapsed time).

[64, 82.00]

Minimum and maximum instantaneous bandwidths on an individual OST.




Note - Although the numbers of threads and objects are specified per-OST in the customization section of the script, the reported results are aggregated over all OSTs.


18.2.2.6 Visualizing Results

It is useful to import the obdfilter_survey script summary data (it is fixed width) into Excel (or any graphing package) and graph the bandwidth versus the number of threads for varying numbers of concurrent regions. This shows how the OSS performs for a given number of concurrently-accessed objects (files) with varying numbers of I/Os in flight.

It is also extremely useful to record average disk I/O sizes during each test. These numbers help locate pathologies in the system when the file system block allocator and the block device elevator.

The plot-obdfilter script (included) is an example of processing output files to a .csv format and plotting a graph using gnuplot.

18.2.3 ost_survey

The ost_survey tool is a shell script that uses lfs setstripe to perform I/O against a single OST. The script writes a file (currently using dd) to each OST in the Lustre file system, and compares read and write speeds. The ost_survey tool is used to detect misbehaving disk subsystems.



Note - We have frequently discovered wide performance variations across all LUNs in a cluster.


To run the ost_survey script, supply a file size (in KB) and the Lustre mount point. For example, run:

$ ./ost-survey.sh 10 /mnt/lustre
Average read Speed:		6.73
Average write Speed:		5.41
read - Worst OST indx 0		5.84 MB/s
write - Worst OST indx 0	3.77 MB/s
read - Best OST indx 1		7.38 MB/s
write - Best OST indx 1		6.31 MB/s
3 OST devices found
Ost index 0 Read speed	5.84	Write speed	3.77
Ost index 0 Read time	0.17	Write time	0.27
Ost index 1 Read speed	7.38	Write speed	6.31
Ost index 1 Read time	0.14	Write time	0.16
Ost index 2 Read speed	6.98	Write speed	6.16
Ost index 2 Read time	0.14	Write time	0.16 

18.2.4 stats-collect

The stats-collect utility contains the following scripts used to collect application profiling information from Lustre clients and servers:

The stats-collect utility requires:

Configuring stats-collect

Configuring the stats-collect utility is simple - all of the profiling configuration VARs are in the config.sh script.

XXXX_INTERVAL is the profiling interval where the value of interval means:

If XXX_INTERVAL is not specified, then XXX statistics are not collected. XXX can be VMSTAT, SERVICE, BRW, SDIO, MBALLOC, IO, JBD, CLIENT

Running stats-collect

The gather_stats_everywhere.sh script should be run in three phases:

Starts statistics collection on each node specified in the config.sh script.

Stops collecting statistics on each node. If <log_name.tgz> is provided, it creates a profile tarball /tmp/<log_name.tgz>.

Analyzes the log_tarball and create a csv tarball for this profiling tarball.

Examples

To collect profile information:

1. Start the collect profile daemon on each node.

sh gather_stats_everywhere.sh config.sh start 

2. Run your test.

3. Stop the collect profile daemon on each node, clean up the temporary file and create a profiling tarball.

sh gather_stats_everywhere.sh config.sh stop log_tarball.tgz

4. Create a csv file according to the profile.

sh gather_stats_everywhere.sh config.sh analyse log_tarball.tgz csv

 

 


18.3 PIOS Test Tool

The PIOS test tool is a parallel I/O simulator for Linux and Solaris. PIOS generates I/O on file systems, block devices and zpools similar to what can be expected from a large Lustre OSS server when handling the load from many clients. The program generates and executes the I/O load in a manner substantially similar to an OSS, that is, multiple threads take work items from a simulated request queue. It forks a CPU load generator to simulate running on a system with additional load.

PIOS can read/write data to a single shared file or multiple files (default is a single file). To specify multiple files, use the --fpp option. (It is better to measure with both single and multiple files.) If the final argument is a file, block device or zpool, PIOS writes to RegionCount regions in one file. PIOS issues I/O commands of size ChunkSize. The regions are spaced apart Offset bytes (or, in the case of many files, the region starts at Offset bytes). In each region, RegionSize bytes are written or read, one ChunkSize I/O at a time. Note that:

ChunkSize <= Regionsize <= Offset

Multiple runs can be specified with comma separated lists of values for ChunkSize, Offset, RegionCount, ThreadCount, and RegionSize. Multiple runs can also be specified by giving a starting (low) value, increase (in percent) and high value for each of these arguments. If a low value is given, no value list or value may be supplied.

Every run is given a timestamp, and the timestamp and offset are written with every chunk (to allow verification). Before every run, PIOS executes the pre-run shell command. After every run, PIOS executes the post-run command. Typically, this is used to clear and collect statistics for the run, or to start and stop statistics gathering during the run. The timestamp is passed to both pre-run and post-run.

For convenience, PIOS understands byte specifiers and uses:

K,k for kilobytes (2<<10)

M,m for megabytes (2<<20)

G,g for gigabytes (2<<30)

T,t for terabytes (2<<40)

Download the PIOS test tool at:

http://downloads.lustre.org/public/tools/benchmarks/pios/

18.3.1 Synopsis

pios 
[--chunksize|-c =values, (--chunksize_low|-a =value 
--chunksize_high|-b =value --chunksize_incr|-g =value)]
 
[--offset|-o =values, (--offset_low|-m =value --offset_high|-q =value
--offset_incr|-r =value)]
 
[--regioncount|-n =values, (--regioncount_low|-i =value 
--regioncount_high|-j =value --regioncount_incr|-k =value)]
 
[--threadcount|-t =values, (--threadcount_low|-l =value
--threadcount_high|-h =value --threadcount_incr|-e =value)]
 
[--regionsize|-s =values, (--regionsize_low|-A =value 
--regionsize_high|-B =value --regionsize_incr|-C =value)]
 
[--directio|-d, --posixio|-x, --cowio|-w} [--cleanup|-L 
--threaddelay|-T =ms --regionnoise|-I ==shift 
--chunknoise|-N =bytes -fpp|-F ]
 
[--verify|-V =values]
 
[--prerun|-P =pre-command --postrun|-R =post-command]
 
[--path|-p =output-file-path]

18.3.2 PIOS I/O Modes

There are several supported PIOS I/O modes:

POSIX I/O:

This is the default operational mode where I/O is done using standard POSIX calls, such as pwrite/pread. This mode is valid on both Linux and Solaris.

DIRECT I/O:

This mode corresponds to the O_DIRECT flag in open(2) system call, and it is currently applicable only to Linux. Use this mode when using PIOS on the ldiskfs file system on an OSS.

COW I/O:

This mode corresponds to the copy overwrite operation where file system blocks that are being overwritten were copied to shadow files. Only use this mode if you want to see overhead of preserving existing data (in case of overwrite). This mode is valid on both Linux and Solaris.

18.3.3 PIOS Parameters

PIOS has five basic parameters to determine the amount of data that is being written.

ChunkSize(c):

Amount of data that a thread writes in one attempt. ChunkSize should be a multiple of file system block size.

RegionSize(s):

Amount of data required to fill up a region. PIOS writes a chunksize of data continuously until it fills the regionsize. RegionSize should be a multiple of ChunkSize.

RegionCount(n):

Number of regions to write in one or multiple files. The total amount of data written by PIOS is RegionSize x RegionCount.

ThreadCount(t):

Number of threads working on regions.

Offset(o):

Distance between two successive regions when all threads are writing to the same file. In the case of multiple files, threads start writing in files at Offset bytes.


Parameter

Description

--chunknoise = N

N is a byte specifier. When performing an I/O task, add a random signed integer in the range [-N,N] to the chunksize. All regions are still fully written. This randomizes the I/O size to some extent.

--chunksize = N[,N2,N3...]

N is a byte specifier and performs I/O in chunks of N kilo-, mega-, giga- or terabyte. You can give a comma separated list of multiple values. This argument is mutually exclusive with --chunksize_low. Note that each thread allocates a buffer of size chunksize + chunknoise for use during the run.

--chunksize_low=L
--chunksize_high=H
--chunksize_incr=F

Performs a sequence of operations starting with a chunksize of L, increasing it by F% each time until chunksize exceeds H.

--cleanup

Removes files that were created during the run. If there is an encounter for existing files, they are over-written.

--directio
--posixio
--cowio

One of these arguments must be passed to indicate if
DIRECT I/O, POSIX I/O or COW I/O is used.

--offset=O[,O2,O3...]

The argument is a byte specifier or a list of specifiers. Each run uses regions at offset multiple of O in a single file. If the run targets multiple files, then the I/O writes at offset O in each file.

--offset_low=OL
--offset_high=OH
--offset_inc=PH

The arguments are byte specifiers. They generate runs with a range of offsets starting at OL, increasing P% until the region size exceeds OH. Each of these arguments is exclusive with the offset argument.

--prerun=”pre-command”

Before each run, executes the pre-command as a shell command through the system(3) call. The timestamp of the run is appended as the last argument to the pre-command string. Typically, this is used to clear statistics or start a data collection script when the run starts.

--postrun=”post-command”

After each run, executes the post-command as a shell command through the system(3) call. The timestamp of the run is appended as the last argument to the pre-command string. Typically, this is used to append statistics for the run or close an open data collection script when the run completes.

--regioncount=N[,N2,N3...]

PIOS writes to N regions in a single file or block device or to N files.

--regioncount_low=RL
--regioncount_high=RH
--regioncount_inc=P

Generate runs with a range of region counts starting at TL, increasing P% until the thread count exceeds RH. Each of these arguments is exclusive with the regioncount argument.

--regionnoise=k

When generating the next I/O task, do not select the next chunk in the next stream, but shift a random number with a maximum noise of shifting k regions ahead. The run will complete when all regions are fully written or read. This merely introduces a randomization of the ordering.

--regionsize=S[,S2,S3...]

The argument is a byte specifier or a list of byte specifiers. During the run(s), write S bytes to each region.

--regionsize_low=RL
--regionsize_high=RH
--regionsize_inc=P

The arguments are byte specifiers. Generate runs with a range of region sizes starting at TL, increasing P% until the region size exceeds RH. Each argument is exclusive with the regionsize argument.

--threadcount=T[,T2,T3...]

PIOS runs with T threads performing I/O. A sequence of values may be given.

--threadcount_low=TL
--threadcount_high=TH
--threadcount_inc=TP

Generate runs with a range of thread counts starting at TL, increasing TP% until the thread count exceeds TH. Each of these arguments is exclusive with the threadcount argument.

--threaddelay=ms

A random amount of noise not exceeding ms is inserted between the time that a thread identifies as the next chunk it needs to read or write and the time it starts the I/O.

--fpp

Where threads write to files:

  • fpp indicates files per process behavior where threads write to multiple files.
  • sff indicates single shared files where all threads write to the same file.

--verify-V=timestamp [,timestamp2,timestamp3]|--verify|-V

Verify a written file or set of files. A single timestamp or sequence of timestamps can be given for each run, respectively. If no argument is passed, the verification is done from timestamps read from the first location of files previously written in the test. If sequence is given, then each run verifies the timestamp accordingly. If a single timestamp is given, then it is verified with all files written.


18.3.4 PIOS Examples

To create a 1 GB load with a different number of threads:

In one file:

pios -t 1,2,4, 8,16, 32, 64, 128 -n 128 -c 1M -s 8M -o 8M --load=posixio -p /mnt/lustre

In multiple files:

pios -t 1,2,4, 8,16, 32, 64, 128 -n 128 -c 1M -s 8M -o 8M --load=posixio,fpp -p /mnt/lustre

To create a 1 GB load with a different number of chunksizes on ldiskfs with direct I/O:

In one file:

pios -t 32 -n 128 -c 128K, 256K, 512K, 1M, 2M, 4M -s 8M -o 8M --load=directio -p /mnt/lustre

In multiple files:

pios -t 32 -n 128 -c 128K, 256K, 512K, 1M, 2M, 4M -s 8M -o 8M --load=directio,fpp -p /mnt/lustre

To create a 32 MB to 128 MB load with different RegionSizes on a Solaris zpool:

In one file:

pios -t 8 -n 16 -c 1M -A 2M -B 8M -C 100 -o 8M --load=posixio -p \/myzpool/

In multiple files:

pios -t 8 -n 16 -c 1M -A 2M -B 8M -C 100 -o 8M --load=posixio, \ fpp -p /myzpool/

To read and verify timestamps:

Create a load with PIOS:

pios -t 40 -n 1024 -c 256K -s 4M -o 8M --load=posixio -p /mnt/lustre

Keep the same parameters to read:

pios -t 40 -n 1024 -c 256K -s 4M -o 8M --load=posixio -p /mnt/lustre --verify


18.4 LNET Self-Test

LNET self-test helps site administrators confirm that Lustre Networking (LNET) has been properly installed and configured, and that underlying network software and hardware are performing according to expectations.

LNET self-test is a kernel module that runs over LNET and LNDs. It is designed to:



Note - Apart from the performance impact, LNET self-test is invisible to Lustre.


18.4.1 Basic Concepts of LNET Self-Test

This section describes basic concepts of LNET self-test, utilities and a sample script.

18.4.1.1 Modules

To run LNET self-test, these modules must be loaded: libcfs, lnet, lnet_selftest and one of the klnds (i.e, ksocklnd, ko2iblnd...). To load all necessary modules, run modprobe lnet_selftest (recursively loads the modules on which LNET self-test depends.

The LNET self-test cluster has two types of nodes:

The console and test nodes require all previously-listed modules to be loaded. (The userspace test node does not require these modules.)



Note - Test nodes can be in either kernel or userspace. A console user can invite a kernel test node to join the test session by running lst add_group NID, but the user cannot actively add a userspace test node to the test-session. However, the console user can passively accept a test node to the test session while the test node runs lstclient to connect to the console.


18.4.1.2 Utilities

LNET self-test has two user utilities, lst and lstclient.

18.4.1.3 Session

In the context of LNET self-test, a session is a test node that can be associated with only one session at a time, to ensure that the session has exclusive use. Almost all operations should be performed in a session context. From the console node, a user can only operate nodes in his own session. If a session ends, the session context in all test nodes is destroyed.

The console node can be used to create, change or destroy a session (new_session, end_session, show_session). For more information, see Session.

18.4.1.4 Console

The console node is the user interface of the LNET self-test system, and can be any node in the test cluster. All self-test commands are entered from the console node. From the console node, a user can control and monitor the status of the entire test cluster (session). The console node is exclusive, meaning that a user cannot control two different sessions (LNET self-test clusters) on one node.

18.4.1.5 Group

An LNET self-test group is just a named collection of nodes. There are no restrictions on group membership, i.e., a node can be included in any number of groups, and any number of groups can exist in a single LNET self-test session.

Each node in a group has a rank, determined by the order in which it was added to the group, which is used to establish test traffic patterns.

A user can only control nodes in his/her session. To allocate nodes to the session, the user needs to add nodes to a group (of the session). All nodes in a group can be referenced by group's name. A node can be allocated to multiple groups of a session.



Note - A console user can associate kernel space test nodes with the session by running lst add_group NIDs, but a userspace test node cannot be actively added to the session. However, the console user can passively "accept" a test node to associate with test session while the test node running lstclient connects to the console node, i.e: lstclient --sesid CONSOLE_NID --group NAME).


18.4.1.6 Test

A test generates network load between two arbitrary groups of nodes - the test's "from" and "to" groups. When a test is running, each node in the "from" group sends requests to nodes in the "to" group, and receive responses in return. This activity is designed to mimic Lustre RPC traffic, i.e. the "from" group acts like a set of clients and the "to" group acts like a set of servers.

The traffic pattern and test intensity is determined several properties, including test type, distribution of test nodes, concurrency of test, RDMA operation type, etc. Several of the available test parameters are described below.

--distribute 1:1 This is the default setting. Each "from" node communicates with the same rank (modules "to" group size) "to" node. Note that if there are more "from" nodes than "to" nodes, some "from" nodes may share the same "to" nodes. Also, if there are more "to" nodes than "from" nodes, some higher-ranked "to" nodes will be idle.

--distribute 1:n (where 'n' is the size of the "to" group). Each "from" node communicates with every node in the "to" group.

18.4.1.7 Batch

A batch is an arbitrary collection of tests which are started and stopped together; they run in parallel. Each test should belong to a batch; tests should not exist individually. Users can control a test batch (run, stop); they cannot control individual tests. Tests in a batch are non-destructive to the file system, and can be run in a normal Lustre environment (provided the performance impact is acceptable).

The simplest batch might contain only a single test - running brw to determine whether network bandwidth will be an I/O bottleneck. In this example, the "to" group is comprised of Lustre OSSes and the "from" group includes the compute nodes. Adding an second test to perform pings from a login node to the MDS could tell you how much checkpointing would affect the ls -l process.

18.4.1.8 Sample Script

These are the steps to run a sample LNET self-test script simulating the traffic pattern of a set of Lustre servers on a TCP network, accessed by Lustre clients on an InfiniBand network (connected via LNET routers). In this example, half the clients are reading and half the clients are writing.

1. Load libcfs.ko, lnet.ko, ksocklnd.ko and lnet_selftest.ko on all test nodes and the console node.

2. Run this script on the console node:

#!/bin/bash
export LST_SESSION=$$
lst new_session read/write
lst add_group servers 192.168.10.[8,10,12-16]@tcp
lst add_group readers 192.168.1.[1-253/2]@o2ib
lst add_group writers 192.168.1.[2-254/2]@o2ib
lst add_batch bulk_rw
lst add_test --batch bulk_rw --from readers --to servers \
brw read check=simple size=1M
lst add_test --batch bulk_rw --from writers --to servers \
brw write check=full size=4K
# start running
lst run bulk_rw
# display server stats for 30 seconds
lst stat servers & sleep 30; kill $!
# tear down
lst end_session


Note - This script can be easily adapted to pass the group NIDs by shell variables or command line arguments (making it good for general-purpose use).


18.4.2 LNET Self-Test Commands

The LNET self-test (lst) utility is used to issue LNET self-test commands. The lst utility takes a number of command line arguments. The first argument is the command name and subsequent arguments are command-specific.

18.4.2.1 Session

This section lists lst session commands.

Process Environment (LST_SESSION)

The lst utility uses the LST_SESSION environmental variable to identify the session locally on the self-test console node. This should be a numeric value that uniquely identifies all session processes on the node. It is convenient to set this to the process ID of the shell both for interactive use and in shell scripts. Almost all lst commands require LST_SESSION to be set.

new_session [--timeout SECONDS] [--force] NAME

Creates a new session.


--timeout SECONDS

Console timeout value of the session. The session ends automatically if it remains idle (i.e., no commands are issued) for this period.

--force

Ends conflicting sessions. This determines who “wins” when one session conflicts with another. For example, if there is already an active session on this node, then this attempt to create a new session fails unless the -force flag is specified. However, if the -force flag is specified, then the other session is ended. Similarly, if this session attempts to add a node that is already “owned” by another session, the -force flag allows this session to “steal” the node.

name

A human-readable string to print when listing sessions or reporting session conflicts.

$ export LST_SESSION=$$

$ lst new_session --force liangzhen


end_session

Stops all operations and tests in the current session and clears the session’s status.

$ lst end_session

show_session

Shows the session information. This command prints information about the current session. It does not require LST_SESSION to be defined in the process environment.

$ lst show_session

18.4.2.2 Group

This section lists lst group commands.

add_group NAME NIDs [NIDs...]

Creates the group and adds a list of test nodes to the group.


NAME

Name of the group.

NIDs

A string that may be expanded into one or more LNET NIDs.

$ lst add_group servers 192.168.10.[35,40-45]@tcp

$ lst add_group clients 192.168.1.[10-100]@tcp 192.168.[2,4].\ [10-20]@tcp


update_group NAME [--refresh] [--clean STATE] [--remove NIDs]

Updates the state of nodes in a group or adjusts a group’s membership. This command is useful if some nodes have crashed and should be excluded from the group.


--refresh

Refreshes the state of all inactive nodes in the group.

--clean STATUS

Removes nodes with a specified status from the group. Status may be:

 

active

The node is in the current session.

 

busy

The node is now owned by another session.

 

down

The node has been marked down.

 

unknown

The node’s status has yet to be determined.

 

invalid

Any state but active.

--remove NIDs

Removes specified nodes from the group.

$ lst update_group clients --refresh
$ lst update_group clients --clean busy
$ lst update_group clients --clean invalid // \
invalid == busy || down || unknown
$ lst update_group clients --remove 192.168.1.[10-20]@tcp


list_group [NAME] [--active] [--busy] [--down] [--unknown] [--all]

Prints information about a group or lists all groups in the current session if no group is specified.


NAME

The name of the group.

--active

Lists the active nodes.

--busy

Lists the busy nodes.

--down

Lists the down nodes.

--unknown

Lists unknown nodes.

--all

Lists all nodes.

$ lst list_group

1) clients

2) servers

Total 2 groups

$ lst list_group clients

ACTIVE BUSY DOWN UNKNOWN TOTAL

3 1 2 0 6

$ lst list_group clients --all

192.168.1.10@tcp Active

192.168.1.11@tcp Active

192.168.1.12@tcp Busy

192.168.1.13@tcp Active

192.168.1.14@tcp DOWN

192.168.1.15@tcp DOWN

Total 6 nodes

$ lst list_group clients --busy

192.168.1.12@tcp Busy

Total 1 node


del_group NAME

Removes a group from the session. If the group is referred to by any test, then the operation fails. If nodes in the group are referred to only by this group, then they are kicked out from the current session; otherwise, they are still in the current session.

$ lst del_group clients

Userland client (lstclient --sesid NID --group NAME)

Use lstclient to run the userland self-test client. lstclient should be executed after creating a session on the console. There are only two mandatory options for lstclient:


--sesid NID

The first console’s NID.

--group NAME

The test group to join.

Console $ lst new_session testsession

Client1 $ lstclient --sesid 192.168.1.52@tcp --group clients


Also, lstclient has a mandatory option that enforces LNET to behave as a server (start acceptor if the underlying NID needs it, use privileged ports, etc.):

--server_mode

For example:

Client1 $ lstclient --sesid 192.168.1.52@tcp |--group clients --server_mode


Note - Only the super user is allowed to use the --server_mode option.


18.4.2.3 Batch and Test

This section lists lst batch and test commands.

add_batch NAME

The default batch (named “batch”) is created when the session is started. However, the user can specify a batch name by using add_batch:

$ lst add_batch bulkperf

add_test --batch BATCH [--loop #] [--concurrency #] [--distribute #:#]
from GROUP --to GROUP TEST ...

Adds a test to batch. For now, TEST can be brw and ping:


--loop #

Loop count of the test.

--concurrency #

Concurrency of the test.

--from GROUP

The source group (test client).

--to GROUP

The target group (test server).

--distribute #:#

The distribution of nodes in clients and servers. The first number of distribute is a subset of client (count of nodes in the “from” group). The second number of distribute is a subset of server (count of nodes in the “to” group); only nodes in two correlative subsets will talk. The following examples are illustrative:

Clients: (C1, C2, C3, C4, C5, C6)

Server: (S1, S2, S3)

--distribute 1:1

(C1->S1), (C2->S2), (C3->S3), (C4->S1), (C5->S2), (C6->S3) \ /* -> means test conversation */
--distribute 2:1

(C1,C2->S1), (C3,C4->S2), (C5,C6->S3)

--distribute 3:1

(C1,C2,C3->S1), (C4,C5,C6->S2), (NULL->S3)

--distribute 3:2

(C1,C2,C3->S1,S2), (C4,C5,C6->S3,S1)

--distribute 4:1

(C1,C2,C3,C4->S1), (C5,C6->S2), (NULL->S3)

--distribute 4:2

(C1,C2,C3,C4->S1,S2), (C5, C6->S3, S1)

--distribute 6:3

(C1,C2,C3,C4,C5,C6->S1,S2,S3)


 

There are only two test types:


--ping

There are no private parameters for the ping test.

--brw

The brw test can have several options:

 

read | write

Read or write. The default is read.

 

size=# | #K | #M

 

I/O size can be bytes, KB or MB (i.e., size=1024, size=4K, size=1M. The default is 4K bytes.

 

check=full | simple

 

A data validation check (checksum of data). The default is no-check. As an example:

$ lst add_group clients 192.168.1.[10-17]@tcp

$ lst add_group servers 192.168.10.[100-103]@tcp

$ lst add_batch bulkperf

$ lst add_test --batch bulkperf --loop 100 \
--concurrency 4 --distribute 4:2 --from clients \
brw WRITE size=16K
// add brw (WRITE, 16 KB) test to batch bulkperf, \
the test will run in 4 workitem, each
// 192.168.1.[10-13] will write to 192.168.10.[100,101]

// 192.168.1.[14-17] will write to 192.168.10.[102,103]


list_batch [NAME] [--test INDEX] [--active] [--invalid] [--server]

Lists batches in the current session or lists client|server nodes in a batch or a test.


--test INDEX

Lists tests in a batch. If no option is used, all tests in the batch are listed. If the option is used, only specified tests in the batch are listed.

$ lst list_batch

bulkperf

$ lst list_batch bulkperf

Batch: bulkperf Tests: 1 State: Idle

ACTIVE BUSY DOWN UNKNOWN TOTAL

client 8 0 0 0 8

server 4 0 0 0 4

Test 1(brw) (loop: 100, concurrency: 4)

ACTIVE BUSY DOWN UNKNOWN TOTAL

client 8 0 0 0 8

server 4 0 0 0 4

$ lst list_batch bulkperf --server --active

192.168.10.100@tcp Active

192.168.10.101@tcp Active

192.168.10.102@tcp Active

192.168.10.103@tcp Active


run NAME

Runs the batch.

$ lst run bulkperf

stop NAME

Stops the batch.

$ lst stop bulkperf

query NAME [--test INDEX] [--timeout #] [--loop #] [--delay #] [--all]

Queries the batch status.


--test INDEX

Only queries the specified test. The test INDEX starts from 1.

--timeout #

The timeout value to wait for RPC. The default is 5 seconds.

--loop #

The loop count of the query.

--delay #

The interval of each query. The default is 5 seconds.

--all

The list status of all nodes in a batch or a test.

$ lst run bulkperf

$ lst query bulkperf --loop 5 --delay 3

Batch is running

Batch is running

Batch is running

Batch is running

Batch is running

$ lst query bulkperf --all

192.168.1.10@tcp Running

192.168.1.11@tcp Running

192.168.1.12@tcp Running

192.168.1.13@tcp Running

192.168.1.14@tcp Running

192.168.1.15@tcp Running

192.168.1.16@tcp Running

192.168.1.17@tcp Running

$ lst stop bulkperf

$ lst query bulkperf

Batch is idle


18.4.2.4 Other Commands

This section lists other lst commands.

ping [-session] [--group NAME] [--nodes NIDs] [--batch name] [--server] [--timeout #]

Sends a “hello” query to the nodes.


--session

Pings all nodes in the current session.

--group NAME

Pings all nodes in a specified group.

--nodes NIDs

Pings all specified nodes.

--batch NAME

Pings all client nodes in a batch.

--server

Sends RPC to all server nodes instead of client nodes. This option is only used with batch NAME.

--timeout #

The RPC timeout value.

$ lst ping 192.168.10.[15-20]@tcp

192.168.1.15@tcp Active [session: liang id: 192.168.1.3@tcp]

192.168.1.16@tcp Active [session: liang id: 192.168.1.3@tcp]

192.168.1.17@tcp Active [session: liang id: 192.168.1.3@tcp]

192.168.1.18@tcp Busy [session: Isaac id: 192.168.10.10@tcp]

192.168.1.19@tcp Down [session: <NULL> id: LNET_NID_ANY]

192.168.1.20@tcp Down [session: <NULL> id: LNET_NID_ANY]


 

stat [--bw] [--rate] [--read] [--write] [--max] [--min] [--avg] " " [--timeout #] [--delay #] GROUP|NIDs [GROUP|NIDs]

The collection performance and RPC statistics of one or more nodes.

Specifying a group name (GROUP) causes statistics to be gathered for all nodes in a test group. For example:

$ lst stat servers

where servers is the name of a test group created by lst add_group

Specifying a NID range (NIDs) causes statistics to be gathered for selected nodes. For example:

$ lst stat 192.168.0.[1-100/2]@tcp

Currently, only LNET performance statistics are available.[2] By default, all statistics information is displayed. Users can specify additional information with these options.


--bw

Displays the bandwidth of the specified group/nodes.

--rate

Displays the rate of RPCs of the specified group/nodes.

--read

Displays the read statistics of the specified group/nodes.

--write

Displays the write statistics of the specified group/nodes.

--max

Displays the maximum value of the statistics.

--min

Displays the minimum value of the statistics.

--avg

Displays the average of the statistics.

--timeout #

The timeout of the statistics RPC. The default is 5 seconds.

--delay #

The interval of the statistics (in seconds).

$ lst run bulkperf

$ lst stat clients

[LNet Rates of clients]

[W] Avg: 1108 RPC/s Min: 1060 RPC/s Max: 1155 RPC/s

[R] Avg: 2215 RPC/s Min: 2121 RPC/s Max: 2310 RPC/s

[LNet Bandwidth of clients]

[W] Avg: 16.60 MB/s Min: 16.10 MB/s Max: 17.1 MB/s

[R] Avg: 40.49 MB/s Min: 40.30 MB/s Max: 40.68 MB/s


show_error [--session] [GROUP]|[NIDs] ...

Lists the number of failed RPCs on test nodes.


--session

Lists errors in the current test session. With this option, historical RPC errors are not listed.
$ lst show_error clients
clients

12345-192.168.1.15@tcp: [Session: 1 brw errors, 0 ping errors] \
[RPC: 20 errors, 0 dropped,
12345-192.168.1.16@tcp: [Session: 0 brw errors, 0 ping errors] \
[RPC: 1 errors, 0 dropped, Total 2 error nodes in clients

$ lst show_error --session clients

clients

12345-192.168.1.15@tcp: [Session: 1 brw errors, 0 ping errors]

Total 1 error nodes in clients


 


1 (Footnote) The sgpdd-survey profiles individual disks. This script is destructive, and should not be run anywhere you want to preserve existing data.
2 (Footnote) In the future, more statistics will be supported.