| C H A P T E R 21 |
|
LustreProc |
This chapter describes Lustre /proc entries and includes the following sections:
The proc file system acts as an interface to internal data structures in the kernel. Proc variables can be used to control aspects of Lustre performance and provide information.
This section describes /proc entries for Lustre.
Use the proc files on the MGS to locate the following:
# cat /proc/fs/lustre/mgs/MGS/filesystems spfs lustre
# cat /proc/fs/lustre/mgs/MGS/live/spfs fsname: spfs flags: 0x0 gen: 7 spfs-MDT0000 spfs-OST0000
All servers are named according to this convention: <fsname>-<MDT|OST><XXXX> This can be shown for live servers under /proc/fs/lustre/devices:
# cat /proc/fs/lustre/devices 0 UP mgs MGS MGS 11 1 UP mgc MGC192.168.10.34@tcp 1f45bb57-d9be-2ddb-c0b0-5431a49226705 2 UP mdt MDS MDS_uuid 3 3 UP lov lustre-mdtlov lustre-mdtlov_UUID 4 4 UP mds lustre-MDT0000 lustre-MDT0000_UUID 7 5 UP osc lustre-OST0000-osc lustre-mdtlov_UUID 5 6 UP osc lustre-OST0001-osc lustre-mdtlov_UUID 5 7 UP lov lustre-clilov-ce63ca00 08ac6584-6c4a-3536-2c6d-b36cf9cbdaa04 8 UP mdc lustre-MDT0000-mdc-ce63ca00 08ac6584-6c4a-3536-2c6d-b36cf9cbdaa05 9 UP osc lustre-OST0000-osc-ce63ca00 08ac6584-6c4a-3536-2c6d-b36cf9cbdaa05 10 UP osc lustre-OST0001-osc-ce63ca00 08ac6584-6c4a-3536-2c6d-b36cf9cbdaa05
Or from the device label at any time:
# e2label /dev/sda lustre-MDT0000
Lustre uses two types of timeouts.
Congested routers can be a source of spurious LND timeouts. To avoid this, increase the number of LNET router buffers to reduce back-pressure and/or increase LND timeouts on all nodes on all connected networks. You should also consider increasing the total number of LNET router nodes in the system so that the aggregate router bandwidth matches the aggregate server bandwidth.
Specific Lustre timeouts are described below.
This is the time period that a client waits for a server to complete an RPC (default is 100s). Servers wait half of this time for a normal client RPC to complete and a quarter of this time for a single bulk request (read or write of up to 1 MB) to complete. The client pings recoverable targets (MDS and OSTs) at one quarter of the timeout, and the server waits one and a half times the timeout before evicting a client for being "stale."
This is the time period for which a server will wait for a client to reply to an initial AST (lock cancellation request) where default is 20s for an OST and 6s for an MDS. If the client replies to the AST, the server will give it a normal timeout (half of the client timeout) to flush any dirty data and release the lock.
| Note - When adaptive timeouts are enabled, the ldlm_timeout tunable is not used. |
This is the internal debugging failure hook.
See lustre/include/linux/obd_support.h for the definitions of individual failure locations. The default value is 0 (zero).
sysctl -w lustre.fail_loc=0x80000122 # drop a single reply
/proc/sys/lustre/dump_on_timeout
This triggers dumps of the Lustre debug log when timeouts occur. The default value is 0 (zero).
/proc/sys/lustre/dump_on_eviction
This triggers dumps of the Lustre debug log when an eviction occurs. The default value is 0 (zero). By default, debug logs are dumped to the /tmp folder; this location can be changed via /proc.
Lustre 1.8 introduces an adaptive mechanism to set RPC timeouts. This feature causes servers to track actual RPC completion times, and to report estimated completion times for future RPCs back to clients. The clients use these estimates to set their future RPC timeout values. If server request processing slows down for any reason, the RPC completion estimates increase, and the clients allow more time for RPC completion.
If RPCs queued on the server approach their timeouts, then the server sends an early reply to the client, telling the client to allow more time. In this manner, clients avoid RPC timeouts and disconnect/reconnect cycles. Conversely, as a server speeds up, RPC timeout values decrease, allowing faster detection of non-responsive servers and faster attempts to reconnect to a server's failover partner.
| Note - In Lustre 1.8, adaptive timeouts are enabled, by default. In earlier Lustre versions supporting adaptive timeouts (1.6.5 through 1.6.7.x), this feature was disabled, by default. |
In previous Lustre versions, the static obd_timeout (/proc/sys/lustre/timeout) value was used as the maximum completion time for all RPCs; this value also affected the client-server ping interval and initial recovery timer. Now, with adaptive timeouts, obd_timeout is only used for the ping interval and initial recovery estimate. When a client reconnects during recovery, the server uses the client's timeout value to reset the recovery wait period; i.e., the server learns how long the client had been willing to wait, and takes this into account when adjusting the recovery period.
One of the goals of adaptive timeouts is to relieve users from having to tune the obd_timeout value. In general, obd_timeout should no longer need to be changed. However, there are several parameters related to adaptive timeouts that users can set. In most situations, the default values should be used.
The following parameters can be set persistently system-wide using lctl conf_param on the MGS. For example, lctl conf_param work1.sys.at_max=1500 sets the at_max value for all servers and clients using the work1 file system.
| Note - Nodes using multiple Lustre file systems must use the same at_* values for all file systems.) |
|
Sets the minimum adaptive timeout (in seconds). Default value is 0. The at_min parameter is the minimum processing time that a server will report. Clients base their timeouts on this value, but they do not use this value directly. If you experience cases in which, for unknown reasons, the adaptive timeout value is too short and clients time out their RPCs (usually due to temporary network outages), then you can increase the at_min value to compensate for this. Ideally, users should leave at_min set to its default. |
|
|
Sets the maximum adaptive timeout (in seconds). The at_max parameter is an upper-limit on the service time estimate, and is used as a 'failsafe' in case of rogue/bad/buggy code that would lead to never-ending estimate increases. If at_max is reached, an RPC request is considered 'broken' and should time out. Setting at_max to 0 causes adaptive timeouts to be disabled and the old fixed-timeout method (obd_timeout) to be used. This is the default value in Lustre 1.6.5. NOTE: It is possible that slow hardware might validly cause the service estimate to increase beyond the default value of at_max. In this case, you should increase at_max to the maximum time you are willing to wait for an RPC completion. |
|
|
Sets a time period (in seconds) within which adaptive timeouts remember the slowest event that occurred. Default value is 600. |
|
|
Sets how far before the deadline Lustre sends an early reply. Default value is 5[1]. |
|
|
Sets the incremental amount of time that a server asks for, with each early reply. The server does not know how much time the RPC will take, so it asks for a fixed value. Default value is 30[2]. When a server finds a queued request about to time out (and needs to send an early reply out), the server adds the at_extra value. If the time expires, the Lustre client enters recovery status and reconnects to restore it to normal status. If you see multiple early replies for the same RPC asking for multiple 30-second increases, change the at_extra value to a larger number to cut down on early replies sent and, therefore, network load. |
|
|
Sets the minimum lock enqueue time. Default value is 100. The ldlm_enqueue time is the maximum of the measured enqueue estimate (influenced by at_min and at_max parameters), multiplied by a weighting factor, and the ldlm_enqueue_min setting. LDLM lock enqueues were based on the obd_timeout value; now they have a dedicated minimum value. Lock enqueues increase as the measured enqueue times increase (similar to adaptive timeouts).[3] |
In Lustre 1.8, adaptive timeouts are enabled, by default. To disable adaptive timeouts, at run time, set at_max to 0. On the MGS, run:
$ lctl conf_param <fsname>.sys.at_max=0
| Note - Changing adaptive timeouts status at runtime may cause transient timeout, reconnect, recovery, etc. |
Adaptive timeouts information can be read from /proc/fs/lustre/*/timeouts files (for each service and client) or with the lctl command.
This is an example from the /proc/fs/lustre/*/timeouts files:
cfs21:~# cat /proc/fs/lustre/ost/OSS/ost_io/timeouts
This is an example using the lctl command:
$ lctl get_param -n ost.*.ost_io.timeouts
service : cur 33 worst 34 (at 1193427052, 0d0h26m40s ago) 1 1 33 2
The ost_io service on this node is currently reporting an estimate of 33 seconds. The worst RPC service time was 34 seconds, and it happened 26 minutes ago.
The output also provides a history of service times. In the example, there are 4 "bins" of adaptive_timeout_history, with the maximum RPC time in each bin reported. In 0-150 seconds, the maximum RPC time was 1, with the same result in 150-300 seconds. From 300-450 seconds, the worst (maximum) RPC time was 33 seconds, and from 450-600s the worst time was 2 seconds. The current estimated service time is the maximum value of the 4 bins (33 seconds in this example).
Service times (as reported by the servers) are also tracked in the client OBDs:
cfs21:# lctl get_param osc.*.timeouts last reply : 1193428639, 0d0h00m00s ago network : cur 1 worst 2 (at 1193427053, 0d0h26m26s ago) 1 1 1 1 portal 6 : cur 33 worst 34 (at 1193427052, 0d0h26m27s ago) 33 33 33 2 portal 28 : cur 1 worst 1 (at 1193426141, 0d0h41m38s ago) 1 1 1 1 portal 7 : cur 1 worst 1 (at 1193426141, 0d0h41m38s ago) 1 0 1 1 portal 17 : cur 1 worst 1 (at 1193426177, 0d0h41m02s ago) 1 0 0 1
In this case, RPCs to portal 6, the OST_IO_PORTAL (see lustre/include/lustre/lustre_idl.h), shows the history of what the ost_io portal has reported as the service estimate.
Server statistic files also show the range of estimates in the normal min/max/sum/sumsq manner.
cfs21:~# lctl get_param mdt.*.mdt.stats ... req_timeout 6 samples [sec] 1 10 15 105 ...
This section describes /proc entries for LNET information.
Shows all NIDs known to this node and also gives information on the queue state.
# cat /proc/sys/lnet/peers nid refs state max rtr min tx min queue 0@lo 1 ~rtr 0 0 0 0 0 0 192.168.10.35@tcp 1 ~rtr 8 8 8 8 6 0 192.168.10.36@tcp 1 ~rtr 8 8 8 8 6 0 192.168.10.37@tcp 1 ~rtr 8 8 8 8 6 0
The fields are explained below:
Credits work like a semaphore. At start they are initialized to allow a certain number of operations (8 in this example). LNET keeps a track of the minimum value so that you can see how congested a resource was.
If rtr/tx is less than max, there are operations in progress. The number of operations is equal to rtr or tx subtracted from max.
If rtr/tx is greater that max, there are operations blocking.
LNET also limits concurrent sends and router buffers allocated to a single peer so that no peer can occupy all these resources.
# cat /proc/sys/lnet/nis nid refs peer max tx min 0@lo 3 0 0 0 0 192.168.10.34@tcp 4 8 256 256 252
Shows the current queue health on this node. The fields are explained below:
|
Number of peer-to-peer send credits on this NID. Credits are used to size buffer pools |
|
Subtracting max - tx yields the number of sends currently active. A large or increasing number of active sends may indicate a problem.
# cat /proc/sys/lnet/nis nid refs peer max tx min 0@lo 2 0 0 0 0 10.67.73.173@tcp 4 8 256 256 253
Free-space stripe weighting, as set, gives a priority of "0" to free space (versus trying to place the stripes "widely" -- nicely distributed across OSSs and OSTs to maximize network balancing). To adjust this priority (as a percentage), use the qos_prio_free proc tunable:
$ cat /proc/fs/lustre/lov/<fsname>-mdtlov/qos_prio_free
Currently, the default is 90%. You can permanently set this value by running this command on the MGS:
$ lctl conf_param <fsname>-MDT0000.lov.qos_prio_free=90
Setting the priority to 100% means that OSS distribution does not count in the weighting, but the stripe assignment is still done via weighting. If OST 2 has twice as much free space as OST 1, it is twice as likely to be used, but it is NOT guaranteed to be used.
Also note that free-space stripe weighting does not activate until two OSTs are imbalanced by more than 20%. Until then, a faster round-robin stripe allocator is used. (The new round-robin order also maximizes network balancing.)
The MDS uses two methods to manage stripe allocation and determine which OSTs to use for file object storage:
Quality of Service (QOS) considers an OST’s available blocks, speed, and the number of existing objects, etc. Using these criteria, the MDS selects OSTs with more free space more often than OSTs with less free space.
Round-Robin (RR) allocates objects evenly across all OSTs. The RR stripe allocator is faster than QOS, and used often because it distributes space usage/load best in most situations, maximizing network balancing and improving performance.
Whether QOS or RR is used depends on the setting of the qos_threshold_rr proc tunable. The qos_threshold_rr variable specifies a percentage threshold where the use of QOS or RR becomes more/less likely. The qos_threshold_rr tunable can be set as an integer, from 0 to 100, and results in this stripe allocation behavior:
The section describes I/O tunables.
/proc/fs/lustre/llite/<fsname>-<uid>/max_cache_mb
# cat /proc/fs/lustre/llite/lustre-ce63ca00/max_cached_mb 128
This tunable is the maximum amount of inactive data cached by the client (default is 3/4 of RAM).
The Lustre engine always attempts to pack an optimal amount of data into each I/O RPC and attempts to keep a consistent number of issued RPCs in progress at a time. Lustre exposes several tuning variables to adjust behavior according to network conditions and cluster size. Each OSC has its own tree of these tunables. For example:
$ ls -d /proc/fs/lustre/osc/OSC_client_ost1_MNT_client_2 /localhost /proc/fs/lustre/osc/OSC_uml0_ost1_MNT_localhost /proc/fs/lustre/osc/OSC_uml0_ost2_MNT_localhost /proc/fs/lustre/osc/OSC_uml0_ost3_MNT_localhost $ ls /proc/fs/lustre/osc/OSC_uml0_ost1_MNT_localhost blocksizefilesfree max_dirty_mb ost_server_uuid stats
RPC stream tunables are described below.
/proc/fs/lustre/osc/<object name>/max_dirty_mb
This tunable controls how many MBs of dirty data can be written and queued up in the OSC. POSIX file writes that are cached contribute to this count. When the limit is reached, additional writes stall until previously-cached writes are written to the server. This may be changed by writing a single ASCII integer to the file. Only values between 0 and 512 are allowable. If 0 is given, no writes are cached. Performance suffers noticeably unless you use large writes (1 MB or more).
/proc/fs/lustre/osc/<object name>/cur_dirty_bytes
This tunable is a read-only value that returns the current amount of bytes written and cached on this OSC.
/proc/fs/lustre/osc/<object name>/max_pages_per_rpc
This tunable is the maximum number of pages that will undergo I/O in a single RPC to the OST. The minimum is a single page and the maximum for this setting is platform dependent (256 for i386/x86_64, possibly less for ia64/PPC with larger PAGE_SIZE), though generally amounts to a total of 1 MB in the RPC.
/proc/fs/lustre/osc/<object name>/max_rpcs_in_flight
This tunable is the maximum number of concurrent RPCs in flight from an OSC to its OST. If the OSC tries to initiate an RPC but finds that it already has the same number of RPCs outstanding, it will wait to issue further RPCs until some complete. The minimum setting is 1 and maximum setting is 32. If you are looking to improve small file I/O performance, increase the max_rpcs_in_flight value.
To maximize performance, the value for max_dirty_mb is recommended to be 4 * max_pages_per_rpc * max_rpcs_in_flight.
| Note - The <object name> varies depending on the specific Lustre configuration. For <object name> examples, refer to the sample command output. |
The same directory contains a rpc_stats file with a histogram showing the composition of previous RPCs. The histogram can be cleared by writing any value into the rpc_stats file.
# cat /proc/fs/lustre/osc/spfs-OST0000-osc-c45f9c00/rpc_stats snapshot_time: 1174867307.156604 (secs.usecs) read RPCs in flight: 0 write RPCs in flight: 0 pending write pages: 0 pending read pages: 0 read write pages per rpc rpcs % cum % | rpcs % cum % 1: 0 0 0 | 0 0 0 read write rpcs in flight rpcs % cum % | rpcs % cum % 0: 0 0 0 | 0 0 0 read write offset rpcs % cum % | rpcs % cum % 0: 0 0 0 | 0 0 0
The offset_stats parameter maintains statistics for occurrences where a series of read or write calls from a process did not access the next sequential location. The offset field is reset to 0 (zero) whenever a different file is read/written.
Read/write offset statistics are off, by default. The statistics can be activated by writing anything into the offset_stats file.
# cat /proc/fs/lustre/llite/lustre-f57dee00/rw_offset_stats snapshot_time: 1155748884.591028 (secs.usecs) R/W PID RANGE START RANGE END SMALLEST EXTENT LARGEST EXTENT OFFSET R 8385 0 128 128 128 0 R 8385 0 224 224 224 -128 W 8385 0 250 50 100 0 W 8385 100 1110 10 500 -150 W 8384 0 5233 5233 5233 0 R 8385 500 600 100 100 -610
Client-Based I/O Extent Size Survey
The rw_extent_stats histogram in the llite directory shows you the statistics for the sizes of the read-write I/O extents. This file does not maintain the per-process statistics.
$ cat /proc/fs/lustre/llite/lustre-ee5af200/extents_stats snapshot_time: 1213828728.348516 (secs.usecs) read | write extents calls % cum% | calls % cum% | 0K - 4K : 0 0 0 | 2 2 2 4K - 8K : 0 0 0 | 0 0 2 8K - 16K : 0 0 0 | 0 0 2 16K - 32K : 0 0 0 | 20 23 26 32K - 64K : 0 0 0 | 0 0 26 64K - 128K : 0 0 0 | 51 60 86 128K - 256K : 0 0 0 | 0 0 86 256K - 512K : 0 0 0 | 0 0 86 512K - 1024K : 0 0 0 | 0 0 86 1M - 2M : 0 0 0 | 11 13 100
The file can be cleared by issuing the following command:
$ echo > cat /proc/fs/lustre/llite/lustre-ee5af200/extents_stats
Per-Process Client I/O Statistics
The extents_stats_per_process file maintains the I/O extent size statistics on a per-process basis. So you can track the per-process statistics for the last MAX_PER_PROCESS_HIST processes.
$ cat /proc/fs/lustre/llite/lustre-ee5af200/extents_stats_per_process snapshot_time: 1213828762.204440 (secs.usecs) read | write extents calls % cum% | calls %cum% | PID: 11488 | 0K - 4K : 0 0 0 | 0 0 0 4K - 8K : 0 0 0 | 0 0 0 8K - 16K : 0 0 0 | 0 0 0 16K - 32K : 0 0 0 | 0 0 0 32K - 64K : 0 0 0 | 0 0 0 64K - 128K : 0 0 0 | 0 0 0 128K - 256K : 0 0 0 | 0 0 0 256K - 512K : 0 0 0 | 0 0 0 512K - 1024K : 0 0 0 | 0 0 0 1M - 2M : 0 0 0 | 10 100 100 PID: 11491 0K - 4K : 0 0 0 | 0 0 0 4K - 8K : 0 0 0 | 0 0 0 8K - 16K : 0 0 0 | 0 0 0 16K - 32K : 0 0 0 | 20 100 100 PID: 11424 0K - 4K : 0 0 0 | 0 0 0 4K - 8K : 0 0 0 | 0 0 0 8K - 16K : 0 0 0 | 0 0 0 16K - 32K : 0 0 0 | 0 0 0 32K - 64K : 0 0 0 | 0 0 0 64K - 128K : 0 0 0 | 16 100 100 PID: 11426 0K - 4K : 0 0 0 | 1 100 100 PID: 11429 0K - 4K : 0 0 0 | 1 100 100
Similarly, there is a brw_stats histogram in the obdfilter directory which shows you the statistics for number of I/O requests sent to the disk, their size and whether they are contiguous on the disk or not.
cat /proc/fs/lustre/obdfilter/lustre-OST0000/brw_stats snapshot_time: 1174875636.764630 (secs:usecs) read write pages per brw brws % cum % | rpcs % cum % 1: 0 0 0 | 0 0 0 read write discont pages rpcs % cum % | rpcs % cum % 1: 0 0 0 | 0 0 0 read write discont blocks rpcs % cum % | rpcs % cum % 1: 0 0 0 | 0 0 0 read write dio frags rpcs % cum % | rpcs % cum % 1: 0 0 0 | 0 0 0 read write disk ios in flight rpcs % cum % | rpcs % cum % 1: 0 0 0 | 0 0 0 read write io time (1/1000s) rpcs % cum % | rpcs % cum % 1: 0 0 0 | 0 0 0 read write disk io size rpcs % cum % | rpcs % cum % 1: 0 0 0 | 0 0 0 read write
The fields are explained below:
For each Lustre service, the following information is provided:
Additionally, data on each Lustre service is provided by service type:
Lustre 1.6.5.1 introduced file readahead and directory statahead functionality that read data into memory in anticipation of a process actually requesting the data. File readahead functionality reads file content data into memory. Directory statahead functionality reads metadata into memory. When readahead and/or statahead work well, a data-consuming process finds that the information it needs is available when requested, and it is unnecessary to wait for network I/O.
File readahead is triggered when two or more sequential reads by an application fail to be satisfied by the Linux buffer cache. The size of the initial readahead is 1 MB. Additional readaheads grow linearly, and increment until the readahead cache on the client is full at 40 MB.
/proc/fs/lustre/llite/<fsname>-<uid>/max_read_ahead_mb
This tunable controls the maximum amount of data readahead on a file. Files are read ahead in RPC-sized chunks (1 MB or the size of read() call, if larger) after the second sequential read on a file descriptor. Random reads are done at the size of the read() call only (no readahead). Reads to non-contiguous regions of the file reset the readahead algorithm, and readahead is not triggered again until there are sequential reads again. To disable readahead, set this tunable to 0. The default value is 40 MB.
/proc/fs/lustre/llite/<fsname>-<uid>/max_read_ahead_whole_mb
This tunable controls the maximum size of a file that is read in its entirety, regardless of the size of the read().
When the ls -l process opens a directory, its process ID is recorded. When the first directory entry is ''stated'' with this recorded process ID, a statahead thread is triggered which stats ahead all of the directory entries, in order. The ls -l process can use the stated directory entries directly, improving performance.
/proc/fs/lustre/llite/*/statahead_max
This tunable controls whether directory statahead is enabled and the maximum statahead count. By default, statahead is active.
To disable statahead, set this tunable to:
echo 0 > /proc/fs/lustre/llite/*/statahead_max
To set the maximum statahead count (n), set this tunable to:
echo n > /proc/fs/lustre/llite/*/statahead_max
The maximum value of n is 8192.
/proc/fs/lustre/llite/*/statahead_status
This is a read-only interface that indicates the current statahead status.
Lustre 1.8 introduces the OSS read cache feature, which provides read-only caching of data on an OSS. This functionality uses the regular Linux page cache to store the data. Just like caching from a regular filesytem in Linux, OSS read cache uses as much physical memory as is allocated.
OSS read cache improves Lustre performance in these situations:
OSS read cache offers these benefits:
OSS read cache is implemented on the OSS, and does not require any special support on the client side. Since OSS read cache uses the memory available in the Linux page cache, you should use I/O patterns to determine the appropriate amount of memory for the cache; if the data is mostly reads, then more cache is required than for writes.
OSS read cache is enabled, by default, and managed by the following tunables:
When the OSS receives a read request from a client, it reads data from disk into its memory and sends the data as a reply to the requests. If read cache is enabled, this data stays in memory after the client’s request is finished, and the OSS skips reading data from disk when subsequent read requests for the same are received. The read cache is managed by the Linux kernel globally across all OSTs on that OSS, and the least recently used cache pages will be dropped from memory when the amount of free memory is running low.
If read cache is disabled (read_cache_enable = 0), then the OSS will discard the data after the client’s read requests are serviced and, for subsequent read requests, the OSS must read the data from disk.
To disable read cache on all OSTs of an OSS, run:
root@oss1# lctl set_param obdfilter.*.read_cache_enable=0
To re-enable read cache on one OST, run:
root@oss1# lctl set_param obdfilter.{OST_name}.read_cache_enable=1
To check if read cache is enabled on all OSTs on an OSS, run:
root@oss1# lctl get_param obdfilter.*.read_cache_enable
When the OSS receives write requests from a client, it receives data from the client into its memory and writes the data to disk. If writethrough cache is enabled, this data stays in memory after the write request is completed, allowing the OSS to skip reading this data from disk if a later read request, or partial-page write request, for the same data is received.
If writethrough cache is disabled (writethrough_cache_enabled = 0), then the OSS discards the data after the client’s write request is completed, and for subsequent read request, or partial-page write request, the OSS must re-read the data from disk.
Enabling writethrough cache is advisable if clients are doing small or unaligned writes that would cause partial-page updates, or if the files written by one node are immediately being accessed by other nodes. Some examples where this might be useful include producer-consumer I/O models or shared-file writes with a different node doing I/O not aligned on 4096-byte boundaries. Disabling writethrough cache is advisable in the case where files are mostly written to the file system but are not re-read within a short time period, or files are only written and re-read by the same node, regardless of whether the I/O is aligned or not.
To disable writethrough cache on all OSTs of an OSS, run:
root@oss1# lctl set_param obdfilter.*.writethrough_cache_enable=0
To re-enable writethrough cache on one OST, run:
root@oss1# lctl set_param \obdfilter.{OST_name}.writethrough_cache_enable=1
To check if writethrough cache is
root@oss1# lctl set_param obdfilter.*.writethrough_cache_enable=1
This can be very useful for workloads where relatively small files are repeatedly accessed by many clients, such as job startup files, executables, log files, etc., but large files are read or written only once. By not putting the larger files into the cache, it is much more likely that more of the smaller files will remain in cache for a longer time.
When setting readcache_max_filesize, the input value can be specified in bytes, or can have a suffix to indicate other binary units such as Kilobytes, Megabytes, Gigabytes, Terabytes, or Petabytes.
To limit the maximum cached file size to 32MB on all OSTs of an OSS, run:
root@oss1# lctl set_param obdfilter.*.readcache_max_filesize=32M
To disable the maximum cached file size on an OST, run:
root@oss1# lctl set_param \obdfilter.{OST_name}.readcache_max_filesize=-1
To check the current maximum cached file size on all OSTs of an OSS, run:
root@oss1# lctl get_param obdfilter.*.readcache_max_filesize
The OSS asynchronous journal commit feature synchronously writes data to disk without forcing a journal flush. This reduces the number of seeks and significantly improves performance on some hardware.
| Note - Asynchronous journal commit cannot work with O_DIRECT writes, a journal flush is still forced. |
When asynchronous journal commit is enabled, client nodes keep data in the page cache (a page reference). Lustre clients monitor the last committed transaction number (transno) in messages sent from the OSS to the clients. When a client sees that the last committed transno reported by the OSS is >=bulk write transno, it releases the reference on the corresponding pages. To avoid page references being held for too long on clients after a bulk write, a 7 second ping request is scheduled (jbd commit time is 5 seconds) after the bulk write reply is received, so the OSS has an opportunity to report the last committed transno.
If the OSS crashes before the journal commit occurs, then the intermediate data is lost. However, new OSS recovery functionality (introduced in the asynchronous journal commit feature), causes clients to replay their write requests and compensate for the missing disk updates by restoring the state of the file system.
| Tip - An issue related to OSS recovery was fixed in Lustre 1.8.2. If you plan to use the asynchronous journal commit feature, we recommend using version 1.8.2 or later. |
| Note - When asynchronous journal commit parameters are tuned, the sync on lock cancel option is reset. |
To enable asynchronous journal commit, set the sync_journal parameter to zero (sync_journal=0):
$ lctl set_param obdfilter.*.sync_journal=0 obdfilter.lol-OST0001.sync_journal=0
By default, sync_journal is disabled (sync_journal=1), which forces a journal flush after every bulk write.
When asynchronous journal commit is used, clients keep a page reference until the journal transaction commits. This can cause problems when a client receives a blocking callback, because pages need to be removed from the page cache, but they cannot be removed because of the extra page reference.
This problem is solved by forcing a journal flush on lock cancellation. When this happens, the client is granted the metadata blocks that have hit the disk, and it can safely release the page reference before processing the blocking callback. The parameter which controls this action is sync_on_lock_cancel, which can be set to the following values:
always: Always force a journal flush on lock cancellation
blocking: Force a journal flush only when the local cancellation is due to a blocking callback
never: Do not force any journal flush
Here is an example of sync_on_lock_cancel being set not to force a journal flush:
$ lctl get_param obdfilter.*.sync_on_lock_cancel obdfilter.lol-OST0001.sync_on_lock_cancel=never
By default, sync_on_lock_cancel is set to never, because asynchronous journal commit is disabled by default.
When asynchronous journal commit is enabled (sync_journal=0), sync_on_lock_cancel is automatically set to always, if it was previously set to never.
Similarly, when asynchronous journal commit is disabled, (sync_journal=1), sync_on_lock_cancel is enforced to never.
/proc/fs/ldiskfs/sda/mb_history
Multi-Block-Allocate (mballoc), enables Lustre to ask ext3 to allocate multiple blocks with a single request to the block allocator. Typically, an ext3 file system allocates only one block per time. Each mballoc-enabled partition has this file. This is sample output:
pid inode goal result found grps cr \ merge tail broken 2838 139267 17/12288/1 17/12288/1 1 0 0 \ M 1 8192 2838 139267 17/12289/1 17/12289/1 1 0 0 \ M 0 0 2838 139267 17/12290/1 17/12290/1 1 0 0 \ M 1 2 2838 24577 3/12288/1 3/12288/1 1 0 0 \ M 1 8192 2838 24578 3/12288/1 3/771/1 1 1 1 \ 0 0 2838 32769 4/12288/1 4/12288/1 1 0 0 \ M 1 8192 2838 32770 4/12288/1 4/12289/1 13 1 1 \ 0 0 2838 32771 4/12288/1 5/771/1 26 2 1 \ 0 0 2838 32772 4/12288/1 5/896/1 31 2 1 \ 1 128 2838 32773 4/12288/1 5/897/1 31 2 1 \ 0 0 2828 32774 4/12288/1 5/898/1 31 2 1 \ 1 2 2838 32775 4/12288/1 5/899/1 31 2 1 \ 0 0 2838 32776 4/12288/1 5/900/1 31 2 1 \ 1 4 2838 32777 4/12288/1 5/901/1 31 2 1 \ 0 0 2838 32778 4/12288/1 5/902/1 31 2 1 \ 1 2
The parameters are described below:
Most customers are probably interested in found/cr. If cr is 0 1 and found is less than 100, then mballoc is doing quite well.
Also, number-of-blocks-in-request (third number in the goal triple) can tell the number of blocks requested by the obdfilter. If the obdfilter is doing a lot of small requests (just few blocks), then either the client is processing input/output to a lot of small files, or something may be wrong with the client (because it is better if client sends large input/output requests). This can be investigated with the OSC rpc_stats or OST brw_stats mentioned above.
Number of groups scanned (grps column) should be small. If it reaches a few dozen often, then either your disk file system is pretty fragmented or mballoc is doing something wrong in the group selection part.
Lustre version 1.6.1 and later includes mballoc3, which was built on top of mballoc2. By default, mballoc3 is enabled, and adds these features:
The following mballoc3 tunables are available:
The following tunables, providing more control over allocation policy, will be available in the next version:
/proc/fs/lustre/ldlm/ldlm/namespaces/<OSC name|MDC name>/lru_size
The lru_size parameter is used to control the number of client-side locks in an LRU queue. LRU size is dynamic, based on load. This optimizes the number of locks available to nodes that have different workloads (e.g., login/build nodes vs. compute nodes vs. backup nodes).
The total number of locks available is a function of the server’s RAM. The default limit is 50 locks/1 MB of RAM. If there is too much memory pressure, then the LRU size is shrunk. The number of locks on the server is limited to {number of OST/MDT on node} * {number of clients} * {client lru_size}.
To clear the LRU on a single client, and as a result flush client cache, without changing the lru_size value:
$ lctl set_param ldlm.namespaces.<osc_name|mdc_name>.lru_size=clear
If you shrink the LRU size below the number of existing unused locks, then the unused locks are canceled immediately. Use echo clear to cancel all locks without changing the value.
| Note - Currently, the lru_size parameter can only be set temporarily with lctl set_param; it cannot be set permanently. |
To disable LRU sizing, run this command on the Lustre clients:
$ lctl set_param ldlm.namespaces.*osc*.lru_size=$((NR_CPU*100))
Replace NR_CPU value with the number of CPUs on the node.
To determine the number of locks being granted:
$ lctl get_param ldlm.namespaces.*.pool.limit
In Lustre 1.8 and later, MDS and OSS thread counts (minimum and maximum) can be set via the {min,max}_thread_count tunable. For each service, a new /proc/fs/lustre/{service}/*/thread_{min,max,started} entry is created. The tunable, {service}.thread_{min,max,started}, can be used to set the minimum and maximum thread counts or get the current number of running threads for the following services.
# lctl {get,set}_param {service}.thread_{min,max,started}
# lctl conf_param {service}.thread_{min,max,started}
The following examples show how to set thread counts and get the number of running threads for the ost_io service.
# lctl get_param ost.OSS.ost_io.threads_started
The command output will be similar to this:
ost.OSS.ost_io.threads_started=128
# lctl get_param ost.OSS.ost_io.threads_max
ost.OSS.ost_io.threads_max=512
# lctl set_param ost.OSS.ost_io.threads_max=256
ost.OSS.ost_io.threads_max=256
# lctl get_param ost.OSS.ost_io.threads_max
The command output will be similar to this:
ost.OSS.ost_io.threads_max=256
By default, Lustre generates a detailed log of all operations to aid in debugging. The level of debugging can affect the performance or speed you achieve with Lustre. Therefore, it is useful to reduce this overhead by turning down the debug level[4] to improve performance. Raise the debug level when you need to collect the logs for debugging problems. The debugging mask can be set with "symbolic names" instead of the numerical values that were used in prior releases. The new symbolic format is shown in the examples below.
| Note - All of the commands below must be run as root; note the # nomenclature. |
To verify the debug level used by examining the sysctl that controls debugging, run:
# sysctl lnet.debug lnet.debug = ioctl neterror warning error emerg ha config console
To turn off debugging (except for network error debugging), run this command on all concerned nodes:
# sysctl -w lnet.debug="neterror" lnet.debug = neterror
To turn off debugging completely, run this command on all concerned nodes:
# sysctl -w lnet.debug=0 lnet.debug = 0
To set an appropriate debug level for a production environment, run:
# sysctl -w lnet.debug="warning dlmtrace error emerg ha rpctrace vfstrace" lnet.debug = warning dlmtrace error emerg ha rpctrace vfstrace
The flags above collect enough high-level information to aid debugging, but they do not cause any serious performance impact.
To clear all flags and set new ones, run:
# sysctl -w lnet.debug="warning" lnet.debug = warning
To add new flags to existing ones, prefix them with a "+":
# sysctl -w lnet.debug="+neterror +ha" lnet.debug = +neterror +ha # sysctl lnet.debug lnet.debug = neterror warning ha
To remove flags, prefix them with a "-":
# sysctl -w lnet.debug="-ha" lnet.debug = -ha # sysctl lnet.debug lnet.debug = neterror warning
You can verify and change the debug level using the /proc interface in Lustre. To use the flags with /proc, run:
# cat /proc/sys/lnet/debug neterror warning # echo "+ha" > /proc/sys/lnet/debug # cat /proc/sys/lnet/debug neterror warning ha # echo "-warning" > /proc/sys/lnet/debug # cat /proc/sys/lnet/debug neterror ha
/proc/sys/lnet/subsystem_debug
This controls the debug logs3 for subsystems (see S_* definitions).
This indicates the location where debugging symbols should be stored for gdb. The default is set to /r/tmp/lustre-log-localhost.localdomain.
These values can also be set via sysctl -w lnet.debug={value}
| Note - The above entries only exist when Lustre has already been loaded. |
This causes Lustre to call ''panic'' when it detects an internal problem (an LBUG); panic crashes the node. This is particularly useful when a kernel crash dump utility is configured. The crash dump is triggered when the internal inconsistency is detected by Lustre.
This allows you to specify the path to the binary which will be invoked when an LBUG is encountered. This binary is called with four parameters. The first one is the string ''LBUG''. The second one is the file where the LBUG occurred. The third one is the function name. The fourth one is the line number in the file.
Some OBD devices maintain a count of the number of RPC events that they process. Sometimes these events are more specific to operations of the device, like llite, than actual raw RPC counts.
$ find /proc/fs/lustre/ -name stats /proc/fs/lustre/osc/lustre-OST0001-osc-ce63ca00/stats /proc/fs/lustre/osc/lustre-OST0000-osc-ce63ca00/stats /proc/fs/lustre/osc/lustre-OST0001-osc/stats /proc/fs/lustre/osc/lustre-OST0000-osc/stats /proc/fs/lustre/mdt/MDS/mds_readpage/stats /proc/fs/lustre/mdt/MDS/mds_setattr/stats /proc/fs/lustre/mdt/MDS/mds/stats /proc/fs/lustre/mds/lustre-MDT0000/exports/ab206805-0630-6647-8543-d24265c91a3d/stats /proc/fs/lustre/mds/lustre-MDT0000/exports/08ac6584-6c4a-3536-2c6d-b36cf9cbdaa0/stats /proc/fs/lustre/mds/lustre-MDT0000/stats /proc/fs/lustre/ldlm/services/ldlm_canceld/stats /proc/fs/lustre/ldlm/services/ldlm_cbd/stats /proc/fs/lustre/llite/lustre-ce63ca00/stats
The OST .../stats files can be used to track client statistics (client activity) for each OST. It is possible to get a periodic dump of values from these file (for example, every 10 seconds), that show the RPC rates (similar to iostat) by using the llstat.pl tool:
# llstat /proc/fs/lustre/osc/lustre-OST0000-osc/stats /usr/bin/llstat: STATS on 09/14/07 /proc/fs/lustre/osc/lustre-OST0000-osc/stats on 192.168.10.34@tcp snapshot_time 1189732762.835363 ost_create 1 ost_get_info 1 ost_connect 1 ost_set_info 1 obd_ping 212
To clear the statistics, give the -c option to llstat.pl. To specify how frequently the statistics should be cleared (in seconds), use an integer for the -i option. This is sample output with -c and -i10 options used, providing statistics every 10s):
$ llstat -c -i10 /proc/fs/lustre/ost/OSS/ost_io/stats /usr/bin/llstat: STATS on 06/06/07 /proc/fs/lustre/ost/OSS/ost_io/ stats on 192.168.16.35@tcp snapshot_time 1181074093.276072 /proc/fs/lustre/ost/OSS/ost_io/stats @ 1181074103.284895 Name Cur.Count Cur.Rate #Events Unit \ last min avg max stddev req_waittime 8 0 8 [usec] 2078\ 34 259.75 868 317.49 req_qdepth 8 0 8 [reqs] 1\ 0 0.12 1 0.35 req_active 8 0 8 [reqs] 11\ 1 1.38 2 0.52 reqbuf_avail 8 0 8 [bufs] 511\ 63 63.88 64 0.35 ost_write 8 0 8 [bytes] 1697677\ 72914 212209.62 387579 91874.29 /proc/fs/lustre/ost/OSS/ost_io/stats @ 1181074113.290180 Name Cur.Count Cur.Rate #Events Unit \ last min avg max stddev req_waittime 31 3 39 [usec] 30011\ 34 822.79 12245 2047.71 req_qdepth 31 3 39 [reqs] 0\ 0 0.03 1 0.16 req_active 31 3 39 [reqs] 58\ 1 1.77 3 0.74 reqbuf_avail 31 3 39 [bufs] 1977\ 63 63.79 64 0.41 ost_write 30 3 38 [bytes] 10284679\ 15019 315325.16 910694 197776.51 /proc/fs/lustre/ost/OSS/ost_io/stats @ 1181074123.325560 Name Cur.Count Cur.Rate #Events Unit \ last min avg max stddev req_waittime 21 2 60 [usec] 14970\ 34 784.32 12245 1878.66 req_qdepth 21 2 60 [reqs] 0\ 0 0.02 1 0.13 req_active 21 2 60 [reqs] 33\ 1 1.70 3 0.70 reqbuf_avail 21 2 60 [bufs] 1341\ 63 63.82 64 0.39 ost_write 21 2 59 [bytes] 7648424\ 15019 332725.08 910694 180397.87
The events common to all services are:
Some service-specific events of interest are:
|
Time it takes to enqueue a lock (this includes file open on the MDS) |
|
|
Time it takes to process an MDS modification record (includes create, mkdir, unlink, rename and setattr) |
The llobdstat utility displays statistics for the activity of a specific OST on an OSS:
/proc/fs/lustre/<ost_name>/stats
Use llobdstat to monitor changes in statistics over time, and I/O rates for all OSTs on a server. the llobdstat utility provides utilization graphs for selectable
time-scales.
#llobdstat <ost_name> [<interval>]
llobdstat lustre-OST0000 2
The MDT .../stats files can be used to track MDT statistics for the MDS. Here is sample output for an MDT stats file:
# cat /proc/fs/lustre/mds/*-MDT0000/stats snapshot_time 1244832003.676892 secs.usecs open 2 samples [reqs] close 1 samples [reqs] getxattr 3 samples [reqs] process_config 1 samples [reqs] connect 2 samples [reqs] disconnect 2 samples [reqs] statfs 3 samples [reqs] setattr 1 samples [reqs] getattr 3 samples [reqs] llog_init 6 samples [reqs] notify 16 samples [reqs]
Copyright © 2010, Oracle and/or its affiliates. All rights reserved.