| C H A P T E R 23 |
|
Lustre Troubleshooting |
This chapter provides information to troubleshoot Lustre, submit a Lustre bug, and Lustre performance tips. It includes the following sections:
Several resources are available to help use troubleshoot Lustre. This section describes error numbers, error messages and logs.
Error numbers for Lustre come from the Linux errno.h, and are located in /usr/include/asm/errno.h. Lustre does not use all of the available Linux error numbers. The exact meaning of an error number depends on where it is used. Here is a summary of the basic errors that Lustre users may encounter.
As Lustre code runs on the kernel, single-digit error codes display to the application; these error codes are an indication of the problem. Refer to the kernel console log (dmesg) for all recent kernel messages from that node. On the node, /var/log/messages holds a log of all messages for at least the past day.
The error message initiates with "LustreError" in the console log and provides a short description of:
Lustre logs are dumped to /proc/sys/lnet/debug_path.
Collect the first group of messages related to a problem, and any messages that precede "LBUG" or "assertion failure" errors. Messages that mention server nodes (OST or MDS) are specific to that server; you must collect similar messages from the relevant server console logs.
Another Lustre debug log holds information for Lustre action for a short period of time which, in turn, depends on the processes on the node to use Lustre. Use the following command to extract debug logs on each of the nodes, run
$ lctl dk <filename>
| Note - LBUG freezes the thread to allow capture of the panic stack. A system reboot is needed to clear the thread. |
If, after troubleshooting your Lustre system, you cannot resolve the problem, consider reporting a Lustre bug. To do this, you will need an account on Bugzilla (defect tracking system used for Lustre), and download the Lustre diagnostics tool to run and capture the diagnostics output.
| Note - Create a Lustre Bugzilla account. Download the Lustre diagnostics tool and install it on the affected nodes. Make sure you are using the most recent version of the diagnostics tool. |
1. Once you have a Lustre Bugzilla account, open a new bug and describe the problem you having.
2. Run the Lustre diagnostics tool, using one of the following commands:
# lustre-diagnostics -t <bugzilla bug #> # lustre-diagnostics.
In case you need to use it later, the output of the bug is sent directly to the terminal. Normal file redirection can be used to send the output to a file which you can manually attach to this bug, if necessary.
This section describes common issues encountered with Lustre, as well as tips to improve Lustre performance.
One of the most common problems encountered in a Lustre environment is when an OST becomes unavailable, because of a network partition, OSS node crash, etc. When this happens, the OST’s clients pause and wait for the OST to become available again, either on the primary OSS or a failover OSS. When the OST comes back online, Lustre starts a recovery process to enable clients to reconnect to the OST. Lustre servers put a limit on the time they will wait in recovery for clients to reconnect[1].
During recovery, clients reconnect and replay their requests, serially, in the same order they were done originally.[2] Periodically, a progress message prints to the log, stating how_many/expected clients have reconnected. If the recovery is aborted, this log shows how many clients managed to reconnect. When all clients have completed recovery, or if the recovery timeout is reached, the recovery period ends and the OST resumes normal request processing.
If some clients fail to replay their requests during the recovery period, this will not stop the recovery from completing. You may have a situation where the OST recovers, but some clients are not able to participate in recovery (e.g. network problems or client failure), so they are evicted and their requests are not replayed. This would result in any operations on the evicted clients failing, including in-progress writes, which would cause cached writes to be lost. This is a normal outcome; the recovery cannot wait indefinitely, or the file system would be hung any time a client failed. The lost transactions are an unfortunate result of the recovery process.
| Note - Lustre 1.8 introduces the version-based recovery (VBR) feature, which enables a failed client to be ''skipped'', so remaining clients can replay their requests, resulting in a more successful recovery from a downed OST. For more information about the VBR feature, see Version-based Recovery.
In Lustre 1.6 and earlier, the success of the recovery process was limited by uncommitted client requests that are unable to be replayed. Because clients attempted to replay their requests to the OST and MDT in serial order, a client that could not replay its requests causes the recovery stream to stop, and left the remaining clients without an opportunity to reconnect and replay their requests. |
Typically, the performance of write operations on a Lustre cluster is better than read operations. When doing writes, all clients are sending write RPCs asynchronously. The RPCs are allocated, and written to disk in the order they arrive. In many cases, this allows the back-end storage to aggregate writes efficiently.
In the case of read operations, the reads from clients may come in a different order and need a lot of seeking to get read from the disk. This noticeably hampers the read throughput.
Currently, there is no readahead on the OSTs themselves, though the clients do readahead. If there are lots of clients doing reads it would not be possible to do any readahead in any case because of memory consumption (consider that even a single RPC (1 MB) readahead for 1000 clients would consume 1 GB of RAM).
For file systems that use socklnd (TCP, Ethernet) as interconnect, there is also additional CPU overhead because the client cannot receive data without copying it from the network buffers. In the write case, the client CAN send data without the additional data copy. This means that the client is more likely to become CPU-bound during reads than writes.
If the OSS fails to find an object or finds a damaged object, this message appears:
OST object missing or damaged (OST "ost1", object 98148, error -2)
If the reported error is -2 (-ENOENT, or "No such file or directory"), then the object is missing. This can occur either because the MDS and OST are out of sync, or because an OST object was corrupted and deleted.
If you have recovered the file system from a disk failure by using e2fsck, then unrecoverable objects may have been deleted or moved to /lost+found on the raw OST partition. Because files on the MDS still reference these objects, attempts to access them produce this error.
If you have recovered a backup of the raw MDS or OST partition, then the restored partition is very likely to be out of sync with the rest of your cluster. No matter which server partition you restored from backup, files on the MDS may reference objects which no longer exist (or did not exist when the backup was taken); accessing those files produces this error.
If neither of those descriptions is applicable to your situation, then it is possible that you have discovered a programming error that allowed the servers to get out of sync. Please report this condition to the Lustre group, and we will investigate.
If the reported error is anything else (such as -5, "I/O error"), it likely indicates a storage failure. The low-level file system returns this error if it is unable to read from the storage device.
If the reported error is -2, you can consider checking in /lost+found on your raw OST device, to see if the missing object is there. However, it is likely that this object is lost forever, and that the file that references the object is now partially or completely lost. Restore this file from backup, or salvage what you can and delete it.
If the reported error is anything else, then you should immediately inspect this server for storage problems.
If the SCSI devices are inaccessible to Lustre at the block device level, then ext3 remounts the device read-only to prevent file system corruption. This is a normal behavior. The status in /proc/fs/lustre/healthcheck also shows "not healthy" on the affected nodes.
To determine what caused the "not healthy" condition:
To recover from this problem, you must restart Lustre services using these file systems. There is no other way to know that the I/O made it to disk, and the state of the cache may be inconsistent with what is on disk.
If an OST is missing for any reason, you may need to know what files are affected. Although an OST is missing, the files system should be operational. From any mounted client node, generate a list of files that reside on the affected OST. It is advisable to mark the missing OST as ’unavailable’ so clients and the MDS do not time out trying to contact it.
1. Generate a list of devices and determine the OST’s device number. Run:
$ lctl dl
The lctl dl command output lists the device name and number, along with the device UUID and the number of references on the device.
2. Deactivate the OST (on the OSS at the MDS). Run:
$ lctl --device <OST device name or number> deactivate
The OST device number or device name is generated by the lctl dl command.
The deactivate command prevents clients from creating new objects on the specified OST, although you can still access the OST for reading.
| Note - If the OST later becomes available it needs to be reactivated, run:
# lctl --device <OST device name or number> activate |
3. Determine all the files that are striped over the missing OST, run:
# lfs getstripe -r -O {OST_UUID} /mountpoint
This returns a simple list of filenames from the affected file system.
| Note - For releases before Lustre 1.8.2, use the lfs find command instead. Run:
# lfs find -r -O {OST_UUID} /mountpoint |
4. If necessary, you can read the valid parts of a striped file, run:
# dd if=filename of=new_filename bs=4k conv=sync,noerror
5. You can delete these files with the unlink or munlink command.
# unlink|munlink filename {filename ...}
6. If you need to know, specifically, which parts of the file are missing data, then you first need to determine the file layout (striping pattern), which includes the index of the missing OST). Run:
# lfs getstripe -v {filename}
7. Use this computation is to determine which offsets in the file are affected: [(C*N + X)*S, (C*N + X)*S + S - 1], N = { 0, 1, 2, ...}
X = index of bad OST for this file
For example, for a 2 stripe file, stripe size = 1M, the bad OST is at index 0, and you have holes in the file at: [(2*N + 0)*1M, (2*N + 0)*1M + 1M - 1], N = { 0, 1, 2, ...}
If the file system cannot be mounted, currently there is no way that parses metadata directly from an MDS. If the bad OST does not start, options to mount the file system are to provide a loop device OST in its place or replace it with a newly-formatted OST. In that case, the missing objects are created and are read as zero-filled.
In releases prior to Lustre 1.8, you could not mount a file system with a missing OST.
A Lustre environment where an application writes small file chunks from many clients to a single file will result in bad I/O performance. To improve Lustre’s performance with small files:
These are the default striping settings:
To change the default striping information.
$ lctl conf_param testfs-MDT0000.lov.stripesize=4M
$ mdt/cli> cat /proc/fs/lustre/lov/testfs-{mdt|cli}lov/stripe*
If you want to erase a file system, run this command on your targets:
$ "mkfs.lustre -reformat"
If you are using a separate MGS and want to keep other file systems defined on that MGS, then set the writeconf flag on the MDT for that file system. The writeconf flag causes the configuration logs to be erased; they are regenerated the next time the servers start.
To set the writeconf flag on the MDT:
1. Unmount all clients/servers using this file system, run:
$ umount /mnt/lustre
2. Erase the file system and, presumably, replace it with another file system, run:
$ mkfs.lustre -reformat --fsname spfs --mdt --mgs /dev/sda
3. If you have a separate MGS (that you do not want to reformat), then add the "writeconf" flag to mkfs.lustre on the MDT, run:
$ mkfs.lustre --reformat --writeconf -fsname spfs --mdt \ --mgs /dev/sda
Each OST contains a LAST_ID file, which holds the last object (pre-)created by the MDS[3]. The MDT contains a lov_objid file, with values that represent the last object the MDS has allocated to a file.
During normal operation, the MDT keeps some pre-created (but unallocated) objects on the OST, and the relationship between LAST_ID and lov_objid should be LAST_ID <= lov_objid. Any difference in the file values results in objects being created on the OST when it next connects to the MDS. These objects are never actually allocated to a file, since they are of 0 length (empty), but they do no harm. Creating empty objects enables the OST to catch up to the MDS, so normal operations resume.
However, in the case where lov_objid < LAST_ID, bad things can happen as the MDS is not aware of objects that have already been allocated on the OST, and it reallocates them to new files, overwriting their existing contents.
Here is the rule to avoid this scenario:
LAST_ID >= lov_objid and LAST_ID == last_physical_object and lov_objid >= last_used_object
Although the lov_objid value should be equal to the last_used_object value, the above rule suffices to keep Lustre happy at the expense of a few leaked objects.
In situations where there is on-disk corruption of the OST, for example caused by running with write cache enabled on the disks, the LAST_ID value may become inconsistent and result in a message similar to:
"filter_precreate()) HOME-OST0003: Serious error: objid 3478673 already exists; is this filesystem corrupt?"
A related situation may happen if there is a significant discrepancy between the record of previously-created objects on the OST and the previously-allocated objects on the MDS, for example if the MDS has been corrupted, or restored from backup, which may cause significant data loss if left unchecked. This produces a message like:
"HOME-OST0003: ignoring bogus orphan destroy request: obdid 3438673 last_id 3478673"
To recover from this situation, determine and set a reasonable LAST_ID value.
| Note - The file system must be stopped on all servers before performing this procedure. |
For hex < -> decimal translations:
(gdb) p /x 15028 $2 = 0x3ab4
echo "obase=16; 15028" | bc
1. Determine a reasonable value for the LAST_ID file. Check on the MDS:
# mount -t ldiskfs /dev/<mdsdev> /mnt/mds # od -Ax -td8 /mnt/mds/lov_objid
There is one entry for each OST, in OST index order. This is what the MDS thinks is the last in-use object.
2. Determine the OST index for this OST.
# od -Ax -td4 /mnt/ost/last_rcvd
It will have it at offset 0x8c.
3. Check on the OST. Use debugfs to check the LAST_ID value:
debugfs -c -R 'dump /O/0/LAST_ID /tmp/LAST_ID' /dev/XXX ; od -Ax -td8 /tmp/LAST_ID"
4. Check the objects on the OST:
mount -rt ldiskfs /dev/{ostdev} /mnt/ost
# note the ls below is a number one and not a letter L
ls -1s /mnt/ost/O/0/d* | grep -v [a-z] |
sort -k2 -n > /tmp/objects.{diskname}
tail -30 /tmp/objects.{diskname}
This shows you the OST state. There may be some pre-created orphans. Check for zero-length objects. Any zero-length objects with IDs higher than LAST_ID should be deleted. New objects will be pre-created.
If the OST LAST_ID value matches that for the objects existing on the OST, then it is possible the lov_objid file on the MDS is incorrect. Delete the lov_objid file on the MDS and it will be re-created from the LAST_ID on the OSTs.
If you determine the LAST_ID file on the OST is incorrect (that is, it does not match what objects exist, does not match the MDS lov_objid value), then you have decided on a proper value for LAST_ID.
Once you have decided on a proper value for LAST_ID, use this repair procedure.
mount -t ldiskfs /dev/{ostdev} /mnt/ost
od -Ax -td8 /mnt/ost/O/0/LAST_ID
3. Be very safe, only work on backups:
cp /mnt/ost/O/0/LAST_ID /tmp/LAST_ID
xxd /tmp/LAST_ID /tmp/LAST_ID.asc
vi /tmp/LAST_ID.asc
xxd -r /tmp/LAST_ID.asc /tmp/LAST_ID.new
od -Ax -td8 /tmp/LAST_ID.new
cp /tmp/LAST_ID.new /mnt/ost/O/0/LAST_ID
umount /mnt/ost
All current Lustre installations run the ext3 file system internally on service nodes. By default, the ext3 reserves 5% of the disk space for the root user. In order to reclaim this space, run the following command on your OSSs:
tune2fs [-m reserved_blocks_percent] [device]
You do not need to shut down Lustre before running this command or restart it afterwards.
Depending on your cluster size and workload, you may want to connect a SAN with Lustre. Before making this connection, consider the following:
During startup, Lustre may report a bind: Address already in use error and reject to start the operation. This is caused by a portmap service (often NFS locking) which starts before Lustre and binds to the default port 988. You must have port 988 open from firewall or IP tables for incoming connections on the client, OSS, and MDS nodes. LNET will create three outgoing connections on available, reserved ports to each client-server pair, starting with 1023, 1022 and 1021.
Unfortunately, you cannot set sunprc to avoid port 988. If you receive this error, do the following:
options lnet accept_port=988
The OST file system is an ldiskfs file system, which is simply a normal ext3 file system plus some performance enhancements--making if very close, in fact, to ext4. To copy the contents of an existing OST to a new OST (or an old MDS to a new MDS), use one of these methods:
mount -t ldiskfs /dev/old /mnt/ost_old mount -t ldiskfs /dev/new /mnt/ost_new rsync -aSv /mnt/ost_old/ /mnt/ost_new # note trailing slash on ost_old/
rsync -aSvz /mnt/ost_old/ new_ost_node:/mnt/ost_new
cd /mnt/mds_old; getfattr -R -e base64 -d . > /tmp/mdsea; \<copy all MDS files as above>; cd /mnt/mds_new; setfattr \--restore=/tmp/mdsea
Linux error -28 is -ENOSPC and indicates that the file system has run out of space. You need to create larger file systems for the OSTs. Normally, Lustre reports this to your application. If the application is checking the return code from its function calls, then it decodes it into a textual error message like "No space left on device." It also appears in the system log messages.
During a "write" or "sync" operation, the file in question resides on an OST which is already full. New files that are created do not use full OSTs, but existing files continue to use the same OST. You need to expand the specific OST or copy/stripe the file over to an OST with more space available. You encounter this situation occasionally when creating files, which may indicate that your MDS has run out of inodes and needs to be enlarged. To check this, use df -i
You may also receive this error if the MDS runs out of free blocks. Since the output of df is an aggregate of the data from the MDS and all of the OSTs, it may not show that the file system is full when one of the OSTs has run out of space. To determine which OST or MDS is running out of space, check the free space and inodes on a client:
grep '[0-9]' /proc/fs/lustre/osc/*/kbytes{free,avail,total}
grep '[0-9]' /proc/fs/lustre/osc/*/files{free,total}
grep '[0-9]' /proc/fs/lustre/mdc/*/kbytes{free,avail,total}
grep '[0-9]' /proc/fs/lustre/mdc/*/files{free,total}
You can find other numeric error codes in /usr/include/asm/errno.h along with their short name and text description.
In some cases, a server node triggers a watchdog timer and this causes a process stack to be dumped to the console along with a Lustre kernel debug log being dumped into /tmp (by default). The presence of a watchdog timer does NOT mean that the thread OOPSed, but rather that it is taking longer time than expected to complete a given operation. In some cases, this situation is expected.
For example, if a RAID rebuild is really slowing down I/O on an OST, it might trigger watchdog timers to trip. But another message follows shortly thereafter, indicating that the thread in question has completed processing (after some number of seconds). Generally, this indicates a transient problem. In other cases, it may legitimately signal that a thread is stuck because of a software error (lock inversion, for example).
Lustre: 0:0:(watchdog.c:122:lcw_cb())
The above message indicates that the watchdog is active for pid 933:
Lustre: 0:0:(linux-debug.c:132:portals_debug_dumpstack())
933 ll_ost_25 D F896071A 0 933 1 934 932 (L-TLB) f6d87c60 00000046 00000000 f896071a f8def7cc 00002710 00001822 2da48cae 0008cf1a f6d7c220 f6d7c3d0 f6d86000 f3529648 f6d87cc4 f3529640 f8961d3d 00000010 f6d87c9c ca65a13c 00001fff 00000001 00000001 00000000 00000001
filter_do_bio+0x3dd/0xb90 [obdfilter] default_wake_function+0x0/0x20 filter_direct_io+0x2fb/0x990 [obdfilter] filter_preprw_read+0x5c5/0xe00 [obdfilter] lustre_swab_niobuf_remote+0x0/0x30 [ptlrpc] ost_brw_read+0x18df/0x2400 [ost] ost_handle+0x14c2/0x42d0 [ost] ptlrpc_server_handle_request+0x870/0x10b0 [ptlrpc] ptlrpc_main+0x42e/0x7c0 [ptlrpc]
If you come across timeouts or hangs on the initial setup of your Lustre system, verify that name resolution for servers and clients is working correctly. Some distributions configure /etc/hosts sts so the name of the local machine (as reported by the 'hostname' command) is mapped to local host (127.0.0.1) instead of a proper IP address.
This might produce this error:
LustreError:(ldlm_handle_cancel()) received cancel for unknown lock cookie 0xe74021a4b41b954e from nid 0x7f000001 (0:127.0.0.1)
Each time Lustre changes the state of the disk file system, it records a unique transaction number. Occasionally, when committing these transactions to the disk, the last committed transaction number displays to other nodes in the cluster to assist the recovery. Therefore, the promised transactions remain absolutely safe on the disappeared disk.
If you know the exact reason for the error, then it is safe to proceed with no further action. If you do not know the reason, then this is a serious issue and you should explore it with your disk vendor.
If the error occurs during failover, examine your disk cache settings. If it occurs after a restart without failover, try to determine how the disk can report that a write succeeded, then lose the Data Device corruption or Disk Errors.
The slow start_page_write message appears when the operation takes an extremely long time to allocate a batch of memory pages. Use these pages to receive network traffic first, and then write to disk.
It is possible to do multi-client O_APPEND writes to a single file, but there are few drawbacks that may make this a sub-optimal solution. These drawbacks are:
When Lustre starts, the Lustre file system needs to read in data from the disk. For the very first mdsrate run after the reboot, the MDS needs to wait on all the OSTs for object pre-creation. This causes a slowdown to occur when Lustre starts up.
After the file system has been running for some time, it contains more data in cache and hence, the variability caused by reading critical metadata from disk is mostly eliminated. The file system now reads data from the cache.
When planning the hardware for an OSS node, consider the memory usage of several components in the Lustre system. If insufficient memory is available, an ‘out of memory’ message can be logged.
During normal operation, several conditions indicate insufficient RAM on a server node:
For information on determining the MDS memory and OSS memory requirements, see Memory Requirements.
The number of OSTs required for sustained throughput depends on your hardware configuration. If you are adding an OST that is identical to an existing OST, you can use the speed of the existing OST to determine how many more OSTs to add.
Keep in mind that adding OSTs affects resource limitations, such as bus bandwidth in the OSS and network bandwidth of the OSS interconnect. You need to understand the performance capability of all system components to develop an overall design that meets your performance goals and scales to future system requirements.
| Note - For best performance, put the MGS and MDT on separate devices. |
Some SCSI drivers default to a maximum I/O size that is too small for good Lustre performance. we have fixed quite a few drivers, but you may still find that some drivers give unsatisfactory performance with Lustre. As the default value is hard-coded, you need to recompile the drivers to change their default. On the other hand, some drivers may have a wrong default set.
If you suspect bad I/O performance and an analysis of Lustre statistics indicates that I/O is not 1 MB, check /sys/block/<device>/queue/max_sectors_kb. If the max_sectors_kb value is less than 1024, set it to at least 1024 to improve performance. If changing max_sectors_kb does not change the I/O size as reported by Lustre, you may want to examine the SCSI driver code.
Use this procedure to identify the file containing a given object on a given OST.
1. On the OST (as root), run debugfs to display the FID[4] of the file associated with the object.
For example, if the object is 34976 on /dev/lustre/ost_test2, the debug command is:
# debugfs -c -R "stat /O/0/d$((34976 %32))/34976" /dev/lustre/ost_test2
debugfs 1.41.5.sun2 (23-Apr-2009) /dev/lustre/ost_test2: catastrophic mode - not reading inode or group bitmaps Inode: 352365 Type: regular Mode: 0666 Flags: 0x80000 Generation: 1574463214 Version: 0xea020000:00000000 User: 500 Group: 500 Size: 260096 File ACL: 0 Directory ACL: 0 Links: 1 Blockcount: 512 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x4a216b48:00000000 -- Sat May 30 13:22:16 2009 atime: 0x4a216b48:00000000 -- Sat May 30 13:22:16 2009 mtime: 0x4a216b48:00000000 -- Sat May 30 13:22:16 2009 crtime: 0x4a216b3c:975870dc -- Sat May 30 13:22:04 2009 Size of extra inode fields: 24 Extended attributes stored in inode body: fid = "e2 00 11 00 00 00 00 00 25 43 c1 87 00 00 00 00 a0 88 00 00 00 00 00 00 00 00 00 00 00 00 00 00 " (32) BLOCKS: (0-63):47968-48031 TOTAL: 64
2. Note the FID’s EA and apply it to the osd_inode_id mapping.
In this example, the FID’s EA is:
e2001100000000002543c18700000000a0880000000000000000000000000000
struct osd_inode_id {
__u64 oii_ino; /* inode number */
__u32 oii_gen; /* inode generation */
__u32 oii_pad; /* alignment padding */
};
After swapping, you get an inode number of 0x001100e2 and generation of 0.
3. On the MDT (as root), use debugfs to find the file associated with the inode.
# debugfs -c -R "ncheck 0x001100e2" /dev/lustre/mdt_test
debugfs 1.41.5.sun2 (23-Apr-2009) /dev/lustre/mdt_test: catastrophic mode - not reading inode or group bitmaps Inode Pathname 1114338 /ROOT/brian-laptop-guest/clients/client11/~dmtmp/PWRPNT/ZD16.BMP
The command lists the inode and pathname associated with the object.
| Note - Debugfs' ''ncheck'' is a brute-force search that may take a long time to complete. |
Copyright © 2010, Oracle and/or its affiliates. All rights reserved.