| C H A P T E R 19 |
|
Lustre Recovery |
This chapter describes how to recover Lustre, and includes the following sections:
Lustre's recovery support is responsible for dealing with node or network failure and returning the cluster to a consistent, performant state. Because Lustre allows servers to perform asynchronous update operations to the on-disk file system (i.e., the server can reply without waiting for the update to synchronously commit to disk), the clients may have state in memory that is newer than what the server can recover from disk after a crash.
A handful of different types of failures can cause recovery to occur:
Currently, all Lustre failure and recovery operations are based on the concept of connection failure; all imports or exports associated with a given connection are considered to fail if any of them fail.
For information on Lustre recovery, see Metadata Replay. For information on recovering from a corrupt file system, see Recovering from Errors or Corruption on a Backing File System. For information on resolving orphaned objects, a common issue after recovery, see .
Lustre's support for recovery from client failure is based on lock revocation and other resources, so surviving clients can continue their work uninterrupted. If a client fails to timely respond to a blocking lock callback from the Distributed Lock Manager (DLM) or fails to communicate with the server in a long period of time (i.e., no pings), the client is forcibly removed from the cluster (evicted). This enables other clients to acquire locks blocked by the dead client's locks, and also frees resources (file handles, export data) associated with that client. Note that this scenario can be caused by a network partition, as well as an actual client node system failure. Network Partition describes this case in more detail.
If a client is not behaving properly from the server's point of view, it will be evicted. This ensures that the whole file system can continue to function in the presence of failed or misbehaving clients. An evicted client must invalidate all locks, which in turn, results in all cached inodes becoming invalidated and all cached data being flushed.
Reasons why a client might be evicted:
Highly-available (HA) Lustre operation requires that the metadata server have a peer configured for failover, including the use of a shared storage device for the MDT backing file system. The actual mechanism for detecting peer failure, power off (STONITH) of the failed peer (to prevent it from continuing to modify the shared disk), and takeover of the Lustre MDS service on the backup node depends on external HA software such as Heartbeat. It is also possible to have MDS recovery with a single MDS node. In this case, recovery will take as long as is needed for the single MDS to be restarted.
When clients detect an MDS failure (either by timeouts of in-flight requests or idle-time ping messages), they connect to the new backup MDS and use the Metadata Replay protocol. Metadata Replay is responsible for ensuring that the backup MDS re-acquires state resulting from transactions whose effects were made visible to clients, but which were not committed to the disk.
The reconnection to a new (or restarted) MDS is managed by the file system configuration loaded by the client when the file system is first mounted. If a failover MDS has been configured (using the --failnode= option to mkfs.lustre or tunefs.lustre), the client tries to reconnect to both the primary and backup MDS until one of them responds that the failed MDT is again available. At that point, the client begins recovery. For more information, see Metadata Replay.
Transaction numbers are used to ensure that operations are replayed in the order they were originally performed, so that they are guaranteed to succeed and present the same filesystem state as before the failure. In addition, clients inform the new server of their existing lock state (including locks that have not yet been granted). All metadata and lock replay must complete before new, non-recovery operations are permitted. In addition, only clients that were connected at the time of MDS failure are permitted to reconnect during the recovery window, to avoid the introduction of state changes that might conflict with what is being replayed by previously-connected clients.
When an OST fails or has communication problems with the client, the default action is that the corresponding OSC enters recovery, and I/O requests going to that OST are blocked waiting for OST recovery or failover. It is possible to administratively mark the OSC as inactive on the client, in which case file operations that involve the failed OST will return an IO error (-EIO). Otherwise, the application waits until the OST has recovered or the client process is interrupted (e.g. ,with CTRL-C).
The MDS (via the LOV) detects that an OST is unavailable and skips it when assigning objects to new files. When the OST is restarted or re-establishes communication with the MDS, the MDS and OST automatically perform orphan recovery to destroy any objects that belong to files that were deleted while the OST was unavailable. For more information, see Working with Orphaned Objects.
While the OSC to OST operation recovery protocol is the same as that between the MDC and MDT using the Metadata Replay protocol, typically the OST commits bulk write operations to disk synchronously and each reply indicates that the request is already committed and the data does not need to be saved for recovery. In some cases, the OST replies to the client before the operation is committed to disk (e.g. truncate, destroy, setattr, and I/O operations in very new versions of Lustre), and normal replay and resend handling is done, including resending of the bulk writes. In this case, the client keeps a copy of the data available in memory until the server indicates that the write has committed to disk.
To force an OST recovery, unmount the OST and then mount it again. If the OST was connected to clients before it failed, then a recovery process starts after the remount, enabling clients to reconnect to the OST and replay transactions in their queue. When the OST is in recovery mode, all new client connections are refused until the recovery finishes. The recovery is complete when either all previously-connected clients reconnect and their transactions are replayed or a client connection attempt times out. If a connection attempt times out, then all clients waiting to reconnect (and their transactions) are lost.
Network failures may be transient. To avoid invoking recovery, the client tries, initially, to re-send any timed out request to the server. If the resend also fails, the client tries to re-establish a connection to the server. Clients can detect harmless partition upon reconnect if the server has not had any reason to evict the client.
If a request was processed by the server, but the reply was dropped (i.e., did not arrive back at the client), the server must reconstruct the reply when the client resends the request, rather than performing the same request twice.
In the case of failed recovery, a client is evicted by the server and must reconnect after having flushed its saved state related to that server, as described in Client Eviction, above. Failed recovery might occur for a number of reasons, including:
Highly available Lustre operation requires that the MDS have a peer configured for failover, including the use of a shared storage device for the MDS backing file system. When a client detects an MDS failure, it connects to the new MDS and uses the metadata replay protocol to replay its requests.
Metadata replay ensures that the failover MDS re-accumulates state resulting from transactions whose effects were made visible to clients, but which were not committed to the disk.
Each request sent by the client contains an XID number, which is a client-unique, monotonically increasing 64-bit integer. The initial value of the XID is chosen so that it is highly unlikely that the same client node reconnecting to the same server after a reboot would have the same XID sequence. The XID is used by the client to order all of the requests that it sends, until such a time that the request is assigned a transaction number. The XID is also used in Reply Reconstruction to uniquely identify per-client requests at the server.
Each client request processed by the server that involves any state change (metadata update, file open, write, etc., depending on server type) is assigned a transaction number by the server that is a target-unique, monontonically increasing, server-wide 64-bit integer. The transaction number for each file system-modifying request is sent back to the client along with the reply to that client request. The transaction numbers allow the client and server to unambiguously order every modification to the file system in case recovery is needed.
Each reply sent to a client (regardless of request type) also contains the last committed transaction number that indicates the highest transaction number committed to the file system. The backing file systems that Lustre uses (ext3/4, ZFS) enforce the requirement that any earlier disk operation will always be committed to disk before a later disk operation, so the last committed transaction number also reports that any requests with a lower transaction number have been committed to disk.
Lustre recovery can be separated into two distinct types of operations: replay and resend.
Replay operations are those for which the client received a reply from the server that the operation had been successfully completed. These operations need to be redone in exactly the same manner after a server restart as had been reported before the server failed. Replay can only happen if the server failed; otherwise it will not have lost any state in memory.
Resend operations are those for which the client never received a reply, so their final state is unknown to the client. The client sends unanswered requests to the server again in XID order, and again awaits a reply for each one. In some cases, resent requests have been handled and committed to disk by the server (possibly also having dependent operations committed), in which case, the server performs reply reconstruction for the lost reply. In other cases, the server did not receive the lost request at all and processing proceeds as with any normal request. These are what happen in the case of a network interruption. It is also possible that the server received the request, but was unable to reply or commit it to disk before failure.
All file system-modifying requests have the potential to be required for server state recovery (replay) in case of a server failure. Replies that have an assigned transaction number that is higher than the last committed transaction number received in any reply from each server are preserved for later replay in a per-server replay list. As each reply is received from the server, it is checked to see if it has a higher last committed transaction number than the previous highest last committed number. Most requests that now have a lower transaction number can safely be removed from the replay list. One exception to this rule is for open requests, which need to be saved for replay until the file is closed so that the MDS can properly reference count open-unlinked files.
A server enters recovery if it was not shut down cleanly. If, upon startup, if any client entries are in the last_rcvd file for any previously connected clients, the server enters recovery mode and waits for these previously-connected clients to reconnect and begin replaying or resending their requests. This allows the server to recreate state that was exposed to clients (a request that completed successfully) but was not committed to disk before failure.
In the absence of any client connection attempts, the server waits indefinitely for the clients to reconnect. This is intended to handle the case where the server has a network problem and clients are unable to reconnect and/or if the server needs to be restarted repeatedly to resolve some problem with hardware or software. Once the server detects client connection attempts - either new clients or previously-connected clients - a recovery timer starts and forces recovery to finish in a finite time regardless of whether the previously-connected clients are available or not.
If no client entries are present in the last_rcvd file, or if the administrator manually aborts recovery, the server does not wait for client reconnection and proceeds to allow all clients to connect.
As clients connect, the server gathers information from each one to determine how long the recovery needs to take. Each client reports its connection UUID, and the server does a lookup for this UUID in the last_rcvd file to determine if this client was previously connected. If not, the client is refused connection and it will retry until recovery is completed. Each client reports its last seen transaction, so the server knows when all transactions have been replayed. The client also reports the amount of time that it was previously waiting for request completion so that the server can estimate how long some clients might need to detect the server failure and reconnect.
If the client times out during replay, it attempts to reconnect. If the client is unable to reconnect, REPLAY fails and it returns to DISCON state. It is possible that clients will timeout frequently during REPLAY, so reconnection should not delay an already slow process more than necessary. We can mitigate this by increasing the timeout during replay.
If a client was previously connected, it gets a response from the server telling it that the server is in recovery and what the last committed transaction number on disk is. The client can then iterate through its replay list and use this last committed transaction number to prune any previously-committed requests. It replays any newer requests to the server in transaction number order, one at a time, waiting for a reply from the server before replaying the next request.
Open requests that are on the replay list may have a transaction number lower than the server's last committed transaction number. The server processes those open requests immediately. The server then processes replayed requests from all of the clients in transaction number order, starting at the last committed transaction number to ensure that the state is updated on disk in exactly the same manner as it was before the crash. As each replayed request is processed, the last committed transaction is incremented. If the server receives a replay request from a client that is higher than the current last committed transaction, that request is put aside until other clients provide the intervening transactions. In this manner, the server replays requests in the same sequence as they were previously executed on the server until either all clients are out of requests to replay or there is a gap in a sequence.
In some cases, a gap may occur in the reply sequence. This might be caused by lost replies, where the request was processed and committed to disk but the reply was not received by the client. It can also be caused by clients missing from recovery due to partial network failure or client death.
In the case where all clients have reconnected, but there is a gap in the replay sequence the only possibility is that some requests were processed by the server but the reply was lost. Since the client must still have these requests in its resend list, they are processed after recovery is finished.
In the case where all clients have not reconnected, it is likely that the failed clients had requests that will no longer be replayed. In Lustre 1.8 and later, version-based recovery (VBR) is used to determine if a request following a transaction gap is safe to be replayed. Each item in the file system (MDS inode or OST object) stores on disk the number of the last transaction in which it was modified. Each reply from the server contains the previous version number of the objects that it affects. During VBR replay, the server matches the previous version numbers in the resend request against the current version number. If the versions match, the request is the next one that affects the object and can be safely replayed. For more information, see Version-based Recovery.
If all requests were replayed successfully and all clients reconnected, clients then do lock replay locks -- that is, every client sends information about every lock it holds from this server and its state (whenever it was granted or not, what mode, what properties and so on), and then recovery completes successfully. Currently, Lustre does not do lock verification and just trusts clients to present an accurate lock state. This does not impart any security concerns since Lustre 1.x clients are trusted for other information (e.g. user ID) during normal operation also.
After all of the saved requests and locks have been replayed, the client sends an MDS_GETSTATUS request with last-replay flag set. The reply to that request is held back until all clients have completed replay (sent the same flagged getstatus request), so that clients don't send non-recovery requests before recovery is complete.
Once all of the previously-shared state has been recovered on the server (the target file system is up-to-date with client cache and the server has recreated locks representing the locks held by the client), the client can resend any requests that did not receive an earlier reply. This processing is done like normal request processing, and, in some cases, the server may do reply reconstruction.
When a reply is dropped, the MDS needs to be able to reconstruct the reply when the original request is re-sent. This must be done without repeating any non-idempotent operations, while preserving the integrity of the locking system. In the event of MDS failover, the information used to reconstruct the reply must be serialized on the disk in transactions that are joined or nested with those operating on the disk.
For the majority of requests, it is sufficient for the server to store three pieces of data in the last_rcvd file:
For open requests, the "disposition" of the open must also be stored.
An open reply consists of up to three pieces of information (in addition to the contents of the "request log"):
The disposition, status and request data (re-sent intact by the client) are sufficient to determine which type of lock handle was granted, whether an open file handle was created, and which resource should be described in the mds_body.
The file handle can be found in the XID of the request and the list of per-export open file handles. The file handle contains the resource/FID.
The file handle contains the resource/fid.
The lock handle can be found by walking the list of granted locks for the resource looking for one with the appropriate remote file handle (present in the re-sent request). Verify that the lock has the right mode (determined by performing the disposition/request/status analysis above) and is granted to the proper client.
Lustre 1.8 introduces the Version-based Recovery (VBR) feature, which improves Lustre reliability in cases where client requests (RPCs) fail to replay during recovery[1].
In pre-1.8 versions of Lustre, if the MGS or an OST went down and then recovered, a recovery process was triggered in which clients attempted to replay their requests. Clients were only allowed to replay RPCs in serial order. If a particular client could not replay its requests, then those requests were lost as well as the requests of clients later in the sequence. The ''downstream'' clients never got to replay their requests because of the wait on the earlier client’s RPCs. Eventually, the recovery period would time out (so the component could accept new requests), leaving some number of clients evicted and their requests and data lost.
With VBR, the recovery mechanism does not result in the loss of clients or their data, because changes in inode versions are tracked, and more clients are able to reintegrate into the cluster. With VBR, inode tracking looks like this:
| Note - An RPC can contain up to four pre-operation versions, because several inodes can be involved in an operation. In the case of a ''rename'' operation, four different inodes can be modified. |
During normal operation, the server:
When the recovery mechanism is underway, VBR follows these steps:
1. VBR only allows clients to replay transactions if the affected inodes have the same version as during the original execution of the transactions, even if there is gap in transactions due to a missed client.
2. The server attempts to execute every transaction that the client offers, even if it encounters a re-integration failure.
3. When the replay is complete, the client and server check if a replay failed on any transaction because of inode version mismatch. If the versions match, the client gets a successful re-integration message. If the versions do not match, then the client is evicted.
VBR recovery is fully transparent to users. It may lead to slightly longer recovery times if the cluster loses several clients during server recovery.
With VBR, it is possible to recover clients even after the server’s recovery window closes. This is known as delayed recovery. This feature is useful if clients have become temporarily unavailable during recovery (e.g., because of a network partition).
In Lustre 1.8, the VBR feature is built into the Lustre recovery functionality. It cannot be disabled.
Delayed recovery can be enabled with the --enable-delayed-recovery option:
./configure ... --enable-delayed-recovery
During reboot, a list of new messages is displayed.
CWARN("RECOVERY: service %s, %d recoverable clients, last_transno "LPU64"\n"); was updated with number delayed clients:
CWARN("RECOVERY: service %s, %d recoverable clients, %d delayed clients, last_transno "LPU64"\n");
| Note - There should be no delayed clients until delayed recovery is enabled. |
These are some VBR messages that may be displayed:
DEBUG_REQ(D_WARNING, req, "Version mismatch during replay\n");
This message indicates why the client was evicted. No action is needed.
CWARN("%s: version recovery fails, reconnecting\n");
This message indicates why the recovery failed. No action is needed.
These are some VBR messages that may be displayed if delayed recovery is enabled:
CWARN("RECOVERY: service %s, %d recoverable clients, %d delayed clients, last_transno "LPU64"\n");
This controls the number of delayed clients. There should be 0 delayed clients without delayed recovery enabled).
CWARN("%s: NID %s (%s) export was already marked as delayed and will wait for end of recovery\n");
The old client is trying to reconnect, but it will wait for end of the server’s recovery period. No action is needed.
VBR will be successful for clients which do not share data with other client. Therefore, the strategy for reliable use of VBR is to store a client’s data in its own directory, where possible. VBR can recover these clients, even if other clients are lost.
When an OSS, MDS, or MGS server crash occurs, it is not necessary to run e2fsck on the file system. Ext3 journaling ensures that the file system remains coherent. The backing file systems are never accessed directly from the client, so client crashes are not relevant.
The only time it is REQUIRED that e2fsck be run on a device is when an event causes problems that ext3 journaling is unable to handle, such as a hardware device failure or I/O error. If the ext3 kernel code detects corruption on the disk, it mounts the file system as read-only to prevent further corruption, but still allows read access to the device. This appears as error "-30" (EROFS) in the syslogs on the server, e.g.:
Dec 29 14:11:32 mookie kernel: LDISKFS-fs error (device sdz): ldiskfs_lookup: unlinked inode 5384166 in dir #145170469
Dec 29 14:11:32 mookie kernel: Remounting filesystem read-only
In such a situation, it is normally required that e2fsck only be run on the bad device before placing the device back into service.
In the vast majority of cases, Lustre can cope with any inconsistencies it finds on the disk and between other devices in the file system.
| Note - lfsck is rarely required for Lustre operation. |
For problem analysis, it is strongly recommended that e2fsck be run under a logger, like script, to record all of the output and changes that are made to the file system in case this information is needed later.
If time permits, it is also a good idea to first run e2fsck in non-fixing mode (-n option) to assess the type and extent of damage to the file system. The drawback is that in this mode, e2fsck does not recover the file system journal, so there may appear to be file system corruption when none really exists.
To address concern about whether corruption is real or only due to the journal not being replayed, you can briefly mount and unmount the ext3 filesystem directly on the node with Lustre stopped (NOT via Lustre), using a command similar to:
mount -t ldiskfs /dev/{ostdev} /mnt/ost; umount /mnt/ost
This causes the journal to be recovered.
The e2fsck utility works well when fixing file system corruption (better than similar file system recovery tools and a primary reason why ext3 was chosen over other file systems for Lustre). However, it is often useful to identify the type of damage that has occurred so an ext3 expert can make intelligent decisions about what needs fixing, in place of e2fsck.
root# {stop lustre services for this device, if running}
root# script /tmp/e2fsck.sda
Script started, file is /tmp/e2fsck.sda
root# mount -t ldiskfs /dev/sda /mnt/ost
root# umount /mnt/ost
root# e2fsck -fn /dev/sda # don't fix file system, just check for corruption
:
[e2fsck output]
:
root# e2fsck -fp /dev/sda # fix filesystem using "prudent" answers (usually 'y')
In addition, the e2fsprogs package contains the lfsck tool, which does distributed coherency checking for the Lustre file system after e2fsck has been run. Running lfsck is NOT required in a large majority of cases, at a small risk of having some leaked space in the file system. To avoid a lengthy downtime, it can be run (with care) after Lustre is started.
In cases where the MDS or an OST becomes corrupt, you can run a distributed check on the file system to determine what sort of problems exist. Use lfsck to correct any defects found.
1. Stop the Lustre file system.
2. Run e2fsck -f on the individual MDS / OST that had problems to fix any local file system damage.
We recommend running e2fsck under script, to create a log of changes made to the file system in case it is needed later. After e2fsck is run, bring up the file system, if necessary, to reduce the outage window.
3. Run a full e2fsck of the MDS to create a database for lfsck. It is critical to use the -n option for a mounted file system, otherwise you will corrupt the file system.
e2fsck -n -v --mdsdb /tmp/mdsdb /dev/{mdsdev}
The mdsdb file can grow fairly large, depending on the number of files in the file system (10 GB or more for millions of files, though the actual file size is larger because the file is sparse). It is quicker to write the file to a local file system due to seeking and small writes. Depending on the number of files, this step can take several hours to complete.
e2fsck -n -v --mdsdb /tmp/mdsdb /dev/sdb e2fsck 1.39.cfs1 (29-May-2006) Warning: skipping journal recovery because doing a read-only filesystem check. lustre-MDT0000 contains a file system with errors, check forced. Pass 1: Checking inodes, blocks, and sizes MDS: ost_idx 0 max_id 288 MDS: got 8 bytes = 1 entries in lov_objids MDS: max_files = 13 MDS: num_osts = 1 mds info db file written Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Free blocks count wrong (656160, counted=656058). Fix? no Free inodes count wrong (786419, counted=786036). Fix? no Pass 6: Acquiring information for lfsck MDS: max_files = 13 MDS: num_osts = 1 MDS: 'lustre-MDT0000_UUID' mdt idx 0: compat 0x4 rocomp 0x1 incomp 0x4 lustre-MDT0000: ******* WARNING: Filesystem still has errors ******* 13 inodes used (0%) 2 non-contiguous inodes (15.4%) # of inodes with ind/dind/tind blocks: 0/0/0 130272 blocks used (16%) 0 bad blocks 1 large file 296 regular files 91 directories 0 character device files 0 block device files 0 fifos 0 links 0 symbolic links (0 fast symbolic links) 0 sockets -------- 387 files
4. Make this file accessible on all OSTs, either by using a shared file system or copying the file to the OSTs. The pdcp command is useful here.
The pdcp command (installed with pdsh), can be used to copy files to groups of hosts. Pdcp is available here:
http://sourceforge.net/projects/pdsh
5. Run a similar e2fsck step on the OSTs. The e2fsck --ostdb command can be run in parallel on all OSTs.
e2fsck -n -v --mdsdb /tmp/mdsdb --ostdb /tmp/{ostNdb} \/dev/{ostNdev}
The mdsdb file is read-only in this step; a single copy can be shared by all OSTs.
| Note - If the OSTs do not have shared file system access to the MDS, a stub mdsdb file, {mdsdb}.mdshdr, is generated. This can be used instead of the full mdsdb file. |
[root@oss161 ~]# e2fsck -n -v --mdsdb /tmp/mdsdb --ostdb \ /tmp/ostdb /dev/sda e2fsck 1.39.cfs1 (29-May-2006) Warning: skipping journal recovery because doing a read-only filesystem check. lustre-OST0000 contains a file system with errors, check forced. Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Free blocks count wrong (989015, counted=817968). Fix? no Free inodes count wrong (262088, counted=261767). Fix? no Pass 6: Acquiring information for lfsck OST: 'lustre-OST0000_UUID' ost idx 0: compat 0x2 rocomp 0 incomp 0x2 OST: num files = 321 OST: last_id = 321 lustre-OST0000: ******* WARNING: Filesystem still has errors ******* 56 inodes used (0%) 27 non-contiguous inodes (48.2%) # of inodes with ind/dind/tind blocks: 13/0/0 59561 blocks used (5%) 0 bad blocks 1 large file 329 regular files 39 directories 0 character device files 0 block device files 0 fifos 0 links 0 symbolic links (0 fast symbolic links) 0 sockets -------- 368 files
6. Make the mdsdb file and all ostdb files available on a mounted client and run lfsck to examine the file system. Optionally, correct the defects found by lfsck.
script /root/lfsck.lustre.log
lfsck -n -v --mdsdb /tmp/mdsdb --ostdb /tmp/{ost1db} /tmp/{ost2db} ... /lustre/mount/point
script /root/lfsck.lustre.log
lfsck -n -v --mdsdb /home/mdsdb --ostdb /home/{ost1db} \/mnt/lustre/client/
MDSDB: /home/mdsdb
OSTDB[0]: /home/ostdb
MOUNTPOINT: /mnt/lustre/client/
MDS: max_id 288 OST: max_id 321
lfsck: ost_idx 0: pass1: check for duplicate objects
lfsck: ost_idx 0: pass1 OK (287 files total)
lfsck: ost_idx 0: pass2: check for missing inode objects
lfsck: ost_idx 0: pass2 OK (287 objects)
lfsck: ost_idx 0: pass3: check for orphan objects
[0] uuid lustre-OST0000_UUID
[0] last_id 288
[0] zero-length orphan objid 1
lfsck: ost_idx 0: pass3 OK (321 files total)
lfsck: pass4: check for duplicate object references
lfsck: pass4 OK (no duplicates)
lfsck: fixed 0 errors
By default, lfsck reports errors, but it does not repair any inconsistencies found. lfsck checks for three kinds of inconsistencies:
If the file system is in use and being modified while the --mdsdb and --ostdb steps are running, lfsck may report inconsistencies where none exist due to files and objects being created/removed after the database files were collected. Examine the lfsck results closely. You may want to re-run the test.
The easiest problem to resolve is that of orphaned objects. When the -l option for lfsck is used, these objects are linked to new files and put into lost+found in the Lustre file system, where they can be examined and saved or deleted as necessary. If you are certain the objects are not useful, run lfsck with the -d option to delete orphaned objects and free up any space they are using.
To fix dangling inodes, use lfsck with the -c option to create new, zero-length objects on the OSTs. These files read back with binary zeros for stripes that had objects re-created. Even without lfsck repair, these files can be read by entering:
dd if=/lustre/bad/file of=/new/file bs=4k conv=sync,noerror
Because it is rarely useful to have files with large holes in them, most users delete these files after reading them (if useful) and/or restoring them from backup.
| Note - You cannot write to the holes of such files without having lfsck re-create the objects. Generally, it is easier to delete these files and restore them from backup. |
To fix inodes with duplicate objects, use lfsck with the -c option to copy the duplicate object to a new object and assign it to a file. One file will be okay and the duplicate will likely contain garbage. By itself, lfsck cannot tell which file is the usable one.
Copyright © 2010, Oracle and/or its affiliates. All rights reserved.