Simplified Interoperability

=Simplified Interoperability Solution Architecture=

Introduction
Due to the different RPC protocol in 1.8 and 2.0, the interoperability between these two releases was considerably more complex than necessary. The first part of the work was to make 1.8 client understand both the 1.8 and 2.0 wire protocol. This enabled 1.8 client to talk to 1.8 MDS/OST servers, as well as 2.0 MDS/OST servers. However, saved open RPCs in the replay list were particularly complex, since 1.8 clients may have saved the 1.8-format RPC, which would not be understood by a 2.0 MDS. It would be better if the client generated a new-format open RPC for each open Lustre file handle rather than saving the old-format RPC in the replay list. Dropping the saved open RPC would also reduce memory usage after the open+create RPC is committed on the MDT, since only the "open FID" state needs to be saved, not the whole file layout and other attributes stored in the saved RPC.

From the point of view of a client, server update/downgrade is treated as a normal recovery after server failure. When a server is upgraded without Simplified Interoperability, the clients are only aware of server shutdown after it has already stopped (possibly with uncommitted changes in memory), and when it is started again with the new version. The client reconnection and recovery process includes request replay/resend, including OPEN and other MDS_REINT and LDLM requests, which covers a large amount of saved state that complicates recovery significantly. In rare cases, the client may need to use a new wire protocol with the new server, which becomes untestable when the number of saved RPCs is large and reformatting is complex.

Simplified Interoperability is aimed at allowing the admin to quiesce clients before an upgrade or other system maintenance, therefore reducing or eliminating the need for complex recovery upon restart. By notifying clients in advance of the outage, they can reduce their saved filesystem state to a bare minimum before the server is offline. This avoids client errors during recovery, and reduces load and spurious error messages on the server until the system maintenance is completed. To achieve this goal, we introduce "Controlled Server Shutdown" and "Open from Handle". When the server is doing a controlled shutdown, it notifies all connected clients that it is going to shutdown. The clients should flush all dirty cache, discard all cached data, and cancel all LDLM locks from this server. The client should then disconnect from the server to acknowledge this notification, similar to IDLE client disconnection. At the same time, the client should block the generation of new RPC requests on this target, except a new OBD_CONNECT, resulting in an (interruptible) application barrier for that server. If File Level Redundancy is available for a file, the client can still try other active targets to access file data, if any. Upon notification from the MGS that the target is active again, the client can reconnect and have minimal RPCs to recover, using the VFS open file handles on this mountpoint to re-establish state on the MDS.

Detail design and alternatives of simplified inter-operation is stated in the following sections.

Client and server state
Once this feature is landed, all server shutdown (umount) will by default be a "Controlled Server Shutdown". The server should notify all clients that it will be unmounted. The server will start a timer, waiting for all clients to cancel their locks, flush cache, and complete pending requests. When the OBD_DISCONNECT request arrives at the server from all clients, or the timeout expires, the server will stop processing any new requests from clients.

When client gets a quiesce notification from the server, it should mark its import as "UNAVAILABLE" to create new requests. However, lock cancellation requests, pending write requests for cache data, and in-flight requests should be allowed in this state. MDS_SYNC and OST_SYNC should also be allowed to send to server. Client RPCs during CSS should be flagged with MSG_QUIESCE so that the server can see the client has received its quiesce request and incoming RPCs are draining client state (possibly extending the shutdown timer in a manner similar to the recovery timer, up to some maximum). RPCs without the MSG_QUIESCE flag are from clients that do not understand CSS or did not get the quiesce request, so should not extend the shutdown timer. When the client has flushed all saved state (except open file handles), MDS_DISCONNECT or OST_DISCONNECT should be sent to server to indicate that the client has completed its operations.

Snapshot Barrier
There is an existing snapshot barrier implementation that may be leveraged to block client requests when the server is in maintenance mode. Re-using the snapshot barrier code will avoid redundant functionality in the code. On the client side, it may also be useful to integrate with the fsfreeze mechanism in the VFS to block userspace threads at a high level, if *all* targets are marked "UNAVAILABLE", rather than implement a Lustre-specific mechanism to do the same. However, if only a single target is UNAVAILABLE then this would be detrimental to the client operation since it could not use FLR to access the same data transparently from another target.

MDS_OPEN request replay
Open client file handles should be preserved during upgrade across protocol versions. Currently, all MDS_OPEN RPC requests are saved in the client RPC imp_replay_list, even after the corresponding RPC transno has been committed to storage on the MDS, until the file is closed. This allows the client to replay the MDS_OPEN request upon reconnect for all currently-open file handles, and ensures that open file handles are replayed in the order they were sent relative to other operations such as unlink, so that open-unlinked files are properly handled on MDS recovery. However, this also adds complexity in the client-side replay handler because of special casing "keep RPC in replay list after transno is committed", and would cause complexity during RPC protocol upgrades due to saving the literal RPC message that was sent during file open. As part of Simplified Interoperability, it would be preferable for the MDS_OPEN request to be re-created with the new wire protocol from the list of open file handles that the client VFS already saves, and replayed to the MDS. By regenerating MDS_OPEN RPCs from the saved VFS file handles, it can also be used for other projects such as Client Metadata Writeback Cache.

Callback/Notification from server
The server must be able to notify its clients that it is going to shutdown and they should quiesce that import. This ensures the client can complete its requests, and flush its cache before the server is offline. Once the server is back online and has completed its maintenance, it is preferable for it to actively notify the clients that it is available, rather than having the clients repeatedly try to connect to the server until it succeeds. This avoids needless load and spurious messages on the server, and allows the admin to start the server for verification without having all clients immediately connect. There are three alternatives to do this.

MGS via Imperative Recovery
Using the existing Imperative Recovery (IR) mechanism would allow the server to notify the client of server shutdown as well as when it is available for reconnection. Currently, the server would immediately advertise that it is started and available for recovery, but for CSS it makes sense that the server persistently save state on the target (e.g. last_rcvd file) that it should not advertise to the MGS until the maintenance is completed. To avoid confusion by the admin if the server restarts and does not enter into recovery due to the saved state, it should periodically print a message to the server console that recovery is disabled and the command to re-enable recovery. If the MGS is also unavailable (possibly also being upgraded), then clients should retry connection to the target in the normal manner as when IR is not available.

LDLM
The client gets a different LDLM lock directly from the server, with a special lock resource name, after connection to servers established successfully. This is a IBITS lock to MDS, or extent lock to OST. MDS/OST should handle this lock specially, because the resource name does not contain a valid object identification. This is a CR lock on client. Only with such lock, it can send new requests to servers, except some special requests, such as CONNECT. When server wants to notify its clients, it tries to get a EX lock on that resource name. This will trigger a blocking callback to clients. Clients should respond to this callback, cancel all its locks acquired from this server, flush all data, and cancel this lock too. When server succeeds to get such EX lock, or timeout, it continues to shutdown. This duplicates a significant amount of functionality with Imperative Recovery and does not seem to have any significant benefits beyond functioning without the MGS present.

Connection callback
The server initiate a special callback to its clients: connection callback. This can be handled on client by LDLM callback thread, or another new dedicated thread. But another new thread to handle this sometimes is not a good idea for clients with limited resources. So, if this can be handled by client callback thread with a special case, that would be preferred. The steps clients should take to respond to this callback are the same to normal LDLM callbacks. This also does not seem sufficiently beneficial to warrant a separate implementation.

Risks
There will be some race conditions in these solutions: new request, pending request, new connection on clients; callback to client, stopping processing new request, connections on servers. These should be discussed with detail in HLD. Which solution to use, LDLM or connection callback, is not yet decided. This should be addressed in wider discussion.

Related documents
[[media:SC09-Simplified-Interop.pdf]]