Simplified Interoperability

From Lustre Wiki
Revision as of 15:00, 13 December 2019 by Adilger (talk | contribs) (→‎Introduction: update description to include file reopen from handle, FLR, idle disconnect)
Jump to navigation Jump to search

Simplified Interoperability Solution Architecture

Introduction

Due to the different RPC protocol in 1.8 and 2.0, the interoperability between these two releases was considerably more complex than necessary. The first part of the work was to make 1.8 client understand both the 1.8 and 2.0 wire protocol. This enabled 1.8 client to talk to 1.8 MDS/OST servers, as well as 2.0 MDS/OST servers. However, saved open RPCs in the replay list were particularly complex, since 1.8 clients may have saved the 1.8-format RPC, which would not be understood by a 2.0 MDS. It would be better if the client generated a new-format open RPC for each open Lustre file handle rather than saving the old-format RPC in the replay list. Dropping the saved open RPC would also reduce memory usage after the open+create RPC is committed on the MDT, since only the "open FID" state needs to be saved, not the whole file layout and other attributes stored in the saved RPC.

From the point of view of a client, server update/downgrade is treated as a normal recovery after server failure. When a server is upgraded without Simplified Interoperability, the clients are only aware of server shutdown after it has already stopped (possibly with uncommitted changes in memory), and when it is started again with the new version. The client reconnection and recovery process includes request replay/resend, including OPEN and other MDS_REINT and LDLM requests, which covers a large amount of saved state that complicates recovery significantly. In rare cases, the client may need to use a new wire protocol with the new server, which becomes untestable when the number of saved RPCs is large and reformatting is complex.

Simplified Interoperability is aimed at allowing the admin to quiesce clients before an upgrade or other system maintenance, therefore reducing or eliminating the need for complex recovery upon restart. By notifying clients in advance of the outage, they can reduce their saved filesystem state to a bare minimum before the server is offline. This avoids client errors during recovery, and reduces load and spurious error messages on the server until the system maintenance is completed. To achieve this goal, we introduce "Controlled Server Shutdown" and "Open from Handle". When the server is doing a controlled shutdown, it notifies all connected clients that it is going to shutdown. The clients should flush all dirty cache, discard all cached data, and cancel all LDLM locks from this server. The client should then disconnect from the server to acknowledge this notification, similar to IDLE client disconnection. At the same time, the client should block the generation of new RPC requests on this target, except a new OBD_CONNECT, resulting in an (interruptible) application barrier for that server. If File Level Redundancy is available for a file, the client can still try other active targets to access file data, if any. Upon notification from the MGS that the target is active again, the client can reconnect and have minimal RPCs to recover, using the VFS open file handles on this mountpoint to re-establish state on the MDS.

Detail design and alternatives of simplified inter-operation is stated in the following sections.

Client and server state

Normally, when this feature is landed, all server shutdown (umount) is a "controlled" shutdown. The server should notify all its clients that it is going to be unmounted. The server will start a timer, waiting all its client to cancel the locks, flush cache, complete pending requests. When the OBD_DISCONNECT request arrives at the server, or timeout, the server will stop processing any new requests from clients.

When client gets such notification from server, it should mark its import as "NOT AVAILABLE" to create new requests. But, lock cancellation request, pending write requests for cache data, pending requests should be allowed in this state. "MDS_SYNC" and "OST_SYNC" should also be allowed to send to server. After that, "MDS_DISCONNECT" or "OST_DISCONNECT" should be sent to server to indicate that the client has completed its operations. And after that, the client should try to reconnect to server. Only when new connection is established, the import will be marked as "AVAILABLE" to send new requests.

Callback/Notification from server

The server needs to notify its clients that it is going to shutdown. This assures the clients to complete its requests, flush its cache. There are two alternatives to do this.

LDLM

The client gets a LDLM lock from server, with a special lock resource name, after connection to servers established successfully. This is a IBITS lock to MDS, or extent lock to OST. MDS/OST should handle this lock specially, because the resource name does not contain a valid object identification. This is a CR lock on client. Only with such lock, it can send new requests to servers, except some special requests, such as CONNECT. When server wants to notify its clients, it tries to get a EX lock on that resource name. This will trigger a blocking callback to clients. Clients should respond to this callback, cancel all its locks acquired from this server, flush all data, and cancel this lock too. When server succeeds to get such EX lock, or timeout, it continues to shutdown.

Connection callback

The server initiate a special callback to its clients: connection callback. This can be handled on client by LDLM callback thread, or another new dedicated thread. But another new thread to handle this sometimes is not a good idea for customers, like Cray. So, if this can be handled by client callback thread with a special case, that would be good. The steps clients should take to respond to this callback are the same to the LDLM algorithm.

Risks

There will be some race conditions in these solutions: new request, pending request, new connection on clients; callback to client, stopping processing new request, connections on servers. These should be discussed with detail in HLD. Which solution to use, LDLM or connection callback, is not yet decided. This should be addressed in wider discussion.

OPEN request reformat in upgrade

OPEN handles should be reserved in upgrade across protocol versions. So, the OPEN request needs to be replayed. OPEN request will be re-created with the new wire protocol and replayed. This is a special case for simplified interoperation.

Related documents

media:SC09-Simplified-Interop.pdf

[Category:Architecture][Category:Recovery][Category:Scalability]