Simplified Interoperability

=Simplified Interoperability Solution Architecture=

Introduction
Due to the different RPC protocol in 1.8 and 2.0, the interoperability between these two releases was considerably more complex than necessary. First part of the work was to make 1.8 client understand both the 1.8 and 2.0 wire protocol. This enabled 1.8 client to talk to 1.8 MDS/OST servers, as well as 2.0 MDS/OST servers. However, saved open RPCs in the replay list were particularly complex, since 1.8 clients may have saved the old-format RPC, which would not be understood by a 2.0 MDS. It would be better if the client generated a new-format open RPC from the open Lustre file handles rather than saving the old-format RPC in the replay list. Dropping the saved open RPC would also reduce memory usage after the open+create RPC is committed on the MDT.

An update/downgrade is treated as a normal recovery of server failovers. When we upgrade MDS server from 1.8 to 2.0, the clients see the old MDS server down, and up again with new version, and the clients talk to new server with new wire protocol. This recovery includes request replay/resend. OPEN replay may be needed in this process. The request replay/resend during upgrade/downgrade is more complex that that in the normal case: request must be re-packed before replay/resend, because the clients need to talk to the server in a different wire protocol. We call this as the request reformat. All reint and ldlm requests are possible to be replayed/resent in recovery. These reformat of the request not only introduces the code complexity, but also increases the difficulty of testing.

The Simplified Interoperability is aimed to reduce/eliminate the need of request reformat. Before upgrading/downgrading, we try to empty the replay/resend queue on client side by invalidating the locks and flushing the caches. In order to achieve this goal, we introduce to a "controlled" shutdown of servers. When the server is doing a "controlled shutdown", it notifies all the connected clients that it is going to shutdown. The client should cancel all the locks which were acquired from this server, flushing all dirty cache, or discarding cached data along with the lock. After that, the client should then send OBD_DISCONNECT to the server to acknowledge this notification. At the same time, the client should mark its import as "NOT AVAILABLE" to send new requests, except a new reconnection.

Detail design and alternatives of simplified inter-operation is stated in the following sections.

Client and server state
Normally, when this feature is landed, all server shutdown (umount) is a "controlled" shutdown. The server should notify all its clients that it is going to be unmounted. The server will start a timer, waiting all its client to cancel the locks, flush cache, complete pending requests. When the OBD_DISCONNECT request arrives at the server, or timeout, the server will stop processing any new requests from clients.

When client gets such notification from server, it should mark its import as "NOT AVAILABLE" to create new requests. But, lock cancellation request, pending write requests for cache data, pending requests should be allowed in this state. "MDS_SYNC" and "OST_SYNC" should also be allowed to send to server. After that, "MDS_DISCONNECT" or "OST_DISCONNECT" should be sent to server to indicate that the client has completed its operations. And after that, the client should try to reconnect to server. Only when new connection is established, the import will be marked as "AVAILABLE" to send new requests.

Callback/Notification from server
The server needs to notify its clients that it is going to shutdown. This assures the clients to complete its requests, flush its cache. There are two alternatives to do this.

LDLM
The client gets a LDLM lock from server, with a special lock resource name, after connection to servers established successfully. This is a IBITS lock to MDS, or extent lock to OST. MDS/OST should handle this lock specially, because the resource name does not contain a valid object identification. This is a CR lock on client. Only with such lock, it can send new requests to servers, except some special requests, such as CONNECT. When server wants to notify its clients, it tries to get a EX lock on that resource name. This will trigger a blocking callback to clients. Clients should respond to this callback, cancel all its locks acquired from this server, flush all data, and cancel this lock too. When server succeeds to get such EX lock, or timeout, it continues to shutdown.

Connection callback
The server initiate a special callback to its clients: connection callback. This can be handled on client by LDLM callback thread, or another new dedicated thread. But another new thread to handle this sometimes is not a good idea for customers, like Cray. So, if this can be handled by client callback thread with a special case, that would be good. The steps clients should take to respond to this callback are the same to the LDLM algorithm.

Risks
There will be some race conditions in these solutions: new request, pending request, new connection on clients; callback to client, stopping processing new request, connections on servers. These should be discussed with detail in HLD. Which solution to use, LDLM or connection callback, is not yet decided. This should be addressed in wider discussion.

OPEN request reformat in upgrade
OPEN handles should be reserved in upgrade across protocol versions. So, the OPEN request needs to be replayed. OPEN request will be re-created with the new wire protocol and replayed. This is a special case for simplified interoperation.

Related documents
[[media:SC09-Simplified-Interop.pdf]]

[Category:Architecture][Category:Recovery][Category:Scalability]