»LID Home
»LID History

1. Introduction

The Portal RPC subsystem is a reliable messaging service layered on top of LNET. It caters for small messages and also for bulk data transfers. A large proportion of Lustre’s error recovery functionality is embodied in the Portal RPC subsystem.

Communication is initiated by the client in the form of a request which is transmitted to the server to process - the server then returns a reply to the client. If bulk data is to be transferred to/from the server, that is handled separately from the request/reply pair.

The communication end-points are represented by an "export" object on the server side and an "import" object on the client side.

Requests and replies are communicated using an "on the wire" format. At each end, the request or reply is packed/unpacked from/to the host format. The formats are all fixed at compile time.

Portal RPC messages can be sent either singularly or in sets. Communication can either be synchronous (wait for the reply before proceeding) or asynchronous (proceed without waiting for a reply - the reply is processed by the Portal RPC daemon).

2. Principle data structures

When thinking about how messages are assembled and disassembled, the most relevant types are:

ptlrpc_request

top-level object that, essentially, contains everything known about the message

req_capsule

a field (rq_pill in the ptlrpc_request object that defines the message’s format (what fields it contains) and the lengths of any variable length fields in the message

req_format

describes the format of a message (what fields are contained in the request and the associated reply).

req_msg_field

characterises an individual field in a message (how big the field is, what function to use to convert from host to on-the-wire format, etc.)

Each RPC request (and the associated reply) has a fixed number of fields. Some fields are fixed size but others can be variable length (strings or arrays of data). The functions that pack and unpack the requests/replies to/from the on-the-wire format need to know for each RPC request/reply type what fields it contains and how those fields should be handled.

A req_format object is defined for each known RPC type and it contains two arrays of req_msg_field objects (one array describes the fields in the request sent by the client and the other array describes the fields in the reply sent by the server).

Each req_msg_field object specifies for a single RPC message field whether it is fixed of variable size, whether it is an array or string and what special functions are required to byte swap it to/from the on-the-wire format or dump it out.

Messages are packed into an on-the-wire format (lustre_msg_v2) that consists of a variable length header that includes a buffer count (lm_bufcount) and an array of buffer lengths (lm_buflens). Following the header are lm_bufcount number of buffers that contain the actual message content and whatever other data is required on the wire (e.g. crypto wrapper).

The ptlrpc_request field rq_reqmsg points at the packed request data within the message buffer and the message buffer itself is pointed to by rq_reqbuf.

This diagram shows these relationships for a typical request. In this example (an "LDLM_CANCEL" request), the request contains two fields; "ptlrpc_body" and "dlm_request".

       (ptlrpc_request)
req -->|--------------|
       |              |
       | ...          |
       |              |            on-the-wire format
       |              |
       |--------------|            (lustre_msg_v2)
       | rq_reqbuf    |----------->|---------------|               ^
       |--------------|            | lm_bufcount   |               |
       | rq_reqbuf_len|            |---------------|               |
       |--------------|            |               |               |
       | rq_reqmsg    |-------*    |---------------|               |
       |--------------|       |    | ...           |               |
       | rq_reqlen    |       |    |---------------|               |
       |--------------|       |    | lm_buflens[0] |               | rq_reqbuf_len
       | ...          |       |    |---------------|               |
       |--------------|       |    | ...           |               |
       | rq_pill      |       |    |---------------|               |
       |              |       |    | lm_buflens[n] |               |
       |   |----------|       |    |---------------|               |
       |   | rc_fmt   |--*    |    | other buffers |               |
       |   |----------|  |    |    | (crypto, etc.)|               |
       |   | rc_area[]|  |    |    |---------------|               |
       |   |----------|  |    *--->| msg field[0]  | ^             |
       |--------------|  |         |---------------| |             |
                         |         | ...           | | rq_reqlen   |
   *---------------------*         |---------------| |             |
   |                               | msg field[n]  | v             v
   |                               |---------------|
   |
   |
   |   (req_format)
   *-->|-----------------------|
       | rf_name               |---> e.g. "LDLM_CANCEL"
       |-----------------------|
       | rf_idx                | = value set by req_layout_init()
       |-----------------------|
       | rf_fields[RCL_CLIENT] |------>|--------------| (array of req_msg_field)
       |-----------------------|       | rmf_flags    | = 0
       | rf_fields[RCL_SERVER] |--*    |--------------|
       |-----------------------|  |    | rmf_name     | = "ptlrpc_body"
                                  |    |--------------|
                                  |  0 | rmf_size     | = sizeof(struct ptlrpc_body)
   [reply req_msg_field array]<---*    |--------------|
                                       | rmf_swabber  | = lustre_swab_ptlrpc_body
                                       |--------------|
                                       | rmf_dumper   | = NULL
                                       |--------------|
                                       | rmf_offset   | = value set by req_layout_init()
                                    ---|--------------|
                                       | rmf_flags    | = RMF_F_NO_SIZE_CHECK
                                       |--------------|
                                       | rmf_name     | = "dlm_req"
                                       |--------------|
                                     1 | rmf_size     | = sizeof(struct ldlm_request)
                                       |--------------|
                                       | rmf_swabber  | = lustre_swab_ldlm_request
                                       |--------------|
                                       | rmf_dumper   | = NULL
                                       |--------------|
                                       | rmf_offset   | = value set by req_layout_init()
                                    ---|--------------|

2.1. ptlrpc_connection

Represents a connection to a Portal RPC peer.

/**
 * Structure to single define portal connection.
 */
struct ptlrpc_connection {
        /** linkage for connections hash table */
        struct hlist_node       c_hash;
        /** Our own lnet nid for this connection */
        lnet_nid_t              c_self;
        /** Remote side nid for this connection */
        lnet_process_id_t       c_peer;
        /** UUID of the other side */
        struct obd_uuid         c_remote_uuid;
        /** reference counter for this connection */
        atomic_t                c_refcount;
};

2.2. ptlrpc_client

Represents a Portal RPC client - holds the portal number the client will use to send requests and receive replies.

/** Client definition for PortalRPC */
struct ptlrpc_client {
        /** What lnet portal does this client send messages to by default */
        __u32                   cli_request_portal;
        /** What portal do we expect replies on */
        __u32                   cli_reply_portal;
        /** Name of the client */
        char                   *cli_name;
};

2.3. ptlrpc_request_set

Represents a group of Portal RPC requests that are to be sent concurrently - the set completes when replies have been received for all of the requests sent.

/**
 * Definition of request set structure.
 * Request set is a list of requests (not necessary to the same target) that
 * once populated with RPCs could be sent in parallel.
 * There are two kinds of request sets. General purpose and with dedicated
 * serving thread. Example of the latter is ptlrpcd set.
 * For general purpose sets once request set started sending it is impossible
 * to add new requests to such set.
 * Provides a way to call "completion callbacks" when all requests in the set
 * returned.
 */
struct ptlrpc_request_set {
        /** number of uncompleted requests */
        int               set_remaining;
        /** wait queue to wait on for request events */
        cfs_waitq_t       set_waitq;
        cfs_waitq_t      *set_wakeup_ptr;
        /** List of requests in the set */
        struct list_head  set_requests;
        /**
         * List of completion callbacks to be called when the set is completed
         * This is only used if \a set_interpret is NULL.
         * Links struct ptlrpc_set_cbdata.
         */
        struct list_head  set_cblist;
        /** Completion callback, if only one. */
        set_interpreter_func    set_interpret;
        /** opaq argument passed to completion \a set_interpret callback. */
        void              *set_arg;
        /**
         * Lock for \a set_new_requests manipulations
         * locked so that any old caller can communicate requests to
         * the set holder who can then fold them into the lock-free set
         */
        spinlock_t        set_new_req_lock;
        /** List of new yet unsent requests. Only used with ptlrpcd now. */
        struct list_head  set_new_requests;
};

/**
 * Description of a single ptrlrpc_set callback
 */
struct ptlrpc_set_cbdata {
        /** List linkage item */
        struct list_head        psc_item;
        /** Pointer to interpreting function */
        set_interpreter_func    psc_interpret;
        /** Opaque argument to pass to the callback */
        void                   *psc_data;
};

2.4. ptlrpc_reply_state

Represents a reply on the server.

/**
 * Structure to define reply state on the server
 */
struct ptlrpc_reply_state {
        /** Callback description */
        struct ptlrpc_cb_id    rs_cb_id;
        /** Linkage for list of all reply states in a system */
        struct list_head       rs_list;
        /** Linkage for list of all reply states on same export */
        struct list_head       rs_exp_list;
        /** Linkage for list of all reply states for same obd */
        struct list_head       rs_obd_list;
#if RS_DEBUG
        struct list_head       rs_debug_list;
#endif
        /** A spinlock to protect the reply state flags */
        spinlock_t             rs_lock;
        /** Reply state flags */
        unsigned long          rs_difficult:1;     /* ACK/commit stuff */
        unsigned long          rs_no_ack:1;    /* no ACK, even for
                                                  difficult requests */
        unsigned long          rs_scheduled:1;     /* being handled? */
        unsigned long          rs_scheduled_ever:1;/* any schedule attempts? */
        unsigned long          rs_handled:1;  /* been handled yet? */
        unsigned long          rs_on_net:1;   /* reply_out_callback pending? */
        unsigned long          rs_prealloc:1; /* rs from prealloc list */
        unsigned long          rs_committed:1;/* the transaction was committed
                                                 and the rs was dispatched
                                                 by ptlrpc_commit_replies */
        /** Size of the state */
        int                    rs_size;
        /** opcode */
        __u32                  rs_opc;
        /** Transaction number */
        __u64                  rs_transno;
        /** xid */
        __u64                  rs_xid;
        struct obd_export     *rs_export;
        struct ptlrpc_service *rs_service;
        /** Lnet metadata handle for the reply */
        lnet_handle_md_t       rs_md_h;
        atomic_t               rs_refcount;

        /** Context for the sevice thread */
        struct ptlrpc_svc_ctx *rs_svc_ctx;
        /** Reply buffer (actually sent to the client), encoded if needed */
        struct lustre_msg     *rs_repbuf;       /* wrapper */
        /** Size of the reply buffer */
        int                    rs_repbuf_len;   /* wrapper buf length */
        /** Size of the reply message */
        int                    rs_repdata_len;  /* wrapper msg length */
        /**
         * Actual reply message. Its content is encrupted (if needed) to
         * produce reply buffer for actual sending. In simple case
         * of no network encryption we jus set \a rs_repbuf to \a rs_msg
         */
        struct lustre_msg     *rs_msg;          /* reply message */

        /** Number of locks awaiting client ACK */
        int                    rs_nlocks;
        /** Handles of locks awaiting client reply ACK */
        struct lustre_handle   rs_locks[RS_MAX_LOCKS];
        /** Lock modes of locks in \a rs_locks */
        ldlm_mode_t            rs_modes[RS_MAX_LOCKS];
};

2.5. ptlrpc_request_pool

Holds a number of pre-allocated requests.

/**
 * Definition of request pool structure.
 * The pool is used to store empty preallocated requests for the case
 * when we would actually need to send something without performing
 * any allocations (to avoid e.g. OOM).
 */
struct ptlrpc_request_pool {
        /** Locks the list */
        spinlock_t prp_lock;
        /** list of ptlrpc_request structs */
        struct list_head prp_req_list;
        /** Maximum message size that would fit into a rquest from this pool */
        int prp_rq_size;
        /** Function to allocate more requests for this pool */
        void (*prp_populate)(struct ptlrpc_request_pool *, int);
};

2.6. req_capsule

Describes the format of an RPC request/reply pair including the sizes of variable-length fields.

enum req_location {
        RCL_CLIENT,
        RCL_SERVER,
        RCL_NR
};

/* Maximal number of fields (buffers) in a request message. */
#define REQ_MAX_FIELD_NR  9

struct req_capsule {
        struct ptlrpc_request   *rc_req;
        const struct req_format *rc_fmt;
        enum req_location        rc_loc;
        __u32                    rc_area[RCL_NR][REQ_MAX_FIELD_NR];
};

2.7. req_format

Describes the format of a single type of RPC request/reply pair.

struct req_format {
        const char *rf_name;
        int         rf_idx;
        struct {
                int                          nr;
                const struct req_msg_field **d;
        } rf_fields[RCL_NR];
};

2.8. req_msg_field

Describes a single field in an RPC request/reply - fields can be structured and contain multiple values.

        __u32       rmf_flags;
        const char *rmf_name;
        /**
         * Field length. (-1) means "variable length".  If the
         * \a RMF_F_STRUCT_ARRAY flag is set the field is also variable-length,
         * but the actual size must be a whole multiple of \a rmf_size.
         */
        int         rmf_size;
        void      (*rmf_swabber)(void *);
        void      (*rmf_dumper)(void *);
        int         rmf_offset[ARRAY_SIZE(req_formats)][RCL_NR];

2.9. ptlrpc_body

All RPC requests contain an initial field, "ptlrpc_body", that contains generic request data (opcode, version, flags, etc.)

/* without gss, ptlrpc_body is put at the first buffer. */
#define PTLRPC_NUM_VERSIONS     4
struct ptlrpc_body {
        struct lustre_handle pb_handle;
        __u32 pb_type;
        __u32 pb_version;
        __u32 pb_opc;
        __u32 pb_status;
        __u64 pb_last_xid;
        __u64 pb_last_seen;
        __u64 pb_last_committed;
        __u64 pb_transno;
        __u32 pb_flags;
        __u32 pb_op_flags;
        __u32 pb_conn_cnt;
        __u32 pb_timeout;  /* for req, the deadline, for rep, the service est */
        __u32 pb_service_time; /* for rep, actual service time */
        __u32 pb_limit;
        __u64 pb_slv;
        /* VBR: pre-versions */
        __u64 pb_pre_versions[PTLRPC_NUM_VERSIONS];
        /* padding for future needs */
        __u64 pb_padding[4];
};

2.10. ptlrpc_request

Represents a single RPC request.

/**
 * Represents remote procedure call.
 *
 * This is a staple structure used by everybody wanting to send a request
 * in Lustre.
 */
struct ptlrpc_request {
        /* Request type: one of PTL_RPC_MSG_* */
        int rq_type;
        /**
         * Linkage item through which this request is included into
         * sending/delayed lists on client and into rqbd list on server
         */
        struct list_head rq_list;
        /**
         * Server side list of incoming unserved requests sorted by arrival
         * time.  Traversed from time to time to notice about to expire
         * requests and sent back "early replies" to clients to let them
         * know server is alive and well, just very busy to service their
         * requests in time
         */
        struct list_head rq_timed_list;
        /** server-side history, used for debuging purposes. */
        struct list_head rq_history_list;
        /** server-side per-export list */
        struct list_head rq_exp_list;
        /** server-side hp handlers */
        struct ptlrpc_hpreq_ops *rq_ops;
        /** history sequence # */
        __u64            rq_history_seq;
        /** the index of service's srv_at_array into which request is linked */
        time_t rq_at_index;
        /** Result of request processing */
        int rq_status;
        /** Lock to protect request flags and some other important bits, like
         * rq_list
         */
        spinlock_t rq_lock;
        /** client-side flags are serialized by rq_lock */
        unsigned long rq_intr:1, rq_replied:1, rq_err:1,
                rq_timedout:1, rq_resend:1, rq_restart:1,
                /**
                 * when ->rq_replay is set, request is kept by the client even
                 * after server commits corresponding transaction. This is
                 * used for operations that require sequence of multiple
                 * requests to be replayed. The only example currently is file
                 * open/close. When last request in such a sequence is
                 * committed, ->rq_replay is cleared on all requests in the
                 * sequence.
                 */
                rq_replay:1,
                rq_no_resend:1, rq_waiting:1, rq_receiving_reply:1,
                rq_no_delay:1, rq_net_err:1, rq_wait_ctx:1,
                rq_early:1, rq_must_unlink:1,
                rq_fake:1,          /* this fake req */
                /* server-side flags */
                rq_packed_final:1,  /* packed final reply */
                rq_hp:1,            /* high priority RPC */
                rq_at_linked:1,     /* link into service's srv_at_array */
                rq_reply_truncate:1;

        enum rq_phase rq_phase; /* one of RQ_PHASE_* */
        enum rq_phase rq_next_phase; /* one of RQ_PHASE_* to be used next */
        atomic_t rq_refcount;   /* client-side refcount for SENT race,
                                   server-side refcounf for multiple replies */

        /** initial thread servicing this request */
        struct ptlrpc_thread *rq_svc_thread;

        /** Portal to which this request would be sent */
        int rq_request_portal;  /* XXX FIXME bug 249 */
        /** Portal where to wait for reply and where reply would be sent */
        int rq_reply_portal;    /* XXX FIXME bug 249 */

        /** client-side # reply bytes actually received  */
        int rq_nob_received;

        /** Request length */
        int rq_reqlen;
        /** Request message - what client sent */
        struct lustre_msg *rq_reqmsg;

        /** Reply length */
        int rq_replen;
        /** Reply message - server response */
        struct lustre_msg *rq_repmsg;
        /** Transaction number */
        __u64 rq_transno;
        /** xid */
        __u64 rq_xid;
        /**
         * List item to for replay list. Not yet commited requests get linked
         * there.
         * Also see \a rq_replay comment above.
         */
        struct list_head rq_replay_list;

        /**
         * security and encryption data
         * @{
        struct ptlrpc_cli_ctx   *rq_cli_ctx;     /* client's half ctx */
        struct ptlrpc_svc_ctx   *rq_svc_ctx;     /* server's half ctx */
        struct list_head         rq_ctx_chain;   /* link to waited ctx */

        struct sptlrpc_flavor    rq_flvr;        /* client & server */
        enum lustre_sec_part     rq_sp_from;

        unsigned long            /* client/server security flags */
                                 rq_ctx_init:1,      /* context initiation */
                                 rq_ctx_fini:1,      /* context destroy */
                                 rq_bulk_read:1,     /* request bulk read */
                                 rq_bulk_write:1,    /* request bulk write */
                                 /* server authentication flags */
                                 rq_auth_gss:1,      /* authenticated by gss */
                                 rq_auth_remote:1,   /* authed as remote user */
                                 rq_auth_usr_root:1, /* authed as root */
                                 rq_auth_usr_mdt:1,  /* authed as mdt */
                                 /* security tfm flags */
                                 rq_pack_udesc:1,
                                 rq_pack_bulk:1,
                                 /* doesn't expect reply FIXME */
                                 rq_no_reply:1,
                                 rq_pill_init:1;     /* pill initialized */

        uid_t                    rq_auth_uid;        /* authed uid */
        uid_t                    rq_auth_mapped_uid; /* authed uid mapped to */

        /* (server side), pointed directly into req buffer */
        struct ptlrpc_user_desc *rq_user_desc;

        /* @} */

        /** early replies go to offset 0, regular replies go after that */
        unsigned int             rq_reply_off;

        /* various buffer pointers */
        struct lustre_msg       *rq_reqbuf;      /* req wrapper */
        int                      rq_reqbuf_len;  /* req wrapper buf len */
        int                      rq_reqdata_len; /* req wrapper msg len */
        char                    *rq_repbuf;      /* rep buffer */
        int                      rq_repbuf_len;  /* rep buffer len */
        struct lustre_msg       *rq_repdata;     /* rep wrapper msg */
        int                      rq_repdata_len; /* rep wrapper msg len */
        struct lustre_msg       *rq_clrbuf;      /* only in priv mode */
        int                      rq_clrbuf_len;  /* only in priv mode */
        int                      rq_clrdata_len; /* only in priv mode */

        /** Fields that help to see if request and reply were swabbed or not */
        __u32 rq_req_swab_mask;
        __u32 rq_rep_swab_mask;

        /** What was import generation when this request was sent */
        int rq_import_generation;
        enum lustre_imp_state rq_send_state;

        /** how many early replies (for stats) */
        int rq_early_count;

        /** client+server request */
        lnet_handle_md_t     rq_req_md_h;
        struct ptlrpc_cb_id  rq_req_cbid;

        /** server-side fields: */
        /** request arrival time */
        struct timeval       rq_arrival_time;
        /** separated reply state */
        struct ptlrpc_reply_state *rq_reply_state;
        /** incoming request buffer */
        struct ptlrpc_request_buffer_desc *rq_rqbd;
#ifdef CRAY_XT3
        __u32                rq_uid;            /* peer uid, used in MDS only */
#endif

        /** client-only incoming reply: */
        lnet_handle_md_t     rq_reply_md_h;
        cfs_waitq_t          rq_reply_waitq;
        struct ptlrpc_cb_id  rq_reply_cbid;

        /** our LNet NID */
        lnet_nid_t           rq_self;
        /** Peer description */
        lnet_process_id_t    rq_peer;
        /** Server-side, export on which request was received */
        struct obd_export   *rq_export;
        /** Client side, import where request is being sent */
        struct obd_import   *rq_import;

        /** Replay callback, called after request is replayed at recovery */
        void (*rq_replay_cb)(struct ptlrpc_request *);
        /**
         * Commit callback, called when request is committed and about to be
         * freed.
         */
        void (*rq_commit_cb)(struct ptlrpc_request *);
        /** Opaq data for replay and commit callbacks. */
        void  *rq_cb_data;

        /** For bulk requests on clien only: bulk descriptor */
        struct ptlrpc_bulk_desc *rq_bulk;

        /** client outgoing req: */
        /**
         * when request/reply sent (secs), or time when request should be sent
         */
        time_t rq_sent;

        /** when request must finish. volatile
         * so that servers' early reply updates to the deadline aren't
         * kept in per-cpu cache */
        volatile time_t rq_deadline;
        /** when req reply unlink must finish. */
        time_t rq_reply_deadline;
        /** when req bulk unlink must finish. */
        time_t rq_bulk_deadline;
        /**
         * service time estimate (secs)
         * If requests is not served by this time, it is marked as timed out.
         */
        int    rq_timeout;

        /** Multi-rpc bits: */
        /** Link item for request set lists */
        struct list_head rq_set_chain;
        /** Link back to the request set */
        struct ptlrpc_request_set *rq_set;
        /** Async completion handler, called when reply is received */
        ptlrpc_interpterer_t rq_interpret_reply;
        /** Async completion context */
        union ptlrpc_async_args rq_async_args;

        /** Pool if request from preallocated list */
        struct ptlrpc_request_pool *rq_pool;

        struct lu_context           rq_session;
        struct lu_context           rq_recov_session;

        /** request format */
        struct req_capsule          rq_pill;
};

2.11. ptlrpc_bulk_page

Represents a single page of bulk data.

/**
 * Structure that defines a single page of a bulk transfer
 */
struct ptlrpc_bulk_page {
        /** Linkage to list of pages in a bulk */
        struct list_head bp_link;
        /**
         * Number of bytes in a page to transfer starting from \a bp_pageoffset
         */
        int              bp_buflen;
        /** offset within a page */
        int              bp_pageoffset;
        /** The page itself */
        struct page     *bp_page;
};

2.12. ptlrpc_bulk_desc

Represents the bulk part of an RPC request.

/**
 * Definition of buk descriptor.
 * Bulks are special "Two phase" RPCs where initial request message
 * is sent first and it is followed bt a transfer (o receiving) of a large
 * amount of data to be settled into pages referenced from the bulk descriptors.
 * Bulks transfers (the actual data following the small requests) are done
 * on separate LNet portals.
 * In lustre we use bulk transfers for READ and WRITE transfers from/to OSTs.
 *  Another user is readpage for MDT.
 */
struct ptlrpc_bulk_desc {
        /** completed successfully */
        unsigned long bd_success:1;
        /** accessible to the network (network io potentially in progress) */
        unsigned long bd_network_rw:1;
        /** {put,get}{source,sink} */
        unsigned long bd_type:2;
        /** client side */
        unsigned long bd_registered:1;
        /** For serialization with callback */
        spinlock_t   bd_lock;
        /** Import generation when request for this bulk was sent */
        int bd_import_generation;
        /** Server side - export this bulk created for */
        struct obd_export *bd_export;
        /** Client side - import this bulk was sent on */
        struct obd_import *bd_import;
        /** LNet portal for this bulk */
        __u32 bd_portal;
        /** Back pointer to the request */
        struct ptlrpc_request *bd_req;
        cfs_waitq_t            bd_waitq;        /* server side only WQ */
        int                    bd_iov_count;    /* # entries in bd_iov */
        int                    bd_max_iov;      /* allocated size of bd_iov */
        int                    bd_nob;          /* # bytes covered */
        int                    bd_nob_transferred; /* # bytes GOT/PUT */

        __u64                  bd_last_xid;

        struct ptlrpc_cb_id    bd_cbid;         /* network callback info */
        lnet_handle_md_t       bd_md_h;         /* associated MD */
        lnet_nid_t             bd_sender;       /* stash event::sender */

#if defined(__KERNEL__)
        /*
         * encrypt iov, size is either 0 or bd_iov_count.
         */
        lnet_kiov_t           *bd_enc_iov;

        lnet_kiov_t            bd_iov[0];
#else
        lnet_md_iovec_t        bd_iov[0];
#endif
};

2.13. ptlrpc_thread

Represents a thread that’s serving Portal RPC requests.

/**
 * Definition of server service thread structure
 */
struct ptlrpc_thread {
        /**
         * List of active threads in svc->srv_threads
         */
        struct list_head t_link;
        /**
         * thread-private data (preallocated memory)
         */
        void *t_data;
        __u32 t_flags;
        /**
         * service thread index, from ptlrpc_start_threads
         */
        unsigned int t_id;
        /**
         * service thread pid
         */
        pid_t t_pid;
        /**
         * put watchdog in the structure per thread b=14840
         */
        struct lc_watchdog *t_watchdog;
        /**
         * the svc this thread belonged to b=18582
         */
        struct ptlrpc_service *t_svc;
        cfs_waitq_t t_ctl_waitq;
        struct lu_env *t_env;
};

2.14. ptlrpc_request_buffer_desc

Server-side object that represents a single request buffer which will hold an incoming request.

/**
 * Request buffer descriptor structure.
 * This is a structure that contains one posted request buffer for service.
 * Once data land into a buffer, event callback creates actual request and
 * notifies wakes one of the service threads to process new incoming request.
 * More than one request can fit into the buffer.
 */
struct ptlrpc_request_buffer_desc {
        /** Link item for rqbds on a service */
        struct list_head       rqbd_list;
        /** History of requests for this buffer */
        struct list_head       rqbd_reqs;
        /** Back pointer to service for which this buffer is registered */
        struct ptlrpc_service *rqbd_service;
        /** LNet descriptor */
        lnet_handle_md_t       rqbd_md_h;
        int                    rqbd_refcount;
        /** The buffer itself */
        char                  *rqbd_buffer;
        struct ptlrpc_cb_id    rqbd_cbid;
        /**
         * This "embedded" request structure is only used for the
         * last request to fit into the buffer
         */
        struct ptlrpc_request  rqbd_req;
};

2.15. ptlrpc_service

Represents the "service" provided by a particular RPC portal.

/**
 * Definition of PortalRPC service.
 * The service is listening on a particular portal (like tcp port)
 * and perform actions for a specific server like IO service for OST
 * or general metadata service for MDS.
 */
struct ptlrpc_service {
        struct list_head srv_list;              /* chain thru all services */
        int              srv_max_req_size;      /* biggest request to receive */
        int              srv_max_reply_size;    /* biggest reply to send */
        int              srv_buf_size;          /* size of individual buffers */
        int              srv_nbuf_per_group;    /* # buffers to allocate in 1 group */
        int              srv_nbufs;             /* total # req buffer descs allocated */
        int              srv_threads_min;       /* threads to start at SOW */
        int              srv_threads_max;       /* thread upper limit */
        int              srv_threads_started;   /* index of last started thread */
        int              srv_threads_running;   /* # running threads */
        atomic_t         srv_n_difficult_replies; /* # 'difficult' replies */
        int              srv_n_active_reqs;     /* # reqs being served */
        int              srv_n_hpreq;           /* # HPreqs being served */
        cfs_duration_t   srv_rqbd_timeout;      /* timeout before re-posting reqs, in tick */
        int              srv_watchdog_factor;   /* soft watchdog timeout multiplier */
        unsigned         srv_cpu_affinity:1;    /* bind threads to CPUs */
        unsigned         srv_at_check:1;        /* check early replies */
        unsigned         srv_is_stopping:1;     /* under unregister_service */
        cfs_time_t       srv_at_checktime;      /* debug */

        /** Local portal on which to receive requests */
        __u32            srv_req_portal;
        /** Portal on the client to send replies to */
        __u32            srv_rep_portal;

        /** AT stuff */
        /** @{ */
        struct adaptive_timeout srv_at_estimate;/* estimated rpc service time */
        spinlock_t        srv_at_lock;
        struct ptlrpc_at_array  srv_at_array;   /* reqs waiting for replies */
        cfs_timer_t       srv_at_timer;         /* early reply timer */
        /** @} */

        int               srv_n_queued_reqs;    /* # reqs in either of the queues below */
        int               srv_hpreq_count;      /* # hp requests handled */
        int               srv_hpreq_ratio;      /* # hp per lp reqs to handle */
        struct list_head  srv_req_in_queue;     /* incoming reqs */
        struct list_head  srv_request_queue;    /* reqs waiting for service */
        struct list_head  srv_request_hpq;      /* high priority queue */

        struct list_head  srv_request_history;  /* request history */
        __u64             srv_request_seq;      /* next request sequence # */
        __u64             srv_request_max_cull_seq; /* highest seq culled from history */
        svcreq_printfn_t  srv_request_history_print_fn; /* service-specific print fn */

        struct list_head  srv_idle_rqbds;       /* request buffers to be reposted */
        struct list_head  srv_active_rqbds;     /* req buffers receiving */
        struct list_head  srv_history_rqbds;    /* request buffer history */
        int               srv_nrqbd_receiving;  /* # posted request buffers */
        int               srv_n_history_rqbds;  /* # request buffers in history */
        int               srv_max_history_rqbds;/* max # request buffers in history */

        atomic_t          srv_outstanding_replies;
        struct list_head  srv_active_replies;   /* all the active replies */
#ifndef __KERNEL__
        struct list_head  srv_reply_queue;      /* replies waiting for service */
#endif
        cfs_waitq_t       srv_waitq; /* all threads sleep on this. This
                                      * wait-queue is signalled when new
                                      * incoming request arrives and when
                                      * difficult reply has to be handled. */

        struct list_head   srv_threads;         /* service thread list */
        /** Handler function for incoming requests for this service */
        svc_handler_t      srv_handler;
        svc_hpreq_handler_t srv_hpreq_handler;  /* hp request handler */

        char *srv_name; /* only statically allocated strings here; we don't clean them */
        char *srv_thread_name; /* only statically allocated strings here; we don't clean them */

        spinlock_t               srv_lock;

        /** Root of /proc dir tree for this service */
        cfs_proc_dir_entry_t    *srv_procroot;
        /** Pointer to statistic data for this service */
        struct lprocfs_stats    *srv_stats;

        /** List of free reply_states */
        struct list_head         srv_free_rs_list;
        /** waitq to run, when adding stuff to srv_free_rs_list */
        cfs_waitq_t              srv_free_rs_waitq;

        /**
         * Tags for lu_context associated with this thread, see struct
         * lu_context.
         */
        __u32                    srv_ctx_tags;
        /**
         * if non-NULL called during thread creation (ptlrpc_start_thread())
         * to initialize service specific per-thread state.
         */
        int (*srv_init)(struct ptlrpc_thread *thread);
        /**
         * if non-NULL called during thread shutdown (ptlrpc_main()) to
         * destruct state created by ->srv_init().
         */
        void (*srv_done)(struct ptlrpc_thread *thread);

        //struct ptlrpc_srv_ni srv_interfaces[0];
};

2.16. ptlrpcd_ctl

Control data for ptlrpcd.

/**
 * Declaration of ptlrpcd control structure
 */
struct ptlrpcd_ctl {
        /**
         * Ptlrpc thread control flags (LIOD_START, LIOD_STOP, LIOD_FORCE)
         */
        unsigned long               pc_flags;
        /**
         * Thread lock protecting structure fields.
         */
        spinlock_t                  pc_lock;
        /**
         * Start completion.
         */
        struct completion           pc_starting;
        /**
         * Stop completion.
         */
        struct completion           pc_finishing;
        /**
         * Thread requests set.
         */
        struct ptlrpc_request_set  *pc_set;
        /**
         * Thread name used in cfs_daemonize()
         */
        char                        pc_name[16];
        /**
         * Environment for request interpreters to run in.
         */
        struct lu_env               pc_env;
#ifndef __KERNEL__
        /**
         * Async rpcs flag to make sure that ptlrpcd_check() is called only
         * once.
         */
        int                         pc_recurred;
        /**
         * Currently not used.
         */
        void                       *pc_callback;
        /**
         * User-space async rpcs callback.
         */
        void                       *pc_wait_callback;
        /**
         * User-space check idle rpcs callback.
         */
        void                       *pc_idle_callback;
#endif
};

2.17. obd_export

Represents an RPC server (target) - used on both the client and the server.

/**
 * Export structure. Represents target-side of connection in portals.
 * Also used in Lustre to connect between layers on the same node when
 * there is no network-connection in-between.
 * For every connected client there is an export structure on the server
 * attached to the same obd device.
 */
struct obd_export {
        /**
         * Export handle, it's id is provided to client on connect
         * Subsequent client RPCs contain this handle id to identify
         * what export they are talking to.
         */
        struct portals_handle     exp_handle;
        atomic_t                  exp_refcount;
        /**
         * Set of counters below is to track where export references are
         * kept. The exp_rpc_count is used for reconnect handling also,
         * the cb_count and locks_count are for debug purposes only for now.
         * The sum of them should be less than exp_refcount by 3
         */
        atomic_t                  exp_rpc_count; /** RPC references */
        atomic_t                  exp_cb_count; /** Commit callback references */
        atomic_t                  exp_locks_count; /** Lock references */
#if LUSTRE_TRACKS_LOCK_EXP_REFS
        struct list_head          exp_locks_list;
        spinlock_t                exp_locks_list_guard;
#endif
        /** Number of queued replay requests to be processes */
        atomic_t                  exp_replay_count;
        /** UUID of client connected to this export */
        struct obd_uuid           exp_client_uuid;
        /** To link all exports on an obd device */
        struct list_head          exp_obd_chain;
        struct hlist_node         exp_uuid_hash; /** uuid-export hash*/
        struct hlist_node         exp_nid_hash; /** nid-export hash */
        /**
         * All exports eligible for ping evictor are linked into a list
         * through this field in "most time since last request on this export"
         * order
         * protected by obd_dev_lock
         */
        struct list_head          exp_obd_chain_timed;
        /** Obd device of this export */
        struct obd_device        *exp_obd;
        /** "reverse" import to send requests (e.g. from ldlm) back to client */
        struct obd_import        *exp_imp_reverse;
        struct nid_stat          *exp_nid_stats;
        struct lprocfs_stats     *exp_md_stats;
        /** Active connetion */
        struct ptlrpc_connection *exp_connection;
        /** Connection count value from last succesful reconnect rpc */
        __u32                     exp_conn_cnt;
        /** Hash list of all ldlm locks granted on this export */
        lustre_hash_t            *exp_lock_hash;
        /** lock to protect exp_lock_hash accesses */
        spinlock_t                exp_lock_hash_lock;
        struct list_head          exp_outstanding_replies;
        struct list_head          exp_uncommitted_replies;
        spinlock_t                exp_uncommitted_replies_lock;
        /** Last committed transno for this export */
        __u64                     exp_last_committed;
        /** When was last request received */
        cfs_time_t                exp_last_request_time;
        /** On replay all requests waiting for replay are linked here */
        struct list_head          exp_req_replay_queue;
        /** protects exp_flags and exp_outstanding_replies */
        spinlock_t                exp_lock;
        /** Compatibility flags for this export */
        __u64                     exp_connect_flags;
        enum obd_option           exp_flags;
        unsigned long             exp_failed:1,
                                  exp_in_recovery:1,
                                  exp_disconnected:1,
                                  exp_connecting:1,
                                  /** VBR: export missed recovery */
                                  exp_delayed:1,
                                  /** VBR: failed version checking */
                                  exp_vbr_failed:1,
                                  exp_req_replay_needed:1,
                                  exp_lock_replay_needed:1,
                                  exp_need_sync:1,
                                  exp_flvr_changed:1,
                                  exp_flvr_adapt:1,
                                  exp_libclient:1, /* liblustre client? */
                                  /* client timed out and tried to reconnect,
                                   * but couldn't because of active rpcs */
                                  exp_abort_active_req:1;
        struct list_head          exp_queued_rpc;  /* RPC to be handled */
        /* also protected by exp_lock */
        enum lustre_sec_part      exp_sp_peer;
        struct sptlrpc_flavor     exp_flvr;             /* current */
        struct sptlrpc_flavor     exp_flvr_old[2];      /* about-to-expire */
        cfs_time_t                exp_flvr_expire[2];   /* seconds */

        /** Target specific data */
        union {
                struct lu_export_data     eu_target_data;
                struct mdt_export_data    eu_mdt_data;
                struct filter_export_data eu_filter_data;
                struct ec_export_data     eu_ec_data;
        } u;
};

#define exp_target_data u.eu_target_data
#define exp_mdt_data    u.eu_mdt_data
#define exp_filter_data u.eu_filter_data
#define exp_ec_data     u.eu_ec_data

2.18. obd_import

Represents an RPC client - used on both the client and server.

/**
 * Defintion of PortalRPC import structure.
 * Imports are representing client-side view to remote target.
 */
struct obd_import {
        /** Local handle (== id) for this import. */
        struct portals_handle     imp_handle;
        /** Reference handle */
        atomic_t                  imp_refcount;
        struct lustre_handle      imp_dlm_handle; /* client's ldlm export */
        /** Currently active connection */
        struct ptlrpc_connection *imp_connection;
        /** PortalRPC client structure for this import */
        struct ptlrpc_client     *imp_client;
        /** List element for linking into pinger chain */
        struct list_head          imp_pinger_chain;
        /** List element for linking into chain for destruction */
        struct list_head          imp_zombie_chain;

        /**
         * Lists of requests that are retained for replay, waiting for a reply,
         * or waiting for recovery to complete, respectively.
         * @{
         */
        struct list_head          imp_replay_list;
        struct list_head          imp_sending_list;
        struct list_head          imp_delayed_list;
        /** @} */

        /** obd device for this import */
        struct obd_device        *imp_obd;

        /**
         * some seciruty-related fields
         * @{
         */
        struct ptlrpc_sec        *imp_sec;
        struct semaphore          imp_sec_mutex;
        cfs_time_t                imp_sec_expire;
        /** @} */

        /** Wait queue for those who need to wait for recovery completion */
        cfs_waitq_t               imp_recovery_waitq;

        /** Number of requests currently in-flight */
        atomic_t                  imp_inflight;
        /** Number of requests currently unregistering */
        atomic_t                  imp_unregistering;
        /** Number of replay requests inflight */
        atomic_t                  imp_replay_inflight;
        /** Number of currently happening import invalidations */
        atomic_t                  imp_inval_count;
        /** Numbner of request timeouts */
        atomic_t                  imp_timeouts;
        /** Current import state */
        enum lustre_imp_state     imp_state;
        /** History of import states */
        struct import_state_hist  imp_state_hist[IMP_STATE_HIST_LEN];
        int                       imp_state_hist_idx;
        /** Current import generation. Incremented on every reconnect */
        int                       imp_generation;
        /** Incremented every time we send reconnection request */
        __u32                     imp_conn_cnt;
        /**
         * \see ptlrpc_free_committed remembers imp_generation value here
         * after a check to save on unnecessary replay list iterations
         */
        int                       imp_last_generation_checked;
        /** Last tranno we replayed */
        __u64                     imp_last_replay_transno;
        /** Last transno committed on remote side */
        __u64                     imp_peer_committed_transno;
        /**
         * \see ptlrpc_free_committed remembers last_transno since its last
         * check here and if last_transno did not change since last run of
         * ptlrpc_free_committed and import generation is the same, we can
         * skip looking for requests to remove from replay list as optimisation
         */
        __u64                     imp_last_transno_checked;
        /**
         * Remote export handle. This is how remote side knows what export
         * we are talking to. Filled from response to connect request
         */
        struct lustre_handle      imp_remote_handle;
        /** When to perform next ping. time in jiffies. */
        cfs_time_t                imp_next_ping;
        /** When we last succesfully connected. time in 64bit jiffies */
        __u64                     imp_last_success_conn;

        /** List of all possible connection for import. */
        struct list_head          imp_conn_list;
        /**
         * Current connection. \a imp_connection is imp_conn_current->oic_conn
         */
        struct obd_import_conn   *imp_conn_current;

        /** Protects flags, level, generation, conn_cnt, *_list */
        spinlock_t                imp_lock;

        /* flags */
        unsigned long             imp_no_timeout:1,       /* timeouts are disabled */
                                  imp_invalid:1,          /* evicted */
                                  imp_deactive:1,         /* administratively disabled */
                                  imp_replayable:1,       /* try to recover the import */
                                  imp_dlm_fake:1,         /* don't run recovery (timeout instead) */
                                  imp_server_timeout:1,   /* use 1/2 timeout on MDS' OSCs */
                                  imp_initial_recov:1,    /* retry the initial connection */
                                  imp_initial_recov_bk:1, /* turn off init_recov after trying all failover nids */
                                  imp_delayed_recovery:1, /* VBR: imp in delayed recovery */
                                  imp_no_lock_replay:1,   /* VBR: if gap was found then no lock replays */
                                  imp_vbr_failed:1,       /* recovery by versions was failed */
                                  imp_force_verify:1,     /* force an immidiate ping */
                                  imp_pingable:1,         /* pingable */
                                  imp_resend_replay:1,    /* resend for replay */
                                  imp_recon_bk:1,         /* turn off reconnect if all failovers fail */
                                  imp_last_recon:1,       /* internally used by above */
                                  imp_force_reconnect:1;  /* import must be reconnected instead of chouse new connection */
        __u32                     imp_connect_op;
        struct obd_connect_data   imp_connect_data;
        __u64                     imp_connect_flags_orig;
        int                       imp_connect_error;

        __u32                     imp_msg_magic;
        __u32                     imp_msghdr_flags;       /* adjusted based on server capability */

        struct ptlrpc_request_pool *imp_rq_pool;          /* emergency request pool */

        struct imp_at             imp_at;                 /* adaptive timeout data */
        time_t                    imp_last_reply_time;    /* for health check */
};

3. Initialization and cleanup functions

3.1. ptlrpc_init()

When the Portal RPC kernel module is loaded, this function is called to initialize the subsystem. It is also called from lllib_init() in liblustre.

__init int ptlrpc_init(void);

Most of the initialization is carried out by sub-functions.

The function starts by calling lustre_assert_wire_constants() to check that the numerous constants and data types used in the on-the-wire protocol have the expected values/size/offsets.

After initializating a few spin locks and mutexes, ptlrpc_init_xid() is called to initialize the node’s XID (ptlrpc_last_xid) to a value based on the current time. This is done so that the XID sequence generated after a reboot cannot contain a previously used value.

Next, req_layout_init() is called to setup the array of request formats, req_formats. Each element in the array (of type req_format) defines the on-the-wire format used by the client and the server for a given RPC operation. Each request or reply can contain a number of fields (field type/order is fixed at compile time). The request format for a particular message type specifies what fields are present and, for each field, whether the field is fixed or variable length and what swabber and dumper functions are to be used to byte-swap and dump that field’s value.

Next, ptlrpc_hr_init() is called to start the reply handling threads. One thread per online CPU is started. Each thread executes the ptlrpc_hr_main_function(). Until the reply handler is terminated, that function waits for replies to be added to its queue and then dequeues them and passes each reply to ptlrpc_handle_rs().

Next, ptlrpc_init_portals() is called:

This function starts by calling ptlrpc_ni_init() which:

  • calls LNetNIInit() to initialise the LNET interface

  • calls LNetEQAlloc() to allocate the global PTLRPC event queue, ptlrpc_eq_h (for kernel PTLRPC, the event queue’s callback function is set to ptlrpc_master_callback())

For non-kernel PTLRPC, liblustre_check_services() is registered as a liblustre wait callback function and then ptlrpcd_addref() is called to obtain a reference to the ptlrpc daemon (ptlrpcd). The first call to that function will start the daemon.

Back in ptlrpc_init(), ptlrpc_connection_init() is called to allocate and initialize the connection hash table (conn_hash).

Next, ptl_put_connection_superhack is initialized to ptlrpc_connection_put(). FIXME - explain

A call to ptlrpc_start_pinger() starts the pinger thread and it executes ptlrpc_pinger_main().

Next, ldlm_init() is called to carry out low-level LDLM initialization (storage allocation).

Next, sptlrpc_init() is called to initialize the PTLRPC security system.

Finally, llog_recov_init() is called to allocate llcd_cache.

If any of the above functions indicate failure, control jumps to cleanup code that undoes the work of those initialization functions that have already successfully completed.

The function returns 0 on success.

3.2. ptlrpc_exit()

When the Portal RPC kernel module is unloaded, this function is called to teardown the subsystem.

static void __exit ptlrpc_exit(void);

4. Functions for message manipulation

4.1. ptlrpc_request_alloc()

Allocate a new request structure and setup its buffers appropriately for the required message format.

struct ptlrpc_request *ptlrpc_request_alloc(struct obd_import *imp,
                                            const struct req_format *format);

Arguments are:

imp

the obd_import through which the message will be sent

format

the req_format object that defines the message’s format

This function simply calls ptlrpc_request_alloc_internal() with a NULL pool.

4.2. ptlrpc_request_alloc_pool()

Allocate a new request structure (from a pool if possible) and setup its buffers appropriately for the required message format.

struct ptlrpc_request *ptlrpc_request_alloc_pool(struct obd_import *imp,
                                                 struct ptlrpc_request_pool * pool,
                                                 const struct req_format *format)

Arguments are:

imp

the obd_import through which the message will be sent

pool

the request pool (ptlrpc_request_pool) from which to take the request

format

the req_format object that defines the message’s format

This function simply calls ptlrpc_request_alloc_internal().

4.3. ptlrpc_request_alloc_internal()

Allocate a new request structure (from a pool if possible) and setup its buffers appropriately for the required message format.

static struct ptlrpc_request *
ptlrpc_request_alloc_internal(struct obd_import *imp,
                              struct ptlrpc_request_pool * pool,
                              const struct req_format *format)

Arguments are:

imp

the obd_import through which the message will be sent

pool

the request pool (ptlrpc_request_pool) from which to take the request

format

the req_format object that defines the message’s format

The message allocation is delegated to __ptlrpc_request_alloc() which first tries to allocate the message from the pool (if pool is non-NULL) and if the pool was empty it just allocates the memory itself. If the message is allocated, a reference to the import is obtained (and the import assigned to the message’s rq_import field.

The message’s request capsule, req_pill, is initialized by a call to req_capsule_init() which sets up a couple of fields and fills the array of field sizes (rc_area) with -1’s.

A call to req_capsule_set() stores format into the message’s rc_format field.

4.4. ptlrpc_request_free()

Free a request (to a pool if the request is associated with one).

void ptlrpc_request_free(struct ptlrpc_request *request);

Argument is the request to free. If the request’s rq_pool field is non-NULL, the request is passed to __ptlrpc_free_req_to_pool() to put it back in the pool. Otherwise, the request’s memory is freed.

4.5. ptlrpc_request_alloc_pack()

Allocate a new (simple) request and pack it ready for transmission.

struct ptlrpc_request *ptlrpc_request_alloc_pack(struct obd_import *imp,
                                                const struct req_format *format,
                                                __u32 version, int opcode)

Arguments are:

imp

the obd_import through which the message will be sent

format

the req_format object that defines the message’s format

version

client part of message version

opcode

the type of the RPC request

First calls ptlrpc_request_alloc() to allocate the request and then calls ptlrpc_request_pack() to pack the request which is then returned.

4.6. ptlrpc_request_pack()

Pack a request ready for transmission.

int ptlrpc_request_pack(struct ptlrpc_request *request,
                        __u32 version, int opcode);

Arguments are:

request

the ptlrpc_request to pack

version

client part of message version

opcode

the type of the RPC request

Simply calls ptlrpc_request_bufs_pack() with NULL values for bufs and ctx.

4.7. ptlrpc_request_bufs_pack()

Pack a request ready for transmission.

int ptlrpc_request_bufs_pack(struct ptlrpc_request *request,
                             __u32 version, int opcode, char **bufs,
                             struct ptlrpc_cli_ctx *ctx)

Arguments are:

request

the ptlrpc_request to pack

version

client part of message version

opcode

the type of the RPC request

bufs

if non-NULL, it should be an array of pointers to buffers containing field data - one pointer for each RPC field in the message

ctx

optional client-side security context

Firstly req_capsule_filled_sizes() is passed the address of request→rq_pill so that it can fill in those elements of rc_area (array of field lengths) that have not been assigned (they are still set to -1). That function returns the number of fields to be packed and that is now passed to __ptlrpc_request_bufs_pack() along with this function’s arguments and the array of field lengths.

4.8. __ptlrpc_request_bufs_pack()

Pack a request ready for transmission.

static int __ptlrpc_request_bufs_pack(struct ptlrpc_request *request,
                                      __u32 version, int opcode,
                                      int count, __u32 *lengths, char **bufs,
                                      struct ptlrpc_cli_ctx *ctx)

Arguments are:

request

the ptlrpc_request to pack

version

client part of message version

opcode

the type of the RPC request

count

the number of fields in the request

lengths

array of field lengths

bufs

optional array of field initialisation data

ctx

optional client-side security context

Firstly, the request’s client security context (rq_cli_ctx) is assigned either the passed in ctx (if non-NULL) or the value returned from a call to sptlrpc_req_get_ctx().

Next, sptlrpc_req_seq_flavor() is called to set some message security flags depending on what the RPC opcode is.

Now, lustre_pack_request() is called to pack the RPC fields into the request’s message buffer (rq_reqmsg). When that function returns, opcode and version are also stored into the request message.

The remainder of the function is concerned with initializing myriad other fields in request, such as:

  • the callback functions that are to be invoked when the request has been sent or the reply received.

  • the request and reply portals

  • the phase of the RPC operation (RQ_PHASE_NEW)

  • various other lists heads, spin locks, wait queues, etc.

4.9. lustre_pack_request()

Low level function to allocate and pack a request’s message.

int lustre_pack_request(struct ptlrpc_request *req, __u32 magic, int count,
                        __u32 *lens, char **bufs)

Arguments are:

req

the ptlrpc_request to pack

magic

a magic number that is now ignored (LUSTRE_MSG_MAGIC_V2 is used instead)

count

the number of fields in the request

lens

array of field lengths

bufs

optional array of field initialisation data

Simply calls lustre_pack_request_v2() to do the real work.

4.10. lustre_pack_request_v2()

Low level function to allocate and pack a request’s message.

int lustre_pack_request_v2(struct ptlrpc_request *req, int count,
                           __u32 *lens, char **bufs)

Arguments are:

req

the ptlrpc_request to pack

count

the number of fields in the request

lens

array of field lengths

bufs

optional array of field initialisation data

The size of the request message buffer is calculated with a call to lustre_msg_size_v2(). The resulting size is the sum of the message header size and the sizes of each of the fields (all values being rounded up to the nearest 8 bytes before they are summed). The size is passed to sptlrpc_cli_alloc_reqbuf() along with req and the allocation of the message buffer (possibly from a pool of pre-allocated buffers) is done through the client security context.

Once the message buffer has been allocated, it can be accessed via req→rq_reqmsg.

The message buffer and the supplied count, lens and bufs are passed to lustre_init_msg_v2() which stores the count and buffer lengths into the message buffer and if bufs is not-NULL, it uses that as an array of pointers to initialization data (one for each buffer) and copies that data into the message buffer.

Finally, it adds PTLRPC_MSG_VERSION into the request’s pb_version field.

4.11. ptlrpc_prep_req()

Prepare request ready for sending.

struct ptlrpc_request *
ptlrpc_prep_req(struct obd_import *imp, __u32 version, int opcode, int count,
                __u32 *lengths, char **bufs)

The arguments are:

imp

the obd_import through which the request is sent

version

client part of message version

opcode

the RPC message type

count

number of buffers supplied

lengths

array of buffer lengths

bufs

array of buffer pointers

Simply passes the arguments to ptlrpc_prep_req_pool() with a NULL pool.

4.12. ptlrpc_prep_req_pool()

Prepare request ready for sending.

struct ptlrpc_request *
ptlrpc_prep_req_pool(struct obd_import *imp,
                     __u32 version, int opcode,
                     int count, __u32 *lengths, char **bufs,
                     struct ptlrpc_request_pool *pool)

The arguments are:

imp

the obd_import through which the request is sent

version

client part of message version

opcode

the RPC message type

count

number of buffers supplied

lengths

array of buffer lengths

bufs

array of buffer pointers

pool

request pool containing free requests

The request allocation is delegated to __ptlrpc_request_alloc() which will either allocate the message from the pool (if not-NULL) or allocate the memory itself. If the allocation succeeds, __ptlrpc_request_bufs_pack() is called to pack the request which is then returned.

4.13. ptlrpc_queue_wait()

Send a request and wait for the reply.

int ptlrpc_queue_wait(struct ptlrpc_request *req);

Argument req is the prepared request.

Firstly, ptlrpc_prep_set() is called to allocate a new RPC set.

For debugging purposes, the current PID is stored in the request’s pb_status field.

A reference to the set is obtained and ptlrpc_set_add_req() is called to add the reference to the set and increment the set’s set_remaining field.

The set is passed to ptlrpc_set_wait() which will send the request and wait for it to complete.

The set is destroyed by calling ptlrpc_set_destroy().

4.14. ptlrpc_prep_set()

Allocates and initializes a new request set.

struct ptlrpc_request_set *ptlrpc_prep_set(void);

If the memory for the new set cannot be allocated, the function returns NULL. Otherwise, various fields are initialised in the expected fashion and the pointer to the new set is returned.

4.15. ptlrpc_set_destroy()

Clears up any requests in the set and then frees the set.

void ptlrpc_set_destroy(struct ptlrpc_request_set *set);

The function starts by checking that all requests in the set are either completed or new.

Each message in the set is removed from the set and passed to ptlrpc_req_interpret() with a status of -EBADR (equivalent to the message being lost in-flight). The request is freed by passing it to ptlrpc_req_finished().

Finally, the set is freed.

4.16. ptlrpc_set_wait()

Send all unsent requests in a set and then wait for them to complete.

void ptlrpc_set_wait(struct ptlrpc_request_set *set);

If the set contains no requests, the function returns 0 immediately.

Each request in the set that is new (its rq_phase field is equal to RQ_PHASE_NEW) is sent by passing it to ptlrpc_send_new_req().

The function now loops as long as there are uncompleted requests and no error has been detected:

ptlrpc_set_next_timeout() is called to determine how many seconds will elapse before the first message in the set times out.

The thread now waits for ptlrpc_check_set() to return non-zero (indicating that either all requests have been sent and no more replies are expected or that the timeout period should be restarted). If no timeout is pending (no messages are in-flight) and one or more signals are pending (note that cfs_signal_pending() returns zero if a signal is pending), the thread waits for up to 1 second with interrupts enabled. Otherwise, the thread waits until the timeout period has expired. If the wait completes, the set is passed to ptlrpc_expired_set() and if the wait was interrupted, the set is passed to ptlrpc_interrupted_set().

Each request’s status (rq_status) is checked and, if non-zero, assigned to local rc.

If the set has a non-NULL completion callback function (set_interpret), it is passed the set. Otherwise, if the set has any completion callback objects (in list set_cblist), each callback object is removed from the list, the associated callback function is passed the set and the callback object freed. Any non-zero callback function return value is propagated into rc.

The function returns rc.

4.17. ptlrpc_check_set()

Sends any unsent requests in a set and then returns non-zero if no more replies are expected.

int ptlrpc_check_set(const struct lu_env *env, struct ptlrpc_request_set *set)

The arguments are:

env

a lu_env that is passed on to a request’s completion handler when it completes

set

the request set

Local force_timer_recalc is initialized to zero. This will be set if an error is detected and causes the function to return non-zero so that a thread waiting on this function will be woken.

If all requests in the set have already completed (set_remaining is 0), the function immediately returns success (1).

The remainder of the function loops over the requests in the set with local req pointing at each request.

If the request has not yet been sent (req→rq_phase equals RQ_PHASE_NEW), req is passed to ptlrpc_send_new_req() for sending and if that function indicates failure, force_timer_recalc is set.

If the request has not yet been sent because it is scheduled to be sent in the future, the loop continues on to the next request in the set.

If the current request’s phase is RQ_PHASE_UNREGISTERING:

If the request’s reply is currently being received or the reply has yet to be unlinked or if an associated bulk transfer is in progress, the loop continues.

The request’s phase is moved on to rq_next_phase.

If the request’s phase is now RQ_PHASE_COMPLETE, the loop continues with the next request.

If the request’s phase is now RQ_PHASE_INTERPRET, control jumps to near the end of the loop where the reply is interpreted.

If the request has suffered a network error (req→rq_net_err is true) and has also not timed out (req→rq_timeout is false), it is passed to ptlrpc_expire_one_request() and then if the reply has yet to be unlinked or if an associated bulk transfer is in progress, the loop continues.

If the request has suffered an error (req→rq_err is true), req→rq_replied is cleared (and if req→rq_status is zero, it is is set to -EIO), the request’s phase is changed to RQ_PHASE_INTERPRET and control jumps to the reply interpretation code below.

If the request has received an interrupt (req→rq_intr is true) and has either timed out (req→rq_timedout is true) or is waiting for the import to come ready (req→rq_waiting is true) or is waiting for a context (req→rq_wait_ctx is true) then req→rq_status is set to -EINTR, the request’s phase is changed to RQ_PHASE_INTERPRET and control jumps to the reply interpretation code below.

If the request’s phase is RQ_PHASE_RPC:

If the request has either timed out or is being resent (req→rq_resend is true) or waiting for the import to become ready or is waiting for a context:

ptlrpc_unregister_reply() is called to (asynchronously) unlink the request’s reply buffer from the network. Until the unlink has completed, the function will return false and this loop continues on to look at the next request.

ptlrpc_import_delay_req() is called to determine if the request should be delayed pending the import to come ready and if so, it is appended to the import’s imp_delayed_list and the loop continues. note that rq_waiting is not set here like it is in ptlrpc_send_new_req() - is this a bug?

If an error was detected in ptlrpc_import_delay_req(), that status is assigned to req→rq_status, the request’s phase is changed to RQ_PHASE_INTERPRET and control jumps to the reply interpretation code below.

If the request is not to be resent (req→rq_no_resend is true) and is not waiting for a context then rq_status is set to -ENOTCONN, the request’s phase is changed to RQ_PHASE_INTERPRET and control jumps to the reply interpretation code below.

The request is appended to the import’s imp_sending_list.

The request is marked as not waiting for the import (req→rq_waiting is cleared).

If the request has timed out or is being resent (req→rq_timedout or req→rq_resend true), req→rq_resend is set true and if there is a bulk transfer associated with the request (req→rq_bulk is true), the request’s rq_xid field is assigned a new XID value to make the previous bulk transfer fail.

Now sptlrpc_req_refresh_ctx() is called to refresh the request’s client context. If that function returns an error code and req→rq_err is true, the error code is stored in req→rq_status and force_timer_recalc is set true. If the function returned an error code and req→rq_err is false, req→rq_wait_ctx is set true to indicate that the request is waiting for a context.

If the client context was refreshed, req→rq_wait_ctx is cleared.

The request is sent by passing it to ptl_send_rpc() and if that function returns an error code, force_timer_recalc and req→rq_net_err are set true.

force_timer_recalc is set true to reset the timeout.

If the request has received an early reply, ptlrpc_at_recv_early_reply() is called and the loop continues.

If the request is currently receiving a reply, the loop continues.

If the request has not yet received a reply, the loop continues.

The request is passed to after_reply() and if req→rq_resend is now true, the loop continues.

If there is no bulk transfer associated with this request (req→rq_bulk is NULL) or the bulk transfer has failed (req→rq_status is non-zero) then the request’s phase is changed to RQ_PHASE_INTERPRET and control jumps to the reply interpretation code below.

The request’s phase is changed to RQ_PHASE_BULK.

If the bulk transfer is still active, the loop continues.

The request’s phase is changed to RQ_PHASE_INTERPRET and control has now reached the portion of the loop that interprets the reply.

ptlrpc_unregister_reply() is called to (asynchronously) unlink the request’s reply buffer from the network. Until the unlink has completed, the function will return false and this loop continues on to look at the next request.

Similarly, ptlrpc_unregister_bulk() is called to (asynchronously) unlink the request’s bulk descriptor from the network. Until the unlink has completed, the function will return false and this loop continues on to look at the next request.

Now, ptlrpc_req_interpret() is passed env, req and the current value of req→rc_status and it simply invokes the request’s callback function rq_interpret_reply (if non-NULL).

The request’s phase is changed to RQ_PHASE_COMPLETE.

If the request is still linked onto a list (rq_list not empty), it is removed from the list and the import’s number of in-flight RPCs (imp_inflight) is decremented.

The number of requests in the set that haven’t yet completed (setset_remaining) is decremented and the threads waiting on the import’s recovery wait queue (imp_recovery_waitq) are woken.

After all of the requests have been processed by the loop, the function returns true if either the number of requests in the set that haven’t completed has become zero (the set is now empty) or if force_timer_recalc is true.

4.18. after_reply()

Callback function invoked after an RPC reply is received.

static int after_reply(struct ptlrpc_request *req);

Argument req is the request that has received a reply.

If the reply has been truncated (rq_reply_truncate is true) and resending is allowed (rq_no_resend is false), rq_resend is set true, the request’s reply buffer is freed by sptlrpc_cli_free_repbuf(), rq_nob_received is assigned to rq_replen and 0 is returned.

A call to sptlrpc_cli_unwrap_reply() unwraps the reply (message header and buffer lengths byte-swapped as required and reply is verified by the security layer).

If rq_resend is now true, the function returns 0.

Next, unpack_reply() is called and, if required, this will byte-swap the ptlrpc_body which all messages contain.

The next couple of lines update a procfs counter (PTLRPC_REQWAIT_CNTR).

If the type of the reply is not PTL_RPC_MSG_REPLY or PTL_RPC_MSG_ERR, -EPROTO is returned.

FIXME - couple of calls to adaptive timeout functions.

The reply’s error status (transmitted in pb_status) is obtained with a call to ptlrpc_check_status(). as of 07/04/2010 that function appears to contain duplicated (dead) code].

If the status indicates an error has occurred and that error is recoverable (-ENOTCONN or -ENODEV), the status value is returned. If possible a reconnect will be initiated and the request marked for resending.

If the status was good (0), the request is passed to ldlm_cli_update_pool() to update the client’s obd_pool_slv and obd_pool_limit fields.

If the message is not being replayed, the transaction number is retrieved from the pb_transno field of the reply’s ptlrpc_body and stored into the request’s rq_transno field and also stored into the pb_transno field of the request’s ptlrpc_body.

If the request’s import is replayable (imp_replayable is true):

If the request has a transaction number (rq_transno is not 0) and that transaction number is greater than the transaction number of the last committed transaction or the request is marked for replay (rq_replay is true), the request is added to the import’s imp_replay_list through a call to ptlrpc_retain_replayable_request(). Also, a call to ptlrpc_save_versions() transfers the version numbers from the reply’s pb_pre_versions field to same field in the request if the request is not being replayed.

Otherwise, the request’s commit callback (rq_commit_cb) is invoked (no NULL check).

If the reply’s pb_last_committed value is non-zero it is assigned to the import’s imp_peer_committed_transno field. ptlrpc_free_committed() is called to scan the import’s imp_replay_list and prune those entries that have a transaction number less than or equal to imp_peer_committed_transno and do not have rq_replay set true. When an entry is pruned its rq_commit_cb is invoked (if not NULL) and the request’s reference count is decremented.

4.19. ptlrpc_expired_set()

Wait event callback function to time out uncompleted requests.

int ptlrpc_expired_set(void *data);

The argument, data, is a pointer to the request set.

The function loops over the set’s requests (set_requests) and expires those requests whose rq_deadline field is greater than the current time and are not either waiting for a context or have a request in flight or have already expired. Each request is expired with a call to ptlrpc_expire_one_request().

The function always returns 1.

4.20. ptlrpc_expire_one_request()

Expire (time out) a single request.

int ptlrpc_expire_one_request(struct ptlrpc_request *req, int async_unlink);

Arguments are:

req

the request to expire

async_unlink

if true, unlink the request’s reply and bulk buffers asynchronously.

The request’s rq_timedout field is set true.

ptlrpc_unregister_reply() and ptlrpc_unregister_bulk() are passed req and async_unlink to unlink the request’s reply and bulk buffers.

If the import is NULL, the function returns 1.

If the request’s rq_fake field is true, the function returns 1.

The number of timeouts on this import imp_timeouts is incremented.

If the import’s imp_dlm_fake field is true, the function returns 1.

If the request should fail due to the timeout, its rq_status field is set to -ETIMEDOUT and its rq_err field set true and the function returns 1.

If the request’s rq_no_resend field is true, the function will return 1, otherwise 0.

Before returning, ptlrpc_fail_import() is called. It is passed the import that is to be disconnected (due to the timeout) and the pb_conn_cnt field from the expired message (that is the value of imp_conn_cnt that the import had at the time the message was sent). Those arguments are passed down to ptlrpc_set_import_discon() which does the disconnection as long as the passed in connection count matches the current value of imp→imp_conn_cnt. i.e. the import hasn’t already been reconnected since the message that has expired was sent.

4.21. ptlrpc_interrupted_set()

Wait event callback function that marks all requests in a set as interrupted.

int ptlrpc_interrupted_set(void *data);

The argument, data, is a pointer to the request set.

Simply loops through the requests in the set (set_requests) and for each request whose rq_phase field is equal to RQ_PHASE_RPC or RQ_PHASE_UNREGISTERING, it calls ptlrpc_mark_interrupted() which sets rq_intr true.

4.22. ptlrpc_send_new_req()

Sends an RPC request for the first time.

static int ptlrpc_send_new_req(struct ptlrpc_request *req)

The argument, req, is the request to send.

If the request has a scheduled time to send (rq_sent not zero) and that time is in the future, the function simply returns 0.

The request’s phase is moved to RQ_PHASE_RPC.

The request’s import is stored in local imp and the import’s generation (imp_generation) is stored in reqrq_import_generation.

ptlrpc_import_delay_req() is called to determine if the request should be delayed pending the import to come ready and if so, the request’s rq_waiting field is set and the request is appended to the import’s imp_delayed_list, the import’s imp_inflight field is incremented and the function returns 0.

If ptlrpc_import_delay_req() returned an error code, it is assigned to rq_status, the request’s phase is moved to RQ_PHASE_INTERPRET and the function returns the error code.

The request is appended to the import’s imp_sending_list and the import’s imp_inflight field is incremented.

The message’s pb_status field is set to the current PID.

Now sptlrpc_req_refresh_ctx() is called to refresh the request’s client context. If that function returns an error code and req→rq_err is true, the error code is stored in req→rq_status and this function returns 1. If the function returned an error code and req→rq_err is false, req→rq_wait_ctx is set true to indicate that the request is waiting for a context and this function returns 0.

The request is sent by calling ptl_send_rpc() and if that function returns an error, the request’s rq_net_err field is set true and the error code returned.

The function returns 0.

4.23. ptl_send_rpc()

Sends an RPC request.

int ptl_send_rpc(struct ptlrpc_request *request, int noreply);

The arguments are:

request

the request to be sent

noreply

if true, don’t set up any reply buffers

If the import’s OBD has failed, the request’s rq_err field is set and rq_status is set to -ENODEV which is also now returned.

The import’s current connection is assigned to local connection.

The message’s pb_handle field is assigned the import’s imp_remote_handle (the export handle).

The message’s pb_type field is set to PTL_RPC_MSG_REQUEST.

The message’s pb_conn_cnt field is set to the import’s imp_conn_cnt.

The message header’s lm_flags field is set to imp_msghdr_flags.

If the message is being resent (rq_resend is true), MSG_RESENT is added to the message’s pb_flags field.

The request is passed to sptlrc_cli_wrap_request() to be signed or sealed by the security layer.

If the request involves a bulk transfer (rq_bulk is not NULL), ptlrpc_register_bulk() is called to setup the required ME and MD.

If a reply is expected (noreply is false):

If the reply buffer has not yet been allocated (rq_repbuf is NULL), it is allocated with a call to sptlrc_cli_alloc_repbuf(). If that call returns an error code, it is assigned to rq_status, rq_err is set and control jumps to cleanup code at the end of the function.

If the reply buffer was allocated, rq_repdata and rq_repmsg are set to NULL.

Next, LNetMEAttach() is called to create a new ME for the reply. The ME’s NID/PID is connection→c_peer and the match bits are requestrq_xid. If that fails, control jumps to the cleanup code but rq_status and rq_err are not assigned, should they?

The request’s rq_receiving_reply field is set to !noreply.

Similarly, the request’s rq_must_unlink field is set to !noreply.

A bunch of flags in the request are cleared.

If a reply is expected (noreply is false):

Local reply_md is initialised for the reply (PUT to req_repbuf) and passed to LNetMDAttach() to create the MD for the reply. If that fails, rq_receiving_reply is cleared and control jumps to the cleanup code.

A reference to the request is obtained on behalf of request_out_callback().

The procfs requests active counter is updated.

The current time of day is assigned to rq_arrival_time and rq_sent is set to the current time (in seconds).

The deadline for the reply’s arrival is calculated and assigned to rq_deadline.

The request’s import (rq_import) is informed that it is being used for a send.

The request is transmitted with a call to ptl_send_buf() and if that function returns success, this function returns success.

The request is passed to ptlrpc_req_finished() which just drops a reference to the request and if noreply is true, this function returns as no more cleanup actions are required.

The ME is cleaned up with a call to LNetMEUnlink() and then ptlrpc_unregister_bulk() is called.

The function returns the error code.

4.24. ptl_send_buf()

static int ptl_send_buf (lnet_handle_md_t *mdh, void *base, int len,
                         lnet_ack_req_t ack, struct ptlrpc_cb_id *cbid,
                         struct ptlrpc_connection *conn, int portal, __u64 xid,
                         unsigned int offset);

FIXME

4.25. ptlrpc_register_bulk()

int ptlrpc_register_bulk(struct ptlrpc_request *req);

FIXME

4.26. ptlrpc_unregister_bulk()

int ptlrpc_unregister_bulk(struct ptlrpc_request *req, int async);

FIXME

5. Server side functions

FIXME

6. Portal RPC daemon functions

FIXME