1. Introduction
The Portal RPC subsystem is a reliable messaging
service layered on top of LNET. It caters for small messages and
also for bulk data transfers. A large proportion of Lustre’s error
recovery functionality is embodied in the Portal RPC subsystem.
Communication is initiated by the client in the form of a request
which is transmitted to the server to process - the server then
returns a reply to the client. If bulk data is to be transferred
to/from the server, that is handled separately from the request/reply
pair.
The communication end-points are represented by an "export" object on
the server side and an "import" object on the client side.
Requests and replies are communicated using an "on the wire"
format. At each end, the request or reply is packed/unpacked from/to
the host format. The formats are all fixed at compile time.
Portal RPC messages can be sent either singularly or in
sets. Communication can either be synchronous (wait for the reply
before proceeding) or asynchronous (proceed without waiting for a
reply - the reply is processed by the Portal RPC daemon).
2. Principle data structures
When thinking about how messages are assembled and disassembled, the
most relevant types are:
-
ptlrpc_request
-
top-level object that, essentially, contains everything known about
the message
-
req_capsule
-
a field (rq_pill in the ptlrpc_request object that defines the
message’s format (what fields it contains) and the lengths of any
variable length fields in the message
-
req_format
-
describes the format of a message (what fields are contained in the
request and the associated reply).
-
req_msg_field
-
characterises an individual field in a message (how big the field is,
what function to use to convert from host to on-the-wire format, etc.)
Each RPC request (and the associated reply) has a fixed number of
fields. Some fields are fixed size but others can be variable length
(strings or arrays of data). The functions that pack and unpack the
requests/replies to/from the on-the-wire format need to know for each
RPC request/reply type what fields it contains and how those fields
should be handled.
A req_format object is defined for each known RPC type and it
contains two arrays of req_msg_field objects (one array describes
the fields in the request sent by the client and the other array
describes the fields in the reply sent by the server).
Each req_msg_field object specifies for a single RPC message field
whether it is fixed of variable size, whether it is an array or string
and what special functions are required to byte swap it to/from the
on-the-wire format or dump it out.
Messages are packed into an on-the-wire format
(lustre_msg_v2) that consists of a variable length
header that includes a buffer count (lm_bufcount) and an array of
buffer lengths (lm_buflens). Following the header are lm_bufcount
number of buffers that contain the actual message content and whatever
other data is required on the wire (e.g. crypto wrapper).
The ptlrpc_request field rq_reqmsg points at the packed request
data within the message buffer and the message buffer itself is
pointed to by rq_reqbuf.
This diagram shows these relationships for a typical request. In this
example (an "LDLM_CANCEL" request), the request contains two fields;
"ptlrpc_body" and "dlm_request".
(ptlrpc_request)
req -->|--------------|
| |
| ... |
| | on-the-wire format
| |
|--------------| (lustre_msg_v2)
| rq_reqbuf |----------->|---------------| ^
|--------------| | lm_bufcount | |
| rq_reqbuf_len| |---------------| |
|--------------| | | |
| rq_reqmsg |-------* |---------------| |
|--------------| | | ... | |
| rq_reqlen | | |---------------| |
|--------------| | | lm_buflens[0] | | rq_reqbuf_len
| ... | | |---------------| |
|--------------| | | ... | |
| rq_pill | | |---------------| |
| | | | lm_buflens[n] | |
| |----------| | |---------------| |
| | rc_fmt |--* | | other buffers | |
| |----------| | | | (crypto, etc.)| |
| | rc_area[]| | | |---------------| |
| |----------| | *--->| msg field[0] | ^ |
|--------------| | |---------------| | |
| | ... | | rq_reqlen |
*---------------------* |---------------| | |
| | msg field[n] | v v
| |---------------|
|
|
| (req_format)
*-->|-----------------------|
| rf_name |---> e.g. "LDLM_CANCEL"
|-----------------------|
| rf_idx | = value set by req_layout_init()
|-----------------------|
| rf_fields[RCL_CLIENT] |------>|--------------| (array of req_msg_field)
|-----------------------| | rmf_flags | = 0
| rf_fields[RCL_SERVER] |--* |--------------|
|-----------------------| | | rmf_name | = "ptlrpc_body"
| |--------------|
| 0 | rmf_size | = sizeof(struct ptlrpc_body)
[reply req_msg_field array]<---* |--------------|
| rmf_swabber | = lustre_swab_ptlrpc_body
|--------------|
| rmf_dumper | = NULL
|--------------|
| rmf_offset | = value set by req_layout_init()
---|--------------|
| rmf_flags | = RMF_F_NO_SIZE_CHECK
|--------------|
| rmf_name | = "dlm_req"
|--------------|
1 | rmf_size | = sizeof(struct ldlm_request)
|--------------|
| rmf_swabber | = lustre_swab_ldlm_request
|--------------|
| rmf_dumper | = NULL
|--------------|
| rmf_offset | = value set by req_layout_init()
---|--------------|
2.1. ptlrpc_connection
Represents a connection to a Portal RPC peer.
/**
* Structure to single define portal connection.
*/
struct ptlrpc_connection {
/** linkage for connections hash table */
struct hlist_node c_hash;
/** Our own lnet nid for this connection */
lnet_nid_t c_self;
/** Remote side nid for this connection */
lnet_process_id_t c_peer;
/** UUID of the other side */
struct obd_uuid c_remote_uuid;
/** reference counter for this connection */
atomic_t c_refcount;
};
2.2. ptlrpc_client
Represents a Portal RPC client - holds the portal number the client
will use to send requests and receive replies.
/** Client definition for PortalRPC */
struct ptlrpc_client {
/** What lnet portal does this client send messages to by default */
__u32 cli_request_portal;
/** What portal do we expect replies on */
__u32 cli_reply_portal;
/** Name of the client */
char *cli_name;
};
2.3. ptlrpc_request_set
Represents a group of Portal RPC requests that are to be sent
concurrently - the set completes when replies have been received for
all of the requests sent.
/**
* Definition of request set structure.
* Request set is a list of requests (not necessary to the same target) that
* once populated with RPCs could be sent in parallel.
* There are two kinds of request sets. General purpose and with dedicated
* serving thread. Example of the latter is ptlrpcd set.
* For general purpose sets once request set started sending it is impossible
* to add new requests to such set.
* Provides a way to call "completion callbacks" when all requests in the set
* returned.
*/
struct ptlrpc_request_set {
/** number of uncompleted requests */
int set_remaining;
/** wait queue to wait on for request events */
cfs_waitq_t set_waitq;
cfs_waitq_t *set_wakeup_ptr;
/** List of requests in the set */
struct list_head set_requests;
/**
* List of completion callbacks to be called when the set is completed
* This is only used if \a set_interpret is NULL.
* Links struct ptlrpc_set_cbdata.
*/
struct list_head set_cblist;
/** Completion callback, if only one. */
set_interpreter_func set_interpret;
/** opaq argument passed to completion \a set_interpret callback. */
void *set_arg;
/**
* Lock for \a set_new_requests manipulations
* locked so that any old caller can communicate requests to
* the set holder who can then fold them into the lock-free set
*/
spinlock_t set_new_req_lock;
/** List of new yet unsent requests. Only used with ptlrpcd now. */
struct list_head set_new_requests;
};
/**
* Description of a single ptrlrpc_set callback
*/
struct ptlrpc_set_cbdata {
/** List linkage item */
struct list_head psc_item;
/** Pointer to interpreting function */
set_interpreter_func psc_interpret;
/** Opaque argument to pass to the callback */
void *psc_data;
};
2.4. ptlrpc_reply_state
Represents a reply on the server.
/**
* Structure to define reply state on the server
*/
struct ptlrpc_reply_state {
/** Callback description */
struct ptlrpc_cb_id rs_cb_id;
/** Linkage for list of all reply states in a system */
struct list_head rs_list;
/** Linkage for list of all reply states on same export */
struct list_head rs_exp_list;
/** Linkage for list of all reply states for same obd */
struct list_head rs_obd_list;
#if RS_DEBUG
struct list_head rs_debug_list;
#endif
/** A spinlock to protect the reply state flags */
spinlock_t rs_lock;
/** Reply state flags */
unsigned long rs_difficult:1; /* ACK/commit stuff */
unsigned long rs_no_ack:1; /* no ACK, even for
difficult requests */
unsigned long rs_scheduled:1; /* being handled? */
unsigned long rs_scheduled_ever:1;/* any schedule attempts? */
unsigned long rs_handled:1; /* been handled yet? */
unsigned long rs_on_net:1; /* reply_out_callback pending? */
unsigned long rs_prealloc:1; /* rs from prealloc list */
unsigned long rs_committed:1;/* the transaction was committed
and the rs was dispatched
by ptlrpc_commit_replies */
/** Size of the state */
int rs_size;
/** opcode */
__u32 rs_opc;
/** Transaction number */
__u64 rs_transno;
/** xid */
__u64 rs_xid;
struct obd_export *rs_export;
struct ptlrpc_service *rs_service;
/** Lnet metadata handle for the reply */
lnet_handle_md_t rs_md_h;
atomic_t rs_refcount;
/** Context for the sevice thread */
struct ptlrpc_svc_ctx *rs_svc_ctx;
/** Reply buffer (actually sent to the client), encoded if needed */
struct lustre_msg *rs_repbuf; /* wrapper */
/** Size of the reply buffer */
int rs_repbuf_len; /* wrapper buf length */
/** Size of the reply message */
int rs_repdata_len; /* wrapper msg length */
/**
* Actual reply message. Its content is encrupted (if needed) to
* produce reply buffer for actual sending. In simple case
* of no network encryption we jus set \a rs_repbuf to \a rs_msg
*/
struct lustre_msg *rs_msg; /* reply message */
/** Number of locks awaiting client ACK */
int rs_nlocks;
/** Handles of locks awaiting client reply ACK */
struct lustre_handle rs_locks[RS_MAX_LOCKS];
/** Lock modes of locks in \a rs_locks */
ldlm_mode_t rs_modes[RS_MAX_LOCKS];
};
2.5. ptlrpc_request_pool
Holds a number of pre-allocated requests.
/**
* Definition of request pool structure.
* The pool is used to store empty preallocated requests for the case
* when we would actually need to send something without performing
* any allocations (to avoid e.g. OOM).
*/
struct ptlrpc_request_pool {
/** Locks the list */
spinlock_t prp_lock;
/** list of ptlrpc_request structs */
struct list_head prp_req_list;
/** Maximum message size that would fit into a rquest from this pool */
int prp_rq_size;
/** Function to allocate more requests for this pool */
void (*prp_populate)(struct ptlrpc_request_pool *, int);
};
2.6. req_capsule
Describes the format of an RPC request/reply pair including the sizes
of variable-length fields.
enum req_location {
RCL_CLIENT,
RCL_SERVER,
RCL_NR
};
/* Maximal number of fields (buffers) in a request message. */
#define REQ_MAX_FIELD_NR 9
struct req_capsule {
struct ptlrpc_request *rc_req;
const struct req_format *rc_fmt;
enum req_location rc_loc;
__u32 rc_area[RCL_NR][REQ_MAX_FIELD_NR];
};
Describes the format of a single type of RPC request/reply pair.
struct req_format {
const char *rf_name;
int rf_idx;
struct {
int nr;
const struct req_msg_field **d;
} rf_fields[RCL_NR];
};
2.8. req_msg_field
Describes a single field in an RPC request/reply - fields can be
structured and contain multiple values.
__u32 rmf_flags;
const char *rmf_name;
/**
* Field length. (-1) means "variable length". If the
* \a RMF_F_STRUCT_ARRAY flag is set the field is also variable-length,
* but the actual size must be a whole multiple of \a rmf_size.
*/
int rmf_size;
void (*rmf_swabber)(void *);
void (*rmf_dumper)(void *);
int rmf_offset[ARRAY_SIZE(req_formats)][RCL_NR];
2.9. ptlrpc_body
All RPC requests contain an initial field, "ptlrpc_body", that
contains generic request data (opcode, version, flags, etc.)
/* without gss, ptlrpc_body is put at the first buffer. */
#define PTLRPC_NUM_VERSIONS 4
struct ptlrpc_body {
struct lustre_handle pb_handle;
__u32 pb_type;
__u32 pb_version;
__u32 pb_opc;
__u32 pb_status;
__u64 pb_last_xid;
__u64 pb_last_seen;
__u64 pb_last_committed;
__u64 pb_transno;
__u32 pb_flags;
__u32 pb_op_flags;
__u32 pb_conn_cnt;
__u32 pb_timeout; /* for req, the deadline, for rep, the service est */
__u32 pb_service_time; /* for rep, actual service time */
__u32 pb_limit;
__u64 pb_slv;
/* VBR: pre-versions */
__u64 pb_pre_versions[PTLRPC_NUM_VERSIONS];
/* padding for future needs */
__u64 pb_padding[4];
};
2.10. ptlrpc_request
Represents a single RPC request.
/**
* Represents remote procedure call.
*
* This is a staple structure used by everybody wanting to send a request
* in Lustre.
*/
struct ptlrpc_request {
/* Request type: one of PTL_RPC_MSG_* */
int rq_type;
/**
* Linkage item through which this request is included into
* sending/delayed lists on client and into rqbd list on server
*/
struct list_head rq_list;
/**
* Server side list of incoming unserved requests sorted by arrival
* time. Traversed from time to time to notice about to expire
* requests and sent back "early replies" to clients to let them
* know server is alive and well, just very busy to service their
* requests in time
*/
struct list_head rq_timed_list;
/** server-side history, used for debuging purposes. */
struct list_head rq_history_list;
/** server-side per-export list */
struct list_head rq_exp_list;
/** server-side hp handlers */
struct ptlrpc_hpreq_ops *rq_ops;
/** history sequence # */
__u64 rq_history_seq;
/** the index of service's srv_at_array into which request is linked */
time_t rq_at_index;
/** Result of request processing */
int rq_status;
/** Lock to protect request flags and some other important bits, like
* rq_list
*/
spinlock_t rq_lock;
/** client-side flags are serialized by rq_lock */
unsigned long rq_intr:1, rq_replied:1, rq_err:1,
rq_timedout:1, rq_resend:1, rq_restart:1,
/**
* when ->rq_replay is set, request is kept by the client even
* after server commits corresponding transaction. This is
* used for operations that require sequence of multiple
* requests to be replayed. The only example currently is file
* open/close. When last request in such a sequence is
* committed, ->rq_replay is cleared on all requests in the
* sequence.
*/
rq_replay:1,
rq_no_resend:1, rq_waiting:1, rq_receiving_reply:1,
rq_no_delay:1, rq_net_err:1, rq_wait_ctx:1,
rq_early:1, rq_must_unlink:1,
rq_fake:1, /* this fake req */
/* server-side flags */
rq_packed_final:1, /* packed final reply */
rq_hp:1, /* high priority RPC */
rq_at_linked:1, /* link into service's srv_at_array */
rq_reply_truncate:1;
enum rq_phase rq_phase; /* one of RQ_PHASE_* */
enum rq_phase rq_next_phase; /* one of RQ_PHASE_* to be used next */
atomic_t rq_refcount; /* client-side refcount for SENT race,
server-side refcounf for multiple replies */
/** initial thread servicing this request */
struct ptlrpc_thread *rq_svc_thread;
/** Portal to which this request would be sent */
int rq_request_portal; /* XXX FIXME bug 249 */
/** Portal where to wait for reply and where reply would be sent */
int rq_reply_portal; /* XXX FIXME bug 249 */
/** client-side # reply bytes actually received */
int rq_nob_received;
/** Request length */
int rq_reqlen;
/** Request message - what client sent */
struct lustre_msg *rq_reqmsg;
/** Reply length */
int rq_replen;
/** Reply message - server response */
struct lustre_msg *rq_repmsg;
/** Transaction number */
__u64 rq_transno;
/** xid */
__u64 rq_xid;
/**
* List item to for replay list. Not yet commited requests get linked
* there.
* Also see \a rq_replay comment above.
*/
struct list_head rq_replay_list;
/**
* security and encryption data
* @{
struct ptlrpc_cli_ctx *rq_cli_ctx; /* client's half ctx */
struct ptlrpc_svc_ctx *rq_svc_ctx; /* server's half ctx */
struct list_head rq_ctx_chain; /* link to waited ctx */
struct sptlrpc_flavor rq_flvr; /* client & server */
enum lustre_sec_part rq_sp_from;
unsigned long /* client/server security flags */
rq_ctx_init:1, /* context initiation */
rq_ctx_fini:1, /* context destroy */
rq_bulk_read:1, /* request bulk read */
rq_bulk_write:1, /* request bulk write */
/* server authentication flags */
rq_auth_gss:1, /* authenticated by gss */
rq_auth_remote:1, /* authed as remote user */
rq_auth_usr_root:1, /* authed as root */
rq_auth_usr_mdt:1, /* authed as mdt */
/* security tfm flags */
rq_pack_udesc:1,
rq_pack_bulk:1,
/* doesn't expect reply FIXME */
rq_no_reply:1,
rq_pill_init:1; /* pill initialized */
uid_t rq_auth_uid; /* authed uid */
uid_t rq_auth_mapped_uid; /* authed uid mapped to */
/* (server side), pointed directly into req buffer */
struct ptlrpc_user_desc *rq_user_desc;
/* @} */
/** early replies go to offset 0, regular replies go after that */
unsigned int rq_reply_off;
/* various buffer pointers */
struct lustre_msg *rq_reqbuf; /* req wrapper */
int rq_reqbuf_len; /* req wrapper buf len */
int rq_reqdata_len; /* req wrapper msg len */
char *rq_repbuf; /* rep buffer */
int rq_repbuf_len; /* rep buffer len */
struct lustre_msg *rq_repdata; /* rep wrapper msg */
int rq_repdata_len; /* rep wrapper msg len */
struct lustre_msg *rq_clrbuf; /* only in priv mode */
int rq_clrbuf_len; /* only in priv mode */
int rq_clrdata_len; /* only in priv mode */
/** Fields that help to see if request and reply were swabbed or not */
__u32 rq_req_swab_mask;
__u32 rq_rep_swab_mask;
/** What was import generation when this request was sent */
int rq_import_generation;
enum lustre_imp_state rq_send_state;
/** how many early replies (for stats) */
int rq_early_count;
/** client+server request */
lnet_handle_md_t rq_req_md_h;
struct ptlrpc_cb_id rq_req_cbid;
/** server-side fields: */
/** request arrival time */
struct timeval rq_arrival_time;
/** separated reply state */
struct ptlrpc_reply_state *rq_reply_state;
/** incoming request buffer */
struct ptlrpc_request_buffer_desc *rq_rqbd;
#ifdef CRAY_XT3
__u32 rq_uid; /* peer uid, used in MDS only */
#endif
/** client-only incoming reply: */
lnet_handle_md_t rq_reply_md_h;
cfs_waitq_t rq_reply_waitq;
struct ptlrpc_cb_id rq_reply_cbid;
/** our LNet NID */
lnet_nid_t rq_self;
/** Peer description */
lnet_process_id_t rq_peer;
/** Server-side, export on which request was received */
struct obd_export *rq_export;
/** Client side, import where request is being sent */
struct obd_import *rq_import;
/** Replay callback, called after request is replayed at recovery */
void (*rq_replay_cb)(struct ptlrpc_request *);
/**
* Commit callback, called when request is committed and about to be
* freed.
*/
void (*rq_commit_cb)(struct ptlrpc_request *);
/** Opaq data for replay and commit callbacks. */
void *rq_cb_data;
/** For bulk requests on clien only: bulk descriptor */
struct ptlrpc_bulk_desc *rq_bulk;
/** client outgoing req: */
/**
* when request/reply sent (secs), or time when request should be sent
*/
time_t rq_sent;
/** when request must finish. volatile
* so that servers' early reply updates to the deadline aren't
* kept in per-cpu cache */
volatile time_t rq_deadline;
/** when req reply unlink must finish. */
time_t rq_reply_deadline;
/** when req bulk unlink must finish. */
time_t rq_bulk_deadline;
/**
* service time estimate (secs)
* If requests is not served by this time, it is marked as timed out.
*/
int rq_timeout;
/** Multi-rpc bits: */
/** Link item for request set lists */
struct list_head rq_set_chain;
/** Link back to the request set */
struct ptlrpc_request_set *rq_set;
/** Async completion handler, called when reply is received */
ptlrpc_interpterer_t rq_interpret_reply;
/** Async completion context */
union ptlrpc_async_args rq_async_args;
/** Pool if request from preallocated list */
struct ptlrpc_request_pool *rq_pool;
struct lu_context rq_session;
struct lu_context rq_recov_session;
/** request format */
struct req_capsule rq_pill;
};
2.11. ptlrpc_bulk_page
Represents a single page of bulk data.
/**
* Structure that defines a single page of a bulk transfer
*/
struct ptlrpc_bulk_page {
/** Linkage to list of pages in a bulk */
struct list_head bp_link;
/**
* Number of bytes in a page to transfer starting from \a bp_pageoffset
*/
int bp_buflen;
/** offset within a page */
int bp_pageoffset;
/** The page itself */
struct page *bp_page;
};
2.12. ptlrpc_bulk_desc
Represents the bulk part of an RPC request.
/**
* Definition of buk descriptor.
* Bulks are special "Two phase" RPCs where initial request message
* is sent first and it is followed bt a transfer (o receiving) of a large
* amount of data to be settled into pages referenced from the bulk descriptors.
* Bulks transfers (the actual data following the small requests) are done
* on separate LNet portals.
* In lustre we use bulk transfers for READ and WRITE transfers from/to OSTs.
* Another user is readpage for MDT.
*/
struct ptlrpc_bulk_desc {
/** completed successfully */
unsigned long bd_success:1;
/** accessible to the network (network io potentially in progress) */
unsigned long bd_network_rw:1;
/** {put,get}{source,sink} */
unsigned long bd_type:2;
/** client side */
unsigned long bd_registered:1;
/** For serialization with callback */
spinlock_t bd_lock;
/** Import generation when request for this bulk was sent */
int bd_import_generation;
/** Server side - export this bulk created for */
struct obd_export *bd_export;
/** Client side - import this bulk was sent on */
struct obd_import *bd_import;
/** LNet portal for this bulk */
__u32 bd_portal;
/** Back pointer to the request */
struct ptlrpc_request *bd_req;
cfs_waitq_t bd_waitq; /* server side only WQ */
int bd_iov_count; /* # entries in bd_iov */
int bd_max_iov; /* allocated size of bd_iov */
int bd_nob; /* # bytes covered */
int bd_nob_transferred; /* # bytes GOT/PUT */
__u64 bd_last_xid;
struct ptlrpc_cb_id bd_cbid; /* network callback info */
lnet_handle_md_t bd_md_h; /* associated MD */
lnet_nid_t bd_sender; /* stash event::sender */
#if defined(__KERNEL__)
/*
* encrypt iov, size is either 0 or bd_iov_count.
*/
lnet_kiov_t *bd_enc_iov;
lnet_kiov_t bd_iov[0];
#else
lnet_md_iovec_t bd_iov[0];
#endif
};
2.13. ptlrpc_thread
Represents a thread that’s serving Portal RPC requests.
/**
* Definition of server service thread structure
*/
struct ptlrpc_thread {
/**
* List of active threads in svc->srv_threads
*/
struct list_head t_link;
/**
* thread-private data (preallocated memory)
*/
void *t_data;
__u32 t_flags;
/**
* service thread index, from ptlrpc_start_threads
*/
unsigned int t_id;
/**
* service thread pid
*/
pid_t t_pid;
/**
* put watchdog in the structure per thread b=14840
*/
struct lc_watchdog *t_watchdog;
/**
* the svc this thread belonged to b=18582
*/
struct ptlrpc_service *t_svc;
cfs_waitq_t t_ctl_waitq;
struct lu_env *t_env;
};
2.14. ptlrpc_request_buffer_desc
Server-side object that represents a single request buffer which will
hold an incoming request.
/**
* Request buffer descriptor structure.
* This is a structure that contains one posted request buffer for service.
* Once data land into a buffer, event callback creates actual request and
* notifies wakes one of the service threads to process new incoming request.
* More than one request can fit into the buffer.
*/
struct ptlrpc_request_buffer_desc {
/** Link item for rqbds on a service */
struct list_head rqbd_list;
/** History of requests for this buffer */
struct list_head rqbd_reqs;
/** Back pointer to service for which this buffer is registered */
struct ptlrpc_service *rqbd_service;
/** LNet descriptor */
lnet_handle_md_t rqbd_md_h;
int rqbd_refcount;
/** The buffer itself */
char *rqbd_buffer;
struct ptlrpc_cb_id rqbd_cbid;
/**
* This "embedded" request structure is only used for the
* last request to fit into the buffer
*/
struct ptlrpc_request rqbd_req;
};
2.15. ptlrpc_service
Represents the "service" provided by a particular RPC portal.
/**
* Definition of PortalRPC service.
* The service is listening on a particular portal (like tcp port)
* and perform actions for a specific server like IO service for OST
* or general metadata service for MDS.
*/
struct ptlrpc_service {
struct list_head srv_list; /* chain thru all services */
int srv_max_req_size; /* biggest request to receive */
int srv_max_reply_size; /* biggest reply to send */
int srv_buf_size; /* size of individual buffers */
int srv_nbuf_per_group; /* # buffers to allocate in 1 group */
int srv_nbufs; /* total # req buffer descs allocated */
int srv_threads_min; /* threads to start at SOW */
int srv_threads_max; /* thread upper limit */
int srv_threads_started; /* index of last started thread */
int srv_threads_running; /* # running threads */
atomic_t srv_n_difficult_replies; /* # 'difficult' replies */
int srv_n_active_reqs; /* # reqs being served */
int srv_n_hpreq; /* # HPreqs being served */
cfs_duration_t srv_rqbd_timeout; /* timeout before re-posting reqs, in tick */
int srv_watchdog_factor; /* soft watchdog timeout multiplier */
unsigned srv_cpu_affinity:1; /* bind threads to CPUs */
unsigned srv_at_check:1; /* check early replies */
unsigned srv_is_stopping:1; /* under unregister_service */
cfs_time_t srv_at_checktime; /* debug */
/** Local portal on which to receive requests */
__u32 srv_req_portal;
/** Portal on the client to send replies to */
__u32 srv_rep_portal;
/** AT stuff */
/** @{ */
struct adaptive_timeout srv_at_estimate;/* estimated rpc service time */
spinlock_t srv_at_lock;
struct ptlrpc_at_array srv_at_array; /* reqs waiting for replies */
cfs_timer_t srv_at_timer; /* early reply timer */
/** @} */
int srv_n_queued_reqs; /* # reqs in either of the queues below */
int srv_hpreq_count; /* # hp requests handled */
int srv_hpreq_ratio; /* # hp per lp reqs to handle */
struct list_head srv_req_in_queue; /* incoming reqs */
struct list_head srv_request_queue; /* reqs waiting for service */
struct list_head srv_request_hpq; /* high priority queue */
struct list_head srv_request_history; /* request history */
__u64 srv_request_seq; /* next request sequence # */
__u64 srv_request_max_cull_seq; /* highest seq culled from history */
svcreq_printfn_t srv_request_history_print_fn; /* service-specific print fn */
struct list_head srv_idle_rqbds; /* request buffers to be reposted */
struct list_head srv_active_rqbds; /* req buffers receiving */
struct list_head srv_history_rqbds; /* request buffer history */
int srv_nrqbd_receiving; /* # posted request buffers */
int srv_n_history_rqbds; /* # request buffers in history */
int srv_max_history_rqbds;/* max # request buffers in history */
atomic_t srv_outstanding_replies;
struct list_head srv_active_replies; /* all the active replies */
#ifndef __KERNEL__
struct list_head srv_reply_queue; /* replies waiting for service */
#endif
cfs_waitq_t srv_waitq; /* all threads sleep on this. This
* wait-queue is signalled when new
* incoming request arrives and when
* difficult reply has to be handled. */
struct list_head srv_threads; /* service thread list */
/** Handler function for incoming requests for this service */
svc_handler_t srv_handler;
svc_hpreq_handler_t srv_hpreq_handler; /* hp request handler */
char *srv_name; /* only statically allocated strings here; we don't clean them */
char *srv_thread_name; /* only statically allocated strings here; we don't clean them */
spinlock_t srv_lock;
/** Root of /proc dir tree for this service */
cfs_proc_dir_entry_t *srv_procroot;
/** Pointer to statistic data for this service */
struct lprocfs_stats *srv_stats;
/** List of free reply_states */
struct list_head srv_free_rs_list;
/** waitq to run, when adding stuff to srv_free_rs_list */
cfs_waitq_t srv_free_rs_waitq;
/**
* Tags for lu_context associated with this thread, see struct
* lu_context.
*/
__u32 srv_ctx_tags;
/**
* if non-NULL called during thread creation (ptlrpc_start_thread())
* to initialize service specific per-thread state.
*/
int (*srv_init)(struct ptlrpc_thread *thread);
/**
* if non-NULL called during thread shutdown (ptlrpc_main()) to
* destruct state created by ->srv_init().
*/
void (*srv_done)(struct ptlrpc_thread *thread);
//struct ptlrpc_srv_ni srv_interfaces[0];
};
2.16. ptlrpcd_ctl
Control data for ptlrpcd.
/**
* Declaration of ptlrpcd control structure
*/
struct ptlrpcd_ctl {
/**
* Ptlrpc thread control flags (LIOD_START, LIOD_STOP, LIOD_FORCE)
*/
unsigned long pc_flags;
/**
* Thread lock protecting structure fields.
*/
spinlock_t pc_lock;
/**
* Start completion.
*/
struct completion pc_starting;
/**
* Stop completion.
*/
struct completion pc_finishing;
/**
* Thread requests set.
*/
struct ptlrpc_request_set *pc_set;
/**
* Thread name used in cfs_daemonize()
*/
char pc_name[16];
/**
* Environment for request interpreters to run in.
*/
struct lu_env pc_env;
#ifndef __KERNEL__
/**
* Async rpcs flag to make sure that ptlrpcd_check() is called only
* once.
*/
int pc_recurred;
/**
* Currently not used.
*/
void *pc_callback;
/**
* User-space async rpcs callback.
*/
void *pc_wait_callback;
/**
* User-space check idle rpcs callback.
*/
void *pc_idle_callback;
#endif
};
2.17. obd_export
Represents an RPC server (target) - used on both the client and the server.
/**
* Export structure. Represents target-side of connection in portals.
* Also used in Lustre to connect between layers on the same node when
* there is no network-connection in-between.
* For every connected client there is an export structure on the server
* attached to the same obd device.
*/
struct obd_export {
/**
* Export handle, it's id is provided to client on connect
* Subsequent client RPCs contain this handle id to identify
* what export they are talking to.
*/
struct portals_handle exp_handle;
atomic_t exp_refcount;
/**
* Set of counters below is to track where export references are
* kept. The exp_rpc_count is used for reconnect handling also,
* the cb_count and locks_count are for debug purposes only for now.
* The sum of them should be less than exp_refcount by 3
*/
atomic_t exp_rpc_count; /** RPC references */
atomic_t exp_cb_count; /** Commit callback references */
atomic_t exp_locks_count; /** Lock references */
#if LUSTRE_TRACKS_LOCK_EXP_REFS
struct list_head exp_locks_list;
spinlock_t exp_locks_list_guard;
#endif
/** Number of queued replay requests to be processes */
atomic_t exp_replay_count;
/** UUID of client connected to this export */
struct obd_uuid exp_client_uuid;
/** To link all exports on an obd device */
struct list_head exp_obd_chain;
struct hlist_node exp_uuid_hash; /** uuid-export hash*/
struct hlist_node exp_nid_hash; /** nid-export hash */
/**
* All exports eligible for ping evictor are linked into a list
* through this field in "most time since last request on this export"
* order
* protected by obd_dev_lock
*/
struct list_head exp_obd_chain_timed;
/** Obd device of this export */
struct obd_device *exp_obd;
/** "reverse" import to send requests (e.g. from ldlm) back to client */
struct obd_import *exp_imp_reverse;
struct nid_stat *exp_nid_stats;
struct lprocfs_stats *exp_md_stats;
/** Active connetion */
struct ptlrpc_connection *exp_connection;
/** Connection count value from last succesful reconnect rpc */
__u32 exp_conn_cnt;
/** Hash list of all ldlm locks granted on this export */
lustre_hash_t *exp_lock_hash;
/** lock to protect exp_lock_hash accesses */
spinlock_t exp_lock_hash_lock;
struct list_head exp_outstanding_replies;
struct list_head exp_uncommitted_replies;
spinlock_t exp_uncommitted_replies_lock;
/** Last committed transno for this export */
__u64 exp_last_committed;
/** When was last request received */
cfs_time_t exp_last_request_time;
/** On replay all requests waiting for replay are linked here */
struct list_head exp_req_replay_queue;
/** protects exp_flags and exp_outstanding_replies */
spinlock_t exp_lock;
/** Compatibility flags for this export */
__u64 exp_connect_flags;
enum obd_option exp_flags;
unsigned long exp_failed:1,
exp_in_recovery:1,
exp_disconnected:1,
exp_connecting:1,
/** VBR: export missed recovery */
exp_delayed:1,
/** VBR: failed version checking */
exp_vbr_failed:1,
exp_req_replay_needed:1,
exp_lock_replay_needed:1,
exp_need_sync:1,
exp_flvr_changed:1,
exp_flvr_adapt:1,
exp_libclient:1, /* liblustre client? */
/* client timed out and tried to reconnect,
* but couldn't because of active rpcs */
exp_abort_active_req:1;
struct list_head exp_queued_rpc; /* RPC to be handled */
/* also protected by exp_lock */
enum lustre_sec_part exp_sp_peer;
struct sptlrpc_flavor exp_flvr; /* current */
struct sptlrpc_flavor exp_flvr_old[2]; /* about-to-expire */
cfs_time_t exp_flvr_expire[2]; /* seconds */
/** Target specific data */
union {
struct lu_export_data eu_target_data;
struct mdt_export_data eu_mdt_data;
struct filter_export_data eu_filter_data;
struct ec_export_data eu_ec_data;
} u;
};
#define exp_target_data u.eu_target_data
#define exp_mdt_data u.eu_mdt_data
#define exp_filter_data u.eu_filter_data
#define exp_ec_data u.eu_ec_data
2.18. obd_import
Represents an RPC client - used on both the client and server.
/**
* Defintion of PortalRPC import structure.
* Imports are representing client-side view to remote target.
*/
struct obd_import {
/** Local handle (== id) for this import. */
struct portals_handle imp_handle;
/** Reference handle */
atomic_t imp_refcount;
struct lustre_handle imp_dlm_handle; /* client's ldlm export */
/** Currently active connection */
struct ptlrpc_connection *imp_connection;
/** PortalRPC client structure for this import */
struct ptlrpc_client *imp_client;
/** List element for linking into pinger chain */
struct list_head imp_pinger_chain;
/** List element for linking into chain for destruction */
struct list_head imp_zombie_chain;
/**
* Lists of requests that are retained for replay, waiting for a reply,
* or waiting for recovery to complete, respectively.
* @{
*/
struct list_head imp_replay_list;
struct list_head imp_sending_list;
struct list_head imp_delayed_list;
/** @} */
/** obd device for this import */
struct obd_device *imp_obd;
/**
* some seciruty-related fields
* @{
*/
struct ptlrpc_sec *imp_sec;
struct semaphore imp_sec_mutex;
cfs_time_t imp_sec_expire;
/** @} */
/** Wait queue for those who need to wait for recovery completion */
cfs_waitq_t imp_recovery_waitq;
/** Number of requests currently in-flight */
atomic_t imp_inflight;
/** Number of requests currently unregistering */
atomic_t imp_unregistering;
/** Number of replay requests inflight */
atomic_t imp_replay_inflight;
/** Number of currently happening import invalidations */
atomic_t imp_inval_count;
/** Numbner of request timeouts */
atomic_t imp_timeouts;
/** Current import state */
enum lustre_imp_state imp_state;
/** History of import states */
struct import_state_hist imp_state_hist[IMP_STATE_HIST_LEN];
int imp_state_hist_idx;
/** Current import generation. Incremented on every reconnect */
int imp_generation;
/** Incremented every time we send reconnection request */
__u32 imp_conn_cnt;
/**
* \see ptlrpc_free_committed remembers imp_generation value here
* after a check to save on unnecessary replay list iterations
*/
int imp_last_generation_checked;
/** Last tranno we replayed */
__u64 imp_last_replay_transno;
/** Last transno committed on remote side */
__u64 imp_peer_committed_transno;
/**
* \see ptlrpc_free_committed remembers last_transno since its last
* check here and if last_transno did not change since last run of
* ptlrpc_free_committed and import generation is the same, we can
* skip looking for requests to remove from replay list as optimisation
*/
__u64 imp_last_transno_checked;
/**
* Remote export handle. This is how remote side knows what export
* we are talking to. Filled from response to connect request
*/
struct lustre_handle imp_remote_handle;
/** When to perform next ping. time in jiffies. */
cfs_time_t imp_next_ping;
/** When we last succesfully connected. time in 64bit jiffies */
__u64 imp_last_success_conn;
/** List of all possible connection for import. */
struct list_head imp_conn_list;
/**
* Current connection. \a imp_connection is imp_conn_current->oic_conn
*/
struct obd_import_conn *imp_conn_current;
/** Protects flags, level, generation, conn_cnt, *_list */
spinlock_t imp_lock;
/* flags */
unsigned long imp_no_timeout:1, /* timeouts are disabled */
imp_invalid:1, /* evicted */
imp_deactive:1, /* administratively disabled */
imp_replayable:1, /* try to recover the import */
imp_dlm_fake:1, /* don't run recovery (timeout instead) */
imp_server_timeout:1, /* use 1/2 timeout on MDS' OSCs */
imp_initial_recov:1, /* retry the initial connection */
imp_initial_recov_bk:1, /* turn off init_recov after trying all failover nids */
imp_delayed_recovery:1, /* VBR: imp in delayed recovery */
imp_no_lock_replay:1, /* VBR: if gap was found then no lock replays */
imp_vbr_failed:1, /* recovery by versions was failed */
imp_force_verify:1, /* force an immidiate ping */
imp_pingable:1, /* pingable */
imp_resend_replay:1, /* resend for replay */
imp_recon_bk:1, /* turn off reconnect if all failovers fail */
imp_last_recon:1, /* internally used by above */
imp_force_reconnect:1; /* import must be reconnected instead of chouse new connection */
__u32 imp_connect_op;
struct obd_connect_data imp_connect_data;
__u64 imp_connect_flags_orig;
int imp_connect_error;
__u32 imp_msg_magic;
__u32 imp_msghdr_flags; /* adjusted based on server capability */
struct ptlrpc_request_pool *imp_rq_pool; /* emergency request pool */
struct imp_at imp_at; /* adaptive timeout data */
time_t imp_last_reply_time; /* for health check */
};
3. Initialization and cleanup functions
3.1. ptlrpc_init()
When the Portal RPC kernel module is loaded, this function is called
to initialize the subsystem. It is also called from
lllib_init() in liblustre.
__init int ptlrpc_init(void);
Most of the initialization is carried out by sub-functions.
The function starts by calling
lustre_assert_wire_constants() to check that the
numerous constants and data types used in the on-the-wire protocol
have the expected values/size/offsets.
After initializating a few spin locks and mutexes,
ptlrpc_init_xid() is called to initialize the node’s
XID (ptlrpc_last_xid) to a value based on the current time.
This is done so that the XID sequence generated after a reboot cannot
contain a previously used value.
Next, req_layout_init() is called to setup the array
of request formats, req_formats. Each element in the array (of type
req_format) defines the on-the-wire format used by the client and
the server for a given RPC operation. Each request or reply can
contain a number of fields (field type/order is fixed at compile
time). The request format for a particular message type specifies what
fields are present and, for each field, whether the field is fixed or
variable length and what swabber and dumper functions are to be used
to byte-swap and dump that field’s value.
Next, ptlrpc_hr_init() is called to start the reply
handling threads. One thread per online CPU is started. Each thread
executes the ptlrpc_hr_main_function(). Until the
reply handler is terminated, that function waits for replies to be
added to its queue and then dequeues them and passes each reply to
ptlrpc_handle_rs().
Next, ptlrpc_init_portals() is called:
-
-
This function starts by calling ptlrpc_ni_init() which:
-
calls LNetNIInit() to initialise
the LNET interface
-
calls LNetEQAlloc() to allocate
the global PTLRPC event queue, ptlrpc_eq_h (for kernel PTLRPC, the
event queue’s callback function is set to
ptlrpc_master_callback())
-
-
For non-kernel PTLRPC, liblustre_check_services() is registered
as a liblustre wait callback function and then
ptlrpcd_addref() is called to obtain a reference to
the ptlrpc daemon (ptlrpcd). The first call to that function will
start the daemon.
Back in ptlrpc_init(), ptlrpc_connection_init() is
called to allocate and initialize the connection hash table
(conn_hash).
A call to ptlrpc_start_pinger() starts the pinger
thread and it executes ptlrpc_pinger_main().
Next, ldlm_init() is called to carry out low-level
LDLM initialization (storage allocation).
Next, sptlrpc_init() is called to initialize the
PTLRPC security system.
Finally, llog_recov_init() is called to allocate
llcd_cache.
If any of the above functions indicate failure, control jumps to
cleanup code that undoes the work of those initialization functions
that have already successfully completed.
The function returns 0 on success.
3.2. ptlrpc_exit()
When the Portal RPC kernel module is unloaded, this function is called
to teardown the subsystem.
static void __exit ptlrpc_exit(void);
4. Functions for message manipulation
4.1. ptlrpc_request_alloc()
Allocate a new request structure and setup its buffers appropriately
for the required message format.
struct ptlrpc_request *ptlrpc_request_alloc(struct obd_import *imp,
const struct req_format *format);
-
imp
-
the obd_import through which the message will be sent
-
format
-
the req_format object that defines the message’s format
4.2. ptlrpc_request_alloc_pool()
Allocate a new request structure (from a pool if possible) and setup
its buffers appropriately for the required message format.
struct ptlrpc_request *ptlrpc_request_alloc_pool(struct obd_import *imp,
struct ptlrpc_request_pool * pool,
const struct req_format *format)
-
imp
-
the obd_import through which the message will be sent
-
pool
-
the request pool (ptlrpc_request_pool) from which to take the request
-
format
-
the req_format object that defines the message’s format
4.3. ptlrpc_request_alloc_internal()
Allocate a new request structure (from a pool if possible) and setup
its buffers appropriately for the required message format.
static struct ptlrpc_request *
ptlrpc_request_alloc_internal(struct obd_import *imp,
struct ptlrpc_request_pool * pool,
const struct req_format *format)
-
imp
-
the obd_import through which the message will be sent
-
pool
-
the request pool (ptlrpc_request_pool) from which to take the request
-
format
-
the req_format object that defines the message’s format
The message allocation is delegated to
__ptlrpc_request_alloc() which first tries to allocate
the message from the pool (if pool is non-NULL) and if the pool was
empty it just allocates the memory itself. If the message is
allocated, a reference to the import is obtained (and the import
assigned to the message’s rq_import field.
The message’s request capsule, req_pill, is initialized by a call
to req_capsule_init() which sets up a couple of fields
and fills the array of field sizes (rc_area) with -1’s.
A call to req_capsule_set() stores format into the
message’s rc_format field.
4.4. ptlrpc_request_free()
Free a request (to a pool if the request is associated with one).
void ptlrpc_request_free(struct ptlrpc_request *request);
Argument is the request to free. If the request’s rq_pool field is
non-NULL, the request is passed to
__ptlrpc_free_req_to_pool() to put it back in the
pool. Otherwise, the request’s memory is freed.
4.5. ptlrpc_request_alloc_pack()
Allocate a new (simple) request and pack it ready for transmission.
struct ptlrpc_request *ptlrpc_request_alloc_pack(struct obd_import *imp,
const struct req_format *format,
__u32 version, int opcode)
-
imp
-
the obd_import through which the message will be sent
-
format
-
the req_format object that defines the message’s format
-
version
-
client part of message version
-
opcode
-
the type of the RPC request
4.6. ptlrpc_request_pack()
Pack a request ready for transmission.
int ptlrpc_request_pack(struct ptlrpc_request *request,
__u32 version, int opcode);
-
request
-
the ptlrpc_request to pack
-
version
-
client part of message version
-
opcode
-
the type of the RPC request
4.7. ptlrpc_request_bufs_pack()
Pack a request ready for transmission.
int ptlrpc_request_bufs_pack(struct ptlrpc_request *request,
__u32 version, int opcode, char **bufs,
struct ptlrpc_cli_ctx *ctx)
-
request
-
the ptlrpc_request to pack
-
version
-
client part of message version
-
opcode
-
the type of the RPC request
-
bufs
-
if non-NULL, it should be an array of pointers to buffers containing
field data - one pointer for each RPC field in the message
-
ctx
-
optional client-side security context
Firstly req_capsule_filled_sizes() is passed the
address of request→rq_pill so that it can fill in those elements
of rc_area (array of field lengths) that have not been assigned
(they are still set to -1). That function returns the number of fields
to be packed and that is now passed to
__ptlrpc_request_bufs_pack() along with this function’s
arguments and the array of field lengths.
4.8. __ptlrpc_request_bufs_pack()
Pack a request ready for transmission.
static int __ptlrpc_request_bufs_pack(struct ptlrpc_request *request,
__u32 version, int opcode,
int count, __u32 *lengths, char **bufs,
struct ptlrpc_cli_ctx *ctx)
-
request
-
the ptlrpc_request to pack
-
version
-
client part of message version
-
opcode
-
the type of the RPC request
-
count
-
the number of fields in the request
-
lengths
-
array of field lengths
-
bufs
-
optional array of field initialisation data
-
ctx
-
optional client-side security context
Firstly, the request’s client security context (rq_cli_ctx) is
assigned either the passed in ctx (if non-NULL) or the value
returned from a call to sptlrpc_req_get_ctx().
Next, sptlrpc_req_seq_flavor() is called to set
some message security flags depending on what the RPC opcode is.
Now, lustre_pack_request() is called to pack the RPC fields into
the request’s message buffer (rq_reqmsg). When that function
returns, opcode and version are also stored into the request message.
The remainder of the function is concerned with initializing myriad
other fields in request, such as:
-
the callback functions that are to be invoked when the request has
been sent or the reply received.
-
the request and reply portals
-
the phase of the RPC operation (RQ_PHASE_NEW)
-
various other lists heads, spin locks, wait queues, etc.
4.9. lustre_pack_request()
Low level function to allocate and pack a request’s message.
int lustre_pack_request(struct ptlrpc_request *req, __u32 magic, int count,
__u32 *lens, char **bufs)
-
req
-
the ptlrpc_request to pack
-
magic
-
a magic number that is now ignored (LUSTRE_MSG_MAGIC_V2 is used
instead)
-
count
-
the number of fields in the request
-
lens
-
array of field lengths
-
bufs
-
optional array of field initialisation data
Simply calls lustre_pack_request_v2() to do the real work.
4.10. lustre_pack_request_v2()
Low level function to allocate and pack a request’s message.
int lustre_pack_request_v2(struct ptlrpc_request *req, int count,
__u32 *lens, char **bufs)
-
req
-
the ptlrpc_request to pack
-
count
-
the number of fields in the request
-
lens
-
array of field lengths
-
bufs
-
optional array of field initialisation data
The size of the request message buffer is calculated with a call to
lustre_msg_size_v2(). The resulting size is the sum of
the message header size and the sizes of each of the fields (all
values being rounded up to the nearest 8 bytes before they are
summed). The size is passed to
sptlrpc_cli_alloc_reqbuf() along with req and the
allocation of the message buffer (possibly from a pool of
pre-allocated buffers) is done through the client security context.
Once the message buffer has been allocated, it can be accessed via
req→rq_reqmsg.
The message buffer and the supplied count, lens and bufs are
passed to lustre_init_msg_v2() which stores the count
and buffer lengths into the message buffer and if bufs is not-NULL,
it uses that as an array of pointers to initialization data (one for
each buffer) and copies that data into the message buffer.
Finally, it adds PTLRPC_MSG_VERSION into the request’s
pb_version field.
4.11. ptlrpc_prep_req()
Prepare request ready for sending.
struct ptlrpc_request *
ptlrpc_prep_req(struct obd_import *imp, __u32 version, int opcode, int count,
__u32 *lengths, char **bufs)
-
imp
-
the obd_import through which the request is sent
-
version
-
client part of message version
-
opcode
-
the RPC message type
-
count
-
number of buffers supplied
-
lengths
-
array of buffer lengths
-
bufs
-
array of buffer pointers
Simply passes the arguments to ptlrpc_prep_req_pool() with a NULL pool.
4.12. ptlrpc_prep_req_pool()
Prepare request ready for sending.
struct ptlrpc_request *
ptlrpc_prep_req_pool(struct obd_import *imp,
__u32 version, int opcode,
int count, __u32 *lengths, char **bufs,
struct ptlrpc_request_pool *pool)
-
imp
-
the obd_import through which the request is sent
-
version
-
client part of message version
-
opcode
-
the RPC message type
-
count
-
number of buffers supplied
-
lengths
-
array of buffer lengths
-
bufs
-
array of buffer pointers
-
pool
-
request pool containing free requests
The request allocation is delegated to __ptlrpc_request_alloc()
which will either allocate the message from the pool (if not-NULL) or
allocate the memory itself. If the allocation succeeds,
__ptlrpc_request_bufs_pack() is called to pack the request which
is then returned.
4.13. ptlrpc_queue_wait()
Send a request and wait for the reply.
int ptlrpc_queue_wait(struct ptlrpc_request *req);
Argument req is the prepared request.
For debugging purposes, the current PID is stored in the request’s
pb_status field.
A reference to the set is obtained and
ptlrpc_set_add_req() is called to add the reference to
the set and increment the set’s set_remaining field.
The set is passed to ptlrpc_set_wait() which will send the
request and wait for it to complete.
4.14. ptlrpc_prep_set()
Allocates and initializes a new request set.
struct ptlrpc_request_set *ptlrpc_prep_set(void);
If the memory for the new set cannot be allocated, the function
returns NULL. Otherwise, various fields are initialised in the
expected fashion and the pointer to the new set is returned.
4.15. ptlrpc_set_destroy()
Clears up any requests in the set and then frees the set.
void ptlrpc_set_destroy(struct ptlrpc_request_set *set);
The function starts by checking that all requests in the set are
either completed or new.
Each message in the set is removed from the set and passed to
ptlrpc_req_interpret() with a status of -EBADR (equivalent to
the message being lost in-flight). The request is freed by passing it
to ptlrpc_req_finished().
Finally, the set is freed.
4.16. ptlrpc_set_wait()
Send all unsent requests in a set and then wait for them to complete.
void ptlrpc_set_wait(struct ptlrpc_request_set *set);
If the set contains no requests, the function returns 0 immediately.
The function now loops as long as there are uncompleted requests and
no error has been detected:
-
-
ptlrpc_set_next_timeout() is called to determine how
many seconds will elapse before the first message in the set times out.
-
-
The thread now waits for ptlrpc_check_set() to return non-zero
(indicating that either all requests have been sent and no more
replies are expected or that the timeout period should be
restarted). If no timeout is pending (no messages are in-flight) and
one or more signals are pending (note that
cfs_signal_pending() returns zero if a signal is
pending), the thread waits for up to 1 second with interrupts
enabled. Otherwise, the thread waits until the timeout period has
expired. If the wait completes, the set is passed to
ptlrpc_expired_set() and if the wait was interrupted, the set is
passed to ptlrpc_interrupted_set().
Each request’s status (rq_status) is checked and, if non-zero,
assigned to local rc.
If the set has a non-NULL completion callback function
(set_interpret), it is passed the set. Otherwise, if the set has
any completion callback objects (in list set_cblist), each
callback object is removed from the list, the associated callback
function is passed the set and the callback object freed. Any non-zero
callback function return value is propagated into rc.
4.17. ptlrpc_check_set()
Sends any unsent requests in a set and then returns non-zero if no
more replies are expected.
int ptlrpc_check_set(const struct lu_env *env, struct ptlrpc_request_set *set)
-
env
-
a lu_env that is passed on to a request’s completion handler when it
completes
-
set
-
the request set
Local force_timer_recalc is initialized to zero. This will be set if
an error is detected and causes the function to return non-zero so
that a thread waiting on this function will be woken.
If all requests in the set have already completed (set_remaining
is 0), the function immediately returns success (1).
The remainder of the function loops over the requests in the set with
local req pointing at each request.
If the request has not yet been sent (req→rq_phase equals
RQ_PHASE_NEW), req is passed to ptlrpc_send_new_req() for
sending and if that function indicates failure, force_timer_recalc
is set.
If the request has not yet been sent because it is scheduled to be
sent in the future, the loop continues on to the next request in the
set.
If the current request’s phase is RQ_PHASE_UNREGISTERING:
-
-
If the request’s reply is currently being received or the reply has
yet to be unlinked or if an associated bulk transfer is in progress,
the loop continues.
-
-
The request’s phase is moved on to rq_next_phase.
If the request’s phase is now RQ_PHASE_COMPLETE, the loop continues
with the next request.
If the request’s phase is now RQ_PHASE_INTERPRET, control jumps to
near the end of the loop where the reply is interpreted.
If the request has suffered a network error (req→rq_net_err is
true) and has also not timed out (req→rq_timeout is false), it
is passed to ptlrpc_expire_one_request() and then if the reply
has yet to be unlinked or if an associated bulk transfer is in
progress, the loop continues.
If the request has suffered an error (req→rq_err is true),
req→rq_replied is cleared (and if req→rq_status is zero,
it is is set to -EIO), the request’s phase is changed to
RQ_PHASE_INTERPRET and control jumps to the reply interpretation
code below.
If the request has received an interrupt (req→rq_intr is true)
and has either timed out (req→rq_timedout is true) or is
waiting for the import to come ready (req→rq_waiting is true)
or is waiting for a context (req→rq_wait_ctx is true) then
req→rq_status is set to -EINTR, the request’s phase is
changed to RQ_PHASE_INTERPRET and control jumps to the reply
interpretation code below.
If the request’s phase is RQ_PHASE_RPC:
-
-
If the request has either timed out or is being resent
(req→rq_resend is true) or waiting for the import to become
ready or is waiting for a context:
-
-
ptlrpc_unregister_reply() is called to (asynchronously) unlink the
request’s reply buffer from the network. Until the unlink has
completed, the function will return false and this loop continues on
to look at the next request.
-
-
ptlrpc_import_delay_req() is called to determine if the request
should be delayed pending the import to come ready and if so, it is
appended to the import’s imp_delayed_list and the loop
continues. note that rq_waiting is not set here like it is in
ptlrpc_send_new_req() - is this a bug?
-
-
If an error was detected in ptlrpc_import_delay_req(), that status
is assigned to req→rq_status, the request’s phase is changed to
RQ_PHASE_INTERPRET and control jumps to the reply interpretation
code below.
-
-
If the request is not to be resent (req→rq_no_resend is true)
and is not waiting for a context then rq_status is set to
-ENOTCONN, the request’s phase is changed to RQ_PHASE_INTERPRET
and control jumps to the reply interpretation code below.
-
-
The request is appended to the import’s imp_sending_list.
-
-
The request is marked as not waiting for the import
(req→rq_waiting is cleared).
-
-
If the request has timed out or is being resent
(req→rq_timedout or req→rq_resend true), req→rq_resend is
set true and if there is a bulk transfer associated with the request
(req→rq_bulk is true), the request’s rq_xid field is assigned a
new XID value to make the previous bulk transfer fail.
-
-
Now sptlrpc_req_refresh_ctx() is called to refresh the request’s
client context. If that function returns an error code and
req→rq_err is true, the error code is stored in
req→rq_status and force_timer_recalc is set true. If the function
returned an error code and req→rq_err is false,
req→rq_wait_ctx is set true to indicate that the request is
waiting for a context.
-
-
If the client context was refreshed, req→rq_wait_ctx is cleared.
-
-
The request is sent by passing it to ptl_send_rpc() and if that
function returns an error code, force_timer_recalc and
req→rq_net_err are set true.
-
-
force_timer_recalc is set true to reset the timeout.
-
-
If the request has received an early reply,
ptlrpc_at_recv_early_reply() is called and the loop continues.
-
-
If the request is currently receiving a reply, the loop continues.
-
-
If the request has not yet received a reply, the loop continues.
-
-
The request is passed to after_reply() and if req→rq_resend
is now true, the loop continues.
-
-
If there is no bulk transfer associated with this request
(req→rq_bulk is NULL) or the bulk transfer has failed
(req→rq_status is non-zero) then the request’s phase is changed to
RQ_PHASE_INTERPRET and control jumps to the reply interpretation
code below.
-
-
The request’s phase is changed to RQ_PHASE_BULK.
If the bulk transfer is still active, the loop continues.
The request’s phase is changed to RQ_PHASE_INTERPRET and control
has now reached the portion of the loop that interprets the reply.
ptlrpc_unregister_reply() is called to (asynchronously) unlink the
request’s reply buffer from the network. Until the unlink has
completed, the function will return false and this loop continues on to
look at the next request.
Similarly, ptlrpc_unregister_bulk() is called to (asynchronously)
unlink the request’s bulk descriptor from the network. Until the
unlink has completed, the function will return false and this loop
continues on to look at the next request.
Now, ptlrpc_req_interpret() is passed env, req and the current
value of req→rc_status and it simply invokes the request’s callback
function rq_interpret_reply (if non-NULL).
The request’s phase is changed to RQ_PHASE_COMPLETE.
If the request is still linked onto a list (rq_list not empty), it
is removed from the list and the import’s number of in-flight RPCs
(imp_inflight) is decremented.
The number of requests in the set that haven’t yet completed
(set→set_remaining) is decremented and the threads waiting on
the import’s recovery wait queue (imp_recovery_waitq) are woken.
After all of the requests have been processed by the loop, the
function returns true if either the number of requests in the set that
haven’t completed has become zero (the set is now empty) or if
force_timer_recalc is true.
4.18. after_reply()
Callback function invoked after an RPC reply is received.
static int after_reply(struct ptlrpc_request *req);
Argument req is the request that has received a reply.
A call to sptlrpc_cli_unwrap_reply() unwraps the reply
(message header and buffer lengths byte-swapped as required and reply
is verified by the security layer).
If rq_resend is now true, the function returns 0.
Next, unpack_reply() is called and, if required, this
will byte-swap the ptlrpc_body which all messages contain.
The next couple of lines update a procfs counter (PTLRPC_REQWAIT_CNTR).
If the type of the reply is not PTL_RPC_MSG_REPLY or
PTL_RPC_MSG_ERR, -EPROTO is returned.
FIXME - couple of calls to adaptive timeout functions.
The reply’s error status (transmitted in pb_status) is obtained
with a call to ptlrpc_check_status(). as of
07/04/2010 that function appears to contain duplicated (dead) code].
If the status indicates an error has occurred and that error is
recoverable (-ENOTCONN or -ENODEV), the status value is
returned. If possible a reconnect will be initiated and the request
marked for resending.
If the status was good (0), the request is passed to
ldlm_cli_update_pool() to update the client’s
obd_pool_slv and obd_pool_limit fields.
If the message is not being replayed, the transaction number is
retrieved from the pb_transno field of the reply’s ptlrpc_body
and stored into the request’s rq_transno field and also stored
into the pb_transno field of the request’s ptlrpc_body.
-
-
If the request has a transaction number (rq_transno is not 0) and
that transaction number is greater than the transaction number of the
last committed transaction or the request is marked for replay
(rq_replay is true), the request is added to the import’s
imp_replay_list through a call to
ptlrpc_retain_replayable_request(). Also, a call to
ptlrpc_save_versions() transfers the version numbers
from the reply’s pb_pre_versions field to same field in the
request if the request is not being replayed.
-
-
Otherwise, the request’s commit callback (rq_commit_cb) is invoked
(no NULL check).
-
-
If the reply’s pb_last_committed value is non-zero it is assigned
to the import’s imp_peer_committed_transno
field. ptlrpc_free_committed() is called to scan the
import’s imp_replay_list and prune those entries that have a
transaction number less than or equal to
imp_peer_committed_transno and do not have rq_replay set
true. When an entry is pruned its rq_commit_cb is invoked (if not
NULL) and the request’s reference count is decremented.
4.19. ptlrpc_expired_set()
Wait event callback function to time out uncompleted requests.
int ptlrpc_expired_set(void *data);
The argument, data, is a pointer to the request set.
The function loops over the set’s requests (set_requests) and
expires those requests whose rq_deadline field is greater than the
current time and are not either waiting for a context or have a
request in flight or have already expired. Each request is expired
with a call to ptlrpc_expire_one_request().
The function always returns 1.
4.20. ptlrpc_expire_one_request()
Expire (time out) a single request.
int ptlrpc_expire_one_request(struct ptlrpc_request *req, int async_unlink);
-
req
-
the request to expire
-
async_unlink
-
if true, unlink the request’s reply and bulk buffers asynchronously.
ptlrpc_unregister_reply() and ptlrpc_unregister_bulk() are passed
req and async_unlink to unlink the request’s reply and bulk buffers.
If the import is NULL, the function returns 1.
If the request’s rq_fake field is true, the function returns 1.
The number of timeouts on this import imp_timeouts is incremented.
If the import’s imp_dlm_fake field is true, the function returns 1.
If the request should fail due to the timeout, its rq_status field
is set to -ETIMEDOUT and its rq_err field set true and the
function returns 1.
If the request’s rq_no_resend field is true, the function will
return 1, otherwise 0.
Before returning, ptlrpc_fail_import() is called. It is passed the
import that is to be disconnected (due to the timeout) and the
pb_conn_cnt field from the expired message (that is the value of
imp_conn_cnt that the import had at the time the message was
sent). Those arguments are passed down to ptlrpc_set_import_discon()
which does the disconnection as long as the passed in connection count
matches the current value of imp→imp_conn_cnt. i.e. the import
hasn’t already been reconnected since the message that has expired was
sent.
4.21. ptlrpc_interrupted_set()
Wait event callback function that marks all requests in a set as
interrupted.
int ptlrpc_interrupted_set(void *data);
The argument, data, is a pointer to the request set.
Simply loops through the requests in the set (set_requests) and
for each request whose rq_phase field is equal to RQ_PHASE_RPC
or RQ_PHASE_UNREGISTERING, it calls ptlrpc_mark_interrupted()
which sets rq_intr true.
4.22. ptlrpc_send_new_req()
Sends an RPC request for the first time.
static int ptlrpc_send_new_req(struct ptlrpc_request *req)
The argument, req, is the request to send.
If the request has a scheduled time to send (rq_sent not zero) and
that time is in the future, the function simply returns 0.
The request’s phase is moved to RQ_PHASE_RPC.
ptlrpc_import_delay_req() is called to determine if the request
should be delayed pending the import to come ready and if so, the
request’s rq_waiting field is set and the request is appended to
the import’s imp_delayed_list, the import’s imp_inflight field
is incremented and the function returns 0.
If ptlrpc_import_delay_req() returned an error code, it is assigned
to rq_status, the request’s phase is moved to RQ_PHASE_INTERPRET
and the function returns the error code.
The message’s pb_status field is set to the current PID.
Now sptlrpc_req_refresh_ctx() is called to refresh the request’s
client context. If that function returns an error code and
req→rq_err is true, the error code is stored in
req→rq_status and this function returns 1. If the function returned
an error code and req→rq_err is false, req→rq_wait_ctx is
set true to indicate that the request is waiting for a context and
this function returns 0.
The request is sent by calling ptl_send_rpc() and if that
function returns an error, the request’s rq_net_err field is set
true and the error code returned.
4.23. ptl_send_rpc()
int ptl_send_rpc(struct ptlrpc_request *request, int noreply);
-
request
-
the request to be sent
-
noreply
-
if true, don’t set up any reply buffers
If the import’s OBD has failed, the request’s rq_err field is set
and rq_status is set to -ENODEV which is also now returned.
The import’s current connection is assigned to local connection.
The message’s pb_type field is set to PTL_RPC_MSG_REQUEST.
If the message is being resent (rq_resend is true), MSG_RESENT
is added to the message’s pb_flags field.
The request is passed to sptlrc_cli_wrap_request() to
be signed or sealed by the security layer.
If a reply is expected (noreply is false):
-
-
If the reply buffer has not yet been allocated (rq_repbuf is
NULL), it is allocated with a call to
sptlrc_cli_alloc_repbuf(). If that call returns an
error code, it is assigned to rq_status, rq_err is set and
control jumps to cleanup code at the end of the function.
-
-
If the reply buffer was allocated, rq_repdata and rq_repmsg
are set to NULL.
-
-
Next, LNetMEAttach() is called to
create a new ME for the reply. The ME’s NID/PID is
connection→c_peer and the match bits are
request→rq_xid. If that fails, control jumps to the cleanup
code but rq_status and rq_err are not assigned, should they?
A bunch of flags in the request are cleared.
If a reply is expected (noreply is false):
A reference to the request is obtained on behalf of
request_out_callback().
The procfs requests active counter is updated.
The deadline for the reply’s arrival is calculated and assigned to
rq_deadline.
The request’s import (rq_import) is informed that it is being used
for a send.
The request is transmitted with a call to ptl_send_buf() and if
that function returns success, this function returns success.
The request is passed to ptlrpc_req_finished() which just drops a
reference to the request and if noreply is true, this function
returns as no more cleanup actions are required.
The function returns the error code.
4.24. ptl_send_buf()
static int ptl_send_buf (lnet_handle_md_t *mdh, void *base, int len,
lnet_ack_req_t ack, struct ptlrpc_cb_id *cbid,
struct ptlrpc_connection *conn, int portal, __u64 xid,
unsigned int offset);
4.25. ptlrpc_register_bulk()
int ptlrpc_register_bulk(struct ptlrpc_request *req);
4.26. ptlrpc_unregister_bulk()
int ptlrpc_unregister_bulk(struct ptlrpc_request *req, int async);
5. Server side functions
6. Portal RPC daemon functions
|