Lustre Distributed Lock ManagerTable of Contents
1. IntroductionThe Lustre Distributed Lock Manager (LDLM) provides a means to ensure that data is updated in a consistent fashion across multiple nodes. Locks provide two vital capabilities for a distributed file system:
1.1. ResourcesA Lustre lock protects a resource. Most commonly, a resource is a file. At this time, resources can be organised in a parent/child hierarchy (comments in code suggest this is due to change?) 1.2. NamespacesResources are scoped within a namespace. As resources are protected by locks, the locks are also scoped within the same namespaces and we can talk about both a namespace’s resources and a namespace’s locks. Namespaces may either be completely local to a specific client or global in the sense that the namespace is "owned" by some server and the server’s clients each have a shadow namespace that only contains those locks from the server’s namespace that are of interest to a particular client. The following diagram shows how the locks in a server namespace can be referenced from client (shadow) namespaces. Client 1’s shadow namespace contains Lock A and Lock B and Client 2’s shadow namespace contains just Lock C.
Figure 1: Shadow namespaces on clients.
1.3. Lock type and modeLocks have a type and a mode. The six possible modes are:
Here is the lock compatibility matrix that shows what lock modes a new lock can have for the lock to be granted when attempting to lock a particular resource that has already been locked by a lock with a given mode. NL CR CW PR PW EX
-----------------------------
NL 1 1 1 1 1 1
-----------------------------
CR 1 1 1 1 1 0
-----------------------------
CW 1 1 1 0 0 0
-----------------------------
PR 1 1 0 1 0 0
-----------------------------
PW 1 1 0 0 0 0
-----------------------------
EX 1 0 0 0 0 0
-----------------------------
You can see it is symmetrical so you can read it either way. For example, if the resource has been locked with PW (protected write) mode, you can look down the PW column and see that the only allowed modes for a new lock are NL (null) and CR (concurrent read). In Lustre, four types of locks are defined and they are used for different purposes. It is up to the client to decide what type of lock it requests. The component that requests a lock from the lock manager is the client and it can be Lustre Lite, OSC or MDC. The four types of locks are:
1.4. CallbacksCallback functions are a means whereby information/state related to a given lock can be passed back to the client from the locking subsystem. As part of the lock request process, the client supplies the callback function pointers. Three types of callback function are used:
1.5. IntentAn intent can be specified by the client when a lock is enqueued to request some special processing during the enqueue. The intent contains parameters that are passed to the intent handler associated with the lock’s namespace. The intent feature allows the client to indicate what they are intending to accomplish (when the lock is granted). This knowledge can be used by the locking subsystem to reduce the number of RPC calls required to achieve the client’s intention. Six intention operations are defined. These are get attributes, set attributes, insert or delete, open, create, and read links. 2. Data StructuresThe major data structures used by the LDLM are: 2.1. ldlm_namespaceInstances of struct ldlm_namespace represent a resource namespace. Each server has a local namespace that contains its resources. Each client that is connected to the server will have a shadow namespace that can contain client-local versions of the server’s resources. /** LDLM Namespace.
*
* Namespace serves to contain locks related to particluar service.
* There are two kinds of namespaces:
* - Local namespace has knowledge of all locks and is therefore authoritative
* to make decisions like what locks could be granted and what conflicts
* exist during new lock enqueue.
* - Client namespace only has limited knowledge about locks in the namespace,
* only seeing locks held by the client.
*
* Every Lustre service has one local namespace present on the server serving
* that service. Every client connected to the service has a client namespace
* for it.
* Every lock obtined by client in that namespace is actually represented by
* two in-memory locks. One on server and one on the client. The locks are
* linked by special cookie by which one node can tell to other which lock
* it actually means during communications
*/
struct ldlm_namespace {
/**
* Namespace name. Used for logging, etc.
*/
char *ns_name;
/**
* Is this a client-side lock tree?
*/
ldlm_side_t ns_client;
/**
* Namespce connect flags supported by server (may be changed via proc,
* lru resize may be disabled/enabled).
*/
__u64 ns_connect_flags;
/**
* Client side orig connect flags supported by server.
*/
__u64 ns_orig_connect_flags;
/**
* Hash table for namespace.
*/
struct list_head *ns_hash;
spinlock_t ns_hash_lock;
/**
* Count of resources in the hash.
*/
__u32 ns_refcount;
/**
* All root resources in namespace.
*/
struct list_head ns_root_list;
/**
* Position in global namespace list.
*/
struct list_head ns_list_chain;
/**
* All root resources in namespace.
*/
struct list_head ns_unused_list;
int ns_nr_unused;
spinlock_t ns_unused_lock;
unsigned int ns_max_unused;
unsigned int ns_max_age;
unsigned int ns_timeouts;
/**
* Seconds.
*/
unsigned int ns_ctime_age_limit;
/**
* Next debug dump, jiffies.
*/
cfs_time_t ns_next_dump;
atomic_t ns_locks;
__u64 ns_resources;
/** "policy" function that does actually conflict determination */
ldlm_res_policy ns_policy;
struct ldlm_valblock_ops *ns_lvbo;
void *ns_lvbp;
cfs_waitq_t ns_waitq;
struct ldlm_pool ns_pool;
ldlm_appetite_t ns_appetite;
/**
* If more than \a ns_contended_locks found, the resource is considered
* to be contended.
*/
unsigned ns_contended_locks;
/**
* The resource remembers contended state during \a ns_contention_time,
* in seconds.
*/
unsigned ns_contention_time;
/**
* Limit size of nolock requests, in bytes.
*/
unsigned ns_max_nolock_size;
/**
* Backward link to obd, required for ldlm pool to store new SLV.
*/
struct obd_device *ns_obd;
struct adaptive_timeout ns_at_estimate;/* estimated lock callback time*/
};
The fields are:
2.2. ldlm_resourceEach lockable object (resource) is represented by an instance of struct ldlm_resource. The resources are named by 4 64-bit integers and the mapping between the object identity and the resource name is done by the user of the LDLM api. /** LDLM resource description.
* Resource is a basically a representation for a single object.
* Object has a name which is currently 4 64 bit integers.
* LDLM user is responsible for creating a mapping between objects it wants
* protected and resource names.
*
* A resurce can only hold locks of a single lock type.
*/
struct ldlm_resource {
/** Back link to namespace */
struct ldlm_namespace *lr_namespace;
/* protected by ns_hash_lock */
struct list_head lr_hash;
/* Remove lr_parent/child and its logic ASAP */
struct ldlm_resource *lr_parent; /* 0 for a root resource */
struct list_head lr_children; /* list head for child resources */
struct list_head lr_childof; /* part of ns_root_list if root res,
* part of lr_children if child */
spinlock_t lr_lock;
/* protected by lr_lock */
/**
* List of locks in granted state
*/
struct list_head lr_granted;
/**
*List of locks waiting to change their granted mode (converted)
*/
struct list_head lr_converting;
/**
* List of locks that could not be granted due to conflicts and
* that are waiting for conflicts to go away */
struct list_head lr_waiting;
/* No longer needed? Remove ASAP */
ldlm_mode_t lr_most_restr;
/**
* Type of locks this namespace can hold
*/
ldlm_type_t lr_type; /* LDLM_{PLAIN,EXTENT,FLOCK} */
struct ldlm_res_id lr_name;
atomic_t lr_refcount;
struct ldlm_interval_tree lr_itree[LCK_MODE_NUM]; /* interval trees*/
/* Server-side-only lock value block elements */
struct semaphore lr_lvb_sem;
__u32 lr_lvb_len;
void *lr_lvb_data;
/* when the resource was considered as contended */
cfs_time_t lr_contention_time;
/**
* List of references to this resource. For debugging.
*/
struct lu_ref lr_reference;
};
The fields are:
2.3. ldlm_lockLocks are represented by instances of struct ldlm_lock. /** LDLM lock structure
*
* Represents actual lock and its state in memory */
struct ldlm_lock {
/** Local lock handle.
* When remote side wants to tell us about a lock, they address it by
* this handle.
* Must be first in the structure.
*/
struct portals_handle l_handle;
/**
* Lock reference count.
*/
atomic_t l_refc;
/**
* Internal spinlock protects l_resource. we should hold this lock
* first before grabbing res_lock.
*/
spinlock_t l_lock;
/**
* ldlm_lock_change_resource() can change this.
*/
struct ldlm_resource *l_resource;
/**
* Protected by ns_hash_lock. List item for client side lru list.
*/
struct list_head l_lru;
/**
* Protected by lr_lock, linkage to resource's lock queues.
*/
struct list_head l_res_link;
/**
* Tree node for ldlm_extent.
*/
struct ldlm_interval *l_tree_node;
/**
* Protected by per-bucket exp->exp_lock_hash locks. Per export hash
* of locks.
*/
struct hlist_node l_exp_hash;
/**
* Protected by lr_lock. Requested mode.
*/
ldlm_mode_t l_req_mode;
/**
* Granted mode, also protected by lr_lock.
*/
ldlm_mode_t l_granted_mode;
/**
* Lock enqueue completion handler.
*/
ldlm_completion_callback l_completion_ast;
/**
* Lock blocking ast handler.
* Called 2 times, once when somebody conflicts with this lock for
* the first time. Once when the lock is finally cancelled.
*/
ldlm_blocking_callback l_blocking_ast;
/**
* Lock glimpse handler.
* Glimpse handler is used to obtain LVB updates from a client by
* server.
*/
ldlm_glimpse_callback l_glimpse_ast;
ldlm_weigh_callback l_weigh_ast;
/**
* Lock export.
*/
struct obd_export *l_export;
/**
* Lock connection export.
*/
struct obd_export *l_conn_export;
/**
* Remote lock handle.
* If the lock has a sister-lock on another node, this is the handle
* of its remote counterpart (l_handle).
*/
struct lustre_handle l_remote_handle;
/** Representation of private data specific for a lock type.
* Examples are: extent range for extent lock or
* bitmask for ibits locks */
ldlm_policy_data_t l_policy_data;
/*
* Protected by lr_lock. Various counters: readers, writers, etc.
*/
__u64 l_flags;
__u32 l_readers;
__u32 l_writers;
/*
* Set for locks that were removed from class hash table and will be
* destroyed when last reference to them is released. Set by
* ldlm_lock_destroy_internal().
*
* Protected by lock and resource locks.
*/
__u8 l_destroyed;
/**
* If the lock is granted, a process sleeps on this waitq to learn when
* it's no longer in use. If the lock is not granted, a process sleeps
* on this waitq to learn when it becomes granted.
*/
cfs_waitq_t l_waitq;
/**
* Seconds. it will be updated if there is any activity related to
* the lock, e.g. enqueue the lock or send block AST.
*/
cfs_time_t l_last_activity;
/**
* Jiffies. Should be converted to time if needed.
*/
cfs_time_t l_last_used;
/** Originally requested extent on the extent lock */
struct ldlm_extent l_req_extent;
/*
* Client-side-only members.
*/
/**
* Temporary storage for an LVB received during an enqueue operation.
*/
__u32 l_lvb_len;
void *l_lvb_data;
void *l_lvb_swabber;
/** Private storage for lock user. Opaue to LDLM. */
void *l_ast_data;
/*
* Server-side-only members.
*/
/** connection cookie for the client originated the operation. */
__u64 l_client_cookie;
/**
* Protected by elt_lock. Callbacks pending.
*/
struct list_head l_pending_chain;
cfs_time_t l_callback_timeout;
/**
* Pid which created this lock.
*/
__u32 l_pid;
/**
* For ldlm_add_ast_work_item().
*/
struct list_head l_bl_ast;
/**
* For ldlm_add_ast_work_item().
*/
struct list_head l_cp_ast;
/**
* For ldlm_add_ast_work_item().
*/
struct list_head l_rk_ast;
struct ldlm_lock *l_blocking_lock;
int l_bl_ast_run;
/**
* Protected by lr_lock, linkages to "skip lists".
*/
struct list_head l_sl_mode;
struct list_head l_sl_policy;
struct lu_ref l_reference;
};
The fields are:
The following fields are used by the client-side only: The following fields are used by the server-side only:
3. Code Description3.1. Namespace creationClient-side and server-side namespaces are created and initialised using ldlm_namespace_new(). 3.1.1. ldlm_namespace_new()struct ldlm_namespace *ldlm_namespace_new(struct obd_device *obd, char *name,
ldlm_side_t client, ldlm_appetite_t apt);
Its arguments are:
The function firstly obtains a reference to the LDLM system by calling ldlm_get_ref(). The first call to that function will invoke ldlm_setup() to initialise the LDLM infrastructure (setup PTLRPC stuff, create threads, etc.) The memory for the new namespace and its hash table (ns_hash) is allocated. The obd and apt arguments are assigned to ns_appetite and ns_obd, respectively and memory is allocated to ns_name for a copy of the name argument. The other members of the namespace structure are set to their initial values. A call to ldlm_proc_namespace() sets up the procfs entries for the namespace. A call to ldlm_pool_init() initialises the namespace’s pool. A call to at_init() initialises the namespace’s ns_at_estimate field (Adaptive Timeout). Finally, the namespace is registered with a call to ldlm_namespace_register() which adds it to the client or server namespaces list (as appropriate) and increments the number of client (or server) namespaces registered. 3.2. Resource creationResources are identified by a name comprising 4 x 64-bit integers (type ldlm_res_id). A reference to a resource is obtained by calling ldlm_resource_get(). 3.2.1. ldlm_resource_get()This function obtains a reference to the named resource within the given namespace. It will (optionally) create the resource if it doesn’t already exist. struct ldlm_resource *
ldlm_resource_get(struct ldlm_namespace *ns, struct ldlm_resource *parent,
const struct ldlm_res_id *name, ldlm_type_t type, int create);
Its arguments are:
The function starts by calling ldlm_hash_fn() to calculate the resource’s hash value (based on the values of name and parent). It then calls ldlm_resource_find() to look up the resource in the namespace’s hash table.
3.2.2. ldlm_resource_add()This function allocates and initialises a new resource within the given namespace. static struct ldlm_resource *
ldlm_resource_add(struct ldlm_namespace *ns, struct ldlm_resource *parent,
const struct ldlm_res_id *name, __u32 hash, ldlm_type_t type)
Its arguments are:
The function starts by calling ldlm_resource_new() to allocate and partially initialise a new resource object. That object is then further initialised with name, ns and type. The namespace’s hash table is then spinlocked and searched to see if it already has an entry that matches hash. If so, it means that the resource has already been added to the namespace and so a reference is obtained to that existing resource and the spinlock remove from the hash table. The newly allocated resource is then freed and the existing resource is returned. If the hashtable did not already contain an entry that matches hash, the new resource is added to the hash table and a reference to the hash table obtained. If parent is non-NULL, the new resource is linked to the parent, otherwise, it’s added to the namespace’s ns_root_list. The hash table spinlock is unlocked. If the namespace’s ns_lvbo field is non-NULL and the referenced ldlm_valblock_ops has a non-NULL lvbo_init function pointer, then that function is invoked, passing it the new resource. The resource’s lr_lvb_sem is up()'d. The resource is returned. 3.2.3. ldlm_resource_new()This function allocates and (semi) initialises a new resource. static struct ldlm_resource *ldlm_resource_new(void); The resource is allocated and various fields zeroed/initialised. The lr_refcount field is set to 1. The lr_lvb_sem semaphore is initialised in the locked state. The new resource is returned. 3.3. Lock creation3.3.1. ldlm_lock_create()This function creates a new lock for the specified resource with the given type, mode and callback functions. struct ldlm_lock *ldlm_lock_create(struct ldlm_namespace *ns,
const struct ldlm_res_id *res_id,
ldlm_type_t type,
ldlm_mode_t mode,
const struct ldlm_callback_suite *cbs,
void *data, __u32 lvb_len);
Arguments are:
The function starts by obtaining a reference to the resource by calling ldlm_resource_get(). The resource will be created if it doesn’t already exist. A new lock is obtained by calling ldlm_lock_new() (this obtains new reference to the resource). The first reference to the resource is released as it’s not required any more. Various fields in the lock are initialised from the passed in values. If the lock is an extent lock, an interval tree node is allocated. If lvb_len is non-zero, a buffer of that size is allocated and assigned to l_lvb_data. The function returns the new lock. On error, the allocated storage is cleaned up and the function returns NULL. 3.3.2. ldlm_lock_new()This function allocates a new lock for the specified resource. static struct ldlm_lock *ldlm_lock_new(struct ldlm_resource *resource); Just a single argument, resource, which is the resource to be locked. Memory is allocated for the new lock and various fields are initialised. A reference to the resource is obtained. Worthy of note is the fact that the lock’s reference count (l_refc) is initialised to 2. The resource’s namespace’s number of locks is incremented. A hash value is computed for the lock’s handle (l_handle). The new lock is returned. 3.4. Lock enqueing - client-sideOn the client, a lock on a remote resource is obtained by calling ldlm_cli_enqueue() to enqueue a lock request. This involves creating a lock object, building a PTLRPC request which is sent to the server and waiting for and then processing the result from the server. Unless an error occurs, the lock is added to one of the resource’s lists (granted, converted or waiting). 3.4.1. ldlm_cli_enqueue()int ldlm_cli_enqueue(struct obd_export *exp, struct ptlrpc_request **reqp,
struct ldlm_enqueue_info *einfo,
const struct ldlm_res_id *res_id,
ldlm_policy_data_t *policy, int *flags,
void *lvb, __u32 lvb_len, void *lvb_swabber,
struct lustre_handle *lockh, int async);
Lots of arguments but the key ones are:
The function starts by retrieving the ldlm_namespace, ns, from the export, exp. If replaying, the previously created lock is located by passing the lockh argument to ldlm_handle2lock_long(). Otherwise, a new lock is created by calling ldlm_lock_create(): const struct ldlm_callback_suite cbs = {
.lcs_completion = einfo->ei_cb_cp,
.lcs_blocking = einfo->ei_cb_bl,
.lcs_glimpse = einfo->ei_cb_gl,
.lcs_weigh = einfo->ei_cb_wg
};
lock = ldlm_lock_create(ns, res_id, einfo->ei_type,
einfo->ei_mode, &cbs, einfo->ei_cbdata,
lvb_len);
This takes the namespace, ns, and the resource id, res_id, that identify the resource being locked and the lock’s desired type, mode, callbacks, etc. A reference is added to the lock by calling ldlm_lock_addref_internal() which has the side effects of removing the lock from the LRU list (if appropriate) and incrementing the lock’s count of readers/writers (depending on the lock type) and adding refs to the lock’s debug reference list. A call to ldlm_lock2handle() returns the lock’s cookie (by reference using *lockh) to the caller. If policy is not NULL (policy data supplied), it is copied by value into the lock. If the lock is an extent lock, the extent data is copied into the lock from policy (code assumes policy not NULL here!) If reqp or *reqp is NULL, the request has to be created with a call to ptlrpc_request_alloc_pack(): req = ptlrpc_request_alloc_pack(class_exp2cliimp(exp),
&RQF_LDLM_ENQUEUE,
LUSTRE_DLM_VERSION,
LDLM_ENQUEUE);
Otherwise, the passed in request is checked to see that its capsule size is large enough to contain the lock request data. The next action of note is that the lock data is transferred into the request body: /* Dump lock data into the request buffer */
body = req_capsule_client_get(&req->rq_pill, &RMF_DLM_REQ);
ldlm_lock2desc(lock, &body->lock_desc);
body->lock_flags = *flags;
body->lock_handle[0] = *lockh;
The main work here is done by ldlm_lock2desc() which converts the lock data into the "on the wire" format stored in the request body. Next, if the PTLRPC request was not passed in (it was allocated in this function) its reply length is set by a call to ptrlrpc_request_set_replen(). Before that is done, if the passed in lvb_len (LDLM Value Block length) is > 0, the request capsule is extended to include the LVB data. If the async argument is true, the function returns at this point without initiating the communication with the server. It is up to the client to do that using the initialised request. Otherwise, the request is sent to the server and its response is processed by: rc = ptlrpc_queue_wait(req);
err = ldlm_cli_enqueue_fini(exp, req, einfo->ei_type, policy ? 1 : 0,
einfo->ei_mode, flags, lvb, lvb_len,
lvb_swabber, lockh, rc);
And after that, the request is cleaned up (if it was allocated in this function). 3.4.2. ldlm_cli_enqueue_fini()This function processes the server’s response to the lock enqueue request. int ldlm_cli_enqueue_fini(struct obd_export *exp, struct ptlrpc_request *req,
ldlm_type_t type, __u8 with_policy, ldlm_mode_t mode,
int *flags, void *lvb, __u32 lvb_len,
void *lvb_swabber, struct lustre_handle *lockh,int rc);
A lot of the arguments correspond to the arguments to ldlm_cli_enqueue() which simply passes them down. Firstly it calls ldlm_handle2lock() to retrieve the lock using the handle *lockh. A side effect of this call is that it obtains a new reference to the lock. If the result is NULL (which should only occur when the lock’s type is LDLM_FLOCK), the function returns -ENOLCK. Otherwise, the return code from the PTLRPC communication is checked. If it’s not ELDLM_OK then the enqueue either aborted or failed. If it aborted, some reply data is byte swapped. In both cases, control transfers to the cleanup code and the function returns. The reply is retrieved from the PTLRPC response, the lock and resource are locked and the lock’s l_remote_handle is updated from the reply’s lock_handle. If the export has a hash of locks (exp_lock_hash not NULL), the lock is rehashed using the new l_remote_handle as the key. The lock_flags in the reply are passed back to the caller by reference (using flags) and those flags that are masked by LDLM_INHERIT_FLAGS are ORed into the lock’s flags, l_flags. The LDLM_FL_NO_TIMEOUT flag is also transferred from the reply to the lock. The lock and resource are unlocked. If the reply flags has LDLM_FL_LOCK_CHANGED set, the lock’s mode and resource name are updated from the reply (if they changed). Also, the lock’s l_policy_data member is updated (except when the server doesn’t support inodebits). If the reply flags indicate that a blocking or cancel AST was sent to whom? or the client is a liblustre client and the lock is an extent lock, the lock is marked as being destroyed. why? If the lvb_len argument is non-zero and the lock’s requested mode is not equal to it’s granted mode, the LVB data in the reply is copied into the region that the lock’s l_lvb_data member points at. If not replaying, the lock is enqueued by calling ldlm_lock_enqueue(). When this returns, if the lock’s completion AST pointer, l_completion_ast, is not NULL, the function is invoked to signal the client that the lock has been enqueued. If the lvb_len argument is non-zero and the lvb argument is non-NULL, the LVB data is copied out of the lock to *lvb. That’s about it, if cleanup is required due to errors, failed_lock_cleanup() is invoked. The reference to the lock obtained when the lock was retrieved by handle is released and also the reference to the lock that was obtained by the calling function. 3.5. Lock enqueing - server-sideOn the server, when an LDLM enqueue message is received, the PTLRPC layer calls back into the LDLM code. Most code still calls ldlm_handle_enqueue() which is now a wrapper around a newer function, ldlm_handle_enqueue0(). 3.5.1. ldlm_handle_enqueue()This is the old entry point to the server-side handling of enqueue requests. int ldlm_handle_enqueue(struct ptlrpc_request *req,
ldlm_completion_callback completion_callback,
ldlm_blocking_callback blocking_callback,
ldlm_glimpse_callback glimpse_callback);
The arguments are:
This function simply aggregates the callback function pointers into a ldlm_callback_suite, obtains the LDLM request data from the PTLRPC request and calls ldlm_handle_enqueue0(). 3.5.2. ldlm_handle_enqueue0()This is the main LDLM server-side entry point for enqueuing lock requests. It is called from PTLRPC receive handling threads. int ldlm_handle_enqueue0(struct ldlm_namespace *ns,
struct ptlrpc_request *req,
const struct ldlm_request *dlm_req,
const struct ldlm_callback_suite *cbs);
The arguments are:
The function starts by calling ldlm_request_cancel() to cancel any locks that the client wishes to be canceled as part of the enqueuing operation. If appropriate, procfs counters are updated. Next, various values are sanity checked (lock type, mode, IBITS consistency) and if the checks fail, control jumps to the error handling code at the bottom of the function. If the lock request is being replayed, the existing lock is searched for in the per-export hash table and, if found, control jumps over the code to create a new lock. A new lock is created with a call to ldlm_lock_create(). The resource’s name and the lock type and mode are all taken from the request. The new lock’s last activity time is set to the current time and its remote handle is assigned the client lock’s handle from the request. The OBD_FAIL_LDLM_ENQUEUE_BLOCKED timeout is set to twice the value of obd_timeout why? If the export has been destroyed, the enqueue is cancelled and control jumps to the error handling code. A reference to the lock’s export is obtained and assigned to the lock’s l_export field. If the export has a lock hash, the lock is added to the hash. The following code is executed for both new and replayed locks. What happens next is dependent on whether the lock has intent (LDLM_FL_HAS_INTENT is set) or not:
If the lock type is not LDLM_PLAIN, the lock’s l_policy_data field is set to the corresponding value in the LDLM request. If the lock type is LDLM_EXTENT, the lock’s l_req_extent field is set from the l_extent part of the lock’s l_policy_data field. The lock is enqueued by calling ldlm_lock_enqueue() passing the namespace, lock, cookie value and the requested lock flags. A pointer to the PTLRPC response structure (dlm_rep) is obtained and the lock’s state, handle and flags are stored into it. The lock and resource are locked. Those lock flags that need to be inherited from the original locking request (selected by mask LDLM_INHERIT_FLAGS) are added into the lock flags for both the reply to the client and the local lock. Check if the export has failed (exp_failed non-zero) and, if so, set the return code to appropriately. Otherwise, do some stuff related to canceling the lock if LDLM_FL_CANCEL_ON_BLOCK is set - can’t work out the meaning and usage of LDLM_FL_AST_SENT so leave this for now Then follows a check for the erroneous situation where the lock type is LDLM_PLAIN or LDLM_IBITS, the client is a liblustre client and the lock doesn’t have the LDLM_FL_CANCEL_ON_BLOCK flag set. In this case, messages are reported but no error state is set. The lock and resource are unlocked. The PTLRPC request field rq_status is assigned the result status value (non-zero if an error has occurred). If required (rq_packed_final is zero), the reply is packed. The remainder of the function only executes if the lock exists (either it was found during replaying or a new lock has been created). The lock and resource are locked. If no error has occurred and the resource’s lr_lvb_len field is non-zero, the resource’s LVB data is copied into the PTLRPC capsule. If an error has occurred, the lock is unlinked from the various places that reference it and is destroyed. The lock and resource are unlocked. If no error occurred while enqueuing the lock or packing the PTLRPC reply and the lock isn’t of type LDLM_FLOCK, ldlm_reprocess_all() is called to process the resource’s waiting and converting queues. The lock is released. The function returns the result code (0 for success). 3.5.3. ldlm_cli_enqueue_local()This function enqueues a lock in a local namespace. No RPC communication is required. Somewhat confusingly (given the function’s name), it is only used on servers to enqueue server-local locks. int ldlm_cli_enqueue_local(struct ldlm_namespace *ns,
const struct ldlm_res_id *res_id,
ldlm_type_t type, ldlm_policy_data_t *policy,
ldlm_mode_t mode, int *flags,
ldlm_blocking_callback blocking,
ldlm_completion_callback completion,
ldlm_glimpse_callback glimpse,
void *data, __u32 lvb_len, void *lvb_swabber,
const __u64 *client_cookie,
struct lustre_handle *lockh);
Lots of arguments, most of which are used to initialise the lock that is created. The function starts by checking that the lock is not being replayed and the supplied namespace is not local. A new lock is created by calling ldlm_lock_create() passing the required namespace, resource id, lock type and mode, callback functions, opaque AST data and LVB length. A reference to the lock is obtained and it’s handle is stored into *lockh. The resource and lock are locked and the lock’s LDLM_FL_LOCAL flag is set. If the LDLM_FL_ATOMIC_CB flag was set in *flags, it is set in the lock. The lock’s swabber is set to lvb_swabber and the lock and resource are unlocked. If the policy or client_cookie arguments are non-NULL, they are stored into the appropriate lock fields. If the lock’s type is LDLM_EXTENT, the lock’s l_req_extent field is set from the policy’s l_extent field. The lock is now enqueued by calling ldlm_lock_enqueue(). If policy is non-NULL, *policy is assigned the value of the lock’s l_policy_data field. If the lock’s l_completion_ast function is non-NULL, the function is invoked (it will block until the lock is granted). The lock is released and the function returns 3.5.4. ldlm_reprocess_all()This function scans the queues of locks that are waiting on a particular resource and grants any that do not conflict with an existing lock. void ldlm_reprocess_all(struct ldlm_resource *res); This function only applies to server-side resources so if invoked on a client-side resource, just return without doing anything. The resource is locked, ldlm_reprocess_queue() is called to process the resource’s converting and waiting queues and then the resource is unlocked. For each lock that was granted, an item was added to a work list. The work list is now passed to ldlm_run_ast_work() to actually send the RPC messages to the clients. 3.5.5. ldlm_reprocess_queue()Scans a queue of locks waiting on a resource and grants any that can be granted. int ldlm_reprocess_queue(struct ldlm_resource *res, struct list_head *queue,
struct list_head *work_list);
The arguments are:
The decision as to whether each lock in the list can be granted or not is made by the resource’s policy function. This is looked up by resource type using ldlm_processing_policy_table. The policy function is called for each entry in the list and, if it decides that the lock can be granted, it will add the lock to the work list. 3.5.6. ldlm_run_ast_work()Sends AST RPC for each item in the supplied work list. int ldlm_run_ast_work(struct list_head *rpc_list, ldlm_desc_ast_t ast_type); Arguments are: Firstly, ptlrpc_prep_set() is called to prepare the set of PTLRPC messages. The type of the callback set and the work function that will create each callback item in the set are determined from the value of ast_type. For each item in the work list, call the work function. These functions basically just remove the lock from the appropriate queue (l_bl_ast, l_cp_ast or l_rk_ast) and invoke the appropriate callback function (l_blocking_ast, l_completion_ast or l_blocking_ast (yes, that’s right)) to build the RPC message and add it to the PTLRPC message set. If the number of callback messages in the set reaches PARALLEL_AST_LIMIT, the messages are sent and a new set started. This continues until all the locks have been processed and then any residual messages in the set are sent.
3.5.7. Resource policy functionsA resource policy function is called to process a grant request for a single lock instance of a given type. typedef int (*ldlm_processing_policy)(struct ldlm_lock *lock, int *flags,
int first_enq, ldlm_error_t *err,
struct list_head *work_list);
The arguments are:
A separate resource policy function is defined for each lock type and ldlm_processing_policy_table maps lock types to policy functions: static ldlm_processing_policy ldlm_processing_policy_table[] = {
[LDLM_PLAIN] ldlm_process_plain_lock,
[LDLM_EXTENT] ldlm_process_extent_lock,
#ifdef __KERNEL__
[LDLM_FLOCK] ldlm_process_flock_lock,
#endif
[LDLM_IBITS] ldlm_process_inodebits_lock,
};
3.5.7.1. ldlm_process_plain_lock()The function starts by checking that the resource is locked and its list of converting locks is empty. If blocking ASTs have already been sent to conflicting locks, ldlm_plain_compat_queue() is called to check that the lock does not conflict with any locks that are already on the resource’s granted and waiting lists. That function ignores any locks that come after the lock being tested. If a conflicting lock is found, the function just returns LDLM_ITER_STOP. If no conflicts are found, the lock is removed from the resource’s skip list and granted by calling ldlm_grant_lock(). If the work_list argument is non-NULL and the lock has a non-NULL l_completion_ast function, the lock’s l_cp_ast field will be linked into the work list. The function then returns LDLM_ITER_CONTINUE. If blocking ASTs had not already been sent to conflicting locks, the resource’s granted and waiting lists are scanned using ldlm_plain_compat_queueue() and if no conflicting locks are found, the lock is removed from the resource’s skip list and granted by calling ldlm_grant_lock() (lock is not linked into work_list). The function then returns 0 (ELDLM_OK). If conflicting locks were found, the lock is added to the resource’s waiting list (if not already there). The resource is unlocked, ldlm_run_ast_work() is called to send a blocking AST to all the conflicting locks and the resource locked again. If ldlm_run_ast_work() returns -ERESTART, control jumps back to scan the granted and waiting queues again and the blocking ASTs are resent. The LDLM_FL_BLOCK_GRANTED flag is set in *flags. The function then returns 0 (ELDLM_OK). 3.5.7.2. ldlm_plain_compat_queue()Helper function that scans a queue of plain locks to see if they are compatible (do not conflict) with a given lock that is being tested for eligibility for granting. static inline int
ldlm_plain_compat_queue(struct list_head *queue, struct ldlm_lock *req,
struct list_head *work_list);
Arguments are:
The function starts by setting the result value, compat, to 1 (true). It then verifies that the requested lock mode (ldlm_mode_t) is sensible. The remainder of the function is a loop over the locks in the queue. If the queue member is the requested lock (req), the function returns the value of compat. Otherwise, the loop variable (tmp) is modified to address the last lock in the mode group. The requested lock mode is tested for compatibility with the mode of the queue member (and, therefore, compatibility with all the locks in the mode group) and, if compatible, the loop then continues to look at the first lock in the next mode group. Seems to me that there is a possible bug here in that it only tests to see whether req is the first lock in a mode group and it doesn’t check to see if it appears later in the mode group? An incompatible lock has been found in the queue. If work_list is NULL, nothing more is to be done and the function just returns 0 to indicate that the lock is not compatible. Otherwise, compat is set to 0 and the function loops over the tested lock and all of the other locks in the same mode group and those that have a non-NULL l_blocking_ast function are added to work_list so that they can be sent a blocking AST. The loop continues to look at the first lock in the next mode group. 3.5.7.3. ldlm_process_extent_lock()The function starts by checking that the resource is locked, the list of converting locks is empty, that either the requested flags does not have LDLM_FL_DENY_ON_CONTENTION set or the LDLM_AST_DISCARD_DATA flag is not set for the grant candidate. If blocking ASTs have already been sent to conflicting locks, ldlm_extent_compat_queue() is called to check that the lock does not conflict with any locks that are already on the resource’s granted and waiting lists. That function ignores any locks that come after the lock being tested. If a conflicting lock is found, the function just returns LDLM_ITER_STOP. If no conflicts are found, the lock is removed from the resource’s skip list and ldlm_extent_policy() is called to extend the lock’s extent as much as possible without creating a conflict with any lock in the resource’s granted or waiting lists. The lock is granted by calling ldlm_grant_lock(). If the work_list argument is non-NULL and the lock has a non-NULL l_completion_ast function, the lock’s l_cp_ast field will be linked into the work list. The function then returns LDLM_ITER_CONTINUE. If blocking ASTs had not already been sent to conflicting locks, the resource’s granted and waiting lists are scanned using ldlm_extent_compat_queue() and if no conflicting locks are found, the lock’s extent is expanded through a call to ldlm_extent_policy(), the lock is removed from the resource’s skip list and granted by calling ldlm_grant_lock() (lock is not linked into work_list). The function then returns 0 (ELDLM_OK). If conflicting locks were found, the lock is added to the resource’s waiting list (if not already there). The resource is unlocked, and ldlm_run_ast_work() is called to send a blocking AST to all the conflicting locks. Some OBD_FAIL check here that I don’t understand that calls class_fail_export() if it fails The resource locked again. If ldlm_run_ast_work() returns -ERESTART, control jumps back to scan the granted and waiting queues again and the blocking ASTs are resent. Before it does so, checks are made to see if the lock has been destroyed or granted during the time the resource was unlocked and, if so, discards any AST work list items and the function returns. Otherwise, the LDLM_FL_BLOCK_GRANTED and LDLM_FL_NO_TIMEOUT flags are set in *flags. The function then returns 0 (ELDLM_OK). 3.5.7.4. ldlm_extent_compat_queue()Helper function that scans a queue of extent locks to see if they are compatible (do not conflict) with a given lock that is being tested for eligibility for granting. static int
ldlm_extent_compat_queue(struct list_head *queue, struct ldlm_lock *req,
int *flags, ldlm_error_t *err,
struct list_head *work_list, int *contended_locks)
Arguments are:
The function starts by setting the result value, compat, to 1. It then verifies that the requested lock mode (ldlm_mode_t) is sensible. The bulk of the function is split into two halves, one half is used when the granted queue is being tested, the other half when the waiting queue is being tested. Granted locks compatibility testingWhen testing for compatibility with granted locks, the code loops over the lock modes (EX, PW, PR, etc.). For each lock mode, if the resource does not have an interval tree for that mode, there is no conflict and the loop continues to look at the next lock mode. If the resource does have an interval tree for that mode and the mode is compatible with the requested mode and the requested mode is LCK_GROUP and the request’s group id (gid) is the same as the interval tree’s group id, the function returns the special value of 2 to indicate that the requested group lock is OK. If the requested mode was compatible but not LCK_GROUP, the loop continues on to look at the next lock mode. At this point, the requested lock has been determined to be incompatible with the lock mode being tested against If that mode is LCK_GROUP and *flags has LDLM_FL_BLOCK_NOWAIT set (meaning, don’t wait if blocked), compat (the function’s return value) is assigned -EWOULDBLOCK and control jumps to code that destroys the (candidate) lock. If that flag is not set, LDLM_FL_NO_TIMEOUT is set in *flags to indicate that the candidate lock is blocked by a group lock with no timeout. If work_list is NULL, the function returns 0 to indicate conflict, otherwise the locks in the interval tree that have a non-NULL l_blocking_ast function are added to the work list and the loop continues on to look at the next lock mode. If the lock mode being tested against is not LCK_GROUP and work_list is NULL, interval_is_overlapped() is called to check if the requested extent overlaps any existing extents in the tree. If so, the function returns 0 to indicate conflict. When work_list is non-NULL, interval_search() is called to find those locks in the interval tree whose extents conflict with the requested extent and if they have a non-NULL l_blocking_ast function, they are added to the work list. If work_list is not empty, compat is set to 0. That’s the end of the loop. Waiting locks compatibility testingLoops over the locks in the waiting queue. A flag, check_contention is initialised to 1. If the waiting lock is the requested lock, the loop terminates. If the scan flag has been set (by code lower down in the loop on a previous iteration) it means that the candidate lock has a mode of LCK_GROUP and on a previous iteration an incompatible lock was found. In which case, if the current lock being tested is not a group lock (which means no more group locks are in the queue), the candidate lock is inserted into the waiting list before it, compat is set to 0 to indicate conflict and the loop terminates. If the lock being tested is a group lock and it has the same gid value as the candidate lock, the candidate lock is inserted into the waiting list after it, compat is set to 0 to indicate conflict and the loop terminates. Otherwise, the loop is continued to look at the next lock in the waiting list. If scan was not set, requested lock mode is tested for compatibility with the current lock’s mode. If the modes are compatible, it doesn’t matter if the extents overlap. If the mode is PR (protected read) and the current lock’s extent totally encloses (or is the same extent) as the candidate lock’s extent and the current lock has not yet been sent a blocking AST, the function returns the value of compat. If the candidate lock is not a group lock, the loop continues as overlaps do not matter as the modes are compatible. For group locks with compatible modes, if the candidate lock has the same gid value as the current lock and that lock has already been granted, the function returns a value of 2 to indicate that the requested group lock is OK. If the current lock has not been granted and *flags has LDLM_FL_BLOCK_NOWAIT set, compat is set to -EWOULDBLOCK and control jumps to the code that destroys the (candidate) lock. Otherwise, the candidate lock is added to the waiting queue after the current lock and the function returns 0 to indicate conflict. At this point, the requested lock mode is possibly incompatible with the current lock’s mode. If the requested lock mode is LCK_GROUP and the current lock has not been granted, scan is set to 1 and compat is set to 0. If the current lock’s mode is not LCK_GROUP, that means that no more group locks follow in the queue and so the candidate lock is added into the queue before the current lock and the loop terminates. If the current lock is a group lock and it has the same gid value as the candidate lock, the current lock is added to the queue after the current lock and the loop terminates. Otherwise, the loop continues. If the current lock’s mode is LCK_GROUP, the candidate lock is not compatible and (as before) either the lock is destroyed (if *flags has LDLM_FL_BLOCK_NOWAIT set) or *flags has LDLM_FL_NO_TIMEOUT set. If the current lock’s mode is not LCK_GROUP and its extent does not overlap the candidate lock’s extent, the loop continues to look at the next look. If the current lock’s originally requested extent (l_req_extent) does not overlap the candidate lock’s extent, check_contention is zeroed to indicate that the lock doesn’t really contend. If work_list is NULL, the function returns 0 to indicate that a conflicting lock has been found. If the current lock is a glimpse lock, check_contention is set to 0. The value of check_contention is added to *contended_locks (passed back to caller) and compat is set to zero to indicate that a conflicting lock has been found. If the current lock has a non-NULL l_blocking_ast function, it is added to work_list. The loop continues on to look at the next lock in the waiting queue. Common code following queue scanning loopsAfter the queues have been scanned, ldlm_check_contention() is called to determine if the resource is currently "contended" (i.e. the number of contended locks exceeds ns_contended_locks). If so, the resource’s lr_contention_time field is set to the current time. The function returns non-zero if lr_contention_time plus the namespace’s ns_contention_time is in the future, i.e. the resource is currently contended. If the resource is contended and the LDLM_FL_DENY_ON_CONTENTION flag is set and the requested mode is not LCK_GROUP and the length of the requested extent is not greater than the limit set in the namespace ns_max_nolock_size field, the lock is destroyed and -EUSERS is returned. At the end of the function is the code to destroy the lock and return the value of compat. 3.5.7.5. ldlm_process_flock_lock()FIXME 3.5.7.6. ldlm_process_inodebits_lock()This function is identical to ldlm_process_plain_lock() except that it calls ldlm_inodebits_compat_queue() rather than ldlm_plain_compat_queueue(). 3.5.7.7. ldlm_inodebits_compat_queue()Helper function that scans a queue of inodebits locks to see if they are compatible (do not conflict) with a given lock that is being tested for eligibility for granting. static inline int
ldlm_inodebits_compat_queue(struct list_head *queue, struct ldlm_lock *req,
struct list_head *work_list);
Arguments are:
The function starts by setting the result value, compat, to 1 (true). The requested inodebits are retrieved from the request’s l_policy_data field (l_inodebits), stored in local req_bits and checked to see that some bits are set. It then verifies that the requested lock mode (ldlm_mode_t) is sensible. The remainder of the function is a loop over the locks in the queue. The queued locks are grouped by mode and within the mode groups, grouped by policy. If the queue member is the requested lock (req), the function returns the value of compat. Otherwise, if the requested lock mode is compatible with the mode of the queue member (and, therefore, compatibility with all the locks in the mode group) the loop variable (tmp) is modified to address the last lock in the mode group and the loop then continues to look at the first lock in the next mode group. If the requested lock’s mode is not compatible more checks are done: If the queued lock’s mode is COS, a check is done to see that the requested lock and the queued lock are from the same client (their l_client_cookie fields are identical). If this test fails, the function will return 0 (not compatible) if work_list is NULL, or if work_list is non-NULL, compat is zeroed and the conflicting lock will be added to work_list if it has a non-NULL l_blocking_ast function (and the mode loop continues). Next, the requested lock is tested for inodebit compatibility with all the locks in the mode group. In each mode group, the locks are grouped by policy (locks with the same policy have the same inodebits) so req_bits is ANDed with the l_inodebits value of one of the locks in the policy group and if the result is non-zero, there is a conflict. In which case, zero is returned if work_list is NULL, otherwise compat is set to zero and all the locks in the policy group that have a non-zero l_blocking_ast function are added to work_list. This testing of inodebit compatibility is repeated for each of the policy groups in the mode group. The loop continues to look at the first lock in the next mode group. Finally, the value of compat is returned. 3.6. Lock enqueing - client and server-side3.6.1. ldlm_lock_enqueue()This function enqueues a lock into either the granted, converted or waiting lists. It may be executed on either the client or server sides and its behaviour differs depending on where it’s being executed. ldlm_error_t ldlm_lock_enqueue(struct ldlm_namespace *ns,
struct ldlm_lock **lockp,
void *cookie, int *flags);
The arguments (all passed by reference) are:
Firstly, the lock’s l_last_activity value is updated to the current time. If the lock is not being replayed or the function is being executed on the server, and the namespace has a policy function, delegate the lock processing to the namespace’s policy function. If that function’s return code indicates that the lock has been replaced (and returned to the caller by *lockp) and it’s different from the original lock, then the original lock is destroyed, LDLM_FL_LOCK_CHANGED is set in the caller’s flags and the function returns. If the policy function returned an error code or the flags had LDLM_FL_INTENT_ONLY set (which means don’t grant the lock, just do the intent processing), the lock is destroyed and the function returns. Next, if this function is replaying on the server and the resource type is LDLM_EXTENT, the interval node is allocated here for later use. The resource and lock are locked. If the function is executing on the client and the lock’s granted mode is equal to its requested mode then it means that the lock has already been granted so nothing else needs to be done apart from clearing the various LDLM_FL_BLOCK_* flags and returning. A call to ldlm_resource_unlink_lock() removes the lock from various lists depending on the lock’s type. If the lock is an extent lock and its l_tree_node member is NULL, the previously allocated interval node is attached to the lock. this doesn’t appear to be dependent on !local as the allocation was above so node could be NULL? Next, some flags from the reply (masked by LDLM_AST_DISCARD_DATA) are ORed into the lock’s flags. What happens next depends on where this function is being executed:
Control only gets to this point when executing on the server and not replaying. The policy function for the resource type is looked up and invoked: policy = ldlm_processing_policy_table[res->lr_type];
policy(lock, flags, 1, &rc, NULL);
The resource and lock are unlocked If the interval node allocated earlier wasn’t actually used, it is now freed. The function returns. 3.6.2. ldlm_grant_lock()This function adds a lock to the resource’s granted list and updates the lock’s mode. void ldlm_grant_lock(struct ldlm_lock *lock, struct list_head *work_list); After checking that the lock is spinlocked, it updates the lock’s granted mode (l_granted_mode) to be the same as the requested mode (l_req_mode). What happens next depends on the lock’s type:
If the lock is the most restrictive lock (as determined by its mode) that has been applied to the resource, the resource’s lr_most_restr field is updated. If the work_list argument is non-NULL and the lock’s l_completion_ast field is non-NULL, work_list is added to the AST work list. Finally, the lock is added to the namespace’s pool with a call to ldlm_pool_add(). oddly, that function doesn’t appear to save the lock anywhere, how does it work? 3.7. Lock cancelling - client-sideLock cancellation on the client side starts with a call to ldlm_cli_cancel() which is passed a handle to the lock that is to be cancelled. Clients try to keep locks as long as possible and normally only relinquish them on receipt of a blocking AST. 3.7.1. ldlm_cli_cancel()Client-side lock cancel. When called, the lock must no longer have any readers or writers. int ldlm_cli_cancel(struct lustre_handle *lockh); Takes a single argument which is the handle for the lock to be cancelled. The function starts by calling ldlm_handle2lock_long() to atomically obtain a pointer to the lock and set the LDLM_FL_CANCELING flag. If that flag is already set in the lock, it means that the lock is already being destroyed so this function just returns 0. Next, ldlm_cli_cancel_local() is called to cancel the lock locally. If that function returns LDLM_FL_LOCAL_ONLY (when its a server-side lock or a client-side locks that doesn’t have the LDLM_FL_LOCAL_ONLY or LDLM_FL_CANCEL_ON_BLOCK flags set), the lock is released and the function returns as nothing more needs to be done. The lock is added as the first element in a local list of locks to be cancelled on the server. If the lock’s server is capable of handling a set of cancel requests, the namespace is scanned (by ldlm_cancel_lru_local()) for further candidates to be cancelled and they are added to the list. The list is processed by ldlm_cli_cancel_list() which will send a cancel request containing the handles of the locks to be cancelled to the server. 3.7.2. ldlm_cli_cancel_local()Cancels a lock locally. static int ldlm_cli_cancel_local(struct ldlm_lock *lock) If lock is a client-side lock (in a shadow namespace) it is locked and its LDLM_FL_CBPENDING flag is set to indicate that the lock is being destroyed. If the lock has either LDLM_FL_LOCAL_ONLY or LDLM_FL_CANCEL_ON_BLOCK flags set, this function will return LDLM_FL_LOCAL_ONLY to indicate that an RPC does not need to be sent to the server. Next, ldlm_cancel_callback() is called to invoke the cancellation callback, l_blocking_ast (if it is non-NULL) passing the LDLM_CB_CANCELING flag to indicate that the lock is being cancelled. It also sets the LDLM_FL_BL_DONE flag to indicate that the lock cache has been dropped. Back in ldlm_cli_cancel_local(), the return code is set to LDLM_FL_BL_AST if that flag is set in the lock, and to LDLM_FL_CANCELLING, otherwise. The lock is unlocked and ldlm_lock_cancel() is called. If the lock is a server-side lock then ldlm_lock_cancel() is called and when that returns, ldlm_reprocess_all() is called to process the resource’s waiting and converting queues and the function returns LDLM_FL_LOCAL_ONLY to indicate that an RPC does not need to be sent. 3.7.3. ldlm_lock_cancel()Cancel a lock in the local namespace that has no readers or writers. void ldlm_lock_cancel(struct ldlm_lock *lock); The resource and lock are locked and a check made to see that the lock has no readers or writers. The lock is removed from the waiting locks list by calling ldlm_del_waiting_lock(). ldlm_cancel_callback() is called to invoke the cancellation callback (if that hasn’t been done already). ldlm_del_waiting_lock() is called for a second time in case the lock had been added to the waiting locks list while ldlm_cancel_callback() was executing. ldlm_resource_unlink_lock() is called to remove the lock from the resource’s skiplists (when type is LDLM_IBITS or LDLM_PLAIN) or to remove it from the resource’s interval tree if the type is LDLM_EXTENT. The lock is also removed from the resource’s list of lock’s (field l_res_link). ldlm_lock_destroy_nolock() is called to destroy the lock (via a call to ldlm_lock_destroy_internal()). That function sets the l_destroyed field to 1 to indicate that the lock is destroyed and that it will be freed when the last reference to it goes away. It also removes the lock from its export lock hash (if appropriate), removes the lock from the namespace’s unused (LRU) list and disassociates the lock from its hash value. If the lock was granted, it is removed from the namespace’s pool by calling ldlm_pool_del(). Finally, the lock’s l_granted_mode is set to LCK_MINMODE and the lock and resource are unlocked. 3.7.4. ldlm_cli_cancel_list()Client-side function that either packs lock handles into a request buffer (if supplied) or sends batched lock cancel RPC to the server. int ldlm_cli_cancel_list(struct list_head *cancels, int count,
struct ptlrpc_request *req, int flags);
Its arguments are:
note that the comment in the code mentions another argument off which doesn’t exist The function loops around while the list of locks to be cancelled, cancels, still contains entries. At the top of the loop, the first lock in the list is obtained and if the lock’s server is capable of handling a set of cancel requests, either ldlm_cancel_pack() is called to pack lock handles into the supplied req or ldlm_cli_cancel_req() is called to prepare and send a batched cancel RPC to the server. The flags argument is passed to that function and if LDLM_FL_ASYNC is not set, the function will queue the RPC for sending and return immediately. Otherwise, the function waits for a response. Up to count handles are batched together. If the lock’s server doesn’t support batched cancel requests, ldlm_cli_cancel_req() is called to process just a single lock handle. The value of count is decremented by the number of locks processed, those locks are removed from cancels and the loop continues. 3.8. Lock cancelling - server-sideAs part of the initialisation carried out by ldlm_setup(), a PTLRPC service (type ptlrpc_service) is created to service lock cancellation requests. That service’s receive handler is ldlm_cancel_handler() and it invokes ldlm_handle_cancel() to process the requests. 3.8.1. ldlm_handle_cancel()Main server-side entry point for processing lock cancellation requests. int ldlm_handle_cancel(struct ptlrpc_request *req); The function is passed the PTLRPC request, req, that contains the handles of the locks to be cancelled. The function starts by calling req_capsule_client_get() to obtain the request in the form of a ldlm_request: If appropriate, procfs counters are updated. req_capsule_server_pack() is called to pack the servers reply. ??? don’t understand what it’s packing here because we haven’t processed the cancel requests yet ldlm_request_cancel() is now called to cancel the locks and ptlrpc_reply() is called to send the reply to the client. 3.8.2. ldlm_request_cancel()Cancel all the locks whose handles are in an ldlm_request. int ldlm_request_cancel(struct ptlrpc_request *req,
const struct ldlm_request *dlm_req, int first);
Arguments are:
The number of lock handles to process, count, is determined and if that is less than first, the function returns. If replaying, nothing needs to be done, so the function returns. For each of the count locks to process, ldlm_handle2lock() is called to obtain a reference to the lock to process and the lock’s resource is stored in local res. Another local variable, pres, is the resource used in the last iteration of the loop (NULL initially). If res is different from pres (the locks have different resources) and if pres is not NULL, ldlm_reprocess_all() is called to process the resource’s waiting and converting queues. If res is not NULL (it never could be?), ldlm_res_lvbo_update() is called to invoke the namespace’s lvbo_update function (if it has one defined). The lock is cancelled locally by calling ldlm_lock_cancel(). The loop continues to look at the next lock handle. After the loop, if pres is not NULL, ldlm_reprocess_all() is called to process the resource’s waiting and converting queues. The function returns. |