»LID Home
»LID History

1. Introduction

The Lustre Distributed Lock Manager (LDLM) provides a means to ensure that data is updated in a consistent fashion across multiple nodes. Locks provide two vital capabilities for a distributed file system:

synchronises data access

access to the locked data is synchronised to avoid clashes and corruption when multiple clients wish to access the same data concurrently

allows data to be cached

if a client holds a lock for some data, it may cache that data locally to avoid network traffic

1.1. Resources

A Lustre lock protects a resource. Most commonly, a resource is a file.

At this time, resources can be organised in a parent/child hierarchy (comments in code suggest this is due to change?)

1.2. Namespaces

Resources are scoped within a namespace. As resources are protected by locks, the locks are also scoped within the same namespaces and we can talk about both a namespace’s resources and a namespace’s locks. Namespaces may either be completely local to a specific client or global in the sense that the namespace is "owned" by some server and the server’s clients each have a shadow namespace that only contains those locks from the server’s namespace that are of interest to a particular client.

The following diagram shows how the locks in a server namespace can be referenced from client (shadow) namespaces. Client 1’s shadow namespace contains Lock A and Lock B and Client 2’s shadow namespace contains just Lock C.

img/ldlm/lustre_nsclient.png
Figure 1: Shadow namespaces on clients.

1.3. Lock type and mode

Locks have a type and a mode. The six possible modes are:

Exclusive mode (EX)

Before a new file is created, MDS requests EX lock on the parent directory.

Protective Write (normal write) mode (PW)

When a client requests a write lock from an OST, a lock with PW mode will be issued.

Protective Read (normal read) mode (PR)

When a client requests a read from an OST, a lock with PR mode will be issued. Also, if the client opens a file for execution, it is granted a lock with PR mode.

Concurrent Write mode CW

The type of lock that the MDS grants if a client requests a write lock during a file open operation.

Concurrent Read mode (CR)

When a client performs a path lookup, MDS grants an inode bit lock (see below) with the CR mode on the intermediate path component.

Null mode (NL)

???

Here is the lock compatibility matrix that shows what lock modes a new lock can have for the lock to be granted when attempting to lock a particular resource that has already been locked by a lock with a given mode.

                                 NL  CR  CW  PR  PW  EX
                             -----------------------------
                            NL    1   1   1   1   1   1
                            -----------------------------
                            CR    1   1   1   1   1   0
                            -----------------------------
                            CW    1   1   1   0   0   0
                            -----------------------------
                            PR    1   1   0   1   0   0
                            -----------------------------
                            PW    1   1   0   0   0   0
                            -----------------------------
                            EX    1   0   0   0   0   0
                            -----------------------------

You can see it is symmetrical so you can read it either way. For example, if the resource has been locked with PW (protected write) mode, you can look down the PW column and see that the only allowed modes for a new lock are NL (null) and CR (concurrent read).

In Lustre, four types of locks are defined and they are used for different purposes. It is up to the client to decide what type of lock it requests. The component that requests a lock from the lock manager is the client and it can be Lustre Lite, OSC or MDC. The four types of locks are:

extent lock

for protecting OST data

flock

for supporting a user space request on flock

inode bit lock

for protecting metadata attributes

plain lock

defined, but not in use and can be ignored.

1.4. Callbacks

Callback functions are a means whereby information/state related to a given lock can be passed back to the client from the locking subsystem. As part of the lock request process, the client supplies the callback function pointers. Three types of callback function are used:

blocking callback

this is a request to the client to drop a lock that it currently holds. There are two conditions where this callback will be invoked for a lock held by a client: . if some client requests a lock that conflicts with this one (even if the request comes from the same client) . when a lock is revoked (after all references went away and the lock was cancelled)

completion callback

invoked when either the requested lock is granted or if the lock is successfully converted (for example, to a different mode)

glimpse callback

the glimpse callback is used to provide certain information about underlying (file) properties without actually releasing the lock. For example, an OST can provide such a callback to provide information on file size since only the OSTs know exactly the size of a file object. So this callback is supplied for an extent lock on the file object and is invoked by the server upon receiving the client request.

1.5. Intent

An intent can be specified by the client when a lock is enqueued to request some special processing during the enqueue. The intent contains parameters that are passed to the intent handler associated with the lock’s namespace.

The intent feature allows the client to indicate what they are intending to accomplish (when the lock is granted). This knowledge can be used by the locking subsystem to reduce the number of RPC calls required to achieve the client’s intention.

Six intention operations are defined. These are get attributes, set attributes, insert or delete, open, create, and read links.

2. Data Structures

The major data structures used by the LDLM are:

2.1. ldlm_namespace

Instances of struct ldlm_namespace represent a resource namespace. Each server has a local namespace that contains its resources. Each client that is connected to the server will have a shadow namespace that can contain client-local versions of the server’s resources.

/** LDLM Namespace.
 *
 * Namespace serves to contain locks related to particluar service.
 * There are two kinds of namespaces:
 * - Local namespace has knowledge of all locks and is therefore authoritative
 *   to make decisions like what locks could be granted and what conflicts
 *   exist during new lock enqueue.
 * - Client namespace only has limited knowledge about locks in the namespace,
 *   only seeing locks held by the client.
 *
 * Every Lustre service has one local namespace present on the server serving
 * that service. Every client connected to the service has a client namespace
 * for it.
 * Every lock obtined by client in that namespace is actually represented by
 * two in-memory locks. One on server and one on the client. The locks are
 * linked by special cookie by which one node can tell to other which lock
 * it actually means during communications
 */
struct ldlm_namespace {
        /**
         * Namespace name. Used for logging, etc.
         */
        char                  *ns_name;

        /**
         * Is this a client-side lock tree?
         */
        ldlm_side_t            ns_client;

        /**
         * Namespce connect flags supported by server (may be changed via proc,
         * lru resize may be disabled/enabled).
         */
        __u64                  ns_connect_flags;

         /**
          * Client side orig connect flags supported by server.
          */
        __u64                  ns_orig_connect_flags;

        /**
         * Hash table for namespace.
         */
        struct list_head      *ns_hash;
        spinlock_t             ns_hash_lock;

         /**
          * Count of resources in the hash.
          */
        __u32                  ns_refcount;

         /**
          * All root resources in namespace.
          */
        struct list_head       ns_root_list;

        /**
         * Position in global namespace list.
         */
        struct list_head       ns_list_chain;

        /**
         * All root resources in namespace.
         */
        struct list_head       ns_unused_list;
        int                    ns_nr_unused;
        spinlock_t             ns_unused_lock;

        unsigned int           ns_max_unused;
        unsigned int           ns_max_age;
        unsigned int           ns_timeouts;
         /**
          * Seconds.
          */
        unsigned int           ns_ctime_age_limit;

        /**
         * Next debug dump, jiffies.
         */
        cfs_time_t             ns_next_dump;

        atomic_t               ns_locks;
        __u64                  ns_resources;

        /** "policy" function that does actually conflict determination */
        ldlm_res_policy        ns_policy;

        struct ldlm_valblock_ops *ns_lvbo;
        void                  *ns_lvbp;
        cfs_waitq_t            ns_waitq;
        struct ldlm_pool       ns_pool;
        ldlm_appetite_t        ns_appetite;

        /**
         * If more than \a ns_contended_locks found, the resource is considered
         * to be contended.
         */
        unsigned               ns_contended_locks;

        /**
         * The resource remembers contended state during \a ns_contention_time,
         * in seconds.
         */
        unsigned               ns_contention_time;

        /**
         * Limit size of nolock requests, in bytes.
         */
        unsigned               ns_max_nolock_size;

        /**
         * Backward link to obd, required for ldlm pool to store new SLV.
         */
        struct obd_device     *ns_obd;

        struct adaptive_timeout ns_at_estimate;/* estimated lock callback time*/
};

The fields are:

ns_name

the namespace’s name - used for debugging

ns_client

a value that indicates whether the namespace is located on the client or the server - one of:

typedef enum {
        LDLM_NAMESPACE_SERVER = 1 << 0,
        LDLM_NAMESPACE_CLIENT = 1 << 1
} ldlm_side_t;
ns_connect_flags

currently, the only flag defined for this is OBD_CONNECT_LRU_RESIZE which can be changed using procfs

ns_orig_connect_flags

a copy of ns_connect_flags that is used to check if a flag that has been changed was originally asserted

ns_hash

a hash table that holds all of the namespace’s resources

ns_hash_lock

a spinlock that protects the hash table

ns_refcount

the number of references to this namespace

ns_root_list

a list of the namespace’s parent resources

ns_list_chain

chains the namespace into a global list

ns_unused_list

head of an LRU list of unused locks in this namespace

ns_nr_unused

the number of unused locks in this namespace

ns_unused_lock

a spinlock that protects access to the unused locks list

ns_max_unused

maximum number of entries in LRU unused locks list?

ns_max_age

maximum age of unused lock in LRU list?

ns_timeouts

the number of locks in this namespace that have timed out

ns_ctime_age_limit

a time period (in seconds) that modifies the sub-type of the inode lock taken by mdt_getattr_name_lock() depending on whether the file has been changed in that time or not

ns_next_dump

the number of jiffies until the next debug dump

ns_locks

the number locks in this namespace

ns_resources

the number of resources in this namespace

ns_policy

the policy function for this namespace - its type is ldlm_res_policy:

typedef int (*ldlm_res_policy)(struct ldlm_namespace *, struct ldlm_lock **,
                               void *req_cookie, ldlm_mode_t mode, int flags,
                               void *data);

this is simply set by the server that created the namespace to point at the server specific policy function

ns_lvbo

a pointer to a ldlm_valblock_ops:

/** LVB operations.
 * LVB is LDLM Value Block. This is a special opaque (to LDLM) value that could
 * be assotiated with an LDLM lock and transferred from client to server and
 * back.
 *
 * Currently LVBs are used by OSC-OST code to maintain current object size/times
 */
struct ldlm_valblock_ops {
        int (*lvbo_init)(struct ldlm_resource *res);
        int (*lvbo_update)(struct ldlm_resource *res,
                           struct ptlrpc_request *r,
                           int buf_idx, int increase);
};
ns_lvbp

pointer to the opaque data that is passed between client and server shared between all resources/locks in the namespace?

ns_waitq

a queue of threads waiting for ns_refcount to become 0

ns_pool

the namespace’s lock pool (of type ldlm_pool)

ns_appetite

whether the namespace’s lock pool is "modest" or greedy - one of ldlm_appetite_t:

typedef enum {
        LDLM_NAMESPACE_GREEDY = 1 << 0,
        LDLM_NAMESPACE_MODEST = 1 << 1
} ldlm_appetite_t;
ns_contended_locks

if a resource (in this namespace) has more than this number of contended locks, it is considered to be "contended" (default value is NS_DEFAULT_CONTENDED_LOCKS)

ns_contention_time

if a resource (in this namespace) becomes contended, that state is remembered for this number of seconds (default value is NS_DEFAULT_CONTENTION_SECONDS)

ns_max_nolock_size

the maximum size (in bytes) of nolock? requests (default value is NS_DEFAULT_MAX_NOLOCK_BYTES)

ns_obd

a backward link to the namespace’s obd_device (used by pool code and mdt_blocking_ast())

ns_at_estimate

the estimated lock callback time more required

2.2. ldlm_resource

Each lockable object (resource) is represented by an instance of struct ldlm_resource. The resources are named by 4 64-bit integers and the mapping between the object identity and the resource name is done by the user of the LDLM api.

/** LDLM resource description.
 * Resource is a basically a representation for a single object.
 * Object has a name which is currently 4 64 bit integers.
 * LDLM user is responsible for creating a mapping between objects it wants
 * protected and resource names.
 *
 * A resurce can only hold locks of a single lock type.
 */
struct ldlm_resource {
        /** Back link to namespace */
        struct ldlm_namespace *lr_namespace;

        /* protected by ns_hash_lock */
        struct list_head       lr_hash;
        /* Remove lr_parent/child and its logic ASAP */
        struct ldlm_resource  *lr_parent;   /* 0 for a root resource */
        struct list_head       lr_children; /* list head for child resources */
        struct list_head       lr_childof;  /* part of ns_root_list if root res,
                                             * part of lr_children if child */
        spinlock_t             lr_lock;

        /* protected by lr_lock */
        /**
         * List of locks in granted state
         */
        struct list_head       lr_granted;
        /**
         *List of locks waiting to change their granted mode (converted)
         */
        struct list_head       lr_converting;
        /**
         * List of locks that could not be granted due to conflicts and
         * that are waiting for conflicts to go away */
        struct list_head       lr_waiting;

        /* No longer needed? Remove ASAP */
        ldlm_mode_t            lr_most_restr;

        /**
         * Type of locks this namespace can hold
         */
        ldlm_type_t            lr_type; /* LDLM_{PLAIN,EXTENT,FLOCK} */

        struct ldlm_res_id     lr_name;
        atomic_t               lr_refcount;

        struct ldlm_interval_tree lr_itree[LCK_MODE_NUM];  /* interval trees*/

        /* Server-side-only lock value block elements */
        struct semaphore       lr_lvb_sem;
        __u32                  lr_lvb_len;
        void                  *lr_lvb_data;

        /* when the resource was considered as contended */
        cfs_time_t             lr_contention_time;
        /**
         * List of references to this resource. For debugging.
         */
        struct lu_ref          lr_reference;
};

The fields are:

lr_namespace

points back to the ldlm_namespace the resource belongs to

lr_hash

links the resource into the namespace’s hash table (protected by spinlock)

lr_parent

0 if resource is a "root resource", othewise points to the resource’s parent to be removed soon?

lr_childof

linkage into either the namespace’s list of root resources or the parent resources list of child resources

lr_children

a list containing the resource’s children

lr_lock

a spinlock that protects various fields in the resource and its associated locks

lr_granted

list of locks that have been granted for the resource

lr_converting

list of (granted) locks that are waiting to change their mode

lr_waiting

list of locks that have yet to be granted

lr_most_restr

the most restrictive lock mode that has been granted on the resource set but not actually used anywhere - to be removed?

lr_type

the type of locks that the resource supports - one of ldlm_type_t:

typedef enum {
        LDLM_PLAIN     = 10,
        LDLM_EXTENT    = 11,
        LDLM_FLOCK     = 12,
        LDLM_IBITS     = 13,
        LDLM_MAX_TYPE
} ldlm_type_t;
lr_name

the resources name - of type ldlm_res_id:

#define RES_NAME_SIZE 4
struct ldlm_res_id {
        __u64 name[RES_NAME_SIZE];
};
lr_refcount

the number of references to the resource

lr_itree

an array of interval trees (one for each lock mode (EX, PW, etc.)) - used for extent locks

lr_lvb_sem

server-side semaphore that synchronises access to the Lock Value Block (LVB) data

lr_lvb_len

length (in bytes) of Lock Value Block data

lr_lvb_data

pointer to memory for Lock Value Block data (allocated by obd code but freed by ldlm code)

lr_contention_time

the time when the resource was last seen to have contended locks

lr_reference

list of references to the resource (for debugging)

2.3. ldlm_lock

Locks are represented by instances of struct ldlm_lock.

/** LDLM lock structure
 *
 * Represents actual lock and its state in memory */
struct ldlm_lock {
        /** Local lock handle.
         * When remote side wants to tell us about a lock, they address it by
         * this handle.
         * Must be first in the structure.
         */
        struct portals_handle    l_handle;
        /**
         * Lock reference count.
         */
        atomic_t                 l_refc;
        /**
         * Internal spinlock protects l_resource.  we should hold this lock
         * first before grabbing res_lock.
         */
        spinlock_t               l_lock;
        /**
         * ldlm_lock_change_resource() can change this.
         */
        struct ldlm_resource    *l_resource;
        /**
         * Protected by ns_hash_lock. List item for client side lru list.
         */
        struct list_head         l_lru;
        /**
         * Protected by lr_lock, linkage to resource's lock queues.
         */
        struct list_head         l_res_link;
        /**
         * Tree node for ldlm_extent.
         */
        struct ldlm_interval    *l_tree_node;
        /**
         * Protected by per-bucket exp->exp_lock_hash locks. Per export hash
         * of locks.
         */
        struct hlist_node        l_exp_hash;
        /**
         * Protected by lr_lock. Requested mode.
         */
        ldlm_mode_t              l_req_mode;
        /**
         * Granted mode, also protected by lr_lock.
         */
        ldlm_mode_t              l_granted_mode;
        /**
         * Lock enqueue completion handler.
         */
        ldlm_completion_callback l_completion_ast;
        /**
         * Lock blocking ast handler.
         * Called 2 times, once when somebody conflicts with this lock for
         * the first time. Once when the lock is finally cancelled.
         */
        ldlm_blocking_callback   l_blocking_ast;
        /**
         * Lock glimpse handler.
         * Glimpse handler is used to obtain LVB updates from a client by
         * server.
         */
        ldlm_glimpse_callback    l_glimpse_ast;

        ldlm_weigh_callback      l_weigh_ast;

        /**
         * Lock export.
         */
        struct obd_export       *l_export;
        /**
         * Lock connection export.
         */
        struct obd_export       *l_conn_export;

        /**
         * Remote lock handle.
         * If the lock has a sister-lock on another node, this is the handle
         * of its remote counterpart (l_handle).
         */
        struct lustre_handle     l_remote_handle;

        /** Representation of private data specific for a lock type.
         * Examples are: extent range for extent lock or
         * bitmask for ibits locks */
        ldlm_policy_data_t       l_policy_data;

        /*
         * Protected by lr_lock. Various counters: readers, writers, etc.
         */
        __u64                 l_flags;
        __u32                 l_readers;
        __u32                 l_writers;
        /*
         * Set for locks that were removed from class hash table and will be
         * destroyed when last reference to them is released. Set by
         * ldlm_lock_destroy_internal().
         *
         * Protected by lock and resource locks.
         */
        __u8                  l_destroyed;

        /**
         * If the lock is granted, a process sleeps on this waitq to learn when
         * it's no longer in use.  If the lock is not granted, a process sleeps
         * on this waitq to learn when it becomes granted.
         */
        cfs_waitq_t           l_waitq;

        /**
         * Seconds. it will be updated if there is any activity related to
         * the lock, e.g. enqueue the lock or send block AST.
         */
        cfs_time_t            l_last_activity;

        /**
         * Jiffies. Should be converted to time if needed.
         */
        cfs_time_t            l_last_used;

        /** Originally requested extent on the extent lock */
        struct ldlm_extent    l_req_extent;

        /*
         * Client-side-only members.
         */

        /**
         * Temporary storage for an LVB received during an enqueue operation.
         */
        __u32                 l_lvb_len;
        void                 *l_lvb_data;
        void                 *l_lvb_swabber;

        /** Private storage for lock user. Opaue to LDLM. */
        void                 *l_ast_data;

        /*
         * Server-side-only members.
         */

        /** connection cookie for the client originated the operation. */
        __u64                 l_client_cookie;

        /**
         * Protected by elt_lock. Callbacks pending.
         */
        struct list_head      l_pending_chain;

        cfs_time_t            l_callback_timeout;

        /**
         * Pid which created this lock.
         */
        __u32                 l_pid;

        /**
         * For ldlm_add_ast_work_item().
         */
        struct list_head      l_bl_ast;
        /**
         * For ldlm_add_ast_work_item().
         */
        struct list_head      l_cp_ast;
        /**
         * For ldlm_add_ast_work_item().
         */
        struct list_head      l_rk_ast;

        struct ldlm_lock     *l_blocking_lock;
        int                   l_bl_ast_run;

        /**
         * Protected by lr_lock, linkages to "skip lists".
         */
        struct list_head      l_sl_mode;
        struct list_head      l_sl_policy;
        struct lu_ref         l_reference;
};

The fields are:

l_handle

the lock’s local handle that is used by the remote side to address the lock - of type struct portals_handle, it must be the first member of the lock structure

l_refc

number of references to the lock

l_lock

spinlock used to protect l_resource

l_resource

the lock’s resource (can be changed during lifetime of lock)

l_lru

linkage into namespace’s list of unused locks

l_res_link

linkage into one of the resource’s lists of locks

l_tree_node

pointer to a struct ldlm_interval required when lock is an extent lock

l_exp_hash

hash node used to link lock into lock’s export’s hash of locks

l_req_mode

requested mode of lock - one of ldlm_mode_t:

typedef enum {
        LCK_MINMODE = 0,
        LCK_EX      = 1,
        LCK_PW      = 2,
        LCK_PR      = 4,
        LCK_CW      = 8,
        LCK_CR      = 16,
        LCK_NL      = 32,
        LCK_GROUP   = 64,
        LCK_COS     = 128,
        LCK_MAXMODE
} ldlm_mode_t;
l_granted_mode

granted mode of lock

l_completion_ast

callback function invoked when enqueue completes - its type is ldlm_completion_callback:

typedef int (*ldlm_completion_callback)(struct ldlm_lock *lock, int flags,
                                        void *data);
l_blocking_ast

blocking ast handler function called when another lock conflicts with this one or when the lock is finally cancelled - its type is ldlm_blocking_callback:

typedef int (*ldlm_blocking_callback)(struct ldlm_lock *lock,
                                      struct ldlm_lock_desc *new, void *data,
                                      int flag);
l_glimpse_ast

glimpse handler function provides LVB updates - its type is ldlm_glimpse_callback:

typedef int (*ldlm_glimpse_callback)(struct ldlm_lock *lock, void *data);
l_weigh_ast

not used?

l_export

on servers, this is the export through which the client is accessed

l_conn_export

on clients, this is the export through which the server is accessed

l_remote_handle

the handle for the lock’s remote sibling - of type lustre_handle:

struct lustre_handle {
        __u64 cookie;
};
l_policy_data

private data specific to the lock type - a union type ldlm_policy_data_t:

typedef union {
        struct ldlm_extent l_extent;
        struct ldlm_flock  l_flock;
        struct ldlm_inodebits l_inodebits;
} ldlm_policy_data_t;
l_flags

lock flags - a combination of:

#define LDLM_FL_LOCK_CHANGED   0x000001 /* extent, mode, or resource changed */

/* If the server returns one of these flags, then the lock was put on that list.
 * If the client sends one of these flags (during recovery ONLY!), it wants the
 * lock added to the specified list, no questions asked. -p */
#define LDLM_FL_BLOCK_GRANTED  0x000002
#define LDLM_FL_BLOCK_CONV     0x000004
#define LDLM_FL_BLOCK_WAIT     0x000008

#define LDLM_FL_CBPENDING      0x000010 /* this lock is being destroyed */
#define LDLM_FL_AST_SENT       0x000020 /* blocking or cancel packet was
                                         * queued for sending. */
#define LDLM_FL_WAIT_NOREPROC  0x000040 /* not a real flag, not saved in lock */
#define LDLM_FL_CANCEL         0x000080 /* cancellation callback already run */

/* Lock is being replayed.  This could probably be implied by the fact that one
 * of BLOCK_{GRANTED,CONV,WAIT} is set, but that is pretty dangerous. */
#define LDLM_FL_REPLAY         0x000100

#define LDLM_FL_INTENT_ONLY    0x000200 /* don't grant lock, just do intent */
#define LDLM_FL_LOCAL_ONLY     0x000400 /* see ldlm_cli_cancel_unused */

/* don't run the cancel callback under ldlm_cli_cancel_unused */
#define LDLM_FL_FAILED         0x000800

#define LDLM_FL_HAS_INTENT     0x001000 /* lock request has intent */
#define LDLM_FL_CANCELING      0x002000 /* lock cancel has already been sent */
#define LDLM_FL_LOCAL          0x004000 /* local lock (ie, no srv/cli split) */
#define LDLM_FL_WARN           0x008000 /* see ldlm_cli_cancel_unused */
#define LDLM_FL_DISCARD_DATA   0x010000 /* discard (no writeback) on cancel */

#define LDLM_FL_NO_TIMEOUT     0x020000 /* Blocked by group lock - wait
                                         * indefinitely */

/* file & record locking */
#define LDLM_FL_BLOCK_NOWAIT   0x040000 // server told not to wait if blocked
#define LDLM_FL_TEST_LOCK      0x080000 // return blocking lock

/* XXX FIXME: This is being added to b_size as a low-risk fix to the fact that
 * the LVB filling happens _after_ the lock has been granted, so another thread
 * can match`t before the LVB has been updated.  As a dirty hack, we set
 * LDLM_FL_LVB_READY only after we've done the LVB poop.
 * this is only needed on lov/osc now, where lvb is actually used and callers
 * must set it in input flags.
 *
 * The proper fix is to do the granting inside of the completion AST, which can
 * be replaced with a LVB-aware wrapping function for OSC locks.  That change is
 * pretty high-risk, though, and would need a lot more testing. */

#define LDLM_FL_LVB_READY      0x100000

/* A lock contributes to the kms calculation until it has finished the part
 * of it's cancelation that performs write back on its dirty pages.  It
 * can remain on the granted list during this whole time.  Threads racing
 * to update the kms after performing their writeback need to know to
 * exclude each others locks from the calculation as they walk the granted
 * list. */
#define LDLM_FL_KMS_IGNORE     0x200000

/* Immediatelly cancel such locks when they block some other locks. Send
 * cancel notification to original lock holder, but expect no reply. This is
 * for clients (like liblustre) that cannot be expected to reliably response
 * to blocking ast. */
#define LDLM_FL_CANCEL_ON_BLOCK 0x800000

/* Flags flags inherited from parent lock when doing intents. */
#define LDLM_INHERIT_FLAGS     (LDLM_FL_CANCEL_ON_BLOCK)

/* completion ast to be executed */
#define LDLM_FL_CP_REQD        0x1000000

/* cleanup_resource has already handled the lock */
#define LDLM_FL_CLEANED        0x2000000

/* optimization hint: LDLM can run blocking callback from current context
 * w/o involving separate thread. in order to decrease cs rate */
#define LDLM_FL_ATOMIC_CB      0x4000000

/* Cancel lock asynchronously. See ldlm_cli_cancel_unused_resource. */
#define LDLM_FL_ASYNC           0x8000000

/* It may happen that a client initiate 2 operations, e.g. unlink and mkdir,
 * such that server send blocking ast for conflict locks to this client for
 * the 1st operation, whereas the 2nd operation has canceled this lock and
 * is waiting for rpc_lock which is taken by the 1st operation.
 * LDLM_FL_BL_AST is to be set by ldlm_callback_handler() to the lock not allow
 * ELC code to cancel it.
 * LDLM_FL_BL_DONE is to be set by ldlm_cancel_callback() when lock cache is
 * droped to let ldlm_callback_handler() return EINVAL to the server. It is
 * used when ELC rpc is already prepared and is waiting for rpc_lock, too late
 * to send a separate CANCEL rpc. */
#define LDLM_FL_BL_AST          0x10000000
#define LDLM_FL_BL_DONE         0x20000000

/* measure lock contention and return -EUSERS if locking contention is high */
#define LDLM_FL_DENY_ON_CONTENTION 0x40000000

/* These are flags that are mapped into the flags and ASTs of blocking locks */
#define LDLM_AST_DISCARD_DATA  0x80000000 /* Add FL_DISCARD to blocking ASTs */

/* Flags sent in AST lock_flags to be mapped into the receiving lock. */
#define LDLM_AST_FLAGS         (LDLM_FL_DISCARD_DATA)
l_readers

number of read mode (LCK_NL | LCK_CR | LCK_PR) references to the lock

l_writers

number of write mode (LCK_EX | LCK_CW | LCK_PW | LCK_GROUP | LCK_COS) references to the lock

l_destroyed

marks the lock as ready for destruction which occurs when the last reference to it is removed

l_waitq

waitq for a process waiting for the lock granted status to change

l_last_activity

number of seconds since lock was last enqueued or converted

l_last_used

time in jiffies when the lock was added to the LRU list

l_req_extent

original requested extent - of type ldlm_extent:

struct ldlm_extent {
        __u64 start;
        __u64 end;
        __u64 gid;
};
l_reference

references to the lock (for debugging)

The following fields are used by the client-side only:

l_lvb_len
l_lvb_data

temporary storage for an LVB received during enqueue

l_lvb_swabber

the swab function to use on the LVB from the completion AST callback handler

l_ast_data

pointer to some opaque data private to the upper layer (lock user)

The following fields are used by the server-side only:

l_client_cookie

identifies the client that originated the operation

l_pending_chain

linkage into list of locks that are pending (contended)

l_callback_timeout

timeout (in jiffies) for callback to occur

l_pid

PID of the process that created the lock

l_bl_ast

linkage into list of conflicting locks to send an AST to that have not yet been sent a blocking AST

l_cp_ast

linkage into list of conflicting locks to send an AST to

l_rk_ast

linkage into list of locks that have been revoked? to be revoked?

l_blocking_lock

the lock that is blocking this one from being granted

l_bl_ast_run

number of times the lock’s blocking AST has been invoked

l_sl_mode

linkage into mode skip list

l_sl_policy

linkage into policy skip list

3. Code Description

3.1. Namespace creation

Client-side and server-side namespaces are created and initialised using ldlm_namespace_new().

3.1.1. ldlm_namespace_new()

struct ldlm_namespace *ldlm_namespace_new(struct obd_device *obd, char *name,
                                          ldlm_side_t client, ldlm_appetite_t apt);

Its arguments are:

obd

the obd_device the namespace is associated with appears to be used only in the MDT handler and the LDLM pool code

name

the namespace’s name

client

a ldlm_side_t that specifies whether the new namespace is client-side or server-side

apt

a ldlm_appetite_t that specifies the namespace’s appetite (used by the pooling code)

The function firstly obtains a reference to the LDLM system by calling ldlm_get_ref(). The first call to that function will invoke ldlm_setup() to initialise the LDLM infrastructure (setup PTLRPC stuff, create threads, etc.)

The memory for the new namespace and its hash table (ns_hash) is allocated. The obd and apt arguments are assigned to ns_appetite and ns_obd, respectively and memory is allocated to ns_name for a copy of the name argument.

The other members of the namespace structure are set to their initial values.

A call to ldlm_proc_namespace() sets up the procfs entries for the namespace.

A call to ldlm_pool_init() initialises the namespace’s pool.

A call to at_init() initialises the namespace’s ns_at_estimate field (Adaptive Timeout).

Finally, the namespace is registered with a call to ldlm_namespace_register() which adds it to the client or server namespaces list (as appropriate) and increments the number of client (or server) namespaces registered.

3.2. Resource creation

Resources are identified by a name comprising 4 x 64-bit integers (type ldlm_res_id). A reference to a resource is obtained by calling ldlm_resource_get().

3.2.1. ldlm_resource_get()

This function obtains a reference to the named resource within the given namespace. It will (optionally) create the resource if it doesn’t already exist.

struct ldlm_resource *
ldlm_resource_get(struct ldlm_namespace *ns, struct ldlm_resource *parent,
                  const struct ldlm_res_id *name, ldlm_type_t type, int create);

Its arguments are:

ns

the namespace in which the resource resides

parent

the resource’s parent resource

name

the name of resource (of type ldlm_res_id)

type

the type of the resource (of type ldlm_type_t)

create

if non-zero and the resource doesn’t already exist, create it

The function starts by calling ldlm_hash_fn() to calculate the resource’s hash value (based on the values of name and parent).

It then calls ldlm_resource_find() to look up the resource in the namespace’s hash table.

  • if found, a new reference to the resource is obtained

    If the namespace’s ns_lvbo field is non-NULL and the referenced ldlm_valblock_ops has a non-NULL lvbo_init function pointer, the resource’s lr_lvb_sem semaphore is down()'d and up()'d

    the resource is returned

  • if the resource is not found and create is zero, the function returns NULL

    otherwise, ldlm_resource_add() is called to allocate and reference a new resource which is returned

3.2.2. ldlm_resource_add()

This function allocates and initialises a new resource within the given namespace.

static struct ldlm_resource *
ldlm_resource_add(struct ldlm_namespace *ns, struct ldlm_resource *parent,
                  const struct ldlm_res_id *name, __u32 hash, ldlm_type_t type)

Its arguments are:

ns

the namespace in which the resource resides

parent

the resource’s parent resource

name

the name of resource (of type ldlm_res_id)

hash

the hash code for the resource

type

the type of the resource (of type ldlm_type_t)

The function starts by calling ldlm_resource_new() to allocate and partially initialise a new resource object. That object is then further initialised with name, ns and type.

The namespace’s hash table is then spinlocked and searched to see if it already has an entry that matches hash. If so, it means that the resource has already been added to the namespace and so a reference is obtained to that existing resource and the spinlock remove from the hash table. The newly allocated resource is then freed and the existing resource is returned.

If the hashtable did not already contain an entry that matches hash, the new resource is added to the hash table and a reference to the hash table obtained. If parent is non-NULL, the new resource is linked to the parent, otherwise, it’s added to the namespace’s ns_root_list. The hash table spinlock is unlocked.

If the namespace’s ns_lvbo field is non-NULL and the referenced ldlm_valblock_ops has a non-NULL lvbo_init function pointer, then that function is invoked, passing it the new resource. The resource’s lr_lvb_sem is up()'d.

The resource is returned.

3.2.3. ldlm_resource_new()

This function allocates and (semi) initialises a new resource.

static struct ldlm_resource *ldlm_resource_new(void);

The resource is allocated and various fields zeroed/initialised.

The lr_refcount field is set to 1.

The lr_lvb_sem semaphore is initialised in the locked state.

The new resource is returned.

3.3. Lock creation

3.3.1. ldlm_lock_create()

This function creates a new lock for the specified resource with the given type, mode and callback functions.

struct ldlm_lock *ldlm_lock_create(struct ldlm_namespace *ns,
                                   const struct ldlm_res_id *res_id,
                                   ldlm_type_t type,
                                   ldlm_mode_t mode,
                                   const struct ldlm_callback_suite *cbs,
                                   void *data, __u32 lvb_len);

Arguments are:

ns

the namespace that contains the resource to be locked

res_id

this is the id of the resource that is being locked

type

the type of the resource (of type ldlm_type_t)

mode

the mode of the lock (of type ldlm_mode_t)

cbs

the lock’s callback functions (of type ldlm_callback_suite):

struct ldlm_callback_suite {
        ldlm_completion_callback lcs_completion;
        ldlm_blocking_callback   lcs_blocking;
        ldlm_glimpse_callback    lcs_glimpse;
        ldlm_weigh_callback      lcs_weigh;
};
data

opaque data private to the lock’s user that is stored in l_ast_data

lvb_len

the length of the lock’s LVB data

The function starts by obtaining a reference to the resource by calling ldlm_resource_get(). The resource will be created if it doesn’t already exist.

A new lock is obtained by calling ldlm_lock_new() (this obtains new reference to the resource).

The first reference to the resource is released as it’s not required any more.

Various fields in the lock are initialised from the passed in values.

If the lock is an extent lock, an interval tree node is allocated.

If lvb_len is non-zero, a buffer of that size is allocated and assigned to l_lvb_data.

The function returns the new lock.

On error, the allocated storage is cleaned up and the function returns NULL.

3.3.2. ldlm_lock_new()

This function allocates a new lock for the specified resource.

static struct ldlm_lock *ldlm_lock_new(struct ldlm_resource *resource);

Just a single argument, resource, which is the resource to be locked.

Memory is allocated for the new lock and various fields are initialised. A reference to the resource is obtained.

Worthy of note is the fact that the lock’s reference count (l_refc) is initialised to 2.

The resource’s namespace’s number of locks is incremented.

A hash value is computed for the lock’s handle (l_handle).

The new lock is returned.

3.4. Lock enqueing - client-side

On the client, a lock on a remote resource is obtained by calling ldlm_cli_enqueue() to enqueue a lock request. This involves creating a lock object, building a PTLRPC request which is sent to the server and waiting for and then processing the result from the server. Unless an error occurs, the lock is added to one of the resource’s lists (granted, converted or waiting).

3.4.1. ldlm_cli_enqueue()

int ldlm_cli_enqueue(struct obd_export *exp, struct ptlrpc_request **reqp,
                     struct ldlm_enqueue_info *einfo,
                     const struct ldlm_res_id *res_id,
                     ldlm_policy_data_t *policy, int *flags,
                     void *lvb, __u32 lvb_len, void *lvb_swabber,
                     struct lustre_handle *lockh, int async);

Lots of arguments but the key ones are:

exp

the export used to communicate with the server

reqp

if the PTLRPC request is to have some special initialisation, that is done before calling this function and passed in using reqp

res_id

this is the id of the resource that is being locked

einfo

specifies detail such as the lock’s type, mode and callback function pointers

policy

the policy data (if any) associated with the lock

flags

the lock’s flags

The function starts by retrieving the ldlm_namespace, ns, from the export, exp. If replaying, the previously created lock is located by passing the lockh argument to ldlm_handle2lock_long(). Otherwise, a new lock is created by calling ldlm_lock_create():

                const struct ldlm_callback_suite cbs = {
                        .lcs_completion = einfo->ei_cb_cp,
                        .lcs_blocking   = einfo->ei_cb_bl,
                        .lcs_glimpse    = einfo->ei_cb_gl,
                        .lcs_weigh      = einfo->ei_cb_wg
                };
                lock = ldlm_lock_create(ns, res_id, einfo->ei_type,
                                        einfo->ei_mode, &cbs, einfo->ei_cbdata,
                                        lvb_len);

This takes the namespace, ns, and the resource id, res_id, that identify the resource being locked and the lock’s desired type, mode, callbacks, etc.

A reference is added to the lock by calling ldlm_lock_addref_internal() which has the side effects of removing the lock from the LRU list (if appropriate) and incrementing the lock’s count of readers/writers (depending on the lock type) and adding refs to the lock’s debug reference list.

A call to ldlm_lock2handle() returns the lock’s cookie (by reference using *lockh) to the caller.

If policy is not NULL (policy data supplied), it is copied by value into the lock.

If the lock is an extent lock, the extent data is copied into the lock from policy (code assumes policy not NULL here!)

If reqp or *reqp is NULL, the request has to be created with a call to ptlrpc_request_alloc_pack():

                req = ptlrpc_request_alloc_pack(class_exp2cliimp(exp),
                                                &RQF_LDLM_ENQUEUE,
                                                LUSTRE_DLM_VERSION,
                                                LDLM_ENQUEUE);

Otherwise, the passed in request is checked to see that its capsule size is large enough to contain the lock request data.

The next action of note is that the lock data is transferred into the request body:

        /* Dump lock data into the request buffer */
        body = req_capsule_client_get(&req->rq_pill, &RMF_DLM_REQ);
        ldlm_lock2desc(lock, &body->lock_desc);
        body->lock_flags = *flags;
        body->lock_handle[0] = *lockh;

The main work here is done by ldlm_lock2desc() which converts the lock data into the "on the wire" format stored in the request body.

Next, if the PTLRPC request was not passed in (it was allocated in this function) its reply length is set by a call to ptrlrpc_request_set_replen(). Before that is done, if the passed in lvb_len (LDLM Value Block length) is > 0, the request capsule is extended to include the LVB data.

If the async argument is true, the function returns at this point without initiating the communication with the server. It is up to the client to do that using the initialised request.

Otherwise, the request is sent to the server and its response is processed by:

        rc = ptlrpc_queue_wait(req);

        err = ldlm_cli_enqueue_fini(exp, req, einfo->ei_type, policy ? 1 : 0,
                                    einfo->ei_mode, flags, lvb, lvb_len,
                                    lvb_swabber, lockh, rc);

And after that, the request is cleaned up (if it was allocated in this function).

3.4.2. ldlm_cli_enqueue_fini()

This function processes the server’s response to the lock enqueue request.

int ldlm_cli_enqueue_fini(struct obd_export *exp, struct ptlrpc_request *req,
                          ldlm_type_t type, __u8 with_policy, ldlm_mode_t mode,
                          int *flags, void *lvb, __u32 lvb_len,
                          void *lvb_swabber, struct lustre_handle *lockh,int rc);

A lot of the arguments correspond to the arguments to ldlm_cli_enqueue() which simply passes them down.

Firstly it calls ldlm_handle2lock() to retrieve the lock using the handle *lockh. A side effect of this call is that it obtains a new reference to the lock.

If the result is NULL (which should only occur when the lock’s type is LDLM_FLOCK), the function returns -ENOLCK.

Otherwise, the return code from the PTLRPC communication is checked. If it’s not ELDLM_OK then the enqueue either aborted or failed. If it aborted, some reply data is byte swapped. In both cases, control transfers to the cleanup code and the function returns.

The reply is retrieved from the PTLRPC response, the lock and resource are locked and the lock’s l_remote_handle is updated from the reply’s lock_handle. If the export has a hash of locks (exp_lock_hash not NULL), the lock is rehashed using the new l_remote_handle as the key.

The lock_flags in the reply are passed back to the caller by reference (using flags) and those flags that are masked by LDLM_INHERIT_FLAGS are ORed into the lock’s flags, l_flags. The LDLM_FL_NO_TIMEOUT flag is also transferred from the reply to the lock.

The lock and resource are unlocked.

If the reply flags has LDLM_FL_LOCK_CHANGED set, the lock’s mode and resource name are updated from the reply (if they changed). Also, the lock’s l_policy_data member is updated (except when the server doesn’t support inodebits).

If the reply flags indicate that a blocking or cancel AST was sent to whom? or the client is a liblustre client and the lock is an extent lock, the lock is marked as being destroyed. why?

If the lvb_len argument is non-zero and the lock’s requested mode is not equal to it’s granted mode, the LVB data in the reply is copied into the region that the lock’s l_lvb_data member points at.

If not replaying, the lock is enqueued by calling ldlm_lock_enqueue(). When this returns, if the lock’s completion AST pointer, l_completion_ast, is not NULL, the function is invoked to signal the client that the lock has been enqueued.

If the lvb_len argument is non-zero and the lvb argument is non-NULL, the LVB data is copied out of the lock to *lvb.

That’s about it, if cleanup is required due to errors, failed_lock_cleanup() is invoked.

The reference to the lock obtained when the lock was retrieved by handle is released and also the reference to the lock that was obtained by the calling function.

3.5. Lock enqueing - server-side

On the server, when an LDLM enqueue message is received, the PTLRPC layer calls back into the LDLM code. Most code still calls ldlm_handle_enqueue() which is now a wrapper around a newer function, ldlm_handle_enqueue0().

3.5.1. ldlm_handle_enqueue()

This is the old entry point to the server-side handling of enqueue requests.

int ldlm_handle_enqueue(struct ptlrpc_request *req,
                        ldlm_completion_callback completion_callback,
                        ldlm_blocking_callback blocking_callback,
                        ldlm_glimpse_callback glimpse_callback);

The arguments are:

req

the PTLRPC request received

completion_callback

a callback function that will be invoked to inform the client that the lock has been granted

blocking_callback

a callback function that will be invoked to inform the client that currently holds the lock that a conflicting lock request has been enqueued

glimpse_callback

a callback function that will be invoked to send a glimpse AST to the client, await the reply and then update the lvbo with the result

This function simply aggregates the callback function pointers into a ldlm_callback_suite, obtains the LDLM request data from the PTLRPC request and calls ldlm_handle_enqueue0().

3.5.2. ldlm_handle_enqueue0()

This is the main LDLM server-side entry point for enqueuing lock requests. It is called from PTLRPC receive handling threads.

int ldlm_handle_enqueue0(struct ldlm_namespace *ns,
                         struct ptlrpc_request *req,
                         const struct ldlm_request *dlm_req,
                         const struct ldlm_callback_suite *cbs);

The arguments are:

ns

the namespace containing the resource to be locked

req

the PTLRPC request received

dlm_req

the LDLM request data

cbs

the callback functions (of type ldlm_callback_suite) for the new lock

The function starts by calling ldlm_request_cancel() to cancel any locks that the client wishes to be canceled as part of the enqueuing operation.

If appropriate, procfs counters are updated.

Next, various values are sanity checked (lock type, mode, IBITS consistency) and if the checks fail, control jumps to the error handling code at the bottom of the function.

If the lock request is being replayed, the existing lock is searched for in the per-export hash table and, if found, control jumps over the code to create a new lock.

A new lock is created with a call to ldlm_lock_create(). The resource’s name and the lock type and mode are all taken from the request.

The new lock’s last activity time is set to the current time and its remote handle is assigned the client lock’s handle from the request.

The OBD_FAIL_LDLM_ENQUEUE_BLOCKED timeout is set to twice the value of obd_timeout why?

If the export has been destroyed, the enqueue is cancelled and control jumps to the error handling code.

A reference to the lock’s export is obtained and assigned to the lock’s l_export field. If the export has a lock hash, the lock is added to the hash.

The following code is executed for both new and replayed locks.

What happens next is dependent on whether the lock has intent (LDLM_FL_HAS_INTENT is set) or not:

lock has intent

the cookie pointer (used later) is set to the PTLRPC request

lock does not have intent

The lock and resource are locked and if the resource has non-zero lr_lvb_len, that size is stored into the PTLRPC request capsule and the lock and resource are unlocked.

does some check using OBD_FAIL_CHECK(OBD_FAIL_LDLM_ENQUEUE_EXTENT_ERR) that I don’t understand

The PTLRPC request capsule is packed.

If the lock type is not LDLM_PLAIN, the lock’s l_policy_data field is set to the corresponding value in the LDLM request.

If the lock type is LDLM_EXTENT, the lock’s l_req_extent field is set from the l_extent part of the lock’s l_policy_data field.

The lock is enqueued by calling ldlm_lock_enqueue() passing the namespace, lock, cookie value and the requested lock flags.

A pointer to the PTLRPC response structure (dlm_rep) is obtained and the lock’s state, handle and flags are stored into it.

The lock and resource are locked.

Those lock flags that need to be inherited from the original locking request (selected by mask LDLM_INHERIT_FLAGS) are added into the lock flags for both the reply to the client and the local lock.

Check if the export has failed (exp_failed non-zero) and, if so, set the return code to appropriately.

Otherwise, do some stuff related to canceling the lock if LDLM_FL_CANCEL_ON_BLOCK is set - can’t work out the meaning and usage of LDLM_FL_AST_SENT so leave this for now

Then follows a check for the erroneous situation where the lock type is LDLM_PLAIN or LDLM_IBITS, the client is a liblustre client and the lock doesn’t have the LDLM_FL_CANCEL_ON_BLOCK flag set. In this case, messages are reported but no error state is set.

The lock and resource are unlocked.

The PTLRPC request field rq_status is assigned the result status value (non-zero if an error has occurred).

If required (rq_packed_final is zero), the reply is packed.

The remainder of the function only executes if the lock exists (either it was found during replaying or a new lock has been created).

The lock and resource are locked.

If no error has occurred and the resource’s lr_lvb_len field is non-zero, the resource’s LVB data is copied into the PTLRPC capsule.

If an error has occurred, the lock is unlinked from the various places that reference it and is destroyed.

The lock and resource are unlocked.

If no error occurred while enqueuing the lock or packing the PTLRPC reply and the lock isn’t of type LDLM_FLOCK, ldlm_reprocess_all() is called to process the resource’s waiting and converting queues.

The lock is released.

The function returns the result code (0 for success).

3.5.3. ldlm_cli_enqueue_local()

This function enqueues a lock in a local namespace. No RPC communication is required. Somewhat confusingly (given the function’s name), it is only used on servers to enqueue server-local locks.

int ldlm_cli_enqueue_local(struct ldlm_namespace *ns,
                           const struct ldlm_res_id *res_id,
                           ldlm_type_t type, ldlm_policy_data_t *policy,
                           ldlm_mode_t mode, int *flags,
                           ldlm_blocking_callback blocking,
                           ldlm_completion_callback completion,
                           ldlm_glimpse_callback glimpse,
                           void *data, __u32 lvb_len, void *lvb_swabber,
                           const __u64 *client_cookie,
                           struct lustre_handle *lockh);

Lots of arguments, most of which are used to initialise the lock that is created.

The function starts by checking that the lock is not being replayed and the supplied namespace is not local.

A new lock is created by calling ldlm_lock_create() passing the required namespace, resource id, lock type and mode, callback functions, opaque AST data and LVB length. A reference to the lock is obtained and it’s handle is stored into *lockh.

The resource and lock are locked and the lock’s LDLM_FL_LOCAL flag is set. If the LDLM_FL_ATOMIC_CB flag was set in *flags, it is set in the lock. The lock’s swabber is set to lvb_swabber and the lock and resource are unlocked.

If the policy or client_cookie arguments are non-NULL, they are stored into the appropriate lock fields. If the lock’s type is LDLM_EXTENT, the lock’s l_req_extent field is set from the policy’s l_extent field.

The lock is now enqueued by calling ldlm_lock_enqueue().

If policy is non-NULL, *policy is assigned the value of the lock’s l_policy_data field.

If the lock’s l_completion_ast function is non-NULL, the function is invoked (it will block until the lock is granted).

The lock is released and the function returns

3.5.4. ldlm_reprocess_all()

This function scans the queues of locks that are waiting on a particular resource and grants any that do not conflict with an existing lock.

void ldlm_reprocess_all(struct ldlm_resource *res);

This function only applies to server-side resources so if invoked on a client-side resource, just return without doing anything.

The resource is locked, ldlm_reprocess_queue() is called to process the resource’s converting and waiting queues and then the resource is unlocked. For each lock that was granted, an item was added to a work list.

The work list is now passed to ldlm_run_ast_work() to actually send the RPC messages to the clients.

3.5.5. ldlm_reprocess_queue()

Scans a queue of locks waiting on a resource and grants any that can be granted.

int ldlm_reprocess_queue(struct ldlm_resource *res, struct list_head *queue,
                         struct list_head *work_list);

The arguments are:

res

the resource the lock’s are queued on

queue

the list of waiting locks

work_list

a list that each granted lock is added to

The decision as to whether each lock in the list can be granted or not is made by the resource’s policy function. This is looked up by resource type using ldlm_processing_policy_table.

The policy function is called for each entry in the list and, if it decides that the lock can be granted, it will add the lock to the work list.

3.5.6. ldlm_run_ast_work()

Sends AST RPC for each item in the supplied work list.

int ldlm_run_ast_work(struct list_head *rpc_list, ldlm_desc_ast_t ast_type);

Arguments are:

rpc_list

the list of work items, one for each client lock

ast_type

indicates the type of AST (blocking, completion, revoking) to send, of type ldlm_desc_ast_t

typedef enum {
        LDLM_WORK_BL_AST,
        LDLM_WORK_CP_AST,
        LDLM_WORK_REVOKE_AST
} ldlm_desc_ast_t;

Firstly, ptlrpc_prep_set() is called to prepare the set of PTLRPC messages.

The type of the callback set and the work function that will create each callback item in the set are determined from the value of ast_type.

For each item in the work list, call the work function. These functions basically just remove the lock from the appropriate queue (l_bl_ast, l_cp_ast or l_rk_ast) and invoke the appropriate callback function (l_blocking_ast, l_completion_ast or l_blocking_ast (yes, that’s right)) to build the RPC message and add it to the PTLRPC message set.

If the number of callback messages in the set reaches PARALLEL_AST_LIMIT, the messages are sent and a new set started. This continues until all the locks have been processed and then any residual messages in the set are sent.

3.5.7. Resource policy functions

A resource policy function is called to process a grant request for a single lock instance of a given type.

The resource policy functions have the type ldlm_processing_policy:

typedef int (*ldlm_processing_policy)(struct ldlm_lock *lock, int *flags,
                                      int first_enq, ldlm_error_t *err,
                                      struct list_head *work_list);

The arguments are:

lock

the lock that is to be granted if no conflicts exist

flags

pointer to the requested flags (flags are read/write)

first_enq

non-zero if blocking ASTs have not yet been sent to any conflicting locks (so 0 means blocking ASTs have been sent already)

err

reference to error status value of type ldlm_error_t:

typedef enum {
        ELDLM_OK = 0,

        ELDLM_LOCK_CHANGED = 300,
        ELDLM_LOCK_ABORTED = 301,
        ELDLM_LOCK_REPLACED = 302,
        ELDLM_NO_LOCK_DATA = 303,

        ELDLM_NAMESPACE_EXISTS = 400,
        ELDLM_BAD_NAMESPACE    = 401
} ldlm_error_t;
work_list

if non-NULL, the work list that a lock’s AST work item will be added to if the lock is granted and it has a non-NULL l_completion_ast function

A separate resource policy function is defined for each lock type and ldlm_processing_policy_table maps lock types to policy functions:

static ldlm_processing_policy ldlm_processing_policy_table[] = {
        [LDLM_PLAIN] ldlm_process_plain_lock,
        [LDLM_EXTENT] ldlm_process_extent_lock,
#ifdef __KERNEL__
        [LDLM_FLOCK] ldlm_process_flock_lock,
#endif
        [LDLM_IBITS] ldlm_process_inodebits_lock,
};
3.5.7.1. ldlm_process_plain_lock()

The function starts by checking that the resource is locked and its list of converting locks is empty.

If blocking ASTs have already been sent to conflicting locks, ldlm_plain_compat_queue() is called to check that the lock does not conflict with any locks that are already on the resource’s granted and waiting lists. That function ignores any locks that come after the lock being tested. If a conflicting lock is found, the function just returns LDLM_ITER_STOP. If no conflicts are found, the lock is removed from the resource’s skip list and granted by calling ldlm_grant_lock(). If the work_list argument is non-NULL and the lock has a non-NULL l_completion_ast function, the lock’s l_cp_ast field will be linked into the work list. The function then returns LDLM_ITER_CONTINUE.

If blocking ASTs had not already been sent to conflicting locks, the resource’s granted and waiting lists are scanned using ldlm_plain_compat_queueue() and if no conflicting locks are found, the lock is removed from the resource’s skip list and granted by calling ldlm_grant_lock() (lock is not linked into work_list). The function then returns 0 (ELDLM_OK).

If conflicting locks were found, the lock is added to the resource’s waiting list (if not already there). The resource is unlocked, ldlm_run_ast_work() is called to send a blocking AST to all the conflicting locks and the resource locked again.

If ldlm_run_ast_work() returns -ERESTART, control jumps back to scan the granted and waiting queues again and the blocking ASTs are resent.

The LDLM_FL_BLOCK_GRANTED flag is set in *flags. The function then returns 0 (ELDLM_OK).

3.5.7.2. ldlm_plain_compat_queue()

Helper function that scans a queue of plain locks to see if they are compatible (do not conflict) with a given lock that is being tested for eligibility for granting.

static inline int
ldlm_plain_compat_queue(struct list_head *queue, struct ldlm_lock *req,
                        struct list_head *work_list);

Arguments are:

queue

the queue of locks being tested for compatibility

req

the lock that will be granted if no conflicts are found

work_list

an optional list that conflicting locks will be added to

The function starts by setting the result value, compat, to 1 (true). It then verifies that the requested lock mode (ldlm_mode_t) is sensible.

The remainder of the function is a loop over the locks in the queue.

If the queue member is the requested lock (req), the function returns the value of compat. Otherwise, the loop variable (tmp) is modified to address the last lock in the mode group. The requested lock mode is tested for compatibility with the mode of the queue member (and, therefore, compatibility with all the locks in the mode group) and, if compatible, the loop then continues to look at the first lock in the next mode group.

Seems to me that there is a possible bug here in that it only tests to see whether req is the first lock in a mode group and it doesn’t check to see if it appears later in the mode group?

An incompatible lock has been found in the queue. If work_list is NULL, nothing more is to be done and the function just returns 0 to indicate that the lock is not compatible. Otherwise, compat is set to 0 and the function loops over the tested lock and all of the other locks in the same mode group and those that have a non-NULL l_blocking_ast function are added to work_list so that they can be sent a blocking AST.

The loop continues to look at the first lock in the next mode group.

3.5.7.3. ldlm_process_extent_lock()

The function starts by checking that the resource is locked, the list of converting locks is empty, that either the requested flags does not have LDLM_FL_DENY_ON_CONTENTION set or the LDLM_AST_DISCARD_DATA flag is not set for the grant candidate.

If blocking ASTs have already been sent to conflicting locks, ldlm_extent_compat_queue() is called to check that the lock does not conflict with any locks that are already on the resource’s granted and waiting lists. That function ignores any locks that come after the lock being tested. If a conflicting lock is found, the function just returns LDLM_ITER_STOP. If no conflicts are found, the lock is removed from the resource’s skip list and ldlm_extent_policy() is called to extend the lock’s extent as much as possible without creating a conflict with any lock in the resource’s granted or waiting lists. The lock is granted by calling ldlm_grant_lock(). If the work_list argument is non-NULL and the lock has a non-NULL l_completion_ast function, the lock’s l_cp_ast field will be linked into the work list. The function then returns LDLM_ITER_CONTINUE.

If blocking ASTs had not already been sent to conflicting locks, the resource’s granted and waiting lists are scanned using ldlm_extent_compat_queue() and if no conflicting locks are found, the lock’s extent is expanded through a call to ldlm_extent_policy(), the lock is removed from the resource’s skip list and granted by calling ldlm_grant_lock() (lock is not linked into work_list). The function then returns 0 (ELDLM_OK).

If conflicting locks were found, the lock is added to the resource’s waiting list (if not already there). The resource is unlocked, and ldlm_run_ast_work() is called to send a blocking AST to all the conflicting locks.

Some OBD_FAIL check here that I don’t understand that calls class_fail_export() if it fails

The resource locked again.

If ldlm_run_ast_work() returns -ERESTART, control jumps back to scan the granted and waiting queues again and the blocking ASTs are resent. Before it does so, checks are made to see if the lock has been destroyed or granted during the time the resource was unlocked and, if so, discards any AST work list items and the function returns.

Otherwise, the LDLM_FL_BLOCK_GRANTED and LDLM_FL_NO_TIMEOUT flags are set in *flags. The function then returns 0 (ELDLM_OK).

3.5.7.4. ldlm_extent_compat_queue()

Helper function that scans a queue of extent locks to see if they are compatible (do not conflict) with a given lock that is being tested for eligibility for granting.

static int
ldlm_extent_compat_queue(struct list_head *queue, struct ldlm_lock *req,
                         int *flags, ldlm_error_t *err,
                         struct list_head *work_list, int *contended_locks)

Arguments are:

queue

the queue of locks being tested for compatibility

req

the lock that will be granted if no conflicts are found

flags

pointer to the requested flags (flags are read/write)

err

reference to error status value of type ldlm_error_t:

work_list

an optional list that conflicting locks will be added to

contended_locks

the number of contended locks found check this

The function starts by setting the result value, compat, to 1. It then verifies that the requested lock mode (ldlm_mode_t) is sensible.

The bulk of the function is split into two halves, one half is used when the granted queue is being tested, the other half when the waiting queue is being tested.

Granted locks compatibility testing

When testing for compatibility with granted locks, the code loops over the lock modes (EX, PW, PR, etc.).

For each lock mode, if the resource does not have an interval tree for that mode, there is no conflict and the loop continues to look at the next lock mode.

If the resource does have an interval tree for that mode and the mode is compatible with the requested mode and the requested mode is LCK_GROUP and the request’s group id (gid) is the same as the interval tree’s group id, the function returns the special value of 2 to indicate that the requested group lock is OK.

If the requested mode was compatible but not LCK_GROUP, the loop continues on to look at the next lock mode.

At this point, the requested lock has been determined to be incompatible with the lock mode being tested against If that mode is LCK_GROUP and *flags has LDLM_FL_BLOCK_NOWAIT set (meaning, don’t wait if blocked), compat (the function’s return value) is assigned -EWOULDBLOCK and control jumps to code that destroys the (candidate) lock. If that flag is not set, LDLM_FL_NO_TIMEOUT is set in *flags to indicate that the candidate lock is blocked by a group lock with no timeout. If work_list is NULL, the function returns 0 to indicate conflict, otherwise the locks in the interval tree that have a non-NULL l_blocking_ast function are added to the work list and the loop continues on to look at the next lock mode.

If the lock mode being tested against is not LCK_GROUP and work_list is NULL, interval_is_overlapped() is called to check if the requested extent overlaps any existing extents in the tree. If so, the function returns 0 to indicate conflict. When work_list is non-NULL, interval_search() is called to find those locks in the interval tree whose extents conflict with the requested extent and if they have a non-NULL l_blocking_ast function, they are added to the work list. If work_list is not empty, compat is set to 0.

That’s the end of the loop.

Waiting locks compatibility testing

Loops over the locks in the waiting queue. A flag, check_contention is initialised to 1.

If the waiting lock is the requested lock, the loop terminates.

If the scan flag has been set (by code lower down in the loop on a previous iteration) it means that the candidate lock has a mode of LCK_GROUP and on a previous iteration an incompatible lock was found. In which case, if the current lock being tested is not a group lock (which means no more group locks are in the queue), the candidate lock is inserted into the waiting list before it, compat is set to 0 to indicate conflict and the loop terminates. If the lock being tested is a group lock and it has the same gid value as the candidate lock, the candidate lock is inserted into the waiting list after it, compat is set to 0 to indicate conflict and the loop terminates. Otherwise, the loop is continued to look at the next lock in the waiting list.

If scan was not set, requested lock mode is tested for compatibility with the current lock’s mode. If the modes are compatible, it doesn’t matter if the extents overlap. If the mode is PR (protected read) and the current lock’s extent totally encloses (or is the same extent) as the candidate lock’s extent and the current lock has not yet been sent a blocking AST, the function returns the value of compat.

If the candidate lock is not a group lock, the loop continues as overlaps do not matter as the modes are compatible.

For group locks with compatible modes, if the candidate lock has the same gid value as the current lock and that lock has already been granted, the function returns a value of 2 to indicate that the requested group lock is OK. If the current lock has not been granted and *flags has LDLM_FL_BLOCK_NOWAIT set, compat is set to -EWOULDBLOCK and control jumps to the code that destroys the (candidate) lock. Otherwise, the candidate lock is added to the waiting queue after the current lock and the function returns 0 to indicate conflict.

At this point, the requested lock mode is possibly incompatible with the current lock’s mode.

If the requested lock mode is LCK_GROUP and the current lock has not been granted, scan is set to 1 and compat is set to 0. If the current lock’s mode is not LCK_GROUP, that means that no more group locks follow in the queue and so the candidate lock is added into the queue before the current lock and the loop terminates. If the current lock is a group lock and it has the same gid value as the candidate lock, the current lock is added to the queue after the current lock and the loop terminates. Otherwise, the loop continues.

If the current lock’s mode is LCK_GROUP, the candidate lock is not compatible and (as before) either the lock is destroyed (if *flags has LDLM_FL_BLOCK_NOWAIT set) or *flags has LDLM_FL_NO_TIMEOUT set.

If the current lock’s mode is not LCK_GROUP and its extent does not overlap the candidate lock’s extent, the loop continues to look at the next look. If the current lock’s originally requested extent (l_req_extent) does not overlap the candidate lock’s extent, check_contention is zeroed to indicate that the lock doesn’t really contend.

If work_list is NULL, the function returns 0 to indicate that a conflicting lock has been found.

If the current lock is a glimpse lock, check_contention is set to 0.

The value of check_contention is added to *contended_locks (passed back to caller) and compat is set to zero to indicate that a conflicting lock has been found.

If the current lock has a non-NULL l_blocking_ast function, it is added to work_list.

The loop continues on to look at the next lock in the waiting queue.

Common code following queue scanning loops

After the queues have been scanned, ldlm_check_contention() is called to determine if the resource is currently "contended" (i.e. the number of contended locks exceeds ns_contended_locks). If so, the resource’s lr_contention_time field is set to the current time. The function returns non-zero if lr_contention_time plus the namespace’s ns_contention_time is in the future, i.e. the resource is currently contended.

If the resource is contended and the LDLM_FL_DENY_ON_CONTENTION flag is set and the requested mode is not LCK_GROUP and the length of the requested extent is not greater than the limit set in the namespace ns_max_nolock_size field, the lock is destroyed and -EUSERS is returned.

At the end of the function is the code to destroy the lock and return the value of compat.

3.5.7.5. ldlm_process_flock_lock()

FIXME

3.5.7.6. ldlm_process_inodebits_lock()

This function is identical to ldlm_process_plain_lock() except that it calls ldlm_inodebits_compat_queue() rather than ldlm_plain_compat_queueue().

3.5.7.7. ldlm_inodebits_compat_queue()

Helper function that scans a queue of inodebits locks to see if they are compatible (do not conflict) with a given lock that is being tested for eligibility for granting.

static inline int
ldlm_inodebits_compat_queue(struct list_head *queue, struct ldlm_lock *req,
                            struct list_head *work_list);

Arguments are:

queue

the queue of locks being tested for compatibility

req

the lock that will be granted if no conflicts are found

work_list

an optional list that conflicting locks will be added to

The function starts by setting the result value, compat, to 1 (true). The requested inodebits are retrieved from the request’s l_policy_data field (l_inodebits), stored in local req_bits and checked to see that some bits are set.

It then verifies that the requested lock mode (ldlm_mode_t) is sensible.

The remainder of the function is a loop over the locks in the queue. The queued locks are grouped by mode and within the mode groups, grouped by policy.

If the queue member is the requested lock (req), the function returns the value of compat. Otherwise, if the requested lock mode is compatible with the mode of the queue member (and, therefore, compatibility with all the locks in the mode group) the loop variable (tmp) is modified to address the last lock in the mode group and the loop then continues to look at the first lock in the next mode group.

If the requested lock’s mode is not compatible more checks are done:

If the queued lock’s mode is COS, a check is done to see that the requested lock and the queued lock are from the same client (their l_client_cookie fields are identical). If this test fails, the function will return 0 (not compatible) if work_list is NULL, or if work_list is non-NULL, compat is zeroed and the conflicting lock will be added to work_list if it has a non-NULL l_blocking_ast function (and the mode loop continues).

Next, the requested lock is tested for inodebit compatibility with all the locks in the mode group. In each mode group, the locks are grouped by policy (locks with the same policy have the same inodebits) so req_bits is ANDed with the l_inodebits value of one of the locks in the policy group and if the result is non-zero, there is a conflict. In which case, zero is returned if work_list is NULL, otherwise compat is set to zero and all the locks in the policy group that have a non-zero l_blocking_ast function are added to work_list. This testing of inodebit compatibility is repeated for each of the policy groups in the mode group.

The loop continues to look at the first lock in the next mode group.

Finally, the value of compat is returned.

3.6. Lock enqueing - client and server-side

3.6.1. ldlm_lock_enqueue()

This function enqueues a lock into either the granted, converted or waiting lists. It may be executed on either the client or server sides and its behaviour differs depending on where it’s being executed.

ldlm_error_t ldlm_lock_enqueue(struct ldlm_namespace *ns,
                               struct ldlm_lock **lockp,
                               void *cookie, int *flags);

The arguments (all passed by reference) are:

ns

the namespace containing the resource being locked

lockp

a doubly indirect pointer to the lock

cookie

(optional) intent data that is passed to the namespace’s policy function

flags

flags returned from the server

Firstly, the lock’s l_last_activity value is updated to the current time.

If the lock is not being replayed or the function is being executed on the server, and the namespace has a policy function, delegate the lock processing to the namespace’s policy function. If that function’s return code indicates that the lock has been replaced (and returned to the caller by *lockp) and it’s different from the original lock, then the original lock is destroyed, LDLM_FL_LOCK_CHANGED is set in the caller’s flags and the function returns.

If the policy function returned an error code or the flags had LDLM_FL_INTENT_ONLY set (which means don’t grant the lock, just do the intent processing), the lock is destroyed and the function returns.

Next, if this function is replaying on the server and the resource type is LDLM_EXTENT, the interval node is allocated here for later use.

The resource and lock are locked.

If the function is executing on the client and the lock’s granted mode is equal to its requested mode then it means that the lock has already been granted so nothing else needs to be done apart from clearing the various LDLM_FL_BLOCK_* flags and returning.

A call to ldlm_resource_unlink_lock() removes the lock from various lists depending on the lock’s type.

If the lock is an extent lock and its l_tree_node member is NULL, the previously allocated interval node is attached to the lock. this doesn’t appear to be dependent on !local as the allocation was above so node could be NULL?

Next, some flags from the reply (masked by LDLM_AST_DISCARD_DATA) are ORed into the lock’s flags.

What happens next depends on where this function is being executed:

on the client

if LDLM_FL_BLOCK_CONV is set in the reply flags, the lock is added to the resource’s converting list

or if LDLM_FL_BLOCK_WAIT or LDLM_FL_BLOCK_GRANTED are set, the lock is added to the resource’s waiting list

otherwise, the lock is granted by calling ldlm_grant_lock()

in all cases, this function returns

on the server

if not replaying, control falls through to the following code

otherwise if the flags have LDLM_FL_BLOCK_CONV or LDLM_FL_BLOCK_WAIT set, the lock is added to the resource’s converting or or waiting lists and this function returns

or if the flags have LDLM_FL_BLOCK_GRANTED set, the lock is granted and this function returns

Control only gets to this point when executing on the server and not replaying.

The policy function for the resource type is looked up and invoked:

        policy = ldlm_processing_policy_table[res->lr_type];
        policy(lock, flags, 1, &rc, NULL);

The resource and lock are unlocked

If the interval node allocated earlier wasn’t actually used, it is now freed.

The function returns.

3.6.2. ldlm_grant_lock()

This function adds a lock to the resource’s granted list and updates the lock’s mode.

void ldlm_grant_lock(struct ldlm_lock *lock, struct list_head *work_list);

After checking that the lock is spinlocked, it updates the lock’s granted mode (l_granted_mode) to be the same as the requested mode (l_req_mode). What happens next depends on the lock’s type:

plain
ibits

the supplied lock is added to the resource’s granted list by calling ldlm_grant_lock_with_skiplist() which determines the appropriate position in the list for the lock depending on its type (and inodebits for ibits type)

extent

the lock is added to the resource’s granted list (only for debug purposes) by calling ldlm_extent_add_lock() which also adds the lock into the resource’s interval tree (lr_itree) that orders the extent locks.

flock

the lock is simply added to the resource’s granted list

If the lock is the most restrictive lock (as determined by its mode) that has been applied to the resource, the resource’s lr_most_restr field is updated.

If the work_list argument is non-NULL and the lock’s l_completion_ast field is non-NULL, work_list is added to the AST work list.

Finally, the lock is added to the namespace’s pool with a call to ldlm_pool_add(). oddly, that function doesn’t appear to save the lock anywhere, how does it work?

3.7. Lock cancelling - client-side

Lock cancellation on the client side starts with a call to ldlm_cli_cancel() which is passed a handle to the lock that is to be cancelled. Clients try to keep locks as long as possible and normally only relinquish them on receipt of a blocking AST.

3.7.1. ldlm_cli_cancel()

Client-side lock cancel. When called, the lock must no longer have any readers or writers.

int ldlm_cli_cancel(struct lustre_handle *lockh);

Takes a single argument which is the handle for the lock to be cancelled.

The function starts by calling ldlm_handle2lock_long() to atomically obtain a pointer to the lock and set the LDLM_FL_CANCELING flag. If that flag is already set in the lock, it means that the lock is already being destroyed so this function just returns 0.

Next, ldlm_cli_cancel_local() is called to cancel the lock locally. If that function returns LDLM_FL_LOCAL_ONLY (when its a server-side lock or a client-side locks that doesn’t have the LDLM_FL_LOCAL_ONLY or LDLM_FL_CANCEL_ON_BLOCK flags set), the lock is released and the function returns as nothing more needs to be done.

The lock is added as the first element in a local list of locks to be cancelled on the server. If the lock’s server is capable of handling a set of cancel requests, the namespace is scanned (by ldlm_cancel_lru_local()) for further candidates to be cancelled and they are added to the list. The list is processed by ldlm_cli_cancel_list() which will send a cancel request containing the handles of the locks to be cancelled to the server.

3.7.2. ldlm_cli_cancel_local()

Cancels a lock locally.

static int ldlm_cli_cancel_local(struct ldlm_lock *lock)

If lock is a client-side lock (in a shadow namespace) it is locked and its LDLM_FL_CBPENDING flag is set to indicate that the lock is being destroyed. If the lock has either LDLM_FL_LOCAL_ONLY or LDLM_FL_CANCEL_ON_BLOCK flags set, this function will return LDLM_FL_LOCAL_ONLY to indicate that an RPC does not need to be sent to the server.

Next, ldlm_cancel_callback() is called to invoke the cancellation callback, l_blocking_ast (if it is non-NULL) passing the LDLM_CB_CANCELING flag to indicate that the lock is being cancelled. It also sets the LDLM_FL_BL_DONE flag to indicate that the lock cache has been dropped.

Back in ldlm_cli_cancel_local(), the return code is set to LDLM_FL_BL_AST if that flag is set in the lock, and to LDLM_FL_CANCELLING, otherwise.

The lock is unlocked and ldlm_lock_cancel() is called.

If the lock is a server-side lock then ldlm_lock_cancel() is called and when that returns, ldlm_reprocess_all() is called to process the resource’s waiting and converting queues and the function returns LDLM_FL_LOCAL_ONLY to indicate that an RPC does not need to be sent.

3.7.3. ldlm_lock_cancel()

Cancel a lock in the local namespace that has no readers or writers.

void ldlm_lock_cancel(struct ldlm_lock *lock);

The resource and lock are locked and a check made to see that the lock has no readers or writers.

The lock is removed from the waiting locks list by calling ldlm_del_waiting_lock().

ldlm_cancel_callback() is called to invoke the cancellation callback (if that hasn’t been done already).

ldlm_del_waiting_lock() is called for a second time in case the lock had been added to the waiting locks list while ldlm_cancel_callback() was executing.

ldlm_resource_unlink_lock() is called to remove the lock from the resource’s skiplists (when type is LDLM_IBITS or LDLM_PLAIN) or to remove it from the resource’s interval tree if the type is LDLM_EXTENT. The lock is also removed from the resource’s list of lock’s (field l_res_link).

ldlm_lock_destroy_nolock() is called to destroy the lock (via a call to ldlm_lock_destroy_internal()). That function sets the l_destroyed field to 1 to indicate that the lock is destroyed and that it will be freed when the last reference to it goes away. It also removes the lock from its export lock hash (if appropriate), removes the lock from the namespace’s unused (LRU) list and disassociates the lock from its hash value.

If the lock was granted, it is removed from the namespace’s pool by calling ldlm_pool_del().

Finally, the lock’s l_granted_mode is set to LCK_MINMODE and the lock and resource are unlocked.

3.7.4. ldlm_cli_cancel_list()

Client-side function that either packs lock handles into a request buffer (if supplied) or sends batched lock cancel RPC to the server.

int ldlm_cli_cancel_list(struct list_head *cancels, int count,
                         struct ptlrpc_request *req, int flags);

Its arguments are:

cancels

the list of locks to cancel

count

the number of locks in the list

req

if non-NULL, the ptlrpc_request that the lock’s handles should be packed into

flags

flags that are passed down to the function that sends the batched RPC

note that the comment in the code mentions another argument off which doesn’t exist

The function loops around while the list of locks to be cancelled, cancels, still contains entries.

At the top of the loop, the first lock in the list is obtained and if the lock’s server is capable of handling a set of cancel requests, either ldlm_cancel_pack() is called to pack lock handles into the supplied req or ldlm_cli_cancel_req() is called to prepare and send a batched cancel RPC to the server. The flags argument is passed to that function and if LDLM_FL_ASYNC is not set, the function will queue the RPC for sending and return immediately. Otherwise, the function waits for a response. Up to count handles are batched together.

If the lock’s server doesn’t support batched cancel requests, ldlm_cli_cancel_req() is called to process just a single lock handle.

The value of count is decremented by the number of locks processed, those locks are removed from cancels and the loop continues.

3.8. Lock cancelling - server-side

As part of the initialisation carried out by ldlm_setup(), a PTLRPC service (type ptlrpc_service) is created to service lock cancellation requests. That service’s receive handler is ldlm_cancel_handler() and it invokes ldlm_handle_cancel() to process the requests.

3.8.1. ldlm_handle_cancel()

Main server-side entry point for processing lock cancellation requests.

int ldlm_handle_cancel(struct ptlrpc_request *req);

The function is passed the PTLRPC request, req, that contains the handles of the locks to be cancelled.

The function starts by calling req_capsule_client_get() to obtain the request in the form of a ldlm_request:

#define LDLM_LOCKREQ_HANDLES 2

struct ldlm_request {
        __u32 lock_flags;
        __u32 lock_count;
        struct ldlm_lock_desc lock_desc;
        struct lustre_handle lock_handle[LDLM_LOCKREQ_HANDLES];
};

If appropriate, procfs counters are updated.

req_capsule_server_pack() is called to pack the servers reply. ??? don’t understand what it’s packing here because we haven’t processed the cancel requests yet

ldlm_request_cancel() is now called to cancel the locks and ptlrpc_reply() is called to send the reply to the client.

3.8.2. ldlm_request_cancel()

Cancel all the locks whose handles are in an ldlm_request.

int ldlm_request_cancel(struct ptlrpc_request *req,
                        const struct ldlm_request *dlm_req, int first);

Arguments are:

req

the PTLRPC request

dlm_req

the ldlm_request containing the array of lock handles to be cancelled

first

the index of the first handle to process

The number of lock handles to process, count, is determined and if that is less than first, the function returns.

If replaying, nothing needs to be done, so the function returns.

For each of the count locks to process, ldlm_handle2lock() is called to obtain a reference to the lock to process and the lock’s resource is stored in local res. Another local variable, pres, is the resource used in the last iteration of the loop (NULL initially).

If res is different from pres (the locks have different resources) and if pres is not NULL, ldlm_reprocess_all() is called to process the resource’s waiting and converting queues.

If res is not NULL (it never could be?), ldlm_res_lvbo_update() is called to invoke the namespace’s lvbo_update function (if it has one defined).

The lock is cancelled locally by calling ldlm_lock_cancel().

The loop continues to look at the next lock handle.

After the loop, if pres is not NULL, ldlm_reprocess_all() is called to process the resource’s waiting and converting queues.

The function returns.