#include <cl_object.h>
Data Fields | |
| cfs_atomic_t | cll_ref |
| Reference counter. | |
| cfs_list_t | cll_layers |
| List of slices. | |
| cfs_list_t | cll_linkage |
| Linkage into cl_lock::cll_descr::cld_obj::coh_locks list. | |
| cl_lock_descr | cll_descr |
| Parameters of this lock. | |
| enum cl_lock_state | cll_state |
| Protected by cl_lock::cll_guard. | |
| cfs_waitq_t | cll_wq |
| signals state changes. | |
| cfs_mutex_t | cll_guard |
| Recursive lock, most fields in cl_lock{} are protected by this. | |
| cfs_task_t * | cll_guarder |
| int | cll_depth |
| cfs_task_t * | cll_intransit_owner |
| the owner for INTRANSIT state | |
| int | cll_error |
| int | cll_holds |
| Number of holds on a lock. | |
| int | cll_users |
| Number of lock users. | |
| unsigned long | cll_flags |
| Flag bit-mask. | |
| cfs_list_t | cll_inclosure |
| A linkage into a list of locks in a closure. | |
| cl_lock * | cll_conflict |
| Confict lock at queuing time. | |
| lu_ref | cll_reference |
| A list of references to this lock, for debugging. | |
| lu_ref | cll_holders |
| A list of holds on this lock, for debugging. | |
| lu_ref_link * | cll_obj_ref |
| A reference for cl_lock::cll_descr::cld_obj. | |
LAYERING
The locking model of the new client code is built around
struct cl_lock
data-type representing an extent lock on a regular file. cl_lock is a layered object (much like cl_object and cl_page), it consists of a header (struct cl_lock) and a list of layers (struct cl_lock_slice), linked to cl_lock::cll_layers list through cl_lock_slice::cls_linkage.
All locks for a given object are linked into cl_object_header::coh_locks list (protected by cl_object_header::coh_lock_guard spin-lock) through cl_lock::cll_linkage. Currently this list is not sorted in any way. We can sort it in starting lock offset, or use altogether different data structure like a tree.
Typical cl_lock consists of the two layers:
lov_lock contains an array of sub-locks. Each of these sub-locks is a normal cl_lock: it has a header (struct cl_lock) and a list of layers:
Each sub-lock is associated with a cl_object (representing stripe sub-object or the file to which top-level cl_lock is associated to), and is linked into that cl_object::coh_locks. In this respect cl_lock is similar to cl_object (that at lov layer also fans out into multiple sub-objects), and is different from cl_page, that doesn't fan out (there is usually exactly one osc_page for every vvp_page). We shall call vvp-lov portion of the lock a "top-lock" and its lovsub-osc portion a "sub-lock".
LIFE CYCLE
cl_lock is reference counted. When reference counter drops to 0, lock is placed in the cache, except when lock is in CLS_FREEING state. CLS_FREEING lock is destroyed when last reference is released. Referencing between top-lock and its sub-locks is described in the lov documentation module.
STATE MACHINE
Also, cl_lock is a state machine. This requires some clarification. One of the goals of client IO re-write was to make IO path non-blocking, or at least to make it easier to make it non-blocking in the future. Here `non-blocking' means that when a system call (read, write, truncate) reaches a situation where it has to wait for a communication with the server, it should --instead of waiting-- remember its current state and switch to some other work. E.g,. instead of waiting for a lock enqueue, client should proceed doing IO on the next stripe, etc. Obviously this is rather radical redesign, and it is not planned to be fully implemented at this time, instead we are putting some infrastructure in place, that would make it easier to do asynchronous non-blocking IO easier in the future. Specifically, where old locking code goes to sleep (waiting for enqueue, for example), new code returns cl_lock_transition::CLO_WAIT. When enqueue reply comes, its completion handler signals that lock state-machine is ready to transit to the next state. There is some generic code in cl_lock.c that sleeps, waiting for these signals. As a result, for users of this cl_lock.c code, it looks like locking is done in normal blocking fashion, and it the same time it is possible to switch to the non-blocking locking (simply by returning cl_lock_transition::CLO_WAIT from cl_lock.c functions).
For a description of state machine states and transitions see enum cl_lock_state.
There are two ways to restrict a set of states which lock might move to:
User is used to assure that lock is not canceled or destroyed while it is being enqueued, or actively used by some IO.
Currently, a user always comes with a hold (cl_lock_invariant() checks that a number of holds is not less than a number of users).
CONCURRENCY
This is how lock state-machine operates. struct cl_lock contains a mutex cl_lock::cll_guard that protects struct fields.
Top-lock and sub-lock has separate mutexes and the latter has to be taken first to avoid dead-lock.
To see an example of interaction of all these issues, take a look at the lov_cl.c:lov_lock_enqueue() function. It is called as a part of cl_enqueue_try(), and tries to advance top-lock to ENQUEUED state, by advancing state-machines of its sub-locks (lov_lock_enqueue_one()). Note also, that it uses trylock to grab sub-lock mutex to avoid dead-lock. It also has to handle CEF_ASYNC enqueue, when sub-locks enqueues have to be done in parallel, rather than one after another (this is used for glimpse locks, that cannot dead-lock).
INTERFACE AND USAGE
struct cl_lock_operations provide a number of call-backs that are invoked when events of interest occurs. Layers can intercept and handle glimpse, blocking, cancel ASTs and a reception of the reply from the server.
One important difference with the old client locking model is that new client has a representation for the top-lock, whereas in the old code only sub-locks existed as real data structures and file-level locks are represented by "request sets" that are created and destroyed on each and every lock creation.
Top-locks are cached, and can be found in the cache by the system calls. It is possible that top-lock is in cache, but some of its sub-locks were canceled and destroyed. In that case top-lock has to be enqueued again before it can be used.
Overall process of the locking during IO operation is as following:
Striping introduces major additional complexity into locking. The fundamental problem is that it is generally unsafe to actively use (hold) two locks on the different OST servers at the same time, as this introduces inter-server dependency and can lead to cascading evictions.
Basic solution is to sub-divide large read/write IOs into smaller pieces so that no multi-stripe locks are taken (note that this design abandons POSIX read/write semantics). Such pieces ideally can be executed concurrently. At the same time, certain types of IO cannot be sub-divived, without sacrificing correctness. This includes:
Also, in the case of read(fd, buf, count) or write(fd, buf, count), where buf is a part of memory mapped Lustre file, a lock or locks protecting buf has to be held together with the usual lock on [offset, offset + count].
As multi-stripe locks have to be allowed, it makes sense to cache them, so that, for example, a sequence of O_APPEND writes can proceed quickly without going down to the individual stripes to do lock matching. On the other hand, multi-stripe locks shouldn't be used by normal read/write calls. To achieve this, every layer can implement ->clo_fits_into() method, that is called by lock matching code (cl_lock_lookup()), and that can be used to selectively disable matching of certain locks for certain IOs. For exmaple, lov layer implements lov_lock_fits_into() that allow multi-stripe locks to be matched only for truncates and O_APPEND writes.
Interaction with DLM
In the expected setup, cl_lock is ultimately backed up by a collection of DLM locks (struct ldlm_lock). Association between cl_lock and DLM lock is implemented in osc layer, that also matches DLM events (ASTs, cancellation, etc.) into cl_lock_operation calls. See struct osc_lock for a more detailed description of interaction with DLM.
| struct cl_lock_descr cl_lock::cll_descr |
Parameters of this lock.
Protected by cl_lock::cll_descr::cld_obj::coh_lock_guard nested within cl_lock::cll_guard. Modified only on lock creation and in cl_lock_modify().
| unsigned long cl_lock::cll_flags |
Flag bit-mask.
Values from enum cl_lock_flags. Updates are protected by cl_lock::cll_guard.
| cfs_mutex_t cl_lock::cll_guard |
Recursive lock, most fields in cl_lock{} are protected by this.
Locking rules: this mutex is never held across network communication, except when lock is being canceled.
Lock ordering: a mutex of a sub-lock is taken first, then a mutex on a top-lock. Other direction is implemented through a try-lock-repeat loop. Mutices of unrelated locks can be taken only by try-locking.
Number of holds on a lock.
A hold prevents a lock from being canceled and destroyed. Protected by cl_lock::cll_guard.
| cfs_list_t cl_lock::cll_inclosure |
| cfs_list_t cl_lock::cll_layers |
List of slices.
Immutable after creation.
| cfs_list_t cl_lock::cll_linkage |
Linkage into cl_lock::cll_descr::cld_obj::coh_locks list.
Protected by cl_lock::cll_descr::cld_obj::coh_lock_guard.
| struct lu_ref_link* cl_lock::cll_obj_ref |
A reference for cl_lock::cll_descr::cld_obj.
For debugging.
Number of lock users.
Valid in cl_lock_state::CLS_HELD state only. Lock user pins lock in CLS_HELD state. Protected by cl_lock::cll_guard.