1. OST and Obdfilter
/* Sigh - really, this is an OSS, the _server_, not the _target_ */
static int ost_setup(struct obd_device *obd, obd_count len, void *buf)
{ ... }
— from Lustre source tree b16
Lustre source tree lustre/ost and all function names prefixed with
ost_ should probably be regarded as server (OSS) function,
if we understand the above comment correctly.
1.1. OSS as OST
OST is loaded as a kernel module. It works closely with
obdfilter and does most of the server/OST side of the work. Between
these two layers, OSS is the switch layer or the thin layer, and it
interprets requests from Portal RPC, prepares for
requests, and then passes requests to obdfilter for further
processing. In the following discussion, we focus on two aspects of
it: initial setup and switching structure, implemented by
ost_setup() and ost_handle(), respectively.
1.1.1. Initial Setup
-
First, the OST checks if the OSS thread
number is specified. If not, then it computes the minimum number of
threads based upon the CPU and memory and ensures that there is 4x
dynamic range between the minimum and maximum number of threads.
oss_min_threads = num_possible_cpus() * num_physpages >> (27 - CFS_PAGE_SHIFT);
if (oss_min_threads < OSS_THREADS_MIN)
oss_min_threads = OSS_THREADS_MIN;
/* Insure a 4x range for dynamic threads */
if (oss_min_threads > OSS_THREADS_MAX / 4)
oss_min_threads = OSS_THREADS_MAX / 4;
oss_max_threads = min(OSS_THREADS_MAX, oss_min_threads * 4 + 1);
To get the obd device of the OST; the following function call is used.
struct ost_obd *ost = &obd->u.ost;
-
Then, the server side initiates RPC services by:
ost->ost_service = ptlrpc_init_svc( , , , , , , ost_handle, , , , "ll_ost");
The function returns the pointer to struct ptlrpc_service. One
important thing to note is that we have supplied a handler,
ost_handle. Once the service is set up as shown below, Portal
RPC will dispatch the request to this handler for further
processing. That is the subject of the following section.
-
The prtrpc threads are started as:
rc = ptlrpc_start_threads(obd, ost->ost_service);
-
The similar call sequence is repeated for creating ost create
threads and the returned service handle is assigned to
ost→ost_create_service. It is also repeated for ost io threads,
and a service handle is assigned to ost→ost_io_service.
-
And finally, the ping eviction service is started.
1.1.2. Dispatching
The handler function takes one input
parameter, struct ptlrpc_request *req, and it is driven largely by
the type of request. The decoding of type of request is through
passing req→rq_reqmsg (which points to struct lustre_msg) to a
helper function lustre_msg_get_opc() provided by Portal RPC. Thus
the dispatch structure looks like:
switch (lsutre_msg_get_opc(req->rq_reqmsg)) {
case OST_CONNECT:
...
rc = target_handle_connect(req, ost_handle);
break;
case OST_CREATE:
...
rc = ost_create(req->rq_export, req, oti);
break;
case OST_WRITE:
...
rc = ost_brw_write(req, oti);
RETURN (rc);
case OST_READ:
...
rc = ost_brw_read(req, oti);
RETURN(rc);
}
The exception handling includes the possible recovery, which can
happen during any request except the OST_CONNECT. Also, we need to
check for connection coming from an unknown client by checking NULL
of req→rq_export.
1.2. OST Directory Layout
This section describes what you will observe on the disk when logging
onto an OST node. The filesystem on the disk is
most likely ldiskfs for now. It means the backend data is really
stored as a regular file, organized in a certain Lustre specific way:
1.2.1. Group Number
Under the top level directory on an OST is the
subdirectory named for each group. This layout accommodates clustered
MDSs where each group corresponds to one MDS. As of now, only one MDS
is in use, so only group zero is effective1.
1.2.2. Object Id
Under each group, 32 subdirectories are created. For each file object, its last five digits are used to identify which
subdirectory this file should reside with. Here, the filename is the
object id.
1.3. obdfilter
The obdfilter device is created when the OST server is
initialized. For each OST, we have an associated obdfilter device. For
each client connection, the obdfilter creates an export as the conduit
of communication. All the exports are maintained in a global hash
table, and the hash key is also known as UUID, as shown in
both fig-obdfilter-conns and fig-exp. The Portal RPC layer
makes use of UUIDs to quickly identify to which export (and obdfilter
device) the incoming request should go. Also, each obdfilter device
maintains a list of the exports it is serving. This relationship is
illustrated in fig-obdfilter-conns.
Figure 1: Import and export connections between an OSC and obdfilter.
The obdfilter provides the following functions:
-
handles create request, presumably from MDS for file data objects.
-
handles read and write requests, from OSC clients.
-
handles connect and disconnect requests from lower Portal RPC layer
for established exports and imports.
-
handles destroy (which involves both client and MDS) requests.
1.3.1. File Deletion
The destroy protocol is as follows. First, the client decides to
remove a file and this request is sent to MDS. MDS checks the
EA striping and uses llog to make a transaction
log. This log contains the following: <_unlink object 1 from ost1,
unlink object 2 from ost2, etc._>. Then, MDS sends this layout and
transaction log back to the client. The client takes this log and
contacts each OST (actually obdfilter) to unlink each file
object. Each successful removal is accompanied by a confirmation or
acknowledgment to the unlink llog record. Once all unlink llog
records on MDS have been acknowledged, the file removal process is
completed.
1.3.2. File Creation
As discussed earlier in sec-ost-and-obdfilter, all requests are
handled by OST and obdfilter together. Now, we will walk through the
handling of a create request. The first portion of request handling is
done inside ost_create() as follows.
-
Prepare the size of reply message. It is composed of two records and
thus requires two buffers. The first record is for portal rpc body,
and the second, for the ost reply body.
__u32 size[2] = { sizeof(struct ptlrpc_body), sizeof(*repbody)};
-
Get a pointer to the request body from the original request and do
byte swapping when necessary.
struct ost_body *body = lustre_swab_reqbuf(req, REQ_REC_OFF,
sizeof(*body), lustre_swab_ost_body);
The last parameter is the swab handler, which is called only when the
swapping is necessary. Client side uses native byte order for its
request, along with a pre-agreed magic number. Server side reads the
magic number to decide if a swap is needed.
-
Do the actual space allocation, and fill in preliminary header
information.
rc = lustre_pack_replay(req, 2, size, NULL);
repbody = lustre_msg_buf(req->rq_repmsg, REPLY_REC_OFF,sizeof(*repbody));
After the first call, req→rq_repmsg points to the newly
allocated space. The second call assigns repbody of the starting
address for the buffer of the reply body.
-
Finally, it fills in the reply body with exactly the same contents
as a request body and passes on to obdfilter for further processing.
memcpy(&repbody->oa, &body->oa, sizeof(body->oa));
req-rq_status = obd_create(exp, &repbody->oa, NULL, oti);
For the create request, the entry point for obdfilter is through
filter_create().
static int filter_create(struct obd_export *exp, struct obdo *oa ..)
We ignore the processing related to struct lov_stripe_md **ea and
struct obd_trans_info *oti because the former is a legacy code and
is unlikely to be used in the future.
-
First, save the current context and assign this client its own
operation context. This is for specifying necessary information for
the thread if it wants to access the backend filesystem. It is like a
sandbox limiting the reach of server threads when processing requests
from clients; it stores “filesystem root” and “current working
dir” for the server thread (not obtained from the client, of course,
but rather dependent on which OST we are working on).
obd = exp->exp_obd;
put_ctxt(&saved, &obd->obd_lvfs_ctxt, NULL);
-
If the request is for recreating an object, then we cancel all
extent locks on the recreated object by acquiring the lock on the
object and call on filter_recreate() to do the actual
job. Otherwise, we follow the normal flow of precreating objects. The
reason for precreating is that, conceptually, when MDS asks an OST for
creating an object, OST doesn’t just create one object, it creates
multiple objects with object id assigned. These batch created objects
have a disk size of zero. The goal is, when MDS responds to a client
request next time for creating new file, it doesn’t have to send a
request to OST again to present the layout information to client. By
taking a look at the pool of precreated objects from each OST, MDS may
already have all the information needed to reply to the client.
if (oa->o_valid & OBD_MD_FLFLAGS) &&
(oa->o_flags & OBD_FL_RECREATE_OBJS)) {
rc = ldlm_cli_enqueue_local(obd->obd_namespace, &res_id, ... );
rc = filter_recreate(obd, oa);
ldlm_lock_deref(&lockh, LCK_PW);
} else {
rc = filter_handle_precreate(exp, oa, oa->o_gr, oti);
}
Here, rc returned from precreate handler is either a negative,
indicative of an error, or a non-negative number, representing the
number of files created.
-
Now, we take a closer look at the function of precreate:
When a client contacts an OST with a precreated object id, OST knows
that this object id now is activated. However, this presents a problem
such that, if the MDS has failed, it now has stale information on
precreated objects. To resolve this conflict, when MDS is restarted,
it checks its records on unused precreated objects and sends requests
to OSTs to delete these objects (delete orphans). The
obdfilter takes these requests and skips those objects that are
actually in use (but out of synchronization with MDS’s own record) and
removes the rest of it. This is the first part of what
filter_handle_precreate() will do:
if ((oa->o_valid & OBD_MD_FLFLAGS) &&
(oa->o_flags & OBD_FL_DELORPHAN)) {
down(&filter->fo_create_lock);
rc = filter_destroy_precreated(exp,oa,filter);
...
} else {
rc = filter_precreate(obd, oa, group, &diff);
...
}
-
Finally, the create request is passed onto fsfilt and
is completed by a VFS call. The process will later go through more
steps, such as getting hold of parent inode, transaction setup, etc.
rc = ll_vfs_create(dparent->d_inode, dchild,
S_IFREG | S_ISUID | S_ISGID | 0666, NULL);
1 In fact, there is a special group for echo client as
well, so that MDS and echo client do not conflict when run at the same
time.
|