http://wiki.lustre.org/api.php?action=feedcontributions&user=Nrutman&feedformat=atomLustre Wiki - User contributions [en]2024-03-29T15:40:51ZUser contributionsMediaWiki 1.39.3http://wiki.lustre.org/index.php?title=File:LUG2019-Developer_Day-ls_Long_Term_Storage-Rutman.pdf&diff=3634File:LUG2019-Developer Day-ls Long Term Storage-Rutman.pdf2019-09-12T21:40:53Z<p>Nrutman: </p>
<hr />
<div></div>Nrutmanhttp://wiki.lustre.org/index.php?title=Talk:Policy_Engine_Proposal&diff=3394Talk:Policy Engine Proposal2018-07-11T18:17:46Z<p>Nrutman: Fast Find description</p>
<hr />
<div>Hi Jxiong, <br />
there are a large number of distinct items that are proposed here (MOI, SLC, changelog improvements, coordinator improvements, a new rules engine, internal (kernel-side) data copy, master/slave job distribution. <br />
It seems to me that these may be implemented separately / incrementally, deriving lots of value without so much implementation effort up front. In particular, we at Cray have been thinking about a Lustre Fast Find which is very similar to your MOI, but generalized. This is very useful without any of the other HSM improvements.<br />
<br />
=Summary=<br />
<br />
Improve Lustre's 'lfs find' capability to run server-side, and include new indices, for fast MD-based searches (as an alternative to tree walking).<br />
<br />
=Background=<br />
One of the major reasons for implementing external MD databases (e.g. Robinhood) or inode scanning tools (e.g. LiPE, ne2scan) is to quickly find a set of files matching particular search criteria, without having to walk the entire FS namespace and stat'ing every file. Searches must be efficient at filtering down the entire set of files into a relatively small list that is actionable, i.e. that will be used to drive a policy (archive these files). If Lustre had an efficient 'lfs find' capability, this could generally take the place of other scanning/walking/DB-generating utilities, allowing much broader usage.<br />
<br />
=Description=<br />
Modify the lfs find command to return a filtered bulk list from the MDS, rather than a userspace client-side tree walk. A new find RPC is issued to the MDS, the MDS executes the search through the file metadata to generate the filtered list, and the results are returned to the client.<br />
<br />
==Search criteria==<br />
Search criteria should include any metadata stored on the MDS (e.g. user, xattr, OST number). For MD not stored on the MDS, approximations are required: lazy SoM (LUS-1772), lazy atime, etc. Multiple filters can be used in combination (e.g. larger than 1M and mtime > 1 day). For efficient filtering, the ordering of criteria is important. <br />
Common search criteria are likely to be modification time, access time, and rough size. Potentially, users might want the ability to label files are search by label.<br />
<br />
==Index creation==<br />
The MDS maintains key-value indices for each file (e.g. OI index maps FID to inode) via the dt_index_operations functionality of the OSD API. All backends support this. For Fast Find, we can create extra indices to sort files based on search criteria of interest. Every modifying RPC would include updates to each index. Indices would then allow efficient find-list generation: search for files with mtime < 1 day could quickly scan through the sorted mtime index to find a range of matching files.<br />
<br />
==List return format==<br />
The resulting file list should possibly include all available MD for each file in addition to the file names, to avoid having to issue subsequent stat's. Presuming a large number of items in the return list, the transfer of the list might be tricky. One possibility may be for the MDS to store the list persistently as a regular Lustre DoM file, attached to the namespace under /.lustre/. The userspace caller would receive a reference to the file, which it could then parse at will.<br />
<br />
==Resources==<br />
Every index update will require additional transaction blocks in the tx, and a write operation. These should be quick for a flash MDS, but updating many indices will be expensive. It will be best to update only those indices of importance for the user; i.e. they should be user selectable.<br />
More importantly, every find operation will require a potentially disk-heavy search on the MDS, requiring CPU and IOPS. This of course will be less total load than a namespace walk, but will be executed on a critical FS component.<br />
<br />
==DNE==<br />
Each DNE server would run the same search in parallel on its files. This gives us horizontally scalable searches (which RBH cannot achieve).<br />
<br />
=Services based on find=<br />
Services such as a purger, archiver, or tier manager can be based on find using extremely simple pipeline scripts, triggered by cron or a job scheduler. For example, a daily cron that runs lfs find /lustre -size +20M -mtime -1 | lfs hsm archive would archive all new files greater than 20MB. Slightly more complex services could take into account external information: Lustre space, archive space, system load, etc.<br />
<br />
=Advantages=<br />
* Very simple policy implementations<br />
* General-purpose functionality for uses beyond policy engines<br />
* Save hardware and software related to tracking new files (changelog ingest) and maintaining external DBs (Robinhood)<br />
<br />
=Drawbacks=<br />
* slower iops on mds with transactional index updates<br />
* 'find' load on mds impacts other jobs<br />
* name-based searches (e.g. find *.jpg) can't be indexed, will be slow<br />
* each index as large as oi_index = 10GB?</div>Nrutmanhttp://wiki.lustre.org/index.php?title=Enhanced_Adaptive_Compression_in_Lustre&diff=3090Enhanced Adaptive Compression in Lustre2018-03-06T17:47:35Z<p>Nrutman: </p>
<hr />
<div>===== General Information =====<br />
<br />
Due to the increasing gap between computational speed, network speed and storage capacity, it has become necessary to investigate data reduction techniques.<br />
Storage systems have become a significant part of the total cost of ownership due to the increased amount of storage devices, their associated acquisition cost and energy consumption.<br />
<br />
Ultimately, we are aiming for compression support in Lustre at multiple levels:<br />
<br />
- Client-side compression allows using the available network and storage capacity more efficiently,<br />
- Client hints empower applications to provide information useful for compression and<br />
- Adaptive compression makes it possible to choose appropriate settings depending on performance metrics and projected benefits.<br />
<br />
Compression will be completely transparent to the applications because it will be performed by the client and/or server on their behalf.<br />
However, it will be possible for users to tune Lustre's behavior to obtain the best performance/compression/etc.<br />
When using client-side compression, the single stream performance bottleneck will directly benefit from the compression.<br />
<br />
===== Funding =====<br />
<br />
Intel Parallel Computing Center for Lustre Universität Hamburg “Enhanced Adaptive Compression in Lustre”<br />
<br />
===== Project Links =====<br />
<br />
* [https://software.intel.com/articles/intel-parallel-computing-center-at-university-of-hamburg-scientific-computing Intel]<br />
* [https://wr.informatik.uni-hamburg.de/research/projects/ipcc-l/start Scientific Computing Group]<br />
* [https://jira.hpdd.intel.com/browse/LU-10026 LU-10026]</div>Nrutmanhttp://wiki.lustre.org/index.php?title=File_Level_Replication_High_Level_Design&diff=2962File Level Replication High Level Design2017-10-26T21:08:39Z<p>Nrutman: minor formatting fixes</p>
<hr />
<div>== Introduction ==<br />
<br />
The Lustre* file system was initially designed and implemented for HPC use. It has been working well on high-end storage that has internal redundancy and fault-tolerance. However, despite the expense and complexity of these storage systems, storage failures still occur and Lustre cannot currently be more reliable than the individual storage and servers components on which it is based. The Lustre file system has no mechanism to mitigate storage hardware failures and files become inaccessible if a server is inaccessible or out of service. This deficiency of Lustre file system in the area of fault tolerance becomes more acute in both large systems with a large number of individual components, and also in the era of cloud computing environments built with commodity hardware lacking redundancy and where failure rates are high.<br />
<br />
In this document, a solution to the problem of data accessibility in the face of failing storage will be designed. The chosen solution is to use configurable File Level Redundancy (FLR) within the file system and is described in the Solution Architecture. With File Level Redundancy, any Lustre file can store the same data on multiple OSTs in order for the system to be robust in the event of storage failures. In addition, IO availability can be improved because given a choice of multiple replicas, the best suited can be chosen to satisfy an individual request. Furthermore, for files that are concurrently read by many clients (e.g. input decks or executables) the aggregate parallel read performance of a single file can be improved by creating multiple replicas of the file's data.<br />
<br />
The considerable functional improvements that redundancy promises also pose significant challenges in design and implementation. In this document we present a design for an initial implementation of redundancy with limited write support – writing to a replicated file will degrade it to a non-replicated file, and only one replica will be updated directly during the write while other replicas will be simply marked as stale. The file can subsequently return to replicated state again by synchronizing among replicas with command line tools (run by the user or administrator directly or via automated monitoring tools). There will typically be some delay before returning the other replicas are re-synced following a write, which is why the first phase implementation is called delayed write.<br />
<br />
While delayed write redundancy will not cover all types of failures, it will still be useful for a wide range of uses. Since the redundancy of file data can be chosen on a per-file basis, it is practical to replicate one of every N checkpoint files (e.g. once a day), so that a long-running application can be restarted even in the face of concurrent compute node and OST failures. This also avoids the need to double or triple the storage capacity of the whole filesystem as is required for device-level redundancy. <br />
<br />
In this document, we will present a solution for delayed write for Lustre. Within this document, the terms RAID-1 or RAID-0+1, replicated, and mirrored are all used to reference the files with multiple copies of the data. More specifically, the term RAID-1 is used in the context of the file layout and replicated is used in the context of data itself.<br />
<br />
== RAID-1 Layout ==<br />
<br />
Based on layout enhancement project, redundancy layout format and a set of layout operations will be defined. At present, there are no dedicated interfaces to interact with layouts, and using setxattr to modify the whole layout atomically is the current work-around. However, setxattr is insufficient to support the complex interactions needed with layout that RAID-1 requires. A set of layout operations will be designed to support the needs of redundancy. This work builds on the Layout Enhancement project.<br />
<br />
=== Layout format ===<br />
Layout format of redundancy is based on the [[Layout Enhancement Solution Architecture]] where composite layout format is defined. Each replica has a unique ID. The state of each replica is maintained separately. The replica state can be modified via layout operation interfaces.<br />
<br />
=== RAID-0+1 and RAID-1+0 layout ===<br />
<br />
A RAID-0+1 layout is mirroring across a collection of RAID-0 layouts. Conversely, RAID-1+0 is striping over a collection of RAID-1 mirrors. RAID-0+1 is simpler to implement within in the current Lustre IO framework and file layout infrastructure since it is possible to take any existing RAID-0 file and add one or more replicas to create a RAID-0+1 file. RAID-1+0 is complex to implement in the initial phase. RAID-1 (mirror of individual objects) will be supported as a degenerate case of RAID-0+1.<br />
<br />
== Command line tools ==<br />
=== Create RAID-1 file by parameters with lfs setstripe ===<br />
<br />
To create RAID-1 file, we have to make sure that different replicas must be on different OSTs, even OSSes and racks. An understanding of cluster topology is necessary to achieve this aim. In the initial implementation the use of existing OST pools mechanism will allow separating OSTs by any arbitrary criteria: i.e. fault domain. <br />
<br />
In practice, we can take advantage of OST pools by grouping OSTs by topological information. Therefore, when creating a replicated file, the users can indicate which OST pool can be used by replicas.<br />
<br />
lfs setstripe --layout mirror [--component idx] [--preferred] [other setstripe options]] <file_name><br />
<br />
The above command will create an empty RAID-1 file. The --component option indicates the replica component number that is being specified, and can be repeated multiple times so that multiple replicas can be created at one time. The --prefer option marks that replica as being preferred for writes (e.g. stored on fast/local OSTs). As mentioned above, it is recommended to use the --pool option (one of the lfs setstripe options) with OST pools configured with independent fault domains to ensure different replicas will be placed on different OSTs and/or servers and/or racks, therefore availability and performance can be improved. If the setstripe options are not specified, it is possible to create replicas with objects on the same OST(s) in the Phase 1 implementation, which would remove most of the benefit of using redundancy in that case.<br />
<br />
=== Extend an existing file to RAID-1 format ===<br />
<br />
lfs setstripe --layout mirror --extend [--read-only|-r] [--prefer] [other setstripe options] <raid1_file><br />
<br />
This command will append a replica by stripe options into an existing file raid1_file. The existing file can already be a replicated file, or just a normal file.<br />
<br />
This command will create new volatile file with any optional setstripe options that are specified, or using the defaults inherited from the parent directory or filesystem. The file contents will be copied from the existing file to volatile file and then the layout will be modified to add the volatile file as a new replica. Lease will be held on the existing file to prevent it from being modified during the copy. If the existing file is modified during this operation, -EBUSY will be returned to the resync tool. Administrators can repeat this command later.<br />
<br />
The --read-only|-r option can be specified if the file is known to be accessed by older clients that do not directly support File Level Redundancy, which will prevent them from marking the file stale if the file is opened O_RDWR (in particular by Fortran programs), but will prevent the file from being modified. This is similar to marking the file immutable, but it can still be deleted or have attributes such as timestamps and ownership modified.<br />
<br />
=== Split a specific replica index from a RAID-1 file ===<br />
<br />
lfs layout --split --component idx <raid1_file> [target_file]<br />
<br />
This command will split the specified replica by its component index idx. raid1_file must be a RAID-1 file otherwise the command will be returned with -ENOTSUPP. If target_file is specified, the file will be created with the layout of the split replica; otherwise the layout of the specified replica will be destroyed.<br />
<br />
=== Replica resynchronize ===<br />
<br />
lfs layout --resync <raid1_file><br />
<br />
This command is used to resynchronize an out-of-sync RAID-1 file. If there are no any stale replicas for the replicated file, this command does nothing. Otherwise, this command will first hold the exclusive lease of raid1_file, and then the file will be resynchronized.<br />
<br />
In terms of resynchronization, the file will be read with normal read, because read must only use uptodate replicas; and then the read contents will be written to stale replicas with a dedicated replica write interface. We’re going to use direct IO for the read and write in this case. After the replicas are synchronized, this command will change the layout to mark the replicas as uptodate.<br />
<br />
=== Verify replicated files ===<br />
<br />
lfs layout --verify <raid1_file><br />
<br />
This command is used to verify that each replica of a replicated file contains the exactly same data. This command will pull data from each replica, and then do comparison or calculate checksum, so that we can get some help from hardware, to make sure they’re the same.<br />
<br />
== Read from replicated file ==<br />
<br />
In the design of redundancy, an effort is made to mandate writers to write all replicas and readers can get valid data by reading from any replica. In the Phase 1 implementation, only a single replica will be written and there will be a delay before the other replicas are updated from the uptodate replica. However, if there is any hidden error in disk, two readers may be returned with inconsistent result if the readers happen to use different replicas. The solution of this problem is out of the scope for Phase 1 and we make the assumption that all uptodate replicas include the same data.<br />
<br />
=== Policy to select read replica ===<br />
<br />
When serving a read, a policy will be run in the IO framework to select a replica. The initial replica would ideally be chosen by the following information:<br />
<br />
* Status of replica - avoid stale replicas<br />
* Type of replica - replica on faster disks should be chosen firstly<br />
* Network topology - Nearer the replica, faster the IO can be finished<br />
* Load balance - to avoid busy servers<br />
<br />
However, running this kind of policy needs information that is not currently available at the client, so the Phase 1 initial replica selection is only by replica status as seen by the client.<br />
<br />
The pseudo code of replica selection policy is as follows:<br />
<br />
for each replica of the file; do<br />
if (replica is stale)<br />
continue;<br />
if (replica including inactive OST)<br />
continue;<br />
if (replica was previously tried and failed)<br />
continue;<br />
<br />
add this replica to list of available replicas;<br />
break;<br />
done<br />
select replica from available replicas<br />
<br />
Replicas may be prioritized using the past usage statistic. If multiple valid replicas exist, clients should deterministically select one of them (e.g. client_nid % replica_count) so that the load is distributed across replicas. This will mean that multiple clients reading from the same file should utilize the bandwidth from all the servers.<br />
<br />
Replica selection occurs in the IO initialization phase; lock request relies on the result of selection. If a read from a replica fails, then the replica is marked stale and the replica selection process is run again. On successful reads, the same replica is re-used for the later reads.<br />
=== IO framework for read ===<br />
<br />
IO framework is on the client side. Clients must recognize the RAID-1 layout and choose a replica to serve read requests by policies.<br />
<br />
To improve read availability, the IO framework for read must have a retry mechanism, which means in case a replica becomes unavailable to serve a read, the IO framework must retry that IO with another replica, and so on. Applications must not see any errors if the data can be read from one of the replicas finally.<br />
<br />
When reading from replica, the IO RPC should be issued with rq_no_delay bit set in ptlrpc_request so that if the object becomes unreachable, the RPC won’t be stuck in the resending list at the RPC layer, but instead it will return with the error code -ENAVAIL. The IO framework should reissue the IO and invoke the replica selection policy to try the next replica.<br />
<br />
The diagram above depicts a typical process of sync read. For sync read, the reading process has to wait for the RPC result, therefore if it sees the error -ENAVAIL that indicates the replica is unavailable, it will set cl_io::ci_restart so that the IO can be restarted.<br />
=== Read ahead ===<br />
<br />
Read ahead RPCs are asynchronous, the process that issues the RPC won’t wait for the result of RPC, nor will the pages be needed right away. In this case, it’s not necessary to take immediate actions if the replica becomes unavailable. The pages will remain in cache without uptodate until they are needed by process, and then sync RPC will be sent with valid replicas being selected.<br />
== Write to replicated file ==<br />
<br />
For simplicity of implementation, write is implemented by two phases:<br />
<br />
* Phase 1: Delayed write – the MDS will choose one replica, named primary, to update and mark the others STALE, based on a hint from the client. An external tool will be used to synchronize the other (STALE) replicas after write is done.<br />
* Phase 2: Immediate write – the major problem with delayed write is that even if all replicas are accessible when the file is being written, it will still only update one replica. This is inefficient since the file will need to be read and updated again at a later time. Immediate write also picks the primary replica for update and marks the others STALE, but it also writes to the secondary replica(s) at the same time. When the write completes, after all dirty data has been written, the writing client(s) can set the secondary replicas’ state to SYNC again. Immediate write will use fanout page and it’s out of the scope in Phase 1. The design and implementation of Immediate write is out of scope for this project.<br />
<br />
The major purpose of picking the primary replica is to simplify the recovery and coordinating concurrent read/write access to a replicated file. The primary replica selection process is similar to that for read, with the main difference that once a primary replica has been selected all other clients reading or writing the file must also use that same replica as the primary. Once the other replicas are marked stale, the replicated file degenerates to a normal file so read/write locking will ensure data coherency and recovery.<br />
<br />
It will also be possible to persistently mark a replica in the layout as preferred for selection as the primary replica for writes, if it is stored on faster OSTs compared to others (e.g. flash vs. SATA). Since the client is able to specify a hint for selecting the primary replica, this will allow future extension for the client to be able to select the replica based on LNET locality or observed OST performance.<br />
=== Delayed write ===<br />
<br />
With delayed write, one replica will be chosen as primary replica to dirty and the others will be marked as STALE. The MDT can either do this at open(O_WRONLY), or will be notified to change the layout before write really occurs for open(O_RDWR).<br />
<br />
After the write is finished, an external tool will acquire read mode lease of the file and use replica write to copy data from the primary replica to staled ones. Marking a replica stale will be recorded in the Lustre ChangeLog so that the resync tool or external policy engine can efficiently detect and resync stale replicas.<br />
<br />
The benefit of delayed write is that it’s recovery-free, because only one replica is chosen to write so we don’t need to worry about the differences among replicas at write time. This also defers the IO overhead of writing to multiple replicas until after the primary replica has finished writing, allowing the full write bandwidth to be used for the primary replica. However, the drawback is that the resynchronization will consume of IO bandwidth at later time. Delayed write is just a temporary solution before immediate write is introduced.<br />
==== Layout Change for Write ====<br />
<br />
Before writing a replicated file, MDT has to be notified in advance so that it can choose one replica as the primary and mark others as STALE. Then, the layout generation will be increased to notify the other clients to clean up their cache. The process of updating layout must be a synchronous write operation on the MDT, though this is not expected to be a significant overhead since updating files after initial creation is a relatively rare event, and the MDT normally is using high-IOPS storage and can commit the update relatively quickly.<br />
<br />
If the file is opened with O_WRONLY, the MDT will immediately pick one replica out as primary and mark the others as STALE at file open time.<br />
==== Versioned Write Request and Replica ====<br />
<br />
OST write requests will carry the layout generation of the file when it was opened for write. OST will check if this version number matches the generation number of replica, see section 6.1.3, or the write request will be denied because it is from a client with an incorrect understanding of the current file layout.<br />
<br />
The version of write request will be filled in cl_req_attr_set(), and the source of version is fetched from struct obd_client_handle, where a new field is added to remember the layout generation when the file is opened for write. Please notice that the file’s layout generation may be unavailable due to DLM false sharing, so we can’t rely on comparing current file version to decide if to send the write request.<br />
<br />
Versioned replica applies on OST objects of replicas; each OST objects will store a 32-bit version number on disk to identify which layout generation of the file the object is attached to.<br />
<br />
Before the objects can be written, the client has to update the layout generation of replica. On the OST side, the write can only succeed if it matches the layout generation of objects.<br />
<br />
The major purpose of versioned replica is to prevent evicted clients from modifying the replicated file. It also enforces data protection because an arbitrary writing request, for example requests from older clients, will fail to modify the data.<br />
==== Replicated File State ====<br />
<br />
Once write is supported, replicated file can be one of the following states:<br />
<br />
* READ_ONLY - the replicated file is on read only state;<br />
* WRITE_PENDING - the file is going to be written, layout change has been finished. This is a middle state between WRITEABLE and READ_ONLY;<br />
* WRITABLE - the file is writable. No extra operation is needed to write the file.<br />
* SYNC_PENDING - the file is picked to resynchronize, layout generation has increased so that evicted client can’t write to the replicas.<br />
<br />
Replicated files transit their state when they are being written and resynchronized. The diagram below shows the FSM to transit the states.<br />
<br />
... need to figure out how to insert diagram ...<br />
<br />
When the first write comes, and the replicated file is in READ_ONLY state, it will increase the layout generation and pick the primary replica, then change the file state to WRITE_PENDING and mark non-primary replicas STALE. The writing client will then uptodate primary replica’s layout generation, and then notify the MDT to set the replicated file state to WRITABLE. After the replicated file is resynchronized, the file state will go back to READ_ONLY again.<br />
<br />
Whenever a writing client sees WRITE_PENDING state replicated files, it will update the replica generation and then notify the MDT to set the state to WRITABLE. If clients see a WRITABLE state replicated files, it can write the file without any extra operations. Therefore, the first write is expensive because it needs two transactions to talk with the MDT; but for the following writes, it won’t have extra overhead at all.<br />
<br />
Besides, each replica can be one of the following states:<br />
<br />
* SYNC – the replica includes the uptodate data;<br />
* STALE – the replica may contain out-of-date data. The replica can be written (in Phase 2) but it cannot be used to serve read if it’s in this state.<br />
* OFFLINE - The replica failed to be resynchronized and is not being updated. A full resynchronization is required to bring the file back to sync.<br />
<br />
When the replicated file is in READ_ONLY state, all but one of the replicas must be in SYNC state. If it is in WRITE_PENDING or WRITABLE state, only primary replica is in SYNC state, other replicas must be in STALE or OFFLINE state.<br />
==== Wrap it up ====<br />
<br />
The diagram below shows the process of how a writing clients interacts the MDT before writing a replicated file.<br />
<br />
''... need to figure out how to insert diagram ...''<br />
<br />
The client sends an RPC called MDT_INTENT_WRITE to the MDT before it writes replicated files. When the MDT receives the MDT_INTENT_WRITE RPC request, and if it turns out that layout has to be changed, it will update the layout synchronously. The reason to update the layout synchronously is that in case the MDT crashes before the update is committed, and if the writing client failed to replay the RPC, it may render corruption to the replicated files.<br />
<br />
The format of MDT_INTENT_WRITE RPC is as follows:<br />
struct req_msg_field *mds_intent_write_server = {<br />
&RMF_PTLRPC_BODY,<br />
&RMF_MDT_EPOCH<br />
};<br />
<br />
struct mdt_write_intent {<br />
struct lustre_handle mwi_handle; /* open handle */<br />
struct lu_fid mwi_fid;<br />
__u64 mwi_start; /* write start bytes */<br />
__u64 mwi_end;<br />
};<br />
<br />
struct req_msg_field RMF_WRITE_INTENT =<br />
DEFINE_MSGF("write_intent", 0,<br />
sizeof(struct mdt_write_intent),<br />
lustre_swab_mdt_write_intent,<br />
NULL);<br />
<br />
struct req_msg_field *mds_intent_write_client = {<br />
&RMF_PTLRPC_BODY,<br />
&RMF_WRITE_INTENT<br />
};<br />
<br />
struct req_format RQF_LDLM_INTENT_WRITE =<br />
DEFINE_REQ_FMT0("LDLM_INTENT_WRITE",<br />
mdt_intent_write_client,<br />
mdt_intent_write_server);<br />
<br />
MDT_WRITE_INTENT RPC will be sent in ll_file_write() before writing any data to the file.<br />
<br />
=== Prevent stale data from being read ===<br />
<br />
To avoid cascading problems, when a layout lock is being canceled at client side, the corresponding file will be marked to have a STALE layout but caching pages are not destroyed. This results in a problem that the reading process may read stale data when the replicated file is being written.<br />
<br />
Let’s take an example:<br />
<br />
A replicated file has two replicas: replica #0 and #1. Client A reads the file and caches some pages from replica #0. Client B starts to write the file and replica #1 is chosen so replica #0 is marked as stale. Client A can continue to read the cached pages that are already stale.<br />
<br />
We’re going to solve this problem in this section.<br />
==== Versioned cl_page ====<br />
<br />
A field will be added to cl_page to remember the layout generation at page creation time. Pages’ layout generation number can be changed under some circumstances.<br />
<br />
To avoid stale page being read, clients have to make sure that there are no pages with previous layout generation existing by traversing page cache before IO starts. Stale pages will either be destroyed, or version is upgraded to latest version if the corresponding replica is still valid.<br />
<br />
In implementation, a field will be added into ll_inode_info() to remember what the lowest layout generation for any cl_page() in page cache. If the layout lock was lost due to false sharing, it won’t do page cache cleanup at all.<br />
==== Mapped page handling ====<br />
<br />
There exists one case that processes can access pages without initiating an IO: mapped pages. If the pages are already mapped into processes’ memory space, they can be read directly via memcpy(). Versioned cl_page can’t help in this case. This poses the risk that stale pages can still be read.<br />
<br />
One solution to solve the problem is to tear down ALL mapped pages when the layout lock is being revoked. This is expensive due to false sharing but mmaped processes have to pay a price themselves.<br />
=== Replica Resynchronization ===<br />
<br />
After a replicated file is written, it will have only one valid replica and others will be marked as STALE. We need a mechanism to synchronize replicas, i.e., copy data from SYNC replica to STALE replicas, and change the layout to mark all successfully copied replicas as SYNC again.<br />
<br />
Obviously we don’t want to synchronize the replicated files that are still being written, nor the files may be modified again in the near future. There will be a configurable time, and files that haven’t been modified in that time can be picked for synchronization. This is called quiescent time.<br />
==== Replica Read and Write ====<br />
<br />
When reading from a replicated file, the data must be read from SYNC replicas. However, to synchronize replicas, we need to write the data to a specific one, this is called replica write.<br />
<br />
Sometimes we want to periodically verify that the replicas are synchronized to avoid software BUGs or hardware malfunctions. In that case, replica read is really useful. We can read blocks from one replica and compare it to the others to make sure that each replica contains exactly the same data.<br />
<br />
An ioctl() interface will be created to specify replica index for opened replicated files. Replica read and write can only apply to direct IO otherwise the client page cache may be tainted by stale pages. During implementation, independent open handlers will be provided for each replica, otherwise ioctl() has to be called to switch among replicas.<br />
==== File lease ====<br />
<br />
When the replicated file is being resynchronized, definitely we want to be notified if the file is to be modified. Exclusive lease is initially going to be held during the time when the file is being synchronized. Therefore, when the file is opened by other entities, the synchronizing process will stop and restart again after quiescent time elapses.<br />
<br />
In the later phase of redundancy, we’re going to hold read lease to resynchronize files. In this way, read and resynchronization can coexist, but when the file is being opened for write, resynchronization process will stop.<br />
==== Resynchronization ====<br />
<br />
The client that is resynchronizing the replicated file is not required to be the same client that wrote that file. Actually, some dedicated clients, named agent clients, could be used to pick files that have elapsed at least quiescent time since the last write. A simple ChangeLog monitor will be provided a list of files (FIDs) that have been marked stale, and can monitor the open state on the MDT. It would also be possible to integrate this resync operation with a policy engine like RobinHood to prioritize if and when to resynchronize the file, or if the replica(s) should be moved other OSTs if the existing ones are unavailable for a long time. This potential integration with RobinHood is out of scope of Phase 1 design and implementation, but a simple ChangeLog monitoring tool will be provided.<br />
<br />
Of course, the file can be resynchronized manually with command lfs layout --resync.<br />
<br />
The flow chart above depicts the process of resynchronization.<br />
<br />
CLOSE RPC will be extended to update the layout and do exclusive close atomically. We should pack the list of replica IDs that have been resynchronized successfully, just in case some replicas were failed to write.<br />
<br />
The format of MDS_CLOSE will be revised as follows:<br />
struct close_data {<br />
struct lustre_handle cd_handle;<br />
struct lu_fid cd_fid;<br />
__u64 cd_data_version;<br />
__u64 cd_flags;<br />
__u64 cd_reserved[7];<br />
__u32 cd_nr_replicas;<br />
__u32 cd_replica_ids[];<br />
};<br />
struct req_msg_field *mdt_release_close_client[] = {<br />
&RMF_PTLRPC_BODY,<br />
&RMF_MDT_EPOCH,<br />
&RMF_REC_REINT,<br />
&RMF_CAPA1,<br />
&RMF_CLOSE_DATA<br />
};<br />
<br />
The replicas that have been successfully synchronized are recorded in cd_nr_replicas and cd_replica_ids field. MDT will clear STALE bit on those synchronized replicas.<br />
=== Client Eviction ===<br />
<br />
Versioned write request and versioned replicas are the key point to solve the eviction problem. Before resynchronization starts, the agent client will flush cached pages and increase layout generation number. If a client with writing open handler is evicted, the request’s version won’t be updated so that any write requests from this client will be denied by OSTs.<br />
<br />
Writing handlers from evicted clients can be recovered under some circumstances. If the file is reopened for writing on the same client, then the writing from the original file descriptor becomes valid as well. However, if the file state has been changed, then it is impossible to recover the open handler, for example, if the file was opened for execute or it has been deleted during the eviction.<br />
<br />
== Other operations ==<br />
=== File Attributes ===<br />
<br />
As a by-product of the redundancy mechanism, Size-On-MDT (SOM) will be fully functionally for files with up-to-date replicas. If the replicated file is in READ_ONLY state, the file's atime, mtime and size will be static and can be safely stored on the MDT, and no OST RPCs are necessary for fetching the file's attributes. It is desirable that the blocks count reported for a file include all of the blocks from all of the replicas, so that users and administrators can use common tools like "du", "ls -s", or "find" to accurately identify the space and quota used by each file.<br />
<br />
If the file is not in the READ_ONLY state, it’s operated as normal file so no attributes will be cached on the MDT, and glimpse RPCs will be sent to the OSTs of the primary replica to fetch these attributes from its objects. However, the blocks count is the sum of every object the file is using, even those in stale replicas. This implies that stat() of the file needs to retrieve the attributes from all of the OST objects in all of the replicas in the file's layout, even for the stale replicas. However, the block count does not have to be accurate under certain conditions. We will glimpse the primary replica, and then multiply the returned block count by the replica count. This ensures the block count will be this number after resynchronization is complete. We call this eventual block count.<br />
=== Truncate ===<br />
<br />
Truncate will be handled by the same mechanisms as write – the primary replica will be truncated and other replicas marked STALE. The replicas can be returned to READ_ONLY state once they have also been truncated.<br />
=== Mmap ===<br />
<br />
For write map, it’s handled the same as normal write, and the MDT will be notified in ll_file_mmap().<br />
=== Lockless IO ===<br />
<br />
Lockless IO is handled in the same manner as regular writes. Since the secondary replicas are marked STALE at open time or before the first write, the presence or absence OST object locking is irrelevant.<br />
=== Missing OSTs ===<br />
<br />
We already have a tool ("<tt>lfs find --ost {index, ...}</tt>") to scan which files are affected if OST(s) are removed permanently. The tool will be updated for redundancy to be able to report the index of replica on a failed OST so that it’s easier to find and fix the affected replica.<br />
<br />
=== Quota ===<br />
<br />
Quota accounting for replicated files will account the full space usage of each replica against the file owner's quota limits. This is both the simplest mechanism to implement and understand, and is also a natural consequence of users being able to selectively specify the redundancy level of each file. In sites where quota limits are enforced for users, the users can make the decision which files need one or more replicas, and the additional space usage of the replicas will be counted for their UID accordingly. This also allows users to create multiple replicas when they are needed, but discourages the use of replicas when they are not necessarily required (e.g. for every short-lived application checkpoint or other temporary files). This accounting mechanism is also no different than the user creating two separate copies of a particular file in the filesystem namespace, which would also consume twice as much quota as having a single file.<br />
=== Compatibility ===<br />
<br />
Older clients accessing replicated files must support the layout lock that arrived in Lustre 2.4. If this requirement is not met the client will be returned with a LOV_BAD_MAGIC in the layout magic number, therefore the attempt to access layout will fail. Clients that support layout lock but do not understand the redundancy feature will cause the MDS to immediately select a primary replica and mark the other replicas as stale synchronously when they open the file O_RDWR or O_WRONLY.<br />
At this point the old clients can modify the file and the resync tool will resynchronize it when it is run on the file.<br />
== Open Issues ==<br />
<br />
We expect no performance impact on the initial file write when using delayed redundancy, since the write mechanism is largely unchanged from the current implementation. We also expect improved performance on concurrent reads of replicas from multiple clients since there will be multiple OSTs holding the file data. However, producing a performance a detailed model for this design is beyond the scope of this work since the overhead of creating and resyncing file replicas is entirely driven by user-space policy and environment-specific characteristics of how often existing files are modified and the overall duty cycle of the filesystem.<br />
== Implementation Efforts and Tasks Partition ==<br />
=== Feature List ===<br />
<br />
Each feature must be testable. Size is a value ranged from 1 to 10 to mark the relatively size of the feature.<br />
<br />
{| class=wikitable<br />
! No. || Name || Description || Index || Size<br />
|-<br />
| 1 || L.composite || Composite layout - by layout enhancement project || 3.1 || - <br />
|-<br />
| 2 || L.interpreter || Build up objects layout based on the composite layout at LOV and LOD layer; Show layout information by lfs getstripe; Any attempt to access replicated file will return -ENOTSUPP || 3.1 || 7<br />
|-<br />
| 3 || L.flags || Define layout flags for redundancy use; Define replicated file state flags; Provide interfaces to change, set, and clear those flags || 6.1.3 || 4<br />
|-<br />
| 4 || F.layout.raid1 || Command line tool. Create an empty RAID-1 file || 4.1 || 4<br />
|-<br />
| 5 || F.layout.cmd || Command line tools. || 4.2 || 6<br />
|-<br />
| 5.1 || || Convert a normal file to composite file and add a replica into it; Detach a replica by index, the replica should be stored into a separate file; Delete a replica by index and destroy the replica || 4.3 || <br />
|-<br />
| 6 || F.replica.policy || Define policy to choose replica; Glimpse should return correct file size || 5.1 || 3<br />
|-<br />
| 7.1 || F.read || Read for replicated file || 5.2 || 3<br />
|-<br />
| 7.2 || || Return correct file content even if some replicas are inaccessible || 5.3 || 2<br />
|-<br />
| 8 || F.replica.version || Versioned replica support. OST write requests carry layout generation; layout generation on OST objects; Set and get layout generation from OST objects || 6.1.2 || 5<br />
|-<br />
| 9 || F.write || Write support for replicated file. Notify MDT to pick primary replica and marks other replicas stale; Writing client updates replica version; Write can succeed; Read returns correct data from primary replica; || 6.1.6 || 9 || -<br />
|-<br />
| 10 || F.read.stale || Make sure no stale data will be read after writes. || 6.2 || 4<br />
|-<br />
| 11 || F.replica.rw || Replica read and write; Command line tool to resync || 6.3.1, 4.4 || 8 <br />
|-<br />
| 12 || F.layout.verify || Command line tool to verify replicas || 4.5 || 4<br />
|-<br />
| 13 || F.resync || Resynchronization || 6.3.3 || 9<br />
|-<br />
| 14 || F.mmap || Mmap support || 7.3 || 1<br />
|-<br />
| 15 || F.truncate || Truncate support || 7.2 || 1<br />
|-<br />
| 16 || F.read.lease || Read lease support || 6.3.2 || 3<br />
|-<br />
| 17 || F.open.recover || For evicted clients, sometimes the open handlers can be recovered || 6.4 || 6<br />
|-<br />
| 18 || S.compatibility || || 7.5.1 || -<br />
|}<br />
<br />
[[Category:Design]]<br />
[[Category:Layout]]</div>Nrutmanhttp://wiki.lustre.org/index.php?title=Category:DoM&diff=2709Category:DoM2017-08-28T22:15:37Z<p>Nrutman: Category page</p>
<hr />
<div>Data on MDT feature</div>Nrutmanhttp://wiki.lustre.org/index.php?title=PFL2_High_Level_Design&diff=2708PFL2 High Level Design2017-08-28T21:24:17Z<p>Nrutman: /* lfs find */</p>
<hr />
<div>==Introduction==<br />
<br />
The Progressive File Layout Phase 2 (PFL2) High Level Design describes details of how the PFL feature may be implemented, including the user interfaces for both the command-line as well as Lustre-aware applications, how PFL files will interact with the client-IO (CLIO) layer in the Lustre kernel VFS driver, as well as the RPC formats between the client and servers, and the interface to the underlying storage. This document further builds upon the reference documents below.<br />
<br />
This design is intended to be comprehensive for both the current PFL Phase 2 implementation, as well as a future PFL Phase 3 implementation, so some use cases describe functionality that will not be implemented as part of the PFL Phase 2 implementation, but are included here so that the overall PFL design is complete, and to ensure that functionality implemented in PFL Phase 2 is considering the longer-term implementation goals and does not need to be reworked once PFL Phase 3 implementation is started. Design aspects that are not intended to be implemented in PFL Phase 2 are marked as such in this document or the [[PFL2 Scope Statement]].<br />
<br />
===References===<br />
[[Layout Enhancement High Level Design]]<br />
<br />
[[Progressive File Layouts]]<br />
<br />
[[PFL Prototype Scope Statement]]<br />
<br />
[[PFL Prototype Solution Architecture]]<br />
<br />
[[PFL Prototype High Level Design]]<br />
<br />
[[PFL2 Scope Statement]]<br />
<br />
[[PFL2 Solution Architecture]]<br />
<br />
==Design Overview==<br />
There are three main components to the PFL design:<br />
<br />
* the user-space interfaces for Lustre-specific command-line tools and user library application programming interfaces (APIs)<br />
* changes to the client IO (CLIO) layer IO engine and RPC interfaces to the MDS for accessing and manipulating composite file layouts<br />
* changes to the MDS server to create, modify, and delete composite files<br />
<br />
The design is structured in a top-down manner, starting with the command-line interfaces that users are going to interact with the most, then the user library APIs, the client-side kernel changes for reading, writing, and accessing PFL files, RPCs for creating and modifying composite files, and finally server-side changes. There is also a discussion of protocol and disk format compatibility issues.<br />
<br />
==Client Side Design==<br />
==User Space Interfaces==<br />
===lfs Command-line Interface===<br />
The lfs(1) command-line interface will be extended to understand and manipulate PFL files and their component layouts. lfs is the primary interface for end users to create new files with a specific layout, show the layout of existing files, as well as setting default layout templates on directories that will be inherited by all new files and subdirectories created therein.<br />
<br />
The [[pfl2-lfs-setstripe.1|lfs setstripe(1)]], lfs migrate(1), lfs getstripe(1), and lfs find(1) sub-commands will be extended to set and display the composite layout of a file, and to search for files with specific composite layout parameters or for components that match specific parameters. The added command-line arguments along with descriptions and examples for each of these commands is given on a dedicated man page for each sub-command linked to the name of the command, so only the synopsis and brief description of each command is shown here.<br />
<br />
====lfs getstripe====<br />
The lfs getstripe command prints some or all of the parameters of a file's layout. This is intended for regular users and administrators to query a particular file's layout, or the individual components of a composite file to examine the layout used to create the file.<br />
<br />
lfs getstripe [--stripe-count|-c ] [--directory|-d] [--stripe-index|-i]<br />
[--layout|-L] [--mdt-index|-m] [--ost|-O <uuid>] [--pool|-p]<br />
[--recursive|-r] [--raw|-R] [--stripe-size|-S] '''[--component-start [start]]<br />
'''[--component-end|-E [end]] [--component-flag|-F [flag]] [--component-id|-I [id]]<br />
'''[--component-count]''' [--quiet|-q] [--verbose|-v] {dirname|filename} ...<br />
<br />
Without any of the option flags, this will display all the layout components, as shown below. To limit the display to specific values of the layout, the options are largely the same as the current lfs getstripe, with new parameters for extracting attributes of composite files, such as the start and end extent of the last instantiated component, the unique component identifier, and the component attribute flags. By default, when requesting specific values of the layout, this will print the parameters of the last instantiated component of the layout, since this is the one that affects the current IO behaviour, and if a single parameter needs to be selected that best represents the file it should come from the layout that the file needed at its current size. It is also possible to select a specific file component by its offset within the file or attribute flags to print specific values from the specified component of the layout. If multiple component options are specified, such as --component-end=64M and --component-flag=uninit, then lfs getstripe will return the attributes of the component that matches all specified options. If no component matches all specified component options, then nothing will be printed.<br />
<br />
Since the output format needs to be changed for composite files, the output is YAML formatted for both ease of parsing and still be human readable. This is still reasonably similar to the original output format, with the exception of the OST object ID information, which is now more structured for ease of use.<br />
# An output example of a file with 3 components<br />
$ lfs getstripe -v /mnt/lustre/file<br />
"/mnt/lustre/file":<br />
fid: "[0x200000400:0x2c3:0x0]"<br />
composite_header:<br />
composite_magic: 0x0BDC0BD0<br />
composite_size: 536<br />
composite_gen: 4<br />
composite_flags: 0<br />
component_count: 3<br />
components:<br />
- component_id: 1<br />
component_flags: 0<br />
component_start: 0<br />
component_end: 2097152<br />
component_offset: 152<br />
component_size: 56<br />
sub_layout:<br />
lmm_magic: 0x0BD10BD0<br />
lmm_pattern: 1<br />
lmm_stripe_size: 1048576<br />
lmm_stripe_count: 1<br />
lmm_stripe_index: 7<br />
lmm_pool: flash<br />
lmm_layout_gen: 0<br />
lmm_obj:<br />
- 0: { lmm_ost: 7, lmm_fid: "[0x100070000:0x2:0x0]" }<br />
- component_id: 2<br />
component_flags: 0<br />
component_start: 2097152<br />
component_end: 16777216<br />
component_offset: 208<br />
component_size: 128<br />
sub_layout:<br />
lmm_magic: 0x0BD10BD0<br />
lmm_pattern: 1<br />
lmm_stripe_size: 1048576<br />
lmm_stripe_count: 4<br />
lmm_stripe_index: 0<br />
lmm_layout_gen: 0<br />
lmm_obj:<br />
- 0: { lmm_ost: 0, lmm_fid: "[0x100000000:0x2:0x0]" }<br />
- 1: { lmm_ost: 1, lmm_fid: "[0x100010000:0x3:0x0]" }<br />
- 2: { lmm_ost: 2, lmm_fid: "[0x100020000:0x4:0x0]" }<br />
- 3: { lmm_ost: 3, lmm_fid: "[0x100030000:0x4:0x0]" }<br />
- component_id: 4<br />
component_flags: 0<br />
component_start: 16777216<br />
component_end: 18446744073709551615<br />
component_offset: 336<br />
component_size: 176<br />
sub_layout:<br />
lmm_magic: 0x0BD10BD0<br />
lmm_pattern: 1<br />
lmm_stripe_size: 4194304<br />
lmm_stripe_count: 6<br />
lmm_stripe_index: 5<br />
lmm_pool: archive<br />
lmm_layout_gen: 0<br />
lmm_obj:<br />
- 0: { lmm_ost: 5, lmm_fid: "[0x100050000:0x2:0x0]" }<br />
- 1: { lmm_ost: 6, lmm_fid: "[0x100060000:0x2:0x0]" }<br />
- 2: { lmm_ost: 7, lmm_fid: "[0x100070000:0x3:0x0]" }<br />
- 3: { lmm_ost: 0, lmm_fid: "[0x100000000:0x3:0x0]" }<br />
- 4: { lmm_ost: 1, lmm_fid: "[0x100010000:0x4:0x0]" }<br />
- 5: { lmm_ost: 2, lmm_fid: "[0x100020000:0x5:0x0]" }<br />
<br />
====lfs setstripe====<br />
The lfs setstripe command creates a new file with the specified layout parameters, or sets the specified layout parameters as the default layout template on a parent directory.<br />
<br />
lfs setstripe {--component-end|-E end1} [component1_OPTIONS] [{--component-end|-E end2} [component2_OPTIONS] ...] {directory|filename}<br />
lfs setstripe --component-del [--component-id|-I comp_id] [--component-flags|-F flags] filename<br />
lfs setstripe --component-set [--component-id|-I comp_id] {--component-flags|-F flags} filename<br />
<br />
Since this is the primary command-line interface for users creating new files with Lustre-specific layouts, there are a significant number of existing options that can be used. Adding in composite-file specific options to lfs setstripe allows the same code to create both files with plain layouts and composite layouts, without duplicating a large number of options. The command-line arguments of lfs setstripe are described in detail in the lfs-setstripe(1) man page. The significant changes to these arguments is the addition of the --component-end argument for specifying which component is being modified during file creation, and allowing multiple components to be specified on the same command-line so that the file does not need to be created piecemeal.<br />
<br />
An example from the man page illustrates the flexibility of the file creation interface:<br />
$ lfs setstripe -E 4M -c 1 --pool flash -E 64M -c 4 -S 4M -E -1 -c -1 -S 16M --pool archive /mnt/lustre/file1<br />
<br />
This creates a file with composite layout in a single operation, rather than building it up one component at a time. The first component has a single stripe that covers [0, 4MiB) and is allocated from an OST in the flash pool. The second component has four stripes that cover [4MiB, 64MiB) and has a stripe size of 4MiB. The last component covers [64MiB, EOF), has a stripe size of 16MiB, and uses all available OSTs in the archive pool.<br />
<br />
Please notice that the '''setstripe options''' in the command line are inheritable, which means the options indicated in previous component will be used by the following components unless they are changed. For example, if ''-c'' option appears in the command line as following:<br />
$ lfs setstripe -E 4M -c 1 -E 8M -E 32M -c 4 -E eof<br />
<br />
It will create the components [0, 4M) and [4M, 8M) with 1 stripe and [8M, 32M) and [32M, EOF) with 4 stripes. This attribute applies to all '''setstripe options'''.<br />
<br />
There is a work in progress to use an explicit option ''--parent'' so that it can reset the previous '''setstripe options''' and use system default stripe options thereafter.<br />
<br />
====lfs migrate====<br />
The lfs migrate command moves the a file's data from one (set of) OST(s) to another (set of) OST(s). This is done by copying the file data from the existing source file layout to a new target file layout as specified by the user. Most of the options to lfs migrate are the same as those of lfs setstripe, since lfs migrate is also creating a new file layout for the file data to be copied.<br />
<br />
lfs migrate [--component-id|-I comp_id] [OPTIONS] filename<br />
<br />
With the addition of composite files in this project, it needs to be possible to migrate a composite file, or a sub-component of that file, to new OST object(s) using the specified parameters. If a component ID is specified, then only that component should be migrated, and the new component should use the same start and end offsets as the source component so that the source component can be replaced without violating the PFL layout rules.<br />
<br />
====lfs find====<br />
The lfs find command is a Lustre-optimized and enhanced version of the find(1) command. It adds several extended options for matching Lustre specific parameters for the file layout. It is also optimizing file access to avoid fetching OST object attributes for each file checked, if the decision can be made based only on the information initially retrieved from the MDT inode.<br />
<br />
lfs find {directory|filename} ... [[!] --atime|-A [-+]days] [[!] --mtime|-M [-+]days]<br />
[[!] --ctime|-C [+-]days] [--maxdepth|-D depth] [[!] --mdt|-m {mdt_uuid|mdt_index,...}]<br />
[--name|-n pattern] [[!] --ost|-O {ost_uuid|ost_index,...}] [--print|-p] [--print0|-P]<br />
[[!] --size|-s [-+]bytes[kMGTPE]] [[!] --stripe-count|-c [+-]stripes]<br />
[[!] --stripe-index|-i {ost_index,...}] [[!] --stripe-size|-S [+-]bytes[kMG]]<br />
[[!] --layout|-L {raid0,released,'''composite'''}] [--type |-t {bcdflps}]<br />
[[!] --gid|-g|--group|-G {group_name|gid}] [[!] --uid|-u|--user|-U {user_name|uid}]<br />
[[!] --pool pool] '''[--component-start start] [--component-end|-E end]<br />
'''[[!] --component-count [+-]count] [--component-flags|-F flags]<br />
<br />
The existing command is enhanced with the --component-start, --component-end, --component-count and --component-flags commands to allow limiting the search criterion to specific extents or components of the file.<br />
<br />
===llapi_layout_comp_* Library API===<br />
<br />
The llapi_layout_* interfaces provide an interface for userspace applications, including lfs, to specify plain and composite file layouts in an abstract manner, and then convert those abstract layouts into actual file layouts depending on the final attributes of the layout. The main data structure for llapi_layout_* functions is struct llapi_layout, which is opaque to userspace, but internally stores all of the attributes of a single plain layout or a single component's sub-layout. For composite file layouts, the API will be extended to handle layouts with multiple components and other composite file specific attributes, for use in Lustre-specific tools such as lfs setstripe, lfs getstripe, and lfs find, as well as by end-user applications or libraries that want to create files with specific composite layouts to optimize file IO patterns, such as HDF5.<br />
<br />
A composite layout can be composed by several layout components, and each component's sub-layout will be described by the opaque data in struct llapi_layout, therefore, few more fields should be added to the structure:<br />
struct llapi_layout {<br />
uint32_t llot_magic;<br />
uint64_t llot_pattern;<br />
uint64_t llot_stripe_size;<br />
uint64_t llot_stripe_count;<br />
uint64_t llot_stripe_offset;<br />
/** Indicates if llot_objects array has been initialized. */<br />
bool llot_objects_are_valid;<br />
/* Add 1 so user always gets back a null terminated string. */<br />
char llot_pool_name[LOV_MAXPOOLNAME + 1];<br />
/* fields for composite layouts */<br />
+ struct lu_extent llot_extent; /* [start, end) of component */<br />
+ uint32_t llot_id; /* unique ID of component */<br />
+ uint32_t llot_flags; /* LCME_FL_* flags */<br />
+ struct list_head llot_list; /* linked list of llapi_layout components */<br />
struct lov_user_ost_data_v1 llot_objects[0];<br />
};<br />
<br />
* '''llot_extent''': The file extent covered by current component; initially assigned by the caller when defining a layout component.<br />
* '''llot_id''': The numeric ID of current component; this may be assigned internally by the llapi_layout_*() interfaces for identification purposes, but the final component ID assignment is the responsibility of the MDS.<br />
* '''llot_flags''': The flags of current component;<br />
* '''llot_list''': List of all the components of the same composite layout;<br />
<br />
A new pair of interfaces will be introduced to set/get the extent of a layout component. The llapi_layout_comp_extent_get(3) function will fetch the start and end offset of the current layout component, and function will set the layout extent of a layout currently being constructed, within acceptable parameters for that component.<br />
int llapi_layout_comp_extent_set(struct llapi_layout *layout, uint64_t start, uint64_t end);<br />
int llapi_layout_comp_extent_get(const struct llapi_layout *layout, uint64_t *start, uint64_t *end);<br />
<br />
A new set of interfaces will be introduced to get, set, and clear the attribute flags of a layout component. The llapi_layout_comp_flags_get(3) function gets the attribute flags of the current component. The llapi_layout_comp_flags_set(3) command sets the specified flags of the current component leaving other flags as-is, while llapi_layout_comp_flags_clear(3) clears the flags specified in the flags word leaving other flags as-is.<br />
int llapi_layout_comp_flags_get(const struct llapi_layout *layout, uint32_t *flags);<br />
int llapi_layout_comp_flags_set(struct llapi_layout *layout, uint32_t flags);<br />
int llapi_layout_comp_flags_clear(const struct llapi_layout *layout, uint32_t flags);<br />
<br />
The new llapi_layout_comp_id_get(3) interface fetches the file-unique component ID of the current layout component. <br />
int llapi_layout_comp_id_get(const struct llapi_layout *layout, uint32_t *id);<br />
<br />
A new pair of interface to add/delete a component to/from the composite layout. The llapi_layout_comp_add(3) command adds the passed layout component comp to the existing composite file layout layout, to allow creating compound composite layouts at one time. The llapi_layout_comp_del(3) deletes the specified layout component comp from the composite layout layout.<br />
int llapi_layout_comp_add(struct llapi_layout *layout, struct llapi_layout *comp);<br />
int llapi_layout_comp_del(struct llapi_layout *layout, struct llapi_layout *comp);<br />
<br />
A new interface llapi_layout_comp_get_by_id(3) to fetch component(s) by ID if the user or application already knows the ID:<br />
struct llapi_layout *llapi_layout_comp_get_by_id(const struct llapi_layout *layout, uint32_t id);<br />
<br />
A new interface llapi_layout_comp_next(3) to iterate all components of a composite layout, by selecting each component in turn internally, and then allowing different llapi_layout_comp_*() operations on that component layout:<br />
struct llapi_layout *llapi_layout_comp_next(const struct llapi_layout *layout);<br />
<br />
The existing llapi_layout_to_lum() and llapi_layout_from_lum() interfaces should be extended to handle the composite layout, the new user md for composite layout is defined in [[Layout Enhancement High Level Design]].<br />
/* data structure representing each layout component, defined in "Layout Enhancement HLD" */<br />
struct lov_comp_md_entry_v1 {<br />
__u32 lcme_id; /* unique identifier of component */<br />
__u32 lcme_flags; /* LCME_FL_XXX */<br />
struct lu_extent lcme_extent; /* file extent for component */<br />
__u32 lcme_offset; /* offset of component blob in layout */<br />
__u32 lcme_size; /* size of component blob data */<br />
__u64 lcme_padding[2];<br />
};<br />
<br />
/* On-disk/wire structure of the composite layout, defined in "Layout Enhancement HLD" */<br />
struct lov_comp_md_v1 {<br />
__u32 lcm_magic; /* LOV_MAGIC_COMP_V1 */<br />
__u32 lcm_size; /* overall size of layout including this structure */<br />
__u32 lcm_layout_gen;<br />
__u16 lcm_flags;<br />
__u16 lcm_entry_count;<br />
__u64 lcm_padding1;<br />
__u64 lcm_padding2;<br />
struct lov_comp_md_entry_v1 lcm_entries[0];<br />
};<br />
<br />
#define lov_user_comp_md lov_comp_md_v1;<br />
<br />
A new interface llapi_layout_file_comp_add(3) to add layout components to an existing file, it converts the passed in layout into lov_user_comp_md, then issue setxattr with the special xattr name defined in "Changes on MDS":<br />
int llapi_layout_file_comp_add(const char *path, const struct llapi_layout *layout);<br />
<br />
A new interface llapi_layout_file_comp_del(3) to delete component(s) by the specified component id (accepting LCME_ID_* wildcards also) from an existing file:<br />
int llapi_layout_file_comp_del(const char *path, uint32_t id);<br />
<br />
A new interface llapi_layout_file_comp_set(3) to change flags or other parameters of the component(s) by component ID of an existing file. The component to be modified is specified by the comp->lcme_id value, which may be either a specific component number or an LCME_ID_* wildcard value. The new attributes are passed in by comp and valid is used to specify which attributes in the component are going to be changed. This allows the interface to be extended to set any attributes in the future.<br />
int llapi_layout_file_comp_set(const char *path, const struct llapi_layout *comp, uint32_t valid);<br />
<br />
===User Space API Use Cases===<br />
<br />
Several uses of the llapi_layout_* interfaces are shown as examples, to understand how this new API can be used by user tools.<br />
<br />
Use case 1: Create a file with full layout components<br />
/* Allocate opaque layout structure for the first component */<br />
layout1 = llapi_layout_alloc();<br />
<br />
/* Set [0, 2M) extent to the first component */<br />
rc = llapi_layout_comp_extent_set(layout1, 0, 2M);<br />
<br />
/* Set stripe size of the first component */<br />
rc = llapi_layout_stripe_size_set(layout1, 1M);<br />
<br />
/* Set stripe count of the first component */<br />
rc = llapi_layout_stripe_count_set(layout1, 1);<br />
<br />
/* Allocate opaque layout structure for the second component */<br />
layout2 = llapi_layout_alloc();<br />
<br />
/* Add layout2 into layout1, and layout2 will inherit the stripe size of layout1 */<br />
rc = llapi_layout_comp_add(layout1, layout2);<br />
<br />
/* Set [2M, 256M) extent to the second component */<br />
rc = llapi_layout_comp_extent_set(layout2, 2M, 256M);<br />
<br />
/* Set stripe count of the second component */<br />
rc = llapi_layout_stripe_count_set(layout2, 4);<br />
<br />
/* Repeat above steps to create a layout3 with [256M, EOF) */<br />
...<br />
<br />
/* Create file with the composite layout */<br />
rc = llapi_layout_file_create(path, open_flags, open_mode, layout1);<br />
<br />
Use case 2: Create a file with initial component, and add components later<br />
/* Allocate opaque layout structure for the first component */<br />
layout1 = llapi_layout_alloc();<br />
<br />
/* Set [0, 2M) extent to the first component */<br />
rc = llapi_layout_comp_extent_set(layout1, 0, 2M);<br />
<br />
/* Set stripe size of the first component */<br />
rc = llapi_layout_stripe_size_set(layout1, 1M);<br />
<br />
/* Set stripe count of the first component */<br />
rc = llapi_layout_stripe_count_set(layout1, 1);<br />
<br />
/* Create file with the specified initial component */<br />
rc = llapi_layout_file_create(path, open_flags, open_mode, layout1);<br />
<br />
/* Allocate opaque layout structure for the second component */<br />
layout2 = llapi_layout_alloc();<br />
<br />
/* Set [2M, 256M) extent to the second component */<br />
rc = llapi_layout_comp_extent_set(layout2, 2M, 256M);<br />
<br />
/* Set stripe count of the second component */<br />
rc = llapi_layout_stripe_count_set(layout2, 4);<br />
<br />
/* Add the component layout2 into the file */ <br />
rc = llapi_layout_file_comp_add(path, layout2);<br />
<br />
Use case 3: Traverse all components of a composite layout. This is useful for something like lfs getstripe to be able to iterate over all components of an object without knowing in advance how many components each file has. During processing, the process using iterator can decide which components are of interest and print them, or use the component's file-unique component ID to print, modify or delete each component in turn.<br />
/* Get composite layout from existing file */<br />
layout = llapi_layout_get_by_path(path, flags);<br />
<br />
/* Traverse the layout */<br />
comp = layout;<br />
do {<br />
/* Get & print stripe count */<br />
rc = llapi_layout_stripe_count_get(comp, &count);<br />
printf("stripe_count: %llu\n", count);<br />
<br />
/* Get & print stripe size */<br />
rc = llapi_layout_stripe_size_get(comp, &size);<br />
printf("stripe_size: %llu\n", size);<br />
<br />
/* Get & print layout pattern */<br />
rc = llapi_layout_pattern_get(comp, &pattern);<br />
printf("stripe_pattern: %llx\n", pattern);<br />
<br />
comp = llapi_layout_comp_next(comp);<br />
} while (comp != layout);<br />
<br />
===Client-IO Interface===<br />
This design is based on the PFL Prototype High Level Design, which demonstrated the feasibility of the PFL concept. In this design, we will add further details to the design and address a few problems discovered during the prototype phase so that it can be used better in production.<br />
<br />
In the PFL prototype design, it addressed the problems of object mapping, in-memory cl_objects setup by layout components and a framework to support fundamental I/O operations. Operations like read, write, setattr, and glimpse are working properly. Two additional problems have to be solved for better use of PFL. The first problem is to create layout components on demand; and the second one is an optimization in terms of performance. This is a part of design of full PFL implementation, and the work in this part design will not be implemented in the project of PFL Phase 2.<br />
<br />
For the PFL Phase 2 implementation, the Client IO layer will be able to interpret existing composite file layouts, but will return an -ENODATA error to the application until the Phase 3 handling of ll_layout_intent() to dynamically fetch and instantiate layout components from the MDS is implemented.<br />
<br />
===Create Layout Component on Demand===<br />
The prototype phase doesn’t have the functionality to create layout component on demand. If applications are writing a file extent without layout component defined, clients will simply return an error to applications. This implies that the user has to know the layout components in advance, and has to understand the application’s access pattern really well so that each layout component can be created before it’s written.<br />
<br />
To support creating layout component on demand, the administrator can associate a PFL file with a layout template, which describes the number of stripes to be created for each range of file extents, along with other parameters such as stripe size, OST pool, etc. If the corresponding file extent is being written without layout component defined, the client will send a dedicated RPC, named layout intent RPC ([[PFL2_Solution_Architecture#A13._Application_writes_within_a_uninitialized_file_component_.28Phase_3.29|A13 in PFL Phase 2 Solution Architecture]]), to the MDT, then the MDT can use the information within layout template to allocate corresponding OST objects to form a layout component, then the layout component will be appended to the file’s layout. After this is done, the client should be able to fetch new layout and proceed the I/O. No error should be seen by applications under normal circumstances, though it is possible to see errors at this point due to environmental factors such as -ENOSPC, -EIO, -ENOMEM or others that may occur when the MDS creates new OST objects and modifies the file layout on disk.<br />
<br />
The following diagram describes the process of creating layout component on demand:<br />
[[Image:pfl2_write_flow_chart.png||Flow chart for PFL write]]<br />
<br />
In the above diagram, 'File Update' can be any operations that will modify file contents, such as file write, mkwrite, and truncate. Read only operation won't necessarily trigger layout component allocation. Reading file extents with undefined layout components will simply return zero filled buffer.<br />
<br />
As it's shown in the diagram, the LOV layer, which is the only module in CLIO who can understand layout, will check if the intended writing region has a layout defined. If there is no layout defined, it will abort and invoke ll_layout_intent() that will send layout intent RPC to the MDT. As what we mentioned in the design of PFL prototype phase, CLIO will split by the boundary of layout component, therefore if only part of the I/O region doesn't have layout defined, it will finish the I/O with layout defined first, and then request for new layout.<br />
<br />
In cl_io data structure, a new bit is going to be defined to mark that the I/O failure was due to missing layout component:<br />
struct cl_io {<br />
...<br />
/* true if this io failed due to missing layout */<br />
unsigned int ci_no_layout:1;<br />
...<br />
}<br />
<br />
If LOV detects that the update I/O can't be finished due to missing layout, it will set ci_no_layout and abort the I/O with error code -ENODATA. In vvp_io_fini(), it should check ci_no_layout and then compose a layout intent RPC and send it to the MDT. Layout intent RPC is an LDLM enqueue RPC with an intent operation, the payload of the intent operation is as follows:<br />
struct layout_intent {<br />
__u32 li_opc; /* intent operation for enqueue, read, write etc */<br />
__u32 li_flags;<br />
__u64 li_start;<br />
__u64 li_end;<br />
};<br />
<br />
In order to request new component, the li_opc will be assigned to LAYOUT_INTENT_WRITE, and li_start and li_end are set to the full range of intended I/O. MDT may decide to create multiple layout components within this range at its discretion. Later, CLIO will fetch the new layout and continue I/O from where it stops.<br />
Flushing Cached Page Wisely under Layout Change<br />
<br />
Client side page cache may needs flushing whenever the file layout is changed. This works well for HSM and file migration because no pages would remain valid after those operations change the file layout. However, for PFL files the cached pages should be still valid after a layout change because layout components are only appended to the existing layout, and do not affect the existing components or their data. It could hurt application performance significantly if all pages are evicted from the client cache and then read back again for each new component that is added to the file.<br />
<br />
PFL uses layered generations to identify layout generations of each file and each component. One is layout generation at the whole file level, this generation will be changed after appending a layout component. Another one is the per layout component generation, which will remain unchanged when appending new file components.<br />
<br />
This feature will be used to facilitate client page cache management. When clients detect a layout change at the LOV layer, the LOV will further check the generations on layout components, and it will only flush pages for the newly added or modified layout components, which is an empty operation for PFL because layout components won’t be altered once created.<br />
<br />
One tricky case worth mentioning is that layout component generations are comparable only if the file's layout generation matches. Even it's known that the file's layout generation increased due to layout component addition, they are still conceptually unrelated layouts. Therefore, clients must compare not only component generations but also OST indices and objects to decide if components are unchanged to avoid unnecessary page eviction. <br />
<br />
In order to accomplish this, a pair of range parameters will be added into cl_object_prune() to indicate that only a subset of pages are being evicted. A callback may be provided by LOV later to check if an individual page should be evicted due to layout change.<br />
<br />
===Dynamic Layout Request===<br />
In the PFL Prototype, layout components must have been defined and instantiated before I/O starts. Otherwise, applications that are trying to access an uninstantiated or undefined component will receive an ENODATA error. With dynamic layout request supported, clients are able to instantiate layout components on demand. Layout intent RPC will be used to request an instantiated layout component from the MDT.<br />
<br />
The RPC format of layout intent:<br />
static const struct req_msg_field *ldlm_intent_layout_client[] = {<br />
&RMF_PTLRPC_BODY,<br />
&RMF_DLM_REQ,<br />
&RMF_LDLM_INTENT,<br />
&RMF_LAYOUT_INTENT,<br />
&RMF_EADATA /* for new layout to be set up */<br />
};<br />
<br />
struct req_msg_field RMF_LAYOUT_INTENT =<br />
DEFINE_MSGF("layout_intent", 0,<br />
sizeof(struct layout_intent), lustre_swab_layout_intent,<br />
NULL);<br />
EXPORT_SYMBOL(RMF_LAYOUT_INTENT);<br />
<br />
/* enqueue layout lock with intent */<br />
struct layout_intent {<br />
__u32 li_opc; /* intent operation for enqueue, read, write etc */<br />
__u32 li_flags;<br />
__u64 li_start;<br />
__u64 li_end;<br />
};<br />
<br />
Layout intent RPC is essentially an LDLM intent RPC. In the layout intent RPC, struct layout_intent will carry the necessary layout information to the MDT, for example, what is the range of required component, and which operation it expects the MDT to execute. With the information of preset layout template, the MDT should be able to create the corresponding component. The MDT should instantiate all components within the range of [li_start, li_end). If the MDT successfully instantiates a component, it will increment the file's layout generation number by 1.<br />
<br />
===When to send layout intent in CLIO stack===<br />
In CLIO stack, the LOV layer is the only place where layout can be understood and interpreted. When I/O is instantiated, the LOV layer will split file level I/O range to sub I/Os by components information, and make sure that one sub I/O won’t cross component boundary. The lov_io_iter_init() function is where it checks if a component is instantiated or not.<br />
<br />
If a component is not instantiated, and if this I/O is a write operation, LOV will set a flag in cl_io data and return to the llite layer with error -ENODATA. The llite layer will realize that new components are required to complete this I/O so that it will invoke ll_layout_request() to send layout intent RPC. If everything goes okay, ll_layout_request() will fetch and apply new layout to CLIO stack. Finally the llite layer should restart the I/O and it should be able to move forward. Since the components are instantiated only once, and are added only to the end of the file, any layout changes for PFL files cannot affect in-flight IO operations.<br />
<br />
So far, the operations would trigger layout intent request are VFS write() family of calls, truncate(), and page ->mkwrite(). In the future, other interfaces such as fallocate() would also need to handle uninstantiated components in a similar manner.<br />
<br />
==Server-side RPC Interface and IO code paths==<br />
MDS should provide an interface that can handle the client's requests to populate a new layout and modify an existing layout. We can consider a few different options to go through the various layers in the metadata stack from the client to the underlying storage on the MDT that holds the layout itself:<br />
<br />
#a new DT method or set of methods<br />
## will be used only by the Logical Object Device (LOD) on the MDS<br />
## unit testing (talking directly to LOD) would need additional infrastructure<br />
# use setxattr() with special xattr names<br />
## lustre.lov.add = { components to add to the composite layout }<br />
## lustre.lov.del = { delete the component ID in the value }<br />
## lustre.lov.set = { set component flags }<br />
## lustre.lov.clear = { clear component flags }<br />
## no need in an extra infrastructure to implement unit testing<br />
## no new special methods<br />
# extend ->dbo_punch() method<br />
## a flag to populate range: assign new objects<br />
## a flag to depopulate: release the objects<br />
## probably not enough to change flags (out-of-sync stripe?)<br />
# totally crazy idea - layout as an index<br />
## this is what it is in essence<br />
## range/offset as a key<br />
## object+status as a value<br />
<br />
Several aspects of the xattr interface (#2) are of interest, which lead to selecting the xattr method for the PFL implementation.<br />
<br />
The first and foremost reason for selecting the xattr interface is that it allows adding the ability to modify existing layout xattrs without a significant change in the number of RPC types, without adding new dedicated server APIs that will only be used for composite files, and without introducing new userspace APIs that may have portability issues. This simplifies the understanding of the code and keeps the complexity growth in check for the future, and doesn't sacrifice flexibility.<br />
<br />
A secondary reason for selecting the xattr interface for managing the layouts on the client is that with IO forwarders such as IBM's CIOD it is difficult to pass ioctl() commands from the compute nodes where applications run to the IO nodes where the Lustre client runs. This needs a special handler for every ioctl() command and quickly becomes a maintenance headache. However, the getxattr() and setxattr() interfaces already exist in such environments and provide generic key=value methods that can work with arbitrary key names and values between the llapi_layout(7) library commands and the Lustre client, over the network to the MDS, and from the MDS down to the underlying MDT storage. It is not expected that applications or users will interact directly with the file components using getxattr() and setxattr(), but only via the llapi_layout interfaces.<br />
<br />
==Use Cases for getxattr() and setxattr() interfaces==<br />
In order to manipulate the layout of the lustre.lov xattr holding the file layout, getxattr() and setxattr() (or fgetxattr() and fsetxattr() for operations on already-open file descriptors) will manipulate virtual xattrs with names such as lustre.lov.add, lustre.lov.del, lustre.lov.set, and lustre.lov.clear. These can interface transparently from userspace on the client with the kernel on the client, or with the MDS as needed. Below we look at the use cases from the PFL 2 Solution Architecture to verify that these xattr interfaces can meet the requirements set out in that document.<br />
<br />
===U01. User creates new file with fixed-size layout component===<br />
setxattr("lustre.lov", <binary composite file description>);<br />
The LOD on the MDS parses the layout, which should contain a component that has an extent start of zero, and adds the component as described and populates it with OST objects. The MDS will always allocate the first component's objects, to avoid the immediate overhead of another RPC and layout lock cancellation for the normal pattern of file create followed by file write.<br />
<br />
===U02. User adds component(s) with fixed-size extent(s) to an existing composite file===<br />
setxattr("lustre.lov.add", <binary component description>);<br />
The LOD on the MDS parses the layout, sanity checks the components against the existing layout (S03, S04) and against each other, adds the component(s) as described to the file, essentially the same as U01. The binary component description is largely self-describing, so it may contain one or more components that are added to the existing file layout, if any. The LOD will assign file-unique component IDs as necessary, which may be different than those the client generated while creating the layout in memory. If the client does not have the OBD_CONNECT_PFL_DYNAMIC flag set at connection time (a PFL 2 client, see Client-MDS Protocol below), then it is not capable of dynamically requesting the layout components need to be instantiated, in which case the MDS will allocate objects for all components.<br />
<br />
===U03. User adds final component to existing composite file===<br />
setxattr("lustre.lov.add", <binary component description>);<br />
LOD parses the layout, adds the component as supplied, and populates it with OST objects (for PFL2 clients only if no OBD_CONNECT_PFL_DYNAMIC flag set at connection time, see Client-MDS Protocol below). This is also the same as U01 and U02, with the only difference being that the supplied component has an extent that ends at 264-1 bytes.<br />
<br />
===U04. User requests the layout for an existing file===<br />
getxattr("lustre.lov");<br />
If the client is fetching the full layout xattr, then it can use the same getxattr interface as is used today by existing tools with no additional processing of the xattr or layout. This ensures that utilities such as tar(1) and others that already save and restore Lustre file layouts continue to work properly with composite files.<br />
getxattr("lustre.lov.extent.<start-end>");<br />
The LOD on the MDS and/or the LOV on the client parses the layout and returns the component(s) covering range [start,end). Since the components themselves are self-describing, containing the component_id, component_start, and component_end, they can be returned directly to the caller and handled directly.<br />
<br />
===U05. User gets layout parameters to existing component by ID===<br />
getxattr("lustre.lov.id.<component_id>");<br />
The LOD on the MDS and/or the LOV on the client parses the layout, finds component with id=<component id> and returns it to the caller.<br />
<br />
===U06. User accesses or modifies components in an existing file===<br />
setxattr("lustre.lov.set.<valid>", <binary component description>);<br />
LOD iterates over the components in the file, applies changes to the individual components passed in as the binary component value, and sets the fields in the component as specified by valid.<br />
<br />
===U07. User deletes composite file===<br />
The LOD iterates over the components on the MDS, destroying the individual OST objects using the existing RPC and recovery methods before deleting the MDT inode.<br />
<br />
===U08. User creates a new composite file describing multiple components===<br />
setxattr("lustre.lov", <binary composite file description>);<br />
This is essentially the same as U01 but the composite file description contains multiple components.<br />
<br />
===U09. User migrates composite file===<br />
fsetxattr(source_fd, "lustre.lov.swap.<component_id>", <volatile_fid>);<br />
The existing mechanism to swap while file layouts should be usable without modification, as it is simply copying the layout xattrs and doesn't actually look into the layouts. This allows existing tools that may use the llapi_[f]swap_layouts() interface, such as HSM copytools, to continue to be able to manage whole-file layout changes. If the user is migrating a single component from a composite file, then the data copy step is largely the same, except it will only copy data covered by the source extent [start,end) into the target file before swapping the layout from the source component to the temporary target file. The design of composite file layouts ensures that the logical file offsets of the source file and the target files are the same, so no special support is needed in userspace to do the data copy. The volatile_fid is accessible on the MDS while the migration tool keeps the volatile file open, though the MDS needs to verify that the file matching <volatile_fid> has been opened by the client for write, to ensure the client has file write permission on the file, since it is not possible to pass two open file descriptors to the fsetxattr() syscall.<br />
<br />
===U10. User searches for composite files===<br />
getxattr("lustre.lov.header");<br />
This will return the composite file layout header containing layout type, number of components, etc. This can reduce amount of information going over the network and up to userspace, so that the caller can allocate a sufficiently large buffer to hold the full layout.<br />
<br />
===U11. User specifies default composite file template for directory (Phase 3)===<br />
setxattr("lustre.lov", <binary composite file template>);<br />
The setxattr() operation would be done on the parent directory, in a similar manner that it is done today, only with different xattr contents. The composite layout template is stored on the parent directory as is done today for plain layout templates, storing only struct lov_mds_md_v1 with the required fields set, and not storing any of the file stripes in lmm_objects. <br />
===U12. Administrator specifies filesystem-wide composite file template for root directory (Phase 3)===<br />
Same as U11, except the operation takes place on the root directory and affects all new files.<br />
===U13. User deletes uninstantiated stripe component from file by ID (Phase 3)===<br />
setxattr("lustre.lov.del", <component id>);<br />
LOD parses the layout, finds the component with ID=<component id> and if it is uninstantiated (has no object assigned), then delete it.<br />
<br />
===U14. User deletes uninstantiated stripe components from file by flags (Phase 3)===<br />
setxattr("lustre.lov.del", <component id>);<br />
To delete all uninstantiated components from the file it is possible to use the enum lcme_id wildcard LCME_ID_UNINIT to delete all uninstantiated components.<br />
For more complex operations, it is more practical to fetch the entire layout to the client, iterate over the components in userspace, and then perform the pattern matching in an arbitrarily complex manner to determine which component IDs to remove or modify. While there is some extra overhead in deleting or modifying components individually, it is impractical to have a complex query and update interface embedded in the MDS for this. Otherwise, there may be an explosion of different matching criteria that need to be added (e.g. components within these extents, have specific flags set, that have specific layout generations, stripe sizes, etc). <br />
<br />
==Server-side Composite Layout Handling==<br />
===Composite File Layout Handling===<br />
The server needs to interpret and handle the virtual xattr values that are sent from the client. In order to avoid namespace collisions and potential abuse by users, the virtual xattr keywords such as .add, .del, .swap, etc. are only interpreted for specific Lustre xattrs such as lustre.lov, and potentially lustre.lmv in the future. The lustre.lov xattr is already handled specially on the client, since it cannot be set if the file already has a layout, so this will not add significant complexity on the client or server. Because xattrs are read and written as a single unit, any modification to the xattr needs to load the existing layout xattr into memory, modify it as requested by the client, and then store it back to the MDT inode object. This will be handled by the LOD layer of the MDS, since it is the software module that interprets the file layout, and also has access to the MDT OSD to load and store the xattr contents.<br />
<br />
The LOD must verify incoming lustre.lov layout xattrs, whether a whole composite layout is being sent, or an incremental update is being made to a layout component. Until other composite layout features such as File-Level Replication, partial HSM restore, the layout checks done by the MDS will be PFL-specific. These include verifying that components have adjacent extents, so that there are no holes in the layout (S03), that the layouts do not overlap (S03, S04) or if they do that they describe identical components (S09.1), and that they do not specify attribute flags that are controlled by the server. Fields such as lcme_offset and lcme_id that are server managed will be ignored and overwritten by the server.<br />
<br />
The composite layout header contains a generation value, lcm_layout_gen , that is updated by the MDS when the composite layout is changed. In order to ensure that component IDs within a file are unique, the lcme_id assigned to a new component added to the layout will be the lcm_layout_gen of the composite file. An lcme_id of 0 is reserved, and indicates that the ID is unassigned for this component or no specific component is requested, and will not be used by the MDS for any components in a file.<br />
<br />
As these checks of the incoming layout and the update of the lustre.lov xattr on disk need to be serialized, these operations will be serialized by the layout lock (MDS_INODELOCK_LAYOUT) on the inode.<br />
<br />
===Layout Intent Lock Handling===<br />
For PFL, there are two kinds of request that could cause layout change. One is from the command line that appends or changes components manually; another one is from CLIO stack after dynamic layout intent is supported. Both kinds of request will end up with invocation of setxattr() on the MD stack. The MDT has to hold LCK_EX mode of the MDS_INODELOCK_LAYOUT lock when it calls setxattr() to do actual changes to the layout. Not only does the holding of the layout lock on the MDT inode object serialize updates between multiple threads on the MDS, it also serves to revoke the layout lock from all clients that have been granted this lock. This revocation will invalidate the cached file layout from the clients, and cause them to refresh the layout on their next IO operation.<br />
<br />
==Feature Compatibility and Interoperability==<br />
===File Layout Compatibility===<br />
The PFL composite layout is incompatible with the existing Lustre file layout, though the individual layout components will re-use the existing lov_mds_md_v1 and lov_mds_md_v3 RAID-0 layouts. Non-PFL clients will receive an EIO error when accessing a composite file. Accessing plain (non-composite) files in the same filesystem will continue to work for both PFL and non-PFL clients. It is not possible to translate the PFL file layouts into a layout that the older clients will understand. Older servers will refuse to try and create a file with a PFL layout, due to the new magic value stored at the start of each layout.<br />
<br />
===MDT On-Disk Format===<br />
The PFL composite layout stored on disk will continue to use the trusted.lov xattr name and will be stored directly in the MDT inode, if space permits, to maximize performance. The existing maximum limits on xattr sizes will not be changed as part of this project. For both ZFS and ldiskfs backing filesystems the on-disk xattr size is not the limiting factor for determining the maximum stripe count of a file, but rather the RPC size limits.<br />
<br />
The MDS itself needs to understand the new struct lov_comp_md_v1 layout format described in [[Layout Enhancement High Level Design#2.1. Composite Layouts|Layout Enhancement HLD Composite Layouts]], in order to unlink the OST objects within that file when it is deleted, or change the ownership of a file's OST objects when the file ownership changes.<br />
<br />
The [https://wiki.hpdd.intel.com/display/opensfs/LFSCK2+High+Level+Design Lustre File System Check (LFSCK)] tool also needs to understand struct lov_comp_md_v1 in order to accurately determine the relationship between an MDT inode and all the OST objects where it stores its data. This can leverage the same composite file layout iteration that the MDS is using for file unlink, setattr, and other operations that affect all of the OST objects on a file.<br />
<br />
====MDT Default File Layout Templates====<br />
The file layout template is an uninstantiated file layout that is initially stored on a parent directory, or on the filesystem root directory, and provides the default layout for new files that do not otherwise have a specific layout assigned at file creation time. When a new file file or directory is first created, it inherits the layout template from the parent directory in which it was created, or if the parent directory has no template then it is inherited from the filesystem root directory. Once assigned to the new file, the layout is stored with the MDT inode on disk and is instantiated as needed for that file.<br />
<br />
The layout template itself for a plain file is simply struct lov_mds_md_v1, or struct lov_mds_md_v3 if an OST pool is in use, without any of the OST objects allocated for it (i.e. the lmm_objects[] array is unused). The plan for composite file templates will be similar - a layout template for a 3-component file would consist of the composite header template struct lov_comp_md_v1 along with three separate pairs of component entries and uninstantiated sub-layout templates, namely struct lov_comp_md_entry_v1 and the accompanying struct lov_mds_md_v1 without any OST objects allocated.<br />
struct lu_extent {<br />
__u64 e_start;<br />
__u64 e_end;<br />
};<br />
<br />
struct lov_comp_md_entry_v1 {<br />
__u32 lcme_id; /* unique identifier of component */<br />
__u32 lcme_flags; /* LCME_FL_XXX */<br />
struct lu_extent lcme_extent; /* file extent for component */<br />
__u32 lcme_offset; /* offset of component blob in layout */<br />
__u32 lcme_size; /* size of component blob data */<br />
__u64 lcme_padding[2];<br />
};<br />
<br />
struct lov_comp_md_v1 {<br />
__u32 lcm_magic; /* LOV_MAGIC_COMP_V1 */<br />
__u32 lcm_size; /* overall size of layout including this structure */<br />
__u32 lcm_layout_gen;<br />
__u16 lcm_flags;<br />
__u16 lcm_entry_count;<br />
__u64 lcm_padding1;<br />
__u64 lcm_padding2;<br />
struct lov_comp_md_entry_v1 lcm_entries[0];<br />
};<br />
<br />
struct lov_ost_data_v1 { /* per-stripe data structure (little-endian)*/<br />
struct ost_id l_ost_oi; /* OST object ID */<br />
__u32 l_ost_gen; /* generation of this l_ost_idx */<br />
__u32 l_ost_idx; /* OST index in LOV (lov_tgt_desc->tgts) */<br />
};<br />
<br />
struct lov_mds_md_v1 { /* LOV EA mds/wire data (little-endian) */<br />
__u32 lmm_magic; /* magic number = LOV_MAGIC_V1 */<br />
__u32 lmm_pattern; /* LOV_PATTERN_RAID0, LOV_PATTERN_RAID1 */<br />
struct ost_id lmm_oi; /* LOV object ID */<br />
__u32 lmm_stripe_size; /* size of stripe in bytes */<br />
/* lmm_stripe_count used to be __u32 */<br />
__u16 lmm_stripe_count; /* num stripes in use for this object */<br />
__u16 lmm_layout_gen; /* layout generation number */<br />
struct lov_ost_data_v1 lmm_objects[0]; /* per-stripe data */<br />
};<br />
<br />
Unfortunately, the size of a 3-component layout template, even without any OST objects allocated, is larger than can fit into the currently 512-byte ldiskfs inodes' free space, as can be seen in the diagram below. There are approximately 180 bytes of free space in the 512-byte inode (depends on the length of the filename and if there are multiple hard links to a file), but a 3-component template layout is 268 bytes in size. Even with aggressive reduction of the size of the lov_comp_md_v1, lov_comp_md_entry_v1, and lov_mds_md to remove all fields that are not strictly necessary, the 3-component template would still be too large to fit into the directory inode, let alone on an actual file using this template with at least one allocated OST object. If the xattr is too large for the in-inode space, for example a plain RAID-0 file with more than 5 stripes, then the layout xattr is stored in an separate data block. Storing the layout xattr outside the inode may incur significant performance penalties, due to an extra seek for every inode access in order to fetch the layout xattr into memory, so this is undesirable for normal usage.<br />
<br />
One option to avoid the overflow of the in-inode xattr space would be to store only a single-component layout on the file, which would fit within the available 180-byte space, and inherit the rest of the components from the parent directory as the file size grows to need these components. This would be desirable from the point of view of minimizing the overhead for small files, which can make up a large fraction of all files in HPC filesystems. However, this also adds complexity to the PFL code and usage, since the inode is not guaranteed to have the same parent directory, and hence a different layout template, when the time comes to extend the file beyond the first component. This may lead to inconsistent or sub-optimal layout components if the file is renamed, or the default layout of a directory or the filesystem is modified, and the new directory layout template is incompatible with the existing component(s) on the file due to overlapping layout extents.<br />
<br />
Even if a single-component file layout could fit in the inode xattr space, the composite layout template still couldn't fit into the parent directory's inode. However, since there are normally far fewer directories than files, and directory leaf blocks are themselves likely to be allocated only one block at a time, the external xattr block would not be as high an overhead, and the one xattr read overhead would normally be amortized over the creation of many files within that directory that use the same layout template.<br />
<br />
Another option is to format the MDT with larger 1024-byte inodes by default, to ensure there is enough space for not only the composite layout or template, but also for other xattrs such as SELinux labels, ACLs, etc. This has the drawback that the MDT will need to be reformatted for 1024-byte inodes to maximize PFL performance, and each inode will take twice as much space on disk and in memory, which may also impact metadata performance. This can possibly be mitigated on existing 512-byte inodes that use SSD or NVM storage for the MDT to avoid the overhead of seeking to read the external xattr block for each cache-cold MDT inode access.<br />
<br />
Due to the implementation complexity and risk of inconsistent or sub-optimal file layouts being created by the incremental inheritance of layouts from the parent layout template, the PFL 2 project will implement whole-layout inheritance at file creation time.<br />
<br />
[[Image:pfl2_default_layout_template.png]]<br />
<br />
====MDS Layout Verification====<br />
In addition to PFL layout verification performed by userspace in the llapi_layout_* functions, the MDS should also do verification of the layout components to ensure that they are valid for the PFL feature. This includes the following checks:<br />
<br />
* verify the start of each component matches the end of the previous component (if any), to prevent overlapping or disjoint extents.<br />
* verify the layout stripe_size and the layout extent_end are properly aligned to prevent fractional pages or RPCs that span multiple components. This restriction may be relaxed over time, but for the initial implementation it will avoid complexity to ensure that full-stripe reads and writes are done within a single component.<br />
* verify object_maxbytes * stripe_count >= extent_end for each component except the last one, to ensure that file data can be written over the full range of the component. For ldiskfs OSTs the object_maxbytes is 16 TiB, so for a component with few stripes and a very large extent_end it is possible that the client would get -EFBIG while writing to the middle of the file. For ZFS OSTs the object_maxbytes value is 263-1 bytes, so this is not an issue. This may be difficult to implement 100% consistently, since the MDS will not necessarily know which specific OSTs will be selected when setting a uninstantiated layout template, which would only be a concern if there are different OST types within the same filesystem. In this unlikely case, it would be easiest to select the minimum maxbytes limit at OST connect time.<br />
<br />
As other features are added that use composite layouts, such as File Level Replication, these restrictions can be relaxed.<br />
<br />
===Client-MDS Protocol===<br />
By using extensions to the xattr protocol to instantiate and modify composite layouts there are no RPC protocol changes needed between the client and MDS. The Phase 2 PFL client will send the new OBD_CONNECT_COMPOSITE connection flag to indicate that it understands the composite layout feature, and the MDS replies with the same feature flag set to inform the client that this feature is supported, otherwise the client would get an error when storing the composite layout on the MDS. The existing MDS_SETXATTR and MDS_GETXATTR RPC opcodes can be used to create, modify, and remove individual components of a file, as well as whole composite files. Since the connection flag exchange is done on every client and MDS restart, there should never be a case where the MDS does not recognize the incoming file layout magic or the enhanced RPC opcodes.<br />
<br />
The existing RPC size limit will not be changed as part of this project, allowing a single file to have maximum stripe count of 2000 OSTs for a plain RAID-0 file. Since the PFL composite file and component layout containers themselves take up space, the maximum number of OSTs that a single file can use depends on the exact layout being used. For a single-component file, the maximum stripe count will only be 2-3 stripes below the 2000-OST limit. For a file with many single-stripe components, the maximum number of components will be approximately 500.<br />
<br />
If the PFL Phase 2 Static Layout implementation is deployed separately from the proposed PFL Phase 3 Dynamic Layout, then some additional changes are needed in the RPC protocol between clients and the MDS. For clients using the PFL2 code that understands composite layouts but not dynamic layout initialization, as detected by the lack of OBD_CONNECT_PFL_DYNAMIC flag at connection time, any file creation requests will result in the MDS allocating all of the OST objects for a file with a layout template. The PFL Phase 3 clients can notify the MDS at connect time, by passing a new OBD_CONNECT_PFL_DYNAMIC feature flag, that they handle on-the-fly layout initialization of files, so it is safe to store only the layout template to disk.<br />
<br />
===Client-OSS Protocol===<br />
During normal IO operations between the client and OSS, the client sends information to the OSS about each object that is being accessed, to avoid the overhead of extra communication between the MDS and OSS for every object created and accessed in the filesystem. This information includes sending the MDT inode File Identifier (FID) to the OSS in order to indicate which file each OST object belongs to, as well as the stripe index of that object within the file. This information is stored on the OST object the first time the object is ever modified. The MDT inode FID passed from the client is sanity checked against the one stored on the OST object for later IO operations in order to avoid accidentally accessing or modifying OST objects due to software bugs, as well as by the [https://wiki.hpdd.intel.com/display/opensfs/LFSCK2+High+Level+Design Lustre File System Check] (LFSCK) tool to verify consistency between the file layout on the MDT and the objects on the OST(s) and to rebuild the MDT inode file layout if it becomes corrupted. By storing the component ID with each OST object, along with the stripe index and stripe size, the LFSCK tool can re-assemble the file layout for each MDT inode FID, even if the layout is lost or corrupted on the MDT.<br />
<br />
The RPC from the client to the OSS currently only passes a single integer for the object's stripe index, since this is all that was needed to uniquely identify the object in a RAID-0 file layout. In order to accommodate the presence of multiple component layouts within a single composite file, the RPC from the client needs to be modified slightly in order to send more information for above purposes, including the stripe size and the stripe count, and if the target is for PFL component, the component ID and related extent range will be sent also. If all these information are sent from the client to the OSS via current obdo structure, then only the obdo.o_padding_{4/5/6} are not enough, we have to reuse some other fields to avoid enlarge the obdo structure. Currently the obdo.o_lcookie is only used by OSP for recording async RPC llog cookie. In fact, such cookie is only used locally. So the obdo.o_lcookie can be used for other on wire purpose. Fortunately, it is large enough (32 bytes) to hold all above information for LFSCK. To be clear, we will define new structure ost_layout for that.<br />
struct llog_cookie {<br />
struct llog_logid lgc_lgl;<br />
__u32 lgc_subsys;<br />
__u32 lgc_index;<br />
__u32 lgc_padding;<br />
} __attribute__((packed));<br />
<br />
+struct ost_layout {<br />
+ __u64 ol_pfl_start;<br />
+ __u64 ol_pfl_end;<br />
+ __u32 ol_pfl_id;<br />
+ __u32 ol_stripe_size;<br />
+ __u32 ol_stripe_count;<br />
+ __u32 ol_padding_0;<br />
+};<br />
<br />
struct obdo {<br />
__u64 o_valid; /* hot fields in this obdo */<br />
struct ost_id o_oi;<br />
__u64 o_parent_seq;<br />
__u64 o_size; /* o_size-o_blocks == ost_lvb */<br />
__s64 o_mtime;<br />
__s64 o_atime;<br />
__s64 o_ctime;<br />
__u64 o_blocks; /* brw: cli sent cached bytes */<br />
__u64 o_grant;<br />
/* 32-bit fields start here: keep an even number of them via padding */<br />
__u32 o_blksize; /* optimal IO blocksize */<br />
__u32 o_mode; /* brw: cli sent cache remain */<br />
__u32 o_uid;<br />
__u32 o_gid;<br />
__u32 o_flags;<br />
__u32 o_nlink; /* brw: checksum */<br />
__u32 o_parent_oid;<br />
__u32 o_misc; /* brw: o_dropped */<br />
__u64 o_ioepoch; /* epoch in ost writes */<br />
__u32 o_stripe_idx; /* layout stripe idx */<br />
__u32 o_parent_ver;<br />
struct lustre_handle o_handle; /* brw: lock handle to prolong<br />
* locks */<br />
- struct llog_cookie o_lcookie; /* destroy: unlink cookie from<br />
- * MDS, obsolete in 2.8, reused<br />
- * in OSP */<br />
+ /* Originally, the field is llog_cookie for destroy with unlink cookie<br />
+ * from MDS, it is obsolete in 2.8. Then reuse it by client to transfer<br />
+ * layout and PFL information in IO, setattr RPCs. Since llog_cookie is<br />
+ * not used on wire any longer, remove it from the obdo, then it can be<br />
+ * enlarged freely in the further without affect related RPCs.<br />
+ *<br />
+ * Here, we have verified sizeof(ost_layout) == sizeof(llog_cookie). */<br />
+ union {<br />
+ /* struct llog_cookie o_lcookie; */<br />
+ struct ost_layout o_layout;<br />
+ };<br />
__u32 o_uid_h;<br />
__u32 o_gid_h;<br />
__u64 o_data_version; /* getattr: sum of iversion for<br />
* each stripe.<br />
* brw: grant space consumed on<br />
* the client for the write */<br />
__u64 o_padding_4;<br />
__u64 o_padding_5;<br />
__u64 o_padding_6;<br />
};<br />
<br />
Implementing the support for LFSCK and the OSTs to handle composite files belongs to PFL Phase 3a, that is beyond the scope of the PFL Phase 2 implementation.<br />
<br />
===OST On-Disk Format===<br />
As discussed in the Client-OSS Protocol section, the OST stores a fragment of the MDT layout with each object in order to do sanity checks on incoming client RPCs and recovery in case of MDT corruption. The OST needs to be able to store an additional 32 bytes of data with struct filter_fid to store additional information for the composite layout so that the OST object knows its place within the component and within the composite file:<br />
struct filter_fid {<br />
struct lu_fid ff_parent; /* ff_parent.f_ver == file stripe number */<br />
+ __u32 ff_stripe_size;<br />
+ __u32 ff_stripe_count;<br />
+ __u64 ff_pfl_start;<br />
+ __u64 ff_pfl_end;<br />
+ __u32 ff_pfl_id;<br />
};<br />
<br />
The osd-ldiskfs on-disk inode along with the Lustre-specific xattrs ("lma" and "fid") are very nearly out of free space in the OST's 256-byte inode, there is not enough room to store these 28 additional bytes in the "fid" xattr. As explained above, storing the "fid" xattr in a separated block will cause serious performance trouble, we have to consider other solution. One possible way is the merge the "lma" and "fid" into the "lma" EA body (or value) to save the space occupied by the "fid" matter entry (20 bytes). It is some hack way, and is hidden inside the osd-ldiskfs. From the up layer users' view, the "fid" xattr is still independent, they do not know and should not care about how the "fid" xattr is stored on the disk.<br />
<br />
[[Image:pfl2_inode_size.png]]<br />
<br />
For the osd-zfs on-disk dnode (inode), the added information will be stored in the System Attributes, which currently already do not fit into the dnode proper, but must already allocate a separate spill block to hold the SAs for the dnode. Once the large dnode patch is landed to the ZFS-on-Linux repository it will allow the SAs to always be stored within the dnode for maximum performance.<br />
<br />
===MDS-OSS Protocol===<br />
<br />
The MDS-OSS protocol is largely unaffected by composite layouts, since the OSTs themselves never use the file layout directly. The LFSCK utility does fetch the struct filter_fid xattr from the OST in order to verify its consistency against its locally stored file layout. The actual network protocol remains unchanged, besides the extra two fields added to this structure. The LFSCK utility will need to verify the ff_stripe_size and ff_component_id fields against their respective values in the file layout to verify that the object is part of the correct component.<br />
<br />
===Known Issues===<br />
<br />
* Append write. Append write has to instantiate all components to fulfill the posix semtantics;<br />
* Group lock. The current semantics of group lock would be hard to comply, work in progress to come up with a solution.<br />
<br />
Please see more known issues at https://jira.hpdd.intel.com/browse/LU-9349<br />
<br />
[[Category:Architecture]]<br />
[[Category:Design]]<br />
[[Category:PFL]]</div>Nrutmanhttp://wiki.lustre.org/index.php?title=PFL2_High_Level_Design&diff=2707PFL2 High Level Design2017-08-28T21:00:12Z<p>Nrutman: /* lfs getstripe */</p>
<hr />
<div>==Introduction==<br />
<br />
The Progressive File Layout Phase 2 (PFL2) High Level Design describes details of how the PFL feature may be implemented, including the user interfaces for both the command-line as well as Lustre-aware applications, how PFL files will interact with the client-IO (CLIO) layer in the Lustre kernel VFS driver, as well as the RPC formats between the client and servers, and the interface to the underlying storage. This document further builds upon the reference documents below.<br />
<br />
This design is intended to be comprehensive for both the current PFL Phase 2 implementation, as well as a future PFL Phase 3 implementation, so some use cases describe functionality that will not be implemented as part of the PFL Phase 2 implementation, but are included here so that the overall PFL design is complete, and to ensure that functionality implemented in PFL Phase 2 is considering the longer-term implementation goals and does not need to be reworked once PFL Phase 3 implementation is started. Design aspects that are not intended to be implemented in PFL Phase 2 are marked as such in this document or the [[PFL2 Scope Statement]].<br />
<br />
===References===<br />
[[Layout Enhancement High Level Design]]<br />
<br />
[[Progressive File Layouts]]<br />
<br />
[[PFL Prototype Scope Statement]]<br />
<br />
[[PFL Prototype Solution Architecture]]<br />
<br />
[[PFL Prototype High Level Design]]<br />
<br />
[[PFL2 Scope Statement]]<br />
<br />
[[PFL2 Solution Architecture]]<br />
<br />
==Design Overview==<br />
There are three main components to the PFL design:<br />
<br />
* the user-space interfaces for Lustre-specific command-line tools and user library application programming interfaces (APIs)<br />
* changes to the client IO (CLIO) layer IO engine and RPC interfaces to the MDS for accessing and manipulating composite file layouts<br />
* changes to the MDS server to create, modify, and delete composite files<br />
<br />
The design is structured in a top-down manner, starting with the command-line interfaces that users are going to interact with the most, then the user library APIs, the client-side kernel changes for reading, writing, and accessing PFL files, RPCs for creating and modifying composite files, and finally server-side changes. There is also a discussion of protocol and disk format compatibility issues.<br />
<br />
==Client Side Design==<br />
==User Space Interfaces==<br />
===lfs Command-line Interface===<br />
The lfs(1) command-line interface will be extended to understand and manipulate PFL files and their component layouts. lfs is the primary interface for end users to create new files with a specific layout, show the layout of existing files, as well as setting default layout templates on directories that will be inherited by all new files and subdirectories created therein.<br />
<br />
The [[pfl2-lfs-setstripe.1|lfs setstripe(1)]], lfs migrate(1), lfs getstripe(1), and lfs find(1) sub-commands will be extended to set and display the composite layout of a file, and to search for files with specific composite layout parameters or for components that match specific parameters. The added command-line arguments along with descriptions and examples for each of these commands is given on a dedicated man page for each sub-command linked to the name of the command, so only the synopsis and brief description of each command is shown here.<br />
<br />
====lfs getstripe====<br />
The lfs getstripe command prints some or all of the parameters of a file's layout. This is intended for regular users and administrators to query a particular file's layout, or the individual components of a composite file to examine the layout used to create the file.<br />
<br />
lfs getstripe [--stripe-count|-c ] [--directory|-d] [--stripe-index|-i]<br />
[--layout|-L] [--mdt-index|-m] [--ost|-O <uuid>] [--pool|-p]<br />
[--recursive|-r] [--raw|-R] [--stripe-size|-S] '''[--component-start [start]]<br />
'''[--component-end|-E [end]] [--component-flag|-F [flag]] [--component-id|-I [id]]<br />
'''[--component-count]''' [--quiet|-q] [--verbose|-v] {dirname|filename} ...<br />
<br />
Without any of the option flags, this will display all the layout components, as shown below. To limit the display to specific values of the layout, the options are largely the same as the current lfs getstripe, with new parameters for extracting attributes of composite files, such as the start and end extent of the last instantiated component, the unique component identifier, and the component attribute flags. By default, when requesting specific values of the layout, this will print the parameters of the last instantiated component of the layout, since this is the one that affects the current IO behaviour, and if a single parameter needs to be selected that best represents the file it should come from the layout that the file needed at its current size. It is also possible to select a specific file component by its offset within the file or attribute flags to print specific values from the specified component of the layout. If multiple component options are specified, such as --component-end=64M and --component-flag=uninit, then lfs getstripe will return the attributes of the component that matches all specified options. If no component matches all specified component options, then nothing will be printed.<br />
<br />
Since the output format needs to be changed for composite files, the output is YAML formatted for both ease of parsing and still be human readable. This is still reasonably similar to the original output format, with the exception of the OST object ID information, which is now more structured for ease of use.<br />
# An output example of a file with 3 components<br />
$ lfs getstripe -v /mnt/lustre/file<br />
"/mnt/lustre/file":<br />
fid: "[0x200000400:0x2c3:0x0]"<br />
composite_header:<br />
composite_magic: 0x0BDC0BD0<br />
composite_size: 536<br />
composite_gen: 4<br />
composite_flags: 0<br />
component_count: 3<br />
components:<br />
- component_id: 1<br />
component_flags: 0<br />
component_start: 0<br />
component_end: 2097152<br />
component_offset: 152<br />
component_size: 56<br />
sub_layout:<br />
lmm_magic: 0x0BD10BD0<br />
lmm_pattern: 1<br />
lmm_stripe_size: 1048576<br />
lmm_stripe_count: 1<br />
lmm_stripe_index: 7<br />
lmm_pool: flash<br />
lmm_layout_gen: 0<br />
lmm_obj:<br />
- 0: { lmm_ost: 7, lmm_fid: "[0x100070000:0x2:0x0]" }<br />
- component_id: 2<br />
component_flags: 0<br />
component_start: 2097152<br />
component_end: 16777216<br />
component_offset: 208<br />
component_size: 128<br />
sub_layout:<br />
lmm_magic: 0x0BD10BD0<br />
lmm_pattern: 1<br />
lmm_stripe_size: 1048576<br />
lmm_stripe_count: 4<br />
lmm_stripe_index: 0<br />
lmm_layout_gen: 0<br />
lmm_obj:<br />
- 0: { lmm_ost: 0, lmm_fid: "[0x100000000:0x2:0x0]" }<br />
- 1: { lmm_ost: 1, lmm_fid: "[0x100010000:0x3:0x0]" }<br />
- 2: { lmm_ost: 2, lmm_fid: "[0x100020000:0x4:0x0]" }<br />
- 3: { lmm_ost: 3, lmm_fid: "[0x100030000:0x4:0x0]" }<br />
- component_id: 4<br />
component_flags: 0<br />
component_start: 16777216<br />
component_end: 18446744073709551615<br />
component_offset: 336<br />
component_size: 176<br />
sub_layout:<br />
lmm_magic: 0x0BD10BD0<br />
lmm_pattern: 1<br />
lmm_stripe_size: 4194304<br />
lmm_stripe_count: 6<br />
lmm_stripe_index: 5<br />
lmm_pool: archive<br />
lmm_layout_gen: 0<br />
lmm_obj:<br />
- 0: { lmm_ost: 5, lmm_fid: "[0x100050000:0x2:0x0]" }<br />
- 1: { lmm_ost: 6, lmm_fid: "[0x100060000:0x2:0x0]" }<br />
- 2: { lmm_ost: 7, lmm_fid: "[0x100070000:0x3:0x0]" }<br />
- 3: { lmm_ost: 0, lmm_fid: "[0x100000000:0x3:0x0]" }<br />
- 4: { lmm_ost: 1, lmm_fid: "[0x100010000:0x4:0x0]" }<br />
- 5: { lmm_ost: 2, lmm_fid: "[0x100020000:0x5:0x0]" }<br />
<br />
====lfs setstripe====<br />
The lfs setstripe command creates a new file with the specified layout parameters, or sets the specified layout parameters as the default layout template on a parent directory.<br />
<br />
lfs setstripe {--component-end|-E end1} [component1_OPTIONS] [{--component-end|-E end2} [component2_OPTIONS] ...] {directory|filename}<br />
lfs setstripe --component-del [--component-id|-I comp_id] [--component-flags|-F flags] filename<br />
lfs setstripe --component-set [--component-id|-I comp_id] {--component-flags|-F flags} filename<br />
<br />
Since this is the primary command-line interface for users creating new files with Lustre-specific layouts, there are a significant number of existing options that can be used. Adding in composite-file specific options to lfs setstripe allows the same code to create both files with plain layouts and composite layouts, without duplicating a large number of options. The command-line arguments of lfs setstripe are described in detail in the lfs-setstripe(1) man page. The significant changes to these arguments is the addition of the --component-end argument for specifying which component is being modified during file creation, and allowing multiple components to be specified on the same command-line so that the file does not need to be created piecemeal.<br />
<br />
An example from the man page illustrates the flexibility of the file creation interface:<br />
$ lfs setstripe -E 4M -c 1 --pool flash -E 64M -c 4 -S 4M -E -1 -c -1 -S 16M --pool archive /mnt/lustre/file1<br />
<br />
This creates a file with composite layout in a single operation, rather than building it up one component at a time. The first component has a single stripe that covers [0, 4MiB) and is allocated from an OST in the flash pool. The second component has four stripes that cover [4MiB, 64MiB) and has a stripe size of 4MiB. The last component covers [64MiB, EOF), has a stripe size of 16MiB, and uses all available OSTs in the archive pool.<br />
<br />
Please notice that the '''setstripe options''' in the command line are inheritable, which means the options indicated in previous component will be used by the following components unless they are changed. For example, if ''-c'' option appears in the command line as following:<br />
$ lfs setstripe -E 4M -c 1 -E 8M -E 32M -c 4 -E eof<br />
<br />
It will create the components [0, 4M) and [4M, 8M) with 1 stripe and [8M, 32M) and [32M, EOF) with 4 stripes. This attribute applies to all '''setstripe options'''.<br />
<br />
There is a work in progress to use an explicit option ''--parent'' so that it can reset the previous '''setstripe options''' and use system default stripe options thereafter.<br />
<br />
====lfs migrate====<br />
The lfs migrate command moves the a file's data from one (set of) OST(s) to another (set of) OST(s). This is done by copying the file data from the existing source file layout to a new target file layout as specified by the user. Most of the options to lfs migrate are the same as those of lfs setstripe, since lfs migrate is also creating a new file layout for the file data to be copied.<br />
<br />
lfs migrate [--component-id|-I comp_id] [OPTIONS] filename<br />
<br />
With the addition of composite files in this project, it needs to be possible to migrate a composite file, or a sub-component of that file, to new OST object(s) using the specified parameters. If a component ID is specified, then only that component should be migrated, and the new component should use the same start and end offsets as the source component so that the source component can be replaced without violating the PFL layout rules.<br />
<br />
====lfs find====<br />
The lfs find command is a Lustre-optimized and enhanced version of the find(1) command. It adds several extended options for matching Lustre specific parameters for the file layout. It is also optimizing file access to avoid fetching OST object attributes for each file checked, if the decision can be made based only on the information initially retrieved from the MDT inode.<br />
<br />
lfs find {directory|filename} ... [[!] --atime|-A [-+]days] [[!] --mtime|-M [-+]days]<br />
[[!] --ctime|-C [+-]days] [--maxdepth|-D depth] [[!] --mdt|-m {mdt_uuid|mdt_index,...}]<br />
[--name|-n pattern] [[!] --ost|-O {ost_uuid|ost_index,...}] [--print|-p] [--print0|-P]<br />
[[!] --size|-s [-+]bytes[kMGTPE]] [[!] --stripe-count|-c [+-]stripes]<br />
[[!] --stripe-index|-i {ost_index,...}] [[!] --stripe-size|-S [+-]bytes[kMG]]<br />
[[!] --layout|-L {raid0,released,composite}] [--type |-t {bcdflps}]<br />
[[!] --gid|-g|--group|-G {group_name|gid}] [[!] --uid|-u|--user|-U {user_name|uid}]<br />
[[!] --pool pool] [--component-start start] [--component-end|-E end]<br />
[[!]--component-count [+-]count] [--component-flags|-F flags]<br />
<br />
The existing command is enhanced with the --component-start, --component-end, --component-count and --component-flags commands to allow limiting the search criterion to specific extents or components of the file.<br />
<br />
===llapi_layout_comp_* Library API===<br />
<br />
The llapi_layout_* interfaces provide an interface for userspace applications, including lfs, to specify plain and composite file layouts in an abstract manner, and then convert those abstract layouts into actual file layouts depending on the final attributes of the layout. The main data structure for llapi_layout_* functions is struct llapi_layout, which is opaque to userspace, but internally stores all of the attributes of a single plain layout or a single component's sub-layout. For composite file layouts, the API will be extended to handle layouts with multiple components and other composite file specific attributes, for use in Lustre-specific tools such as lfs setstripe, lfs getstripe, and lfs find, as well as by end-user applications or libraries that want to create files with specific composite layouts to optimize file IO patterns, such as HDF5.<br />
<br />
A composite layout can be composed by several layout components, and each component's sub-layout will be described by the opaque data in struct llapi_layout, therefore, few more fields should be added to the structure:<br />
struct llapi_layout {<br />
uint32_t llot_magic;<br />
uint64_t llot_pattern;<br />
uint64_t llot_stripe_size;<br />
uint64_t llot_stripe_count;<br />
uint64_t llot_stripe_offset;<br />
/** Indicates if llot_objects array has been initialized. */<br />
bool llot_objects_are_valid;<br />
/* Add 1 so user always gets back a null terminated string. */<br />
char llot_pool_name[LOV_MAXPOOLNAME + 1];<br />
/* fields for composite layouts */<br />
+ struct lu_extent llot_extent; /* [start, end) of component */<br />
+ uint32_t llot_id; /* unique ID of component */<br />
+ uint32_t llot_flags; /* LCME_FL_* flags */<br />
+ struct list_head llot_list; /* linked list of llapi_layout components */<br />
struct lov_user_ost_data_v1 llot_objects[0];<br />
};<br />
<br />
* '''llot_extent''': The file extent covered by current component; initially assigned by the caller when defining a layout component.<br />
* '''llot_id''': The numeric ID of current component; this may be assigned internally by the llapi_layout_*() interfaces for identification purposes, but the final component ID assignment is the responsibility of the MDS.<br />
* '''llot_flags''': The flags of current component;<br />
* '''llot_list''': List of all the components of the same composite layout;<br />
<br />
A new pair of interfaces will be introduced to set/get the extent of a layout component. The llapi_layout_comp_extent_get(3) function will fetch the start and end offset of the current layout component, and function will set the layout extent of a layout currently being constructed, within acceptable parameters for that component.<br />
int llapi_layout_comp_extent_set(struct llapi_layout *layout, uint64_t start, uint64_t end);<br />
int llapi_layout_comp_extent_get(const struct llapi_layout *layout, uint64_t *start, uint64_t *end);<br />
<br />
A new set of interfaces will be introduced to get, set, and clear the attribute flags of a layout component. The llapi_layout_comp_flags_get(3) function gets the attribute flags of the current component. The llapi_layout_comp_flags_set(3) command sets the specified flags of the current component leaving other flags as-is, while llapi_layout_comp_flags_clear(3) clears the flags specified in the flags word leaving other flags as-is.<br />
int llapi_layout_comp_flags_get(const struct llapi_layout *layout, uint32_t *flags);<br />
int llapi_layout_comp_flags_set(struct llapi_layout *layout, uint32_t flags);<br />
int llapi_layout_comp_flags_clear(const struct llapi_layout *layout, uint32_t flags);<br />
<br />
The new llapi_layout_comp_id_get(3) interface fetches the file-unique component ID of the current layout component. <br />
int llapi_layout_comp_id_get(const struct llapi_layout *layout, uint32_t *id);<br />
<br />
A new pair of interface to add/delete a component to/from the composite layout. The llapi_layout_comp_add(3) command adds the passed layout component comp to the existing composite file layout layout, to allow creating compound composite layouts at one time. The llapi_layout_comp_del(3) deletes the specified layout component comp from the composite layout layout.<br />
int llapi_layout_comp_add(struct llapi_layout *layout, struct llapi_layout *comp);<br />
int llapi_layout_comp_del(struct llapi_layout *layout, struct llapi_layout *comp);<br />
<br />
A new interface llapi_layout_comp_get_by_id(3) to fetch component(s) by ID if the user or application already knows the ID:<br />
struct llapi_layout *llapi_layout_comp_get_by_id(const struct llapi_layout *layout, uint32_t id);<br />
<br />
A new interface llapi_layout_comp_next(3) to iterate all components of a composite layout, by selecting each component in turn internally, and then allowing different llapi_layout_comp_*() operations on that component layout:<br />
struct llapi_layout *llapi_layout_comp_next(const struct llapi_layout *layout);<br />
<br />
The existing llapi_layout_to_lum() and llapi_layout_from_lum() interfaces should be extended to handle the composite layout, the new user md for composite layout is defined in [[Layout Enhancement High Level Design]].<br />
/* data structure representing each layout component, defined in "Layout Enhancement HLD" */<br />
struct lov_comp_md_entry_v1 {<br />
__u32 lcme_id; /* unique identifier of component */<br />
__u32 lcme_flags; /* LCME_FL_XXX */<br />
struct lu_extent lcme_extent; /* file extent for component */<br />
__u32 lcme_offset; /* offset of component blob in layout */<br />
__u32 lcme_size; /* size of component blob data */<br />
__u64 lcme_padding[2];<br />
};<br />
<br />
/* On-disk/wire structure of the composite layout, defined in "Layout Enhancement HLD" */<br />
struct lov_comp_md_v1 {<br />
__u32 lcm_magic; /* LOV_MAGIC_COMP_V1 */<br />
__u32 lcm_size; /* overall size of layout including this structure */<br />
__u32 lcm_layout_gen;<br />
__u16 lcm_flags;<br />
__u16 lcm_entry_count;<br />
__u64 lcm_padding1;<br />
__u64 lcm_padding2;<br />
struct lov_comp_md_entry_v1 lcm_entries[0];<br />
};<br />
<br />
#define lov_user_comp_md lov_comp_md_v1;<br />
<br />
A new interface llapi_layout_file_comp_add(3) to add layout components to an existing file, it converts the passed in layout into lov_user_comp_md, then issue setxattr with the special xattr name defined in "Changes on MDS":<br />
int llapi_layout_file_comp_add(const char *path, const struct llapi_layout *layout);<br />
<br />
A new interface llapi_layout_file_comp_del(3) to delete component(s) by the specified component id (accepting LCME_ID_* wildcards also) from an existing file:<br />
int llapi_layout_file_comp_del(const char *path, uint32_t id);<br />
<br />
A new interface llapi_layout_file_comp_set(3) to change flags or other parameters of the component(s) by component ID of an existing file. The component to be modified is specified by the comp->lcme_id value, which may be either a specific component number or an LCME_ID_* wildcard value. The new attributes are passed in by comp and valid is used to specify which attributes in the component are going to be changed. This allows the interface to be extended to set any attributes in the future.<br />
int llapi_layout_file_comp_set(const char *path, const struct llapi_layout *comp, uint32_t valid);<br />
<br />
===User Space API Use Cases===<br />
<br />
Several uses of the llapi_layout_* interfaces are shown as examples, to understand how this new API can be used by user tools.<br />
<br />
Use case 1: Create a file with full layout components<br />
/* Allocate opaque layout structure for the first component */<br />
layout1 = llapi_layout_alloc();<br />
<br />
/* Set [0, 2M) extent to the first component */<br />
rc = llapi_layout_comp_extent_set(layout1, 0, 2M);<br />
<br />
/* Set stripe size of the first component */<br />
rc = llapi_layout_stripe_size_set(layout1, 1M);<br />
<br />
/* Set stripe count of the first component */<br />
rc = llapi_layout_stripe_count_set(layout1, 1);<br />
<br />
/* Allocate opaque layout structure for the second component */<br />
layout2 = llapi_layout_alloc();<br />
<br />
/* Add layout2 into layout1, and layout2 will inherit the stripe size of layout1 */<br />
rc = llapi_layout_comp_add(layout1, layout2);<br />
<br />
/* Set [2M, 256M) extent to the second component */<br />
rc = llapi_layout_comp_extent_set(layout2, 2M, 256M);<br />
<br />
/* Set stripe count of the second component */<br />
rc = llapi_layout_stripe_count_set(layout2, 4);<br />
<br />
/* Repeat above steps to create a layout3 with [256M, EOF) */<br />
...<br />
<br />
/* Create file with the composite layout */<br />
rc = llapi_layout_file_create(path, open_flags, open_mode, layout1);<br />
<br />
Use case 2: Create a file with initial component, and add components later<br />
/* Allocate opaque layout structure for the first component */<br />
layout1 = llapi_layout_alloc();<br />
<br />
/* Set [0, 2M) extent to the first component */<br />
rc = llapi_layout_comp_extent_set(layout1, 0, 2M);<br />
<br />
/* Set stripe size of the first component */<br />
rc = llapi_layout_stripe_size_set(layout1, 1M);<br />
<br />
/* Set stripe count of the first component */<br />
rc = llapi_layout_stripe_count_set(layout1, 1);<br />
<br />
/* Create file with the specified initial component */<br />
rc = llapi_layout_file_create(path, open_flags, open_mode, layout1);<br />
<br />
/* Allocate opaque layout structure for the second component */<br />
layout2 = llapi_layout_alloc();<br />
<br />
/* Set [2M, 256M) extent to the second component */<br />
rc = llapi_layout_comp_extent_set(layout2, 2M, 256M);<br />
<br />
/* Set stripe count of the second component */<br />
rc = llapi_layout_stripe_count_set(layout2, 4);<br />
<br />
/* Add the component layout2 into the file */ <br />
rc = llapi_layout_file_comp_add(path, layout2);<br />
<br />
Use case 3: Traverse all components of a composite layout. This is useful for something like lfs getstripe to be able to iterate over all components of an object without knowing in advance how many components each file has. During processing, the process using iterator can decide which components are of interest and print them, or use the component's file-unique component ID to print, modify or delete each component in turn.<br />
/* Get composite layout from existing file */<br />
layout = llapi_layout_get_by_path(path, flags);<br />
<br />
/* Traverse the layout */<br />
comp = layout;<br />
do {<br />
/* Get & print stripe count */<br />
rc = llapi_layout_stripe_count_get(comp, &count);<br />
printf("stripe_count: %llu\n", count);<br />
<br />
/* Get & print stripe size */<br />
rc = llapi_layout_stripe_size_get(comp, &size);<br />
printf("stripe_size: %llu\n", size);<br />
<br />
/* Get & print layout pattern */<br />
rc = llapi_layout_pattern_get(comp, &pattern);<br />
printf("stripe_pattern: %llx\n", pattern);<br />
<br />
comp = llapi_layout_comp_next(comp);<br />
} while (comp != layout);<br />
<br />
===Client-IO Interface===<br />
This design is based on the PFL Prototype High Level Design, which demonstrated the feasibility of the PFL concept. In this design, we will add further details to the design and address a few problems discovered during the prototype phase so that it can be used better in production.<br />
<br />
In the PFL prototype design, it addressed the problems of object mapping, in-memory cl_objects setup by layout components and a framework to support fundamental I/O operations. Operations like read, write, setattr, and glimpse are working properly. Two additional problems have to be solved for better use of PFL. The first problem is to create layout components on demand; and the second one is an optimization in terms of performance. This is a part of design of full PFL implementation, and the work in this part design will not be implemented in the project of PFL Phase 2.<br />
<br />
For the PFL Phase 2 implementation, the Client IO layer will be able to interpret existing composite file layouts, but will return an -ENODATA error to the application until the Phase 3 handling of ll_layout_intent() to dynamically fetch and instantiate layout components from the MDS is implemented.<br />
<br />
===Create Layout Component on Demand===<br />
The prototype phase doesn’t have the functionality to create layout component on demand. If applications are writing a file extent without layout component defined, clients will simply return an error to applications. This implies that the user has to know the layout components in advance, and has to understand the application’s access pattern really well so that each layout component can be created before it’s written.<br />
<br />
To support creating layout component on demand, the administrator can associate a PFL file with a layout template, which describes the number of stripes to be created for each range of file extents, along with other parameters such as stripe size, OST pool, etc. If the corresponding file extent is being written without layout component defined, the client will send a dedicated RPC, named layout intent RPC ([[PFL2_Solution_Architecture#A13._Application_writes_within_a_uninitialized_file_component_.28Phase_3.29|A13 in PFL Phase 2 Solution Architecture]]), to the MDT, then the MDT can use the information within layout template to allocate corresponding OST objects to form a layout component, then the layout component will be appended to the file’s layout. After this is done, the client should be able to fetch new layout and proceed the I/O. No error should be seen by applications under normal circumstances, though it is possible to see errors at this point due to environmental factors such as -ENOSPC, -EIO, -ENOMEM or others that may occur when the MDS creates new OST objects and modifies the file layout on disk.<br />
<br />
The following diagram describes the process of creating layout component on demand:<br />
[[Image:pfl2_write_flow_chart.png||Flow chart for PFL write]]<br />
<br />
In the above diagram, 'File Update' can be any operations that will modify file contents, such as file write, mkwrite, and truncate. Read only operation won't necessarily trigger layout component allocation. Reading file extents with undefined layout components will simply return zero filled buffer.<br />
<br />
As it's shown in the diagram, the LOV layer, which is the only module in CLIO who can understand layout, will check if the intended writing region has a layout defined. If there is no layout defined, it will abort and invoke ll_layout_intent() that will send layout intent RPC to the MDT. As what we mentioned in the design of PFL prototype phase, CLIO will split by the boundary of layout component, therefore if only part of the I/O region doesn't have layout defined, it will finish the I/O with layout defined first, and then request for new layout.<br />
<br />
In cl_io data structure, a new bit is going to be defined to mark that the I/O failure was due to missing layout component:<br />
struct cl_io {<br />
...<br />
/* true if this io failed due to missing layout */<br />
unsigned int ci_no_layout:1;<br />
...<br />
}<br />
<br />
If LOV detects that the update I/O can't be finished due to missing layout, it will set ci_no_layout and abort the I/O with error code -ENODATA. In vvp_io_fini(), it should check ci_no_layout and then compose a layout intent RPC and send it to the MDT. Layout intent RPC is an LDLM enqueue RPC with an intent operation, the payload of the intent operation is as follows:<br />
struct layout_intent {<br />
__u32 li_opc; /* intent operation for enqueue, read, write etc */<br />
__u32 li_flags;<br />
__u64 li_start;<br />
__u64 li_end;<br />
};<br />
<br />
In order to request new component, the li_opc will be assigned to LAYOUT_INTENT_WRITE, and li_start and li_end are set to the full range of intended I/O. MDT may decide to create multiple layout components within this range at its discretion. Later, CLIO will fetch the new layout and continue I/O from where it stops.<br />
Flushing Cached Page Wisely under Layout Change<br />
<br />
Client side page cache may needs flushing whenever the file layout is changed. This works well for HSM and file migration because no pages would remain valid after those operations change the file layout. However, for PFL files the cached pages should be still valid after a layout change because layout components are only appended to the existing layout, and do not affect the existing components or their data. It could hurt application performance significantly if all pages are evicted from the client cache and then read back again for each new component that is added to the file.<br />
<br />
PFL uses layered generations to identify layout generations of each file and each component. One is layout generation at the whole file level, this generation will be changed after appending a layout component. Another one is the per layout component generation, which will remain unchanged when appending new file components.<br />
<br />
This feature will be used to facilitate client page cache management. When clients detect a layout change at the LOV layer, the LOV will further check the generations on layout components, and it will only flush pages for the newly added or modified layout components, which is an empty operation for PFL because layout components won’t be altered once created.<br />
<br />
One tricky case worth mentioning is that layout component generations are comparable only if the file's layout generation matches. Even it's known that the file's layout generation increased due to layout component addition, they are still conceptually unrelated layouts. Therefore, clients must compare not only component generations but also OST indices and objects to decide if components are unchanged to avoid unnecessary page eviction. <br />
<br />
In order to accomplish this, a pair of range parameters will be added into cl_object_prune() to indicate that only a subset of pages are being evicted. A callback may be provided by LOV later to check if an individual page should be evicted due to layout change.<br />
<br />
===Dynamic Layout Request===<br />
In the PFL Prototype, layout components must have been defined and instantiated before I/O starts. Otherwise, applications that are trying to access an uninstantiated or undefined component will receive an ENODATA error. With dynamic layout request supported, clients are able to instantiate layout components on demand. Layout intent RPC will be used to request an instantiated layout component from the MDT.<br />
<br />
The RPC format of layout intent:<br />
static const struct req_msg_field *ldlm_intent_layout_client[] = {<br />
&RMF_PTLRPC_BODY,<br />
&RMF_DLM_REQ,<br />
&RMF_LDLM_INTENT,<br />
&RMF_LAYOUT_INTENT,<br />
&RMF_EADATA /* for new layout to be set up */<br />
};<br />
<br />
struct req_msg_field RMF_LAYOUT_INTENT =<br />
DEFINE_MSGF("layout_intent", 0,<br />
sizeof(struct layout_intent), lustre_swab_layout_intent,<br />
NULL);<br />
EXPORT_SYMBOL(RMF_LAYOUT_INTENT);<br />
<br />
/* enqueue layout lock with intent */<br />
struct layout_intent {<br />
__u32 li_opc; /* intent operation for enqueue, read, write etc */<br />
__u32 li_flags;<br />
__u64 li_start;<br />
__u64 li_end;<br />
};<br />
<br />
Layout intent RPC is essentially an LDLM intent RPC. In the layout intent RPC, struct layout_intent will carry the necessary layout information to the MDT, for example, what is the range of required component, and which operation it expects the MDT to execute. With the information of preset layout template, the MDT should be able to create the corresponding component. The MDT should instantiate all components within the range of [li_start, li_end). If the MDT successfully instantiates a component, it will increment the file's layout generation number by 1.<br />
<br />
===When to send layout intent in CLIO stack===<br />
In CLIO stack, the LOV layer is the only place where layout can be understood and interpreted. When I/O is instantiated, the LOV layer will split file level I/O range to sub I/Os by components information, and make sure that one sub I/O won’t cross component boundary. The lov_io_iter_init() function is where it checks if a component is instantiated or not.<br />
<br />
If a component is not instantiated, and if this I/O is a write operation, LOV will set a flag in cl_io data and return to the llite layer with error -ENODATA. The llite layer will realize that new components are required to complete this I/O so that it will invoke ll_layout_request() to send layout intent RPC. If everything goes okay, ll_layout_request() will fetch and apply new layout to CLIO stack. Finally the llite layer should restart the I/O and it should be able to move forward. Since the components are instantiated only once, and are added only to the end of the file, any layout changes for PFL files cannot affect in-flight IO operations.<br />
<br />
So far, the operations would trigger layout intent request are VFS write() family of calls, truncate(), and page ->mkwrite(). In the future, other interfaces such as fallocate() would also need to handle uninstantiated components in a similar manner.<br />
<br />
==Server-side RPC Interface and IO code paths==<br />
MDS should provide an interface that can handle the client's requests to populate a new layout and modify an existing layout. We can consider a few different options to go through the various layers in the metadata stack from the client to the underlying storage on the MDT that holds the layout itself:<br />
<br />
#a new DT method or set of methods<br />
## will be used only by the Logical Object Device (LOD) on the MDS<br />
## unit testing (talking directly to LOD) would need additional infrastructure<br />
# use setxattr() with special xattr names<br />
## lustre.lov.add = { components to add to the composite layout }<br />
## lustre.lov.del = { delete the component ID in the value }<br />
## lustre.lov.set = { set component flags }<br />
## lustre.lov.clear = { clear component flags }<br />
## no need in an extra infrastructure to implement unit testing<br />
## no new special methods<br />
# extend ->dbo_punch() method<br />
## a flag to populate range: assign new objects<br />
## a flag to depopulate: release the objects<br />
## probably not enough to change flags (out-of-sync stripe?)<br />
# totally crazy idea - layout as an index<br />
## this is what it is in essence<br />
## range/offset as a key<br />
## object+status as a value<br />
<br />
Several aspects of the xattr interface (#2) are of interest, which lead to selecting the xattr method for the PFL implementation.<br />
<br />
The first and foremost reason for selecting the xattr interface is that it allows adding the ability to modify existing layout xattrs without a significant change in the number of RPC types, without adding new dedicated server APIs that will only be used for composite files, and without introducing new userspace APIs that may have portability issues. This simplifies the understanding of the code and keeps the complexity growth in check for the future, and doesn't sacrifice flexibility.<br />
<br />
A secondary reason for selecting the xattr interface for managing the layouts on the client is that with IO forwarders such as IBM's CIOD it is difficult to pass ioctl() commands from the compute nodes where applications run to the IO nodes where the Lustre client runs. This needs a special handler for every ioctl() command and quickly becomes a maintenance headache. However, the getxattr() and setxattr() interfaces already exist in such environments and provide generic key=value methods that can work with arbitrary key names and values between the llapi_layout(7) library commands and the Lustre client, over the network to the MDS, and from the MDS down to the underlying MDT storage. It is not expected that applications or users will interact directly with the file components using getxattr() and setxattr(), but only via the llapi_layout interfaces.<br />
<br />
==Use Cases for getxattr() and setxattr() interfaces==<br />
In order to manipulate the layout of the lustre.lov xattr holding the file layout, getxattr() and setxattr() (or fgetxattr() and fsetxattr() for operations on already-open file descriptors) will manipulate virtual xattrs with names such as lustre.lov.add, lustre.lov.del, lustre.lov.set, and lustre.lov.clear. These can interface transparently from userspace on the client with the kernel on the client, or with the MDS as needed. Below we look at the use cases from the PFL 2 Solution Architecture to verify that these xattr interfaces can meet the requirements set out in that document.<br />
<br />
===U01. User creates new file with fixed-size layout component===<br />
setxattr("lustre.lov", <binary composite file description>);<br />
The LOD on the MDS parses the layout, which should contain a component that has an extent start of zero, and adds the component as described and populates it with OST objects. The MDS will always allocate the first component's objects, to avoid the immediate overhead of another RPC and layout lock cancellation for the normal pattern of file create followed by file write.<br />
<br />
===U02. User adds component(s) with fixed-size extent(s) to an existing composite file===<br />
setxattr("lustre.lov.add", <binary component description>);<br />
The LOD on the MDS parses the layout, sanity checks the components against the existing layout (S03, S04) and against each other, adds the component(s) as described to the file, essentially the same as U01. The binary component description is largely self-describing, so it may contain one or more components that are added to the existing file layout, if any. The LOD will assign file-unique component IDs as necessary, which may be different than those the client generated while creating the layout in memory. If the client does not have the OBD_CONNECT_PFL_DYNAMIC flag set at connection time (a PFL 2 client, see Client-MDS Protocol below), then it is not capable of dynamically requesting the layout components need to be instantiated, in which case the MDS will allocate objects for all components.<br />
<br />
===U03. User adds final component to existing composite file===<br />
setxattr("lustre.lov.add", <binary component description>);<br />
LOD parses the layout, adds the component as supplied, and populates it with OST objects (for PFL2 clients only if no OBD_CONNECT_PFL_DYNAMIC flag set at connection time, see Client-MDS Protocol below). This is also the same as U01 and U02, with the only difference being that the supplied component has an extent that ends at 264-1 bytes.<br />
<br />
===U04. User requests the layout for an existing file===<br />
getxattr("lustre.lov");<br />
If the client is fetching the full layout xattr, then it can use the same getxattr interface as is used today by existing tools with no additional processing of the xattr or layout. This ensures that utilities such as tar(1) and others that already save and restore Lustre file layouts continue to work properly with composite files.<br />
getxattr("lustre.lov.extent.<start-end>");<br />
The LOD on the MDS and/or the LOV on the client parses the layout and returns the component(s) covering range [start,end). Since the components themselves are self-describing, containing the component_id, component_start, and component_end, they can be returned directly to the caller and handled directly.<br />
<br />
===U05. User gets layout parameters to existing component by ID===<br />
getxattr("lustre.lov.id.<component_id>");<br />
The LOD on the MDS and/or the LOV on the client parses the layout, finds component with id=<component id> and returns it to the caller.<br />
<br />
===U06. User accesses or modifies components in an existing file===<br />
setxattr("lustre.lov.set.<valid>", <binary component description>);<br />
LOD iterates over the components in the file, applies changes to the individual components passed in as the binary component value, and sets the fields in the component as specified by valid.<br />
<br />
===U07. User deletes composite file===<br />
The LOD iterates over the components on the MDS, destroying the individual OST objects using the existing RPC and recovery methods before deleting the MDT inode.<br />
<br />
===U08. User creates a new composite file describing multiple components===<br />
setxattr("lustre.lov", <binary composite file description>);<br />
This is essentially the same as U01 but the composite file description contains multiple components.<br />
<br />
===U09. User migrates composite file===<br />
fsetxattr(source_fd, "lustre.lov.swap.<component_id>", <volatile_fid>);<br />
The existing mechanism to swap while file layouts should be usable without modification, as it is simply copying the layout xattrs and doesn't actually look into the layouts. This allows existing tools that may use the llapi_[f]swap_layouts() interface, such as HSM copytools, to continue to be able to manage whole-file layout changes. If the user is migrating a single component from a composite file, then the data copy step is largely the same, except it will only copy data covered by the source extent [start,end) into the target file before swapping the layout from the source component to the temporary target file. The design of composite file layouts ensures that the logical file offsets of the source file and the target files are the same, so no special support is needed in userspace to do the data copy. The volatile_fid is accessible on the MDS while the migration tool keeps the volatile file open, though the MDS needs to verify that the file matching <volatile_fid> has been opened by the client for write, to ensure the client has file write permission on the file, since it is not possible to pass two open file descriptors to the fsetxattr() syscall.<br />
<br />
===U10. User searches for composite files===<br />
getxattr("lustre.lov.header");<br />
This will return the composite file layout header containing layout type, number of components, etc. This can reduce amount of information going over the network and up to userspace, so that the caller can allocate a sufficiently large buffer to hold the full layout.<br />
<br />
===U11. User specifies default composite file template for directory (Phase 3)===<br />
setxattr("lustre.lov", <binary composite file template>);<br />
The setxattr() operation would be done on the parent directory, in a similar manner that it is done today, only with different xattr contents. The composite layout template is stored on the parent directory as is done today for plain layout templates, storing only struct lov_mds_md_v1 with the required fields set, and not storing any of the file stripes in lmm_objects. <br />
===U12. Administrator specifies filesystem-wide composite file template for root directory (Phase 3)===<br />
Same as U11, except the operation takes place on the root directory and affects all new files.<br />
===U13. User deletes uninstantiated stripe component from file by ID (Phase 3)===<br />
setxattr("lustre.lov.del", <component id>);<br />
LOD parses the layout, finds the component with ID=<component id> and if it is uninstantiated (has no object assigned), then delete it.<br />
<br />
===U14. User deletes uninstantiated stripe components from file by flags (Phase 3)===<br />
setxattr("lustre.lov.del", <component id>);<br />
To delete all uninstantiated components from the file it is possible to use the enum lcme_id wildcard LCME_ID_UNINIT to delete all uninstantiated components.<br />
For more complex operations, it is more practical to fetch the entire layout to the client, iterate over the components in userspace, and then perform the pattern matching in an arbitrarily complex manner to determine which component IDs to remove or modify. While there is some extra overhead in deleting or modifying components individually, it is impractical to have a complex query and update interface embedded in the MDS for this. Otherwise, there may be an explosion of different matching criteria that need to be added (e.g. components within these extents, have specific flags set, that have specific layout generations, stripe sizes, etc). <br />
<br />
==Server-side Composite Layout Handling==<br />
===Composite File Layout Handling===<br />
The server needs to interpret and handle the virtual xattr values that are sent from the client. In order to avoid namespace collisions and potential abuse by users, the virtual xattr keywords such as .add, .del, .swap, etc. are only interpreted for specific Lustre xattrs such as lustre.lov, and potentially lustre.lmv in the future. The lustre.lov xattr is already handled specially on the client, since it cannot be set if the file already has a layout, so this will not add significant complexity on the client or server. Because xattrs are read and written as a single unit, any modification to the xattr needs to load the existing layout xattr into memory, modify it as requested by the client, and then store it back to the MDT inode object. This will be handled by the LOD layer of the MDS, since it is the software module that interprets the file layout, and also has access to the MDT OSD to load and store the xattr contents.<br />
<br />
The LOD must verify incoming lustre.lov layout xattrs, whether a whole composite layout is being sent, or an incremental update is being made to a layout component. Until other composite layout features such as File-Level Replication, partial HSM restore, the layout checks done by the MDS will be PFL-specific. These include verifying that components have adjacent extents, so that there are no holes in the layout (S03), that the layouts do not overlap (S03, S04) or if they do that they describe identical components (S09.1), and that they do not specify attribute flags that are controlled by the server. Fields such as lcme_offset and lcme_id that are server managed will be ignored and overwritten by the server.<br />
<br />
The composite layout header contains a generation value, lcm_layout_gen , that is updated by the MDS when the composite layout is changed. In order to ensure that component IDs within a file are unique, the lcme_id assigned to a new component added to the layout will be the lcm_layout_gen of the composite file. An lcme_id of 0 is reserved, and indicates that the ID is unassigned for this component or no specific component is requested, and will not be used by the MDS for any components in a file.<br />
<br />
As these checks of the incoming layout and the update of the lustre.lov xattr on disk need to be serialized, these operations will be serialized by the layout lock (MDS_INODELOCK_LAYOUT) on the inode.<br />
<br />
===Layout Intent Lock Handling===<br />
For PFL, there are two kinds of request that could cause layout change. One is from the command line that appends or changes components manually; another one is from CLIO stack after dynamic layout intent is supported. Both kinds of request will end up with invocation of setxattr() on the MD stack. The MDT has to hold LCK_EX mode of the MDS_INODELOCK_LAYOUT lock when it calls setxattr() to do actual changes to the layout. Not only does the holding of the layout lock on the MDT inode object serialize updates between multiple threads on the MDS, it also serves to revoke the layout lock from all clients that have been granted this lock. This revocation will invalidate the cached file layout from the clients, and cause them to refresh the layout on their next IO operation.<br />
<br />
==Feature Compatibility and Interoperability==<br />
===File Layout Compatibility===<br />
The PFL composite layout is incompatible with the existing Lustre file layout, though the individual layout components will re-use the existing lov_mds_md_v1 and lov_mds_md_v3 RAID-0 layouts. Non-PFL clients will receive an EIO error when accessing a composite file. Accessing plain (non-composite) files in the same filesystem will continue to work for both PFL and non-PFL clients. It is not possible to translate the PFL file layouts into a layout that the older clients will understand. Older servers will refuse to try and create a file with a PFL layout, due to the new magic value stored at the start of each layout.<br />
<br />
===MDT On-Disk Format===<br />
The PFL composite layout stored on disk will continue to use the trusted.lov xattr name and will be stored directly in the MDT inode, if space permits, to maximize performance. The existing maximum limits on xattr sizes will not be changed as part of this project. For both ZFS and ldiskfs backing filesystems the on-disk xattr size is not the limiting factor for determining the maximum stripe count of a file, but rather the RPC size limits.<br />
<br />
The MDS itself needs to understand the new struct lov_comp_md_v1 layout format described in [[Layout Enhancement High Level Design#2.1. Composite Layouts|Layout Enhancement HLD Composite Layouts]], in order to unlink the OST objects within that file when it is deleted, or change the ownership of a file's OST objects when the file ownership changes.<br />
<br />
The [https://wiki.hpdd.intel.com/display/opensfs/LFSCK2+High+Level+Design Lustre File System Check (LFSCK)] tool also needs to understand struct lov_comp_md_v1 in order to accurately determine the relationship between an MDT inode and all the OST objects where it stores its data. This can leverage the same composite file layout iteration that the MDS is using for file unlink, setattr, and other operations that affect all of the OST objects on a file.<br />
<br />
====MDT Default File Layout Templates====<br />
The file layout template is an uninstantiated file layout that is initially stored on a parent directory, or on the filesystem root directory, and provides the default layout for new files that do not otherwise have a specific layout assigned at file creation time. When a new file file or directory is first created, it inherits the layout template from the parent directory in which it was created, or if the parent directory has no template then it is inherited from the filesystem root directory. Once assigned to the new file, the layout is stored with the MDT inode on disk and is instantiated as needed for that file.<br />
<br />
The layout template itself for a plain file is simply struct lov_mds_md_v1, or struct lov_mds_md_v3 if an OST pool is in use, without any of the OST objects allocated for it (i.e. the lmm_objects[] array is unused). The plan for composite file templates will be similar - a layout template for a 3-component file would consist of the composite header template struct lov_comp_md_v1 along with three separate pairs of component entries and uninstantiated sub-layout templates, namely struct lov_comp_md_entry_v1 and the accompanying struct lov_mds_md_v1 without any OST objects allocated.<br />
struct lu_extent {<br />
__u64 e_start;<br />
__u64 e_end;<br />
};<br />
<br />
struct lov_comp_md_entry_v1 {<br />
__u32 lcme_id; /* unique identifier of component */<br />
__u32 lcme_flags; /* LCME_FL_XXX */<br />
struct lu_extent lcme_extent; /* file extent for component */<br />
__u32 lcme_offset; /* offset of component blob in layout */<br />
__u32 lcme_size; /* size of component blob data */<br />
__u64 lcme_padding[2];<br />
};<br />
<br />
struct lov_comp_md_v1 {<br />
__u32 lcm_magic; /* LOV_MAGIC_COMP_V1 */<br />
__u32 lcm_size; /* overall size of layout including this structure */<br />
__u32 lcm_layout_gen;<br />
__u16 lcm_flags;<br />
__u16 lcm_entry_count;<br />
__u64 lcm_padding1;<br />
__u64 lcm_padding2;<br />
struct lov_comp_md_entry_v1 lcm_entries[0];<br />
};<br />
<br />
struct lov_ost_data_v1 { /* per-stripe data structure (little-endian)*/<br />
struct ost_id l_ost_oi; /* OST object ID */<br />
__u32 l_ost_gen; /* generation of this l_ost_idx */<br />
__u32 l_ost_idx; /* OST index in LOV (lov_tgt_desc->tgts) */<br />
};<br />
<br />
struct lov_mds_md_v1 { /* LOV EA mds/wire data (little-endian) */<br />
__u32 lmm_magic; /* magic number = LOV_MAGIC_V1 */<br />
__u32 lmm_pattern; /* LOV_PATTERN_RAID0, LOV_PATTERN_RAID1 */<br />
struct ost_id lmm_oi; /* LOV object ID */<br />
__u32 lmm_stripe_size; /* size of stripe in bytes */<br />
/* lmm_stripe_count used to be __u32 */<br />
__u16 lmm_stripe_count; /* num stripes in use for this object */<br />
__u16 lmm_layout_gen; /* layout generation number */<br />
struct lov_ost_data_v1 lmm_objects[0]; /* per-stripe data */<br />
};<br />
<br />
Unfortunately, the size of a 3-component layout template, even without any OST objects allocated, is larger than can fit into the currently 512-byte ldiskfs inodes' free space, as can be seen in the diagram below. There are approximately 180 bytes of free space in the 512-byte inode (depends on the length of the filename and if there are multiple hard links to a file), but a 3-component template layout is 268 bytes in size. Even with aggressive reduction of the size of the lov_comp_md_v1, lov_comp_md_entry_v1, and lov_mds_md to remove all fields that are not strictly necessary, the 3-component template would still be too large to fit into the directory inode, let alone on an actual file using this template with at least one allocated OST object. If the xattr is too large for the in-inode space, for example a plain RAID-0 file with more than 5 stripes, then the layout xattr is stored in an separate data block. Storing the layout xattr outside the inode may incur significant performance penalties, due to an extra seek for every inode access in order to fetch the layout xattr into memory, so this is undesirable for normal usage.<br />
<br />
One option to avoid the overflow of the in-inode xattr space would be to store only a single-component layout on the file, which would fit within the available 180-byte space, and inherit the rest of the components from the parent directory as the file size grows to need these components. This would be desirable from the point of view of minimizing the overhead for small files, which can make up a large fraction of all files in HPC filesystems. However, this also adds complexity to the PFL code and usage, since the inode is not guaranteed to have the same parent directory, and hence a different layout template, when the time comes to extend the file beyond the first component. This may lead to inconsistent or sub-optimal layout components if the file is renamed, or the default layout of a directory or the filesystem is modified, and the new directory layout template is incompatible with the existing component(s) on the file due to overlapping layout extents.<br />
<br />
Even if a single-component file layout could fit in the inode xattr space, the composite layout template still couldn't fit into the parent directory's inode. However, since there are normally far fewer directories than files, and directory leaf blocks are themselves likely to be allocated only one block at a time, the external xattr block would not be as high an overhead, and the one xattr read overhead would normally be amortized over the creation of many files within that directory that use the same layout template.<br />
<br />
Another option is to format the MDT with larger 1024-byte inodes by default, to ensure there is enough space for not only the composite layout or template, but also for other xattrs such as SELinux labels, ACLs, etc. This has the drawback that the MDT will need to be reformatted for 1024-byte inodes to maximize PFL performance, and each inode will take twice as much space on disk and in memory, which may also impact metadata performance. This can possibly be mitigated on existing 512-byte inodes that use SSD or NVM storage for the MDT to avoid the overhead of seeking to read the external xattr block for each cache-cold MDT inode access.<br />
<br />
Due to the implementation complexity and risk of inconsistent or sub-optimal file layouts being created by the incremental inheritance of layouts from the parent layout template, the PFL 2 project will implement whole-layout inheritance at file creation time.<br />
<br />
[[Image:pfl2_default_layout_template.png]]<br />
<br />
====MDS Layout Verification====<br />
In addition to PFL layout verification performed by userspace in the llapi_layout_* functions, the MDS should also do verification of the layout components to ensure that they are valid for the PFL feature. This includes the following checks:<br />
<br />
* verify the start of each component matches the end of the previous component (if any), to prevent overlapping or disjoint extents.<br />
* verify the layout stripe_size and the layout extent_end are properly aligned to prevent fractional pages or RPCs that span multiple components. This restriction may be relaxed over time, but for the initial implementation it will avoid complexity to ensure that full-stripe reads and writes are done within a single component.<br />
* verify object_maxbytes * stripe_count >= extent_end for each component except the last one, to ensure that file data can be written over the full range of the component. For ldiskfs OSTs the object_maxbytes is 16 TiB, so for a component with few stripes and a very large extent_end it is possible that the client would get -EFBIG while writing to the middle of the file. For ZFS OSTs the object_maxbytes value is 263-1 bytes, so this is not an issue. This may be difficult to implement 100% consistently, since the MDS will not necessarily know which specific OSTs will be selected when setting a uninstantiated layout template, which would only be a concern if there are different OST types within the same filesystem. In this unlikely case, it would be easiest to select the minimum maxbytes limit at OST connect time.<br />
<br />
As other features are added that use composite layouts, such as File Level Replication, these restrictions can be relaxed.<br />
<br />
===Client-MDS Protocol===<br />
By using extensions to the xattr protocol to instantiate and modify composite layouts there are no RPC protocol changes needed between the client and MDS. The Phase 2 PFL client will send the new OBD_CONNECT_COMPOSITE connection flag to indicate that it understands the composite layout feature, and the MDS replies with the same feature flag set to inform the client that this feature is supported, otherwise the client would get an error when storing the composite layout on the MDS. The existing MDS_SETXATTR and MDS_GETXATTR RPC opcodes can be used to create, modify, and remove individual components of a file, as well as whole composite files. Since the connection flag exchange is done on every client and MDS restart, there should never be a case where the MDS does not recognize the incoming file layout magic or the enhanced RPC opcodes.<br />
<br />
The existing RPC size limit will not be changed as part of this project, allowing a single file to have maximum stripe count of 2000 OSTs for a plain RAID-0 file. Since the PFL composite file and component layout containers themselves take up space, the maximum number of OSTs that a single file can use depends on the exact layout being used. For a single-component file, the maximum stripe count will only be 2-3 stripes below the 2000-OST limit. For a file with many single-stripe components, the maximum number of components will be approximately 500.<br />
<br />
If the PFL Phase 2 Static Layout implementation is deployed separately from the proposed PFL Phase 3 Dynamic Layout, then some additional changes are needed in the RPC protocol between clients and the MDS. For clients using the PFL2 code that understands composite layouts but not dynamic layout initialization, as detected by the lack of OBD_CONNECT_PFL_DYNAMIC flag at connection time, any file creation requests will result in the MDS allocating all of the OST objects for a file with a layout template. The PFL Phase 3 clients can notify the MDS at connect time, by passing a new OBD_CONNECT_PFL_DYNAMIC feature flag, that they handle on-the-fly layout initialization of files, so it is safe to store only the layout template to disk.<br />
<br />
===Client-OSS Protocol===<br />
During normal IO operations between the client and OSS, the client sends information to the OSS about each object that is being accessed, to avoid the overhead of extra communication between the MDS and OSS for every object created and accessed in the filesystem. This information includes sending the MDT inode File Identifier (FID) to the OSS in order to indicate which file each OST object belongs to, as well as the stripe index of that object within the file. This information is stored on the OST object the first time the object is ever modified. The MDT inode FID passed from the client is sanity checked against the one stored on the OST object for later IO operations in order to avoid accidentally accessing or modifying OST objects due to software bugs, as well as by the [https://wiki.hpdd.intel.com/display/opensfs/LFSCK2+High+Level+Design Lustre File System Check] (LFSCK) tool to verify consistency between the file layout on the MDT and the objects on the OST(s) and to rebuild the MDT inode file layout if it becomes corrupted. By storing the component ID with each OST object, along with the stripe index and stripe size, the LFSCK tool can re-assemble the file layout for each MDT inode FID, even if the layout is lost or corrupted on the MDT.<br />
<br />
The RPC from the client to the OSS currently only passes a single integer for the object's stripe index, since this is all that was needed to uniquely identify the object in a RAID-0 file layout. In order to accommodate the presence of multiple component layouts within a single composite file, the RPC from the client needs to be modified slightly in order to send more information for above purposes, including the stripe size and the stripe count, and if the target is for PFL component, the component ID and related extent range will be sent also. If all these information are sent from the client to the OSS via current obdo structure, then only the obdo.o_padding_{4/5/6} are not enough, we have to reuse some other fields to avoid enlarge the obdo structure. Currently the obdo.o_lcookie is only used by OSP for recording async RPC llog cookie. In fact, such cookie is only used locally. So the obdo.o_lcookie can be used for other on wire purpose. Fortunately, it is large enough (32 bytes) to hold all above information for LFSCK. To be clear, we will define new structure ost_layout for that.<br />
struct llog_cookie {<br />
struct llog_logid lgc_lgl;<br />
__u32 lgc_subsys;<br />
__u32 lgc_index;<br />
__u32 lgc_padding;<br />
} __attribute__((packed));<br />
<br />
+struct ost_layout {<br />
+ __u64 ol_pfl_start;<br />
+ __u64 ol_pfl_end;<br />
+ __u32 ol_pfl_id;<br />
+ __u32 ol_stripe_size;<br />
+ __u32 ol_stripe_count;<br />
+ __u32 ol_padding_0;<br />
+};<br />
<br />
struct obdo {<br />
__u64 o_valid; /* hot fields in this obdo */<br />
struct ost_id o_oi;<br />
__u64 o_parent_seq;<br />
__u64 o_size; /* o_size-o_blocks == ost_lvb */<br />
__s64 o_mtime;<br />
__s64 o_atime;<br />
__s64 o_ctime;<br />
__u64 o_blocks; /* brw: cli sent cached bytes */<br />
__u64 o_grant;<br />
/* 32-bit fields start here: keep an even number of them via padding */<br />
__u32 o_blksize; /* optimal IO blocksize */<br />
__u32 o_mode; /* brw: cli sent cache remain */<br />
__u32 o_uid;<br />
__u32 o_gid;<br />
__u32 o_flags;<br />
__u32 o_nlink; /* brw: checksum */<br />
__u32 o_parent_oid;<br />
__u32 o_misc; /* brw: o_dropped */<br />
__u64 o_ioepoch; /* epoch in ost writes */<br />
__u32 o_stripe_idx; /* layout stripe idx */<br />
__u32 o_parent_ver;<br />
struct lustre_handle o_handle; /* brw: lock handle to prolong<br />
* locks */<br />
- struct llog_cookie o_lcookie; /* destroy: unlink cookie from<br />
- * MDS, obsolete in 2.8, reused<br />
- * in OSP */<br />
+ /* Originally, the field is llog_cookie for destroy with unlink cookie<br />
+ * from MDS, it is obsolete in 2.8. Then reuse it by client to transfer<br />
+ * layout and PFL information in IO, setattr RPCs. Since llog_cookie is<br />
+ * not used on wire any longer, remove it from the obdo, then it can be<br />
+ * enlarged freely in the further without affect related RPCs.<br />
+ *<br />
+ * Here, we have verified sizeof(ost_layout) == sizeof(llog_cookie). */<br />
+ union {<br />
+ /* struct llog_cookie o_lcookie; */<br />
+ struct ost_layout o_layout;<br />
+ };<br />
__u32 o_uid_h;<br />
__u32 o_gid_h;<br />
__u64 o_data_version; /* getattr: sum of iversion for<br />
* each stripe.<br />
* brw: grant space consumed on<br />
* the client for the write */<br />
__u64 o_padding_4;<br />
__u64 o_padding_5;<br />
__u64 o_padding_6;<br />
};<br />
<br />
Implementing the support for LFSCK and the OSTs to handle composite files belongs to PFL Phase 3a, that is beyond the scope of the PFL Phase 2 implementation.<br />
<br />
===OST On-Disk Format===<br />
As discussed in the Client-OSS Protocol section, the OST stores a fragment of the MDT layout with each object in order to do sanity checks on incoming client RPCs and recovery in case of MDT corruption. The OST needs to be able to store an additional 32 bytes of data with struct filter_fid to store additional information for the composite layout so that the OST object knows its place within the component and within the composite file:<br />
struct filter_fid {<br />
struct lu_fid ff_parent; /* ff_parent.f_ver == file stripe number */<br />
+ __u32 ff_stripe_size;<br />
+ __u32 ff_stripe_count;<br />
+ __u64 ff_pfl_start;<br />
+ __u64 ff_pfl_end;<br />
+ __u32 ff_pfl_id;<br />
};<br />
<br />
The osd-ldiskfs on-disk inode along with the Lustre-specific xattrs ("lma" and "fid") are very nearly out of free space in the OST's 256-byte inode, there is not enough room to store these 28 additional bytes in the "fid" xattr. As explained above, storing the "fid" xattr in a separated block will cause serious performance trouble, we have to consider other solution. One possible way is the merge the "lma" and "fid" into the "lma" EA body (or value) to save the space occupied by the "fid" matter entry (20 bytes). It is some hack way, and is hidden inside the osd-ldiskfs. From the up layer users' view, the "fid" xattr is still independent, they do not know and should not care about how the "fid" xattr is stored on the disk.<br />
<br />
[[Image:pfl2_inode_size.png]]<br />
<br />
For the osd-zfs on-disk dnode (inode), the added information will be stored in the System Attributes, which currently already do not fit into the dnode proper, but must already allocate a separate spill block to hold the SAs for the dnode. Once the large dnode patch is landed to the ZFS-on-Linux repository it will allow the SAs to always be stored within the dnode for maximum performance.<br />
<br />
===MDS-OSS Protocol===<br />
<br />
The MDS-OSS protocol is largely unaffected by composite layouts, since the OSTs themselves never use the file layout directly. The LFSCK utility does fetch the struct filter_fid xattr from the OST in order to verify its consistency against its locally stored file layout. The actual network protocol remains unchanged, besides the extra two fields added to this structure. The LFSCK utility will need to verify the ff_stripe_size and ff_component_id fields against their respective values in the file layout to verify that the object is part of the correct component.<br />
<br />
===Known Issues===<br />
<br />
* Append write. Append write has to instantiate all components to fulfill the posix semtantics;<br />
* Group lock. The current semantics of group lock would be hard to comply, work in progress to come up with a solution.<br />
<br />
Please see more known issues at https://jira.hpdd.intel.com/browse/LU-9349<br />
<br />
[[Category:Architecture]]<br />
[[Category:Design]]<br />
[[Category:PFL]]</div>Nrutmanhttp://wiki.lustre.org/index.php?title=PFL2_High_Level_Design&diff=2706PFL2 High Level Design2017-08-28T20:55:18Z<p>Nrutman: /* Design Overview */</p>
<hr />
<div>==Introduction==<br />
<br />
The Progressive File Layout Phase 2 (PFL2) High Level Design describes details of how the PFL feature may be implemented, including the user interfaces for both the command-line as well as Lustre-aware applications, how PFL files will interact with the client-IO (CLIO) layer in the Lustre kernel VFS driver, as well as the RPC formats between the client and servers, and the interface to the underlying storage. This document further builds upon the reference documents below.<br />
<br />
This design is intended to be comprehensive for both the current PFL Phase 2 implementation, as well as a future PFL Phase 3 implementation, so some use cases describe functionality that will not be implemented as part of the PFL Phase 2 implementation, but are included here so that the overall PFL design is complete, and to ensure that functionality implemented in PFL Phase 2 is considering the longer-term implementation goals and does not need to be reworked once PFL Phase 3 implementation is started. Design aspects that are not intended to be implemented in PFL Phase 2 are marked as such in this document or the [[PFL2 Scope Statement]].<br />
<br />
===References===<br />
[[Layout Enhancement High Level Design]]<br />
<br />
[[Progressive File Layouts]]<br />
<br />
[[PFL Prototype Scope Statement]]<br />
<br />
[[PFL Prototype Solution Architecture]]<br />
<br />
[[PFL Prototype High Level Design]]<br />
<br />
[[PFL2 Scope Statement]]<br />
<br />
[[PFL2 Solution Architecture]]<br />
<br />
==Design Overview==<br />
There are three main components to the PFL design:<br />
<br />
* the user-space interfaces for Lustre-specific command-line tools and user library application programming interfaces (APIs)<br />
* changes to the client IO (CLIO) layer IO engine and RPC interfaces to the MDS for accessing and manipulating composite file layouts<br />
* changes to the MDS server to create, modify, and delete composite files<br />
<br />
The design is structured in a top-down manner, starting with the command-line interfaces that users are going to interact with the most, then the user library APIs, the client-side kernel changes for reading, writing, and accessing PFL files, RPCs for creating and modifying composite files, and finally server-side changes. There is also a discussion of protocol and disk format compatibility issues.<br />
<br />
==Client Side Design==<br />
==User Space Interfaces==<br />
===lfs Command-line Interface===<br />
The lfs(1) command-line interface will be extended to understand and manipulate PFL files and their component layouts. lfs is the primary interface for end users to create new files with a specific layout, show the layout of existing files, as well as setting default layout templates on directories that will be inherited by all new files and subdirectories created therein.<br />
<br />
The [[pfl2-lfs-setstripe.1|lfs setstripe(1)]], lfs migrate(1), lfs getstripe(1), and lfs find(1) sub-commands will be extended to set and display the composite layout of a file, and to search for files with specific composite layout parameters or for components that match specific parameters. The added command-line arguments along with descriptions and examples for each of these commands is given on a dedicated man page for each sub-command linked to the name of the command, so only the synopsis and brief description of each command is shown here.<br />
<br />
====lfs getstripe====<br />
The lfs getstripe command prints some or all of the parameters of a file's layout. This is intended for regular users and administrators to query a particular file's layout, or the individual components of a composite file to examine the layout used to create the file.<br />
<br />
lfs getstripe [--stripe-count|-c ] [--directory|-d] [--stripe-index|-i]<br />
[--layout|-L] [--mdt-index|-m] [--ost|-O <uuid>] [--pool|-p]<br />
[--recursive|-r] [--raw|-R] [--stripe-size|-S] [--component-start [start]]<br />
[--component-end|-E [end]] [--component-flag|-F [flag]] [--component-id|-I [id]]<br />
[--component-count] [--quiet|-q] [--verbose|-v] {dirname|filename} ...<br />
<br />
Without any of the option flags, this will display all the layout components, as shown below. To limit the display to specific values of the layout, the options are largely the same as the current lfs getstripe, with new parameters for extracting attributes of composite files, such as the start and end extent of the last instantiated component, the unique component identifier, and the component attribute flags. By default, when requesting specific values of the layout, this will print the parameters of the last instantiated component of the layout, since this is the one that affects the current IO behaviour, and if a single parameter needs to be selected that best represents the file it should come from the layout that the file needed at its current size. It is also possible to select a specific file component by its offset within the file or attribute flags to print specific values from the specified component of the layout. If multiple component options are specified, such as --component-end=64M and --component-flag=uninit, then lfs getstripe will return the attributes of the component that matches all specified options. If no component matches all specified component options, then nothing will be printed.<br />
<br />
Since the output format needs to be changed for composite files, the output is YAML formatted for both ease of parsing and still be human readable. This is still reasonably similar to the original output format, with the exception of the OST object ID information, which is now more structured for ease of use.<br />
# An output example of a file with 3 components<br />
$ lfs getstripe -v /mnt/lustre/file<br />
"/mnt/lustre/file":<br />
fid: "[0x200000400:0x2c3:0x0]"<br />
composite_header:<br />
composite_magic: 0x0BDC0BD0<br />
composite_size: 536<br />
composite_gen: 4<br />
composite_flags: 0<br />
component_count: 3<br />
components:<br />
- component_id: 1<br />
component_flags: 0<br />
component_start: 0<br />
component_end: 2097152<br />
component_offset: 152<br />
component_size: 56<br />
sub_layout:<br />
lmm_magic: 0x0BD10BD0<br />
lmm_pattern: 1<br />
lmm_stripe_size: 1048576<br />
lmm_stripe_count: 1<br />
lmm_stripe_index: 7<br />
lmm_pool: flash<br />
lmm_layout_gen: 0<br />
lmm_obj:<br />
- 0: { lmm_ost: 7, lmm_fid: "[0x100070000:0x2:0x0]" }<br />
- component_id: 2<br />
component_flags: 0<br />
component_start: 2097152<br />
component_end: 16777216<br />
component_offset: 208<br />
component_size: 128<br />
sub_layout:<br />
lmm_magic: 0x0BD10BD0<br />
lmm_pattern: 1<br />
lmm_stripe_size: 1048576<br />
lmm_stripe_count: 4<br />
lmm_stripe_index: 0<br />
lmm_layout_gen: 0<br />
lmm_obj:<br />
- 0: { lmm_ost: 0, lmm_fid: "[0x100000000:0x2:0x0]" }<br />
- 1: { lmm_ost: 1, lmm_fid: "[0x100010000:0x3:0x0]" }<br />
- 2: { lmm_ost: 2, lmm_fid: "[0x100020000:0x4:0x0]" }<br />
- 3: { lmm_ost: 3, lmm_fid: "[0x100030000:0x4:0x0]" }<br />
- component_id: 4<br />
component_flags: 0<br />
component_start: 16777216<br />
component_end: 18446744073709551615<br />
component_offset: 336<br />
component_size: 176<br />
sub_layout:<br />
lmm_magic: 0x0BD10BD0<br />
lmm_pattern: 1<br />
lmm_stripe_size: 4194304<br />
lmm_stripe_count: 6<br />
lmm_stripe_index: 5<br />
lmm_pool: archive<br />
lmm_layout_gen: 0<br />
lmm_obj:<br />
- 0: { lmm_ost: 5, lmm_fid: "[0x100050000:0x2:0x0]" }<br />
- 1: { lmm_ost: 6, lmm_fid: "[0x100060000:0x2:0x0]" }<br />
- 2: { lmm_ost: 7, lmm_fid: "[0x100070000:0x3:0x0]" }<br />
- 3: { lmm_ost: 0, lmm_fid: "[0x100000000:0x3:0x0]" }<br />
- 4: { lmm_ost: 1, lmm_fid: "[0x100010000:0x4:0x0]" }<br />
- 5: { lmm_ost: 2, lmm_fid: "[0x100020000:0x5:0x0]" }<br />
<br />
====lfs setstripe====<br />
The lfs setstripe command creates a new file with the specified layout parameters, or sets the specified layout parameters as the default layout template on a parent directory.<br />
<br />
lfs setstripe {--component-end|-E end1} [component1_OPTIONS] [{--component-end|-E end2} [component2_OPTIONS] ...] {directory|filename}<br />
lfs setstripe --component-del [--component-id|-I comp_id] [--component-flags|-F flags] filename<br />
lfs setstripe --component-set [--component-id|-I comp_id] {--component-flags|-F flags} filename<br />
<br />
Since this is the primary command-line interface for users creating new files with Lustre-specific layouts, there are a significant number of existing options that can be used. Adding in composite-file specific options to lfs setstripe allows the same code to create both files with plain layouts and composite layouts, without duplicating a large number of options. The command-line arguments of lfs setstripe are described in detail in the lfs-setstripe(1) man page. The significant changes to these arguments is the addition of the --component-end argument for specifying which component is being modified during file creation, and allowing multiple components to be specified on the same command-line so that the file does not need to be created piecemeal.<br />
<br />
An example from the man page illustrates the flexibility of the file creation interface:<br />
$ lfs setstripe -E 4M -c 1 --pool flash -E 64M -c 4 -S 4M -E -1 -c -1 -S 16M --pool archive /mnt/lustre/file1<br />
<br />
This creates a file with composite layout in a single operation, rather than building it up one component at a time. The first component has a single stripe that covers [0, 4MiB) and is allocated from an OST in the flash pool. The second component has four stripes that cover [4MiB, 64MiB) and has a stripe size of 4MiB. The last component covers [64MiB, EOF), has a stripe size of 16MiB, and uses all available OSTs in the archive pool.<br />
<br />
Please notice that the '''setstripe options''' in the command line are inheritable, which means the options indicated in previous component will be used by the following components unless they are changed. For example, if ''-c'' option appears in the command line as following:<br />
$ lfs setstripe -E 4M -c 1 -E 8M -E 32M -c 4 -E eof<br />
<br />
It will create the components [0, 4M) and [4M, 8M) with 1 stripe and [8M, 32M) and [32M, EOF) with 4 stripes. This attribute applies to all '''setstripe options'''.<br />
<br />
There is a work in progress to use an explicit option ''--parent'' so that it can reset the previous '''setstripe options''' and use system default stripe options thereafter.<br />
<br />
====lfs migrate====<br />
The lfs migrate command moves the a file's data from one (set of) OST(s) to another (set of) OST(s). This is done by copying the file data from the existing source file layout to a new target file layout as specified by the user. Most of the options to lfs migrate are the same as those of lfs setstripe, since lfs migrate is also creating a new file layout for the file data to be copied.<br />
<br />
lfs migrate [--component-id|-I comp_id] [OPTIONS] filename<br />
<br />
With the addition of composite files in this project, it needs to be possible to migrate a composite file, or a sub-component of that file, to new OST object(s) using the specified parameters. If a component ID is specified, then only that component should be migrated, and the new component should use the same start and end offsets as the source component so that the source component can be replaced without violating the PFL layout rules.<br />
<br />
====lfs find====<br />
The lfs find command is a Lustre-optimized and enhanced version of the find(1) command. It adds several extended options for matching Lustre specific parameters for the file layout. It is also optimizing file access to avoid fetching OST object attributes for each file checked, if the decision can be made based only on the information initially retrieved from the MDT inode.<br />
<br />
lfs find {directory|filename} ... [[!] --atime|-A [-+]days] [[!] --mtime|-M [-+]days]<br />
[[!] --ctime|-C [+-]days] [--maxdepth|-D depth] [[!] --mdt|-m {mdt_uuid|mdt_index,...}]<br />
[--name|-n pattern] [[!] --ost|-O {ost_uuid|ost_index,...}] [--print|-p] [--print0|-P]<br />
[[!] --size|-s [-+]bytes[kMGTPE]] [[!] --stripe-count|-c [+-]stripes]<br />
[[!] --stripe-index|-i {ost_index,...}] [[!] --stripe-size|-S [+-]bytes[kMG]]<br />
[[!] --layout|-L {raid0,released,composite}] [--type |-t {bcdflps}]<br />
[[!] --gid|-g|--group|-G {group_name|gid}] [[!] --uid|-u|--user|-U {user_name|uid}]<br />
[[!] --pool pool] [--component-start start] [--component-end|-E end]<br />
[[!]--component-count [+-]count] [--component-flags|-F flags]<br />
<br />
The existing command is enhanced with the --component-start, --component-end, --component-count and --component-flags commands to allow limiting the search criterion to specific extents or components of the file.<br />
<br />
===llapi_layout_comp_* Library API===<br />
<br />
The llapi_layout_* interfaces provide an interface for userspace applications, including lfs, to specify plain and composite file layouts in an abstract manner, and then convert those abstract layouts into actual file layouts depending on the final attributes of the layout. The main data structure for llapi_layout_* functions is struct llapi_layout, which is opaque to userspace, but internally stores all of the attributes of a single plain layout or a single component's sub-layout. For composite file layouts, the API will be extended to handle layouts with multiple components and other composite file specific attributes, for use in Lustre-specific tools such as lfs setstripe, lfs getstripe, and lfs find, as well as by end-user applications or libraries that want to create files with specific composite layouts to optimize file IO patterns, such as HDF5.<br />
<br />
A composite layout can be composed by several layout components, and each component's sub-layout will be described by the opaque data in struct llapi_layout, therefore, few more fields should be added to the structure:<br />
struct llapi_layout {<br />
uint32_t llot_magic;<br />
uint64_t llot_pattern;<br />
uint64_t llot_stripe_size;<br />
uint64_t llot_stripe_count;<br />
uint64_t llot_stripe_offset;<br />
/** Indicates if llot_objects array has been initialized. */<br />
bool llot_objects_are_valid;<br />
/* Add 1 so user always gets back a null terminated string. */<br />
char llot_pool_name[LOV_MAXPOOLNAME + 1];<br />
/* fields for composite layouts */<br />
+ struct lu_extent llot_extent; /* [start, end) of component */<br />
+ uint32_t llot_id; /* unique ID of component */<br />
+ uint32_t llot_flags; /* LCME_FL_* flags */<br />
+ struct list_head llot_list; /* linked list of llapi_layout components */<br />
struct lov_user_ost_data_v1 llot_objects[0];<br />
};<br />
<br />
* '''llot_extent''': The file extent covered by current component; initially assigned by the caller when defining a layout component.<br />
* '''llot_id''': The numeric ID of current component; this may be assigned internally by the llapi_layout_*() interfaces for identification purposes, but the final component ID assignment is the responsibility of the MDS.<br />
* '''llot_flags''': The flags of current component;<br />
* '''llot_list''': List of all the components of the same composite layout;<br />
<br />
A new pair of interfaces will be introduced to set/get the extent of a layout component. The llapi_layout_comp_extent_get(3) function will fetch the start and end offset of the current layout component, and function will set the layout extent of a layout currently being constructed, within acceptable parameters for that component.<br />
int llapi_layout_comp_extent_set(struct llapi_layout *layout, uint64_t start, uint64_t end);<br />
int llapi_layout_comp_extent_get(const struct llapi_layout *layout, uint64_t *start, uint64_t *end);<br />
<br />
A new set of interfaces will be introduced to get, set, and clear the attribute flags of a layout component. The llapi_layout_comp_flags_get(3) function gets the attribute flags of the current component. The llapi_layout_comp_flags_set(3) command sets the specified flags of the current component leaving other flags as-is, while llapi_layout_comp_flags_clear(3) clears the flags specified in the flags word leaving other flags as-is.<br />
int llapi_layout_comp_flags_get(const struct llapi_layout *layout, uint32_t *flags);<br />
int llapi_layout_comp_flags_set(struct llapi_layout *layout, uint32_t flags);<br />
int llapi_layout_comp_flags_clear(const struct llapi_layout *layout, uint32_t flags);<br />
<br />
The new llapi_layout_comp_id_get(3) interface fetches the file-unique component ID of the current layout component. <br />
int llapi_layout_comp_id_get(const struct llapi_layout *layout, uint32_t *id);<br />
<br />
A new pair of interface to add/delete a component to/from the composite layout. The llapi_layout_comp_add(3) command adds the passed layout component comp to the existing composite file layout layout, to allow creating compound composite layouts at one time. The llapi_layout_comp_del(3) deletes the specified layout component comp from the composite layout layout.<br />
int llapi_layout_comp_add(struct llapi_layout *layout, struct llapi_layout *comp);<br />
int llapi_layout_comp_del(struct llapi_layout *layout, struct llapi_layout *comp);<br />
<br />
A new interface llapi_layout_comp_get_by_id(3) to fetch component(s) by ID if the user or application already knows the ID:<br />
struct llapi_layout *llapi_layout_comp_get_by_id(const struct llapi_layout *layout, uint32_t id);<br />
<br />
A new interface llapi_layout_comp_next(3) to iterate all components of a composite layout, by selecting each component in turn internally, and then allowing different llapi_layout_comp_*() operations on that component layout:<br />
struct llapi_layout *llapi_layout_comp_next(const struct llapi_layout *layout);<br />
<br />
The existing llapi_layout_to_lum() and llapi_layout_from_lum() interfaces should be extended to handle the composite layout, the new user md for composite layout is defined in [[Layout Enhancement High Level Design]].<br />
/* data structure representing each layout component, defined in "Layout Enhancement HLD" */<br />
struct lov_comp_md_entry_v1 {<br />
__u32 lcme_id; /* unique identifier of component */<br />
__u32 lcme_flags; /* LCME_FL_XXX */<br />
struct lu_extent lcme_extent; /* file extent for component */<br />
__u32 lcme_offset; /* offset of component blob in layout */<br />
__u32 lcme_size; /* size of component blob data */<br />
__u64 lcme_padding[2];<br />
};<br />
<br />
/* On-disk/wire structure of the composite layout, defined in "Layout Enhancement HLD" */<br />
struct lov_comp_md_v1 {<br />
__u32 lcm_magic; /* LOV_MAGIC_COMP_V1 */<br />
__u32 lcm_size; /* overall size of layout including this structure */<br />
__u32 lcm_layout_gen;<br />
__u16 lcm_flags;<br />
__u16 lcm_entry_count;<br />
__u64 lcm_padding1;<br />
__u64 lcm_padding2;<br />
struct lov_comp_md_entry_v1 lcm_entries[0];<br />
};<br />
<br />
#define lov_user_comp_md lov_comp_md_v1;<br />
<br />
A new interface llapi_layout_file_comp_add(3) to add layout components to an existing file, it converts the passed in layout into lov_user_comp_md, then issue setxattr with the special xattr name defined in "Changes on MDS":<br />
int llapi_layout_file_comp_add(const char *path, const struct llapi_layout *layout);<br />
<br />
A new interface llapi_layout_file_comp_del(3) to delete component(s) by the specified component id (accepting LCME_ID_* wildcards also) from an existing file:<br />
int llapi_layout_file_comp_del(const char *path, uint32_t id);<br />
<br />
A new interface llapi_layout_file_comp_set(3) to change flags or other parameters of the component(s) by component ID of an existing file. The component to be modified is specified by the comp->lcme_id value, which may be either a specific component number or an LCME_ID_* wildcard value. The new attributes are passed in by comp and valid is used to specify which attributes in the component are going to be changed. This allows the interface to be extended to set any attributes in the future.<br />
int llapi_layout_file_comp_set(const char *path, const struct llapi_layout *comp, uint32_t valid);<br />
<br />
===User Space API Use Cases===<br />
<br />
Several uses of the llapi_layout_* interfaces are shown as examples, to understand how this new API can be used by user tools.<br />
<br />
Use case 1: Create a file with full layout components<br />
/* Allocate opaque layout structure for the first component */<br />
layout1 = llapi_layout_alloc();<br />
<br />
/* Set [0, 2M) extent to the first component */<br />
rc = llapi_layout_comp_extent_set(layout1, 0, 2M);<br />
<br />
/* Set stripe size of the first component */<br />
rc = llapi_layout_stripe_size_set(layout1, 1M);<br />
<br />
/* Set stripe count of the first component */<br />
rc = llapi_layout_stripe_count_set(layout1, 1);<br />
<br />
/* Allocate opaque layout structure for the second component */<br />
layout2 = llapi_layout_alloc();<br />
<br />
/* Add layout2 into layout1, and layout2 will inherit the stripe size of layout1 */<br />
rc = llapi_layout_comp_add(layout1, layout2);<br />
<br />
/* Set [2M, 256M) extent to the second component */<br />
rc = llapi_layout_comp_extent_set(layout2, 2M, 256M);<br />
<br />
/* Set stripe count of the second component */<br />
rc = llapi_layout_stripe_count_set(layout2, 4);<br />
<br />
/* Repeat above steps to create a layout3 with [256M, EOF) */<br />
...<br />
<br />
/* Create file with the composite layout */<br />
rc = llapi_layout_file_create(path, open_flags, open_mode, layout1);<br />
<br />
Use case 2: Create a file with initial component, and add components later<br />
/* Allocate opaque layout structure for the first component */<br />
layout1 = llapi_layout_alloc();<br />
<br />
/* Set [0, 2M) extent to the first component */<br />
rc = llapi_layout_comp_extent_set(layout1, 0, 2M);<br />
<br />
/* Set stripe size of the first component */<br />
rc = llapi_layout_stripe_size_set(layout1, 1M);<br />
<br />
/* Set stripe count of the first component */<br />
rc = llapi_layout_stripe_count_set(layout1, 1);<br />
<br />
/* Create file with the specified initial component */<br />
rc = llapi_layout_file_create(path, open_flags, open_mode, layout1);<br />
<br />
/* Allocate opaque layout structure for the second component */<br />
layout2 = llapi_layout_alloc();<br />
<br />
/* Set [2M, 256M) extent to the second component */<br />
rc = llapi_layout_comp_extent_set(layout2, 2M, 256M);<br />
<br />
/* Set stripe count of the second component */<br />
rc = llapi_layout_stripe_count_set(layout2, 4);<br />
<br />
/* Add the component layout2 into the file */ <br />
rc = llapi_layout_file_comp_add(path, layout2);<br />
<br />
Use case 3: Traverse all components of a composite layout. This is useful for something like lfs getstripe to be able to iterate over all components of an object without knowing in advance how many components each file has. During processing, the process using iterator can decide which components are of interest and print them, or use the component's file-unique component ID to print, modify or delete each component in turn.<br />
/* Get composite layout from existing file */<br />
layout = llapi_layout_get_by_path(path, flags);<br />
<br />
/* Traverse the layout */<br />
comp = layout;<br />
do {<br />
/* Get & print stripe count */<br />
rc = llapi_layout_stripe_count_get(comp, &count);<br />
printf("stripe_count: %llu\n", count);<br />
<br />
/* Get & print stripe size */<br />
rc = llapi_layout_stripe_size_get(comp, &size);<br />
printf("stripe_size: %llu\n", size);<br />
<br />
/* Get & print layout pattern */<br />
rc = llapi_layout_pattern_get(comp, &pattern);<br />
printf("stripe_pattern: %llx\n", pattern);<br />
<br />
comp = llapi_layout_comp_next(comp);<br />
} while (comp != layout);<br />
<br />
===Client-IO Interface===<br />
This design is based on the PFL Prototype High Level Design, which demonstrated the feasibility of the PFL concept. In this design, we will add further details to the design and address a few problems discovered during the prototype phase so that it can be used better in production.<br />
<br />
In the PFL prototype design, it addressed the problems of object mapping, in-memory cl_objects setup by layout components and a framework to support fundamental I/O operations. Operations like read, write, setattr, and glimpse are working properly. Two additional problems have to be solved for better use of PFL. The first problem is to create layout components on demand; and the second one is an optimization in terms of performance. This is a part of design of full PFL implementation, and the work in this part design will not be implemented in the project of PFL Phase 2.<br />
<br />
For the PFL Phase 2 implementation, the Client IO layer will be able to interpret existing composite file layouts, but will return an -ENODATA error to the application until the Phase 3 handling of ll_layout_intent() to dynamically fetch and instantiate layout components from the MDS is implemented.<br />
<br />
===Create Layout Component on Demand===<br />
The prototype phase doesn’t have the functionality to create layout component on demand. If applications are writing a file extent without layout component defined, clients will simply return an error to applications. This implies that the user has to know the layout components in advance, and has to understand the application’s access pattern really well so that each layout component can be created before it’s written.<br />
<br />
To support creating layout component on demand, the administrator can associate a PFL file with a layout template, which describes the number of stripes to be created for each range of file extents, along with other parameters such as stripe size, OST pool, etc. If the corresponding file extent is being written without layout component defined, the client will send a dedicated RPC, named layout intent RPC ([[PFL2_Solution_Architecture#A13._Application_writes_within_a_uninitialized_file_component_.28Phase_3.29|A13 in PFL Phase 2 Solution Architecture]]), to the MDT, then the MDT can use the information within layout template to allocate corresponding OST objects to form a layout component, then the layout component will be appended to the file’s layout. After this is done, the client should be able to fetch new layout and proceed the I/O. No error should be seen by applications under normal circumstances, though it is possible to see errors at this point due to environmental factors such as -ENOSPC, -EIO, -ENOMEM or others that may occur when the MDS creates new OST objects and modifies the file layout on disk.<br />
<br />
The following diagram describes the process of creating layout component on demand:<br />
[[Image:pfl2_write_flow_chart.png||Flow chart for PFL write]]<br />
<br />
In the above diagram, 'File Update' can be any operations that will modify file contents, such as file write, mkwrite, and truncate. Read only operation won't necessarily trigger layout component allocation. Reading file extents with undefined layout components will simply return zero filled buffer.<br />
<br />
As it's shown in the diagram, the LOV layer, which is the only module in CLIO who can understand layout, will check if the intended writing region has a layout defined. If there is no layout defined, it will abort and invoke ll_layout_intent() that will send layout intent RPC to the MDT. As what we mentioned in the design of PFL prototype phase, CLIO will split by the boundary of layout component, therefore if only part of the I/O region doesn't have layout defined, it will finish the I/O with layout defined first, and then request for new layout.<br />
<br />
In cl_io data structure, a new bit is going to be defined to mark that the I/O failure was due to missing layout component:<br />
struct cl_io {<br />
...<br />
/* true if this io failed due to missing layout */<br />
unsigned int ci_no_layout:1;<br />
...<br />
}<br />
<br />
If LOV detects that the update I/O can't be finished due to missing layout, it will set ci_no_layout and abort the I/O with error code -ENODATA. In vvp_io_fini(), it should check ci_no_layout and then compose a layout intent RPC and send it to the MDT. Layout intent RPC is an LDLM enqueue RPC with an intent operation, the payload of the intent operation is as follows:<br />
struct layout_intent {<br />
__u32 li_opc; /* intent operation for enqueue, read, write etc */<br />
__u32 li_flags;<br />
__u64 li_start;<br />
__u64 li_end;<br />
};<br />
<br />
In order to request new component, the li_opc will be assigned to LAYOUT_INTENT_WRITE, and li_start and li_end are set to the full range of intended I/O. MDT may decide to create multiple layout components within this range at its discretion. Later, CLIO will fetch the new layout and continue I/O from where it stops.<br />
Flushing Cached Page Wisely under Layout Change<br />
<br />
Client side page cache may needs flushing whenever the file layout is changed. This works well for HSM and file migration because no pages would remain valid after those operations change the file layout. However, for PFL files the cached pages should be still valid after a layout change because layout components are only appended to the existing layout, and do not affect the existing components or their data. It could hurt application performance significantly if all pages are evicted from the client cache and then read back again for each new component that is added to the file.<br />
<br />
PFL uses layered generations to identify layout generations of each file and each component. One is layout generation at the whole file level, this generation will be changed after appending a layout component. Another one is the per layout component generation, which will remain unchanged when appending new file components.<br />
<br />
This feature will be used to facilitate client page cache management. When clients detect a layout change at the LOV layer, the LOV will further check the generations on layout components, and it will only flush pages for the newly added or modified layout components, which is an empty operation for PFL because layout components won’t be altered once created.<br />
<br />
One tricky case worth mentioning is that layout component generations are comparable only if the file's layout generation matches. Even it's known that the file's layout generation increased due to layout component addition, they are still conceptually unrelated layouts. Therefore, clients must compare not only component generations but also OST indices and objects to decide if components are unchanged to avoid unnecessary page eviction. <br />
<br />
In order to accomplish this, a pair of range parameters will be added into cl_object_prune() to indicate that only a subset of pages are being evicted. A callback may be provided by LOV later to check if an individual page should be evicted due to layout change.<br />
<br />
===Dynamic Layout Request===<br />
In the PFL Prototype, layout components must have been defined and instantiated before I/O starts. Otherwise, applications that are trying to access an uninstantiated or undefined component will receive an ENODATA error. With dynamic layout request supported, clients are able to instantiate layout components on demand. Layout intent RPC will be used to request an instantiated layout component from the MDT.<br />
<br />
The RPC format of layout intent:<br />
static const struct req_msg_field *ldlm_intent_layout_client[] = {<br />
&RMF_PTLRPC_BODY,<br />
&RMF_DLM_REQ,<br />
&RMF_LDLM_INTENT,<br />
&RMF_LAYOUT_INTENT,<br />
&RMF_EADATA /* for new layout to be set up */<br />
};<br />
<br />
struct req_msg_field RMF_LAYOUT_INTENT =<br />
DEFINE_MSGF("layout_intent", 0,<br />
sizeof(struct layout_intent), lustre_swab_layout_intent,<br />
NULL);<br />
EXPORT_SYMBOL(RMF_LAYOUT_INTENT);<br />
<br />
/* enqueue layout lock with intent */<br />
struct layout_intent {<br />
__u32 li_opc; /* intent operation for enqueue, read, write etc */<br />
__u32 li_flags;<br />
__u64 li_start;<br />
__u64 li_end;<br />
};<br />
<br />
Layout intent RPC is essentially an LDLM intent RPC. In the layout intent RPC, struct layout_intent will carry the necessary layout information to the MDT, for example, what is the range of required component, and which operation it expects the MDT to execute. With the information of preset layout template, the MDT should be able to create the corresponding component. The MDT should instantiate all components within the range of [li_start, li_end). If the MDT successfully instantiates a component, it will increment the file's layout generation number by 1.<br />
<br />
===When to send layout intent in CLIO stack===<br />
In CLIO stack, the LOV layer is the only place where layout can be understood and interpreted. When I/O is instantiated, the LOV layer will split file level I/O range to sub I/Os by components information, and make sure that one sub I/O won’t cross component boundary. The lov_io_iter_init() function is where it checks if a component is instantiated or not.<br />
<br />
If a component is not instantiated, and if this I/O is a write operation, LOV will set a flag in cl_io data and return to the llite layer with error -ENODATA. The llite layer will realize that new components are required to complete this I/O so that it will invoke ll_layout_request() to send layout intent RPC. If everything goes okay, ll_layout_request() will fetch and apply new layout to CLIO stack. Finally the llite layer should restart the I/O and it should be able to move forward. Since the components are instantiated only once, and are added only to the end of the file, any layout changes for PFL files cannot affect in-flight IO operations.<br />
<br />
So far, the operations would trigger layout intent request are VFS write() family of calls, truncate(), and page ->mkwrite(). In the future, other interfaces such as fallocate() would also need to handle uninstantiated components in a similar manner.<br />
<br />
==Server-side RPC Interface and IO code paths==<br />
MDS should provide an interface that can handle the client's requests to populate a new layout and modify an existing layout. We can consider a few different options to go through the various layers in the metadata stack from the client to the underlying storage on the MDT that holds the layout itself:<br />
<br />
#a new DT method or set of methods<br />
## will be used only by the Logical Object Device (LOD) on the MDS<br />
## unit testing (talking directly to LOD) would need additional infrastructure<br />
# use setxattr() with special xattr names<br />
## lustre.lov.add = { components to add to the composite layout }<br />
## lustre.lov.del = { delete the component ID in the value }<br />
## lustre.lov.set = { set component flags }<br />
## lustre.lov.clear = { clear component flags }<br />
## no need in an extra infrastructure to implement unit testing<br />
## no new special methods<br />
# extend ->dbo_punch() method<br />
## a flag to populate range: assign new objects<br />
## a flag to depopulate: release the objects<br />
## probably not enough to change flags (out-of-sync stripe?)<br />
# totally crazy idea - layout as an index<br />
## this is what it is in essence<br />
## range/offset as a key<br />
## object+status as a value<br />
<br />
Several aspects of the xattr interface (#2) are of interest, which lead to selecting the xattr method for the PFL implementation.<br />
<br />
The first and foremost reason for selecting the xattr interface is that it allows adding the ability to modify existing layout xattrs without a significant change in the number of RPC types, without adding new dedicated server APIs that will only be used for composite files, and without introducing new userspace APIs that may have portability issues. This simplifies the understanding of the code and keeps the complexity growth in check for the future, and doesn't sacrifice flexibility.<br />
<br />
A secondary reason for selecting the xattr interface for managing the layouts on the client is that with IO forwarders such as IBM's CIOD it is difficult to pass ioctl() commands from the compute nodes where applications run to the IO nodes where the Lustre client runs. This needs a special handler for every ioctl() command and quickly becomes a maintenance headache. However, the getxattr() and setxattr() interfaces already exist in such environments and provide generic key=value methods that can work with arbitrary key names and values between the llapi_layout(7) library commands and the Lustre client, over the network to the MDS, and from the MDS down to the underlying MDT storage. It is not expected that applications or users will interact directly with the file components using getxattr() and setxattr(), but only via the llapi_layout interfaces.<br />
<br />
==Use Cases for getxattr() and setxattr() interfaces==<br />
In order to manipulate the layout of the lustre.lov xattr holding the file layout, getxattr() and setxattr() (or fgetxattr() and fsetxattr() for operations on already-open file descriptors) will manipulate virtual xattrs with names such as lustre.lov.add, lustre.lov.del, lustre.lov.set, and lustre.lov.clear. These can interface transparently from userspace on the client with the kernel on the client, or with the MDS as needed. Below we look at the use cases from the PFL 2 Solution Architecture to verify that these xattr interfaces can meet the requirements set out in that document.<br />
<br />
===U01. User creates new file with fixed-size layout component===<br />
setxattr("lustre.lov", <binary composite file description>);<br />
The LOD on the MDS parses the layout, which should contain a component that has an extent start of zero, and adds the component as described and populates it with OST objects. The MDS will always allocate the first component's objects, to avoid the immediate overhead of another RPC and layout lock cancellation for the normal pattern of file create followed by file write.<br />
<br />
===U02. User adds component(s) with fixed-size extent(s) to an existing composite file===<br />
setxattr("lustre.lov.add", <binary component description>);<br />
The LOD on the MDS parses the layout, sanity checks the components against the existing layout (S03, S04) and against each other, adds the component(s) as described to the file, essentially the same as U01. The binary component description is largely self-describing, so it may contain one or more components that are added to the existing file layout, if any. The LOD will assign file-unique component IDs as necessary, which may be different than those the client generated while creating the layout in memory. If the client does not have the OBD_CONNECT_PFL_DYNAMIC flag set at connection time (a PFL 2 client, see Client-MDS Protocol below), then it is not capable of dynamically requesting the layout components need to be instantiated, in which case the MDS will allocate objects for all components.<br />
<br />
===U03. User adds final component to existing composite file===<br />
setxattr("lustre.lov.add", <binary component description>);<br />
LOD parses the layout, adds the component as supplied, and populates it with OST objects (for PFL2 clients only if no OBD_CONNECT_PFL_DYNAMIC flag set at connection time, see Client-MDS Protocol below). This is also the same as U01 and U02, with the only difference being that the supplied component has an extent that ends at 264-1 bytes.<br />
<br />
===U04. User requests the layout for an existing file===<br />
getxattr("lustre.lov");<br />
If the client is fetching the full layout xattr, then it can use the same getxattr interface as is used today by existing tools with no additional processing of the xattr or layout. This ensures that utilities such as tar(1) and others that already save and restore Lustre file layouts continue to work properly with composite files.<br />
getxattr("lustre.lov.extent.<start-end>");<br />
The LOD on the MDS and/or the LOV on the client parses the layout and returns the component(s) covering range [start,end). Since the components themselves are self-describing, containing the component_id, component_start, and component_end, they can be returned directly to the caller and handled directly.<br />
<br />
===U05. User gets layout parameters to existing component by ID===<br />
getxattr("lustre.lov.id.<component_id>");<br />
The LOD on the MDS and/or the LOV on the client parses the layout, finds component with id=<component id> and returns it to the caller.<br />
<br />
===U06. User accesses or modifies components in an existing file===<br />
setxattr("lustre.lov.set.<valid>", <binary component description>);<br />
LOD iterates over the components in the file, applies changes to the individual components passed in as the binary component value, and sets the fields in the component as specified by valid.<br />
<br />
===U07. User deletes composite file===<br />
The LOD iterates over the components on the MDS, destroying the individual OST objects using the existing RPC and recovery methods before deleting the MDT inode.<br />
<br />
===U08. User creates a new composite file describing multiple components===<br />
setxattr("lustre.lov", <binary composite file description>);<br />
This is essentially the same as U01 but the composite file description contains multiple components.<br />
<br />
===U09. User migrates composite file===<br />
fsetxattr(source_fd, "lustre.lov.swap.<component_id>", <volatile_fid>);<br />
The existing mechanism to swap while file layouts should be usable without modification, as it is simply copying the layout xattrs and doesn't actually look into the layouts. This allows existing tools that may use the llapi_[f]swap_layouts() interface, such as HSM copytools, to continue to be able to manage whole-file layout changes. If the user is migrating a single component from a composite file, then the data copy step is largely the same, except it will only copy data covered by the source extent [start,end) into the target file before swapping the layout from the source component to the temporary target file. The design of composite file layouts ensures that the logical file offsets of the source file and the target files are the same, so no special support is needed in userspace to do the data copy. The volatile_fid is accessible on the MDS while the migration tool keeps the volatile file open, though the MDS needs to verify that the file matching <volatile_fid> has been opened by the client for write, to ensure the client has file write permission on the file, since it is not possible to pass two open file descriptors to the fsetxattr() syscall.<br />
<br />
===U10. User searches for composite files===<br />
getxattr("lustre.lov.header");<br />
This will return the composite file layout header containing layout type, number of components, etc. This can reduce amount of information going over the network and up to userspace, so that the caller can allocate a sufficiently large buffer to hold the full layout.<br />
<br />
===U11. User specifies default composite file template for directory (Phase 3)===<br />
setxattr("lustre.lov", <binary composite file template>);<br />
The setxattr() operation would be done on the parent directory, in a similar manner that it is done today, only with different xattr contents. The composite layout template is stored on the parent directory as is done today for plain layout templates, storing only struct lov_mds_md_v1 with the required fields set, and not storing any of the file stripes in lmm_objects. <br />
===U12. Administrator specifies filesystem-wide composite file template for root directory (Phase 3)===<br />
Same as U11, except the operation takes place on the root directory and affects all new files.<br />
===U13. User deletes uninstantiated stripe component from file by ID (Phase 3)===<br />
setxattr("lustre.lov.del", <component id>);<br />
LOD parses the layout, finds the component with ID=<component id> and if it is uninstantiated (has no object assigned), then delete it.<br />
<br />
===U14. User deletes uninstantiated stripe components from file by flags (Phase 3)===<br />
setxattr("lustre.lov.del", <component id>);<br />
To delete all uninstantiated components from the file it is possible to use the enum lcme_id wildcard LCME_ID_UNINIT to delete all uninstantiated components.<br />
For more complex operations, it is more practical to fetch the entire layout to the client, iterate over the components in userspace, and then perform the pattern matching in an arbitrarily complex manner to determine which component IDs to remove or modify. While there is some extra overhead in deleting or modifying components individually, it is impractical to have a complex query and update interface embedded in the MDS for this. Otherwise, there may be an explosion of different matching criteria that need to be added (e.g. components within these extents, have specific flags set, that have specific layout generations, stripe sizes, etc). <br />
<br />
==Server-side Composite Layout Handling==<br />
===Composite File Layout Handling===<br />
The server needs to interpret and handle the virtual xattr values that are sent from the client. In order to avoid namespace collisions and potential abuse by users, the virtual xattr keywords such as .add, .del, .swap, etc. are only interpreted for specific Lustre xattrs such as lustre.lov, and potentially lustre.lmv in the future. The lustre.lov xattr is already handled specially on the client, since it cannot be set if the file already has a layout, so this will not add significant complexity on the client or server. Because xattrs are read and written as a single unit, any modification to the xattr needs to load the existing layout xattr into memory, modify it as requested by the client, and then store it back to the MDT inode object. This will be handled by the LOD layer of the MDS, since it is the software module that interprets the file layout, and also has access to the MDT OSD to load and store the xattr contents.<br />
<br />
The LOD must verify incoming lustre.lov layout xattrs, whether a whole composite layout is being sent, or an incremental update is being made to a layout component. Until other composite layout features such as File-Level Replication, partial HSM restore, the layout checks done by the MDS will be PFL-specific. These include verifying that components have adjacent extents, so that there are no holes in the layout (S03), that the layouts do not overlap (S03, S04) or if they do that they describe identical components (S09.1), and that they do not specify attribute flags that are controlled by the server. Fields such as lcme_offset and lcme_id that are server managed will be ignored and overwritten by the server.<br />
<br />
The composite layout header contains a generation value, lcm_layout_gen , that is updated by the MDS when the composite layout is changed. In order to ensure that component IDs within a file are unique, the lcme_id assigned to a new component added to the layout will be the lcm_layout_gen of the composite file. An lcme_id of 0 is reserved, and indicates that the ID is unassigned for this component or no specific component is requested, and will not be used by the MDS for any components in a file.<br />
<br />
As these checks of the incoming layout and the update of the lustre.lov xattr on disk need to be serialized, these operations will be serialized by the layout lock (MDS_INODELOCK_LAYOUT) on the inode.<br />
<br />
===Layout Intent Lock Handling===<br />
For PFL, there are two kinds of request that could cause layout change. One is from the command line that appends or changes components manually; another one is from CLIO stack after dynamic layout intent is supported. Both kinds of request will end up with invocation of setxattr() on the MD stack. The MDT has to hold LCK_EX mode of the MDS_INODELOCK_LAYOUT lock when it calls setxattr() to do actual changes to the layout. Not only does the holding of the layout lock on the MDT inode object serialize updates between multiple threads on the MDS, it also serves to revoke the layout lock from all clients that have been granted this lock. This revocation will invalidate the cached file layout from the clients, and cause them to refresh the layout on their next IO operation.<br />
<br />
==Feature Compatibility and Interoperability==<br />
===File Layout Compatibility===<br />
The PFL composite layout is incompatible with the existing Lustre file layout, though the individual layout components will re-use the existing lov_mds_md_v1 and lov_mds_md_v3 RAID-0 layouts. Non-PFL clients will receive an EIO error when accessing a composite file. Accessing plain (non-composite) files in the same filesystem will continue to work for both PFL and non-PFL clients. It is not possible to translate the PFL file layouts into a layout that the older clients will understand. Older servers will refuse to try and create a file with a PFL layout, due to the new magic value stored at the start of each layout.<br />
<br />
===MDT On-Disk Format===<br />
The PFL composite layout stored on disk will continue to use the trusted.lov xattr name and will be stored directly in the MDT inode, if space permits, to maximize performance. The existing maximum limits on xattr sizes will not be changed as part of this project. For both ZFS and ldiskfs backing filesystems the on-disk xattr size is not the limiting factor for determining the maximum stripe count of a file, but rather the RPC size limits.<br />
<br />
The MDS itself needs to understand the new struct lov_comp_md_v1 layout format described in [[Layout Enhancement High Level Design#2.1. Composite Layouts|Layout Enhancement HLD Composite Layouts]], in order to unlink the OST objects within that file when it is deleted, or change the ownership of a file's OST objects when the file ownership changes.<br />
<br />
The [https://wiki.hpdd.intel.com/display/opensfs/LFSCK2+High+Level+Design Lustre File System Check (LFSCK)] tool also needs to understand struct lov_comp_md_v1 in order to accurately determine the relationship between an MDT inode and all the OST objects where it stores its data. This can leverage the same composite file layout iteration that the MDS is using for file unlink, setattr, and other operations that affect all of the OST objects on a file.<br />
<br />
====MDT Default File Layout Templates====<br />
The file layout template is an uninstantiated file layout that is initially stored on a parent directory, or on the filesystem root directory, and provides the default layout for new files that do not otherwise have a specific layout assigned at file creation time. When a new file file or directory is first created, it inherits the layout template from the parent directory in which it was created, or if the parent directory has no template then it is inherited from the filesystem root directory. Once assigned to the new file, the layout is stored with the MDT inode on disk and is instantiated as needed for that file.<br />
<br />
The layout template itself for a plain file is simply struct lov_mds_md_v1, or struct lov_mds_md_v3 if an OST pool is in use, without any of the OST objects allocated for it (i.e. the lmm_objects[] array is unused). The plan for composite file templates will be similar - a layout template for a 3-component file would consist of the composite header template struct lov_comp_md_v1 along with three separate pairs of component entries and uninstantiated sub-layout templates, namely struct lov_comp_md_entry_v1 and the accompanying struct lov_mds_md_v1 without any OST objects allocated.<br />
struct lu_extent {<br />
__u64 e_start;<br />
__u64 e_end;<br />
};<br />
<br />
struct lov_comp_md_entry_v1 {<br />
__u32 lcme_id; /* unique identifier of component */<br />
__u32 lcme_flags; /* LCME_FL_XXX */<br />
struct lu_extent lcme_extent; /* file extent for component */<br />
__u32 lcme_offset; /* offset of component blob in layout */<br />
__u32 lcme_size; /* size of component blob data */<br />
__u64 lcme_padding[2];<br />
};<br />
<br />
struct lov_comp_md_v1 {<br />
__u32 lcm_magic; /* LOV_MAGIC_COMP_V1 */<br />
__u32 lcm_size; /* overall size of layout including this structure */<br />
__u32 lcm_layout_gen;<br />
__u16 lcm_flags;<br />
__u16 lcm_entry_count;<br />
__u64 lcm_padding1;<br />
__u64 lcm_padding2;<br />
struct lov_comp_md_entry_v1 lcm_entries[0];<br />
};<br />
<br />
struct lov_ost_data_v1 { /* per-stripe data structure (little-endian)*/<br />
struct ost_id l_ost_oi; /* OST object ID */<br />
__u32 l_ost_gen; /* generation of this l_ost_idx */<br />
__u32 l_ost_idx; /* OST index in LOV (lov_tgt_desc->tgts) */<br />
};<br />
<br />
struct lov_mds_md_v1 { /* LOV EA mds/wire data (little-endian) */<br />
__u32 lmm_magic; /* magic number = LOV_MAGIC_V1 */<br />
__u32 lmm_pattern; /* LOV_PATTERN_RAID0, LOV_PATTERN_RAID1 */<br />
struct ost_id lmm_oi; /* LOV object ID */<br />
__u32 lmm_stripe_size; /* size of stripe in bytes */<br />
/* lmm_stripe_count used to be __u32 */<br />
__u16 lmm_stripe_count; /* num stripes in use for this object */<br />
__u16 lmm_layout_gen; /* layout generation number */<br />
struct lov_ost_data_v1 lmm_objects[0]; /* per-stripe data */<br />
};<br />
<br />
Unfortunately, the size of a 3-component layout template, even without any OST objects allocated, is larger than can fit into the currently 512-byte ldiskfs inodes' free space, as can be seen in the diagram below. There are approximately 180 bytes of free space in the 512-byte inode (depends on the length of the filename and if there are multiple hard links to a file), but a 3-component template layout is 268 bytes in size. Even with aggressive reduction of the size of the lov_comp_md_v1, lov_comp_md_entry_v1, and lov_mds_md to remove all fields that are not strictly necessary, the 3-component template would still be too large to fit into the directory inode, let alone on an actual file using this template with at least one allocated OST object. If the xattr is too large for the in-inode space, for example a plain RAID-0 file with more than 5 stripes, then the layout xattr is stored in an separate data block. Storing the layout xattr outside the inode may incur significant performance penalties, due to an extra seek for every inode access in order to fetch the layout xattr into memory, so this is undesirable for normal usage.<br />
<br />
One option to avoid the overflow of the in-inode xattr space would be to store only a single-component layout on the file, which would fit within the available 180-byte space, and inherit the rest of the components from the parent directory as the file size grows to need these components. This would be desirable from the point of view of minimizing the overhead for small files, which can make up a large fraction of all files in HPC filesystems. However, this also adds complexity to the PFL code and usage, since the inode is not guaranteed to have the same parent directory, and hence a different layout template, when the time comes to extend the file beyond the first component. This may lead to inconsistent or sub-optimal layout components if the file is renamed, or the default layout of a directory or the filesystem is modified, and the new directory layout template is incompatible with the existing component(s) on the file due to overlapping layout extents.<br />
<br />
Even if a single-component file layout could fit in the inode xattr space, the composite layout template still couldn't fit into the parent directory's inode. However, since there are normally far fewer directories than files, and directory leaf blocks are themselves likely to be allocated only one block at a time, the external xattr block would not be as high an overhead, and the one xattr read overhead would normally be amortized over the creation of many files within that directory that use the same layout template.<br />
<br />
Another option is to format the MDT with larger 1024-byte inodes by default, to ensure there is enough space for not only the composite layout or template, but also for other xattrs such as SELinux labels, ACLs, etc. This has the drawback that the MDT will need to be reformatted for 1024-byte inodes to maximize PFL performance, and each inode will take twice as much space on disk and in memory, which may also impact metadata performance. This can possibly be mitigated on existing 512-byte inodes that use SSD or NVM storage for the MDT to avoid the overhead of seeking to read the external xattr block for each cache-cold MDT inode access.<br />
<br />
Due to the implementation complexity and risk of inconsistent or sub-optimal file layouts being created by the incremental inheritance of layouts from the parent layout template, the PFL 2 project will implement whole-layout inheritance at file creation time.<br />
<br />
[[Image:pfl2_default_layout_template.png]]<br />
<br />
====MDS Layout Verification====<br />
In addition to PFL layout verification performed by userspace in the llapi_layout_* functions, the MDS should also do verification of the layout components to ensure that they are valid for the PFL feature. This includes the following checks:<br />
<br />
* verify the start of each component matches the end of the previous component (if any), to prevent overlapping or disjoint extents.<br />
* verify the layout stripe_size and the layout extent_end are properly aligned to prevent fractional pages or RPCs that span multiple components. This restriction may be relaxed over time, but for the initial implementation it will avoid complexity to ensure that full-stripe reads and writes are done within a single component.<br />
* verify object_maxbytes * stripe_count >= extent_end for each component except the last one, to ensure that file data can be written over the full range of the component. For ldiskfs OSTs the object_maxbytes is 16 TiB, so for a component with few stripes and a very large extent_end it is possible that the client would get -EFBIG while writing to the middle of the file. For ZFS OSTs the object_maxbytes value is 263-1 bytes, so this is not an issue. This may be difficult to implement 100% consistently, since the MDS will not necessarily know which specific OSTs will be selected when setting a uninstantiated layout template, which would only be a concern if there are different OST types within the same filesystem. In this unlikely case, it would be easiest to select the minimum maxbytes limit at OST connect time.<br />
<br />
As other features are added that use composite layouts, such as File Level Replication, these restrictions can be relaxed.<br />
<br />
===Client-MDS Protocol===<br />
By using extensions to the xattr protocol to instantiate and modify composite layouts there are no RPC protocol changes needed between the client and MDS. The Phase 2 PFL client will send the new OBD_CONNECT_COMPOSITE connection flag to indicate that it understands the composite layout feature, and the MDS replies with the same feature flag set to inform the client that this feature is supported, otherwise the client would get an error when storing the composite layout on the MDS. The existing MDS_SETXATTR and MDS_GETXATTR RPC opcodes can be used to create, modify, and remove individual components of a file, as well as whole composite files. Since the connection flag exchange is done on every client and MDS restart, there should never be a case where the MDS does not recognize the incoming file layout magic or the enhanced RPC opcodes.<br />
<br />
The existing RPC size limit will not be changed as part of this project, allowing a single file to have maximum stripe count of 2000 OSTs for a plain RAID-0 file. Since the PFL composite file and component layout containers themselves take up space, the maximum number of OSTs that a single file can use depends on the exact layout being used. For a single-component file, the maximum stripe count will only be 2-3 stripes below the 2000-OST limit. For a file with many single-stripe components, the maximum number of components will be approximately 500.<br />
<br />
If the PFL Phase 2 Static Layout implementation is deployed separately from the proposed PFL Phase 3 Dynamic Layout, then some additional changes are needed in the RPC protocol between clients and the MDS. For clients using the PFL2 code that understands composite layouts but not dynamic layout initialization, as detected by the lack of OBD_CONNECT_PFL_DYNAMIC flag at connection time, any file creation requests will result in the MDS allocating all of the OST objects for a file with a layout template. The PFL Phase 3 clients can notify the MDS at connect time, by passing a new OBD_CONNECT_PFL_DYNAMIC feature flag, that they handle on-the-fly layout initialization of files, so it is safe to store only the layout template to disk.<br />
<br />
===Client-OSS Protocol===<br />
During normal IO operations between the client and OSS, the client sends information to the OSS about each object that is being accessed, to avoid the overhead of extra communication between the MDS and OSS for every object created and accessed in the filesystem. This information includes sending the MDT inode File Identifier (FID) to the OSS in order to indicate which file each OST object belongs to, as well as the stripe index of that object within the file. This information is stored on the OST object the first time the object is ever modified. The MDT inode FID passed from the client is sanity checked against the one stored on the OST object for later IO operations in order to avoid accidentally accessing or modifying OST objects due to software bugs, as well as by the [https://wiki.hpdd.intel.com/display/opensfs/LFSCK2+High+Level+Design Lustre File System Check] (LFSCK) tool to verify consistency between the file layout on the MDT and the objects on the OST(s) and to rebuild the MDT inode file layout if it becomes corrupted. By storing the component ID with each OST object, along with the stripe index and stripe size, the LFSCK tool can re-assemble the file layout for each MDT inode FID, even if the layout is lost or corrupted on the MDT.<br />
<br />
The RPC from the client to the OSS currently only passes a single integer for the object's stripe index, since this is all that was needed to uniquely identify the object in a RAID-0 file layout. In order to accommodate the presence of multiple component layouts within a single composite file, the RPC from the client needs to be modified slightly in order to send more information for above purposes, including the stripe size and the stripe count, and if the target is for PFL component, the component ID and related extent range will be sent also. If all these information are sent from the client to the OSS via current obdo structure, then only the obdo.o_padding_{4/5/6} are not enough, we have to reuse some other fields to avoid enlarge the obdo structure. Currently the obdo.o_lcookie is only used by OSP for recording async RPC llog cookie. In fact, such cookie is only used locally. So the obdo.o_lcookie can be used for other on wire purpose. Fortunately, it is large enough (32 bytes) to hold all above information for LFSCK. To be clear, we will define new structure ost_layout for that.<br />
struct llog_cookie {<br />
struct llog_logid lgc_lgl;<br />
__u32 lgc_subsys;<br />
__u32 lgc_index;<br />
__u32 lgc_padding;<br />
} __attribute__((packed));<br />
<br />
+struct ost_layout {<br />
+ __u64 ol_pfl_start;<br />
+ __u64 ol_pfl_end;<br />
+ __u32 ol_pfl_id;<br />
+ __u32 ol_stripe_size;<br />
+ __u32 ol_stripe_count;<br />
+ __u32 ol_padding_0;<br />
+};<br />
<br />
struct obdo {<br />
__u64 o_valid; /* hot fields in this obdo */<br />
struct ost_id o_oi;<br />
__u64 o_parent_seq;<br />
__u64 o_size; /* o_size-o_blocks == ost_lvb */<br />
__s64 o_mtime;<br />
__s64 o_atime;<br />
__s64 o_ctime;<br />
__u64 o_blocks; /* brw: cli sent cached bytes */<br />
__u64 o_grant;<br />
/* 32-bit fields start here: keep an even number of them via padding */<br />
__u32 o_blksize; /* optimal IO blocksize */<br />
__u32 o_mode; /* brw: cli sent cache remain */<br />
__u32 o_uid;<br />
__u32 o_gid;<br />
__u32 o_flags;<br />
__u32 o_nlink; /* brw: checksum */<br />
__u32 o_parent_oid;<br />
__u32 o_misc; /* brw: o_dropped */<br />
__u64 o_ioepoch; /* epoch in ost writes */<br />
__u32 o_stripe_idx; /* layout stripe idx */<br />
__u32 o_parent_ver;<br />
struct lustre_handle o_handle; /* brw: lock handle to prolong<br />
* locks */<br />
- struct llog_cookie o_lcookie; /* destroy: unlink cookie from<br />
- * MDS, obsolete in 2.8, reused<br />
- * in OSP */<br />
+ /* Originally, the field is llog_cookie for destroy with unlink cookie<br />
+ * from MDS, it is obsolete in 2.8. Then reuse it by client to transfer<br />
+ * layout and PFL information in IO, setattr RPCs. Since llog_cookie is<br />
+ * not used on wire any longer, remove it from the obdo, then it can be<br />
+ * enlarged freely in the further without affect related RPCs.<br />
+ *<br />
+ * Here, we have verified sizeof(ost_layout) == sizeof(llog_cookie). */<br />
+ union {<br />
+ /* struct llog_cookie o_lcookie; */<br />
+ struct ost_layout o_layout;<br />
+ };<br />
__u32 o_uid_h;<br />
__u32 o_gid_h;<br />
__u64 o_data_version; /* getattr: sum of iversion for<br />
* each stripe.<br />
* brw: grant space consumed on<br />
* the client for the write */<br />
__u64 o_padding_4;<br />
__u64 o_padding_5;<br />
__u64 o_padding_6;<br />
};<br />
<br />
Implementing the support for LFSCK and the OSTs to handle composite files belongs to PFL Phase 3a, that is beyond the scope of the PFL Phase 2 implementation.<br />
<br />
===OST On-Disk Format===<br />
As discussed in the Client-OSS Protocol section, the OST stores a fragment of the MDT layout with each object in order to do sanity checks on incoming client RPCs and recovery in case of MDT corruption. The OST needs to be able to store an additional 32 bytes of data with struct filter_fid to store additional information for the composite layout so that the OST object knows its place within the component and within the composite file:<br />
struct filter_fid {<br />
struct lu_fid ff_parent; /* ff_parent.f_ver == file stripe number */<br />
+ __u32 ff_stripe_size;<br />
+ __u32 ff_stripe_count;<br />
+ __u64 ff_pfl_start;<br />
+ __u64 ff_pfl_end;<br />
+ __u32 ff_pfl_id;<br />
};<br />
<br />
The osd-ldiskfs on-disk inode along with the Lustre-specific xattrs ("lma" and "fid") are very nearly out of free space in the OST's 256-byte inode, there is not enough room to store these 28 additional bytes in the "fid" xattr. As explained above, storing the "fid" xattr in a separated block will cause serious performance trouble, we have to consider other solution. One possible way is the merge the "lma" and "fid" into the "lma" EA body (or value) to save the space occupied by the "fid" matter entry (20 bytes). It is some hack way, and is hidden inside the osd-ldiskfs. From the up layer users' view, the "fid" xattr is still independent, they do not know and should not care about how the "fid" xattr is stored on the disk.<br />
<br />
[[Image:pfl2_inode_size.png]]<br />
<br />
For the osd-zfs on-disk dnode (inode), the added information will be stored in the System Attributes, which currently already do not fit into the dnode proper, but must already allocate a separate spill block to hold the SAs for the dnode. Once the large dnode patch is landed to the ZFS-on-Linux repository it will allow the SAs to always be stored within the dnode for maximum performance.<br />
<br />
===MDS-OSS Protocol===<br />
<br />
The MDS-OSS protocol is largely unaffected by composite layouts, since the OSTs themselves never use the file layout directly. The LFSCK utility does fetch the struct filter_fid xattr from the OST in order to verify its consistency against its locally stored file layout. The actual network protocol remains unchanged, besides the extra two fields added to this structure. The LFSCK utility will need to verify the ff_stripe_size and ff_component_id fields against their respective values in the file layout to verify that the object is part of the correct component.<br />
<br />
===Known Issues===<br />
<br />
* Append write. Append write has to instantiate all components to fulfill the posix semtantics;<br />
* Group lock. The current semantics of group lock would be hard to comply, work in progress to come up with a solution.<br />
<br />
Please see more known issues at https://jira.hpdd.intel.com/browse/LU-9349<br />
<br />
[[Category:Architecture]]<br />
[[Category:Design]]<br />
[[Category:PFL]]</div>Nrutmanhttp://wiki.lustre.org/index.php?title=Data_on_MDT_Solution_Architecture&diff=2701Data on MDT Solution Architecture2017-08-25T19:24:04Z<p>Nrutman: /* Solution Requirements */</p>
<hr />
<div>==Introduction==<br />
<br />
Lustre file system read/write performance is currently optimized for large files (where ''large'' is more than a few megabytes in size). In addition to the initial file open RPC to the MDT, there are separate read/write RPCs to the OSTs to fetch the data, as well as disk IO on both MDT and OST for fetching and storing the objects and their attributes. This separation of functionality is acceptable (and even desirable) for large files since one MDT open RPC is typically a small fraction of the total number of read or write RPCs, but this hurts small file performance significantly when there is only a single read or write RPC for the file data. The Data On MDT (DOM) project aims to improve small file performance by allowing the data for small files to be placed only on the MDT, so that these additional RPCs and I/O overhead can be eliminated, and performance correspondingly improved. Since the MDT storage is typically configured as high-IOPS RAID-1+0 and optimized for small IO, Data-on-MDT will also be able to leverage this faster storage. Used in conjunction with the [[Distributed Namespace]] (DNE), this will improve efficiency without sacrificing horizontal scale.<br />
<br />
In order to store file data on the MDT, users or system administrators will need to explicitly specify a layout at file creation time to store the data on the MDT, or set a default layout policy on any directory so that newly created files will store the data on the MDT. This is the same as the current Lustre mechanism to specify the stripe count or stripe size when creating new files with data on the OSTs.<br />
<br />
The administrator will be able to specify a maximum file size for files that store their data on the MDT, to avoid users consuming too much space on the MDT and causing problems for the other users. If a file's layout specifies to store data on the MDT, but it grows beyond the specified maximum file size the data will be migrated to a layout with data on OST object(s). It will also be possible to use the file migration mechanism introduced in Lustre 2.4 as part of the HSM project to migrate existing small files that have data on OST to the MDT.<br />
<br />
==Use Cases==<br />
<br />
New files are created with an explicit layout to store the data on the MDT.<br />
<br />
New files are created with an implicit layout to store the data on the MDT inherited from the default layout stored on the parent directory.<br />
<br />
A small file is stored without the overhead of creating an OST object, and without OST RPCs.<br />
<br />
A small file is retrieved without the overhead of accessing an OST object, and without OST RPCs.<br />
<br />
Users, administrators, or policy engines can migrate existing small files stored on OSTs to an MDT.<br />
<br />
A client accesses a small file and has the file attributes, lock, and data returned with a single RPC.<br />
<br />
An administrator sets a global maximum size limit for small files stored on the MDT(s). Files larger than this value do not store their data on an MDT.<br />
<br />
==Solution Requirements==<br />
===Design a new layout for files with data on MDT===<br />
There needs to be some way for the client to know that the data is stored on the MDT. Currently there is no mechanism to record this information. The new DOM file layout is being designed in conjunction with the [[Layout Enhancement Solution Architecture]]. The DOM layout will only have a single implicit data stripe, which is the MDT inode itself.<br />
<br />
===Client IO stack must allow IO requests to an MDT device===<br />
The current client IO stack is implemented to perform IO through the OSC layer to OSTs. There is no IO mechanism for the MDC layer. Storing data on the MDT requires that IO be read and written through the MDC layer. <br />
<br />
===Explicitly allocating files on an MDT by default directory striping===<br />
Since Lustre allocates the file layout when the file is first opened, the DOM layout needs to be chosen before any file data is actually written. This means it is not possible to use the file size or amount of data written by the client to decide after the file is opened whether the file data will reside on the MDT or OST. Applications would be able to explicitly specify a DOM layout for newly-created files using existing llapi interfaces. It would also be possible to inherit the default layout from the parent directory to allocate all new files in that directory on the MDT to avoid any changes to the application. <br />
<br />
It would be possible to allow applications to mknod() a file and then truncate() it to the final size to allow the MDT to make the layout decision before the file is first opened. This was handled in most previous versions of Lustre, but was removed in recent versions since it was never properly documented and never used by applications to our knowledge. This mechanism would be useful for applications that know (or can reasonably estimate) the final file size in advance, such as tar(1) or cp(1) or applications writing a fixed amount of data to a file (e.g. known array size, or fixed data size per task). <br />
<br />
[[:Category:PFL|PFL]] provides a solution to the pre-allocation dilemma; the final size need not be known, and only the starting extent of a file need be stored on the MDT.<br />
<br />
===A mechanism to allow files to grow from MDT to OST that exceed small file size limit===<br />
If a client writes to a file with a DOM layout that immediately the small-file size limit, under current locking behavior clients would need to flush their dirty cache to the MDT and cancel their layout lock before changing the file layout. Clients should be able to avoid this overhead. Instead, it would be preferable to change the layout to contain an OST object and flush the data directly from the client cache to the OST, and not store anything to the MDT. One possibility is to create all small files with an OST object to handle the overflow case, but this would add overhead when it may not be needed. Another option is to allow the client PW layout lock for the object on the MDS and then the client can modify the layout directly on the MDT without having to drop the layout lock or flush its cache.<br />
<br />
Integrating DoM with [[:Category:PFL|PFL]] will allow files to grow beyond the small MDT limit at the expense of having every file store a "small" part of the file on the MDT. In many cases, since the majority of files (often 90%) are small, while a few large files (5%) consume a majority of space (90%) this does not impose a significant additional burden on the MDT beyond DoM itself. For cases where the file size is known in advance (e.g. migration, HSM, delayed mirroring) it is better to just create the whole file on the OST(s) and skip the small DoM component.<br />
<br />
==Functional Requirements==<br />
===Administration===<br />
<br />
* system admin sets filesystem-wide filesize limit via procfs or lctl set_param for DOM feature<br />
* system admin sets default layout directly on file or dir or fs using lfs setstripe<br />
<br />
===Small file size limit===<br />
<br />
* The small file size limit is defined for small files and means that file growing beyond that limit should be migrated to OSTs.<br />
* Migration process includes new layout creation for that file, object(s) allocation on OST(s) and data transfer from MDT to the OST object(s).<br />
<br />
===MDS_GETATTR request===<br />
<br />
* Client to get LDLM locks and file size in addition to other metadata attributes from the MDT in the same RPC.<br />
<br />
[[Category:Architecture]]<br />
[[Category:DoM]]</div>Nrutmanhttp://wiki.lustre.org/index.php?title=Data_on_MDT_Solution_Architecture&diff=2700Data on MDT Solution Architecture2017-08-25T19:20:05Z<p>Nrutman: /* A mechanism to allow files to grow from MDT to OST that exceed small file size limit */</p>
<hr />
<div>==Introduction==<br />
<br />
Lustre file system read/write performance is currently optimized for large files (where ''large'' is more than a few megabytes in size). In addition to the initial file open RPC to the MDT, there are separate read/write RPCs to the OSTs to fetch the data, as well as disk IO on both MDT and OST for fetching and storing the objects and their attributes. This separation of functionality is acceptable (and even desirable) for large files since one MDT open RPC is typically a small fraction of the total number of read or write RPCs, but this hurts small file performance significantly when there is only a single read or write RPC for the file data. The Data On MDT (DOM) project aims to improve small file performance by allowing the data for small files to be placed only on the MDT, so that these additional RPCs and I/O overhead can be eliminated, and performance correspondingly improved. Since the MDT storage is typically configured as high-IOPS RAID-1+0 and optimized for small IO, Data-on-MDT will also be able to leverage this faster storage. Used in conjunction with the [[Distributed Namespace]] (DNE), this will improve efficiency without sacrificing horizontal scale.<br />
<br />
In order to store file data on the MDT, users or system administrators will need to explicitly specify a layout at file creation time to store the data on the MDT, or set a default layout policy on any directory so that newly created files will store the data on the MDT. This is the same as the current Lustre mechanism to specify the stripe count or stripe size when creating new files with data on the OSTs.<br />
<br />
The administrator will be able to specify a maximum file size for files that store their data on the MDT, to avoid users consuming too much space on the MDT and causing problems for the other users. If a file's layout specifies to store data on the MDT, but it grows beyond the specified maximum file size the data will be migrated to a layout with data on OST object(s). It will also be possible to use the file migration mechanism introduced in Lustre 2.4 as part of the HSM project to migrate existing small files that have data on OST to the MDT.<br />
<br />
==Use Cases==<br />
<br />
New files are created with an explicit layout to store the data on the MDT.<br />
<br />
New files are created with an implicit layout to store the data on the MDT inherited from the default layout stored on the parent directory.<br />
<br />
A small file is stored without the overhead of creating an OST object, and without OST RPCs.<br />
<br />
A small file is retrieved without the overhead of accessing an OST object, and without OST RPCs.<br />
<br />
Users, administrators, or policy engines can migrate existing small files stored on OSTs to an MDT.<br />
<br />
A client accesses a small file and has the file attributes, lock, and data returned with a single RPC.<br />
<br />
An administrator sets a global maximum size limit for small files stored on the MDT(s). Files larger than this value do not store their data on an MDT.<br />
<br />
==Solution Requirements==<br />
===Design a new layout for files with data on MDT===<br />
There needs to be some way for the client to know that the data is stored on the MDT. Currently there is no mechanism to record this information. The new DOM file layout is being designed in conjunction with the [[Layout Enhancement Solution Architecture]]. The DOM layout will only have a single implicit data stripe, which is the MDT inode itself.<br />
<br />
===Client IO stack must allow IO requests to an MDT device===<br />
The current client IO stack is implemented to perform IO through the OSC layer to OSTs. There is no IO mechanism for the MDC layer. Storing data on the MDT requires that IO be read and written through the MDC layer. <br />
<br />
===Explicitly allocating files on an MDT by default directory striping===<br />
Since Lustre allocates the file layout when the file is first opened, the DOM layout needs to be chosen before any file data is actually written. This means it is not possible to use the file size or amount of data written by the client to decide after the file is opened whether the file data will reside on the MDT or OST. Applications would be able to explicitly specify a DOM layout for newly-created files using existing llapi interfaces. It would also be possible to inherit the default layout from the parent directory to allocate all new files in that directory on the MDT to avoid any changes to the application. <br />
<br />
It would be possible to allow applications to mknod() a file and then truncate() it to the final size to allow the MDT to make the layout decision before the file is first opened. This was handled in most previous versions of Lustre, but was removed in recent versions since it was never properly documented and never used by applications to our knowledge. This mechanism would be useful for applications that know (or can reasonably estimate) the final file size in advance, such as tar(1) or cp(1) or applications writing a fixed amount of data to a file (e.g. known array size, or fixed data size per task). <br />
<br />
===A mechanism to allow files to grow from MDT to OST that exceed small file size limit===<br />
If a client writes to a file with a DOM layout that immediately the small-file size limit, under current locking behavior clients would need to flush their dirty cache to the MDT and cancel their layout lock before changing the file layout. Clients should be able to avoid this overhead. Instead, it would be preferable to change the layout to contain an OST object and flush the data directly from the client cache to the OST, and not store anything to the MDT. One possibility is to create all small files with an OST object to handle the overflow case, but this would add overhead when it may not be needed. Another option is to allow the client PW layout lock for the object on the MDS and then the client can modify the layout directly on the MDT without having to drop the layout lock or flush its cache.<br />
<br />
Integrating DoM with [[:Category:PFL|PFL]] will allow files to grow beyond the small MDT limit at the expense of having every file store a "small" part of the file on the MDT. In many cases, since the majority of files (often 90%) are small, while a few large files (5%) consume a majority of space (90%) this does not impose a significant additional burden on the MDT beyond DoM itself. For cases where the file size is known in advance (e.g. migration, HSM, delayed mirroring) it is better to just create the whole file on the OST(s) and skip the small DoM component.<br />
<br />
==Functional Requirements==<br />
===Administration===<br />
<br />
* system admin sets filesystem-wide filesize limit via procfs or lctl set_param for DOM feature<br />
* system admin sets default layout directly on file or dir or fs using lfs setstripe<br />
<br />
===Small file size limit===<br />
<br />
* The small file size limit is defined for small files and means that file growing beyond that limit should be migrated to OSTs.<br />
* Migration process includes new layout creation for that file, object(s) allocation on OST(s) and data transfer from MDT to the OST object(s).<br />
<br />
===MDS_GETATTR request===<br />
<br />
* Client to get LDLM locks and file size in addition to other metadata attributes from the MDT in the same RPC.<br />
<br />
[[Category:Architecture]]<br />
[[Category:DoM]]</div>Nrutmanhttp://wiki.lustre.org/index.php?title=Category:Design&diff=2699Category:Design2017-08-25T19:15:33Z<p>Nrutman: </p>
<hr />
<div>List of all design documents.<br />
<br />
[[Category:Architecture]]</div>Nrutmanhttp://wiki.lustre.org/index.php?title=Data_on_MDT_Solution_Architecture&diff=2698Data on MDT Solution Architecture2017-08-25T19:10:34Z<p>Nrutman: /* Design a new layout for files with data on MDT */</p>
<hr />
<div>==Introduction==<br />
<br />
Lustre file system read/write performance is currently optimized for large files (where ''large'' is more than a few megabytes in size). In addition to the initial file open RPC to the MDT, there are separate read/write RPCs to the OSTs to fetch the data, as well as disk IO on both MDT and OST for fetching and storing the objects and their attributes. This separation of functionality is acceptable (and even desirable) for large files since one MDT open RPC is typically a small fraction of the total number of read or write RPCs, but this hurts small file performance significantly when there is only a single read or write RPC for the file data. The Data On MDT (DOM) project aims to improve small file performance by allowing the data for small files to be placed only on the MDT, so that these additional RPCs and I/O overhead can be eliminated, and performance correspondingly improved. Since the MDT storage is typically configured as high-IOPS RAID-1+0 and optimized for small IO, Data-on-MDT will also be able to leverage this faster storage. Used in conjunction with the [[Distributed Namespace]] (DNE), this will improve efficiency without sacrificing horizontal scale.<br />
<br />
In order to store file data on the MDT, users or system administrators will need to explicitly specify a layout at file creation time to store the data on the MDT, or set a default layout policy on any directory so that newly created files will store the data on the MDT. This is the same as the current Lustre mechanism to specify the stripe count or stripe size when creating new files with data on the OSTs.<br />
<br />
The administrator will be able to specify a maximum file size for files that store their data on the MDT, to avoid users consuming too much space on the MDT and causing problems for the other users. If a file's layout specifies to store data on the MDT, but it grows beyond the specified maximum file size the data will be migrated to a layout with data on OST object(s). It will also be possible to use the file migration mechanism introduced in Lustre 2.4 as part of the HSM project to migrate existing small files that have data on OST to the MDT.<br />
<br />
==Use Cases==<br />
<br />
New files are created with an explicit layout to store the data on the MDT.<br />
<br />
New files are created with an implicit layout to store the data on the MDT inherited from the default layout stored on the parent directory.<br />
<br />
A small file is stored without the overhead of creating an OST object, and without OST RPCs.<br />
<br />
A small file is retrieved without the overhead of accessing an OST object, and without OST RPCs.<br />
<br />
Users, administrators, or policy engines can migrate existing small files stored on OSTs to an MDT.<br />
<br />
A client accesses a small file and has the file attributes, lock, and data returned with a single RPC.<br />
<br />
An administrator sets a global maximum size limit for small files stored on the MDT(s). Files larger than this value do not store their data on an MDT.<br />
<br />
==Solution Requirements==<br />
===Design a new layout for files with data on MDT===<br />
There needs to be some way for the client to know that the data is stored on the MDT. Currently there is no mechanism to record this information. The new DOM file layout is being designed in conjunction with the [[Layout Enhancement Solution Architecture]]. The DOM layout will only have a single implicit data stripe, which is the MDT inode itself.<br />
<br />
===Client IO stack must allow IO requests to an MDT device===<br />
The current client IO stack is implemented to perform IO through the OSC layer to OSTs. There is no IO mechanism for the MDC layer. Storing data on the MDT requires that IO be read and written through the MDC layer. <br />
<br />
===Explicitly allocating files on an MDT by default directory striping===<br />
Since Lustre allocates the file layout when the file is first opened, the DOM layout needs to be chosen before any file data is actually written. This means it is not possible to use the file size or amount of data written by the client to decide after the file is opened whether the file data will reside on the MDT or OST. Applications would be able to explicitly specify a DOM layout for newly-created files using existing llapi interfaces. It would also be possible to inherit the default layout from the parent directory to allocate all new files in that directory on the MDT to avoid any changes to the application. <br />
<br />
It would be possible to allow applications to mknod() a file and then truncate() it to the final size to allow the MDT to make the layout decision before the file is first opened. This was handled in most previous versions of Lustre, but was removed in recent versions since it was never properly documented and never used by applications to our knowledge. This mechanism would be useful for applications that know (or can reasonably estimate) the final file size in advance, such as tar(1) or cp(1) or applications writing a fixed amount of data to a file (e.g. known array size, or fixed data size per task). <br />
<br />
===A mechanism to allow files to grow from MDT to OST that exceed small file size limit===<br />
If a client writes to a file with a DOM layout that immediately the small-file size limit, under current locking behavior clients would need to flush their dirty cache to the MDT and cancel their layout lock before changing the file layout. Clients should be able to avoid this overhead. Instead, it would be preferable to change the layout to contain an OST object and flush the data directly from the client cache to the OST, and not store anything to the MDT. One possibility is to create all small files with an OST object to handle the overflow case, but this would add overhead when it may not be needed. Another option is to allow the client PW layout lock for the object on the MDS and then the client can modify the layout directly on the MDT without having to drop the layout lock or flush its cache.<br />
<br />
Integrating DoM with [[PFL]] will allow files to grow beyond the small MDT limit at the expense of having every file store a "small" part of the file on the MDT. In many cases, since the majority of files (often 90%) are small, while a few large files (5%) consume a majority of space (90%) this does not impose a significant additional burden on the MDT beyond DoM itself. For cases where the file size is known in advance (e.g. migration, HSM, delayed mirroring) it is better to just create the whole file on the OST(s) and skip the small DoM component.<br />
<br />
==Functional Requirements==<br />
===Administration===<br />
<br />
* system admin sets filesystem-wide filesize limit via procfs or lctl set_param for DOM feature<br />
* system admin sets default layout directly on file or dir or fs using lfs setstripe<br />
<br />
===Small file size limit===<br />
<br />
* The small file size limit is defined for small files and means that file growing beyond that limit should be migrated to OSTs.<br />
* Migration process includes new layout creation for that file, object(s) allocation on OST(s) and data transfer from MDT to the OST object(s).<br />
<br />
===MDS_GETATTR request===<br />
<br />
* Client to get LDLM locks and file size in addition to other metadata attributes from the MDT in the same RPC.<br />
<br />
[[Category:Architecture]]<br />
[[Category:DoM]]</div>Nrutmanhttp://wiki.lustre.org/index.php?title=Progressive_File_Layouts&diff=2697Progressive File Layouts2017-08-25T19:03:05Z<p>Nrutman: </p>
<hr />
<div>The Lustre Progressive File Layout (PFL) feature intends to simplify the use of Lustre so that users can expect reasonable performance for a variety of normal file IO patterns without the need to explicitly understand their IO model or Lustre usage details in advance. In particular, users do not necessarily need to know the size or concurrency of output files in advance of their creation and explicitly specify an optimal layout for each file in order to achieve good performance for both highly concurrent shared-single-large-file IO or parallel IO to many smaller per-process files.<br />
<br />
The PFL feature is implemented in several phases, providing incremental functionality with each phase, including the base functionality of [[Layout_Enhancement_High_Level_Design#2.1._Composite_Layouts|Composite layouts]] which can be used for several other features that affect the file layout.<br />
<br />
== Phase 1: Prototype Implementation ==<br />
The Prototype Implementation is needed to explore options for implementing the PFL feature, and verify that expected performance gains are available for PFL files once a production implementation is available.<br />
* [[PFL Prototype Scope Statement]] describes the overall goals and intended outcomes of the prototype<br />
* [[PFL Prototype Solution Architecture]] describes how the goals of the PFL project may be implemented, and how to measure the completion and outcomes<br />
===Composite Extent-Mapped Layouts===<br />
Progressive file layouts are characterized by increasing the stripe count of the file in a step-wise manner as the file offset increases. This will be achieved by using multiple extent-based composite layouts as described in the [[Layout Enhancement High Level Design]].<br />
<br />
Composite layouts allow specifying different specific layouts for different ranges (extents) of the file. Currently, the only layout supported by Lustre is RAID-0 striping across one or more OST objects. The size of a composite layout will be constrained by existing total size limits for file layouts (approximately 4KB or 48KB, depending on the capabilities of the MDT backing storage) and the size of the RPC request and reply messages, but there will otherwise be no limits added on the number of layout extents by the PFL code itself.<br />
<br />
Progressive layout files will typically have a small fixed-length layout extent covering the start of the file (e.g. tens of megabytes in size). This initial layout extent will typically have only a single OST object to minimize file creation and access overhead. Additional layout extents, each with their own specific layout, follow at the end of the previous layout extent using new OST objects and covering a progressively larger extent of the file (e.g. a few gigabytes). This typically repeats, until the file is striped across all available OSTs at which point the layout extent will cover the rest of the file, regardless of its size. <br />
<br />
The actual file offsets at which the layout extent changes, the stripe count for each layout extent, and the stripe size will be tunable by the user/administrator. While there is no requirement that the stripe size for each layout extent be the same, the start and end of each layout extent shall be an integer multiple of its stripe size. Overlapping layout extents (file-level data replication) are out of scope of the PFL project.<br />
<br />
===Client LOV support for composite layouts===<br />
The client Logical Object Volume (LOV) is responsible for mapping logical file offsets to offsets in the specific OST object that holds that region of the file, using a unique per-file layout. With progressive file layouts, there will be multiple contiguous but disjoint regions (layout extents) of each file that use distinct RAID-0 layouts for each region, and each layout extent will use different OST objects for storage.<br />
<br />
Mapping a logical file offset to a specific OST object is done by first finding the enclosing layout extent start and end offsets. In the PFL project, layout extents must not overlap, and will be stored in increasing file offset order. Once a layout extent is selected, the file offset is mapped to the OST object offset using the specific RAID-0 layout stored in that extent. The mapping within each extent is computed as if the specific layout spanned the whole file. This avoids the need for data migration in possible future layouts that will support mutable extent boundaries.<br />
<br />
As with current plain RAID-0 layouts, the file size, block count, and timestamps will be determined by aggregating the attributes from all OST objects that are part of the layout.<br />
<br />
== Phase 2: Static Layout Implementation ==<br />
The Static PFL Implementation will provide a functional implementation that allows specifying the full layout using standard user tools and addresses any shortcuts and/or defects in the Prototype implementation. The following functionality was implemented:<br />
* [[PFL2 Scope Statement]] describes the overall goals and intended outcomes of the production implementation<br />
* [[PFL2 Solution Architecture]] describes how the goals of the PFL project may be implemented, and how to measure the completion and outcomes<br />
* [[PFL2 High Level Design]] describes the implementation details for the PFL feature<br />
===Implement improved layout handling APIs===<br />
Currently, Lustre clients and servers understand only a single type of layout, RAID-0 striping across one or more OST objects, with a few small variations of a basic layout structure (lov_mds_md_v3). In order to add the progressive layout handling and other future layout types such as RAID-10 in a sustainable manner, the internal code needs to be restructured in order to isolate the parts of the code that handle the file layouts to a single library. Other parts of the Lustre code should not access the internal details of the file layout, and instead use library accessor functions to query required parameters such as the number of OST objects over which a file is striped, the size of the layout structure, and other common parameters. This will improve maintainability and reduce complexity and potential defects in the Lustre code as new layout types are added.<br />
===Implement RPCs for modifying composite layouts (need Layout APIs)===<br />
OST objects for each layout extent of a PFL file are allocated by the MDS on demand as the file grows. Writing past the end of the last allocated extent sends an RPC to the MDS to grow the file. The MDS then locks the layout and allocates OST objects for the new layout extent(s) and updates the layout. The MDS layout locking invalidates copies of the old layout cached on Lustre clients and forces them to refresh their copy of the layout, including the new layout extent(s), before accessing the file again. Since the previously allocated OST objects will not have changed, no data movement or data cache invalidation is required. The MDS is exclusively locking the layout during modification to avoid races from multiple clients trying to modify the layout concurrently.<br />
===User interfaces for specifying composite layouts===<br />
While PFL will itself provide simplified usability of files in Lustre, there will still be a need for administrators and users to directly specify progressive layouts for the filesystem. The filesystem global default layout (set on the filesystem root directory) will determine the layout for all new files created in the filesystem. The size thresholds between stages in the progressive layout are best tuned to the application environment to maximize performance. Some users and applications may also optimize performance by specifying a default progressive layout on a parent directory that is inherited by all new files and directories created therein.<br />
<br />
In order to specify the progressive layouts from userspace, the llapi_layout_* API will need to be enhanced to understand the new layout type. This new API will be used internally by the lfs setstripe command, and optionally by other applications modified to use this interface.<br />
<br />
The layouts that can be used for the individual components are expected to be the same as those available in Lustre today. This includes the LOV_MDS_MD_V1 and LOV_MDS_MD_V3 layouts that are RAID-0 striping across OSTs with the ability to specify different stripe_count, stripe_size, ost_index, and ost_pool. The design and implementation of PFL composite layouts is intended to work with other layout types in the future, but actual operation with future layout types is beyond the scope of this project.<br />
===Server LOD composite layout support===<br />
On the MDS, the Logical Object Device (LOD) manages the operational aspects of files that components on multiple MDTs. The LOD component will primarily be concerned with the creation of new files using progressive layouts. In some cases, it will need to decode the layout and interact with objects one at a time for operations such as unlink, setattr, and LFSCK. The LOD code will also handle layout modification RPCs arriving from the clients.<br />
<br />
== Phase 3a: PFL Usability Improvements ==<br />
===LFSCK support for composite layouts===<br />
The Lustre File System Checker (LFSCK) verifies the structure of a Lustre filesystem, ensuring that the file layout on the MDT matches the objects located on the OST(s), and reconstructing the filesystem structure if it should become inconsistent or corrupted. In order to be able to do this, LFSCK needs to be able to understand the file's layout stored on the MDT object inode. Also, the OST objects need to store information about its part of the file layout so that the layout can be rebuilt if needed. With the addition of composite file layouts, LFSCK needs to be enhanced to support the new layout type, and the OST on-disk format needs to be extended so that OST objects can be identified as part of the correct component of the layout.<br />
===Default layout inheritance===<br />
In order to realize the full benefits of PFL, the progressive layout extents should not create OST objects until the size of the file grows sufficiently to need those objects. However, it is also necessary to be able to specify the layout template for the whole file at file creation time, so that the user or administrator can get the performance profile desired as the file is written.<br />
<br />
It should be possible to specify a default layout template on a directory that is inherited by new files and subdirectories created within that directory. If no default layout template is specified on the parent directory, it should also be possible to inherit the filesystem-wide default layout template when a file is created.<br />
<br />
== Phase 3b: Dynamic Layout Implementation ==<br />
===Composite file templates===<br />
In order to realize the full benefits of PFL, the progressive layout extents should not allocate OST objects until the size of the file grows sufficiently to need those objects. However, it is also necessary to be able to specify the layout template for the whole file at file creation time without allocating OST objects for all components, so that the user or administrator can get the performance profile desired as file size grows during writes.<br />
<br />
It should be possible to specify a default layout template on a directory that is inherited by new files and subdirectories created within that directory. If no default layout template is specified on the parent directory, it should inherit the filesystem-wide default layout template when a file is created.<br />
===Dynamic layout instantiation based on file offset===<br />
In order to simplify implementation, this project will focus on implementing composite layouts that are grown by allocating objects in non-overlapping layout extents at the end of the file, and will not implement modification of already allocated layout extents containing data.<br />
<br />
The client IO (CLIO) layer needs to be able to manage the growth of the file layout by reconfiguring its IO stack to add new OST objects into the layout. The client will request that the MDS instantiate OST objects based on the layout template before it begins writing to a file offset beyond the currently instantiated layout components. The layout generation stored in the composite layout and in each layout extent will allow CLIO to detect whether a specific layout extent has been modified when the lock is revoked. Since the existing components of the file layout will not be modified for PFL files, any in-flight IO operations and cached data do not need to be interrupted.<br />
===Improved MDS object allocator===<br />
The current MDS object allocator is designed only to allocate objects for one file at the time the file is first created. For progressive file layouts, at a minimum the allocator will need to be enhanced in order to avoid allocating objects on OSTs that are already part of a file's other components. If files have multiple objects allocated to the same OSTs before objects are allocated from unused OSTs, there may be a significant performance loss due to oversubscribing the bandwidth on that OST compared to the other OSTs. The only exception may be for a fully-striped component at the end of the file (see [[#Example Progressive Layouts]] for more detail), where it would be acceptable to allocate objects across all of the available OSTs to maximize the bandwidth available for the file.<br />
<br />
== Example Progressive Layouts ==<br />
In order to balance space and bandwidth usage against stripe count, one option would be to keep a linear balance between the total number of stripes and the file size. If a stripe is added for each fixed unit of size, at power-of-two size intervals, the aggregate performance would be the same as if the file was striped over a corresponding number of objects from the start.<br />
<br />
For example, adding a stripe for every 128MB of space on a system with 280 OSTs:<br />
{| class="wikitable"<br />
|-<br />
! Extent !! Component Striping !! Total Objects<br />
|-<br />
| [0-128MB) || 1 object ||<br />
|-<br />
| [128MB-512MB) || 3 objects || 4 objects<br />
|-<br />
| [512MB-2GB) || 12 objects || 16 objects<br />
|-<br />
| [2GB-8GB) || 48 objects || 64 objects<br />
|-<br />
| [8GB-35GB) || 216 objects || 280 objects<br />
|-<br />
| [35GB-EOF) || 280 objects || 560 objects<br />
|}<br />
<br />
This results in a total of 280 objects for the first 35GB (= 128MB * 280) of the file, and each object holds a total of 128MB of file data. If this file were accessed in parallel across the first 35GB, the aggregate bandwidth and space usage for each object would be identical to a file that was striped across 280 objects for the entire 35GB, though it would be sub-optimal for parallel access to smaller ranges of the file. File sizes beyond 35GB would be identical to fully-striped files, at the expense of having twice the overhead for stat and locking operations.<br />
<br />
The progressive layout should stop growing at the point where the total number of stripes would equal or exceed the number of OSTs. At that point, it would be advantageous to add a final layout extent to EOF that stripes across all available OSTs in order to maximize bandwidth at the end of the file, in case it continues to grow significantly larger in size. This would result in a layout that was somewhat more than twice as large as a file that was striped across all OSTs right from the start.<br />
<br />
Alternatively, the last stripe could be grown to cover only the remaining OSTs if it was clear there weren’t going to be enough unused OSTs remaining for the next stage:<br />
{| class="wikitable"<br />
|-<br />
! Extent !! Component Striping !! Total Objects<br />
|-<br />
| [0-128MB) || 1 object ||<br />
|-<br />
| [128MB-512MB) || 3 objects || 4 objects<br />
|-<br />
| [512MB-2GB) || 12 objects || 16 objects<br />
|-<br />
| [2GB-8GB) || 48 objects || 64 objects<br />
|-<br />
| [8GB-EOF) || 216 objects || 280 objects<br />
|}<br />
<br />
This means the bandwidth of the end of the file is only 216 / 280 = 77% of the aggregate, but it means that there is less overhead for accessing the large file due to fewer objects to create, lock, and destroy.<br />
<br />
These are purely examples and in no way show a constraint on how PFL files are used. The stripe count does not need to increase from one component to the next, nor do the stripe counts or stripe boundaries need to be power-of-two values.<br />
<br />
== Notes and Limitations ==<br />
===Maximum File Size===<br />
For ldiskfs the direct mapping of file offset to extent offset imposes a maximum file size of (stripe_count * 16TB) for stripe_count of the last layout extent in the progressive layout. This is not significantly different from today, except for a small reduction in the stripe_count of the last extent due to OST objects that are allocated in earlier extents. Typically this will not be a limitation due to space constraints on the OSTs, and can be tuned by selecting the layout progression appropriately. This is not a limitation for ZFS-based OSTs.<br />
<br />
===Client Compatibility===<br />
Clients that are not patched with the new progressive layout code will not be able to access files that use progressive layouts. This incompatibility would only affect files using progressive layouts, and not other files that may already exist in the filesystem, or new files created without using the progressive layout format.<br />
<br />
===OST Oversubscription===<br />
To avoid oversubscribing OST bandwidth, OSTs used at the beginning of the file should not normally be re-used for objects allocated later in the file. The space usage of each OST, and by extension its required bandwidth, can be balanced by selecting the layout progression appropriately. <br />
<br />
If necessary (e.g. in case of ENOSPC) it might be necessary to allow multiple objects to be allocated from the same OSTs. In such cases, it would be necessary and desirable to allocate a new layout extent that allocates stripes across a subset of OSTs with available space.<br />
<br />
===Layout Locking===<br />
It is expected that the existing layout lock implementation is sufficient for progressive layouts, and extent-based locking of the layout itself is unnecessary (there will of course still be extent-based locking of the file data itself). This implies that there is a single lock bit that manages the entire layout content, and revokes the whole layout from clients if it needs to be modified in any way. It is expected that the layout for any individual file will only be changed at most a handful of times in its lifetime, so revoking the layout lock a few times is no worse than revoking the object extent locks as happens many times during the lifetime of a file being written concurrently by multiple clients.<br />
<br />
Since progressive layouts only change by adding new layout extents at the end of the file, there is no need to invalidate the (meta-)data that is cached under the OST object locks. Clients in the process of writing to a file when the layout lock is revoked may complete the write without any danger. Clients starting new file writes must block until they have the layout lock, since the OST extent locks will not accurately reflect the range of the file that might be modified under a particular lock.<br />
<br />
[[Category:Architecture]]<br />
[[Category:Design]]<br />
[[Category:Features]]<br />
[[Category:PFL]]</div>Nrutmanhttp://wiki.lustre.org/index.php?title=Data_on_MDT&diff=2696Data on MDT2017-08-25T19:02:37Z<p>Nrutman: </p>
<hr />
<div>==Feature Goals==<br />
<br />
Lustre file system performance is currently optimized for large files. This results in additional RPC round-trips to the OSTs, which hurt small file performance. This project aims to correct this deficiency by allowing the data for small files to be placed on the MDT so that these additional RPCs can be eliminated and performance correspondingly improved. Used in conjunction with DNE, this will preserve efficiency without sacrificing horizontal scale.<br />
<br />
Users or system administrators will set a layout policy that locates small files to be stored on an MDT. Files that grow beyond this size will use [[Progressive File Layouts]] to extend larger files onto OST objects and leave the start of the file on the MDT.<br />
<br />
==In Scope==<br />
<br />
* Implement a new layout for files with data on MDT.<br />
* Client IO stack must allow IO requests to an MDT device.<br />
* Design of a tunable for specifying size limit for files on a MDT.<br />
* Integration of DoM with PFL to automatically extend files from the MDT to OSTs.<br />
* Explicitly allocating small files on the MDT by default directory striping.<br />
* A discussion of interoperability (or lack thereof) of old clients and servers with Data on MDT.<br />
<br />
==Out of Scope==<br />
<br />
* Automatically locating small files on the MDT after the file has started to be written.<br />
<br />
==Documentation==<br />
* [[Data on MDT Solution Architecture]]<br />
* [[Data on MDT High Level Design]]<br />
<br />
[[Category:Features]]<br />
[[Category:DoM]]<br />
[[Category:PFL]]</div>Nrutmanhttp://wiki.lustre.org/index.php?title=Data_on_MDT_High_Level_Design&diff=2695Data on MDT High Level Design2017-08-25T18:58:09Z<p>Nrutman: </p>
<hr />
<div>==Introduction==<br />
<br />
The Data-on-MDT project aims to achieve good performance for small file IO. The motivations for this and current behaviour of small file IO are described in [[Data on MDT Solution Architecture]] document.<br />
<br />
This high-level design document provides more details about approach to implement this feature in the Lustre file system.<br />
==Implementation Requirements==<br />
===Create new layout for Data-on-MDT===<br />
While [[Layout Enhancement]] project is responsible for a new layout, we must add appropriate changes in LOV to properly recognize the new layout and use LMV/MDC cl_object interface to work with data. LOV should understand and properly handle also maximum data size for the new layout. Another part of related work is to change `lfs` tool so it will be able to set new layout on directory as default.<br />
<br />
===CLIO should be able to work with MDT===<br />
====Make MDC devices part of CLIO====<br />
MDC devices must become `cl_device` and part of CLIO stack and be used by LOV for IO if new layout. Some unification is possible between OSC and MDC methods.<br />
<br />
That code is orthogonal to the current MDC and is placed there for simpler access to import and request-related code. This code will become eventually the generic client that will be used by MDC and OSC.<br />
<br />
===MDT support for IO requests===<br />
====IO request handlers====<br />
Unified Target made this possible, but specific handlers do not currently exist for the MDT. They will be added as part of this work.<br />
<br />
====LOD must understand new layout====<br />
LOD must be aware of a new layout and handle it properly bypassing IO methods and other operations to the local storage instead of OST<br />
<br />
===Lustre tools changes===<br />
The `lfs setstripe` command must be extended to recognize new layout setup and set maximum size for it.<br />
<br />
==Functional Specification==<br />
===New DOM layout===<br />
====Wire/disk changes====<br />
<br />
The `lov_stripe_md` (`lsm`) stores information about DOM layout. The `lw_pattern` has specific flag `LOV_PATTERN_DOM` set, the `lw_stripe_size` contains maximum data size value for DOM, the `lw_stripe_count` is 0, no `lsm_oinfo` is allocated. This does not change wire protocol because `LOV_PATTERN_DOM` replaces existing `LOV_PATTERN_FIRST`.<br />
====In-memory structures====<br />
<br />
The LOV implements new interface for new `LLT_DOM` layout type. This set of methods are closely related with `LLT_RAID0` interface and some may be reused.<br />
<br />
Upon new object allocation the LOV does the following:<br />
<br />
* `lov_layout_type()` checks `lsm` and returns related type;<br />
* `lov_dispatch()` points to proper interface for this type;<br />
<br />
so the proper stack of objects is built with using sub-objects points to MDC.<br />
<br />
union lov_layout_state {<br />
struct lov_layout_raid0 {<br />
...<br />
} raid0;<br />
struct lov_layout_dom {<br />
struct lovsub_object *lo_dom;<br />
} dom;<br />
}<br />
<br />
The `lov_layout_dom` refers just to single object below LOV.<br />
<br />
===New LOV subdevices===<br />
For DOM we need pass IO through the MDC device, therefore LOV should have such target similar with other targets pointing to OSCs. With DNE in mind we should be able to query FLDB by FID to get proper MDC device according. The LOV implements own `lov_fld_lookup()` for that and may need to setup also own FLD client OR use one from LMV.<br />
<br />
struct lov_device {<br />
...<br />
<br />
/* Data-on-MDT devices*/<br />
__u32 ld_md_tgts_nr;<br />
struct lovsub_device **ld_md_tgts;<br />
struct lu_client_fld *ld_fld;<br />
};<br />
<br />
====Adding MDC====<br />
MDC devices are added in client LOV config log like them are added to the LMV. Upon new MDC addition the device stack may be not yet ready and must be saved to be added later in the `lov_device_init()`. The LOV device has new arrays of MDC targets `ld_device::ld_md_tgts[]` with the same type `lovsub_device` as used for OSC but with different set of methods.<br />
<br />
====Deleting MDC====<br />
Unlike OSC the MDC might be cleaned up before LOV because they are controlled still by MD stack of devices. The problem can be solved by taking extra reference for MDC device at LOV or better by notification of LOV about MDC is going to shutdown. <br />
<br />
====LOV `lovdom_device`====<br />
<br />
The lovdom device replaces lovsub device for OSC. New set of methods is introduced to handle LOV-MDC interactions.<br />
<br />
static const struct lu_device_operations lovdom_lu_ops = {<br />
.ldo_object_alloc = lovdom_object_alloc,<br />
.ldo_process_config = NULL,<br />
.ldo_recovery_complete = NULL<br />
};<br />
<br />
static const struct cl_device_operations lovdom_cl_ops = {<br />
.cdo_req_init = lovsub_req_init<br />
};<br />
<br />
static struct lu_device *lovdom_device_alloc(const struct lu_env *env,<br />
struct lu_device_type *t,<br />
struct lustre_cfg *cfg)<br />
{<br />
struct lu_device *d;<br />
struct lovsub_device *lsd;<br />
<br />
OBD_ALLOC_PTR(lsd);<br />
if (lsd != NULL) {<br />
int result;<br />
result = cl_device_init(&lsd->acid_cl, t);<br />
if (result == 0) {<br />
d = lovsub2lu_dev(lsd);<br />
d->ld_ops = &lovdom_lu_ops;<br />
lsd->acid_cl.cd_ops = &lovdom_cl_ops;<br />
} else {<br />
d = ERR_PTR(result);<br />
}<br />
} else {<br />
d = ERR_PTR(-ENOMEM);<br />
}<br />
return d;<br />
}<br />
<br />
static const struct lu_device_type_operations lovdom_device_type_ops = {<br />
.ldto_device_alloc = lovdom_device_alloc,<br />
.ldto_device_free = lovsub_device_free,<br />
.ldto_device_init = lovsub_device_init,<br />
.ldto_device_fini = lovsub_device_fini<br />
};<br />
<br />
The `lovdom` device is much simpler than `lovsub_device `because it is not a top device of new sub-stack but part of the stack. That is why device alloc is different but at the same time we reuse other `lovsub` methods.<br />
<br />
===DOM object layering===<br />
DOM object is referred by single FID of MDS object and has no stripes, so it is presented by plain stack `ccc->lov->lovdom[mds_number]->mdc` referred by FID of MDS object. <br />
<br />
====DOM object====<br />
Considering there is always single stripe, the lov+lovdom layers are trivial and have almost no extra functions/checks. All real code goes to the MDC layer.<br />
<br />
====Manage max_stripe_size====<br />
LOV specific code is max_stripe_size boundary check and code to start migration upon file growing.<br />
<br />
The `cl_io_operations:cio_io_iter_init()` method for lovdom layer checks that operation is not out of `max_stripe_size` boundary. The Phase I action is just `-ENOSPC` return code, the Phase II starts migration at this point.<br />
<br />
===MDC===<br />
====MDC CLIO methods====<br />
MDC fully reuses OSC CLIO methods with several exceptions, e.g. the device initialization methods will be different as it must call `mdc_setup()` instead of `osc_setup()`. That means we can separate that code from OSC and make it generic.<br />
<br />
====Locking====<br />
All IO for DOM object is done under `PW LAYOUT` lock. It protects whole data on MDT exclusively. The llite is responsible for that and MDC CLIO locks operatoins are all lockless to avoid real lock enqueueing. <br />
<br />
====Glimpse====<br />
With data on MDT the `GETATTR` request from MDT returns both `OBD_MD_FLSIZE|OBD_MD_FLBLOCKS` along with other attributes. Therefore glimpse is disabled by setting inode flag `LLIF_MDS_SIZE_LOCK` for DOM objects. <br />
<br />
===MDT changes===<br />
====IO request handlers====<br />
<br />
New set of handler is added at MDT, they are needed to handle punch/read/write operations.<br />
<br />
static struct tgt_handler mdt_tgt_handlers[] = {<br />
...<br />
TGT_MDT_HDL(HABEO_CORPUS| HABEO_REFERO, OST_BRW_READ, tgt_brw_read),<br />
TGT_MDT_HDL(HABEO_CORPUS| MUTABOR, OST_BRW_WRITE, tgt_brw_write),<br />
TGT_MDT_HDL(HABEO_CORPUS| HABEO_REFERO | MUTABOR, OST_PUNCH, mdt_punch_hdl),<br />
}<br />
<br />
====IO object operations====<br />
<br />
New operations are added to the MDT and sends IO requests directly to the OSD bypassing MDD and LOD. Unified Target uses OBD interface for IO methods, so MDD OBD operations are extended with the following:<br />
<br />
mdt_obd_ops = {<br />
...<br />
.o_preprw = mdt_obd_preprw,<br />
.o_commitrw = mdt_obd_commitrw,<br />
}<br />
<br />
====DOM support in LOD====<br />
<br />
LOD understand new DOM layout by checking `lsm` pattern with `lsm_is_dom(lsm)` helper function. It must not use any OSP subdevices for such layout but only local OSD target. All assertions and check on pattern should be extended from "only `RAID0` is supported" to "`RAID0` AND `DOM` are supported".<br />
<br />
MDT stack doesn't filter `BLOCKS|SIZE` attributes from DOM object and return them to the client.<br />
<br />
====Quota on MDT====<br />
As a first implementation, a user limit the DOM blocks can be provided without any code changes: for instance, if user want to set 1GB limit for DOM and the maximum size of each file on MDT is 1M, then user can just set the inode limit to 1024. This approach is in conflict with a real inode limit but provides a initial path to quota control.<br />
<br />
A more robust quota for Data on MDT can be executed in two phases:<br />
<br />
=====Phase I: Support block quota on MDTs, and MDTs share same block limit with OSTs=====<br />
In current quota framework, each MDT has only one MD qsd_instance associated with it's OSD device<br />
<br />
struct osd_device {<br />
...<br />
/* quota slave instance */<br />
struct qsd_instance *od_quota_slave;<br />
};<br />
<br />
This will be expanded to a list, and two `qsd_instance` will be linked in the list for each MDT: one is MD for metadata operations, the other is DT for data operations. Write on MDT will use the DT `qsd_instance` to enforce block quota, and metadata operations will have to enforce both block and inode quota with these two qsd_instances. For ZFS OSD, once block quota enabled on MDT, the approach of estimating inode accounting by used blocks will not work. Using our own ZAP to track inode usage will be the only option. Ideally, OpenZFS will support inode accounting directly.<br />
<br />
=====Phase II: Support different block limits for DOM and OSTs=====<br />
Because space on a MDT is often limited compared to a OST, an administrator may want to set more restricted block limit to MDTs rather than sharing same limits with OSTs. To support private limits for DOM, following changes are required:<br />
<br />
* Introduce a pre-defined DT pool for DOM to manage the block limit for all MDTs;<br />
* Add an additional qsd_instance associated with each MDT. This acquires limits from the DOM pool; (some code could probably needs be revised to support non-default pool ID)<br />
* Enhance the quota utilities to set and show quota limits for specified pool; (packing pool ID in quota request requires client to server protocol changes)<br />
* Write on MDT must enforce two block quotas: the default DT quota and the DOM quota;<br />
<br />
====Grants on MDT====<br />
Grant support on MDT require major MDT stack changes because we need to account all changes including metadata operations with directories, extended attributes and llogs. Initial grants support will be implemented as MDT target support for grants basically. It is able to report grants and declare their support during connection, but returns zero values (means IO to be sync) or some rough estimated values. Further work will be done in accounting all operations with data and report real values. This can be done during declaration phase of transaction.<br />
<br />
Operations to take into account:<br />
<br />
* directory create/destroy<br />
* EA add/delete<br />
* llog accounting (changelogs mostly)<br />
* Write/truncate of DOM objects<br />
* DOM object destroy<br />
<br />
===Lustre tools changes===<br />
====`lfs setstripe`====<br />
LFS tool introduces new option `--layout=mdt | -L mdt`. It means DOM layout and sets layout pattern to `LOV_PATTERN_DOM`, stripe counts to 0. Option `--stripe-size` used with `--layout` will set DOM `maxsize`. Option `--stripe-count` will cause error if not 0. Pool name cannot be set also for this layout at the current time.<br />
==Logic Specification<br />
===`MDS_GETATTR` logic===<br />
<br />
Optimization is to avoid glimpse requests by returning back `SIZE` and `BLOCKS` attributes.<br />
<br />
[[media:dom_getattr.png]]<br />
<br />
* `MDS_GETATTR` is enqueued as usual but with `PR LAYOUT` lock<br />
* MDT returns `BLOCKS` and `SIZE` attributes as valid<br />
* llite sets flag to indicate that glimpse is not needed (already exists for SOM)<br />
<br />
===MDS_OPEN logic and optimizations===<br />
====`OPEN` with `O_RDWR`====<br />
<br />
[[media:dom_open_rw.png]]<br />
<br />
This optimization might help with partial page updates, when client have to read page first, adds new data in it and flush it back finally.<br />
<br />
* OPEN is enqueued with PW LAYOUT lock on object<br />
* MDT returns partial page back if fits into EA buffer<br />
* client update partial page and flush pages<br />
<br />
====`OPEN` with `O_TRUNC`====<br />
<br />
[[media:dom_open_trunc.png]]<br />
<br />
This optimization can truncate file during open if `O_TRUNC` flag is set.<br />
<br />
* `OPEN` is invoked with `PW LAYOUT` lock<br />
* MDS checks `O_TRUNC` is set and truncate file<br />
* client gets reply with updated attributes<br />
<br />
===Data Readahead logic===<br />
<br />
All readahead optimizations are based on fact that it is possible to return some data from small file in EA buffer which is quite big now.<br />
<br />
[[media:dom_readahead.png]]<br />
<br />
Typical case is `stat()`:<br />
<br />
* `MDS_GETATTR` is enqueued as usual<br />
* MDT returns basic attributes and checks amount of free space in EA reply buffer. <br />
* MDT reads data from file to EA buffer<br />
* client fill pages with data from EA buffer<br />
<br />
==Tests Specification==<br />
===`sanity` tests===<br />
The set of tests to make sure the feature works as expected<br />
<br />
====`test_1a` - file with DOM layout====<br />
# create file with `lfs setstripe -L`, check its LOV EA attribute has LOV_PATTERN_DOM pattern and has no stripes on OST<br />
# write to the file, checks data is written and valid, checks file attributes are valid, checks fs free blocks is decreased according with file size.<br />
# delete file, checks free blocks are returned back<br />
<br />
====`test_1b` - DOM layout as default directory layout====<br />
# set new default layout on directory with lfs setstripe -L and create file in it. <br />
# all steps from `test_1a`<br />
<br />
====`test_2a` - check size limit and migration====<br />
# create file with DOM layout<br />
# write more that `max_stripe_size` to the file<br />
## Phase I: checks that `-ENOSPC` error is returned<br />
## Phase II: check that file expansion to OSTs occurred<br />
# (only Phase II below)<br />
# checks file layout is changed to filesystem default layout and has stripes on OSTs<br />
# checks file size is valid and fs free blocks data is valid as well<br />
<br />
===`sanityn` tests===<br />
====`test_1{a,...}` - parallel access to the small files====<br />
# open DOM file from client1 in different modes and pause it.<br />
# perform various operations on file from client2<br />
# make sure client2 is waiting client1 to finish first<br />
<br />
===Recovery tests===<br />
# fail MDS while performing operations like create/open/write/destroy on DOM file<br />
# make sure operations are replayed back and finished as expected<br />
<br />
===Functional tests===<br />
Check all cases we optimize<br />
====Stat====<br />
# create files with DOM layout vs 2-stripes layout<br />
# fills them with data<br />
# perform ls -l on them<br />
# Output results for both cases<br />
<br />
====Open with `O_TRUNC`====<br />
# create files with DOM and default layout<br />
# fills them with data<br />
# perform open with `O_TRUNC`<br />
# output results<br />
<br />
====Write at the end of file with partial page====<br />
# create many small files and fills with data so last page is filled partially<br />
# performs write to the end of file<br />
# Output results for DOM file vs normal file<br />
<br />
====Readahead of small files====<br />
# create small files with DOM and normal layout<br />
# fills them with data<br />
# perform grep on them<br />
# output results<br />
<br />
===Generic performance tests===<br />
====`mdsrate`====<br />
Regular MD operations should benefit from Data-on-MDT because there are no OST requests, basically only stat should benefit noticeable because open uses precreated objects and destroy is not blocked but OST objects destroy. <br />
<br />
Use mdsrate to perform the following operations:<br />
<br />
* file creation with single stripe vs DOM<br />
* `stat` <br />
* `unlink`<br />
* output results for both cases<br />
<br />
'''Files with data'''<br />
The mdsrate tool must be modified to create DoM files with amount of data to test DoM files performance in common case.<br />
<br />
====`FIO`====<br />
Check generic IO performance of small files<br />
<br />
* run `FIO` with DoM files vs normal files<br />
* output results<br />
<br />
====postmark====<br />
The `postmark` utility is designed to emulate applications such as software development, email, newsgroup servers and Web applications. It is an industry-standard benchmark for small file and metadata-intensive workloads.<br />
<br />
* Run postmark over DoM striped directory and default one.<br />
* Output results<br />
<br />
[[Category:Design]]<br />
[[Category:DoM]]</div>Nrutmanhttp://wiki.lustre.org/index.php?title=Data_on_MDT_Solution_Architecture&diff=2694Data on MDT Solution Architecture2017-08-25T18:57:44Z<p>Nrutman: </p>
<hr />
<div>==Introduction==<br />
<br />
Lustre file system read/write performance is currently optimized for large files (where ''large'' is more than a few megabytes in size). In addition to the initial file open RPC to the MDT, there are separate read/write RPCs to the OSTs to fetch the data, as well as disk IO on both MDT and OST for fetching and storing the objects and their attributes. This separation of functionality is acceptable (and even desirable) for large files since one MDT open RPC is typically a small fraction of the total number of read or write RPCs, but this hurts small file performance significantly when there is only a single read or write RPC for the file data. The Data On MDT (DOM) project aims to improve small file performance by allowing the data for small files to be placed only on the MDT, so that these additional RPCs and I/O overhead can be eliminated, and performance correspondingly improved. Since the MDT storage is typically configured as high-IOPS RAID-1+0 and optimized for small IO, Data-on-MDT will also be able to leverage this faster storage. Used in conjunction with the [[Distributed Namespace]] (DNE), this will improve efficiency without sacrificing horizontal scale.<br />
<br />
In order to store file data on the MDT, users or system administrators will need to explicitly specify a layout at file creation time to store the data on the MDT, or set a default layout policy on any directory so that newly created files will store the data on the MDT. This is the same as the current Lustre mechanism to specify the stripe count or stripe size when creating new files with data on the OSTs.<br />
<br />
The administrator will be able to specify a maximum file size for files that store their data on the MDT, to avoid users consuming too much space on the MDT and causing problems for the other users. If a file's layout specifies to store data on the MDT, but it grows beyond the specified maximum file size the data will be migrated to a layout with data on OST object(s). It will also be possible to use the file migration mechanism introduced in Lustre 2.4 as part of the HSM project to migrate existing small files that have data on OST to the MDT.<br />
<br />
==Use Cases==<br />
<br />
New files are created with an explicit layout to store the data on the MDT.<br />
<br />
New files are created with an implicit layout to store the data on the MDT inherited from the default layout stored on the parent directory.<br />
<br />
A small file is stored without the overhead of creating an OST object, and without OST RPCs.<br />
<br />
A small file is retrieved without the overhead of accessing an OST object, and without OST RPCs.<br />
<br />
Users, administrators, or policy engines can migrate existing small files stored on OSTs to an MDT.<br />
<br />
A client accesses a small file and has the file attributes, lock, and data returned with a single RPC.<br />
<br />
An administrator sets a global maximum size limit for small files stored on the MDT(s). Files larger than this value do not store their data on an MDT.<br />
<br />
==Solution Requirements==<br />
===Design a new layout for files with data on MDT===<br />
There needs to be some way for the client to know that the data is stored on the MDT. Currently there is no mechanism to record this information. The new DOM file layout is being designed in conjunction with the Layout Enhancement Solution Architecture. The DOM layout will only have a single implicit data stripe, which is the MDT inode itself.<br />
<br />
===Client IO stack must allow IO requests to an MDT device===<br />
The current client IO stack is implemented to perform IO through the OSC layer to OSTs. There is no IO mechanism for the MDC layer. Storing data on the MDT requires that IO be read and written through the MDC layer. <br />
<br />
===Explicitly allocating files on an MDT by default directory striping===<br />
Since Lustre allocates the file layout when the file is first opened, the DOM layout needs to be chosen before any file data is actually written. This means it is not possible to use the file size or amount of data written by the client to decide after the file is opened whether the file data will reside on the MDT or OST. Applications would be able to explicitly specify a DOM layout for newly-created files using existing llapi interfaces. It would also be possible to inherit the default layout from the parent directory to allocate all new files in that directory on the MDT to avoid any changes to the application. <br />
<br />
It would be possible to allow applications to mknod() a file and then truncate() it to the final size to allow the MDT to make the layout decision before the file is first opened. This was handled in most previous versions of Lustre, but was removed in recent versions since it was never properly documented and never used by applications to our knowledge. This mechanism would be useful for applications that know (or can reasonably estimate) the final file size in advance, such as tar(1) or cp(1) or applications writing a fixed amount of data to a file (e.g. known array size, or fixed data size per task). <br />
<br />
===A mechanism to allow files to grow from MDT to OST that exceed small file size limit===<br />
If a client writes to a file with a DOM layout that immediately the small-file size limit, under current locking behavior clients would need to flush their dirty cache to the MDT and cancel their layout lock before changing the file layout. Clients should be able to avoid this overhead. Instead, it would be preferable to change the layout to contain an OST object and flush the data directly from the client cache to the OST, and not store anything to the MDT. One possibility is to create all small files with an OST object to handle the overflow case, but this would add overhead when it may not be needed. Another option is to allow the client PW layout lock for the object on the MDS and then the client can modify the layout directly on the MDT without having to drop the layout lock or flush its cache.<br />
<br />
Integrating DoM with [[PFL]] will allow files to grow beyond the small MDT limit at the expense of having every file store a "small" part of the file on the MDT. In many cases, since the majority of files (often 90%) are small, while a few large files (5%) consume a majority of space (90%) this does not impose a significant additional burden on the MDT beyond DoM itself. For cases where the file size is known in advance (e.g. migration, HSM, delayed mirroring) it is better to just create the whole file on the OST(s) and skip the small DoM component.<br />
<br />
==Functional Requirements==<br />
===Administration===<br />
<br />
* system admin sets filesystem-wide filesize limit via procfs or lctl set_param for DOM feature<br />
* system admin sets default layout directly on file or dir or fs using lfs setstripe<br />
<br />
===Small file size limit===<br />
<br />
* The small file size limit is defined for small files and means that file growing beyond that limit should be migrated to OSTs.<br />
* Migration process includes new layout creation for that file, object(s) allocation on OST(s) and data transfer from MDT to the OST object(s).<br />
<br />
===MDS_GETATTR request===<br />
<br />
* Client to get LDLM locks and file size in addition to other metadata attributes from the MDT in the same RPC.<br />
<br />
[[Category:Architecture]]<br />
[[Category:DoM]]</div>Nrutmanhttp://wiki.lustre.org/index.php?title=Layout_Enhancement_Solution_Architecture&diff=2693Layout Enhancement Solution Architecture2017-08-25T18:37:32Z<p>Nrutman: /* Layouts for File-Level Replication and Layout Extents */</p>
<hr />
<div>=Introduction=<br />
In the Lustre file system the data of a file is striped over one or more objects each residing on an OST. The layout of a file is an attribute of the file which describes the mapping of file data ranges to object data ranges. The layout is stored on the MDT as a trusted extended attribute (`trusted.lov`) of the file and are sent to clients as needed. The layout of a file is often simply referred to as its striping, since in the current (2.5) implementation of Lustre only non-redundant striped (RAID0) layouts are permitted. This project will design enhancements to the representation and handling of layouts to support features such as [[Data on MDT]] (DoM), [[File Level Replication High Level Design|File-Level Replication]] (FLR), live data migration (via File-Level Replication), and RAID1/5/6 or erasure coding. The [[Layout Enhancement]] (LE) project is therefore a prerequisite for the Data on MDS and File-Level Replication projects, described in separate documents.<br />
<br />
=Solution Requirements=<br />
There are several demands on the Layout Enhancement project, each with its own set of requirements.<br />
<br />
==Layouts for File-Level Replication==<br />
File-level replication will use mirroring to offer increased availability and resilience of data on a Lustre file system. File-Level Replication is discussed in more detail in the [[File Level Replication High Level Design]]. The implementation of File-Level Replication requires a new composite layout type that comprises several simple (non-composite) layouts designating mirrors of the file data. It generalizes RAID0+1 to allow heterogeneous stripe sizes and counts among mirrors and to allow for the possibility of partial mirroring schemes, potentially with each replica on different OST storage pools with different performance and capacity characteristics. The design will specify the interaction of layouts for file-level replication with existing code that interprets or manipulates layouts (HSM, LOD, LOV, utilities, layout swapping, ...). The figure below depicts a 3-way replicated file F with components C0, C1, and C2 having 1, 3 and 8 stripes respectively.<br />
<br />
[[Image:Layout_with_three_components.png]]<br />
<br />
===Extent Based Layouts===<br />
Extent based layouts permit different layouts to be used in different extents of a file. They may be used to set progressively wider striping as a file grows in size, to prevent inconsistent out of space errors as individual OSTs become full, and to enable incremental migration, replication, and HSM restore. The figure below depicts a file with four different layouts (C0 through C3) each with its own extent to cover both overlapping and non-overlapping parts of the file data.<br />
<br />
[[Image:Layout_with_overlapping_components.png]]<br />
<br />
===RAID-1/5/6/10 and Other Algorithmic Layouts===<br />
These anticipate future developments to enable more space-efficient replication techniques. The design will discuss how these layouts can be expressed with simple extensions to the existing layout code. While RAID-1 and RAID-10 layouts offer degrees of replication they are not to be confused with the layouts for file-level replication described above. They are fixed layouts similar to the existing RAID-0 layouts currently used by Lustre. Compared to composite layouts they are simpler and more compact but they are also less expressive and less flexible. RAID1 and RAID-10 layouts may be converted to composite layouts. Composite layouts whose components all have the same stripe size and stripe count may be expressed as RAID1 and RAID10 layouts. Full read/write support for files with RAID-3/5/6 layouts is beyond the scope of this design. Instead we will briefly outline an "offline parity" access mode for files with these layout types.<br />
<br />
===Layouts Based on the CRUSH Algorithm===<br />
After investigating the CRUSH layout model it has been determined that the complexity of implementing a CRUSH-based solution within the current Lustre code is beyond the scope of this effort. Potential layouts could be proposed but we are not confident if they would be correct or complete by the time of implementation. We are confident however, that the layout proposed in this document is flexible and could allow for CRUSH-style algorithms in future.<br />
<br />
===Compact Layouts for Widely Striped Files===<br />
Existing (RAID-0) layout formats use an explicit array of object identifiers to map each stripe index to a specific OST object. When using FIDs alone to identify objects this approach requires 16 bytes per stripe. The current implementation packs an OST index together with the object identifier and needs 24 bytes per stripe. The allocation of memory buffers to transmit, receive, and handle these layouts for very widely striped files (over 160 stripes) can be costly. We will discuss a compact layout based on a bitmap of OST indices which reduces memory consumption for widely stripe files by a factor of 192 for the current maximally-striped layout of 2000 stripes.<br />
<br />
===Handling of Large Layouts===<br />
Issues with the handling of large layouts will be discussed and various solutions to these issues will be considered. In particular, for very large layouts it is desirable that the retrieval of the layout be separated from the initial open RPC in order to avoid the need for large request/reply buffer allocation. Instead, a bulk RDMA transfer could be used to fetch the layout from the MDT.<br />
<br />
=Use Cases=<br />
Since the Layout Enhancement Design itself is focused on providing an infrastructure to describe a flexible layout for complex files, the use cases are described in terms of potential uses of the enhanced layouts. In some cases, there are projects underway to implement these use cases, but in other cases these are just potential uses that may be implemented in the future.<br />
<br />
==File data availability==<br />
A user wants to change an existing file to be robust against temporary or permanent OST failure. This requires that all of the file data be stored on more than one OST, so that it can be read if an OST is overloaded or permanently offline due to failure. <br />
<br />
The [[File Level Replication High Level Design|File-Level Redundancy]] (FLR) project will use a composite layout with two or more overlapping extents to keep file data available in the face of OST failure. Due to the use of per-file layouts rather than per-OST replication, it is possible to selectively replicate files on an as-needed basis, such as 1-in-12 or 1-in-24 application checkpoints would have two-way replication and 1-in-72 checkpoints would have three-way replication. This allows users/applications to balance the replication and availability needs against space and bandwidth constraints, and is not an all-or-nothing decision.<br />
<br />
In the FLR Phase 2 it will be possible to create replicated files from the beginning. In FLR Phase 1 and later it will also be able to add replication to an existing ated file.<br />
<br />
User tools as described in the FLR project could use the following layout operations to add redundancy to non-replicated files:<br />
# A file's RAID-0 layout is converted to a composite layout whose sole component is the previous layout.<br />
# A temporary RAID-0 file is created to hold the new replica data (this is not connected to the layout and will be deleted in case of failure).<br />
# The file data is copied to the temporary replica file.<br />
# The new replica is merged into the composite layout as a new component, resulting in a compound layout with overlapping extents.<br />
<br />
==File data read performance==<br />
A user wants to have high-performance read access to a file from a large number of clients. This requires that the same data be stored on a large number of OSTs, so that it can be read in parallel at an aggregate bandwidth larger than what is available from a single OSS.<br />
<br />
Similar to File Data Availability, the File-Level Replication project will use a composite layout with overlapping extents. The number of replicated extents can be determined by the required read bandwidth and available OSS nodes. Having a large number of replicas on a file only makes sense for read-only files.<br />
<br />
==Reducing redundancy for old files==<br />
If a user no longer needs a high degree of replication on a file, either because has been backed up to a separate archive, because a high read bandwidth is no longer required, or because of quota limitations, it is possible to remove one or more replicas from a file.<br />
<br />
User tools as described in the FLR project could use the following layout operations to remove redundancy from replicated files:<br />
#A replicated file is destroyed, each layout component is destroyed.<br />
#A replicated file has one of the component replicas split from the file and it is destroyed.<br />
<br />
==Improved small file performance==<br />
A user want to store small files so that they can be read/written efficiently. Using a layout that specifies that the file data is stored on the MDT, as described in the [[Data on MDT]] (DOM) allows accessing the file data with fewer disk IO operations and fewer RPCs. The DOM layout allows specifying that the data is stored in the MDT inode.<br />
<br />
==Improved performance for the start of a file==<br />
Some applications need improved performance to the start of the file, for operations such as determining the file type, accessing file metadata, or accessing a file index header, etc. This can also be useful in conjunction with HSM, where the start of the file is resident on the MDT, but the rest of the file is archived on tape. This can be achieved by using a Data On MDT component with an extent covering the first stripe_size of file data, and having an OST-based component with an extent covering the extent [0, ∞) or [stripe_size, ∞), depending whether the first chunk of data will be replicated to the OST object(s) or not.<br />
<br />
==Increased bandwidth and capacity for larger files==<br />
A user wants to optimize small files with a single OST stripe, and large files with many OST stripes, without having to explicitly manage this on a per-file or per-directory basis. This could be implemented with a compound layout with multiple layouts in non-overlapping extents. As a file grows in size, progressively wider striping is used for file data in order to give the file access to more OST storage capacity and IO bandwidth.<br />
<br />
For example, a new file could be created with a single stripe for the extent [0, 32MB). If a file grows beyond 32MB in size, a new component layout would be created for the extent [32MB, 1GB) with 8 stripes. Should the file grow beyond 1GB in size, a third component layout would be created for the extent [1GB, ∞) that is striped over all of the available OSTs.<br />
<br />
==Handling out-of-space on an OST==<br />
If an application is writing to a file it is possible that one of the OSTs runs out of space, while other OSTs have a larger amount of free space (e.g. if new OSTs were added and had much more free space, or if a very large file was created with a single stripe). This could be implemented by converting the existing RAID-0 layout to have an extent from [0, file_size) and then creating a new RAID-0 layout as a separate extent [file_size, ∞) for the end of the file. <br />
<br />
==Transparent migration of files between OSTs==<br />
A file needs to be migrated between two OSTs. This may be needed in order to evacuate an OST that is failing or scheduled for hardware replacement, with a file that is in use by a long-running application.<br />
<br />
The file can be converted from a non-replicated file to a composite file with an extent from [0, ∞). A second component is added with an extent [0, 0) on the target OST(s). The file data can incrementally be copied to the target replica component. As data is copied to the new replica component the new component's extent end is increased to cover the just-copied range of the file, for example [0, 1GB). The old component's extent start is increased to cover a smaller range at the end of the file, for example [1GB, ∞), and its data is punched by the same amount (or the source objects are simply deleted at the end).<br />
<br />
==Unaligned components==<br />
For applications that have poorly-structured IO, it is possible that the application writes a short file header, and then reads or writes large-but-unaligned chunks of the file with large requests. For example, if there is a 64kB header, and then a series of 1MB chunks read/written with a 64kB offset from the start of the file. This produces poorly-formed IO to the underlying OST RAID LUNs because it is not aligned with the RAID chunks. It would be possible to specify a compound layout with a Data On MDT component for the start of the file, and then a RAID-0 OST component for the rest of the file. Unlike the components specified in other examples, the unaligned IO component would be flagged to be starting at the end of the first extent, rather than overlapping the first extent.<br />
<br />
==Algorithmic Layouts==<br />
Use cases for RAID-1/10 are similar to those for replication, except that the former formats are simpler and may be expressed more concisely. For example the RAID-1 form of a layout designating 4 mirrored objects (RAID-1) is small enough to fit in the extra space of a 512 byte MDT inode. This is not true of the analogous composite layout with 4 entries.<br />
<br />
RAID-5 and 6 offer increased data availability in the face of OST failures but not at the expense of full mirroring. While we do not anticipate supporting networked RAID-5/6 in the short term, there are interesting use cases for read mostly files. Given an existing (non RAID-5/6) file, a volatile file is opened and given a RAID-5/6 layout, data is copied from the old file to the new, and parity chunks are written. Then the volatile file is merged in to the old file or layout swap is performed.<br />
<br />
==Compact Layouts==<br />
A widely striped file is to be created in order to achieve maximum IO bandwidth. Transferring a conventional layout for this file requires clients to register tens or hundreds of KB in buffers. Using a compact layout format the file's layout may be transferred with negligible RPC overhead.<br />
<br />
=Solution Proposal=<br />
==Layouts for File-Level Replication and Layout Extents==<br />
Layouts for file-level replication and extent based layouts will be offered through the same underlying layout type, which we call a composite layout. This layout consists of a header (described by struct lov_comp_md_v1 below), an array of component descriptors (described by struct lov_comp_md_entry_v1), and the component layouts (a sequence of blobs that are independent RAID-0/1/5/6/10 layouts of type struct lov_mds_md_v3 or other layout types in the future).<br />
<br />
This design does not support nested composite layouts (i.e. components which are themselves composites) to avoid complexity and recursion in the implementation of layout handling. It is thought that non-nested layouts provide sufficient flexibility for current projects and anticipated future uses.<br />
struct lu_extent {<br />
__u64 e_begin;<br />
__u64 e_end;<br />
};<br />
<br />
struct lov_comp_md_entry_v1 {<br />
__u32 lcme_id; /* unique identifier of component within composite layout */<br />
__u32 lcme_flags; /* LCME_FL_STALE, LCME_FL_OFFLINE, LCME_FL_PREFERRED, ... */<br />
struct lu_extent lcme_extent; /* file extent for component */<br />
__u32 lcme_offset; /* offset of component layout from start of composite layout */<br />
__u32 lcme_size; /* size of component layout data in bytes */<br />
union {<br />
__u64 lcme_padding;<br />
} u;<br />
};<br />
<br />
struct lov_comp_md_v1 {<br />
__u32 lcm_magic; /* LOV_MAGIC_COMP_V1 */<br />
__u32 lcm_size; /* overall size including this struct */<br />
__u32 lcm_layout_gen; /* incremented each time layout is modified */<br />
__u16 lcm_flags; /* LCM_FL_RS_READ_ONLY, LCM_FL_RS_SYNC_PENDING, ... */<br />
__u16 lcm_entry_count; /* number of components in lcm_entries[] */<br />
union {<br />
__u64 lcm_padding[2];<br />
} u;<br />
struct lov_comp_md_entry_v1 lcm_entries[0];<br />
};<br />
<br />
Each component has an extent which describes the range of the file to which the layout applies. The extent does not necessarily need to cover the full file range. Any extents which are overlapping other extents are replicated, and are expected to contain the same file data at the same logical offset. A replicated (overlapping) extent may be marked UPDATING if it is currently being updated asynchronously by a client (see [[File Level Replication High Level Design]] for more details). A replicated extent may be marked STALE if there was a hard failure updating the data of that extent to match the primary replica.<br />
<br />
The mechanism for maintaining and resynchronizing replicas is beyond the scope of this document. However, it should be mentioned that it is desirable to keep STALE replicas attached to a file rather than removing them immediately upon OST failure. The number of updates needed to resynchronize a STALE replica may be minimal if it is offline for only a short time. This may also allow recovery of an old version of the file from a STALE replica if the primary replica suffers a fatal error. <br />
<br />
In the design we normally assume that the component extents in a composite layout have the same starting offset at byte 0 of the file. The extents may each form a non-overlapping a subset of [0, ∞), or the may all start at file offset 0, or there may be some other overlap. However, we should try not to use the component extent start as an offset when accessing the component objects. That is, if a component has a single object O and extent [s, ∞) then the file byte at position p should be found at offset p of O and not at p - s. In this way an extent with non-zero start can be converted to one which starts at 0. Similarly assume that file data is safely mirrored to another component, a component whose extent starts at zero can be figuratively punched to have some positive start without remapping the objects, followed by a punch of corresponding objects data.<br />
<br />
The ability to pack independent file layouts as components of a larger composite layout provides a great deal of flexibility, while isolating the complexity of the individual layouts. By allowing both overlapping and non-overlapping extents to be specified, it is possible to construct file layouts for almost any use case. The ability to add different component layouts in the future (e.g. CRUSH, RAID-5/6) will allow flexibility without requiring changes in the core layout infrastructure.<br />
<br />
For quota accounting of files with compound layouts, each component is treated in the same way as a separate file with the same contents. Adding a replica to a file will increase the quota usage of a user, and removing a replica will decrease the quota usage. For files with Data-on-MDT, the space usage of the component on the MDT will be accounted separately from the space on the OSTs. With [[Project Quotas]] (a separate project, see [https://jira.hpdd.intel.com/browse/LU-4017 LU-4017]) it would be possible in the future to separately administer the quota space available to users on the MDT. This allows and encourages users to pick the availability and performance characteristics that suit their needs best.<br />
<br />
It is anticipated that replication requirements can also be managed by an external policy engine such as [https://github.com/cea-hpc/robinhood/wiki RobinHood], to add or remove replicas to files, to migrate small files to the MDT, or to create or migrate replicas to different tiers of storage.<br />
<br />
Potentially complex applications are possible in the future with integration into userspace libraries/applications, such as tailoring file IO characteristics differently for disjoint parts of a single large file. For example, an HDF5 file could use high IOPS OST pool for components with extents covering an index or small IOs, replicated (overlapping) extents for important metadata, and large streaming components for extents with well-formed IO.<br />
<br />
==Operations on Composite Layouts==<br />
Several kinds of operations are needed to manipulate simple and compound file from a file with a simple layout.<br />
#A file with a simple layout is converted to a composite layout whose sole component is the previous layout.<br />
#A file with simple RAID-0 layout is merged into an existing compound file.<br />
#A replica (composite layout component) of a file is split out to a new file with only this component as its layout.<br />
#A replica (composite layout component) of a file is split out of the compound layout and is destroyed.<br />
#A replicated file is destroyed, each layout component is destroyed.<br />
#From a composite layout, the component with a given id is opened.<br />
#From a composite layout, the component with a given id set/clears the STALE state.<br />
#A composite layout is repacked after removal of a component.<br />
<br />
==Algorithmic Layouts==<br />
#A new RAID-1/5/6/10 file is created.<br />
#Two suitably striped RAID-0 files are merged into a RAID-1/10 file.<br />
#A suitably striped RAID-0 file is merged into an exiting RAID-1/10 file.<br />
#A specified mirror in a RAID-1/10 file is split off into a new file without an assigned layout.<br />
#A volatile (open unlinked) file with RAID-5/6 layout is created and written with a copy of an existing file's data. The volatile file's layout is merged as a component of the original file. The RAID-5/6 component can be read.<br />
<br />
==Compact Layouts==<br />
#A file is to be striped over a large number of OSTs, say 512 or more. An ordinary RAID0 layout for the file would be tens or hundreds of KB in size. Instead of explicitly specifying the FID of each OST object, a bitmap of OST indices is stored along with enough data that for each OST index set in the bitmap, a corresponding OST object FID may be computed.<br />
#Compact layouts must also include a starting index (or bias) to ensure balanced OST use.<br />
<br />
=Unit/Integration Test Plan=<br />
==Composite Layouts==<br />
#Create, store, load, and destroy empty composite layout on a file with no assigned layout.<br />
#Convert simple file layout to singleton composite layout.<br />
#Convert singleton composite layout to simple layout.<br />
#Merge simple file layout to existing composite layout.<br />
#Split component of composite layout to existing file without layout.<br />
#Move component between the composite layouts of existing files.<br />
#Get component layout with a given id from composite layout and validate.<br />
<br />
[[Category:Architecture]]<br />
[[Category:Design]]<br />
[[Category:Layout]]</div>Nrutmanhttp://wiki.lustre.org/index.php?title=Layout_Enhancement_Solution_Architecture&diff=2692Layout Enhancement Solution Architecture2017-08-25T18:12:25Z<p>Nrutman: /* Improved small file performance */</p>
<hr />
<div>=Introduction=<br />
In the Lustre file system the data of a file is striped over one or more objects each residing on an OST. The layout of a file is an attribute of the file which describes the mapping of file data ranges to object data ranges. The layout is stored on the MDT as a trusted extended attribute (`trusted.lov`) of the file and are sent to clients as needed. The layout of a file is often simply referred to as its striping, since in the current (2.5) implementation of Lustre only non-redundant striped (RAID0) layouts are permitted. This project will design enhancements to the representation and handling of layouts to support features such as [[Data on MDT]] (DoM), [[File Level Replication High Level Design|File-Level Replication]] (FLR), live data migration (via File-Level Replication), and RAID1/5/6 or erasure coding. The [[Layout Enhancement]] (LE) project is therefore a prerequisite for the Data on MDS and File-Level Replication projects, described in separate documents.<br />
<br />
=Solution Requirements=<br />
There are several demands on the Layout Enhancement project, each with its own set of requirements.<br />
<br />
==Layouts for File-Level Replication==<br />
File-level replication will use mirroring to offer increased availability and resilience of data on a Lustre file system. File-Level Replication is discussed in more detail in the [[File Level Replication High Level Design]]. The implementation of File-Level Replication requires a new composite layout type that comprises several simple (non-composite) layouts designating mirrors of the file data. It generalizes RAID0+1 to allow heterogeneous stripe sizes and counts among mirrors and to allow for the possibility of partial mirroring schemes, potentially with each replica on different OST storage pools with different performance and capacity characteristics. The design will specify the interaction of layouts for file-level replication with existing code that interprets or manipulates layouts (HSM, LOD, LOV, utilities, layout swapping, ...). The figure below depicts a 3-way replicated file F with components C0, C1, and C2 having 1, 3 and 8 stripes respectively.<br />
<br />
[[Image:Layout_with_three_components.png]]<br />
<br />
===Extent Based Layouts===<br />
Extent based layouts permit different layouts to be used in different extents of a file. They may be used to set progressively wider striping as a file grows in size, to prevent inconsistent out of space errors as individual OSTs become full, and to enable incremental migration, replication, and HSM restore. The figure below depicts a file with four different layouts (C0 through C3) each with its own extent to cover both overlapping and non-overlapping parts of the file data.<br />
<br />
[[Image:Layout_with_overlapping_components.png]]<br />
<br />
===RAID-1/5/6/10 and Other Algorithmic Layouts===<br />
These anticipate future developments to enable more space-efficient replication techniques. The design will discuss how these layouts can be expressed with simple extensions to the existing layout code. While RAID-1 and RAID-10 layouts offer degrees of replication they are not to be confused with the layouts for file-level replication described above. They are fixed layouts similar to the existing RAID-0 layouts currently used by Lustre. Compared to composite layouts they are simpler and more compact but they are also less expressive and less flexible. RAID1 and RAID-10 layouts may be converted to composite layouts. Composite layouts whose components all have the same stripe size and stripe count may be expressed as RAID1 and RAID10 layouts. Full read/write support for files with RAID-3/5/6 layouts is beyond the scope of this design. Instead we will briefly outline an "offline parity" access mode for files with these layout types.<br />
<br />
===Layouts Based on the CRUSH Algorithm===<br />
After investigating the CRUSH layout model it has been determined that the complexity of implementing a CRUSH-based solution within the current Lustre code is beyond the scope of this effort. Potential layouts could be proposed but we are not confident if they would be correct or complete by the time of implementation. We are confident however, that the layout proposed in this document is flexible and could allow for CRUSH-style algorithms in future.<br />
<br />
===Compact Layouts for Widely Striped Files===<br />
Existing (RAID-0) layout formats use an explicit array of object identifiers to map each stripe index to a specific OST object. When using FIDs alone to identify objects this approach requires 16 bytes per stripe. The current implementation packs an OST index together with the object identifier and needs 24 bytes per stripe. The allocation of memory buffers to transmit, receive, and handle these layouts for very widely striped files (over 160 stripes) can be costly. We will discuss a compact layout based on a bitmap of OST indices which reduces memory consumption for widely stripe files by a factor of 192 for the current maximally-striped layout of 2000 stripes.<br />
<br />
===Handling of Large Layouts===<br />
Issues with the handling of large layouts will be discussed and various solutions to these issues will be considered. In particular, for very large layouts it is desirable that the retrieval of the layout be separated from the initial open RPC in order to avoid the need for large request/reply buffer allocation. Instead, a bulk RDMA transfer could be used to fetch the layout from the MDT.<br />
<br />
=Use Cases=<br />
Since the Layout Enhancement Design itself is focused on providing an infrastructure to describe a flexible layout for complex files, the use cases are described in terms of potential uses of the enhanced layouts. In some cases, there are projects underway to implement these use cases, but in other cases these are just potential uses that may be implemented in the future.<br />
<br />
==File data availability==<br />
A user wants to change an existing file to be robust against temporary or permanent OST failure. This requires that all of the file data be stored on more than one OST, so that it can be read if an OST is overloaded or permanently offline due to failure. <br />
<br />
The [[File Level Replication High Level Design|File-Level Redundancy]] (FLR) project will use a composite layout with two or more overlapping extents to keep file data available in the face of OST failure. Due to the use of per-file layouts rather than per-OST replication, it is possible to selectively replicate files on an as-needed basis, such as 1-in-12 or 1-in-24 application checkpoints would have two-way replication and 1-in-72 checkpoints would have three-way replication. This allows users/applications to balance the replication and availability needs against space and bandwidth constraints, and is not an all-or-nothing decision.<br />
<br />
In the FLR Phase 2 it will be possible to create replicated files from the beginning. In FLR Phase 1 and later it will also be able to add replication to an existing ated file.<br />
<br />
User tools as described in the FLR project could use the following layout operations to add redundancy to non-replicated files:<br />
# A file's RAID-0 layout is converted to a composite layout whose sole component is the previous layout.<br />
# A temporary RAID-0 file is created to hold the new replica data (this is not connected to the layout and will be deleted in case of failure).<br />
# The file data is copied to the temporary replica file.<br />
# The new replica is merged into the composite layout as a new component, resulting in a compound layout with overlapping extents.<br />
<br />
==File data read performance==<br />
A user wants to have high-performance read access to a file from a large number of clients. This requires that the same data be stored on a large number of OSTs, so that it can be read in parallel at an aggregate bandwidth larger than what is available from a single OSS.<br />
<br />
Similar to File Data Availability, the File-Level Replication project will use a composite layout with overlapping extents. The number of replicated extents can be determined by the required read bandwidth and available OSS nodes. Having a large number of replicas on a file only makes sense for read-only files.<br />
<br />
==Reducing redundancy for old files==<br />
If a user no longer needs a high degree of replication on a file, either because has been backed up to a separate archive, because a high read bandwidth is no longer required, or because of quota limitations, it is possible to remove one or more replicas from a file.<br />
<br />
User tools as described in the FLR project could use the following layout operations to remove redundancy from replicated files:<br />
#A replicated file is destroyed, each layout component is destroyed.<br />
#A replicated file has one of the component replicas split from the file and it is destroyed.<br />
<br />
==Improved small file performance==<br />
A user want to store small files so that they can be read/written efficiently. Using a layout that specifies that the file data is stored on the MDT, as described in the [[Data on MDT]] (DOM) allows accessing the file data with fewer disk IO operations and fewer RPCs. The DOM layout allows specifying that the data is stored in the MDT inode.<br />
<br />
==Improved performance for the start of a file==<br />
Some applications need improved performance to the start of the file, for operations such as determining the file type, accessing file metadata, or accessing a file index header, etc. This can also be useful in conjunction with HSM, where the start of the file is resident on the MDT, but the rest of the file is archived on tape. This can be achieved by using a Data On MDT component with an extent covering the first stripe_size of file data, and having an OST-based component with an extent covering the extent [0, ∞) or [stripe_size, ∞), depending whether the first chunk of data will be replicated to the OST object(s) or not.<br />
<br />
==Increased bandwidth and capacity for larger files==<br />
A user wants to optimize small files with a single OST stripe, and large files with many OST stripes, without having to explicitly manage this on a per-file or per-directory basis. This could be implemented with a compound layout with multiple layouts in non-overlapping extents. As a file grows in size, progressively wider striping is used for file data in order to give the file access to more OST storage capacity and IO bandwidth.<br />
<br />
For example, a new file could be created with a single stripe for the extent [0, 32MB). If a file grows beyond 32MB in size, a new component layout would be created for the extent [32MB, 1GB) with 8 stripes. Should the file grow beyond 1GB in size, a third component layout would be created for the extent [1GB, ∞) that is striped over all of the available OSTs.<br />
<br />
==Handling out-of-space on an OST==<br />
If an application is writing to a file it is possible that one of the OSTs runs out of space, while other OSTs have a larger amount of free space (e.g. if new OSTs were added and had much more free space, or if a very large file was created with a single stripe). This could be implemented by converting the existing RAID-0 layout to have an extent from [0, file_size) and then creating a new RAID-0 layout as a separate extent [file_size, ∞) for the end of the file. <br />
<br />
==Transparent migration of files between OSTs==<br />
A file needs to be migrated between two OSTs. This may be needed in order to evacuate an OST that is failing or scheduled for hardware replacement, with a file that is in use by a long-running application.<br />
<br />
The file can be converted from a non-replicated file to a composite file with an extent from [0, ∞). A second component is added with an extent [0, 0) on the target OST(s). The file data can incrementally be copied to the target replica component. As data is copied to the new replica component the new component's extent end is increased to cover the just-copied range of the file, for example [0, 1GB). The old component's extent start is increased to cover a smaller range at the end of the file, for example [1GB, ∞), and its data is punched by the same amount (or the source objects are simply deleted at the end).<br />
<br />
==Unaligned components==<br />
For applications that have poorly-structured IO, it is possible that the application writes a short file header, and then reads or writes large-but-unaligned chunks of the file with large requests. For example, if there is a 64kB header, and then a series of 1MB chunks read/written with a 64kB offset from the start of the file. This produces poorly-formed IO to the underlying OST RAID LUNs because it is not aligned with the RAID chunks. It would be possible to specify a compound layout with a Data On MDT component for the start of the file, and then a RAID-0 OST component for the rest of the file. Unlike the components specified in other examples, the unaligned IO component would be flagged to be starting at the end of the first extent, rather than overlapping the first extent.<br />
<br />
==Algorithmic Layouts==<br />
Use cases for RAID-1/10 are similar to those for replication, except that the former formats are simpler and may be expressed more concisely. For example the RAID-1 form of a layout designating 4 mirrored objects (RAID-1) is small enough to fit in the extra space of a 512 byte MDT inode. This is not true of the analogous composite layout with 4 entries.<br />
<br />
RAID-5 and 6 offer increased data availability in the face of OST failures but not at the expense of full mirroring. While we do not anticipate supporting networked RAID-5/6 in the short term, there are interesting use cases for read mostly files. Given an existing (non RAID-5/6) file, a volatile file is opened and given a RAID-5/6 layout, data is copied from the old file to the new, and parity chunks are written. Then the volatile file is merged in to the old file or layout swap is performed.<br />
<br />
==Compact Layouts==<br />
A widely striped file is to be created in order to achieve maximum IO bandwidth. Transferring a conventional layout for this file requires clients to register tens or hundreds of KB in buffers. Using a compact layout format the file's layout may be transferred with negligible RPC overhead.<br />
<br />
=Solution Proposal=<br />
==Layouts for File-Level Replication and Layout Extents==<br />
Layouts for file-level replication and extent based layouts will be offered through the same underlying layout type, which we call a composite layout. This layout consists of a header (described by struct lov_comp_md_v1 below), an array of component descriptors (described by struct lov_comp_md_entry_v1), and the component layouts (a sequence of blobs that are independent RAID-0/1/5/6/10 layouts of type struct lov_mds_md_v3 or other layout types in the future).<br />
<br />
This design does not support nested composite layouts (i.e. components which are themselves composites) to avoid complexity and recursion in the implementation of layout handling. It is thought that non-nested layouts provide sufficient flexibility for current projects and anticipated future uses.<br />
struct lu_extent {<br />
__u64 e_begin;<br />
__u64 e_end;<br />
};<br />
<br />
struct lov_comp_md_entry_v1 {<br />
__u32 lcme_id; /* unique identifier of component within composite layout */<br />
__u32 lcme_flags; /* LCME_FL_STALE, LCME_FL_OFFLINE, LCME_FL_PREFERRED, ... */<br />
struct lu_extent lcme_extent; /* file extent for component */<br />
__u32 lcme_offset; /* offset of component layout from start of composite layout */<br />
__u32 lcme_size; /* size of component layout data in bytes */<br />
union {<br />
__u64 lcme_padding;<br />
} u;<br />
};<br />
<br />
struct lov_comp_md_v1 {<br />
__u32 lcm_magic; /* LOV_MAGIC_COMP_V1 */<br />
__u32 lcm_size; /* overall size including this struct */<br />
__u32 lcm_layout_gen; /* incremented each time layout is modified */<br />
__u16 lcm_flags; /* LCM_FL_RS_READ_ONLY, LCM_FL_RS_SYNC_PENDING, ... */<br />
__u16 lcm_entry_count; /* number of components in lcm_entries[] */<br />
union {<br />
__u64 lcm_padding[2];<br />
} u;<br />
struct lov_comp_md_entry_v1 lcm_entries[0];<br />
};<br />
<br />
Each component has an extent which describes the range of the file to which the layout applies. The extent does not necessarily need to cover the full file range. Any extents which are overlapping other extents are replicated, and are expected to contain the same file data at the same logical offset. A replicated (overlapping) extent may be marked UPDATING if it is currently being updated asynchronously by a client (see [[File-Level Replication Solution Architecture]] for more details). A replicated extent may be marked STALE if there was a hard failure updating the data of that extent to match the primary replica.<br />
<br />
The mechanism for maintaining and resynchronizing replicas is beyond the scope of this document. However, it should be mentioned that it is desirable to keep STALE replicas attached to a file rather than removing them immediately upon OST failure. The number of updates needed to resynchronize a STALE replica may be minimal if it is offline for only a short time. This may also allow recovery of an old version of the file from a STALE replica if the primary replica suffers a fatal error. <br />
<br />
In the design we normally assume that the component extents in a composite layout have the same starting offset at byte 0 of the file. The extents may each form a non-overlapping a subset of [0, ∞), or the may all start at file offset 0, or there may be some other overlap. However, we should try not to use the component extent start as an offset when accessing the component objects. That is, if a component has a single object O and extent [s, ∞) then the file byte at position p should be found at offset p of O and not at p - s. In this way an extent with non-zero start can be converted to one which starts at 0. Similarly assume that file data is safely mirrored to another component, a component whose extent starts at zero can be figuratively punched to have some positive start without remapping the objects, followed by a punch of corresponding objects data.<br />
<br />
The ability to pack independent file layouts as components of a larger composite layout provides a great deal of flexibility, while isolating the complexity of the individual layouts. By allowing both overlapping and non-overlapping extents to be specified, it is possible to construct file layouts for almost any use case. The ability to add different component layouts in the future (e.g. CRUSH, RAID-5/6) will allow flexibility without requiring changes in the core layout infrastructure.<br />
<br />
For quota accounting of files with compound layouts, each component is treated in the same way as a separate file with the same contents. Adding a replica to a file will increase the quota usage of a user, and removing a replica will decrease the quota usage. For files with Data-on-MDT, the space usage of the component on the MDT will be accounted separately from the space on the OSTs. With [[Project Quotas]] (a separate project, see [https://jira.hpdd.intel.com/browse/LU-4017 LU-4017]) it would be possible in the future to separately administer the quota space available to users on the MDT. This allows and encourages users to pick the availability and performance characteristics that suit their needs best.<br />
<br />
It is anticipated that replication requirements can also be managed by an external policy engine such as [https://github.com/cea-hpc/robinhood/wiki RobinHood], to add or remove replicas to files, to migrate small files to the MDT, or to create or migrate replicas to different tiers of storage.<br />
<br />
Potentially complex applications are possible in the future with integration into userspace libraries/applications, such as tailoring file IO characteristics differently for disjoint parts of a single large file. For example, an HDF5 file could use high IOPS OST pool for components with extents covering an index or small IOs, replicated (overlapping) extents for important metadata, and large streaming components for extents with well-formed IO.<br />
<br />
==Operations on Composite Layouts==<br />
Several kinds of operations are needed to manipulate simple and compound file from a file with a simple layout.<br />
#A file with a simple layout is converted to a composite layout whose sole component is the previous layout.<br />
#A file with simple RAID-0 layout is merged into an existing compound file.<br />
#A replica (composite layout component) of a file is split out to a new file with only this component as its layout.<br />
#A replica (composite layout component) of a file is split out of the compound layout and is destroyed.<br />
#A replicated file is destroyed, each layout component is destroyed.<br />
#From a composite layout, the component with a given id is opened.<br />
#From a composite layout, the component with a given id set/clears the STALE state.<br />
#A composite layout is repacked after removal of a component.<br />
<br />
==Algorithmic Layouts==<br />
#A new RAID-1/5/6/10 file is created.<br />
#Two suitably striped RAID-0 files are merged into a RAID-1/10 file.<br />
#A suitably striped RAID-0 file is merged into an exiting RAID-1/10 file.<br />
#A specified mirror in a RAID-1/10 file is split off into a new file without an assigned layout.<br />
#A volatile (open unlinked) file with RAID-5/6 layout is created and written with a copy of an existing file's data. The volatile file's layout is merged as a component of the original file. The RAID-5/6 component can be read.<br />
<br />
==Compact Layouts==<br />
#A file is to be striped over a large number of OSTs, say 512 or more. An ordinary RAID0 layout for the file would be tens or hundreds of KB in size. Instead of explicitly specifying the FID of each OST object, a bitmap of OST indices is stored along with enough data that for each OST index set in the bitmap, a corresponding OST object FID may be computed.<br />
#Compact layouts must also include a starting index (or bias) to ensure balanced OST use.<br />
<br />
=Unit/Integration Test Plan=<br />
==Composite Layouts==<br />
#Create, store, load, and destroy empty composite layout on a file with no assigned layout.<br />
#Convert simple file layout to singleton composite layout.<br />
#Convert singleton composite layout to simple layout.<br />
#Merge simple file layout to existing composite layout.<br />
#Split component of composite layout to existing file without layout.<br />
#Move component between the composite layouts of existing files.<br />
#Get component layout with a given id from composite layout and validate.<br />
<br />
[[Category:Architecture]]<br />
[[Category:Design]]<br />
[[Category:Layout]]</div>Nrutmanhttp://wiki.lustre.org/index.php?title=Layout_Enhancement_Solution_Architecture&diff=2691Layout Enhancement Solution Architecture2017-08-25T18:11:58Z<p>Nrutman: /* Improved small file performance */</p>
<hr />
<div>=Introduction=<br />
In the Lustre file system the data of a file is striped over one or more objects each residing on an OST. The layout of a file is an attribute of the file which describes the mapping of file data ranges to object data ranges. The layout is stored on the MDT as a trusted extended attribute (`trusted.lov`) of the file and are sent to clients as needed. The layout of a file is often simply referred to as its striping, since in the current (2.5) implementation of Lustre only non-redundant striped (RAID0) layouts are permitted. This project will design enhancements to the representation and handling of layouts to support features such as [[Data on MDT]] (DoM), [[File Level Replication High Level Design|File-Level Replication]] (FLR), live data migration (via File-Level Replication), and RAID1/5/6 or erasure coding. The [[Layout Enhancement]] (LE) project is therefore a prerequisite for the Data on MDS and File-Level Replication projects, described in separate documents.<br />
<br />
=Solution Requirements=<br />
There are several demands on the Layout Enhancement project, each with its own set of requirements.<br />
<br />
==Layouts for File-Level Replication==<br />
File-level replication will use mirroring to offer increased availability and resilience of data on a Lustre file system. File-Level Replication is discussed in more detail in the [[File Level Replication High Level Design]]. The implementation of File-Level Replication requires a new composite layout type that comprises several simple (non-composite) layouts designating mirrors of the file data. It generalizes RAID0+1 to allow heterogeneous stripe sizes and counts among mirrors and to allow for the possibility of partial mirroring schemes, potentially with each replica on different OST storage pools with different performance and capacity characteristics. The design will specify the interaction of layouts for file-level replication with existing code that interprets or manipulates layouts (HSM, LOD, LOV, utilities, layout swapping, ...). The figure below depicts a 3-way replicated file F with components C0, C1, and C2 having 1, 3 and 8 stripes respectively.<br />
<br />
[[Image:Layout_with_three_components.png]]<br />
<br />
===Extent Based Layouts===<br />
Extent based layouts permit different layouts to be used in different extents of a file. They may be used to set progressively wider striping as a file grows in size, to prevent inconsistent out of space errors as individual OSTs become full, and to enable incremental migration, replication, and HSM restore. The figure below depicts a file with four different layouts (C0 through C3) each with its own extent to cover both overlapping and non-overlapping parts of the file data.<br />
<br />
[[Image:Layout_with_overlapping_components.png]]<br />
<br />
===RAID-1/5/6/10 and Other Algorithmic Layouts===<br />
These anticipate future developments to enable more space-efficient replication techniques. The design will discuss how these layouts can be expressed with simple extensions to the existing layout code. While RAID-1 and RAID-10 layouts offer degrees of replication they are not to be confused with the layouts for file-level replication described above. They are fixed layouts similar to the existing RAID-0 layouts currently used by Lustre. Compared to composite layouts they are simpler and more compact but they are also less expressive and less flexible. RAID1 and RAID-10 layouts may be converted to composite layouts. Composite layouts whose components all have the same stripe size and stripe count may be expressed as RAID1 and RAID10 layouts. Full read/write support for files with RAID-3/5/6 layouts is beyond the scope of this design. Instead we will briefly outline an "offline parity" access mode for files with these layout types.<br />
<br />
===Layouts Based on the CRUSH Algorithm===<br />
After investigating the CRUSH layout model it has been determined that the complexity of implementing a CRUSH-based solution within the current Lustre code is beyond the scope of this effort. Potential layouts could be proposed but we are not confident if they would be correct or complete by the time of implementation. We are confident however, that the layout proposed in this document is flexible and could allow for CRUSH-style algorithms in future.<br />
<br />
===Compact Layouts for Widely Striped Files===<br />
Existing (RAID-0) layout formats use an explicit array of object identifiers to map each stripe index to a specific OST object. When using FIDs alone to identify objects this approach requires 16 bytes per stripe. The current implementation packs an OST index together with the object identifier and needs 24 bytes per stripe. The allocation of memory buffers to transmit, receive, and handle these layouts for very widely striped files (over 160 stripes) can be costly. We will discuss a compact layout based on a bitmap of OST indices which reduces memory consumption for widely stripe files by a factor of 192 for the current maximally-striped layout of 2000 stripes.<br />
<br />
===Handling of Large Layouts===<br />
Issues with the handling of large layouts will be discussed and various solutions to these issues will be considered. In particular, for very large layouts it is desirable that the retrieval of the layout be separated from the initial open RPC in order to avoid the need for large request/reply buffer allocation. Instead, a bulk RDMA transfer could be used to fetch the layout from the MDT.<br />
<br />
=Use Cases=<br />
Since the Layout Enhancement Design itself is focused on providing an infrastructure to describe a flexible layout for complex files, the use cases are described in terms of potential uses of the enhanced layouts. In some cases, there are projects underway to implement these use cases, but in other cases these are just potential uses that may be implemented in the future.<br />
<br />
==File data availability==<br />
A user wants to change an existing file to be robust against temporary or permanent OST failure. This requires that all of the file data be stored on more than one OST, so that it can be read if an OST is overloaded or permanently offline due to failure. <br />
<br />
The [[File Level Replication High Level Design|File-Level Redundancy]] (FLR) project will use a composite layout with two or more overlapping extents to keep file data available in the face of OST failure. Due to the use of per-file layouts rather than per-OST replication, it is possible to selectively replicate files on an as-needed basis, such as 1-in-12 or 1-in-24 application checkpoints would have two-way replication and 1-in-72 checkpoints would have three-way replication. This allows users/applications to balance the replication and availability needs against space and bandwidth constraints, and is not an all-or-nothing decision.<br />
<br />
In the FLR Phase 2 it will be possible to create replicated files from the beginning. In FLR Phase 1 and later it will also be able to add replication to an existing ated file.<br />
<br />
User tools as described in the FLR project could use the following layout operations to add redundancy to non-replicated files:<br />
# A file's RAID-0 layout is converted to a composite layout whose sole component is the previous layout.<br />
# A temporary RAID-0 file is created to hold the new replica data (this is not connected to the layout and will be deleted in case of failure).<br />
# The file data is copied to the temporary replica file.<br />
# The new replica is merged into the composite layout as a new component, resulting in a compound layout with overlapping extents.<br />
<br />
==File data read performance==<br />
A user wants to have high-performance read access to a file from a large number of clients. This requires that the same data be stored on a large number of OSTs, so that it can be read in parallel at an aggregate bandwidth larger than what is available from a single OSS.<br />
<br />
Similar to File Data Availability, the File-Level Replication project will use a composite layout with overlapping extents. The number of replicated extents can be determined by the required read bandwidth and available OSS nodes. Having a large number of replicas on a file only makes sense for read-only files.<br />
<br />
==Reducing redundancy for old files==<br />
If a user no longer needs a high degree of replication on a file, either because has been backed up to a separate archive, because a high read bandwidth is no longer required, or because of quota limitations, it is possible to remove one or more replicas from a file.<br />
<br />
User tools as described in the FLR project could use the following layout operations to remove redundancy from replicated files:<br />
#A replicated file is destroyed, each layout component is destroyed.<br />
#A replicated file has one of the component replicas split from the file and it is destroyed.<br />
<br />
==Improved small file performance==<br />
A user want to store small files so that they can be read/written efficiently. Using a layout that specifies that the file data is stored on the MDT, as described in the [[Data On MDT]] (DOM) allows accessing the file data with fewer disk IO operations and fewer RPCs. The DOM layout allows specifying that the data is stored in the MDT inode.<br />
<br />
==Improved performance for the start of a file==<br />
Some applications need improved performance to the start of the file, for operations such as determining the file type, accessing file metadata, or accessing a file index header, etc. This can also be useful in conjunction with HSM, where the start of the file is resident on the MDT, but the rest of the file is archived on tape. This can be achieved by using a Data On MDT component with an extent covering the first stripe_size of file data, and having an OST-based component with an extent covering the extent [0, ∞) or [stripe_size, ∞), depending whether the first chunk of data will be replicated to the OST object(s) or not.<br />
<br />
==Increased bandwidth and capacity for larger files==<br />
A user wants to optimize small files with a single OST stripe, and large files with many OST stripes, without having to explicitly manage this on a per-file or per-directory basis. This could be implemented with a compound layout with multiple layouts in non-overlapping extents. As a file grows in size, progressively wider striping is used for file data in order to give the file access to more OST storage capacity and IO bandwidth.<br />
<br />
For example, a new file could be created with a single stripe for the extent [0, 32MB). If a file grows beyond 32MB in size, a new component layout would be created for the extent [32MB, 1GB) with 8 stripes. Should the file grow beyond 1GB in size, a third component layout would be created for the extent [1GB, ∞) that is striped over all of the available OSTs.<br />
<br />
==Handling out-of-space on an OST==<br />
If an application is writing to a file it is possible that one of the OSTs runs out of space, while other OSTs have a larger amount of free space (e.g. if new OSTs were added and had much more free space, or if a very large file was created with a single stripe). This could be implemented by converting the existing RAID-0 layout to have an extent from [0, file_size) and then creating a new RAID-0 layout as a separate extent [file_size, ∞) for the end of the file. <br />
<br />
==Transparent migration of files between OSTs==<br />
A file needs to be migrated between two OSTs. This may be needed in order to evacuate an OST that is failing or scheduled for hardware replacement, with a file that is in use by a long-running application.<br />
<br />
The file can be converted from a non-replicated file to a composite file with an extent from [0, ∞). A second component is added with an extent [0, 0) on the target OST(s). The file data can incrementally be copied to the target replica component. As data is copied to the new replica component the new component's extent end is increased to cover the just-copied range of the file, for example [0, 1GB). The old component's extent start is increased to cover a smaller range at the end of the file, for example [1GB, ∞), and its data is punched by the same amount (or the source objects are simply deleted at the end).<br />
<br />
==Unaligned components==<br />
For applications that have poorly-structured IO, it is possible that the application writes a short file header, and then reads or writes large-but-unaligned chunks of the file with large requests. For example, if there is a 64kB header, and then a series of 1MB chunks read/written with a 64kB offset from the start of the file. This produces poorly-formed IO to the underlying OST RAID LUNs because it is not aligned with the RAID chunks. It would be possible to specify a compound layout with a Data On MDT component for the start of the file, and then a RAID-0 OST component for the rest of the file. Unlike the components specified in other examples, the unaligned IO component would be flagged to be starting at the end of the first extent, rather than overlapping the first extent.<br />
<br />
==Algorithmic Layouts==<br />
Use cases for RAID-1/10 are similar to those for replication, except that the former formats are simpler and may be expressed more concisely. For example the RAID-1 form of a layout designating 4 mirrored objects (RAID-1) is small enough to fit in the extra space of a 512 byte MDT inode. This is not true of the analogous composite layout with 4 entries.<br />
<br />
RAID-5 and 6 offer increased data availability in the face of OST failures but not at the expense of full mirroring. While we do not anticipate supporting networked RAID-5/6 in the short term, there are interesting use cases for read mostly files. Given an existing (non RAID-5/6) file, a volatile file is opened and given a RAID-5/6 layout, data is copied from the old file to the new, and parity chunks are written. Then the volatile file is merged in to the old file or layout swap is performed.<br />
<br />
==Compact Layouts==<br />
A widely striped file is to be created in order to achieve maximum IO bandwidth. Transferring a conventional layout for this file requires clients to register tens or hundreds of KB in buffers. Using a compact layout format the file's layout may be transferred with negligible RPC overhead.<br />
<br />
=Solution Proposal=<br />
==Layouts for File-Level Replication and Layout Extents==<br />
Layouts for file-level replication and extent based layouts will be offered through the same underlying layout type, which we call a composite layout. This layout consists of a header (described by struct lov_comp_md_v1 below), an array of component descriptors (described by struct lov_comp_md_entry_v1), and the component layouts (a sequence of blobs that are independent RAID-0/1/5/6/10 layouts of type struct lov_mds_md_v3 or other layout types in the future).<br />
<br />
This design does not support nested composite layouts (i.e. components which are themselves composites) to avoid complexity and recursion in the implementation of layout handling. It is thought that non-nested layouts provide sufficient flexibility for current projects and anticipated future uses.<br />
struct lu_extent {<br />
__u64 e_begin;<br />
__u64 e_end;<br />
};<br />
<br />
struct lov_comp_md_entry_v1 {<br />
__u32 lcme_id; /* unique identifier of component within composite layout */<br />
__u32 lcme_flags; /* LCME_FL_STALE, LCME_FL_OFFLINE, LCME_FL_PREFERRED, ... */<br />
struct lu_extent lcme_extent; /* file extent for component */<br />
__u32 lcme_offset; /* offset of component layout from start of composite layout */<br />
__u32 lcme_size; /* size of component layout data in bytes */<br />
union {<br />
__u64 lcme_padding;<br />
} u;<br />
};<br />
<br />
struct lov_comp_md_v1 {<br />
__u32 lcm_magic; /* LOV_MAGIC_COMP_V1 */<br />
__u32 lcm_size; /* overall size including this struct */<br />
__u32 lcm_layout_gen; /* incremented each time layout is modified */<br />
__u16 lcm_flags; /* LCM_FL_RS_READ_ONLY, LCM_FL_RS_SYNC_PENDING, ... */<br />
__u16 lcm_entry_count; /* number of components in lcm_entries[] */<br />
union {<br />
__u64 lcm_padding[2];<br />
} u;<br />
struct lov_comp_md_entry_v1 lcm_entries[0];<br />
};<br />
<br />
Each component has an extent which describes the range of the file to which the layout applies. The extent does not necessarily need to cover the full file range. Any extents which are overlapping other extents are replicated, and are expected to contain the same file data at the same logical offset. A replicated (overlapping) extent may be marked UPDATING if it is currently being updated asynchronously by a client (see [[File-Level Replication Solution Architecture]] for more details). A replicated extent may be marked STALE if there was a hard failure updating the data of that extent to match the primary replica.<br />
<br />
The mechanism for maintaining and resynchronizing replicas is beyond the scope of this document. However, it should be mentioned that it is desirable to keep STALE replicas attached to a file rather than removing them immediately upon OST failure. The number of updates needed to resynchronize a STALE replica may be minimal if it is offline for only a short time. This may also allow recovery of an old version of the file from a STALE replica if the primary replica suffers a fatal error. <br />
<br />
In the design we normally assume that the component extents in a composite layout have the same starting offset at byte 0 of the file. The extents may each form a non-overlapping a subset of [0, ∞), or the may all start at file offset 0, or there may be some other overlap. However, we should try not to use the component extent start as an offset when accessing the component objects. That is, if a component has a single object O and extent [s, ∞) then the file byte at position p should be found at offset p of O and not at p - s. In this way an extent with non-zero start can be converted to one which starts at 0. Similarly assume that file data is safely mirrored to another component, a component whose extent starts at zero can be figuratively punched to have some positive start without remapping the objects, followed by a punch of corresponding objects data.<br />
<br />
The ability to pack independent file layouts as components of a larger composite layout provides a great deal of flexibility, while isolating the complexity of the individual layouts. By allowing both overlapping and non-overlapping extents to be specified, it is possible to construct file layouts for almost any use case. The ability to add different component layouts in the future (e.g. CRUSH, RAID-5/6) will allow flexibility without requiring changes in the core layout infrastructure.<br />
<br />
For quota accounting of files with compound layouts, each component is treated in the same way as a separate file with the same contents. Adding a replica to a file will increase the quota usage of a user, and removing a replica will decrease the quota usage. For files with Data-on-MDT, the space usage of the component on the MDT will be accounted separately from the space on the OSTs. With [[Project Quotas]] (a separate project, see [https://jira.hpdd.intel.com/browse/LU-4017 LU-4017]) it would be possible in the future to separately administer the quota space available to users on the MDT. This allows and encourages users to pick the availability and performance characteristics that suit their needs best.<br />
<br />
It is anticipated that replication requirements can also be managed by an external policy engine such as [https://github.com/cea-hpc/robinhood/wiki RobinHood], to add or remove replicas to files, to migrate small files to the MDT, or to create or migrate replicas to different tiers of storage.<br />
<br />
Potentially complex applications are possible in the future with integration into userspace libraries/applications, such as tailoring file IO characteristics differently for disjoint parts of a single large file. For example, an HDF5 file could use high IOPS OST pool for components with extents covering an index or small IOs, replicated (overlapping) extents for important metadata, and large streaming components for extents with well-formed IO.<br />
<br />
==Operations on Composite Layouts==<br />
Several kinds of operations are needed to manipulate simple and compound file from a file with a simple layout.<br />
#A file with a simple layout is converted to a composite layout whose sole component is the previous layout.<br />
#A file with simple RAID-0 layout is merged into an existing compound file.<br />
#A replica (composite layout component) of a file is split out to a new file with only this component as its layout.<br />
#A replica (composite layout component) of a file is split out of the compound layout and is destroyed.<br />
#A replicated file is destroyed, each layout component is destroyed.<br />
#From a composite layout, the component with a given id is opened.<br />
#From a composite layout, the component with a given id set/clears the STALE state.<br />
#A composite layout is repacked after removal of a component.<br />
<br />
==Algorithmic Layouts==<br />
#A new RAID-1/5/6/10 file is created.<br />
#Two suitably striped RAID-0 files are merged into a RAID-1/10 file.<br />
#A suitably striped RAID-0 file is merged into an exiting RAID-1/10 file.<br />
#A specified mirror in a RAID-1/10 file is split off into a new file without an assigned layout.<br />
#A volatile (open unlinked) file with RAID-5/6 layout is created and written with a copy of an existing file's data. The volatile file's layout is merged as a component of the original file. The RAID-5/6 component can be read.<br />
<br />
==Compact Layouts==<br />
#A file is to be striped over a large number of OSTs, say 512 or more. An ordinary RAID0 layout for the file would be tens or hundreds of KB in size. Instead of explicitly specifying the FID of each OST object, a bitmap of OST indices is stored along with enough data that for each OST index set in the bitmap, a corresponding OST object FID may be computed.<br />
#Compact layouts must also include a starting index (or bias) to ensure balanced OST use.<br />
<br />
=Unit/Integration Test Plan=<br />
==Composite Layouts==<br />
#Create, store, load, and destroy empty composite layout on a file with no assigned layout.<br />
#Convert simple file layout to singleton composite layout.<br />
#Convert singleton composite layout to simple layout.<br />
#Merge simple file layout to existing composite layout.<br />
#Split component of composite layout to existing file without layout.<br />
#Move component between the composite layouts of existing files.<br />
#Get component layout with a given id from composite layout and validate.<br />
<br />
[[Category:Architecture]]<br />
[[Category:Design]]<br />
[[Category:Layout]]</div>Nrutmanhttp://wiki.lustre.org/index.php?title=Layout_Enhancement_Solution_Architecture&diff=2690Layout Enhancement Solution Architecture2017-08-25T18:02:36Z<p>Nrutman: /* Introduction */</p>
<hr />
<div>=Introduction=<br />
In the Lustre file system the data of a file is striped over one or more objects each residing on an OST. The layout of a file is an attribute of the file which describes the mapping of file data ranges to object data ranges. The layout is stored on the MDT as a trusted extended attribute (`trusted.lov`) of the file and are sent to clients as needed. The layout of a file is often simply referred to as its striping, since in the current (2.5) implementation of Lustre only non-redundant striped (RAID0) layouts are permitted. This project will design enhancements to the representation and handling of layouts to support features such as [[Data on MDT]] (DoM), [[File Level Replication High Level Design|File-Level Replication]] (FLR), live data migration (via File-Level Replication), and RAID1/5/6 or erasure coding. The [[Layout Enhancement]] (LE) project is therefore a prerequisite for the Data on MDS and File-Level Replication projects, described in separate documents.<br />
<br />
=Solution Requirements=<br />
There are several demands on the Layout Enhancement project, each with its own set of requirements.<br />
<br />
==Layouts for File-Level Replication==<br />
File-level replication will use mirroring to offer increased availability and resilience of data on a Lustre file system. File-Level Replication is discussed in more detail in the [[File Level Replication High Level Design]]. The implementation of File-Level Replication requires a new composite layout type that comprises several simple (non-composite) layouts designating mirrors of the file data. It generalizes RAID0+1 to allow heterogeneous stripe sizes and counts among mirrors and to allow for the possibility of partial mirroring schemes, potentially with each replica on different OST storage pools with different performance and capacity characteristics. The design will specify the interaction of layouts for file-level replication with existing code that interprets or manipulates layouts (HSM, LOD, LOV, utilities, layout swapping, ...). The figure below depicts a 3-way replicated file F with components C0, C1, and C2 having 1, 3 and 8 stripes respectively.<br />
<br />
[[Image:Layout_with_three_components.png]]<br />
<br />
===Extent Based Layouts===<br />
Extent based layouts permit different layouts to be used in different extents of a file. They may be used to set progressively wider striping as a file grows in size, to prevent inconsistent out of space errors as individual OSTs become full, and to enable incremental migration, replication, and HSM restore. The figure below depicts a file with four different layouts (C0 through C3) each with its own extent to cover both overlapping and non-overlapping parts of the file data.<br />
<br />
[[Image:Layout_with_overlapping_components.png]]<br />
<br />
===RAID-1/5/6/10 and Other Algorithmic Layouts===<br />
These anticipate future developments to enable more space-efficient replication techniques. The design will discuss how these layouts can be expressed with simple extensions to the existing layout code. While RAID-1 and RAID-10 layouts offer degrees of replication they are not to be confused with the layouts for file-level replication described above. They are fixed layouts similar to the existing RAID-0 layouts currently used by Lustre. Compared to composite layouts they are simpler and more compact but they are also less expressive and less flexible. RAID1 and RAID-10 layouts may be converted to composite layouts. Composite layouts whose components all have the same stripe size and stripe count may be expressed as RAID1 and RAID10 layouts. Full read/write support for files with RAID-3/5/6 layouts is beyond the scope of this design. Instead we will briefly outline an "offline parity" access mode for files with these layout types.<br />
<br />
===Layouts Based on the CRUSH Algorithm===<br />
After investigating the CRUSH layout model it has been determined that the complexity of implementing a CRUSH-based solution within the current Lustre code is beyond the scope of this effort. Potential layouts could be proposed but we are not confident if they would be correct or complete by the time of implementation. We are confident however, that the layout proposed in this document is flexible and could allow for CRUSH-style algorithms in future.<br />
<br />
===Compact Layouts for Widely Striped Files===<br />
Existing (RAID-0) layout formats use an explicit array of object identifiers to map each stripe index to a specific OST object. When using FIDs alone to identify objects this approach requires 16 bytes per stripe. The current implementation packs an OST index together with the object identifier and needs 24 bytes per stripe. The allocation of memory buffers to transmit, receive, and handle these layouts for very widely striped files (over 160 stripes) can be costly. We will discuss a compact layout based on a bitmap of OST indices which reduces memory consumption for widely stripe files by a factor of 192 for the current maximally-striped layout of 2000 stripes.<br />
<br />
===Handling of Large Layouts===<br />
Issues with the handling of large layouts will be discussed and various solutions to these issues will be considered. In particular, for very large layouts it is desirable that the retrieval of the layout be separated from the initial open RPC in order to avoid the need for large request/reply buffer allocation. Instead, a bulk RDMA transfer could be used to fetch the layout from the MDT.<br />
<br />
=Use Cases=<br />
Since the Layout Enhancement Design itself is focused on providing an infrastructure to describe a flexible layout for complex files, the use cases are described in terms of potential uses of the enhanced layouts. In some cases, there are projects underway to implement these use cases, but in other cases these are just potential uses that may be implemented in the future.<br />
<br />
==File data availability==<br />
A user wants to change an existing file to be robust against temporary or permanent OST failure. This requires that all of the file data be stored on more than one OST, so that it can be read if an OST is overloaded or permanently offline due to failure. <br />
<br />
The [[File Level Replication High Level Design|File-Level Redundancy]] (FLR) project will use a composite layout with two or more overlapping extents to keep file data available in the face of OST failure. Due to the use of per-file layouts rather than per-OST replication, it is possible to selectively replicate files on an as-needed basis, such as 1-in-12 or 1-in-24 application checkpoints would have two-way replication and 1-in-72 checkpoints would have three-way replication. This allows users/applications to balance the replication and availability needs against space and bandwidth constraints, and is not an all-or-nothing decision.<br />
<br />
In the FLR Phase 2 it will be possible to create replicated files from the beginning. In FLR Phase 1 and later it will also be able to add replication to an existing ated file.<br />
<br />
User tools as described in the FLR project could use the following layout operations to add redundancy to non-replicated files:<br />
# A file's RAID-0 layout is converted to a composite layout whose sole component is the previous layout.<br />
# A temporary RAID-0 file is created to hold the new replica data (this is not connected to the layout and will be deleted in case of failure).<br />
# The file data is copied to the temporary replica file.<br />
# The new replica is merged into the composite layout as a new component, resulting in a compound layout with overlapping extents.<br />
<br />
==File data read performance==<br />
A user wants to have high-performance read access to a file from a large number of clients. This requires that the same data be stored on a large number of OSTs, so that it can be read in parallel at an aggregate bandwidth larger than what is available from a single OSS.<br />
<br />
Similar to File Data Availability, the File-Level Replication project will use a composite layout with overlapping extents. The number of replicated extents can be determined by the required read bandwidth and available OSS nodes. Having a large number of replicas on a file only makes sense for read-only files.<br />
<br />
==Reducing redundancy for old files==<br />
If a user no longer needs a high degree of replication on a file, either because has been backed up to a separate archive, because a high read bandwidth is no longer required, or because of quota limitations, it is possible to remove one or more replicas from a file.<br />
<br />
User tools as described in the FLR project could use the following layout operations to remove redundancy from replicated files:<br />
#A replicated file is destroyed, each layout component is destroyed.<br />
#A replicated file has one of the component replicas split from the file and it is destroyed.<br />
<br />
==Improved small file performance==<br />
A user want to store small files so that they can be read/written efficiently. Using a layout that specifies that the file data is stored on the MDT, as described in the [[Data On MDT Solution Architecture]] (DOM) allows accessing the file data with fewer disk IO operations and fewer RPCs. The DOM layout allows specifying that the data is stored in the MDT inode.<br />
<br />
==Improved performance for the start of a file==<br />
Some applications need improved performance to the start of the file, for operations such as determining the file type, accessing file metadata, or accessing a file index header, etc. This can also be useful in conjunction with HSM, where the start of the file is resident on the MDT, but the rest of the file is archived on tape. This can be achieved by using a Data On MDT component with an extent covering the first stripe_size of file data, and having an OST-based component with an extent covering the extent [0, ∞) or [stripe_size, ∞), depending whether the first chunk of data will be replicated to the OST object(s) or not.<br />
<br />
==Increased bandwidth and capacity for larger files==<br />
A user wants to optimize small files with a single OST stripe, and large files with many OST stripes, without having to explicitly manage this on a per-file or per-directory basis. This could be implemented with a compound layout with multiple layouts in non-overlapping extents. As a file grows in size, progressively wider striping is used for file data in order to give the file access to more OST storage capacity and IO bandwidth.<br />
<br />
For example, a new file could be created with a single stripe for the extent [0, 32MB). If a file grows beyond 32MB in size, a new component layout would be created for the extent [32MB, 1GB) with 8 stripes. Should the file grow beyond 1GB in size, a third component layout would be created for the extent [1GB, ∞) that is striped over all of the available OSTs.<br />
<br />
==Handling out-of-space on an OST==<br />
If an application is writing to a file it is possible that one of the OSTs runs out of space, while other OSTs have a larger amount of free space (e.g. if new OSTs were added and had much more free space, or if a very large file was created with a single stripe). This could be implemented by converting the existing RAID-0 layout to have an extent from [0, file_size) and then creating a new RAID-0 layout as a separate extent [file_size, ∞) for the end of the file. <br />
<br />
==Transparent migration of files between OSTs==<br />
A file needs to be migrated between two OSTs. This may be needed in order to evacuate an OST that is failing or scheduled for hardware replacement, with a file that is in use by a long-running application.<br />
<br />
The file can be converted from a non-replicated file to a composite file with an extent from [0, ∞). A second component is added with an extent [0, 0) on the target OST(s). The file data can incrementally be copied to the target replica component. As data is copied to the new replica component the new component's extent end is increased to cover the just-copied range of the file, for example [0, 1GB). The old component's extent start is increased to cover a smaller range at the end of the file, for example [1GB, ∞), and its data is punched by the same amount (or the source objects are simply deleted at the end).<br />
<br />
==Unaligned components==<br />
For applications that have poorly-structured IO, it is possible that the application writes a short file header, and then reads or writes large-but-unaligned chunks of the file with large requests. For example, if there is a 64kB header, and then a series of 1MB chunks read/written with a 64kB offset from the start of the file. This produces poorly-formed IO to the underlying OST RAID LUNs because it is not aligned with the RAID chunks. It would be possible to specify a compound layout with a Data On MDT component for the start of the file, and then a RAID-0 OST component for the rest of the file. Unlike the components specified in other examples, the unaligned IO component would be flagged to be starting at the end of the first extent, rather than overlapping the first extent.<br />
<br />
==Algorithmic Layouts==<br />
Use cases for RAID-1/10 are similar to those for replication, except that the former formats are simpler and may be expressed more concisely. For example the RAID-1 form of a layout designating 4 mirrored objects (RAID-1) is small enough to fit in the extra space of a 512 byte MDT inode. This is not true of the analogous composite layout with 4 entries.<br />
<br />
RAID-5 and 6 offer increased data availability in the face of OST failures but not at the expense of full mirroring. While we do not anticipate supporting networked RAID-5/6 in the short term, there are interesting use cases for read mostly files. Given an existing (non RAID-5/6) file, a volatile file is opened and given a RAID-5/6 layout, data is copied from the old file to the new, and parity chunks are written. Then the volatile file is merged in to the old file or layout swap is performed.<br />
<br />
==Compact Layouts==<br />
A widely striped file is to be created in order to achieve maximum IO bandwidth. Transferring a conventional layout for this file requires clients to register tens or hundreds of KB in buffers. Using a compact layout format the file's layout may be transferred with negligible RPC overhead.<br />
<br />
=Solution Proposal=<br />
==Layouts for File-Level Replication and Layout Extents==<br />
Layouts for file-level replication and extent based layouts will be offered through the same underlying layout type, which we call a composite layout. This layout consists of a header (described by struct lov_comp_md_v1 below), an array of component descriptors (described by struct lov_comp_md_entry_v1), and the component layouts (a sequence of blobs that are independent RAID-0/1/5/6/10 layouts of type struct lov_mds_md_v3 or other layout types in the future).<br />
<br />
This design does not support nested composite layouts (i.e. components which are themselves composites) to avoid complexity and recursion in the implementation of layout handling. It is thought that non-nested layouts provide sufficient flexibility for current projects and anticipated future uses.<br />
struct lu_extent {<br />
__u64 e_begin;<br />
__u64 e_end;<br />
};<br />
<br />
struct lov_comp_md_entry_v1 {<br />
__u32 lcme_id; /* unique identifier of component within composite layout */<br />
__u32 lcme_flags; /* LCME_FL_STALE, LCME_FL_OFFLINE, LCME_FL_PREFERRED, ... */<br />
struct lu_extent lcme_extent; /* file extent for component */<br />
__u32 lcme_offset; /* offset of component layout from start of composite layout */<br />
__u32 lcme_size; /* size of component layout data in bytes */<br />
union {<br />
__u64 lcme_padding;<br />
} u;<br />
};<br />
<br />
struct lov_comp_md_v1 {<br />
__u32 lcm_magic; /* LOV_MAGIC_COMP_V1 */<br />
__u32 lcm_size; /* overall size including this struct */<br />
__u32 lcm_layout_gen; /* incremented each time layout is modified */<br />
__u16 lcm_flags; /* LCM_FL_RS_READ_ONLY, LCM_FL_RS_SYNC_PENDING, ... */<br />
__u16 lcm_entry_count; /* number of components in lcm_entries[] */<br />
union {<br />
__u64 lcm_padding[2];<br />
} u;<br />
struct lov_comp_md_entry_v1 lcm_entries[0];<br />
};<br />
<br />
Each component has an extent which describes the range of the file to which the layout applies. The extent does not necessarily need to cover the full file range. Any extents which are overlapping other extents are replicated, and are expected to contain the same file data at the same logical offset. A replicated (overlapping) extent may be marked UPDATING if it is currently being updated asynchronously by a client (see [[File-Level Replication Solution Architecture]] for more details). A replicated extent may be marked STALE if there was a hard failure updating the data of that extent to match the primary replica.<br />
<br />
The mechanism for maintaining and resynchronizing replicas is beyond the scope of this document. However, it should be mentioned that it is desirable to keep STALE replicas attached to a file rather than removing them immediately upon OST failure. The number of updates needed to resynchronize a STALE replica may be minimal if it is offline for only a short time. This may also allow recovery of an old version of the file from a STALE replica if the primary replica suffers a fatal error. <br />
<br />
In the design we normally assume that the component extents in a composite layout have the same starting offset at byte 0 of the file. The extents may each form a non-overlapping a subset of [0, ∞), or the may all start at file offset 0, or there may be some other overlap. However, we should try not to use the component extent start as an offset when accessing the component objects. That is, if a component has a single object O and extent [s, ∞) then the file byte at position p should be found at offset p of O and not at p - s. In this way an extent with non-zero start can be converted to one which starts at 0. Similarly assume that file data is safely mirrored to another component, a component whose extent starts at zero can be figuratively punched to have some positive start without remapping the objects, followed by a punch of corresponding objects data.<br />
<br />
The ability to pack independent file layouts as components of a larger composite layout provides a great deal of flexibility, while isolating the complexity of the individual layouts. By allowing both overlapping and non-overlapping extents to be specified, it is possible to construct file layouts for almost any use case. The ability to add different component layouts in the future (e.g. CRUSH, RAID-5/6) will allow flexibility without requiring changes in the core layout infrastructure.<br />
<br />
For quota accounting of files with compound layouts, each component is treated in the same way as a separate file with the same contents. Adding a replica to a file will increase the quota usage of a user, and removing a replica will decrease the quota usage. For files with Data-on-MDT, the space usage of the component on the MDT will be accounted separately from the space on the OSTs. With [[Project Quotas]] (a separate project, see [https://jira.hpdd.intel.com/browse/LU-4017 LU-4017]) it would be possible in the future to separately administer the quota space available to users on the MDT. This allows and encourages users to pick the availability and performance characteristics that suit their needs best.<br />
<br />
It is anticipated that replication requirements can also be managed by an external policy engine such as [https://github.com/cea-hpc/robinhood/wiki RobinHood], to add or remove replicas to files, to migrate small files to the MDT, or to create or migrate replicas to different tiers of storage.<br />
<br />
Potentially complex applications are possible in the future with integration into userspace libraries/applications, such as tailoring file IO characteristics differently for disjoint parts of a single large file. For example, an HDF5 file could use high IOPS OST pool for components with extents covering an index or small IOs, replicated (overlapping) extents for important metadata, and large streaming components for extents with well-formed IO.<br />
<br />
==Operations on Composite Layouts==<br />
Several kinds of operations are needed to manipulate simple and compound file from a file with a simple layout.<br />
#A file with a simple layout is converted to a composite layout whose sole component is the previous layout.<br />
#A file with simple RAID-0 layout is merged into an existing compound file.<br />
#A replica (composite layout component) of a file is split out to a new file with only this component as its layout.<br />
#A replica (composite layout component) of a file is split out of the compound layout and is destroyed.<br />
#A replicated file is destroyed, each layout component is destroyed.<br />
#From a composite layout, the component with a given id is opened.<br />
#From a composite layout, the component with a given id set/clears the STALE state.<br />
#A composite layout is repacked after removal of a component.<br />
<br />
==Algorithmic Layouts==<br />
#A new RAID-1/5/6/10 file is created.<br />
#Two suitably striped RAID-0 files are merged into a RAID-1/10 file.<br />
#A suitably striped RAID-0 file is merged into an exiting RAID-1/10 file.<br />
#A specified mirror in a RAID-1/10 file is split off into a new file without an assigned layout.<br />
#A volatile (open unlinked) file with RAID-5/6 layout is created and written with a copy of an existing file's data. The volatile file's layout is merged as a component of the original file. The RAID-5/6 component can be read.<br />
<br />
==Compact Layouts==<br />
#A file is to be striped over a large number of OSTs, say 512 or more. An ordinary RAID0 layout for the file would be tens or hundreds of KB in size. Instead of explicitly specifying the FID of each OST object, a bitmap of OST indices is stored along with enough data that for each OST index set in the bitmap, a corresponding OST object FID may be computed.<br />
#Compact layouts must also include a starting index (or bias) to ensure balanced OST use.<br />
<br />
=Unit/Integration Test Plan=<br />
==Composite Layouts==<br />
#Create, store, load, and destroy empty composite layout on a file with no assigned layout.<br />
#Convert simple file layout to singleton composite layout.<br />
#Convert singleton composite layout to simple layout.<br />
#Merge simple file layout to existing composite layout.<br />
#Split component of composite layout to existing file without layout.<br />
#Move component between the composite layouts of existing files.<br />
#Get component layout with a given id from composite layout and validate.<br />
<br />
[[Category:Architecture]]<br />
[[Category:Design]]<br />
[[Category:Layout]]</div>Nrutmanhttp://wiki.lustre.org/index.php?title=Layout_Enhancement_Solution_Architecture&diff=2689Layout Enhancement Solution Architecture2017-08-25T18:01:26Z<p>Nrutman: /* Layouts for File-Level Replication */</p>
<hr />
<div>=Introduction=<br />
In the Lustre file system the data of a file is striped over one or more objects each residing on an OST. The layout of a file is an attribute of the file which describes the mapping of file data ranges to object data ranges. The layout is stored on the MDT as a trusted extended attribute (`trusted.lov`) of the file and are sent to clients as needed. The layout of a file is often simply referred to as its striping, since in the current (2.5) implementation of Lustre only non-redundant striped (RAID0) layouts are permitted. This project will design enhancements to the representation and handling of layouts to support features such as [[Data on MDT]] (DoM), [[File-Level Replication]] (FLR), live data migration (via File-Level Replication), and RAID1/5/6 or erasure coding. The [[Layout Enhancement]] (LE) project is therefore a prerequisite for the Data on MDS and File-Level Replication projects, described in separate documents.<br />
<br />
=Solution Requirements=<br />
There are several demands on the Layout Enhancement project, each with its own set of requirements.<br />
<br />
==Layouts for File-Level Replication==<br />
File-level replication will use mirroring to offer increased availability and resilience of data on a Lustre file system. File-Level Replication is discussed in more detail in the [[File Level Replication High Level Design]]. The implementation of File-Level Replication requires a new composite layout type that comprises several simple (non-composite) layouts designating mirrors of the file data. It generalizes RAID0+1 to allow heterogeneous stripe sizes and counts among mirrors and to allow for the possibility of partial mirroring schemes, potentially with each replica on different OST storage pools with different performance and capacity characteristics. The design will specify the interaction of layouts for file-level replication with existing code that interprets or manipulates layouts (HSM, LOD, LOV, utilities, layout swapping, ...). The figure below depicts a 3-way replicated file F with components C0, C1, and C2 having 1, 3 and 8 stripes respectively.<br />
<br />
[[Image:Layout_with_three_components.png]]<br />
<br />
===Extent Based Layouts===<br />
Extent based layouts permit different layouts to be used in different extents of a file. They may be used to set progressively wider striping as a file grows in size, to prevent inconsistent out of space errors as individual OSTs become full, and to enable incremental migration, replication, and HSM restore. The figure below depicts a file with four different layouts (C0 through C3) each with its own extent to cover both overlapping and non-overlapping parts of the file data.<br />
<br />
[[Image:Layout_with_overlapping_components.png]]<br />
<br />
===RAID-1/5/6/10 and Other Algorithmic Layouts===<br />
These anticipate future developments to enable more space-efficient replication techniques. The design will discuss how these layouts can be expressed with simple extensions to the existing layout code. While RAID-1 and RAID-10 layouts offer degrees of replication they are not to be confused with the layouts for file-level replication described above. They are fixed layouts similar to the existing RAID-0 layouts currently used by Lustre. Compared to composite layouts they are simpler and more compact but they are also less expressive and less flexible. RAID1 and RAID-10 layouts may be converted to composite layouts. Composite layouts whose components all have the same stripe size and stripe count may be expressed as RAID1 and RAID10 layouts. Full read/write support for files with RAID-3/5/6 layouts is beyond the scope of this design. Instead we will briefly outline an "offline parity" access mode for files with these layout types.<br />
<br />
===Layouts Based on the CRUSH Algorithm===<br />
After investigating the CRUSH layout model it has been determined that the complexity of implementing a CRUSH-based solution within the current Lustre code is beyond the scope of this effort. Potential layouts could be proposed but we are not confident if they would be correct or complete by the time of implementation. We are confident however, that the layout proposed in this document is flexible and could allow for CRUSH-style algorithms in future.<br />
<br />
===Compact Layouts for Widely Striped Files===<br />
Existing (RAID-0) layout formats use an explicit array of object identifiers to map each stripe index to a specific OST object. When using FIDs alone to identify objects this approach requires 16 bytes per stripe. The current implementation packs an OST index together with the object identifier and needs 24 bytes per stripe. The allocation of memory buffers to transmit, receive, and handle these layouts for very widely striped files (over 160 stripes) can be costly. We will discuss a compact layout based on a bitmap of OST indices which reduces memory consumption for widely stripe files by a factor of 192 for the current maximally-striped layout of 2000 stripes.<br />
<br />
===Handling of Large Layouts===<br />
Issues with the handling of large layouts will be discussed and various solutions to these issues will be considered. In particular, for very large layouts it is desirable that the retrieval of the layout be separated from the initial open RPC in order to avoid the need for large request/reply buffer allocation. Instead, a bulk RDMA transfer could be used to fetch the layout from the MDT.<br />
<br />
=Use Cases=<br />
Since the Layout Enhancement Design itself is focused on providing an infrastructure to describe a flexible layout for complex files, the use cases are described in terms of potential uses of the enhanced layouts. In some cases, there are projects underway to implement these use cases, but in other cases these are just potential uses that may be implemented in the future.<br />
<br />
==File data availability==<br />
A user wants to change an existing file to be robust against temporary or permanent OST failure. This requires that all of the file data be stored on more than one OST, so that it can be read if an OST is overloaded or permanently offline due to failure. <br />
<br />
The [[File Level Replication High Level Design|File-Level Redundancy]] (FLR) project will use a composite layout with two or more overlapping extents to keep file data available in the face of OST failure. Due to the use of per-file layouts rather than per-OST replication, it is possible to selectively replicate files on an as-needed basis, such as 1-in-12 or 1-in-24 application checkpoints would have two-way replication and 1-in-72 checkpoints would have three-way replication. This allows users/applications to balance the replication and availability needs against space and bandwidth constraints, and is not an all-or-nothing decision.<br />
<br />
In the FLR Phase 2 it will be possible to create replicated files from the beginning. In FLR Phase 1 and later it will also be able to add replication to an existing ated file.<br />
<br />
User tools as described in the FLR project could use the following layout operations to add redundancy to non-replicated files:<br />
# A file's RAID-0 layout is converted to a composite layout whose sole component is the previous layout.<br />
# A temporary RAID-0 file is created to hold the new replica data (this is not connected to the layout and will be deleted in case of failure).<br />
# The file data is copied to the temporary replica file.<br />
# The new replica is merged into the composite layout as a new component, resulting in a compound layout with overlapping extents.<br />
<br />
==File data read performance==<br />
A user wants to have high-performance read access to a file from a large number of clients. This requires that the same data be stored on a large number of OSTs, so that it can be read in parallel at an aggregate bandwidth larger than what is available from a single OSS.<br />
<br />
Similar to File Data Availability, the File-Level Replication project will use a composite layout with overlapping extents. The number of replicated extents can be determined by the required read bandwidth and available OSS nodes. Having a large number of replicas on a file only makes sense for read-only files.<br />
<br />
==Reducing redundancy for old files==<br />
If a user no longer needs a high degree of replication on a file, either because has been backed up to a separate archive, because a high read bandwidth is no longer required, or because of quota limitations, it is possible to remove one or more replicas from a file.<br />
<br />
User tools as described in the FLR project could use the following layout operations to remove redundancy from replicated files:<br />
#A replicated file is destroyed, each layout component is destroyed.<br />
#A replicated file has one of the component replicas split from the file and it is destroyed.<br />
<br />
==Improved small file performance==<br />
A user want to store small files so that they can be read/written efficiently. Using a layout that specifies that the file data is stored on the MDT, as described in the [[Data On MDT Solution Architecture]] (DOM) allows accessing the file data with fewer disk IO operations and fewer RPCs. The DOM layout allows specifying that the data is stored in the MDT inode.<br />
<br />
==Improved performance for the start of a file==<br />
Some applications need improved performance to the start of the file, for operations such as determining the file type, accessing file metadata, or accessing a file index header, etc. This can also be useful in conjunction with HSM, where the start of the file is resident on the MDT, but the rest of the file is archived on tape. This can be achieved by using a Data On MDT component with an extent covering the first stripe_size of file data, and having an OST-based component with an extent covering the extent [0, ∞) or [stripe_size, ∞), depending whether the first chunk of data will be replicated to the OST object(s) or not.<br />
<br />
==Increased bandwidth and capacity for larger files==<br />
A user wants to optimize small files with a single OST stripe, and large files with many OST stripes, without having to explicitly manage this on a per-file or per-directory basis. This could be implemented with a compound layout with multiple layouts in non-overlapping extents. As a file grows in size, progressively wider striping is used for file data in order to give the file access to more OST storage capacity and IO bandwidth.<br />
<br />
For example, a new file could be created with a single stripe for the extent [0, 32MB). If a file grows beyond 32MB in size, a new component layout would be created for the extent [32MB, 1GB) with 8 stripes. Should the file grow beyond 1GB in size, a third component layout would be created for the extent [1GB, ∞) that is striped over all of the available OSTs.<br />
<br />
==Handling out-of-space on an OST==<br />
If an application is writing to a file it is possible that one of the OSTs runs out of space, while other OSTs have a larger amount of free space (e.g. if new OSTs were added and had much more free space, or if a very large file was created with a single stripe). This could be implemented by converting the existing RAID-0 layout to have an extent from [0, file_size) and then creating a new RAID-0 layout as a separate extent [file_size, ∞) for the end of the file. <br />
<br />
==Transparent migration of files between OSTs==<br />
A file needs to be migrated between two OSTs. This may be needed in order to evacuate an OST that is failing or scheduled for hardware replacement, with a file that is in use by a long-running application.<br />
<br />
The file can be converted from a non-replicated file to a composite file with an extent from [0, ∞). A second component is added with an extent [0, 0) on the target OST(s). The file data can incrementally be copied to the target replica component. As data is copied to the new replica component the new component's extent end is increased to cover the just-copied range of the file, for example [0, 1GB). The old component's extent start is increased to cover a smaller range at the end of the file, for example [1GB, ∞), and its data is punched by the same amount (or the source objects are simply deleted at the end).<br />
<br />
==Unaligned components==<br />
For applications that have poorly-structured IO, it is possible that the application writes a short file header, and then reads or writes large-but-unaligned chunks of the file with large requests. For example, if there is a 64kB header, and then a series of 1MB chunks read/written with a 64kB offset from the start of the file. This produces poorly-formed IO to the underlying OST RAID LUNs because it is not aligned with the RAID chunks. It would be possible to specify a compound layout with a Data On MDT component for the start of the file, and then a RAID-0 OST component for the rest of the file. Unlike the components specified in other examples, the unaligned IO component would be flagged to be starting at the end of the first extent, rather than overlapping the first extent.<br />
<br />
==Algorithmic Layouts==<br />
Use cases for RAID-1/10 are similar to those for replication, except that the former formats are simpler and may be expressed more concisely. For example the RAID-1 form of a layout designating 4 mirrored objects (RAID-1) is small enough to fit in the extra space of a 512 byte MDT inode. This is not true of the analogous composite layout with 4 entries.<br />
<br />
RAID-5 and 6 offer increased data availability in the face of OST failures but not at the expense of full mirroring. While we do not anticipate supporting networked RAID-5/6 in the short term, there are interesting use cases for read mostly files. Given an existing (non RAID-5/6) file, a volatile file is opened and given a RAID-5/6 layout, data is copied from the old file to the new, and parity chunks are written. Then the volatile file is merged in to the old file or layout swap is performed.<br />
<br />
==Compact Layouts==<br />
A widely striped file is to be created in order to achieve maximum IO bandwidth. Transferring a conventional layout for this file requires clients to register tens or hundreds of KB in buffers. Using a compact layout format the file's layout may be transferred with negligible RPC overhead.<br />
<br />
=Solution Proposal=<br />
==Layouts for File-Level Replication and Layout Extents==<br />
Layouts for file-level replication and extent based layouts will be offered through the same underlying layout type, which we call a composite layout. This layout consists of a header (described by struct lov_comp_md_v1 below), an array of component descriptors (described by struct lov_comp_md_entry_v1), and the component layouts (a sequence of blobs that are independent RAID-0/1/5/6/10 layouts of type struct lov_mds_md_v3 or other layout types in the future).<br />
<br />
This design does not support nested composite layouts (i.e. components which are themselves composites) to avoid complexity and recursion in the implementation of layout handling. It is thought that non-nested layouts provide sufficient flexibility for current projects and anticipated future uses.<br />
struct lu_extent {<br />
__u64 e_begin;<br />
__u64 e_end;<br />
};<br />
<br />
struct lov_comp_md_entry_v1 {<br />
__u32 lcme_id; /* unique identifier of component within composite layout */<br />
__u32 lcme_flags; /* LCME_FL_STALE, LCME_FL_OFFLINE, LCME_FL_PREFERRED, ... */<br />
struct lu_extent lcme_extent; /* file extent for component */<br />
__u32 lcme_offset; /* offset of component layout from start of composite layout */<br />
__u32 lcme_size; /* size of component layout data in bytes */<br />
union {<br />
__u64 lcme_padding;<br />
} u;<br />
};<br />
<br />
struct lov_comp_md_v1 {<br />
__u32 lcm_magic; /* LOV_MAGIC_COMP_V1 */<br />
__u32 lcm_size; /* overall size including this struct */<br />
__u32 lcm_layout_gen; /* incremented each time layout is modified */<br />
__u16 lcm_flags; /* LCM_FL_RS_READ_ONLY, LCM_FL_RS_SYNC_PENDING, ... */<br />
__u16 lcm_entry_count; /* number of components in lcm_entries[] */<br />
union {<br />
__u64 lcm_padding[2];<br />
} u;<br />
struct lov_comp_md_entry_v1 lcm_entries[0];<br />
};<br />
<br />
Each component has an extent which describes the range of the file to which the layout applies. The extent does not necessarily need to cover the full file range. Any extents which are overlapping other extents are replicated, and are expected to contain the same file data at the same logical offset. A replicated (overlapping) extent may be marked UPDATING if it is currently being updated asynchronously by a client (see [[File-Level Replication Solution Architecture]] for more details). A replicated extent may be marked STALE if there was a hard failure updating the data of that extent to match the primary replica.<br />
<br />
The mechanism for maintaining and resynchronizing replicas is beyond the scope of this document. However, it should be mentioned that it is desirable to keep STALE replicas attached to a file rather than removing them immediately upon OST failure. The number of updates needed to resynchronize a STALE replica may be minimal if it is offline for only a short time. This may also allow recovery of an old version of the file from a STALE replica if the primary replica suffers a fatal error. <br />
<br />
In the design we normally assume that the component extents in a composite layout have the same starting offset at byte 0 of the file. The extents may each form a non-overlapping a subset of [0, ∞), or the may all start at file offset 0, or there may be some other overlap. However, we should try not to use the component extent start as an offset when accessing the component objects. That is, if a component has a single object O and extent [s, ∞) then the file byte at position p should be found at offset p of O and not at p - s. In this way an extent with non-zero start can be converted to one which starts at 0. Similarly assume that file data is safely mirrored to another component, a component whose extent starts at zero can be figuratively punched to have some positive start without remapping the objects, followed by a punch of corresponding objects data.<br />
<br />
The ability to pack independent file layouts as components of a larger composite layout provides a great deal of flexibility, while isolating the complexity of the individual layouts. By allowing both overlapping and non-overlapping extents to be specified, it is possible to construct file layouts for almost any use case. The ability to add different component layouts in the future (e.g. CRUSH, RAID-5/6) will allow flexibility without requiring changes in the core layout infrastructure.<br />
<br />
For quota accounting of files with compound layouts, each component is treated in the same way as a separate file with the same contents. Adding a replica to a file will increase the quota usage of a user, and removing a replica will decrease the quota usage. For files with Data-on-MDT, the space usage of the component on the MDT will be accounted separately from the space on the OSTs. With [[Project Quotas]] (a separate project, see [https://jira.hpdd.intel.com/browse/LU-4017 LU-4017]) it would be possible in the future to separately administer the quota space available to users on the MDT. This allows and encourages users to pick the availability and performance characteristics that suit their needs best.<br />
<br />
It is anticipated that replication requirements can also be managed by an external policy engine such as [https://github.com/cea-hpc/robinhood/wiki RobinHood], to add or remove replicas to files, to migrate small files to the MDT, or to create or migrate replicas to different tiers of storage.<br />
<br />
Potentially complex applications are possible in the future with integration into userspace libraries/applications, such as tailoring file IO characteristics differently for disjoint parts of a single large file. For example, an HDF5 file could use high IOPS OST pool for components with extents covering an index or small IOs, replicated (overlapping) extents for important metadata, and large streaming components for extents with well-formed IO.<br />
<br />
==Operations on Composite Layouts==<br />
Several kinds of operations are needed to manipulate simple and compound file from a file with a simple layout.<br />
#A file with a simple layout is converted to a composite layout whose sole component is the previous layout.<br />
#A file with simple RAID-0 layout is merged into an existing compound file.<br />
#A replica (composite layout component) of a file is split out to a new file with only this component as its layout.<br />
#A replica (composite layout component) of a file is split out of the compound layout and is destroyed.<br />
#A replicated file is destroyed, each layout component is destroyed.<br />
#From a composite layout, the component with a given id is opened.<br />
#From a composite layout, the component with a given id set/clears the STALE state.<br />
#A composite layout is repacked after removal of a component.<br />
<br />
==Algorithmic Layouts==<br />
#A new RAID-1/5/6/10 file is created.<br />
#Two suitably striped RAID-0 files are merged into a RAID-1/10 file.<br />
#A suitably striped RAID-0 file is merged into an exiting RAID-1/10 file.<br />
#A specified mirror in a RAID-1/10 file is split off into a new file without an assigned layout.<br />
#A volatile (open unlinked) file with RAID-5/6 layout is created and written with a copy of an existing file's data. The volatile file's layout is merged as a component of the original file. The RAID-5/6 component can be read.<br />
<br />
==Compact Layouts==<br />
#A file is to be striped over a large number of OSTs, say 512 or more. An ordinary RAID0 layout for the file would be tens or hundreds of KB in size. Instead of explicitly specifying the FID of each OST object, a bitmap of OST indices is stored along with enough data that for each OST index set in the bitmap, a corresponding OST object FID may be computed.<br />
#Compact layouts must also include a starting index (or bias) to ensure balanced OST use.<br />
<br />
=Unit/Integration Test Plan=<br />
==Composite Layouts==<br />
#Create, store, load, and destroy empty composite layout on a file with no assigned layout.<br />
#Convert simple file layout to singleton composite layout.<br />
#Convert singleton composite layout to simple layout.<br />
#Merge simple file layout to existing composite layout.<br />
#Split component of composite layout to existing file without layout.<br />
#Move component between the composite layouts of existing files.<br />
#Get component layout with a given id from composite layout and validate.<br />
<br />
[[Category:Architecture]]<br />
[[Category:Design]]<br />
[[Category:Layout]]</div>Nrutmanhttp://wiki.lustre.org/index.php?title=File:LUG-2012-Network_Checksum_Improvement-Nathan-Xyratex.pdf&diff=2045File:LUG-2012-Network Checksum Improvement-Nathan-Xyratex.pdf2016-12-17T00:04:32Z<p>Nrutman: </p>
<hr />
<div></div>Nrutmanhttp://wiki.lustre.org/index.php?title=Projects&diff=1659Projects2016-05-03T18:57:48Z<p>Nrutman: /* Current Projects */</p>
<hr />
<div>__TOC__<br />
<br />
== Current Projects ==<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
! Feature !! Feature Summary !! Point of Contact !! Tracker !! Target Date (YYYY-MM)<br />
|-<br />
| [http://wiki.opensfs.org/index.php?title=Contract_SFS-DEV-002#Summary UID/GID mapping] || Map UID/GID for remote client nodes to local UID/GID on the MDS and OSS. Allows a single Lustre filesystem to be shared across clients with administrative domains. || Stephen Simms (Indiana University) || [https://jira.hpdd.intel.com/browse/LU-3291 LU-3291] || 2015-11<br />
|-<br />
| [http://wiki.opensfs.org/index.php?title=Contract_SFS-DEV-002#Summary Shared Key Crypto] || Allow node authentication and/or RPC encryption using symmetric shared key crypto with GSSAPI. Avoids complexity in configuring Kerberos across multiple domains. || Stephen Simms (Indiana University) || [https://jira.hpdd.intel.com/browse/LU-3289 LU-3289] || 2015-12<br />
|-<br />
| [[NRS Delay policy]] || Use NRS for fault injection. Intentionally delay request processing to simulate server load. || Chris Horn (Cray) || [https://jira.hpdd.intel.com/browse/LU-6283 LU-6283] || 2015-05<br />
|-<br />
| [[Lock ahead]] || Allow user space to request LDLM extent locks in advance of need. Intended to optimize shared file IO. || Patrick Farrell (Cray) || [https://jira.hpdd.intel.com/browse/LU-6179 LU-6179] || 2015-07<br />
|-<br />
| [[ZFS ZIL Support]] || Add support for the ZFS Intent Log (ZIL) to the Lustre osd-zfs || Alex Zhuravlev (Intel) || [https://jira.hpdd.intel.com/browse/LU-4009 LU-4009]||2016-06<br />
|-<br />
| [[Composite File Layouts]] || Add support for multiple layouts on a single file, for File Level Replication, Data on MDT, PFL, etc. || Jinshan Xiong, Niu Yawei (Intel) || [https://jira.hpdd.intel.com/browse/LU-3480 LU-3480]||2016-09<br />
|-<br />
| [[Progressive File Layouts]] || Allow composite file layouts to be instantiated incrementally during file writes || Jinshan Xiong, Niu Yawei (Intel) || || 2016-12<br />
|-<br />
| [[Subdirectory Mounts]] || Ability for client to mount subdirectories of a Lustre filesystem || Wang Shilong (DDN) || [https://jira.hpdd.intel.com/browse/LU-28 LU-28]||2016-01<br />
|-<br />
| [[Multi-Rail LNet]] || Use multiple LNet network interfaces concurrently to improve reliability and performance || Amir Shehata (Intel), Olaf Weber (SGI) || [https://jira.hpdd.intel.com/browse/LU-7734 LU-7734] || 2016-09<br />
|-<br />
| [[Server side advise and hinting]] || Add support new APIs and utilities for server/storage side advise of accessing file for server cache || Li Xi (DDN) ||[https://jira.hpdd.intel.com/browse/LU-4931 LU-4931] || 2016-01<br />
|-<br />
| [[TBF policy enhancement]] || An enhancement of NRS/TBF policy to support complex TBF policy with NID/JOBID expressions || Li Xi (DDN) ||[https://jira.hpdd.intel.com/browse/LU-7470 LU-7470] || 2016-06<br />
|-<br />
| [[Lustre Snapshots|Simplified Userspace Snapshots]] || Allow snapshots of ZFS targets to be mounted as a coherent filesystem || Fan Yong (Intel) || || 2017-01<br />
|-<br />
| [[Large Bulk IO]] || Increase the OST bulk IO maximum size to 16MB or larger for more efficient disk IO submission. || Shuichi Ihara (DDN) || [https://jira.hpdd.intel.com/browse/LU-7990 LU-7990] || 2016-04<br />
|}<br />
<br />
== Future Projects ==<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
! Feature !! Feature Summary !! Point of Contact!! Tracker<br />
|-<br />
| Patchless Server || Remove Lustre kernel patches to allow Lustre servers to be more easily ported to new kernels, and to be built against vendor kernels without changing the vendor kernel RPMs. || Andreas Dilger (Intel) || [https://jira.hpdd.intel.com/browse/LU-20 LU-20]<br />
|-<br />
| [[File Level Replication Phase 1]] || Solution Architecture and HDL, Exclusive Open, RAID 1 Layout, Layout Modification Method, Read Only RAID 1 || Jinshan Xiong (Intel) || [https://jira.hpdd.intel.com/browse/LU-3254 LU-3254]<br />
|-<br />
| [[File Level Replication Phase 2]] || Immediate asynchronous write from client. || Jinshan Xiong (Intel) || [https://jira.hpdd.intel.com/browse/LU-3254 LU-3254]<br />
|-<br />
| [[Data on MDT]] || Allow small files to be stored directly on the MDT for reduced RPC count and improved performance. || Mikhail Pershin (Intel) || .<br />
|-<br />
| [[Quota for Projects]] || Allow specifying a "project" or "subtree" identifier for files for accounting to a project, separate from UID/GID. || Shuichi Ihara (DDN) || [https://jira.hpdd.intel.com/browse/LU-4017 LU-4017]<br />
|-<br />
| [[OSP multiple modify requests to MDT]] || Improve performance of cross-MDT modify operations. || Grégoire Pichon (Atos) || .[https://jira.hpdd.intel.com/browse/LU-6864 LU-6864]<br />
|-<br />
| [[Layout Enhancement High Level Design|Layout Enhancement]] || Necessary to enable file-level replication and extend based layouts. || Andreas Dilger (Intel) || .<br />
|-<br />
| [[OFD Read Cache]] || Server side Read cache with SSD, NVMe. One of storage cache tier between OSS Readcache with server memory and Storage || Li Xi (DDN) || .<br />
|-<br />
| [[Fileheat based Policy engine ]] || Add support new attribute "file heat" of objects to track "hot" files, for OFD read cache as a policy engine || Li Xi (DDN) || .<br />
|-<br />
| [[IB Multi-rail]] || Use multiple IB interfaces as a single Lustre NID to improve data transferring bandwidth and redundancy against the failure of IB. || Kenichiro Sakai (Fujitsu) || [https://jira.hpdd.intel.com/browse/LU-6495 LU-6495]<br />
|-<br />
| [[Directory Quota]] || Limit the number of inodes and disk blocks of files and subdirectories in the specified directory in a manner similar to user/group Quota. || Kenichiro Sakai (Fujitsu) || .<br />
|-<br />
| [[Directory Level Snapshot]] || Directory level snapshot (DL-SNAP) is designed for user level file backups. DL-SNAP will be implemented by using copy-on-write mechanism on top of ldiskfs without modification of disk format. || Kenichiro Sakai (Fujitsu) || .<br />
|}<br />
<br />
== Potential Projects ==<br />
<br />
These projects are potential areas of development that are looking for an interested party to work on or sponsor another developer to do. Many of them have more detailed descriptions, but it is worthwhile to contact the lustre-devel mailing list <br />
<br />
{| class="wikitable sortable"<br />
|-<br />
! Feature !! Feature Summary !! Complexity !! Tracker<br />
|-<br />
| [[ioctl() number cleanups]]<br />
|| Clean up Linux IOC numbering to properly use "size" field so that mixed 32- and 64-bit kernel/userspace ioctls work correctly. Attention needs to be paid to maintaining userspace compatibility for a number of releases, so the old ioctl() numbers cannot simply be removed.<br />
|| 1 ||[https://bugzilla.lustre.org/show_bug.cgi?id=20731 b=20731]<br />
|-<br />
| [[Updated man pages]]<br />
|| Update the online manual pages for Lustre user tools and the lustreapi library. Split the existing lfs.1 and lctl.8 man pages into separate pages for each sub-command, describing options and providing usage examples.<br />
|| 2 ||[https://jira.hpdd.intel.com/browse/LU-4315 LU-4315]<br />
|-<br />
| [[Improve testing Efficiency]]<br />
||Improve the performance, efficiency, and coverage of the acceptance-small.sh test scripts. As a basic step, printing the duration of each test script in the acceptance-small.sh test summary would tell us where the testing time is being spent.<br />
|| 3 ||[https://bugzilla.lustre.org/show_bug.cgi?id=23051 b=23051]<br />
|-<br />
| [[Config save/edit/restore]]<br />
||Need to be able to backup/edit/restore the client/MDS/OSS config llog files after a writeconf. One reason is for config recovery if the config llog becomes corrupted. Another reason is that all of the filesystem tunable parameters (including all of the OST pool definitions) are stored in the config llog and are lost if a writeconf is done. Being able to dump the config log to a plain text file, edit it, and then restore it would make administration considerably easier.<br />
|| 3 ||[https://bugzilla.lustre.org/show_bug.cgi?id=17094 b=17094]<br />
|-<br />
| [[Filesystem default OST pool]]<br />
||Allow a filesystem-wide default OST pool to be specified. Currently, it is possible to set the default stripe count, size, index on a filesystem with "lfs setstripe" on the filesystem root, but the OST pool name is ignored. There is no other mechanism to specify default the OST pool for all new files in the filesystem, if no pool is specified. This would be useful for WAN or other heterogeneous OST configurations.<br />
|| 3 ||[https://jira.hpdd.intel.com/browse/LU-7335 LU-7335]<br />
|-<br />
| [[Error message improvements]]<br />
||Review and improve the Lustre error messages to be more useful. A larger project is to change the core Lustre error message handling to generate better structured error messages so that they can be parsed/managed more easily.<br />
|| 4 ||<br />
|-<br />
| [[Improve QOS Round-Robin object allocator]]<br />
||Improve LOV QOS allocator to always do weighted round-robin allocation, instead of degrading into weighted random allocations once the OST free space becomes imbalanced. This evens out allocations continuously, avoids crazy/bad OST allocation imbalances when QOS becomes active, and allows adding weighting for things like current load, OST RAID rebuild, etc.<br />
|| 5 ||[https://jira.hpdd.intel.com/browse/LU-9 LU-9]<br />
|-<br />
| [[RPC Replay Signatures]]<br />
||<small>Allow MDS/OSS to determine if client can legitimately replay an RPC, by digitally signing it at processing time and verifying the signature at replay time.<br />
|| 6 ||[https://bugzilla.lustre.org/show_bug.cgi?id=18547 b=18547]<br />
|-<br />
| [[Virtual Lustre Block Device]]<br />
||Lustre object lloop driver exports block device to userspace, bypassing filesystem. Code partly works and is currently part of Lustre, but has correctness issues and potential performance problems. It needs to be ported to newer kernels.<br />
|| 6 ||[https://jira.hpdd.intel.com/browse/LU-6585 LU-6585]<br />
|-<br />
| [[Swap on Lustre]]<br />
||Depends on the Lustre block device. Has problems when working under memory pressure, which makes it mostly useless until those problems are fixed.<br />
|| 7 ||[https://bugzilla.lustre.org/show_bug.cgi?id=5498 b=5498]<br />
|-<br />
| [[Directory readdir+]]<br />
||Bulk metadata readdir/stat interface to speed up "ls -l" operations. Send back requested inode attributes for all directory entries as part of the extended dirent data. Integrate with any proposed API for this on the client. Needs Large Readdir RPCs to be efficient over the wire, since more data will be returned for every entry.<br />
|| 7 ||[https://bugzilla.lustre.org/show_bug.cgi?id=17845 b=17845]<br />
|-<br />
| [[Small file IO aggregation]]<br />
||Small file IO aggregation (multi-object RPCs), most likely for writes first, and possibly later for reads in conjunction with statahead.<br />
|| 7 ||[https://bugzilla.lustre.org/show_bug.cgi?id=944 b=944]<br />
|-<br />
| [[Lustre Snapshots|Integrated Lustre Snapshots]]<br />
|| Allow Lustre to internally mount and manage ZFS snapshots and other datasets within a single namespace<br />
|| 7 ||<br />
|-<br />
| [[Client-side data encryption]]<br />
||Encrypt files and directories (or possibly just filenames in directories) on the client before sending to the server. This avoids sending unencrypted data over the network, or ever having the data in plaintext on the server (in case of separate decryption from network plus encryption on disk). It seems possible to leverage the existing GSSAPI sptlrpc_cli_wrap_bulk() and friends to do bulk data encryption/decryption on the client using a per-file key (itself stored encrypted by the users' key(s) with the file on the MDS), and salted with the OST object ID or stripe index and object offset, rather than a per-session key and then not decrypting the data at the server before writing it to disk.<br />
|| 7 ||[]<br />
|-<br />
| [[Local object zero-copy IO]]<br />
||Efficient data IO between a client and a local OST object; optimization to support local clients. Likely implemented as a fast-path connection between the OSC and the local OFD/OSD. Read cache should be kept on the OSD instead of at the client VFS level, so that the cache can be shared among all users of this OST.<br />
|| 9 ||<br />
|-<br />
|}<br />
<br />
== Past Projects ==<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
! Feature !! Feature Summary !! Point of Contact!! Tracker !! Target Date (YYYY-MM) !! Merge Date (YYYY-MM)<br />
|-<br />
| [[Dynamic LNET Config]] || Introduces a user space script which will read routes from a config file and add those routes to LNET dynamically with the lctl utility. This allows the support of very large routing tables || Amir Shehata (Intel) || [https://jira.hpdd.intel.com/browse/LU-2456 LU-2456] || 2014-11 || 2014-11<br />
|-<br />
| [[LFSCK Phase 3 - DNE consistency]] || Enhance LFSCK to work with DNE filesystems, including remote directory entries, and OST orphan handling for multiple MDTs. || Fan Yong (Intel) || [https://jira.hpdd.intel.com/browse/LU-2307 LU-2307] || 2014-11 || 2014-11<br />
|-<br />
| [[LFSCK Phase 4 - performance tuning]] || Enhance LFSCK performance and efficiency. || Fan Yong (Intel) || [https://jira.hpdd.intel.com/browse/LU-6361 LU-6361] || 2015-05 || 2015-05<br />
|-<br />
| [[Multiple metadata RPCs]] || Support of multiple metadata modifications per client (in last_rcvd file) to improve the multi-threaded metadata performance of a single client || Grégoire Pichon (Bull/Atos) || [https://jira.hpdd.intel.com/browse/LU-5319 LU-5319] || 2015-05 || 2015-05<br />
|-<br />
| [[DNE Phase IIb]] || Asynchronous Commit of cross-MDT updates for improved performance. Remote rename and remote hard link functionality. || Wang Di (Intel) || [https://jira.hpdd.intel.com/browse/LU-3534 LU-3534] || 2015-05 || 2015-05<br />
|-<br />
| [[Kerberos Revival]] || Fix up existing Kerberos code so it is tested working again. || Sébastien Buisson (Bull/Atos) || [https://jira.hpdd.intel.com/browse/LU-6356 LU-6356] || 2015-06 || 2015-06<br />
|}<br />
<br />
[[Category: Development]]</div>Nrutmanhttp://wiki.lustre.org/index.php?title=Projects&diff=1658Projects2016-05-03T18:56:58Z<p>Nrutman: /* Current Projects */</p>
<hr />
<div>__TOC__<br />
<br />
== Current Projects ==<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
! Feature !! Feature Summary !! Point of Contact !! Tracker !! Target Date (YYYY-MM)<br />
|-<br />
| [http://wiki.opensfs.org/index.php?title=Contract_SFS-DEV-002#Documentation UID/GID mapping] || Map UID/GID for remote client nodes to local UID/GID on the MDS and OSS. Allows a single Lustre filesystem to be shared across clients with administrative domains. || Stephen Simms (Indiana University) || [https://jira.hpdd.intel.com/browse/LU-3291 LU-3291] || 2015-11<br />
|-<br />
| [http://wiki.opensfs.org/index.php?title=Contract_SFS-DEV-002#Documentation Shared Key Crypto] || Allow node authentication and/or RPC encryption using symmetric shared key crypto with GSSAPI. Avoids complexity in configuring Kerberos across multiple domains. || Stephen Simms (Indiana University) || [https://jira.hpdd.intel.com/browse/LU-3289 LU-3289] || 2015-12<br />
|-<br />
| [[NRS Delay policy]] || Use NRS for fault injection. Intentionally delay request processing to simulate server load. || Chris Horn (Cray) || [https://jira.hpdd.intel.com/browse/LU-6283 LU-6283] || 2015-05<br />
|-<br />
| [[Lock ahead]] || Allow user space to request LDLM extent locks in advance of need. Intended to optimize shared file IO. || Patrick Farrell (Cray) || [https://jira.hpdd.intel.com/browse/LU-6179 LU-6179] || 2015-07<br />
|-<br />
| [[ZFS ZIL Support]] || Add support for the ZFS Intent Log (ZIL) to the Lustre osd-zfs || Alex Zhuravlev (Intel) || [https://jira.hpdd.intel.com/browse/LU-4009 LU-4009]||2016-06<br />
|-<br />
| [[Composite File Layouts]] || Add support for multiple layouts on a single file, for File Level Replication, Data on MDT, PFL, etc. || Jinshan Xiong, Niu Yawei (Intel) || [https://jira.hpdd.intel.com/browse/LU-3480 LU-3480]||2016-09<br />
|-<br />
| [[Progressive File Layouts]] || Allow composite file layouts to be instantiated incrementally during file writes || Jinshan Xiong, Niu Yawei (Intel) || || 2016-12<br />
|-<br />
| [[Subdirectory Mounts]] || Ability for client to mount subdirectories of a Lustre filesystem || Wang Shilong (DDN) || [https://jira.hpdd.intel.com/browse/LU-28 LU-28]||2016-01<br />
|-<br />
| [[Multi-Rail LNet]] || Use multiple LNet network interfaces concurrently to improve reliability and performance || Amir Shehata (Intel), Olaf Weber (SGI) || [https://jira.hpdd.intel.com/browse/LU-7734 LU-7734] || 2016-09<br />
|-<br />
| [[Server side advise and hinting]] || Add support new APIs and utilities for server/storage side advise of accessing file for server cache || Li Xi (DDN) ||[https://jira.hpdd.intel.com/browse/LU-4931 LU-4931] || 2016-01<br />
|-<br />
| [[TBF policy enhancement]] || An enhancement of NRS/TBF policy to support complex TBF policy with NID/JOBID expressions || Li Xi (DDN) ||[https://jira.hpdd.intel.com/browse/LU-7470 LU-7470] || 2016-06<br />
|-<br />
| [[Lustre Snapshots|Simplified Userspace Snapshots]] || Allow snapshots of ZFS targets to be mounted as a coherent filesystem || Fan Yong (Intel) || || 2017-01<br />
|-<br />
| [[Large Bulk IO]] || Increase the OST bulk IO maximum size to 16MB or larger for more efficient disk IO submission. || Shuichi Ihara (DDN) || [https://jira.hpdd.intel.com/browse/LU-7990 LU-7990] || 2016-04<br />
|}<br />
<br />
== Future Projects ==<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
! Feature !! Feature Summary !! Point of Contact!! Tracker<br />
|-<br />
| Patchless Server || Remove Lustre kernel patches to allow Lustre servers to be more easily ported to new kernels, and to be built against vendor kernels without changing the vendor kernel RPMs. || Andreas Dilger (Intel) || [https://jira.hpdd.intel.com/browse/LU-20 LU-20]<br />
|-<br />
| [[File Level Replication Phase 1]] || Solution Architecture and HDL, Exclusive Open, RAID 1 Layout, Layout Modification Method, Read Only RAID 1 || Jinshan Xiong (Intel) || [https://jira.hpdd.intel.com/browse/LU-3254 LU-3254]<br />
|-<br />
| [[File Level Replication Phase 2]] || Immediate asynchronous write from client. || Jinshan Xiong (Intel) || [https://jira.hpdd.intel.com/browse/LU-3254 LU-3254]<br />
|-<br />
| [[Data on MDT]] || Allow small files to be stored directly on the MDT for reduced RPC count and improved performance. || Mikhail Pershin (Intel) || .<br />
|-<br />
| [[Quota for Projects]] || Allow specifying a "project" or "subtree" identifier for files for accounting to a project, separate from UID/GID. || Shuichi Ihara (DDN) || [https://jira.hpdd.intel.com/browse/LU-4017 LU-4017]<br />
|-<br />
| [[OSP multiple modify requests to MDT]] || Improve performance of cross-MDT modify operations. || Grégoire Pichon (Atos) || .[https://jira.hpdd.intel.com/browse/LU-6864 LU-6864]<br />
|-<br />
| [[Layout Enhancement High Level Design|Layout Enhancement]] || Necessary to enable file-level replication and extend based layouts. || Andreas Dilger (Intel) || .<br />
|-<br />
| [[OFD Read Cache]] || Server side Read cache with SSD, NVMe. One of storage cache tier between OSS Readcache with server memory and Storage || Li Xi (DDN) || .<br />
|-<br />
| [[Fileheat based Policy engine ]] || Add support new attribute "file heat" of objects to track "hot" files, for OFD read cache as a policy engine || Li Xi (DDN) || .<br />
|-<br />
| [[IB Multi-rail]] || Use multiple IB interfaces as a single Lustre NID to improve data transferring bandwidth and redundancy against the failure of IB. || Kenichiro Sakai (Fujitsu) || [https://jira.hpdd.intel.com/browse/LU-6495 LU-6495]<br />
|-<br />
| [[Directory Quota]] || Limit the number of inodes and disk blocks of files and subdirectories in the specified directory in a manner similar to user/group Quota. || Kenichiro Sakai (Fujitsu) || .<br />
|-<br />
| [[Directory Level Snapshot]] || Directory level snapshot (DL-SNAP) is designed for user level file backups. DL-SNAP will be implemented by using copy-on-write mechanism on top of ldiskfs without modification of disk format. || Kenichiro Sakai (Fujitsu) || .<br />
|}<br />
<br />
== Potential Projects ==<br />
<br />
These projects are potential areas of development that are looking for an interested party to work on or sponsor another developer to do. Many of them have more detailed descriptions, but it is worthwhile to contact the lustre-devel mailing list <br />
<br />
{| class="wikitable sortable"<br />
|-<br />
! Feature !! Feature Summary !! Complexity !! Tracker<br />
|-<br />
| [[ioctl() number cleanups]]<br />
|| Clean up Linux IOC numbering to properly use "size" field so that mixed 32- and 64-bit kernel/userspace ioctls work correctly. Attention needs to be paid to maintaining userspace compatibility for a number of releases, so the old ioctl() numbers cannot simply be removed.<br />
|| 1 ||[https://bugzilla.lustre.org/show_bug.cgi?id=20731 b=20731]<br />
|-<br />
| [[Updated man pages]]<br />
|| Update the online manual pages for Lustre user tools and the lustreapi library. Split the existing lfs.1 and lctl.8 man pages into separate pages for each sub-command, describing options and providing usage examples.<br />
|| 2 ||[https://jira.hpdd.intel.com/browse/LU-4315 LU-4315]<br />
|-<br />
| [[Improve testing Efficiency]]<br />
||Improve the performance, efficiency, and coverage of the acceptance-small.sh test scripts. As a basic step, printing the duration of each test script in the acceptance-small.sh test summary would tell us where the testing time is being spent.<br />
|| 3 ||[https://bugzilla.lustre.org/show_bug.cgi?id=23051 b=23051]<br />
|-<br />
| [[Config save/edit/restore]]<br />
||Need to be able to backup/edit/restore the client/MDS/OSS config llog files after a writeconf. One reason is for config recovery if the config llog becomes corrupted. Another reason is that all of the filesystem tunable parameters (including all of the OST pool definitions) are stored in the config llog and are lost if a writeconf is done. Being able to dump the config log to a plain text file, edit it, and then restore it would make administration considerably easier.<br />
|| 3 ||[https://bugzilla.lustre.org/show_bug.cgi?id=17094 b=17094]<br />
|-<br />
| [[Filesystem default OST pool]]<br />
||Allow a filesystem-wide default OST pool to be specified. Currently, it is possible to set the default stripe count, size, index on a filesystem with "lfs setstripe" on the filesystem root, but the OST pool name is ignored. There is no other mechanism to specify default the OST pool for all new files in the filesystem, if no pool is specified. This would be useful for WAN or other heterogeneous OST configurations.<br />
|| 3 ||[https://jira.hpdd.intel.com/browse/LU-7335 LU-7335]<br />
|-<br />
| [[Error message improvements]]<br />
||Review and improve the Lustre error messages to be more useful. A larger project is to change the core Lustre error message handling to generate better structured error messages so that they can be parsed/managed more easily.<br />
|| 4 ||<br />
|-<br />
| [[Improve QOS Round-Robin object allocator]]<br />
||Improve LOV QOS allocator to always do weighted round-robin allocation, instead of degrading into weighted random allocations once the OST free space becomes imbalanced. This evens out allocations continuously, avoids crazy/bad OST allocation imbalances when QOS becomes active, and allows adding weighting for things like current load, OST RAID rebuild, etc.<br />
|| 5 ||[https://jira.hpdd.intel.com/browse/LU-9 LU-9]<br />
|-<br />
| [[RPC Replay Signatures]]<br />
||<small>Allow MDS/OSS to determine if client can legitimately replay an RPC, by digitally signing it at processing time and verifying the signature at replay time.<br />
|| 6 ||[https://bugzilla.lustre.org/show_bug.cgi?id=18547 b=18547]<br />
|-<br />
| [[Virtual Lustre Block Device]]<br />
||Lustre object lloop driver exports block device to userspace, bypassing filesystem. Code partly works and is currently part of Lustre, but has correctness issues and potential performance problems. It needs to be ported to newer kernels.<br />
|| 6 ||[https://jira.hpdd.intel.com/browse/LU-6585 LU-6585]<br />
|-<br />
| [[Swap on Lustre]]<br />
||Depends on the Lustre block device. Has problems when working under memory pressure, which makes it mostly useless until those problems are fixed.<br />
|| 7 ||[https://bugzilla.lustre.org/show_bug.cgi?id=5498 b=5498]<br />
|-<br />
| [[Directory readdir+]]<br />
||Bulk metadata readdir/stat interface to speed up "ls -l" operations. Send back requested inode attributes for all directory entries as part of the extended dirent data. Integrate with any proposed API for this on the client. Needs Large Readdir RPCs to be efficient over the wire, since more data will be returned for every entry.<br />
|| 7 ||[https://bugzilla.lustre.org/show_bug.cgi?id=17845 b=17845]<br />
|-<br />
| [[Small file IO aggregation]]<br />
||Small file IO aggregation (multi-object RPCs), most likely for writes first, and possibly later for reads in conjunction with statahead.<br />
|| 7 ||[https://bugzilla.lustre.org/show_bug.cgi?id=944 b=944]<br />
|-<br />
| [[Lustre Snapshots|Integrated Lustre Snapshots]]<br />
|| Allow Lustre to internally mount and manage ZFS snapshots and other datasets within a single namespace<br />
|| 7 ||<br />
|-<br />
| [[Client-side data encryption]]<br />
||Encrypt files and directories (or possibly just filenames in directories) on the client before sending to the server. This avoids sending unencrypted data over the network, or ever having the data in plaintext on the server (in case of separate decryption from network plus encryption on disk). It seems possible to leverage the existing GSSAPI sptlrpc_cli_wrap_bulk() and friends to do bulk data encryption/decryption on the client using a per-file key (itself stored encrypted by the users' key(s) with the file on the MDS), and salted with the OST object ID or stripe index and object offset, rather than a per-session key and then not decrypting the data at the server before writing it to disk.<br />
|| 7 ||[]<br />
|-<br />
| [[Local object zero-copy IO]]<br />
||Efficient data IO between a client and a local OST object; optimization to support local clients. Likely implemented as a fast-path connection between the OSC and the local OFD/OSD. Read cache should be kept on the OSD instead of at the client VFS level, so that the cache can be shared among all users of this OST.<br />
|| 9 ||<br />
|-<br />
|}<br />
<br />
== Past Projects ==<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
! Feature !! Feature Summary !! Point of Contact!! Tracker !! Target Date (YYYY-MM) !! Merge Date (YYYY-MM)<br />
|-<br />
| [[Dynamic LNET Config]] || Introduces a user space script which will read routes from a config file and add those routes to LNET dynamically with the lctl utility. This allows the support of very large routing tables || Amir Shehata (Intel) || [https://jira.hpdd.intel.com/browse/LU-2456 LU-2456] || 2014-11 || 2014-11<br />
|-<br />
| [[LFSCK Phase 3 - DNE consistency]] || Enhance LFSCK to work with DNE filesystems, including remote directory entries, and OST orphan handling for multiple MDTs. || Fan Yong (Intel) || [https://jira.hpdd.intel.com/browse/LU-2307 LU-2307] || 2014-11 || 2014-11<br />
|-<br />
| [[LFSCK Phase 4 - performance tuning]] || Enhance LFSCK performance and efficiency. || Fan Yong (Intel) || [https://jira.hpdd.intel.com/browse/LU-6361 LU-6361] || 2015-05 || 2015-05<br />
|-<br />
| [[Multiple metadata RPCs]] || Support of multiple metadata modifications per client (in last_rcvd file) to improve the multi-threaded metadata performance of a single client || Grégoire Pichon (Bull/Atos) || [https://jira.hpdd.intel.com/browse/LU-5319 LU-5319] || 2015-05 || 2015-05<br />
|-<br />
| [[DNE Phase IIb]] || Asynchronous Commit of cross-MDT updates for improved performance. Remote rename and remote hard link functionality. || Wang Di (Intel) || [https://jira.hpdd.intel.com/browse/LU-3534 LU-3534] || 2015-05 || 2015-05<br />
|-<br />
| [[Kerberos Revival]] || Fix up existing Kerberos code so it is tested working again. || Sébastien Buisson (Bull/Atos) || [https://jira.hpdd.intel.com/browse/LU-6356 LU-6356] || 2015-06 || 2015-06<br />
|}<br />
<br />
[[Category: Development]]</div>Nrutmanhttp://wiki.lustre.org/index.php?title=Projects&diff=1657Projects2016-05-03T18:56:33Z<p>Nrutman: /* Current Projects */</p>
<hr />
<div>__TOC__<br />
<br />
== Current Projects ==<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
! Feature !! Feature Summary !! Point of Contact !! Tracker !! Target Date (YYYY-MM)<br />
|-<br />
| [http://wiki.opensfs.org/index.php?title=Contract_SFS-DEV-002#Documentation UID/GID mapping] || Map UID/GID for remote client nodes to local UID/GID on the MDS and OSS. Allows a single Lustre filesystem to be shared across clients with administrative domains. || Stephen Simms (Indiana University) || [https://jira.hpdd.intel.com/browse/LU-3291 LU-3291] || 2015-11<br />
|-<br />
| [[Shared Key Crypto]] || Allow node authentication and/or RPC encryption using symmetric shared key crypto with GSSAPI. Avoids complexity in configuring Kerberos across multiple domains. || Stephen Simms (Indiana University) || [https://jira.hpdd.intel.com/browse/LU-3289 LU-3289] || 2015-12<br />
|-<br />
| [[NRS Delay policy]] || Use NRS for fault injection. Intentionally delay request processing to simulate server load. || Chris Horn (Cray) || [https://jira.hpdd.intel.com/browse/LU-6283 LU-6283] || 2015-05<br />
|-<br />
| [[Lock ahead]] || Allow user space to request LDLM extent locks in advance of need. Intended to optimize shared file IO. || Patrick Farrell (Cray) || [https://jira.hpdd.intel.com/browse/LU-6179 LU-6179] || 2015-07<br />
|-<br />
| [[ZFS ZIL Support]] || Add support for the ZFS Intent Log (ZIL) to the Lustre osd-zfs || Alex Zhuravlev (Intel) || [https://jira.hpdd.intel.com/browse/LU-4009 LU-4009]||2016-06<br />
|-<br />
| [[Composite File Layouts]] || Add support for multiple layouts on a single file, for File Level Replication, Data on MDT, PFL, etc. || Jinshan Xiong, Niu Yawei (Intel) || [https://jira.hpdd.intel.com/browse/LU-3480 LU-3480]||2016-09<br />
|-<br />
| [[Progressive File Layouts]] || Allow composite file layouts to be instantiated incrementally during file writes || Jinshan Xiong, Niu Yawei (Intel) || || 2016-12<br />
|-<br />
| [[Subdirectory Mounts]] || Ability for client to mount subdirectories of a Lustre filesystem || Wang Shilong (DDN) || [https://jira.hpdd.intel.com/browse/LU-28 LU-28]||2016-01<br />
|-<br />
| [[Multi-Rail LNet]] || Use multiple LNet network interfaces concurrently to improve reliability and performance || Amir Shehata (Intel), Olaf Weber (SGI) || [https://jira.hpdd.intel.com/browse/LU-7734 LU-7734] || 2016-09<br />
|-<br />
| [[Server side advise and hinting]] || Add support new APIs and utilities for server/storage side advise of accessing file for server cache || Li Xi (DDN) ||[https://jira.hpdd.intel.com/browse/LU-4931 LU-4931] || 2016-01<br />
|-<br />
| [[TBF policy enhancement]] || An enhancement of NRS/TBF policy to support complex TBF policy with NID/JOBID expressions || Li Xi (DDN) ||[https://jira.hpdd.intel.com/browse/LU-7470 LU-7470] || 2016-06<br />
|-<br />
| [[Lustre Snapshots|Simplified Userspace Snapshots]] || Allow snapshots of ZFS targets to be mounted as a coherent filesystem || Fan Yong (Intel) || || 2017-01<br />
|-<br />
| [[Large Bulk IO]] || Increase the OST bulk IO maximum size to 16MB or larger for more efficient disk IO submission. || Shuichi Ihara (DDN) || [https://jira.hpdd.intel.com/browse/LU-7990 LU-7990] || 2016-04<br />
|}<br />
<br />
== Future Projects ==<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
! Feature !! Feature Summary !! Point of Contact!! Tracker<br />
|-<br />
| Patchless Server || Remove Lustre kernel patches to allow Lustre servers to be more easily ported to new kernels, and to be built against vendor kernels without changing the vendor kernel RPMs. || Andreas Dilger (Intel) || [https://jira.hpdd.intel.com/browse/LU-20 LU-20]<br />
|-<br />
| [[File Level Replication Phase 1]] || Solution Architecture and HDL, Exclusive Open, RAID 1 Layout, Layout Modification Method, Read Only RAID 1 || Jinshan Xiong (Intel) || [https://jira.hpdd.intel.com/browse/LU-3254 LU-3254]<br />
|-<br />
| [[File Level Replication Phase 2]] || Immediate asynchronous write from client. || Jinshan Xiong (Intel) || [https://jira.hpdd.intel.com/browse/LU-3254 LU-3254]<br />
|-<br />
| [[Data on MDT]] || Allow small files to be stored directly on the MDT for reduced RPC count and improved performance. || Mikhail Pershin (Intel) || .<br />
|-<br />
| [[Quota for Projects]] || Allow specifying a "project" or "subtree" identifier for files for accounting to a project, separate from UID/GID. || Shuichi Ihara (DDN) || [https://jira.hpdd.intel.com/browse/LU-4017 LU-4017]<br />
|-<br />
| [[OSP multiple modify requests to MDT]] || Improve performance of cross-MDT modify operations. || Grégoire Pichon (Atos) || .[https://jira.hpdd.intel.com/browse/LU-6864 LU-6864]<br />
|-<br />
| [[Layout Enhancement High Level Design|Layout Enhancement]] || Necessary to enable file-level replication and extend based layouts. || Andreas Dilger (Intel) || .<br />
|-<br />
| [[OFD Read Cache]] || Server side Read cache with SSD, NVMe. One of storage cache tier between OSS Readcache with server memory and Storage || Li Xi (DDN) || .<br />
|-<br />
| [[Fileheat based Policy engine ]] || Add support new attribute "file heat" of objects to track "hot" files, for OFD read cache as a policy engine || Li Xi (DDN) || .<br />
|-<br />
| [[IB Multi-rail]] || Use multiple IB interfaces as a single Lustre NID to improve data transferring bandwidth and redundancy against the failure of IB. || Kenichiro Sakai (Fujitsu) || [https://jira.hpdd.intel.com/browse/LU-6495 LU-6495]<br />
|-<br />
| [[Directory Quota]] || Limit the number of inodes and disk blocks of files and subdirectories in the specified directory in a manner similar to user/group Quota. || Kenichiro Sakai (Fujitsu) || .<br />
|-<br />
| [[Directory Level Snapshot]] || Directory level snapshot (DL-SNAP) is designed for user level file backups. DL-SNAP will be implemented by using copy-on-write mechanism on top of ldiskfs without modification of disk format. || Kenichiro Sakai (Fujitsu) || .<br />
|}<br />
<br />
== Potential Projects ==<br />
<br />
These projects are potential areas of development that are looking for an interested party to work on or sponsor another developer to do. Many of them have more detailed descriptions, but it is worthwhile to contact the lustre-devel mailing list <br />
<br />
{| class="wikitable sortable"<br />
|-<br />
! Feature !! Feature Summary !! Complexity !! Tracker<br />
|-<br />
| [[ioctl() number cleanups]]<br />
|| Clean up Linux IOC numbering to properly use "size" field so that mixed 32- and 64-bit kernel/userspace ioctls work correctly. Attention needs to be paid to maintaining userspace compatibility for a number of releases, so the old ioctl() numbers cannot simply be removed.<br />
|| 1 ||[https://bugzilla.lustre.org/show_bug.cgi?id=20731 b=20731]<br />
|-<br />
| [[Updated man pages]]<br />
|| Update the online manual pages for Lustre user tools and the lustreapi library. Split the existing lfs.1 and lctl.8 man pages into separate pages for each sub-command, describing options and providing usage examples.<br />
|| 2 ||[https://jira.hpdd.intel.com/browse/LU-4315 LU-4315]<br />
|-<br />
| [[Improve testing Efficiency]]<br />
||Improve the performance, efficiency, and coverage of the acceptance-small.sh test scripts. As a basic step, printing the duration of each test script in the acceptance-small.sh test summary would tell us where the testing time is being spent.<br />
|| 3 ||[https://bugzilla.lustre.org/show_bug.cgi?id=23051 b=23051]<br />
|-<br />
| [[Config save/edit/restore]]<br />
||Need to be able to backup/edit/restore the client/MDS/OSS config llog files after a writeconf. One reason is for config recovery if the config llog becomes corrupted. Another reason is that all of the filesystem tunable parameters (including all of the OST pool definitions) are stored in the config llog and are lost if a writeconf is done. Being able to dump the config log to a plain text file, edit it, and then restore it would make administration considerably easier.<br />
|| 3 ||[https://bugzilla.lustre.org/show_bug.cgi?id=17094 b=17094]<br />
|-<br />
| [[Filesystem default OST pool]]<br />
||Allow a filesystem-wide default OST pool to be specified. Currently, it is possible to set the default stripe count, size, index on a filesystem with "lfs setstripe" on the filesystem root, but the OST pool name is ignored. There is no other mechanism to specify default the OST pool for all new files in the filesystem, if no pool is specified. This would be useful for WAN or other heterogeneous OST configurations.<br />
|| 3 ||[https://jira.hpdd.intel.com/browse/LU-7335 LU-7335]<br />
|-<br />
| [[Error message improvements]]<br />
||Review and improve the Lustre error messages to be more useful. A larger project is to change the core Lustre error message handling to generate better structured error messages so that they can be parsed/managed more easily.<br />
|| 4 ||<br />
|-<br />
| [[Improve QOS Round-Robin object allocator]]<br />
||Improve LOV QOS allocator to always do weighted round-robin allocation, instead of degrading into weighted random allocations once the OST free space becomes imbalanced. This evens out allocations continuously, avoids crazy/bad OST allocation imbalances when QOS becomes active, and allows adding weighting for things like current load, OST RAID rebuild, etc.<br />
|| 5 ||[https://jira.hpdd.intel.com/browse/LU-9 LU-9]<br />
|-<br />
| [[RPC Replay Signatures]]<br />
||<small>Allow MDS/OSS to determine if client can legitimately replay an RPC, by digitally signing it at processing time and verifying the signature at replay time.<br />
|| 6 ||[https://bugzilla.lustre.org/show_bug.cgi?id=18547 b=18547]<br />
|-<br />
| [[Virtual Lustre Block Device]]<br />
||Lustre object lloop driver exports block device to userspace, bypassing filesystem. Code partly works and is currently part of Lustre, but has correctness issues and potential performance problems. It needs to be ported to newer kernels.<br />
|| 6 ||[https://jira.hpdd.intel.com/browse/LU-6585 LU-6585]<br />
|-<br />
| [[Swap on Lustre]]<br />
||Depends on the Lustre block device. Has problems when working under memory pressure, which makes it mostly useless until those problems are fixed.<br />
|| 7 ||[https://bugzilla.lustre.org/show_bug.cgi?id=5498 b=5498]<br />
|-<br />
| [[Directory readdir+]]<br />
||Bulk metadata readdir/stat interface to speed up "ls -l" operations. Send back requested inode attributes for all directory entries as part of the extended dirent data. Integrate with any proposed API for this on the client. Needs Large Readdir RPCs to be efficient over the wire, since more data will be returned for every entry.<br />
|| 7 ||[https://bugzilla.lustre.org/show_bug.cgi?id=17845 b=17845]<br />
|-<br />
| [[Small file IO aggregation]]<br />
||Small file IO aggregation (multi-object RPCs), most likely for writes first, and possibly later for reads in conjunction with statahead.<br />
|| 7 ||[https://bugzilla.lustre.org/show_bug.cgi?id=944 b=944]<br />
|-<br />
| [[Lustre Snapshots|Integrated Lustre Snapshots]]<br />
|| Allow Lustre to internally mount and manage ZFS snapshots and other datasets within a single namespace<br />
|| 7 ||<br />
|-<br />
| [[Client-side data encryption]]<br />
||Encrypt files and directories (or possibly just filenames in directories) on the client before sending to the server. This avoids sending unencrypted data over the network, or ever having the data in plaintext on the server (in case of separate decryption from network plus encryption on disk). It seems possible to leverage the existing GSSAPI sptlrpc_cli_wrap_bulk() and friends to do bulk data encryption/decryption on the client using a per-file key (itself stored encrypted by the users' key(s) with the file on the MDS), and salted with the OST object ID or stripe index and object offset, rather than a per-session key and then not decrypting the data at the server before writing it to disk.<br />
|| 7 ||[]<br />
|-<br />
| [[Local object zero-copy IO]]<br />
||Efficient data IO between a client and a local OST object; optimization to support local clients. Likely implemented as a fast-path connection between the OSC and the local OFD/OSD. Read cache should be kept on the OSD instead of at the client VFS level, so that the cache can be shared among all users of this OST.<br />
|| 9 ||<br />
|-<br />
|}<br />
<br />
== Past Projects ==<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
! Feature !! Feature Summary !! Point of Contact!! Tracker !! Target Date (YYYY-MM) !! Merge Date (YYYY-MM)<br />
|-<br />
| [[Dynamic LNET Config]] || Introduces a user space script which will read routes from a config file and add those routes to LNET dynamically with the lctl utility. This allows the support of very large routing tables || Amir Shehata (Intel) || [https://jira.hpdd.intel.com/browse/LU-2456 LU-2456] || 2014-11 || 2014-11<br />
|-<br />
| [[LFSCK Phase 3 - DNE consistency]] || Enhance LFSCK to work with DNE filesystems, including remote directory entries, and OST orphan handling for multiple MDTs. || Fan Yong (Intel) || [https://jira.hpdd.intel.com/browse/LU-2307 LU-2307] || 2014-11 || 2014-11<br />
|-<br />
| [[LFSCK Phase 4 - performance tuning]] || Enhance LFSCK performance and efficiency. || Fan Yong (Intel) || [https://jira.hpdd.intel.com/browse/LU-6361 LU-6361] || 2015-05 || 2015-05<br />
|-<br />
| [[Multiple metadata RPCs]] || Support of multiple metadata modifications per client (in last_rcvd file) to improve the multi-threaded metadata performance of a single client || Grégoire Pichon (Bull/Atos) || [https://jira.hpdd.intel.com/browse/LU-5319 LU-5319] || 2015-05 || 2015-05<br />
|-<br />
| [[DNE Phase IIb]] || Asynchronous Commit of cross-MDT updates for improved performance. Remote rename and remote hard link functionality. || Wang Di (Intel) || [https://jira.hpdd.intel.com/browse/LU-3534 LU-3534] || 2015-05 || 2015-05<br />
|-<br />
| [[Kerberos Revival]] || Fix up existing Kerberos code so it is tested working again. || Sébastien Buisson (Bull/Atos) || [https://jira.hpdd.intel.com/browse/LU-6356 LU-6356] || 2015-06 || 2015-06<br />
|}<br />
<br />
[[Category: Development]]</div>Nrutmanhttp://wiki.lustre.org/index.php?title=Lustre_Conferences&diff=1654Lustre Conferences2016-04-28T01:16:53Z<p>Nrutman: /* Developer Day meetings */</p>
<hr />
<div>[http://wiki.lustre.org/Past_Events External Links to past events]<br />
<br />
=Lustre User Group (LUG) conferences=<br />
* [[Lustre User Group 2016]]<br />
* [[Lustre User Group 2015]]<br />
* [[Lustre User Group 2014]]<br />
* [[Lustre User Group 2013]]<br />
* [[Lustre User Group 2012]]<br />
* [[Lustre User Group 2011]]<br />
* [[Lustre User Group 2010]]<br />
* [[Lustre User Group 2009]]<br />
* [[Lustre User Group 2008]]<br />
* [[Lustre User Group 2008]]<br />
* [[Lustre User Group 2007]]<br />
* [[Lustre User Group 2006]]<br />
<br />
=Lustre Administrator & Developer (LAD) conferences=<br />
* [http://www.eofs.org/?id=lad15 LAD15]<br />
* [http://www.eofs.org/?id=lad14 LAD14]<br />
* [http://www.eofs.org/?id=lad13 LAD13]<br />
* [http://www.eofs.org/?id=lad12 LAD12]<br />
* [http://www.eofs.org/?id=lad11 LAD11]<br />
<br />
=Technical Meetings=<br />
* [[Developer_Day_2016-04-04]]<br />
* [[Developer_Day_2015-04-16]]<br />
* [[Lustre_Developer_Summit_2014]]<br />
* [[Lustre_Fall_Workshop_10/2010]]<br />
* [[SC%2709_and_Lustre_Senior_Technical_Meeting_11/2009]]<br />
* [[Lustre_All-Hands_Meeting_12/2008]]<br />
* [[Lustre_All-Hands_Meeting_3/2008]]<br />
<br />
=Other=<br />
* [[HPC_Software_Workshop_and_Seminars_-_Regensburg_Germany_2009]]<br />
<br />
[[Category: Events]]</div>Nrutmanhttp://wiki.lustre.org/index.php?title=Lustre_Conferences&diff=1653Lustre Conferences2016-04-28T01:14:49Z<p>Nrutman: </p>
<hr />
<div>[http://wiki.lustre.org/Past_Events External Links to past events]<br />
<br />
=Lustre User Group (LUG) conferences=<br />
* [[Lustre User Group 2016]]<br />
* [[Lustre User Group 2015]]<br />
* [[Lustre User Group 2014]]<br />
* [[Lustre User Group 2013]]<br />
* [[Lustre User Group 2012]]<br />
* [[Lustre User Group 2011]]<br />
* [[Lustre User Group 2010]]<br />
* [[Lustre User Group 2009]]<br />
* [[Lustre User Group 2008]]<br />
* [[Lustre User Group 2008]]<br />
* [[Lustre User Group 2007]]<br />
* [[Lustre User Group 2006]]<br />
<br />
=Lustre Administrator & Developer (LAD) conferences=<br />
* [http://www.eofs.org/?id=lad15 LAD15]<br />
* [http://www.eofs.org/?id=lad14 LAD14]<br />
* [http://www.eofs.org/?id=lad13 LAD13]<br />
* [http://www.eofs.org/?id=lad12 LAD12]<br />
* [http://www.eofs.org/?id=lad11 LAD11]<br />
<br />
=Developer Day meetings=<br />
* [[Developer_Day_2016-04-04]]<br />
* [[Developer_Day_2015-04-16]]<br />
* [[Lustre_Developer_Summit_2014]]<br />
* [[Lustre_Fall_Workshop_10/2010]]<br />
* [[SC%2709_and_Lustre_Senior_Technical_Meeting_11/2009]]<br />
<br />
=Other=<br />
* [[HPC_Software_Workshop_and_Seminars_-_Regensburg_Germany_2009]]<br />
<br />
[[Category: Events]]</div>Nrutmanhttp://wiki.lustre.org/index.php?title=Lustre_Conferences&diff=1652Lustre Conferences2016-04-28T00:07:43Z<p>Nrutman: </p>
<hr />
<div>[http://wiki.lustre.org/Past_Events External Links to past events]<br />
<br />
=Lustre User Group (LUG) conferences=<br />
* [[Lustre User Group 2016]]<br />
* [[Lustre User Group 2015]]<br />
* [[Lustre User Group 2014]]<br />
* [[Lustre User Group 2013]]<br />
* [[Lustre User Group 2012]]<br />
* [[Lustre User Group 2011]]<br />
* [[Lustre User Group 2010]]<br />
* [[Lustre User Group 2009]]<br />
* [[Lustre User Group 2008]]<br />
* [[Lustre User Group 2008]]<br />
* [[Lustre User Group 2007]]<br />
* [[Lustre User Group 2006]]<br />
<br />
=Lustre Administrator & Developer (LAD) conferences=<br />
* [http://www.eofs.org/?id=lad15 LAD15]<br />
* [http://www.eofs.org/?id=lad14 LAD14]<br />
* [http://www.eofs.org/?id=lad13 LAD13]<br />
* [http://www.eofs.org/?id=lad12 LAD12]<br />
* [http://www.eofs.org/?id=lad11 LAD11]<br />
<br />
=Developer Day meetings=<br />
* [[Developer_Day_2016-04-04]]<br />
* [[Developer_Day_2015-04-16]]<br />
* [[Lustre_Developer_Summit_2014]]<br />
<br />
[[Category: Events]]</div>Nrutmanhttp://wiki.lustre.org/index.php?title=Lustre_Conferences&diff=1651Lustre Conferences2016-04-28T00:07:18Z<p>Nrutman: /* Developer Day meetings */</p>
<hr />
<div>=Lustre User Group (LUG) conferences=<br />
* [[Lustre User Group 2016]]<br />
* [[Lustre User Group 2015]]<br />
* [[Lustre User Group 2014]]<br />
* [[Lustre User Group 2013]]<br />
* [[Lustre User Group 2012]]<br />
* [[Lustre User Group 2011]]<br />
* [[Lustre User Group 2010]]<br />
* [[Lustre User Group 2009]]<br />
* [[Lustre User Group 2008]]<br />
* [[Lustre User Group 2008]]<br />
* [[Lustre User Group 2007]]<br />
* [[Lustre User Group 2006]]<br />
<br />
=Lustre Administrator & Developer (LAD) conferences=<br />
* [http://www.eofs.org/?id=lad15 LAD15]<br />
* [http://www.eofs.org/?id=lad14 LAD14]<br />
* [http://www.eofs.org/?id=lad13 LAD13]<br />
* [http://www.eofs.org/?id=lad12 LAD12]<br />
* [http://www.eofs.org/?id=lad11 LAD11]<br />
<br />
=Developer Day meetings=<br />
* [[Developer_Day_2016-04-04]]<br />
* [[Developer_Day_2015-04-16]]<br />
* [[Lustre_Developer_Summit_2014]]<br />
<br />
[[Category: Events]]</div>Nrutmanhttp://wiki.lustre.org/index.php?title=Lustre_Conferences&diff=1650Lustre Conferences2016-04-28T00:02:32Z<p>Nrutman: /* Developer Day meetings */</p>
<hr />
<div>=Lustre User Group (LUG) conferences=<br />
* [[Lustre User Group 2016]]<br />
* [[Lustre User Group 2015]]<br />
* [[Lustre User Group 2014]]<br />
* [[Lustre User Group 2013]]<br />
* [[Lustre User Group 2012]]<br />
* [[Lustre User Group 2011]]<br />
* [[Lustre User Group 2010]]<br />
* [[Lustre User Group 2009]]<br />
* [[Lustre User Group 2008]]<br />
* [[Lustre User Group 2008]]<br />
* [[Lustre User Group 2007]]<br />
* [[Lustre User Group 2006]]<br />
<br />
=Lustre Administrator & Developer (LAD) conferences=<br />
* [http://www.eofs.org/?id=lad15 LAD15]<br />
* [http://www.eofs.org/?id=lad14 LAD14]<br />
* [http://www.eofs.org/?id=lad13 LAD13]<br />
* [http://www.eofs.org/?id=lad12 LAD12]<br />
* [http://www.eofs.org/?id=lad11 LAD11]<br />
<br />
=Developer Day meetings=<br />
* [[Developer_Day_2016-04-04]]<br />
* [[Developer_Day_2015-04-16]]</div>Nrutmanhttp://wiki.lustre.org/index.php?title=Lustre_Conferences&diff=1649Lustre Conferences2016-04-28T00:00:42Z<p>Nrutman: Links to Lustre conferences</p>
<hr />
<div>=Lustre User Group (LUG) conferences=<br />
* [[Lustre User Group 2016]]<br />
* [[Lustre User Group 2015]]<br />
* [[Lustre User Group 2014]]<br />
* [[Lustre User Group 2013]]<br />
* [[Lustre User Group 2012]]<br />
* [[Lustre User Group 2011]]<br />
* [[Lustre User Group 2010]]<br />
* [[Lustre User Group 2009]]<br />
* [[Lustre User Group 2008]]<br />
* [[Lustre User Group 2008]]<br />
* [[Lustre User Group 2007]]<br />
* [[Lustre User Group 2006]]<br />
<br />
=Lustre Administrator & Developer (LAD) conferences=<br />
* [http://www.eofs.org/?id=lad15 LAD15]<br />
* [http://www.eofs.org/?id=lad14 LAD14]<br />
* [http://www.eofs.org/?id=lad13 LAD13]<br />
* [http://www.eofs.org/?id=lad12 LAD12]<br />
* [http://www.eofs.org/?id=lad11 LAD11]<br />
<br />
=Developer Day meetings=<br />
* [[Developer_Day_2016-04-04|Lustre Developer Day 2016]]</div>Nrutmanhttp://wiki.lustre.org/index.php?title=Multi-rail_lnet&diff=1567Multi-rail lnet2016-04-06T21:05:39Z<p>Nrutman: Created page with "see Multi-Rail_LNet"</p>
<hr />
<div>see [[Multi-Rail_LNet]]</div>Nrutmanhttp://wiki.lustre.org/index.php?title=Talk:Layout_Enhancement_High_Level_Design&diff=1096Talk:Layout Enhancement High Level Design2015-12-03T22:16:33Z<p>Nrutman: Created page with "> The use of dedicated RPCs for managing composite layouts has been superseded in the PFL 2: High Level Design by the use of virtual xattrs that can address individual compone..."</p>
<hr />
<div>> The use of dedicated RPCs for managing composite layouts has been superseded in the PFL 2: High Level Design by the use of virtual xattrs that can address individual components of the file<br />
<br><br />
Can we update the text on the wiki to describe the virtual xattr plan?<br />
(Also, link seems bad)</div>Nrutman