Layout Enhancement High Level Design

= 1. Introduction =

-

The following design applies to the Layout Enhancement project within the Technical Proposal by High Performance Data Division of Intel for OpenSFS Contract SFS-DEV-003 signed Friday 23rd August, 2013.

In the Lustre file system the data of a file is striped over one or more objects each residing on an OST. The layout of a file is an attribute of the file which describes the mapping of file data ranges to object data ranges. The layout is stored on the MDT as a trusted extended attribute of the file and are sent to clients as needed. The layout of a file is often simply referred to as its striping, since in the current (2.5) implementation of the Lustre file system only non-redundant striped (RAID0) layouts are permitted. This project will design enhancements to the representation and handling of layouts to support features such as Data On MDT, File-Level Replication, Progressive File Layouts, live data migration (via File-Level Replication), and RAID1/5/6 or erasure coding. The Layout Enhancement (LE) project is therefore a prerequisite for the Data on MDT and File-Level Replication projects, described in separate documents.

In this document we present several new layout types: composite layouts to support replication and layouts extents (sections 2.1 and 2.2), RAID layouts (section 3), compact layouts for widely striped files (section 4), and large layouts (section 5).

= 2.1. Composite Layouts =

-

In order to support file-level replication and extent based layouts we define a new composite layout type which comprises several plain (non-composite) layouts designating (partial) mirrors of the file data. This type generalizes RAID0+1 to allow heterogeneous stripe sizes and counts among mirrors and to allow for the possibility of partial mirroring schemes, potentially with each replica on different OST storage pools with different performance and capacity characteristics.

Composite layouts are described by  which is defined below.

struct lu_extent { __u64 e_start; __u64 e_end; };

/* These enum constants are virtual entry ids and are to be used in * struct lov_comp_md_op and in composite layout API functions. They * specify any or all entries in a composite layout matching a certain * condition. Together with a flag that negates the sense of matching * they allow us to specify multiple entries in a single operation, * for example: delete all entries but the primary from a layout, or * get any stale entry. */ enum lcme_id { LCME_ID_NONE   = -1, LCME_ID_ALL    = -2, LCME_ID_ANY    = -3, LCME_ID_PRIMARY = -4, LCME_ID_STALE  = -5, LCME_ID_UNINIT   = -6, };

enum lov_comp_md_entry_flags { LCME_FL_PRIMARY  = 0x00000001, LCME_FL_STALE    = 0x00000002, LCME_FL_OFFLINE  = 0x00000004, LCME_FL_PREFERRED = 0x00000008, LCME_FL_INIT     = 0x00000010, };

struct lov_comp_md_entry_v1 { __u32 lcme_id;                 /* unique identifier of component */ __u32 lcme_flags;              /* LCME_FL_XXX */ struct lu_extent lcme_extent;  /* file extent for component */ __u32 lcme_offset;             /* offset of component blob in layout */ __u32 lcme_size;               /* size of component blob data */ union { __u64 lcme_padding; } u; };

enum { LOV_MAGIC_COMP_V1 = 0x0BD40BD0, };

enum lov_comp_md_flags { /* Replication states, use explained in replication HLD. */       LCM_FL_RS_READ_ONLY     = 0, LCM_FL_RS_WRITE_PENDING = 1, LCM_FL_RS_WRITABLE     = 2, LCM_FL_RS_SYNC_PENDING = 3, LCM_FL_RS_MASK         = 0xff,

LCM_FL_PRIMARY_SET     = 1 &lt;&lt; 15, };

struct lov_comp_md_v1 { __u32 lcm_magic;       /* LOV_MAGIC_COMP_V1 */ __u32 lcm_size;        /* overall size of layout including this structure */ __u32 lcm_layout_gen; __u16 lcm_flags; __u16 lcm_entry_count; union { __u64 lcm_padding[2]; } u;       struct lov_comp_md_entry_v1 lcm_entries[0]; };

A composite layout begins with a  followed by   entries (instances of  ), followed by   plain layouts ( 3, or any other non composite layouts).

Each entry has an identifier ) which is unique within the lifetime of the composite layout. This is the only way to identify an entry or component within a composite layout. In contrast to existing layout types, composite are intended to updated as components (file replicas) are added and removed. Therefore the particular index on an entry in   is not a suitable identifier. Entry identifiers are selected by the server side layout handling code as new components are added to a composite layout.

The offset and size members of the entry  and  ) describe the location of the contained plain layout. We do not assume that the component offsets of the entries are in any particular order nor do we assume any relationship among the component sizes. The offsets are chosen to be multiples of 8 so that components will be suitably aligned in memory. Assume   points to an in-memory instance of   which contains three entries.

Recall that extent based layouts permit different layouts to be used in different extents of a file. They may be used to set progressively wider striping as a file grows in size, to prevent inconsistent out of space errors as individual OSTs become full, and to enable incremental migration, replication, and HSM restore. The extent member of a composite layout entry describes the range of the file to which the component layout applies. Note that in the simplest case of replication each component would have the same extent of [0, ∞). For a more interesting use of component extents, consider a file F which cannot grow in size because one of its objects belongs to an OST with no free space. To handle this, the layout L0 of F is converted to a composite layout C which contains L0 as a component with extent [0, s) where s is the size of F rounded down to the nearest multiple of . Then a new plain layout L1 is allocated (with objects on other OSTs) and added to C but with extent [s, ∞). Now data appended to F goes to objects belonging to L1.

= 2.2. RPCs for Composite Layouts =

-

In order to construct and modify composite layouts a set of new RPCs are defined. These are sent using a single high level RPC with opcode  and format   based on the existing MDS_SWAP_LAYOUTS RPC. Depending on the operation one or two FIDs are packed in an MDT body along with capabilities and LDLM data. To this a  is added which specifies the exact operation to be performed and any operation specific data.

is defined below.

The file(s) to be operated on are identified by  (and  ) of the MDT body. Files must reside on the MDT receiving the RPC; otherwise no operation is performed and  is returned. For quota maintenance purposes they must also have the same ownership (UID and GID) or  is returned. If  is not -1 then is must be equal to the layout generation of the file identified by  ; otherwise no operation is performed and   is returned. Similarly for  and the file identified by   for operations involving two files.

In each operation  designates one or more entries in a composite layout by using a known id or by using a virtual id from. Setting  negates the sense of matching and may only be used with a specific id, , or  ; otherwise no operation is performed and   is returned. If either file has a non-composite layout then  should be used to specify their single-component. If the designated entries do not exist then no operation is performed and  is returned.

must be a valid non-empty extent, have  and   both 0 (empty), or have   and   both -1 (special, see below); otherwise no operation is performed and   is returned.


 * - move one or more components between the files (Fd and Fs) identified by  and  . Fd is the destination and Fs the source. The newly created entries in Fd will have flags and extents equal to   and .
 * - update the flags and/or extent of the entries in the file (F0) specified by  of the MDT body. If O is the supplied struct lov_comp_md_op then for easch such entry E we set  . If   is not {   } then we set   as well.
 * - the designated entry in the file (F0) identified by  is set primary.

Clients can delete a given set of components from a composite layout by opening a volatile file and moving the set to the volatile file.

= 3. RAID Layouts =

-

RAID layouts are specified using  defined below. This format borrows from the SNIA Common RAID Disk Data Format Specification and allows us to specify common simple (0, 1, 4, 5, 6) and nested RAID levels (10, 40, 50, 60).

Non-nested RAID (levels 0, 1, 4, 5, and 6) layouts have a  value of 1. Nested RAID (levels 10, 40, 50, and 60) have a value greater than 1. In the nested case, the file objects are arranged in to  primary groups each of which is a RAID set for the primary RAID level (0, 1, 4, 5, 6) with   objects. Then depending on the secondary RAID level these primary groups are mirrored or striped to create the file. For levels 5 and 6  includes the parity objects (and hence must be at least 3 for RAID-5 and 4 for RAID-6. The number of objects in is RAID layout is always equal to the product.

= 4. Compact Layouts for Widely Striped Files =

-

Existing RAID-0 layout formats  and  use an explicit array of object identifiers to map each stripe index to a specific OST object. When using FIDs alone to identify objects this approach requires 16 bytes per stripe. The current implementation packs an OST index together with the object identifier and needs 24 bytes per stripe. The allocation of memory buffers to transmit, receive, and handle these layouts for very widely striped files (over 160 stripes) can be costly. To avoid this cost we define a compact layout based on a bitmask of OST indices which reduces memory consumption for widely stripe files by a factor of 192 for the current maximally-striped layout of 2000 stripes.

is defined below.

Every lov_wide_md_v1 may be converted into an equivalent lov_mds_md_v1 as sketched below. There are three ideas at work:


 * The  members describe the set of OSTs which contain objects for the file
 * The  specify the FID of an object in the file given its OST index.
 * is used to rotate the array of used OST indexes to ensure that the placement of initial stripes is not biased toward lower index OSTs.

= 5. Handling of Large Layouts =

-

There are (at least) two issues with handing large layouts:


 * Storing the layout as an extended attribute on an MDT inode limits its size to 64 KB.
 * Retrieval of the layout in the initial open RPC requires a large request/reply buffer allocation.
 * Layout sizes are constrained by single transfer size limits from lower layers (LNet).

To address these issues we define an indirect layout type which specifies that the file layout is stored as the data of a second layout file.

Clients encountering this layout type may send OST_READ requests to the MDT using the supplied FID to retrieve the true file layout data. The layout file must have a name (derived from the FID of the original file) in a directory ( or similar) on the MDT but this name will not exist in the client visible namespace.