Erasure Coding Read-Only High Level Design
Introduction
The Erasure Coding High Level Design (HLD) describes details of how the EC feature may be implemented, including the user interfaces, how files with EC feature interact with client-IO layer, the RPC interchange between the client and servers.
Design Overview
There are three main components to the EC design:
- The user-space interfaces for Lustre-specific command line tools and user library application programming interfaces (APIs)
- Changes to the client IO (CLIO) layer IO engine and RPC interfaces to the MDS for accessing and writing composite file layouts
- Changes to the MDS server to create, modify, and delete composite files
The design is structured in a top-down manner, starting with the command line interfaces that users, then the user library APIs, the client-side kernel changes, RPCs for creating and modifying composite files, and server-side changes.
User Space Interfaces
The lfs(1) command line interface will be extended to understand and manipulate EC files and their component layouts. User will use it to create files with EC layout, show the layout of existing files, and set default EC layout template on directories which can be inherited by new files and sub-directories created therein.
lfs setstripe
The lfs setstripe command creates a new file with the specified layout parameters, or sets the default layout template for a directory.
lfs setstripe {--component-end|-E END1} [COMP1_OPTIONS] {-L ec} [-k NUM_DATA_STRIPES ] {-m NUM_PARITY_STRIPES} [SETSTRIPE_OPTIONS] \
{--component-end|-E END2} [COMP2_OPTIONS] [{-L ec} ...] {FILENAME|DIRECTORY}
For ease of use, the EC component can be specified with a shorthand, like "-L ec:D+P" or "--ec D+P" to represent the NUMBER_OF_DATA_STRIPES and NUMBER_OF_PARITY_STRIPES more easily.
lfs mirror split -d -L ec FILENAME # delete all EC components
lfs setstripe --comp-del -I EC_COMP_ID FILENAME # delete an single EC component (not recommended)
This will create two separate sets of components in the file. The first set will hold the standard RAID0 data stripes for each of the components, and the second set will hold a matching set of EC parity stripes for each of those components. Each of these component sets will have a different MIRROR_ID values, like a regular FLR mirror, except that the EC parity components will additionally store the LOV_PATTERN_PARITY
Since this is the primary command line interface for users to create new files with Lustre specific layouts, there are a number of existing options can be used. Adding erasure coding specific options allows the same code to create erasure coding enabled composite layout without duplicating a large number of options. The existing --component-end arguments are used to specify the data component, and -L erasure-code arguments following the data component arguments are to specify the parameters of coding component associated with the corresponding data component. -k is used to specify the number of data stripes used to compute the erasure code out of them, without setting this option, all of the associated data component stripe devices will be used for the code calculation. -m specifies the number of code stripes to be used; And the act of encoding takes k data devices, calculates m code devices. The following -i/-o/-p options can be used to specify which OSTs to be used as the code devices.
As an example:
$ lfs setstripe -E 4M -c 4 -L erasure_code -m 2 -E eof -c 32 --ec:8+2 /mnt/lustre/new_file
Every code component uses the same stripe size as its corresponding data component’s stripe size, and the parity computation only involves its corresponding component’s data, if the tailing data of the component are not aligned for the k group of stripes, a memory filled with 0 padding will be used for the parity computation.
This creates a file with two data components and two code components:
lfs mirror resync [-y] EC_FILENAME
This command is used to resynchronize an out-of-sync erasure-coding file. If there are no stale code components in the EC file and no -y argument is used, this command does nothing. Otherwise, this command will first hold the exclusive lease of the file, and then the file will be resynchronized. During the resynchronization, the file data will be read and generate the parity code, then the parity codes will be written to the OST objects of the code components. After the parity codes are synchronized, this command will change the layout to mark the code component as uptodate and release the lease of the file.
It is possible to add an EC component to an existing striped file. If there is only a single data mirror, then the EC mirror can determine the mirror ID itself:
lfs mirror extend --ec 8+2 /mnt/lustre/existing_file
However, if the file has mutliple data mirrors, then it is necessary to specify which one the EC mirror will be bound to, to ensure the parity stripes are not sharing the same OSTs with data stripes:
lfs mirror extend --ec 8+2 -I DATA_COMP_ID /mnt/lustre/existing_file
On-disk/wire structure changes for Erasure Coding layout component
EC Geometry
The Lustre FLR-EC layout will implement redundancy on a *per-file* basis, with the data and parity being self-contained in the objects allocated to each file. This allows a great deal of flexibility in EC geometries, since they can be selected and tuned on a per-file, per-directory, and per-application basis. A file layout can be mapped conceptually as if each OST object is a "disk" in a traditional RAID device, where each file has a different set of "disks" (objects) that are wholly independent of the layout used by other files. The data stripes ("disks") will be arranged in a RAID-0 format as they are for traditional Lustre striped file layouts. The EC parity will be isolated to their own stripes ("disks"), which results in a RAID-4 style layout (dedicated parity "disks") rather than RAID-5 or RAID-6 where the parity is interleaved with the file data to balance load during parity reconstruction.
The following figure shows a RAID-4 layout with k=3 data stripes (0, 1, 2) and m=1 parity stripe (p). The 3 data stripes themselves are in a component with a RAID-0 layout that make up the data mirror, and the single EC parity stripe is in its own component that makes up the EC mirror. Each stripe is an OST object that is allocated on a different OST. Data is written in round-robin "slices" (A, B, C, D) in stripe_size units across the stripes, with the corresponding EC parity (Ap, Bp, Cp, Dp) in the parity stripe.
A significant advantage of using the RAID-4 geometry (RAID-0 + fixed parity stripes) is that this allows *adding* EC parity to existing Lustre files, and maintaining read compatibility for older clients that do not understand FLR-EC layouts. The RAID-0 data mirror of the file will be written normally in the same manner it was before FLR-EC (potentially already years ago), and then the EC mirror can be created and written afterward to add redundancy to the file. The delay between data writes and EC parity writes can be managed as appropriate for the files. For example, short-lived checkpoint files may not need any EC overhead, but 1/6 or 1/8 or 1/24 of hourly checkpoints for a long-running application (1, 3, 4 times per day) may still want to have EC generated to minimize the loss of application progress in case of a significant failure of the storage.
Because the FLR-EC layout is on a per-file basis, the data and EC parity stripes will themselves be distributed across OSTs in the filesystem, so there will already be "parity declustering" across OSTs, so data reconstruction during reads will naturally read parity from multiple OSTs. Similarly, if files need to be reconstructed due to missing objects/OSTs, the replacement objects will be allocated from all OSTs with free space, and the writes will distributed across all OSTs, providing "distributed hot space" during rebuild.
It is expected that files will often use the standard "8+2" data+parity configuration to trade off reliability vs. performance and capacity overhead, but other layouts like "16+3" or "32+4" can also be used for large files to reduce the overhead of space used for the largest files in the filesystem. Using wider-striped EC geometries will impose more overhead during parity reconstruction due to the need to read all data stripes and at least one parity stripe in order to rebuild the missing data. For example, with a geometry --ec 32+3, 32MiB of data would need to be read per 1MiB of data to reconstruct an unavailable OST object. That is expensive if the file is accessed randomly by many clients, but doesn't add significant overhead if the application is already reading that 32MiB of data during normal operations and reduces parity space overhead to only 3/32=10% of that file's data. More extreme EC geometries are still possible (e.g. 255+1 through 255+15), but more experience is needed to determine where the cutoff lies before the reconstruction overhead is too high or the reliability is too low to meet operational requirements.
RAID Sets
A RAID set is a group of related data and parity stripes within the file, composing (at least part of) the "k+m" EC Geometry of that file. Each EC file will have at least one RAID set.
For files where the stripe count exceeds the number of stripes in the EC geometry, the concept of a RAID set is introduced. This is necessary when files are widely striped (e.g. 128 data stripes) to allow the EC Geometry to remain reasonable for reconstruction. In such cases, the data stripes are split into sets of k stripes, each of which has its own set of m parity stripes. For example, in a 16-stripe file with --ec 8+2 layout there will be 16 data stripes in the data mirror split into two 8+2 RAID sets, with two corresponding 2-stripe EC parity stripes (total 4 stripes) in the EC mirror. This allows the EC parity calculations to remain relatively efficient, while still allowing widely-striped files to have parity.
The following image shows an example RAID-4 layout with 6 stripes in the data mirror component split into two k=3 m=1 RAID sets (0, 1, 2 and 3, 4, 5), and the EC mirror component has 2 parity stripes (p, q), one for each RAID set. The data stripes 0, 1, 2 use parity stripe p, and data stripes 3, 4, 5 use parity stripe q. The 6 data stripes themselves are in a single component with a RAID-0 layout that make up the data mirror, and the two EC parity stripe are in their own component that makes up the EC mirror. Each data and parity stripe is an OST object that is allocated on a different OST. Data is written in round-robin "slices" (A, B, C, D) in stripe_size units across the stripes, with the corresponding EC parities (Ap+Aq, Bp+Bq, Cp+Cq, Dp+Dq) in the parity stripe. It should be noted that this is only an example to illustrate the concept of RAID sets in a constrained space, and in a real system it is much more likely that this file would use a single EC 6+2 geometry to allow recovery from two concurrent failures.
RAID sets also represent isolated failure domains within a file, allowing the reconstruction of data independently and in parallel across each RAID set. In addition, for widely-striped files where the number of data stripes are allocated on most or all of the OSTs in the filesystem, it allows the data and parity stripes to be distributed across all of the OSTs in the filesystem since the only requirement to maintain redundancy is that the OSTs *do not overlap within a single RAID set*.
Unequal RAID Sets
If the stripe_count of a file is not an integer multiple of k, then the RAID sets in that file will not all have identical EC geometry. This can happen for multiple reasons, such as the total OST count not matching the EC geometry, OSTs being offline or full when files are being created, if a single EC geometry is specified for a PFL file that has component stripe counts that do not match, if there is a default EC geometry for the filesystem that doesn't align with the default stripe count in a directory, or if the user explicitly specifies a stripe count and EC geometry that do not match.
In such cases, the EC geometry will be adjusted automatically to best align with the file's (component's) data stripe count, while maintaining at least the requested redundancy. That means that the EC geometry may reduce the EC stripe count k to better align with the file's stripe count, so that all of the RAID sets in a component will have an EC stripe count within one of each other. The RAID sets with the higher stripe count will cover the first stripes of the component, and 0 or more RAID sets with one fewer stripe will cover the last stripes of the component.
For example, a file with stripe_count=20, k=8 could have three RAID sets 8+2, 8+2, 4+2 to cover the 20 data stripes, but the last RAID set would have higher EC overhead compared to the first two. Instead, the EC layout would use RAID sets 7+2, 7+2, 6+2 so that the EC overhead is more uniform across the file.
The following image shows an example EC layout with 7 data stripes with one k=4 m=1 RAID set and one k=3 m=1 RAID set (data stripes 0, 1, 2, 3, and 4, 5, 6 respectively) and two parity stripes (p, q), one for each RAID set. The data stripes 0, 1, 2, 3 use parity stripe p, and data stripes 4, 5, 6 use parity stripe q.
layout header
The struct lov_comp_md_v1 uses one of the lcm_padding3 bytes to store the number of EC components in the file layout.
struct lov_comp_md_v1 {
__u32 lcm_magic;
__u32 lcm_size;
__u32 lcm_layout_gen;
__u16 lcm_flags;
__u16 lcm_entry_count;
/* lcm_mirror_count stores the number of actual mirrors minus 1,
* so that non-flr files will have value 0 meaning 1 mirror.
*/
__u16 lcm_mirror_count;
/* code components count, non-EC file contains 0 ec_count */
+ __u8 lcm_ec_count;
__u8 lcm_padding3[1];
__u16 lcm_padding1[2];
__u64 lcm_padding2;
struct lov_comp_md_entry_v1 lcm_entries[0];
};
code component entry header
The struct lov_comp_md_entry_v1 uses two of the lcme_padding_1 bytes to store the number of erasure code stripes for the corresponding number of data stripe in the file layout. For each data component in an EC file there will be a separate EC component associated with it. The one or more data components that make up a whole copy of the file data (i.e. in a PFL file) are a "data mirror", and will have a separate matching set of EC components that make up an "EC mirror". The "EC mirror" will be dedicated to a specific "data mirror", even if there are multiple data mirrors in the same file. This is required to ensure the failure domains (OSTs) of the data mirror do not overlap with the failure domains (OSTs) of its corresponding EC mirror. While it 'may' be possible to reconstruct data from a mismatched set of data and EC mirrors, this cannot be relied upon and would only be done in case of emergency.
struct lov_comp_md_entry_v1 {
__u32 lcme_id;
__u32 lcme_flags;
struct lu_extent lcme_extent;
__u32 lcme_offset;
__u32 lcme_size;
__u32 lcme_layout_gen;
+ __u8 lcme_dstripe_count; /* data stripe count used in EC, k value */
+ __u8 lcme_cstripe_count; /* code stripe count, m value */
__u16 lcme_padding_1;
__u64 lcme_timestamp;
};
It should be noted that lcme_dstripe_count does not need to be equal to the lmm_stripe_count of the file, as it is desirable to allow e.g. EC 8+2 RAID sets in a file with 128 data stripes. The restrictions are that lcme_dstripe_count cannot be greater than lmm_stripe_count, and cannot be larger than 255. The lcme_cstripe_count cannot be greater than lcme_dstripe_count, and cannot be larger than 15 to avoid too large overhead when (re)computing the parity data.
data component entry blob is the same as a non-EC plain file layout
This describes a normal RAID-0 data component layout, as is traditionally used by Lustre. There may be one or more RAID-0 data components for different extents of a PFL file that makes up a single data mirror.
struct lov_mds_md_v3 {
__u32 lmm_magic; /* LOV_MAGIC_V3 */
__u32 lmm_pattern; /* LOV_PATTERN_RAID0 */
struct ost_id lmm_oi;
__u32 lmm_stripe_size;
__u16 lmm_stripe_count; /* ec stripe count */
__u16 lmm_layout_gen;
char lmm_pool_name[LOV_MAXPOOLNAME + 1];
struct lov_ost_data_v1 lmm_objects[0];
};
parity code component entry blob
This represents only the erasure-coded parity stripes of an EC mirror. There will be exactly one EC component in the EC mirror for each data component in the corresponding data mirror. The component extent start and end for each EC component, the stripe count, and stripe size must exactly match the corresponding data component extent.
struct lov_mds_md_v3 {
__u32 lmm_magic; /* LOV_MAGIC_V3 */
__u32 lmm_pattern; /* LOV_PATTERN_PARITY */
struct ost_id lmm_oi;
__u32 lmm_stripe_size;
__u16 lmm_stripe_count; /* ec stripe count */
__u16 lmm_layout_gen;
char lmm_pool_name[LOV_MAXPOOLNAME + 1];
struct lov_ost_data_v1 lmm_objects[0];
};
There will be one OST object for each parity stripe for each RAID Set in the file. For example, a simple 8-stripe file with --ec 8+2 layout, there will be 8 data stripes in the data mirror, and 2 parity stripes in the EC mirror. With a 40-stripe file with --ec 8+2 layout, there will be 40 data stripes in the data mirror that make 5 8+2 RAID sets, resulting in 5x2=10 EC stripes in the EC mirror. The data in the EC stripes will essentially also have a striped "RAID-0" layout, with each EC stripe corresponding to a single "parity disk" in the RAID-4 layout.
Memory structure changes for Erasure Coding layout
layout header
struct lov_stripe_md {
...
u32 lsm_magic; -> LOV_MAGIC_COMP_EC
...
};
code component entry
struct lov_stripe_md_entry {
...
u32 lsme_magic; -> LOV_MAGIC_EC
+ u8 lsme_dstripe_count;
+ u8 lsme_cstipe_count;
...
};
The code component has the same extent range as its associated data component, and the code component descriptor is posited after data components.
Layout Component Creation
Creating layout component on demand still uses layout intent RPC to notify the MDS to prepare the file’s components as what PFL/FLR files do.
Write to EC File with Delayed Parity
The first phase of this project is aim to build the infrastructure to support erasure coding in Lustre. To simplify the implementation, this design only supports delayed parity calculating, i.e., the data write does not generate parity code at the same time, it will only mark the code components stale, and write continues on the striped data components. After the file is closed, administrator can use an external tool to generate and write the parity code. Marking code component stale will be recorded in the Lustre ChangeLog so that the resync tool or external policy engine can detect and resync the code components.
With delayed write, the write intent RPC handler in MDS will instantiate destination data and code component, and mark the code component STALE. The layout generation will be increased to notify the other clients to clean up their cache. And these code components will stay stale, and read cannot use them to rebuild missing data until they got resynced later.
Map file offset to code offset
Each stripe of parity code is calculated by a polynomial taking several (say k) stripes of data as its variables, so that every k bytes of data in the file corresponds to a byte of code in one of the corresponding component code stripes.
Suppose the data of offset off locates in component m, we use km stripes for code generation in component m, and component m starts at compm.start, the stripe size of this component is sm, so the code offset code_off in the code object corresponding to the data at off can be calculated as follows:
offm = off – compm.start code_off = offm/(km * sm) * sm + offm % sm
Erasure Code Resynchronization
After an EC file is written, its code component is not valid to reconstruct normal data if data OST is not available. This phase will need an external mechanism to synchronize the code component, i.e., compute the parity code from normal data and write the code to the code component, then update the layout to indicate that the parity in the code component is valid so that read can leverage them to recover data from possibly failed OSTs.
Read of Degraded File
When reading from an EC file, if the read I/O encountered failed OSTs, the IO framework detects the error. The LOV layer will manage the page to read the remaining data pages in the EC chunk and its associated parity from the code component and re-generate the missed data page for the failed OSTs, so that client can read without noticing OSTs’ unavailable. Since the EC parity is computed on a per-block basis across the striped file, the minimum number of pages needed to reconstruct one missing data page is (d-1)+p, one data page from each remaining stripe plus one parity (though possibly both parities to allow confirming reconstruction correctness).
For large reads (>= stripe_size), it is most efficient to read the full stripe_size of data and parity and reconstruct a whole stripe at a time. It is likely that the additional data stripes would be needed for the application read() call in any case.
Extent lock expand
Depends on the number of code fragments parameter in the erasure-coding, to calculate a data value in certain a file offset, all the data stripes in the relevant code fragments need to be read, which could possibly involve reading the data not in the current IO request range, so that means we also need to take read lock covering that part of data. In the case of an OST failure, then read will be restarted and we would take the extent lock covering the relevant data and code fragments. IO framework for read
the IO framework retry mechanism is triggered, and the retry created another EC I/O, which is meant to read data stripes of available OSTs and parity code OSTs, and calculate data belonging to failed OSTs.
Page management in LOV
To shield user space tools and LLITE layer from aware of the parity code directly, LOV layer will take charge of the page management of the parity code belonging to the code components. LOV will acquire CR extent lock of the requested code OST object, allocate cl_page associated with the parity OST object, and since the code OST object is obscured from LLITE layer, so that it’s LOV layer’s responsibility to maintain the code pages. These pages can be attached to the code extent lock and/or maintained in an LRU facility. When the lock got canceled, the parity cache can be discarded accordingly.
Requirements
Erasure Code Library
We’d choose an existing library supporting erasure code encode and decode. Intel Intelligent Storage Acceleration Library (ISA-L) (https://github.com/01org/isa-l) supports fast block Reed-Solomon type erasure codes for any encode/decode matrix in GF(2^8), and we can leverage for the parity code generating and data restore.
User Space Tools
A Lustre user space tool is needed to define and set parity components for a file. We can reuse lfs mirror create to serve this purpose.
Another tool is needed to generate the erasure code for the changed data block and update the parity code. We’d reuse lfs mirror resync to check internally whether the component to resync is a mirror or a EC.
Erasure Coded File Write
Erasure coded file writes will mark the corresponding parity component stale. After the file is closed, resync tool (e.g. "lfs mirror resync") can be used to clear the write extent list and generate/update erase code for corresponding parity component.
Erasure Coded File Read
The Lustre client will do normal reads from the RAID-0 data component, unless there is an OST failure or other error reading from a data stripe, a read recovery will be started, reading erasure code data from parity components and reconstruct the data for the failed OST.
Future Development
Write to EC File with Immediate Parity
In a later development phase, it may be possible to implement Immediate EC Write for the restricted (but fairly common) use case of a single client writing a new file in linear offset order (which is the only IO method supported for most object stores). The EC components would be initialized as "in progress" or "STALE" during initial file writes. The client could compute the code for the EC component in a similar manner as is done for Read of Degraded File, and write it asynchronously to the EC stripes. If the write of the file's data and code components have completed and sync'd without errors (the most common case), then the EC component(s) can be marked uptodate when the file is closed. This is essentially combining the Delayed Write with the Erasure Code Resynchronization step, and could cover a large fraction of the normal use cases.
Supporting Multiple EC Code Types
This section describes how the proposed EC infrastructure could be extended to support alternative erasure code constructions beyond the default Cauchy RS.
Encoding a Code Type in the Layout
The code type must be stored on disk such that
- Old clients reject layouts using a code type they do not understand.
- The correct encoding matrix is generated at resync and degraded-read time.
Two options seem possible without adding new on-disk fields.
Option A: upper bits of lcme_cstripe_count
lcme_cstripe_count is a __u8 on disk. The current maximum parity stripe count is 15 (LOV_EC_MAX_CODING_STRIPES), so only the low 4 bits are used. The upper bits can encode a code type:
Bits 7-4: EC code type (0 = Cauchy RS, current default) Bits 3-0: parity stripe count (p), range 1-15
The proposed user-space tooling already rejects values >15. The server side (lod) does not currently validate lcme_cstripe_count against LOV_EC_MAX_CODING_STRIPES, so server-side validation would need to be added for this to be backwards-compatible. While compact and requiring no new fields, it overloads the count field with type semantics.
Option B: new flag in lmm_pattern
lmm_pattern is a __u32 per component with several free bits. New EC code types could be indicated with additional flags, e.g.:
LOV_PATTERN_EC_XXX = 0x008
A XXX parity component would have LOV_PATTERN_RAID0 | LOV_PATTERN_PARITY | LOV_PATTERN_EC_XXX. Old code already rejects unknown patterns.
lmm_pattern seems to be a good fit semantically. However, EC types are mutually exclusive, so validation must reject nonsensical combinations of multiple EC type flags.
Requirements for a Compatible EC Code
Any candidate EC code must satisfy the following to be supportable within the proposed Lustre EC infrastructure without rewriting the encode/decode path and/or the client I/O stack.
- Systematic: Data blocks must be stored unmodified. A non-systematic code would require decoding on every normal read, which means reworking the full read path. It is therefore a hard requirement and non-systematic codes are incompatible.
- GF(2^8) arithmetic: The ISA-L implementation operates over a Galois field with 2^8 elements (GF(2^8)). The core functions ec_init_tables(), ec_encode_data(), and gf_invert_matrix() work with any encoding matrix over GF(2^8). That is, a different matrix construction (e.g., Vandermonde instead of Cauchy) only requires a new matrix generation function. The encode, decode, and inversion machinery can be reused as-is. A code requiring GF(2^16) or another field would need a replacement encode/decode implementation in both the kernel module and user-space library. Note, ISA-L does not support GF(2^16).
- Stripe-aligned I/O: The client issues RPCs in whole-stripe units. Codes that subdivide a stripe into sub-blocks and require non-contiguous access across sub-blocks at different offsets within a stripe do not map to the existing RPC model. Codes where sub-stripes align to whole stripes (or where sub-block grouping maps onto the existing raid set structure) can be supported without changes to the I/O stack.
The currently proposed implementation uses Cauchy RS (gf_gen_cauchy1_matrix()). Other code constructions, such as "LESS" (Cheng et al., USENIX FAST 2026), which uses Vandermonde matrices over GF(2^8), or "Hitchhiker" (Rashmi et al., ACM SIGCOMM 2014), which uses standard RS with an additional XOR piggybacking step, also appear compatible. Both are systematic, operate over GF(2^8), and their sub-stripe structures seem compatible with the proposed RAID set model. Integration would require a new matrix generation function and changes to the repair read path, but the core ec_encode_data() and gf_invert_matrix() logic can be reused.


