Progressive File Layouts

The Lustre Progressive File Layout (PFL) feature intends to simplify the use of Lustre so that users can expect reasonable performance for a variety of normal file IO patterns without the need to explicitly understand their IO model or Lustre usage details in advance. In particular, users do not necessarily need to know the size or concurrency of output files in advance of their creation and explicitly specify an optimal layout for each file in order to achieve good performance for both highly concurrent shared-single-large-file IO or parallel IO to many smaller per-process files.

The PFL feature is implemented in several phases, providing incremental functionality with each phase, including the base functionality of Composite layouts which can be used for several other features that affect the file layout.

Phase 1: Prototype Implementation
The Prototype Implementation is needed to explore options for implementing the PFL feature, and verify that expected performance gains are available for PFL files once a production implementation is available.
 * PFL Prototype Scope Statement describes the overall goals and intended outcomes of the prototype
 * PFL Prototype Solution Architecture describes how the goals of the PFL project may be implemented, and how to measure the completion and outcomes

Composite Extent-Mapped Layouts
Progressive file layouts are characterized by increasing the stripe count of the file in a step-wise manner as the file offset increases. This will be achieved by using multiple extent-based composite layouts as described in the Layout Enhancement High Level Design.

Composite layouts allow specifying different specific layouts for different ranges (extents) of the file. Currently, the only layout supported by Lustre is RAID-0 striping across one or more OST objects. The size of a composite layout will be constrained by existing total size limits for file layouts (approximately 4KB or 48KB, depending on the capabilities of the MDT backing storage) and the size of the RPC request and reply messages, but there will otherwise be no limits added on the number of layout extents by the PFL code itself.

Progressive layout files will typically have a small fixed-length layout extent covering the start of the file (e.g. tens of megabytes in size). This initial layout extent will typically have only a single OST object to minimize file creation and access overhead. Additional layout extents, each with their own specific layout, follow at the end of the previous layout extent using new OST objects and covering a progressively larger extent of the file (e.g. a few gigabytes). This typically repeats, until the file is striped across all available OSTs at which point the layout extent will cover the rest of the file, regardless of its size.

The actual file offsets at which the layout extent changes, the stripe count for each layout extent, and the stripe size will be tunable by the user/administrator. While there is no requirement that the stripe size for each layout extent be the same, the start and end of each layout extent shall be an integer multiple of its stripe size. Overlapping layout extents (file-level data replication) are out of scope of the PFL project.

Client LOV support for composite layouts
The client Logical Object Volume (LOV) is responsible for mapping logical file offsets to offsets in the specific OST object that holds that region of the file, using a unique per-file layout. With progressive file layouts, there will be multiple contiguous but disjoint regions (layout extents) of each file that use distinct RAID-0 layouts for each region, and each layout extent will use different OST objects for storage.

Mapping a logical file offset to a specific OST object is done by first finding the enclosing layout extent start and end offsets. In the PFL project, layout extents must not overlap, and will be stored in increasing file offset order. Once a layout extent is selected, the file offset is mapped to the OST object offset using the specific RAID-0 layout stored in that extent. The mapping within each extent is computed as if the specific layout spanned the whole file. This avoids the need for data migration in possible future layouts that will support mutable extent boundaries.

As with current plain RAID-0 layouts, the file size, block count, and timestamps will be determined by aggregating the attributes from all OST objects that are part of the layout.

Phase 2: Static Layout Implementation
The Static PFL Implementation will provide a functional implementation that allows specifying the full layout using standard user tools and addresses any shortcuts and/or defects in the Prototype implementation. The following functionality was implemented:
 * PFL2 Scope Statement describes the overall goals and intended outcomes of the production implementation
 * PFL2 Solution Architecture describes how the goals of the PFL project may be implemented, and how to measure the completion and outcomes
 * PFL2 High Level Design describes the implementation details for the PFL feature

Implement improved layout handling APIs
Currently, Lustre clients and servers understand only a single type of layout, RAID-0 striping across one or more OST objects, with a few small variations of a basic layout structure (lov_mds_md_v3). In order to add the progressive layout handling and other future layout types such as RAID-10 in a sustainable manner, the internal code needs to be restructured in order to isolate the parts of the code that handle the file layouts to a single library. Other parts of the Lustre code should not access the internal details of the file layout, and instead use library accessor functions to query required parameters such as the number of OST objects over which a file is striped, the size of the layout structure, and other common parameters. This will improve maintainability and reduce complexity and potential defects in the Lustre code as new layout types are added.

Implement RPCs for modifying composite layouts (need Layout APIs)
OST objects for each layout extent of a PFL file are allocated by the MDS on demand as the file grows. Writing past the end of the last allocated extent sends an RPC to the MDS to grow the file. The MDS then locks the layout and allocates OST objects for the new layout extent(s) and updates the layout. The MDS layout locking invalidates copies of the old layout cached on Lustre clients and forces them to refresh their copy of the layout, including the new layout extent(s), before accessing the file again. Since the previously allocated OST objects will not have changed, no data movement or data cache invalidation is required. The MDS is exclusively locking the layout during modification to avoid races from multiple clients trying to modify the layout concurrently.

User interfaces for specifying composite layouts
While PFL will itself provide simplified usability of files in Lustre, there will still be a need for administrators and users to directly specify progressive layouts for the filesystem. The filesystem global default layout (set on the filesystem root directory) will determine the layout for all new files created in the filesystem. The size thresholds between stages in the progressive layout are best tuned to the application environment to maximize performance. Some users and applications may also optimize performance by specifying a default progressive layout on a parent directory that is inherited by all new files and directories created therein.

In order to specify the progressive layouts from userspace, the llapi_layout_* API will need to be enhanced to understand the new layout type. This new API will be used internally by the lfs setstripe command, and optionally by other applications modified to use this interface.

The layouts that can be used for the individual components are expected to be the same as those available in Lustre today. This includes the LOV_MDS_MD_V1 and LOV_MDS_MD_V3 layouts that are RAID-0 striping across OSTs with the ability to specify different stripe_count, stripe_size, ost_index, and ost_pool. The design and implementation of PFL composite layouts is intended to work with other layout types in the future, but actual operation with future layout types is beyond the scope of this project.

Server LOD composite layout support
On the MDS, the Logical Object Device (LOD) manages the operational aspects of files that components on multiple MDTs. The LOD component will primarily be concerned with the creation of new files using progressive layouts. In some cases, it will need to decode the layout and interact with objects one at a time for operations such as unlink, setattr, and LFSCK. The LOD code will also handle layout modification RPCs arriving from the clients.

LFSCK support for composite layouts
The Lustre File System Checker (LFSCK) verifies the structure of a Lustre filesystem, ensuring that the file layout on the MDT matches the objects located on the OST(s), and reconstructing the filesystem structure if it should become inconsistent or corrupted. In order to be able to do this, LFSCK needs to be able to understand the file's layout stored on the MDT object inode. Also, the OST objects need to store information about its part of the file layout so that the layout can be rebuilt if needed. With the addition of composite file layouts, LFSCK needs to be enhanced to support the new layout type, and the OST on-disk format needs to be extended so that OST objects can be identified as part of the correct component of the layout.

Default layout inheritance
In order to realize the full benefits of PFL, the progressive layout extents should not create OST objects until the size of the file grows sufficiently to need those objects. However, it is also necessary to be able to specify the layout template for the whole file at file creation time, so that the user or administrator can get the performance profile desired as the file is written.

It should be possible to specify a default layout template on a directory that is inherited by new files and subdirectories created within that directory. If no default layout template is specified on the parent directory, it should also be possible to inherit the filesystem-wide default layout template when a file is created.

Composite file templates
In order to realize the full benefits of PFL, the progressive layout extents should not allocate OST objects until the size of the file grows sufficiently to need those objects. However, it is also necessary to be able to specify the layout template for the whole file at file creation time without allocating OST objects for all components, so that the user or administrator can get the performance profile desired as file size grows during writes.

It should be possible to specify a default layout template on a directory that is inherited by new files and subdirectories created within that directory. If no default layout template is specified on the parent directory, it should inherit the filesystem-wide default layout template when a file is created.

Dynamic layout instantiation based on file offset
In order to simplify implementation, this project will focus on implementing composite layouts that are grown by allocating objects in non-overlapping layout extents at the end of the file, and will not implement modification of already allocated layout extents containing data.

The client IO (CLIO) layer needs to be able to manage the growth of the file layout by reconfiguring its IO stack to add new OST objects into the layout. The client will request that the MDS instantiate OST objects based on the layout template before it begins writing to a file offset beyond the currently instantiated layout components. The layout generation stored in the composite layout and in each layout extent will allow CLIO to detect whether a specific layout extent has been modified when the lock is revoked. Since the existing components of the file layout will not be modified for PFL files, any in-flight IO operations and cached data do not need to be interrupted.

Improved MDS object allocator
The current MDS object allocator is designed only to allocate objects for one file at the time the file is first created. For progressive file layouts, at a minimum the allocator will need to be enhanced in order to avoid allocating objects on OSTs that are already part of a file's other components. If files have multiple objects allocated to the same OSTs before objects are allocated from unused OSTs, there may be a significant performance loss due to oversubscribing the bandwidth on that OST compared to the other OSTs. The only exception may be for a fully-striped component at the end of the file (see for more detail), where it would be acceptable to allocate objects across all of the available OSTs to maximize the bandwidth available for the file.

Example Progressive Layouts
In order to balance space and bandwidth usage against stripe count, one option would be to keep a linear balance between the total number of stripes and the file size. If a stripe is added for each fixed unit of size, at power-of-two size intervals, the aggregate performance would be the same as if the file was striped over a corresponding number of objects from the start.

For example, adding a stripe for every 128MB of space on a system with 280 OSTs:

This results in a total of 280 objects for the first 35GB (= 128MB * 280) of the file, and each object holds a total of 128MB of file data. If this file were accessed in parallel across the first 35GB, the aggregate bandwidth and space usage for each object would be identical to a file that was striped across 280 objects for the entire 35GB, though it would be sub-optimal for parallel access to smaller ranges of the file. File sizes beyond 35GB would be identical to fully-striped files, at the expense of having twice the overhead for stat and locking operations.

The progressive layout should stop growing at the point where the total number of stripes would equal or exceed the number of OSTs. At that point, it would be advantageous to add a final layout extent to EOF that stripes across all available OSTs in order to maximize bandwidth at the end of the file, in case it continues to grow significantly larger in size. This would result in a layout that was somewhat more than twice as large as a file that was striped across all OSTs right from the start.

Alternatively, the last stripe could be grown to cover only the remaining OSTs if it was clear there weren’t going to be enough unused OSTs remaining for the next stage:

This means the bandwidth of the end of the file is only 216 / 280 = 77% of the aggregate, but it means that there is less overhead for accessing the large file due to fewer objects to create, lock, and destroy.

These are purely examples and in no way show a constraint on how PFL files are used. The stripe count does not need to increase from one component to the next, nor do the stripe counts or stripe boundaries need to be power-of-two values.

Maximum File Size
For ldiskfs the direct mapping of file offset to extent offset imposes a maximum file size of (stripe_count * 16TB) for stripe_count of the last layout extent in the progressive layout. This is not significantly different from today, except for a small reduction in the stripe_count of the last extent due to OST objects that are allocated in earlier extents. Typically this will not be a limitation due to space constraints on the OSTs, and can be tuned by selecting the layout progression appropriately. This is not a limitation for ZFS-based OSTs.

Client Compatibility
Clients that are not patched with the new progressive layout code will not be able to access files that use progressive layouts. This incompatibility would only affect files using progressive layouts, and not other files that may already exist in the filesystem, or new files created without using the progressive layout format.

OST Oversubscription
To avoid oversubscribing OST bandwidth, OSTs used at the beginning of the file should not normally be re-used for objects allocated later in the file. The space usage of each OST, and by extension its required bandwidth, can be balanced by selecting the layout progression appropriately.

If necessary (e.g. in case of ENOSPC) it might be necessary to allow multiple objects to be allocated from the same OSTs. In such cases, it would be necessary and desirable to allocate a new layout extent that allocates stripes across a subset of OSTs with available space.

Layout Locking
It is expected that the existing layout lock implementation is sufficient for progressive layouts, and extent-based locking of the layout itself is unnecessary (there will of course still be extent-based locking of the file data itself). This implies that there is a single lock bit that manages the entire layout content, and revokes the whole layout from clients if it needs to be modified in any way. It is expected that the layout for any individual file will only be changed at most a handful of times in its lifetime, so revoking the layout lock a few times is no worse than revoking the object extent locks as happens many times during the lifetime of a file being written concurrently by multiple clients.

Since progressive layouts only change by adding new layout extents at the end of the file, there is no need to invalidate the (meta-)data that is cached under the OST object locks. Clients in the process of writing to a file when the layout lock is revoked may complete the write without any danger. Clients starting new file writes must block until they have the layout lock, since the OST extent locks will not accurately reflect the range of the file that might be modified under a particular lock.