Progressive File Layouts: Difference between revisions

From Lustre Wiki
Jump to navigation Jump to search
(→‎Phase 3b: Dynamic Layout Implementation: add description of components)
(→‎Phase 2: Static Layout Implementation: add description of components)
Line 11: Line 11:
* [[PFL2 Solution Architecture]] describes how the goals of the PFL project may be implemented, and how to measure the completion and outcomes
* [[PFL2 Solution Architecture]] describes how the goals of the PFL project may be implemented, and how to measure the completion and outcomes
* [[PFL2 High Level Design]] describes the implementation details for the PFL feature
* [[PFL2 High Level Design]] describes the implementation details for the PFL feature
* Implement improved layout handling APIs
===Implement improved layout handling APIs===
* Address technical debt from prototype phase
Currently, Lustre clients and servers understand only a single type of layout, RAID-0 striping across one or more OST objects, with a few small variations of a basic layout structure (lov_mds_md_v3). In order to add the progressive layout handling and other future layout types such as RAID-10 in a sustainable manner, the internal code needs to be restructured in order to isolate the parts of the code that handle the file layouts to a single library.  Other parts of the Lustre code should not access the internal details of the file layout, and instead use library accessor functions to query required parameters such as the number of OST objects over which a file is striped, the size of the layout structure, and other common parameters.  This will improve maintainability and reduce complexity and potential defects in the Lustre code as new layout types are added.
* Implement RPCs for modifying composite layouts (need Layout APIs)
===Implement RPCs for modifying composite layouts (need Layout APIs)===
* Server composite layout support (need Layout APIs)
OST objects for each layout extent of a PFL file are allocated by the MDS on demand as the file grows.  Writing past the end of the last allocated extent sends an RPC to the MDS to grow the file.  The MDS then locks the layout and allocates OST objects for the new layout extent(s) and updates the layout.  The MDS layout locking invalidates copies of the old layout cached on Lustre clients and forces them to refresh their copy of the layout, including the new layout extent(s), before accessing the file again.  Since the previously allocated OST objects will not have changed, no data movement or data cache invalidation is required.  The MDS is exclusively locking the layout during modification to avoid races from multiple clients trying to modify the layout concurrently.
===User interfaces for specifying composite layouts===
While PFL will itself provide simplified usability of files in Lustre, there will still be a need for administrators and users to directly specify progressive layouts for the filesystem.  The filesystem global default layout (set on the filesystem root directory) will determine the layout for all new files created in the filesystem.  The size thresholds between stages in the progressive layout are best tuned to the application environment to maximize performance.  Some users and applications may also optimize performance by specifying a default progressive layout on a parent directory that is inherited by all new files and directories created therein.
 
In order to specify the progressive layouts from userspace, the llapi_layout_* API will need to be enhanced to understand the new layout type.  This new API will be used internally by the lfs setstripe command, and optionally by other applications modified to use this interface.
 
The layouts that can be used for the individual components are expected to be the same as those available in Lustre today.  This includes the LOV_MDS_MD_V1 and LOV_MDS_MD_V3 layouts that are RAID-0 striping across OSTs with the ability to specify different stripe_count, stripe_size, ost_index, and ost_pool.  The design and implementation of PFL composite layouts is intended to work with other layout types in the future, but actual operation with future layout types is beyond the scope of this project.
===Server LOD composite layout support===
On the MDS, the Logical Object Device (LOD) manages the operational aspects of files that components on multiple MDTs.  The LOD component will primarily be concerned with the creation of new files using progressive layouts.  In some cases, it will need to decode the layout and interact with objects one at a time for operations such as unlink, setattr, and LFSCK.  The LOD code will also handle layout modification RPCs arriving from the clients.
 
== Phase 3a: PFL Usability Improvements ==
== Phase 3a: PFL Usability Improvements ==
===Server LOD support for composite layouts===
===Server LOD support for composite layouts===

Revision as of 16:16, 11 January 2017

The Lustre Progressive File Layout (PFL) feature intends to simplify the use of Lustre so that users can expect reasonable performance for a variety of normal file IO patterns without the need to explicitly understand their IO model or Lustre usage details in advance. In particular, users do not necessarily need to know the size or concurrency of output files in advance of their creation and explicitly specify an optimal layout for each file in order to achieve good performance for both highly concurrent shared-single-large-file IO or parallel IO to many smaller per-process files.

The PFL feature is implemented in several phases, providing incremental functionality with each phase, including the base functionality of Composite layouts which can be used for several other features that affect the file layout.

Phase 1: Prototype Implementation

Phase 2: Static Layout Implementation

The Static PFL Implementation will provide a functional implementation that allows specifying the full layout using standard user tools and addresses any shortcuts and/or defects in the Prototype implementation. The following functionality was implemented:

Implement improved layout handling APIs

Currently, Lustre clients and servers understand only a single type of layout, RAID-0 striping across one or more OST objects, with a few small variations of a basic layout structure (lov_mds_md_v3). In order to add the progressive layout handling and other future layout types such as RAID-10 in a sustainable manner, the internal code needs to be restructured in order to isolate the parts of the code that handle the file layouts to a single library. Other parts of the Lustre code should not access the internal details of the file layout, and instead use library accessor functions to query required parameters such as the number of OST objects over which a file is striped, the size of the layout structure, and other common parameters. This will improve maintainability and reduce complexity and potential defects in the Lustre code as new layout types are added.

Implement RPCs for modifying composite layouts (need Layout APIs)

OST objects for each layout extent of a PFL file are allocated by the MDS on demand as the file grows. Writing past the end of the last allocated extent sends an RPC to the MDS to grow the file. The MDS then locks the layout and allocates OST objects for the new layout extent(s) and updates the layout. The MDS layout locking invalidates copies of the old layout cached on Lustre clients and forces them to refresh their copy of the layout, including the new layout extent(s), before accessing the file again. Since the previously allocated OST objects will not have changed, no data movement or data cache invalidation is required. The MDS is exclusively locking the layout during modification to avoid races from multiple clients trying to modify the layout concurrently.

User interfaces for specifying composite layouts

While PFL will itself provide simplified usability of files in Lustre, there will still be a need for administrators and users to directly specify progressive layouts for the filesystem. The filesystem global default layout (set on the filesystem root directory) will determine the layout for all new files created in the filesystem. The size thresholds between stages in the progressive layout are best tuned to the application environment to maximize performance. Some users and applications may also optimize performance by specifying a default progressive layout on a parent directory that is inherited by all new files and directories created therein.

In order to specify the progressive layouts from userspace, the llapi_layout_* API will need to be enhanced to understand the new layout type. This new API will be used internally by the lfs setstripe command, and optionally by other applications modified to use this interface.

The layouts that can be used for the individual components are expected to be the same as those available in Lustre today. This includes the LOV_MDS_MD_V1 and LOV_MDS_MD_V3 layouts that are RAID-0 striping across OSTs with the ability to specify different stripe_count, stripe_size, ost_index, and ost_pool. The design and implementation of PFL composite layouts is intended to work with other layout types in the future, but actual operation with future layout types is beyond the scope of this project.

Server LOD composite layout support

On the MDS, the Logical Object Device (LOD) manages the operational aspects of files that components on multiple MDTs. The LOD component will primarily be concerned with the creation of new files using progressive layouts. In some cases, it will need to decode the layout and interact with objects one at a time for operations such as unlink, setattr, and LFSCK. The LOD code will also handle layout modification RPCs arriving from the clients.

Phase 3a: PFL Usability Improvements

Server LOD support for composite layouts

On the MDS, the Logical Object Device (LOD) manages the operational aspects of files that components on multiple MDTs. The LOD component will primarily be concerned with the creation of new files using progressive layouts. In some cases, it will need to decode the layout and interact with objects one at a time for operations such as unlink, setattr, and LFSCK. The LOD code will also handle layout modification RPCs arriving from the clients, both when the file is idle and while it is in use by multiple clients.

LFSCK support for composite layouts

The Lustre File System Checker (LFSCK) verifies the structure of a Lustre filesystem, ensuring that the file layout on the MDT matches the objects located on the OST(s), and reconstructing the filesystem structure if it should become inconsistent or corrupted. In order to be able to do this, LFSCK needs to be able to understand the file's layout stored on the MDT object inode. Also, the OST objects need to store information about its part of the file layout so that the layout can be rebuilt if needed. With the addition of composite file layouts, LFSCK needs to be enhanced to support the new layout type, and the OST on-disk format needs to be extended so that OST objects can be identified as part of the correct component of the layout.

Default layout inheritance

In order to realize the full benefits of PFL, the progressive layout extents should not create OST objects until the size of the file grows sufficiently to need those objects. However, it is also necessary to be able to specify the layout template for the whole file at file creation time, so that the user or administrator can get the performance profile desired as the file is written.

It should be possible to specify a default layout template on a directory that is inherited by new files and subdirectories created within that directory. If no default layout template is specified on the parent directory, it should also be possible to inherit the filesystem-wide default layout template when a file is created.

Phase 3b: Dynamic Layout Implementation

Composite file templates

In order to realize the full benefits of PFL, the progressive layout extents should not allocate OST objects until the size of the file grows sufficiently to need those objects. However, it is also necessary to be able to specify the layout template for the whole file at file creation time without allocating OST objects for all components, so that the user or administrator can get the performance profile desired as file size grows during writes.

It should be possible to specify a default layout template on a directory that is inherited by new files and subdirectories created within that directory. If no default layout template is specified on the parent directory, it should inherit the filesystem-wide default layout template when a file is created.

Dynamic layout instantiation based on file offset

In order to simplify implementation, this project will focus on implementing composite layouts that are grown by allocating objects in non-overlapping layout extents at the end of the file, and will not implement modification of already allocated layout extents containing data.

The client IO (CLIO) layer needs to be able to manage the growth of the file layout by reconfiguring its IO stack to add new OST objects into the layout. The client will request that the MDS instantiate OST objects based on the layout template before it begins writing to a file offset beyond the currently instantiated layout components. The layout generation stored in the composite layout and in each layout extent will allow CLIO to detect whether a specific layout extent has been modified when the lock is revoked. Since the existing components of the file layout will not be modified for PFL files, any in-flight IO operations and cached data do not need to be interrupted.

Improved MDS object allocator

The current MDS object allocator is designed only to allocate objects for one file at the time the file is first created. For progressive file layouts, at a minimum the allocator will need to be enhanced in order to avoid allocating objects on OSTs that are already part of a file's other components. If files have multiple objects allocated to the same OSTs before objects are allocated from unused OSTs, there may be a significant performance loss due to oversubscribing the bandwidth on that OST compared to the other OSTs. The only exception may be for a fully-striped component at the end of the file (see #Example Progressive Layouts for more detail), where it would be acceptable to allocate objects across all of the available OSTs to maximize the bandwidth available for the file.