PFL Prototype Scope Statement

From Lustre Wiki
Jump to navigation Jump to search

Problem Statement

Note: The PFL Phase 2 Scope Statement takes precedence over this document

Today, files in a Lustre filesystem have a static layout that is determined at the time each file is first opened. Each file's layout may have one or many stripes (OST objects) on which it stores data, but the file's layout cannot be modified after it is created, which forces the user to choose (or live with) the layout used at that time. If files are small, and/or many clients/threads are each writing to their own file (file-per-process, 1:1, N:N) then having a single stripe per file is optimal to reduce contention and metadata overhead. If files are very large and/or accessed by many clients concurrently (shared-single-file, N:1, N:M), and/or have bandwidth requirements that exceed what is available from a single OST it is desirable to have a large number of stripes for the file to maximize aggregate bandwidth and balance space usage. While there are some mechanisms available to administrators, users, and applications to specify the layout per-filesystem, per-directory, or per-file for newly created files, this requires advance knowledge of how each file may be used (total size, number of clients writing to the file, required bandwidth, etc.) that may not necessarily be available. It is desirable that administrators, users, and applications be isolated from the need to specify the striping for each file in order to optimize performance for their various uses.

Project Requirements and Goals

This project requires a prototype implementation that demonstrates how Lustre could optimize usage of both small and large files, avoiding the need for users to predetermine the required layout for each file. The goal is that a single file layout specified at file creation time can balance several conflicting needs:

  • minimize the number of stripes for small files to avoid the overhead of creating and accessing multiple OST objects for such files
  • maximize the number of stripes for large files to utilize the available bandwidth as well as distribute space usage to avoid consuming all of the space on a single OST
  • flexibility to specify a layout that is suitable for a majority of uses, while allowing for variability and flexibility between users, applications, and sites

The administrator shall be able to set a specific progressive layout for any new files created in the filesystem, and users shall be able to specify progressive layouts on their own files for their specific needs. It should be possible to specify a default progressive layout template for a directory that is inherited by new files and directories created in that directory, as well as a filesystem-wide global default layout that is used in the absence of more specific layouts.

Proposed Solution

The proposed solution is to implement Progressive File Layouts (PFL) using Composite Layouts for regular files that progressively increase, under user or application control (in the production implementation), the number of OST objects across which the file is striped as the size of the file increases beyond specific thresholds. This allows small files to have only one (or a few) stripes, while large or concurrently accessed files can have stripes on many/all OSTs in the filesystem. Since the definition of small and large files is very subjective and site, user, and application specific, and it will also be possible to specify intermediate steps between a small file and a large file so that the number of stripes in the file increases in a step-wise manner as the size grows. The number and size of sub-layouts within the PFL file, and the number of stripes in each will be tunable, though subject to other implementation and environmental limitations such as maximum layout size, number of available OSTs, and alignment of regions to page or stripe boundaries.

For the PFL Prototype implementation, an important goal is demonstrating and determining the feasibility of the design and prototype implementation to meet end-user requirements and suitability for a full production-ready implementation. As such, a set of tests should be documented that are intended demonstrate the benefits of the PFL approach. These tests should include measurement of existing layouts in the absence of PFL, as well as PFL files. Both read and write bandwidth should be measured, as well as metadata create, stat, and unlink performance. The expected outcome is that small PFL files should have similar IO and metadata performance as single-striped files, while large PFL files should have similar IO and metadata performance as many-striped files, indicating that a production PFL implementation could dynamically tune the performance of both small to a large files with the same PFL layout without user interaction and see the optimum performance for both.


For the prototype implementation, the end goal is that user shall be able to manually extend the layout of a file (the mapping of files to specific OST objects) by appending a new layout with new objects at or beyond the end of the file's data. This modification of the layout will only be possible in the prototype for files that are not currently open or in use by any client. In order to be able to use progressive file layouts, a number of new components must be implemented. The PFL functionality will be implemented to the level of a prototype suitable for demonstrating the technology and measuring performance improvements under specific workloads in a controlled environment. The following components are considered in-scope for this project:

  • High Level Design document for PFL prototype
  • Implement new composite layout type for files that allows specifying different sub-layouts for multiple regions of the file
  • Implement client support for interpreting the composite layout type with sub-layouts for contiguous, non-overlapping regions of the file
  • Implement a basic tool for modifying the composite layout of a file (expected to be only usable directly on the MDS, and not from the client)
  • Test plan and documented results from executing the test plan on a suitable system

Out of Scope for PFL Prototype Implementation

It should be noted that the planned implementation is for a demonstration prototype only. It will be suitable for functional testing and benchmarking, or perhaps use by applications specifically modified to understand the PFL functionality and limitations. In particular, the prototype implementation does not include the following elements that are required in order to make this a full production-ready feature:

  • High Level Design and implementation of internal API for modifying composite layouts
  • High Level Design and implementation of RPCs for creating and modifying composite layouts from the client
  • High Level Design and implementation of Client IO stack remapping for dynamically changing layouts
  • Ability to modify the file layout while the file is open by any client
  • Composite layout specification at file creation time
  • Composite layout inheritance from parent directories or filesystem default
  • Optimized OST object allocation for composite layouts
  • Dynamic layout extension during writes
  • Optimized large layout transfer
  • LFSCK support for the composite layout type
  • Documentation of user tools in man pages or the user manual
  • Stabilization of code under stress testing
  • Landing of feature to a supported release

Out of Scope for Production PFL Implementation

Although this specific project is focused on implementing a PFL prototype for demonstration purposes, the long-term goal is implementing PFL in a form suitable for production use. While the components listed above would be considered in-scope for a production-ready PFL implementation, it is worthwhile to mention in this document that the composite layouts being implemented have potential uses for other important features beyond even the production PFL implementation.

Composite layouts could be a building block for File Level Replication and/or versioned files. Using composite layouts where two or more components' sub-layouts overlap implies that those sub-layouts have data for the same region of a file, or potentially the whole file. If these copies of the file data are kept in sync, the file can be replicated across multiple independent OSTs in order to provide redundancy, fault tolerance, and/or increased read bandwidth. Alternately, if the old versions of a file are not kept in sync, this would allow file versioning, snapshots, data forks, and similar functionality to be implemented. Being able to access, synchronize, and resynchronize files with overlapping regions would require additional implementation on the client and userspace infrastructure.

In practice, it is expected that typical PFL usage will have files that contain a small number of components (2-4) with an increasing number of RAID-0 stripes. The actual number of components, their start and end offsets, the number of stripes, and even sub-layout type are not actually constrained by the PFL implementation in this way. This would allow novel composite layouts to be implemented by users or in conjunction with PFL-aware applications or middleware libraries (e.g. HDF5) that specify different sub-layouts for different regions of a single file. This includes arbitrarily changing the stripe count and the stripe size for each region to match different access patterns for that region, replicating some regions and not others to match availability requirements (e.g. index or metadata vs. data), storing the data on different classes of OST storage (e.g. SSD vs. HDD, RAID-1 vs. RAID-6) using OST pools or direct OST selection, and changing the layout type (e.g. RAID-0 vs. Data-on-MDT).

Composite layouts would also be useful for HSM and other tiered storate, in order to allow partial archive/restore of a file to/from the archive into a component that only covers part of the file. In a similar manner, overlapping components could be used to allow incremental migration or restriping of large files across OSTs or between different tiers of storage, by creating a replica that only covers a part of the file's data, and dynamically changing the regions covered by the components as data is migrated.

It could be possible to add a new overlapping region for an existing file with RAID-0 layout that adds a single or double parity layout, in essence turning a RAID-0 file into a RAID-4 or RAID-DP file that allows the file to remain accessible even if one or two stripes in the file to become unavailable, without having to rewrite or replicate the whole file.

There is no intention to allow nested composite layouts (i.e. sub-layouts that have multiple sub-layouts internally), due to the lack of any suitable use case for this functionality.