PFL Prototype Solution Architecture

From Lustre Wiki
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Introduction

Note: The PFL Phase 2 Solution Architecture takes precedence over this document.

The Lustre Progressive File Layouts (PFL) feature intends to simplify the use of Lustre so that users can expect reasonable performance for a variety of normal file IO patterns without the need to explicitly understand their IO model or Lustre usage details in advance. In particular, users do not necessarily need to know the size or concurrency of output files in advance of their creation and explicitly specify an optimal layout for each file in order to achieve good performance for both highly concurrent shared-single-large-file IO or parallel IO to many smaller per-process files. The PFL Prototype Scope Statement describes the overall goals and intended outcomes of this phase in more detail, and will not be repeated here. This PFL Prototype Solution Architecture document describes how the goals of the Progressive File Layouts project may be implemented, and how to measure the completion and outcomes of this phase.

In some places of this document may describe functionality and usage in terms of what a production-quality implementation may provide (e.g. automatic layout extension as file size grows) that will not be implemented in the prototype. This is done to illustrate how the feature may be used in the long term and guide the prototype implementation in the right direction.

Use Cases

U1. Create a small file with minimum overhead

A user wants to create one or more "small" files (e.g. below 32MB) in a directory with minimum overhead. Creating files with a single OST stripe is optimal in such cases to reduce contention on OSTs (i.e. excess seeking on OSTs), and minimize overhead for fetching the file size from OSTs (e.g. ls -l overhead due to fetching the file size from multiple OSTs). Single-striped files is the default for Lustre, though sites may specify their own filesystem-wide default stripe count. For PFL files, the first component should specify a sub-layout with a single stripe for an extent that covers the maximum size of "small" files. The data for "small" files will only be written to the first stripe of the file, and no OST objects will be allocated for the unused parts of the composite layout.

For the purposes of the prototype implementation, the "small" PFL file will be represented by a normal RAID-0 file layout (LOV_MAGIC_V1) to simulate performance of IO and metadata operations. Automatic PFL extension will not be implemented, but will be handled manually.

U2. Spread medium file space usage across multiple OSTs

A user wants to create one or more "medium sized" files (e.g. 32MB-4GB) that distribute space usage across multiple OSTs and provide moderate performance benefits without causing undue contention on the OSTs. Once the file size exceeds some threshold, it should automatically stripe the file more widely to distribute space usage across multiple OSTs to avoid high space usage or out-of-space conditions on one OST. For PFL layouts, the second (or later) component of the file should specify a sub-layout with a larger number of stripes (e.g. 4 or 8) for files beyond the "small" size. The file data will initially be written to the first (single-stripe) component of the file, and when data is written beyond the end of the first component it will begin writing to additional stripes for the extent specified by the second component. No OST objects will be allocated for any unused components of the composite layout.

U3. Spread concurrent file IO across multiple OSTs

A user wants to spread file IO for large files (e.g. over 4GB) across a large number of OSTs (e.g. 100+) in order to maximize single-file IO performance for a large number of clients. Similar to U2, it is expected that multiple writers to a single file will seek and write to large offsets in the file in order to trigger this behaviour. For PFL layouts, the last component of the file may specify a sub-layout to cover the remainder of the possible file sizes (4GB-) that has a number of stripes to cover the remaining OSTs (so there is no contention with the stripes for the start of the file), or (near) the maximum number of OSTs that are available in the filesystem (so that very large files use space and bandwidth from (almost) all OSTs). OST objects will be allocated for all components of the large PFL composite layout.

Solution Requirements

R1. Composite Layouts

In order to be able to have a striping pattern that can change for different sizes of files, a new file layout type must be added to support this. The layout type should be flexible in allowing complex layouts to be generated for different application types, and should ideally be reusable for a number of different purposes. The new layout type should be understood by both the client and server.

R2. Tool(s) for using composite layout files

In order to test the PFL functionality, it is necessary to be able to create and remove files with a composite layout. It must be possible to create files with different stripe counts at specific offsets within the file. It must be possible to create files with more than two components in order to test proper functionality. It should be possible to display the layout components and allocated objects of a composite file. Ideally, the tool for deleting composite files will be rm so that no special handling is needed by other programs and test scripts.

R3. Client IO to composite files

The client IO stack must be able to interpret the composite layouts, and map file read/write requests to the appropriate component within the composite layout, and to map the file offset to an object offset within the component. The client should support both read and write operations, and truncate (both extending and shrinking) . The client should properly lock the OST object extents for the operations it is doing on each component involved in the IO operation.

In order to avoid complexity in the Client IO prototype implementation, it is not supported to change the layout of a file while is open or in use by any client. File IO to an offset of a composite file that does not have a defined layout component will return ENODATA to the application. R4. Understand security implications

While system security is not a requirement of the prototype implementation, some consideration should be given to at minimum understand what impact the changes and tools will have on system security and data integrity.

Solution Proposal

S0. Development Target Version

The PFL prototype will be developed against a private copy of the Lustre master development branch, based on the Lustre 2.7.0 release or later. The patches will not be backported to other maintenance branches as part of this project. While some patches implemented as part of this project may land to the public master development branch in order to fix bugs, improve code quality, or simplify future interoperability, there is no requirement that all or part of the feature patches will be landed to a publicly-available release.

S1. Composite Layouts

The PFL implementation will use the composite extent-mapped layout as described in the Layout Enhancement Solution Architecture to allow different RAID-0 layouts to describe different parts of the file. The composite layout will have one or more components, with each one descibing different parts of the file , and also provides functionality useful for other complex file management features.

A plain file layout is one with the current non-PFL RAID-0 layout. This is in contrast to a PFL composite layout file which will have one or more components to describe the layout for different parts of the file.

S1.1. Stripe size and component boundaries

In the current implementation of Lustre, it is not possible to have a file layout with a stripe size that crosses a PAGE_SIZE boundary in the client cache. Otherwise, it would be required to track partially-dirty pages, since the split page may be written independently to two separate OSTs. In order to avoid potential incompatibilities with the PAGE_SIZE on different CPU architectures, the stripe size must be a multiple of at least 64KB (LOV_MIN_STRIPE_SIZE). For simplicity of implementation in the prototype, it may be necessary to restrict the components to all have the same stripe size.

For PFL files, there must also be a clear separation of pages between different components of the layout, so that a page is exclusively in one layout or another. One possibility is to require that the boundary between one component's extent end and the next component's extent start are on integer multiples of the stripe size of both components. If the stripe size of all components is identical, then this requirement is not overly complex, but if the stripe sizes of adjacent components is not the same then this may force the component boundary into an unwanted region of the file. Another option is to only require that the boundary between components to only be on PAGE_SIZE boundaries. This may result in additional complexity in the client IO implementation if the stripes of adjacent sub-layout boundaries are overlapping. This issue will be explored further and resolved as part of the PFL Prototype High Level Design Document.

For the PFL project, layout components must have adjacent adjacent extents, so that there are no holes in the layout, except possibly at the end of the file. While the composite file layout format itself would allow components to specify overlapping extents, for PFL files (both in the prototype, and a PFL production implementation), it will not be possible to specify layout components that have overlapping extents. The client IO mechanisms to write a single page of file data to multiple overlapping components is out of scope and will not be implemented as part of this project. Overlapping components would part of a separate File-Level Replication project.

S1.2. Component iteration and selection

When file data is accessed with a composite layout, the client first needs to determine which component covers the specified IO extent. This will typically be done by looping over the components within the composite layout to find one with an extent that covers the IO extent. Since there will typically only be a small number of components in a PFL file (typically 2-3 components, with a theoretical maximum of around 500 single-striped sub-layouts due to existing maximum xattr and RPC size limits) it should be possible to iterate over a list or array of components in order to find the appropriate layout for a particular file offset, rather than building a hierarchical structure in memory such as a tree (which may impose more overhead than it avoids in the common case). Only once the component sub-layout is selected will the IO be mapped using the specific stripe count, stripe size, and objects of the sub-layout.

Some file-level operations such as truncate, setattr, unlink, etc. will operate on all objects on the file, regardless of the component they belong to.

If no component covers the IO extent (i.e. the last component's layout extent does not go to maximum end-of-file = 263 bytes) then the prototype implementation will return an ENODATA error to the caller. In the production implementation the client would request a layout extension to cover the required offset(s) in the file and an updated layout from the MDS for this extent of the file, and the MDS would instantiate the required component(s) in some manner, by either extending the extent of the last component, or instantiating objects from a layout template attached to the file at creation time from a parent directory or system-wide default layout.

S1.3. File size calculation

In order to determine the size of a file, the Lustre needs to know the size of each of the OST objects, and then the actual file size is taken from the maximum size after mapping with the file's layout (stripe_size, stripe_count). For the PFL prototype it will fetch the OST object size from all objects that are allocated to the file. Since the object size is fetched from all OSTs in parallel, and the number of objects contained in early components will be relatively small compared to the number of objects in the last component there appears to be little benefit to trying to optimize out the RPCs to the earlier objects.

Truncate operations should be sent to all OST objects, after appropriately mapping the file offset into per-object offsets, to ensure they discard data or extend the object size appropriately even if the new file size does not intersect their component's extent. Otherwise, truncate operations that only are sent to the components that intersect the requested file size may inadvertently leave file data or object sizes in later components.

S1.4. Object allocation

For the PFL prototype implementation, there will be no coordination of OST object allocation between different components on the MDS. This means that it is possible to create composite layouts that may be sub-optimal because they allocate multiple objects from the same OST(s). This would place an extra IO burden on those OST(s) that have more allocated objects, and may reduce the aggregate performance available for the entire file if clients are still writing to the different components concurrently. In some cases, it may be desirable that the last component allocates objects from all OSTs, even if this overlaps with OSTs allocated in earlier components, so that large PFL files that are growing linearly will eventually be spread across all OSTs. The negative performance impact of multiple objects being allocated to the same OST can be avoided using careful layout specification at file creation time (e.g. use lfs setstripe to allocate sub-layouts with a different starting --index, using explicit OST selection using the --ost-list option added in Lustre 2.7, or --pool to select objects from a different OST pool).

S2. Tools for using composite files

Creating composite files within the scope of the PFL prototype will be done by incrementally merging a newly-created temporary file with the required layout onto an existing file using Layout Merge for each component added to the end of the file. This allows generating potentially complex composite file layouts as needed, while maintaining a relatively simple user interface.

S2.1. Composite files via Layout Merge

The method for creating composite files is to create an initial target file with a composite layout, and then create a second temporary victim file with the desired layout for the next component, with the layout of the victim file being added as the next component of the target file. For PFL files, one would first create a composite file for the first component in the file with the desired layout properties using lfs setstripe. Then, create a second victim (temporary volatile) file with the appropriate striping parameters for the next component, and via ioctl() on the first file add the layout of the victim file to the target file, specifying the extent start and end for the new component. The temporary victim file would be left without any layout, and would be destroyed when closed. Composite file layouts will not support nested composite sub-layouts, only a single level of composite layouts will be supported.

PFL Layout Merge.png

The Layout Merge approach has the benefit that its implementation would be similar to the existing MDS_SWAP_LAYOUTS code that allows exchanging the layouts between two files, but rather than moving the target file's layout to the victim file it is kept as the first component.

Layout Merge also allows the layout to be specified incrementally during usage instead of being fixed at file creation time. However, this implementation could also be a drawback, since it does not necessarily allow inheriting a composite layout from the parent directory incrementally, as the composite layout could become inconsistent if the parent layout is changed in the potentially lengthy time between file creation and when the later components are added. For the production PFL implementation, a complete composite layout template could be inherited from the parent directory when a new file is created, which would contain stub layouts for each component, but not have initialized objects in order to avoid overhead for small files. The layout template for each component would be instantiated (OST objects allocated) automatically by the client when needed.

One potential issue for the prototype is that Layout Merge needs to be done with an open file handle on the target file in order to perform the ioctl() command. Since the expectation for the PFL prototype is that files are closed on the client and the file layout is revoked from the client cache when it changes, this operation will require closing the file again after the ioctl() is completed before IO is started on the file again. That will allow the client IO stack to reconfigure when the layout is re-fetched on the next IO.

Another long-term issue with using Layout Merge for creating PFL files is that multiple clients trying to extend the file concurrently would incur extra overhead (additional open+create RPCs) by creating victim files and OST objects for the next component, only to have an error returned by the MDS if a component had already been created for that extent. For the purposes of the prototype implementation and testing, there will only be a single process changing the layout at one time, so Layout Merge as a usable option for this project.

For the prototype, it will not be possible to write to the file at an offset that will be beyond the end of the component's extent. For the purpose of the PFL prototype, this is not a concern as it would not affect IO or metadata performance, and can be avoided by proper tool usage by creating the next component before it is written to.

This functionality should be implemented via the lfs setstripe command:

lfs setstripe --pattern=composite [--component=<component_number>] --stripe-count=<stripe_count> [--start-offset=<start_offset>] --end-offset=<end_offset> [other setstripe options] <target_file>

to attach a newly-created volatile file to <target_file> as a new component starting at <start_offset> (or after the end of the previous component's end_offset) and ending at <end_offset>, with --end_offset=OBD_OBJECT_EOF if it is not specified (implying this is the last component to be created on the file). Other setstripe options for specifying the stripe size, starting index, OST pool, etc. for the temporary file should be available. For the prototype implementation, it is possible that not all setstripe options will be supported in conjunction with composite layouts.

S2.2. Displaying composite file layouts

For both development purposes, functional and correctness testing, and usability, it is desirable to be able to display composite file layouts and the allocated objects. This should be possible using lfs getstripe on the composite file to display the attributes of the component(s) of the file and the sub-layout of each component. Since composite files are hierarchical in nature, and are different from existing file layouts, consideration should be given to implementing an improved output format for lfs getstripe such as YAML format that allows describing the composite layout in an easy-to-read and easy-to-parse format, however this is not a requirement of the prototype implementation.

S3. MDS support for composite layouts

The MDS server needs to be able to understand and use composite layouts. The MDS should be able to create composite files with allocated objects. The MDS should be able to delete files with composite layouts and destroy all the components' objects. The OSTs are not aware of file layouts outside of LFSCK (which is not part of the project scope), and do not need to be modified to work with composite file layouts. With Layout Merge the MDS would need to merge two plain file layouts into a single composite layout with one component for each plain layout, or merge one plain file layout onto an existing composite layout. The <start_offset> of each new component must match the <end_offset> of the previous component, and must be strictly less than the <end_offset> (i.e. no zero-length components). The handling of the (now without layout) victim file is not directly controlled by the MDS. If the client tool created a victim file as a volatile file (temporary file without a filename) then it will be unlinked when the victim file is closed. Otherwise, it will remain as a file without any instantiated layout, which is not unusual for Lustre.

The MDS LOD subsystem will use functionality from S1.2 to be able to iterate over all layout components for operations such as truncate, setattr, unlink, and others and then process the sub-layouts. The LOD will need to perform an operation on each object or to determine the total number of objects in a layout in order to determine the transaction size for unlink.

The MDS LOD will need to calculate the size of new layouts, taking into account the overhead of the composite layout header and component entries. There is already support for relatively large file layouts (up to 2000 stripes in a plain file, about 48KB) that should be sufficient for normal usage. While there will be some overhead from the composite layout header and component entry, this is not expected to significantly impact maximum file size or bandwidth limits unless layouts become very complex.

S4. Client support for composite layouts

The client IO stack and LOV needs to understand the new composite layout in order to do most file-based operations, including read, write, truncate, setattr (chmod, chown, chgrp, *time), getattr (*stat), DLM locking, and others. There are two classes of operations that need to be handled to work with composite layout extents. The offset based operations such as read, write, truncate, and DLM locking will use the S1.2 and S1.3 functionality to map file-level offsets into object-level offsets within the extents of each component as well as the sub-layout within each component. The non-offset operations such as setattr and getattr only need to be able to iterate over all of the objects in the layout.

In the LOV layer, a new interpreter for composite layout will be defined. The methods for the interpreter must know how to convert an on-disk composite layout structure into memory layout. Also, it will provide functionality to map file offset to layout component and object offset on the OST.

IO framework on the client side will be revised to split into sub IOs. Each sub IO should be within a single layout component. The client IO stack will reuse most of the RAID-0 code to handle each IO for individual layout component. Depending on the IO size from application, it will need several iterations to complete an IO.

Due to its complexity, append write won't be supported in the prototype implementation.

S5. Security and correctness considerations

Since this project is currently targeting a prototype implementation, there may be security issues that are present in the prototype that would not be acceptable in a production release. In particular, the user tool to create a composite file layout may not have strict verification of the supplied parameters since this would be handled in the production implementation by a different API that properly controls the ways in which a layout can be modified. As such, invalid requests by the user tool may potentially allow incorrect behavior under some cases. Permission checking for user tool operations may also be bypassed in order to allow these operations. The user tool only needs to provide correct functionality for a limited number of test cases, and as such may require a priori knowledge of valid combinations of parameters for correct functionality.

For the purposes of the prototype, it is possible that a Layout Merge is used and the previously-last component had file data that was overlapped by the newly-added component, then the data in the earlier component would become permanently inaccessible because the client would begin to map that file offset into the newly-added component.

S6. Compatibility

The composite layouts will not be compatible with existing MDS or client codes. Old clients accessing a composite file on a new MDS would return an error and refuse to open the file. New clients trying to create a composite file on an old MDS would be refused by the MDS.

Functional Testing

F1. Create a PFL file

Create a PFL file with three separate components, with 1-stripe, 4-stripe, and many-stripes respectively.

F2. Write and read a PFL file

Write pattern data to a 3-component PFL file, then read and verify the data, ensuring that it does not come from the client cache (e.g. flush client cache between write and read, or use IOR read-verify mode to read from a different node from the writer).

F3. Truncate PFL file

Truncate a 3-component file with the size within each component's extent and verify that the size is stored correctly and discards/initializes data stored on disks (e.g. flush client cache, or read attributes from another client).

Verify that performing a truncate that shrinks the size of the file into an earlier component returns the correct file size.

Verify that performing a truncate that expands the size of the file into a later component beyond the original file size returns zeroes for the newly-extended size.

Verify that performing a shrinking truncate followed by an extending truncate returns zeroes for the extent that was previously truncated away.

F4. Unlink a PFL file

Unlink a large PFL file and verify that it has been removed from the namespace and that the allocated space has been removed, after a sufficient interval to ensure the unlink has committed on the MDT and OST(s).

Performance Measurement

P0. Test tools and methodology

IO performance measurement for both read and write throughput will be done with IOR using parameters appropriate to each test (shared single file vs. file per process, thread counts, data verification). The stripe size of files should be 1MB for all performance test cases, and the IOR IO chunk size should also be 1MB. The total file size should be selected to be large enough to avoid cache effects on the server. The cache on the client will not significantly affect IOR testing unless the file sizes are very small (below 64MB per stripe) because Lustre flushes writes aggressively to the server to avoid idle time while the client cache is filled and already limits the amount of dirty data per stripe. The IOR read with data verification tests run on a different client node from the one that wrote the data to avoid the write cache entirely.

For tests running on "a large number of clients", this should be selected to be at least as many clients as OSTs, if possible, up to the maximum number of clients available.

For PFL files with multiple components, the victim files will be allocated to avoid placing stripes on OSTs that are already part of other components in the file to avoid bottlenecks.

Metadata performance measurement will be done with mdtest, creating 1M files in total with each 1 stripe, and many stripes (equal to the number of OSTs) . No file data is needed for the metadata tests, since the primary goal is to measure overhead from fetching the file size from the OST(s), which is independent of the actual file size but highly dependent on the number of stripes. Tests should include file create, file stat, and file unlink. The filesystem should be unmounted between each test phase to properly measure the overhead of cold-cache operations as is typical for large filesystems.

Each test run should be done at least 5 times to ensure that variability in the testing results is visible during analysis, recording the minimum, maximum, and median values. Each test run is expected to take several minutes of runtime to ensure statistically useful performance measurements. All tests will be with LDISKFS.

P1. Single-stripe file-per-process performance

Measure the IO and metadata performance of a single client and a large number of clients accessing many plain single-striped files. This test case represents the current default layout performance, and best case for minimizing metadata overhead for file operations like create, stat, and unlink that depend on the number of stripes. This will also provide a baseline measurement for PFL "small" file metadata and IO performance (i.e. with only the first component instantiated).

P2. Single-stripe shared-single-file performance

Measure the IO performance of a large number of clients accessing a single plain single-striped file. This represents the pessimal case of a user not aware of the requirement to stripe the file across multiple OSTs to achieve space balance and leverage available IO bandwidth. It is expected that IO performance is not better than that available from a single OST, and may be significantly worse due to increased contention on that OST. No metadata benchmarks will be done for the single-file case, since it is difficult to provide meaningful metadata performance metrics for a single file. This is a case that PFL intends to avoid.

P3. Many-stripe file-per-process performance

Measure the IO and metadata performance of a single client and a large number of clients accessing many plain many-striped files. This represents the case where a user or administrator sets a large default stripe count for all files to avoid space and performance imbalances from large files and files accessed by many clients, at the expense of increased metadata overhead. This is expected to be sub-optimal for metadata operations (create, stat, unlink) due to the large number of OST objects per file. The IO performance may be acceptable, depending on the ratio of the number of clients to the number of OSTs. This is a case that PFL intends to avoid.

P4. Many-stripe shared-single-file performance

Measure the IO performance of a large number of clients accessing a single plain many-striped file. This represents the case of a knowledgeable user that sets a large stripe count for a file accessed by many clients concurrently. This is expected to provide better aggregate IO performance since a large number of clients can write to a large number of OSTs concurrently. No metadata benchmarks will be done for the single-file case, since it is difficult to provide meaningful metrics for a single file. This will also provide a baseline measurement for PFL "large" case.

P5. Single-stripe "PFL small" file-per-process performance

Measure the IO and metadata performance of a single client and a large number of clients accessing many PFL single-striped files. A single-stripe "small" PFL file will be nearly identical to a plain single-striped file, with the exception of an additional header in the layout to describe the component start and end. So long as the file size does not exceed the end of the defined layout extent there should not be any significant performance difference between plain and composite files, so the P5 results should be the same as P1 results. Measuring the metadata create performance for a PFL file may not be possible due to the lack of support in the prototype to inherit PFL file layouts, but it should be possible to measure the file stat and file unlink performance for such files after running a separate script to create them individually.

P6. Multi-stripe "PFL medium" file-per-process performance

To measure the performance effects of having non-uniform PFL striping, a two-component PFL file will be created for each process using Layout Merge where the expected file size for the IO test is in the second component. Measure the IO and metadata (stat and unlink) performance of many "medium" PFL files with a single client and large number of clients to determine overhead and benefits from the PFL layout. Measuring the metadata create performance for a multi-component PFL file may not be practical due to the need to create such a file incrementally, which would require modifying the testing tool, but it should be possible to measure the file stat and file unlink performance for such files after running a separate script to create them individually. It is expected that the aggregate "PFL medium" file IO performance will be an improvement over the P1 single-client case, and approximately the same as that of P1 with many clients. It is likely that the "PFL medium" file will provide sufficient disk bandwidth to a single client that the IO performance will be equal to P3 since the limitation will be the network bandwidth. The metadata performance is expected be somewhat lower than that of P1 due to more objects per file, but not as bad as that of P3 for the same reason.

P7. Multi-stripe "PFL medium" shared single file performance

Since the size of a shared-single-file IO test will exceed the maximum size of the two-component PFL file in order to avoid cache effects, this test would be equivalent to that of P8 (since it would automatically be extended to a three-component file in the production PFL implementation) and will not be run.

P8. Many-stripe "PFL large" shared-single-file performance

In order to simulate the long-term behavior of a PFL file that grows to have a large number of stripes based on file size, the test file will first be created with a three-component PFL layout (1-stripe, 4-stripe, and many-stripes). Measure IO performance testing with a large number of clients accessing a single "large" PFL file. While it is possible to have more than three components in a composite layout, this is chosen as a likely representative of how PFL will be used in real-world deployments. This test case is expected to provide comparable aggregate IO performance to that of the P4 test case of the manually-created many-striped non-PFL file.

Depending on the exact striping parameters chosen for the last component and their relative sizes compared to the initial components, the "PFL large" file may have some performance limitations compared to the ideal many-stripe configuration due to OSTs used for the initial components being idle if the total file size far exceeds that the extents of the first components. In cases where the file size is expected to be very large (>> num_osts * last_extent_start), it may be desirable to specify the stripe count in the last component to include all OSTs in the filesystem so that very large files distribute load across all OSTs, even if the OST assignment overlaps slightly with the initial components. For cases where the file size is expected to be comparable to (num_osts * first_extent_size) then it may be desirable to specify the stripe count in the last component to be (num_osts - stripes_in_other_components) to minimize contention with IO to the earlier stripes.

Notes and Limitations on the proposed implementation

Maximum File Size

For ldiskfs the direct mapping of file offset to extent offset imposes a maximum file size of (stripe_count * 16TB) for stripe_count of the last layout extent in the progressive layout. This is not significantly different from today, except for a small reduction in the stripe_count of the last extent due to OST objects that are allocated in earlier extents. Typically this will not be a limitation due to space constraints on the OSTs, and can be tuned by selecting the layout progression appropriately. This is not a limitation for ZFS-based OSTs.

Client Compatibility

Clients that are not patched with the new progressive layout code will not be able to access files that use progressive layouts. This incompatibility would only affect files using progressive layouts, and not other files that may already exist in the filesystem, or new files created without using the progressive layout format.

OST Oversubscription

To avoid oversubscribing OST bandwidth, OSTs used at the beginning of the file should not normally be re-used for objects allocated later in the file. The space usage of each OST, and by extension its required bandwidth, can be balanced by selecting the layout progression appropriately.

If necessary (e.g. in case of ENOSPC) it might be necessary to allow multiple objects to be allocated from the same OSTs. In such cases, it would be necessary and desirable to allocate a new layout extent that allocates stripes across a subset of OSTs with available space.

Layout Locking

It is expected that the existing layout lock implementation is sufficient for progressive layouts, and extent-based locking of the layout itself is unnecessary (there will of course still be extent-based locking of the file data itself). This implies that there is a single lock bit that manages the entire layout content, and revokes the whole layout from clients if it needs to be modified in any way. It is expected that the layout for any individual file will only be changed at most a handful of times in its lifetime, so revoking the layout lock a few times is no worse than revoking the object extent locks as happens many times during the lifetime of a file being written concurrently by multiple clients.

Since progressive layouts only change by adding new layout extents at the end of the file, there is no need to invalidate the (meta-)data that is cached under the OST object locks. Clients in the process of writing to a file when the layout lock is revoked may complete the write without any danger. Clients starting new file writes must block until they have the layout lock, since the OST extent locks will not accurately reflect the range of the file that might be modified under a particular lock.

Example Progressive Layouts

Simple PFL Layout

The simplest form of PFL file would have only a small number of components, to best handle small, medium, and large classes of files. For example, on a system with 280 OSTs it would be possible to create a layout that handles small files (below 128MB), medium files (between 64MB-1GB), and large files (above 1GB) differently:

 [0-64MB): 1 object
 [64MB-1GB): 4 objects	(total 5 objs)
 [2GB-EOF): 275 objects 	(total 280 objs)

This results in a fairly compact layout for small files (only 3 components) that could fit into the MDS inode, but provides 275/280 = 98% of the aggregate bandwidth for large files (and/or allows for a few OSTs to be offline when such a large file is created).

Oversubscribed Large PFL Layout

In order to balance space and bandwidth usage against stripe count, one option would be to keep a linear balance between the total number of stripes and the file size. If a stripe is added for each fixed unit of size, at power-of-two size intervals, the aggregate performance would be the same as if the file was striped over a corresponding number of objects from the start.

For example, adding a stripe for every 128MB of space on a system with 280 OSTs:

 [0-128MB): 1 object
 [128MB-512MB): 3 objects	(total 4 objs)
 [512MB-2GB): 12 objects	(total 16 objs)
 [2GB-8GB): 48 objects	(total 64 objs)
 [8GB-35GB): 216 objects	(total 280 objs)
 [35GB-EOF): 280 objects	(total 560 objects)

This results in a total of 280 objects for the first 35GB (= 128MB * 280) of the file, and each object holds a total of 128MB of file data. If this file were accessed in parallel across the first 35GB, the aggregate bandwidth and space usage for each object would be identical to a file that was striped across 280 objects for the entire 35GB, though it would be sub-optimal for parallel access to smaller ranges of the file. File sizes beyond 35GB would be identical to fully-striped files, at the expense of having twice the overhead for stat and locking operations, and twice the layout size.

The progressive layout should stop growing at the point where the total number of stripes would equal or exceed the number of OSTs. At that point, it would be advantageous to add a final layout extent to EOF that stripes across all available OSTs in order to maximize bandwidth at the end of the file, in case it continues to grow significantly larger in size. This would result in a layout that was somewhat more than twice as large as a file that was striped across all OSTs right from the start.

Undersubscribed Large PFL Layout

Alternatively, the last stripe could be grown to cover only the remaining OSTs if it was clear there weren’t going to be enough unused OSTs remaining for the next stage:

 [0-128MB): 1 object
 [128MB-512MB): 3 objects	(total 4 objs)
 [512MB-2GB): 12 objects	(total 16 objs)
 [2GB-8GB): 48 objects	(total 64 objs)
 [8GB-EOF): 216 objects	(total 280 objs)

This means the bandwidth of the end of the file is only 216/280 = 77% of the aggregate, but it means that there is less overhead for accessing the large file due to fewer objects to create, lock, and destroy. This may be beneficial to avoid striping the file across all of the OSTs each time, to avoid problems if some of the OSTs are offline, and to leave some OSTs available for other files that are also being written at the same time.

These are purely examples and in no way show a constraint on how PFL files are used. The stripe count does not need to increase from one component to the next, nor do the stripe counts or stripe boundaries need to be power-of-two values.