Data on MDT Solution Architecture

Introduction
Lustre file system read/write performance is currently optimized for large files (where large is more than a few megabytes in size). In addition to the initial file open RPC to the MDT, there are separate read/write RPCs to the OSTs to fetch the data, as well as disk IO on both MDT and OST for fetching and storing the objects and their attributes. This separation of functionality is acceptable (and even desirable) for large files since one MDT open RPC is typically a small fraction of the total number of read or write RPCs, but this hurts small file performance significantly when there is only a single read or write RPC for the file data. The Data On MDT (DOM) project aims to improve small file performance by allowing the data for small files to be placed only on the MDT, so that these additional RPCs and I/O overhead can be eliminated, and performance correspondingly improved. Since the MDT storage is typically configured as high-IOPS RAID-1+0 and optimized for small IO, Data-on-MDT will also be able to leverage this faster storage. Used in conjunction with the Distributed Namespace (DNE), this will improve efficiency without sacrificing horizontal scale.

In order to store file data on the MDT, users or system administrators will need to explicitly specify a layout at file creation time to store the data on the MDT, or set a default layout policy on any directory so that newly created files will store the data on the MDT. This is the same as the current Lustre mechanism to specify the stripe count or stripe size when creating new files with data on the OSTs.

The administrator will be able to specify a maximum file size for files that store their data on the MDT, to avoid users consuming too much space on the MDT and causing problems for the other users. If a file's layout specifies to store data on the MDT, but it grows beyond the specified maximum file size the data will be migrated to a layout with data on OST object(s). It will also be possible to use the file migration mechanism introduced in Lustre 2.4 as part of the HSM project to migrate existing small files that have data on OST to the MDT.

Use Cases
New files are created with an explicit layout to store the data on the MDT.

New files are created with an implicit layout to store the data on the MDT inherited from the default layout stored on the parent directory.

A small file is stored without the overhead of creating an OST object, and without OST RPCs.

A small file is retrieved without the overhead of accessing an OST object, and without OST RPCs.

Users, administrators, or policy engines can migrate existing small files stored on OSTs to an MDT.

A client accesses a small file and has the file attributes, lock, and data returned with a single RPC.

An administrator sets a global maximum size limit for small files stored on the MDT(s). Files larger than this value do not store their data on an MDT.

Design a new layout for files with data on MDT
There needs to be some way for the client to know that the data is stored on the MDT. Currently there is no mechanism to record this information. The new DOM file layout is being designed in conjunction with the Layout Enhancement Solution Architecture. The DOM layout will only have a single implicit data stripe, which is the MDT inode itself.

Client IO stack must allow IO requests to an MDT device
The current client IO stack is implemented to perform IO through the OSC layer to OSTs. There is no IO mechanism for the MDC layer. Storing data on the MDT requires that IO be read and written through the MDC layer.

Explicitly allocating files on an MDT by default directory striping
Since Lustre allocates the file layout when the file is first opened, the DOM layout needs to be chosen before any file data is actually written. This means it is not possible to use the file size or amount of data written by the client to decide after the file is opened whether the file data will reside on the MDT or OST. Applications would be able to explicitly specify a DOM layout for newly-created files using existing llapi interfaces. It would also be possible to inherit the default layout from the parent directory to allocate all new files in that directory on the MDT to avoid any changes to the application.

It would be possible to allow applications to mknod a file and then truncate it to the final size to allow the MDT to make the layout decision before the file is first opened. This was handled in most previous versions of Lustre, but was removed in recent versions since it was never properly documented and never used by applications to our knowledge. This mechanism would be useful for applications that know (or can reasonably estimate) the final file size in advance, such as tar(1) or cp(1) or applications writing a fixed amount of data to a file (e.g. known array size, or fixed data size per task).

A mechanism to allow files to grow from MDT to OST that exceed small file size limit
If a client writes to a file with a DOM layout that immediately the small-file size limit, under current locking behavior clients would need to flush their dirty cache to the MDT and cancel their layout lock before changing the file layout. Clients should be able to avoid this overhead. Instead, it would be preferable to change the layout to contain an OST object and flush the data directly from the client cache to the OST, and not store anything to the MDT. One possibility is to create all small files with an OST object to handle the overflow case, but this would add overhead when it may not be needed. Another option is to allow the client PW layout lock for the object on the MDS and then the client can modify the layout directly on the MDT without having to drop the layout lock or flush its cache.

Integrating DoM with PFL will allow files to grow beyond the small MDT limit at the expense of having every file store a "small" part of the file on the MDT. In many cases, since the majority of files (often 90%) are small, while a few large files (5%) consume a majority of space (90%) this does not impose a significant additional burden on the MDT beyond DoM itself. For cases where the file size is known in advance (e.g. migration, HSM, delayed mirroring) it is better to just create the whole file on the OST(s) and skip the small DoM component.

Administration

 * system admin sets filesystem-wide filesize limit via procfs or lctl set_param for DOM feature
 * system admin sets default layout directly on file or dir or fs using lfs setstripe

Small file size limit

 * The small file size limit is defined for small files and means that file growing beyond that limit should be migrated to OSTs.
 * Migration process includes new layout creation for that file, object(s) allocation on OST(s) and data transfer from MDT to the OST object(s).

MDS_GETATTR request

 * Client to get LDLM locks and file size in addition to other metadata attributes from the MDT in the same RPC.