ZFS OSD

Background
The purpose of this page is to document architecture and requirements related to Lustre servers using the DMU. What features should be added to ZFS as more storage management features are added to Lustre, and vice versa. This page documents how such features might be added to ZFS, based on discussions with Bill Moore from Sun and internal discussions in the Lustre Group.

The DMU offers benefits for Lustre but it is not a perfect marriage. The approach below is based on:
 * 1) low risk: use methods we know
 * 2) time to market: stick with methods we use under ldiskfs
 * 3) low controversy: start with something that ZFS can deliver without modifications
 * 4) few initial enhancements: there are a few ZFS enhancements that would be highly beneficial

File system formats
The Lustre servers will interface with the DMU in such a way that the disk images that are created can be mounted as ZFS file systems. While this is not strictly necessary, it will retain the transparency of where the data is, and allow debugging Lustre-ZFS filesystems by mounting them locally on the server.

ZFS has an exceptionally rich "fork" feature (similar to extended attributes), and this can be used to build, for example, object indexes in a way consistent with ZFS disk file system images.

OST

 * Object index: we stick with a directory hierarchy under O. Because sequences will be important, the hierarchy will be /O/ /, where and are formatted in variable-width hexadecimal ASCII numbers.  The last part of the pathname points to a ZFS regular file.
 * Reference to MDS FID: Each object requires a reference to the MDS FID to which it belongs. For this we need a relatively small extended attribute stored as a ZFS system attribute.
 * Size on MDS, HSM: also requires extended attributes on the objects and are stored as separate system attributes
 * Larger blocks: we believe that for HPC applications larger blocks of at least 1MB are desirable for performance reasons.

MDT

 * FID to object mapping: We propose to use ZAP OI files for this purpose, hashed over several ZAP files to reduce locking contention.
 * Readdir: readdir must return the FID to the client. The FID will be put in an EA of the znode in the first implementation, but it leads to every znode being read during readdir.  For improved performance the FID will also be stored in the ZAP entry for the name as two 64-bit integers after the dnode number.
 * File layout: this needs to go into a system attribute if small, or an extended attribute if it is large (which may be slower). Using larger dnodes seems the right way to go, but these require further changes to ZFS.

Larger dnodes with embedded EAs
To avoid much indirection to other data structures as is currently seen with ZFS xattrs, larger dnodes in ZFS which can contain small EAs (large enough for most Lustre EAs) are very attractive. The |Large dnode pool feature is under development.

ZFS large dnodes and ZFS_TinyZAP

Larger block size
For HPC applications, 128K is probably a blocksize that is considerably too small. We will need to implement larger block sizes. The |ZFS 1MB Block Size feature was landed for the 0.6.3 release.

Read / Write priorities
ZFS has a simple table to control read write priorities. Given that writes mostly go to cache and are flushed by background daemons, while reads block applications, reads are often given higher priority, with limitations to prevent starving writes. Henry Newman raised concerns that for the HPCS file system this policy is not necessarily ideal. Bill Moore explained that it is simple to change it through settings in a table.

Data on Separate Devices
Past parallel file systems and current efforts with MAID arrays have found significant advantages in file systems that store data on separate block devices from metadata. Some users of Lustre already place the journal on a separate device.

In ZFS this is relatively easy to arrange by introducing new classes of vdev's. The block allocator would choose a metadata class vdev if it was allocating metadata and an file-data class if it was allocating for file data. See Jeff Bonwick's blog entry about block allocation. This is currently under development by Intel (2015-09-14).

Migration of Allocation Data
When Lustre's server network striping (SNS) feature will be introduced, write calls that overwrite existing data will have to save the old contents before overwriting for recovery purposes. The SNS architecture proposes to record allocation data of the extent in a log file from which it will be freed when the network stripe commits globally.

ZFS has a block pointer (BP) list structure that might be very useful to hold such allocation data. It comes with appropriate API's to free such blocks. The BP list is held in a DMU object.

ZFS system attributes
ZFS has an extended attribute model that is very general and support large extended attributes.

One issue is that the ZFS xattr model provides no protection for xattrs stored on a file and a user would be able to corrupt the Lustre EA data with enough effort, even if it is owned by root. The system attributes feature separates internal attributes from user attributes and avoids this issue.

Parity Declustering
Simple parity declustering patterns should be supported. The |Parity declustered RAIDz/mirror feature is under development.