ZFS OSD: Difference between revisions

From Lustre Wiki
Jump to navigation Jump to search
(Created page with "{{Warning|The content on this page is out of date}} == Background == === Purpose of this page === The purpose of this page is to document architecture and requirements rel...")
 
(update document to describe current stat)
(7 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{Warning|The content on this page is out of date}}
== Background ==
== Background ==
The purpose of this page is to document architecture and requirements related to Lustre servers using the ZFS DMU interfaces.  What features are needed from ZFS as more storage management features are added to Lustre, and vice versa.  This page documents such features and requirements.


=== Purpose of this page ===
The DMU offers benefits for Lustre but it is not a perfect marriage.  The approach described below was based on:
 
The purpose of this page is to document architecture and requirements related to Lustre servers using the DMU.  There are two parts:
# how should the DMU be used as a storage backend for the Lustre servers and
# what features should be added to ZFS because Lustre requires these from the server disk file systems as more storage management features are added to Lustre. 
 
This page documents how such features might be added to ZFS, based on discussions with Bill Moore from Sun and internal discussions in the Lustre Group.
 
=== Using the DMU in user space ===
 
The ZFS DMU can run in userspace and is extensively used by the ztest program.  The Linux ZFS FUSE project has also used it.  Bill Moore (Sun ZFS Architect) reports that in principle there are no obstacles using the DMU in user space, but that some performance tuning is to be expected.  All storage management facilities offered by the DMU remain available, such as pool management.
 
The DMU offers benefits for Lustre but it is not a perfect marriage.  The approach below is based on:
# low risk: use methods we know
# low risk: use methods we know
# time to market: stick with methods we use under ldiskfs
# time to market: stick with methods we use under ldiskfs
Line 25: Line 12:
=== File system formats ===
=== File system formats ===


The Lustre servers will interface with the DMU in such a way that the disk images that are created can be mounted as ZFS file systems - at least for the initial versions.   These DMU created ZFS file systems will be identically populated on the OSS as is currently done in the user level OSS layered on ZFS.
The Lustre servers will interface with the DMU in such a way that the disk images that are created can be mounted as ZFS file systems. While this is not strictly necessary, it will retain the transparency of where the data is, and allow debugging Lustre-ZFS filesystems by mounting them locally on the server.


While this is not strictly necessary, it will retain the transparency of where the data is.
ZFS has an exceptionally rich "fork" feature (similar to extended attributes), and this can be used to store the Lustre-specific extended attributes.  It also has native support for key-value index objects (ZAPs) that can be used, for example, as object indexes in a way consistent with ZFS directories.


ZFS has an exceptionally rich "fork" feature (similar to extended attributes), and this can be used to build, for example, object indexes in a way consistent with ZFS disk file system images.
==== OST ====


=== User space ===
* Object index: we stick with a directory hierarchy under O/.  Because sequences will be important, the hierarchy will be /O/<seq>/<objid>, where <seq> and <objid> are formatted in variable-width hexadecimal ASCII numbers.  The last part of the pathname points to a ZFS regular file.
* Reference to MDS FID: Each object requires a reference to the MDS FID to which it belongs.  For this we need a relatively small extended attribute stored as a ZFS system attribute.
* Size on MDS, HSM: also requires extended attributes on the objects and are stored as separate system attributes
* Larger blocks: we believe that for HPC applications larger blocks of at least 1MB are desirable for performance reasons.


The DMU will be used in user space.  It is important that the DMU is treated as a library (libzpool) and that a strict separation is made between code changes to the DMU and to Lustre for license reasons.
==== MDT ====


Note that volume management and utilities are already working in userspace also.  Libzpool is perhaps a deceptive name as it includes DMU and Pool features.
* FID to object mapping: We propose to use ZAP OI files for this purpose, hashed over several ZAP files to reduce locking contention.
* Readdir: readdir must return the FID to the client.  The FID was put in an xattr of the dnode in the first implementation, but it leads to every dnode being read during readdir.  For improved performance the FID is also stored in the ZAP directory entry for the name as two 64-bit integers after the dnode numberThis allows the FID to be returned efficiently just via directory traversal.  The ZFS-on-Linux code is modified to ignore the extra integers after the dnode number. See also [[Architecture - ZFS_TinyZAP|ZFS TinyZAP]].
* File layout: this needs to go into a system attribute if small, or an extended attribute if it is large (which may be slower).  Using [[Architecture_-_ZFS_large_dnodes|larger dnodes]] seems the right way to go, but these require changes to the ZFS on-disk format.


==== OSS ====
== ZFS Features Required by Lustre ==
 
* Object index: we stick with a directory hierarchy under O.  Because sequences will be important, the hierarchy will be /O/<seq>/<objid>.  The last part of the pathname points to a ZFS regular file.  Note that the current ZFS port uses this also.
* Reference to MDS fid: Each object requires a reference to the MDS fid to which it belongs.  For this we need a relatively small extended attribute.  We propose to use the empty 56 bytes in the inode, ''in due course'', but will store it initially as an extended attribute.  Should this contain a checksum to validate that the fid is valid?
* Size on MDS, HSM: also requires extended attributes on the objects.  It will likely not fit into 56 bytes and standard EAs are probably slow.  Yet we will use this.
 
* Larger blocks: we believe that for HPC applications larger blocks than 128K may be desirable for performance reasons.  Benchmarking is in progress.
 
==== MDS ====
 
* Fid to object mapping: We propose to use ZAP EAs associated with the /O directory for this purpose, perhaps one ZAP per sequence, as iteration is important.  Probably for robustness reasons, each entry in this zap should include a checksum of some aspects of the object it is referencing so that false entries can be detected.
* Readdir: readdir must return the fid to the client.  The fid will be put in an EA of the znode in the first implementation but it leads to every znode being read during readdir.  This will be slow, but it may be by far the most common case of readdir use anyway.  If this is an obstacle then two things can be done: one is use a secondary name-to-fid ZAP in a fork of the directory, the second is to add the fid to a dirent structure.
* Stripe information: this needs to go into an extended attribute, and is expected to be quite slow.  Using larger inodes and EA's embedded in the inode seems the right way to go, but these are two changes to ZFS, which can be made before the first release, possibly.
 
=== First Steps ===


# Get PIOS working with the DMU
=== ZFS system attributes ===
# Get excellent performance out of the user space DMU
# Design and implement a DMU OSS backend
# Design and implement an DMU MDS backend
# Enhance ZFS with features required by Lustre


== ZFS Features Required by Lustre ==
ZFS has an extended attribute model that is very general and supports large extended attributes.


=== Larger dnodes with embedded EAs ===
One issue is that the ZFS xattr model provides no protection for xattrs stored on a file and a user would be able to corrupt the Lustre EA data with enough effort, even if it is owned by root.  The system attributes (SA) feature separates internal attributes from user attributes and avoids this issue.  The SA feature is available since the ZFS 0.5 release.


To avoid much indirection to other data structures as is currently seen with ZFS xattrs, larger dnodes in ZFS which can contain small EAs (large enough for most Lustre EAs) are very attractive.
=== Larger dnodes with embedded xattrs ===


[[Architecture - ZFS large dnodes|ZFS large dnodes]] and [[Architecture - ZFS_TinyZAP|ZFS_TinyZAP]]
To avoid much indirection to other data structures as is currently seen with ZFS xattrs, larger dnodes in ZFS which can contain small xattrs (large enough for most Lustre xattrs) are very attractive.  The [https://github.com/zfsonlinux/zfs/pull/3542 Large dnode pool feature] was landed for the ZFS 0.7 release.  See also [[Architecture - ZFS large dnodes|ZFS large dnodes]].


=== Larger block size ===
=== Larger block size ===


For HPC applications, 128K is probably a blocksize that is considerably too small.  We will need to implement larger block sizes.
For HPC applications, 128K is probably a blocksize that is considerably too small, especially considering Lustre will send at least 1MB of data per RPC if it is availableZFS should use larger block sizes by default for more efficiency.  The [https://github.com/zfsonlinux/zfs/issues/354 ZFS 1MB Block Size] feature was landed for the 0.6.3 release.  


=== Read / Write priorities ===
=== Read / Write priorities ===


ZFS has a simple table to control read write priorities.  Given that writes mostly go to cache and are flushed by background daemons, while reads block applications, reads are often given higher priority, with limitations to prevent starving writes.  Henry Newman raised concerns that for the HPCS file system this policy is not necessarily ideal.  Bill Moore explained that it is simple to change it through settings in a table.
ZFS has a simple table to control read/write priorities.  Given that writes mostly go to cache and are flushed by background daemons, while reads block applications, reads are often given higher priority, with limitations to prevent starving writes.  Henry Newman raised concerns that for the HPCS file system this policy is not necessarily ideal.  Bill Moore explained that it is simple to change it through settings in a table.


=== Data on Separate Devices ===
=== Data and Metadata on Separate Devices ===


Past parallel file systems and current efforts with MAID arrays have found significant advantages in file systems that store data on separate block devices from metadata.  Some users of Lustre already place the journal on a separate device.   
Past parallel file systems and current efforts with MAID arrays have found significant advantages in file systems that store data on separate block devices from metadata.  Some users of Lustre already place the ldiskfs journal on a separate device.   


In ZFS this is relatively easy to arrange by introducing new classes of vdev's.  The block allocator would choose a metadata class vdev if it was allocating metadata and an file-data class if it was allocating for file data.  See Jeff Bonwick's blog entry about [http://blogs.sun.com/bonwick/entry/zfs_block_allocation block allocation]
In ZFS this is relatively easy to arrange by introducing new classes of VDEVs.  The block allocator would choose a metadata class VDEV if it was allocating metadata and an file-data class if it was allocating for file data.  See Jeff Bonwick's blog entry about [http://blogs.sun.com/bonwick/entry/zfs_block_allocation block allocation].  This feature was developed by Intel and is currently in final landing (2018-06-19) for the ZFS 0.8 release.


=== Migration of Allocation Data ===
=== Parity Declustering ===
 
When Lustre's server network striping (SNS) feature will be introduced, write calls that overwrite existing data will have to save the old contents before overwriting for recovery purposes.  The SNS architecture proposes to record allocation data of the extent in a log file from which it will be freed when the network stripe commits globally.


ZFS has a block pointer (BP) list structure that might be very useful to hold such allocation data.  It comes with appropriate API's to free such blocks.  The BP list is held in a DMU object.
Simple parity declustering patterns should be supported for large VDEVs in order to reduce rebuild times.  The [https://github.com/zfsonlinux/zfs/issues/3497 Parity declustered RAIDz/mirror] feature was developed by Intel and is currently preparing for landing in the ZFS 0.8 release.
 
 
 
=== ZFS extended attributes ===
 
ZFS has an extended attribute model that is very general and support large extended attributes.  See http://www.scit.wlv.ac.uk/cgi-bin/mansec?5+fsattr
 
One issue is that the ZFS xattr model provides no protection for xattrs stored on a file and a user would be able to corrupt the Lustre EA data with enough effort, even if it is owned by root.
 
=== Parity Declustering ===


Simple parity declustering patterns should be supported.
[[Category: Testing]]
[[Category: ZFS]]

Revision as of 00:05, 19 June 2018

Background

The purpose of this page is to document architecture and requirements related to Lustre servers using the ZFS DMU interfaces. What features are needed from ZFS as more storage management features are added to Lustre, and vice versa. This page documents such features and requirements.

The DMU offers benefits for Lustre but it is not a perfect marriage. The approach described below was based on:

  1. low risk: use methods we know
  2. time to market: stick with methods we use under ldiskfs
  3. low controversy: start with something that ZFS can deliver without modifications
  4. few initial enhancements: there are a few ZFS enhancements that would be highly beneficial

Lustre Servers using the DMU

File system formats

The Lustre servers will interface with the DMU in such a way that the disk images that are created can be mounted as ZFS file systems. While this is not strictly necessary, it will retain the transparency of where the data is, and allow debugging Lustre-ZFS filesystems by mounting them locally on the server.

ZFS has an exceptionally rich "fork" feature (similar to extended attributes), and this can be used to store the Lustre-specific extended attributes. It also has native support for key-value index objects (ZAPs) that can be used, for example, as object indexes in a way consistent with ZFS directories.

OST

  • Object index: we stick with a directory hierarchy under O/. Because sequences will be important, the hierarchy will be /O/<seq>/<objid>, where <seq> and <objid> are formatted in variable-width hexadecimal ASCII numbers. The last part of the pathname points to a ZFS regular file.
  • Reference to MDS FID: Each object requires a reference to the MDS FID to which it belongs. For this we need a relatively small extended attribute stored as a ZFS system attribute.
  • Size on MDS, HSM: also requires extended attributes on the objects and are stored as separate system attributes
  • Larger blocks: we believe that for HPC applications larger blocks of at least 1MB are desirable for performance reasons.

MDT

  • FID to object mapping: We propose to use ZAP OI files for this purpose, hashed over several ZAP files to reduce locking contention.
  • Readdir: readdir must return the FID to the client. The FID was put in an xattr of the dnode in the first implementation, but it leads to every dnode being read during readdir. For improved performance the FID is also stored in the ZAP directory entry for the name as two 64-bit integers after the dnode number. This allows the FID to be returned efficiently just via directory traversal. The ZFS-on-Linux code is modified to ignore the extra integers after the dnode number. See also ZFS TinyZAP.
  • File layout: this needs to go into a system attribute if small, or an extended attribute if it is large (which may be slower). Using larger dnodes seems the right way to go, but these require changes to the ZFS on-disk format.

ZFS Features Required by Lustre

ZFS system attributes

ZFS has an extended attribute model that is very general and supports large extended attributes.

One issue is that the ZFS xattr model provides no protection for xattrs stored on a file and a user would be able to corrupt the Lustre EA data with enough effort, even if it is owned by root. The system attributes (SA) feature separates internal attributes from user attributes and avoids this issue. The SA feature is available since the ZFS 0.5 release.

Larger dnodes with embedded xattrs

To avoid much indirection to other data structures as is currently seen with ZFS xattrs, larger dnodes in ZFS which can contain small xattrs (large enough for most Lustre xattrs) are very attractive. The Large dnode pool feature was landed for the ZFS 0.7 release. See also ZFS large dnodes.

Larger block size

For HPC applications, 128K is probably a blocksize that is considerably too small, especially considering Lustre will send at least 1MB of data per RPC if it is available. ZFS should use larger block sizes by default for more efficiency. The ZFS 1MB Block Size feature was landed for the 0.6.3 release.

Read / Write priorities

ZFS has a simple table to control read/write priorities. Given that writes mostly go to cache and are flushed by background daemons, while reads block applications, reads are often given higher priority, with limitations to prevent starving writes. Henry Newman raised concerns that for the HPCS file system this policy is not necessarily ideal. Bill Moore explained that it is simple to change it through settings in a table.

Data and Metadata on Separate Devices

Past parallel file systems and current efforts with MAID arrays have found significant advantages in file systems that store data on separate block devices from metadata. Some users of Lustre already place the ldiskfs journal on a separate device.

In ZFS this is relatively easy to arrange by introducing new classes of VDEVs. The block allocator would choose a metadata class VDEV if it was allocating metadata and an file-data class if it was allocating for file data. See Jeff Bonwick's blog entry about block allocation. This feature was developed by Intel and is currently in final landing (2018-06-19) for the ZFS 0.8 release.

Parity Declustering

Simple parity declustering patterns should be supported for large VDEVs in order to reduce rebuild times. The Parity declustered RAIDz/mirror feature was developed by Intel and is currently preparing for landing in the ZFS 0.8 release.