ZFS: Difference between revisions

From Lustre Wiki
Jump to navigation Jump to search
m (note about compression)
(reorganizing, integrating compression article and recordsize article)
 
Line 1: Line 1:
== Overview ==
The Lustre target file system ''ldiskfs'' (based on ext4) offers no guarantee of data integrity. To improve the reliability and resilience of the backing filesystem on the OST and MDT components, Lustre added support for using ZFS as backing filesystem for both OSTs and MDTs.
The Lustre target file system ''ldiskfs'' (based on ext4) offers no guarantee of data integrity. To improve the reliability and resilience of the backing filesystem on the OST and MDT components, Lustre added support for using ZFS as backing filesystem for both OSTs and MDTs.


== Features ==
Lustre ZFS targets offer a number of advantages, such as improved data integrity with transaction-based, copy-on-write operations, snapshots, transparent inline compression, and persistent data and metadata checksums every write.
Lustre ZFS targets offer a number of advantages, such as improved data integrity with transaction-based, copy-on-write operations, snapshots, transparent inline compression, and persistent data and metadata checksums every write.


Line 17: Line 19:
* ''Hybrid storage support'' - ZFS supports the addition of high-speed I/O devices, such as SSDs, to the same storage pool as HDDs. The Read Cache Pool, or L2ARC, acts as a read-only cache layer between memory and the disk. This support can substantially improve the performance of random read operations. SSDs can also be used to improve metadata write performance, by adding them to the pool as "special" VDEV devices that store only metadata and small files. You can add as many SSDs to your storage pool as you need to increase your read cache size and IOPS, your metadata write IOPS, or both.
* ''Hybrid storage support'' - ZFS supports the addition of high-speed I/O devices, such as SSDs, to the same storage pool as HDDs. The Read Cache Pool, or L2ARC, acts as a read-only cache layer between memory and the disk. This support can substantially improve the performance of random read operations. SSDs can also be used to improve metadata write performance, by adding them to the pool as "special" VDEV devices that store only metadata and small files. You can add as many SSDs to your storage pool as you need to increase your read cache size and IOPS, your metadata write IOPS, or both.
* ''Scalability'' - ZFS is a 128-bit file system. This means that it can scale to very large file systems for a single MDT or OST, and maximum size of a single file will be removed.
* ''Scalability'' - ZFS is a 128-bit file system. This means that it can scale to very large file systems for a single MDT or OST, and maximum size of a single file will be removed.
== Dataset Properties ==
ZFS datasets have various properties that can control ZFS' behavior.
=== recordsize ===
The <code>recordsize</code> property of ZFS datasets is used to specify the maximum block size for files in the file system. Often, this property does not need to be changed, but for workloads that create very large files, increasing the value of <code>recordsize</code> can deliver a performance benefit. The chosen size must be a power of 2 with the minimum allowed size being 512 bytes. For ZFS on Linux, the maximum value is 16MiB if the <code>large_blocks</code> feature is enabled, 1MiB otherwise.
The default setting is not sufficient to sustain the performance of large I/O workloads, and it is recommended to increase the <code>recordsize</code> to 1MiB (1024K), in order to better match the Lustre IO transaction sizes for block I/O.  This is the default for ZFS OSTs with Lustre 2.11 and later.
=== compression ===
ZFS supports transparent compression of data at the block level for datasets. Enabling compression can improve storage efficiency by decreasing the amount of data written to disk and improve read performance for large files, as less data needs to be read from disk.  The tradeoff is that compression consumes additional CPU resources on the Lustre server that imports a zfs dataset with compression enabled.
Compression is typically most beneficial for OSTs since this is where most of the filesystem data is stored.  Enabling compression for MDTs is safe, but there may not be as much benefit since most of the data on an MDT is file metadata which ZFS often compresses by default.  If you are using Data on MDT (DoM), then there may be a benefit to using compression since some file data is stored on the MDT.  The decision to enable or disable compression on OSTs will need to be evaluated by each site depending on their specific needs and performance requirements.  Since compression can be enabled/disabled without affecting exising data, it is easy to test both scenarios without the risk of data loss.
To check if compression is enabled for an OST, run:
<pre>zfs get compression <OST_dataset></pre>
Example output will look something like this:
[root@oss1][~]# zfs get compression lustre-OST0000
NAME          PROPERTY    VALUE          SOURCE
lustre-OST0000  compression  off            local
To enable compression, run:
<pre>zfs set compression=on <OST_dataset></pre>
This turns on compression with the default algorithm (lz4 since ZFS version 0.6.5).
Other compression algorithms are available, and a list of supported algorithms can be found by looking at the '''zfsprops''' man page.  To use a different algorithm, replace the '''compression=on''' option with '''compression=<algorithm>'''.  The '''lz4''' algorithm is likely the best choice for now. It has been tested extensively, and provides very good compression balanced with performance. lz4 also implements an early abort feature<ref>https://github.com/openzfs/zfs/blob/e0bf43d64ed01285321bf6c3a308f699c5483efc/module/zfs/lz4_zfs.c#L520</ref>, meaning it stops trying to compress if initial compression results do not meet a threshold, so performance with incompressible data isn't degraded.
To check compression performance, the compression ratio can be queried on the Lustre server:
<pre>zfs get compressratio <OST_dataset></pre>
From a user's perspective, the logical size and on-disk size of a file can be determined by running the following commands from a Lustre client:
# Check logical usage
du --apparent-size -h <file>
# Check on-disk usage
du -h <file>
==Performance Tuning==
ZFS performance tuning is a pretty complex topic, but some pointers are provided in the following articles:
* [[ZFS_Tunables_for_Lustre_Object_Storage_Servers_(OSS)]]
* [[ZFS_Tunables_for_Lustre_Metadata_Servers_(MDS)]]
==Backups==
ZFS' snapshot functionality combined with its send and receive features allow for fairly simple backup and restoration, see [[ZFS_Snapshots_for_MDT_backup]] for a general idea of how you can leverage these features to backup both OSTs and MDTs.


== See Also ==
== See Also ==
* [[ZFS OSD Storage Basics]]
* [[ZFS OSD Storage Basics]]
* [[ZFS Versions in Official Lustre Releases]]
* [[ZFS Versions in Official Lustre Releases]]
[[Category:ZFS]]
[[Category:ZFS]]

Latest revision as of 12:19, 5 June 2026

Overview

The Lustre target file system ldiskfs (based on ext4) offers no guarantee of data integrity. To improve the reliability and resilience of the backing filesystem on the OST and MDT components, Lustre added support for using ZFS as backing filesystem for both OSTs and MDTs.

Features

Lustre ZFS targets offer a number of advantages, such as improved data integrity with transaction-based, copy-on-write operations, snapshots, transparent inline compression, and persistent data and metadata checksums every write.

Copy-on-write means that ZFS never overwrites existing data. Changed information is written to a new block and the block pointer to in-use data is only moved after the write transaction is completed. This mechanism is used all the way up to the file system block structure at the top block.

To avoid data corruption, ZFS computes checksums of all data and metadata in the filesystem. The checksum is not stored with the data block, but rather in the pointer to the block. All checksums are done in server memory, so errors not caught by other file systems are detected in ZFS, such as:

  • Phantom writes, where the write is dropped on the floor.
  • Misdirected reads or writes, where the disk accesses the wrong block.
  • DMA parity errors between the array and server memory or from the driver, since the checksum validates data inside the array.
  • Driver errors, where data winds up in the wrong buffer inside the kernel.
  • Accidental overwrites, such as swapping to a live file system.

Lustre support of ZFS offers several specific advantages:

  • Self-healing capability - In a mirrored or RAID configuration, ZFS not only detects data corruption, but it automatically corrects the bad data.
  • Improved administration - Because ZFS detects and reports data corruption on all read and write errors at the block level, it is easier for system administrators to quickly identify which hardware components are corrupting data. ZFS also has very easy-to-use command-line administration utilities.
  • Hybrid storage support - ZFS supports the addition of high-speed I/O devices, such as SSDs, to the same storage pool as HDDs. The Read Cache Pool, or L2ARC, acts as a read-only cache layer between memory and the disk. This support can substantially improve the performance of random read operations. SSDs can also be used to improve metadata write performance, by adding them to the pool as "special" VDEV devices that store only metadata and small files. You can add as many SSDs to your storage pool as you need to increase your read cache size and IOPS, your metadata write IOPS, or both.
  • Scalability - ZFS is a 128-bit file system. This means that it can scale to very large file systems for a single MDT or OST, and maximum size of a single file will be removed.

Dataset Properties

ZFS datasets have various properties that can control ZFS' behavior.

recordsize

The recordsize property of ZFS datasets is used to specify the maximum block size for files in the file system. Often, this property does not need to be changed, but for workloads that create very large files, increasing the value of recordsize can deliver a performance benefit. The chosen size must be a power of 2 with the minimum allowed size being 512 bytes. For ZFS on Linux, the maximum value is 16MiB if the large_blocks feature is enabled, 1MiB otherwise.

The default setting is not sufficient to sustain the performance of large I/O workloads, and it is recommended to increase the recordsize to 1MiB (1024K), in order to better match the Lustre IO transaction sizes for block I/O. This is the default for ZFS OSTs with Lustre 2.11 and later.

compression

ZFS supports transparent compression of data at the block level for datasets. Enabling compression can improve storage efficiency by decreasing the amount of data written to disk and improve read performance for large files, as less data needs to be read from disk. The tradeoff is that compression consumes additional CPU resources on the Lustre server that imports a zfs dataset with compression enabled.

Compression is typically most beneficial for OSTs since this is where most of the filesystem data is stored. Enabling compression for MDTs is safe, but there may not be as much benefit since most of the data on an MDT is file metadata which ZFS often compresses by default. If you are using Data on MDT (DoM), then there may be a benefit to using compression since some file data is stored on the MDT. The decision to enable or disable compression on OSTs will need to be evaluated by each site depending on their specific needs and performance requirements. Since compression can be enabled/disabled without affecting exising data, it is easy to test both scenarios without the risk of data loss.

To check if compression is enabled for an OST, run:

zfs get compression <OST_dataset>

Example output will look something like this:

[root@oss1][~]# zfs get compression lustre-OST0000
NAME           PROPERTY     VALUE           SOURCE
lustre-OST0000  compression  off             local

To enable compression, run:

zfs set compression=on <OST_dataset>

This turns on compression with the default algorithm (lz4 since ZFS version 0.6.5).

Other compression algorithms are available, and a list of supported algorithms can be found by looking at the zfsprops man page. To use a different algorithm, replace the compression=on option with compression=<algorithm>. The lz4 algorithm is likely the best choice for now. It has been tested extensively, and provides very good compression balanced with performance. lz4 also implements an early abort feature[1], meaning it stops trying to compress if initial compression results do not meet a threshold, so performance with incompressible data isn't degraded.

To check compression performance, the compression ratio can be queried on the Lustre server:

zfs get compressratio <OST_dataset>

From a user's perspective, the logical size and on-disk size of a file can be determined by running the following commands from a Lustre client:

# Check logical usage
du --apparent-size -h <file>

# Check on-disk usage
du -h <file>

Performance Tuning

ZFS performance tuning is a pretty complex topic, but some pointers are provided in the following articles:

Backups

ZFS' snapshot functionality combined with its send and receive features allow for fairly simple backup and restoration, see ZFS_Snapshots_for_MDT_backup for a general idea of how you can leverage these features to backup both OSTs and MDTs.

See Also