ZFS: Difference between revisions

From Lustre Wiki
Jump to navigation Jump to search
(Initial Creation From Migration)
 
(remove old/inaccurate description of functionality that was never implemented, and "future tense" that is long behind us)
Line 1: Line 1:
{| class='wikitable'
The Lustre target file system ''ldiskfs'' (based on ext4) offers no guarantee of data integrity. To improve the reliability and resilience of the backing filesystem on the OST and MDT components, Lustre added support for using ZFS as backing filesystem for both OSTs and MDTs.
|-
!Note: This page originated on the old Lustre wiki. It was identified as likely having value and was migrated to the new wiki. It is in the process of being reviewed/updated and may currently have content that is out of date.
|}


The Lustre™ node file system ''ldiskfs'' (based on ext3/ext4) is limited to an 8 TB maximum file system size and offers no guarantee of data integrity. To improve the reliability and resilience of the underlying file system on the OSS and MDS components, Lustre will add ZFS support.
Lustre ZFS targets offer a number of advantages, such as improved data integrity with transaction-based, copy-on-write operations, snapshots, and persistent data and metadata checksums every write.
 
Lustre supporting ZFS will offer a number of advantages, such as improved data integrity with transaction-based, copy-on-write operations and an end-to-end checksum on every block.


Copy-on-write means that ZFS never overwrites existing data. Changed information is written to a new block and the block pointer to in-use data is only moved after the write transaction is completed. This mechanism is used all the way up to the file system block structure at the top block.
Copy-on-write means that ZFS never overwrites existing data. Changed information is written to a new block and the block pointer to in-use data is only moved after the write transaction is completed. This mechanism is used all the way up to the file system block structure at the top block.


To avoid data corruption, ZFS performs end-to-end checksumming. The checksum is not stored with the data block, but rather in the pointer to the block. All checksums are done in server memory, so errors not caught by other file systems are detected in ZFS, such as:
To avoid data corruption, ZFS computes checksums of all data and metadata in the filesystem. The checksum is not stored with the data block, but rather in the pointer to the block. All checksums are done in server memory, so errors not caught by other file systems are detected in ZFS, such as:
* Phantom writes, where the write is dropped on the floor.
* Phantom writes, where the write is dropped on the floor.
* Misdirected reads or writes, where the disk accesses the wrong block.
* Misdirected reads or writes, where the disk accesses the wrong block.
Line 17: Line 12:
* Accidental overwrites, such as swapping to a live file system.
* Accidental overwrites, such as swapping to a live file system.


In Lustre, ZFS checksumming will be done by the Lustre client on the application node. This will detect any data corruption introduced into the network between the application node and the disk drive in the Lustre storage system.
Lustre support of ZFS offers several specific advantages:
 
Previous testing of Lustre with network checksums has resulted in the detection of previously unknown corruption in network cards. These cards silently introduced data corruption that went undetected without the use of checksums. It should be noted that the checksum computation does consume some processor cycles, approximately 1 GHz of CPU time to process 500 MB/sec of I/O.
 
''An implementation note:'' Previously, ZFS support was being developed and tested with a user space implementation of the ZFS DMU. Currently, we are running the DMU in kernel space. Also, the Lustre DMU code is almost entirely common with the Solaris version of ZFS, so Lustre support for ZFS will closely parallel the Solaris release of ZFS.
 
Lustre support of ZFS will offer several specific advantages:
* ''Self-healing capability'' - In a mirrored or RAID configuration, ZFS not only detects data corruption, but it automatically corrects the bad data.
* ''Self-healing capability'' - In a mirrored or RAID configuration, ZFS not only detects data corruption, but it automatically corrects the bad data.
* ''Improved administration'' - Because ZFS detects and reports data corruption on all read and write errors at the block level, it is easier for system administrators to quickly identify which hardware components are corrupting data. ZFS also has very easy-to-use command-line administration utilities.
* ''Improved administration'' - Because ZFS detects and reports data corruption on all read and write errors at the block level, it is easier for system administrators to quickly identify which hardware components are corrupting data. ZFS also has very easy-to-use command-line administration utilities.
* ''SSD support'' - ZFS supports the addition of high-speed I/O devices, such as SSDs, to the storage pool. The Read Cache Pool or L2ARC acts as a cache layer between memory and the disk. This support can substantially improve the performance of random read operations. SSDs can also be used to improve synchronous write performance, by adding them to the pool as log devices. You can add as many SSDs to your storage pool as you need to increase your read cache size and IOPS, your synchronous write IOPS, or both.
* ''Hybrid storage support'' - ZFS supports the addition of high-speed I/O devices, such as SSDs, to the same storage pool as HDDs. The Read Cache Pool, or L2ARC, acts as a read-only cache layer between memory and the disk. This support can substantially improve the performance of random read operations. SSDs can also be used to improve metadata write performance, by adding them to the pool as "special" VDEV devices that store only metadata and small files. You can add as many SSDs to your storage pool as you need to increase your read cache size and IOPS, your metadata write IOPS, or both.
* ''Scalability'' - ZFS is a 128-bit file system. This means that current restrictions on maximum-size file systems for a single MDT or OST, maximum stripe size, andmaximum size of a single file will be removed. ZFS support will also remove the current 16 TB limitation on LUNs.
* ''Scalability'' - ZFS is a 128-bit file system. This means that it can scale to very large file systems for a single MDT or OST, and maximum size of a single file will be removed.


For more general information about ZFS, see [[ZFS Resources]].
For more general information about ZFS, see [[ZFS Resources]].

Revision as of 13:57, 4 May 2020

The Lustre target file system ldiskfs (based on ext4) offers no guarantee of data integrity. To improve the reliability and resilience of the backing filesystem on the OST and MDT components, Lustre added support for using ZFS as backing filesystem for both OSTs and MDTs.

Lustre ZFS targets offer a number of advantages, such as improved data integrity with transaction-based, copy-on-write operations, snapshots, and persistent data and metadata checksums every write.

Copy-on-write means that ZFS never overwrites existing data. Changed information is written to a new block and the block pointer to in-use data is only moved after the write transaction is completed. This mechanism is used all the way up to the file system block structure at the top block.

To avoid data corruption, ZFS computes checksums of all data and metadata in the filesystem. The checksum is not stored with the data block, but rather in the pointer to the block. All checksums are done in server memory, so errors not caught by other file systems are detected in ZFS, such as:

  • Phantom writes, where the write is dropped on the floor.
  • Misdirected reads or writes, where the disk accesses the wrong block.
  • DMA parity errors between the array and server memory or from the driver, since the checksum validates data inside the array.
  • Driver errors, where data winds up in the wrong buffer inside the kernel.
  • Accidental overwrites, such as swapping to a live file system.

Lustre support of ZFS offers several specific advantages:

  • Self-healing capability - In a mirrored or RAID configuration, ZFS not only detects data corruption, but it automatically corrects the bad data.
  • Improved administration - Because ZFS detects and reports data corruption on all read and write errors at the block level, it is easier for system administrators to quickly identify which hardware components are corrupting data. ZFS also has very easy-to-use command-line administration utilities.
  • Hybrid storage support - ZFS supports the addition of high-speed I/O devices, such as SSDs, to the same storage pool as HDDs. The Read Cache Pool, or L2ARC, acts as a read-only cache layer between memory and the disk. This support can substantially improve the performance of random read operations. SSDs can also be used to improve metadata write performance, by adding them to the pool as "special" VDEV devices that store only metadata and small files. You can add as many SSDs to your storage pool as you need to increase your read cache size and IOPS, your metadata write IOPS, or both.
  • Scalability - ZFS is a 128-bit file system. This means that it can scale to very large file systems for a single MDT or OST, and maximum size of a single file will be removed.

For more general information about ZFS, see ZFS Resources.