ZFS System Design: Difference between revisions

Latest revision as of 17:44, 19 January 2018

ZFS System Design

Introduction

When you have a disk array and you want to format the drives as OSTs, you would be in a dilemma for the choice of zpool and OST configuration. What's the actual differences between multiple VDEVs in the same zpool and multiple zpools with single VDEV in each? This article will help you make the right choice based on the performance and reliability.

ZPool Configuration

Let's start with an actual example: if you have a disk array with 120 disks (e.g. two 6x10 HDD JBOD) you could choose to have 10 zpools and each zpool has a single VDEV with 9+2 RAID-Z2 and space for 10 hot spares or SSDs for L2ARC, or separate Metadata Allocation Class. You could have 4 zpools with three 8+2 VDEVs. You could instead have a single zpool with 10 9+2 RAID-Z2 VDEVs and 10 unused slots, or 12 8+2 RAID-Z2 VDEVs and no unused slots, or a wide variety of other combinations. The first configuration will have 10 OSTs exported to clients, the second would have 4 OSTs, while the latter two will have only one large OST. These configurations have different trade-offs for performance, manageability, and fault tolerance that make them more or less desirable for various applications.

Having a single large zpool with one OST has several benefits:

Easier Management: It is typically easier to manage fewer OSTs, especially when it comes to the hot spare disks management. For example, for the above disk array, you can assign any remaining disks in the zpool as hot spare disks and ZFS will take care of disk failure automatically. In this configuration, ZFS can tolerate to lose up to 10 disks in the zpool without data loss.
Redundant metadata efficiency: Another benefit is that ZFS will store 3 metadata copies (ditto blocks) for filesystem internal metadata and 2 copies for file metadata (dnodes, indirect blocks). With more VDEVs in a single zpool, ZFS will be able to choose different VDEVs to write different ditto blocks, therefore improving the overall reliability and avoiding redundant writes to a single VDEV. Having at least 3 VDEVs in a zpool is preferred.
Reduced client overhead: Every Lustre client reserves space on every OST (grant) for efficient writeback caching, so having fewer OSTs means less space is reserved for grant. Also, each client maintains state for every OST connection, so having more OSTs increases client overhead to some extent.
Reduced free space/quota fragmentation: Having fewer, larger OST will reduce the chance of any OST becoming full, compared to having more, smaller OSTs. ZFS (as any filesystem) performs worse when there is little free space in a filesystem, so having larger OSTs avoids fragmented allocations. While free space is typically expressed as a percentage, the absolute amount of free space is also important. If 5% of free space is several TB of space then ZFS can likely still make efficient 1MB allocations, but if 5% free space is a few GB spread across the whole zpool then it is more likely to make poor allocations.

The multiple zpool configuration has some benefits too:

Better disk utilization: When Lustre is writing data to ZFS in the single zpool configuration, ZFS will use round robin policy to allocate blocks from each VDEV, which will spread sequentially written blocks across all VDEVs in the zpool. If a single-stripe read/write is accessing many VDEVs this can add overhead when reading the blocks back, since it will require seeking on all disks to fetch blocks for the same file.
Consistent I/O performance: In the single zpool configuration a filesystem transaction group (TXG) will use many or all VDEVs (using dozens of disks), so the actual TXG sync time will be limited by the slowest disk in the zpool. If any VDEV is in degraded mode due to failed disks, if disks have remapped sectors or perform internal housekeeping (internal recalibration, etc.), or other hardware issues, the overall I/O performance will be impacted, because each transaction commit has to stall writing for all of the disks and wait for TXG sync to complete before it can cache more dirty data in memory (essentially disk-level jitter). Having separate zpools allows the TXG sync to happen independently on each zpool, and allows these (often random) individual disk slowdowns to avoid slowing down all disk operations.
More independent OST IO streams: Clients maintain dirty data and space grant per OSTs. With the single zpool configuration, even it has a super fast OST, the client won't necessarily be able to submit RPCs fast enough for that OST, but it will cause contention (seeks) with other clients using the same OST. Having more independent OSTs allows more clients to do disk IO without causing too much contention.
Avoid lock contention: The ZFS/Lustre code have locks to maintain consistency for read/write in that zpool. Having separate zpools allows more independent locks and results in less contention in software.
Improved failure isolation: If there is a degraded RAID-Z2 VDEV, a full OST, or total zpool failure/corruption, etc. then the MDS will avoid allocating new objects on this OST, or may stop using it completely. Having more OSTs reduces the performance impact of a single OST going offline. If an OST fails completely, or needs to be removed from service (due to age, faulty hardware, etc), then the number of files affected is proportional to the OST size, which affects recovery time (restore files from backup, migrate files off the failing OST).

Conclusion

The benefits of fewer, larger OSTs need to be balanced against extra management effort for more OSTs, more fragmentation of free space and grant overhead. In general, it's better to have multiple zpools for performance reason but single zpool for reliability. For the example above, in order to balance the requirements of performance and reliability, it would be reasonable to have 3-5 zpools, and each zpool has a couple of 8+2 or 9+2 RAIDZ2 VDEVs. Also, each zpool will have single hot spare disk. You can probably sell the remaining disks for some beer.

DRAID

(to be continued)

@@ Line 4: / Line 4: @@
 === Introduction ===
-When you have a disk array and you want to format the drives as OSTs, you would be in a dilemma for the choice of zpool and OST configuration. What's the actual differences between multiple VDEVs in the same pool and multiple pools with single vdev in each? This article will help you make the right choice based on the performance and reliability.
+When you have a disk array and you want to format the drives as OSTs, you would be in a dilemma for the choice of zpool and OST configuration. What's the actual differences between multiple VDEVs in the same zpool and multiple zpools with single VDEV in each? This article will help you make the right choice based on the performance and reliability.
-=== Pool Configuration ===
+=== ZPool Configuration ===
-Let's start with an actual example: when you have a disk array with 116 disks, you would choose to have 10 zpools and each zpool has a single VDEV with 9+2 RAIDZ2, or you could have a single zpool with 10 9+2 RAIDZ2 VDEVs. The former will have 10 OSTs exported but the latter will have only one OST. I will compare the above configurations and make a proposal in the conclusion.
+Let's start with an actual example: if you have a disk array with 120 disks (e.g. two 6x10 HDD JBOD) you could choose to have 10 zpools and each zpool has a single VDEV with 9+2 RAID-Z2 and space for 10 hot spares or SSDs for L2ARC, or separate Metadata Allocation Class.  You could have 4 zpools with three 8+2 VDEVs.  You could instead have a single zpool with 10 9+2 RAID-Z2 VDEVs and 10 unused slots, or 12 8+2 RAID-Z2 VDEVs and no unused slots, or a wide variety of other combinations.  The first configuration will have 10 OSTs exported to clients, the second would have 4 OSTs, while the latter two will have only one large OST. These configurations have different trade-offs for performance, manageability, and fault tolerance that make them more or less desirable for various applications.
-Single zpool would be easier to manage, especially when it comes to the hot spare disks management. For example, for the above disk array, you can assign all remaining disks into the zpool as hot spare disks and ZFS will take care of disk failure automatically. In this configuration, ZFS can tolerate to lose any 8 disks in the pool without data loss.  The other benefit is about ditto blocks. ZFS will store 3 copies for file system metadata and 2 copies for file metadata(indirect blocks). With more VDEVs in the pool, ZFS will be able to choose different VDEVs to write different ditto blocks, therefore improving the overall reliability.
+Having a single large zpool with one OST has several benefits:
+# '''Easier Management:''' It is typically easier to manage fewer OSTs, especially when it comes to the hot spare disks management. For example, for the above disk array, you can assign any remaining disks in the zpool as hot spare disks and ZFS will take care of disk failure automatically. In this configuration, ZFS can tolerate to lose up to 10 disks in the zpool without data loss.
+# '''Redundant metadata efficiency:''' Another benefit is that ZFS will store 3 metadata copies (ditto blocks) for filesystem internal metadata and 2 copies for file metadata (dnodes, indirect blocks). With more VDEVs in a single zpool, ZFS will be able to choose different VDEVs to write different ditto blocks, therefore improving the overall reliability and avoiding redundant writes to a single VDEV.  Having at least 3 VDEVs in a zpool is preferred.
+# '''Reduced client overhead:''' Every Lustre client reserves space on every OST (grant) for efficient writeback caching, so having fewer OSTs means less space is reserved for grant.  Also, each client maintains state for every OST connection, so having more OSTs increases client overhead to some extent.
+# '''Reduced free space/quota fragmentation:'''  Having fewer, larger OST will reduce the chance of any OST becoming full, compared to having more, smaller OSTs. ZFS (as any filesystem) performs worse when there is little free space in a filesystem, so having larger OSTs avoids fragmented allocations.  While free space is typically expressed as a percentage, the absolute amount of free space is also important.  If 5% of free space is several TB of space then ZFS can likely still make efficient 1MB allocations, but if 5% free space is a few GB spread across the whole zpool then it is more likely to make poor allocations.
-The configuration of multiple zpool has some benefits too:
+The multiple zpool configuration has some benefits too:
-# Better read performance. When osd-zfs is writing a bunch of data to DMU, in the single zpool configuration, DMU will use round robin policy to allocate blocks from each VDEVs, which will cause the problem that sequentially written blocks will be spread all around the pool. Then it will create some performance problems when reading them back;
+# '''Better disk utilization:'''  When Lustre is writing data to ZFS in the single zpool configuration, ZFS will use round robin policy to allocate blocks from each VDEV, which will spread sequentially written blocks across all VDEVs in the zpool.  If a single-stripe read/write is accessing many VDEVs this can add overhead when reading the blocks back, since it will require seeking on all disks to fetch blocks for the same file.
-# Consistent I/O performance. Since a TXG may spread many VDEVs in the single zpool configuration, the actual TXG sync time will be limited by the slowest VDEV. If there exists a VDEV in degrated mode in the pool, the overall I/O performance will be impacted, because DMU has to stall writing and wait for sync TXG to complete before it can cache more dirty data in memory;
+# '''Consistent I/O performance:''' In the single zpool configuration a filesystem transaction group (TXG) will use many or all VDEVs (using dozens of disks), so the actual TXG sync time will be limited by the slowest disk in the zpool.  If any VDEV is in degraded mode due to failed disks, if disks have remapped sectors or perform internal housekeeping (internal recalibration, etc.), or other hardware issues, the overall I/O performance will be impacted, because each transaction commit has to stall writing for all of the disks and wait for TXG sync to complete before it can cache more dirty data in memory (essentially disk-level [https://en.wikipedia.org/wiki/Jitter jitter]).  Having separate zpools allows the TXG sync to happen independently on each zpool, and allows these (often random) individual disk slowdowns to avoid slowing down all disk operations.
-# Having more OSTs will help client stream data better. Clients maintain dirty data and space grant per OSTs. With single pool configuration, even it has a super fast OST but the client won't be able to use it efficently;
+# '''More independent OST IO streams:''' Clients maintain dirty data and space grant per OSTs. With the single zpool configuration, even it has a super fast OST, the client won't necessarily be able to submit RPCs fast enough for that OST, but it will cause contention (seeks) with other clients using the same OST.  Having more independent OSTs allows more clients to do disk IO without causing too much contention.
-# There are also lock contention issues at the pool/lustre levels that make separate pools perform better, as well as improved failure isolation (e.g. degraded RAID, total pool failure/corruption, etc),  but need to be balanced with extra management effort for more OSTs, more fragmentation of free space and grant overhead.
+# '''Avoid lock contention:''' The ZFS/Lustre code have locks to maintain consistency for read/write in that zpool.  Having separate zpools allows more independent locks and results in less contention in software.
+# '''Improved failure isolation:'''  If there is a degraded RAID-Z2 VDEV, a full OST, or total zpool failure/corruption, etc. then the MDS will avoid allocating new objects on this OST, or may stop using it completely.  Having more OSTs reduces the performance impact of a single OST going offline.  If an OST fails completely, or needs to be removed from service (due to age, faulty hardware, etc), then the number of files affected is proportional to the OST size, which affects recovery time (restore files from backup, migrate files off the failing OST).
 === Conclusion ===
+The benefits of fewer, larger OSTs need to be balanced against extra management effort for more OSTs, more fragmentation of free space and grant overhead.
-All in all, it's better to have separate zpool for performance reason but single zpool for reliabiltiy. For the example above, in order to balance the requirements of performance and reliability, it would be reasonable to have 5 zpools, and each zpool has a couple of 9+2 RAIDZ2 VDEVs and single hot spare disk. You can probably sell the remaining disk for some beer.
+In general, it's better to have multiple zpools for performance reason but single zpool for reliability. For the example above, in order to balance the requirements of performance and reliability, it would be reasonable to have 3-5 zpools, and each zpool has a couple of 8+2 or 9+2 RAIDZ2 VDEVs.  Also, each zpool will have single hot spare disk. You can probably sell the remaining disks for some beer.
 === DRAID ===
 (to be continued)

ZFS System Design: Difference between revisions

Latest revision as of 17:44, 19 January 2018

Contents