ZFS System Design

From Lustre Wiki
Jump to navigation Jump to search

ZFS System Design


When you have a disk array and you want to format the drives as OSTs, you would be in a dilemma for the choice of zpool and OST configuration. What's the actual differences between multiple VDEVs in the same zpool and multiple zpools with single VDEV in each? This article will help you make the right choice based on the performance and reliability.

ZPool Configuration

Let's start with an actual example: if you have a disk array with 120 disks (e.g. two 6x10 HDD JBOD) you could choose to have 10 zpools and each zpool has a single VDEV with 9+2 RAID-Z2 and space for 10 hot spares or SSDs for L2ARC, or separate Metadata Allocation Class. You could have 4 zpools with three 8+2 VDEVs. You could instead have a single zpool with 10 9+2 RAID-Z2 VDEVs and 10 unused slots, or 12 8+2 RAID-Z2 VDEVs and no unused slots, or a wide variety of other combinations. The first configuration will have 10 OSTs exported to clients, the second would have 4 OSTs, while the latter two will have only one large OST. These configurations have different trade-offs for performance, manageability, and fault tolerance that make them more or less desirable for various applications.

Having a single large zpool with one OST has several benefits:

  1. Easier Management: It is typically easier to manage fewer OSTs, especially when it comes to the hot spare disks management. For example, for the above disk array, you can assign any remaining disks in the zpool as hot spare disks and ZFS will take care of disk failure automatically. In this configuration, ZFS can tolerate to lose up to 10 disks in the zpool without data loss.
  2. Redundant metadata efficiency: Another benefit is that ZFS will store 3 metadata copies (ditto blocks) for filesystem internal metadata and 2 copies for file metadata (dnodes, indirect blocks). With more VDEVs in a single zpool, ZFS will be able to choose different VDEVs to write different ditto blocks, therefore improving the overall reliability and avoiding redundant writes to a single VDEV. Having at least 3 VDEVs in a zpool is preferred.
  3. Reduced client overhead: Every Lustre client reserves space on every OST (grant) for efficient writeback caching, so having fewer OSTs means less space is reserved for grant. Also, each client maintains state for every OST connection, so having more OSTs increases client overhead to some extent.
  4. Reduced free space/quota fragmentation: Having fewer, larger OST will reduce the chance of any OST becoming full, compared to having more, smaller OSTs. ZFS (as any filesystem) performs worse when there is little free space in a filesystem, so having larger OSTs avoids fragmented allocations. While free space is typically expressed as a percentage, the absolute amount of free space is also important. If 5% of free space is several TB of space then ZFS can likely still make efficient 1MB allocations, but if 5% free space is a few GB spread across the whole zpool then it is more likely to make poor allocations.

The multiple zpool configuration has some benefits too:

  1. Better disk utilization: When Lustre is writing data to ZFS in the single zpool configuration, ZFS will use round robin policy to allocate blocks from each VDEV, which will spread sequentially written blocks across all VDEVs in the zpool. If a single-stripe read/write is accessing many VDEVs this can add overhead when reading the blocks back, since it will require seeking on all disks to fetch blocks for the same file.
  2. Consistent I/O performance: In the single zpool configuration a filesystem transaction group (TXG) will use many or all VDEVs (using dozens of disks), so the actual TXG sync time will be limited by the slowest disk in the zpool. If any VDEV is in degraded mode due to failed disks, if disks have remapped sectors or perform internal housekeeping (internal recalibration, etc.), or other hardware issues, the overall I/O performance will be impacted, because each transaction commit has to stall writing for all of the disks and wait for TXG sync to complete before it can cache more dirty data in memory (essentially disk-level jitter). Having separate zpools allows the TXG sync to happen independently on each zpool, and allows these (often random) individual disk slowdowns to avoid slowing down all disk operations.
  3. More independent OST IO streams: Clients maintain dirty data and space grant per OSTs. With the single zpool configuration, even it has a super fast OST, the client won't necessarily be able to submit RPCs fast enough for that OST, but it will cause contention (seeks) with other clients using the same OST. Having more independent OSTs allows more clients to do disk IO without causing too much contention.
  4. Avoid lock contention: The ZFS/Lustre code have locks to maintain consistency for read/write in that zpool. Having separate zpools allows more independent locks and results in less contention in software.
  5. Improved failure isolation: If there is a degraded RAID-Z2 VDEV, a full OST, or total zpool failure/corruption, etc. then the MDS will avoid allocating new objects on this OST, or may stop using it completely. Having more OSTs reduces the performance impact of a single OST going offline. If an OST fails completely, or needs to be removed from service (due to age, faulty hardware, etc), then the number of files affected is proportional to the OST size, which affects recovery time (restore files from backup, migrate files off the failing OST).


The benefits of fewer, larger OSTs need to be balanced against extra management effort for more OSTs, more fragmentation of free space and grant overhead. In general, it's better to have multiple zpools for performance reason but single zpool for reliability. For the example above, in order to balance the requirements of performance and reliability, it would be reasonable to have 3-5 zpools, and each zpool has a couple of 8+2 or 9+2 RAIDZ2 VDEVs. Also, each zpool will have single hot spare disk. You can probably sell the remaining disks for some beer.


(to be continued)