ZFS System Design

From Lustre Wiki
Revision as of 16:32, 19 January 2018 by Adilger (talk | contribs) (Adilger moved page Zfs system design to ZFS System Design: capitalize title properly)
Jump to navigation Jump to search


ZFS System Design

Introduction

When you have a disk array and you want to format the drives as OSTs, you would be in a dilemma for the choice of zpool and OST configuration. What's the actual differences between multiple VDEVs in the same pool and multiple pools with single vdev in each? This article will help you make the right choice based on the performance and reliability.

Pool Configuration

Let's start with an actual example: when you have a disk array with 116 disks, you would choose to have 10 zpools and each zpool has a single VDEV with 9+2 RAIDZ2, or you could have a single zpool with 10 9+2 RAIDZ2 VDEVs. The former will have 10 OSTs exported but the latter will have only one OST. I will compare the above configurations and make a proposal in the conclusion.

Single zpool would be easier to manage, especially when it comes to the hot spare disks management. For example, for the above disk array, you can assign all remaining disks into the zpool as hot spare disks and ZFS will take care of disk failure automatically. In this configuration, ZFS can tolerate to lose any 8 disks in the pool without data loss. The other benefit is about ditto blocks. ZFS will store 3 copies for file system metadata and 2 copies for file metadata(indirect blocks). With more VDEVs in the pool, ZFS will be able to choose different VDEVs to write different ditto blocks, therefore improving the overall reliability.

The configuration of multiple zpool has some benefits too:

  1. Better read performance. When osd-zfs is writing a bunch of data to DMU, in the single zpool configuration, DMU will use round robin policy to allocate blocks from each VDEVs, which will cause the problem that sequentially written blocks will be spread all around the pool. Then it will create some performance problems when reading them back;
  2. Consisten I/O performance. Since a TXG may spread many VDEVs in the single zpool configuration, the actual TXG sync time will be limited by the slowest VDEV. Therefore if there exists a VDEV in degrated mode, the overall I/O performance will be impacted, because DMU has to stall writing and wait for sync TXG to complete before it can cache more dirty data in memory;
  3. Having more OSTs will help client stream data better. Clients maintain dirty data and space grant per OSTs. With single pool configuration, even it has a super fast OST but the client won't be able to use it efficently;
  4. There are also lock contention issues at the pool/lustre levels that make separate pools perform better, as well as improved failure isolation (e.g. degraded RAID, total pool failure/corruption, etc), but need to be balanced with extra management effort for more OSTs, more fragmentation of free space and grant overhead.

Conclusion

All in all, it's better to have separate zpool for performance reason but single zpool for reliabiltiy. For the example above, in order to balance the requirements of performance and reliability, it would be reasonable to have 5 zpools, and each zpool has a couple of 9+2 RAIDZ2 VDEVs. Also, each zpool will have single hot spare disk. You can probably sell the remaining disk for some beer.

DRAID

(to be continued)