File:LUG2019-Quantitative Approach All Flash Lustre-Lozinskiy.pdf

New experimental and AI-driven workloads are moving into the realm of extreme-scale HPC systems at the same time that high-performance flash is becoming cost-effective to deploy at scale. This confluence poses a number of new technical and economic challenges and opportunities in designing the next generation of HPC storage and I/O subsystems to achieve the right balance of bandwidth, latency, endurance, and cost. In this presentation, we present the quantitative approach to requirements definition that resulted in the 30 PB all-flash Lustre file system that will be deployed with NERSC’s upcoming Perlmutter system in 2020. By integrating analysis of current workloads and projections of future performance and throughput, we were able to constrain many critical design space parameters and quantitatively demonstrate that Perlmutter will not only deliver optimal performance, but effectively balance cost with capacity, endurance, and many modern features of Lustre.

The National Energy Research Scientific Computing Center (NERSC) will be deploying the Perlmutter HPC system in 2020 and has designed the system from the ground up to address the needs of both traditional modeling and simulation workloads and these emerging data-driven workloads. A foundation of Perlmutter’s data processing capabilities is its 30 PB, all-flash Lustre file system that is designed to provide both a high peak bandwidth (4 TB/sec) for checkpoint-restart workloads and high peak I/O operation rates for both data and metadata.

All-flash Lustre file systems have been tested at modest scale, and all-flash burst buffers based on custom file systems are being deployed at large scales. However completely replacing the proven disk-based high-performance tier with an all-flash tier at scale introduces a number of new questions:

– Is it economically possible to deploy enough flash capacity to replace the scratch tier and burst buffer tier?

– How much capacity is enough capacity for a scratch file system? What should the purge policy be to manage this capacity?

– Will the SSDs wear out too quickly? What drive endurance rating is required?

In addition, deploying an all-flash Lustre file system at scale poses a unique set of design questions:

– What new Lustre features are required to get the maximum performance from the SSDs?

– How much flash capacity should be provisioned for metadata versus data?

– Using Lustre’s new Data-on-MDT feature, what is the optimal default file layout to balance low latency, high bandwidth, and overall system cost?

In this presentation, we describe the sources of data and analytical methods applied during the design of the Perlmutter file system’s specifications to answer these questions.