File:LUG2019-Lustre 2.12 In Production-Thiell.pdf

Since February 2019, Stanford’s Sherlock cluster has been running Lustre 2.12 in production, and has been taking advantage of the latest Lustre features like DNE, DOM and PFL. Managed by the Stanford Research Computing Center, Sherlock is a shared and heterogeneous 1,500-node computing cluster available to the whole Stanford research community, running all kinds of Research Computing applications, from interactive tools to the most taxing of HPC and AI workloads. Unlike most large clusters in computing centers, Sherlock is driven by contributions from individual PIs and groups, and as such, is constantly evolving. This provides a valuable pool of resources for its 4,000 users, but also poses a unique set of challenges from the system administration perspective, especially in the domain of data storage.

In my talk, I’ll first describe Sherlock’s new scratch file system design, especially focused on small files performance, and designed around DNE and DOM, and I’ll provide feedback to the community about our early experience with Lustre 2.12.