File:LUG2019-Managing Lustre on AWS FSX-Pollock.pdf

In this presentation we will introduce Amazon FSx for Lustre, a new managed service offering launched at AWS Re:Invent in November 2018. While we already offered Elastic File System (EFS) as a file system on AWS, we heard from customers that their workloads required greater throughput and lower latencies, and they were willing to sacrifice durability to achieve that. These customers often named Lustre by named as the “F1 Ferrari” of file systems for fast file. Working backwards from customer needs, we introduced an unreplicated managed Lustre service to fill this gap, and it is now a key accelerator of our investment into the High Performance Computing space.

The choice to manage Lustre was an obvious business decision but how to integrate it seamlessly into the AWS ecosystem technically was not. We will walk through the features of Lustre that we are leveraging to do this integration, why we had a bias toward leveraging prior art from the open source community and some of the technical challenges that come from leveraging these features at Amazon scale. In particular, we will dive deep into our usage of the Hierarchical Storage Management feature and performance challenges when trying to bulk import and export millions, tens of millions or even hundreds of millions of objects in a spin-up, spin-down compute model. We will walk through an example that uses AWS Batch to orchestrate a deep learning-focused HPC workload customer workload built on top of Amazon FSx for Lustre. We will close by describing any gaps we in the Lustre offering today that are impeding adoption by more customers and how we can help close those gaps.