File:LUG2019-Long Distance Lustre-Gautam.pdf

Lustre is widely used in HPC datacenters with Infiniband, Omnipath, Aries and Ethernet fabrics. Lustre networking (LNET) plays a big role in how lustre devices communicate with each other. LNET router is a great way to bridge different network fabrics together, where client and server across different fabrics can communicate with each other. LNET routers also adds resiliency by using multiple LNET routers to route to a FileSystem.

We installed a brand new HPC datacenter about 30 miles (48 km) away from an existing datacenter. Both datacenters uses Lustre FileSystems in a flat Infiniband network. This presentation explains how we were able to connect these two datacenters where Lustre clients on one datacenter can access lustre FileSystems on other datacenter across the town. Since long distance Infiniband is expensive and complex, we chose to use high speed Ethernet network for long distance communication and use IB-Ethernet LNET routers on both ends to bridge two fabrics together. We will show how the various OS and Lustre tunings on LNET routers, Lustre clients and servers that needs to be performed to maximize the throughput and show some test results. We will also present challenges that we faced along the way and how we were able to resolve and/or mitigate them. The system is now in production exceeding our expectations.