Lessons Learned in Deploying the World’s Largest Scale Lustre File System

The Spider system at the Oak Ridge National Laboratory’s Leadership Computing Facility (OLCF) is the world’s largest scale Lustre parallel file system. Envisioned as a shared parallel file system capable of delivering both the bandwidth and capacity requirements of the OLCF’s diverse computational environment, the project had a number of ambitious goals. To support the workloads of the OLCF’s diverse computational platforms, the aggregate performance and storage capacity of Spider exceed that of our previously deployed systems by a factor of 6x 240 GB/sec, and 17x 10 Petabytes, respectively. Furthermore, Spider supports over 26,000 clients concurrently accessing the file system, which exceeds our previously deployed systems by nearly 4x. In addition to these scalability challenges, moving to a center-wide shared file system required dramatically improved resiliency and fault-tolerance mechanisms. This paper details our efforts in designing, deploying, and operating Spider. Through a phased approach of research and development, prototyping, deployment, and transition to operations, this work has resulted in a number of insights into large-scale parallel file system architectures, from both the design and the operational perspectives. We present in this paper our solutions to issues such as network congestion, performance baselining and evaluation, file system journaling overheads, and high availability in a system with tens of thousands of components. We also discuss areas of continued challenges, such as stressed metadata performance and the need for file system quality of service alongside with our efforts to address them. Finally, operational aspects of managing a system of this scale are discussed along with real-world data and observations.

[1]  Makia Minich Infiniband Based Cable Comparison , 2007 .

[2]  Sadaf R. Alam,et al.  Cray XT4: an early evaluation for petascale scientific simulation , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[3]  Amith R. Mamidala,et al.  Performance modeling of subnet management on fat tree InfiniBand networks using OpenSM , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[4]  Sayantan Sur,et al.  Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms , 2007, 15th Annual IEEE Symposium on High-Performance Interconnects (HOTI 2007).

[5]  Galen M. Shipman,et al.  Efficient Object Storage Journaling in a Distributed Parallel File System , 2010, FAST.

[6]  Keith D. Underwood,et al.  Initial performance evaluation of the Cray SeaStar interconnect , 2005, 13th Symposium on High Performance Interconnects (HOTI'05).

[7]  Galen M. Shipman,et al.  Jaguar: The World?s Most Powerful Computer , 2009 .

[8]  Keith D. Underwood,et al.  SeaStar Interconnect: Balanced Bandwidth for Scalable Performance , 2006, IEEE Micro.