On the role of burst buffers in leadership-class storage systems

The largest-scale high-performance (HPC) systems are stretching parallel file systems to their limits in terms of aggregate bandwidth and numbers of clients. To further sustain the scalability of these file systems, researchers and HPC storage architects are exploring various storage system designs. One proposed storage system design integrates a tier of solid-state burst buffers into the storage system to absorb application I/O requests. In this paper, we simulate and explore this storage system design for use by large-scale HPC systems. First, we examine application I/O patterns on an existing large-scale HPC system to identify common burst patterns. Next, we describe enhancements to the CODES storage system simulator to enable our burst buffer simulations. These enhancements include the integration of a burst buffer model into the I/O forwarding layer of the simulator, the development of an I/O kernel description language and interpreter, the development of a suite of I/O kernels that are derived from observed I/O patterns, and fidelity improvements to the CODES models. We evaluate the I/O performance for a set of multiapplication I/O workloads and burst buffer configurations. We show that burst buffers can accelerate the application perceived throughput to the external storage system and can reduce the amount of external storage bandwidth required to meet a desired application perceived throughput goal.

[1]  Michael Zingale,et al.  Flash code: studying astrophysical thermonuclear flashes , 2000, Comput. Sci. Eng..

[2]  D. Reed Informed Prefetching of Collective Input/Output Requests , 2003 .

[3]  Bo Hong,et al.  File System Workload Analysis For Large Scientific Computing Applications , 2004, MSST.

[4]  T. Inglett,et al.  Designing a Highly-Scalable Operating System: The Blue Gene/L Story , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[5]  Philip C. Roth,et al.  Characterizing the I/O behavior of scientific applications on the Cray XT , 2007, PDSW '07.

[6]  Seetharami R. Seelam,et al.  Modeling the Impact of Checkpoints on Next-Generation Systems , 2007, 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007).

[7]  P. Nowoczynski,et al.  Zest Checkpoint storage system for large supercomputers , 2008, 2008 3rd Petascale Data Storage Workshop.

[8]  J. Shalf,et al.  Characterizing and predicting the I/O performance of HPC applications using a parameterized synthetic benchmark , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  Christopher D. Carothers,et al.  Scalable Time Warp on Blue Gene Supercomputers , 2009, 2009 ACM/IEEE/SCS 23rd Workshop on Principles of Advanced and Distributed Simulation.

[10]  Carlos Maltzahn,et al.  Building a parallel file system simulator , 2009 .

[11]  Robert Latham,et al.  Scalable I/O forwarding framework for high-performance computing systems , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[12]  Karsten Schwan,et al.  DataStager: scalable data staging services for petascale applications , 2009, HPDC '09.

[13]  Robert Latham,et al.  I/O performance challenges at leadership scale , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[14]  Peter H. Beckman,et al.  Analyzing Checkpointing Trends for Applications on the IBM Blue Gene/P System , 2009, 2009 International Conference on Parallel Processing Workshops.

[15]  Robert Latham,et al.  24/7 Characterization of petascale I/O workloads , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[16]  John M. Dennis,et al.  Optimizing High-Resolution Climate Variability Experiments on the Cray XT4 and XT5 Systems at NICS and NERSC , 2009 .

[17]  Peter Desnoyers,et al.  Write Endurance in Flash Drives: Measurements and Analysis , 2010, FAST.

[18]  Parosh Aziz Abdulla Impact of Architecture and Technology for Extreme Scale on Software and Algorithm Design , 2010 .

[19]  Franck Cappello,et al.  Distributed Diskless Checkpoint for Large Scale Systems , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[20]  Scott Klasky,et al.  Enabling high-speed asynchronous data extraction and transfer using DART , 2010 .

[21]  Nicholas J. Wright,et al.  Performance analysis of commodity and enterprise class flash devices , 2010, 2010 5th Petascale Data Storage Workshop (PDSW '10).

[22]  David A Dillow,et al.  Lessons Learned in Deploying the World’s Largest Scale Lustre File System , 2010 .

[23]  Karsten Schwan,et al.  Managing Variability in the IO Performance of Petascale Storage Systems , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[24]  Jeffrey Bennett,et al.  DASH-IO: an empirical study of flash-based IO for HPC , 2010 .

[25]  John Shalf,et al.  Exascale Computing Technology Challenges , 2010, VECPAR.

[26]  Kalyan S. Perumalla,et al.  μπ: a scalable and transparent system for simulating MPI programs , 2010, SimuTools.

[27]  Laxmikant V. Kalé,et al.  Simulating Large Scale Parallel Applications Using Statistical Models for Sequential Execution Blocks , 2010, 2010 IEEE 16th International Conference on Parallel and Distributed Systems.

[28]  Galen M. Shipman,et al.  Workload characterization of a leadership class storage cluster , 2010, 2010 5th Petascale Data Storage Workshop (PDSW '10).

[29]  G. Grider,et al.  U.S. Department of Energy Best Practices Workshop on File Systems & Archives: Usability at Los Alamos National Lab , 2011 .

[30]  Robert Latham,et al.  Understanding and improving computational science storage access through continuous characterization , 2011, MSST.

[31]  Amy Apon,et al.  File system simulation: hierachical performance measurement and modeling , 2011 .

[32]  Mohammad A. Khaleel Scientific Grand Challenges: Crosscutting Technologies for Computing at the Exascale - February 2-4, 2010, Washington, D.C. , 2011 .

[33]  Robert B. Ross,et al.  Modeling a Leadership-Scale Storage System , 2011, PPAM.

[34]  Tong Zhang,et al.  Exploiting Memory Device Wear-Out Dynamics to Improve NAND Flash Memory System Performance , 2011, FAST.

[35]  Jesús Carretero,et al.  Optimizing Distributed Architectures to Improve Performance on Checkpointing Applications , 2011, 2011 IEEE International Conference on High Performance Computing and Communications.

[36]  Christopher D. Carothers,et al.  Modeling Billion-Node Torus Networks Using Massively Parallel Discrete-Event Simulation , 2011, 2011 IEEE Workshop on Principles of Advanced and Distributed Simulation.

[37]  Bruce Jacob,et al.  The structural simulation toolkit , 2006, PERV.

[38]  Sadaf R. Alam,et al.  Parallel I/O and the metadata wall , 2011, PDSW '11.

[39]  Michael E. Papka,et al.  Topology-aware data movement and staging for I/O acceleration on Blue Gene/P supercomputing systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[40]  Renato Figueiredo,et al.  Towards simulation of parallel file system scheduling algorithms with PFSsim , 2011 .

[41]  Maya Gokhale,et al.  On the Role of NVRAM in Data-intensive Architectures: An Evaluation , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[42]  Wei-keng Liao,et al.  A case study for scientific I/O: improving the FLASH astrophysics code , 2012 .