Optimizing checkpoint data placement with guaranteed burst buffer endurance in large-scale hierarchical storage systems

Non-volatile devices, such as SSDs, will be an integral part of the deepening storage hierarchy on large-scale HPC systems. These devices can be on the compute nodes as part of a distributed burst buffer service or they can be external. Wherever they are located in the hierarchy, one critical design issue is the SSD endurance under the write-heavy workloads, such as the checkpoint I/O for scientific applications. For these environments, it is widely assumed that checkpoint operations can occur once every 60 min and for each checkpoint step as much as half of the system memory can be written out. Unfortunately, for large-scale HPC applications, the burst buffer SSDs can be worn out much more quickly given the extensive amount of data written at every checkpoint step. One possible solution is to control the amount of data written by reducing the checkpoint frequency. However, a direct effect caused by reduced checkpoint frequency is the increased vulnerability window of system failures and therefore potentially wasted computation time, especially for large-scale compute jobs.In this paper, we propose a new checkpoint placement optimization model which collaboratively utilizes both the burst buffer and the parallel file system to store the checkpoints, with design goals of maximizing computation efficiency while guaranteeing the SSD endurance requirements. Moreover, we present an adaptive algorithm which can dynamically adjust the checkpoint placement based on the systems dynamic runtime characteristics and continuously optimize the burst buffer utilization. The evaluation results show that by using our adaptive checkpoint placement algorithm we can guarantee the burst buffer endurance with at most 5% performance degradation per application and less than 3% for the entire system. A thorough analysis of both failure patterns and runtime characteristics of HPC systems.A new checkpoint placement model for optimizing large-scale hierarchical storage systems usage.A novel adaptive algorithm that can dynamically optimize the checkpoint placement.

[1]  Henri Casanova,et al.  Checkpointing strategies for parallel jobs , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[2]  Bianca Schroeder,et al.  The Computer Failure Data Repository (CFDR): collecting, sharing and analyzing failure data , 2006, SC.

[3]  Feng Chen,et al.  Hystor: making the best use of solid state drives in high performance storage systems , 2011, ICS '11.

[4]  Tei-Wei Kuo,et al.  A file-system-aware FTL design for flash-memory storage systems , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[5]  Robert B. Ross,et al.  On the role of burst buffers in leadership-class storage systems , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[6]  Qing Yang,et al.  I-CASH: Intelligently Coupled Array of SSD and HDD , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[7]  Xiaoming Zhang,et al.  Hybrid hierarchy storage system in MilkyWay-2 supercomputer , 2014, Frontiers of Computer Science.

[8]  Bianca Schroeder,et al.  Checkpoint/restart in practice: When ‘simple is better’ , 2014, 2014 IEEE International Conference on Cluster Computing (CLUSTER).

[9]  Sungjin Lee,et al.  Lifetime management of flash-based SSDs using recovery-aware dynamic throttling , 2012, FAST.

[10]  Jason Duell,et al.  Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters , 2006 .

[11]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[12]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[13]  Bianca Schroeder,et al.  To checkpoint or not to checkpoint: Understanding energy-performance-I/O tradeoffs in HPC checkpointing , 2014, 2014 IEEE International Conference on Cluster Computing (CLUSTER).

[14]  Devesh Tiwari,et al.  A practical approach to reconciling availability, performance, and capacity in provisioning extreme-scale storage systems , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[15]  Teng Wang,et al.  BurstMem: A high-performance burst buffer system for scientific applications , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[16]  Rina Panigrahy,et al.  Design Tradeoffs for SSD Performance , 2008, USENIX ATC.

[17]  Mahesh Balakrishnan,et al.  Extending SSD Lifetimes with Disk-Based Write Caches , 2010, FAST.

[18]  Fabio Margaglia,et al.  Extending SSD lifetime in database applications with page overwrites , 2013, SYSTOR '13.

[19]  Sorin Faibish,et al.  Jitter-free co-processing on a prototype exascale storage stack , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[20]  Stephen L. Scott,et al.  A reliability-aware approach for an optimal checkpoint/restart model in HPC environments , 2007, 2007 IEEE International Conference on Cluster Computing.

[21]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[22]  Saurabh Gupta,et al.  Best Practices and Lessons Learned from Deploying and Operating Large-Scale Data-Centric Parallel File Systems , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[23]  Tian Luo,et al.  CAFTL: A Content-Aware Flash Translation Layer Enhancing the Lifespan of Flash Memory based Solid State Drives , 2011, FAST.

[24]  Youyou Lu,et al.  Extending the lifetime of flash-based storage through reducing write amplification from file systems , 2013, FAST.

[25]  Satoshi Matsuoka,et al.  A User-Level InfiniBand-Based File System and Checkpoint Strategy for Burst Buffers , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[26]  Lorenz T. Biegler,et al.  On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming , 2006, Math. Program..

[27]  Evangelos Eleftheriou,et al.  Write amplification analysis in flash-based solid state drives , 2009, SYSTOR '09.

[28]  Andrew A. Chien,et al.  How Much SSD Is Useful for Resilience in Supercomputers , 2015, FTXS@HPDC.

[29]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[30]  Kai Shen,et al.  A performance evaluation of scientific I/O workloads on Flash-based SSDs , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[31]  Saurabh Gupta,et al.  Lazy Checkpointing: Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[32]  Xubin He,et al.  Delta-FTL: improving SSD lifetime via exploiting content locality , 2012, EuroSys '12.

[33]  Nitin H. Vaidya,et al.  Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme , 1997, IEEE Trans. Computers.

[34]  Satoshi Matsuoka,et al.  Design and modeling of a non-blocking checkpointing system , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[35]  Lipeng Wan,et al.  A Report on Simulation-Driven Reliability and Failure Analysis of Large-Scale Storage Systems , 2014 .

[36]  Lipeng Wan,et al.  SSD-optimized workload placement with adaptive learning and classification in HPC environments , 2014, 2014 30th Symposium on Mass Storage Systems and Technologies (MSST).