Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System

High-performance computing (HPC) systems are growing more powerful by utilizing more hardware components. As the system mean-time-before-failure correspondingly drops, applications must checkpoint more frequently to make progress. However, as the system memory sizes grow faster than the bandwidth to the parallel file system, the cost of checkpointing begins to dominate application run times. Multi-level checkpointing potentially solves this problem through multiple types of checkpoints with different costs and different levels of resiliency in a single run. This solution employs lightweight checkpoints to handle the most common failure modes and relies on more expensive checkpoints for less common, but more severe failures. This theoretically promising approach has not been fully evaluated in a large- scale, production system context. We have designed the Scalable Checkpoint/Restart (SCR) library, a multi-level checkpoint system that writes checkpoints to RAM, Flash, or disk on the compute nodes in addition to the parallel file system. We present the performance and reliability properties of SCR as well as a probabilistic Markov model that predicts its performance on current and future systems. We show that multi-level checkpointing improves efficiency on existing large-scale systems and that this benefit increases as the system size grows. In particular, we developed low-cost checkpoint schemes that are 100x-1000x faster than the parallel file system and effective against 85% of our system failures. This leads to a gain in machine efficiency of up to 35%, and it reduces the the load on the parallel file system by a factor of two on current and future systems.

[1]  Tyce T. McLarty,et al.  Parallel file system testing for the lunatic fringe: the care and feeding of restless I/O power users , 2005, 22nd IEEE / 13th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST'05).

[2]  Bhawani Sankar Panda,et al.  Performance Evaluation of a Two Level Error Recovery Scheme for Distributed Systems , 2002, IWDC.

[3]  Kai Li,et al.  Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..

[4]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[5]  John A. Gunnels,et al.  Extending stability beyond CPU millennium: a micron-scale atomistic simulation of Kelvin-Helmholtz instability , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[6]  A. B. Langdon,et al.  On the dominant and subdominant behavior of stimulated Raman and Brillouin scattering driven by nonuniform laser beams , 1998 .

[7]  Kai Li,et al.  ickp: a consistent checkpointer for multicomputers , 1994, IEEE Parallel & Distributed Technology: Systems & Applications.

[8]  Kai Li,et al.  Faster checkpointing with N+1 parity , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[9]  Stephen L. Scott,et al.  Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[10]  Bianca Schroeder,et al.  Understanding failures in petascale computers , 2007 .

[11]  Robert E. Lyons,et al.  The Use of Triple-Modular Redundancy to Improve Computer Reliability , 1962, IBM J. Res. Dev..

[12]  Erol Gelenbe,et al.  A model of roll-back recovery with multiple checkpoints , 1976, ICSE '76.

[13]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[14]  George Bosilca,et al.  Fault tolerant high performance computing by a coding approach , 2005, PPoPP.

[15]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[16]  Stuart I. Feldman,et al.  IGOR: a system for program debugging via reversible execution , 1988, PADD '88.

[17]  Larry Rudolph,et al.  Cooperative checkpointing: a robust approach to large-scale systems reliability , 2006, ICS '06.

[18]  N. Hengartner,et al.  Predicting the number of fatal soft errors in Los Alamos national laboratory's ASC Q supercomputer , 2005, IEEE Transactions on Device and Materials Reliability.

[19]  James S. Plank,et al.  Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems , 2001, J. Parallel Distributed Comput..

[20]  Nitin H. Vaidya,et al.  A case for two-level distributed recovery schemes , 1995, SIGMETRICS '95/PERFORMANCE '95.

[21]  Kamil Iskra,et al.  ZOID: I/O-forwarding infrastructure for petascale architectures , 2008, PPoPP.

[22]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.

[23]  Robert B. Ross,et al.  Providing Efficient I/O Redundancy in MPI Environments , 2004, PVM/MPI.

[24]  Andrzej Duda,et al.  The Effects of Checkpointing on Program Execution Time , 1983, Inf. Process. Lett..

[25]  B R de Supinski,et al.  Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System , 2010 .

[26]  Luís Moura Silva,et al.  Using two-level stable storge for efficient checkpointing , 1998, IEE Proc. Softw..

[27]  P. Nowoczynski,et al.  Zest Checkpoint storage system for large supercomputers , 2008, 2008 3rd Petascale Data Storage Workshop.

[28]  Nitin H. Vaidya A Case of Multi-Level Distributed Recovery Schemes , 2001 .