An optimal checkpoint/restart model for a large scale high performance computing system

The increase in the physical size of high performance computing (HPC) platform makes system reliability more challenging. In order to minimize the performance loss (rollback and checkpoint overheads) due to unexpected failures or unnecessary overhead of fault tolerant mechanisms, we present a reliability-aware method for an optimal checkpoint/restart strategy. Our scheme aims at addressing fault tolerance challenge, especially in a large-scale HPC system, by providing optimal checkpoint placement techniques that are derived from the actual system reliability. Unlike existing checkpoint models, which can only handle Poisson failure and a constant checkpoint interval, our model can deal with a varying checkpoint interval and with different failure distributions. In addition, the approach considers optimality for both checkpoint overhead and rollback time. Our validation results suggest a significant improvement over existing techniques.

[1]  David F. Heidel,et al.  An Overview of the BlueGene/L Supercomputer , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[2]  Mark A. Franklin,et al.  Distributed computing systems and checkpointing , 1993, [1993] Proceedings The 2nd International Symposium on High Performance Distributed Computing.

[3]  K. Mani Chandy,et al.  Analytic models for rollback and recovery strategies in data base systems , 1975, IEEE Transactions on Software Engineering.

[4]  Xiaola Lin,et al.  A Variational Calculus Approach to Optimal Checkpoint Placement , 2001, IEEE Trans. Computers.

[5]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[6]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[7]  John Daly A Model for Predicting the Optimum Checkpoint Interval for Restart Dumps , 2003, International Conference on Computational Science.

[8]  Larry Rudolph,et al.  Cooperative checkpointing: a robust approach to large-scale systems reliability , 2006, ICS '06.

[9]  Chokchai Leangsuksun,et al.  On the Survivability of Standard MPI Applications , 2006 .

[10]  James S. Plank,et al.  The average availability of parallel checkpointing systems and its importance in selecting runtime parameters , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[11]  E. N. Elnozahy,et al.  Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery , 2004, IEEE Transactions on Dependable and Secure Computing.

[12]  Jason Duell,et al.  The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..

[13]  Yudan Liu Reliability -aware optimal checkpoint /restart model in high performance computing , 2007 .

[14]  Robert Geist,et al.  Selection of a checkpoint interval in a critical-task environment , 1988 .

[15]  Tadashi Dohi,et al.  Distribution-free checkpoint placement algorithms based on min-max principle , 2006, IEEE Transactions on Dependable and Secure Computing.

[16]  Sheldon M. Ross,et al.  Stochastic Processes , 2018, Gauge Integral Structures for Stochastic Calculus and Quantum Electrodynamics.

[17]  Victor F. Nicola,et al.  Checkpointing and the modeling of program execution time , 1994 .

[18]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[19]  K. Mani Chandy,et al.  A Survey of Analytic Models of Rollback and Recovery Stratergies , 1975, Computer.

[20]  Michael Treaster A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems , 2005, ArXiv.