Selection of a Checkpoint Interval in Coordinated Checkpointing Protocol for Fault Tolerant Open MPI

The goal of this paper is to address the selection of efficient checkpoint interval which reduces the total overhead cost due to the checkpointing and restarting of the applications in a distributed system environment. Coordinated checkpointing rollback recovery protocol is used for making the application programs fault tolerant on a stand-alone system under no load conditions using BLCR and OPEN MPI at system level.

[1]  K. Mani Chandy,et al.  Analytic models for rollback and recovery strategies in data base systems , 1975, IEEE Transactions on Software Engineering.

[2]  Luís Moura Silva,et al.  The performance of coordinated and independent checkpointing , 1999, Proceedings 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing. IPPS/SPDP 1999.

[3]  Jason Duell,et al.  Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters , 2006 .

[4]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[5]  K. Mani Chandy,et al.  A Survey of Analytic Models of Rollback and Recovery Stratergies , 1975, Computer.

[6]  James S. Plank,et al.  The average availability of parallel checkpointing systems and its importance in selecting runtime parameters , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[7]  Michael Treaster,et al.  A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems , 2004, ArXiv.

[8]  Jack J. Dongarra,et al.  HARNESS and fault tolerant MPI , 2001, Parallel Comput..

[9]  Robert Geist,et al.  Selection of a checkpoint interval in a critical-task environment , 1988 .

[10]  E. N. Elnozahy,et al.  Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery , 2004, IEEE Transactions on Dependable and Secure Computing.

[11]  Stephen L. Scott,et al.  An optimal checkpoint/restart model for a large scale high performance computing system , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[12]  John Daly A Model for Predicting the Optimum Checkpoint Interval for Restart Dumps , 2003, International Conference on Computational Science.

[13]  Yudan Liu Reliability -aware optimal checkpoint /restart model in high performance computing , 2007 .

[14]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.