Evaluating distributed checkpointing protocols

This paper presents an objective measure, called overhead ratio, for evaluating distributed checkpointing protocols. This measure extends previous evaluation schemes by incorporating several additional parameters that are inherent in distributed environments. In particular, we take into account the rollback propagation of the protocol, which impacts the length of the recovery process, and therefore the expected program run-time in executions that involve failures and recoveries. The paper also analyzes several known protocols and compares their overhead ratio.

[1]  Jehoshua Bruck,et al.  Analysis of checkpointing schemes for multiprocessor systems , 1994, Proceedings of IEEE 13th Symposium on Reliable Distributed Systems.

[2]  James S. Plank Efficient checkpointing on MIMD architectures , 1993 .

[3]  Roy Friedman,et al.  Quantifying rollback propagation in distributed checkpointing , 2001, Proceedings 20th IEEE Symposium on Reliable Distributed Systems.

[4]  Kishor S. Trivedi Probability and Statistics with Reliability, Queuing, and Computer Science Applications , 1984 .

[5]  Roy Friedman,et al.  Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[6]  Yin-Min Wang,et al.  Consistent Global checkpoints that Contain a Given Set of Local Chekpoints , 1997, IEEE Trans. Computers.

[7]  James S. Plank,et al.  Design, implementation, and performance of checkpointing in NetSolve , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[8]  Nitin H. Vaidya,et al.  On Checkpoint Latency , 1995 .

[9]  Augusto Ciuffoletti,et al.  A Distributed Domino-Effect free recovery Algorithm , 1984, Symposium on Reliability in Distributed Software and Database Systems.

[10]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[11]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[12]  Lorenzo Alvisi,et al.  An analysis of communication induced checkpointing , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[13]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[14]  Nitin H. Vaidya,et al.  Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme , 1997, IEEE Trans. Computers.

[15]  D. Manivannan,et al.  Quasi-Synchronous Checkpointing: Models, Characterization, and Classification , 1999, IEEE Trans. Parallel Distributed Syst..

[16]  Bruno Ciciani,et al.  A VP-accordant checkpointing protocol preventing useless checkpoints , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).

[17]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[18]  Nitin H. Vaidya,et al.  A case for two-level distributed recovery schemes , 1995, SIGMETRICS '95/PERFORMANCE '95.

[19]  Nitin H. Vaidya Another Two-Level Failure Recovery Scheme , 1994 .

[20]  J. Bruck,et al.  Efficient checkpointing over local area networks , 1994, Proceedings of IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems.

[21]  Roy Friedman,et al.  Virtual machine based heterogeneous checkpointing , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[22]  James S. Plank,et al.  An Overview of Checkpointing in Uniprocessor and DistributedSystems, Focusing on Implementation and Performance , 1997 .

[23]  James S. Plank,et al.  Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems , 2001, J. Parallel Distributed Comput..

[24]  Jian Xu,et al.  Necessary and Sufficient Conditions for Consistent Global Snapshots , 1995, IEEE Trans. Parallel Distributed Syst..