The performance of coordinated and independent checkpointing

Checkpointing is a very effective technique to tolerate the occurrence of failures in distributed and parallel applications. The existing algorithms in the literature are basically divided into two main classes: coordinated and independent checkpointing. This paper presents an experimental study that compares the performance of these two classes of algorithms. The main conclusion of our study is that coordinated checkpointing is more efficient than independent checkpointing and all the arguments against the performance of coordinated algorithms were not verified in practice.

[1]  Kai Li,et al.  ickp: a consistent checkpointer for multicomputers , 1994, IEEE Parallel & Distributed Technology: Systems & Applications.

[2]  Brian Randell System Structure for Software Fault Tolerance , 1975, IEEE Trans. Software Eng..

[3]  Jeffrey F. Naughton,et al.  Real-time, concurrent checkpoint for parallel programs , 1990, PPOPP '90.

[4]  Luís Moura Silva,et al.  Global checkpointing for distributed programs , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[5]  Henri E. Bal,et al.  Transparent fault-tolerance in parallel Orca programs , 1992 .

[6]  Brian Randell,et al.  Consistent State Restoration in Distributed Systems , 1977 .

[7]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[8]  Anita Borg,et al.  A message system supporting fault tolerance , 1983, SOSP '83.

[9]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[10]  Willy Zwaenepoel,et al.  The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[11]  J. A. McDermid Checkpointing and Error Recovery in distributed Systems , 1981, ICDCS.

[12]  W. Kent Fuchs,et al.  Checkpoint Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems , 1995, IEEE Trans. Parallel Distributed Syst..

[13]  Gilles Muller,et al.  Performance of Consistent Checkpointing in a Modular Operating System: Results of the FTM Experiment , 1994, EDCC.

[14]  Bharat K. Bhargava,et al.  Independent checkpointing and concurrent rollback for recovery in distributed systems-an optimistic approach , 1988, Proceedings [1988] Seventh Symposium on Reliable Distributed Systems.