Checkpointing with mutable checkpoints

There are two approaches to reduce the overhead associated with coordinated checkpointing: first is to minimize the number of synchronization messages and the number of checkpoints; the other is to make the checkpointing process non-blocking. In our previous work (IEEE Parallel Distributed Systems 9 (12) (1998) 1213), we proved that there does not exist a nonblocking algorithm which forces only a minimum number of processes to take their checkpoints. In this paper, we present a min-process algorithm which relaxes the non-blocking condition while tries to minimize the blocking time, and a non-blocking algorithm which relaxes the min-process condition while minimizing the number of checkpoints saved on the stable storage. The proposed non-blocking algorithm is based on the concept of "mutable checkpoint", which is neither a tentative checkpoint nor a permanent checkpoint. Based on mutable checkpoints, our nonblocking algorithm avoids the avalanche effect and forces only a minimum number of processes to take their checkpoints on the stable storage.

[1]  Mukesh Singhal,et al.  Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems , 1996, IEEE Trans. Parallel Distributed Syst..

[2]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[3]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[4]  Luís Moura Silva,et al.  Global checkpointing for distributed programs , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[5]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[6]  Jian Xu,et al.  Necessary and Sufficient Conditions for Consistent Global Snapshots , 1995, IEEE Trans. Parallel Distributed Syst..

[7]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[8]  Mukesh Singhal,et al.  On Coordinated Checkpointing in Distributed Systems , 1998, IEEE Trans. Parallel Distributed Syst..

[9]  J. van Leeuwen,et al.  Theoretical Computer Science , 2003, Lecture Notes in Computer Science.

[10]  Junguk L. Kim,et al.  An Efficient Protocol for Checkpointing Recovery in Distributed Systems , 1993, IEEE Trans. Parallel Distributed Syst..

[11]  Nitin H. Vaidya,et al.  Staggered Consistent Checkpointing , 1999, IEEE Trans. Parallel Distributed Syst..

[12]  Shing-Tsaan Huang,et al.  Detecting termination of distributed computations by external agents , 1989, [1989] Proceedings. The 9th International Conference on Distributed Computing Systems.

[13]  Mukesh Singhal,et al.  On the impossibility of min-process non-blocking checkpointing and an efficient checkpointing algorithm for mobile computing systems , 1998, Proceedings. 1998 International Conference on Parallel Processing (Cat. No.98EX205).

[14]  Ten-Hwang Lai,et al.  On Distributed Snapshots , 1987, Inf. Process. Lett..

[15]  Madalene Spezialetti,et al.  Efficient Distributed Snapshots , 1986, ICDCS.

[16]  Yong Deng,et al.  Checkpointing and rollback-recovery algorithms in distributed systems , 1994, J. Syst. Softw..

[17]  Willy Zwaenepoel,et al.  The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[18]  Mukesh Singhal,et al.  Maximal global snapshot with concurrent initiators , 1994, Proceedings of 1994 6th IEEE Symposium on Parallel and Distributed Processing.

[19]  Parameswaran Ramanathan,et al.  Use of Common Time Base for Checkpointing and Rollback Recovery in a Distributed System , 1993, IEEE Trans. Software Eng..