论文信息 - Checkpointing with mutable checkpoints

Checkpointing with mutable checkpoints

There are two approaches to reduce the overhead associated with coordinated checkpointing: first is to minimize the number of synchronization messages and the number of checkpoints; the other is to make the checkpointing process non-blocking. In our previous work (IEEE Parallel Distributed Systems 9 (12) (1998) 1213), we proved that there does not exist a nonblocking algorithm which forces only a minimum number of processes to take their checkpoints. In this paper, we present a min-process algorithm which relaxes the non-blocking condition while tries to minimize the blocking time, and a non-blocking algorithm which relaxes the min-process condition while minimizing the number of checkpoints saved on the stable storage. The proposed non-blocking algorithm is based on the concept of "mutable checkpoint", which is neither a tentative checkpoint nor a permanent checkpoint. Based on mutable checkpoints, our nonblocking algorithm avoids the avalanche effect and forces only a minimum number of processes to take their checkpoints on the stable storage.

Mukesh Singhal | Guohong Cao

[1] Mukesh Singhal,et al. Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems , 1996, IEEE Trans. Parallel Distributed Syst..

[2] Leslie Lamport,et al. Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[3] RICHARD KOO,et al. Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[4] Luís Moura Silva,et al. Global checkpointing for distributed programs , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[5] Leslie Lamport,et al. Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[6] Jian Xu,et al. Necessary and Sufficient Conditions for Consistent Global Snapshots , 1995, IEEE Trans. Parallel Distributed Syst..

[7] Robert E. Strom,et al. Optimistic recovery in distributed systems , 1985, TOCS.

[8] Mukesh Singhal,et al. On Coordinated Checkpointing in Distributed Systems , 1998, IEEE Trans. Parallel Distributed Syst..

[9] J. van Leeuwen,et al. Theoretical Computer Science , 2003, Lecture Notes in Computer Science.

[10] Junguk L. Kim,et al. An Efficient Protocol for Checkpointing Recovery in Distributed Systems , 1993, IEEE Trans. Parallel Distributed Syst..

[11] Nitin H. Vaidya,et al. Staggered Consistent Checkpointing , 1999, IEEE Trans. Parallel Distributed Syst..

[12] Shing-Tsaan Huang,et al. Detecting termination of distributed computations by external agents , 1989, [1989] Proceedings. The 9th International Conference on Distributed Computing Systems.

[13] Mukesh Singhal,et al. On the impossibility of min-process non-blocking checkpointing and an efficient checkpointing algorithm for mobile computing systems , 1998, Proceedings. 1998 International Conference on Parallel Processing (Cat. No.98EX205).

[14] Ten-Hwang Lai,et al. On Distributed Snapshots , 1987, Inf. Process. Lett..

[15] Madalene Spezialetti,et al. Efficient Distributed Snapshots , 1986, ICDCS.

[16] Yong Deng,et al. Checkpointing and rollback-recovery algorithms in distributed systems , 1994, J. Syst. Softw..

[17] Willy Zwaenepoel,et al. The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[18] Mukesh Singhal,et al. Maximal global snapshot with concurrent initiators , 1994, Proceedings of 1994 6th IEEE Symposium on Parallel and Distributed Processing.

[19] Parameswaran Ramanathan,et al. Use of Common Time Base for Checkpointing and Rollback Recovery in a Distributed System , 1993, IEEE Trans. Software Eng..