An optimistic checkpointing and message logging approach for consistent global checkpoint collection in distributed systems

Checkpointing and rollback recovery are widely used techniques for achieving fault-tolerance in distributed systems. In this paper, we present a novel checkpointing algorithm which has the following desirable features: A process can independently initiate consistent global checkpointing by saving its current state, called a tentative checkpoint. Other processes come to know about a consistent global checkpoint initiation through information piggy-backed with the application messages or limited control messages if necessary. When a process comes to know about a new consistent global checkpoint initiation, it takes a tentative checkpoint after processing the message (not before processing the message as in existing communication-induced checkpointing algorithms). After a process takes a tentative checkpoint, it starts logging the messages sent and received in memory. When a process comes to know that every other process has taken a tentative checkpoint corresponding to current consistent global checkpoint initiation, it flushes the tentative checkpoint and the message log to the stable storage. The tentative checkpoints together with the message logs stored in the stable storage form a consistent global checkpoint. Two or more processes can concurrently initiate consistent global checkpointing by taking a new tentative checkpoint; in that case, the tentative checkpoints taken by all these processes will be part of the same consistent global checkpoint. The sequence numbers assigned to checkpoints by a process increase monotonically. Checkpoints with the same sequence number form a consistent global checkpoint. We also present the performance evaluation of our algorithm.

[1]  S. Venkatesan,et al.  Crash recovery with little overhead , 1991, [1991] Proceedings. 11th International Conference on Distributed Computing Systems.

[2]  Willy Zwaenepoel,et al.  Output-Driven Distributed Optimistic Message Logging and Checkpointing , 1990 .

[3]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[4]  Jeffrey F. Naughton,et al.  Checkpointing multicomputer applications , 1991, [1991] Proceedings Tenth Symposium on Reliable Distributed Systems.

[5]  Nitin H. Vaidya,et al.  Staggered Consistent Checkpointing , 1999, IEEE Trans. Parallel Distributed Syst..

[6]  Luís Moura Silva,et al.  Global checkpointing for distributed programs , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[7]  Nitin H. Vaidya,et al.  On Checkpoint Latency , 1995 .

[8]  A. Prasad Sistla,et al.  Efficient distributed recovery using message logging , 1989, PODC '89.

[9]  Vijay K. Garg,et al.  Distributed recovery with K-optimistic logging , 1997, Proceedings of 17th International Conference on Distributed Computing Systems.

[10]  Mukesh Singhal,et al.  Checkpointing with mutable checkpoints , 2003, Theor. Comput. Sci..

[11]  D. Manivannan,et al.  A low-overhead recovery technique using quasi-synchronous checkpointing , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.

[12]  D. Manivannan,et al.  Asynchronous recovery without using vector timestamps , 2002, J. Parallel Distributed Comput..

[13]  D. Manivannan,et al.  Quasi-Synchronous Checkpointing: Models, Characterization, and Classification , 1999, IEEE Trans. Parallel Distributed Syst..

[14]  Willy Zwaenepoel,et al.  Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.

[15]  Lorenzo Alvisi,et al.  Message logging: pessimistic, optimistic, and causal , 1995, Proceedings of 15th International Conference on Distributed Computing Systems.

[16]  Achour Mostéfaoui,et al.  A communication-induced checkpointing protocol that ensures rollback-dependency trackability , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.

[17]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[18]  Junguk L. Kim,et al.  An Efficient Protocol for Checkpointing Recovery in Distributed Systems , 1993, IEEE Trans. Parallel Distributed Syst..

[19]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[20]  Christian Lavault,et al.  A distributed algorithm for constructing a minimum diameter spanning tree , 2004, J. Parallel Distributed Comput..

[21]  D. Manivannan,et al.  An optimistic checkpointing and selective message logging approach for consistent global checkpoint collection in distributed systems , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[22]  Krishnendu Mukhopadhyaya,et al.  Concurrent checkpoint initiation and recovery algorithms on asynchronous ring network , 2004, J. Parallel Distributed Comput..

[23]  James S. Plank Efficient checkpointing on MIMD architectures , 1993 .

[24]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[25]  Jian Xu,et al.  Necessary and Sufficient Conditions for Consistent Global Snapshots , 1995, IEEE Trans. Parallel Distributed Syst..

[26]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[27]  Willy Zwaenepoel,et al.  The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[28]  Willy Zwaenepoel,et al.  On the use and implementation of message logging , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.