An Asynchronous Recovery Algorithm Based on a Staggered Quasi-Synchronous Checkpointing Algorithm

Checkpointing and rollback recovery are established techniques for handling failures in distributed systems. Under synchronous checkpointing, each process involved in the distributed computation takes checkpoint almost simultaneously. This causes contention for network stable storage and hence degrades performance. To overcome this problem, checkpoint staggering under which checkpoints by various processes are taken in a staggered manner, has been proposed. In this paper, we propose a staggered quasi-synchronous checkpointing algorithm which reduces contention for network stable storage without any synchronization overhead. We also present an asynchronous recovery algorithm based on the checkpointing algorithm.

[1]  Nancy A. Lynch,et al.  Distributed Algorithms , 1992, Lecture Notes in Computer Science.

[2]  D. Manivannan,et al.  Asynchronous recovery without using vector timestamps , 2002, J. Parallel Distributed Comput..

[3]  D. Manivannan,et al.  Quasi-Synchronous Checkpointing: Models, Characterization, and Classification , 1999, IEEE Trans. Parallel Distributed Syst..

[4]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[5]  James S. Plank Efficient checkpointing on MIMD architectures , 1993 .

[6]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[7]  Jian Xu,et al.  Necessary and Sufficient Conditions for Consistent Global Snapshots , 1995, IEEE Trans. Parallel Distributed Syst..

[8]  D. Manivannan,et al.  A low-overhead recovery technique using quasi-synchronous checkpointing , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.

[9]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[10]  Achour Mostéfaoui,et al.  A communication-induced checkpointing protocol that ensures rollback-dependency trackability , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.

[11]  Luís Moura Silva,et al.  Global checkpointing for distributed programs , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[12]  Jean-Michel Hélary Observing Global States of Asynchronous Distributed Applications , 1989, WDAG.

[13]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[14]  Nitin H. Vaidya,et al.  On Checkpoint Latency , 1995 .

[15]  Willy Zwaenepoel,et al.  Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.

[16]  Nitin H. Vaidya,et al.  Staggered Consistent Checkpointing , 1999, IEEE Trans. Parallel Distributed Syst..

[17]  David B. Johnson,et al.  Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing , 1988, J. Algorithms.