Checkpointing and rollback-recovery algorithms in distributed systems

Abstract To keep it free of arbitrary failures, a distributed system may require taking checkpoints from time to time. In case of failures, the system will roll back to checkpoints where global consistency is preserved. Based on the concept of global consistency defined in this article, which eliminates both received-not-sent and sent-not-received types of inconsistencies, we developed a synchronous checkpointing algorithm C ∗ and an asynchronous roll back-recovering algorithm R ∗ with O ( VE ) message complexity, which has a relatively small coefficient, where V is the number of processors and E is the number of communication links in the system. Neither incoming message logging nor message-number calculation is needed. Our algorithms work well in complicated situations such as communication loops involving a minimal number of processors. Because of the feature of instant crash recovery and the overall low time, space, and message complexities, these algorithms are most feasible in real-time applications where rapid rollback recovery is crucial.

[1]  David B. Johnson,et al.  Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing , 1988, J. Algorithms.

[2]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[3]  Tong-Ying Tony Juang,et al.  Efficient Algorithms for Crash Recovery in Distributed Systems , 1990, FSTTCS.

[4]  J. T. Lim,et al.  A checkpointing scheme for heterogeneous distributed database systems , 1991, [1991] Proceedings. 11th International Conference on Distributed Computing Systems.

[5]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[6]  Nancy A. Lynch,et al.  Global States of a Distributed System , 1982, IEEE Transactions on Software Engineering.

[7]  Sang Hyuk Son,et al.  Distributed Checkpointing for Globally Consistent States of Databases , 1989, IEEE Transactions on Software Engineering.

[8]  Aviziens Fault-Tolerant Systems , 1976, IEEE Transactions on Computers.

[9]  S. Venkatesan,et al.  Crash recovery with little overhead , 1991, [1991] Proceedings. 11th International Conference on Distributed Computing Systems.

[10]  K. H. Kim,et al.  An Approach to Experimental Evaluation of Real-Time Fault-Tolerant Distributed Computing Schemes , 1989, IEEE Trans. Software Eng..

[11]  A. Prasad Sistla,et al.  Efficient distributed recovery using message logging , 1989, PODC '89.

[12]  K. H. Kim,et al.  Implementation of the Conversation Scheme in Message-Based Distributed Computer Systems , 1992, IEEE Trans. Parallel Distributed Syst..

[13]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.