Efficient algorithms for optimistic crash recovery

SummaryRecovery from transient processor failures can be achieved by using optimistic message logging and checkpointing. The faulty processorsroll back, and some/all of the non-faulty processors also may have to roll back. This paper formulates the rollback problem as a closure problem. A centralized closure algorithm is presented together with two efficient distributed implementations. Several related problems are also considered and distributed algorithms are presented for solving them.

[1]  S. Venkatesan,et al.  Crash recovery with little overhead , 1991, [1991] Proceedings. 11th International Conference on Distributed Computing Systems.

[2]  Willy Zwaenepoel,et al.  Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.

[3]  Fred B. Schneider,et al.  Byzantine generals in action: implementing fail-stop processors , 1984, TOCS.

[4]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[5]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[6]  David F. Bacon,et al.  Volatile logging in n-fault-tolerant distributed systems , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[7]  S. Venkatesan,et al.  On Finding and Updating Shortest Paths Distributively , 1992, J. Algorithms.

[8]  Pierre A. Humblet,et al.  A Distributed Algorithm for Minimum-Weight Spanning Trees , 1983, TOPL.

[9]  David B. Johnson,et al.  Distributed system fault tolerance using message logging and checkpointing , 1990 .

[10]  L. Alvisi,et al.  Nonblocking and Orphan-Free Message Logging Protocols , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[11]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[12]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[13]  David B. Johnson,et al.  Sender-Based Message Logging , 1987 .

[14]  Anita Borg,et al.  A message system supporting fault tolerance , 1983, SOSP '83.

[15]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[16]  K. H. Kim,et al.  Programmer-Transparent Coordination of Recovering Concurrent Processes: Philosophy and Rules for Efficient Implementation , 1988, IEEE Trans. Software Eng..

[17]  David B. Johnson,et al.  Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing , 1988, J. Algorithms.

[18]  David L. Presotto,et al.  Publishing: a reliable broadcast communication mechanism , 1983, SOSP '83.

[19]  Butler W. Lampson,et al.  Crash Recovery in a Distributed Data Storage System , 1981 .

[20]  A. Prasad Sistla,et al.  Efficient distributed recovery using message logging , 1989, PODC '89.