论文信息 - Efficient algorithms for optimistic crash recovery

Efficient algorithms for optimistic crash recovery

SummaryRecovery from transient processor failures can be achieved by using optimistic message logging and checkpointing. The faulty processorsroll back, and some/all of the non-faulty processors also may have to roll back. This paper formulates the rollback problem as a closure problem. A centralized closure algorithm is presented together with two efficient distributed implementations. Several related problems are also considered and distributed algorithms are presented for solving them.

Tong-Ying Tony Juang | Subbarayan Venkatesan | T. Juang | S. Venkatesan

[1] S. Venkatesan,et al. Crash recovery with little overhead , 1991, [1991] Proceedings. 11th International Conference on Distributed Computing Systems.

[2] Willy Zwaenepoel,et al. Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.

[3] Fred B. Schneider,et al. Byzantine generals in action: implementing fail-stop processors , 1984, TOCS.

[4] RICHARD KOO,et al. Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[5] Leslie Lamport,et al. Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[6] David F. Bacon,et al. Volatile logging in n-fault-tolerant distributed systems , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[7] S. Venkatesan,et al. On Finding and Updating Shortest Paths Distributively , 1992, J. Algorithms.

[8] Pierre A. Humblet,et al. A Distributed Algorithm for Minimum-Weight Spanning Trees , 1983, TOPL.

[9] David B. Johnson,et al. Distributed system fault tolerance using message logging and checkpointing , 1990 .

[10] L. Alvisi,et al. Nonblocking and Orphan-Free Message Logging Protocols , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[11] Leslie Lamport,et al. Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[12] Robert E. Strom,et al. Optimistic recovery in distributed systems , 1985, TOCS.

[13] David B. Johnson,et al. Sender-Based Message Logging , 1987 .

[14] Anita Borg,et al. A message system supporting fault tolerance , 1983, SOSP '83.

[15] Brian Randell,et al. System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[16] K. H. Kim,et al. Programmer-Transparent Coordination of Recovering Concurrent Processes: Philosophy and Rules for Efficient Implementation , 1988, IEEE Trans. Software Eng..

[17] David B. Johnson,et al. Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing , 1988, J. Algorithms.

[18] David L. Presotto,et al. Publishing: a reliable broadcast communication mechanism , 1983, SOSP '83.

[19] Butler W. Lampson,et al. Crash Recovery in a Distributed Data Storage System , 1981 .

[20] A. Prasad Sistla,et al. Efficient distributed recovery using message logging , 1989, PODC '89.