Optimistic protocols for fault tolerance in distributed systems

This dissertation focuses on the use of message logging for recovering from process failures in distributed systems. Optimistic message logging protocols assume that failures are rare. Based on this assumption, they try to reduce the failure-free overhead. We have proved several fundamental results about optimistic logging protocols. We have designed a protocol that allows the user of a system to tune the degree of optimism. This protocol provides a trade-off between failure-free overhead and recovery efficiency. The special cases of this protocol include an existing optimistic protocol and an existing pessimistic protocol. We have also studied extensions of optimistic protocols to multi-threaded environments. The natural extensions offer a trade-off between the false causality and the failure-free overhead. We avoid this trade-off by treating threads as the unit of recovery and processes as the unit of failure. The protocols mentioned so far are independent of any particular application characteristics. The fault-tolerance overhead can sometimes be reduced by exploiting the specific characteristics of an application. We have demonstrated this reduction in the context of optimistic computations. Specifically, we have developed a new fault-tolerant optimistic simulation protocol.

[1]  Algirdas Avizienis,et al.  Design of fault-tolerant computers , 1967, AFIPS '67 (Fall).

[2]  Yennun Huang,et al.  Software Implemented Fault Tolerance Technologies and Experience , 1993, FTCS.

[3]  David B. Johnson,et al.  Sender-Based Message Logging , 1987 .

[4]  Divyakant Agrawal,et al.  Replicated objects in time warp simulations , 1992, WSC '92.

[5]  P. M. Melliar-Smith,et al.  A program structure for error detection and recovery , 1974, Symposium on Operating Systems.

[6]  Sean W. Smith,et al.  Minimizing timestamp size for completely asynchronous optimistic recovery with minimal rollback , 1995, Proceedings 15th Symposium on Reliable Distributed Systems.

[7]  Jong-Deok Choi,et al.  Deterministic replay of Java multithreaded applications , 1998, SPDT '98.

[8]  Yi-Min Wang,et al.  Why optimistic message logging has not been used in telecommunications systems , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[9]  E. N. Elnozahy,et al.  Supporting nondeterministic execution in fault-tolerant systems , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[10]  Sean W. Smith,et al.  Completely asynchronous optimistic recovery with minimal rollbacks , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[11]  Willy Zwaenepoel,et al.  On the use and implementation of message logging , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[12]  Vijay K. Garg,et al.  Optimistic Distributed Simulation Based on Transitive Dependency Tracking , 1997, Workshop on Parallel and Distributed Simulation.

[13]  Silvano Maffeis Prianha: A CORBA Tool For High Availability , 1997, Computer.

[14]  Harrick M. Vin,et al.  The cost of recovery in message logging protocols , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).

[15]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[16]  Lorenzo Alvisi,et al.  Deriving optimal checkpoint protocols for distributed shared memory architectures , 1995, PODC '95.

[17]  David B. Johnson,et al.  Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing , 1988, J. Algorithms.

[18]  Alexander I. Tomlinson,et al.  Using induction to prove properties of distributed programs , 1993, Proceedings of 1993 5th IEEE Symposium on Parallel and Distributed Processing.

[19]  Wolfgang Graetsch,et al.  Fault tolerance under UNIX , 1989, TOCS.

[20]  Vijay K. Garg,et al.  How to recover efficiently and asynchronously when optimism fails , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.

[21]  Mark Russinovich,et al.  Replay for concurrent non-deterministic shared-memory applications , 1996, PLDI '96.

[22]  Thomas J. LeBlanc,et al.  A software instruction counter , 1989, ASPLOS III.

[23]  Lorenzo Alvisi Understanding the message logging paradigm for masking process crashes , 1996 .

[24]  W. Kent Fuchs,et al.  Progressive Retry for Software Failure Recovery in Message-Passing Applications , 1997, IEEE Trans. Computers.

[25]  Andreas Reuter,et al.  Transaction Processing: Concepts and Techniques , 1992 .

[26]  Willy Zwaenepoel,et al.  Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.

[27]  Robbert van Renesse,et al.  Reliable Distributed Computing with the Isis Toolkit , 1994 .

[28]  Lorenzo Alvisi,et al.  Message logging: pessimistic, optimistic, and causal , 1995, Proceedings of 15th International Conference on Distributed Computing Systems.

[29]  Arthur P. Goldberg Transparent Recovery of Mach Applications , 1990, USENIX MACH Symposium.

[30]  David L. Presotto,et al.  Publishing: a reliable broadcast communication mechanism , 1983, SOSP '83.

[31]  Richard H. Carver,et al.  Debugging Concurrent Ada Programs by Deterministic Execution , 1991, IEEE Trans. Software Eng..

[32]  W. Kent Fuchs,et al.  Lazy checkpoint coordination for bounding rollback propagation , 1992, Proceedings of 1993 IEEE 12th Symposium on Reliable Distributed Systems.

[33]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[34]  Bil Lewis,et al.  Threads Primer: A Guide to Multithreaded Programming , 1995 .

[35]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[36]  Yi-Min Wang,et al.  Reliability and availability issues in distributed component object model (DCOM) , 1997, 1997 Fourth International Workshop on Community Networking Processing.

[37]  Phil Kearns,et al.  Rollback based on vector time , 1993, Proceedings of 1993 IEEE 12th Symposium on Reliable Distributed Systems.

[38]  Hassan Rajaei,et al.  The local Time Warp approach to parallel simulation , 1993, PADS '93.

[39]  Herman H. Goldstine,et al.  Preliminary discussion of the logical design of an electronic computing instrument (1946) , 1989 .

[40]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[41]  Anita Borg,et al.  A message system supporting fault tolerance , 1983, SOSP '83.

[42]  Luke Lin,et al.  Using checkpoints to localize the effects of faults in distributed systems , 1989, Proceedings of the Eighth Symposium on Reliable Distributed Systems.

[43]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1983, PODS '83.

[44]  Lorenzo Alvisi,et al.  Nonblocking and orphan-free message logging protocols , 1992, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[45]  Sampath Rangarajan,et al.  Filterfresh: Hot Replication of Java RMI Server Objects , 1998, COOTS.

[46]  David M. Nicol,et al.  The dark side of risk (what your mother never told you about Time Warp) , 1997, Workshop on Parallel and Distributed Simulation.

[47]  David B. Johnson,et al.  Efficient transparent optimistic rollback recovery for distributed application programs , 1993, Proceedings of 1993 IEEE 12th Symposium on Reliable Distributed Systems.

[48]  Carl Tropper,et al.  Clustered time warp and logic simulation , 1995, PADS.

[49]  A. Weiss,et al.  Rollback sometimes works...if filtered , 1989, WSC '89.

[50]  Butler W. Lampson,et al.  Crash Recovery in a Distributed Data Storage System , 1981 .

[51]  Shigeru Chiba,et al.  A metaobject protocol for fault-tolerant CORBA applications , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).

[52]  David R. Jefferson,et al.  Virtual time , 1985, ICPP.

[53]  Vijay K. Garg,et al.  Detection of Weak Unstable Predicates in Distributed Programs , 1994, IEEE Trans. Parallel Distributed Syst..

[54]  Bojan Groselj,et al.  Fault-tolerant distributed simulation , 1991, 1991 Winter Simulation Conference Proceedings..

[55]  James R. Russell,et al.  Optimistic failure recovery for very large networks , 1991, [1991] Proceedings Tenth Symposium on Reliable Distributed Systems.

[56]  Algirdas Avizienis,et al.  The N-Version Approach to Fault-Tolerant Software , 1985, IEEE Transactions on Software Engineering.

[57]  R. M. Fujimoto,et al.  Parallel discrete event simulation , 1989, WSC '89.

[58]  A. Prasad Sistla,et al.  Efficient distributed recovery using message logging , 1989, PODC '89.

[59]  Thomas J. LeBlanc,et al.  Debugging Parallel Programs with Instant Replay , 1987, IEEE Transactions on Computers.