Fundamentals of fault-tolerant distributed computing in asynchronous environments

Fault tolerance in distributed computing is a wide area with a significant body of literature that is vastly diverse in methodology and terminology. This paper aims at structuring the area and thus guiding readers into this interesting field. We use a formal approach to define important terms like fault, fault tolerance, and redundancy. This leads to four distinct forms of fault tolerance and to two main phases in achieving them: detection and correction. We show that this can help to reveal inherently fundamental structures that contribute to understanding and unifying methods and terminology. By doing this, we survey many existing methodologies and discuss their relations. The underlying system model is the close-to-reality asynchronous message-passing model of distributed computing.

[1]  Edsger W. Dijkstra,et al.  Self-stabilizing systems in spite of distributed control , 1974, CACM.

[2]  Gérard Le Lann On Real-Time and Non Real-Time Distributed Computing , 1995, WDAG.

[3]  Sam Toueg,et al.  A Modular Approach to Fault-Tolerant Broadcasts and Related Problems , 1994 .

[4]  Vijay K. Garg,et al.  Distributed predicate detection in a faulty environment , 1998, Proceedings. 18th International Conference on Distributed Computing Systems (Cat. No.98CB36183).

[5]  Keith Marzullo,et al.  Consistent detection of global predicates , 1991, PADD '91.

[6]  Marcos K. Aguilera,et al.  Failure detection and consensus in the crash-recovery model , 1998, Distributed Computing.

[7]  Fred B. Schneider What good are models and what models are good , 1993 .

[8]  Anish Arora,et al.  Component Based Design of Multitolerant Systems , 1998, IEEE Trans. Software Eng..

[9]  Felix C. Freiling,et al.  On proving the stability of distributed algorithms: self-stabilization versus control theory , 1998 .

[10]  Sape J. Mullender,et al.  Distributed systems (2nd Ed.) , 1993 .

[11]  Scott D. Stoller Detecting Global Predicates in Distributed Systems with Clocks , 1997, WDAG.

[12]  André Schiper,et al.  Primary Partition "Virtually-Synchronous Communication" harder than Consensus , 1994, WDAG.

[13]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[14]  Sam Toueg,et al.  The weakest failure detector for solving consensus , 1996, JACM.

[15]  André Schiper,et al.  Consensus in the Crash-Recover Model , 1997 .

[16]  Flaviu Cristian,et al.  Understanding fault-tolerant distributed systems , 1991, CACM.

[17]  Anish Arora,et al.  Closure and Convergence: A Foundation of Fault-Tolerant Computing , 1993, IEEE Trans. Software Eng..

[18]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[19]  Felix C. Freiling Specifications for Fault Tolerance: A Comedy of Failures , 1998 .

[20]  Paulo Veríssimo,et al.  Using light-weight groups to handle timing failures in quasi-synchronous systems , 1998, Proceedings 19th IEEE Real-Time Systems Symposium (Cat. No.98CB36279).

[21]  Achour Mostéfaoui,et al.  Consensus in asynchronous systems where processes can crash and recover , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).

[22]  Peter M. Chen,et al.  How fail-stop are faulty programs? , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[23]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1985, JACM.

[24]  Roy Friedman,et al.  Failure detectors in omission failure environments , 1997, PODC '97.

[25]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[26]  André Schiper,et al.  Consensus: The Big Misunderstanding , 1997 .

[27]  Sanjay R. Radia,et al.  The SunSCALR framework for Internet servers , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[28]  Leslie Lamport,et al.  Proving the Correctness of Multiprocess Programs , 1977, IEEE Transactions on Software Engineering.

[29]  Anish Arora,et al.  Designing Masking Fault-Tolerance via Nonmasking Fault-Tolerance , 1998, IEEE Trans. Software Eng..

[30]  Hagit Attiya,et al.  Achievable cases in an asynchronous environment , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[31]  David K. Gifford,et al.  The space shuttle primary computer system , 1984, CACM.

[32]  Marcos K. Aguilera,et al.  Heartbeat: A Timeout-Free Failure Detector for Quiescent Reliable Communication , 1997, WDAG.

[33]  André Schiper,et al.  Muteness detectors for consensus with Byzantine processes , 1998, PODC '98.

[34]  Fred B. Schneider,et al.  Faster Possibility Detection by Combining Two Approaches , 1995, WDAG.

[35]  D. Richard Kuhn,et al.  Sources of Failure in the Public Switched Telephone Network , 1997, Computer.

[36]  André Schiper Early consensus in an asynchronous system with a weak failure detector , 1997, Distributed Computing.

[37]  Keith Marzullo,et al.  Election Vs. Consensus in Asynchronous Systems , 1995 .

[38]  Anish Arora,et al.  Detectors and correctors: a theory of fault-tolerance components , 1998, Proceedings. 18th International Conference on Distributed Computing Systems (Cat. No.98CB36183).

[39]  Kenneth P. Birman,et al.  Understanding partitions and the 'no partition' assumption , 1993, 1993 4th Workshop on Future Trends of Distributed Computing Systems.

[40]  Douglas T. Ross,et al.  Guest Editorial - Reflections on Requirements , 1977, IEEE Trans. Software Eng..

[41]  Hagen Völzer Verifying Fault Tolerance of Distributed Algorithms Formally - An Example , 1998, ACSD.

[42]  Gerard Tel,et al.  Introduction to Distributed Algorithms: Contents , 2000 .

[43]  Carole Delporte-Gallet,et al.  Local and temporal predicates in distributed systems , 1995, TOPL.

[44]  Bernadette Charron-Bost,et al.  On the impossibility of group membership , 1996, PODC '96.

[45]  Bowen Alpern,et al.  Defining Liveness , 1984, Inf. Process. Lett..

[46]  Marcos K. Aguilera,et al.  On the Weakest Failure Detector for Quiescent Reliable Communication , 1997 .

[47]  Nancy A. Lynch,et al.  Distributed Computing: Models and Methods , 1990, Handbook of Theoretical Computer Science, Volume B: Formal Models and Sematics.

[48]  Nancy A. Lynch,et al.  Consensus in the presence of partial synchrony , 1988, JACM.

[49]  Anish Arora,et al.  Distributed Reset , 1994, IEEE Trans. Computers.

[50]  Ozalp Babaoglu,et al.  Partitionable Group Membership: Specification and Algorithms , 1997 .

[51]  Bernadette Charron-Bost,et al.  Solving Problems in the Presence of Process Crashes and Lossy Links , 1996 .

[52]  Anish Arora,et al.  Compositional design of multitolerant repetitive byzantine agreement , 1997, WSS.

[53]  Pankaj Jalote,et al.  Fault tolerance in distributed systems , 1994 .

[54]  Keith Marzullo,et al.  Detection of Global State Predicates , 1991, WDAG.

[55]  Daniel P. Siewiorek,et al.  Reliable computer systems (2nd ed.): design and evaluation , 1992 .

[56]  K. Mani Chandy,et al.  How processes learn , 1985, PODC '85.

[57]  Edsger W. Dijkstra,et al.  Guarded commands, nondeterminacy and formal derivation of programs , 1975, Commun. ACM.

[58]  Anish Arora A foundation of fault-tolerant computing , 1992 .

[59]  Ozalp Babaoglu,et al.  Consistent global states of distributed systems: fundamental concepts and mechanisms , 1993 .

[60]  Vijay K. Garg,et al.  Detection of Weak Unstable Predicates in Distributed Programs , 1994, IEEE Trans. Parallel Distributed Syst..

[61]  Dennis Shasha,et al.  The many faces of consensus in distributed systems , 1992, Computer.

[62]  Darrell D. E. Long,et al.  Accessing Replicated Data in a Large-Scale Distributed System , 1991, Int. J. Comput. Simul..

[63]  Vijay K. Garg,et al.  Detection of global predicates: Techniques and their limitations , 1998, Distributed Computing.

[64]  Flaviu Cristian,et al.  The Timed Asynchronous Distributed System Model , 1998, IEEE Trans. Parallel Distributed Syst..

[65]  Flaviu Cristian,et al.  A Rigorous Approach to Fault-Tolerant Programming , 1985, IEEE Transactions on Software Engineering.

[66]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.

[67]  Jean-Luis Dega,et al.  The redundancy mechanisms of the Ariane 5 Operational Control Center , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[68]  Darrell D. E. Long,et al.  A study of the reliability of Internet sites , 1991, [1991] Proceedings Tenth Symposium on Reliable Distributed Systems.

[69]  Sam Toueg,et al.  Asynchronous consensus and broadcast protocols , 1985, JACM.

[70]  Nancy A. Lynch,et al.  Reaching approximate agreement in the presence of faults , 1986, JACM.

[71]  Marco Schneider,et al.  Self-stabilization , 1993, CSUR.

[72]  Maurice Herlihy,et al.  Specifying Graceful Degradation , 1991, IEEE Trans. Parallel Distributed Syst..

[73]  Kenneth P. Birman,et al.  The process group approach to reliable distributed computing , 1992, CACM.

[74]  Danny Dolev,et al.  On the minimal synchronism needed for distributed consensus , 1983, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).

[75]  Vijay K. Garg Observation and Control for Debugging Distributed Computations , 1997, AADEBUG.

[76]  Rachid Guerraoui,et al.  Non blocking atomic commitment with an unreliable failure detector , 1995, Proceedings. 14th Symposium on Reliable Distributed Systems.

[77]  Victor P. Nelson Fault-tolerant computing: fundamental concepts , 1990, Computer.

[78]  Richard D. Schlichting,et al.  Fail-stop processors: an approach to designing fault-tolerant computing systems , 1981, TOCS.

[79]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[80]  Nancy A. Lynch,et al.  Distributed Algorithms , 1992, Lecture Notes in Computer Science.

[81]  Gianluca Dini,et al.  Replicated File Management in Large-Scale Distributed Systems , 1994, WDAG.

[82]  Marcos K. Aguilera,et al.  Quiescent Reliable Communication and Quiescent Consensus in Partitionable Networks , 1997 .

[83]  Santosh K. Shrivastava,et al.  Reliable Computer Systems , 1985, Texts and Monographs in Computer Science.

[84]  Vijay K. Garg,et al.  Detection of Strong Unstable Predicates in Distributed Programs , 1996, IEEE Trans. Parallel Distributed Syst..

[85]  J-C. Laprie,et al.  DEPENDABLE COMPUTING AND FAULT TOLERANCE : CONCEPTS AND TERMINOLOGY , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[86]  Miroslaw Malek,et al.  The consensus problem in fault-tolerant computing , 1993, CSUR.

[87]  Aviziens Fault-Tolerant Systems , 1976, IEEE Transactions on Computers.

[88]  Bernadette Charron-Bost,et al.  Simulating Reliable Links with Unreliable Links in the Presence of Process Crashes , 1996, WDAG.

[89]  George Varghese,et al.  Constraint satisfaction as a basis for designing nonmasking fault-tolerance , 1996, J. High Speed Networks.