Distributed Fault Tolerance - Lessons Learned from Delta-4

Software-implemented approaches to fault-tolerance are very resilient to change since changes in hardware technology do not require extensive re-design of specialized hardware. This paper argues the case for implementing fault-tolerance in a distributed fashion and reports the approach adopted in the European Delta-4 project. Fault-tolerance is achieved by replicating capsules (the run-time representation of application objects) on distributed nodes interconnected by a local area network. Capsule groups can be configured to tolerate either stopping failures or arbitrary failures. Multipoint protocols are used for coordinating capsule groups and for error processing and fault treatment. The paper concludes with a critical analysis of the project's results.

[1]  Flaviu Cristian,et al.  Atomic Broadcast: From Simple Message Diffusion to Byzantine Agreement , 1995, Inf. Comput..

[2]  Jim Gray,et al.  Why Do Computers Stop and What Can Be Done About It? , 1986, Symposium on Reliability in Distributed Software and Database Systems.

[3]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[4]  André Schiper,et al.  Lightweight causal and atomic group multicast , 1991, TOCS.

[5]  Jean Arlat,et al.  Experimental evaluation of the fault tolerance of an atomic multicast system , 1990 .

[6]  Brian Randell System structure for software fault tolerance , 1975 .

[7]  Paulo Veríssimo,et al.  xAMp: a multi-primitive group communications service , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[8]  Paulo Veríssimo Redundant media mechanisms for dependable communication in token-bus LANs , 1988, Proceedings [1988] 13th Conference on Local Computer Networks.

[9]  Paulo Veríssimo,et al.  The Delta-4 extra performance architecture (XPA) , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[10]  Michael J. Fischer A Theoretician's View of Fault Tolerant Distributed Computing , 1986, Fault-Tolerant Distributed Computing.

[11]  David Powell Extra Performance Architecture (XPA) , 1991 .

[12]  Eric C. Cooper Replicated procedure call , 1984, PODC '84.

[13]  Liming Chen,et al.  N-VERSION PROGRAMMINC: A FAULT-TOLERANCE APPROACH TO RELlABlLlTY OF SOFTWARE OPERATlON , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[14]  I. Bey,et al.  Delta-4: A Generic Architecture for Dependable Distributed Computing , 1991, Research Reports ESPRIT.

[15]  Paulo Veríssimo,et al.  AMp: a highly parallel atomic multicast protocol , 1989, SIGCOMM '89.

[16]  D. McCue,et al.  Fault-Tolerance in the Advanced Automation System , 1991, OPSR.

[17]  David Powell The Atomic Multicast protocol (AMp) , 1991 .

[18]  Jim Gray,et al.  Fault Tolerance in Tandem Computer Systems , 1987 .

[19]  David Powell Open System Architecture (OSA) , 1991 .

[20]  Robbert van Renesse,et al.  Reliable Multicast between Micro-Kernels , 1992, USENIX Workshop on Microkernels and Other Kernel Architectures.

[21]  P. Reynier,et al.  Active replication in Delta-4 , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[22]  David Powell,et al.  Failure mode assumptions and assumption coverage , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[23]  Kenneth P. Birman,et al.  Reliable communication in the presence of failures , 1987, TOCS.

[24]  A. Fleischmann Distributed Systems , 1994, Springer Berlin Heidelberg.

[25]  P. M. Melliar-Smith,et al.  Formal Specification and Mechanical Verification of SIFT: A Fault-Tolerant Flight Control System , 1982, IEEE Transactions on Computers.

[26]  Anita Borg,et al.  A message system supporting fault tolerance , 1983, SOSP '83.

[27]  D. Powell,et al.  The Delta-4 Approach to Dependability in Open Distributed Computing Systems , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[28]  Peter A. Barrett,et al.  Using passive replicates in Delta-4 to provide dependable distributed computing , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[29]  D. Powell,et al.  Dependability evaluation of bus and ring communication topologies for the Delta-4 distributed fault-tolerant architecture , 1991, [1991] Proceedings Tenth Symposium on Reliable Distributed Systems.

[30]  Kenneth P. Birman,et al.  Exploiting replication in distributed systems , 1990 .

[31]  Ravishankar K. Iyer,et al.  Faults, symptoms, and software fault tolerance in the Tandem GUARDIAN90 operating system , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.