Chameleon: a software infrastructure for adaptive fault tolerance

This paper presents Chameleon, an adaptive software infrastructure for concurrently supporting different reliability levels in the same networked environment. Traditionally, fault tolerance has been provided through dedicated hardware, dedicated software, or a combination of both. Hardware solutions from manufacturers like Tandem have provided dedicated fault-tolerant machines with extensive hardware redundancy. Unfortunately, such solutions offer static levels of fault tolerance that remain fixed throughout the lifetime of the system. Software solutions, employed in distributed environments, involve replication of services in software to provide the requisite reliability level. However, to benefit from such solutions, applications need to be written with an intent to run in such an environment. Therefore, the benefits of such middleware go unnoticed to off-the-shelf applications. In contemporary networked computing systems, a broad range of commercial and scientific applications, with potentially varying reliability requirements, need to coexist. It is neither cost effective nor feasible to provide dedicated platforms for hardware-based fault tolerance for each application, or to rewrite each application to leverage off the specialized software middleware. We propose Chameleon as an infrastructure to provide adaptive levels of dependability to off-the-shelf applications with off-the-shelf unreliable hardware.

[1]  I. Bey,et al.  Delta-4: A Generic Architecture for Dependable Distributed Computing , 1991, Research Reports ESPRIT.

[2]  Yair Amir,et al.  Transis: A Communication Sub-system for High Availability , 1992 .

[3]  Robbert van Renesse,et al.  Reliable Distributed Computing with the Isis Toolkit , 1994 .

[4]  Flaviu Cristian,et al.  Automatic service availability management in asynchronous distributed systems , 1994, Proceedings of 2nd International Workshop on Configurable Distributed Systems.

[5]  Yennun Huang,et al.  Software Implemented Fault Tolerance Technologies and Experience , 1993, FTCS.

[6]  Michael K. Reiter,et al.  Distributing trust with the Rampart toolkit , 1996, CACM.

[7]  Jean-Charles Fabre,et al.  A Metaobject Architecture for Fault-Tolerant Distributed Systems: The FRIENDS Approach , 1998, IEEE Trans. Computers.

[8]  Daniel L. McCue,et al.  Computing replica placement in distributed systems , 1992, [1992 Proceedings] Second Workshop on the Management of Replicated Data.

[9]  Hermann Kopetz,et al.  Distributed fault-tolerant real-time systems: the Mars approach , 1989, IEEE Micro.

[10]  Peter Alan Lee,et al.  Fault Tolerance , 1990, Dependable Computing and Fault-Tolerant Systems.

[11]  Robbert van Renesse,et al.  Horus: a flexible group communication system , 1996, CACM.

[12]  Robert W. Horst TNet: A Reliable System Area Network , 1995, IEEE Micro.

[13]  Flaviu Cristian,et al.  Understanding fault-tolerant distributed systems , 1991, CACM.

[14]  K. H. Kim ROAFTS: a middleware architecture for real-time object-oriented adaptive fault tolerance support , 1998, Proceedings Third IEEE International High-Assurance Systems Engineering Symposium (Cat. No.98EX231).

[15]  Saurabh Bagchi,et al.  Incorporating Reconfigurability, Error Detection and Recovery into the Chameleon ARMOR Architecture , 1998 .

[16]  Gul Agha,et al.  A Methodology for Adapting to Patterns of Faults , 1994 .

[17]  Daniel P. Siewiorek,et al.  Models for time coalescence in event logs , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[18]  Kenneth P. Birman,et al.  The process group approach to reliable distributed computing , 1992, CACM.

[19]  Danny Dolev,et al.  The Transis approach to high availability cluster communication , 1996, CACM.

[20]  Inhwan Lee,et al.  Software dependability in the operational phase , 1995 .

[21]  John H. Wensley SIFT: software implemented fault tolerance , 1972, AFIPS '72 (Fall, part I).

[22]  William H. Sanders,et al.  AQuA: an adaptive architecture that provides dependable distributed objects , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).

[23]  Kenneth P. Birman,et al.  Building Secure and Reliable Network Applications , 1996 .

[24]  Santosh K. Shrivastava,et al.  An overview of the Arjuna distributed programming system , 1991, IEEE Software.

[25]  Santosh K. Shrivastava,et al.  Using application specific knowledge for configuring object replicas , 1996, Proceedings of International Conference on Configurable Distributed Systems.

[26]  Mark Sullivan,et al.  Software defects and their impact on system availability-a study of field failures in operating systems , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[27]  Louise E. Moser,et al.  Totem: a fault-tolerant multicast group communication system , 1996, CACM.

[28]  Flaviu Cristian Automatic service availability management , 1993, Proceedings ISAD 93: International Symposium on Autonomous Decentralized Systems.

[29]  Paulo Veríssimo,et al.  The Delta-4 extra performance architecture (XPA) , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[30]  Flaviu Cristian Automatic reconfiguration in the presence of failures , 1992, Softw. Eng. J..

[31]  Silvano Maffeis Prianha: A CORBA Tool For High Availability , 1997, Computer.

[32]  L. Romano,et al.  Behavior of a computer based interlocking system under transient hardware faults , 1997, Proceedings Pacific Rim International Symposium on Fault-Tolerant Systems.