A flexible software architecture for high availability computing

Presents an overview of the Chameleon architecture for supporting a wide range of criticality requirements in a heterogeneous network environment. Chameleon employs ARMORs (Adaptive, Reconfigurable and Mobile Objects for Reliability) to synthesize different fault-tolerant configurations and to maintain run-time adaptation to changes in the fault tolerance requirements of an application. ARMORs have a flexible architecture that allows their composition to be reconfigured at run-time, i.e. the ARMORs may dynamically adapt to changing application requirements. In this paper, we focus on the detailed description of the ARMOR architecture, including ARMOR class hierarchy, basic building blocks, ARMOR composition and use of ARMOR factories. We describe how ARMORs can be reconfigured and reengineered, and demonstrate how the architecture serves our objective of providing an adaptive software infrastructure. Our experience with an early Chameleon implementation demonstrates that the proposed ARMOR architecture provides for a highly flexible and reconfigurable software infrastructure.

[1]  Nuno Neves,et al.  RENEW: a tool for fast and efficient implementation of checkpoint protocols , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[2]  David Powell,et al.  Distributed fault tolerance: lessons from Delta-4 , 1994, IEEE Micro.

[3]  John H. Wensley SIFT: software implemented fault tolerance , 1972, AFIPS '72 (Fall, part I).

[4]  William H. Sanders,et al.  AQuA: an adaptive architecture that provides dependable distributed objects , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).

[5]  Daniel P. Siewiorek,et al.  Models for time coalescence in event logs , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[6]  Inhwan Lee,et al.  Software dependability in the operational phase , 1995 .

[7]  I. Bey,et al.  Delta-4: A Generic Architecture for Dependable Distributed Computing , 1991, Research Reports ESPRIT.

[8]  Yair Amir,et al.  Transis: A Communication Sub-system for High Availability , 1992 .

[9]  Robbert van Renesse,et al.  Reliable Distributed Computing with the Isis Toolkit , 1994 .

[10]  Silvano Maffeis Prianha: A CORBA Tool For High Availability , 1997, Computer.

[11]  Mark Garland Hayden,et al.  The Ensemble System , 1998 .

[12]  Jim Gray,et al.  Why Do Computers Stop and What Can Be Done About It? , 1986, Symposium on Reliability in Distributed Software and Database Systems.

[13]  Gul Agha,et al.  A Methodology for Adapting to Patterns of Faults , 1994 .

[14]  Robbert van Renesse,et al.  Horus: a flexible group communication system , 1996, CACM.

[15]  Michael K. Reiter,et al.  Distributing trust with the Rampart toolkit , 1996, CACM.

[16]  Mark Sullivan,et al.  Software defects and their impact on system availability-a study of field failures in operating systems , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[17]  Louise E. Moser,et al.  Totem: a fault-tolerant multicast group communication system , 1996, CACM.

[18]  Ravishankar K. Iyer,et al.  The Chameleon infrastructure for adaptive, software implemented fault tolerance , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).

[19]  Jean-Charles Fabre,et al.  A Metaobject Architecture for Fault-Tolerant Distributed Systems: The FRIENDS Approach , 1998, IEEE Trans. Computers.

[20]  D. McCue,et al.  Fault-Tolerance in the Advanced Automation System , 1991, OPSR.

[21]  Flaviu Cristian,et al.  Understanding fault-tolerant distributed systems , 1991, CACM.

[22]  Robert W. Horst TNet: A Reliable System Area Network , 1995, IEEE Micro.

[23]  D. Jewett,et al.  Integrity S2: A Fault-Tolerant Unix Platform , 1991, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[24]  Hermann Kopetz,et al.  Distributed fault-tolerant real-time systems: the Mars approach , 1989, IEEE Micro.

[25]  Yennun Huang,et al.  Software Implemented Fault Tolerance Technologies and Experience , 1993, FTCS.

[26]  Danny Dolev,et al.  The Transis approach to high availability cluster communication , 1996, CACM.

[27]  Kenneth P. Birman,et al.  The process group approach to reliable distributed computing , 1992, CACM.