Adaptive Fault-Resistant Systems

Abstract : The combined effects of faults and resource failures, wide swings in service demand, and situation-dependent user requirements, stress a computer's ability to satisfy its service expectations. This is an especially significant problem in distributed system that employ unreliable communications, and whose components may operate in different and perhaps harsh physical data and usage environments. One of the goals for adaptive design is to allow flexible use of available resources to cover a much wider range of different kinds of environmental variables than could be covered by a fixed, worst-case design. The research focused on the tasks of: (1) developing a theory of adaptive fault- resistant systems and general principles of architectural design; (2) developing specific architectural design techniques; and (3) demonstrating adaptive designs. Three mechanisms were investigated: (1) Adaptive Distributed Recovery Blocks (ADRBs), a multiple-mode scheme for error detection and recovery, useful for both hardware and software faults; (2) adaptive fault tolerance for hybrid faults, an economical technique for tolerating both simple and complex fault types; and (3) adaptive distributed thread integrity, a technique for detecting and repairing thread breaks in a wide range of operating environments using the Alpha programming model. Adaptive distributed systems, Recovery blocks, Fault tolerance, Anomaly management.

[1]  K. H. Kim,et al.  Process Scheduling and Prevention of Conmunication Deadlocks in an Experimental Microcomputer Network , 1982, RTSS.

[2]  Patricia Florissi,et al.  On remote procedure call , 1992, CASCON.

[3]  D. Wilson The STRATUS computer system , 1986 .

[4]  H. Kopetz,et al.  Dependability: Basic Concepts and Terminology , 1992, Dependable Computing and Fault-Tolerant Systems.

[5]  J. Goldberg,et al.  SIFT: Design and analysis of a fault-tolerant computer for aircraft control , 1978, Proceedings of the IEEE.

[6]  Patrick Lincoln,et al.  A Formally Verified Algorithm for Interactive Consistency Under a Hybrid Fault Model , 1993, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[7]  Amr El Abbadi,et al.  Implementing Fault-Tolerant Distributed Objects , 1985, IEEE Transactions on Software Engineering.

[8]  Robert J. Stroud Transparency and reflection in distributed systems , 1993, OPSR.

[9]  J. D. Day,et al.  A principle for resilient sharing of distributed resources , 1976, ICSE '76.

[10]  K. H. Kim,et al.  Distributed Execution of Recovery Blocks: An Approach for Uniform Treatment of Hardware and Software Faults in Real-Time Applications , 1989, IEEE Trans. Computers.

[11]  Herb Schwetman,et al.  CSIM† Reference Manual (Revision 16) , 1992 .

[12]  Sape Mullender,et al.  Distributed systems , 1989 .

[13]  Herb Schwetman,et al.  CSIM: a C-based process-oriented simulation language , 1986, WSC '86.

[14]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[15]  K. H. Kim Structuring DRB computing stations in highly decentralized LAN systems , 1993, Proceedings ISAD 93: International Symposium on Autonomous Decentralized Systems.

[16]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[17]  Dhiraj K. Pradhan,et al.  Consensus With Dual Failure Modes , 1991, IEEE Trans. Parallel Distributed Syst..

[18]  Nancy A. Lynch,et al.  On the Correctness of Orphan Elimination Algorithms. , 1987 .

[19]  Sam Toueg,et al.  Early-Stopping Distributed Bidding and Applications (Preliminary Version) , 1990, WDAG.

[20]  Myron Hecht,et al.  A distributed fault tolerant architecture for nuclear reactor control and safety functions , 1989, [1989] Proceedings. Real-Time Systems Symposium.

[21]  Hirokazu Ihara,et al.  Autonomous Decentralized Loop Network , 1982, COMPCON.

[22]  Kenneth P. Birman,et al.  The process group approach to reliable distributed computing , 1992, CACM.

[23]  Fred B. Schneider,et al.  Optimal Primary-Backup Protocols , 1992, WDAG.

[24]  Fred B. Schneider,et al.  Primary-Backup Protocols: Lower Bounds and Optimal Implementations , 1992 .

[25]  Satoshi Matsuoka,et al.  Object-Oriented Concurrent Reflective Architectures , 1991, Object-Based Concurrent Computing.

[26]  Navin Budhiraja The Primary-Backup Approach: Lower and Upper Bounds , 1993 .

[27]  Hirokazu Ihara,et al.  Autonomous Decentralized Software Structure and It's Application , 1986, FJCC.

[28]  K. H. Kim,et al.  Approaches to implementation of multiple DRB stations in tightly-coupled computer networks , 1991, [1991] Proceedings The Fifteenth Annual International Computer Software & Applications Conference.

[29]  Piotr Berman,et al.  Optimal Early Stopping in Distributed Consensus (Extended Abstract) , 1992, WDAG.

[30]  Maurice Herlihy,et al.  Timestamp-Based Orphan Elimination , 1989, IEEE Trans. Software Eng..

[31]  P. M. Melliar-Smith,et al.  A program structure for error detection and recovery , 1974, Symposium on Operating Systems.

[32]  Danny Dolev,et al.  Early stopping in Byzantine agreement , 1990, JACM.

[33]  Hermann Kopetz,et al.  Fault-Tolerant Membership Service in a Synchronous Distributed Real-Time System , 1991 .

[34]  Santosh K. Shrivastava,et al.  Rajdoot: A Remote Procedure Call Mechanism Supporting Orphan Detection and Killing , 1988, IEEE Trans. Software Eng..

[35]  Franklin Reynolds,et al.  An Architectural Overview of Alpha: A Real-Time, Distributed Kernel , 1992, USENIX Workshop on Microkernels and Other Kernel Architectures.

[36]  Santosh K. Shrivastava,et al.  On the Treatment of Orphans in a Distributed System , 1983, Symposium on Reliability in Distributed Software and Database Systems.

[37]  B SchneiderFred Implementing fault-tolerant services using the state machine approach: a tutorial , 1990 .

[38]  Len T. Armstrong Adaptive Fault Tolerance , 1994 .

[39]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[40]  Patrick Lincoln,et al.  A formally verified algorithm for interactive consistency under a hybrid fault model , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[41]  K. H. Kim,et al.  Adaptive fault-tolerance in complex real-time distributed computer system applications , 1992, Comput. Commun..

[42]  Bharat K. Bhargava,et al.  Adaptability experiments in the RAID distributed database system , 1990, Proceedings Ninth Symposium on Reliable Distributed Systems.

[43]  Juan A. Garay,et al.  A Continuum of Failure Models for Distributed Computing , 1992, WDAG.

[44]  James R. Leigh Applied Digital Control , 1985 .

[45]  Philip M. Thambidurai,et al.  Interactive consistency with multiple failure modes , 1988, Proceedings [1988] Seventh Symposium on Reliable Distributed Systems.

[46]  Kenneth P. Birman,et al.  Using process groups to implement failure detection in asynchronous environments , 1991, PODC '91.

[47]  K. H. Kim,et al.  Distributed Execution of Recovery Blocks: An Approach to Uniform Treatment of Hardware and Software Faults , 1984, IEEE International Conference on Distributed Computing Systems.

[48]  Hermann Kopetz,et al.  Dependability: Basic Concepts and Terminology , 1992 .

[49]  K. H. Kim,et al.  A distributed fault tolerant architecture for nuclear reactor and other critical process control applications , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[50]  Lorenzo Alvisi,et al.  Nonblocking and orphan-free message logging protocols , 1992, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.