Software Fault-Tolerance and Design Diversity: Past Experience and Future Evolution

Abstract Interest in software fauIt-tolerance (following an interest in fault-tolerance in general, and prompted by the so-called “software crisis”) is now exhibited by many institutions. Though many software fault-tolerance techniques are known, their use is limited by the lack of consistent, flexible methodologies, of support mechanisms in operating systems and of design tools. In this paper, we discuss the known software fault-tolerance techniques, with particular reference to the two coherent methodologies proposed, Multiple Version Software and Recovery Blocks. We argument that a more general methodology is needed for use in complex software system, and outline how the necessary mechanisms could be included in low-level system software.

[1]  J. Goldberg,et al.  SIFT: Design and analysis of a fault-tolerant computer for aircraft control , 1978, Proceedings of the IEEE.

[2]  Algirdas Avizienis,et al.  Fault Tolerance by Design Diversity: Concepts and Experiments , 1984, Computer.

[3]  Robert L. Glass A benefit analysis of some software reliability methodologies , 1980, SOEN.

[4]  Butler W. Lampson,et al.  Atomic Transactions , 1980, Advanced Course: Distributed Systems.

[5]  K. H. Kim,et al.  Distributed Execution of Recovery Blocks: An Approach to Uniform Treatment of Hardware and Software Faults , 1984, IEEE International Conference on Distributed Computing Systems.

[6]  Herbert Hecht Fault-Tolerant Software , 1979, IEEE Transactions on Reliability.

[7]  K. H. Kim,et al.  Approaches to Mechanization of the Conversation Scheme Based on Monitors , 1982, IEEE Transactions on Software Engineering.

[8]  Eric C. Cooper Circus: A Replicated Procedure Call Facility , 1984, Symposium on Reliability in Distributed Software and Database Systems.

[9]  L. Gmeiner,et al.  Software Diversity in Reactor Protection Systems: An Experment , 1979 .

[10]  Flaviu Cristian,et al.  Exception Handling and Software Fault Tolerance , 1982, IEEE Transactions on Computers.

[11]  Brian Randell System structure for software fault tolerance , 1975 .

[12]  T. Anderson Can design faults be tolerated? , 1984, Fehlertolerierende Rechensysteme.

[13]  Brian Randell Fault Tolerance and System Structuring , 1984 .

[14]  Jean-Charles Fabre,et al.  Distributed coupled actors: A Chorus proposal for reliability , 1982, ICDCS.

[15]  Gerald Leitner Stylized Interprocess Communication - A Kernel Primitive for Reliable Distributed Computing , 1984, Symposium on Reliability in Distributed Software and Database Systems.

[16]  Augusto Ciuffoletti Error recovery in systems of communicating processes. , 1984, ICSE '84.

[17]  P. Ciompi,et al.  A Highly Available Multimicroprocessor System for Real-Time Applications , 1983 .

[18]  David L. Russell,et al.  State Restoration in Systems of Communicating Processes , 1980, IEEE Transactions on Software Engineering.