Software Fault Tolerance: An Overview

This paper presents an overview of the techniques that can be used by developers to produce software that can tolerate design faults and faults of the surrounding environment. After reviewing the basic terms and concepts of fault tolerance, the most well-known fault-tolerance techniques exploiting software-, information- and time redundancy are presented, classified according to the kind of concurrency they support.

[1]  Jörg Kienzle,et al.  Transaction Support for Ada , 2001, Ada-Europe.

[2]  David F. McAllister,et al.  The consensus recovery block , 1983 .

[3]  K. H. Kim,et al.  Distributed Execution of Recovery Blocks: An Approach for Uniform Treatment of Hardware and Software Faults in Real-Time Applications , 1989, IEEE Trans. Computers.

[4]  Jörg Kienzle,et al.  Open multithreaded transactions: keeping threads and exceptions under control , 2001, Proceedings Sixth International Workshop on Object-Oriented Real-Time Dependable Systems.

[5]  Andy J. Wellings,et al.  On Programming Atomic Actions in Ada 95 , 1997, Ada-Europe.

[6]  Hermann Kopetz,et al.  Real-time systems , 2018, CSC '73.

[7]  Calton Pu,et al.  Split-Transactions for Open-Ended Activities , 1988, VLDB.

[8]  Jie Xu,et al.  Concurrent Exception Handling and Resolution in Distributed Object Systems , 2000, IEEE Trans. Parallel Distributed Syst..

[9]  Brian Randell,et al.  Implementing Software-Fault Tolerance in C++ and Open C++: An Object-Oriented and Reflective Approach , 1996 .

[10]  Andy J. Wellings,et al.  An Incremental RecoveryCache Supporting Sotware Fault Tolerance , 1999, Ada-Europe.

[11]  Andy J. Wellings,et al.  Distributed Atomic Actions in Ada 95 , 1998, Comput. J..

[12]  Liming Chen,et al.  N-VERSION PROGRAMMINC: A FAULT-TOLERANCE APPROACH TO RELlABlLlTY OF SOFTWARE OPERATlON , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[13]  K. Kane The Distributed Recovery Block Scheme , 2022 .

[14]  P. M. Melliar-Smith,et al.  A program structure for error detection and recovery , 1974, Symposium on Operating Systems.

[15]  Michael R. Lyu Software Fault Tolerance , 1995 .

[16]  Paul Ammann,et al.  Data Diversity: An Approach to Software Fault Tolerance , 1988, IEEE Trans. Computers.

[17]  Avelino Francisco Zorzo,et al.  Implementation of blocking coordinated atomic actions based on forward error recovery , 1997, J. Syst. Archit..

[18]  Brian Randell,et al.  Error recovery in asynchronous systems , 1986, IEEE Transactions on Software Engineering.

[19]  Nancy G. Leveson,et al.  The Consistent Comparison Problem in N-Version Software , 1989, IEEE Trans. Software Eng..

[20]  Jörg Kienzle,et al.  AOP: Does It Make Sense? The Case of Concurrency and Failures , 2002, ECOOP.

[21]  J-C. Laprie,et al.  DEPENDABLE COMPUTING AND FAULT TOLERANCE : CONCEPTS AND TERMINOLOGY , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[22]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[23]  Gregor Kiczales,et al.  Discussing aspects of AOP , 2001, CACM.

[24]  Cecília M. F. Rubira,et al.  Fault tolerance in concurrent object-oriented software through coordinated error recovery , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[25]  Andreas Reuter,et al.  Transaction Processing: Concepts and Techniques , 1992 .

[26]  David W. Stemple,et al.  Recoverable Actions in Gutenberg , 1986, ICDCS.

[27]  Hermann Kopetz,et al.  Fault tolerance, principles and practice , 1990 .

[28]  Andrea Bondavalli,et al.  A Cost-Effective and Flexible Scheme for Software fault Tolerance , 1993 .

[29]  Flaviu Cristian,et al.  Understanding fault-tolerant distributed systems , 1991, CACM.

[30]  Jörg Kienzle,et al.  Auction system design using open multithreaded transactions , 2002, Proceedings of the Seventh IEEE International Workshop on Object-Oriented Real-Time Dependable Systems. (WORDS 2002).

[31]  Benjamin J. Shannon,et al.  Java 2 platform enterprise edition specification , 2001 .

[32]  Ann T. Tai,et al.  Performability enhancement of fault-tolerant software , 1993 .

[33]  Brian Randell,et al.  The Evolution of the Recovery Block Concept , 1994 .

[34]  Pattie Maes,et al.  Concepts and experiments in computational reflection , 1987, OOPSLA '87.

[35]  Lorenzo Strigini,et al.  Coordinated backward between client processes and data servers , 1997, IEE Proc. Softw. Eng..

[36]  Juan Antonio de la Puente,et al.  Reliable Software Technologies — Ada-Europe’ 99 , 2002, Lecture Notes in Computer Science.

[37]  K. H. Kim,et al.  Distributed Execution of Recovery Blocks: An Approach to Uniform Treatment of Hardware and Software Faults , 1984, IEEE International Conference on Distributed Computing Systems.

[38]  Laura L. Pullum,et al.  Software Fault Tolerance Techniques and Implementation , 2001 .

[39]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[40]  Jörg Kienzle,et al.  Shared Recoverable Objects , 1999, Ada-Europe.

[41]  Algirdas Avizienis,et al.  The N-Version Approach to Fault-Tolerant Software , 1985, IEEE Transactions on Software Engineering.