Approaches to Software Based Fault Tolerance - A Review

This paper presents a review work on various approaches to software based fault tolerance. The aim of this paper is to cover past and present approaches to software implemented fault tolerance that rely on both software design diversity and on single but enhanced design.

[1]  Algirdas A. Avi The Methodology of N-Version Programming , 1995 .

[2]  A. Avizienis Dependable computing depends on structured fault tolerance , 1995, Proceedings of Sixth International Symposium on Software Reliability Engineering. ISSRE'95.

[3]  David F. McAllister,et al.  Fault-Tolerant SoFtware Reliability Modeling , 1987, IEEE Transactions on Software Engineering.

[4]  Roger S. Pressman,et al.  Software Engineering: A Practitioner's Approach , 1982 .

[5]  Gam D. Nguyen Error-detection codes: algorithms and fast implementation , 2005, IEEE Transactions on Computers.

[6]  Hermann Kopetz,et al.  The real-time operating system of MARS , 1989, OPSR.

[7]  Ravishankar K. Iyer,et al.  Experimental analysis of computer system dependability , 1996 .

[8]  Eli Gafni,et al.  A Software-Based Hardware Fault Tolerance Scheme for Multicomputers , 1987, ICPP.

[9]  Thomas C. Bressoud,et al.  TFT: a software system for application-transparent fault tolerance , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[10]  Babak Falsafi,et al.  Dual use of superscalar datapath for transient-fault detection and recovery , 2001, MICRO.

[11]  David F. McAllister,et al.  An Experimental Evaluation of Software Redundancy as a Strategy For Improving Reliability , 1991, IEEE Trans. Software Eng..

[12]  Niraj K. Jha,et al.  Fault-tolerant computer system design , 1996, IEEE Parallel & Distributed Technology: Systems & Applications.

[13]  Charng-Da Lu,et al.  Assessing Fault Sensitivity in MPI Applications , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[14]  Mark Russinovich,et al.  Application transparent fault management in fault tolerant Mach , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[15]  Goutam K. Saha Transient fault tolerant processing in a RF application , 2000 .

[16]  Gary McGraw,et al.  Software fault injection: inoculating programs against errors , 1997 .

[17]  Brian Randell,et al.  The Evolution of the Recovery Block Concept , 1994 .

[18]  Hua Zhong,et al.  CRAK: Linux Checkpoint/Restart As a Kernel Module , 1996 .

[19]  J. Arlat,et al.  Assessment of COTS microkernels by fault injection , 1999, Dependable Computing for Critical Applications 7.

[20]  Brian Randell System Structure for Software Fault Tolerance , 1975, IEEE Trans. Software Eng..

[21]  Paul Ammann,et al.  Data Diversity: An Approach to Software Fault Tolerance , 1988, IEEE Trans. Computers.

[22]  Jim Gray,et al.  Why Do Computers Stop and What Can Be Done About It? , 1986, Symposium on Reliability in Distributed Software and Database Systems.

[23]  Janak H. Patel,et al.  Concurrent Error Detection in ALU's by Recomputing with Shifted Operands , 1982, IEEE Transactions on Computers.

[24]  Henrique Madeira,et al.  Experimental evaluation of the fail-silent behaviour in programs with consistency checks , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[25]  S. Wicker Error Control Systems for Digital Communication and Storage , 1994 .

[26]  Nancy G. Leveson,et al.  An experimental evaluation of the assumption of independence in multiversion programming , 1986, IEEE Transactions on Software Engineering.

[27]  Peter J. Fleming,et al.  Dependable, intelligent voting for real-time control software , 1995 .

[28]  Lisa Spainhower,et al.  IBM S/390 Parallel Enterprise Server G5 fault tolerance: A historical perspective , 1999, IBM J. Res. Dev..

[29]  Jean Arlat,et al.  MetaKernels and fault containment wrappers , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[30]  Goutam Kumar Saha Transient software fault tolerance through recovery , 2003, UBIQ.

[31]  Y. C. Yeh,et al.  Triple-triple redundant 777 primary flight computer , 1996, 1996 IEEE Aerospace Applications Conference. Proceedings.

[32]  Leslie Lamport Solved problems, unsolved problems and non-problems in concurrency , 1985, OPSR.

[33]  Goutam Kumar Saha Beyond the conventional techniques of software fault tolerance , 2004, UBIQ.

[34]  Algirdas Avizienis,et al.  The N-Version Approach to Fault-Tolerant Software , 1985, IEEE Transactions on Software Engineering.

[35]  Goutam Kumar Saha Fault management in mobile computing , 2003, UBIQ.

[36]  Barry W. Johnson Fault-Tolerant Microprocessor-Based Systems , 1984, IEEE Micro.

[37]  Goutam Kumar Saha,et al.  Software implemented fault tolerance: The ESVP approach , 2006, UBIQ.

[38]  Peter G. Bishop Software Fault Tolerance by Design Diversity , 1995 .

[39]  A. Avizienis,et al.  Dependable computing: From concepts to design diversity , 1986, Proceedings of the IEEE.

[40]  Peter Neumann,et al.  Safeware: System Safety and Computers , 1995, SOEN.

[41]  K. S. Tso,et al.  Multi-Version Software Development , 1986 .

[42]  Robert Bleeg,et al.  Commercial jet transport fly-by-wire architecture considerations , 1988 .

[43]  David P. Gluch,et al.  A Perspective on the State of Research in Fault-Tolerant Systems. , 1997 .

[44]  Alan Messer,et al.  Increasing relevance of memory hardware errors: a case for recoverable programming models , 2000, EW 9.

[45]  Eric Rotenberg,et al.  AR-SMT: a microarchitectural approach to fault tolerance in microprocessors , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[46]  Timothy C. K. Chou Beyond Fault Tolerance , 1997, Computer.

[47]  Victor F. Nicola,et al.  Checkpointing and the modeling of program execution time , 1994 .

[48]  Eric Rotenberg AR-SMT: Coarse-Grain Time Redundancy for High Performance General Purpose Processors , 1998 .

[49]  Andreas Damm Self-Checking Coverage of Components of a Distributed Real-Time System , 1989, Fehlertolerierende Rechensysteme.

[50]  Hermann Kopetz,et al.  Distributed fault-tolerant real-time systems: the Mars approach , 1989, IEEE Micro.

[51]  Gordon Mckinzie SUMMING UP THE 777'S FIRST YEAR: IS THIS A GREAT AIRPLANE, OR WHAT?. , 1996 .

[52]  Ravishankar K. Iyer,et al.  FINE: A Fault Injection and Monitoring Environment for Tracing the UNIX System Behavior under Faults , 1993, IEEE Trans. Software Eng..

[53]  Shlomi Dolev,et al.  Self Stabilization , 2004, J. Aerosp. Comput. Inf. Commun..

[54]  Jean Arlat,et al.  Fault injection for the formal testing of fault tolerance , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[55]  Ravishankar K. Iyer,et al.  Software Dependability in the Tandem GUARDIAN System , 1995, IEEE Trans. Software Eng..

[56]  Wayne Goddard,et al.  Self-Stabilizing Distributed Algorithm for Strong Matching in a System Graph , 2003, HiPC.

[57]  Francesca Saglietti Software diversity metrics quantifying dissimilarity in the input partition , 1990, Softw. Eng. J..

[58]  J. H. Lala,et al.  Architectural principles for safety-critical real-time applications , 1994, Proc. IEEE.

[59]  Pascal Traverse,et al.  AIRBUS A320/A330/A340 electrical flight controls - A family of fault-tolerant systems , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[60]  Goutam Kumar Saha,et al.  A Software Tool for Fault Tolerance , 2006, J. Inf. Sci. Eng..

[61]  Jean Arlat,et al.  Fault Injection for Dependability Validation: A Methodology and Some Applications , 1990, IEEE Trans. Software Eng..

[62]  Pascal Traverse Dependability of Digital Computers on Board Airplanes , 1991 .

[63]  J.L. Gersting,et al.  A comparison of voting algorithms for n-version programming , 1991, Proceedings of the Twenty-Fourth Annual Hawaii International Conference on System Sciences.

[64]  Ming-Yee Lai,et al.  Software Fault Insertion Testing for Fault Tolerance , 1995 .

[65]  Brian Randell,et al.  Software fault tolerance: t/(n-1)-variant programming , 1992 .

[66]  Lorenzo Alvisi,et al.  Modeling the effect of technology trends on the soft error rate of combinational logic , 2002, Proceedings International Conference on Dependable Systems and Networks.

[67]  David J. Taylor,et al.  Redundancy in Data Structures: Some Theoretical Results , 1980, IEEE Transactions on Software Engineering.

[68]  Johannes Reisinger Failure Modes and Failure Characteristics of a TDMA driven Ethernet , 1989 .

[69]  Jack L. Lo,et al.  Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[70]  Barry W. Johnson An introduction to the design and analysis of fault-tolerant systems , 1996 .

[71]  B. D. Aleksa,et al.  Boeing 777 airplane information management system operational experience , 1997, 16th DASC. AIAA/IEEE Digital Avionics Systems Conference. Reflections to the Future. Proceedings.

[72]  Victor P. Nelson Fault-tolerant computing: fundamental concepts , 1990, Computer.

[73]  James P. Black,et al.  Redundancy in Data Structures: Improving Software Fault Tolerance , 1980, IEEE Transactions on Software Engineering.

[74]  Hoyt Lougee,et al.  SOFTWARE CONSIDERATIONS IN AIRBORNE SYSTEMS AND EQUIPMENT CERTIFICATION , 2001 .

[75]  Stephen S. Yau,et al.  An Approach to Concurrent Control Flow Checking , 1980, IEEE Transactions on Software Engineering.

[76]  Algirdas Avizienis,et al.  Software Fault Tolerance , 1989, IFIP Congress.

[77]  Goutam Kumar Saha,et al.  A software fix towards fault-tolerant computing , 2005, UBIQ.

[78]  Jean Arlat,et al.  Definition and analysis of hardware- and software-fault-tolerant architectures , 1990, Computer.

[79]  Algirdas Avizienis,et al.  Toward Systematic Design of Fault-Tolerant Systems , 1997, Computer.

[80]  Pankaj Jalote,et al.  Fault tolerance in distributed systems , 1994 .

[81]  Francesca Saglietti Strategies for the Achievement and Assessment of Software Fault-Tolerance , 1990 .

[82]  Jeffrey M. Voas,et al.  Certifying Off-the-Shelf Software Components , 1998, Computer.

[83]  Marco Torchiano,et al.  Soft-error detection through software fault-tolerance techniques , 1999, Proceedings 1999 IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (EFT'99).

[84]  Peter J. Denning,et al.  Fault Tolerant Operating Systems , 1976, CSUR.

[85]  R. B. Broen New Voters for Redundant Systems , 1975 .

[86]  Jason Duell,et al.  Requirements for Linux Checkpoint/Restart , 2002 .

[87]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[88]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[89]  Neeraj Suri,et al.  Advances in ULTRA-Dependable Distributed Systems , 1994 .

[90]  Dhiraj K. Pradhan,et al.  Fault-tolerant computer system design , 1996 .

[91]  Russ Abbott,et al.  Resourceful systems for fault tolerance, reliability, and safety , 1990, CSUR.

[92]  Algirdas Avizienis,et al.  A design paradigm for fault-tolerant systems , 1987 .

[93]  D. J. Taylor,et al.  A Compendium of Robust Data Structures , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[94]  Eric Roman A Survey of Checkpoint / Restart Implementations , 2002 .

[95]  Mark Russinovich,et al.  Fault-tolerance for off-the-shelf applications and hardware , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[96]  Bev Littlewood,et al.  Predictably Dependable Computing Systems , 2012, ESPRIT Basic Research Series.

[97]  Goutam Kumar Saha,et al.  Transient software fault tolerance using single-version algorithm , 2005, UBIQ.

[98]  Daniel P. Siewiorek,et al.  Comparing operating systems using robustness benchmarks , 1997, Proceedings of SRDS'97: 16th IEEE Symposium on Reliable Distributed Systems.

[99]  Goutam K. Saha Algorithm based EFT errors detection in matrix arrays , 1999 .

[100]  Alan Messer,et al.  Susceptibility of commodity systems and software to memory soft errors , 2004, IEEE Transactions on Computers.

[101]  Jean Arlat,et al.  Architectural Issues in Software Fault Tolerance , 1995 .

[102]  Hermann Kopetz,et al.  Fault tolerance, principles and practice , 1990 .

[103]  Dave E. Eckhardt,et al.  A theoretical investigation of generalized voters for redundant systems , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[104]  Edsger W. Dijkstra,et al.  Self-stabilizing systems in spite of distributed control , 1974, CACM.

[105]  Andy Hills,et al.  Fault tolerant avionics , 1988 .

[106]  Hermann Kopetz,et al.  Tolerating transient faults in MARS , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.