Practical issues in the use of ABFT and a new failure model

We study the behavior of algorithm based fault tolerance (ABFT) techniques under faults injected according to a quite general fault model. Besides the problem of roundoff error in floating point arithmetic we identify two further weakpoints, namely lack of protection of data during input and output, and incorrect execution of the correctness checks. We propose the robust ABFT technique to handle those weakpoints. We then generalize it to programs that use assertions, where similar problems arise, leading to the technique of robust assertions, whose effectiveness is shown by fault injection experiments on a realistic control application. With this technique a system follows a new failure model, that we call fail-bounded, where with high probability all results produced are either correct or, if wrong, they are within a certain bound of the correct value, whose exact value depends on the output assertions used. We claim that this failure model is very useful to describe the behavior of many low redundancy systems.

[1]  M. Malek,et al.  A Fault-Tolerant Systolic Sorter , 1988, IEEE Trans. Computers.

[2]  Niraj K. Jha,et al.  Optimal Design of Checks for Error Detection and Location in Fault-Tolerant Multiprocessor Systems , 1993, IEEE Trans. Computers.

[3]  M. Tsunoyama,et al.  A fault-tolerant FFT processor , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[4]  Amber Roy-Chowdhury,et al.  Algorithm-based fault location and recovery for matrix computations , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[5]  David Powell Failure mode assumptions and assumption coverage , 1992 .

[6]  Henrique Madeira,et al.  Xception: Software Fault Injection and Monitoring in Processor Functional Units1 , 1995 .

[7]  P. Duba,et al.  Transient fault behavior in a microprocessor-A case study , 1988, Proceedings 1988 IEEE International Conference on Computer Design: VLSI.

[8]  Nicholas J. Higham,et al.  INVERSE PROBLEMS NEWSLETTER , 1991 .

[9]  Amber Roy-Chowdhury,et al.  Compiler-assisted generation of error-detecting parallel programs , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[10]  Edward J. McCluskey,et al.  Concurrent Fault Detection Using a Watchdog Processor and Assertions , 1983, ITC.

[11]  Niraj K. Jha,et al.  Graceful Degradation in Algorithm-Based Fault Tolerant Multiprocessor Systems , 1997, IEEE Trans. Parallel Distributed Syst..

[12]  Suku Nair,et al.  Algorithm-Based Fault Tolerance on a Hypercube Multiprocessor , 1990, IEEE Trans. Computers.

[13]  Amber Roy-Chowdhury,et al.  Tolerance determination for algorithm-based checks using simplified error analysis techniques , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[14]  Henrique Madeira,et al.  Experimental assessment of parallel systems , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[15]  Ken Sakamura,et al.  Design fault tolerance in operating systems based on a standardization project , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[16]  Henrique Madeira,et al.  Experimental evaluation of the fail-silent behavior in computers without error masking , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[17]  Henrique Madeira,et al.  Experimental evaluation of the fail-silent behaviour in programs with consistency checks , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[18]  Bapiraju Vinnakota,et al.  Design of Algorithm-Based Fault-Tolerant Multiprocessor Systems for Concurrent Error Detection and Fault Diagnosis , 1994, IEEE Trans. Parallel Distributed Syst..

[19]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[20]  Jacob A. Abraham,et al.  Bounds on Algorithm-Based Fault Tolerance in Multiple Processor Systems , 1986, IEEE Transactions on Computers.

[21]  Jerome H. Saltzer,et al.  End-to-end arguments in system design , 1984, TOCS.

[22]  D. Powell,et al.  The Delta-4 Approach to Dependability in Open Distributed Computing Systems , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..