Robust assertions and fail-bounded behavior

In this paper the behavior of assertion-based error detection mechanisms is characterized under faults injected according to a quite general fault model. Assertions based on the knowledge of the application can be very effective at detecting corruption of critical data caused by hardware faults. The main drawbacks of that approach are identified as being the lack of protection of data outside the section covered by assertions, namely during input and output, and the possible incorrect execution of the assertions.To handle those weak-points the Robust Assertions technique is proposed, whose effectiveness is shown by extensive fault injection experiments. With this technique a system follows a new failure model, that is called Fail-Bounded, where with high probability all results produced are either correct or, if wrong, they are within a certain bound of the correct value, whose exact distance depends on the output assertions used.Any kind of assertions can be considered, from simple likelihood tests to high coverage assertions such as those used in the Algorithm Based Fault Tolerance paradigm. We claim that this failure model is very useful to describe the behavior of many low-cost fault-tolerant systems, that have low hardware and software redundancy, like embedded systems, were cost is a severe restriction, yet full availability is expected.

[1]  Roy A. Maxion,et al.  Improving software robustness with dependability cases , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[2]  Amber Roy-Chowdhury,et al.  Compiler-assisted generation of error-detecting parallel programs , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[3]  Henrique Madeira,et al.  Xception: A Technique for the Experimental Evaluation of Dependability in Modern Computers , 1998, IEEE Trans. Software Eng..

[4]  Mei Han An,et al.  accuracy and stability of numerical algorithms , 1991 .

[5]  Niraj K. Jha,et al.  Optimal Design of Checks for Error Detection and Location in Fault-Tolerant Multiprocessor Systems , 1993, IEEE Trans. Computers.

[6]  Bapiraju Vinnakota,et al.  Design of Algorithm-Based Fault-Tolerant Multiprocessor Systems for Concurrent Error Detection and Fault Diagnosis , 1994, IEEE Trans. Parallel Distributed Syst..

[7]  Daniel P. Siewiorek,et al.  Automated robustness testing of off-the-shelf software components , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[8]  Suku Nair,et al.  Design and Evaluation of System-Level Checks for On-Line Control Flow Error Detection , 1999, IEEE Trans. Parallel Distributed Syst..

[9]  Johan Karlsson,et al.  On the design of robust integrators for fail-bounded control systems , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[10]  Daniel S. Katz,et al.  Software-implemented fault detection for high-performance space applications , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[11]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[12]  Cristian Constantinescu,et al.  Impact of deep submicron technology on dependability of VLSI circuits , 2002, Proceedings International Conference on Dependable Systems and Networks.

[13]  Henrique Madeira,et al.  Experimental evaluation of the fail-silent behaviour in programs with consistency checks , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[14]  Henrique Madeira,et al.  Experimental evaluation of the fail-silent behavior in computers without error masking , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[15]  Suku Nair,et al.  Algorithm-Based Fault Tolerance on a Hypercube Multiprocessor , 1990, IEEE Trans. Computers.

[16]  Amber Roy-Chowdhury,et al.  Tolerance determination for algorithm-based checks using simplified error analysis techniques , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[17]  Niraj K. Jha,et al.  Graceful Degradation in Algorithm-Based Fault Tolerant Multiprocessor Systems , 1997, IEEE Trans. Parallel Distributed Syst..

[18]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[19]  Jacob A. Abraham,et al.  Bounds on Algorithm-Based Fault Tolerance in Multiple Processor Systems , 1986, IEEE Transactions on Computers.

[20]  A. Campbell,et al.  Single event upset rates in space , 1992 .

[21]  Jerome H. Saltzer,et al.  End-to-end arguments in system design , 1984, TOCS.

[22]  Amber Roy-Chowdhury,et al.  Algorithm-based fault location and recovery for matrix computations , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[23]  Ansi Ieee,et al.  IEEE Standard for Binary Floating Point Arithmetic , 1985 .

[24]  John Paul Shen,et al.  Continuous signature monitoring: low-cost concurrent detection of processor control errors , 1990, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[25]  Edward J. McCluskey,et al.  Concurrent Fault Detection Using a Watchdog Processor and Assertions , 1983, ITC.

[26]  Mário Zenha Rela,et al.  A study of failure models in feedback control systems , 2001, 2001 International Conference on Dependable Systems and Networks.

[27]  Henrique Madeira,et al.  Practical issues in the use of ABFT and a new failure model , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[28]  Johan Karlsson,et al.  Reducing critical failures for control algorithms using executable assertions and best effort recovery , 2001, 2001 International Conference on Dependable Systems and Networks.

[29]  D. Powell,et al.  The Delta-4 Approach to Dependability in Open Distributed Computing Systems , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[30]  Henrique Madeira,et al.  Experimental assessment of parallel systems , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[31]  Edward J. McCluskey,et al.  Concurrent Error Detection Using Watchdog Processors - A Survey , 1988, IEEE Trans. Computers.

[32]  Manuel Blum,et al.  Software reliability via run-time result-checking , 1997, JACM.

[33]  Zhen Xiao,et al.  HEALERS: a toolkit for enhancing the robustness and security of existing applications , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[34]  M. Tsunoyama,et al.  A fault-tolerant FFT processor , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[35]  Jean Arlat,et al.  On Stratified Sampling for High Coverage Estimations , 1996, EDCC.

[36]  Andreas Steininger,et al.  On finding an optimal combination of error detection mechanisms based on results of fault injection experiments , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.

[37]  Lorenzo Alvisi,et al.  Modeling the effect of technology trends on the soft error rate of combinational logic , 2002, Proceedings International Conference on Dependable Systems and Networks.

[38]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[39]  Daniel P. Siewiorek,et al.  Robustness testing and hardening of CORBA ORB implementations , 2001, 2001 International Conference on Dependable Systems and Networks.

[40]  P. Duba,et al.  Transient fault behavior in a microprocessor-A case study , 1988, Proceedings 1988 IEEE International Conference on Computer Design: VLSI.

[41]  Ken Sakamura,et al.  Design fault tolerance in operating systems based on a standardization project , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[42]  Edward J. McCluskey,et al.  Control-flow checking by software signatures , 2002, IEEE Trans. Reliab..

[43]  David Powell Failure mode assumptions and assumption coverage , 1992 .