Experimental evaluation of the fail-silent behaviour in programs with consistency checks

An important research topic deals with the investigation of whether a non-duplicated computer can be made fail-silent, since that behaviour is a-priori assumed in many algorithms. However, previous research has shown that in systems using a simple behaviour based error detection mechanism invisible to the programmer (e.g. memory protection), the percentage of fail-silent violations could be higher than 10%. Since the study of these errors has shown that they were mostly caused by pure data errors, we evaluate the effectiveness of software techniques capable of checking the semantics of the data, such as assertions, to detect these remaining errors. The results of injecting physical pin-level faults show that these tests can prevent about 40% of the fail-silent model violations that escape the simple hardware-based error detection techniques. In order to decouple the intrinsic limitations of the tests used from other factors that might affect its error detection capabilities, we evaluated a special class of software checks known for its high theoretical coverage: algorithm based fault tolerance (ABFT). The analysis of the remaining errors showed that most of them remained undetected due to short range control flow errors. When very simple software-based control flow checking was associated to the semantic tests, the target system, without any dedicated error detection hardware, behaved according to the fail-silent model for about 98% of all the faults injected.

[1]  Edward J. McCluskey,et al.  Linear Complexity Assertions for Sorting , 1994, IEEE Trans. Software Eng..

[2]  Jean Arlat,et al.  Fault injection for dependability validation of fault-tolerant computing systems , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[3]  Johan Karlsson,et al.  TWO FAULT INJECTION TECHNIQUES FOR TEST OF FAULT HANDLING MECHANISMS , 1991, 1991, Proceedings. International Test Conference.

[4]  Prithviraj Banerjee,et al.  Algorithms-Based Fault Detection for Signal Processing Applications , 1990, IEEE Trans. Computers.

[5]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[6]  Hubert D. Kirrmann Fault Tolerance in Process Control: An Overview And Examples of European Products , 1987, IEEE Micro.

[7]  Henrique Madeira,et al.  Experimental evaluation of the fail-silent behavior in computers without error masking , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[8]  David J. Lu Watchdog Processors and Structural Integrity Checking , 1982, IEEE Transactions on Computers.

[9]  Butler W. Lampson,et al.  Atomic Transactions , 1980, Advanced Course: Distributed Systems.

[10]  Algirdas Avižienis,et al.  Building dependable systems: how to keep up with complexity , 1995 .

[11]  Jean Arlat,et al.  Fault Injection for Dependability Validation: A Methodology and Some Applications , 1990, IEEE Trans. Software Eng..

[12]  Ravishankar K. Iyer,et al.  Error Propagation in a Digital Avionic Processor: A Simulation-Based Study , 1986, RTSS.

[13]  Miroslaw Malek,et al.  A Fault-Tolerant FFT Processor , 1988, IEEE Trans. Computers.

[14]  Edward J. McCluskey,et al.  Concurrent Fault Detection Using a Watchdog Processor and Assertions , 1983, ITC.

[15]  Butler W. Lampson,et al.  Distributed Systems - Architecture and Implementation, An Advanced Course , 1981, Advanced Course: Distributed Systems.

[16]  Brian Randell System Structure for Software Fault Tolerance , 1975, IEEE Trans. Software Eng..

[17]  Henrique Madeira,et al.  RIFLE: A General Purpose Pin-level Fault Injector , 1994, EDCC.

[18]  D. Powell,et al.  The Delta-4 Approach to Dependability in Open Distributed Computing Systems , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[19]  M. Tsunoyama,et al.  A fault-tolerant FFT processor , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[20]  Ravishankar K. Iyer,et al.  A Measurement-Based Model for Workload Dependence of CPU Errors , 1986, IEEE Transactions on Computers.

[21]  Johan Karlsson,et al.  Evaluation of error detection schemes using fault injection by heavy-ion radiation , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[22]  Johan Karlsson,et al.  Two software techniques for on-line error detection , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[23]  Edward J. McCluskey,et al.  Concurrent Error Detection Using Watchdog Processors - A Survey , 1988, IEEE Trans. Computers.