Experimental evaluation of the fail-silent behavior in computers without error masking

Traditionally, fail-silent computers are implemented by using massive redundancy (hardware or software). In this research we investigate if it is possible to obtain a high degree of fail-silent behavior from a computer without hardware or software replication by using only simple behavior based error detection techniques. It is assumed that if the errors caused by a fault are detected in time it will be possible to stop the erroneous computer behavior, thus preventing the violation of the fail-silent model. The evaluation technique used in this research is physical fault injection at the pin level. Results obtained by the injection of about 20000 different faults in two different target systems have shown that: in a system without error detection up to 46% of the faults caused the violation of the fail-silent model; in a computer with behavior based error detection the percentage of faults that caused the violation of the fail-silent mode was reduced to values from 2.3% to 0.4%; the results are very dependent on the target system, on the program under execution during the fault injection and on the type of faults.<<ETX>>

[1]  Ravishankar K. Iyer,et al.  Error Propagation in a Digital Avionic Processor: A Simulation-Based Study , 1986, RTSS.

[2]  John Paul Shen,et al.  Processor Control Flow Monitoring Using Signatured Instruction Streams , 1987, IEEE Transactions on Computers.

[3]  Parag K. Lala,et al.  Fault tolerant and fault testable hardware design , 1985 .

[4]  Flaviu Cristian,et al.  Atomic Broadcast: From Simple Message Diffusion to Byzantine Agreement , 1995, Inf. Comput..

[5]  Edward J. McCluskey,et al.  Concurrent Error Detection Using Watchdog Processors - A Survey , 1988, IEEE Trans. Computers.

[6]  John Paul Shen,et al.  Continuous signature monitoring: low-cost concurrent detection of processor control errors , 1990, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[7]  Jean Arlat,et al.  Fault Injection for Dependability Validation: A Methodology and Some Applications , 1990, IEEE Trans. Software Eng..

[8]  R. Koga,et al.  SEU Vulnerability of the Zilog Z-80 and NSC-800 Microprocessors , 1985, IEEE Transactions on Nuclear Science.

[9]  Henrique Madeira,et al.  Experimental evaluation of a set of simple error detection mechanisms , 1990 .

[10]  Paulo Veríssimo,et al.  The Delta-4 approach to dependability in open distributed computing systems , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[11]  Edward J. McCluskey,et al.  Executable assertions and flight software , 1984 .

[12]  Johan Karlsson,et al.  Two software techniques for on-line error detection , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[13]  R. K. Iyer,et al.  Impact of device level faults in a digital avionic processor , 1988 .

[14]  Jim Gray,et al.  Fault Tolerance in Tandem Computer Systems , 1987 .

[15]  Victor Carreño,et al.  A Fault Behavior Model for an Avionic Microprocessor: A Case Study , 1991 .

[16]  John Paul Shen,et al.  Exploiting instruction-level resource parallelism for transparent, integrated control-flow monitoring , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[17]  Henrique Madeira,et al.  On-Line Signature Learning and Checking , 1992 .

[18]  Johan Karlsson,et al.  TWO FAULT INJECTION TECHNIQUES FOR TEST OF FAULT HANDLING MECHANISMS , 1991, 1991, Proceedings. International Test Conference.

[19]  Janusz Sosnowski,et al.  Detection of control flow errors using signature and checking instructions , 1988, International Test Conference 1988 Proceeding@m_New Frontiers in Testing.

[20]  Fred B. Schneider,et al.  Byzantine generals in action: implementing fail-stop processors , 1984, TOCS.

[21]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[22]  C. Preece,et al.  Erroneous execution and recovery in microprocessor systems , 1985, Softw. Microsystems.

[23]  Johan Karlsson,et al.  Evaluation of error detection schemes using fault injection by heavy-ion radiation , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[24]  Timothy Kong,et al.  Efficient memory access checking , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[25]  M. Namjoo,et al.  WATCHDOG PROCESSORS AND CAPABILITY CHECKING , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[26]  Masood Namjoo,et al.  Techniques for Concurrent Testing of VLSI Processor Operation , 1982, ITC.