ED4I: Error Detection by Diverse Data and Duplicated Instructions

Errors in computing systems can cause abnormal behavior and degrade data integrity and system availability. Errors should be avoided especially in embedded systems for critical applications. However, as the trend in VLSI technologies has been toward smaller feature sizes, lower supply voltages and higher frequencies, there is a growing concern about temporary errors as well as permanent errors in embedded systems; thus, it is very essential to detect those errors. Software-implemented hardware fault tolerance (SIHFT) is a low-cost alternative to hardware fault-tolerance techniques for embedded processors: It does not require any hardware modification of commercial off-the-shelf (COTS) processors. ED/sup 4/I (error detection by data diversity and duplicated instructions) is a SIHFT technique that detects both permanent and temporary errors by executing two "different" programs (with the same functionality) and comparing their outputs. ED/sup 4/I maps each number, x, in the original program into a new number x', and then transforms the program so that it operates on the new numbers so that the results can be mapped backwards for comparison with the results of the original program. The mapping in the transformation of ED/sup 4/I is x' = k/spl middot/x for integer numbers, where k/sub f/ determines the fault detection probability and data integrity of the system. For floating-point numbers, we find a value of k/sub f/ for the fraction and k/sub e/ for the exponent separately, and use k = k/sub f//spl times/2/sup k/ for the value of k. We have demonstrated how to choose an optimal value of k for the transformation. This paper shows that, for integer programs, the transformation with k = -2 was the most desirable choice in six out of seven benchmark programs we simulated. It maximizes the fault detection probability under the condition that the data integrity is highest.

[1]  Pat H. Sterbenz,et al.  Floating-point computation , 1973 .

[2]  Heidrun Engel,et al.  Data flow transformations to detect results which are corrupted by hardware faults , 1996, Proceedings. IEEE High-Assurance Systems Engineering Workshop (Cat. No.96TB100076).

[3]  Mark Stephenson,et al.  Bidwidth analysis with application to silicon compilation , 2000, PLDI '00.

[4]  Michael J. Flynn,et al.  Computer Organization and Architecture , 1978, Advanced Course: Operating Systems.

[5]  William H. Harrison,et al.  Compiler Analysis of the Value Ranges for Variables , 1977, IEEE Transactions on Software Engineering.

[6]  Neil Weste,et al.  Principles of CMOS VLSI Design , 1985 .

[7]  Joseph R. Cavallaro,et al.  Fault tolerant algorithms and architectures for robotics , 1994, Proceedings of MELECON '94. Mediterranean Electrotechnical Conference.

[8]  David F. McAllister,et al.  The consensus recovery block , 1983 .

[9]  Hermann Kopetz,et al.  Transparent redundancy in the time-triggered architecture , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[10]  James B. Kuo,et al.  A BiCMOS dynamic multiplier using Wallace tree reduction architecture and 1.5-V full-swing BiCMOS dynamic logic circuit , 1995 .

[11]  Edward J. McCluskey,et al.  Probabilistic Treatment of General Combinational Networks , 1975, IEEE Transactions on Computers.

[12]  Joanne Bechta Dugan,et al.  Reliability evaluation of fly-by-wire computer systems , 1994, J. Syst. Softw..

[13]  Michael R. Lyu,et al.  Assuring Design Diversity in N-Version Software: A Design Paradigm for N-Version Programming , 1992 .

[14]  John J. Shedletsky,et al.  Error Correction by Alternate-Data Retry , 1978, IEEE Transactions on Computers.

[15]  Prithviraj Banerjee,et al.  Low Cost Concurrent Error Detection in a VLIW Architecture Using Replicated Instructions , 1992, ICPP.

[16]  Jean Arlat,et al.  Definition and analysis of hardware- and software-fault-tolerant architectures , 1990, Computer.

[17]  Dave E. Eckhardt,et al.  A Theoretical Basis for the Analysis of Multiversion Software Subject to Coincident Errors , 1985, IEEE Transactions on Software Engineering.

[18]  Edward J. McCluskey,et al.  Control-flow checking by software signatures , 2002, IEEE Trans. Reliab..

[19]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[20]  Edward J. McCluskey,et al.  Fault-tolerant computing for radiation environments , 2001 .

[21]  Janak H. Patel,et al.  Concurrent Error Detection in Multiply and Divide Arrays , 1983, IEEE Transactions on Computers.

[22]  Bev Littlewood,et al.  Conceptual Modeling of Coincident Failures in Multiversion Software , 1989, IEEE Trans. Software Eng..

[23]  Johan Karlsson,et al.  Two software techniques for on-line error detection , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[24]  David F. McAllister,et al.  Fault-Tolerant SoFtware Reliability Modeling , 1987, IEEE Transactions on Software Engineering.

[25]  AvizienisA.,et al.  Fault Tolerance by Design Diversity , 1984 .

[26]  Paul Ammann,et al.  Data Diversity: An Approach to Software Fault Tolerance , 1988, IEEE Trans. Computers.

[27]  Liming Chen,et al.  N-VERSION PROGRAMMINC: A FAULT-TOLERANCE APPROACH TO RELlABlLlTY OF SOFTWARE OPERATlON , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[28]  M. Ray Mercer,et al.  Bounding Signal Probabilities for Testability Measurement Using Conditional Syndromes , 1992, IEEE Trans. Computers.

[29]  Algirdas Avizienis,et al.  Fault Tolerance by Design Diversity: Concepts and Experiments , 1984, Computer.

[30]  David A. Patterson,et al.  Computer architecture (2nd ed.): a quantitative approach , 1996 .

[31]  David T. Brown Error Detecting and Correcting Binary Codes for Arithmetic Operations , 1960, IRE Trans. Electron. Comput..

[32]  Harry B. Hunt,et al.  On Computing Signal Probability and Detection Probability of Stuck-at Faults , 1990, IEEE Trans. Computers.

[33]  P. Chapront VITAL CODED PROCESSOR AND SAFETY RELATED SOFTWARE DESIGN , 1992 .

[34]  Edward J. McCluskey,et al.  Software implemented hardware fault tolerance , 2000 .

[35]  Edward J. McCluskey,et al.  The Watchdog Task: Concurrent error detection using assertions , 1985 .

[36]  Martin Hiller,et al.  Executable assertions for detecting data errors in embedded control systems , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[37]  Janak H. Patel,et al.  Concurrent Error Detection in ALU's by Recomputing with Shifted Operands , 1982, IEEE Transactions on Computers.

[38]  Edward J. McCluskey,et al.  Error detection by duplicated instructions in super-scalar processors , 2002, IEEE Trans. Reliab..

[39]  J. H. Lala,et al.  Architectural principles for safety-critical real-time applications , 1994 .

[40]  Edward J. McCluskey,et al.  Analysis of Logic Circuits with Faults Using Input Signal Probabilities , 1975, IEEE Transactions on Computers.

[41]  Jan Torin,et al.  Dependable flight control system using data diversity with error recovery , 1994 .

[42]  René David,et al.  Analysis of Detection Probability and Some Applications , 1990, IEEE Trans. Computers.

[43]  Edward J. McCluskey,et al.  A design diversity metric and reliability analysis for redundant systems , 1999, International Test Conference 1999. Proceedings (IEEE Cat. No.99CH37034).

[44]  P. Forin,et al.  VITAL CODED MICROPROCESSOR PRINCIPLES AND APPLICATION FOR VARIOUS TRANSIT SYSTEMS , 1990 .

[45]  Jason R. C. Patterson,et al.  Accurate static branch prediction by value range propagation , 1995, PLDI '95.