Error detection by duplicated instructions in super-scalar processors

This paper proposes a pure software technique "error detection by duplicated instructions" (EDDI), for detecting errors during usual system operation. Compared to other error-detection techniques that use hardware redundancy, EDDI does not require any hardware modifications to add error detection capability to the original system. EDDI duplicates instructions during compilation and uses different registers and variables for the new instructions. Especially for the fault in the code segment of memory, formulas are derived to estimate the error-detection coverage of EDDI using probabilistic methods. These formulas use statistics of the program, which are collected during compilation. EDDI was applied to eight benchmark programs and the error-detection coverage was estimated. Then, the estimates were verified by simulation, in which a fault injector forced a bit-flip in the code segment of executable machine codes. The simulation results validated the estimated fault coverage and show that approximately 1.5% of injected faults produced incorrect results in eight benchmark programs with EDDI, while on average, 20% of injected faults produced undetected incorrect results in the programs without EDDI. Based on the theoretical estimates and actual fault-injection experiments, EDDI can provide over 98% fault-coverage without any extra hardware for error detection. This pure software technique is especially useful when designers cannot change the hardware, but they need dependability in the computer system. To reduce the performance overhead, EDDI schedules the instructions that are added for detecting errors such that "instruction-level parallelism" (ILP) is maximized. Performance overhead can be reduced by increasing ILP within a single super-scalar processor. The execution time overhead in a 4-way super-scalar processor is less than the execution time overhead in the processors that can issue two instructions in one cycle.

[1]  Edward J. McCluskey,et al.  The Watchdog Task: Concurrent error detection using assertions , 1985 .

[2]  Henrique Madeira,et al.  On-Line Signature Learning and Checking , 1992 .

[3]  Edward J. McCluskey,et al.  Control-Flow Checking Using Watchdog Assists and Extended-Precision Checksums , 1990, IEEE Trans. Computers.

[4]  Jean-Claude Laprie,et al.  Saturation: reduced idleness for improved fault-tolerance , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[5]  Douglas M. Blough,et al.  Fault detection and diagnosis in multiprocessor systems , 1988 .

[6]  John J. Shedletsky,et al.  Error Correction by Alternate-Data Retry , 1978, IEEE Transactions on Computers.

[7]  John Paul Shen,et al.  Exploiting Instruction-Level Parallelism for Integrated Control-Flow Monitoring , 1994, IEEE Trans. Computers.

[8]  P. K. Lala,et al.  On self-checking software design , 1991, IEEE Proceedings of the SOUTHEASTCON '91.

[9]  Stamatis Vassiliadis,et al.  Parallel Computer Architecture , 2000, Euro-Par.

[10]  Janak H. Patel,et al.  Concurrent Error Detection in ALU's by Recomputing with Shifted Operands , 1982, IEEE Transactions on Computers.

[11]  Scott A. Mahlke,et al.  IMPACT: an architectural framework for multiple-instruction-issue processors , 1991, ISCA '91.

[12]  Richard M. Stallman,et al.  Using and Porting GNU CC , 1998 .

[13]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[14]  John Paul Shen,et al.  Continuous signature monitoring: low-cost concurrent detection of processor control errors , 1990, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[15]  David J. Lu Watchdog Processors and Structural Integrity Checking , 1982, IEEE Transactions on Computers.

[16]  John Paul Shen,et al.  On-Line Self-Monitoring Using Signatured Instruction Streams , 1983, International Test Conference.

[17]  Jean-Pierre Queille,et al.  Executable assertions and timed traces for on-line software error detection , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[18]  A.L. Hopkins,et al.  FTMP—A highly reliable fault-tolerant multiprocess for aircraft , 1978, Proceedings of the IEEE.

[19]  Edward J. McCluskey,et al.  Concurrent Error Detection Using Watchdog Processors - A Survey , 1988, IEEE Trans. Computers.

[20]  R. Koga,et al.  SEU Vulnerability of the Zilog Z-80 and NSC-800 Microprocessors , 1985, IEEE Transactions on Nuclear Science.

[21]  John Paul Shen,et al.  Continuous signature monitoring: efficient concurrent-detection of processor control errors , 1988, International Test Conference 1988 Proceeding@m_New Frontiers in Testing.

[22]  Masood Namjoo,et al.  Techniques for Concurrent Testing of VLSI Processor Operation , 1982, ITC.

[23]  Douglas M. Blough,et al.  Performance Analysis of a Generalized Concurrent Error Detection Procedure , 1990, IEEE Trans. Computers.

[24]  John Paul Shen,et al.  Concurrent Error Detection using Signature Monitoring and Encryption , 1991 .

[25]  Scott A. Mahlke,et al.  IMPACT: An Architectural Framework for Multiple-Instruction-Issue Processors , 1998, 25 Years ISCA: Retrospectives and Reprints.

[26]  David S. Johnson,et al.  Computers and Inrracrobiliry: A Guide ro the Theory of NP-Completeness , 1979 .

[27]  Kewal K. Saluja,et al.  A study of time-redundant fault tolerance techniques for high-performance pipelined computers , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[28]  Liming Chen,et al.  N-VERSION PROGRAMMINC: A FAULT-TOLERANCE APPROACH TO RELlABlLlTY OF SOFTWARE OPERATlON , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[29]  John Paul Shen,et al.  Processor Monitoring Using Asynchronous Signatured Instruction Streams , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[30]  Earl E. Swartzlander,et al.  Time redundancy for error detecting neural networks , 1995, Proceedings IEEE International Conference on Wafer Scale Integration (ICWSI).

[31]  Peter Alan Lee,et al.  Fault Tolerance , 1990, Dependable Computing and Fault-Tolerant Systems.

[32]  Giovanni De Micheli,et al.  Synthesis and Optimization of Digital Circuits , 1994 .

[33]  Prithviraj Banerjee,et al.  Low Cost Concurrent Error Detection in a VLIW Architecture Using Replicated Instructions , 1992, ICPP.

[34]  Marco Torchiano,et al.  Soft-error detection through software fault-tolerance techniques , 1999, Proceedings 1999 IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (EFT'99).

[35]  Edward J. McCluskey,et al.  Control-flow checking using watchdog assists and extended-precision checksums , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[36]  Edward J. McCluskey,et al.  Control-flow checking by software signatures , 2002, IEEE Trans. Reliab..

[37]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[38]  Douglas M. Blough,et al.  Fault tolerance in super-scalar and vliw processors , 1991 .

[39]  Jacob A. Abraham,et al.  Evaluation of integrated system-level checks for on-line error detection , 1996, Proceedings of IEEE International Computer Performance and Dependability Symposium.

[40]  Johan Karlsson,et al.  Two software techniques for on-line error detection , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[41]  Gernot Metze,et al.  Fault Detection Capabilities of Alternating Logic , 1978, IEEE Transactions on Computers.

[42]  Mike Johnson,et al.  Superscalar microprocessor design , 1991, Prentice Hall series in innovative technology.