REESE: a method of soft error detection in microprocessors

Future reliability of general-purpose processors (GPPs) is threatened by a combination of shrinking transistor size, higher clock rates, reduced supply voltages, and other factors. It is predicted that the occurrence of arbitrary transient faults, or soft errors, will dramatically increase as these trends continue. The authors develop and evaluate a fault-tolerant microprocessor architecture that detects soft errors in its own data pipeline. This architecture accomplishes soft error detection through time redundancy, while requiring little execution time overhead. Our approach, called REESE (REdundant Execution using Spare Elements), first minimizes this overhead and then decreases is even further by strategically adding a small number of functional units to the pipeline. This differs from similar approaches in the past that have not addressed ways of reducing the overhead necessary to implement time redundancy in GPPs.

[1]  Chin-Long Chen,et al.  Error-Correcting Codes for Semiconductor Memory Applications: A State-of-the-Art Review , 1984, IBM J. Res. Dev..

[2]  Barry W. Johnson Fault-Tolerant Microprocessor-Based Systems , 1984, IEEE Micro.

[3]  Lorena Anghel,et al.  Cost reduction and evaluation of temporary faults detecting technique , 2000, DATE '00.

[4]  Rami Melhem,et al.  Compiler assisted fault detection for distributed-memory systems , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[5]  Nick Kanopoulos,et al.  Design of Self-Checking Circuits Using DCVS Logic: A Case Study , 1992, IEEE Trans. Computers.

[6]  Kewal K. Saluja,et al.  Fault tolerance through re-execution in multiscalar architecture , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[7]  Michael Nicolaidis Time redundancy based soft-error tolerance to rescue nanometer technologies , 1999, Proceedings 17th IEEE VLSI Test Symposium (Cat. No.PR00146).

[8]  P. Hazucha,et al.  Cosmic-ray soft error rate characterization of a standard 0.6-/spl mu/m CMOS process , 2000, IEEE Journal of Solid-State Circuits.

[9]  Manoj Franklin Incorporating fault tolerance in superscalar processors , 1996, Proceedings of 3rd International Conference on High Performance Computing (HiPC).

[10]  Lorena Anghel,et al.  Self-checking circuits versus realistic faults in very deep submicron , 2000, Proceedings 18th IEEE VLSI Test Symposium.

[11]  Janak H. Patel,et al.  Concurrent Error Detection in ALU's by Recomputing with Shifted Operands , 1982, IEEE Transactions on Computers.

[12]  S. Satoh,et al.  Simple method for estimating neutron-induced soft error rates based on modified BGR model , 1999, IEEE Electron Device Letters.

[13]  John L. Hennessy,et al.  The Future of Systems Research , 1999, Computer.

[14]  Marco Torchiano,et al.  Soft-error detection through software fault-tolerance techniques , 1999, Proceedings 1999 IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (EFT'99).

[15]  Arun K. Somani,et al.  Low Overhead Multiprocessor Allocation Strategies Exploiting System Space Capacity for Fault Detection and Location , 1995, IEEE Trans. Computers.

[16]  Eric Rotenberg,et al.  AR-SMT: a microarchitectural approach to fault tolerance in microprocessors , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[17]  Lisa Spainhower,et al.  G4: a fault-tolerant CMOS mainframe , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[18]  Marc Tremblay,et al.  The UCLA mirror processor: a building block for self-checking self-repairing computing nodes , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[19]  S. McFarling Combining Branch Predictors , 1993 .

[20]  E. E. Swartzlander,et al.  Time redundant error correcting adders and multipliers , 1992, Proceedings 1992 IEEE International Workshop on Defect and Fault Tolerance in VLSI Systems.

[21]  Kewal K. Saluja,et al.  A study of time-redundant fault tolerance techniques for high-performance pipelined computers , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[22]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[23]  Shekhar Y. Borkar,et al.  Design challenges of technology scaling , 1999, IEEE Micro.

[24]  Manoj Franklin A study of time redundant fault tolerance techniques for superscalar processors , 1995, Proceedings of International Workshop on Defect and Fault Tolerance in VLSI.