Dual use of superscalar datapath for transient-fault detection and recovery

Diminutive devices and high clock frequency of future microprocessor generations are causing increased concerns for transient soft failures in hardware, necessitating fault detection and recovery mechanisms even in commodity processors. In this paper, we propose a fault-tolerant extension for modern superscalar out-of-order datapath that can be supported by only modest additional hardware. In the proposed extensions, error-detection is achieved by verifying the redundant results of dynamically replicated threads of executions, while the error-recovery scheme employs the instruction-rewind mechanism to restart at a failed instruction. We study the performance impact of augmenting superscalar microarchitectures with this fault tolerance mechanism. An analytical performance model is used in conjunction with a performance simulator. The simulation results of 11 SPEC95 and SPEC2000 benchmarks show that in the absence of faults, error detection causes a 2% to 45% reduction in throughput, which is in line with other proposed detection schemes. In the presence of transient faults, the fast error recovery scheme contributes very little additional slowdown.

[1]  T. May,et al.  A New Physical Mechanism for Soft Errors in Dynamic Memories , 1978, 16th International Reliability Physics Symposium.

[2]  J. Ziegler,et al.  Effect of Cosmic Rays on Computer Memories , 1979, Science.

[3]  Janak H. Patel,et al.  Concurrent Error Detection in ALU's by Recomputing with Shifted Operands , 1982, IEEE Transactions on Computers.

[4]  Gurindar S. Sohi,et al.  Instruction Issue Logic for High-Performance Interruptible, Multiple Functional Unit, Pipelines Computers , 1990, IEEE Trans. Computers.

[5]  Z. Hasnain,et al.  Building-in reliability: soft errors-a case study , 1992, 30th Annual Proceedings Reliability Physics 1992.

[6]  Manoj Franklin A study of time redundant fault tolerance techniques for superscalar processors , 1995, Proceedings of International Workshop on Defect and Fault Tolerance in VLSI.

[7]  E. Normand Single event upset at ground level , 1996 .

[8]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[9]  Eric Rotenberg,et al.  AR-SMT: a microarchitectural approach to fault tolerance in microprocessors , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[10]  Deep submicron microprocessor design issues , 1999, IEEE Micro.

[11]  F. Faccio,et al.  Single Event Effects in Static and Dynamic Registers in a 0 . 25 μ m CMOS Technology , 1999 .

[12]  Shekhar Y. Borkar,et al.  Design challenges of technology scaling , 1999, IEEE Micro.

[13]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[14]  Timothy J. Slegel,et al.  IBM's S/390 G5 microprocessor design , 1999, IEEE Micro.

[15]  Neeraj Suri,et al.  Designing high-performance and reliable superscalar architectures-the out of order reliable superscalar (O3RS) approach , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[16]  K. Sundaramoorthy,et al.  Slipstream processors: improving both performance and fault tolerance , 2000, SIGP.

[17]  Lorena Anghel,et al.  Self-checking circuits versus realistic faults in very deep submicron , 2000, Proceedings 18th IEEE VLSI Test Symposium.

[18]  Shubhendu S. Mukherjee,et al.  Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).