Opportunistic Transient-Fault Detection

CMOS scaling increases susceptibility of microprocessors to transient faults. Most current proposals for transient-fault detection use full redundancy to achieve perfect coverage while incurring significant performance degradation. However, most commodity systems do not need or provide perfect coverage. A recent paper explores this leniency to reduce the soft-error rate of the issue queue during L2 misses while incurring minimal performance degradation. Whereas the previous paper reduces soft-error rate without using any redundancy, we target better coverage while incurring similarly-minimal performance degradation by opportunistically using redundancy. We propose two semi-complementary techniques, called partial explicit redundancy (PER) and implicit redundancy through reuse (IRTR), to explore the trade-off between soft-error rate and performance. PER opportunistically exploits low-ILP phases and L2 misses to introduce explicit redundancy with minimal performance degradation. Because PER covers the entire pipeline and exploits not only L2 misses but all low-ILP phases, PER achieves better coverage than the previous work. To achieve coverage in high-ILP phases as well, we propose implicit redundancy through reuse (IRTR). Previous work exploits the phenomenon of instruction reuse to avoid redundant execution while falling back on redundant execution when there is no reuse. IRTR takes reuse to the extreme of performance-coverage trade-off and completely avoids explicit redundancy by exploiting reuseýs implicit redundancy within the main thread for fault detection with virtually no performance degradation. Using simulations with SPEC2000, we show that PER and IRTR achieve better trade-off between soft-error rate and performance degradation than the previous schemes.

[1]  Eric Rotenberg,et al.  Slipstream processors: improving both performance and fault tolerance , 2000, SIGP.

[2]  Irith Pomeranz,et al.  Transient-fault recovery for chip multiprocessors , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[3]  Kevin Skadron,et al.  Temperature-aware microarchitecture , 2003, ISCA '03.

[4]  Shubhendu S. Mukherjee,et al.  Detailed design and evaluation of redundant multithreading alternatives , 2002, ISCA.

[5]  Joel Emer,et al.  A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[6]  Timothy J. Slegel,et al.  IBM's S/390 G5 microprocessor design , 1999, IEEE Micro.

[7]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[8]  Kewal K. Saluja,et al.  A study of time-redundant fault tolerance techniques for high-performance pipelined computers , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[9]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[10]  Lorenzo Alvisi,et al.  Modeling the effect of technology trends on the soft error rate of combinational logic , 2002, Proceedings International Conference on Dependable Systems and Networks.

[11]  Edward J. McCluskey,et al.  Concurrent Error Detection Using Watchdog Processors - A Survey , 1988, IEEE Trans. Computers.

[12]  Shubhendu S. Mukherjee,et al.  Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[13]  Gernot Metze,et al.  Fault Detection Capabilities of Alternating Logic , 1978, IEEE Transactions on Computers.

[14]  Eric Rotenberg,et al.  AR-SMT: a microarchitectural approach to fault tolerance in microprocessors , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[15]  G.S. Sohi,et al.  Dynamic instruction reuse , 1997, ISCA '97.

[16]  Jack L. Lo,et al.  Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[17]  Anand Sivasubramaniam,et al.  A complexity-effective approach to ALU bandwidth enhancement for instruction-level temporal redundancy , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[18]  James C. Hoe,et al.  Dual use of superscalar datapath for transient-fault detection and recovery , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[19]  Prithviraj Banerjee,et al.  Low Cost Concurrent Error Detection in a VLIW Architecture Using Replicated Instructions , 1992, ICPP.

[20]  Janak H. Patel,et al.  Concurrent Error Detection in ALU's by Recomputing with Shifted Operands , 1982, IEEE Transactions on Computers.

[21]  Dean M. Tullsen,et al.  Handling long-latency loads in a simultaneous multithreading processor , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[22]  Joel S. Emer,et al.  Techniques to reduce the soft error rate of a high-performance microprocessor , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[23]  Irith Pomeranz,et al.  Transient-fault recovery using simultaneous multithreading , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[24]  Doug Burger,et al.  Evaluating Future Microprocessors: the SimpleScalar Tool Set , 1996 .