ReStore: symptom based soft error detection in microprocessors

Device scaling and large-scale integration have led to growing concerns about soft errors in microprocessors. To date, in all but the most demanding applications, implementing parity and ECC for caches and other large, regular SRAM structures have been sufficient to stem the growing soft error tide. This will not be the case for long and questions remain as to the best way to detect and recover from soft errors in the remainder of the processor - in particular, the less structured execution core. In this work, we propose the ReStore architecture, which leverages existing performance enhancing checkpointing hardware to recover from soft error events in a low cost fashion. Error detection in the ReStore architecture is novel: symptoms that hint at the presence of soft errors trigger restoration of a previous checkpoint. Example symptoms include exceptions, control flow misspeculations, and cache or translation look-aside buffer misses. Compared to conventional soft error detection via full replication, the ReStore framework incurs little overhead, but sacrifices some amount of error coverage. These attributes make it an ideal means to provide very cost effective error coverage for processor applications that can tolerate a nonzero, but small, soft error failure rate. Our evaluation of an example ReStore implementation exhibits a 2times increase in MTBF (mean time between failures) over a standard pipeline with minimal hardware and performance overheads. The MTBF increases by 20times if ReStore is coupled with protection for certain particularly vulnerable pipeline structures

[1]  ZhangMing,et al.  Robust System Design with Built-In Soft-Error Resilience , 2005 .

[2]  Haitham Akkary,et al.  Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors , 2003, MICRO.

[3]  Shubhendu S. Mukherjee,et al.  Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[4]  Todd M. Austin,et al.  A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor , 2003, MICRO.

[5]  Shubhendu S. Mukherjee,et al.  Detailed design and evaluation of redundant multithreading alternatives , 2002, ISCA.

[6]  Lisa Spainhower,et al.  IBM S/390 Parallel Enterprise Server G5 fault tolerance: A historical perspective , 1999, IBM J. Res. Dev..

[7]  Dirk Grunwald,et al.  Confidence estimation for speculation control , 1998, ISCA.

[8]  Sanjay J. Patel,et al.  ReStore: Symptom-Based Soft Error Detection in Microprocessors , 2006, IEEE Trans. Dependable Secur. Comput..

[9]  Sanjay J. Patel,et al.  Characterization of essential dynamic instructions , 2003, SIGMETRICS '03.

[10]  D.A. Jimenez Fast path-based neural branch prediction , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[11]  Jiri Gaisler A portable and fault-tolerant microprocessor based on the SPARC v8 architecture , 2002, Proceedings International Conference on Dependable Systems and Networks.

[12]  T.H. Lee,et al.  A 600 MHz superscalar RISC microprocessor with out-of-order execution , 1997, 1997 IEEE International Solids-State Circuits Conference. Digest of Technical Papers.

[13]  Ravishankar K. Iyer,et al.  A Processor-Level Framework for High-Performance and High-Dependability , 2001 .

[14]  Manoj Franklin Incorporating fault tolerance in superscalar processors , 1996, Proceedings of 3rd International Conference on High Performance Computing (HiPC).

[15]  Ravishankar K. Iyer,et al.  Characterization of linux kernel behavior under errors , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[16]  Haitham Akkary,et al.  Perceptron-Based Branch Confidence Estimation , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[17]  Onur Mutlu,et al.  Microarchitecture-based introspection: a technique for transient-fault tolerance in microprocessors , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[18]  Todd M. Austin,et al.  A fault tolerant approach to microprocessor design , 2001, 2001 International Conference on Dependable Systems and Networks.

[19]  Eric Rotenberg,et al.  AR-SMT: a microarchitectural approach to fault tolerance in microprocessors , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[20]  Todd C. Mowry,et al.  The potential for using thread-level data speculation to facilitate automatic parallelization , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[21]  Joel S. Emer,et al.  Techniques to reduce the soft error rate of a high-performance microprocessor , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[22]  Onur Mutlu,et al.  Runahead Execution: An Effective Alternative to Large Instruction Windows , 2003, IEEE Micro.

[23]  Trevor Mudge,et al.  Improving data cache performance by pre-executing instructions under a cache miss , 1997 .

[24]  Eric Rotenberg,et al.  Assigning confidence to conditional branch predictions , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[25]  Hiroyuki Sugiyama,et al.  A 1.3 GHz fifth generation SPARC64 microprocessor , 2003, 2003 IEEE International Solid-State Circuits Conference, 2003. Digest of Technical Papers. ISSCC..

[26]  P. Hazucha,et al.  Impact of CMOS technology scaling on the atmospheric neutron soft error rate , 2000 .

[27]  Peter Hazucha,et al.  Characterization of soft errors caused by single event upsets in CMOS processes , 2004, IEEE Transactions on Dependable and Secure Computing.

[28]  Babak Falsafi,et al.  Fingerprinting: Bounding Soft-Error-Detection Latency and Bandwidth , 2004, IEEE Micro.

[29]  S. McFarling Combining Branch Predictors , 1993 .

[30]  Ravishankar K. Iyer,et al.  Error sensitivity of the Linux kernel executing on PowerPC G4 and Pentium 4 processors , 2004, International Conference on Dependable Systems and Networks, 2004.

[31]  Edward J. McCluskey,et al.  Concurrent Error Detection Using Watchdog Processors - A Survey , 1988, IEEE Trans. Computers.

[32]  David J. Sager,et al.  The microarchitecture of the Pentium 4 processor , 2001 .

[33]  Sanjay J. Patel,et al.  Characterizing the effects of transient faults on a high-performance processor pipeline , 2004, International Conference on Dependable Systems and Networks, 2004.

[34]  Daniel A. Jiménez,et al.  Fast Path-Based Neural Branch Prediction , 2003, MICRO.

[35]  Babak Falsafi,et al.  Fingerprinting: bounding soft-error-detection latency and bandwidth , 2004, IEEE Micro.