Intermittent Hardware Errors Recovery: Modeling and Evaluation

The frequency of hardware errors is increasing due to shrinking feature sizes, higher levels of integration, and increasing design complexity. Intermittent errors are those that occur non-deterministically at the same location. It has been shown that intermittent hardware errors contribute to about 39% of the total hardware failures. Intermittent faults have characteristics that are different than transient and permanent errors, which makes it challenging to devise efficient recovery techniques for them. In this paper, we evaluate the impact of different intermittent error recovery scenarios on the processor performance. To achieve this, we model a system that consists of a fault-tolerant multicore processor subject to intermittent faults. Our fault models are based on insights from related work at the physical level. We find that the frequency of the intermittent error and the relative importance of the error location play an important role in choosing the recovery action that maximizes the processor's performance.

[1]  Amin Ansari,et al.  The StageNet fabric for constructing resilient multicore systems , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[2]  W. Wang,et al.  Re-investigation of gate oxide breakdown on logic circuit reliability , 2011, 2011 International Reliability Physics Symposium.

[3]  Neeraj Suri,et al.  On-Line Diagnosis and Recovery: On the Choice and Impact of Tuning Parameters , 2007, IEEE Transactions on Dependable and Secure Computing.

[4]  James Tschanz,et al.  Parameter variations and impact on circuits and microarchitecture , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[5]  J. W. McPherson,et al.  Reliability challenges for 45nm and beyond , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[6]  S. Natarajan,et al.  Impact of negative bias temperature instability on digital circuit reliability , 2002, 2002 IEEE International Reliability Physics Symposium. Proceedings. 40th Annual (Cat. No.02CH37320).

[7]  Karthik Pattabiraman,et al.  Modeling the Propagation of Intermittent Hardware Faults in Programs , 2010, 2010 IEEE 16th Pacific Rim International Symposium on Dependable Computing.

[8]  C. Constantinescu,et al.  Intermittent faults and effects on reliability of integrated circuits , 2008, 2008 Annual Reliability and Maintainability Symposium.

[9]  Shantanu Gupta,et al.  Architectural core salvaging in a multi-core processor for hard-error tolerance , 2009, ISCA '09.

[10]  Todd M. Austin,et al.  A fault tolerant approach to microprocessor design , 2001, 2001 International Conference on Dependable Systems and Networks.

[11]  Pradip Bose,et al.  The case for lifetime reliability-aware microprocessors , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[12]  Amitava Majumdar,et al.  A scalable, low cost design-for-test architecture for UltraSPARC/spl trade/ chip multi-processors , 2002, Proceedings. International Test Conference.

[13]  John R. Douceur,et al.  Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs , 2011, EuroSys '11.

[14]  Kevin Skadron,et al.  Temperature-aware microarchitecture: Modeling and implementation , 2004, TACO.

[15]  T. N. Vijaykumar,et al.  Rescue: a microarchitecture for testability and defect tolerance , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[16]  Pradip Bose,et al.  Exploiting structural duplication for lifetime reliability enhancement , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[17]  Doug Burger,et al.  Exploiting microarchitectural redundancy for defect tolerance , 2003, Proceedings 21st International Conference on Computer Design.

[18]  Ravishankar K. Iyer,et al.  Automated Derivation of Application-Specific Error Detectors Using Dynamic Analysis , 2011, IEEE Transactions on Dependable and Secure Computing.

[19]  Andrea Bondavalli,et al.  Threshold-Based Mechanisms to Discriminate Transient from Intermittent Faults , 2000, IEEE Trans. Computers.

[20]  William H. Sanders,et al.  The Möbius modeling environment , 2004 .

[21]  Albert Meixner,et al.  Detouring: Translating software to circumvent hard faults in simple cores , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[22]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[23]  Daniel J. Sorin,et al.  Core Cannibalization Architecture: Improving lifetime chip performance for multicore processors in the presence of hard faults , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[24]  J. Babcock,et al.  Dynamic recovery of negative bias temperature instability in p-type metal–oxide–semiconductor field-effect transistors , 2003 .

[25]  Koushik Chakraborty,et al.  Adapting to intermittent faults in multicore systems , 2008, ASPLOS.

[26]  Shuguang Feng,et al.  Self-calibrating Online Wearout Detection , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[27]  Yun Zhang,et al.  Revisiting the Sequential Programming Model for Multi-Core , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[28]  M. Heyns,et al.  Soft Breakdown of Ultra-Thin Gate Oxide Layers , 1995, ESSDERC '95: Proceedings of the 25th European Solid State Device Research Conference.

[29]  Ravishankar K. Iyer,et al.  Modeling coordinated checkpointing for large-scale supercomputers , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).