Towards understanding the effects of intermittent hardware faults on programs

Intermittent hardware faults are bursts of errors that last from a few CPU cycles to a few seconds. They are caused by process variations, circuit wear-out, and temperature, clock or voltage fluctuations. Recent studies show that intermittent fault rates are increasing due to technology scaling and are likely to be a significant concern in future systems. We study the propagation of intermittent faults to programs; in particular, we are interested in the crash behaviour of programs. We use a model of a program that represents the data dependencies in a fault-free trace of the program and we analyze this model to glean some information about the length of intermittent faults and their effect on the program under specific fault and crash models. The results of our study can aid fault detection, diagnosis and recovery techniques.

[1]  Babak Falsafi,et al.  Detecting Emerging Wearout Faults , 2007 .

[2]  Sarita V. Adve,et al.  Understanding the propagation of hard errors to software and implications for resilient system design , 2008, ASPLOS.

[3]  Pedro J. Gil,et al.  Analysis of the influence of intermittent faults in a microcontroller , 2008, 2008 11th IEEE Workshop on Design and Diagnostics of Electronic Circuits and Systems.

[4]  Ravishankar K. Iyer,et al.  Characterization of linux kernel behavior under errors , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[5]  A. A. Ismaeel,et al.  Test for detection and location of intermittent faults in combinational circuits , 1997 .

[6]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[7]  James Tschanz,et al.  Parameter variations and impact on circuits and microarchitecture , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[8]  Karthik Pattabiraman,et al.  Formal Diagnosis of Hardware Transient Errors in Programs , 2010 .

[9]  Koushik Chakraborty,et al.  Adapting to intermittent faults in multicore systems , 2008, ASPLOS.

[10]  C. Constantinescu,et al.  Intermittent faults and effects on reliability of integrated circuits , 2008, 2008 Annual Reliability and Maintainability Symposium.

[11]  Trevor Mudge,et al.  Razor: a low-power pipeline based on circuit-level timing speculation , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[12]  Ravishankar K. Iyer,et al.  Application-based metrics for strategic placement of detectors , 2005, 11th Pacific Rim International Symposium on Dependable Computing (PRDC'05).

[13]  Kevin Skadron,et al.  Temperature-aware microarchitecture: Modeling and implementation , 2004, TACO.

[14]  Ravishankar K. Iyer,et al.  FINE: A Fault Injection and Monitoring Environment for Tracing the UNIX System Behavior under Faults , 1993, IEEE Trans. Software Eng..

[15]  J. W. McPherson,et al.  Reliability challenges for 45nm and beyond , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[16]  Babak Falsafi,et al.  Fingerprinting: bounding soft-error-detection latency and bandwidth , 2004, IEEE Micro.

[17]  cristian. constantinescu Impact of Intermittent Faults on Nanocomputing Devices , 2007 .

[18]  Joseph Robert Horgan,et al.  Dynamic program slicing , 1990, PLDI '90.

[19]  Douglas M. Blough,et al.  Intermittent Fault Diagnosis in Multiprocessor Systems , 1992, IEEE Trans. Computers.

[20]  David Blaauw,et al.  Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation , 2003, MICRO.