Towards analyzing and improving robustness of software applications to intermittent and permanent faults in hardware

Although a significant fraction of emerging failure and wearout mechanisms result in intermittent or permanent faults in hardware, their impact (as distinct from transient faults) on software applications has not been well studied. In this paper, we develop a distinguishing application characteristic, referred to as similarity from fundamental circuit-level understanding of the failure mechanisms. We present a mathematical definition and a procedure for similarity computation for practical software applications and experimentally verify the relationship between similarity and fault rate. Leveraging dependence of application robustness on the similarity metric, we present example architecture independent code transformations to reduce similarity and thereby the worst-case fault rate with minimal performance degradation. Our experimental results with arithmetic unit faults show as much as 74% improvement in the worst case fault rate on benchmark kernels, with less than 10% runtime penalty.

[1]  Rakesh Kumar,et al.  A numerical optimization-based methodology for application robustification: Transforming applications for error tolerance , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[2]  U. Schlichtmann,et al.  Goldilocks failures: Not too soft, not too hard , 2012, 2012 IEEE International Reliability Physics Symposium (IRPS).

[3]  Muhammad Shafique,et al.  RAISE: Reliability-Aware Instruction SchEduling for unreliable hardware , 2012, 17th Asia and South Pacific Design Automation Conference.

[4]  Sarita V. Adve,et al.  Using likely program invariants to detect hardware errors , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[5]  Brian Randell System structure for software fault tolerance , 1975 .

[6]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[7]  Goutam Kumar Saha,et al.  Software Based Fault Tolerance – a Survey , 2006 .

[8]  Ahmed M. Eltawil,et al.  Fast error aware model for arithmetic and logic circuits , 2012, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[9]  Babak Falsafi,et al.  Detecting Emerging Wearout Faults , 2007 .

[10]  Sarita V. Adve,et al.  Low-cost program-level detectors for reducing silent data corruptions , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[11]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[12]  Puneet Gupta,et al.  VarEMU: An emulation testbed for variability-aware software , 2013, 2013 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[13]  Yi-Min Wang,et al.  Checkpointing and its applications , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[14]  Ankur Sharma Understanding Software Application Behaviour in Presence of Permanent and Intermittent Hardware Faults , 2013 .

[15]  Sani R. Nassif,et al.  Goldilocks failures: Not too soft, not too hard , 2012, IRPS 2012.

[16]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[17]  John Sartori,et al.  Stochastic computing: Embracing errors in architecture and design of processors and applications , 2011, 2011 Proceedings of the 14th International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES).

[18]  Karthik Pattabiraman,et al.  Towards understanding the effects of intermittent hardware faults on programs , 2010, 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W).

[19]  Karthik Pattabiraman,et al.  Comparing the effects of intermittent and transient hardware faults on programs , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems and Networks Workshops (DSN-W).

[20]  SahaGoutam Kumar Software based fault tolerance , 2006 .

[21]  Sarita V. Adve,et al.  Relyzer: exploiting application-level fault equivalence to analyze application resiliency to transient faults , 2012, ASPLOS XVII.

[22]  Pedro J. Gil,et al.  Analysis of the influence of intermittent faults in a microcontroller , 2008, 2008 11th IEEE Workshop on Design and Diagnostics of Electronic Circuits and Systems.

[23]  Sarita V. Adve,et al.  The impact of technology scaling on lifetime reliability , 2004, International Conference on Dependable Systems and Networks, 2004.

[24]  Johan Karlsson,et al.  On the probability of detecting data errors generated by permanent faults using time redundancy , 2003, 9th IEEE On-Line Testing Symposium, 2003. IOLTS 2003..

[25]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[26]  Subhasish Mitra,et al.  ERSA: Error Resilient System Architecture for probabilistic applications , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[27]  Sarita V. Adve,et al.  Accurate microarchitecture-level fault modeling for studying hardware faults , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[28]  Sarita V. Adve,et al.  SWAT : An Error Resilient System , 2008 .

[29]  Muhammad Shafique,et al.  Reliable software for unreliable hardware: Embedded code generation aiming at reliability , 2011, 2011 Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).