Towards Formal Approaches to System Resilience

Technology scaling and techniques such as dynamic voltage/frequency scaling are predicted to increase the number of transient faults in future processors. Error detectors implemented in hardware are often energy inefficient, as they are "always on." While software-level error detection can augment hardware-level detectors, creating detectors in software that are highly effective remains a challenge. In this paper, we first present anew LLVM-level fault injector called KULFI that helps simulate faults occurring within CPU state elements in a versatile manner. Second, using KULFI, we study the behavior of a family of well-known and simple algorithms under error injection. (We choose a family of sorting algorithms for this study.) We then propose a promising way to interpret our empirical results using a formal model that builds on the idea of predicate state transition diagrams. After introducing the basic abstraction underlying our predicate transition diagrams, we draw connections to the level of resilience empirically observed during fault injection studies. Building on the observed connections, we develop a simple, and yet effective, predicate-abstraction-based fault detector. While in its initial stages, ours is believed to be the first study that offers a formal way to interpret and compare fault injection results obtained from algorithms from within one family. Given the absolutely unpredictable nature of what a fault can do to a computation in general, our approach may help designers choose amongst a class of algorithms one that behaves most resilient of all.

[1]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[2]  Jing Yu,et al.  Efficient software checking for fault tolerance , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[3]  Sarita V. Adve,et al.  Using likely program invariants to detect hardware errors , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[4]  Todd Millstein,et al.  Automatic predicate abstraction of C programs , 2001, PLDI '01.

[5]  Henrique Madeira,et al.  RIFLE: A General Purpose Pin-level Fault Injector , 1994, EDCC.

[6]  David L. Dill,et al.  Experience with Predicate Abstraction , 1999, CAV.

[7]  Karthik Pattabiraman,et al.  LLFI : An Intermediate Code Level Fault Injector For Soft Computing Applications , 2013 .

[8]  David Walker,et al.  Reasoning about Control Flow in the Presence of Transient Faults , 2008, SAS.

[9]  Umberto Ferraro Petrillo,et al.  The Price of Resiliency: a Case Study on Sorting with Memory Faults , 2008, Algorithmica.

[10]  Umberto Ferraro Petrillo,et al.  Experimental Study of Resilient Algorithms and Data Structures , 2010, SEA.

[11]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[12]  Hassen Saïdi,et al.  Construction of Abstract State Graphs with PVS , 1997, CAV.

[13]  Karthik Pattabiraman,et al.  BLOCKWATCH: Leveraging similarity in parallel programs for error detection , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[14]  Ravishankar K. Iyer,et al.  Automated Derivation of Application-Aware Error Detectors Using Static Analysis: The Trusted Illiac Approach , 2011, IEEE Transactions on Dependable and Secure Computing.

[15]  John P. Hayes,et al.  Low-cost on-line fault detection using control flow assertions , 2003, 9th IEEE On-Line Testing Symposium, 2003. IOLTS 2003..

[16]  Jacob A. Abraham,et al.  FERRARI: A Flexible Software-Based Fault and Error Injection System , 1995, IEEE Trans. Computers.

[17]  Robert F. Lucas,et al.  Poster: Programming Model Extensions for Resilience in Extreme Scale Computing , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[18]  Ravishankar K. Iyer,et al.  Hauberk: Lightweight Silent Data Corruption Error Detector for GPGPU , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[19]  Prabhakar Kudva,et al.  Soft-error resilience of the IBM POWER6 processor , 2008, IBM J. Res. Dev..

[20]  Ravishankar K. Iyer,et al.  Application-based metrics for strategic placement of detectors , 2005, 11th Pacific Rim International Symposium on Dependable Computing (PRDC'05).

[21]  Robert F. Lucas,et al.  Programming Model Extensions for Resilience in Extreme Scale Computing , 2012, Euro-Par Workshops.

[22]  Jungang Lou,et al.  A PIN-Based Dynamic Software Fault Injection System , 2008, 2008 The 9th International Conference for Young Computer Scientists.

[23]  Sarita V. Adve,et al.  Relyzer: exploiting application-level fault equivalence to analyze application resiliency to transient faults , 2012, ASPLOS XVII.

[24]  Karthikeyan Sankaralingam,et al.  Relax: an architectural framework for software recovery of hardware faults , 2010, ISCA.

[25]  Thomas Ball,et al.  A Theory of Predicate-Complete Test Coverage and Generation , 2004, FMCO.

[26]  Ravishankar K. Iyer,et al.  SymPLFIED: Symbolic program-level fault injection and error detection framework , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[27]  Martin Schulz,et al.  Fault resilience of the algebraic multi-grid solver , 2012, ICS '12.

[28]  Meeta Sharma Gupta,et al.  Error Tolerance in Server Class Processors , 2011, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[29]  Volkmar Sieh,et al.  Fault-injector Using Unix Ptrace Interface 1. Introduction 2. Ptrace(2) Interface , 1993 .

[30]  Edward J. McCluskey,et al.  Control-flow checking by software signatures , 2002, IEEE Trans. Reliab..