Efficient soft error protection for commodity embedded microprocessors using profile information

Successive generations of processors use smaller transistors in the quest to make more powerful computing systems. It has been previously studied that smaller transistors make processors more susceptible to soft errors (transient faults caused by high energy particle strikes). Such errors can result in unexpected behavior and incorrect results. With smaller and cheaper transistors becoming pervasive in mainstream computing, it is necessary to protect these devices against soft errors; an increasing rate of faults necessitates the protection of applications running on commodity processors against soft errors. The existing methods of protecting against such faults generally have high area or performance overheads and thus are not directly applicable in the embedded design space. In order to protect against soft errors, the detection of these errors is a necessary first step so that a recovery can be triggered. To solve the problem of detecting soft errors cheaply, we propose a profiling-based software-only application analysis and transformation solution. The goal is to develop a low cost solution which can be deployed for off-the-shelf embedded processors. The solution works by intelligently duplicating instructions that are likely to affect the program output, and comparing results between original and duplicated instructions. The intelligence of our solution is garnered through the use of control flow, memory dependence, and value profiling to understand and exploit the common-case behavior of applications. Our solution is able to achieve 92% fault coverage with a 20% instruction overhead. This represents a 41% lower performance overhead than the best prior approaches with approximately the same fault coverage.

[1]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[2]  Régis Leveugle,et al.  Statistical fault injection: Quantified error and confidence , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[3]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[4]  Sarita V. Adve,et al.  mSWAT: Low-cost hardware fault detection and diagnosis for multicore systems , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[5]  E. Normand Single event upset at ground level , 1996 .

[6]  Sule Ozev,et al.  A mechanism for online diagnosis of hard faults in microprocessors , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[7]  Babak Falsafi,et al.  Efficient Resource Sharing in Concurrent Error Detecting Superscalar Microarchitectures , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[8]  Wei Liu,et al.  Using Register Lifetime Predictions to Protect Register Files Against Soft Errors , 2008 .

[9]  Priti Shankar,et al.  The Compiler Design Handbook: Optimizations and Machine Code Generation , 2002, The Compiler Design Handbook.

[10]  Shubhendu S. Mukherjee,et al.  Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[11]  Todd M. Austin,et al.  A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor , 2003, MICRO.

[12]  Frank Mueller,et al.  Soft error protection via fault-resilient data representations , 2007 .

[13]  Sanjay J. Patel,et al.  ReStore: symptom based soft error detection in microprocessors , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[14]  Albert Meixner,et al.  Argus: Low-Cost, Comprehensive Error Detection in Simple Cores , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[15]  Irith Pomeranz,et al.  Transient-fault recovery for chip multiprocessors , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[16]  Todd M. Austin,et al.  A fault tolerant approach to microprocessor design , 2001, 2001 International Conference on Dependable Systems and Networks.

[17]  Eric Rotenberg,et al.  AR-SMT: a microarchitectural approach to fault tolerance in microprocessors , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[18]  Nathan L. Binkert,et al.  Network-Oriented Full-System Simulation using M5 , 2003 .

[19]  Shuguang Feng,et al.  Cost-efficient soft error protection for embedded microprocessors , 2006, CASES '06.

[20]  Mikko H. Lipasti,et al.  Characterization of silent stores , 2000, Proceedings 2000 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00622).

[21]  David I. August,et al.  Software-controlled fault tolerance , 2005, TACO.

[22]  T. May,et al.  Alpha-particle-induced soft errors in dynamic memories , 1979, IEEE Transactions on Electron Devices.

[23]  Amin Ansari,et al.  Shoestring: probabilistic soft error reliability on the cheap , 2010, ASPLOS XV.

[24]  Shubhendu S. Mukherjee,et al.  Detailed design and evaluation of redundant multithreading alternatives , 2002, ISCA.

[25]  Lisa Spainhower,et al.  IBM S/390 Parallel Enterprise Server G5 fault tolerance: A historical perspective , 1999, IBM J. Res. Dev..

[26]  Thomas Rorie Mason LAMPVIEW: A LOOP-AWARE TOOLSET FOR FACILITATING PARALLELIZATION , 2009 .

[27]  Rajiv Gupta,et al.  Profile-Guided Compiler Optimizations , 2002, The Compiler Design Handbook.

[28]  Sanjay J. Patel,et al.  Characterizing the effects of transient faults on a high-performance processor pipeline , 2004, International Conference on Dependable Systems and Networks, 2004.

[29]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[30]  Sarita V. Adve,et al.  Understanding the propagation of hard errors to software and implications for resilient system design , 2008, ASPLOS.

[31]  Lisa Spainhower,et al.  Commercial fault tolerance: a tale of two systems , 2004, IEEE Transactions on Dependable and Secure Computing.

[32]  Lorenzo Alvisi,et al.  Modeling the effect of technology trends on the soft error rate of combinational logic , 2002, Proceedings International Conference on Dependable Systems and Networks.

[33]  Sanjay J. Patel,et al.  Y-branches: when you come to a fork in the road, take it , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[34]  Alan Wood,et al.  The impact of new technology on soft error rates , 2011, 2011 International Reliability Physics Symposium.