Understanding soft error propagation using Efficient vulnerability-driven fault injection

Extreme CMOS scaling is expected to significantly impact the reliability of future microprocessors, prompting recent research effort on low-cost hardware-software cross-layer reliability solutions. To evaluate, statistical fault injection (SFI) is often used to estimate the error coverage of the underlying method. Unfortunately, because a significant number of errors injected by SFI are often derated, the evaluation becomes less rigorous and less efficient. This paper makes the observation that many derated errors can be gracefully avoided to allow the fault injection campaign to focus on likely non-derated faults that stress the method-under-test. We propose a biased injection framework called CriticalFault that employs vulnerability analysis to map out relevant faults for stress testing. With CriticalFault, our results show that the injection space is reduced by 29% and 59% of the biased injections cause either software aborts or silent data corruptions, both are improvements from SFI. Moreover, we characterize different propagation behaviors of these non-derated faults and discuss the implications of designing future cross-layer solutions. Overall, not only CriticalFault is highly effective in identifying relevant test cases for current systems, but reliability researchers and engineers can also conduct more in-depth and meaningful analysis in deveoping future reliability solutions using CriticalFault.

[1]  Sanjay J. Patel,et al.  Characterizing the effects of transient faults on a high-performance processor pipeline , 2004, International Conference on Dependable Systems and Networks, 2004.

[2]  Ravishankar K. Iyer,et al.  Measuring Fault Tolerance with the FTAPE Fault Injection Tool , 1995, MMB.

[3]  Sarita V. Adve,et al.  Relyzer: exploiting application-level fault equivalence to analyze application resiliency to transient faults , 2012, ASPLOS XVII.

[4]  Joel S. Emer,et al.  THE SECOND AVOIDS DECLARING ERRORS ON BENIGN FAULTS . APPLYING THESE TECHNIQUES TO A MICROPROCESSOR INSTRUCTION QUEUE SIGNIFICANTLY REDUCES ITS ERROR RATE WITH ONLY MINOR PERFORMANCE DEGRADATION . REDUCING THE SOFT-ERROR RATE OF A HIGH-PERFORMANCE MICROPROCESSOR , 2005 .

[5]  David R. Kaeli,et al.  Eliminating microarchitectural dependency from Architectural Vulnerability , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[6]  Joel Emer,et al.  A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[7]  Todd M. Austin,et al.  A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor , 2003, MICRO.

[8]  Sanjay J. Patel,et al.  ReStore: symptom based soft error detection in microprocessors , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[9]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[10]  Amin Ansari,et al.  Shoestring: probabilistic soft error reliability on the cheap , 2010, ASPLOS 2010.

[11]  Shubhendu S. Mukherjee,et al.  Perturbation-based Fault Screening , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[12]  Pia Sanda,et al.  Statistical Fault Injection , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[13]  Jacob A. Abraham,et al.  Dependability evaluation using hybrid fault/error injection , 1995, Proceedings of 1995 IEEE International Computer Performance and Dependability Symposium.

[14]  Wei Liu,et al.  Using Register Lifetime Predictions to Protect Register Files Against Soft Errors , 2008 .

[15]  Michael S. Floyd,et al.  Fault - tolerant design of the IBM POWER6™ microprocessor , 2007, 2007 IEEE Hot Chips 19 Symposium (HCS).

[16]  Ravishankar K. Iyer,et al.  An experimental study of soft errors in microprocessors , 2005, IEEE Micro.

[17]  Huiyang Zhou,et al.  Unified Architectural Support for Soft-Error Protection or Software Bug Detection , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[18]  Arun K. Somani,et al.  A reconfigurable multi-function computing cache architecture , 2000, FPGA '00.

[19]  Elizabeth M. Rudnick,et al.  A Gate-Level Simulation Environment for Alpha-Particle-Induced Transient Faults , 1996, IEEE Trans. Computers.

[20]  Arijit Biswas,et al.  Computing Accurate AVFs using ACE Analysis on Performance Models: A Rebuttal , 2008, IEEE Computer Architecture Letters.

[21]  Amin Ansari,et al.  Shoestring: probabilistic soft error reliability on the cheap , 2010, ASPLOS XV.

[22]  Donald Yeung,et al.  Application-Level Correctness and its Impact on Fault Tolerance , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[23]  Sarita V. Adve,et al.  Using likely program invariants to detect hardware errors , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[24]  Sanjay J. Patel,et al.  Examining ACE analysis reliability estimates using fault-injection , 2007, ISCA '07.

[26]  Sanjay J. Patel,et al.  Y-branches: when you come to a fork in the road, take it , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[27]  Sarita V. Adve,et al.  Understanding the propagation of hard errors to software and implications for resilient system design , 2008, ASPLOS.

[28]  Craig B. Zilles,et al.  A characterization of instruction-level error derating and its implications for error detection , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).