Relyzer: Application Resiliency Analyzer for Transient Faults

Future microprocessors need low-cost solutions for reliable operation in the presence of failure-prone devices. A promising approach is to detect hardware faults by deploying low-cost software-level symptom monitors. However, there remains a nonnegligible risk that several faults might escape these detectors to produce silent data corruptions (SDCs). Evaluating and bounding SDCs is, therefore, crucial for low-cost resiliency solutions. The authors present Relyzer, an approach that can systematically analyze all application fault sites and identify virtually all SDC-causing program locations. Instead of performing fault injections on all possible application-level fault sites, which is impractical, Relyzer carefully picks a small subset. It employs novel fault-pruning techniques that reduce the number of fault sites by either predicting their outcomes or showing them equivalent to others. Results show that 99.78 percent of faults are pruned across 12 studied workloads, reducing the complete application resiliency evaluation time by 2 to 6 orders of magnitude. Relyzer, for the first time, achieves the capability to list virtually all SDC-vulnerable program locations, which is critical in designing low-cost application-centric resiliency solutions. Relyzer also opens new avenues of research in designing error-resilient programming models as well as even faster (and simpler) evaluation methodologies.

[1]  Christian Bienia,et al.  PARSEC 2.0: A New Benchmark Suite for Chip-Multiprocessors , 2009 .

[2]  Sarita V. Adve,et al.  Accurate microarchitecture-level fault modeling for studying hardware faults , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[3]  Albert Meixner,et al.  Argus: Low-Cost, Comprehensive Error Detection in Simple Cores , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[4]  Ravishankar K. Iyer,et al.  SymPLFIED: Symbolic program-level fault injection and error detection framework , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[5]  Ravishankar K. Iyer,et al.  Application-based metrics for strategic placement of detectors , 2005, 11th Pacific Rim International Symposium on Dependable Computing (PRDC'05).

[6]  Sarita V. Adve,et al.  Low-cost program-level detectors for reducing silent data corruptions , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[7]  Sanjay J. Patel,et al.  ReStore: symptom based soft error detection in microprocessors , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[8]  Shubhendu S. Mukherjee,et al.  Perturbation-based Fault Screening , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[9]  Sarita V. Adve,et al.  Understanding the propagation of hard errors to software and implications for resilient system design , 2008, ASPLOS.

[10]  Sarita V. Adve,et al.  Trace-based microarchitecture-level diagnosis of permanent hardware faults , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[11]  Massimo Violante,et al.  Soft-error detection using control flow assertions , 2003, Proceedings 18th IEEE Symposium on Defect and Fault Tolerance in VLSI Systems.

[12]  Amin Ansari,et al.  Shoestring: probabilistic soft error reliability on the cheap , 2010, ASPLOS XV.

[13]  Sarita V. Adve,et al.  Using likely program invariants to detect hardware errors , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[14]  Ravishankar K. Iyer,et al.  An end-to-end approach for the automatic derivation of application-aware error detectors , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[15]  David R. Kaeli,et al.  Eliminating microarchitectural dependency from Architectural Vulnerability , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[16]  Ravishankar K. Iyer,et al.  Dynamic Derivation of Application-Specific Error Detectors and their Implementation in Hardware , 2006, 2006 Sixth European Dependable Computing Conference.

[17]  David I. August,et al.  Software-controlled fault tolerance , 2005, TACO.

[18]  Alfredo Benso,et al.  Data criticality estimation in software applications , 2003, International Test Conference, 2003. Proceedings. ITC 2003..

[19]  Sarita V. Adve,et al.  mSWAT: Low-cost hardware fault detection and diagnosis for multicore systems , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[20]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[21]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[22]  Huiyang Zhou,et al.  Unified Architectural Support for Soft-Error Protection or Software Bug Detection , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[23]  Sanjay J. Patel,et al.  ReStore: Symptom-Based Soft Error Detection in Microprocessors , 2006, IEEE Trans. Dependable Secur. Comput..

[24]  Sarita V. Adve,et al.  Detecting and recovering from in-core hardware faults through software anomaly treatment , 2011 .