Improving the Accuracy of IR-Level Fault Injection

Fault injection (FI) is a commonly used experimental technique to evaluate the resilience of software techniques for tolerating hardware faults. Software-implemented FI can be performed at different levels of abstraction in the system stack; FI performed at the compiler’s intermediate representation (IR) level has the advantage that it is closer to the program being evaluated and is hence easier to derive insights from for the design of software fault-tolerance mechanisms. Unfortunately, it is not clear how accurate IR-level FI is vis-a-vis FI performed at the assembly code level, and prior work has presented contradictory findings. In this paper, we perform a comprehensive evaluation of the accuracy of IR-level FI across a range of benchmark programs and compiler optimization levels. Our results show that IR-level FI is as accurate as assembly-level FI for silent data corruption (SDC) probability estimation across different benchmarks and optimization levels. Further, we present a machine-learning-based technique for improving the accuracy of crash probability measurements made by IR-level FI, which takes advantage of an observed correlation between program crash probabilities and instructions that operate on memory address values. We find that the machine learning technique provides comparable accuracy for IR-level FI as assembly code level FI for program crashes.

[1]  Frank Bellosa,et al.  Memory-aware Scheduling for Energy Efficiency on Multicore Processors , 2008, HotPower.

[2]  Santosh Pande,et al.  LADR: low-cost application-level detector for reducing silent output corruptions , 2018, HPDC.

[3]  Martin Schulz,et al.  REFINE: Realistic Fault Injection via Compiler-based Instrumentation for Accuracy, Portability and Speed , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[5]  Sarita V. Adve,et al.  Low-cost program-level detectors for reducing silent data corruptions , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[6]  Sriram Krishnamoorthy,et al.  Towards Resiliency Evaluation of Vector Programs , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[7]  Bo Fang,et al.  A Systematic Methodology for Evaluating the Error Resilience of GPGPU Applications , 2016, IEEE Transactions on Parallel and Distributed Systems.

[8]  Banu Diri,et al.  A systematic review of software fault prediction studies , 2009, Expert Syst. Appl..

[9]  Karthik Pattabiraman,et al.  Error detector placement for soft computation , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[10]  Joel S. Emer,et al.  Techniques to reduce the soft error rate of a high-performance microprocessor , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[11]  Jason Cong,et al.  Assuring application-level correctness against soft errors , 2011, 2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[12]  Franck Cappello,et al.  Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..

[13]  Xiangyu Li,et al.  PRISM: Predicting Resilience of GPU Applications Using Statistical Methods , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  J-C. Laprie,et al.  DEPENDABLE COMPUTING AND FAULT TOLERANCE : CONCEPTS AND TERMINOLOGY , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[15]  Ganesh Gopalakrishnan,et al.  Towards Formal Approaches to System Resilience , 2013, 2013 IEEE 19th Pacific Rim International Symposium on Dependable Computing.

[16]  Scott A. Mahlke,et al.  Runtime asynchronous fault tolerance via speculation , 2012, CGO '12.

[17]  Marco Vieira,et al.  On the emulation of software faults by software fault injection , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[18]  Johan Karlsson,et al.  Comparison of Physical and Software-Implemented Fault Injection Techniques , 2003, IEEE Trans. Computers.

[19]  Bo Fang,et al.  GPU-Qin: A methodology for evaluating the error resilience of GPGPU applications , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[20]  Amin Ansari,et al.  Shoestring: probabilistic soft error reliability on the cheap , 2010, ASPLOS XV.

[21]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[22]  Meeta Sharma Gupta,et al.  Configurable Detection of SDC-causing Errors in Programs , 2017, ACM Trans. Embed. Comput. Syst..

[23]  Henrique Madeira,et al.  Emulation of Software Faults: A Field Data Study and a Practical Approach , 2006, IEEE Transactions on Software Engineering.

[24]  Filippo Lanubile,et al.  Comparing models for identifying fault-prone software components , 1995, SEKE.

[25]  Karthikeyan Sankaralingam,et al.  Relax: an architectural framework for software recovery of hardware faults , 2010, ISCA.

[26]  Marc Snir,et al.  Understanding the Propagation of Error Due to a Silent Data Corruption in a Sparse Matrix Vector Multiply , 2015, 2015 IEEE International Conference on Cluster Computing.

[27]  Karthik Pattabiraman,et al.  LLFI: An Intermediate Code-Level Fault Injection Tool for Hardware Faults , 2015, 2015 IEEE International Conference on Software Quality, Reliability and Security.

[28]  Guanpeng Li,et al.  A Tale of Two Injectors: End-to-End Comparison of IR-Level and Assembly-Level Fault Injection , 2019, 2019 IEEE 30th International Symposium on Software Reliability Engineering (ISSRE).

[29]  Ravishankar K. Iyer,et al.  Error sensitivity of the Linux kernel executing on PowerPC G4 and Pentium 4 processors , 2004, International Conference on Dependable Systems and Networks, 2004.

[30]  Adam A. Porter,et al.  Empirically guided software development using metric-based classification trees , 1990, IEEE Software.

[31]  Sriram Krishnamoorthy,et al.  BonVoision: leveraging spatial data smoothness for recovery from memory soft errors , 2019, ICS.

[32]  Bo Fang,et al.  ePVF: An Enhanced Program Vulnerability Factor Methodology for Cross-Layer Resilience Analysis , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[33]  Dong Li,et al.  Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[34]  Victor R. Basili,et al.  Developing Interpretable Models with Optimized Set Reduction for Identifying High-Risk Software Components , 1993, IEEE Trans. Software Eng..

[35]  Ming Zhao,et al.  Application of multivariate analysis for software fault prediction , 1998, Software Quality Journal.

[36]  Karthik Pattabiraman,et al.  Fine-Grained Characterization of Faults Causing Long Latency Crashes in Programs , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[37]  Marc Snir,et al.  FlipIt: An LLVM Based Fault Injector for HPC , 2014, Euro-Par Workshops.

[38]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[39]  Onur Mutlu,et al.  A Flexible Software-Based Framework for Online Detection of Hardware Defects , 2009, IEEE Transactions on Computers.

[40]  Xi Chen,et al.  An In-Depth Analysis of Disassembly on Full-Scale x86/x64 Binaries , 2016, USENIX Security Symposium.

[41]  Edward J. McCluskey,et al.  Error detection by duplicated instructions in super-scalar processors , 2002, IEEE Trans. Reliab..

[42]  Bo Fang,et al.  LetGo: A Lightweight Continuous Framework for HPC Applications Under Failures , 2017, HPDC.

[43]  Karthik Pattabiraman,et al.  Quantifying the Accuracy of High-Level Fault Injection Techniques for Hardware Faults , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[44]  Christof Fetzer,et al.  Hardware Fault Injection Using Dynamic Binary Instrumentation: FITgrind , 2006 .

[45]  Jacob A. Abraham,et al.  FERRARI: A Flexible Software-Based Fault and Error Injection System , 1995, IEEE Trans. Computers.

[46]  Saeed Safari,et al.  A cross-layer approach to online adaptive reliability prediction of transient faults , 2015, 2015 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFTS).

[47]  Laura Monroe,et al.  SDC is in the Eye of the Beholder: A Survey and Preliminary Study , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshop (DSN-W).

[48]  Pradip Bose,et al.  Experience report: An application-specific checkpointing technique for minimizing checkpoint corruption , 2015, 2015 IEEE 26th International Symposium on Software Reliability Engineering (ISSRE).

[49]  David H. Bailey,et al.  The NAS parallel benchmarks summary and preliminary results , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[50]  Sarita V. Adve,et al.  GangES: Gang error simulation for hardware resiliency evaluation , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[51]  Frank Mueller,et al.  Uncore power scavenger: a runtime for uncore power conservation on HPC systems , 2019, SC.

[52]  Jacob A. Abraham,et al.  Quantitative evaluation of soft error injection techniques for robust system design , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[53]  Domenico Cotroneo,et al.  Faultprog: Testing the Accuracy of Binary-Level Software Fault Injection , 2018, IEEE Transactions on Dependable and Secure Computing.

[54]  A. Jaleel Memory Characterization of Workloads Using Instrumentation-Driven Simulation A Pin-based Memory Characterization of the SPEC CPU 2000 and SPEC CPU 2006 Benchmark Suites , 2022 .

[55]  Domenico Cotroneo,et al.  Experimental Analysis of Binary-Level Software Fault Injection in Complex Software , 2012, 2012 Ninth European Dependable Computing Conference.

[56]  C. Constantinescu,et al.  Intermittent faults and effects on reliability of integrated circuits , 2008, 2008 Annual Reliability and Maintainability Symposium.

[57]  Kang G. Shin,et al.  Fault Injection Techniques and Tools , 1997, Computer.

[58]  Karthik Pattabiraman,et al.  Modeling Soft-Error Propagation in Programs , 2018, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).