gem5-Approxilyzer: An Open-Source Tool for Application-Level Soft Error Analysis

Modern systems are increasingly susceptible to soft errors in the field and traditional redundancy-based mitigation techniques are too expensive to protect against all errors. Recent techniques, such as approximate computing and various low-cost resilience mechanisms, intelligently trade off inaccuracy in program output for better energy, performance, and resiliency overhead. A fundamental requirement for realizing the full potential of these techniques is a thorough understanding of how applications react to errors. Approxilyzer is a state-of-the-art tool that enables an accurate, efficient, and comprehensive analysis of how errors in almost all dynamic instructions in a program's execution affect the quality of the final program output. While useful, its adoption is limited by its implementation using the proprietary Simics infrastructure and the SPARC ISA. We present gem5-Approxilyzer, a re-implementation of Approxilyzer using the open-source gem5 simulator. gem5-Approxilyzer can be extended to different ISAs, starting with x86 in this work. We show that gem5-Approxilyzer is both efficient (up to two orders of magnitude reduction in error injections over a naive campaign) and accurate (average 92% for our experiments) in predicting the program's output quality in the presence of errors. We also compare the error profiles of five workloads under x86 and SPARC to further motivate the need for a tool like gem5-Approxilyzer.

[1]  S. Adve,et al.  LOW-COST HARDWARE FAULT DETECTION AND DIAGNOSIS FOR MULTICORE SYSTEMS RUNNING MULTITHREADED WORKLOADS , 2022 .

[2]  Sarita V. Adve,et al.  Approxilyzer: Towards a systematic framework for instruction-level approximate computing and its application to hardware resiliency , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[3]  Sarita V. Adve,et al.  Using likely program invariants to detect hardware errors , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[4]  Mahmut T. Kandemir,et al.  Compiler-directed instruction duplication for soft error detection , 2005, Design, Automation and Test in Europe.

[5]  Albert Meixner,et al.  Argus: Low-Cost, Comprehensive Error Detection in Simple Cores , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[6]  Ravishankar K. Iyer,et al.  SymPLFIED: Symbolic program-level fault injection and error detection framework , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[7]  Huiyang Zhou,et al.  Unified Architectural Support for Soft-Error Protection or Software Bug Detection , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[8]  Dimitris Gizopoulos,et al.  MeRLiN: Exploiting dynamic instruction behavior for fast and accurate microarchitecture level reliability assessment , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[9]  Régis Leveugle,et al.  Statistical fault injection: Quantified error and confidence , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[10]  Amin Ansari,et al.  Shoestring: probabilistic soft error reliability on the cheap , 2010, ASPLOS XV.

[11]  Guanpeng Li,et al.  Understanding Error Propagation in Deep Learning Neural Network (DNN) Accelerators and Applications , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  Dimitris Gizopoulos,et al.  Anatomy of microarchitecture-level reliability assessment: Throughput and accuracy , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[13]  Shubhendu S. Mukherjee,et al.  Measuring Architectural Vulnerability Factors , 2003, IEEE Micro.

[14]  Olaf Spinczyk,et al.  FAIL*: An Open and Versatile Fault-Injection Framework for the Assessment of Software-Implemented Hardware Fault Tolerance , 2015, 2015 11th European Dependable Computing Conference (EDCC).

[15]  Sarita V. Adve,et al.  Relyzer: exploiting application-level fault equivalence to analyze application resiliency to transient faults , 2012, ASPLOS XVII.

[16]  Karthik Pattabiraman,et al.  Quantifying the Accuracy of High-Level Fault Injection Techniques for Hardware Faults , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[17]  Martin C. Rinard,et al.  Chisel: reliability- and accuracy-aware optimization of approximate computational kernels , 2014, OOPSLA.

[18]  Dan Grossman,et al.  EnerJ: approximate data types for safe and general low-power computation , 2011, PLDI '11.

[19]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[20]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[21]  Edward J. McCluskey,et al.  Error detection by duplicated instructions in super-scalar processors , 2002, IEEE Trans. Reliab..

[22]  Shubhendu S. Mukherjee,et al.  Perturbation-based Fault Screening , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[23]  Todd M. Austin,et al.  A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor , 2003, MICRO.

[24]  S AdveVikram,et al.  Understanding the propagation of hard errors to software and implications for resilient system design , 2008 .

[25]  Sarita V. Adve,et al.  Low-cost program-level detectors for reducing silent data corruptions , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[26]  Niranjan Hasabnis,et al.  Lifting Assembly to Intermediate Representation: A Novel Approach Leveraging Compilers , 2016, ASPLOS.

[27]  Christos D. Antonopoulos,et al.  GemFI: A Fault Injection Tool for Studying the Behavior of Applications on Unreliable Substrates , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[28]  Sarita V. Adve,et al.  GangES: Gang error simulation for hardware resiliency evaluation , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[29]  Sanjay J. Patel,et al.  ReStore: symptom based soft error detection in microprocessors , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[30]  Henry Hoffmann,et al.  Managing performance vs. accuracy trade-offs with loop perforation , 2011, ESEC/FSE '11.

[31]  Ravishankar K. Iyer,et al.  An end-to-end approach for the automatic derivation of application-aware error detectors , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[32]  David R. Kaeli,et al.  Eliminating microarchitectural dependency from Architectural Vulnerability , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[33]  Darko Marinov,et al.  Minotaur: Adapting Software Testing Techniques for Hardware Errors , 2019, ASPLOS.

[34]  Bo Fang,et al.  ePVF: An Enhanced Program Vulnerability Factor Methodology for Cross-Layer Resilience Analysis , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[35]  Ravishankar K. Iyer,et al.  Dynamic Derivation of Application-Specific Error Detectors and their Implementation in Hardware , 2006, 2006 Sixth European Dependable Computing Conference.

[36]  Christian Bienia,et al.  Benchmarking modern multiprocessors , 2011 .

[37]  QingPing Tan,et al.  SmartInjector: Exploiting intelligent fault injection for SDC rate analysis , 2013, 2013 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFTS).

[38]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[39]  Eric Cheng,et al.  CLEAR: Cross-layer exploration for architecting resilience: Combining hardware and software techniques to tolerate soft errors in processor cores , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[40]  Scott A. Mahlke,et al.  Scaling Performance via Self-Tuning Approximation for Graphics Engines , 2014, TOCS.

[41]  Massimo Violante,et al.  Soft-error detection using control flow assertions , 2003, Proceedings 18th IEEE Symposium on Defect and Fault Tolerance in VLSI Systems.

[42]  Scott A. Mahlke,et al.  SAGE: Self-tuning approximation for graphics engines , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[43]  Xiaodong Li,et al.  Online Estimation of Architectural Vulnerability Factor for Soft Errors , 2008, 2008 International Symposium on Computer Architecture.

[44]  Marc Snir,et al.  FlipIt: An LLVM Based Fault Injector for HPC , 2014, Euro-Par Workshops.