Hamartia: A Fast and Accurate Error Injection Framework

Single bit-flip has been the most popular error model for resilience studies with fault injection. We use RTL gate-level fault injection to show that this model fails to cover many realistic hardware faults. Specifically, single-event transients from combinational logic and single-event upsets in pipeline latches can lead to complex multi-bit errors at the architecture level. However, although accurate, RTL simulation is too slow to evaluate application-level resilience. To strike a balance between model accuracy and injection speed, we refine the concept of hierarchical injection to prune faults with known outcomes, saving 62% of program runs at 2% margin of error on average across 9 benchmark programs. Our implementation of the hierarchical error injector is not only accurate but also fast because it is able to source realistic error patterns using on demand RTL gate-level fault injection. Our tool outperforms state-of-the-art assembly-level and compiler-based error injectors by up to 6X, while providing higher fidelity.

[1]  John Lach,et al.  Transient fault models and AVF estimation revisited , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[2]  Pradip Bose,et al.  BRAVO: Balanced Reliability-Aware Voltage Optimization , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[3]  Dong Li,et al.  Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Sarita V. Adve,et al.  GangES: Gang error simulation for hardware resiliency evaluation , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[5]  Song Fu,et al.  F-SEFI: A Fine-Grained Soft Error Fault Injection Tool for Profiling Application Vulnerability , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[6]  Norbert Wehn,et al.  A Cross-Layer Technology-Based Study of How Memory Errors Impact System Resilience , 2013, IEEE Micro.

[7]  Dimitris Gizopoulos,et al.  MeRLiN: Exploiting dynamic instruction behavior for fast and accurate microarchitecture level reliability assessment , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[8]  Scott A. Mahlke,et al.  Harnessing Soft Computations for Low-Budget Fault Tolerance , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[9]  Jinsuk Chung,et al.  Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems , 2012, HiPC 2012.

[10]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[11]  Amin Ansari,et al.  Shoestring: probabilistic soft error reliability on the cheap , 2010, ASPLOS XV.

[12]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[13]  Sarita V. Adve,et al.  Relyzer: exploiting application-level fault equivalence to analyze application resiliency to transient faults , 2012, ASPLOS XVII.

[14]  Jacob A. Abraham,et al.  Quantitative evaluation of soft error injection techniques for robust system design , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[15]  Martin Schulz,et al.  REFINE: Realistic Fault Injection via Compiler-based Instrumentation for Accuracy, Portability and Speed , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  Christian Bienia,et al.  Benchmarking modern multiprocessors , 2011 .

[17]  Ravishankar K. Iyer,et al.  Hierarchical Simulation Approach to Accurate Fault Modeling for System Dependability Evaluation , 1999, IEEE Trans. Software Eng..

[18]  Ameya Suhas Chaudhari Fiesta++ : a software implemented fault injection tool for transient fault injection , 2014 .

[19]  N. Seifert,et al.  Comparison of alpha-particle and neutron-induced combinational and sequential logic error rates at the 32nm technology node , 2009, 2009 IEEE International Reliability Physics Symposium.

[20]  Bo Fang,et al.  GPU-Qin: A methodology for evaluating the error resilience of GPGPU applications , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[21]  Sarita V. Adve,et al.  Accurate microarchitecture-level fault modeling for studying hardware faults , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[22]  Ian Karlin,et al.  LULESH 2.0 Updates and Changes , 2013 .

[23]  Karthik Pattabiraman,et al.  Quantifying the Accuracy of High-Level Fault Injection Techniques for Hardware Faults , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[24]  Stephen W. Keckler,et al.  SASSIFI: An architecture-level fault injection tool for GPU application resilience evaluation , 2017, 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[25]  Sriram Krishnamoorthy,et al.  Towards Resiliency Evaluation of Vector Programs , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[26]  Ganesh Gopalakrishnan,et al.  Towards Formal Approaches to System Resilience , 2013, 2013 IEEE 19th Pacific Rim International Symposium on Dependable Computing.

[27]  Sarita V. Adve,et al.  Low-cost program-level detectors for reducing silent data corruptions , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[28]  Trevor Mudge,et al.  MiBench: A free, commercially representative embedded benchmark suite , 2001 .

[29]  Elizabeth M. Rudnick,et al.  A Gate-Level Simulation Environment for Alpha-Particle-Induced Transient Faults , 1996, IEEE Trans. Computers.

[30]  Laura Monroe,et al.  Design, Use and Evaluation of P-FSEFI: A Parallel Soft Error Fault Injection Framework for Emulating Soft Errors in Parallel Applications , 2016, SimuTools.

[31]  Marc Snir,et al.  FlipIt: An LLVM Based Fault Injector for HPC , 2014, Euro-Par Workshops.

[32]  Zainalabedin Navabi,et al.  Hierarchical fault simulation using behavioral and gate level hardware models , 2002, Proceedings of the 11th Asian Test Symposium, 2002. (ATS '02)..

[33]  Michel Dubois,et al.  MACAU: A Markov model for reliability evaluations of caches under Single-bit and Multi-bit Upsets , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[34]  Dimitris Gizopoulos,et al.  Anatomy of microarchitecture-level reliability assessment: Throughput and accuracy , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[35]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[36]  John Lach,et al.  Bit-slice logic interleaving for spatial multi-bit soft-error tolerance , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[37]  Régis Leveugle,et al.  Statistical fault injection: Quantified error and confidence , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[38]  Karthik Pattabiraman,et al.  LLFI : An Intermediate Code Level Fault Injector For Soft Computing Applications , 2013 .

[39]  L. W. Massengill,et al.  Impact of technology scaling on the combinational logic soft error rate , 2014, 2014 IEEE International Reliability Physics Symposium.

[40]  R. Allmon,et al.  Soft Error Susceptibilities of 22 nm Tri-Gate Devices , 2012, IEEE Transactions on Nuclear Science.

[41]  Eric Cheng,et al.  CLEAR: Cross-layer exploration for architecting resilience: Combining hardware and software techniques to tolerate soft errors in processor cores , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[42]  Shinya Takamaeda-Yamazaki,et al.  Pyverilog: A Python-Based Hardware Design Processing Toolkit for Verilog HDL , 2015, ARC.

[43]  David R. Kaeli,et al.  Calculating Architectural Vulnerability Factors for Spatial Multi-Bit Transient Faults , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[44]  Israel Koren,et al.  CAROL-FI: an Efficient Fault-Injection Tool for Vulnerability Evaluation of Modern HPC Parallel Accelerators , 2017, Conf. Computing Frontiers.

[45]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[46]  Johan Karlsson,et al.  One Bit is (Not) Enough: An Empirical Study of the Impact of Single and Multiple Bit-Flip Errors , 2017, 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[47]  Sarita V. Adve,et al.  Approxilyzer: Towards a systematic framework for instruction-level approximate computing and its application to hardware resiliency , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[48]  Eric Cheng,et al.  System-Level Effects of Soft Errors in Uncore Components , 2017, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[49]  Karthikeyan Sankaralingam,et al.  Understanding the impact of gate-level physical reliability effects on whole program execution , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).