Evaluating and Accelerating High-Fidelity Error Injection for HPC

We address two important concerns for the analysis of the behavior of applications in the presence of hardware errors: (1) when is it important to model how hardware faults lead to erroneous values (instruction-level errors) with high fidelity, as opposed to using simple bit-flipping models, and (2) how to enable fast high-fidelity error injection campaigns, in particular when error detectors are employed. We present and verify a new nested Monte Carlo methodology for evaluating high-fidelity gate-level fault models and error-detector coverage, which is orders of magnitude faster than current approaches. We use that methodology to demonstrate that, without detectors, simple error models suffice for evaluating errors in 9 HPC benchmarks.

[1]  R. Allmon,et al.  Soft Error Susceptibilities of 22 nm Tri-Gate Devices , 2012, IEEE Transactions on Nuclear Science.

[2]  L. W. Massengill,et al.  Impact of technology scaling on the combinational logic soft error rate , 2014, 2014 IEEE International Reliability Physics Symposium.

[3]  Sarita V. Adve,et al.  Approxilyzer: Towards a systematic framework for instruction-level approximate computing and its application to hardware resiliency , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[4]  Israel Koren,et al.  CAROL-FI: an Efficient Fault-Injection Tool for Vulnerability Evaluation of Modern HPC Parallel Accelerators , 2017, Conf. Computing Frontiers.

[5]  Sandia Report,et al.  Improving Performance via Mini-applications , 2009 .

[6]  Zainalabedin Navabi,et al.  Hierarchical fault simulation using behavioral and gate level hardware models , 2002, Proceedings of the 11th Asian Test Symposium, 2002. (ATS '02)..

[7]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[8]  Johan Karlsson,et al.  One Bit is (Not) Enough: An Empirical Study of the Impact of Single and Multiple Bit-Flip Errors , 2017, 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[9]  Sarita V. Adve,et al.  GangES: Gang error simulation for hardware resiliency evaluation , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[10]  Ravishankar K. Iyer,et al.  Hierarchical Simulation Approach to Accurate Fault Modeling for System Dependability Evaluation , 1999, IEEE Trans. Software Eng..

[11]  Franck Cappello,et al.  Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..

[12]  Sarita V. Adve,et al.  Accurate microarchitecture-level fault modeling for studying hardware faults , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[13]  Albert Meixner,et al.  Argus: Low-Cost, Comprehensive Error Detection in Simple Cores , 2008, IEEE Micro.

[14]  Régis Leveugle,et al.  Statistical fault injection: Quantified error and confidence , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[15]  Karthik Pattabiraman,et al.  LLFI : An Intermediate Code Level Fault Injector For Soft Computing Applications , 2013 .

[16]  Mattan Erez,et al.  Hamartia: A Fast and Accurate Error Injection Framework , 2018, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W).

[17]  Elizabeth M. Rudnick,et al.  A Gate-Level Simulation Environment for Alpha-Particle-Induced Transient Faults , 1996, IEEE Trans. Computers.

[18]  John T. Daly,et al.  Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters , 2010, HPDC '10.

[19]  Andrew Siegel,et al.  XSBENCH - THE DEVELOPMENT AND VERIFICATION OF A PERFORMANCE ABSTRACTION FOR MONTE CARLO REACTOR ANALYSIS , 2014 .

[20]  Dimitris Gizopoulos,et al.  Anatomy of microarchitecture-level reliability assessment: Throughput and accuracy , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[21]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[22]  Jaume Abella,et al.  Implementing End-to-End Register Data-Flow Continuous Self-Test , 2009, IEEE Transactions on Computers.

[23]  Ravishankar K. Iyer,et al.  An experimental study of soft errors in microprocessors , 2005, IEEE Micro.

[24]  Albert Meixner,et al.  Argus: Low-Cost, Comprehensive Error Detection in Simple Cores , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[25]  Karthikeyan Sankaralingam,et al.  Understanding the impact of gate-level physical reliability effects on whole program execution , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[26]  Gokcen Kestor,et al.  Understanding the propagation of transient errors in HPC applications , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[27]  Jinsuk Chung,et al.  Containment domains: A scalable, efficient, and flexible resilience scheme for exascale systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[28]  Jacob A. Abraham,et al.  Quantitative evaluation of soft error injection techniques for robust system design , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[29]  Martin Schulz,et al.  REFINE: Realistic Fault Injection via Compiler-based Instrumentation for Accuracy, Portability and Speed , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[30]  Dimitris Gizopoulos,et al.  MeRLiN: Exploiting dynamic instruction behavior for fast and accurate microarchitecture level reliability assessment , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[31]  Scott A. Mahlke,et al.  Harnessing Soft Computations for Low-Budget Fault Tolerance , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[32]  Ian Karlin,et al.  LULESH 2.0 Updates and Changes , 2013 .

[33]  David Z. Pan,et al.  High-level synthesis of error detecting cores through low-cost modulo-3 shadow datapaths , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[34]  Sarita V. Adve,et al.  Low-cost program-level detectors for reducing silent data corruptions , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[35]  Laura Monroe,et al.  Design, Use and Evaluation of P-FSEFI: A Parallel Soft Error Fault Injection Framework for Emulating Soft Errors in Parallel Applications , 2016, SimuTools.

[36]  Meeta Sharma Gupta,et al.  Understanding Soft Error Resiliency of Blue Gene/Q Compute Chip through Hardware Proton Irradiation and Software Fault Injection , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[37]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[38]  Eric Cheng,et al.  CLEAR: Cross-layer exploration for architecting resilience: Combining hardware and software techniques to tolerate soft errors in processor cores , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[39]  Shinya Takamaeda-Yamazaki,et al.  Pyverilog: A Python-Based Hardware Design Processing Toolkit for Verilog HDL , 2015, ARC.

[40]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[41]  Dong Li,et al.  Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[42]  N. Seifert,et al.  Comparison of alpha-particle and neutron-induced combinational and sequential logic error rates at the 32nm technology node , 2009, 2009 IEEE International Reliability Physics Symposium.

[43]  Sarita V. Adve,et al.  Relyzer: exploiting application-level fault equivalence to analyze application resiliency to transient faults , 2012, ASPLOS XVII.

[44]  Karthik Pattabiraman,et al.  Quantifying the Accuracy of High-Level Fault Injection Techniques for Hardware Faults , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[45]  Stephen W. Keckler,et al.  SASSIFI: An architecture-level fault injection tool for GPU application resilience evaluation , 2017, 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[46]  James Tschanz,et al.  A Low Cost Scheme for Reducing Silent Data Corruption in Large Arithmetic Circuits , 2008, 2008 IEEE International Symposium on Defect and Fault Tolerance of VLSI Systems.

[47]  Sriram Krishnamoorthy,et al.  Towards Resiliency Evaluation of Vector Programs , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[48]  Ganesh Gopalakrishnan,et al.  Towards Formal Approaches to System Resilience , 2013, 2013 IEEE 19th Pacific Rim International Symposium on Dependable Computing.

[49]  Pradip Bose,et al.  BRAVO: Balanced Reliability-Aware Voltage Optimization , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[50]  Song Fu,et al.  F-SEFI: A Fine-Grained Soft Error Fault Injection Tool for Profiling Application Vulnerability , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[51]  Norbert Wehn,et al.  A Cross-Layer Technology-Based Study of How Memory Errors Impact System Resilience , 2013, IEEE Micro.

[52]  Prabhakar Kudva,et al.  Soft-error resilience of the IBM POWER6 processor , 2008, IBM J. Res. Dev..

[53]  Florent de Dinechin,et al.  Designing Custom Arithmetic Data Paths with FloPoCo , 2011, IEEE Design & Test of Computers.

[54]  Laura Monroe,et al.  SDC is in the Eye of the Beholder: A Survey and Preliminary Study , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshop (DSN-W).