Design, Use and Evaluation of P-FSEFI: A Parallel Soft Error Fault Injection Framework for Emulating Soft Errors in Parallel Applications

Future exascale application programmers and users have a need to quantity an application's resilience and vulnerability to soft errors before running their codes on production supercomputers due to the cost of failures and hazards from silent data corruption. Barring a deep understanding of the resiliency of a particular application, vulnerability evaluation is commonly done through fault injection tools at either the software or hardware level. Hardware fault injection, while most realistic, is relegated to customized vendor chips and usually applications cannot be evaluated at scale. Software fault injection can be done more practically and efficiently and is the approach that many researchers use as a reasonable approximation. With a sufficiently sophisticated software fault injection framework, an application can be studied to see how it would handle many of the errors that manifest at the application level. Using such a tool, a developer can progressively improve the resilience at targeted locations they believe are important for their target hardware.

[1]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..

[2]  Henrique Madeira,et al.  RIFLE: A General Purpose Pin-level Fault Injector , 1994, EDCC.

[3]  Puneet Gupta,et al.  VarEMU: An emulation testbed for variability-aware software , 2013, 2013 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[4]  Gokcen Kestor,et al.  Understanding the propagation of transient errors in HPC applications , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Song Fu,et al.  Addressing statistical significance of fault injection: empirical studies of the soft error susceptibility , 2017, Int. J. High Perform. Comput. Netw..

[6]  William M. Jones,et al.  Fault Injection Experiments with the CLAMR Hydrodynamics Mini-App , 2014, 2014 IEEE International Symposium on Software Reliability Engineering Workshops.

[7]  Robert C. Aitken,et al.  Impact of Technology and Voltage Scaling on the Soft Error Susceptibility in Nanoscale CMOS , 2008, 2008 IEEE International Symposium on Defect and Fault Tolerance of VLSI Systems.

[8]  Peng Liu,et al.  FIMD-MPI: a tool for injecting faults into MPI application , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[9]  Radu Teodorescu,et al.  Dynamic reduction of voltage margins by leveraging on-chip ECC in Itanium II processors , 2013, ISCA.

[10]  Song Fu,et al.  Empirical Studies of the Soft Error Susceptibility ofSorting Algorithms to Statistical Fault Injection , 2015, FTXS@HPDC.

[11]  Karthik Pattabiraman,et al.  LLFI: An Intermediate Code-Level Fault Injection Tool for Hardware Faults , 2015, 2015 IEEE International Conference on Software Quality, Reliability and Security.

[12]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[13]  William M. Jones,et al.  Towards Building Resilient Scientific Applications: Resilience Analysis on the Impact of Soft Error and Transient Error Tolerance with the CLAMR Hydrodynamics Mini-App , 2015, 2015 IEEE International Conference on Cluster Computing.

[14]  Cecilia Metra,et al.  Impact of Aging Phenomena on Soft Error Susceptibility , 2011, 2011 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems.

[15]  R.V. Joshi,et al.  The Impact of Aging Effects and Manufacturing Variation on SRAM Soft-Error Rate , 2008, IEEE Transactions on Device and Materials Reliability.

[16]  Alessandro Paccagnella,et al.  Temperature dependence of neutron-induced soft errors in SRAMs , 2012, Microelectron. Reliab..

[17]  Rémi Gaillard,et al.  Single Event Effects: Mechanisms and Classification , 2011 .

[18]  Song Fu,et al.  F-SEFI: A Fine-Grained Soft Error Fault Injection Tool for Profiling Application Vulnerability , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[19]  Marcelo Lubaszewski,et al.  Neutron-induced single event effects analysis in a SAR-ADC architecture embedded in a mixed-signal SoC , 2013, 2013 IEEE Computer Society Annual Symposium on VLSI (ISVLSI).

[20]  Heinrich Theodor Vierhaus,et al.  A register-transfer-level fault simulator for permanent and transient faults in embedded processors , 2001, Proceedings Design, Automation and Test in Europe. Conference and Exhibition 2001.

[21]  Xin Xu,et al.  Understanding soft error propagation using Efficient vulnerability-driven fault injection , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[22]  Neeraj Suri,et al.  No PAIN, No Gain? The Utility of PArallel Fault INjections , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[23]  Kurt B. Ferreira,et al.  Using unreliable virtual hardware to inject errors in extreme-scale systems , 2013, FTXS '13.

[24]  Jinsuk Chung,et al.  Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems , 2012, HiPC 2012.

[25]  Fernanda Gusmão de Lima Kastensmidt,et al.  Soft error injection methodology based on QEMU software platform , 2014, LATW.

[26]  Sarita V. Adve,et al.  Relyzer: Application Resiliency Analyzer for Transient Faults , 2013, IEEE Micro.

[27]  Andrew B. Kahng,et al.  On potential design impacts of electromigration awareness , 2013, 2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC).

[28]  M. W. Roberson Soft error rates in solder bumped packaging , 1998, Proceedings. 4th International Symposium on Advanced Packaging Materials Processes, Properties and Interfaces (Cat. No.98EX153).

[29]  Ziming Zhang,et al.  Experimental Framework for Injecting Logic Errors in a Virtual Machine to Profile Applications for Soft Error Resilience , 2011, Euro-Par Workshops.

[30]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[31]  Susan L. Graham,et al.  Gprof: A call graph execution profiler , 1982, SIGPLAN '82.

[32]  Franck Cappello,et al.  Toward Exascale Resilience: 2014 update , 2014, Supercomput. Front. Innov..

[33]  L. W. Massengill,et al.  Temperature dependence of soft error rate in flip-flop designs , 2012, 2012 IEEE International Reliability Physics Symposium (IRPS).

[34]  Dong Li,et al.  Fast Fault Injection and Sensitivity Analysis for Collective Communications , 2015, 2015 IEEE International Conference on Cluster Computing.

[35]  David Blaauw,et al.  Razor: circuit-level correction of timing errors for low-power operation , 2004, IEEE Micro.

[36]  A. Chugg,et al.  The Random Telegraph Signal Behavior of Intermittently Stuck Bits in SDRAMs , 2009, IEEE Transactions on Nuclear Science.

[37]  Dong Li,et al.  Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[38]  John Shalf,et al.  Memory Errors in Modern Systems: The Good, The Bad, and The Ugly , 2015, ASPLOS.