Characterization of the Impact of Soft Errors on Iterative Methods

Soft errors caused by transient bit flips have the potential to significantly impact an application's behavior. This has motivated the design of an array of techniques to detect, isolate, and correct soft errors using microarchitectural, architectural, compilation-based, or application-level techniques to minimize their impact on the executing application. The first step toward the design of good error detection/correction techniques involves an understanding of an application's vulnerability to soft errors. In this paper, we present the first comprehensive characterization of the impact of soft errors on the convergence characteristics of six iterative methods using application-level fault injection. In particular, we consider the use of iterative methods to incrementally solve a linear system of equations, which constitute the core kernel in many scientific applications. We analyze the impact of soft errors in terms of the type of error (single-vs multi-bit), the distribution and location of bits affected, the data structure and statement impacted, and variation with time. In addition to understanding the vulnerability of iterative solvers to soft errors, this characterization can aid the design of fault injection campaigns that ensure systematic coverage.

[1]  Zizhong Chen,et al.  Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods , 2013, PPoPP '13.

[2]  Michael Butler,et al.  Bulldozer: An Approach to Multithreaded Compute Performance , 2011, IEEE Micro.

[3]  Giorgio Di Natale,et al.  A survey on simulation-based fault injection tools for complex systems , 2014, 2014 9th IEEE International Conference on Design & Technology of Integrated Systems in Nanoscale Era (DTIS).

[4]  John Kalamatianos,et al.  On characterizing near-threshold SRAM failures in FinFET technology , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[5]  Bronis R. de Supinski,et al.  Soft error vulnerability of iterative linear algebra methods , 2007, ICS '08.

[6]  Franck Cappello,et al.  Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications , 2016, IEEE Transactions on Parallel and Distributed Systems.

[7]  Xin Xu,et al.  Understanding soft error propagation using Efficient vulnerability-driven fault injection , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[8]  Unsal Osman,et al.  Unprotected Computing: A Large-Scale Study of DRAM Raw Error Rate on a Supercomputer , 2016 .

[9]  Jacob A. Abraham,et al.  FERRARI: A Flexible Software-Based Fault and Error Injection System , 1995, IEEE Trans. Computers.

[10]  Jared C. Smolens,et al.  Fingerprinting: hash-based error detection in microprocessors , 2007 .

[11]  Song Fu,et al.  F-SEFI: A Fine-Grained Soft Error Fault Injection Tool for Profiling Application Vulnerability , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[12]  Sarita V. Adve,et al.  Accurate microarchitecture-level fault modeling for studying hardware faults , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[13]  Frank Mueller,et al.  Evaluating the Impact of SDC on the GMRES Iterative Solver , 2013, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[14]  Joel Emer,et al.  A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[15]  Martin C. Rinard,et al.  Verifying quantitative reliability for programs that execute on unreliable hardware , 2013, OOPSLA.

[16]  Ravishankar K. Iyer,et al.  Measuring Fault Tolerance with the FTAPE Fault Injection Tool , 1995, MMB.

[17]  Shuaiwen Song,et al.  New-Sum: A Novel Online ABFT Scheme For General Iterative Methods , 2016, HPDC.

[18]  Karthik Pattabiraman,et al.  Quantifying the Accuracy of High-Level Fault Injection Techniques for Hardware Faults , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[19]  Franck Cappello,et al.  MACORD: Online Adaptive Machine Learning Framework for Silent Error Detection , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[20]  Daniel P. Siewiorek,et al.  FIAT-fault injection based automated testing environment , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[21]  Sriram Krishnamoorthy,et al.  Towards Resiliency Evaluation of Vector Programs , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[22]  Ganesh Gopalakrishnan,et al.  Towards Formal Approaches to System Resilience , 2013, 2013 IEEE 19th Pacific Rim International Symposium on Dependable Computing.

[23]  Kang G. Shin,et al.  DOCTOR: an integrated software fault injection environment for distributed real-time systems , 1995, Proceedings of 1995 IEEE International Computer Performance and Dependability Symposium.

[24]  Gokcen Kestor,et al.  Understanding the propagation of transient errors in HPC applications , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[25]  Rakesh Kumar,et al.  Algorithmic approaches to low overhead fault detection for sparse linear algebra , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[26]  Martin Burtscher,et al.  Effects of source-code optimizations on GPU performance and energy consumption , 2015, GPGPU@PPoPP.

[27]  Dong Li,et al.  Quantitatively Modeling Application Resilience with the Data Vulnerability Factor , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[28]  Martin Schulz,et al.  Fault resilience of the algebraic multi-grid solver , 2012, ICS '12.

[29]  Johan Karlsson,et al.  A comparison of simulation based and scan chain implemented fault injection , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[30]  Jack Dongarra,et al.  IML++ v. 1.2 Iterative Methods Library Reference Guide | NIST , 1996 .

[31]  Chen-Yong Cher,et al.  Soft Error Resiliency Characterization on IBM BlueGene/Q Processor , 2014, 2014 19th Asia and South Pacific Design Automation Conference (ASP-DAC).

[32]  Sriram Krishnamoorthy,et al.  Tolerating correlated failures for generalized Cartesian distributions via bipartite matching , 2011, CF '11.

[33]  Meeta Sharma Gupta,et al.  Understanding Soft Error Resiliency of Blue Gene/Q Compute Chip through Hardware Proton Irradiation and Software Fault Injection , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[34]  Amin Ansari,et al.  Shoestring: probabilistic soft error reliability on the cheap , 2010, ASPLOS XV.

[35]  Kurt B. Ferreira,et al.  Using unreliable virtual hardware to inject errors in extreme-scale systems , 2013, FTXS '13.

[36]  Marc Snir,et al.  FlipIt: An LLVM Based Fault Injector for HPC , 2014, Euro-Par Workshops.

[37]  Christian Engelmann,et al.  xSim: The extreme-scale simulator , 2011, 2011 International Conference on High Performance Computing & Simulation.

[38]  Jacob A. Abraham,et al.  Quantitative evaluation of soft error injection techniques for robust system design , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[39]  Johan Karlsson,et al.  GOOFI-2: A tool for experimental dependability assessment , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[40]  Sriram Krishnamoorthy,et al.  Multi-Fault Tolerance for Cartesian Data Distributions , 2012, International Journal of Parallel Programming.

[41]  Volkmar Sieh,et al.  VERIFY: evaluation of reliability using VHDL-models with embedded fault descriptions , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.

[42]  Jungang Lou,et al.  A PIN-Based Dynamic Software Fault Injection System , 2008, 2008 The 9th International Conference for Young Computer Scientists.

[43]  Franck Cappello,et al.  Spatial Support Vector Regression to Detect Silent Errors in the Exascale Era , 2016, 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid).

[44]  David R. Kaeli,et al.  Quantifying software vulnerability , 2008, WREFT '08.

[45]  Dong Li,et al.  Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[46]  Michail Maniatakos,et al.  Instruction-Level Impact Analysis of Low-Level Faults in a Modern Microprocessor Controller , 2011, IEEE Transactions on Computers.

[47]  Jiaqi Liu,et al.  A Practical Approach for Handling Soft Errors in Iterative Applications , 2015, 2015 IEEE International Conference on Cluster Computing.

[48]  Gokcen Kestor,et al.  Comparative analysis of soft-error detection strategies: a case study with iterative methods , 2018, CF.

[49]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[50]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.