Understanding Ineffectiveness of the Application-Level Fault Injection

Extreme-scale applications are at a significant risk of being hit by soft errors on supercomputers, as the scale of these systems and the component density continues to increase. In order to better understand soft error vulnerabilities in those applications, the application-level fault injection is widely employed to evaluate applications. This poster reveals that the application-level fault injection has some inherent uncertainties due to the random nature of fault injection. First, the fault injection result has a strong correlation with the number of fault injection tests. What is a good number of fault injection tests is uncertain. Second, given a specific application, the fault injection result can vary as the input problem of the application varies. How to interpret the fault injection result is uncertain. Those uncertainties can make fault injection ineffective for accurately modeling application vulnerability.

[1]  Dong Li,et al.  Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  Gokcen Kestor,et al.  Understanding the propagation of transient errors in HPC applications , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Marc Snir,et al.  FlipIt: An LLVM Based Fault Injector for HPC , 2014, Euro-Par Workshops.

[4]  Karthik Pattabiraman,et al.  Quantifying the Accuracy of High-Level Fault Injection Techniques for Hardware Faults , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[5]  Padma Raghavan,et al.  Characterizing the impact of soft errors on iterative methods in scientific computing , 2011, ICS '11.

[6]  Martin Schulz,et al.  Fault resilience of the algebraic multi-grid solver , 2012, ICS '12.