Fault Injection Experiments with the CLAMR Hydrodynamics Mini-App

In this paper, we present a resilience analysis of the impact of soft errors on CLAMR, a hydrodynamics mini-app for high performance computing (HPC). We utilize F-SEFI, a fine grainedfault injection tool, to inject faults into the kernel routines of CLAMR. We demonstrate visually the impact of these faults as they are either benign (have no impact on the results), cause silent data corruption (SDC), or cause the application to crash due to instabilities. We quantify the probability that an injected fault will cause CLAMR to transition to one of the above three states using F-SEFI. Finally, we explore the relationship between the application's fault characteristics and when the fault is injected in simulation time. Overall, we find that 17% and 24% of the faults propagate into SDC and crashes respectively.

[1]  Franck Cappello,et al.  Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..

[2]  Dong Li,et al.  Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Song Fu,et al.  F-SEFI: A Fine-Grained Soft Error Fault Injection Tool for Profiling Application Vulnerability , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[4]  Keun Soo Yim Characterization of Impact of Transient Faults and Detection of Data Corruption Errors in Large-Scale N-Body Programs Using Graphics Processing Units , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[5]  Bo Fang,et al.  Evaluating the Error Resilience of Parallel Programs , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[6]  Christos D. Antonopoulos,et al.  GemFI: A Fault Injection Tool for Studying the Behavior of Applications on Unreliable Substrates , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[7]  Fabrice Bellard,et al.  QEMU, a Fast and Portable Dynamic Translator , 2005, USENIX ATC, FREENIX Track.

[8]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..