An Efficient Silent Data Corruption Detection Method with Error-Feedback Control and Even Sampling for HPC Applications

The silent data corruption (SDC) problem is attracting more and more attentions because it is expected to have a great impact on exascale HPC applications. SDC faults are hazardous in that they pass unnoticed by hardware and can lead to wrong computation results. In this work, we formulate SDC detection as a runtime one-step-ahead prediction method, leveraging multiple linear prediction methods in order to improve the detection results. The contributions are twofold: (1) we propose an error feedback control model that can reduce the prediction errors for different linear prediction methods, and (2) we propose a spatial-data-based even-sampling method to minimize the detection overheads (including memory and computation cost). We implement our algorithms in the fault tolerance interface, a fault tolerance library with multiple checkpoint levels, such that users can conveniently protect their HPC applications against both SDC errors and fail-stop errors. We evaluate our approach by using large-scale traces from well-known, large-scale HPC applications, as well as by running those HPC applications on a real cluster environment. Experiments show that our error feedback control model can improve detection sensitivity by 34-189% for bit-flip memory errors injected with the bit positions in the range [20,30], without any degradation on detection accuracy. Furthermore, memory size can be reduced by 33% with our spatial-data even-sampling method, with only a slight and graceful degradation in the detection sensitivity.

[1]  Franck Cappello,et al.  Optimization of a Multilevel Checkpoint Model with Uncertain Execution Scales , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  Christian Engelmann,et al.  A Tunable, Software-Based DRAM Error Detection and Correction Library for HPC , 2011, Euro-Par Workshops.

[3]  Keun Soo Yim Characterization of Impact of Transient Faults and Detection of Data Corruption Errors in Large-Scale N-Body Programs Using Graphics Processing Units , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[4]  Rakesh Kumar,et al.  Algorithmic approaches to low overhead fault detection for sparse linear algebra , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[5]  Daniel S. Katz,et al.  Tests and Tolerances for High-Performance Software-Implemented Fault Detection , 2003, IEEE Trans. Computers.

[6]  Franck Cappello,et al.  Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[7]  Daniel S. Katz,et al.  Software-implemented fault detection for high-performance space applications , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[8]  Austin R. Benson,et al.  Silent error detection in numerical time-stepping schemes , 2015, Int. J. High Perform. Comput. Appl..

[9]  Robert A. van de Geijn,et al.  Fault-tolerant high-performance matrix multiplication: theory and practice , 2001, 2001 International Conference on Dependable Systems and Networks.

[10]  Gilbert T. Walker,et al.  On Periodicity in Series of Related Terms , 1931 .

[11]  Israel Koren,et al.  Application-level fault tolerance in the orbital thermal imaging spectrometer , 2004, 10th IEEE Pacific Rim International Symposium on Dependable Computing, 2004. Proceedings..

[12]  Franck Cappello,et al.  FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[13]  Franck Cappello,et al.  Lightweight Silent Data Corruption Detection Based on Runtime Data Analysis for HPC Applications , 2015, HPDC.

[14]  Zizhong Chen,et al.  Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods , 2013, PPoPP '13.

[15]  G. Yule On a Method of Investigating Periodicities in Disturbed Series, with Special Reference to Wolfer's Sunspot Numbers , 1927 .

[16]  Tiranee Achalakul,et al.  Failure Prediction of Data Centers Using Time Series and Fault Tree Analysis , 2012, 2012 IEEE 18th International Conference on Parallel and Distributed Systems.

[17]  Robert A. van de Geijn,et al.  Fault–Tolerant High–Performance Matrix Multiplication , 2004 .

[18]  Ravishankar K. Iyer,et al.  Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[19]  F. Cappello,et al.  Toward Effective Detection of Silent Data Corruptions for HPC Applications , 2014 .

[20]  Rolf Riesen,et al.  Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing , 2012, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.