Detecting and Correcting Data Corruption in Stencil Applications through Multivariate Interpolation

High-performance computing is a powerful tool that allows scientists to study complex natural phenomena. Extreme-scale supercomputers promise orders of magnitude higher performance compared with that of current systems. However, power constrains in future exascale systems might limit the level of resilience of those machines. In particular, data could get corrupted silently, that is, without the hardware detecting the corruption. This situation is clearly unacceptable: simulation results must be within the error margin specified by the user. In this paper, we exploit multivariate interpolation in order to detect and correct data corruption in stencil applications. We evaluate this technique with a turbulent fluid application, and we demonstrate that the prediction error using multivariate interpolation is on the order of 0.01. Our results show that this mechanism can detect and correct most important corruptions and keep the error deviation under 1% during the entire execution while injecting one corruption per minute. In addition, we stress test the detector by injecting more than ten corruptions per minute and observe that our strategy allows the application to produce results with an error deviation under 10% in such a stressful scenario.

[1]  Timothy J. Dell,et al.  A white paper on the benefits of chipkill-correct ecc for pc server main memory , 1997 .

[2]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[3]  Luigi Carro,et al.  GPGPUs: How to combine high computational power with high reliability , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[4]  E. Normand Single event upset at ground level , 1996 .

[5]  Chris Fallin,et al.  Flipping bits in memory without accessing them: An experimental study of DRAM disturbance errors , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[6]  Wolfgang Böhm,et al.  A survey of curve and surface methods in CAGD , 1984, Comput. Aided Geom. Des..

[7]  Franck Cappello,et al.  Lightweight Silent Data Corruption Detection Based on Runtime Data Analysis for HPC Applications , 2015, HPDC.

[8]  Joel S. Emer,et al.  The soft error problem: an architectural perspective , 2005, 11th International Symposium on High-Performance Computer Architecture.

[9]  W. Madych,et al.  Multivariate interpolation and condi-tionally positive definite functions , 1988 .

[10]  Rolf Riesen,et al.  Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing , 2012, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[11]  Franck Cappello,et al.  FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[12]  Zizhong Chen,et al.  Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods , 2013, PPoPP '13.

[13]  Satoshi Matsuoka,et al.  Physis: An implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[14]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[15]  Dong Li,et al.  Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  F. Cappello,et al.  Toward Effective Detection of Silent Data Corruptions for HPC Applications , 2014 .

[17]  Ravishankar K. Iyer,et al.  Hauberk: Lightweight Silent Data Corruption Error Detector for GPGPU , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[18]  Andrea C. Arpaci-Dusseau,et al.  An analysis of data corruption in the storage stack , 2008, TOS.

[19]  Omer Subasi,et al.  Programmer-directed partial redundancy for resilient HPC , 2015, Conf. Computing Frontiers.