Silent error detection in numerical time-stepping schemes

Errors due to hardware or low-level software problems, if detected, can be fixed by various schemes, such as recomputation from a checkpoint. Silent errors are errors in application state that have escaped low-level error detection. At extreme scale, where machines can perform astronomically many operations per second, silent errors threaten the validity of computed results. We propose a new paradigm for detecting silent errors at the application level. Our central idea is to frequently compare computed values to those provided by a cheap checking computation, and to build error detectors based on the difference between the two output sequences. Numerical analysis provides us with usable checking computations for the solution of initial-value problems in ODEs and PDEs, arguably the most common problems in computational science. Here, we provide, optimize, and test methods based on Runge–Kutta and linear multistep methods for ODEs, and on implicit and explicit finite difference schemes for PDEs. We take the heat equation and Navier–Stokes equations as examples. In tests with artificially injected errors, this approach effectively detects almost all meaningful errors, without significant slowdown.

[1]  William Gropp,et al.  PETSc Users Manual Revision 3.4 , 2016 .

[2]  James D. Hamilton Time Series Analysis , 1994 .

[3]  John R. Douceur,et al.  Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs , 2011, EuroSys '11.

[4]  Franck Cappello,et al.  Toward Exascale Resilience , 2009, Int. J. High Perform. Comput. Appl..

[5]  Abhinav Vishnu,et al.  A Case for Soft Error Detection and Correction in Computational Chemistry. , 2013, Journal of chemical theory and computation.

[6]  M. Berger,et al.  Adaptive mesh refinement for hyperbolic partial differential equations , 1982 .

[7]  J. Dormand,et al.  A family of embedded Runge-Kutta formulae , 1980 .

[8]  Bronis R. de Supinski,et al.  Soft error vulnerability of iterative linear algebra methods , 2007, ICS '08.

[9]  Volodymyr Kindratenko,et al.  On testing GPU memory for hard and soft errors , 2011 .

[10]  Tamara G. Kolda,et al.  An overview of the Trilinos project , 2005, TOMS.

[11]  John Shalf,et al.  The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..

[12]  Franck Cappello,et al.  Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..

[13]  J. Strikwerda Finite Difference Schemes and Partial Differential Equations , 1989 .

[14]  E. Fehlberg,et al.  Low-order classical Runge-Kutta formulas with stepsize control and their application to some heat transfer problems , 1969 .

[15]  F. Al-Shamali,et al.  Author Biographies. , 2015, Journal of social work in disability & rehabilitation.

[16]  Martin C. Rinard Parallel Synchronization-Free Approximate Data Structure Construction , 2013, HotPar.

[17]  Laxmikant V. Kalé,et al.  A scalable double in-memory checkpoint and restart scheme towards exascale , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012).

[18]  Jack J. Dongarra,et al.  High Performance Dense Linear System Solver with Resilience to Multiple Soft Errors , 2012, ICCS.

[19]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[20]  Martin Schulz,et al.  Fault resilience of the algebraic multi-grid solver , 2012, ICS '12.

[21]  Gilbert Strang,et al.  Computational Science and Engineering , 2007 .

[22]  Kurt B. Ferreira,et al.  Fault-tolerant iterative methods via selective reliability. , 2011 .

[23]  Eamonn J. Keogh,et al.  A symbolic representation of time series, with implications for streaming algorithms , 2003, DMKD '03.