In-Situ Mitigation of Silent Data Corruption in PDE Solvers

We present algorithmic techniques for parallel PDE solvers that leverage numerical smoothness properties of physics simulation to detect and correct silent data corruption within local computations. We initially model such silent hardware errors (which are of concern for extreme scale) via injected DRAM bit flips. Our mitigation approach generalizes previously developed "robust stencils" and uses modified linear algebra operations that spatially interpolate to replace large outlier values. Prototype implementations for 1D hyperbolic and 3D elliptic solvers, tested on up to 2048 cores, show that this error mitigation enables tolerating orders of magnitude higher bit-flip rates. The runtime overhead of the approach generally decreases with greater solver scale and complexity, becoming no more than a few percent in some cases. A key advantage is that silent data corruption can be handled transparently with data in cache, reducing the cost of false-positive detections compared to rollback approaches.

[1]  Edward S. Richardson,et al.  A DNS study on the stabilization mechanism of a turbulent lifted ethylene jet flame in highly-heated coflow , 2011 .

[2]  Patrick M. Widener,et al.  Canaries in a Coal Mine: Using Application-Level Checkpoints to Detect Memory Failures , 2015, Euro-Par Workshops.

[3]  G. Bronevetsky,et al.  Detecting Soft Errors in Stencil based Computations , 2015 .

[4]  Peter E. Strazdins,et al.  A Robust Technique to Make a 2D Advection Solver Tolerant to Soft Faults , 2016, ICCS.

[5]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[6]  Charles L. Lawson,et al.  Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[7]  George Bosilca,et al.  Algorithm-based fault tolerance applied to high performance computing , 2009, J. Parallel Distributed Comput..

[8]  Yves Robert,et al.  A backward/forward recovery approach for the preconditioned conjugate gradient method , 2015, J. Comput. Sci..

[9]  John Shalf,et al.  Memory Errors in Modern Systems: The Good, The Bad, and The Ugly , 2015, ASPLOS.

[10]  Franck Cappello,et al.  Toward Exascale Resilience: 2014 update , 2014, Supercomput. Front. Innov..

[11]  John Sartori,et al.  Automated Algorithmic Error Resilience for Structured Grid Problems Based on Outlier Detection , 2014, CGO '14.

[12]  Philip L. Roe,et al.  The use of the Riemann problem in finite difference schemes , 1989 .

[13]  Frank Mueller,et al.  Evaluating the Impact of SDC on the GMRES Iterative Solver , 2013, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[14]  Sandia Report,et al.  Improving Performance via Mini-applications , 2009 .

[15]  Jackson R. Mayo,et al.  Finite Difference Stencils Robust to Silent Data Corruption. , 2014 .

[16]  Padma Raghavan,et al.  Fault tolerant preconditioned conjugate gradient for sparse linear system solution , 2012, ICS '12.

[17]  Franck Cappello,et al.  Detecting Silent Data Corruption for Extreme-Scale Applications through Data Mining , 2014 .