Physics-Based Checksums for Silent-Error Detection in PDE Solvers

We discuss techniques for efficient local detection of silent data corruption in parallel scientific computations, leveraging physical quantities such as momentum and energy that may be conserved by discretized PDEs. The conserved quantities are analogous to “algorithm-based fault tolerance” checksums for linear algebra but, due to their physical foundation, are applicable to both linear and nonlinear equations and have efficient local updates based on fluxes between subdomains. These physics-based checksums enable precise intermittent detection of errors and recovery by rollback to a checkpoint, with very low overhead when errors are rare. We present applications to both explicit hyperbolic and iterative elliptic (unstructured finite-element) solvers with injected memory bit flips.

[1]  Shuaiwen Song,et al.  New-Sum: A Novel Online ABFT Scheme For General Iterative Methods , 2016, HPDC.

[2]  Franck Cappello,et al.  MACORD: Online Adaptive Machine Learning Framework for Silent Error Detection , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[3]  John Daly A Model for Predicting the Optimum Checkpoint Interval for Restart Dumps , 2003, International Conference on Computational Science.

[4]  Martin C. Rinard Parallel Synchronization-Free Approximate Data Structure Construction , 2013, HotPar.

[5]  Manish Parashar,et al.  Local recovery and failure masking for stencil-based applications at extreme scales , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Christian Engelmann,et al.  Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale , 2016, Supercomput. Front. Innov..

[7]  Franck Cappello,et al.  Toward Exascale Resilience: 2014 update , 2014, Supercomput. Front. Innov..

[8]  Robert C. Armstrong,et al.  In-Situ Mitigation of Silent Data Corruption in PDE Solvers , 2016, FTXS@HPDC.

[9]  Kurt B. Ferreira,et al.  Fault-tolerant linear solvers via selective reliability , 2012, ArXiv.

[10]  Vivek Sarkar,et al.  ASC CSSE Level 2 Milestone #6362: Resilient Asynchronous Many Task Programming Model. , 2018 .

[11]  Sandia Report,et al.  Improving Performance via Mini-applications , 2009 .

[12]  Yves Robert,et al.  Which Verification for Soft Error Detection? , 2015, 2015 IEEE 22nd International Conference on High Performance Computing (HiPC).

[13]  Austin R. Benson,et al.  Silent error detection in numerical time-stepping schemes , 2015, Int. J. High Perform. Comput. Appl..