A Practical Approach for Handling Soft Errors in Iterative Applications

With reducing feature sizes, there is a growing need for soft errors to be handled at the software level. This paper focuses on iterative scientific applications, particularly, solvers of PDEs. After empirically studying the impact of bit flips on convergence and correctness of these applications as well as analyzing the underlying numerical algorithm, we propose the following method for improving accuracy of these applications in the presence of silent data corruptions. We show that changes in value of the residue can serve as the signature that detect the soft errors that can have the most negative impact on the applications. Our analysis also shows that for iterative solvers, bit flips in the later part of the computation are a lot more likely to impact final results. For such cases, we propose partial replication to help improve accuracy without very large overheads. After applying our approach on five scientific applications, we find that our signature based method removes all infinite loops because of bit flips, reduces the error in the final results by up to 99%, and has less than 6% overhead (with an additional 24% overhead for checkpointing and restart). The reduction in error can be as high as 99.9% while using partial replication together with our signature analysis for two of the applications.

[1]  International Conference for High Performance Computing, Networking, Storage and Analysis, SC'13, Denver, CO, USA - November 17 - 21, 2013 , 2013, SC.

[2]  Rakesh Kumar,et al.  An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[3]  Padma Raghavan,et al.  Fault tolerant preconditioned conjugate gradient for sparse linear system solution , 2012, ICS '12.

[4]  Rakesh Kumar,et al.  Algorithmic approaches to low overhead fault detection for sparse linear algebra , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[5]  Dong Li,et al.  Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Laxmikant V. Kalé,et al.  ACR: Automatic checkpoint/restart for soft and hard error protection , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[7]  Shubhendu S. Mukherjee,et al.  Perturbation-based Fault Screening , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[8]  David Fiala Detection and correction of silent data corruption for large-scale high-performance computing , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  Kurt B. Ferreira,et al.  Fault-tolerant iterative methods via selective reliability. , 2011 .

[10]  Frank Mueller,et al.  Evaluating the Impact of SDC on the GMRES Iterative Solver , 2013, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[11]  Padma Raghavan,et al.  Characterizing the impact of soft errors on iterative methods in scientific computing , 2011, ICS '11.

[12]  Sanjay J. Patel,et al.  ReStore: Symptom-Based Soft Error Detection in Microprocessors , 2006, IEEE Trans. Dependable Secur. Comput..

[13]  Sarita V. Adve,et al.  Low-cost program-level detectors for reducing silent data corruptions , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).