Soft Error Detection for Iterative Applications Using Offline Training

Silent data corruption (SDC) from soft errors is one of the challenges for Exascale systems as the number of cores is increasing and the feature size is decreasing. In recent years, a variety of soft error handling methods have been proposed, including replicating computation, feature based runtime detection, algorithm-level protection and others. However, these methods are either relatively expensive or not sufficiently accurate or not even applicable for certain applications. For convergent iterative applications, we observe that their progression of values of the residual leaves a signature of SDC, which is specific to an application but independent of the input dataset size. Based on this observation, we explore a different approach to soft error detection, which involves machine learning technique for off-line training of an application with representative inputs, and on-line detection using the model, applied even to a different dataset. Our experimental evaluation shows that our method is low-cost and effective, and outperforms online detection.

[1]  Dong Li,et al.  Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  Zizhong Chen,et al.  Correcting soft errors online in LU factorization , 2013, HPDC '13.

[3]  Hui Liu,et al.  High performance linpack benchmark: a fault tolerant implementation without checkpointing , 2011, ICS '11.

[4]  Mahmut T. Kandemir,et al.  Compiler-assisted soft error detection under performance and energy constraints in embedded systems , 2009, TECS.

[5]  Amin Ansari,et al.  Shoestring: probabilistic soft error reliability on the cheap , 2010, ASPLOS XV.

[6]  Edward J. McCluskey,et al.  Control-flow checking by software signatures , 2002, IEEE Trans. Reliab..

[7]  Edward J. McCluskey,et al.  Software-implemented EDAC protection against SEUs , 2000, IEEE Trans. Reliab..

[8]  Zizhong Chen,et al.  FT-ScaLAPACK: correcting soft errors on-line for ScaLAPACK cholesky, QR, and LU factorization routines , 2014, HPDC '14.

[9]  Sanjay J. Patel,et al.  ReStore: Symptom-Based Soft Error Detection in Microprocessors , 2006, IEEE Trans. Dependable Secur. Comput..

[10]  Timothy J. Dell,et al.  A white paper on the benefits of chipkill-correct ecc for pc server main memory , 1997 .

[11]  Jiaqi Liu,et al.  A Practical Approach for Handling Soft Errors in Iterative Applications , 2015, 2015 IEEE International Conference on Cluster Computing.

[12]  Austin R. Benson,et al.  Silent error detection in numerical time-stepping schemes , 2015, Int. J. High Perform. Comput. Appl..

[13]  Kurt B. Ferreira,et al.  An Examination of the Impact of Failure Distribution on Coordinated Checkpoint/Restart , 2016, FTXS@HPDC.

[14]  Jinsuk Chung,et al.  Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems , 2012, HiPC 2012.

[15]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[16]  Shubhendu S. Mukherjee,et al.  Perturbation-based Fault Screening , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[17]  Martin Schulz,et al.  Mechanisms and Evaluation of Cross-Layer Fault-Tolerance for Supercomputing , 2012, 2012 41st International Conference on Parallel Processing.

[18]  Mahmut T. Kandemir,et al.  A data-centric approach to checksum reuse for array-intensive applications , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[19]  Kai Li,et al.  Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..

[20]  Thomas Hérault,et al.  Algorithm-based fault tolerance for dense matrix factorizations , 2012, PPoPP '12.

[21]  Zizhong Chen Algorithm-based recovery for iterative methods without checkpointing , 2011, HPDC '11.

[22]  Zizhong Chen Extending algorithm-based fault tolerance to tolerate fail-stop failures in high performance distributed environments , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[23]  Padma Raghavan,et al.  Characterizing the impact of soft errors on iterative methods in scientific computing , 2011, ICS '11.

[24]  Franck Cappello,et al.  Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications , 2016, IEEE Transactions on Parallel and Distributed Systems.

[25]  Laxmikant V. Kalé,et al.  A scalable double in-memory checkpoint and restart scheme towards exascale , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012).

[26]  Christian Engelmann,et al.  The Case for Modular Redundancy in Large-Scale High Performance Computing Systems , 2009 .

[27]  Franck Cappello,et al.  An Efficient Silent Data Corruption Detection Method with Error-Feedback Control and Even Sampling for HPC Applications , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[28]  Kurt B. Ferreira,et al.  Fault-tolerant iterative methods via selective reliability. , 2011 .

[29]  Frank Mueller,et al.  Evaluating the Impact of SDC on the GMRES Iterative Solver , 2013, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[30]  Laxmikant V. Kalé,et al.  ACR: Automatic checkpoint/restart for soft and hard error protection , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[31]  Zizhong Chen,et al.  Fail-Stop Failure Algorithm-Based Fault Tolerance for Cholesky Decomposition , 2015, IEEE Transactions on Parallel and Distributed Systems.

[32]  Sarita V. Adve,et al.  Low-cost program-level detectors for reducing silent data corruptions , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[33]  Gokcen Kestor,et al.  Understanding the propagation of transient errors in HPC applications , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[34]  Rolf Riesen,et al.  Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing , 2012, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[35]  T. M. Mak,et al.  Do we need anything more than single bit error correction (ECC)? , 2004, Records of the 2004 International Workshop on Memory Technology, Design and Testing, 2004..

[36]  Rakesh Kumar,et al.  Algorithmic approaches to low overhead fault detection for sparse linear algebra , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[37]  F. Cappello,et al.  Toward Effective Detection of Silent Data Corruptions for HPC Applications , 2014 .

[38]  Hui Liu,et al.  Algorithm-Based Recovery for Newton's Method without Checkpointing , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[39]  Xin Li,et al.  A Realistic Evaluation of Memory Hardware Errors and Software System Susceptibility , 2010, USENIX Annual Technical Conference.

[40]  Franck Cappello,et al.  Lightweight Silent Data Corruption Detection Based on Runtime Data Analysis for HPC Applications , 2015, HPDC.

[41]  David Fiala Detection and correction of silent data corruption for large-scale high-performance computing , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[42]  Padma Raghavan,et al.  Fault tolerant preconditioned conjugate gradient for sparse linear system solution , 2012, ICS '12.

[43]  Edward J. McCluskey,et al.  Error detection by duplicated instructions in super-scalar processors , 2002, IEEE Trans. Reliab..

[44]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[45]  International Conference for High Performance Computing, Networking, Storage and Analysis, SC'13, Denver, CO, USA - November 17 - 21, 2013 , 2013, SC.

[46]  Jiaqi Liu,et al.  Algorithm Level Fault Tolerance for Molecular Dynamic Applications , 2015, 2015 IEEE 22nd International Conference on High Performance Computing (HiPC).

[47]  Heather M. Quinn,et al.  Terrestrial-based radiation upsets: a cautionary tale , 2005, 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'05).

[48]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[49]  Hui Liu,et al.  Matrix Multiplication on GPUs with On-Line Fault Tolerance , 2011, 2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications.

[50]  Rakesh Kumar,et al.  An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[51]  Gerald M. Masson,et al.  Checking the integrity of trees , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.