Characterizing a Detection Strategy for Transient Faults in HPC

Handling faults is a growing concern in HPC; greater varieties, higher error rates, larger detection intervals and silent faults are expected in the future. It is projected that, in exascale systems, errors will occur several times a day, and that they will propagate to generate errors that will range from process crashes to corrupted results, with undetected errors in applications that are still running. In this article, we analyze a methodology for transient fault detection (called SMCV) for MPI applications. The methodology is based on software replication, and it assumes that data corruption is made apparent producing different messages between replicas. SMCV allows obtaining reliable executions with correct results, or, at least, leading the system to a safe stop. This work presents a complete characterization, formally defining the behavior in the presence of faults and experimentally validating it in order to show its efficacy and viability to detect transient faults in HPC systems.

[1]  Kunle Olukotun,et al.  The Future of Microprocessors , 2005, ACM Queue.

[2]  Franck Cappello,et al.  Toward Exascale Resilience: 2014 update , 2014, Supercomput. Front. Innov..

[3]  James H. Laros,et al.  Evaluating the viability of process replication reliability for exascale systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[4]  Christian Engelmann,et al.  Redundant Execution of HPC Applications with MR-MPI , 2011 .

[5]  Andrew A. Chien,et al.  The future of microprocessors , 2011, Commun. ACM.

[6]  Laxmikant V. Kalé,et al.  ACR: Automatic checkpoint/restart for soft and hard error protection , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[7]  Franck Cappello,et al.  Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..

[8]  Emilio Luque,et al.  SMCV: a Methodology for Detecting Transient Faults in Multicore Clusters , 2012, CLEI Electron. J..

[9]  Christian Engelmann,et al.  Combining Partial Redundancy and Checkpointing for HPC , 2012, 2012 IEEE 32nd International Conference on Distributed Computing Systems.

[10]  Narayan Desai,et al.  Co-analysis of RAS Log and Job Log on Blue Gene/P , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[11]  Christian Engelmann,et al.  The Case for Modular Redundancy in Large-Scale High Performance Computing Systems , 2009 .

[12]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  James H. Laros,et al.  rMPI : increasing fault resiliency in a message-passing environment. , 2011 .

[14]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[15]  Sarita V. Adve,et al.  Low-cost program-level detectors for reducing silent data corruptions , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[16]  David Fiala Detection and correction of silent data corruption for large-scale high-performance computing , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[17]  Tipp Moseley,et al.  PLR: A Software Approach to Transient Fault Tolerance for Multicore Architectures , 2009, IEEE Transactions on Dependable and Secure Computing.

[18]  Armando Eduardo De Giusti,et al.  A tool for detecting transient faults in execution of parallel scientific applications on multicore clusters , 2013 .

[19]  Zizhong Chen Algorithm-based recovery for iterative methods without checkpointing , 2011, HPDC '11.

[20]  Osman S. Unsal,et al.  Fault tolerance for multi-threaded applications by leveraging hardware transactional memory , 2013, CF '13.

[21]  Andrew A. Chien,et al.  When is multi-version checkpointing needed? , 2013, FTXS '13.

[22]  Frank Mueller,et al.  Evaluating the Impact of SDC on the GMRES Iterative Solver , 2013, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[23]  Dong Li,et al.  Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.