Identifying the Right Replication Level to Detect and Correct Silent Errors at Scale

This paper provides a model and an analytical study of replication as a technique to detect and correct silent errors. Although other detection techniques exist for HPC applications, based on algorithms (ABFT), invariant preservation or data analytics, replication remains the most transparent and least intrusive technique. We explore the right level (duplication, triplication or more) of replication needed to efficiently detect and correct silent errors. Replication is combined with checkpointing and comes with two flavors: process replication and group replication. Process replication applies to message-passing applications with communicating processes. Each process is replicated, and the platform is composed of process pairs, or triplets. Group replication applies to black-box applications, whose parallel execution is replicated several times. The platform is partitioned into two halves (or three thirds). In both scenarios, results are compared before each checkpoint, which is taken only when both results (duplication) or two out of three results (triplication) coincide. If not, one or more silent errors have been detected, and the application rolls back to the last checkpoint. We provide a detailed analytical study of both scenarios, with formulas to decide, for each scenario, the optimal parameters as a function of the error rate, checkpoint cost, and platform size. We also report a set of extensive simulation results that corroborates the analytical model.

[1]  Christian Engelmann,et al.  The Case for Modular Redundancy in Large-Scale High Performance Computing Systems , 2009 .

[2]  Zhibo Wu,et al.  Thread-level redundancy fault tolerant CMP based on relaxed input replication , 2011, 2011 6th International Conference on Computer Sciences and Convergence Information Technology (ICCIT).

[3]  Henri Casanova,et al.  Using group replication for resilience on exascale systems , 2014, Int. J. High Perform. Comput. Appl..

[4]  Franck Cappello,et al.  Detecting and Correcting Data Corruption in Stencil Applications through Multivariate Interpolation , 2015, 2015 IEEE International Conference on Cluster Computing.

[5]  James L. Walsh,et al.  IBM experiments in soft fails in computer electronics (1978-1994) , 1996, IBM J. Res. Dev..

[6]  Y. Robert,et al.  Fault-Tolerance Techniques for High-Performance Computing , 2015, Computer Communications and Networks.

[7]  Franck Cappello,et al.  Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[8]  Yves Robert,et al.  Assessing General-Purpose Algorithms to Cope with Fail-Stop and Silent Errors , 2016, TOPC.

[9]  Yves Robert,et al.  Which Verification for Soft Error Detection? , 2015, 2015 IEEE 22nd International Conference on High Performance Computing (HiPC).

[10]  Austin R. Benson,et al.  Silent error detection in numerical time-stepping schemes , 2015, Int. J. High Perform. Comput. Appl..

[11]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[12]  E. Tronci,et al.  1996 , 1997, Affair of the Heart.

[13]  T. J. O'Gorman The effect of cosmic rays on the soft error rate of a DRAM at ground level , 1994 .

[14]  Seetharami R. Seelam,et al.  Modeling the Impact of Checkpoints on Next-Generation Systems , 2007, 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007).

[15]  Mikyung Kang,et al.  Programming Models and Development Software for a Space-Based Many-Core Processor , 2011, 2011 IEEE Fourth International Conference on Space Mission Challenges for Information Technology.

[16]  Richard W. Vuduc,et al.  Self-stabilizing iterative solvers , 2013, ScalA '13.

[17]  Padma Raghavan,et al.  Fault tolerant preconditioned conjugate gradient for sparse linear system solution , 2012, ICS '12.

[18]  Henri Casanova,et al.  On the impact of process replication on executions of large-scale parallel applications with coordinated checkpointing , 2015, Future Gener. Comput. Syst..

[19]  Rolf Riesen,et al.  Transparent Redundant Computing with MPI , 2010, EuroMPI.

[20]  Franck Cappello,et al.  Detecting silent data corruption through data dynamic monitoring for scientific applications , 2014, PPoPP '14.

[21]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[22]  Franck Cappello,et al.  Toward Exascale Resilience: 2014 update , 2014, Supercomput. Front. Innov..

[23]  Christian Engelmann,et al.  Redundant Execution of HPC Applications with MR-MPI , 2011 .

[24]  Israel Koren,et al.  Application-level fault tolerance in the orbital thermal imaging spectrometer , 2004, 10th IEEE Pacific Rim International Symposium on Dependable Computing, 2004. Proceedings..

[25]  Michael C. Huang,et al.  Supporting highly-decoupled thread-level redundancy for parallel programs , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[26]  Franck Cappello,et al.  Exploiting Spatial Smoothness in HPC Applications to Detect Silent Data Corruption , 2015, 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems.

[27]  Jaspal Subhlok,et al.  VolpexMPI: An MPI Library for Execution of Parallel Applications on Volatile Nodes , 2009, PVM/MPI.

[28]  Franck Cappello,et al.  Improving the trust in results of numerical simulations and scientific data analytics , 2015 .

[29]  Christian Engelmann,et al.  Combining Partial Redundancy and Checkpointing for HPC , 2012, 2012 IEEE 32nd International Conference on Distributed Computing Systems.

[30]  Robert E. Lyons,et al.  The Use of Triple-Modular Redundancy to Improve Computer Reliability , 1962, IBM J. Res. Dev..

[31]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[32]  John Shalf,et al.  The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..

[33]  Franck Cappello,et al.  Toward Exascale Resilience , 2009, Int. J. High Perform. Comput. Appl..

[34]  Bronis R. de Supinski,et al.  Soft error vulnerability of iterative linear algebra methods , 2007, ICS '08.

[35]  Franck Cappello,et al.  Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..

[36]  E. N. Elnozahy,et al.  Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery , 2004, IEEE Transactions on Dependable and Secure Computing.

[37]  Franck Cappello,et al.  Lightweight Silent Data Corruption Detection Based on Runtime Data Analysis for HPC Applications , 2015, HPDC.

[38]  David Fiala Detection and correction of silent data corruption for large-scale high-performance computing , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[39]  B R de Supinski,et al.  Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System , 2010 .

[40]  George Bosilca,et al.  Algorithm-based fault tolerance applied to high performance computing , 2009, J. Parallel Distributed Comput..

[41]  James H. Laros,et al.  Evaluating the viability of process replication reliability for exascale systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[42]  Laxmikant V. Kalé,et al.  ACR: Automatic checkpoint/restart for soft and hard error protection , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[43]  James H. Laros,et al.  Does partial replication pay off? , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012).

[44]  Yves Robert,et al.  Fault-Tolerance Techniques for High-Performance Computing , 2015 .

[45]  Sathish S. Vadhiyar,et al.  ADFT: An Adaptive Framework for Fault Tolerance on Large Scale Systems using Application Malleability , 2012, ICCS.

[46]  Omer Subasi,et al.  Programmer-directed partial redundancy for resilient HPC , 2015, Conf. Computing Frontiers.

[47]  Bianca Schroeder,et al.  Understanding failures in petascale computers , 2007 .

[48]  Kurt B. Ferreira,et al.  Fault-tolerant iterative methods via selective reliability. , 2011 .

[49]  Zhiling Lan,et al.  Reliability-aware scalability models for high performance computing , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[50]  Frank Mueller,et al.  Evaluating the Impact of SDC on the GMRES Iterative Solver , 2013, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[51]  Carl E. Landwehr,et al.  Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.

[52]  Bongjae Kim,et al.  Using replication and checkpointing for reliable task management in computational Grids , 2010, 2010 International Conference on High Performance Computing & Simulation.

[53]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..