Combining Checkpointing and Replication for Reliable Execution of Linear Workflows with Fail-Stop and Silent Errors

Large-scale platforms currently experience errors from two different sources, namely fail-stop errors (which interrupt the execution) and silent errors (which strike unnoticed and corrupt data). This work combines checkpointing and replication for the reliable execution of linear workflows on platforms subject to these two error types. While checkpointing and replication have been studied separately, their combination has not yet been investigated despite its promising potential to minimize the execution time of linear workflows in error-prone environments. Moreover, combined checkpointing and replication has not yet been studied in the presence of both fail-stop and silent errors. The combination raises new problems: for each task, we have to decide whether to checkpoint and/or replicate it to ensure its reliable execution. We provide an optimal dynamic programming algorithm of quadratic complexity to solve both problems. This dynamic programming algorithm has been validated through extensive simulations that reveal the conditions in which checkpointing only, replication only, or the combination of both techniques, lead to improved performance.

[1]  Hans P. Muhlfeld,et al.  Cosmic ray soft error rates of 16-Mb DRAM memory chips , 1998, IEEE J. Solid State Circuits.

[2]  Nitin H. Vaidya,et al.  A case for two-level distributed recovery schemes , 1995, SIGMETRICS '95/PERFORMANCE '95.

[3]  Michael C. Huang,et al.  Supporting highly-decoupled thread-level redundancy for parallel programs , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[4]  John Shalf,et al.  DOE Advanced Scientific Computing Advisory Subcommittee (ASCAC) Report: Top Ten Exascale Research Challenges , 2014 .

[5]  Omer Subasi,et al.  Designing and Modelling Selective Replication for Fault-Tolerant HPC Applications , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[6]  Jaspal Subhlok,et al.  VolpexMPI: An MPI Library for Execution of Parallel Applications on Volatile Nodes , 2009, PVM/MPI.

[7]  E. N. Elnozahy,et al.  Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery , 2004, IEEE Transactions on Dependable and Secure Computing.

[8]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[9]  Franck Cappello,et al.  Toward Exascale Resilience , 2009, Int. J. High Perform. Comput. Appl..

[10]  Bongjae Kim,et al.  Using replication and checkpointing for reliable task management in computational Grids , 2010, 2010 International Conference on High Performance Computing & Simulation.

[11]  Yves Robert,et al.  Optimal Resilience Patterns to Cope with Fail-Stop and Silent Errors , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[12]  James H. Laros,et al.  Evaluating the viability of process replication reliability for exascale systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[13]  Franck Cappello,et al.  Toward an Optimal Online Checkpoint Solution under a Two-Level HPC Checkpoint Model , 2017, IEEE Transactions on Parallel and Distributed Systems.

[14]  Laxmikant V. Kalé,et al.  ACR: Automatic checkpoint/restart for soft and hard error protection , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[15]  Robert E. Lyons,et al.  The Use of Triple-Modular Redundancy to Improve Computer Reliability , 1962, IBM J. Res. Dev..

[16]  Luís Moura Silva,et al.  Using two-level stable storge for efficient checkpointing , 1998, IEE Proc. Softw..

[17]  Yves Robert,et al.  Assessing General-Purpose Algorithms to Cope with Fail-Stop and Silent Errors , 2016, TOPC.

[18]  Henri Casanova,et al.  On the impact of process replication on executions of large-scale parallel applications with coordinated checkpointing , 2015, Future Gener. Comput. Syst..

[19]  Domenico Talia,et al.  Workflow Systems for Science: Concepts and Tools , 2013 .

[20]  Christian Engelmann,et al.  The Case for Modular Redundancy in Large-Scale High Performance Computing Systems , 2009 .

[21]  Zhibo Wu,et al.  Thread-level redundancy fault tolerant CMP based on relaxed input replication , 2011, 2011 6th International Conference on Computer Sciences and Convergence Information Technology (ICCIT).

[22]  Thomas Hérault,et al.  Unified model for assessing checkpointing protocols at extreme‐scale , 2014, Concurr. Comput. Pract. Exp..

[23]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[24]  Emma S. Buneci Qualitative Performance Analysis for Large-Scale Scientific Workflows , 2008 .

[25]  T. J. O'Gorman The effect of cosmic rays on the soft error rate of a DRAM at ground level , 1994 .

[26]  Özalp Babaoglu,et al.  On the Optimum Checkpoint Selection Problem , 1984, SIAM J. Comput..

[27]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[28]  Daniel A. Reed,et al.  Fault Tolerance and Recovery of Scientific Workflows on Computational Grids , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[29]  James L. Walsh,et al.  IBM experiments in soft fails in computer electronics (1978-1994) , 1996, IBM J. Res. Dev..

[30]  Bertram Ludäscher,et al.  Scientific Workflows and Provenance: Introduction and Research Opportunities , 2012, Datenbank-Spektrum.

[31]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[32]  Mikyung Kang,et al.  Programming Models and Development Software for a Space-Based Many-Core Processor , 2011, 2011 IEEE Fourth International Conference on Space Mission Challenges for Information Technology.

[33]  Zhiling Lan,et al.  Reliability-aware scalability models for high performance computing , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[34]  Bianca Schroeder,et al.  Understanding failures in petascale computers , 2007 .

[35]  Laxmikant V. Kalé,et al.  FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[36]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[37]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[38]  Kai Li,et al.  Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..

[39]  Dinesh P. Mehta,et al.  Meta-Algorithms for Scheduling a Chain of Coarse-Grained Tasks on an Array of Reconfigurable FPGAs , 2013, VLSI Design.

[40]  James H. Laros,et al.  Does partial replication pay off? , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012).

[41]  Yves Robert,et al.  Fault-Tolerance Techniques for High-Performance Computing , 2015 .

[42]  Sathish S. Vadhiyar,et al.  ADFT: An Adaptive Framework for Fault Tolerance on Large Scale Systems using Application Malleability , 2012, ICCS.

[43]  Omer Subasi,et al.  Programmer-directed partial redundancy for resilient HPC , 2015, Conf. Computing Frontiers.

[44]  Christian Engelmann,et al.  Combining Partial Redundancy and Checkpointing for HPC , 2012, 2012 IEEE 32nd International Conference on Distributed Computing Systems.

[45]  Franck Cappello,et al.  Identifying the Right Replication Level to Detect and Correct Silent Errors at Scale , 2017, FTXS '17.

[46]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[47]  Franck Cappello,et al.  Toward Exascale Resilience: 2014 update , 2014, Supercomput. Front. Innov..

[48]  Christian Engelmann,et al.  Redundant Execution of HPC Applications with MR-MPI , 2011 .

[49]  Yves Robert,et al.  Combining Checkpointing and Replication for Reliable Execution of Linear Workflows , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[50]  Zhiling Lan,et al.  Reliability-Aware Speedup Models for Parallel Applications with Coordinated Checkpointing/Restart , 2015, IEEE Transactions on Computers.

[51]  Unsal Osman,et al.  Unprotected Computing: A Large-Scale Study of DRAM Raw Error Rate on a Supercomputer , 2016 .

[52]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[53]  Huntington W. Curtis,et al.  Accelerated testing for cosmic soft-error rate , 1996, IBM J. Res. Dev..