TeaMPI—Replication-Based Resilience Without the (Performance) Pain

In an era where we can not afford to checkpoint frequently, replication is a generic way forward to construct numerical simulations that can continue to run even if hardware parts fail. Yet, replication often is not employed on larger scales, as naïvely mirroring a computation once effectively halves the machine size, and as keeping replicated simulations consistent with each other is not trivial. We demonstrate for the ExaHyPE engine—a task-based solver for hyperbolic equation systems—that it is possible to realise resiliency without major code changes on the user side, while we introduce a novel algorithmic idea where replication reduces the time-to-solution. The redundant CPU cycles are not burned “for nothing”. Our work employs a weakly consistent data model where replicas run independently yet inform each other through heartbeat messages whether they are still up and running. Our key performance idea is to let the tasks of the replicated simulations share some of their outcomes, while we shuffle the actual task execution order per replica. This way, replicated ranks can skip some local computations and automatically start to synchronise with each other. Our experiments with a production-level seismic wave-equation solver provide evidence that this novel concept has the potential to make replication affordable for large-scale simulations in high-performance computing.

[1]  Michael Bader,et al.  Influence of A-Posteriori Subcell Limiting on Fault Frequency in Higher-Order DG Schemes , 2018, 2018 IEEE/ACM 8th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS).

[2]  Jack Dongarra,et al.  Applied Mathematics Research for Exascale Computing , 2014 .

[3]  Michael Dumbser,et al.  ExaHyPE: An Engine for Parallel Dynamically Adaptive Simulations of Wave Problems , 2019, Comput. Phys. Commun..

[4]  Jinsuk Chung,et al.  Containment domains: A scalable, efficient, and flexible resilience scheme for exascale systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[6]  Omer Subasi,et al.  Designing and Modelling Selective Replication for Fault-Tolerant HPC Applications , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[7]  Christian Engelmann,et al.  Scaling to a million cores and beyond: Using light-weight simulation to understand the challenges ahead on the road to exascale , 2014, Future Gener. Comput. Syst..

[8]  Rolf Riesen,et al.  Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing , 2012, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[9]  Frank Mueller,et al.  End-to-End Resilience for HPC Applications , 2019, ISC.

[10]  Philipp Samfass,et al.  Tasks Unlimited: Lightweight Task Offloading Exploiting MPI Wait Times for Parallel Adaptive Mesh Refinement , 2019, ArXiv.

[11]  Jannis Klinkenberg,et al.  CHAMELEON: Reactive Load Balancing for Hybrid MPI+OpenMP Task-Parallel Applications , 2020, J. Parallel Distributed Comput..

[12]  Michael Dumbser,et al.  Studies on the energy and deep memory behaviour of a cache-oblivious, task-based hyperbolic PDE solver , 2018, Int. J. High Perform. Comput. Appl..

[13]  Martin Schulz,et al.  Exploiting Data Similarity to Reduce Memory Footprints , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[14]  Thomas Hérault,et al.  Post-failure recovery of MPI communication capability , 2013, Int. J. High Perform. Comput. Appl..

[15]  Dominik Göddeke,et al.  Soft fault detection and correction for multigrid , 2018, Int. J. High Perform. Comput. Appl..

[16]  Franck Cappello,et al.  Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities , 2009, Int. J. High Perform. Comput. Appl..

[17]  Thomas Hérault,et al.  Design for a Soft Error Resilient Dynamic Task-Based Runtime , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[18]  Jannis Klinkenberg,et al.  Hybrid MPI+OpenMP Reactive Work Stealing in Distributed Memory in the PDE Framework sam(oa)^2 , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).

[19]  Zizhong Chen,et al.  Process Fault Tolerance: Semantics, Design and Applications for High Performance Computing , 2005, Int. J. High Perform. Comput. Appl..

[20]  Philipp Samfass,et al.  Lightweight task offloading exploiting MPI wait times for parallel adaptive mesh refinement , 2020, Concurr. Comput. Pract. Exp..

[21]  Torsten Hoefler,et al.  Message progression in parallel computing - to thread or not to thread? , 2008, 2008 IEEE International Conference on Cluster Computing.

[22]  Dirk Ribbrock,et al.  Fault-tolerant finite-element multigrid algorithms with hierarchically compressed asynchronous checkpointing , 2015, Parallel Comput..

[23]  Dirk Pflüger,et al.  A Massively-Parallel, Fault-Tolerant Solver for High-Dimensional PDEs , 2016, Euro-Par Workshops.

[24]  Michael Dumbser,et al.  A simple diffuse interface approach on adaptive Cartesian grids for the linear elastic wave equations with complex topography , 2018, J. Comput. Phys..

[25]  Tobias Weinzierl,et al.  Enclave Tasking for Discontinuous Galerkin Methods on Dynamically Adaptive Meshes , 2018, SIAM J. Sci. Comput..

[26]  James H. Laros,et al.  Evaluating the viability of process replication reliability for exascale systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[27]  George Bosilca,et al.  Fault tolerant high performance computing by a coding approach , 2005, PPoPP.

[28]  Tyler A. Simon,et al.  Improving Application Resilience through Probabilistic Task Replication , 2013 .

[29]  Christian Engelmann,et al.  The Case for Modular Redundancy in Large-Scale High Performance Computing Systems , 2009 .

[30]  Rolf Riesen,et al.  See applications run and throughput jump: The case for redundant computing in HPC , 2010, 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W).