Resilience for Massively Parallel Multigrid Solvers

Fault tolerant massively parallel multigrid methods for elliptic partial differential equations are a step towards resilient solvers. Here, we combine domain partitioning with geometric multigrid methods to obtain fast and fault-robust solvers for three-dimensional problems. The recovery strategy is based on the redundant storage of ghost values, as they are commonly used in distributed memory parallel programs. In the case of a fault, the redundant interface values can be easily recovered, while the lost inner unknowns are recomputed approximately with recovery algorithms using multigrid cycles for solving a local Dirichlet problem. Different strategies are compared and evaluated with respect to performance, computational cost, and speedup. Especially effective are asynchronous strategies combining global solves with accelerated local recovery. By this, multiple faults can be fully compensated with respect to both the number of iterations and run-time. For illustration, we use a state-of-the-art petascal...

[1]  Edmond Chow,et al.  A Survey of Parallelization Techniques for Multigrid Solvers , 2006, Parallel Processing for Scientific Computing.

[2]  Thomas Hérault,et al.  Unified model for assessing checkpointing protocols at extreme‐scale , 2014, Concurr. Comput. Pract. Exp..

[3]  Mahmut T. Kandemir,et al.  Analyzing the soft error resilience of linear solvers on multicore multiprocessors , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[4]  Franck Cappello,et al.  Toward Exascale Resilience , 2009, Int. J. High Perform. Comput. Appl..

[5]  Zizhong Chen,et al.  Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods , 2013, PPoPP '13.

[6]  Robert D. Falgout,et al.  hypre: A Library of High Performance Preconditioners , 2002, International Conference on Computational Science.

[7]  Thomas Hérault,et al.  Extending the scope of the Checkpoint‐on‐Failure protocol for forward recovery in standard MPI , 2013, Concurr. Comput. Pract. Exp..

[8]  Dirk Ribbrock,et al.  Fault-tolerant finite-element multigrid algorithms with hierarchically compressed asynchronous checkpointing , 2015, Parallel Comput..

[9]  J. Bey,et al.  Tetrahedral grid refinement , 1995, Computing.

[10]  Kurt B. Ferreira,et al.  Fault-tolerant linear solvers via selective reliability , 2012, ArXiv.

[11]  Franklin T. Luk,et al.  A Linear Algebraic Model of Algorithm-Based Fault Tolerance , 1988, IEEE Trans. Computers.

[12]  Benjamin Karl Bergen,et al.  Hierarchical hybrid grids: data structures and core algorithms for multigrid , 2004, Numer. Linear Algebra Appl..

[13]  Thomas Hérault,et al.  An Evaluation of User-Level Failure Mitigation Support in MPI , 2012, EuroMPI.

[14]  Rakesh Kumar,et al.  Algorithmic approaches to low overhead fault detection for sparse linear algebra , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[15]  Anton Schüller,et al.  Multigrid methods on parallel computers - A survey of recent developments , 1991, IMPACT Comput. Sci. Eng..

[16]  Jim Euchner Design , 2014, Catalysis from A to Z.

[17]  Andrew Lumsdaine,et al.  Coordinated checkpoint/restart process fault tolerance for mpi applications on hpc systems , 2010 .

[18]  Peter D. Düben,et al.  On the use of inexact, pruned hardware in atmospheric modelling , 2014, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[19]  Emmanuel Agullo,et al.  Towards resilient parallel linear Krylov solvers: recover-restart strategies , 2013 .

[20]  Mark F. Adams,et al.  Segmental Refinement: A Multigrid Technique for Data Locality , 2016, SIAM J. Sci. Comput..

[21]  R. Bank,et al.  A class of iterative methods for solving saddle point problems , 1989 .

[22]  Jack Dongarra,et al.  Chapter 1 Fault tolerance techniques for high-performance computing , 2015 .

[23]  Michail Maniatakos,et al.  Low-Cost Concurrent Error Detection for Floating-Point Unit (FPU) Controllers , 2013, IEEE Transactions on Computers.

[24]  Zizhong Chen,et al.  Algorithm-Based Fault Tolerance for Fail-Stop Failures , 2008, IEEE Transactions on Parallel and Distributed Systems.

[25]  Ulrich Rüde,et al.  Towards Textbook Efficiency for Parallel Multigrid , 2015 .

[26]  Franck Cappello,et al.  Toward Exascale Resilience: 2014 update , 2014, Supercomput. Front. Innov..

[27]  Franck Cappello,et al.  Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[28]  Ulrich Rüde,et al.  Optimization of the multigrid-convergence rate on semi-structured meshes by local Fourier analysis , 2013, Comput. Math. Appl..

[29]  Irad Yavneh,et al.  On Red-Black SOR Smoothing in Multigrid , 1996, SIAM J. Sci. Comput..

[30]  Gerhard Wellein,et al.  An Evaluation of Different I/O Techniques for Checkpoint/Restart , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[31]  Barbara I. Wohlmuth,et al.  Performance and Scalability of Hierarchical Hybrid Multigrid Solvers for Stokes Systems , 2015, SIAM J. Sci. Comput..

[32]  Barbara I. Wohlmuth,et al.  A quantitative performance study for Stokes solvers at the extreme scale , 2016, J. Comput. Sci..

[33]  Franck Cappello,et al.  Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities , 2009, Int. J. High Perform. Comput. Appl..

[34]  Jack J. Dongarra,et al.  FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World , 2000, PVM/MPI.

[35]  Ulrike Meier Yang,et al.  On the use of relaxation parameters in hybrid smoothers , 2004, Numer. Linear Algebra Appl..

[36]  Ulrich Rüde,et al.  Parallel multigrid on hierarchical hybrid grids: a performance study on current high performance computing clusters , 2014, Concurr. Comput. Pract. Exp..

[37]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[38]  Yves Robert,et al.  Fault-Tolerance Techniques for High-Performance Computing , 2015 .

[39]  Amber Roy-Chowdhury,et al.  A Fault-Tolerant Parallel Algorithm for Iterative Solution of the Laplace Equation , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[40]  Laxmikant V. Kalé,et al.  Performance evaluation of automatic checkpoint-based fault tolerance for AMPI and Charm++ , 2006, OPSR.

[41]  Thomas Hérault,et al.  Post-failure recovery of MPI communication capability , 2013, Int. J. High Perform. Comput. Appl..

[42]  George Bosilca,et al.  Recovery Patterns for Iterative Methods in a Parallel Unstable Environment , 2007, SIAM J. Sci. Comput..

[43]  Barbara I. Wohlmuth,et al.  A quantitative performance analysis for Stokes solvers at the extreme scale , 2015, ArXiv.

[44]  Clayton G. Webster,et al.  Numerical Analysis of Fixed Point Algorithms in the Presence of Hardware Faults , 2015, SIAM J. Sci. Comput..

[45]  Walter Zulehner,et al.  Analysis of iterative methods for saddle point problems: a unified approach , 2002, Math. Comput..

[46]  V. E. Henson,et al.  BoomerAMG: a parallel algebraic multigrid solver and preconditioner , 2002 .

[47]  Patrick Amestoy,et al.  Hybrid scheduling for the parallel solution of linear systems , 2006, Parallel Comput..

[48]  Franklin T. Luk,et al.  Algorithmic Fault Tolerance Using the Lanczos Method , 1992, SIAM J. Matrix Anal. Appl..

[49]  Achi Brandt,et al.  Multigrid Techniques: 1984 Guide with Applications to Fluid Dynamics, Revised Edition , 2011 .

[50]  Ulrike Meier Yang,et al.  Scalability of Classical Algebraic Multigrid for Elasticity to Half a Million Parallel Tasks , 2016, Software for Exascale Computing.

[51]  Hari Sundar,et al.  Parallel geometric-algebraic multigrid on unstructured forests of octrees , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[52]  Joel S. Emer,et al.  The soft error problem: an architectural perspective , 2005, 11th International Symposium on High-Performance Computer Architecture.

[53]  Wolfgang Hackbusch,et al.  Multi-grid methods and applications , 1985, Springer series in computational mathematics.

[54]  Markus Hegland,et al.  Fault Tolerant Computation with the Sparse Grid Combination Technique , 2015, SIAM J. Sci. Comput..

[55]  Jutta Docter,et al.  JUQUEEN: IBM Blue Gene/Q® Supercomputer System at the Jülich Supercomputing Centre , 2015 .

[56]  John Daly A Model for Predicting the Optimum Checkpoint Interval for Restart Dumps , 2003, International Conference on Computational Science.

[57]  Andreas Dedner,et al.  A generic grid interface for parallel and adaptive scientific computing. Part II: implementation and tests in DUNE , 2008, Computing.

[58]  Zizhong Chen,et al.  Correcting soft errors online in LU factorization , 2013, HPDC '13.

[59]  Jinchao Xu,et al.  An error-resilient redundant subspace correction method , 2017, Comput. Vis. Sci..

[60]  J. Douglas,et al.  Stabilized mixed methods for the Stokes problem , 1988 .

[61]  Franklin T. Luk,et al.  An Analysis of Algorithm-Based Fault Tolerance Techniques , 1988, J. Parallel Distributed Comput..

[62]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[63]  Martin Schulz,et al.  Fault resilience of the algebraic multi-grid solver , 2012, ICS '12.

[64]  Artem Napov,et al.  A massively parallel solver for discrete Poisson-like problems , 2015, J. Comput. Phys..