Resilience for Massively Parallel Multigrid Solvers
暂无分享,去创建一个
Barbara I. Wohlmuth | Ulrich Rüde | Markus Huber | Björn Gmeiner | U. Rüde | B. Wohlmuth | M. Huber | B. Gmeiner
[1] Edmond Chow,et al. A Survey of Parallelization Techniques for Multigrid Solvers , 2006, Parallel Processing for Scientific Computing.
[2] Thomas Hérault,et al. Unified model for assessing checkpointing protocols at extreme‐scale , 2014, Concurr. Comput. Pract. Exp..
[3] Mahmut T. Kandemir,et al. Analyzing the soft error resilience of linear solvers on multicore multiprocessors , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).
[4] Franck Cappello,et al. Toward Exascale Resilience , 2009, Int. J. High Perform. Comput. Appl..
[5] Zizhong Chen,et al. Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods , 2013, PPoPP '13.
[6] Robert D. Falgout,et al. hypre: A Library of High Performance Preconditioners , 2002, International Conference on Computational Science.
[7] Thomas Hérault,et al. Extending the scope of the Checkpoint‐on‐Failure protocol for forward recovery in standard MPI , 2013, Concurr. Comput. Pract. Exp..
[8] Dirk Ribbrock,et al. Fault-tolerant finite-element multigrid algorithms with hierarchically compressed asynchronous checkpointing , 2015, Parallel Comput..
[9] J. Bey,et al. Tetrahedral grid refinement , 1995, Computing.
[10] Kurt B. Ferreira,et al. Fault-tolerant linear solvers via selective reliability , 2012, ArXiv.
[11] Franklin T. Luk,et al. A Linear Algebraic Model of Algorithm-Based Fault Tolerance , 1988, IEEE Trans. Computers.
[12] Benjamin Karl Bergen,et al. Hierarchical hybrid grids: data structures and core algorithms for multigrid , 2004, Numer. Linear Algebra Appl..
[13] Thomas Hérault,et al. An Evaluation of User-Level Failure Mitigation Support in MPI , 2012, EuroMPI.
[14] Rakesh Kumar,et al. Algorithmic approaches to low overhead fault detection for sparse linear algebra , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).
[15] Anton Schüller,et al. Multigrid methods on parallel computers - A survey of recent developments , 1991, IMPACT Comput. Sci. Eng..
[16] Jim Euchner. Design , 2014, Catalysis from A to Z.
[17] Andrew Lumsdaine,et al. Coordinated checkpoint/restart process fault tolerance for mpi applications on hpc systems , 2010 .
[18] Peter D. Düben,et al. On the use of inexact, pruned hardware in atmospheric modelling , 2014, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.
[19] Emmanuel Agullo,et al. Towards resilient parallel linear Krylov solvers: recover-restart strategies , 2013 .
[20] Mark F. Adams,et al. Segmental Refinement: A Multigrid Technique for Data Locality , 2016, SIAM J. Sci. Comput..
[21] R. Bank,et al. A class of iterative methods for solving saddle point problems , 1989 .
[22] Jack Dongarra,et al. Chapter 1 Fault tolerance techniques for high-performance computing , 2015 .
[23] Michail Maniatakos,et al. Low-Cost Concurrent Error Detection for Floating-Point Unit (FPU) Controllers , 2013, IEEE Transactions on Computers.
[24] Zizhong Chen,et al. Algorithm-Based Fault Tolerance for Fail-Stop Failures , 2008, IEEE Transactions on Parallel and Distributed Systems.
[25] Ulrich Rüde,et al. Towards Textbook Efficiency for Parallel Multigrid , 2015 .
[26] Franck Cappello,et al. Toward Exascale Resilience: 2014 update , 2014, Supercomput. Front. Innov..
[27] Franck Cappello,et al. Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.
[28] Ulrich Rüde,et al. Optimization of the multigrid-convergence rate on semi-structured meshes by local Fourier analysis , 2013, Comput. Math. Appl..
[29] Irad Yavneh,et al. On Red-Black SOR Smoothing in Multigrid , 1996, SIAM J. Sci. Comput..
[30] Gerhard Wellein,et al. An Evaluation of Different I/O Techniques for Checkpoint/Restart , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.
[31] Barbara I. Wohlmuth,et al. Performance and Scalability of Hierarchical Hybrid Multigrid Solvers for Stokes Systems , 2015, SIAM J. Sci. Comput..
[32] Barbara I. Wohlmuth,et al. A quantitative performance study for Stokes solvers at the extreme scale , 2016, J. Comput. Sci..
[33] Franck Cappello,et al. Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities , 2009, Int. J. High Perform. Comput. Appl..
[34] Jack J. Dongarra,et al. FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World , 2000, PVM/MPI.
[35] Ulrike Meier Yang,et al. On the use of relaxation parameters in hybrid smoothers , 2004, Numer. Linear Algebra Appl..
[36] Ulrich Rüde,et al. Parallel multigrid on hierarchical hybrid grids: a performance study on current high performance computing clusters , 2014, Concurr. Comput. Pract. Exp..
[37] Bronis R. de Supinski,et al. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[38] Yves Robert,et al. Fault-Tolerance Techniques for High-Performance Computing , 2015 .
[39] Amber Roy-Chowdhury,et al. A Fault-Tolerant Parallel Algorithm for Iterative Solution of the Laplace Equation , 1993, 1993 International Conference on Parallel Processing - ICPP'93.
[40] Laxmikant V. Kalé,et al. Performance evaluation of automatic checkpoint-based fault tolerance for AMPI and Charm++ , 2006, OPSR.
[41] Thomas Hérault,et al. Post-failure recovery of MPI communication capability , 2013, Int. J. High Perform. Comput. Appl..
[42] George Bosilca,et al. Recovery Patterns for Iterative Methods in a Parallel Unstable Environment , 2007, SIAM J. Sci. Comput..
[43] Barbara I. Wohlmuth,et al. A quantitative performance analysis for Stokes solvers at the extreme scale , 2015, ArXiv.
[44] Clayton G. Webster,et al. Numerical Analysis of Fixed Point Algorithms in the Presence of Hardware Faults , 2015, SIAM J. Sci. Comput..
[45] Walter Zulehner,et al. Analysis of iterative methods for saddle point problems: a unified approach , 2002, Math. Comput..
[46] V. E. Henson,et al. BoomerAMG: a parallel algebraic multigrid solver and preconditioner , 2002 .
[47] Patrick Amestoy,et al. Hybrid scheduling for the parallel solution of linear systems , 2006, Parallel Comput..
[48] Franklin T. Luk,et al. Algorithmic Fault Tolerance Using the Lanczos Method , 1992, SIAM J. Matrix Anal. Appl..
[49] Achi Brandt,et al. Multigrid Techniques: 1984 Guide with Applications to Fluid Dynamics, Revised Edition , 2011 .
[50] Ulrike Meier Yang,et al. Scalability of Classical Algebraic Multigrid for Elasticity to Half a Million Parallel Tasks , 2016, Software for Exascale Computing.
[51] Hari Sundar,et al. Parallel geometric-algebraic multigrid on unstructured forests of octrees , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[52] Joel S. Emer,et al. The soft error problem: an architectural perspective , 2005, 11th International Symposium on High-Performance Computer Architecture.
[53] Wolfgang Hackbusch,et al. Multi-grid methods and applications , 1985, Springer series in computational mathematics.
[54] Markus Hegland,et al. Fault Tolerant Computation with the Sparse Grid Combination Technique , 2015, SIAM J. Sci. Comput..
[55] Jutta Docter,et al. JUQUEEN: IBM Blue Gene/Q® Supercomputer System at the Jülich Supercomputing Centre , 2015 .
[56] John Daly. A Model for Predicting the Optimum Checkpoint Interval for Restart Dumps , 2003, International Conference on Computational Science.
[57] Andreas Dedner,et al. A generic grid interface for parallel and adaptive scientific computing. Part II: implementation and tests in DUNE , 2008, Computing.
[58] Zizhong Chen,et al. Correcting soft errors online in LU factorization , 2013, HPDC '13.
[59] Jinchao Xu,et al. An error-resilient redundant subspace correction method , 2017, Comput. Vis. Sci..
[60] J. Douglas,et al. Stabilized mixed methods for the Stokes problem , 1988 .
[61] Franklin T. Luk,et al. An Analysis of Algorithm-Based Fault Tolerance Techniques , 1988, J. Parallel Distributed Comput..
[62] Jacob A. Abraham,et al. Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.
[63] Martin Schulz,et al. Fault resilience of the algebraic multi-grid solver , 2012, ICS '12.
[64] Artem Napov,et al. A massively parallel solver for discrete Poisson-like problems , 2015, J. Comput. Phys..