Fault Resilient Domain Decomposition Preconditioner for PDEs

The move towards extreme-scale computing platforms challenges scientific simulations in many ways. Given the recent tendencies in computer architecture development, one needs to reformulate legacy codes in order to cope with large amounts of communication, system faults, and requirements of low-memory usage per core. In this work, we develop a novel framework for solving PDEs via domain decomposition that reformulates the solution as a state of knowledge with a probabilistic interpretation. Such reformulation allows resiliency with respect to potential faults without having to apply fault detection, avoids unnecessary communication, and is generally well-suited for rigorous uncertainty quantification studies that target improvements of predictive fidelity of scientific models. We demonstrate our algorithm for one-dimensional PDE examples where artificial faults have been implemented as bit flips in the binary representation of subdomain solutions.

[1]  Andrea Toselli,et al.  Domain decomposition methods : algorithms and theory , 2005 .

[2]  Cosmin Safta,et al.  Partial Differential Equations Solver Resilient to Soft and Hard Faults. , 2015 .

[3]  Thomas Hérault,et al.  Algorithm-based fault tolerance for dense matrix factorizations , 2012, PPoPP '12.

[4]  Michael A. Heroux,et al.  Toward Local Failure Local Recovery Resilience Model using MPI-ULFM , 2014, EuroMPI/ASIA.

[5]  Zizhong Chen Algorithm-based recovery for iterative methods without checkpointing , 2011, HPDC '11.

[6]  M. Holst,et al.  An Algebraic Schwarz Theory , 1994 .

[7]  A. McNeil Multivariate t Distributions and Their Applications , 2006 .

[8]  Dong Li,et al.  Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  James H. Laros,et al.  Evaluating the viability of process replication reliability for exascale systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[10]  P. Oswald,et al.  Remarks on the Abstract Theory of Additive and Multiplicative Schwarz Algorithms , 1995 .

[11]  D. Keyes How Scalable is Domain Decomposition in Practice , 1998 .

[12]  Sarita V. Adve,et al.  Understanding the propagation of hard errors to software and implications for resilient system design , 2008, ASPLOS.

[13]  Mahmut T. Kandemir,et al.  Analyzing the soft error resilience of linear solvers on multicore multiprocessors , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[14]  Edwin T. Jaynes,et al.  Prior Probabilities , 1968, Encyclopedia of Machine Learning.

[15]  J. Nocedal,et al.  A Limited Memory Algorithm for Bound Constrained Optimization , 1995, SIAM J. Sci. Comput..

[16]  Dongbin Xiu,et al.  Local Polynomial Chaos Expansion for Linear Differential Equations with High Dimensional Random Inputs , 2015, SIAM J. Sci. Comput..

[17]  Nicholas Wilson,et al.  Fault-Tolerant Grid-Based Solvers: Combining Concepts from Sparse Grids and MapReduce , 2013, ICCS.

[18]  Franck Cappello,et al.  Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities , 2009, Int. J. High Perform. Comput. Appl..

[19]  Bradley P. Carlin,et al.  Bayesian Methods for Data Analysis , 2008 .

[20]  Hui Liu,et al.  Matrix Multiplication on GPUs with On-Line Fault Tolerance , 2011, 2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications.

[21]  Franck Cappello,et al.  Toward Exascale Resilience , 2009, Int. J. High Perform. Comput. Appl..

[22]  Alfredo Benso,et al.  Statistical Reliability Estimation of Microprocessor-Based Systems , 2012, IEEE Transactions on Computers.

[23]  John Skilling,et al.  Data analysis : a Bayesian tutorial , 1996 .

[24]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..

[25]  Barry Smith,et al.  Domain Decomposition Methods for Partial Differential Equations , 1997 .

[26]  Tipp Moseley,et al.  Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[27]  Wotao Yin,et al.  Iteratively reweighted algorithms for compressive sensing , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[28]  Michele Benzi,et al.  Algebraic theory of multiplicative Schwarz methods , 2001, Numerische Mathematik.

[29]  George Bosilca,et al.  Algorithm-based fault tolerance applied to high performance computing , 2009, J. Parallel Distributed Comput..

[30]  Refik Soyer,et al.  Bayesian Methods for Nonlinear Classification and Regression , 2004, Technometrics.