Partial Differential Equations Preconditioner Resilient to Soft and Hard Faults

We present a domain-decomposition-based pre-conditioner for the solution of partial differential equations (PDEs) that is resilient to both soft and hard faults. The algorithm is based on the following steps: first, the computational domain is split into overlapping subdomains, second, the target PDE is solved on each subdomain for sampled values of the local current boundary conditions, third, the subdomain solution samples are collected and fed into a regression step to build maps between the subdomains' boundary conditions, finally, the intersection of these maps yields the updated state at the subdomain boundaries. This reformulation allows us to recast the problem as a set of independent tasks. The implementation relies on an asynchronous server-client framework, where one or more reliable servers hold the data, while the clients ask for tasks and execute them. This framework provides resiliency to hard faults such that if a client crashes, it stops asking for work, and the servers simply distribute the work among all the other clients alive. Erroneous subdomain solves (e.g. due to soft faults) appear as corrupted data, which is either rejected if that causes a task to fail, or is seamlessly filtered out during the regression stage through a suitable noise model. Three different types of faults are modeled: hard faults modeling nodes (or clients) crashing, soft faults occurring during the communication of the tasks between server and clients, and soft faults occurring during task execution. We demonstrate the resiliency of the approach for a 2D elliptic PDE, and explore the effect of the faults at various failure rates.

[1]  Jim Gray,et al.  Why Do Computers Stop and What Can Be Done About It? , 1986, Symposium on Reliability in Distributed Software and Database Systems.

[2]  Daniel P. Siewiorek,et al.  Error log analysis: statistical modeling and heuristic trend analysis , 1990 .

[3]  An Algebraic Schwarz Theory , 1994 .

[4]  Nitin H. Vaidya,et al.  A case for two-level distributed recovery schemes , 1995, SIGMETRICS '95/PERFORMANCE '95.

[5]  Barry Smith,et al.  Domain Decomposition Methods for Partial Differential Equations , 1997 .

[6]  D. Keyes How Scalable is Domain Decomposition in Practice , 1998 .

[7]  Michele Benzi,et al.  Algebraic theory of multiplicative Schwarz methods , 2001, Numerische Mathematik.

[8]  Mark Frederick Hoemmen,et al.  An Overview of Trilinos , 2003 .

[9]  Archana Ganapathi,et al.  Why Do Internet Services Fail, and What Can Be Done About It? , 2002, USENIX Symposium on Internet Technologies and Systems.

[10]  Mark S. Squillante,et al.  Failure data analysis of a large-scale heterogeneous server environment , 2004, International Conference on Dependable Systems and Networks, 2004.

[11]  Andrea Toselli,et al.  Domain decomposition methods : algorithms and theory , 2005 .

[12]  Tipp Moseley,et al.  Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[13]  Sarita V. Adve,et al.  Understanding the propagation of hard errors to software and implications for resilient system design , 2008, ASPLOS.

[14]  I. Daubechies,et al.  Iteratively reweighted least squares minimization for sparse recovery , 2008, 0807.0575.

[15]  Franck Cappello,et al.  Toward Exascale Resilience , 2009, Int. J. High Perform. Comput. Appl..

[16]  George Bosilca,et al.  Algorithm-based fault tolerance applied to high performance computing , 2009, J. Parallel Distributed Comput..

[17]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..

[18]  Mahmut T. Kandemir,et al.  Analyzing the soft error resilience of linear solvers on multicore multiprocessors , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[19]  Kurt B. Ferreira,et al.  Keeping checkpoint/restart viable for exascale systems , 2011 .

[20]  Kurt B. Ferreira,et al.  Fault-tolerant iterative methods via selective reliability. , 2011 .

[21]  Zizhong Chen Algorithm-based recovery for iterative methods without checkpointing , 2011, HPDC '11.

[22]  Hui Liu,et al.  Matrix Multiplication on GPUs with On-Line Fault Tolerance , 2011, 2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications.

[23]  P. Diniz Exascale Programming Challenges , 2011 .

[24]  Jack J. Dongarra,et al.  High Performance Dense Linear System Solver with Soft Error Resilience , 2011, 2011 IEEE International Conference on Cluster Computing.

[25]  P. Oswald,et al.  Greedy and Randomized Versions of the Multiplicative Schwarz Method , 2012 .

[26]  Rakesh Kumar,et al.  Algorithmic approaches to low overhead fault detection for sparse linear algebra , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[27]  Thomas Hérault,et al.  Algorithm-based fault tolerance for dense matrix factorizations , 2012, PPoPP '12.

[28]  Kurt B. Ferreira,et al.  Fault-tolerant linear solvers via selective reliability , 2012, ArXiv.

[29]  Dong Li,et al.  Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[30]  Pete Beckman,et al.  Introspective Fault Tolerance for Exascale Systems∗ , 2012 .

[31]  Nicholas Wilson,et al.  Fault-Tolerant Grid-Based Solvers: Combining Concepts from Sparse Grids and MapReduce , 2013, ICCS.

[32]  Thomas Hérault,et al.  Post-failure recovery of MPI communication capability , 2013, Int. J. High Perform. Comput. Appl..

[33]  Christian Engelmann,et al.  Toward a Performance/Resilience Tool for Hardware/Software Co-design of High-Performance Computing Systems , 2013, 2013 42nd International Conference on Parallel Processing.

[34]  Franck Cappello,et al.  Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..

[35]  Călin Caşcaval,et al.  Languages and compilers for parallel computing : 26th International Workshop, LCPC 2013, San Jose, CA, USA, September 25-27, 2013 : revised selected papers , 2014 .

[36]  Franck Cappello,et al.  Toward Exascale Resilience: 2014 update , 2014, Supercomput. Front. Innov..

[37]  Md. Mohsin Ali,et al.  Application Level Fault Recovery: Using Fault-Tolerant Open MPI in a PDE Solver , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[38]  John Shalf,et al.  Abstract Machine Models and Proxy Architectures for Exascale Computing , 2014, 2014 Hardware-Software Co-Design for High Performance Computing.

[39]  Joseph P. Kenny,et al.  Using Discrete Event Simulation for Programming Model Exploration at Extreme-Scale: Macroscale Components for the Structural Simulation Toolkit (SST) , 2015 .

[40]  Cosmin Safta,et al.  Partial Differential Equations Preconditioner Resilient to Soft and Hard Faults , 2015, CLUSTER.

[41]  Cosmin Safta,et al.  Fault Resilient Domain Decomposition Preconditioner for PDEs , 2015, SIAM J. Sci. Comput..