Domain Overlap for Iterative Sparse Triangular Solves on GPUs

Iterative methods for solving sparse triangular systems are an attractive alternative to exact forward and backward substitution if an approximation of the solution is acceptable. On modern hardware, performance benefits are available as iterative methods allow for better parallelization. In this paper, we investigate how block-iterative triangular solves can benefit from using overlap. Because the matrices are triangular, we use “directed” overlap, depending on whether the matrix is upper or lower triangular. We enhance a GPU implementation of the block-asynchronous Jacobi method with directed overlap. For GPUs and other cases where the problem must be overdecomposed, i.e., more subdomains and threads than cores, there is a preference in processing or scheduling the subdomains in a specific order, following the dependencies specified by the sparse triangular matrix. For sparse triangular factors from incomplete factorizations, we demonstrate that moderate directed overlap with subdomain scheduling can improve convergence and time-to-solution.

[1]  Yueqiang Shang,et al.  A parallel finite element variational multiscale method based on fully overlapping domain decomposition for incompressible flows , 2015 .

[2]  D. Szyld Different Models Of Parallel Asynchronous Iterations With Overlapping Blocks , 1998 .

[3]  Jack Dongarra,et al.  Block-asynchronous multigrid smoothers for GPU-accelerated systems , 2011 .

[4]  Arno C. N. van Duin,et al.  Scalable Parallel Preconditioning with the Sparse Approximate Inverse of Triangular Matrices , 1999, SIAM J. Matrix Anal. Appl..

[5]  Santa Clara,et al.  Parallel Solution of Sparse Triangular Linear Systems in the Preconditioned Iterative Methods on the GPU , 2011 .

[6]  J. Kulpa,et al.  Time-frequency analysis using NVIDIA compute unified device architecture (CUDA) , 2009, Symposium on Photonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments (WILGA).

[7]  Hartwig Anzt,et al.  Asynchronous and Multiprecision Linear Solvers - Scalable and Fault-Tolerant Numerics for Energy Efficient High Performance Computing , 2012 .

[8]  Michele Benzi,et al.  Orderings for Incomplete Factorization Preconditioning of Nonsymmetric Problems , 1999, SIAM J. Sci. Comput..

[9]  D. Szyld,et al.  ASYNCHRONOUS WEIGHTED ADDITIVE SCHWARZ METHODS , 1997 .

[10]  Arno C. N. van,et al.  Scalable Parallel Preconditioning with the Sparse Approximate Inverse of Triangular Matrices , 1999 .

[11]  Joel H. Saltz,et al.  Aggregation Methods for Solving Sparse Triangular Systems on Multiprocessors , 1990, SIAM J. Sci. Comput..

[12]  Yousef Saad,et al.  Solving Sparse Triangular Linear Systems on Parallel Computers , 1989, Int. J. High Speed Comput..

[13]  M. Benzi,et al.  A comparative study of sparse approximate inverse preconditioners , 1999 .

[14]  Edmond Chow,et al.  Fine-Grained Parallel Incomplete LU Factorization , 2015, SIAM J. Sci. Comput..

[15]  Maxim Naumov,et al.  Parallel Solution of Sparse Triangular Linear Systems in the Preconditioned Iterative Methods on the GPU , 2011 .

[16]  Edmond Chow,et al.  Iterative Sparse Triangular Solves for Preconditioning , 2015, Euro-Par.

[17]  Vincent Heuveline,et al.  GPU-Accelerated Asynchronous Error Correction for Mixed Precision Iterative Refinement , 2011 .

[18]  Jan Mayer,et al.  Parallel algorithms for solving linear systems with sparse triangular matrices , 2009, Computing.

[19]  Fernando L. Alvarado,et al.  Optimal Parallel Solution of Sparse Triangular Systems , 1993, SIAM J. Sci. Comput..

[20]  Daniel B. Szyld,et al.  An Algebraic Convergence Theory for Restricted Additive Schwarz Methods Using Weighted Max Norms , 2001, SIAM J. Numer. Anal..

[21]  Michele Benzi,et al.  Algebraic theory of multiplicative Schwarz methods , 2001, Numerische Mathematik.

[22]  Robert Schreiber,et al.  Efficient ICCG on a Shared Memory Multiprocessor , 1992, Int. J. High Speed Comput..

[23]  Y. Saad BILUM : Block versions of multi-elimination ILU preconditioner for general sparse linear systems , 1999 .

[24]  Barry F. Smith,et al.  Domain Decomposition: Parallel Multilevel Methods for Elliptic Partial Differential Equations , 1996 .

[25]  Vincent Heuveline,et al.  A Block-Asynchronous Relaxation Method for Graphics Processing Units , 2011 .

[26]  D. Szyld,et al.  On asynchronous iterations , 2000 .

[27]  Xiao-Chuan Cai,et al.  A Restricted Additive Schwarz Preconditioner for General Sparse Linear Systems , 1999, SIAM J. Sci. Comput..

[28]  I. Duff,et al.  The effect of ordering on preconditioned conjugate gradients , 1989 .

[29]  Edmond Chow,et al.  Asynchronous Iterative Algorithm for Computing Incomplete Factorizations on GPUs , 2015, ISC.

[30]  Andrea Toselli,et al.  Domain decomposition methods : algorithms and theory , 2005 .

[31]  Fernando L. Alvarado,et al.  A Fast Reordering Algorithm for Parallel Sparse Triangular Solution , 1992, SIAM J. Sci. Comput..

[32]  Enrique S. Quintana-Ortí,et al.  Tuning stationary iterative solvers for fault resilience , 2015, ScalA '15.

[33]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .