PLASMA View project Performance API ( PAPI ) View project

Iterative methods for solving sparse triangular systems are an attractive alternative to exact forward and backward substitution if an approximation of the solution is acceptable. On modern hardware, performance benefits are available as iterative methods allow for better parallelization. In this paper, we investigate how block-iterative triangular solves can benefit from using overlap. Because the matrices are triangular, we use “directed” overlap, depending on whether the matrix is upper or lower triangular. We enhance a GPU implementation of the blockasynchronous Jacobi method with directed overlap. For GPUs and other cases where the problem must be overdecomposed, i.e., more subdomains and threads than cores, there is a preference in processing or scheduling the subdomains in a specific order, following the dependencies specified by the sparse triangular matrix. For sparse triangular factors from incomplete factorizations, we demonstrate that moderate directed overlap with subdomain scheduling can improve convergence and timeto-solution.

[1]  Enrique S. Quintana-Ortí,et al.  Tuning stationary iterative solvers for fault resilience , 2015, ScalA '15.

[2]  Edmond Chow,et al.  Iterative Sparse Triangular Solves for Preconditioning , 2015, Euro-Par.

[3]  Edmond Chow,et al.  Asynchronous Iterative Algorithm for Computing Incomplete Factorizations on GPUs , 2015, ISC.

[4]  Yueqiang Shang,et al.  A parallel finite element variational multiscale method based on fully overlapping domain decomposition for incompressible flows , 2015 .

[5]  Edmond Chow,et al.  Fine-Grained Parallel Incomplete LU Factorization , 2015, SIAM J. Sci. Comput..

[6]  Jesper Larsson Träff,et al.  Euro-Par 2015: Parallel Processing , 2015, Lecture Notes in Computer Science.

[7]  Jack J. Dongarra,et al.  GPU-Accelerated Asynchronous Error Correction for Mixed Precision Iterative Refinement , 2011, Euro-Par.

[8]  Jack J. Dongarra,et al.  Block-asynchronous Multigrid Smoothers for GPU-accelerated Systems , 2011, ICCS.

[9]  Jack J. Dongarra,et al.  A Block-Asynchronous Relaxation Method for Graphics Processing Units , 2011, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[10]  Hartwig Anzt,et al.  Asynchronous and Multiprecision Linear Solvers - Scalable and Fault-Tolerant Numerics for Energy Efficient High Performance Computing , 2012 .

[11]  Santa Clara,et al.  Parallel Solution of Sparse Triangular Linear Systems in the Preconditioned Iterative Methods on the GPU , 2011 .

[12]  Daniel B. Szyld,et al.  Asynchronous Iterations , 2011, Encyclopedia of Parallel Computing.

[13]  Jan Mayer,et al.  Parallel algorithms for solving linear systems with sparse triangular matrices , 2009, Computing.

[14]  Andrea Toselli,et al.  Domain decomposition methods : algorithms and theory , 2005 .

[15]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[16]  Michele Benzi,et al.  Algebraic theory of multiplicative Schwarz methods , 2001, Numerische Mathematik.

[17]  Daniel B. Szyld,et al.  An Algebraic Convergence Theory for Restricted Additive Schwarz Methods Using Weighted Max Norms , 2001, SIAM J. Numer. Anal..

[18]  Xiao-Chuan Cai,et al.  A Restricted Additive Schwarz Preconditioner for General Sparse Linear Systems , 1999, SIAM J. Sci. Comput..

[19]  M. Benzi,et al.  A comparative study of sparse approximate inverse preconditioners , 1999 .

[20]  Jun Zhang,et al.  BILUM: Block Versions of Multielimination and Multilevel ILU Preconditioner for General Sparse Linear Systems , 1999, SIAM J. Sci. Comput..

[21]  Michele Benzi,et al.  Orderings for Incomplete Factorization Preconditioning of Nonsymmetric Problems , 1999, SIAM J. Sci. Comput..

[22]  Arno C. N. van Duin,et al.  Scalable Parallel Preconditioning with the Sparse Approximate Inverse of Triangular Matrices , 1999, SIAM J. Matrix Anal. Appl..

[23]  D. Szyld Different Models Of Parallel Asynchronous Iterations With Overlapping Blocks , 1998 .

[24]  D. Szyld,et al.  ASYNCHRONOUS WEIGHTED ADDITIVE SCHWARZ METHODS , 1997 .

[25]  Barry F. Smith,et al.  Domain Decomposition: Parallel Multilevel Methods for Elliptic Partial Differential Equations , 1996 .

[26]  Fernando L. Alvarado,et al.  Optimal Parallel Solution of Sparse Triangular Systems , 1993, SIAM J. Sci. Comput..

[27]  Robert Schreiber,et al.  Efficient ICCG on a Shared Memory Multiprocessor , 1992, Int. J. High Speed Comput..

[28]  Fernando L. Alvarado,et al.  A Fast Reordering Algorithm for Parallel Sparse Triangular Solution , 1992, SIAM J. Sci. Comput..

[29]  Joel H. Saltz,et al.  Aggregation Methods for Solving Sparse Triangular Systems on Multiprocessors , 1990, SIAM J. Sci. Comput..

[30]  I. Duff,et al.  The effect of ordering on preconditioned conjugate gradients , 1989 .