On Parallel Solution of Sparse Triangular Linear Systems in CUDA

The acceleration of sparse matrix computations on modern many-core processors, such as the graphics processing units (GPUs), has been recognized and studied over a decade. Significant performance enhancements have been achieved for many sparse matrix computational kernels such as sparse matrix-vector products and sparse matrix-matrix products. Solving linear systems with sparse triangular structured matrices is another important sparse kernel as demanded by a variety of scientific and engineering applications such as sparse linear solvers. However, the development of efficient parallel algorithms in CUDA for solving sparse triangular linear systems remains a challenging task due to the inherently sequential nature of the computation. In this paper, we will revisit this problem by reviewing the existing level-scheduling methods and proposing algorithms with self-scheduling techniques. Numerical results have indicated that the CUDA implementations of the proposed algorithms can outperform the state-of-the-art solvers in cuSPARSE by a factor of up to $2.6$ for structured model problems and general sparse matrices.

[1]  Michal Rewienski,et al.  GPU-Accelerated LOBPCG Method with Inexact Null-Space Filtering for Solving Generalized Eigenvalue Problems in Computational Electromagnetics Analysis with Higher-Order FEM , 2017 .

[2]  Yousef Saad,et al.  GPU-accelerated preconditioned iterative linear solvers , 2013, The Journal of Supercomputing.

[3]  M. Clemens,et al.  GPU Acceleration of Algebraic Multigrid Preconditioners for Discrete Elliptic Field Problems , 2014, IEEE Transactions on Magnetics.

[4]  Fernando L. Alvarado,et al.  Optimal Parallel Solution of Sparse Triangular Systems , 1993, SIAM J. Sci. Comput..

[5]  Robert Strzodka,et al.  AmgX: A Library for GPU Accelerated Algebraic Multigrid and Preconditioned Iterative Methods , 2015, SIAM J. Sci. Comput..

[6]  Guillaume Caumon,et al.  Concurrent number cruncher: a GPU implementation of a general sparse linear solver , 2009, Int. J. Parallel Emergent Distributed Syst..

[7]  Timothy A. Davis,et al.  Accelerating sparse cholesky factorization on GPUs , 2014, IA3 '14.

[8]  Eric C. Kerrigan,et al.  Balancing Locality and Concurrency: Solving Sparse Triangular Systems on GPUs , 2016, 2016 IEEE 23rd International Conference on High Performance Computing (HiPC).

[9]  Brian Vinter,et al.  CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication , 2015, ICS.

[10]  J. Krüger,et al.  Linear algebra operators for GPU implementation of numerical algorithms , 2003, ACM Trans. Graph..

[11]  A. B. Kahn,et al.  Topological sorting of large networks , 1962, CACM.

[12]  Yousef Saad,et al.  Solving Sparse Triangular Linear Systems on Parallel Computers , 1989, Int. J. High Speed Comput..

[13]  Michael T. Heath,et al.  Parallel solution of triangular systems on distributed-memory multiprocessors , 1988 .

[14]  Brian Vinter,et al.  Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors , 2015, Parallel Comput..

[15]  Brian Vinter,et al.  An Efficient GPU General Sparse Matrix-Matrix Multiplication for Irregular Data , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[16]  R. Schreiber,et al.  Highly Parallel Sparse Triangular Solution , 1994 .

[17]  Michael T. Heath,et al.  Solution of sparse positive definite systems on a shared-memory multiprocessor , 1986, International Journal of Parallel Programming.

[18]  Hiroshi Nakashima,et al.  Algebraic Block Multi-Color Ordering Method for Parallel Multi-Threaded Sparse Triangular Solver in ICCG Method , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[19]  Jonathan M. Cohen,et al.  Parallel Graph Coloring with Applications to the Incomplete-LU Factorization on the GPU , 2015 .

[20]  Michael T. Heath,et al.  Modified cyclic algorithms for solving triangular systems on distributed-memory multiprocessors , 1988 .

[21]  Kipton Barros,et al.  Solving lattice QCD systems of equations using mixed precision solvers on GPUs , 2009, Comput. Phys. Commun..

[22]  Luke N. Olson,et al.  Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods , 2012, SIAM J. Sci. Comput..

[23]  Brian Vinter,et al.  A Synchronization-Free Algorithm for Parallel Sparse Triangular Solves , 2016, Euro-Par.

[24]  Jonathan D. Hogg A Fast Dense Triangular Solve in CUDA , 2013, SIAM J. Sci. Comput..

[25]  Joel H. Saltz,et al.  Aggregation Methods for Solving Sparse Triangular Systems on Multiprocessors , 1990, SIAM J. Sci. Comput..

[26]  Edmond Chow,et al.  Fine-Grained Parallel Incomplete LU Factorization , 2015, SIAM J. Sci. Comput..

[27]  Thomas F. Coleman,et al.  A parallel triangular solver for distributed-memory multiprocessor , 1988 .

[28]  Jack J. Dongarra,et al.  Accelerating the LOBPCG method on GPUs using a blocked sparse matrix vector product , 2015, SpringSim.

[29]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[30]  Richard W. Vuduc,et al.  Model-driven autotuning of sparse matrix-vector multiply on GPUs , 2010, PPoPP '10.

[31]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[32]  Rajesh Bordawekar,et al.  Optimizing Sparse Matrix-Vector Multiplication on GPUs using Compile-time and Run-time Strategies , 2008 .

[33]  Michael Garland,et al.  Implementing sparse matrix-vector multiplication on throughput-oriented processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[34]  Richard W. Vuduc,et al.  A Distributed CPU-GPU Sparse Direct Solver , 2014, Euro-Par.

[35]  Santa Clara,et al.  Parallel Solution of Sparse Triangular Linear Systems in the Preconditioned Iterative Methods on the GPU , 2011 .

[36]  Srinivasan Parthasarathy,et al.  Fast Sparse Matrix-Vector Multiplication on GPUs for Graph Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[37]  Edmond Chow,et al.  Iterative Sparse Triangular Solves for Preconditioning , 2015, Euro-Par.

[38]  Thomas H. Cormen,et al.  Introduction to algorithms [2nd ed.] , 2001 .

[39]  Kiran Kumar Matam,et al.  Sparse matrix-matrix multiplication on modern architectures , 2012, 2012 19th International Conference on High Performance Computing.

[40]  Brian Vinter,et al.  Fast synchronization‐free algorithms for parallel sparse triangular solves with multiple right‐hand sides , 2017, Concurr. Comput. Pract. Exp..

[41]  Manish Parashar,et al.  Solving Sparse Linear Systems on NVIDIA Tesla GPUs , 2009, ICCS.

[42]  Karl Rupp,et al.  ViennaCL - Linear Algebra Library for Multi- and Many-Core Architectures , 2016, SIAM J. Sci. Comput..

[43]  Padma Raghavan,et al.  Adapting Sparse Triangular Solution to GPUs , 2012, 2012 41st International Conference on Parallel Processing Workshops.