论文信息 - On Parallel Solution of Sparse Triangular Linear Systems in CUDA

On Parallel Solution of Sparse Triangular Linear Systems in CUDA

The acceleration of sparse matrix computations on modern many-core processors, such as the graphics processing units (GPUs), has been recognized and studied over a decade. Significant performance enhancements have been achieved for many sparse matrix computational kernels such as sparse matrix-vector products and sparse matrix-matrix products. Solving linear systems with sparse triangular structured matrices is another important sparse kernel as demanded by a variety of scientific and engineering applications such as sparse linear solvers. However, the development of efficient parallel algorithms in CUDA for solving sparse triangular linear systems remains a challenging task due to the inherently sequential nature of the computation. In this paper, we will revisit this problem by reviewing the existing level-scheduling methods and proposing algorithms with self-scheduling techniques. Numerical results have indicated that the CUDA implementations of the proposed algorithms can outperform the state-of-the-art solvers in cuSPARSE by a factor of up to $2.6$ for structured model problems and general sparse matrices.

Ruipeng Li

[1] Michal Rewienski,et al. GPU-Accelerated LOBPCG Method with Inexact Null-Space Filtering for Solving Generalized Eigenvalue Problems in Computational Electromagnetics Analysis with Higher-Order FEM , 2017 .

[2] Yousef Saad,et al. GPU-accelerated preconditioned iterative linear solvers , 2013, The Journal of Supercomputing.

[3] M. Clemens,et al. GPU Acceleration of Algebraic Multigrid Preconditioners for Discrete Elliptic Field Problems , 2014, IEEE Transactions on Magnetics.

[4] Fernando L. Alvarado,et al. Optimal Parallel Solution of Sparse Triangular Systems , 1993, SIAM J. Sci. Comput..

[5] Robert Strzodka,et al. AmgX: A Library for GPU Accelerated Algebraic Multigrid and Preconditioned Iterative Methods , 2015, SIAM J. Sci. Comput..

[6] Guillaume Caumon,et al. Concurrent number cruncher: a GPU implementation of a general sparse linear solver , 2009, Int. J. Parallel Emergent Distributed Syst..

[7] Timothy A. Davis,et al. Accelerating sparse cholesky factorization on GPUs , 2014, IA3 '14.

[8] Eric C. Kerrigan,et al. Balancing Locality and Concurrency: Solving Sparse Triangular Systems on GPUs , 2016, 2016 IEEE 23rd International Conference on High Performance Computing (HiPC).

[9] Brian Vinter,et al. CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication , 2015, ICS.

[10] J. Krüger,et al. Linear algebra operators for GPU implementation of numerical algorithms , 2003, ACM Trans. Graph..

[11] A. B. Kahn,et al. Topological sorting of large networks , 1962, CACM.

[12] Yousef Saad,et al. Solving Sparse Triangular Linear Systems on Parallel Computers , 1989, Int. J. High Speed Comput..

[13] Michael T. Heath,et al. Parallel solution of triangular systems on distributed-memory multiprocessors , 1988 .

[14] Brian Vinter,et al. Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors , 2015, Parallel Comput..

[15] Brian Vinter,et al. An Efficient GPU General Sparse Matrix-Matrix Multiplication for Irregular Data , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[16] R. Schreiber,et al. Highly Parallel Sparse Triangular Solution , 1994 .

[17] Michael T. Heath,et al. Solution of sparse positive definite systems on a shared-memory multiprocessor , 1986, International Journal of Parallel Programming.

[18] Hiroshi Nakashima,et al. Algebraic Block Multi-Color Ordering Method for Parallel Multi-Threaded Sparse Triangular Solver in ICCG Method , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[19] Jonathan M. Cohen,et al. Parallel Graph Coloring with Applications to the Incomplete-LU Factorization on the GPU , 2015 .

[20] Michael T. Heath,et al. Modified cyclic algorithms for solving triangular systems on distributed-memory multiprocessors , 1988 .

[21] Kipton Barros,et al. Solving lattice QCD systems of equations using mixed precision solvers on GPUs , 2009, Comput. Phys. Commun..

[22] Luke N. Olson,et al. Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods , 2012, SIAM J. Sci. Comput..

[23] Brian Vinter,et al. A Synchronization-Free Algorithm for Parallel Sparse Triangular Solves , 2016, Euro-Par.

[24] Jonathan D. Hogg. A Fast Dense Triangular Solve in CUDA , 2013, SIAM J. Sci. Comput..

[25] Joel H. Saltz,et al. Aggregation Methods for Solving Sparse Triangular Systems on Multiprocessors , 1990, SIAM J. Sci. Comput..

[26] Edmond Chow,et al. Fine-Grained Parallel Incomplete LU Factorization , 2015, SIAM J. Sci. Comput..

[27] Thomas F. Coleman,et al. A parallel triangular solver for distributed-memory multiprocessor , 1988 .

[28] Jack J. Dongarra,et al. Accelerating the LOBPCG method on GPUs using a blocked sparse matrix vector product , 2015, SpringSim.

[29] Timothy A. Davis,et al. The university of Florida sparse matrix collection , 2011, TOMS.

[30] Richard W. Vuduc,et al. Model-driven autotuning of sparse matrix-vector multiply on GPUs , 2010, PPoPP '10.

[31] Yousef Saad,et al. Iterative methods for sparse linear systems , 2003 .

[32] Rajesh Bordawekar,et al. Optimizing Sparse Matrix-Vector Multiplication on GPUs using Compile-time and Run-time Strategies , 2008 .

[33] Michael Garland,et al. Implementing sparse matrix-vector multiplication on throughput-oriented processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[34] Richard W. Vuduc,et al. A Distributed CPU-GPU Sparse Direct Solver , 2014, Euro-Par.

[35] Santa Clara,et al. Parallel Solution of Sparse Triangular Linear Systems in the Preconditioned Iterative Methods on the GPU , 2011 .

[36] Srinivasan Parthasarathy,et al. Fast Sparse Matrix-Vector Multiplication on GPUs for Graph Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[37] Edmond Chow,et al. Iterative Sparse Triangular Solves for Preconditioning , 2015, Euro-Par.

[38] Thomas H. Cormen,et al. Introduction to algorithms [2nd ed.] , 2001 .

[39] Kiran Kumar Matam,et al. Sparse matrix-matrix multiplication on modern architectures , 2012, 2012 19th International Conference on High Performance Computing.

[40] Brian Vinter,et al. Fast synchronization‐free algorithms for parallel sparse triangular solves with multiple right‐hand sides , 2017, Concurr. Comput. Pract. Exp..

[41] Manish Parashar,et al. Solving Sparse Linear Systems on NVIDIA Tesla GPUs , 2009, ICCS.

[42] Karl Rupp,et al. ViennaCL - Linear Algebra Library for Multi- and Many-Core Architectures , 2016, SIAM J. Sci. Comput..

[43] Padma Raghavan,et al. Adapting Sparse Triangular Solution to GPUs , 2012, 2012 41st International Conference on Parallel Processing Workshops.