GPU-accelerated preconditioned iterative linear solvers

This work is an overview of our preliminary experience in developing a high-performance iterative linear solver accelerated by GPU coprocessors. Our goal is to illustrate the advantages and difficulties encountered when deploying GPU technology to perform sparse linear algebra computations. Techniques for speeding up sparse matrix-vector product (SpMV) kernels and finding suitable preconditioning methods are discussed. Our experiments with an NVIDIA TESLA M2070 show that for unstructured matrices SpMV kernels can be up to 8 times faster on the GPU than the Intel MKL on the host Intel Xeon X5675 Processor. Overall performance of the GPU-accelerated Incomplete Cholesky (IC) factorization preconditioned CG method can outperform its CPU counterpart by a smaller factor, up to 3, and GPU-accelerated The incomplete LU (ILU) factorization preconditioned GMRES method can achieve a speed-up nearing 4. However, with better suited preconditioning techniques for GPUs, this performance can be further improved.

[1]  Yousef Saad,et al.  ILUT: A dual threshold incomplete LU factorization , 1994, Numer. Linear Algebra Appl..

[2]  Y. Saad,et al.  GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems , 1986 .

[3]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[4]  C. Lanczos An iteration method for the solution of the eigenvalue problem of linear differential and integral operators , 1950 .

[5]  Anant Agarwal,et al.  The KILL Rule for Multicore , 2007, 2007 44th ACM/IEEE Design Automation Conference.

[6]  Rajesh Bordawekar,et al.  Optimizing Sparse Matrix-Vector Multiplication on GPUs , 2009 .

[7]  Eitan Grinspun,et al.  Sparse matrix solvers on the GPU: conjugate gradients and multigrid , 2003, SIGGRAPH Courses.

[8]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[9]  Yousef Saad,et al.  High performance manycore solvers for reservoir simulation , 2010 .

[10]  Wolfgang Straßer,et al.  A Parallel Preconditioned Conjugate Gradient Solver for the Poisson Problem on a Multi-GPU Platform , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[11]  Yao Zhang,et al.  Scan primitives for GPU computing , 2007, GH '07.

[12]  M. Newman,et al.  Interpolation and approximation , 1965 .

[13]  Arutyun Avetisyan,et al.  Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures , 2010, HiPEAC.

[14]  Jack Dongarra,et al.  Scientific Computing with Multicore and Accelerators , 2010, Chapman and Hall / CRC computational science series.

[15]  Frédéric Guyomarc'h,et al.  Least-Squares Polynomial Filters for Ill-Conditioned Linear Systems , 2001 .

[16]  Manish Parashar,et al.  Solving Sparse Linear Systems on NVIDIA Tesla GPUs , 2009, ICCS.

[17]  Y. Saad,et al.  Parallel self-consistent-field calculations via Chebyshev-filtered subspace acceleration. , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.

[18]  Wilfred Pinfold,et al.  Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis , 2009, HiPC 2009.

[19]  Yves Robert,et al.  Regular incomplete factorizations of real positive definite matrices , 1982 .

[20]  Jack Dongarra,et al.  Proceedings of the 9th International Conference on Computational Science , 2009, ICCS 2009.

[21]  P. Davis Interpolation and approximation , 1965 .

[22]  Richard W. Vuduc,et al.  Model-driven autotuning of sparse matrix-vector multiply on GPUs , 2010, PPoPP '10.

[23]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[24]  Rajesh Bordawekar,et al.  Optimizing Sparse Matrix-Vector Multiplication on GPUs using Compile-time and Run-time Strategies , 2008 .

[25]  Michael Garland,et al.  Implementing sparse matrix-vector multiplication on throughput-oriented processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[26]  Alan George,et al.  The Evolution of the Minimum Degree Ordering Algorithm , 1989, SIAM Rev..

[27]  Youcef Saad,et al.  A Basic Tool Kit for Sparse Matrix Computations , 1990 .

[28]  Ester M. Garzón,et al.  The sparse matrix vector product on GPUs , 2011 .

[29]  James Demmel,et al.  LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs , 2008 .

[30]  Atsushi Suzuki,et al.  New Row-grouped CSR format for storing the sparse matrices on GPU with implementation in CUDA , 2010, ArXiv.

[31]  Hiroshi Okuda,et al.  Conjugate Gradients on Graphic Hardware : Performance & Feasibility , 2008 .

[32]  Arutyun Avetisyan,et al.  Implementing Blocked Sparse Matrix-Vector Multiplication on NVIDIA GPUs , 2009, SAMOS.