Analysis and performance estimation of the Conjugate Gradient method on multiple GPUs

The Conjugate Gradient (CG) method is a widely-used iterative method for solving linear systems described by a (sparse) matrix. The method requires a large amount of Sparse-Matrix Vector (SpMV) multiplications, vector reductions and other vector operations to be performed. We present a number of mappings for the SpMV operation on modern programmable GPUs using the Block Compressed Sparse Row (BCSR) format. Further, we show that reordering matrix blocks substantially improves the performance of the SpMV operation, especially when small blocks are used, so that our method outperforms existing state-of-the-art approaches, in most cases. Finally, a thorough analysis of the performance of both SpMV and CG methods is performed, which allows us to model and estimate the expected maximum performance for a given (unseen) problem.

[1]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[2]  M. Hestenes,et al.  Methods of conjugate gradients for solving linear systems , 1952 .

[3]  H. V. D. Vorst,et al.  The rate of convergence of Conjugate Gradients , 1986 .

[4]  Mark J. Harris,et al.  Parallel Prefix Sum (Scan) with CUDA , 2011 .

[5]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[6]  G.J.M. Smit,et al.  Implementing the conjugate gradient algorithm on multi-core systems , 2007, 2007 International Symposium on System-on-Chip.

[7]  Richard Barrett,et al.  Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods , 1994, Other Titles in Applied Mathematics.

[8]  S.A. Manavski,et al.  CUDA Compatible GPU as an Efficient Hardware Accelerator for AES Cryptography , 2007, 2007 IEEE International Conference on Signal Processing and Communications.

[9]  Rainald Löhner,et al.  Running unstructured grid‐based CFD solvers on modern graphics hardware , 2009 .

[10]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[11]  Ian P. King,et al.  An automatic reordering scheme for simultaneous equations derived from network systems , 1970 .

[12]  Guillaume Caumon,et al.  Concurrent Number Cruncher: An Efficient Sparse Linear Solver on the GPU , 2007, HPCC.

[13]  Nectarios Koziris,et al.  Understanding the Performance of Sparse Matrix-Vector Multiplication , 2008, 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008).

[14]  Patrick R. Amestoy,et al.  An Approximate Minimum Degree Ordering Algorithm , 1996, SIAM J. Matrix Anal. Appl..

[15]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[16]  E. Cuthill,et al.  Reducing the bandwidth of sparse symmetric matrices , 1969, ACM '69.

[17]  Timothy A. Davis,et al.  Algorithm 837: AMD, an approximate minimum degree ordering algorithm , 2004, TOMS.

[18]  J. E. Glynn,et al.  Numerical Recipes: The Art of Scientific Computing , 1989 .

[19]  William H. Press,et al.  The Art of Scientific Computing Second Edition , 1998 .

[20]  J. Douglas Faires,et al.  Numerical Analysis , 1981 .

[21]  Miriam Leeser,et al.  Efficient Shallow Water Simulations on GPUs , 2011 .

[22]  Eitan Grinspun,et al.  Sparse matrix solvers on the GPU: conjugate gradients and multigrid , 2003, SIGGRAPH Courses.

[23]  Christopher Dyken,et al.  State-of-the-art in heterogeneous computing , 2010, Sci. Program..

[24]  Richard W. Vuduc,et al.  Model-driven autotuning of sparse matrix-vector multiply on GPUs , 2010, PPoPP '10.

[25]  Tom R. Halfhill NVIDIA's Next-Generation CUDA Compute and Graphics Architecture, Code-Named Fermi, Adds Muscle for Parallel Processing , 2009 .

[26]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[27]  Michael Garland,et al.  Efficient Sparse Matrix-Vector Multiplication on CUDA , 2008 .

[28]  Satoshi Matsuoka,et al.  High performance conjugate gradient solver on multi-GPU clusters using hypergraph partitioning , 2010, Computer Science - Research and Development.

[29]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[30]  A. James 2010 , 2011, Philo of Alexandria: an Annotated Bibliography 2007-2016.

[31]  Robert Strzodka,et al.  Exploring weak scalability for FEM calculations on a GPU-enhanced cluster , 2007, Parallel Comput..

[32]  Henk A. van der Vorst,et al.  Bi-CGSTAB: A Fast and Smoothly Converging Variant of Bi-CG for the Solution of Nonsymmetric Linear Systems , 1992, SIAM J. Sci. Comput..

[33]  Mustafa S. Altinakar,et al.  Efficient shallow water simulations on GPUs: Implementation, visualization, verification, and validation , 2012 .

[34]  Satoshi Matsuoka,et al.  Fast Conjugate Gradients with Multiple GPUs , 2009, ICCS.

[35]  Arutyun Avetisyan,et al.  Implementing Blocked Sparse Matrix-Vector Multiplication on NVIDIA GPUs , 2009, SAMOS.

[36]  J. Miller Numerical Analysis , 1966, Nature.

[37]  William H. Press,et al.  Numerical recipes in C. The art of scientific computing , 1987 .