A Parallel Preconditioned Conjugate Gradient Solver for the Poisson Problem on a Multi-GPU Platform

We present a parallel conjugate gradient solver for the Poisson problem optimized for multi-GPU platforms. Our approach includes a novel heuristic Poisson preconditioner well suited for massively-parallel SIMD processing. Furthermore, we address the problem of limited transfer rates over typical data channels such as the PCI-express bus relative to the bandwidth requirements of powerful GPUs. Specifically, naive communication schemes can severely reduce the achievable speedup in such communication-intense algorithms. For this reason, we employ overlapping memory transfers to establish a high level of concurrency and to improve scalability. We have implemented our model on a high-performance workstation with multiple hardware accelerators. We discuss the mathematical principles, give implementation details, and present the performance and the scalability of the system.

[1]  Michele Benzi,et al.  A Sparse Approximate Inverse Preconditioner for the Conjugate Gradient Method , 1996, SIAM J. Sci. Comput..

[2]  Gene H. Golub,et al.  Some History of the Conjugate Gradient and Lanczos Algorithms: 1948-1976 , 1989, SIAM Rev..

[3]  Nathan A. Carr,et al.  Cache and bandwidth aware matrix multiplication on the GPU , 2010 .

[4]  H. V. D. Vorst,et al.  The rate of convergence of Conjugate Gradients , 1986 .

[5]  Satoshi Matsuoka,et al.  Fast Conjugate Gradients with Multiple GPUs , 2009, ICCS.

[6]  Jonathan M. Cohen,et al.  Low viscosity flow simulations for animation , 2008, SCA '08.

[7]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[8]  James Demmel,et al.  Parallel numerical linear algebra , 1993, Acta Numerica.

[9]  Dinesh Manocha,et al.  LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[10]  Guillaume Caumon,et al.  Concurrent number cruncher: a GPU implementation of a general sparse linear solver , 2009, Int. J. Parallel Emergent Distributed Syst..

[11]  Robert Bridson,et al.  Fluid simulation: SIGGRAPH 2007 course notesVideo files associated with this course are available from the citation page , 2007, SIGGRAPH Courses.

[12]  J. Krüger,et al.  Linear algebra operators for GPU implementation of numerical algorithms , 2003, ACM Trans. Graph..

[13]  Naga K. Govindaraju,et al.  GPGPU: general-purpose computation on graphics hardware , 2006, SC.

[14]  Katherine Yelick,et al.  Performance models for evaluation and automatic tuning of symmetric sparse matrix-vector multiply , 2004 .

[15]  A. Griewank,et al.  Approximate inverse preconditionings for sparse linear systems , 1992 .

[16]  David K. McAllister,et al.  Fast matrix multiplies using graphics hardware , 2001, SC.

[17]  J. Shewchuk An Introduction to the Conjugate Gradient Method Without the Agonizing Pain , 1994 .

[18]  G.J.M. Smit,et al.  Implementing the conjugate gradient algorithm on multi-core systems , 2007, 2007 International Symposium on System-on-Chip.

[19]  M. Hestenes,et al.  Methods of conjugate gradients for solving linear systems , 1952 .