Conjugate gradients on multiple GPUs

A GPU-accelerated Conjugate Gradient solver is tested on eight different matrices with different structural and numerical characteristics. The first four matrices are obtained by discretizing the 3D Poisson's equation, which arises in many fields such as computational fluid dynamics, heat transfer and so on. Their relatively low bandwidth and low condition numbers makes them ideal targets for GPU acceleration. We chose another four matrices from the other end of the spectrum, both ill-conditioned and with very large bandwidth. This paper concentrates on the computational aspects related to running the solver on multiple GPUs. We develop a fast distributed sparse-matrix vector multiplication routine using optimized data formats that allows the overlapping of communication with computation and, at the same time, the sharing of some of the work with the CPU. By a thorough analysis of the time spent in communication and computation, we show that the proposed overlapped implementation outperforms the non-overlapped one by a large margin and provides almost perfect strong scalability for large Poisson-type matrices. We then benchmark the performance of the entire solver, using both double precision and single precision combined with iterative refinement and report up to 22× acceleration when using three GPUs as compared with one of the most powerful Intel Nehalem CPUs available today. Finally, we show that using GPUs as accelerators not only brings an order of magnitude speedup but also up to 5x increase in power efficiency and over 10x increase in cost effectiveness. Copyright © 2010 John Wiley & Sons, Ltd.

[1]  Michael Garland,et al.  Efficient Sparse Matrix-Vector Multiplication on CUDA , 2008 .

[2]  Richard Vuduc,et al.  Automatic performance tuning of sparse matrix kernels , 2003 .

[3]  Robert Strzodka,et al.  Scientific computation for simulations on programmable graphics hardware , 2005, Simul. Model. Pract. Theory.

[4]  Robert Strzodka,et al.  Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations , 2007, Int. J. Parallel Emergent Distributed Syst..

[5]  Eitan Grinspun,et al.  Sparse matrix solvers on the GPU: conjugate gradients and multigrid , 2003, SIGGRAPH Courses.

[6]  M. Hestenes,et al.  Methods of conjugate gradients for solving linear systems , 1952 .

[7]  Guillaume Caumon,et al.  Concurrent Number Cruncher: An Efficient Sparse Linear Solver on the GPU , 2007, HPCC.

[8]  Robert Strzodka,et al.  Exploring weak scalability for FEM calculations on a GPU-enhanced cluster , 2007, Parallel Comput..

[9]  Jens H. Krüger,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007, Eurographics.

[10]  S. Menon,et al.  Implementation of an Efficient Conjugate Gradient Algorithm for Poisson Solutions on Graphics Processors , 2007 .

[11]  Rainald Löhner,et al.  Deflated preconditioned conjugate gradient solvers for the Pressure-Poisson equation , 2008, J. Comput. Phys..

[12]  J. Dongarra,et al.  Exploiting the Performance of 32 bit Floating Point Arithmetic in Obtaining 64 bit Accuracy (Revisiting Iterative Refinement for Linear Systems) , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[13]  Stefan Turek,et al.  GPU acceleration of an unmodified parallel finite element Navier-Stokes solver , 2009, 2009 International Conference on High Performance Computing & Simulation.

[14]  GrinspunEitan,et al.  Sparse matrix solvers on the GPU , 2003 .

[15]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[16]  Jack Dongarra,et al.  Numerical Linear Algebra for High-Performance Computers , 1998 .