Reducing Communication Overhead in Multi-GPU Hybrid Solver for 2D Laplace’s Equation

The possibility of porting algorithms to graphics processing units (GPUs) raises significant interest among researchers. The natural next step is to employ multiple GPUs, but communication overhead may limit further performance improvement. In this paper, we investigate techniques reducing overhead on hybrid CPU–GPU platforms, including careful data layout and usage of GPU memory spaces, and use of non-blocking communication. In addition, we propose an accurate automatic load balancing technique for heterogeneous environments. We validate our approach on a hybrid Jacobi solver for 2D Laplace’s Equation. Experiments carried out using various graphics hardware and types of connectivity have confirmed that the proposed data layout allows our fastest CUDA kernels to reach the analytical limit for memory bandwidth (up to 106 GB/s on NVidia GTX 480), and that the non-blocking communication significantly reduces overhead, allowing for almost linear speed-up, even when communication is carried out over relatively slow networks.

[1]  Jie Cheng,et al.  CUDA by Example: An Introduction to General-Purpose GPU Programming , 2010, Scalable Comput. Pract. Exp..

[2]  Roman Wyrzykowski,et al.  Parallel Implementation of Conjugate Gradient Method on Graphics Processors , 2009, PPAM.

[3]  Chao-Tung Yang,et al.  Hybrid CUDA, OpenMP, and MPI parallel programming on multicore GPU clusters , 2011, Comput. Phys. Commun..

[4]  James Demmel,et al.  Applied Numerical Linear Algebra , 1997 .

[5]  Torsten Hoefler,et al.  Optimizing a conjugate gradient solver with non-blocking collective operations , 2007, Parallel Comput..

[6]  Greg Humphreys,et al.  A multigrid solver for boundary value problems using programmable graphics hardware , 2003, HWWS '03.

[7]  Jie Cheng,et al.  Programming Massively Parallel Processors. A Hands-on Approach , 2010, Scalable Comput. Pract. Exp..

[8]  William L. Briggs,et al.  A multigrid tutorial, Second Edition , 2000 .

[9]  Orion S. Lawlor,et al.  Message passing for GPGPU clusters: CudaMPI , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[10]  Paulius Micikevicius,et al.  3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.

[11]  Robert Strzodka,et al.  Cyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed-Precision Multigrid , 2011, IEEE Transactions on Parallel and Distributed Systems.

[12]  Satoshi Matsuoka,et al.  Fast Conjugate Gradients with Multiple GPUs , 2009, ICCS.

[13]  Rajeev Thakur,et al.  Test suite for evaluating performance of multithreaded MPI communication , 2009, Parallel Comput..

[14]  William L. Briggs,et al.  A multigrid tutorial , 1987 .

[15]  Jens H. Krüger,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007, Eurographics.

[16]  David E. Bernholdt,et al.  A framework for characterizing overlap of communication and computation in parallel applications , 2008, Cluster Computing.

[17]  Michal Czapinski,et al.  An effective Parallel Multistart Tabu Search for Quadratic Assignment Problem on CUDA platform , 2013, J. Parallel Distributed Comput..

[18]  Yao Zhang,et al.  Fast tridiagonal solvers on the GPU , 2010, PPoPP '10.

[19]  Peng Li,et al.  Multigrid on GPU: tackling power grid analysis on parallel SIMT platforms , 2008, ICCAD 2008.

[20]  Jack J. Dongarra,et al.  Overlapping Computation and Communication for Advection on Hybrid Parallel Computers , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[21]  Yao Zhang,et al.  Parallel Computing Experiences with CUDA , 2008, IEEE Micro.

[22]  Michal Czapinski,et al.  Tabu Search with two approaches to parallel flowshop evaluation on CUDA platform , 2011, J. Parallel Distributed Comput..

[23]  Keith D. Underwood,et al.  Analyzing the Impact of Overlap, Offload, and Independent Progress for Message Passing Interface Applications , 2005, Int. J. High Perform. Comput. Appl..

[24]  Eric Darve,et al.  Large calculation of the flow over a hypersonic vehicle using a GPU , 2008, J. Comput. Phys..

[25]  Andreas Koch,et al.  A Fast GPU Implementation for Solving Sparse Ill-Posed Linear Equation Systems , 2009, PPAM.

[26]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[27]  S. McCormick,et al.  A multigrid tutorial (2nd ed.) , 2000 .

[28]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[29]  Matemática,et al.  Society for Industrial and Applied Mathematics , 2010 .

[30]  Jack J. Dongarra,et al.  Towards dense linear algebra for hybrid GPU accelerated manycore systems , 2009, Parallel Comput..