Effective multi-GPU communication using multiple CUDA streams and threads

In the context of multiple GPUs that share the same PCIe bus, we propose a new communication scheme that leads to a more effective overlap of communication and computation. Multiple CUDA streams and OpenMP threads are adopted so that data can simultaneously be sent and received. A representative 3D stencil example is used to demonstrate the effectiveness of our scheme. We compare the performance of our new scheme with an MPI-based state-of-the-art scheme. Results show that our approach outperforms the state-of-the-art scheme, being up to 1.85× faster. However, our performance results also indicate that the current underlying PCIe bus architecture needs improvements to handle the future scenario of many GPUs per node.

[1]  Inanc Senocak,et al.  CUDA Implementation of a Navier-Stokes Solver on Multi-GPU Desktop Platforms for Incompressible Flows , 2009 .

[2]  Peter Messmer,et al.  Forward and adjoint simulations of seismic wave propagation on emerging large-scale GPU architectures , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Naoyuki Onodera,et al.  A High-productivity Framework for Multi-GPU computation of Mesh-based applications , 2014 .

[4]  Massimo Bernaschi,et al.  Benchmarking of communication techniques for GPUs , 2013, J. Parallel Distributed Comput..

[5]  Inanc Senocak,et al.  An MPI-CUDA Implementation for Massively Parallel Incompressible Flow Computations on Multi-GPU Clusters , 2010 .

[6]  Massimiliano Fatica,et al.  Multi-GPU Programming , 2014 .

[7]  Feng Ji,et al.  Efficient Intranode Communication in GPU-Accelerated Systems , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[8]  Paulius Micikevicius,et al.  3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.

[9]  K. Hawick,et al.  Asynchronous Communication for Finite-Difference Simulations on GPU Clusters using CUDA and MPI , 2010 .

[10]  Massimiliano Fatica,et al.  Implementing the Himeno benchmark with CUDA on GPU clusters , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[11]  Satoshi Matsuoka,et al.  Physis: An implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[12]  Dhabaleswar K. Panda,et al.  Optimizing MPI Communication on Multi-GPU Systems Using CUDA Inter-Process Communication , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[13]  Sayantan Sur,et al.  MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters , 2011, Computer Science - Research and Development.

[14]  Hans Werner Meuer,et al.  Top500 Supercomputer Sites , 1997 .