论文信息 - Effective multi-GPU communication using multiple CUDA streams and threads

Effective multi-GPU communication using multiple CUDA streams and threads

In the context of multiple GPUs that share the same PCIe bus, we propose a new communication scheme that leads to a more effective overlap of communication and computation. Multiple CUDA streams and OpenMP threads are adopted so that data can simultaneously be sent and received. A representative 3D stencil example is used to demonstrate the effectiveness of our scheme. We compare the performance of our new scheme with an MPI-based state-of-the-art scheme. Results show that our approach outperforms the state-of-the-art scheme, being up to 1.85× faster. However, our performance results also indicate that the current underlying PCIe bus architecture needs improvements to handle the future scenario of many GPUs per node.

Scott B. Baden | Xing Cai | Tor Gillberg | Mohammed Sourouri

[1] Inanc Senocak,et al. CUDA Implementation of a Navier-Stokes Solver on Multi-GPU Desktop Platforms for Incompressible Flows , 2009 .

[2] Peter Messmer,et al. Forward and adjoint simulations of seismic wave propagation on emerging large-scale GPU architectures , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[3] Naoyuki Onodera,et al. A High-productivity Framework for Multi-GPU computation of Mesh-based applications , 2014 .

[4] Massimo Bernaschi,et al. Benchmarking of communication techniques for GPUs , 2013, J. Parallel Distributed Comput..

[5] Inanc Senocak,et al. An MPI-CUDA Implementation for Massively Parallel Incompressible Flow Computations on Multi-GPU Clusters , 2010 .

[6] Massimiliano Fatica,et al. Multi-GPU Programming , 2014 .

[7] Feng Ji,et al. Efficient Intranode Communication in GPU-Accelerated Systems , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[8] Paulius Micikevicius,et al. 3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.

[9] K. Hawick,et al. Asynchronous Communication for Finite-Difference Simulations on GPU Clusters using CUDA and MPI , 2010 .

[10] Massimiliano Fatica,et al. Implementing the Himeno benchmark with CUDA on GPU clusters , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[11] Satoshi Matsuoka,et al. Physis: An implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[12] Dhabaleswar K. Panda,et al. Optimizing MPI Communication on Multi-GPU Systems Using CUDA Inter-Process Communication , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[13] Sayantan Sur,et al. MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters , 2011, Computer Science - Research and Development.

[14] Hans Werner Meuer,et al. Top500 Supercomputer Sites , 1997 .