Efficient Breadth First Search on Multi-GPU Systems Using GPU-Centric OpenSHMEM

NVSHMEM is an implementation of OpenSHMEM for NVIDIA GPUs which allows communication to be issued from inside CUDA kernels. In this work, we present an implementation of Breadth First Search for multi-GPU systems using NVSHMEM. We analyze the benefits and bottlenecks of moving fine-grained communication into CUDA kernels. Using our implementation of BFS, we achieve up to 75% improvement in performance compared to a CUDA-aware MPI-based implementation, in the best case.

[1]  D. Panda,et al.  Extending OpenSHMEM for GPU Computing , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[2]  Wu-chun Feng,et al.  MPI-ACC: An Integrated and Extensible Approach to Data Movement in Accelerator-based Systems , 2012, 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems.

[3]  Satoshi Matsuoka Making TSUBAME2.0, the world's greenest production supercomputer, even greener — Challenges to the architects , 2011, IEEE/ACM International Symposium on Low Power Electronics and Design.

[4]  Vijay Saraswat,et al.  GPU programming in a high level language: compiling X10 to CUDA , 2011, X10 '11.

[5]  Hiroki Honda,et al.  FLAT: a GPU programming framework to provide embedded MPI , 2012, GPGPU-5.

[6]  Andrew S. Grimshaw,et al.  Scalable GPU graph traversal , 2012, PPoPP '12.

[7]  Koji Ueno,et al.  Parallel distributed breadth first search on GPU , 2013, 20th Annual International Conference on High Performance Computing.

[8]  Dhabaleswar K. Panda,et al.  Efficient Inter-node MPI Communication Using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs , 2013, 2013 42nd International Conference on Parallel Processing.

[9]  Sayantan Sur,et al.  MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters , 2011, Computer Science - Research and Development.

[10]  Duncan Poole,et al.  Exploring OpenSHMEM Model to Program GPU-based Extreme-Scale Systems , 2015, OpenSHMEM.

[11]  Massimo Bernaschi,et al.  Parallel Distributed Breadth First Search on the Kepler Architecture , 2016, IEEE Transactions on Parallel and Distributed Systems.