Efficient Inter-node MPI Communication Using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs

GPUs and accelerators have become ubiquitous in modern supercomputing systems. Scientific applications from a wide range of fields are being modified to take advantage of their compute power. However, data movement continues to be a critical bottleneck in harnessing the full potential of a GPU. Data in the GPU memory has to be moved into the host memory before it can be sent over the network. MPI libraries like MVAPICH2 have provided solutions to alleviate this bottleneck using techniques like pipelining. GPUDirect RDMA is a feature introduced in CUDA 5.0, that allows third party devices like network adapters to directly access data in GPU device memory, over the PCIe bus. NVIDIA has partnered with Mellanox to make this solution available for InfiniBand clusters. In this paper, we evaluate the first version of GPUDirect RDMA for InfiniBand and propose designs in MVAPICH2 MPI library to efficiently take advantage of this feature. We highlight the limitations posed by current generation architectures in effectively using GPUDirect RDMA and address these issues through novel designs in MVAPICH2. To the best of our knowledge, this is the first work to demonstrate a solution for internode GPU-to-GPU MPI communication using GPUDirect RDMA. Results show that the proposed designs improve the latency of internode GPU-to-GPU communication using MPI Send/MPI Recv by 69% and 32% for 4Byte and 128KByte messages, respectively. The designs boost the uni-directional bandwidth achieved using 4KByte and 64KByte messages by 2x and 35%, respectively. We demonstrate the impact of the proposed designs using two end-applications: LBMGPU and AWP-ODC. They improve the communication times in these applications by up to 35% and 40%, respectively.

[1]  Dhabaleswar K. Panda,et al.  Optimizing MPI Communication on Multi-GPU Systems Using CUDA Inter-Process Communication , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[2]  Sayantan Sur,et al.  MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters , 2011, Computer Science - Research and Development.

[3]  Vijay Saraswat,et al.  GPU Programming in a High Level Language , 2011 .

[4]  Sayantan Sur,et al.  MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefit , 2011, 2011 IEEE International Conference on Cluster Computing.

[5]  Sayantan Sur,et al.  Quantifying performance benefits of overlap using MPI-2 in a seismic modeling application , 2010, ICS '10.

[6]  Jack J. Dongarra,et al.  A scalable framework for heterogeneous GPU-based clusters , 2012, SPAA '12.

[7]  Carlos Rosales,et al.  Multiphase LBM Distributed over Multiple GPUs , 2011, 2011 IEEE International Conference on Cluster Computing.

[8]  Vijay Saraswat,et al.  GPU programming in a high level language: compiling X10 to CUDA , 2011, X10 '11.

[9]  Dhabaleswar K. Panda,et al.  Design and implementation of MPICH2 over InfiniBand with RDMA support , 2003, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[10]  Lei Huang,et al.  Unified Parallel C for GPU Clusters: Language Extensions and Compiler Implementation , 2010, LCPC.

[11]  Dhabaleswar K. Panda,et al.  Scalable Earthquake Simulation on Petascale Supercomputers , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  Andrew Birrell,et al.  System programming in a high level language , 1977 .

[13]  Hiroki Honda,et al.  FLAT: a GPU programming framework to provide embedded MPI , 2012, GPGPU-5.

[14]  Federico Silla,et al.  Enabling CUDA acceleration within virtual machines using rCUDA , 2011, 2011 18th International Conference on High Performance Computing.