GGAS: Global GPU address spaces for efficient communication in heterogeneous clusters

Modern GPUs are powerful high-core-count processors, which are no longer used solely for graphics applications, but are also employed to accelerate computationally intensive general-purpose tasks. For utmost performance, GPUs are distributed throughout the cluster to process parallel programs. In fact, many recent high-performance systems in the TOP500 list are heterogeneous architectures. Despite being highly effective processing units, GPUs on different hosts are incapable of communicating without assistance from a CPU. As a result, communication between distributed GPUs suffers from unnecessary overhead, introduced by switching control flow from GPUs to CPUs and vice versa. Most communication libraries even require intermediate copies from GPU memory to host memory. This overhead in particular penalizes small data movements and synchronization operations, reduces efficiency and limits scalability. In this work we introduce global address spaces to facilitate direct communication between distributed GPUs without CPU involvement. Avoiding context switches and unnecessary copying dramatically reduces communication overhead. We evaluate our approach using a variety of workloads including low-level latency and bandwidth benchmarks, basic synchronization primitives like barriers, and a stencil computation as an example application. We see performance benefits of up to 2× for basic benchmarks and up to 1.67× for stencil computations.

[1]  Tong Liu,et al.  The development of Mellanox/NVIDIA GPUDirect over InfiniBand—a new model for GPU to GPU communications , 2011, Computer Science - Research and Development.

[2]  Pradeep Dubey,et al.  Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU , 2010, ISCA.

[3]  Massimiliano Fatica,et al.  Implementing the Himeno benchmark with CUDA on GPU clusters , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[4]  John D. Owens,et al.  Distributed texture memory in a multi-GPU environment , 2006, GH '06.

[5]  Steven L. Scott,et al.  Synchronization and communication in the T3E multiprocessor , 1996, ASPLOS VII.

[6]  Wu-chun Feng,et al.  MPI-ACC: An Integrated and Extensible Approach to Data Movement in Accelerator-based Systems , 2012, 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems.

[7]  Wu-chun Feng,et al.  Inter-block GPU communication via fast barrier synchronization , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[8]  Jie Cheng,et al.  Programming Massively Parallel Processors. A Hands-on Approach , 2010, Scalable Comput. Pract. Exp..

[9]  John D. Owens,et al.  Message passing on data-parallel architectures , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[10]  Holger Fröning,et al.  Efficient hardware support for the Partitioned Global Address Space , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[11]  Dhabaleswar K. Panda,et al.  Optimizing MPI Communication on Multi-GPU Systems Using CUDA Inter-Process Communication , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[12]  Sayantan Sur,et al.  MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters , 2011, Computer Science - Research and Development.

[13]  Yutaka Ishikawa,et al.  Direct MPI Library for Intel Xeon Phi Co-Processors , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[14]  Holger Fröning,et al.  Highly scalable barriers for future high-performance computing clusters , 2011, 2011 18th International Conference on High Performance Computing.

[15]  Massimo Bernaschi,et al.  GPU Peer-to-Peer Techniques Applied to a Cluster Interconnect , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.