Energy-Efficient Stencil Computations on Distributed GPUs Using Dynamic Parallelism and GPU-Controlled Communication

GPUs are widely used in high performance computing, due to their high computational power and high performance per Watt. Still, one of the main bottlenecks of GPU-accelerated cluster computing is the data transfer between distributed GPUs. This not only affects performance, but also power consumption. The most common way to utilize a GPU cluster is a hybrid model, in which the GPU is used to accelerate the computation while the CPU is responsible for the communication. This approach always requires a dedicated CPU thread, which consumes additional CPU cycles and therefore increases the power consumption of the complete application. In recent work we have shown that the GPU is able to control the communication independently of the CPU. Still, there are several problems with GPU-controlled communication. The main problem is intra-GPU synchronization, since GPU blocks are non-preemptive. Therefore, the use of communication requests within a GPU can easily result in a deadlock. In this work we show how Dynamic Parallelism solves this problem. GPU-controlled communication in combination with Dynamic Parallelism allows keeping the control flow of multi-GPU applications on the GPU and bypassing the CPU completely. Although the performance of applications using GPU-controlled communication is still slightly worse than the performance of hybrid applications, we will show that performance per Watt increases by up to 10% while still using commodity hardware.

[1]  Holger Fröning,et al.  Energy-Efficient Collective Reduce and Allreduce Operations on Distributed GPUs , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[2]  D. Panda,et al.  Extending OpenSHMEM for GPU Computing , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[3]  Michela Taufer,et al.  Performance impact of dynamic parallelism on different clustering algorithms , 2013, Defense, Security, and Sensing.

[4]  John Shalf,et al.  Exascale Computing Technology Challenges , 2010, VECPAR.

[5]  Massimiliano Fatica,et al.  Implementing the Himeno benchmark with CUDA on GPU clusters , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[6]  Rahul Khanna,et al.  RAPL: Memory power estimation and capping , 2010, 2010 ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED).

[7]  John D. Owens,et al.  Message passing on data-parallel architectures , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[8]  Sudhakar Yalamanchili,et al.  Coordinated energy management in heterogeneous processors , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[9]  John E. Stone,et al.  Quantifying the impact of GPUs on performance and energy efficiency in HPC clusters , 2010, International Conference on Green Computing.

[10]  Jeff A. Stuart,et al.  A study of Persistent Threads style GPU programming for GPGPU workloads , 2012, 2012 Innovative Parallel Computing (InPar).

[11]  Fei Wang,et al.  Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism , 2013, IDEAL.

[12]  Yi Yang,et al.  CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications , 2015, Journal of Computer Science and Technology.

[13]  Dhabaleswar K. Panda,et al.  Efficient Inter-node MPI Communication Using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs , 2013, 2013 42nd International Conference on Parallel Processing.

[14]  Fei Wang,et al.  Graph-Based Substructure Pattern Mining Using CUDA Dynamic Parallelism , 2013, IDEAL.

[15]  Wu-chun Feng,et al.  Inter-block GPU communication via fast barrier synchronization , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[16]  Lena Oden,et al.  GPI2 for GPUs: A PGAS framework for efficient communication in hybrid clusters , 2013, PARCO.

[17]  Holger Fröning,et al.  GGAS: Global GPU address spaces for efficient communication in heterogeneous clusters , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[18]  C. Simmendinger,et al.  The GASPI API specification and its implementation GPI 2.0 , 2013 .

[19]  Holger Fröning,et al.  Analyzing Put/Get APIs for Thread-Collaborative Processors , 2014, 2014 43rd International Conference on Parallel Processing Workshops.

[20]  Holger Fröning,et al.  InfiniBand Verbs on GPU: a case study of controlling an InfiniBand network device from the GPU , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[21]  Sayantan Sur,et al.  MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters , 2011, Computer Science - Research and Development.