Scaling scientific applications on clusters of hybrid multicore/GPU nodes

Rapid advances in the performance and programmability of graphics accelerators have made GPU computing a compelling solution for a wide variety of application domains. However, the increased complexity as a result of architectural heterogeneity and imbalances in hardware resources poses significant programming challenges in harnessing the performance advantages of GPU accelerated parallel systems. Moreover, the speedup derived from GPU often gets offset by longer communication latencies and inefficient task scheduling. To achieve the best possible performance, a suitable parallel programming model is therefore essential. In this paper, we explore a new hybrid parallel programming model that incorporates GPU acceleration with the Partitioned Global Address Space (PGAS) programming paradigm. As we demonstrate, by combining Unified Parallel C (UPC) and CUDA as a case study, this hybrid model offers programmers with both enhanced programmability and powerful heterogeneous execution. Two application benchmarks, namely NAS Parallel Benchmark (NPB) FT and MG, are used to show the effectiveness of our proposed hybrid approach. Experimental results indicate that both implementations achieve significantly better performance due to optimization opportunities offered by the hybrid model, such as the funneled execution mode and fine-grained overlapping of communication and computation.

[1]  Dan Bonachea Proposal for extending the upc memory copy library functions and supporting extensions to gasnet , 2004 .

[2]  Katherine A. Yelick,et al.  Optimizing bandwidth limited problems using one-sided communication and overlap , 2005, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[3]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[4]  Avi Mendelson,et al.  Many-Core vs. Many-Thread Machines: Stay Away From the Valley , 2009, IEEE Computer Architecture Letters.

[5]  Steven J. Deitz,et al.  A Comparative Study of the NAS MG Benchmark across Parallel Languages and Architectures , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[6]  Mohamed M. Zahran,et al.  Productivity analysis of the UPC language , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[7]  Rudolf Eigenmann,et al.  OpenMP to GPGPU: a compiler framework for automatic translation and optimization , 2009, PPoPP '09.

[8]  Katherine A. Yelick,et al.  Scaling communication-intensive applications on BlueGene/P using one-sided communication and overlap , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[9]  John E. Stone,et al.  GPU clusters for high-performance computing , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[10]  Manish Vachharajani,et al.  GPU acceleration of numerical weather prediction , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[11]  Tom R. Halfhill NVIDIA's Next-Generation CUDA Compute and Graphics Architecture, Code-Named Fermi, Adds Muscle for Parallel Processing , 2009 .

[12]  Klaus Schulten,et al.  Adapting a message-driven parallel application to GPU-accelerated clusters , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  John D. Owens,et al.  Message passing on data-parallel architectures , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[14]  Tarek A. El-Ghazawi,et al.  UPC Performance and Potential: A NPB Experimental Study , 2002, ACM/IEEE SC 2002 Conference (SC'02).