Exploring Parallel Programming Models for Heterogeneous Computing Systems

Parallel systems that employ CPUs and GPUs as two heterogeneous computational units have become immensely popular due to their ability to maximize performance under restrictive thermal budgets. However, programming heterogeneous systems via traditional programming models like OpenCL or CUDA involves rewriting large portions of application-code. They also lead to code that is not performance portable across different architectures or even across different generations of the same architecture. In this paper, we evaluate the current state of two emerging parallel programming models: C++ AMP and OpenACC. These emerging programming paradigms require minimal code changes and rely on compilers to interact with the low-level hardware language, thereby producing performance portable code from an application standpoint. We analyze the performance and productivity of the emerging programming models and compare them with OpenCL using a diverse set of applications on two different architectures, a CPU coupled with a discrete GPU and an Accelerated Programming Unit (APU). Our experiments demonstrate that while the emerging programming models improve programmer productivity, they do not yet expose enough flexibility to extract maximum performance as compared to traditional programming models.

[1]  Alan Gray,et al.  Porting and scaling OpenACC applications on massively-parallel, GPU-accelerated supercomputers , 2012 .

[2]  Rajkishore Barik,et al.  Efficient Mapping of Irregular C++ Applications to Integrated GPUs , 2014, CGO '14.

[3]  Wu-chun Feng,et al.  Architecture-Aware Mapping and Optimization on a 1600-Core GPU , 2011, 2011 IEEE 17th International Conference on Parallel and Distributed Systems.

[4]  Wu-chun Feng,et al.  Towards accelerating molecular modeling via multi-scale approximation on a GPU , 2011, 2011 IEEE 1st International Conference on Computational Advances in Bio and Medical Sciences (ICCABS).

[5]  Stephen A. Jarvis,et al.  Accelerating Hydrocodes with OpenACC, OpenCL and CUDA , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[6]  Stephen L. Olivier,et al.  Toward an evolutionary task parallel integrated MPI + X programming model , 2015, PMAM@PPoPP.

[7]  Ray W. Grout,et al.  Hybridizing S3D into an Exascale application using OpenACC: An approach for moving to multi-petaflops and beyond , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Wen-mei W. Hwu,et al.  Program optimization space pruning for a multithreaded gpu , 2008, CGO '08.

[9]  Henri Calandra,et al.  Experiences with OpenMP, PGI, HMPP and OpenACC Directives on ISO/TTI Kernels , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[10]  Lucian Codrescu Qualcomm Hexagon DSP: An architecture optimized for mobile multimedia and communications , 2013, 2013 IEEE Hot Chips 25 Symposium (HCS).

[11]  Aaftab Munshi,et al.  The OpenCL specification , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).

[12]  Christian Terboven,et al.  OpenACC - First Experiences with Real-World Applications , 2012, Euro-Par.

[13]  Martin Schulz,et al.  Exploring Traditional and Emerging Parallel Programming Models Using a Proxy Application , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[14]  Ru Zhu Speedup of Micromagnetic Simulations with C++ AMP on Graphics Processing Units , 2016, Computing in Science & Engineering.

[15]  Ben Sander,et al.  Applying AMD's Kaveri APU for heterogeneous computing , 2014, 2014 IEEE Hot Chips 26 Symposium (HCS).

[16]  Mitesh R. Meswani,et al.  Efficient breadth-first search on a heterogeneous processor , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[17]  Joseph L. Greathouse,et al.  Efficient Sparse Matrix-Vector Multiplication on GPUs Using the CSR Storage Format , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  John D. Owens,et al.  GPU Computing , 2008, Proceedings of the IEEE.

[19]  Simon See,et al.  在Intel Knights Corner和NVIDIA Kepler架构上OpenACC的性能可移植性分析 (Performance Portability Evaluation for OpenACC on Intel Knights Corner and NVIDIA Kepler) , 2015, 计算机科学.

[20]  Hans Werner Meuer,et al.  Top500 Supercomputer Sites , 1997 .

[21]  Satoshi Matsuoka,et al.  CUDA vs OpenACC: Performance Case Studies with Kernel Benchmarks and a Memory-Bound CFD Application , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.