Using High Level GPU Tasks to Explore Memory and Communications Options on Heterogeneous Platforms

Heterogeneous computing platforms that use GPUs for acceleration are becoming prevalent. Developing parallel applications for GPU platforms and optimizing GPU related applications for good performance is important. In this work, we develop a set of applications based on a high level task design, which ensures a well defined structure for portability improvement. Together with the GPU task implementation, we utilize a uniform interface to allocate and manage memory blocks that are used by both host and device. In this way we can choose the appropriate types of memory for host/device communication easily and flexibly in GPU tasks. Through asynchronous task execution and CUDA streams, we can explore concurrent GPU kernels for performance improvement when running multiple tasks. We developed a test benchmark set containing nine different kernel applications. Through tests we can learn that pinned memory can improve host/device data transfer for GPU platforms. The performance of unified memory differs a lot on different GPU architectures and is not a good choice if performance is the main focus. The multiple task tests show that applications based on our GPU tasks can effectively make use of the concurrent kernel ability of modern GPUs for better resource utilization.

[1]  David I. August,et al.  Automatic CPU-GPU communication management and optimization , 2011, PLDI '11.

[2]  Nathan Bell,et al.  Thrust: A Productivity-Oriented Library for CUDA , 2012 .

[3]  Raphael Landaverde,et al.  An investigation of Unified Memory Access performance in CUDA , 2014, 2014 IEEE High Performance Extreme Computing Conference (HPEC).

[4]  Murat Efe Guney,et al.  On the limits of GPU acceleration , 2010 .

[5]  Simon See,et al.  An Evaluation of Unified Memory Technology on NVIDIA GPUs , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[6]  Satoshi Matsuoka,et al.  CUDA vs OpenACC: Performance Case Studies with Kernel Benchmarks and a Memory-Bound CFD Application , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[7]  Ronan Keryell,et al.  Khronos SYCL for OpenCL: a tutorial , 2015, IWOCL.

[8]  Chao Liu,et al.  A Framework for Developing Parallel Applications with high level Tasks on Heterogeneous Platforms , 2017, PMAM@PPoPP.

[9]  William J. Dally,et al.  The GPU Computing Era , 2010, IEEE Micro.

[10]  Kevin Skadron,et al.  Dymaxion: Optimizing memory access patterns for heterogeneous systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[11]  Miriam Leeser,et al.  Design space exploration of GPU Accelerated cluster systems for optimal data transfer using PCIe bus , 2016, 2016 IEEE High Performance Extreme Computing Conference (HPEC).

[12]  Tianyi David Han,et al.  Reducing branch divergence in GPU programs , 2011, GPGPU-4.

[13]  Shinpei Kato,et al.  GDM: device memory management for gpgpu computing , 2014, SIGMETRICS '14.

[14]  Christian Terboven,et al.  OpenACC - First Experiences with Real-World Applications , 2012, Euro-Par.

[15]  Vijay Saraswat,et al.  GPU programming in a high level language: compiling X10 to CUDA , 2011, X10 '11.

[16]  Frank Bellosa,et al.  GPUswap: Enabling Oversubscription of GPU Memory through Transparent Swapping , 2015, VEE.

[17]  Christoph W. Kessler,et al.  VectorPU: A Generic and Efficient Data-container and Component Model for Transparent Data Transfer on GPU-based Heterogeneous Systems , 2017, PARMA-DITAM '17.

[18]  Mohamed Wahib,et al.  Scalable Kernel Fusion for Memory-Bound GPU Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[19]  John E. Stone,et al.  OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.

[20]  Kai Lu,et al.  Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing , 2010, 2010 IEEE International Conference on Cluster Computing.