Overlapping Data Transfers with Computation on GPU with Tiles

GPUs are employed to accelerate scientific applications however they require much more programming effort from the programmers particularly because of the disjoint address spaces between the host and the device. OpenACC and OpenMP 4.0 provide directive based programming solutions to alleviate the programming burden however synchronous data movement can create a performance bottleneck in fully taking advantage of GPUs. We propose a tiling based programming model and its library that simplifies the development of GPU programs and overlaps the data movement with computation. The programming model decomposes the data and computation into tiles and treats them as the main data transfer and execution units, which enables pipelining the transfers to hide the transfer latency. Moreover, partitioning application data into tiles allows the programmer to still take advantage of GPU even though application data cannot fit into the device memory. The library leverages C++ lambda functions, OpenACC directives, CUDA streams and tiling API from TiDA to support both productivity and performance. We show the performance of the library on a data transfer-intensive and a compute-intensive kernels and compare its speedup against OpenACC and CUDA. The results indicate that the library can hide the transfer latency, handle the cases where there is no sufficient device memory, and achieves reasonable performance.

[1]  Tianyi David Han,et al.  Reducing branch divergence in GPU programs , 2011, GPGPU-4.

[2]  Nathan Bell,et al.  Thrust: A Productivity-Oriented Library for CUDA , 2012 .

[3]  Scott B. Baden,et al.  Mint: realizing CUDA performance in 3D stencil methods with annotated C , 2011, ICS '11.

[4]  Sergei Gorlatch,et al.  Programming GPUs with C++14 and Just-In-Time Compilation , 2015, PARCO.

[5]  Torsten Hoefler,et al.  dCUDA: Hardware Supported Overlap of Computation and Communication , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Scott B. Baden,et al.  A new approach to interactive viewpoint selection for volume data sets , 2013, Inf. Vis..

[7]  Jun Zhou,et al.  Hands-on Performance Tuning of 3D Finite Difference Earthquake Simulation on GPU Fermi Chipset , 2012, ICCS.

[8]  Richard D. Hornung,et al.  The RAJA Portability Layer: Overview and Status , 2014 .

[9]  Samuel Williams,et al.  ExaSAT: An exascale co-design tool for performance modeling , 2015, Int. J. High Perform. Comput. Appl..

[10]  Laxmi N. Bhuyan,et al.  CuMAS: Data Transfer Aware Multi-Application Scheduling for Shared GPUs , 2016, ICS.

[11]  Mauro Bianco,et al.  A Generic Strategy for Multi-stage Stencils , 2014, Euro-Par.

[12]  Mohamed Wahib,et al.  Daino: A High-Level Framework for Parallel and Efficient AMR on GPUs , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  John Shalf,et al.  Trends in Data Locality Abstractions for HPC Systems , 2017, IEEE Transactions on Parallel and Distributed Systems.

[14]  Satoshi Matsuoka,et al.  CUDA vs OpenACC: Performance Case Studies with Kernel Benchmarks and a Memory-Bound CFD Application , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[15]  Paulius Micikevicius,et al.  3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.

[16]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[17]  Rudolf Eigenmann,et al.  OpenMPC: Extended OpenMP Programming and Tuning for GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  Yong-Jun Lee,et al.  Translating OpenMP Device Constructs to OpenCL Using Unnecessary Data Transfer Elimination , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[19]  Mitsuhisa Sato,et al.  XcalableACC: Extension of XcalableMP PGAS Language Using OpenACC for Accelerator Clusters , 2014, 2014 First Workshop on Accelerator Programming using Directives.

[20]  Michael Wolfe,et al.  Implementing the PGI Accelerator model , 2010, GPGPU-3.

[21]  John Shalf,et al.  TiDA: High-Level Programming Abstractions for Data Locality Management , 2016, ISC.

[22]  John Shalf,et al.  BoxLib with Tiling: An Adaptive Mesh Refinement Software Framework , 2016, SIAM J. Sci. Comput..

[23]  Chau-Wen Tseng,et al.  Tiling Optimizations for 3D Scientific Computations , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[24]  Jacqueline Chame,et al.  A script-based autotuning compiler system to generate high-performance CUDA code , 2013, TACO.

[25]  Albert Cohen,et al.  Split tiling for GPUs: automatic parallelization using trapezoidal tiles , 2013, GPGPU@ASPLOS.

[26]  David A. Padua,et al.  Programming for parallelism and locality with hierarchically tiled arrays , 2006, PPoPP '06.

[27]  H. Carter Edwards,et al.  Kokkos: Enabling Performance Portability Across Manycore Architectures , 2013, 2013 Extreme Scaling Workshop (xsw 2013).

[28]  Ade Miller,et al.  C++ AMP: Accelerated Massive Parallelism with Microsoft Visual C++ , 2012 .

[29]  Ronan Keryell,et al.  Khronos SYCL for OpenCL: a tutorial , 2015, IWOCL.

[30]  Zehra Sura,et al.  Towards Performance Portable GPU Programming with RAJA [ Extended Abstact ] , 2015 .