Extending OpenACC for Efficient Stencil Code Generation and Execution by Skeleton Frameworks

The OpenACC programming model simplifies the programming for accelerator devices such as GPUs. Its abstract accelerator model defines a least common denominator for accelerator devices, thus it cannot represent architectural specifics of these devices without losing portability. Therefore, this general- purpose approach delivers good performance on average, but it misses optimization opportunities for code generation and execution of specific classes of applications. In this paper, we propose OpenACC extensions to enable efficient code generation and execution of stencil applications by parallel skeleton frameworks such as PSkel. Our results show that our stencil extensions may improve the performance of OpenACC in up to 28% and 45% on GPU and CPU, respectively. Moreover, we show that the work-partitioning mechanism offered by the skeleton framework, which splits the computation across CPU and GPU, may improve even further the performance of the applications in up to 18%.

[1]  Christoph W. Kessler,et al.  SkePU: a multi-backend skeleton programming library for multi-GPU systems , 2010, HLPP '10.

[2]  Alyson D. Pereira,et al.  TOAST: Automatic tiling for iterative stencil computations on GPUs , 2017, Concurr. Comput. Pract. Exp..

[3]  Maury M. Gouvea,et al.  Cloud dynamics simulation with cellular automata , 2010, SummerSim.

[4]  Monica S. Lam,et al.  Efficient and exact data dependence analysis , 1991, PLDI '91.

[5]  Vadim Maslov,et al.  Delinearization: an efficient way to break multiloop dependence equations , 1992, PLDI '92.

[6]  Albert Cohen,et al.  Induction Variable Analysis with Delayed Abstractions , 2005, HiPEAC.

[7]  Murray Cole,et al.  PARTANS: An autotuning framework for stencil computation on multi-GPU systems , 2013, TACO.

[8]  Michel Steuwer,et al.  LIFT: A functional data-parallel IR for high-performance GPU code generation , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[9]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[10]  Sergei Gorlatch,et al.  SkelCL - A Portable Skeleton Library for High-Level GPU Programming , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[11]  Matthias Christen,et al.  Patus for convenient high-performance stencils: Evaluation in earthquake simulations , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  Christian Terboven,et al.  OpenACC - First Experiences with Real-World Applications , 2012, Euro-Par.

[13]  J. Ramanujam,et al.  On Recovering Multi-Dimensional Arrays in Polly , 2015 .

[14]  Amirali Baniasadi,et al.  Employing Software-Managed Caches in OpenACC , 2016, ACM Trans. Model. Perform. Evaluation Comput. Syst..

[15]  Michael Wolfe,et al.  Beyond induction variables: detecting and classifying sequences using a demand-driven SSA form , 1995, TOPL.

[16]  Michael McCool,et al.  Structured parallel programming with deterministic patterns , 2010 .

[17]  Luís Fabrício Wanderley Góes,et al.  PSkel: A stencil programming framework for CPU‐GPU systems , 2015, Concurr. Comput. Pract. Exp..

[18]  Kevin Skadron,et al.  A Performance Study for Iterative Stencil Loops on GPUs with Ghost Zone Optimizations , 2011, International Journal of Parallel Programming.

[19]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[20]  James Demmel,et al.  Applied Numerical Linear Algebra , 1997 .

[21]  Andreas Simbürger,et al.  On the Variety of Static Control Parts in Real-World Programs : from Affine via Multi-dimensional to Polynomial and Just-inTime , 2013 .

[22]  P. Sadayappan,et al.  High-performance code generation for stencil computations on GPU architectures , 2012, ICS '12.

[23]  Richard Veras,et al.  A stencil compiler for short-vector SIMD architectures , 2013, ICS '13.

[24]  Christoph W. Kessler,et al.  SkePU 2: Flexible and Type-Safe Skeleton Programming for Heterogeneous Parallel Systems , 2018, International Journal of Parallel Programming.

[25]  Master Gardener,et al.  Mathematical games: the fantastic combinations of john conway's new solitaire game "life , 1970 .

[26]  Satoshi Matsuoka,et al.  Physis: An implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).