Domain-Specific Optimization and Generation of High-Performance GPU Code for Stencil Computations

Stencil computations arise in a number of computational domains. They exhibit significant data parallelism and are thus well suited for execution on graphical processing units (GPUs), but can be memory-bandwidth limited unless temporal locality is utilized via tiling. This paper describes how effective tiled code can be generated for GPUs from a domain-specific language (DSL) for stencils. Experimental results demonstrate the benefits of such a domain-specific optimization approach over state-of-the-art general-purpose compiler optimizations.

[1]  Bradley C. Kuszmaul,et al.  The pochoir stencil compiler , 2011, SPAA '11.

[2]  P. Sadayappan,et al.  High-performance code generation for stencil computations on GPU architectures , 2012, ICS '12.

[3]  Mohamed Wahib,et al.  Scalable Kernel Fusion for Memory-Bound GPU Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Alejandro Duran,et al.  YASK—Yet Another Stencil Kernel: A Framework for HPC Stencil Code-Generation and Tuning , 2016, 2016 Sixth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC).

[5]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Albert Cohen,et al.  Split tiling for GPUs: automatic parallelization using trapezoidal tiles , 2013, GPGPU@ASPLOS.

[7]  A Thesis,et al.  Tiling Stencil Computations to Maximize Parallelism , 2013 .

[8]  Paulius Micikevicius,et al.  Fusing convolution kernels through tiling , 2015, ARRAY@PLDI.

[9]  Kevin Skadron,et al.  Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs , 2009, ICS.

[10]  J. Ramanujam,et al.  A framework for enhancing data reuse via associative reordering , 2014, PLDI.

[11]  Uday Bondhugula,et al.  Loop transformations: convexity, pruning and optimization , 2011, POPL '11.

[12]  Sven Verdoolaege,et al.  isl: An Integer Set Library for the Polyhedral Model , 2010, ICMS.

[13]  Xing Zhou,et al.  Hierarchical overlapped tiling , 2012, CGO '12.

[14]  Naoya Maruyama,et al.  Optimizing Stencil Computations for NVIDIA Kepler GPUs , 2014 .

[15]  Mario A. R. Dantas,et al.  Extending OpenACC for Efficient Stencil Code Generation and Execution by Skeleton Frameworks , 2017, 2017 International Conference on High Performance Computing & Simulation (HPCS).

[16]  Mark F. Adams,et al.  Chombo Software Package for AMR Applications Design Document , 2014 .

[17]  Uday Bondhugula,et al.  PolyMage: Automatic Optimization for Image Processing Pipelines , 2015, ASPLOS.

[18]  P. Sadayappan,et al.  Resource conscious reuse-driven tiling for GPUs , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[19]  Jacqueline Chame,et al.  A script-based autotuning compiler system to generate high-performance CUDA code , 2013, TACO.

[20]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.

[21]  Tobias Gysi,et al.  STELLA: a domain-specific tool for structured grid methods in weather and climate models , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[22]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[23]  Francky Catthoor,et al.  Polyhedral parallel code generation for CUDA , 2013, TACO.

[24]  Albert Cohen,et al.  Hybrid Hexagonal/Classical Tiling for GPUs , 2014, CGO '14.

[25]  Richard Veras,et al.  A stencil compiler for short-vector SIMD architectures , 2013, ICS '13.

[26]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[27]  Samuel Williams,et al.  Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers , 2017, Parallel Comput..

[28]  Vinod Grover,et al.  Forma: a DSL for image processing applications to target GPUs and multi-core CPUs , 2015, GPGPU@PPoPP.

[29]  Catherine Mills Olschanowsky,et al.  A Study on Balancing Parallelism, Data Locality, and Recomputation in Existing PDE Solvers , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[30]  Catherine Mills Olschanowsky,et al.  Transforming loop chains via macro dataflow graphs , 2018, CGO.

[31]  P. Sadayappan,et al.  Effective resource management for enhancing performance of 2D and 3D stencils on GPUs , 2016, GPGPU@PPoPP.

[32]  Cosmin Nita,et al.  Optimized three-dimensional stencil computation on Fermi and Kepler GPUs , 2014, 2014 IEEE High Performance Extreme Computing Conference (HPEC).

[33]  Helmar Burkhart,et al.  PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[34]  Sanjay V. Rajopadhye,et al.  Simple, Accurate, Analytical Time Modeling and Optimal Tile Size Selection for GPGPU Stencils , 2017, PPoPP.

[35]  Scott B. Baden,et al.  Mint: realizing CUDA performance in 3D stencil methods with annotated C , 2011, ICS '11.

[36]  Satoshi Matsuoka,et al.  Physis: An implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[37]  D. Qainlant,et al.  ROSE: Compiler Support for Object-Oriented Frameworks , 1999 .

[38]  Peter Messmer,et al.  Parallel data-locality aware stencil computations on modern micro-architectures , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[39]  Jonathan Ragan-Kelley,et al.  Automatically scheduling halide image processing pipelines , 2016, ACM Trans. Graph..

[40]  Frank Mueller,et al.  Auto-generation and auto-tuning of 3D stencil codes on GPU clusters , 2012, CGO '12.

[41]  Paulius Micikevicius,et al.  3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.

[42]  Shoaib Kamil,et al.  OpenTuner: An extensible framework for program autotuning , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[43]  Torsten Hoefler,et al.  MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures , 2015, ICS.

[44]  Pradeep Dubey,et al.  3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.