论文信息 - Domain-Specific Optimization and Generation of High-Performance GPU Code for Stencil Computations

Domain-Specific Optimization and Generation of High-Performance GPU Code for Stencil Computations

Stencil computations arise in a number of computational domains. They exhibit significant data parallelism and are thus well suited for execution on graphical processing units (GPUs), but can be memory-bandwidth limited unless temporal locality is utilized via tiling. This paper describes how effective tiled code can be generated for GPUs from a domain-specific language (DSL) for stencils. Experimental results demonstrate the benefits of such a domain-specific optimization approach over state-of-the-art general-purpose compiler optimizations.

[1] Bradley C. Kuszmaul,et al. The pochoir stencil compiler , 2011, SPAA '11.

[2] P. Sadayappan,et al. High-performance code generation for stencil computations on GPU architectures , 2012, ICS '12.

[3] Mohamed Wahib,et al. Scalable Kernel Fusion for Memory-Bound GPU Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[4] Alejandro Duran,et al. YASK—Yet Another Stencil Kernel: A Framework for HPC Stencil Code-Generation and Tuning , 2016, 2016 Sixth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC).

[5] Samuel Williams,et al. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[6] Albert Cohen,et al. Split tiling for GPUs: automatic parallelization using trapezoidal tiles , 2013, GPGPU@ASPLOS.

[7] A Thesis,et al. Tiling Stencil Computations to Maximize Parallelism , 2013 .

[8] Paulius Micikevicius,et al. Fusing convolution kernels through tiling , 2015, ARRAY@PLDI.

[9] Kevin Skadron,et al. Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs , 2009, ICS.

[10] J. Ramanujam,et al. A framework for enhancing data reuse via associative reordering , 2014, PLDI.

[11] Uday Bondhugula,et al. Loop transformations: convexity, pruning and optimization , 2011, POPL '11.

[12] Sven Verdoolaege,et al. isl: An Integer Set Library for the Polyhedral Model , 2010, ICMS.

[13] Xing Zhou,et al. Hierarchical overlapped tiling , 2012, CGO '12.

[14] Naoya Maruyama,et al. Optimizing Stencil Computations for NVIDIA Kepler GPUs , 2014 .

[15] Mario A. R. Dantas,et al. Extending OpenACC for Efficient Stencil Code Generation and Execution by Skeleton Frameworks , 2017, 2017 International Conference on High Performance Computing & Simulation (HPCS).

[16] Mark F. Adams,et al. Chombo Software Package for AMR Applications Design Document , 2014 .

[17] Uday Bondhugula,et al. PolyMage: Automatic Optimization for Image Processing Pipelines , 2015, ASPLOS.

[18] P. Sadayappan,et al. Resource conscious reuse-driven tiling for GPUs , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[19] Jacqueline Chame,et al. A script-based autotuning compiler system to generate high-performance CUDA code , 2013, TACO.

[20] James Demmel,et al. Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.

[21] Tobias Gysi,et al. STELLA: a domain-specific tool for structured grid methods in weather and climate models , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[22] Vikram S. Adve,et al. LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[23] Francky Catthoor,et al. Polyhedral parallel code generation for CUDA , 2013, TACO.

[24] Albert Cohen,et al. Hybrid Hexagonal/Classical Tiling for GPUs , 2014, CGO '14.

[25] Richard Veras,et al. A stencil compiler for short-vector SIMD architectures , 2013, ICS '13.

[26] Frédo Durand,et al. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[27] Samuel Williams,et al. Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers , 2017, Parallel Comput..

[28] Vinod Grover,et al. Forma: a DSL for image processing applications to target GPUs and multi-core CPUs , 2015, GPGPU@PPoPP.

[29] Catherine Mills Olschanowsky,et al. A Study on Balancing Parallelism, Data Locality, and Recomputation in Existing PDE Solvers , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[30] Catherine Mills Olschanowsky,et al. Transforming loop chains via macro dataflow graphs , 2018, CGO.

[31] P. Sadayappan,et al. Effective resource management for enhancing performance of 2D and 3D stencils on GPUs , 2016, GPGPU@PPoPP.

[32] Cosmin Nita,et al. Optimized three-dimensional stencil computation on Fermi and Kepler GPUs , 2014, 2014 IEEE High Performance Extreme Computing Conference (HPEC).

[33] Helmar Burkhart,et al. PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[34] Sanjay V. Rajopadhye,et al. Simple, Accurate, Analytical Time Modeling and Optimal Tile Size Selection for GPGPU Stencils , 2017, PPoPP.

[35] Scott B. Baden,et al. Mint: realizing CUDA performance in 3D stencil methods with annotated C , 2011, ICS '11.

[36] Satoshi Matsuoka,et al. Physis: An implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[37] D. Qainlant,et al. ROSE: Compiler Support for Object-Oriented Frameworks , 1999 .

[38] Peter Messmer,et al. Parallel data-locality aware stencil computations on modern micro-architectures , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[39] Jonathan Ragan-Kelley,et al. Automatically scheduling halide image processing pipelines , 2016, ACM Trans. Graph..

[40] Frank Mueller,et al. Auto-generation and auto-tuning of 3D stencil codes on GPU clusters , 2012, CGO '12.

[41] Paulius Micikevicius,et al. 3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.

[42] Shoaib Kamil,et al. OpenTuner: An extensible framework for program autotuning , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[43] Torsten Hoefler,et al. MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures , 2015, ICS.

[44] Pradeep Dubey,et al. 3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.