AN5D: automated stencil framework for high-degree temporal blocking on GPUs

Stencil computation is one of the most widely-used compute patterns in high performance computing applications. Spatial and temporal blocking have been proposed to overcome the memory-bound nature of this type of computation by moving memory pressure from external memory to on-chip memory on GPUs. However, correctly implementing those optimizations while considering the complexity of the architecture and memory hierarchy of GPUs to achieve high performance is difficult. We propose AN5D, an automated stencil framework which is capable of automatically transforming and optimizing stencil patterns in a given C source code, and generating corresponding CUDA code. Parameter tuning in our framework is guided by our performance model. Our novel optimization strategy reduces shared memory and register pressure in comparison to existing implementations, allowing performance scaling up to a temporal blocking degree of 10. We achieve the highest performance reported so far for all evaluated stencil benchmarks on the state-of-the-art Tesla V100 GPU.

[1]  Pradeep Dubey,et al.  3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  P. Sadayappan,et al.  On Optimizing Complex Stencils on GPUs , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[3]  Naoya Maruyama,et al.  Optimizing Stencil Computations for NVIDIA Kepler GPUs , 2014 .

[4]  Sanjay V. Rajopadhye,et al.  Simple, Accurate, Analytical Time Modeling and Optimal Tile Size Selection for GPGPU Stencils , 2017, PPoPP.

[5]  Satoshi Matsuoka,et al.  High-Performance High-Order Stencil Computation on FPGAs Using OpenCL , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[6]  Sergei Gorlatch,et al.  High performance stencil code generation with Lift , 2018, CGO.

[7]  Uday Bondhugula,et al.  Diamond Tiling: Tiling Techniques to Maximize Parallelism for Stencil Computations , 2017, IEEE Transactions on Parallel and Distributed Systems.

[8]  Chao Yang,et al.  26 PFLOPS Stencil Computations for Atmospheric Modeling on Sunway TaihuLight , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[9]  Francky Catthoor,et al.  Polyhedral parallel code generation for CUDA , 2013, TACO.

[10]  Bradley C. Kuszmaul,et al.  The pochoir stencil compiler , 2011, SPAA '11.

[11]  Albert Cohen,et al.  The Promises of Hybrid Hexagonal/Classical Tiling for GPU , 2013 .

[12]  P. Sadayappan,et al.  Register optimizations for stencils on GPUs , 2018, PPoPP.

[13]  Vinod Grover,et al.  Forma: a DSL for image processing applications to target GPUs and multi-core CPUs , 2015, GPGPU@PPoPP.

[14]  Nikolaus A. Adams,et al.  11 PFLOP/s simulations of cloud cavitation collapse , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[15]  P. Sadayappan,et al.  Domain-Specific Optimization and Generation of High-Performance GPU Code for Stencil Computations , 2018, Proceedings of the IEEE.

[16]  Torsten Hoefler,et al.  Designing scalable FPGA architectures using high-level synthesis , 2018, PPoPP.

[17]  Alejandro Duran,et al.  YASK—Yet Another Stencil Kernel: A Framework for HPC Stencil Code-Generation and Tuning , 2016, 2016 Sixth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC).

[18]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[19]  Satoshi Matsuoka,et al.  An 80-Fold Speedup, 15.0 TFlops Full GPU Acceleration of Non-Hydrostatic Weather Model ASUCA Production Code , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[20]  Albert Cohen,et al.  Split tiling for GPUs: automatic parallelization using trapezoidal tiles , 2013, GPGPU@ASPLOS.

[21]  Kevin Skadron,et al.  Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs , 2009, ICS.

[22]  P. Sadayappan,et al.  Effective resource management for enhancing performance of 2D and 3D stencils on GPUs , 2016, GPGPU@PPoPP.

[23]  Satoshi Matsuoka,et al.  Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[24]  Uday Bondhugula,et al.  Effective automatic parallelization of stencil computations , 2007, PLDI '07.

[25]  Albert Cohen,et al.  The Relation Between Diamond Tiling and Hexagonal Tiling , 2014, Parallel Process. Lett..

[26]  Sven Verdoolaege,et al.  Polyhedral Extraction Tool , 2012 .

[27]  Michael Wolfe,et al.  More iteration space tiling , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[28]  P. Sadayappan,et al.  High-performance code generation for stencil computations on GPU architectures , 2012, ICS '12.

[29]  Junichiro Makino,et al.  Optimal Temporal Blocking for Stencil Computation , 2015, ICCS.

[30]  Yannis Cotronis,et al.  A Quantitative Performance Evaluation of Fast on-Chip Memories of GPUs , 2016, 2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP).

[31]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[32]  Albert Cohen,et al.  Hybrid Hexagonal/Classical Tiling for GPUs , 2014, CGO '14.

[33]  J. Ramanujam,et al.  SDSLc: a multi-target domain-specific compiler for stencil computations , 2015, WOLFHPC@SC.

[34]  Stephen John Turner,et al.  Optimizing and Auto-Tuning Iterative Stencil Loops for GPUs with the In-Plane Method , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[35]  Matt Martineau,et al.  GPU-STREAM v2.0: Benchmarking the Achievable Memory Bandwidth of Many-Core Processors Across Diverse Parallel Programming Models , 2016, ISC Workshops.

[36]  Samuel Williams,et al.  Implicit and explicit optimizations for stencil computations , 2006, MSPC '06.

[37]  Sven Verdoolaege,et al.  isl: An Integer Set Library for the Polyhedral Model , 2010, ICMS.

[38]  Jason Cong,et al.  SODA: Stencil with Optimized Dataflow Architecture , 2018, 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[39]  Satoshi Matsuoka,et al.  Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL , 2018, FPGA.