Towards Accelerating High-Order Stencils on Modern GPUs and Emerging Architectures with a Portable Framework

PDE discretization schemes yielding stencil-like computing patterns are commonly used for seismic modeling, weather forecast, and other scientific applications. Achieving HPC-level stencil computations on one architecture is challenging, porting to other architectures without sacrificing performance requires significant effort, especially in this golden age of many distinctive architectures. To help developers achieve performance, portability, and productivity with stencil computations, we developed StencilPy. With StencilPy, developers write stencil computations in a high-level domain-specific language, which promotes productivity, while its backends generate efficient code for existing and emerging architectures, including NVIDIA, AMD, and Intel GPUs, A64FX, and STX. StencilPy demonstrates promising performance results on par with hand-written code, maintains cross-architectural performance portability, and enhances productivity. Its modular design enables easy configuration, customization, and extension. A 25-point star-shaped stencil written in StencilPy is one-quarter of the length of a hand-crafted CUDA code and achieves similar performance on an NVIDIA H100 GPU.

[1]  M. Araya-Polo,et al.  Massively Distributed Finite-Volume Flux Computation , 2023, SC Workshops.

[2]  M. Araya-Polo,et al.  Scalable Distributed High-Order Stencil Computations , 2022, SC22: International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  John M. Mellor-Crummey,et al.  Using the Semi-Stencil Algorithm to Accelerate High-Order Stencils on GPUs , 2021, 2021 International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS).

[4]  John Mellor-Crummey,et al.  Accelerating High-Order Stencils on GPUs , 2020, 2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS).

[5]  Yuanming Hu,et al.  The Taichi programming language , 2020, SIGGRAPH Courses.

[6]  Henri Calandra,et al.  Minimod: A Finite Difference solver for Seismic Modeling , 2020, ArXiv.

[7]  Jaime Fern'andez del R'io,et al.  Array programming with NumPy , 2020, Nature.

[8]  Torsten Hoefler,et al.  Domain-Specific Multi-Level IR Rewriting for GPU , 2020, ACM Trans. Archit. Code Optim..

[9]  Uday Bondhugula,et al.  MLIR: A Compiler Infrastructure for the End of Moore's Law , 2020, ArXiv.

[10]  Mohamed Wahib,et al.  AN5D: automated stencil framework for high-degree temporal blocking on GPUs , 2020, CGO.

[11]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[12]  Ulrich Rüde,et al.  Code generation for massively parallel phase-field simulations , 2019, SC.

[13]  P. Sadayappan,et al.  On Optimizing Complex Stencils on GPUs , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[14]  P. Sadayappan,et al.  Domain-Specific Optimization and Generation of High-Performance GPU Code for Stencil Computations , 2018, Proceedings of the IEEE.

[15]  Felix J. Herrmann,et al.  Devito: an embedded domain-specific language for finite differences and geophysical exploration , 2018, Geoscientific Model Development.

[16]  Philipp A. Witte,et al.  Architecture and Performance of Devito, a System for Automated Stencil Computation , 2018, ACM Trans. Math. Softw..

[17]  Shoaib Kamil,et al.  Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code , 2018, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[18]  P. Sadayappan,et al.  Register optimizations for stencils on GPUs , 2018, PPoPP.

[19]  Michel Steuwer,et al.  LIFT: A functional data-parallel IR for high-performance GPU code generation , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[20]  Stefan Bilbao,et al.  Large Stencil Operations for GPU-based 3-D Acoustics Simulations , 2015 .

[21]  J. Ramanujam,et al.  SDSLc: a multi-target domain-specific compiler for stencil computations , 2015, WOLFHPC@SC.

[22]  Elnar Hajiyev,et al.  PENCIL: A Platform-Neutral Compute Intermediate Language for Accelerator Programming , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[23]  Tobias Gysi,et al.  Towards a performance portable, architecture agnostic implementation strategy for weather and climate models , 2014, Supercomput. Front. Innov..

[24]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI.

[25]  Albert Cohen,et al.  Split tiling for GPUs: automatic parallelization using trapezoidal tiles , 2013, GPGPU@ASPLOS.

[26]  Uday Bondhugula,et al.  Tiling stencil computations to maximize parallelism , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[27]  P. Sadayappan,et al.  High-performance code generation for stencil computations on GPU architectures , 2012, ICS '12.

[28]  Bradley C. Kuszmaul,et al.  The pochoir stencil compiler , 2011, SPAA '11.

[29]  Helmar Burkhart,et al.  PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[30]  Pradeep Dubey,et al.  3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[31]  Hans-Peter Seidel,et al.  Cache oblivious parallelograms in iterative stencil computations , 2010, ICS '10.

[32]  José María Cela,et al.  Introducing the Semi-stencil Algorithm , 2009, PPAM.

[33]  Volker Strumpen,et al.  The Cache Complexity of Multithreaded Cache Oblivious Algorithms , 2009, SPAA '06.

[34]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[35]  Paulius Micikevicius,et al.  3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.

[36]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[37]  Uday Bondhugula,et al.  Effective automatic parallelization of stencil computations , 2007, PLDI '07.

[38]  Volker Strumpen,et al.  Cache oblivious stencil computations , 2005, ICS '05.

[39]  Jeroen Tromp,et al.  A perfectly matched layer absorbing boundary condition for the second-order seismic wave equation , 2003 .

[40]  David G. Wonnacott,et al.  Achieving Scalable Locality with Time Skewing , 2002, International Journal of Parallel Programming.

[41]  Guohua Jin,et al.  Increasing Temporal Locality with Skewing and Recursive Blocking , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[42]  David G. Wonnacott,et al.  Using time skewing to eliminate idle time due to memory bandwidth and network limitations , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[43]  Matteo Frigo,et al.  Cache-oblivious algorithms , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[44]  Zhiyuan Li,et al.  New tiling techniques to improve cache temporal locality , 1999, PLDI '99.

[45]  Michael Isard,et al.  A functional pattern-based language in mlir , 2020 .

[46]  Raúl de la Cruz,et al.  Algorithm 942: Semi-Stencil , 2014, ACM Trans. Math. Softw..

[47]  John D. McCalpin,et al.  Time Skewing: A Value-Based Approach to Optimizing for Memory Locality , 1999 .