Domain-Specific Multi-Level IR Rewriting for GPU

Most compilers have a single core intermediate representation (IR) (e.g., LLVM) sometimes complemented with vaguely defined IR-like data structures. This IR is commonly low-level and close to machine instructions. As a result, optimizations relying on domain-specific information are either not possible or require complex analysis to recover the missing information. In contrast, multi-level rewriting instantiates a hierarchy of dialects (IRs), lowers programs level-by-level, and performs code transformations at the most suitable level. We demonstrate the effectiveness of this approach for the weather and climate domain. In particular, we develop a prototype compiler and design stencil- and GPU-specific dialects based on a set of newly introduced design principles. We find that two domain-specific optimizations (500 lines of code) realized on top of LLVM’s extensible MLIR compiler infrastructure suffice to outperform state-of-the-art solutions. In essence, multi-level rewriting promises to herald the age of specialized compilers composed from domain- and target-specific dialects implemented on top of a shared infrastructure.

[1]  Albert Cohen,et al.  Violated dependence analysis , 2006, ICS '06.

[2]  M. Wegman,et al.  Global value numbers and redundant computations , 1988, POPL '88.

[3]  Jan Vitek,et al.  Terra: a multi-stage language for high-performance computing , 2013, PLDI.

[4]  Torsten Hoefler,et al.  MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures , 2015, ICS.

[5]  Alejandro Duran,et al.  YASK—Yet Another Stencil Kernel: A Framework for HPC Stencil Code-Generation and Tuning , 2016, 2016 Sixth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC).

[6]  Eike Hermann Müller,et al.  LFRic: Meeting the challenges of scalability and performance portability in Weather and Climate models , 2018, J. Parallel Distributed Comput..

[7]  Raja , 2019, La Generación sin Nombre. Una antología.

[8]  Taylor Graham Dawn , 2000 .

[9]  M. Baldauf,et al.  Operational Convective-Scale Numerical Weather Prediction with the COSMO Model: Description and Sensitivities , 2011 .

[10]  Michel Steuwer,et al.  LIFT: A functional data-parallel IR for high-performance GPU code generation , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[11]  Uday Bondhugula,et al.  MLIR: Scaling Compiler Infrastructure for Domain Specific Computation , 2021, 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[12]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[13]  W. M. McKeeman,et al.  Peephole optimization , 1965, CACM.

[14]  Philipp Slusallek,et al.  AnyDSL: a partial evaluation framework for programming high-performance libraries , 2018, Proc. ACM Program. Lang..

[15]  Torsten Hoefler,et al.  Dawn: a High-level Domain-Specific Language Compiler Toolchain for Weather and Climate Applications , 2020, Supercomput. Front. Innov..

[16]  Robert Pincus,et al.  The CLAW DSL: Abstractions for Performance Portable Weather and Climate Models , 2018, PASC.

[17]  Mohamed Wahib,et al.  Scalable Kernel Fusion for Memory-Bound GPU Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  Marc Pouzet,et al.  Optimization space pruning without regrets , 2017, CC.

[19]  Albert Cohen,et al.  Split tiling for GPUs: automatic parallelization using trapezoidal tiles , 2013, GPGPU@ASPLOS.

[20]  G. McMechan MIGRATION BY EXTRAPOLATION OF TIME‐DEPENDENT BOUNDARY VALUES* , 1983 .

[21]  Mary W. Hall,et al.  Exploiting reuse and vectorization in blocked stencil computations on CPUs and GPUs , 2019, SC.

[22]  Tobias Gysi,et al.  Towards a performance portable, architecture agnostic implementation strategy for weather and climate models , 2014, Supercomput. Front. Innov..

[23]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[24]  P. Sadayappan,et al.  Domain-Specific Optimization and Generation of High-Performance GPU Code for Stencil Computations , 2018, Proceedings of the IEEE.

[25]  Hal Finkel,et al.  User-Directed Loop-Transformations in Clang , 2018, 2018 IEEE/ACM 5th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC).

[26]  Daniel J. Quinlan ROSE: Compiler Support for Object-Oriented Frameworks , 2000, Parallel Process. Lett..

[27]  H. Carter Edwards,et al.  Kokkos: Enabling Performance Portability Across Manycore Architectures , 2013, 2013 Extreme Scaling Workshop (xsw 2013).

[28]  Bradley C. Kuszmaul,et al.  The pochoir stencil compiler , 2011, SPAA '11.

[29]  P. Sadayappan,et al.  Effective resource management for enhancing performance of 2D and 3D stencils on GPUs , 2016, GPGPU@PPoPP.

[30]  Scott B. Baden,et al.  Panda: A Compiler Framework for Concurrent CPU+\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$+$$\end{document}GPU Ex , 2016, International Journal of Parallel Programming.

[31]  Kunle Olukotun,et al.  Delite , 2014, ACM Trans. Embed. Comput. Syst..

[32]  Shian-Jiann Lin,et al.  A Two-Way Nested Global-Regional Dynamical Core on the Cubed-Sphere Grid , 2013 .

[33]  Mohamed Wahib,et al.  AN5D: automated stencil framework for high-degree temporal blocking on GPUs , 2020, CGO.

[34]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[35]  Francky Catthoor,et al.  Polyhedral parallel code generation for CUDA , 2013, TACO.

[36]  Haichen Shen,et al.  TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.

[37]  Sergei Gorlatch,et al.  High performance stencil code generation with Lift , 2018, CGO.

[38]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[39]  Pradeep Dubey,et al.  3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[40]  Naoya Maruyama,et al.  Optimizing Stencil Computations for NVIDIA Kepler GPUs , 2014 .

[41]  Tobias Gysi,et al.  STELLA: a domain-specific tool for structured grid methods in weather and climate models , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[42]  Elnar Hajiyev,et al.  PENCIL: A Platform-Neutral Compute Intermediate Language for Accelerator Programming , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[43]  Christian Lengauer,et al.  Polly - Performing Polyhedral Optimizations on a Low-Level Intermediate Representation , 2012, Parallel Process. Lett..

[44]  Frédo Durand,et al.  Learning to optimize halide with tree search and random programs , 2019, ACM Trans. Graph..

[45]  P. Sadayappan,et al.  Register optimizations for stencils on GPUs , 2018, PPoPP.

[46]  P. Sadayappan,et al.  High-performance code generation for stencil computations on GPU architectures , 2012, ICS '12.

[47]  Torsten Hoefler,et al.  Stateful dataflow multigraphs: a data-centric model for performance portability on heterogeneous architectures , 2019, SC.

[48]  Torsten Hoefler,et al.  Polly-ACC Transparent compilation to heterogeneous hardware , 2016, ICS.

[49]  Torsten Hoefler,et al.  Absinthe: Learning an Analytical Performance Model to Fuse and Tile Stencil Codes in One Shot , 2019, 2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[50]  Albert Cohen,et al.  Hybrid Hexagonal/Classical Tiling for GPUs , 2014, CGO '14.

[51]  Martin Odersky,et al.  Lightweight modular staging: a pragmatic approach to runtime code generation and compiled DSLs , 2010, GPCE '10.

[52]  Vivek Sarkar,et al.  Modeling the conflicting demands of parallelism and Temporal/Spatial locality in affine scheduling , 2018, CC.

[53]  J. Ramanujam,et al.  SDSLc: a multi-target domain-specific compiler for stencil computations , 2015, WOLFHPC@SC.

[54]  Alexandros Nikolaos Ziogas,et al.  A data-centric approach to extreme-scale ab initio dissipative quantum transport simulations , 2019, SC.

[55]  Uday Bondhugula,et al.  MLIR: A Compiler Infrastructure for the End of Moore's Law , 2020, ArXiv.

[56]  Takayuki Aoki,et al.  Hybrid Fortran: High Productivity GPU Porting Framework Applied to Japanese Weather Prediction Model , 2017, WACCPD@SC.

[57]  Uday Bondhugula,et al.  PolyMage: Automatic Optimization for Image Processing Pipelines , 2015, ASPLOS.