论文信息 - Domain-Specific Multi-Level IR Rewriting for GPU

Domain-Specific Multi-Level IR Rewriting for GPU

Most compilers have a single core intermediate representation (IR) (e.g., LLVM) sometimes complemented with vaguely defined IR-like data structures. This IR is commonly low-level and close to machine instructions. As a result, optimizations relying on domain-specific information are either not possible or require complex analysis to recover the missing information. In contrast, multi-level rewriting instantiates a hierarchy of dialects (IRs), lowers programs level-by-level, and performs code transformations at the most suitable level. We demonstrate the effectiveness of this approach for the weather and climate domain. In particular, we develop a prototype compiler and design stencil- and GPU-specific dialects based on a set of newly introduced design principles. We find that two domain-specific optimizations (500 lines of code) realized on top of LLVM’s extensible MLIR compiler infrastructure suffice to outperform state-of-the-art solutions. In essence, multi-level rewriting promises to herald the age of specialized compilers composed from domain- and target-specific dialects implemented on top of a shared infrastructure.

[1] Albert Cohen,et al. Violated dependence analysis , 2006, ICS '06.

[2] M. Wegman,et al. Global value numbers and redundant computations , 1988, POPL '88.

[3] Jan Vitek,et al. Terra: a multi-stage language for high-performance computing , 2013, PLDI.

[4] Torsten Hoefler,et al. MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures , 2015, ICS.

[5] Alejandro Duran,et al. YASK—Yet Another Stencil Kernel: A Framework for HPC Stencil Code-Generation and Tuning , 2016, 2016 Sixth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC).

[6] Eike Hermann Müller,et al. LFRic: Meeting the challenges of scalability and performance portability in Weather and Climate models , 2018, J. Parallel Distributed Comput..

[7] Raja , 2019, La Generación sin Nombre. Una antología.

[8] Taylor Graham. Dawn , 2000 .

[9] M. Baldauf,et al. Operational Convective-Scale Numerical Weather Prediction with the COSMO Model: Description and Sensitivities , 2011 .

[10] Michel Steuwer,et al. LIFT: A functional data-parallel IR for high-performance GPU code generation , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[11] Uday Bondhugula,et al. MLIR: Scaling Compiler Infrastructure for Domain Specific Computation , 2021, 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[12] Vikram S. Adve,et al. LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[13] W. M. McKeeman,et al. Peephole optimization , 1965, CACM.

[14] Philipp Slusallek,et al. AnyDSL: a partial evaluation framework for programming high-performance libraries , 2018, Proc. ACM Program. Lang..

[15] Torsten Hoefler,et al. Dawn: a High-level Domain-Specific Language Compiler Toolchain for Weather and Climate Applications , 2020, Supercomput. Front. Innov..

[16] Robert Pincus,et al. The CLAW DSL: Abstractions for Performance Portable Weather and Climate Models , 2018, PASC.

[17] Mohamed Wahib,et al. Scalable Kernel Fusion for Memory-Bound GPU Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[18] Marc Pouzet,et al. Optimization space pruning without regrets , 2017, CC.

[19] Albert Cohen,et al. Split tiling for GPUs: automatic parallelization using trapezoidal tiles , 2013, GPGPU@ASPLOS.

[20] G. McMechan. MIGRATION BY EXTRAPOLATION OF TIME‐DEPENDENT BOUNDARY VALUES* , 1983 .

[21] Mary W. Hall,et al. Exploiting reuse and vectorization in blocked stencil computations on CPUs and GPUs , 2019, SC.

[22] Tobias Gysi,et al. Towards a performance portable, architecture agnostic implementation strategy for weather and climate models , 2014, Supercomput. Front. Innov..

[23] Kevin Skadron,et al. Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[24] P. Sadayappan,et al. Domain-Specific Optimization and Generation of High-Performance GPU Code for Stencil Computations , 2018, Proceedings of the IEEE.

[25] Hal Finkel,et al. User-Directed Loop-Transformations in Clang , 2018, 2018 IEEE/ACM 5th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC).

[26] Daniel J. Quinlan. ROSE: Compiler Support for Object-Oriented Frameworks , 2000, Parallel Process. Lett..

[27] H. Carter Edwards,et al. Kokkos: Enabling Performance Portability Across Manycore Architectures , 2013, 2013 Extreme Scaling Workshop (xsw 2013).

[28] Bradley C. Kuszmaul,et al. The pochoir stencil compiler , 2011, SPAA '11.

[29] P. Sadayappan,et al. Effective resource management for enhancing performance of 2D and 3D stencils on GPUs , 2016, GPGPU@PPoPP.

[30] Scott B. Baden,et al. Panda: A Compiler Framework for Concurrent CPU+\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$+$$\end{document}GPU Ex , 2016, International Journal of Parallel Programming.

[31] Kunle Olukotun,et al. Delite , 2014, ACM Trans. Embed. Comput. Syst..

[32] Shian-Jiann Lin,et al. A Two-Way Nested Global-Regional Dynamical Core on the Cubed-Sphere Grid , 2013 .

[33] Mohamed Wahib,et al. AN5D: automated stencil framework for high-degree temporal blocking on GPUs , 2020, CGO.

[34] Frédo Durand,et al. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[35] Francky Catthoor,et al. Polyhedral parallel code generation for CUDA , 2013, TACO.

[36] Haichen Shen,et al. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.

[37] Sergei Gorlatch,et al. High performance stencil code generation with Lift , 2018, CGO.

[38] Steven S. Muchnick,et al. Advanced Compiler Design and Implementation , 1997 .

[39] Pradeep Dubey,et al. 3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[40] Naoya Maruyama,et al. Optimizing Stencil Computations for NVIDIA Kepler GPUs , 2014 .

[41] Tobias Gysi,et al. STELLA: a domain-specific tool for structured grid methods in weather and climate models , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[42] Elnar Hajiyev,et al. PENCIL: A Platform-Neutral Compute Intermediate Language for Accelerator Programming , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[43] Christian Lengauer,et al. Polly - Performing Polyhedral Optimizations on a Low-Level Intermediate Representation , 2012, Parallel Process. Lett..

[44] Frédo Durand,et al. Learning to optimize halide with tree search and random programs , 2019, ACM Trans. Graph..

[45] P. Sadayappan,et al. Register optimizations for stencils on GPUs , 2018, PPoPP.

[46] P. Sadayappan,et al. High-performance code generation for stencil computations on GPU architectures , 2012, ICS '12.

[47] Torsten Hoefler,et al. Stateful dataflow multigraphs: a data-centric model for performance portability on heterogeneous architectures , 2019, SC.

[48] Torsten Hoefler,et al. Polly-ACC Transparent compilation to heterogeneous hardware , 2016, ICS.

[49] Torsten Hoefler,et al. Absinthe: Learning an Analytical Performance Model to Fuse and Tile Stencil Codes in One Shot , 2019, 2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[50] Albert Cohen,et al. Hybrid Hexagonal/Classical Tiling for GPUs , 2014, CGO '14.

[51] Martin Odersky,et al. Lightweight modular staging: a pragmatic approach to runtime code generation and compiled DSLs , 2010, GPCE '10.

[52] Vivek Sarkar,et al. Modeling the conflicting demands of parallelism and Temporal/Spatial locality in affine scheduling , 2018, CC.

[53] J. Ramanujam,et al. SDSLc: a multi-target domain-specific compiler for stencil computations , 2015, WOLFHPC@SC.

[54] Alexandros Nikolaos Ziogas,et al. A data-centric approach to extreme-scale ab initio dissipative quantum transport simulations , 2019, SC.

[55] Uday Bondhugula,et al. MLIR: A Compiler Infrastructure for the End of Moore's Law , 2020, ArXiv.

[56] Takayuki Aoki,et al. Hybrid Fortran: High Productivity GPU Porting Framework Applied to Japanese Weather Prediction Model , 2017, WACCPD@SC.

[57] Uday Bondhugula,et al. PolyMage: Automatic Optimization for Image Processing Pipelines , 2015, ASPLOS.