论文信息 - Declarative Loop Tactics for Domain-specific Optimization

Declarative Loop Tactics for Domain-specific Optimization

Increasingly complex hardware makes the design of effective compilers difficult. To reduce this problem, we introduce Declarative Loop Tactics, which is a novel framework of composable program transformations based on an internal tree-like program representation of a polyhedral compiler. The framework is based on a declarative C++ API built around easy-to-program matchers and builders, which provide the foundation to develop loop optimization strategies. Using our matchers and builders, we express computational patterns and core building blocks, such as loop tiling, fusion, and data-layout transformations, and compose them into algorithm-specific optimizations. Declarative Loop Tactics (Loop Tactics for short) can be applied to many domains. For two of them, stencils and linear algebra, we show how developers can express sophisticated domain-specific optimizations as a set of composable transformations or calls to optimized libraries. By allowing developers to add highly customized optimizations for a given computational pattern, we expect our approach to reduce the need for DSLs and to extend the range of optimizations that can be performed by a current general-purpose compiler.

[1] Hal Finkel,et al. A Proposal for Loop-Transformation Pragmas , 2018, IWOMP.

[2] Sven Verdoolaege,et al. Schedule Trees , 2013 .

[3] Rudolf Eigenmann,et al. Cetus - An Extensible Compiler Infrastructure for Source-to-Source Transformation , 2003, LCPC.

[4] Michael Wolfe,et al. Beyond induction variables: detecting and classifying sequences using a demand-driven SSA form , 1995, TOPL.

[5] Juan Touriño,et al. XARK: An extensible framework for automatic recognition of computational kernels , 2008, TOPL.

[6] Tomofumi Yuki,et al. AlphaZ: A System for Design Space Exploration in the Polyhedral Model , 2012, LCPC.

[7] Michael Kruse,et al. High-Performance Generalized Tensor Operations , 2018, ACM Trans. Archit. Code Optim..

[8] Gabe Rudy,et al. CUDA-CHiLL: A programming language interface for GPGPU optimizations and code generation , 2010 .

[9] Franz Franchetti,et al. Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures , 2011, CC.

[10] Sven Verdoolaege,et al. Polyhedral Extraction Tool , 2012 .

[11] G. Smith,et al. Numerical Solution of Partial Differential Equations: Finite Difference Methods , 1978 .

[12] Ron Y. Pinter,et al. Program optimization and parallelization using idioms , 1991, POPL '91.

[13] Zhaofang Wen,et al. Automatic Algorithm Recognition and Replacement: A New Approach to Program Optimization , 2000 .

[14] David A. Padua,et al. A Language for the Compact Representation of Multiple Program Versions , 2005, LCPC.

[15] Haichen Shen,et al. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.

[16] Gerda Janssens,et al. Scheduling for PPCG , 2017 .

[17] Sven Verdoolaege,et al. isl: An Integer Set Library for the Polyhedral Model , 2010, ICMS.

[18] Uday Bondhugula,et al. A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[19] Mary W. Hall,et al. CHiLL : A Framework for Composing High-Level Loop Transformations , 2007 .

[20] C.W. Kessler,et al. The SPARAMAT approach to automatic comprehension of sparse matrix computations , 1999, Proceedings Seventh International Workshop on Program Comprehension.

[21] Christian Lengauer,et al. Polly - Performing Polyhedral Optimizations on a Low-Level Intermediate Representation , 2012, Parallel Process. Lett..

[22] Allen Taflove,et al. Finite‐Difference Time‐Domain Analysis , 2005 .

[23] FerdmanMichael,et al. Architectural Support for Dynamic Linking , 2015 .

[24] Hongbin Zheng,et al. Polly – Polyhedral optimization in LLVM , 2012 .

[25] Toshio Nakatani,et al. Detection and global optimization of reduction operations for distributed parallel machines , 1996, ICS '96.

[26] Louis-Noël Pouchet,et al. Model-driven transformations for multi- and many-core CPUs , 2019, PLDI.

[27] W. Pugh,et al. A framework for unifying reordering transformations , 1993 .

[28] Jacobi. Pattern Driven Automatic Parallelization , 2004 .

[29] Kunle Olukotun,et al. Composition and Reuse with Compiled Domain-Specific Languages , 2013, ECOOP.

[30] Albert Cohen,et al. Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions , 2018, ArXiv.

[31] Beniamino Di Martino,et al. PAP Recognizer: a tool for automatic recognition of parallelizable patterns , 1996, WPC '96. 4th Workshop on Program Comprehension.

[32] Jacqueline Chame,et al. A script-based autotuning compiler system to generate high-performance CUDA code , 2013, TACO.

[33] Cédric Bastoul,et al. Opening polyhedral compiler's black box , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[34] Michael Ferdman,et al. Architectural Support for Dynamic Linking , 2015, ASPLOS.

[35] Paul Feautrier,et al. Polyhedron Model , 2011, Encyclopedia of Parallel Computing.

[36] Rudolf Eigenmann,et al. Idiom recognition in the Polaris parallelizing compiler , 1995, ICS '95.

[37] David A. Padua,et al. Locus: A System and a Language for Program Optimization , 2019, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[38] Nicolas Vasilache,et al. GRAPHITE : Polyhedral Analyses and Optimizations for GCC , 2006 .

[39] Wojtek Kozaczynski,et al. Program Concept Recognition and Transformation , 1992, IEEE Trans. Software Eng..

[40] Kunle Olukotun,et al. Forge: generating a high performance DSL implementation from a declarative specification , 2013, GPCE '13.

[41] Vivek Sarkar,et al. Modeling the conflicting demands of parallelism and Temporal/Spatial locality in affine scheduling , 2018, CC.

[42] Christoph W. Kessler,et al. Extensible Recognition of Algorithmic Patterns in DSP Programs for Automatic Parallelization , 2012, International Journal of Parallel Programming.

[43] Uday Bondhugula,et al. PolyMage: Automatic Optimization for Image Processing Pipelines , 2015, ASPLOS.

[44] Albert Cohen,et al. Hybrid Hexagonal/Classical Tiling for GPUs , 2014, CGO '14.

[45] Jeffrey S. Vetter,et al. NVIDIA Tensor Core Programmability, Performance & Precision , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[46] Kunle Olukotun,et al. Delite , 2014, ACM Trans. Embed. Comput. Syst..

[47] Uday Bondhugula,et al. A model for fusion and code motion in an automatic parallelizing compiler , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[48] Sven Verdoolaege. Counting Afﬁne Calculator and Applications , 2011 .

[49] J. Ramanujam,et al. Optimistic Delinearization of Parametrically Sized Arrays , 2015, ICS.

[50] Eelco Visser,et al. Stratego/XT 0.17. A language and toolset for program transformation , 2008, Sci. Comput. Program..

[51] William Pugh,et al. Static analysis of upper and lower bounds on dependences and parallelism , 1994, TOPL.

[52] Lawrence Rauchwerger,et al. Polaris: Improving the Effectiveness of Parallelizing Compilers , 1994, LCPC.

[53] Yuan Yu,et al. TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[54] Uday Bondhugula,et al. PLuTo: A Practical and Fully Automatic Polyhedral Program Optimization System , 2015 .

[55] Frédo Durand,et al. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[56] Allen Taflove,et al. Computational Electrodynamics the Finite-Difference Time-Domain Method , 1995 .

[57] Qing Yi,et al. POET: a scripting language for applying parameterized source‐to‐source program transformations , 2012, Softw. Pract. Exp..

[58] Tze Meng Low,et al. Analytical Modeling Is Enough for High-Performance BLIS , 2016, ACM Trans. Math. Softw..

[59] David Parello,et al. Semi-Automatic Composition of Loop Transformations for Deep Parallelism and Memory Hierarchies , 2006, International Journal of Parallel Programming.