Tiramisu : A Polyhedral Compiler with A Scheduling Language for Targeting High Performance Systems Riyadh

This paper introduces Tiramisu, a polyhedral framework designed to generate high performance code formultiple platforms including multicores, GPUs, and distributed machines. Tiramisu introduces a scheduling language with novel extensions to explicitly manage the complexities that arise when targeting these systems. The extensions include explicit communication, synchronization, and mapping buffers to different memory hierarchies. Tiramisu relies on a flexible representation based on the polyhedral model and explicitly uses a well-defined four-level IR that allows full separation between the algorithms, loop transformations, data-layouts, and communication. This separation simplifies targeting multiple hardware architectures with the same algorithm. We evaluate Tiramisu by writing a set of image processing and stencil benchmarks and compare it with state-of-the-art compilers. We show that Tiramisu matches or outperforms existing compilers on different hardware architectures, including multicore CPUs, GPUs, and distributed machines.

[1]  Monica S. Lam,et al.  A Loop Transformation Theory and an Algorithm to Maximize Parallelism , 1991, IEEE Trans. Parallel Distributed Syst..

[2]  Monica S. Lam,et al.  Communication optimization and code generation for distributed memory machines , 1993, PLDI '93.

[3]  Alain Darte,et al.  New Complexity Results on Array Contraction and Related Problems , 2005, J. VLSI Signal Process..

[4]  Paul Feautrier,et al.  Automatic Storage Management for Parallel Programs , 1998, Parallel Comput..

[5]  Albert Cohen,et al.  GRAPHITE Two Years After First Lessons Learned From Real-World Polyhedral Compilation , 2010 .

[6]  Elnar Hajiyev,et al.  PENCIL: A Platform-Neutral Compute Intermediate Language for Accelerator Programming , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[7]  Marco D. Santambrogio,et al.  A Unified Backend for Targeting FPGAs from DSLs , 2018, 2018 IEEE 29th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[8]  Christian Lengauer,et al.  Polly - Performing Polyhedral Optimizations on a Low-Level Intermediate Representation , 2012, Parallel Process. Lett..

[9]  Uday Bondhugula,et al.  Effective automatic parallelization of stencil computations , 2007, PLDI '07.

[10]  Manish Gupta,et al.  On privatization of variables for data-parallel execution , 1997, Proceedings 11th International Parallel Processing Symposium.

[11]  Frédo Durand,et al.  Decoupling algorithms from schedules for easy optimization of image processing pipelines , 2012, ACM Trans. Graph..

[12]  Richard W. Vuduc,et al.  POET: Parameterized Optimizations for Empirical Tuning , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[13]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[14]  Wei Huang,et al.  Design of High Performance MVAPICH2: MPI2 over InfiniBand , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[15]  Cédric Bastoul Code Generation in the Polyhedral Model Is Easier Than You Think , 2004, IEEE PACT.

[16]  Sanjay V. Rajopadhye,et al.  Optimizing memory usage in the polyhedral model , 2000, TOPL.

[17]  Chun Chen,et al.  Loop Transformation Recipes for Code Generation and Auto-Tuning , 2009, LCPC.

[18]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI.

[19]  Paul Feautrier,et al.  Polyhedron Model , 2011, Encyclopedia of Parallel Computing.

[20]  Frédéric Vivien,et al.  A unified framework for schedule and storage optimization , 2001, PLDI '01.

[21]  Shoaib Kamil,et al.  Distributed Halide , 2016, PPoPP.

[22]  David Parello,et al.  Semi-Automatic Composition of Loop Transformations for Deep Parallelism and Memory Hierarchies , 2006, International Journal of Parallel Programming.

[23]  Tomofumi Yuki,et al.  AlphaZ: A System for Design Space Exploration in the Polyhedral Model , 2012, LCPC.

[24]  Uday Bondhugula,et al.  Loop transformations: convexity, pruning and optimization , 2011, POPL '11.

[25]  Albert Cohen,et al.  Hybrid Hexagonal/Classical Tiling for GPUs , 2014, CGO '14.

[26]  Paul Feautrier,et al.  Dataflow analysis of array and scalar references , 1991, International Journal of Parallel Programming.

[27]  Monica S. Lam,et al.  Array-data flow analysis and its use in array privatization , 1993, POPL '93.

[28]  Uday Bondhugula Compiling affine loop nests for distributed-memory parallel architectures , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[29]  Uday Bondhugula,et al.  PolyMage: Automatic Optimization for Image Processing Pipelines , 2015, ASPLOS.

[30]  Albert Cohen,et al.  Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions , 2018, ArXiv.

[31]  Robert A. van de Geijn,et al.  Anatomy of high-performance matrix multiplication , 2008, TOMS.

[32]  Lawrence G. Roberts,et al.  Machine Perception of Three-Dimensional Solids , 1963, Outstanding Dissertations in the Computer Sciences.

[33]  David A. Padua,et al.  Automatic Array Privatization , 1993, Compiler Optimizations for Scalable Parallel Systems Languages.

[34]  Albert Cohen,et al.  The Polyhedral Model Is More Widely Applicable Than You Think , 2010, CC.

[35]  Sven Verdoolaege,et al.  isl: An Integer Set Library for the Polyhedral Model , 2010, ICMS.

[36]  Mary W. Hall,et al.  CHiLL : A Framework for Composing High-Level Loop Transformations , 2007 .

[37]  Monica S. Lam,et al.  Data Dependence and Data-Flow Analysis of Arrays , 1992, LCPC.

[38]  Zhiyuan Li Array privatization for parallel execution of loops , 1992, International Conference on Supercomputing.