Optimizing the Memory Hierarchy by Compositing Automatic Transformations on Computations and Data

Optimizing compilers exploit the memory hierarchy using loop tiling and fusion, but these two transformations usually interfere with each other due to the oversight of transformations on data in memories. We present a novel composition of loop tiling and fusion in this paper. Unlike existing tiling-after-fusion algorithms that only transform computation spaces, our approach first applies rectangular/parallelogram tiling to live-out computation spaces for fitting the memory hierarchy, followed by the computation of the memory footprints required by each tile. The upwards exposed data extracted from the memory footprints are used to determine the tile shapes of intermediate computation spaces, allowing the construction of arbitrary tile shapes. Finally, our technique implements a post-tiling fusion strategy for maximizing data locality without losing tilability or parallelism of live-out computation spaces, thereby enabling storage reduction and reuse, and optimizing the memory hierarchy. We demonstrate that our approach can achieve superior performance on both CPU and GPU architectures over the state of the art by experimenting on 11 benchmarks extracted from numerous domains including neural networks, image processing, sparse matrix computation and linear algebra. Also, the results of the ResNet-50 model on an AI accelerator show that our approach can obtain 16% performance improvement.

[1]  Shoaib Kamil,et al.  Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code , 2018, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[2]  Tim Zerrell,et al.  Stripe: Tensor Compilation via the Nested Polyhedral Model , 2019, ArXiv.

[3]  Cédric Bastoul,et al.  Code generation in the polyhedral model is easier than you think , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[4]  Francky Catthoor,et al.  Polyhedral parallel code generation for CUDA , 2013, TACO.

[5]  Hariharan Sandanagobalane,et al.  Diesel: DSL for linear algebra and neural net computations on GPUs , 2018, MAPL@PLDI.

[6]  Sanjay V. Rajopadhye,et al.  Multi-level tiling: M for the price of one , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[7]  Uday Bondhugula,et al.  A model for fusion and code motion in an automatic parallelizing compiler , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[8]  Albert Cohen,et al.  A polyhedral compilation framework for loops with dynamic data-dependent bounds , 2018, CC.

[9]  Jan Kautz,et al.  Local Laplacian filters: edge-aware image processing with a Laplacian pyramid , 2011, ACM Trans. Graph..

[10]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[11]  Albert Cohen,et al.  Polyhedral AST Generation Is More Than Scanning Polyhedra , 2015, ACM Trans. Program. Lang. Syst..

[12]  Albert Cohen,et al.  Sub-polyhedral scheduling using (unit-)two-variable-per-inequality polyhedra , 2013, POPL.

[13]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[14]  Paul Feautrier,et al.  Some efficient solutions to the affine scheduling problem. I. One-dimensional time , 1992, International Journal of Parallel Programming.

[15]  Mary W. Hall,et al.  Loop and data transformations for sparse matrix code , 2015, PLDI.

[16]  Haichen Shen,et al.  TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.

[17]  Uday Bondhugula,et al.  An effective fusion and tile size model for optimizing image processing pipelines , 2018, PPoPP.

[18]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[19]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Sven Verdoolaege,et al.  isl: An Integer Set Library for the Polyhedral Model , 2010, ICMS.

[21]  Jan Bartovsky,et al.  GPU implementation of linear morphological openings with arbitrary angle , 2012, Journal of Real-Time Image Processing.

[22]  Uday Bondhugula,et al.  Effective automatic parallelization of stencil computations , 2007, PLDI '07.

[23]  William Pugh,et al.  Static analysis of upper and lower bounds on dependences and parallelism , 1994, TOPL.

[24]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[25]  John L. Henning SPEC CPU2000: Measuring CPU Performance in the New Millennium , 2000, Computer.

[26]  David R. O'Hallaron,et al.  Large-scale simulation of elastic wave propagation in heterogeneous media on parallel computers , 1998 .

[27]  Christian Lengauer,et al.  Polly - Performing Polyhedral Optimizations on a Low-Level Intermediate Representation , 2012, Parallel Process. Lett..

[28]  Jürgen Teich,et al.  From Loop Fusion to Kernel Fusion: A Domain-Specific Approach to Locality Optimization , 2019, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[29]  Jun Yang,et al.  FusionStitching: Deep Fusion and Code Generation for Tensorflow Computations on GPUs , 2018, ArXiv.

[30]  Catherine Mills Olschanowsky,et al.  Transforming loop chains via macro dataflow graphs , 2018, CGO.

[31]  P. Sadayappan,et al.  Resource conscious reuse-driven tiling for GPUs , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[32]  Jiawen Chen,et al.  Real-time edge-aware image processing with the bilateral grid , 2007, ACM Trans. Graph..

[33]  Jingling Xue,et al.  Loop Tiling for Parallelism , 2000, Kluwer International Series in Engineering and Computer Science.

[34]  Uday Bondhugula,et al.  Diamond Tiling: Tiling Techniques to Maximize Parallelism for Stencil Computations , 2017, IEEE Transactions on Parallel and Distributed Systems.

[35]  Xing Zhou,et al.  Hierarchical overlapped tiling , 2012, CGO '12.

[36]  Uday Bondhugula,et al.  MLIR: A Compiler Infrastructure for the End of Moore's Law , 2020, ArXiv.

[37]  Christopher G. Harris,et al.  A Combined Corner and Edge Detector , 1988, Alvey Vision Conference.

[38]  Vivek Sarkar,et al.  Modeling the conflicting demands of parallelism and Temporal/Spatial locality in affine scheduling , 2018, CC.

[39]  Sanjay V. Rajopadhye,et al.  Parameterized loop tiling , 2012, TOPL.

[40]  Frédo Durand,et al.  Learning to optimize halide with tree search and random programs , 2019, ACM Trans. Graph..

[41]  Sriram Krishnamoorthy,et al.  Parametric multi-level tiling of imperfectly nested loops , 2009, ICS.

[42]  Frédo Durand,et al.  Fast Local Laplacian Filters , 2014, ACM Trans. Graph..

[43]  Jing Xia,et al.  DaVinci: A Scalable Architecture for Neural Network Computing , 2019, 2019 IEEE Hot Chips 31 Symposium (HCS).

[44]  Robert J. Harrison,et al.  On fusing recursive traversals of K-d trees , 2016, CC.

[45]  Richard Veras,et al.  When polyhedral transformations meet SIMD code generation , 2013, PLDI.

[46]  Pierre Kornprobst,et al.  Bilateral Filtering , 2009 .

[47]  Shoaib Kamil,et al.  OpenTuner: An extensible framework for program autotuning , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[48]  Albert Cohen,et al.  Polyhedral Code Generation in the Real World , 2006, CC.

[49]  Pen-Chung Yew,et al.  Tile size selection revisited , 2013, ACM Trans. Archit. Code Optim..

[50]  Uday Bondhugula,et al.  PolyMage: Automatic Optimization for Image Processing Pipelines , 2015, ASPLOS.

[51]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[52]  Albert Cohen,et al.  Hybrid Hexagonal/Classical Tiling for GPUs , 2014, CGO '14.

[53]  Albert Cohen,et al.  Split tiling for GPUs: automatic parallelization using trapezoidal tiles , 2013, GPGPU@ASPLOS.

[54]  Ken Kennedy,et al.  Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution , 1993, LCPC.

[55]  Louis-Noël Pouchet,et al.  Model-driven transformations for multi- and many-core CPUs , 2019, PLDI.

[56]  Paul Feautrier,et al.  Some efficient solutions to the affine scheduling problem. Part II. Multidimensional time , 1992, International Journal of Parallel Programming.

[57]  Pen-Chung Yew,et al.  Revisiting loop fusion in the polyhedral framework , 2014, PPoPP '14.