An Effective Fusion and Tile Size Model for PolyMage

Effective models for fusion of loop nests continue to remain a challenge in both general-purpose and domain-specific language (DSL) compilers. The difficulty often arises from the combinatorial explosion of grouping choices and their interaction with parallelism and locality. This article presents a new fusion algorithm for high-performance domain-specific compilers for image processing pipelines. The fusion algorithm is driven by dynamic programming and explores spaces of fusion possibilities not covered by previous approaches, and it is also driven by a cost function more concrete and precise in capturing optimization criteria than prior approaches. The fusion model is particularly tailored to the transformation and optimization sequence applied by PolyMage and Halide, two recent DSLs for image processing pipelines. Our model-driven technique when implemented in PolyMage provides significant improvements (up to 4.32×) over PolyMage’s approach (which uses auto-tuning to aid its model) and over Halide’s automatic approach (by up to 2.46×) on two state-of-the-art shared-memory multicore architectures.

[1]  Ken Kennedy Fast greedy weighted fusion , 2000, ICS '00.

[2]  Frédo Durand,et al.  Decoupling algorithms from schedules for easy optimization of image processing pipelines , 2012, ACM Trans. Graph..

[3]  Uday Bondhugula,et al.  Effective automatic parallelization of stencil computations , 2007, PLDI '07.

[4]  Uday Bondhugula,et al.  An effective fusion and tile size model for optimizing image processing pipelines , 2018, PPoPP.

[5]  Sebastian Hack,et al.  Polyhedral expression propagation , 2018, CC.

[6]  Xing Zhou,et al.  Hierarchical overlapped tiling , 2012, CGO '12.

[7]  Pen-Chung Yew,et al.  Tile size selection revisited , 2013, ACM Trans. Archit. Code Optim..

[8]  Uday Bondhugula,et al.  Optimizing geometric multigrid method computation using a DSL approach , 2017, SC.

[9]  Ken Kennedy,et al.  Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution , 1993, LCPC.

[10]  Uday Bondhugula,et al.  PolyMage: Automatic Optimization for Image Processing Pipelines , 2015, ASPLOS.

[11]  Ken Kennedy,et al.  Loop fusion in high performance Fortran , 1998, ICS '98.

[12]  Gihan R. Mudalige,et al.  Loop Tiling in Large-Scale Stencil Codes at Run-Time with OPS , 2017, IEEE Transactions on Parallel and Distributed Systems.

[13]  Catherine Mills Olschanowsky,et al.  A Study on Balancing Parallelism, Data Locality, and Recomputation in Existing PDE Solvers , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  Vivek Sarkar,et al.  Analytical Bounds for Optimal Tile Size Selection , 2012, CC.

[15]  Catherine Mills Olschanowsky,et al.  Transforming loop chains via macro dataflow graphs , 2018, CGO.

[16]  Samuel Williams,et al.  Compiler generation and autotuning of communication-avoiding operators for geometric multigrid , 2013, 20th Annual International Conference on High Performance Computing.

[17]  V. Sarkar,et al.  Collective Loop Fusion for Array Contraction , 1992, LCPC.

[18]  Ken Kennedy,et al.  Profitable loop fusion and tiling using model-driven empirical search , 2006, ICS '06.

[19]  Uday Bondhugula,et al.  A model for fusion and code motion in an automatic parallelizing compiler , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[20]  Vivek Sarkar,et al.  Optimal weighted loop fusion for parallel programs , 1997, SPAA '97.

[21]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[22]  David G. Wonnacott,et al.  Time Skewing for Parallel Computers , 1999, LCPC.

[23]  Jonathan Ragan-Kelley,et al.  Automatically scheduling halide image processing pipelines , 2016, ACM Trans. Graph..

[24]  Ken Kennedy,et al.  Improving Memory Hierarchy Performance through Combined Loop Interchange and Multi-Level Fusion , 2004, Int. J. High Perform. Comput. Appl..