Tiling arbitrarily nested loops by means of the transitive

Abstract A novel approach to generation of tiled code for arbitrarily nested loops is presented. It is derived via a combination of the polyhedral and iteration space slicing frameworks. Instead of program transformations represented by a set of affine functions, one for each statement, it uses the transitive closure of a loop nest dependence graph to carry out corrections of original rectangular tiles so that all dependences of the original loop nest are preserved under the lexicographic order of target tiles. Parallel tiled code can be generated on the basis of valid serial tiled code by means of applying affine transformations or transitive closure using on input an inter-tile dependence graph whose vertices are represented by target tiles while edges connect dependent target tiles. We demonstrate how a relation describing such a graph can be formed. The main merit of the presented approach in comparison with the well-known ones is that it does not require full permutability of loops to generate both serial and parallel tiled codes; this increases the scope of loop nests to be tiled.

[1]  Jacek Blaszczyk,et al.  Object Library of Algorithms for Dynamic Optimization Problems: Benchmarking SQP and Nonlinear Interior Point Methods , 2007, Int. J. Appl. Math. Comput. Sci..

[2]  Wlodzimierz Bielecki,et al.  Using Basis Dependence Distance Vectors to Calculate the Transitive Closure of Dependence Relations by Means of the Floyd-Warshall Algorithm , 2013, COCOA.

[3]  Albert Cohen,et al.  Coarse-Grained Loop Parallelization: Iteration Space Slicing vs Affine Transformations , 2009, ISPDC.

[4]  Jim Jeffers,et al.  High Performance Parallelism Pearls Volume Two: Multicore and Many-core Programming Approaches , 2015 .

[5]  Sanjay V. Rajopadhye,et al.  Optimal semi-oblique tiling , 2001, SPAA '01.

[6]  Jingling Xue,et al.  Loop Tiling for Parallelism , 2000, Kluwer International Series in Engineering and Computer Science.

[7]  P. Feautrier Some Eecient Solutions to the Aane Scheduling Problem Part Ii Multidimensional Time , 1992 .

[8]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[9]  Anna Beletska,et al.  An Iterative Algorithm of Computing the Transitive Closure of a Union of Parameterized Affine Integer Tuple Relations , 2010, COCOA.

[10]  Monica S. Lam,et al.  Communication-Free Parallelization via Affine Transformations , 1994, LCPC.

[11]  Guang R. Gao,et al.  Tile Reduction: The First Step towards Tile Aware Parallelization in OpenMP , 2009, IWOMP.

[12]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[13]  S. Campbell Numerical analysis and systems theory , 2001 .

[14]  Jingling Xue Communication-Minimal Tiling of Uniform Dependence Loops , 1997, J. Parallel Distributed Comput..

[15]  Sanjay V. Rajopadhye,et al.  Parameterized Tiling for Imperfectly Nested Loops , 2009 .

[16]  William Pugh,et al.  Static analysis of upper and lower bounds on dependences and parallelism , 1994, TOPL.

[17]  William Pugh,et al.  Transitive Closure of Infinite Graphs and its Applications , 1995, Int. J. Parallel Program..

[18]  Marek Palkowski,et al.  Free scheduling for statement instances of parameterized arbitrarily nested affine loops , 2012, Parallel Comput..

[19]  Uday Bondhugula,et al.  Tiling for Dynamic Scheduling , 2014 .

[20]  Marcin Maciazek,et al.  Genetic and combinatorial algorithms for optimal sizing and placement of active power filters , 2015, Int. J. Appl. Math. Comput. Sci..

[21]  J. Ramanujam,et al.  Tiling Multidimensional Itertion Spaces for Multicomputers , 1992, J. Parallel Distributed Comput..

[22]  Marek Palkowski,et al.  Free Scheduling of Tiles Based on the Transitive Closure of Dependence Graphs , 2015, PPAM.

[23]  William Pugh,et al.  The Omega Library interface guide , 1995 .

[24]  Peiyi Tang,et al.  Generating efficient tiled code for distributed memory machines , 2000, Parallel Comput..

[25]  Marek Palkowski,et al.  Perfectly Nested Loop Tiling Transformations Based on the Transitive Closure of the Program Dependence Graph , 2014, ACS.

[26]  Markus Kowarschik,et al.  An Overview of Cache Optimization Techniques and Cache-Aware Numerical Algorithms , 2002, Algorithms for Memory Hierarchies.

[27]  Cédric Bastoul,et al.  Code generation in the polyhedral model is easier than you think , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[28]  Monica S. Lam,et al.  An affine partitioning algorithm to maximize parallelism and minimize communication , 1999, ICS '99.

[29]  Jingling Xue,et al.  Communication-Minimal Tiling of Uniform Dependence Loops , 1996, J. Parallel Distributed Comput..

[30]  Paul Feautrier,et al.  Some efficient solutions to the affine scheduling problem. Part II. Multidimensional time , 1992, International Journal of Parallel Programming.

[31]  Martin Griebl,et al.  Index Set Splitting , 2000, International Journal of Parallel Programming.

[32]  Paul Feautrier,et al.  Some efficient solutions to the affine scheduling problem. I. One-dimensional time , 1992, International Journal of Parallel Programming.

[33]  Anne Greenbaum,et al.  NUMERICAL METHODS , 2017 .

[34]  Larry Carter,et al.  Sparse Tiling for Stationary Iterative Methods , 2004, Int. J. High Perform. Comput. Appl..

[35]  Matt W. Mutka,et al.  Enabling unimodular transformations , 1994, Proceedings of Supercomputing '94.

[36]  William Pugh,et al.  Iteration Space Slicing for Locality , 1999, LCPC.

[37]  Wlodzimierz Bielecki,et al.  Using basis dependence distance vectors in the modified Floyd–Warshall algorithm , 2015, J. Comb. Optim..

[38]  Martin Griebl,et al.  Automatic Parallelization of Loop Programs for Distributed Memory Architectures , 2004 .

[39]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[40]  William Pugh,et al.  Iteration space slicing and its application to communication optimization , 1997, ICS '97.

[41]  Marek Palkowski,et al.  TRACO: An automatic loop nest parallelizer for numerical applications , 2015, 2015 Federated Conference on Computer Science and Information Systems (FedCSIS).

[42]  Uday Bondhugula,et al.  Automatic Transformations for Communication-Minimized Parallelization and Locality Optimization in the Polyhedral Model , 2008, CC.

[43]  F. H. Mcmahon,et al.  The Livermore Fortran Kernels: A Computer Test of the Numerical Performance Range , 1986 .

[44]  Albert Cohen,et al.  Transitive Closures of Affine Integer Tuple Relations and Their Overapproximations , 2011, SAS.

[45]  Paul Feautrier,et al.  Improving Data Locality by Chunking , 2003, CC.

[46]  Keshav Pingali,et al.  Tiling Imperfectly-nested Loop Nests , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[47]  Jingling Xue,et al.  On Tiling as a Loop Transformation , 1997, Parallel Process. Lett..

[48]  William Pugh,et al.  An Exact Method for Analysis of Value-based Array Data Dependences , 1993, LCPC.

[49]  J. Leader Numerical Analysis and Scientific Computation , 2022 .

[50]  Rafal Zdunek,et al.  Regularized nonnegative matrix factorization: Geometrical interpretation and application to spectral unmixing , 2014, Int. J. Appl. Math. Comput. Sci..

[51]  Albert Cohen,et al.  Split tiling for GPUs: automatic parallelization using trapezoidal tiles , 2013, GPGPU@ASPLOS.