Parallel Tiled Code Generation with Loop Permutation within Tiles

An approach of generation of tiled code with an arbitrary order of loops within tiles is presented. It is based on the transitive closure of the program dependence graph and derived via a combination of the Polyhedral and Iteration Space Slicing frameworks. The approach is explained by means of a working example. Details of an implementation of the approach in the TRACO compiler are outlined. Increasing tiled program performance due to loop permutation within tiles is illustrated on real-life programs from the NAS Parallel Benchmark suite. An analysis of speed-up and scalability of parallel tiled code with loop permutation is presented.

[1]  Monica S. Lam,et al.  A Loop Transformation Theory and an Algorithm to Maximize Parallelism , 1991, IEEE Trans. Parallel Distributed Syst..

[2]  Marek Palkowski,et al.  Perfectly Nested Loop Tiling Transformations Based on the Transitive Closure of the Program Dependence Graph , 2014, ACS.

[3]  William Pugh,et al.  Iteration space slicing and its application to communication optimization , 1997, ICS '97.

[4]  Marek Palkowski,et al.  TRACO: An automatic loop nest parallelizer for numerical applications , 2015, 2015 Federated Conference on Computer Science and Information Systems (FedCSIS).

[5]  Cédric Bastoul,et al.  Code generation in the polyhedral model is easier than you think , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[6]  Uday Bondhugula,et al.  Effective automatic parallelization of stencil computations , 2007, PLDI '07.

[7]  Jingling Xue,et al.  Loop Tiling for Parallelism , 2000, Kluwer International Series in Engineering and Computer Science.

[8]  J. Ramanujam,et al.  Tiling Multidimensional Itertion Spaces for Multicomputers , 1992, J. Parallel Distributed Comput..

[9]  Wlodzimierz Bielecki,et al.  Using Basis Dependence Distance Vectors to Calculate the Transitive Closure of Dependence Relations by Means of the Floyd-Warshall Algorithm , 2013, COCOA.

[10]  Paul Feautrier,et al.  Some efficient solutions to the affine scheduling problem. I. One-dimensional time , 1992, International Journal of Parallel Programming.

[11]  Martin Griebl,et al.  Automatic Parallelization of Loop Programs for Distributed Memory Architectures , 2004 .

[12]  Albert Cohen,et al.  Coarse-Grained Loop Parallelization: Iteration Space Slicing vs Affine Transformations , 2009, ISPDC.

[13]  Uday Bondhugula,et al.  Tiling for Dynamic Scheduling , 2014 .

[14]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[15]  Anna Beletska,et al.  An Iterative Algorithm of Computing the Transitive Closure of a Union of Parameterized Affine Integer Tuple Relations , 2010, COCOA.

[16]  Monica S. Lam,et al.  An affine partitioning algorithm to maximize parallelism and minimize communication , 1999, ICS '99.

[17]  Paul Feautrier,et al.  Some efficient solutions to the affine scheduling problem. Part II. Multidimensional time , 1992, International Journal of Parallel Programming.

[18]  William Pugh,et al.  Transitive Closure of Infinite Graphs and its Applications , 1995, Int. J. Parallel Program..

[19]  Marek Palkowski,et al.  Free scheduling for statement instances of parameterized arbitrarily nested affine loops , 2012, Parallel Comput..

[20]  Marek Palkowski,et al.  Free Scheduling of Tiles Based on the Transitive Closure of Dependence Graphs , 2015, PPAM.

[21]  William Pugh,et al.  The Omega Library interface guide , 1995 .

[22]  Monica S. Lam,et al.  Communication-Free Parallelization via Affine Transformations , 1994, LCPC.

[23]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[24]  G. Shipman,et al.  Omega Library , 2011, Encyclopedia of Parallel Computing.

[25]  Albert Cohen,et al.  Transitive Closures of Affine Integer Tuple Relations and Their Overapproximations , 2011, SAS.

[26]  Uday Bondhugula,et al.  Effective automatic parallelization and locality optimization using the polyhedral model , 2008 .

[27]  D. Wonnacott,et al.  On the Scalability of Loop Tiling Techniques , 2012 .

[28]  David Wonnacott,et al.  Automatic Tiling of “ Mostly-Tileable ” Loop Nests , 2014 .