Generation of parallel synchronization-free tiled code

A novel approach to generation of parallel synchronization-free tiled code for the loop nest is presented. It is derived via a combination of the Polyhedral and Iteration Space Slicing frameworks. It uses the transitive closure of loop nest dependence graphs to carry out corrections of original rectangular tiles so that all dependences of the original loop nest are preserved under the lexicographic order of target (corrected) tiles. Then parallel synchronization-free tiled code is generated on the basis of valid (corrected) tiles applying the transitive closure of dependence graphs. The main contribution of the paper is demonstrating that the presented technique is able to generate parallel synchronization-free tiled code, provided that the exact transitive closure of a dependence graph can be calculated and there exist synchronization-free slices on the statement instance level in the loop nest. We show that the presented approach extracts such a parallelism when well-known techniques fail to extract it. Enlarging the scope of loop nests, for which synchronization-free tiled code can be generated, is achieved by means of applying the intersection of extracted slices and generated valid tiles, in contrast to forming slices of valid tiles as suggested in previously published techniques based on the transitive closure of a dependence graph. The presented approach is implemented in the publicly available TC optimizing compiler. Results of experiments demonstrating the effectiveness of the approach and the efficiency of parallel programs generated by means of it are discussed.

[1]  A Thesis,et al.  Tiling Stencil Computations to Maximize Parallelism , 2013 .

[2]  Albert Cohen,et al.  Polyhedral AST Generation Is More Than Scanning Polyhedra , 2015, ACM Trans. Program. Lang. Syst..

[3]  Sven Verdoolaege,et al.  isl: An Integer Set Library for the Polyhedral Model , 2010, ICMS.

[4]  Albert Cohen,et al.  Transitive Closures of Affine Integer Tuple Relations and Their Overapproximations , 2011, SAS.

[5]  Marek Palkowski,et al.  Perfectly Nested Loop Tiling Transformations Based on the Transitive Closure of the Program Dependence Graph , 2014, ACS.

[6]  Sriram Krishnamoorthy,et al.  Parametric multi-level tiling of imperfectly nested loops , 2009, ICS.

[7]  Paul Feautrier,et al.  Some efficient solutions to the affine scheduling problem. I. One-dimensional time , 1992, International Journal of Parallel Programming.

[8]  Albert Cohen,et al.  The Relation Between Diamond Tiling and Hexagonal Tiling , 2014, Parallel Process. Lett..

[9]  Sven Verdoolaege,et al.  Polyhedral Extraction Tool , 2012 .

[10]  Sanjay V. Rajopadhye,et al.  Efficient Tiled Loop Generation: D-Tiling , 2009, LCPC.

[11]  William Pugh,et al.  Transitive Closure of Infinite Graphs and its Applications , 1995, Int. J. Parallel Program..

[12]  Marek Palkowski,et al.  Free scheduling for statement instances of parameterized arbitrarily nested affine loops , 2012, Parallel Comput..

[13]  Albert Cohen,et al.  Coarse-Grained Loop Parallelization: Iteration Space Slicing vs Affine Transformations , 2009, ISPDC.

[14]  Marek Palkowski,et al.  Tiling arbitrarily nested loops by means of the transitive , 2016, Int. J. Appl. Math. Comput. Sci..

[15]  Marek Palkowski,et al.  Free Scheduling of Tiles Based on the Transitive Closure of Dependence Graphs , 2015, PPAM.

[16]  Wlodzimierz Bielecki,et al.  Using Basis Dependence Distance Vectors to Calculate the Transitive Closure of Dependence Relations by Means of the Floyd-Warshall Algorithm , 2013, COCOA.

[17]  Jingling Xue,et al.  Loop Tiling for Parallelism , 2000, Kluwer International Series in Engineering and Computer Science.

[18]  William Pugh,et al.  Iteration Space Slicing for Locality , 1999, LCPC.

[19]  Wlodzimierz Bielecki,et al.  Using basis dependence distance vectors in the modified Floyd–Warshall algorithm , 2015, J. Comb. Optim..

[20]  Sven Verdoolaege,et al.  Presburger formulas and polyhedral compilation , 2016 .

[21]  Monica S. Lam,et al.  Communication-Free Parallelization via Affine Transformations , 1994, LCPC.

[22]  Martin Griebl,et al.  Automatic Parallelization of Loop Programs for Distributed Memory Architectures , 2004 .

[23]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[24]  P. Feautrier Some Eecient Solutions to the Aane Scheduling Problem Part Ii Multidimensional Time , 1992 .

[25]  Sven Verdoolaege Counting Affine Calculator and Applications , 2011 .

[26]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[27]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[28]  William Pugh,et al.  Iteration space slicing and its application to communication optimization , 1997, ICS '97.

[29]  Marek Palkowski,et al.  TRACO: An automatic loop nest parallelizer for numerical applications , 2015, 2015 Federated Conference on Computer Science and Information Systems (FedCSIS).

[30]  Uday Bondhugula,et al.  Automatic Transformations for Communication-Minimized Parallelization and Locality Optimization in the Polyhedral Model , 2008, CC.

[31]  Cédric Bastoul,et al.  Code generation in the polyhedral model is easier than you think , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[32]  Uday Bondhugula,et al.  Effective automatic parallelization of stencil computations , 2007, PLDI '07.

[33]  Monica S. Lam,et al.  An affine partitioning algorithm to maximize parallelism and minimize communication , 1999, ICS '99.

[34]  J. Ramanujam,et al.  Tiling Multidimensional Itertion Spaces for Multicomputers , 1992, J. Parallel Distributed Comput..