Multi-level tiling: M for the price of one

Tiling is a widely used loop transformation for exposing/exploiting parallelism and data locality. High-performance implementations use multiple levels of tiling to exploit the hierarchy of parallelism and cache/register locality. Efficient generation of multi-level tiled code is essential for effective use of multi-level tiling. Parameterized tiled code, where tile sizes are not fixed but left as symbolic parameters can enable several dynamic and run-time optimizations. Previous solutions to multi-level tiled loop generation are limited to the case where tile sizes are fixed at compile time. We present an algorithm that can generate multi-level parameterized tiled loops at the same cost as generating single-level tiled loops. The efficiency of our method is demonstrated on several benchmarks. We also present a method-useful in register tiling-for separating partial and full tiles at any arbitrary level of tiling. The code generator we have implemented is available as an open source tool.

[1]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[2]  Michael F. P. O'Boyle,et al.  Iterative Compilation , 2002, Embedded Processor Design Challenges.

[3]  Marta Jiménez,et al.  Register tiling in nonrectangular iteration spaces , 2002, TOPL.

[4]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[5]  Jingling Xue,et al.  Loop Tiling for Parallelism , 2000, Kluwer International Series in Engineering and Computer Science.

[6]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[7]  Monica S. Lam,et al.  A data locality optimizing algorithm (with retrospective) , 1991 .

[8]  Nectarios Koziris,et al.  An Efficient Code Generation Technique for Tiled Iteration Spaces , 2003, IEEE Trans. Parallel Distributed Syst..

[9]  William J. Dally,et al.  Sequoia: Programming the Memory Hierarchy , 2006, International Conference on Software Composition.

[10]  Sanjay V. Rajopadhye,et al.  Generation of Efficient Nested Loops from Polyhedra , 2000, International Journal of Parallel Programming.

[11]  Jack Dongarra,et al.  Automatic Blocking of Nested Loops , 1990 .

[12]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[13]  Corinne Ancourt,et al.  Scanning polyhedra with DO loops , 1991, PPOPP '91.

[14]  Steven W. K. Tjiang,et al.  SUIF: an infrastructure for research on parallelizing and optimizing compilers , 1994, SIGP.

[15]  David A. Padua,et al.  Programming for parallelism and locality with hierarchically tiled arrays , 2006, PPoPP '06.

[16]  William Pugh,et al.  A practical algorithm for exact array dependence analysis , 1992, CACM.

[17]  Michael F. P. O'Boyle,et al.  Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation , 2000, Proceedings 2000 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00622).

[18]  Doran Wilde,et al.  Loop nest synthesis using the polyhedral library , 1994 .

[19]  Keshav Pingali,et al.  Mobile MPI programs in computational grids , 2006, PPoPP '06.

[20]  Michael F. P. O'Boyle,et al.  Embedded Processor Design Challenges , 2002 .

[21]  Marta Jiménez,et al.  A Cost-Effective Implementation of Multilevel Tiling , 2003, IEEE Trans. Parallel Distributed Syst..

[22]  Sanjay V. Rajopadhye,et al.  Towards Optimal Multi-level Tiling for Stencil Computations , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[23]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[24]  Ed F. Deprettere,et al.  Embedded Processor Design Challenges: Systems, Architectures, Modeling, and Simulation - SAMOS , 2002 .

[25]  Dimitrios S. Nikolopoulos Dynamic tiling for effective use of shared caches on multithreaded processors , 2004, Int. J. High Perform. Comput. Netw..

[26]  Sanjay V. Rajopadhye,et al.  Parameterized tiled loops for free , 2007, PLDI '07.

[27]  Cédric Bastoul,et al.  Code generation in the polyhedral model is easier than you think , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[28]  David K. Lowenthal,et al.  Accurately Selecting Block Size at Runtime in Pipelined Parallel Programs , 2000, International Journal of Parallel Programming.

[29]  Chau-Wen Tseng,et al.  Locality Optimizations for Multi-Level Caches , 1999, SC.

[30]  Monica S. Lam,et al.  Communication optimization and code generation for distributed memory machines , 1993, PLDI '93.

[31]  Larry Carter,et al.  Hierarchical tiling for improved superscalar performance , 1995, Proceedings of 9th International Parallel Processing Symposium.

[32]  Armin Größlinger,et al.  Introducing Non-linear Parameters to the Polyhedron Model , 2004 .

[33]  Saman Amarasinghe,et al.  Parallelizing Compiler Techniques Based on Linear Inequalities , 1997 .

[34]  W. Kelly,et al.  Code generation for multiple mappings , 1995, Proceedings Frontiers '95. The Fifth Symposium on the Frontiers of Massively Parallel Computation.