Non-affine Extensions to Polyhedral Code Generation

This paper describes a loop transformation framework that extends a polyhedral representation of loop nests to represent and transform computations with non-affine index arrays in loop bounds and subscripts via a new interface between compile-time and run-time abstractions. Polyhedra scanning code generation, which historically applies an affine mapping to the subscript expressions of the statements in a loop nest, is modified to apply non-affine mappings involving index arrays that are represented at compile time by uninterpreted functions; non-affine loop bounds involving index arrays are also represented. When appropriate, an inspector is utilized to capture the non-affine subscript mappings, and a generalized loop coalescing transformation is introduced as a non-affine transformation to support non-affine loop bounds. With this support, complex sequences of new and existing transformations can then be composed. We demonstrate the effectiveness of this framework by optimizing sparse matrix vector multiplication operations targeting GPUs for different matrix structures and parallelization strategies. This approach achieves performance that is comparable to or greater than the hand-tuned CUSP library; for two of the implementations it achieves an average 1.14× improvement over CUSP across a collection of sparse matrices, while the third performs on average within 8% of CUSP.

[1]  Albert Cohen,et al.  Polyhedral Code Generation in the Real World , 2006, CC.

[2]  Geri Georg,et al.  Set and Relation Manipulation for the Sparse Polyhedral Framework , 2012, LCPC.

[3]  Jacqueline Chame,et al.  A script-based autotuning compiler system to generate high-performance CUDA code , 2013, TACO.

[4]  Ken Kennedy,et al.  Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings , 2001, International Journal of Parallel Programming.

[5]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[6]  Larry Carter,et al.  An approach for code generation in the Sparse Polyhedral Framework , 2016, Parallel Comput..

[7]  Harry A. G. Wijshoff,et al.  Sublimation: Expanding Data Structures to Enable Data Instance Specific Optimizations , 2010, LCPC.

[8]  Joel H. Saltz,et al.  Programming Irregular Applications: Runtime Support, Compilation and Tools , 1997, Adv. Comput..

[9]  Rudolf Eigenmann,et al.  The range test: a dependence test for symbolic, non-linear expressions , 1994, Proceedings of Supercomputing '94.

[10]  Paul Feautrier,et al.  Fuzzy Array Dataflow Analysis , 1997, J. Parallel Distributed Comput..

[11]  Rudolf Eigenmann,et al.  Optimizing irregular shared-memory applications for distributed-memory systems , 2006, PPoPP '06.

[12]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[13]  Corinne Ancourt,et al.  Scanning polyhedra with DO loops , 1991, PPOPP '91.

[14]  Monica S. Lam,et al.  Interprocedural parallelization analysis in SUIF , 2005, TOPL.

[15]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2009, Parallel Comput..

[16]  Chun Chen,et al.  Polyhedra scanning revisited , 2012, PLDI.

[17]  Albert Cohen,et al.  The Polyhedral Model Is More Widely Applicable Than You Think , 2010, CC.

[18]  Paul Feautrier,et al.  Automatic Parallelization in the Polytope Model , 1996, The Data Parallel Programming Model.

[19]  Francky Catthoor,et al.  Polyhedral parallel code generation for CUDA , 2013, TACO.

[20]  Joel H. Saltz,et al.  Principles of runtime support for parallel processors , 1988, ICS '88.

[21]  Ken Kennedy,et al.  Improving cache performance in dynamic applications through data and computation reorganization at run time , 1999, PLDI '99.

[22]  William Pugh,et al.  Constraint-based array dependence analysis , 1998, TOPL.

[23]  William Pugh,et al.  Optimization within a unified transformation framework , 1996 .

[24]  Chau-Wen Tseng,et al.  Exploiting locality for irregular scientific codes , 2006, IEEE Transactions on Parallel and Distributed Systems.

[25]  Larry Carter,et al.  Compile-time composition of run-time data and iteration reorderings , 2003, PLDI '03.

[26]  Sanjay V. Rajopadhye,et al.  Generation of Efficient Nested Loops from Polyhedra , 2000, International Journal of Parallel Programming.

[27]  David A. Padua,et al.  Compiler analysis of irregular memory accesses , 2000, PLDI '00.

[28]  William Pugh,et al.  Nonlinear array dependence analysis , 1994 .

[29]  Larry Carter,et al.  Localizing non-affine array references , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[30]  Bo Wu,et al.  Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU , 2013, PPoPP '13.

[31]  Michael Garland,et al.  Implementing sparse matrix-vector multiplication on throughput-oriented processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[32]  L. Rauchwerger,et al.  The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization , 1999, IEEE Trans. Parallel Distributed Syst..

[33]  J. Ramanujam,et al.  Code generation for parallel execution of a class of irregular loops on distributed memory systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[34]  Lawrence Rauchwerger,et al.  Hybrid Analysis: Static & Dynamic Memory Reference Analysis , 2004, International Journal of Parallel Programming.

[35]  Rudolf Eigenmann,et al.  Idiom recognition in the Polaris parallelizing compiler , 1995, ICS '95.

[36]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[37]  Sven Verdoolaege,et al.  isl: An Integer Set Library for the Polyhedral Model , 2010, ICMS.