Compiler Transformation to Generate Hybrid Sparse Computations

Applications over sparse matrices and graphs often rely on efficient matrix representations that exploit the nonzero structure of the sparse representation. In some cases, this structure varies within the matrix, e.g., some portions are more dense and others are very sparse. For such matrices, hybrid algorithms are commonly used in sparse linear algebra and graph libraries, which employ multiple representations and computations. Automating such an approach in a compiler is difficult as it depends on analysis of the input matrix, which is only available at runtime. This paper describes compiler and runtime support for generating hybrid implementations. It automatically partitions the input matrix or graph into multiple disjoint subsets, which correspond to significant differences of nonzero structures. These subsets can then be optimized separately. For this purpose, the paper introduces a non-affine split transformation, which automatically generates an inspector and multiple executors. The inspector analyzes and partitions the input matrix according to the split criteria. The resulting executors are further optimized with customized transformations to derive specialized representations. We demonstrate the performance gains on an Nvidia K20c (Kepler) GPU of hybrid implementations for examples from sparse linear algebra and graph analytics: sparse matrix-vector multiplication and stochastic gradient descent.

[1]  Keshav Pingali,et al.  Sparse code generation for imperfectly nested loops with dependences , 1997, ICS '97.

[2]  Albert Cohen,et al.  Hybrid Hexagonal/Classical Tiling for GPUs , 2014, CGO '14.

[3]  Chun Chen,et al.  Polyhedra scanning revisited , 2012, PLDI.

[4]  Mary W. Hall,et al.  Non-affine Extensions to Polyhedral Code Generation , 2014, CGO '14.

[5]  Martin Griebl,et al.  Index Set Splitting , 2000, International Journal of Parallel Programming.

[6]  William Pugh,et al.  SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations , 1998, LCPC.

[7]  J. Ramanujam,et al.  Code generation for parallel execution of a class of irregular loops on distributed memory systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Sung-Eun Choi,et al.  Optimizing Loop-level Parallelism in Cray XMT TM Applications , 2009 .

[9]  Lu Yao,et al.  Implementing Sparse Matrix-Vector multiplication using CUDA based on a hybrid sparse matrix format , 2010, 2010 International Conference on Computer Application and System Modeling (ICCASM 2010).

[10]  John R. Gilbert,et al.  The Combinatorial BLAS: design, implementation, and applications , 2011, Int. J. High Perform. Comput. Appl..

[11]  Mary W. Hall,et al.  CHiLL : A Framework for Composing High-Level Loop Transformations , 2007 .

[12]  Mary W. Hall,et al.  Loop and data transformations for sparse matrix code , 2015, PLDI.

[13]  Aart J. C. Bik,et al.  Advanced Compiler Optimizations for Sparse Computations , 1995, J. Parallel Distributed Comput..

[14]  Lawrence Rauchwerger,et al.  The LRPD test: speculative run-time parallelization of loops with privatization and reduction parallelization , 1995, PLDI '95.

[15]  Larry Carter,et al.  Localizing non-affine array references , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[16]  Jacqueline Chame,et al.  A script-based autotuning compiler system to generate high-performance CUDA code , 2013, TACO.

[17]  Keshav Pingali,et al.  Synchronization Trade-Offs in GPU Implementations of Graph Algorithms , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[18]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[19]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[20]  John M. Mellor-Crummey,et al.  Optimizing Sparse Matrix–Vector Product Computations Using Unroll and Jam , 2004, Int. J. High Perform. Comput. Appl..

[21]  Johannes Hölzl,et al.  Specifying and verifying sparse matrix codes , 2010, ICFP '10.

[22]  Rudolf Eigenmann,et al.  Optimizing irregular shared-memory applications for distributed-memory systems , 2006, PPoPP '06.

[23]  Cédric Bastoul,et al.  Code generation in the polyhedral model is easier than you think , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[24]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[25]  Joel H. Saltz,et al.  Programming Irregular Applications: Runtime Support, Compilation and Tools , 1997, Adv. Comput..

[26]  Ken Kennedy,et al.  Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings , 2001, International Journal of Parallel Programming.

[27]  William Gropp,et al.  A hybrid format for better performance of sparse matrix-vector multiplication on a GPU , 2016, Int. J. High Perform. Comput. Appl..

[28]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[29]  Hyun Jin Moon,et al.  Fast Sparse Matrix-Vector Multiplication by Exploiting Variable Block Structure , 2005, HPCC.

[30]  William Pugh,et al.  Nonlinear array dependence analysis , 1994 .

[31]  Keshav Pingali,et al.  Next-generation generic programming and its application to sparse matrix computations , 2000, ICS '00.

[32]  Chun Chen,et al.  Improving High-Performance Sparse Libraries Using Compiler-Assisted Specialization: A PETSc Case Study , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[33]  Keshav Pingali,et al.  A Relational Approach to the Compilation of Sparse Matrix Programs , 1997, Euro-Par.

[34]  Joel H. Saltz,et al.  Principles of runtime support for parallel processors , 1988, ICS '88.

[35]  Ken Kennedy,et al.  Improving cache performance in dynamic applications through data and computation reorganization at run time , 1999, PLDI '99.

[36]  Calvin J. Ribbens,et al.  Pattern-based sparse matrix representation for memory-efficient SMVM kernels , 2009, ICS.

[37]  Bo Wu,et al.  Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU , 2013, PPoPP '13.

[38]  Michael Garland,et al.  Implementing sparse matrix-vector multiplication on throughput-oriented processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[39]  Francky Catthoor,et al.  Polyhedral parallel code generation for CUDA , 2013, TACO.

[40]  Chau-Wen Tseng,et al.  Exploiting locality for irregular scientific codes , 2006, IEEE Transactions on Parallel and Distributed Systems.