Unified Sparse Formats for Tensor Algebra Compilers

This paper shows how to build a sparse tensor algebra compiler that is agnostic to tensor formats (data layouts). We develop an interface that describes formats in terms of their capabilities and properties, and show how to build a modular code generator where new formats can be added as plugins. We then describe six implementations of the interface that compose to form the dense, CSR/CSF, COO, DIA, ELL, and HASH tensor formats and countless variants thereof. With these implementations at hand, our code generator can generate code for any tensor algebra expression on any combination of the aforementioned formats. To demonstrate our modular code generator design, we have implemented it in the open-source taco tensor algebra compiler. Our evaluation shows that we get better performance by supporting more formats specialized to different tensor structures, and our plugins makes it easy to add new formats. For example, when data is provided in the COO format, computing a single matrix-vector multiplication with COO is up to 3.6× faster than with CSR. Furthermore, DIA is specialized to tensor convolutions and stencil operations and therefore performs up to 22% faster than CSR for such operations. To further demonstrate the importance of support for many formats, we show that the best vector format for matrix-vector multiplication varies with input sparsities, from hash maps to sparse vectors to dense vectors. Finally, we show that the performance of generated code for these formats is competitive with hand-optimized implementations.

[1]  Aart J. C. Bik,et al.  Compilation techniques for sparse matrix computations , 1993, ICS '93.

[2]  J. Kolecki An Introduction to Tensors for Students of Physics and Engineering , 2002 .

[3]  George Karypis,et al.  Tensor-matrix products with a compressed sparse tensor , 2015, IA3@SC.

[4]  Michael Wolfe,et al.  Optimizing supercompilers for supercomputers , 1989, ICS.

[5]  Johannes Hölzl,et al.  Specifying and verifying sparse matrix codes , 2010, ICFP '10.

[6]  Nikos D. Sidiropoulos,et al.  SPLATT: Efficient and Parallel Sparse Tensor-Matrix Multiplication , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[7]  Donald E. Knuth,et al.  The art of computer programming: sorting and searching (volume 3) , 1973 .

[8]  Aart J. C. Bik,et al.  On Automatic Data Structure Selection and Code Generation for Sparse Computations , 1993, LCPC.

[9]  Evgeny Epifanovsky,et al.  New implementation of high‐level correlated methods using a general block tensor library for high‐performance electronic structure calculations , 2013, J. Comput. Chem..

[10]  Gilad Arnold,et al.  Data-Parallel Language for Correct and Efficient Sparse Matrix Codes , 2011 .

[11]  Anand D. Sarwate,et al.  A Unified Optimization Approach for Sparse Tensor Operations on GPUs , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[12]  Hongbo Rong,et al.  Sparso: Context-driven optimizations of sparse linear algebra , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[13]  Mary W. Hall,et al.  Loop and data transformations for sparse matrix code , 2015, PLDI.

[14]  Keshav Pingali,et al.  Relational Algebraic Techniques for the Synthesis of Sparse Matrix Programs , 1999 .

[15]  Shoaib Kamil,et al.  The tensor algebra compiler , 2017, Proc. ACM Program. Lang..

[16]  Jimeng Sun,et al.  An input-adaptive and in-place approach to dense tensor-times-matrix multiply , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[17]  John R. Gilbert,et al.  On the representation and multiplication of hypersparse matrices , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[18]  Anima Anandkumar,et al.  Tensor decompositions for learning latent variable models , 2012, J. Mach. Learn. Res..

[19]  David E. Bernholdt,et al.  Automatic code generation for many-body electronic structure methods: the tensor contraction engine , 2006 .

[20]  Benoît Meister,et al.  Efficient and scalable computations with sparse tensors , 2012, 2012 IEEE Conference on High Performance Extreme Computing.

[21]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[22]  William Pugh,et al.  SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations , 1998, LCPC.

[23]  John R. Gilbert,et al.  Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks , 2009, SPAA '09.

[24]  Tamara G. Kolda,et al.  Efficient MATLAB Computations with Sparse and Factored Tensors , 2007, SIAM J. Sci. Comput..

[25]  David S. Wise,et al.  Representation-transparent matrix algorithms with scalable performance , 2007, ICS '07.

[26]  Eric Jones,et al.  SciPy: Open Source Scientific Tools for Python , 2001 .

[27]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[28]  R. Leighton,et al.  The Feynman Lectures on Physics; Vol. I , 1965 .

[29]  John Michael McNamee Algorithm 408: a sparse matrix package (part I) [F4] , 1971, CACM.

[30]  Eduardo F. D'Azevedo,et al.  Vectorized Sparse Matrix Multiply for Compressed Row Storage Format , 2005, International Conference on Computational Science.

[31]  H. Wilf,et al.  Direct Solutions of Sparse Network Equations by Optimally Ordered Triangular Factorization , 1967 .

[32]  John F. Stanton,et al.  A massively parallel tensor contraction framework for coupled-cluster computations , 2014, J. Parallel Distributed Comput..

[33]  Timothy A. Davis,et al.  Direct methods for sparse linear systems , 2006, Fundamentals of algorithms.

[34]  Keshav Pingali,et al.  A Relational Approach to the Compilation of Sparse Matrix Programs , 1997, Euro-Par.

[35]  Paul Vinson Stodghill,et al.  A Relational Approach to the Automatic Generation of Sequential Sparse matrix Codes , 1997 .

[36]  Markus Püschel,et al.  A Basic Linear Algebra Compiler , 2014, CGO '14.

[37]  Arutyun Avetisyan,et al.  Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures , 2010, HiPEAC.

[38]  Scott Thibault,et al.  Generating Indexing Functions of Regularly Sparse Arrays for Array Compilers , 2007 .

[39]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[40]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[41]  A. N. Yzelman,et al.  A Cache-Oblivious Sparse Matrix–Vector Multiplication Scheme Based on the Hilbert Curve , 2012 .

[42]  Kurt Keutzer,et al.  clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs , 2012, ICS '12.

[43]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[44]  Devin A. Matthews,et al.  High-Performance Tensor Contraction without Transposition , 2016, SIAM J. Sci. Comput..

[45]  Michael Garland,et al.  Efficient Sparse Matrix-Vector Multiplication on CUDA , 2008 .

[46]  Paolo Bientinesi,et al.  Design of a High-Performance GEMM-like Tensor–Tensor Multiplication , 2016, ACM Trans. Math. Softw..

[47]  Elizabeth R. Jessup,et al.  Reliable Generation of High-Performance Matrix Algebra , 2012, ACM Trans. Math. Softw..