The tensor algebra compiler

Tensor algebra is a powerful tool with applications in machine learning, data analytics, engineering and the physical sciences. Tensors are often sparse and compound operations must frequently be computed in a single kernel for performance and to save memory. Programmers are left to write kernels for every operation of interest, with different mixes of dense and sparse tensors in different formats. The combinations are infinite, which makes it impossible to manually implement and optimize them all. This paper introduces the first compiler technique to automatically generate kernels for any compound tensor algebra operation on dense and sparse tensors. The technique is implemented in a C++ library called taco. Its performance is competitive with best-in-class hand-optimized kernels in popular libraries, while supporting far more tensor operations.

[1]  M. M. G. Ricci,et al.  Méthodes de calcul différentiel absolu et leurs applications , 1900 .

[2]  A. Einstein The Foundation of the General Theory of Relativity , 1916 .

[3]  Kenneth E. Iverson,et al.  A programming language , 1899, AIEE-IRE '62 (Spring).

[4]  R. Feynman,et al.  The Feynman Lectures on Physics Addison-Wesley Reading , 1963 .

[5]  J. W. Walker,et al.  Direct solutions of sparse network equations by optimally ordered triangular factorization , 1967 .

[6]  John Michael McNamee Algorithm 408: a sparse matrix package (part I) [F4] , 1971, CACM.

[7]  R. Leighton,et al.  Feynman Lectures on Physics , 1971 .

[8]  Donald E. Knuth,et al.  The art of computer programming: sorting and searching (volume 3) , 1973 .

[9]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[10]  Fred G. Gustavson,et al.  Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition , 1978, TOMS.

[11]  L. Mullin A mathematics of arrays , 1988 .

[12]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[13]  Aart J. C. Bik,et al.  Compilation techniques for sparse matrix computations , 1993, ICS '93.

[14]  Aart J. C. Bik,et al.  On Automatic Data Structure Selection and Code Generation for Sparse Computations , 1993, LCPC.

[15]  Lambertus Hesselink,et al.  The topology of symmetric, second-order tensor fields , 1994, VIS '94.

[16]  Ed Anderson,et al.  LAPACK Users' Guide , 1995 .

[17]  William Gropp,et al.  Efficient Management of Parallelism in Object-Oriented Numerical Software Libraries , 1997, SciTools.

[18]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[19]  Paul Vinson Stodghill,et al.  A Relational Approach to the Automatic Generation of Sequential Sparse matrix Codes , 1997 .

[20]  Keshav Pingali,et al.  Compiling Parallel Sparse Code for User-Defined Data Structures , 1997, PPSC.

[21]  Keshav Pingali,et al.  A Relational Approach to the Compilation of Sparse Matrix Programs , 1997, Euro-Par.

[22]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[23]  William Pugh,et al.  SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations , 1998, LCPC.

[24]  Keshav Pingali,et al.  Relational Algebraic Techniques for the Synthesis of Sparse Matrix Programs , 1999 .

[25]  J. Kolecki An Introduction to Tensors for Students of Physics and Engineering , 2002 .

[26]  A data locality optimizing algorithm , 2004, SIGP.

[27]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[28]  David E. Bernholdt,et al.  Automatic code generation for many-body electronic structure methods: the tensor contraction engine , 2006 .

[29]  Tamara G. Kolda,et al.  Efficient MATLAB Computations with Sparse and Factored Tensors , 2007, SIAM J. Sci. Comput..

[30]  James Bennett,et al.  The Netflix Prize , 2007 .

[31]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[32]  Scott Thibault,et al.  Generating Indexing Functions of Regularly Sparse Arrays for Array Compilers , 2007 .

[33]  John R. Gilbert,et al.  On the representation and multiplication of hypersparse matrices , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[34]  Michael W. Berry,et al.  Discussion Tracking in Enron Email using PARAFAC. , 2008 .

[35]  John R. Gilbert,et al.  Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks , 2009, SPAA '09.

[36]  Krishna P. Gummadi,et al.  On the evolution of user interaction in Facebook , 2009, WOSN '09.

[37]  Conrad Sanderson,et al.  Armadillo: An Open Source C++ Linear Algebra Library for Fast Prototyping and Computationally Intensive Experiments , 2010 .

[38]  Tjerk P. Straatsma,et al.  NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations , 2010, Comput. Phys. Commun..

[39]  Gaël Varoquaux,et al.  The NumPy Array: A Structure for Efficient Numerical Computation , 2011, Computing in Science & Engineering.

[40]  Gilad Arnold,et al.  Data-Parallel Language for Correct and Efficient Sparse Matrix Codes , 2011 .

[41]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[42]  Katherine Yelick,et al.  Autotuning Sparse Matrix-Vector Multiplication for Multicore , 2012 .

[43]  Alan Edelman,et al.  Julia: A Fast Dynamic Language for Technical Computing , 2012, ArXiv.

[44]  Benoît Meister,et al.  Efficient and scalable computations with sparse tensors , 2012, 2012 IEEE Conference on High Performance Extreme Computing.

[45]  Evgeny Epifanovsky,et al.  New implementation of high‐level correlated methods using a general block tensor library for high‐performance electronic structure calculations , 2013, J. Comput. Chem..

[46]  Jure Leskovec,et al.  Hidden factors and hidden topics: understanding rating dimensions with review text , 2013, RecSys.

[47]  Markus Püschel,et al.  A Basic Linear Algebra Compiler , 2014, CGO '14.

[48]  Lenore R. Mullin,et al.  Scalable, Portable, Verifiable Kronecker Products on Multi-scale Computers , 2014, Constraint Programming and Decision Making.

[49]  Anima Anandkumar,et al.  Tensor decompositions for learning latent variable models , 2012, J. Mach. Learn. Res..

[50]  Huasha Zhao,et al.  High Performance Machine Learning through Codesign and Rooflining , 2014 .

[51]  John F. Stanton,et al.  A massively parallel tensor contraction framework for coupled-cluster computations , 2014, J. Parallel Distributed Comput..

[52]  Elizabeth R. Jessup,et al.  Reliable Generation of High-Performance Matrix Algebra , 2012, ACM Trans. Math. Softw..

[53]  Mary W. Hall,et al.  Loop and data transformations for sparse matrix code , 2015, PLDI.

[54]  Torsten Hoefler,et al.  Sparse Tensor Algebra as a Parallel Programming Model , 2015, ArXiv.

[55]  Jimeng Sun,et al.  An input-adaptive and in-place approach to dense tensor-times-matrix multiply , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[56]  Benoît Meister,et al.  Optimization of symmetric tensor computations , 2015, 2015 IEEE High Performance Extreme Computing Conference (HPEC).

[57]  Bora Uçar,et al.  Scalable sparse tensor decompositions in distributed memory systems , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[58]  George Karypis,et al.  Tensor-matrix products with a compressed sparse tensor , 2015, IA3@SC.

[59]  Nikos D. Sidiropoulos,et al.  SPLATT: Efficient and Parallel Sparse Tensor-Matrix Multiplication , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[60]  Hongbo Rong,et al.  Sparso: Context-driven optimizations of sparse linear algebra , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[61]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[62]  Richard W. Vuduc,et al.  Optimizing Sparse Tensor Times Matrix on Multi-core and Many-Core Architectures , 2016, 2016 6th Workshop on Irregular Applications: Architecture and Algorithms (IA3).

[63]  Wojciech Matusik,et al.  Simit , 2016, ACM Trans. Graph..

[64]  Hongbo Rong,et al.  Automating Wavefront Parallelization for Sparse Matrix Computations , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[65]  Devin A. Matthews,et al.  High-Performance Tensor Contraction without Transposition , 2016, SIAM J. Sci. Comput..

[66]  Paolo Bientinesi,et al.  Design of a High-Performance GEMM-like Tensor–Tensor Multiplication , 2016, ACM Trans. Math. Softw..