Sparse Tensor Algebra Optimizations with Workspaces

This paper shows how to optimize sparse tensor algebraic expressions by introducing temporary tensors, called workspaces, into the resulting loop nests. We develop a new intermediate language for tensor operations called concrete index notation that extends tensor index notation. Concrete index notation expresses when and where sub-computations occur and what tensor they are stored into. We then describe the workspace optimization in this language, and how to compile it to sparse code by building on prior work in the literature. We demonstrate the importance of the optimization on several important sparse tensor kernels, including sparse matrix-matrix multiplication (SpMM), sparse tensor addition (SpAdd), and the matricized tensor times Khatri-Rao product (MTTKRP) used to factorize tensors. Our results show improvements over prior work on tensor algebra compilation and brings the performance of these kernels on par with state-of-the-art hand-optimized implementations. For example, SpMM was not supported by prior tensor algebra compilers, the performance of MTTKRP on the nell-2 data set improves by 35%, and MTTKRP can for the first time have sparse results.

[1]  Aart J. C. Bik,et al.  Compilation techniques for sparse matrix computations , 1993, ICS '93.

[2]  Geri Georg,et al.  Set and Relation Manipulation for the Sparse Polyhedral Framework , 2012, LCPC.

[3]  J. Kolecki An Introduction to Tensors for Students of Physics and Engineering , 2002 .

[4]  Frédo Durand,et al.  Decoupling algorithms from schedules for easy optimization of image processing pipelines , 2012, ACM Trans. Graph..

[5]  Per Brinch Hansen,et al.  The nucleus of a multiprogramming system , 1970, CACM.

[6]  John F. Stanton,et al.  A massively parallel tensor contraction framework for coupled-cluster computations , 2014, J. Parallel Distributed Comput..

[7]  Saman P. Amarasinghe,et al.  Format abstraction for sparse tensor algebra compilers , 2018, Proc. ACM Program. Lang..

[8]  Fred G. Gustavson,et al.  Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition , 1978, TOMS.

[9]  Timothy A. Davis,et al.  Direct methods for sparse linear systems , 2006, Fundamentals of algorithms.

[10]  Shoaib Kamil,et al.  The tensor algebra compiler , 2017, Proc. ACM Program. Lang..

[11]  Keshav Pingali,et al.  A Relational Approach to the Compilation of Sparse Matrix Programs , 1997, Euro-Par.

[12]  George Karypis,et al.  Constrained Tensor Factorization with Accelerated AO-ADMM , 2017, 2017 46th International Conference on Parallel Processing (ICPP).

[13]  Nikos D. Sidiropoulos,et al.  SPLATT: Efficient and Parallel Sparse Tensor-Matrix Multiplication , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[14]  Alan Edelman,et al.  Julia: A Fast Dynamic Language for Technical Computing , 2012, ArXiv.

[15]  Pradeep Dubey,et al.  Parallel Efficient Sparse Matrix-Matrix Multiplication on Multicore Platforms , 2015, ISC.

[16]  Michael T. Heath,et al.  Parallel Algorithms for Sparse Linear Systems , 1991, SIAM Rev..

[17]  Kenneth E. Iverson,et al.  A programming language , 1899, AIEE-IRE '62 (Spring).

[18]  Wojciech Matusik,et al.  Simit: A Language for Physical Simulation , 2016, ACM Trans. Graph..

[19]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[20]  Amnon Shashua,et al.  Linear image coding for regression and classification using the tensor-rank principle , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[21]  John R. Gilbert,et al.  Sparse Matrices in MATLAB: Design and Implementation , 1992, SIAM J. Matrix Anal. Appl..

[22]  Xipeng Shen,et al.  GLORE: generalized loop redundancy elimination upon LER-notation , 2017, Proc. ACM Program. Lang..

[23]  A. Cichocki,et al.  Tensor decompositions for feature extraction and classification of high dimensional datasets , 2010 .

[24]  A. Einstein The Foundation of the General Theory of Relativity , 1916 .

[25]  Mary W. Hall,et al.  The Sparse Polyhedral Framework: Composing Compiler-Generated Inspector-Executor Code , 2018, Proceedings of the IEEE.

[26]  Hongbo Rong,et al.  Automating Wavefront Parallelization for Sparse Matrix Computations , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[27]  Andrzej Cichocki,et al.  Era of Big Data Processing: A New Approach via Tensor Networks and Tensor Decompositions , 2014, ArXiv.

[28]  William Pugh,et al.  SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations , 1998, LCPC.

[29]  Denis Barthou,et al.  FADAlib: an open source C++ library for fuzzy array dataflow analysis , 2010, ICCS.

[30]  J. Mocks,et al.  Topographic components model for event-related potentials and some biophysical considerations , 1988, IEEE Transactions on Biomedical Engineering.

[31]  BackusJohn The history of FORTRAN I, II, and III , 1978 .

[32]  Shoaib Kamil,et al.  Taco: A tool to generate tensor algebra kernels , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[33]  John R. Gilbert,et al.  Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks , 2009, SPAA '09.

[34]  Tamara G. Kolda,et al.  Efficient MATLAB Computations with Sparse and Factored Tensors , 2007, SIAM J. Sci. Comput..

[35]  Michael W. Berry,et al.  Discussion Tracking in Enron Email using PARAFAC. , 2008 .

[36]  David E. Bernholdt,et al.  Automatic code generation for many-body electronic structure methods: the tensor contraction engine , 2006 .

[37]  E. Blum,et al.  A programming language , 1899, AIEE-IRE '62 (Spring).

[38]  Jure Leskovec,et al.  Hidden factors and hidden topics: understanding rating dimensions with review text , 2013, RecSys.

[39]  J. W. Walker,et al.  Direct solutions of sparse network equations by optimally ordered triangular factorization , 1967 .

[40]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[41]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[42]  F. L. Hitchcock The Expression of a Tensor or a Polyadic as a Sum of Products , 1927 .

[43]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[44]  Albert Cohen,et al.  Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions , 2018, ArXiv.

[45]  Tamara G. Kolda,et al.  Tensor Decompositions and Applications , 2009, SIAM Rev..

[46]  David E. Bernholdt,et al.  Electronic Structure Methods: The Tensor Contraction Engine ⁄ , 2015 .

[47]  William A. Wulf,et al.  HYDRA , 1974, Commun. ACM.

[48]  Michael Wolfe,et al.  Optimizing supercompilers for supercomputers , 1989, ICS.

[49]  Donald E. Knuth,et al.  The art of computer programming: sorting and searching (volume 3) , 1973 .

[50]  John W. Backus,et al.  The history of FORTRAN I, II, and III , 1978, SIGP.

[51]  Mary W. Hall,et al.  Loop and data transformations for sparse matrix code , 2015, PLDI.

[52]  Huasha Zhao,et al.  High Performance Machine Learning through Codesign and Rooflining , 2014 .

[53]  Anima Anandkumar,et al.  Tensor decompositions for learning latent variable models , 2012, J. Mach. Learn. Res..