An input-adaptive and in-place approach to dense tensor-times-matrix multiply

This paper describes a novel framework, called I<scp>n</scp>T<scp>ens</scp>L<scp>i</scp> ("intensely"), for producing fast single-node implementations of dense tensor-times-matrix multiply (T<scp>tm</scp>) of arbitrary dimension. Whereas conventional implementations of T<scp>tm</scp> rely on explicitly converting the input tensor operand into a matrix---in order to be able to use any available and fast general matrix-matrix multiply (G<scp>emm</scp>) implementation---our framework's strategy is to carry out the T<scp>tm</scp> <i>in-place</i>, avoiding this copy. As the resulting implementations expose tuning parameters, this paper also describes a heuristic empirical model for selecting an optimal configuration based on the T<scp>tm</scp>'s inputs. When compared to widely used single-node T<scp>tm</scp> implementations that are available in the Tensor Toolbox and Cyclops Tensor Framework (C<scp>tf</scp>), In-TensLi's in-place and input-adaptive T<scp>tm</scp> implementations achieve 4× and 13× speedups, showing Gemm-like performance on a variety of input sizes.

[1]  Demetri Terzopoulos,et al.  Multilinear Analysis of Image Ensembles: TensorFaces , 2002, ECCV.

[2]  J Möcks,et al.  Topographic components model for event-related potentials and some biophysical considerations. , 1988, IEEE transactions on bio-medical engineering.

[3]  Bülent Yener,et al.  Modeling and Multiway Analysis of Chatroom Tensors , 2005, ISI.

[4]  L. Tucker,et al.  Some mathematical notes on three-mode factor analysis , 1966, Psychometrika.

[5]  Christos Faloutsos,et al.  FUNNEL: automatic mining of spatially coevolving epidemics , 2014, KDD.

[6]  Rasmus Bro,et al.  Multiway analysis of epilepsy tensors , 2007, ISMB/ECCB.

[7]  Arvind Ramanathan,et al.  An Online Approach for Mining Collective Behaviors from Molecular Dynamics Simulations , 2010, J. Comput. Biol..

[8]  James Demmel,et al.  Communication lower bounds and optimal algorithms for numerical linear algebra*† , 2014, Acta Numerica.

[9]  Daniel Kressner,et al.  A literature survey of low‐rank tensor approximation techniques , 2013, 1302.7121.

[10]  Christos Faloutsos,et al.  HaTen2: Billion-scale tensor decompositions , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[11]  Arvind Ramanathan,et al.  An Online Approach for Mining Collective Behaviors from Molecular Dynamics Simulations , 2009, RECOMB.

[12]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[13]  J. Mocks,et al.  Topographic components model for event-related potentials and some biophysical considerations , 1988, IEEE Transactions on Biomedical Engineering.

[14]  Nikos D. Sidiropoulos,et al.  SPLATT: Efficient and Parallel Sparse Tensor-Matrix Multiplication , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[15]  Roger W. Hockney,et al.  F1/2: a Parameter to Characterize Memory and Communication Bottlenecks , 1989, Parallel Comput..

[16]  Andrzej Cichocki,et al.  Multiway array decomposition analysis of EEGs in Alzheimer's disease , 2012, Journal of Neuroscience Methods.

[17]  Jimeng Sun,et al.  Marble: high-throughput phenotyping from electronic health records via sparse nonnegative tensor factorization , 2014, KDD.

[18]  Amnon Shashua,et al.  Linear image coding for regression and classification using the tensor-rank principle , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[19]  Ninghui Sun,et al.  SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication , 2013, PLDI.

[20]  Tamara G. Kolda,et al.  Tensor Decompositions and Applications , 2009, SIAM Rev..

[21]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[22]  Nikos D. Sidiropoulos,et al.  Memory-efficient parallel computation of tensor and matrix products for big tensor decomposition , 2014, 2014 48th Asilomar Conference on Signals, Systems and Computers.

[23]  Christos Faloutsos,et al.  GigaTensor: scaling tensor analysis up by 100 times - algorithms and discoveries , 2012, KDD.

[24]  J. Demmel,et al.  Sun Microsystems , 1996 .

[25]  J. H. Choi,et al.  DFacTo: Distributed Factorization of Tensors , 2014, NIPS.

[26]  James Demmel,et al.  Cyclops Tensor Framework: Reducing Communication and Eliminating Load Imbalance in Massively Parallel Contractions , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[27]  Tamara G. Kolda,et al.  Scalable Tensor Decompositions for Multi-aspect Data Mining , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[28]  L. Lathauwer,et al.  Dimensionality reduction in higher-order signal processing and rank-(R1,R2,…,RN) reduction in multilinear algebra , 2004 .

[29]  Nikos D. Sidiropoulos,et al.  Blind PARAFAC receivers for DS-CDMA systems , 2000, IEEE Trans. Signal Process..

[30]  Nikos D. Sidiropoulos,et al.  ParCube: Sparse Parallelizable Tensor Decompositions , 2012, ECML/PKDD.

[31]  James Demmel,et al.  Communication Lower Bounds for Tensor Contraction Algorithms , 2015 .

[32]  Lars Grasedyck,et al.  Hierarchical Singular Value Decomposition of Tensors , 2010, SIAM J. Matrix Anal. Appl..

[33]  Fei Wang,et al.  FEMA: flexible evolutionary multi-faceted analysis for dynamic behavioral pattern discovery , 2014, KDD.

[34]  Benoît Meister,et al.  Low-overhead load-balanced scheduling for sparse tensor computations , 2014, 2014 IEEE High Performance Extreme Computing Conference (HPEC).

[35]  David E. Bernholdt,et al.  Automatic code generation for many-body electronic structure methods: the tensor contraction engine , 2006 .

[36]  Michael W. Mahoney,et al.  Future Directions in Tensor-Based Computation and Modeling , 2009 .

[37]  Nikos D. Sidiropoulos,et al.  Parallel factor analysis in sensor array processing , 2000, IEEE Trans. Signal Process..

[38]  Robert A. van de Geijn,et al.  BLIS: A Framework for Rapidly Instantiating BLAS Functionality , 2015, ACM Trans. Math. Softw..

[39]  Benoît Meister,et al.  Efficient and scalable computations with sparse tensors , 2012, 2012 IEEE Conference on High Performance Extreme Computing.

[40]  Ivan Oseledets,et al.  Tensor-Train Decomposition , 2011, SIAM J. Sci. Comput..

[41]  Jimeng Sun,et al.  Beyond streams and graphs: dynamic tensor analysis , 2006, KDD '06.

[42]  Richard A. Harshman,et al.  Foundations of the PARAFAC procedure: Models and conditions for an "explanatory" multi-model factor analysis , 1970 .

[43]  Robert A. van de Geijn,et al.  Anatomy of high-performance matrix multiplication , 2008, TOMS.

[44]  Lars Kai Hansen,et al.  Parallel Factor Analysis as an exploratory tool for wavelet transformed event-related EEG , 2006, NeuroImage.

[45]  Andrzej Cichocki,et al.  Era of Big Data Processing: A New Approach via Tensor Networks and Tensor Decompositions , 2014, ArXiv.

[46]  Berkant Savas,et al.  Handwritten digit classification using higher order singular value decomposition , 2007, Pattern Recognit..

[47]  Misha Elena Kilmer,et al.  Kronecker product approximation for preconditioning in three-dimensional imaging applications , 2006, IEEE Transactions on Image Processing.