On Optimizing Distributed Tucker Decomposition for Sparse Tensors

The Tucker decomposition generalizes the notion of Singular Value Decomposition (SVD) to tensors, the higher dimensional analogues of matrices. We study the problem of constructing the Tucker decomposition of sparse tensors on distributed memory systems via the HOOI procedure, a popular iterative method. The scheme used for distributing the input tensor among the processors (MPI ranks) critically influences the HOOI execution time. Prior work has proposed different distribution schemes: an offline scheme based on sophisticated hypergraph partitioning method and simple, lightweight alternatives that can be used real-time. While the hypergraph based scheme typically results in faster HOOI execution time, being complex, the time taken for determining the distribution is an order of magnitude higher than the execution time of a single HOOI iteration. Our main contribution is a lightweight distribution scheme, which achieves the best of both worlds. We show that the scheme is near-optimal on certain fundamental metrics associated with the HOOI procedure and as a result, near-optimal on the computational load (FLOPs). Though the scheme may incur higher communication volume, the computation time is the dominant factor and as the result, the scheme achieves better performance on the overall HOOI execution time. Our experimental evaluation on large real-life tensors (having up to 4 billion elements) shows that the scheme outperforms the prior schemes on the HOOI execution time by a factor of up to 3x. On the other hand, its distribution time is comparable to the prior lightweight schemes and is typically lesser than the execution time of a single HOOI iteration.

[1]  George Karypis,et al.  Accelerating the Tucker Decomposition with Compressed Sparse Tensors , 2017, Euro-Par.

[2]  Vicente Hernández,et al.  A robust and efficient parallel SVD solver based on restarted Lanczos bidiagonalization. , 2007 .

[3]  Tamara G. Kolda,et al.  Efficient MATLAB Computations with Sparse and Factored Tensors , 2007, SIAM J. Sci. Comput..

[4]  Bora Uçar,et al.  High Performance Parallel Algorithms for the Tucker Decomposition of Sparse Tensors , 2016, 2016 45th International Conference on Parallel Processing (ICPP).

[5]  Bora Uçar,et al.  Scalable sparse tensor decompositions in distributed memory systems , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Tamara G. Kolda,et al.  Parallel Tensor Compression for Large-Scale Scientific Data , 2015, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[7]  Vijay V. Vazirani,et al.  Approximation Algorithms , 2001, Springer Berlin Heidelberg.

[8]  L. Tucker,et al.  Some mathematical notes on three-mode factor analysis , 1966, Psychometrika.

[9]  Tamara G. Kolda,et al.  Tensor Decompositions and Applications , 2009, SIAM Rev..

[10]  John H. Reif,et al.  Implementations of randomized sorting on large parallel machines , 1992, SPAA '92.

[11]  J. Chang,et al.  Analysis of individual differences in multidimensional scaling via an n-way generalization of “Eckart-Young” decomposition , 1970 .

[12]  Christos Faloutsos,et al.  GigaTensor: scaling tensor analysis up by 100 times - algorithms and discoveries , 2012, KDD.

[13]  Yogish Sabharwal,et al.  On Optimizing Distributed Tucker Decomposition for Dense Tensors , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[14]  Tamara G. Kolda,et al.  Scalable Tensor Decompositions for Multi-aspect Data Mining , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[15]  Joos Vandewalle,et al.  A Multilinear Singular Value Decomposition , 2000, SIAM J. Matrix Anal. Appl..

[16]  George Karypis,et al.  A Medium-Grained Algorithm for Sparse Tensor Factorization , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[17]  Christos Faloutsos,et al.  HaTen2: Billion-scale tensor decompositions , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[18]  J. E. Román,et al.  Lanczos Bidiagonalization for the SVD in SLEPc , 2015 .

[19]  J. H. Choi,et al.  DFacTo: Distributed Factorization of Tensors , 2014, NIPS.

[20]  Kijung Shin,et al.  Distributed Methods for High-Dimensional and Large-Scale Tensor Factorization , 2014, 2014 IEEE International Conference on Data Mining.

[21]  Benoît Meister,et al.  Efficient and scalable computations with sparse tensors , 2012, 2012 IEEE Conference on High Performance Extreme Computing.

[22]  Zheng Chen,et al.  Text representation: from vector to tensor , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[23]  Lars Karlsson,et al.  Parallel algorithms for tensor completion in the CP format , 2016, Parallel Comput..

[24]  Joos Vandewalle,et al.  On the Best Rank-1 and Rank-(R1 , R2, ... , RN) Approximation of Higher-Order Tensors , 2000, SIAM J. Matrix Anal. Appl..

[25]  Salah Bourennane,et al.  Multidimensional filtering based on a tensor approach , 2005, Signal Process..

[26]  Andrzej Cichocki,et al.  Decomposition of Big Tensors With Low Multilinear Rank , 2014, ArXiv.

[27]  George Karypis,et al.  Sparse Tensor Factorization on Many-Core Processors with High-Bandwidth Memory , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[28]  Richard A. Harshman,et al.  Foundations of the PARAFAC procedure: Models and conditions for an "explanatory" multi-model factor analysis , 1970 .