A Medium-Grained Algorithm for Distributed Sparse Tensor Factorization

Modeling multi-way data can be accomplished using tensors, which are data structures indexed along three or more dimensions. Tensors are increasingly used to analyze extremely large and sparse multi-way datasets in life sciences, engineering, and business. The canonical polyadic decomposition (CPD) is a popular tensor factorization for discovering latent features and is most commonly found via the method of alternating least squares (CPD-ALS). The computational time and memory required to compute CPD limits the size and dimensionality of the tensors that can be solved on a typical workstation, making distributed solution approaches the only viable option. Most methods for distributed-memory systems have focused on distributing the tensor in a coarse-grained, one-dimensional fashion that prohibitively requires the dense matrix factors to be fully replicated on each node. Recent work overcomes this limitation by using a fine-grained decomposition of the tensor nonzeros, at the cost of computationally expensive hypergraph partitioning. To that effect, we present a medium-grained decomposition that avoids complete factor replication and communication, while eliminating the need for expensive pre-processing steps. We use a hybrid MPI+OpenMP implementation that exploits multi-core architectures with a low memory footprint. We theoretically analyze the scalability of the coarse-, medium-, and fine-grained decompositions and experimentally compare them across a variety of datasets. Experiments show that the medium-grained decomposition reduces communication volume by 36-90% compared to the coarse-grained decomposition, is 41-76x faster than a state-ofthe-art MPI code, and is 1.5-5.0x faster than the fine-grained decomposition with 1024 cores. Keywords-Sparse tensor, distributed, PARAFAC, CPD, parallel, medium-grained

[1]  Martha Larson,et al.  TFMAP: optimizing MAP for top-n context-aware recommendation , 2012, SIGIR '12.

[2]  Rob H. Bisseling,et al.  A Medium-Grain Method for Fast 2D Bipartitioning of Sparse Matrices , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[3]  Rob H. Bisseling,et al.  Parallel hypergraph partitioning for scientific computing , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[4]  Brett W. Bader,et al.  The TOPHITS Model for Higher-Order Web Link Analysis∗ , 2006 .

[5]  Christos Faloutsos,et al.  GigaTensor: scaling tensor analysis up by 100 times - algorithms and discoveries , 2012, KDD.

[6]  James Bennett,et al.  The Netflix Prize , 2007 .

[7]  Jimeng Sun,et al.  Marble: high-throughput phenotyping from electronic health records via sparse nonnegative tensor factorization , 2014, KDD.

[8]  George Karypis,et al.  Introduction to Parallel Computing , 1994 .

[9]  Kijung Shin,et al.  Distributed Methods for High-Dimensional and Large-Scale Tensor Factorization , 2014, 2014 IEEE International Conference on Data Mining.

[10]  Jure Leskovec,et al.  Hidden factors and hidden topics: understanding rating dimensions with review text , 2013, RecSys.

[11]  L. Tucker,et al.  Some mathematical notes on three-mode factor analysis , 1966, Psychometrika.

[12]  Bora Uçar,et al.  Scalable sparse tensor decompositions in distributed memory systems , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  Qiang Zhang,et al.  A Parallel Nonnegative Tensor Factorization Algorithm for Mining Global Climate Data , 2009, ICCS.

[14]  Tamara G. Kolda,et al.  Tensor Decompositions and Applications , 2009, SIAM Rev..

[15]  George Karypis,et al.  Tensor-matrix products with a compressed sparse tensor , 2015, IA3@SC.

[16]  Bora Uçar,et al.  On Two-Dimensional Sparse Matrix Partitioning: Models, Methods, and a Recipe , 2010, SIAM J. Sci. Comput..

[17]  Brendan Vastenhouw,et al.  A Two-Dimensional Data Distribution Method for Parallel Sparse Matrix-Vector Multiplication , 2005, SIAM Rev..

[18]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[19]  Steffen Staab,et al.  PINTS: peer-to-peer infrastructure for tagging systems , 2008, IPTPS.

[20]  Cevdet Aykanat,et al.  Fast optimal load balancing algorithms for 1D partitioning , 2004, J. Parallel Distributed Comput..

[21]  Nikos D. Sidiropoulos,et al.  Parallel Algorithms for Constrained Tensor Factorization via Alternating Direction Method of Multipliers , 2014, IEEE Transactions on Signal Processing.

[22]  Nikos D. Sidiropoulos,et al.  Memory-efficient parallel computation of tensor and matrix products for big tensor decomposition , 2014, 2014 48th Asilomar Conference on Signals, Systems and Computers.

[23]  Nikos D. Sidiropoulos,et al.  SPLATT: Efficient and Parallel Sparse Tensor-Matrix Multiplication , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[24]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[25]  J. H. Choi,et al.  DFacTo: Distributed Factorization of Tensors , 2014, NIPS.