论文信息 - A Medium-Grained Algorithm for Distributed Sparse Tensor Factorization

A Medium-Grained Algorithm for Distributed Sparse Tensor Factorization

Modeling multi-way data can be accomplished using tensors, which are data structures indexed along three or more dimensions. Tensors are increasingly used to analyze extremely large and sparse multi-way datasets in life sciences, engineering, and business. The canonical polyadic decomposition (CPD) is a popular tensor factorization for discovering latent features and is most commonly found via the method of alternating least squares (CPD-ALS). The computational time and memory required to compute CPD limits the size and dimensionality of the tensors that can be solved on a typical workstation, making distributed solution approaches the only viable option. Most methods for distributed-memory systems have focused on distributing the tensor in a coarse-grained, one-dimensional fashion that prohibitively requires the dense matrix factors to be fully replicated on each node. Recent work overcomes this limitation by using a fine-grained decomposition of the tensor nonzeros, at the cost of computationally expensive hypergraph partitioning. To that effect, we present a medium-grained decomposition that avoids complete factor replication and communication, while eliminating the need for expensive pre-processing steps. We use a hybrid MPI+OpenMP implementation that exploits multi-core architectures with a low memory footprint. We theoretically analyze the scalability of the coarse-, medium-, and fine-grained decompositions and experimentally compare them across a variety of datasets. Experiments show that the medium-grained decomposition reduces communication volume by 36-90% compared to the coarse-grained decomposition, is 41-76x faster than a state-ofthe-art MPI code, and is 1.5-5.0x faster than the fine-grained decomposition with 1024 cores. Keywords-Sparse tensor, distributed, PARAFAC, CPD, parallel, medium-grained

G. Karypis | Shaden Smith

[1] Martha Larson,et al. TFMAP: optimizing MAP for top-n context-aware recommendation , 2012, SIGIR '12.

[2] Rob H. Bisseling,et al. A Medium-Grain Method for Fast 2D Bipartitioning of Sparse Matrices , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[3] Rob H. Bisseling,et al. Parallel hypergraph partitioning for scientific computing , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[4] Brett W. Bader,et al. The TOPHITS Model for Higher-Order Web Link Analysis∗ , 2006 .

[5] Christos Faloutsos,et al. GigaTensor: scaling tensor analysis up by 100 times - algorithms and discoveries , 2012, KDD.

[6] James Bennett,et al. The Netflix Prize , 2007 .

[7] Jimeng Sun,et al. Marble: high-throughput phenotyping from electronic health records via sparse nonnegative tensor factorization , 2014, KDD.

[8] George Karypis,et al. Introduction to Parallel Computing , 1994 .

[9] Kijung Shin,et al. Distributed Methods for High-Dimensional and Large-Scale Tensor Factorization , 2014, 2014 IEEE International Conference on Data Mining.

[10] Jure Leskovec,et al. Hidden factors and hidden topics: understanding rating dimensions with review text , 2013, RecSys.

[11] L. Tucker,et al. Some mathematical notes on three-mode factor analysis , 1966, Psychometrika.

[12] Bora Uçar,et al. Scalable sparse tensor decompositions in distributed memory systems , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[13] Qiang Zhang,et al. A Parallel Nonnegative Tensor Factorization Algorithm for Mining Global Climate Data , 2009, ICCS.

[14] Tamara G. Kolda,et al. Tensor Decompositions and Applications , 2009, SIAM Rev..

[15] George Karypis,et al. Tensor-matrix products with a compressed sparse tensor , 2015, IA3@SC.

[16] Bora Uçar,et al. On Two-Dimensional Sparse Matrix Partitioning: Models, Methods, and a Recipe , 2010, SIAM J. Sci. Comput..

[17] Brendan Vastenhouw,et al. A Two-Dimensional Data Distribution Method for Parallel Sparse Matrix-Vector Multiplication , 2005, SIAM Rev..

[18] Martin F. Porter,et al. An algorithm for suffix stripping , 1997, Program.

[19] Steffen Staab,et al. PINTS: peer-to-peer infrastructure for tagging systems , 2008, IPTPS.

[20] Cevdet Aykanat,et al. Fast optimal load balancing algorithms for 1D partitioning , 2004, J. Parallel Distributed Comput..

[21] Nikos D. Sidiropoulos,et al. Parallel Algorithms for Constrained Tensor Factorization via Alternating Direction Method of Multipliers , 2014, IEEE Transactions on Signal Processing.

[22] Nikos D. Sidiropoulos,et al. Memory-efficient parallel computation of tensor and matrix products for big tensor decomposition , 2014, 2014 48th Asilomar Conference on Signals, Systems and Computers.

[23] Nikos D. Sidiropoulos,et al. SPLATT: Efficient and Parallel Sparse Tensor-Matrix Multiplication , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[24] Estevam R. Hruschka,et al. Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[25] J. H. Choi,et al. DFacTo: Distributed Factorization of Tensors , 2014, NIPS.