HPC formulations of optimization algorithms for tensor completion

Abstract Tensor completion is a powerful tool used to estimate or recover missing values in multi-way data. It has seen great success in domains such as product recommendation and healthcare. Tensor completion is most often accomplished via low-rank sparse tensor factorization, a computationally expensive non-convex optimization problem which has only recently been studied in the context of parallel computing. In this work, we study three optimization algorithms that have been successfully applied to tensor completion: alternating least squares (ALS), stochastic gradient descent (SGD), and coordinate descent (CCD++). We explore opportunities for parallelism on shared- and distributed-memory systems and address challenges such as memory- and operation-efficiency, load balance, cache locality, and communication. Among our advancements are a communication-efficient CCD++ algorithm, an ALS algorithm rich in level-3 BLAS routines, and an SGD algorithm which combines stratification with asynchronous communication. Furthermore, we show that introducing randomization during ALS and CCD++ can accelerate convergence. We evaluate our parallel formulations on a variety of real datasets on a modern supercomputer and demonstrate speedups through 16384 cores. These improvements reduce time-to-solution from hours to seconds on real-world datasets. We show that after our optimizations, ALS is advantageous on parallel systems of small-to-moderate scale, while both ALS and CCD++ provide the lowest time-to-solution on large-scale distributed systems.

[1]  Dennis M. Wilkinson,et al.  Large-Scale Parallel Collaborative Filtering for the Netflix Prize , 2008, AAIM.

[2]  Nikos D. Sidiropoulos,et al.  SPLATT: Efficient and Parallel Sparse Tensor-Matrix Multiplication , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[3]  James Bennett,et al.  The Netflix Prize , 2007 .

[4]  Inderjit S. Dhillon,et al.  Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems , 2012, 2012 IEEE 12th International Conference on Data Mining.

[5]  George Karypis,et al.  A Medium-Grained Algorithm for Sparse Tensor Factorization , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[6]  Fabio Petroni,et al.  GASGD: stochastic gradient descent for distributed asynchronous matrix completion via graph partitioning. , 2014, RecSys '14.

[7]  Bora Uçar,et al.  Scalable sparse tensor decompositions in distributed memory systems , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  H. Kiers Towards a standardized notation and terminology in multiway analysis , 2000 .

[9]  Yehuda Koren,et al.  The Yahoo! Music Dataset and KDD-Cup '11 , 2012, KDD Cup.

[10]  Kijung Shin,et al.  Distributed Methods for High-Dimensional and Large-Scale Tensor Factorization , 2014, 2014 IEEE International Conference on Data Mining.

[11]  Lars Karlsson,et al.  Parallel algorithms for tensor completion in the CP format , 2016, Parallel Comput..

[12]  Peter Richtárik,et al.  Accelerated, Parallel, and Proximal Coordinate Descent , 2013, SIAM J. Optim..

[13]  George Karypis,et al.  Tensor-matrix products with a compressed sparse tensor , 2015, IA3@SC.

[14]  Yurii Nesterov,et al.  Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems , 2012, SIAM J. Optim..

[15]  Jimeng Sun,et al.  An input-adaptive and in-place approach to dense tensor-times-matrix multiply , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  Peter J. Haas,et al.  Large-scale matrix factorization with distributed stochastic gradient descent , 2011, KDD.

[17]  Jimeng Sun,et al.  Marble: high-throughput phenotyping from electronic health records via sparse nonnegative tensor factorization , 2014, KDD.

[18]  Nikos D. Sidiropoulos,et al.  Tensor Algebra and Multidimensional Harmonic Retrieval in Signal Processing for MIMO Radar , 2010, IEEE Transactions on Signal Processing.

[19]  Cevdet Aykanat,et al.  Fast optimal load balancing algorithms for 1D partitioning , 2004, J. Parallel Distributed Comput..

[20]  George Karypis,et al.  An Exploration of Optimization Algorithms for High Performance Tensor Completion , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[21]  Jack J. Dongarra,et al.  Accelerating collaborative filtering using concepts from high performance computing , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[22]  Martha Larson,et al.  TFMAP: optimizing MAP for top-n context-aware recommendation , 2012, SIGIR '12.

[23]  J. Berge,et al.  On uniqueness in candecomp/parafac , 2002 .

[24]  Jure Leskovec,et al.  Hidden factors and hidden topics: understanding rating dimensions with review text , 2013, RecSys.

[25]  F. Maxwell Harper,et al.  The MovieLens Datasets: History and Context , 2016, TIIS.

[26]  Tamara G. Kolda,et al.  Tensor Decompositions and Applications , 2009, SIAM Rev..

[27]  Qiang Zhang,et al.  A Parallel Nonnegative Tensor Factorization Algorithm for Mining Global Climate Data , 2009, ICCS.

[28]  Benoît Meister,et al.  Efficient and scalable computations with sparse tensors , 2012, 2012 IEEE Conference on High Performance Extreme Computing.

[29]  J. Chang,et al.  Analysis of individual differences in multidimensional scaling via an n-way generalization of “Eckart-Young” decomposition , 1970 .

[30]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[31]  Jimeng Sun,et al.  Rubik: Knowledge Guided Tensor Factorization and Completion for Health Data Analytics , 2015, KDD.

[32]  Roberto Battiti,et al.  Accelerated Backpropagation Learning: Two Optimization Methods , 1989, Complex Syst..

[33]  J. H. Choi,et al.  DFacTo: Distributed Factorization of Tensors , 2014, NIPS.