Enabling Distributed-Memory Tensor Completion in Python using New Sparse Tensor Kernels

Tensor computations are increasingly prevalent numerical techniques in data science, but pose unique challenges for high-performance implementation. We provide novel algorithms and systems infrastructure, together enabling the first high-level parallel implementations of three algorithms for the tensor completion problem: alternating least squares (ALS), stochastic gradient descent (SGD), and coordinate descent (CCD++). We develop these methods using a new Python interface to the Cyclops tensor algebra library, which fully automates the management of distributed-memory parallelism and sparsity for NumPy-style operations on multidimensional arrays. To make possible tensor completion for very sparse tensors, we introduce a new multi-tensor routine, TTTP, that is asymptotically more efficient than pairwise tensor contraction for key components of the tensor completion methods. In particular, we show how TTTP can be used to perform an ALS via conjugate gradient with implicit matrix-vector products, a novel tensor completion algorithm. Further, we provide the first distributed tensor library with hypersparse matrix representations, via integration of new sequential and parallel routines into the Cyclops library. We provide microbenchmarking results on the Stampede2 supercomputer to demonstrate the efficiency of this functionality. Finally, we study the performance of the tensor completion methods for a synthetic tensor with 10 billion nonzeros and the Netflix dataset.

[1]  Paolo Bientinesi,et al.  HPTT: a high-performance tensor transposition C++ library , 2017, ARRAY@PLDI.

[2]  Lars Karlsson,et al.  Parallel algorithms for tensor completion in the CP format , 2016, Parallel Comput..

[3]  F. L. Hitchcock The Expression of a Tensor or a Polyadic as a Sum of Products , 1927 .

[4]  Sriram Krishnamoorthy,et al.  Toward generalized tensor algebra for ab initio quantum chemistry methods , 2019, ARRAY@PLDI.

[5]  John F. Canny,et al.  Big data analytics with small footprint: squaring the cloud , 2013, KDD.

[6]  John R. Gilbert,et al.  Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments , 2011, SIAM J. Sci. Comput..

[7]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[8]  Prateek Jain,et al.  Low-rank matrix completion using alternating minimization , 2012, STOC '13.

[9]  David B. Skillicorn,et al.  Questions and Answers about BSP , 1997, Sci. Program..

[10]  Torsten Hoefler,et al.  Scaling Betweenness Centrality using Communication-Efficient Sparse Matrix Multiplication , 2016, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  James Demmel,et al.  Communication optimal parallel multiplication of sparse random matrices , 2013, SPAA.

[12]  Peter Ahrens,et al.  Tensor Algebra Compilation with Workspaces , 2019, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[13]  Tamara G. Kolda,et al.  Tensor Decompositions and Applications , 2009, SIAM Rev..

[14]  Rasmus Pagh,et al.  The Input/Output Complexity of Sparse Matrix Multiplication , 2014, ESA.

[15]  Trevor J. Hastie,et al.  Matrix completion and low-rank SVD via fast alternating least squares , 2014, J. Mach. Learn. Res..

[16]  S. Hirata Tensor Contraction Engine: Abstraction and Automated Parallel Implementation of Configuration-Interaction, Coupled-Cluster, and Many-Body Perturbation Theories , 2003 .

[17]  Leonid Oliker,et al.  Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[18]  John F. Stanton,et al.  A massively parallel tensor contraction framework for coupled-cluster computations , 2014, J. Parallel Distributed Comput..

[19]  Saman P. Amarasinghe,et al.  Format abstraction for sparse tensor algebra compilers , 2018, Proc. ACM Program. Lang..

[20]  Fred G. Gustavson,et al.  Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition , 1978, TOMS.

[21]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[22]  James Bennett,et al.  The Netflix Prize , 2007 .

[23]  Andrea Montanari,et al.  Matrix Completion from Noisy Entries , 2009, J. Mach. Learn. Res..

[24]  Shoaib Kamil,et al.  The tensor algebra compiler , 2017, Proc. ACM Program. Lang..

[25]  Christina Freytag,et al.  Using Mpi Portable Parallel Programming With The Message Passing Interface , 2016 .

[26]  Andrzej Cichocki,et al.  Fast Alternating LS Algorithms for High Order CANDECOMP/PARAFAC Tensor Factorizations , 2013, IEEE Transactions on Signal Processing.

[27]  George Karypis,et al.  An Exploration of Optimization Algorithms for High Performance Tensor Completion , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[28]  Nikos D. Sidiropoulos,et al.  SPLATT: Efficient and Parallel Sparse Tensor-Matrix Multiplication , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[29]  Jimeng Sun,et al.  Model-Driven Sparse CP Decomposition for Higher-Order Tensors , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[30]  Oded Schwartz,et al.  Hypergraph Partitioning for Parallel Sparse Matrix-Matrix Multiplication , 2015, SPAA.

[31]  Torsten Hoefler,et al.  Sparse Tensor Algebra as a Parallel Programming Model , 2015, ArXiv.

[32]  Jason Duell,et al.  Productivity and performance using partitioned global address space languages , 2007, PASCO '07.

[33]  Raf Vandebril,et al.  Computing the Gradient in Optimization Algorithms for the CP Decomposition in Constant Memory through Tensor Blocking , 2015, SIAM J. Sci. Comput..

[34]  Jimeng Sun,et al.  HiCOO: Hierarchical Storage of Sparse Tensors , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[35]  Albert Cohen,et al.  Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions , 2018, ArXiv.

[36]  Jack Dongarra,et al.  ScaLAPACK user's guide , 1997 .

[37]  Daniel Kats,et al.  Sparse tensor framework for implementation of general local correlation methods. , 2013, The Journal of chemical physics.

[38]  Bora Uçar,et al.  Scalable sparse tensor decompositions in distributed memory systems , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[39]  Justus A. Calvin,et al.  Scalable task-based algorithm for multiplication of block-rank-sparse matrices , 2015, IA3@SC.

[40]  Gaël Varoquaux,et al.  The NumPy Array: A Structure for Efficient Numerical Computation , 2011, Computing in Science & Engineering.

[41]  Inderjit S. Dhillon,et al.  Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems , 2012, 2012 IEEE 12th International Conference on Data Mining.

[42]  Charles L. Lawson,et al.  Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[43]  Eric Jones,et al.  SciPy: Open Source Scientific Tools for Python , 2001 .

[44]  Robert J. Harrison,et al.  Global arrays: A nonuniform memory access programming model for high-performance computers , 1996, The Journal of Supercomputing.

[45]  Bora Uçar,et al.  Parallel Candecomp/Parafac Decomposition of Sparse Tensors Using Dimension Trees , 2018, SIAM J. Sci. Comput..

[46]  P. Sadayappan,et al.  Sampled Dense Matrix Multiplication for High-Performance Machine Learning , 2018, 2018 IEEE 25th International Conference on High Performance Computing (HiPC).

[47]  George Karypis,et al.  Tensor-matrix products with a compressed sparse tensor , 2015, IA3@SC.

[48]  Grey Ballard,et al.  Communication Lower Bounds for Matricized Tensor Times Khatri-Rao Product , 2017, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[49]  Evgeny Epifanovsky,et al.  New implementation of high‐level correlated methods using a general block tensor library for high‐performance electronic structure calculations , 2013, J. Comput. Chem..

[50]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[51]  Ang Li,et al.  PASTA: a parallel sparse tensor algorithm benchmark suite , 2019, CCF Transactions on High Performance Computing.

[52]  Jieping Ye,et al.  Tensor Completion for Estimating Missing Values in Visual Data , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[53]  Stefan Behnel,et al.  Cython: The Best of Both Worlds , 2011, Computing in Science & Engineering.

[54]  John R. Gilbert,et al.  On the representation and multiplication of hypersparse matrices , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[55]  Peter J. Haas,et al.  Large-scale matrix factorization with distributed stochastic gradient descent , 2011, KDD.

[56]  Rainer Gemulla,et al.  Distributed Matrix Completion , 2012, 2012 IEEE 12th International Conference on Data Mining.

[57]  Justus A. Calvin,et al.  Massively Parallel Implementation of Explicitly Correlated Coupled-Cluster Singles and Doubles Using TiledArray Framework. , 2016, The journal of physical chemistry. A.

[58]  Grey Ballard,et al.  Shared-memory parallelization of MTTKRP for dense tensors , 2018, PPOPP.

[59]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.