GigaTensor: scaling tensor analysis up by 100 times - algorithms and discoveries

Many data are modeled as tensors, or multi dimensional arrays. Examples include the predicates (subject, verb, object) in knowledge bases, hyperlinks and anchor texts in the Web graphs, sensor streams (time, location, and type), social networks over time, and DBLP conference-author-keyword relations. Tensor decomposition is an important data mining tool with various applications including clustering, trend detection, and anomaly detection. However, current tensor decomposition algorithms are not scalable for large tensors with billions of sizes and hundreds millions of nonzeros: the largest tensor in the literature remains thousands of sizes and hundreds thousands of nonzeros. Consider a knowledge base tensor consisting of about 26 million noun-phrases. The intermediate data explosion problem, associated with naive implementations of tensor decomposition algorithms, would require the materialization and the storage of a matrix whose largest dimension would be ≈7 x 1014; this amounts to ~10 Petabytes, or equivalently a few data centers worth of storage, thereby rendering the tensor analysis of this knowledge base, in the naive way, practically impossible. In this paper, we propose GIGATENSOR, a scalable distributed algorithm for large scale tensor decomposition. GIGATENSOR exploits the sparseness of the real world tensors, and avoids the intermediate data explosion problem by carefully redesigning the tensor decomposition algorithm. Extensive experiments show that our proposed GIGATENSOR solves 100 times bigger problems than existing methods. Furthermore, we employ GIGATENSOR in order to analyze a very large real world, knowledge base tensor and present our astounding findings which include discovery of potential synonyms among millions of noun-phrases (e.g. the noun 'pollutant' and the noun-phrase 'greenhouse gases').

[1]  Ralf Lämmel,et al.  Google's MapReduce programming model - Revisited , 2007, Sci. Comput. Program..

[2]  Brett W. Bader,et al.  The TOPHITS Model for Higher-Order Web Link Analysis∗ , 2006 .

[3]  Christos Faloutsos,et al.  MultiAspectForensics: Pattern Mining on Large-Scale Heterogeneous Networks with Tensor Analysis , 2011, 2011 International Conference on Advances in Social Networks Analysis and Mining.

[4]  Xuelong Li,et al.  Bayesian Tensor Approach for 3-D Face Modeling , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[5]  Tamara G. Kolda,et al.  Temporal Analysis of Social Networks using Three-way DEDICOM , 2006 .

[6]  Tamara G. Kolda,et al.  Tensor Decompositions and Applications , 2009, SIAM Rev..

[7]  Philip S. Yu,et al.  Window-based Tensor Analysis on High-dimensional and Multi-aspect Streams , 2006, Sixth International Conference on Data Mining (ICDM'06).

[8]  Tamara G. Kolda,et al.  MATLAB Tensor Toolbox , 2006 .

[9]  Chao Liu,et al.  Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce , 2010, WWW '10.

[10]  Huan Liu,et al.  CubeSVD: a novel approach to personalized Web search , 2005, WWW '05.

[11]  Christos Faloutsos,et al.  Mining large graphs: Algorithms, inference, and discoveries , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[12]  Nikos D. Sidiropoulos,et al.  Co-clustering as multilinear decomposition with sparse latent factors , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Richard A. Harshman,et al.  Foundations of the PARAFAC procedure: Models and conditions for an "explanatory" multi-model factor analysis , 1970 .

[14]  R. Bro PARAFAC. Tutorial and applications , 1997 .

[15]  Xuelong Li,et al.  Supervised Tensor Learning , 2005, ICDM.

[16]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[17]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[18]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[19]  Christos Faloutsos,et al.  PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[20]  C. Eckart,et al.  The approximation of one matrix by another of lower rank , 1936 .

[21]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[22]  L. Tucker,et al.  Some mathematical notes on three-mode factor analysis , 1966, Psychometrika.

[23]  Tamara G. Kolda,et al.  Efficient MATLAB Computations with Sparse and Factored Tensors , 2007, SIAM J. Sci. Comput..

[24]  Tamara G. Kolda,et al.  Scalable Tensor Decompositions for Multi-aspect Data Mining , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[25]  Nikos D. Sidiropoulos,et al.  Blind PARAFAC receivers for DS-CDMA systems , 2000, IEEE Trans. Signal Process..

[26]  R. Penrose A Generalized inverse for matrices , 1955 .

[27]  Christos Faloutsos,et al.  Spectral Analysis for Billion-Scale Graphs: Discoveries and Implementation , 2011, PAKDD.

[28]  Rasmus Bro,et al.  The N-way Toolbox for MATLAB , 2000 .

[29]  Jimeng Sun,et al.  GBASE: a scalable and general graph management system , 2011, KDD.

[30]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[31]  Rasmus Bro,et al.  Multiway analysis of epilepsy tensors , 2007, ISMB/ECCB.

[32]  Tamara G. Kolda,et al.  Cross-language information retrieval using PARAFAC2 , 2007, KDD '07.