Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce

The Web abounds with dyadic data that keeps increasing by every single second. Previous work has repeatedly shown the usefulness of extracting the interaction structure inside dyadic data [21, 9, 8]. A commonly used tool in extracting the underlying structure is the matrix factorization, whose fame was further boosted in the Netflix challenge [26]. When we were trying to replicate the same success on real-world Web dyadic data, we were seriously challenged by the scalability of available tools. We therefore in this paper report our efforts on scaling up the nonnegative matrix factorization (NMF) technique. We show that by carefully partitioning the data and arranging the computations to maximize data locality and parallelism, factorizing a tens of millions by hundreds of millions matrix with billions of nonzero cells can be accomplished within tens of hours. This result effectively assures practitioners of the scalability of NMF on Web-scale dyadic data.

[1]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[2]  Xiaolong Wang,et al.  Sequence analysis Application of latent semantic analysis to protein remote homology detection , 2006 .

[3]  Patrik O. Hoyer,et al.  Non-negative Matrix Factorization with Sparseness Constraints , 2004, J. Mach. Learn. Res..

[4]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[5]  Danny Ziyi Chen,et al.  Efficient Geometric Algorithms on the EREW PRAM , 1995, IEEE Trans. Parallel Distributed Syst..

[6]  D. A. Kenny,et al.  Dyadic Data Analysis , 2006 .

[7]  Ricardo A. Baeza-Yates,et al.  Query Recommendation Using Query Logs in Search Engines , 2004, EDBT Workshops.

[8]  Susan T. Dumais,et al.  Improving Web Search Ranking by Incorporating User Behavior Information , 2019, SIGIR Forum.

[9]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[10]  Inderjit S. Dhillon,et al.  Fast Newton-type Methods for the Least Squares Nonnegative Matrix Approximation Problem , 2007, SDM.

[11]  Chris H. Q. Ding,et al.  Orthogonal nonnegative matrix t-factorizations for clustering , 2006, KDD '06.

[12]  Stefan A. Robila,et al.  A parallel unmixing algorithm for hyperspectral images , 2006, SPIE Optics East.

[13]  Ji-Rong Wen,et al.  Query clustering using user logs , 2002, TOIS.

[14]  Yihong Gong,et al.  Fast nonparametric matrix factorization for large-scale collaborative filtering , 2009, SIGIR.

[15]  Hao Wang,et al.  PSVM : Parallelizing Support Vector Machines on Distributed Computers , 2007 .

[16]  Edward Y. Chang,et al.  Parallelizing Support Vector Machines on Distributed Computers , 2007, NIPS.

[17]  Michael W. Berry,et al.  Algorithms and applications for approximate nonnegative matrix factorization , 2007, Comput. Stat. Data Anal..

[18]  Inderjit S. Dhillon,et al.  Fast Projection‐Based Methods for the Least Squares Nonnegative Matrix Approximation Problem , 2008, Stat. Anal. Data Min..

[19]  Nikos A. Vlassis,et al.  Newscast EM , 2004, NIPS.

[20]  Michele Colajanni,et al.  PSBLAS: a library for parallel linear algebra computation on sparse matrices , 2000, TOMS.

[21]  Thomas Hofmann,et al.  Map-Reduce for Machine Learning on Multicore , 2007 .

[22]  Marcel Worring,et al.  Learning tag relevance by neighbor voting for social image retrieval , 2008, MIR '08.

[23]  Yehuda Koren,et al.  Matrix Factorization Techniques for Recommender Systems , 2009, Computer.

[24]  Nathan Srebro,et al.  Fast maximum margin matrix factorization for collaborative prediction , 2005, ICML.

[25]  Daniel Hanisch,et al.  Co-clustering of biological networks and gene expression data , 2002, ISMB.

[26]  Stan Z. Li,et al.  Learning spatially localized, parts-based representation , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[27]  S. Sra Nonnegative Matrix Approximation: Algorithms and Applications , 2006 .

[28]  Tie-Yan Liu,et al.  BrowseRank: letting web users vote for page importance , 2008, SIGIR '08.

[29]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[30]  Edward Y. Chang,et al.  PLDA: Parallel Latent Dirichlet Allocation for Large-Scale Applications , 2009, AAIM.

[31]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[32]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[33]  Dan Klein,et al.  Fully distributed EM for very large datasets , 2008, ICML '08.

[34]  Peter S. Pacheco Parallel programming with MPI , 1996 .

[35]  Khushboo Kanjani Parallel Non Negative Matrix Factorization for Document Clustering , 2007 .

[36]  Geoffrey J. Gordon,et al.  A Unified View of Matrix Factorization Models , 2008, ECML/PKDD.

[37]  John F. Canny,et al.  Large-scale behavioral targeting , 2009, KDD.

[38]  Deepak Agarwal,et al.  Predictive discrete latent factor models for large scale dyadic data , 2007, KDD '07.

[39]  Ole Winther,et al.  Bayesian Non-negative Matrix Factorization , 2009, ICA.

[40]  Michael W. Berry,et al.  Document clustering using nonnegative matrix factorization , 2006, Inf. Process. Manag..

[41]  Thomas Hofmann,et al.  Learning from Dyadic Data , 1998, NIPS.

[42]  Ramesh Nallapati,et al.  Parallelized Variational EM for Latent Dirichlet Allocation: An Experimental Evaluation of Speed and Scalability , 2007, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007).

[43]  Edward Y. Chang,et al.  Collaborative filtering for orkut communities: discovery of user latent behavior , 2009, WWW '09.

[44]  Abhinandan Das,et al.  Google news personalization: scalable online collaborative filtering , 2007, WWW '07.