PLDA: Parallel Latent Dirichlet Allocation for Large-Scale Applications

This paper presents PLDA, our parallel implementation of Latent Dirichlet Allocation on MPI and MapReduce. PLDA smooths out storage and computation bottlenecks and provides fault recovery for lengthy distributed computations. We show that PLDA can be applied to large, real-world applications and achieves good scalability. We have released MPI-PLDA to open source at http://code.google.com/p/plda under the Apache License.

[1]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[2]  David Cohn,et al.  Learning to Probabilistically Identify Authoritative Documents , 2000, ICML.

[3]  David M. Pennock,et al.  Probabilistic Models for Unified Collaborative and Content-Based Recommendation in Sparse-Data Environments , 2001, UAI.

[4]  Tom Minka,et al.  Expectation-Propogation for the Generative Aspect Model , 2002, UAI.

[5]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[6]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[7]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[9]  Kunle Olukotun,et al.  Map-Reduce for Machine Learning on Multicore , 2006, NIPS.

[10]  Hao Wang,et al.  PSVM : Parallelizing Support Vector Machines on Distributed Computers , 2007 .

[11]  Edward Y. Chang,et al.  Parallelizing Support Vector Machines on Distributed Computers , 2007, NIPS.

[12]  Ricardo da Silva Torres,et al.  Diagnosing Similarity of Oscillation Trends in Time Series , 2007 .

[13]  William W. Cohen,et al.  Parallelized Variational EM for Latent Dirichlet Allocation: An Experimental Evaluation of Speed and Scalability , 2007, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007).

[14]  Fei-Fei Li,et al.  OPTIMOL: Automatic Online Picture Collection via Incremental Model Learning , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Max Welling,et al.  Distributed Inference for Latent Dirichlet Allocation , 2007, NIPS.

[16]  Max Welling,et al.  Asynchronous Distributed Learning of Topic Models , 2008, NIPS.

[17]  Pietro Perona,et al.  Memory bounded inference in topic models , 2008, ICML '08.

[18]  James Demmel,et al.  Avoiding communication in sparse matrix computations , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[19]  Max Welling,et al.  Fast collapsed gibbs sampling for latent dirichlet allocation , 2008, KDD.

[20]  Patrick F. Reidy An Introduction to Latent Semantic Analysis , 2009 .

[21]  Edward Y. Chang,et al.  Collaborative filtering for orkut communities: discovery of user latent behavior , 2009, WWW '09.