论文信息 - Distributed Inference for Latent Dirichlet Allocation

Distributed Inference for Latent Dirichlet Allocation

We investigate the problem of learning a widely-used latent-variable model - the Latent Dirichlet Allocation (LDA) or "topic" model - using distributed computation, where each of P processors only sees 1/P of the total data set. We propose two distributed inference schemes that are motivated from different perspectives. The first scheme uses local Gibbs sampling on each processor with periodic updates—it is simple to implement and can be viewed as an approximation to a single processor implementation of Gibbs sampling. The second scheme relies on a hierarchical Bayesian extension of the standard LDA model to directly account for the fact that data are distributed across P processors—it has a theoretical guarantee of convergence but is more complex to implement than the approximate method. Using five real-world text corpora we show that distributed learning works very well for LDA models, i.e., perplexity and precision-recall scores for distributed learning are indistinguishable from those obtained with single-processor learning. Our extensive experimental results include large-scale distributed computation on 1000 virtual processors; and speedup experiments of learning topics in a 100-million word corpus using 16 processors.

[1] G. C. Wei,et al. A Monte Carlo Implementation of the EM Algorithm and the Poor Man's Data Augmentation Algorithms , 1990 .

[2] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[3] Yee Whye Teh,et al. Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes , 2004, NIPS.

[4] Nikos A. Vlassis,et al. Newscast EM , 2004, NIPS.

[5] Mark Steyvers,et al. Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[6] Kunle Olukotun,et al. Map-Reduce for Machine Learning on Multicore , 2006, NIPS.

[7] Anthony Brockwell. Parallel Markov chain Monte Carlo Simulation by Pre-Fetching , 2006 .

[8] Wei Li,et al. Pachinko allocation: DAG-structured mixture models of topic correlations , 2006, ICML.

[9] Abhinandan Das,et al. Google news personalization: scalable online collaborative filtering , 2007, WWW '07.

[10] Andrew McCallum,et al. Organizing the OCA: learning faceted subjects from a library of digital books , 2007, JCDL '07.