Bounding Sample Errors in Approximate Distributed Latent Dirichlet Allocation

Latent Dirichlet allocation (LDA) is a popular algorithm for discovering structure in large collections of text or other data. Although its complexity is linear in the data size, its use on increasingly massive collections has created considerable interest in parallel implementations. “Approximate distributed” LDA, or AD-LDA, approximates the popular collapsed Gibbs sampling algorithm for LDA models while running on a distributed architecture. Although this algorithm often appears to perform well in practice, its quality is not well understood or easily assessed. In this work, we provide some theoretical justification of the algorithm, and modify AD-LDA to track an error bound on its performance. Specifically, we upper-bound the probability of making a sampling error at each step of the algorithm (compared to an exact, sequential Gibbs sampler), given the samples drawn thus far. We show empirically that our bound is sufficiently tight to give a meaningful and intuitive measure of approximation error in AD-LDA, allowing the user to understand the trade-off between accuracy and efficiency. Citation: UCI ICS Technical Report # 09-06 Original Date: October 2009 Last Updated: October 27, 2009

[1]  John D. Lafferty,et al.  A correlated topic model of Science , 2007, 0708.3601.

[2]  William W. Cohen,et al.  Parallelized Variational EM for Latent Dirichlet Allocation: An Experimental Evaluation of Speed and Scalability , 2007, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007).

[3]  Max Welling,et al.  Asynchronous Distributed Learning of Topic Models , 2008, NIPS.

[4]  Feng Yan,et al.  Parallel Inference for Latent Dirichlet Allocation on Graphics Processing Units , 2009, NIPS.

[5]  Alexei A. Efros,et al.  Discovering object categories in image collections , 2005 .

[6]  Andrew McCallum,et al.  Information Extraction , 2005, ACM Queue.

[7]  Joseph Gonzalez,et al.  Residual Splash for Optimally Parallelizing Belief Propagation , 2009, AISTATS.

[8]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[9]  A Topic Model For Movie Choices and Ratings , 2009 .

[10]  Yee Whye Teh,et al.  On Smoothing and Inference for Topic Models , 2009, UAI.

[11]  Luc Devroye,et al.  Combinatorial methods in density estimation , 2001, Springer series in statistics.

[12]  Max Welling,et al.  Distributed Inference for Latent Dirichlet Allocation , 2007, NIPS.

[13]  John W. Fisher,et al.  Loopy Belief Propagation: Convergence and Effects of Message Errors , 2005, J. Mach. Learn. Res..

[14]  Edward Y. Chang,et al.  PLDA: Parallel Latent Dirichlet Allocation for Large-Scale Applications , 2009, AAIM.

[15]  Yee Whye Teh,et al.  Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes , 2004, NIPS.

[16]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Yee Whye Teh,et al.  Hybrid Variational/Gibbs Collapsed Inference in Topic Models , 2008, UAI.

[18]  P. Bushell Hilbert's metric and positive contraction mappings in a Banach space , 1973 .