论文信息 - Efficient methods for topic model inference on streaming document collections

Efficient methods for topic model inference on streaming document collections

Topic models provide a powerful tool for analyzing large text collections by representing high dimensional data in a low dimensional subspace. Fitting a topic model given a set of training documents requires approximate inference techniques that are computationally expensive. With today's large-scale, constantly expanding document collections, it is useful to be able to infer topic distributions for new documents without retraining the model. In this paper, we empirically evaluate the performance of several methods for topic inference in previously unseen documents, including methods based on Gibbs sampling, variational inference, and a new method inspired by text classification. The classification-based inference method produces results similar to iterative inference methods, but requires only a single matrix multiplication. In addition to these inference methods, we present SparseLDA, an algorithm and data structure for evaluating Gibbs sampling distributions. Empirical results indicate that SparseLDA can be approximately 20 times faster than traditional LDA and provide twice the speedup of previously published fast sampling methods, while also using substantially less memory.

[1] Adam L. Berger,et al. A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[2] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[3] Mark Steyvers,et al. Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[4] Ching-Yung Lin,et al. Modeling and predicting personal information dissemination behavior , 2005, KDD '05.

[5] Yee Whye Teh,et al. A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation , 2006, NIPS.

[6] Arindam Banerjee,et al. Topic Models over Text Streams: A Study of Batch and Online Unsupervised Learning , 2007, SDM.

[7] Jimeng Sun,et al. Dynamic Mixture Models for Multiple Time-Series , 2007, IJCAI.

[8] Max Welling,et al. Distributed Inference for Latent Dirichlet Allocation , 2007, NIPS.

[9] Max Welling,et al. Fast collapsed gibbs sampling for latent dirichlet allocation , 2008, KDD.