Online Latent Dirichlet Allocation with Infinite Vocabulary

Topic models based on latent Dirichlet allocation (LDA) assume a predefined vocabulary. This is reasonable in batch settings but not reasonable for streaming and online settings. To address this lacuna, we extend LDA by drawing topics from a Dirichlet process whose base distribution is a distribution over all strings rather than from a finite Dirichlet. We develop inference using online variational inference and -- to only consider a finite number of words for each topic -- propose heuristics to dynamically order, expand, and contract the set of words we consider in our vocabulary. We show our model can successfully incorporate new words and that it performs better than topic models with finite vocabularies in evaluations of topic quality and classification performance.

[1]  D. Blei,et al.  Truncation-free stochastic variational inference for Bayesian nonparametric models , 2012, NIPS 2012.

[2]  Michael I. Jordan,et al.  Variational inference for Dirichlet process mixtures , 2006 .

[3]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[4]  Noah A. Smith,et al.  Variational Inference for Adaptor Grammars , 2010, NAACL.

[5]  David M. Blei,et al.  Sparse stochastic inference for latent Dirichlet allocation , 2012, ICML.

[6]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[7]  Yee Whye Teh,et al.  A Hierarchical Bayesian Language Model Based On Pitman-Yor Processes , 2006, ACL.

[8]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[9]  Steffen Bickel,et al.  Unsupervised prediction of citation influences , 2007, ICML '07.

[10]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[11]  David M. Blei,et al.  Multilingual Topic Models for Unaligned Text , 2009, UAI.

[12]  Kilian Q. Weinberger,et al.  Feature hashing for large scale multitask learning , 2009, ICML '09.

[13]  Fernando A. Quintana,et al.  Nonparametric Bayesian data analysis , 2004 .

[14]  Michael I. Jordan,et al.  Variational Inference over Combinatorial Spaces , 2010, NIPS.

[15]  David M. Blei,et al.  Connections between the lines: augmenting social networks with text , 2009, KDD.

[16]  David Newman,et al.  External evaluation of topic models , 2009 .

[17]  Chong Wang,et al.  Continuous Time Dynamic Topic Models , 2008, UAI.

[18]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[19]  Max Welling,et al.  Accelerated Variational Dirichlet Process Mixtures , 2006, NIPS.

[20]  Chong Wang,et al.  Online Variational Inference for the Hierarchical Dirichlet Process , 2011, AISTATS.

[21]  W. Bruce Croft,et al.  LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[22]  Yee Whye Teh,et al.  Collapsed Variational Dirichlet Process Mixture Models , 2007, IJCAI.

[23]  Chong Wang,et al.  Truncation-free Online Variational Inference for Bayesian Nonparametric Models , 2012, NIPS.

[24]  Sean Gerrish,et al.  A Language-based Approach to Measuring Scholarly Impact , 2010, ICML.

[25]  John Algeo,et al.  Where Do All the New Words Come from , 1980 .

[26]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[27]  Chong Wang,et al.  Variational Inference for the Nested Chinese Restaurant Process , 2009, NIPS.

[28]  J. Sethuraman A CONSTRUCTIVE DEFINITION OF DIRICHLET PRIORS , 1991 .

[29]  Phil Blunsom,et al.  A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction , 2011, ACL.

[30]  Thomas L. Griffiths,et al.  A fully Bayesian approach to unsupervised part-of-speech tagging , 2007, ACL.

[31]  Alexander Clark,et al.  Combining Distributional and Morphological Information for Part of Speech Induction , 2003, EACL.

[32]  Kevin Knight,et al.  Machine Transliteration , 1997, CL.

[33]  Masa-aki Sato,et al.  Online Model Selection Based on the Variational Bayes , 2001, Neural Computation.

[34]  Radford M. Neal Probabilistic Inference Using Markov Chain Monte Carlo Methods , 2011 .

[35]  Jordan L. Boyd-Graber,et al.  Mr. LDA: a flexible large scale topic modeling package using variational inference in MapReduce , 2012, WWW.

[36]  Michael J. Paul,et al.  A Two-Dimensional Topic-Aspect Model for Discovering Multi-Faceted Topics , 2010, AAAI.