Probabilistic topic models

In this article, we review probabilistic topic models: graphical models that can be used to summarize a large collection of documents with a smaller number of distributions over words. Those distributions are called "topics" because, when fit to data, they capture the salient themes that run through the collection. We describe both finite-dimensional parametric topic models and their Bayesian nonparametric counterparts, which are based on the hierarchical Dirichlet process (HDP). We discuss two extensions of topic models to time-series data-one that lets the topics slowly change over time and one that lets the assumed prevalence of the topics change. Finally, we illustrate the application of topic models to nontext data, summarizing some recent research results in image analysis.

[1]  Derek E. Wildman,et al.  Implications of natural selection in shaping 99.4% nonsynonymous DNA identity between humans and chimpanzees: Enlarging genus Homo , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[2]  David A. Forsyth,et al.  Matching Words and Pictures , 2003, J. Mach. Learn. Res..

[3]  Aleks Jakulin,et al.  Discrete Component Analysis , 2005, SLSFS.

[4]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[5]  David J. Spiegelhalter,et al.  VIBES: A Variational Inference Engine for Bayesian Networks , 2002, NIPS.

[6]  Timothy Baldwin,et al.  Automatic Evaluation of Topic Coherence , 2010, NAACL.

[7]  W K Gregory THE NEW ANTHROPOGENY: TWENTY-FIVE STAGES OF VERTEBRATE EVOLUTION, FROM SILURIAN CHORDATE TO MAN. , 1933, Science.

[8]  Chong Wang,et al.  Continuous Time Dynamic Topic Models , 2008, UAI.

[9]  W. Eric L. Grimson,et al.  Unsupervised Activity Perception in Crowded and Complicated Scenes Using Hierarchical Bayesian Models , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Yee Whye Teh,et al.  Spatial Normalized Gamma Processes , 2009, NIPS.

[11]  Peter I. Frazier,et al.  Distance dependent Chinese restaurant processes , 2009, ICML.

[12]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[13]  Perry R. Cook,et al.  Content-Based Musical Similarity Computation using the Hierarchical Dirichlet Process , 2008, ISMIR.

[14]  S. Fienberg,et al.  DESCRIBING DISABILITY THROUGH INDIVIDUAL-LEVEL MIXTURE MODELS FOR MULTIVARIATE BINARY DATA. , 2007, The annals of applied statistics.

[15]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[16]  David B. Dunson,et al.  Hierarchical kernel stick-breaking process for multi-task image analysis , 2008, ICML '08.

[17]  Edoardo M. Airoldi,et al.  Mixed Membership Stochastic Blockmodels , 2007, NIPS.

[18]  W. Eric L. Grimson,et al.  Unsupervised Activity Perception by Hierarchical Bayesian Models , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  W. Ewens The sampling theory of selectively neutral alleles. , 1972, Theoretical population biology.

[20]  David M. Blei,et al.  FINDING LATENT SOURCES IN RECORDED MUSIC WITH A SHIFT-INVARIANT HDP , 2009 .

[21]  Yee Whye Teh,et al.  Collapsed Variational Inference for HDP , 2007, NIPS.

[22]  Fei-Fei Li,et al.  Spatially Coherent Latent Topic Model for Concurrent Segmentation and Classification of Objects and Scenes , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[23]  David B. Dunson,et al.  A Bayesian Model for Simultaneous Image Clustering, Annotation and Object Segmentation , 2009, NIPS.

[24]  Andrew Zisserman,et al.  Scene Classification Via pLSA , 2006, ECCV.

[25]  Li Fei-Fei,et al.  Towards total scene understanding: Classification, annotation and segmentation in an automatic framework , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Thomas L. Griffiths,et al.  Probabilistic Topic Models , 2007 .

[27]  Lawrence Carin,et al.  Hierarchical Bayesian Modeling of Topics in Time-Stamped Documents , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Michael I. Jordan,et al.  Variational inference for Dirichlet process mixtures , 2006 .

[29]  Alexei A. Efros,et al.  Discovering object categories in image collections , 2005 .

[30]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[31]  Christian P. Robert,et al.  Monte Carlo Statistical Methods (Springer Texts in Statistics) , 2005 .

[32]  Ramesh Nallapati,et al.  Link-PLSA-LDA: A New Unsupervised Model for Topics and Influence of Blogs , 2021, ICWSM.

[33]  Michael I. Jordan,et al.  Shared Segmentation of Natural Scenes Using Dependent Pitman-Yor Processes , 2008, NIPS.

[34]  Yee Whye Teh,et al.  A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation , 2006, NIPS.

[35]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[36]  W B Scott,et al.  THE ISTHMUS OF PANAMA IN ITS RELATION TO THE ANIMAL LIFE OF NORTH AND SOUTH AMERICA. , 1916, Science.

[37]  Feng Yan,et al.  Parallel Inference for Latent Dirichlet Allocation on Graphics Processing Units , 2009, NIPS.

[38]  W. Eric L. Grimson,et al.  Spatial Latent Dirichlet Allocation , 2007, NIPS.

[39]  C. Elkan,et al.  Topic Models , 2008 .

[40]  Max Welling,et al.  Distributed Inference for Latent Dirichlet Allocation , 2007, NIPS.

[41]  C. Antoniak Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems , 1974 .

[42]  David M. Blei,et al.  Hierarchical relational models for document networks , 2009, 0909.4331.

[43]  Jon D. McAuliffe,et al.  Variational Inference for Large-Scale Models of Discrete Choice , 2007, 0712.2526.

[44]  Vasant Honavar,et al.  Multi-Modal Hierarchical Dirichlet Process Model for Predicting Image Annotation and Image-Object Label Correspondence , 2009, SDM.

[45]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[46]  David M. Mimno,et al.  Reconstructing Pompeian Households , 2011, UAI.

[47]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[48]  John Aitchison,et al.  The Statistical Analysis of Compositional Data , 1986 .

[49]  PROCEssIng magazInE IEEE Signal Processing Magazine , 2004 .

[50]  Scott Lindroth,et al.  Dynamic Nonparametric Bayesian Models for Analysis of Music , 2010 .

[51]  Yee Whye Teh,et al.  Hybrid Variational/Gibbs Collapsed Inference in Topic Models , 2008, UAI.

[52]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[53]  Sean Gerrish,et al.  A Language-based Approach to Measuring Scholarly Impact , 2010, ICML.

[54]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[55]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[56]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[57]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[58]  Xiaojin Zhu,et al.  Statistical Debugging Using Latent Topic Models , 2007, ECML.

[59]  Chong Wang,et al.  Simultaneous image classification and annotation , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[60]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[61]  Andrew McCallum,et al.  Topic Models Conditioned on Arbitrary Features with Dirichlet-multinomial Regression , 2008, UAI.