Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis

Topic models such as the latent Dirichlet allocation (LDA) have become a standard staple in the modeling toolbox of machine learning. They have been applied to a vast variety of data sets, contexts, and tasks to varying degrees of success. However, to date there is almost no formal theory explicating the LDA's behavior, and despite its familiarity there is very little systematic analysis of and guidance on the properties of the data that affect the inferential performance of the model. This paper seeks to address this gap, by providing a systematic analysis of factors which characterize the LDA's performance. We present theorems elucidating the posterior contraction rates of the topics as the amount of data increases, and a thorough supporting empirical study using synthetic and real data sets, including news and web-based articles and tweet messages. Based on these results we provide practical guidance on how to identify suitable data sets for topic models, and how to specify particular model parameters.

[1]  Edwin V. Bonilla,et al.  Improving Topic Coherence with Regularized Topic Models , 2011, NIPS.

[2]  M. Stephens,et al.  Inference of population structure using multilocus genotype data: dominant markers and null alleles , 2007, Molecular ecology notes.

[3]  Padhraic Smyth,et al.  Analyzing Entities and Topics in News Articles Using Statistical Topic Models , 2006, ISI.

[4]  Hongfei Yan,et al.  Comparing Twitter and Traditional Media Using Topic Models , 2011, ECIR.

[5]  Andrew McCallum,et al.  Organizing the OCA: learning faceted subjects from a library of digital books , 2007, JCDL '07.

[6]  XuanLong Nguyen,et al.  Posterior contraction of the population polytope in finite admixture models , 2012, ArXiv.

[7]  Susan T. Dumais,et al.  Characterizing Microblogs with Topic Models , 2010, ICWSM.

[8]  Chong Wang,et al.  Collaborative topic modeling for recommending scientific articles , 2011, KDD.

[9]  Fabio Crestani,et al.  Towards query log based personalization using topic models , 2010, CIKM.

[10]  Sanjeev Arora,et al.  Learning Topic Models -- Going beyond SVD , 2012, 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science.

[11]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[12]  David M. Blei,et al.  Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation , 2008, NIPS.

[13]  Timothy Baldwin,et al.  Evaluating topic models for digital libraries , 2010, JCDL '10.

[14]  Brian D. Davison,et al.  Empirical study of topic modeling in Twitter , 2010, SOMA '10.

[15]  David M. Mimno,et al.  Computational historiography: Data mining in a century of classics journals , 2012, JOCCH.

[16]  Christopher D. Manning,et al.  Topic Modeling for the Social Sciences , 2009 .

[17]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[18]  Andrew McCallum,et al.  Rethinking LDA: Why Priors Matter , 2009, NIPS.

[19]  Yan Liu,et al.  Topic-link LDA: joint models of topic and author community , 2009, ICML '09.

[20]  Justin Grimmer,et al.  A Bayesian Hierarchical Topic Model for Political Texts: Measuring Expressed Agendas in Senate Press Releases , 2010, Political Analysis.

[21]  Xin He,et al.  Generating gene summaries from biomedical literature: A study of semi-structured summarization , 2007, Inf. Process. Manag..

[22]  Gideon S. Mann,et al.  Bibliometric impact measures leveraging topic analysis , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[23]  Lidong Bing,et al.  Using query log and social tagging to refine queries based on latent topics , 2011, CIKM '11.

[24]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.