Topic Model Diagnostics: Assessing Domain Relevance via Topical Alignment

The use of topic models to analyze domain-specific texts often requires manual validation of the latent topics to ensure that they are meaningful. We introduce a framework to support such a large-scale assessment of topical relevance. We measure the correspondence between a set of latent topics and a set of reference concepts to quantify four types of topical misalignment: junk, fused, missing, and repeated topics. Our analysis compares 10,000 topic model variants to 200 expert-provided domain concepts, and demonstrates how our framework can inform choices of model parameters, inference algorithms, and intrinsic measures of topical quality.

[1]  Jeffrey Heer,et al.  Termite: visualization techniques for assessing textual topic models , 2012, AVI.

[2]  David Buttler,et al.  Exploring Topic Coherence over Many Models and Many Topics , 2012, EMNLP.

[3]  Thomas L. Griffiths,et al.  Probabilistic author-topic models for information discovery , 2004, KDD.

[4]  Ruslan Salakhutdinov,et al.  Evaluation methods for topic models , 2009, ICML '09.

[5]  Andrew McCallum,et al.  Rethinking LDA: Why Priors Matter , 2009, NIPS.

[6]  Timothy Baldwin,et al.  Automatic Evaluation of Topic Coherence , 2010, NAACL.

[7]  W. Bruce Croft,et al.  LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[8]  Daniel Barbará,et al.  Topic Significance Ranking of LDA Generative Models , 2009, ECML/PKDD.

[9]  Thomas L. Griffiths,et al.  Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[10]  Padhraic Smyth,et al.  Analyzing Entities and Topics in News Articles Using Statistical Topic Models , 2006, ISI.

[11]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Peter Pirolli,et al.  Modeling Information Scent: A Comparison of LSA, PMI and GLSA Similarity Measures on Common Tests and Corpora , 2007, RIAO.

[13]  Andrew McCallum,et al.  Database of NIH grants using machine-learned categories and graphical clustering , 2011, Nature Methods.

[14]  Timothy Baldwin,et al.  Evaluating topic models for digital libraries , 2010, JCDL '10.

[15]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[16]  David M. Blei,et al.  Visualizing Topic Models , 2012, ICWSM.

[17]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[18]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[19]  Quentin Pleple,et al.  Interactive Topic Modeling , 2013 .

[20]  Ivan Titov,et al.  A Joint Model of Text and Aspect Ratings for Sentiment Summarization , 2008, ACL.

[21]  Wayne D. Gray,et al.  Basic objects in natural categories , 1976, Cognitive Psychology.

[22]  Stefan Trausan-Matu,et al.  Improving Topic Evaluation Using Conceptual Knowledge , 2011, IJCAI.

[23]  Daniel Jurafsky,et al.  Studying the History of Ideas Using Topic Models , 2008, EMNLP.

[24]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[25]  Jeffrey Heer,et al.  Interpretation and trust: designing model-driven visualizations for text analysis , 2012, CHI.

[26]  Susan T. Dumais,et al.  Characterizing Microblogs with Topic Models , 2010, ICWSM.

[27]  Padhraic Smyth,et al.  TopicNets: Visual Analysis of Large Text Corpora with Topic Modeling , 2012, TIST.

[28]  Kathy E. Johnson,et al.  Effects of varying levels of expertise on the basic level of categorization. , 1997, Journal of experimental psychology. General.

[29]  Andrew McCallum,et al.  Optimizing Semantic Coherence in Topic Models , 2011, EMNLP.

[30]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[31]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[32]  Susan T. Dumais,et al.  Partially labeled topic models for interpretable text mining , 2011, KDD.