Selecting Priors for Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) has gained much attention from researchers and is increasingly being applied to uncover underlying semantic structures from a variety of corpora. However, nearly all researchers use symmetrical Dirichlet priors, often unaware of the underlying practical implications that they bear. This research is the first to explore symmetrical and asymmetrical Dirichlet priors on topic coherence and human topic ranking when uncovering latent semantic structures from scientific research articles. More specifically, we examine the practical effects of several classes of Dirichlet priors on 2000 LDA models created from abstract and full-text research articles. Our results show that symmetrical or asymmetrical priors on the document-topic distribution or the topic-word distribution for full-text data have little effect on topic coherence scores and human topic ranking. In contrast, asymmetrical priors on the document-topic distribution for abstract data show a significant increase in topic coherence scores and improved human topic ranking compared to a symmetrical prior. Symmetrical or asymmetrical priors on the topic-word distribution show no real benefits for both abstract and full-text data.

[1]  David Buttler,et al.  Exploring Topic Coherence over Many Models and Many Topics , 2012, EMNLP.

[2]  C. Elkan,et al.  Topic Models , 2008 .

[3]  Peder Olesen Larsen,et al.  The rate of growth in scientific publication and the decline in coverage provided by Science Citation Index , 2010, Scientometrics.

[4]  Daniel Jurafsky,et al.  Studying the History of Ideas Using Topic Models , 2008, EMNLP.

[5]  Igor Douven,et al.  Measuring coherence , 2006, Synthese.

[6]  Maksym Polyakov,et al.  Antipodean Agricultural and Resource Economics at 60: Trends in Topics, Authorship and Collaboration , 2016 .

[7]  C. Urquhart An encounter with grounded theory: tackling the practical and philosophical issues , 2001 .

[8]  Victor R. Prybutok,et al.  Latent Semantic Analysis: five methodological recommendations , 2012, Eur. J. Inf. Syst..

[9]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[10]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[11]  Ruslan Salakhutdinov,et al.  Evaluation methods for topic models , 2009, ICML '09.

[12]  G. Ronning Maximum likelihood estimation of dirichlet distributions , 1989 .

[13]  Marco R. Spruit,et al.  Full-Text or Abstract? Examining Topic Coherence Scores Using Latent Dirichlet Allocation , 2017, 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[14]  Ramin Mehran,et al.  Abnormal crowd behavior detection using social force model , 2009, CVPR.

[15]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[16]  Marco R. Spruit,et al.  Bootstrapping a Semantic Lexicon on Verb Similarities , 2016, KDIR.

[17]  Mark Stevenson,et al.  Evaluating Topic Coherence Using Distributional Semantics , 2013, IWCS.

[18]  Andrew McCallum,et al.  Rethinking LDA: Why Priors Matter , 2009, NIPS.

[19]  Kenneth E. Shirley,et al.  LDAvis: A method for visualizing and interpreting topics , 2014 .

[20]  Mehran Sahami,et al.  Text Mining: Classification, Clustering, and Applications , 2009 .

[21]  Martin J. Westgate,et al.  Text analysis tools for identification of emerging topics and research gaps in conservation science , 2015, Conservation biology : the journal of the Society for Conservation Biology.

[22]  Jonathan Huang Maximum Likelihood Estimation of Dirichlet Distribution Parameters , 2005 .

[23]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[24]  Shrikanth S. Narayanan,et al.  Acoustic topic model for audio information retrieval , 2009, 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[25]  Gregor Heinrich Parameter estimation for text analysis , 2009 .

[26]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[27]  Andrew McCallum,et al.  Optimizing Semantic Coherence in Topic Models , 2011, EMNLP.

[28]  Pietro Perona,et al.  Learning object categories from Google's image search , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[29]  Yee Whye Teh,et al.  On Smoothing and Inference for Topic Models , 2009, UAI.

[30]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[31]  Jeffrey Heer,et al.  Interpretation and trust: designing model-driven visualizations for text analysis , 2012, CHI.

[32]  Michael Röder,et al.  Exploring the Space of Topic Coherence Measures , 2015, WSDM.

[33]  Timothy Baldwin,et al.  Automatic Evaluation of Topic Coherence , 2010, NAACL.

[34]  Yafeng Yin,et al.  Discovering themes and trends in transportation research using topic modeling , 2017 .

[35]  Shaheen Syed,et al.  Using Machine Learning to Uncover Latent Research Topics in Fishery Models , 2018 .

[36]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[37]  J. Lafferty,et al.  Mixed-membership models of scientific publications , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[38]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[39]  Christopher J. Gatti,et al.  A Historical Analysis of the Field of OR/MS using Topic Models , 2015, ArXiv.