Full-Text or Abstract? Examining Topic Coherence Scores Using Latent Dirichlet Allocation

This paper assesses topic coherence and human topic ranking of uncovered latent topics from scientific publications when utilizing the topic model latent Dirichlet allocation (LDA) on abstract and full-text data. The coherence of a topic, used as a proxy for topic quality, is based on the distributional hypothesis that states that words with similar meaning tend to co-occur within a similar context. Although LDA has gained much attention from machine-learning researchers, most notably with its adaptations and extensions, little is known about the effects of different types of textual data on generated topics. Our research is the first to explore these practical effects and shows that document frequency, document word length, and vocabulary size have mixed practical effects on topic coherence and human topic ranking of LDA topics. We furthermore show that large document collections are less affected by incorrect or noise terms being part of the topic-word distributions, causing topics to be more coherent and ranked higher. Differences between abstract and full-text data are more apparent within small document collections, with differences as large as 90% high-quality topics for full-text data, compared to 50% high-quality topics for abstract data.

[1]  Daniel Jurafsky,et al.  Studying the History of Ideas Using Topic Models , 2008, EMNLP.

[2]  Christopher J. Gatti,et al.  A Historical Analysis of the Field of OR/MS using Topic Models , 2015, ArXiv.

[3]  Ivan Jarić,et al.  Trends in Fisheries Science from 2000 to 2009: A Bibliometric Study , 2012 .

[4]  Kevin W. Boyack,et al.  Creation of a highly detailed, dynamic, global model and map of science , 2014, J. Assoc. Inf. Sci. Technol..

[5]  Derek Greene,et al.  An analysis of the coherence of descriptors in topic modeling , 2015, Expert Syst. Appl..

[6]  Michael I. Jordan,et al.  Variational inference for Dirichlet process mixtures , 2006 .

[7]  Mark Stevenson,et al.  Evaluating Topic Coherence Using Distributional Semantics , 2013, IWCS.

[8]  David Buttler,et al.  Exploring Topic Coherence over Many Models and Many Topics , 2012, EMNLP.

[9]  Victor R. Prybutok,et al.  Latent Semantic Analysis: five methodological recommendations , 2012, Eur. J. Inf. Syst..

[10]  Gerlof Bouma,et al.  Normalized (pointwise) mutual information in collocation extraction , 2009 .

[11]  Peder Olesen Larsen,et al.  The rate of growth in scientific publication and the decline in coverage provided by Science Citation Index , 2010, Scientometrics.

[12]  C. Elkan,et al.  Topic Models , 2008 .

[13]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[14]  Yafeng Yin,et al.  Discovering themes and trends in transportation research using topic modeling , 2017 .

[15]  Igor Douven,et al.  Measuring coherence , 2006, Synthese.

[16]  Max Welling,et al.  Distributed Inference for Latent Dirichlet Allocation , 2007, NIPS.

[17]  Maksym Polyakov,et al.  Antipodean Agricultural and Resource Economics at 60: Trends in Topics, Authorship and Collaboration , 2016 .

[18]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[19]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[20]  Kenneth E. Shirley,et al.  LDAvis: A method for visualizing and interpreting topics , 2014 .

[21]  Mehran Sahami,et al.  Text Mining: Classification, Clustering, and Applications , 2009 .

[22]  Wray L. Buntine Operations for Learning with Graphical Models , 1994, J. Artif. Intell. Res..

[23]  Martin J. Westgate,et al.  Text analysis tools for identification of emerging topics and research gaps in conservation science , 2015, Conservation biology : the journal of the Society for Conservation Biology.

[24]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[25]  Timothy Baldwin,et al.  Automatic Evaluation of Topic Coherence , 2010, NAACL.

[26]  Zhaohui Tang,et al.  Data Mining with SQL Server 2005 , 2005 .

[27]  Marco R. Spruit,et al.  Bootstrapping a Semantic Lexicon on Verb Similarities , 2016, KDIR.

[28]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[29]  Paul Hofmarcher,et al.  MODEL TREES WITH TOPIC MODEL PREPROCESSING: AN APPROACH FOR DATA JOURNALISM ILLUSTRATED WITH THE WIKILEAKS AFGHANISTAN WAR LOGS , 2013 .

[30]  J. Lafferty,et al.  Mixed-membership models of scientific publications , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[31]  Jeffrey Heer,et al.  Interpretation and trust: designing model-driven visualizations for text analysis , 2012, CHI.

[32]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[33]  Max Welling,et al.  Fast collapsed gibbs sampling for latent dirichlet allocation , 2008, KDD.

[34]  Justin Grimmer,et al.  Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts , 2013, Political Analysis.

[35]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[36]  Yee Whye Teh,et al.  On Smoothing and Inference for Topic Models , 2009, UAI.

[37]  Michael Röder,et al.  Exploring the Space of Topic Coherence Measures , 2015, WSDM.

[38]  Thomas Hofmann,et al.  A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation , 2007 .

[39]  David B. Dunson,et al.  Probabilistic topic models , 2012, Commun. ACM.

[40]  Chong Wang,et al.  Online Variational Inference for the Hierarchical Dirichlet Process , 2011, AISTATS.

[41]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[42]  Ruslan Salakhutdinov,et al.  Evaluation methods for topic models , 2009, ICML '09.

[43]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .