Document-based topic coherence measures for news media text

Abstract There is a rising need for automated analysis of news text, and topic models have proven to be useful tools for this task. However, as the quality of the topics induced by topic models greatly varies, much research effort has been devoted to their automated evaluation. Recent research has focused on topic coherence as a measure of a topic’s quality. Existing topic coherence measures work by considering the semantic similarity of topic words. This makes them unfit to detect the coherence of transient topics with semantically unrelated topic words, which abound in news media texts. In this paper, we introduce the notion of document-based topic coherence and propose novel topic coherence measures that estimate topic coherence based on topic documents rather than topic words. We evaluate the proposed measures on two datasets containing topics manually labeled for document-based coherence, on which the proposed measures outperform a strong baseline as well as word-based coherence measures. We also demonstrate the usefulness of document-based coherence measures for automated topic discovery from news media texts.

[1]  Jason Chuang,et al.  Large-Scale Topical Analysis of Multiple Online News Sources with Media Cloud , 2014 .

[2]  T. Murata,et al.  Breaking News Detection and Tracking in Twitter , 2010, 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[3]  David M. Mimno,et al.  Care and Feeding of Topic Models , 2014, Handbook of Mixed Membership Models and Their Applications.

[4]  Dunja Mladenic,et al.  Visualization of News Articles , 2004, Informatica.

[5]  David Buttler,et al.  Exploring Topic Coherence over Many Models and Many Topics , 2012, EMNLP.

[6]  Alice H. Oh,et al.  A computational analysis of agenda setting , 2014, WWW.

[7]  E. DeLong,et al.  Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. , 1988, Biometrics.

[8]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[9]  Alexander J. Smola,et al.  Unified analysis of streaming news , 2011, WWW.

[10]  Bruno Pouliquen,et al.  An introduction to the Europe Media Monitor family of applications , 2013, ArXiv.

[11]  Carina Jacobi,et al.  Quantitative analysis of large amounts of journalistic texts using topic modelling , 2016, Rethinking Research Methods in an Age of Digital Journalism.

[12]  Xiaojin Zhu,et al.  A Topic Model for Word Sense Disambiguation , 2007, EMNLP.

[13]  Boi Faltings,et al.  Personalized news recommendation with context trees , 2013, RecSys.

[14]  Jeffrey Heer,et al.  Interpretation and trust: designing model-driven visualizations for text analysis , 2012, CHI.

[15]  Chang-Shing Lee,et al.  Ontology-based fuzzy event extraction agent for Chinese e-news summarization , 2003, Expert Syst. Appl..

[16]  Girish Keshav Palshikar,et al.  Measuring Topic Coherence through Optimal Word Buckets , 2017, EACL.

[17]  Yulan He,et al.  Joint sentiment/topic model for sentiment analysis , 2009, CIKM.

[18]  Antal van den Bosch,et al.  Comparing and evaluating information retrieval algorithms for news recommendation , 2007, RecSys '07.

[19]  Dafna Shahaf,et al.  Connecting Two (or Less) Dots: Discovering Structure in News Articles , 2012, TKDD.

[20]  Nello Cristianini,et al.  RESEARCH METHODS IN THE AGE OF DIGITAL JOURNALISM , 2013 .

[21]  Kevin Gimpel,et al.  From Paraphrase Database to Compositional Paraphrase Model and Back , 2015, Transactions of the Association for Computational Linguistics.

[22]  Jan Snajder,et al.  Getting the Agenda Right: Measuring Media Agenda using Topic Models , 2015, TM@CIKM.

[23]  D. Shaw,et al.  Agenda setting function of mass media , 1972 .

[24]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[25]  Sergey I. Nikolenko,et al.  Latent dirichlet allocation: stability and applications to studies of user-generated content , 2014, WebSci '14.

[26]  Padhraic Smyth,et al.  Analyzing Entities and Topics in News Articles Using Statistical Topic Models , 2006, ISI.

[27]  Derek Greene,et al.  An analysis of the coherence of descriptors in topic modeling , 2015, Expert Syst. Appl..

[28]  Jeffrey Heer,et al.  Topic Model Diagnostics: Assessing Domain Relevance via Topical Alignment , 2013, ICML.

[29]  Mark Stevenson,et al.  Evaluating Topic Coherence Using Distributional Semantics , 2013, IWCS.

[30]  Xavier Robin,et al.  pROC: an open-source package for R and S+ to analyze and compare ROC curves , 2011, BMC Bioinformatics.

[31]  J. A. Rodríguez-Velázquez,et al.  Subgraph centrality in complex networks. , 2005, Physical review. E, Statistical, nonlinear, and soft matter physics.

[32]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[33]  Sergey I. Nikolenko,et al.  Topic Quality Metrics Based on Distributed Word Representations , 2016, SIGIR.

[34]  Ruslan Salakhutdinov,et al.  Evaluation methods for topic models , 2009, ICML '09.

[35]  Kimberly A. Neuendorf,et al.  The Content Analysis Guidebook , 2001 .

[36]  John D. Lafferty,et al.  A correlated topic model of Science , 2007, 0708.3601.

[37]  Andrew McCallum,et al.  Topic Models Conditioned on Arbitrary Features with Dirichlet-multinomial Regression , 2008, UAI.

[38]  Jakub Piskorski,et al.  Information Extraction: Past, Present and Future , 2013, Multi-source, Multilingual Information Extraction and Summarization.

[39]  James Allan,et al.  A Comparative Study of Utilizing Topic Models for Information Retrieval , 2009, ECIR.

[40]  Daniel Barbará,et al.  Topic Significance Ranking of LDA Generative Models , 2009, ECML/PKDD.

[41]  Mikko Kivelä,et al.  Generalizations of the clustering coefficient to weighted complex networks. , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.

[42]  Steven Skiena,et al.  Large-Scale Sentiment Analysis for News and Blogs (system demonstration) , 2007, ICWSM.

[43]  Sergey I. Nikolenko,et al.  Topic modelling for qualitative studies , 2017, J. Inf. Sci..

[44]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[45]  Ivan Titov,et al.  A Joint Model of Text and Aspect Ratings for Sentiment Summarization , 2008, ACL.

[46]  Timothy Baldwin,et al.  Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality , 2014, EACL.

[47]  W. Bruce Croft,et al.  LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[48]  Piet Schenelaars Public opinion , 1994, Bio/Technology.

[49]  李涛,et al.  Personalized News Recommendation:A Review and an Experimental Investigation , 2011 .

[50]  William Ribarsky,et al.  LeadLine: Interactive visual analysis of text data through event identification and exploration , 2012, 2012 IEEE Conference on Visual Analytics Science and Technology (VAST).

[51]  David L. Waltz,et al.  Classifying news stories using memory based reasoning , 1992, SIGIR '92.

[52]  Timothy Baldwin,et al.  Automatic Evaluation of Topic Coherence , 2010, NAACL.

[53]  Ying Wah Teh,et al.  Text mining of news-headlines for FOREX market prediction: A Multi-layer Dimension Reduction Algorithm with semantics and sentiment , 2015, Expert Syst. Appl..

[54]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[55]  Mark Stevenson,et al.  The Reuters Corpus Volume 1 -from Yesterday’s News to Tomorrow’s Language Resources , 2002, LREC.

[56]  Mengen Chen,et al.  Short Text Classification Improved by Learning Multi-Granularity Topics , 2011, IJCAI.

[57]  Robert M. Entman,et al.  Framing: Toward Clarification of a Fractured Paradigm , 1993 .

[58]  Andrew McCallum,et al.  Optimizing Semantic Coherence in Topic Models , 2011, EMNLP.

[59]  Tomaz Erjavec,et al.  hrWaC and slWac: Compiling Web Corpora for Croatian and Slovene , 2011, TSD.

[60]  Peng Li,et al.  Joint topic modeling for event summarization across news and social media streams , 2012, CIKM.

[61]  Etienne Barnard,et al.  Evaluating topic models with stability , 2008 .

[62]  J. Šnajder,et al.  Topics and their Salience in the 2015 Parliamentary Election in Croatia: A Topic Model based Analysis of the Media Agenda , 2016 .

[63]  Bruno Pouliquen,et al.  Sentiment Analysis in the News , 2010, LREC.

[64]  Christer Clerwall Enter the Robot Journalist , 2014 .

[65]  Andreas Kerren,et al.  Text visualization techniques: Taxonomy, visual survey, and community insights , 2015, 2015 IEEE Pacific Visualization Symposium (PacificVis).

[66]  Piek T. J. M. Vossen,et al.  NewsReader: recording history from daily news streams , 2014, LREC.

[67]  Stefan Trausan-Matu,et al.  Improving Topic Evaluation Using Conceptual Knowledge , 2011, IJCAI.

[68]  David B. Dunson,et al.  Probabilistic topic models , 2011, KDD '11 Tutorials.

[69]  James Allan,et al.  Topic detection and tracking: event-based information organization , 2002 .

[70]  Ansgar Scherp,et al.  Word Embeddings for Practical Information Retrieval , 2017, GI-Jahrestagung.

[71]  Joe Bob Hester Setting the Agenda: The Mass Media and Public Opinion , 2005 .

[72]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[73]  Margaret E. Roberts,et al.  Navigating the Local Modes of Big Data: The Case of Topic Models , 2016, Computational Social Science.

[74]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[75]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[76]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[77]  Jakub Piskorski,et al.  Real-Time News Event Extraction for Global Crisis Monitoring , 2008, NLDB.

[78]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[79]  Michael Röder,et al.  Exploring the Space of Topic Coherence Measures , 2015, WSDM.

[80]  Justin Grimmer,et al.  A Bayesian Hierarchical Topic Model for Political Texts: Measuring Expressed Agendas in Senate Press Releases , 2010, Political Analysis.

[81]  L. Freeman Centrality in social networks conceptual clarification , 1978 .

[82]  Justin Grimmer,et al.  Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts , 2013, Political Analysis.

[83]  Xueqi Cheng,et al.  Aggregating Neural Word Embeddings for Document Representation , 2018, ECIR.

[84]  Michael Röder,et al.  Evaluating topic coherence measures , 2014, ArXiv.

[85]  Fan Yang,et al.  Modeling and broadening temporal user interest in personalized news recommendation , 2014, Expert Syst. Appl..

[86]  Derek Greene,et al.  Stability of topic modeling via matrix factorization , 2017, Expert Syst. Appl..

[87]  Branden Fitelson A probabilistic theory of coherence , 2003 .

[88]  Aric Hagberg,et al.  Exploring Network Structure, Dynamics, and Function using NetworkX , 2008, Proceedings of the Python in Science Conference.

[89]  Fabio Stella,et al.  Topic model validation , 2012, Neurocomputing.

[90]  Timothy Baldwin,et al.  An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation , 2016, Rep4NLP@ACL.

[91]  David M. Blei,et al.  Bayesian Checking for Topic Models , 2011, EMNLP.

[92]  Carlo Strapparava,et al.  The role of domain information in Word Sense Disambiguation , 2002, Natural Language Engineering.

[93]  N. Newman,et al.  Reuters Institute Digital News Report 2019 , 2019 .

[94]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[95]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.