Keyword Assisted Topic Models

For a long time, many social scientists have conducted content analysis by using their substantive knowledge and manually coding documents. In recent years, however, fully automated content analysis based on probabilistic topic models has become increasingly popular because of their scalability. Unfortunately, applied researchers find that these models often fail to yield topics of their substantive interest by inadvertently creating multiple topics with similar content and combining different themes into a single topic. In this paper, we empirically demonstrate that providing topic models with a small number of keywords can substantially improve their performance. The proposed keyword assisted topic model (keyATM) offers an important advantage that the specification of keywords requires researchers to label topics prior to fitting a model to the data. This contrasts with a widespread practice of post-hoc topic interpretation and adjustments that compromises the objectivity of empirical findings. In our applications, we find that the keyATM provides more interpretable results, has better document classification performance, and is less sensitive to the number of topics than the standard topic models. Finally, we show that the keyATM can also incorporate covariates and model time trends. An open-source software package is available for implementing the proposed methodology.

[1]  Margaret E. Roberts,et al.  A Model of Text for Experimentation in the Social Sciences , 2016 .

[2]  Will Lowe,et al.  A textual Taylor rule: estimating central bank preferences combining topic and scaling methods , 2015, Political Science Research and Methods.

[3]  L. Barnes,et al.  Making Austerity Popular: The Media and Mass Attitudes toward Fiscal Policy , 2018 .

[4]  K. Imai,et al.  Dynamic Stochastic Blockmodel Regression for Social Networks : Application to International Conflicts ∗ , 2018 .

[5]  Joshua A. Tucker,et al.  Elites Tweet to Get Feet Off the Streets: Measuring Regime Social Media Strategies During Protest , 2018, Political Science Research and Methods.

[6]  Junyan Jiang Making Bureaucracy Work: Patronage Networks, Performance Incentives, and Economic Development in China , 2018, American Journal of Political Science.

[7]  D. Mimno,et al.  Care and Feeding of Topic Models: Problems, Diagnostics, and Improvements , 2014 .

[8]  Xiaojin Zhu,et al.  Incorporating domain knowledge into topic modeling via Dirichlet Forest priors , 2009, ICML '09.

[9]  Amy L. Catalinac,et al.  From Pork to Policy: The Rise of Programmatic Campaigning in Japanese Elections , 2016, The Journal of Politics.

[10]  Gregory J. Martin,et al.  Local News and National Politics , 2019, American Political Science Review.

[11]  David M. Mimno,et al.  Care and Feeding of Topic Models , 2014, Handbook of Mixed Membership Models and Their Applications.

[12]  Dragomir R. Radev,et al.  How to Analyze Political Attention with Minimal Assumptions and Costs , 2010 .

[13]  Edwin V. Bonilla,et al.  Improving Topic Coherence with Regularized Topic Models , 2011, NIPS.

[14]  Justin Grimmer,et al.  Mirrors for Princes and Sultans: Advice on the Art of Governance in the Medieval Christian and Islamic Worlds , 2018, The Journal of Politics.

[15]  Claire Cardie,et al.  Multi-aspect Sentiment Analysis with Topic Models , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[16]  Edoardo M. Airoldi,et al.  Summarizing topical content with word frequency and exclusivity , 2012, ICML 2012.

[17]  Economic development in China , 2014 .

[18]  Jian Xing,et al.  Seed-Guided Topic Model for Document Filtering and Classification , 2018, ACM Trans. Inf. Syst..

[19]  Peter A. Chew,et al.  Term Weighting Schemes for Latent Dirichlet Allocation , 2010, NAACL.

[20]  S. Chib Estimation and comparison of multiple change-point models , 1998 .

[21]  Andrew McCallum,et al.  Topic Models Conditioned on Arbitrary Features with Dirichlet-multinomial Regression , 2008, UAI.

[22]  Douglas Rice Issue Divisions and US Supreme Court Decision Making , 2017, The Journal of Politics.

[23]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[24]  Hal Daumé,et al.  Incorporating Lexical Priors into Topic Models , 2012, EACL.

[25]  Huan Liu,et al.  A Novel Measure for Coherence in Statistical Topic Models , 2016, ACL.

[26]  Margaret E. Roberts,et al.  Computer‐Assisted Keyword and Document Set Discovery from Unstructured Text , 2017 .

[27]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[28]  Margaret E. Roberts,et al.  Navigating the Local Modes of Big Data: The Case of Topic Models , 2016, Computational Social Science.

[29]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[30]  Justin Grimmer,et al.  A Bayesian Hierarchical Topic Model for Political Texts: Measuring Expressed Agendas in Senate Press Releases , 2010, Political Analysis.

[31]  Justin Grimmer,et al.  Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts , 2013, Political Analysis.

[32]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[33]  Justin Grimmer,et al.  Appropriators not Position Takers: The Distorting Effects of Electoral Incentives on Congressional Representation , 2013 .

[34]  John M. Olin Calculating posterior distributions and modal estimates in Markov mixture models , 1996 .

[35]  Diyi Yang,et al.  Incorporating Word Correlation Knowledge into Topic Modeling , 2015, NAACL.

[36]  Wiebke Wagner,et al.  Steven Bird, Ewan Klein and Edward Loper: Natural Language Processing with Python, Analyzing Text with the Natural Language Toolkit , 2010, Lang. Resour. Evaluation.

[37]  S. Hobolt,et al.  Government Responsiveness in the European Union: Evidence From Council Voting , 2017 .

[38]  Gang Liu,et al.  MetaLDA: A Topic Model that Efficiently Incorporates Meta Information , 2017, 2017 IEEE International Conference on Data Mining (ICDM).

[39]  Martin W. Bauer,et al.  Qualitative researching with text, image and sound : a practical handbook , 2000 .

[40]  Andrew McCallum,et al.  Rethinking LDA: Why Priors Matter , 2009, NIPS.

[41]  P. Schuler Position Taking or Position Ducking? A Theory of Public Debate in Single-Party Legislatures , 2018 .

[42]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[43]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[44]  Christian Rauh,et al.  Reading Between the Lines: Prediction of Political Violence Using Newspaper Text , 2016, American Political Science Review.

[45]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[46]  Ding Chen The economic development of China. , 1980 .

[47]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[48]  Haiyan Wang,et al.  quanteda: An R package for the quantitative analysis of textual data , 2018, J. Open Source Softw..

[49]  D. Hopkins The Exaggerated Life of Death Panels? The Limited but Real Influence of Elite Rhetoric in the 2009–2010 Health Care Debate , 2017, Political Behavior.

[50]  Daichi Mochihashi,et al.  Unbounded Slice Sampling , 2020, 2010.01760.

[51]  Jennifer Pan,et al.  Concealing Corruption: How Chinese Officials Distort Upward Reporting of Online Grievances , 2018, American Political Science Review.

[52]  Benjamin E. Bagozzi,et al.  The Politics of Scrutiny in Human Rights Monitoring: Evidence from Structural Topic Models of US State Department Human Rights Reports , 2016, Political Science Research and Methods.