Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts

Politics and political conflict often occur in the written and spoken word. Scholars have long recognized this, but the massive costs of analyzing even moderately sized collections of texts have hindered their use in political science research. Here lies the promise of automated text analysis: it substantially reduces the costs of analyzing large collections of text. We provide a guide to this exciting new area of research and show how, in many instances, the methods have already obtained part of their promise. But there are pitfalls to using automated methods—they are no substitute for careful thought and close reading and require extensive and problem-specific validation. We survey a wide range of new methods, provide guidance on how to validate the output of the models, and clarify misconceptions and errors in the literature. To conclude, we argue that for automated text methods to become a standard tool for political scientists, methodologists must contribute new methods and new methods of validation.

[1]  Samuel B. Williams,et al.  Association for Computing Machinery , 2009 .

[2]  M. E. Maron,et al.  On Relevance, Probabilistic Indexing and Information Retrieval , 1960, JACM.

[3]  F. Mosteller,et al.  Inference in an Authorship Problem , 1963 .

[4]  J. Armstrong,et al.  Derivation of Theory by Means of Factor Analysis or Tom Swift and His Electric Factor Analysis Machine , 2015 .

[5]  Philip J. Stone,et al.  Extracting Information. (Book Reviews: The General Inquirer. A Computer Approach to Content Analysis) , 1967 .

[6]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[7]  Marshall S. Smith,et al.  The general inquirer: A computer approach to content analysis. , 1967 .

[8]  William E. Grieb The general inquirer: A computer approach to content analysis: Philip J. Stone, Dexter C. Dunphy, Marshall S. Smith, Daniel M. Ogilvie, with associates. The MIT Press, Cambridge, Massachusetts, 1966. 651 pp. plus xx , 1968 .

[9]  H. L. Le Roy,et al.  Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; Vol. IV , 1969 .

[10]  David R. Mayhew Congress: The Electoral Connection , 1975 .

[11]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[12]  Richard F. Fenno Home Style : House Members in Their Districts , 1978 .

[13]  W. Cleveland Robust Locally Weighted Regression and Smoothing Scatterplots , 1979 .

[14]  Klaus Krippendorff,et al.  Content Analysis: An Introduction to Its Methodology , 1980 .

[15]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[16]  Kenneth A. Shepsle,et al.  The Political Economy of Benefits and Costs: A Neoclassical Approach to Distributive Politics , 1981, Journal of Political Economy.

[17]  Diana Evans Yiannakis House Members' Communication Styles: Newsletters and Press Releases , 1982, The Journal of Politics.

[18]  B. Efron,et al.  A Leisurely Look at the Bootstrap, the Jackknife, and , 1983 .

[19]  M. Aldenderfer,et al.  Cluster Analysis. Sage University Paper Series On Quantitative Applications in the Social Sciences 07-044 , 1984 .

[20]  R. Weber Basic Content Analysis , 1986 .

[21]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[22]  S. Iyengar,et al.  Going Negative: How Political Advertisements Shrink and Polarize the Electorate , 1995 .

[23]  R. Morgan Genetics and molecular biology. , 1995, Current opinion in lipidology.

[24]  K. T. Poole,et al.  Congress: A Political-Economic History of Roll Call Voting , 1997 .

[25]  Janet M. Martin Congress: A Political-Economic History of Roll Call Voting . By Keith T. Poole and Howard Rosenthal. (New York: Oxford University Press, 1997. Pp. 297. $85.00.) , 1998 .

[26]  J. Krosnick,et al.  Survey Research , 1977, Annual review of psychology.

[27]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[28]  M. Bradley,et al.  Affective Norms for English Words (ANEW): Instruction Manual and Affective Ratings , 1999 .

[29]  Thomas G. Dietterich Ensemble Methods in Machine Learning , 2000, Multiple Classifier Systems.

[30]  Paul M. Kellstedt The Mass Media and the Dynamics of American Racial Attitudes: Media Framing and the Dynamics of Racial Policy Preferences , 2000 .

[31]  Virginia Reviewer-Teller,et al.  Review of Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition by Daniel Jurafsky and James H. Martin. Prentice Hall 2000. , 2000 .

[32]  M. Laver,et al.  Estimating policy positions from political texts , 2000 .

[33]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[34]  Diana Richards,et al.  Political Complexity: Nonlinear Models of Politics , 2000 .

[35]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[36]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[37]  Kimberly A. Neuendorf,et al.  The Content Analysis Guidebook , 2001 .

[38]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[39]  Beyond the Median : Voter Preferences , District Heterogeneity , and Representation 1 , 2002 .

[40]  Speak Softly and Carry a Big Stick? Veterans in the Political Elite and the American Use of Force , 2002, American Political Science Review.

[41]  Michael L. Littman,et al.  Measuring praise and criticism: Inference of semantic orientation from association , 2003, TOIS.

[42]  M. Laver,et al.  Extracting Policy Positions from Political Texts Using Words as Data , 2003, American Political Science Review.

[43]  Barry C. Burden,et al.  Budget Rhetoric in Presidential Campaigns from 1952 to 2000 , 2003 .

[44]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[45]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[46]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[47]  Elisabeth R. Gerber,et al.  Beyond the Median: Voter Preferences, District Heterogeneity, and Political Representation , 2004, Journal of Political Economy.

[48]  Joshua D. Clinton,et al.  The Statistical Analysis of Roll Call Data , 2004, American Political Science Review.

[49]  Pranab Kumar Sen,et al.  Statistics and Decisions , 2006 .

[50]  A. V. D. Vaart,et al.  Oracle inequalities for multi-fold cross validation , 2006 .

[51]  Michael C. Herron Twenty Years of the Kansas Event Data System Project , 2006 .

[52]  David J. Hand,et al.  Classifier Technology and the Illusion of Progress , 2006, math/0606441.

[53]  Michael I. Jordan,et al.  Variational inference for Dirichlet process mixtures , 2006 .

[54]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[55]  Sven-Oliver Proksch,et al.  A Scaling Model for Estimating Time-Series Party Positions from Texts , 2007 .

[56]  M. J. van der Laan,et al.  Statistical Applications in Genetics and Molecular Biology Super Learner , 2010 .

[57]  Mark J van der Laan,et al.  Super Learning: An Application to the Prediction of HIV-1 Drug Resistance , 2007, Statistical applications in genetics and molecular biology.

[58]  I. McLean,et al.  UK OC OK? Interpreting Optimal Classification Scores for the U.K. House of Commons , 2007, Political Analysis.

[59]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[60]  Lanny W. Martin,et al.  A Robust Transformation Procedure for Interpreting Political Text , 2007, Political Analysis.

[61]  Philip A. Schrodt Pattern Recognition of International Crises using Hidden Markov Models , 2007 .

[62]  I. Budge,et al.  Do they work?: Validating computerised word frequency estimates against policy series , 2007 .

[63]  James W. Pennebaker,et al.  Linguistic Inquiry and Word Count (LIWC2007) , 2007 .

[64]  Gary King,et al.  Extracting Systematic Social Science Meaning from Text 1 , 2007 .

[65]  Dustin Hillard,et al.  Computer-Assisted Topic Classification for Mixed-Methods Social Science Research , 2008 .

[66]  Legislative Productivity in Comparative Perspective: An Introduction to the Comparative Agendas Project , 2008 .

[67]  W. Lowe,et al.  Understanding Wordscores , 2008, Political Analysis.

[68]  Burt L. Monroe,et al.  Fightin' Words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict , 2008, Political Analysis.

[69]  Jens Hainmueller,et al.  MPs for Sale? Returns to Office in Postwar British Politics , 2009, American Political Science Review.

[70]  Brandon M. Stewart,et al.  Use of force and civil–military relations in Russia: an automated content analysis , 2009 .

[71]  N. Stanietsky,et al.  The interaction of TIGIT with PVR and PVRL2 inhibits human NK cell cytotoxicity , 2009, Proceedings of the National Academy of Sciences.

[72]  Kenneth Benoit,et al.  Treating Words as Data with Error: Uncertainty in Text Statements of Policy Positions , 2009 .

[73]  W. Greene,et al.  计量经济分析 = Econometric analysis , 2009 .

[74]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[75]  Speak softly and carry a big stick , 2010 .

[76]  Tim Loughran,et al.  When is a Liability not a Liability? Textual Analysis, Dictionaries, and 10-Ks , 2010 .

[77]  Dragomir R. Radev,et al.  How to Analyze Political Attention with Minimal Assumptions and Costs , 2010 .

[78]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[79]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[80]  Gary King,et al.  ReadMe: Software for Automated Content Analysis , 2010 .

[81]  Matthew Eshbaugh-Soha,et al.  The Tone of Local Presidential News Coverage , 2010 .

[82]  Justin Grimmer,et al.  A Bayesian Hierarchical Topic Model for Political Texts: Measuring Expressed Agendas in Senate Press Releases , 2010, Political Analysis.

[83]  Katherine A. Heller,et al.  An Alternative Prior Process for Nonparametric Bayesian Clustering , 2008, AISTATS.

[84]  William N. Venables,et al.  Modern Applied Statistics with S , 2010 .

[85]  Matt Taddy,et al.  Inverse Regression for Analysis of Sentiment in Text , 2010 .

[86]  Slava J. Mikhaylov,et al.  Scaling policy preferences from coded political texts , 2011 .

[87]  Stefan Kaufmann,et al.  Language and Ideology in Congress , 2011, British Journal of Political Science.

[88]  Gary King,et al.  General purpose computer-assisted clustering and conceptualization , 2011, Proceedings of the National Academy of Sciences.

[89]  Kenneth Benoit,et al.  Coder Reliability and Misclassification in the Human Coding of Party Manifestos , 2012, Political Analysis.

[90]  A. Spirling U.S. Treaty Making with American Indians: Institutional Change and Relative Power, 1784–1911 , 2012 .

[91]  Stuart Soroka,et al.  Affective News: The Automated Coding of Sentiment in Political Texts , 2012 .

[92]  Adam J. Berinsky,et al.  Evaluating Online Labor Markets for Experimental Research: Amazon.com's Mechanical Turk , 2012, Political Analysis.

[93]  Doron Shultziner Genes and Politics: A New Explanation and Evaluation of Twin Study Results and Association Studies in Political Science , 2013, Political Analysis.

[94]  Amber E. Boydstun,et al.  RTextTools: A Supervised Learning Package for Text Classification , 2013, R J..

[95]  Justin Grimmer,et al.  Appropriators not Position Takers: The Distorting Effects of Electoral Incentives on Congressional Representation , 2013 .

[96]  Ethan Bueno de Mesquita,et al.  Delivering the Goods: Legislative Particularism in Different Electoral and Institutional Settings , 2006, The Journal of Politics.

[97]  Jstor The American political science review , 2022 .