How to make causal inferences using texts

New text as data techniques offer a great promise: the ability to inductively discover measures that are useful for testing social science theories of interest from large collections of text. We introduce a conceptual framework for making causal inferences with discovered measures as a treatment or outcome. Our framework enables researchers to discover high-dimensional textual interventions and estimate the ways that observed treatments affect text-based outcomes. We argue that nearly all text-based causal inferences depend upon a latent representation of the text and we provide a framework to learn the latent representation. But estimating this latent representation, we show, creates new risks: we may introduce an identification problem or overfit. To address these risks we describe a split-sample framework and apply it to estimate causal effects from an experiment on immigration attitudes and a study on bureaucratic response. Our work provides a rigorous foundation for text-based causal inferences.

[1]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[2]  J. Pennebaker,et al.  The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods , 2010 .

[3]  Daniel J. Hopkins,et al.  Causal Inference in Conjoint Analysis: Understanding Multidimensional Choices via Stated Preference Experiments , 2013 .

[4]  Gary King,et al.  The Changing Evidence Base of Social Science Research , 2009 .

[5]  Judea Pearl,et al.  Causal Inference , 2010 .

[6]  Sandra González-Bailón,et al.  Bit by bit: social research in the digital age , 2019, The Journal of Mathematical Sociology.

[7]  Edoardo M. Airoldi,et al.  Causal inference for ordinal outcomes , 2015 .

[8]  Yee Whye Teh,et al.  Variational Inference for the Indian Buffet Process , 2009, AISTATS.

[9]  Kimberly A. Neuendorf,et al.  The Content Analysis Guidebook , 2001 .

[10]  Cun-Hui Zhang,et al.  Lasso adjustments of treatment effect estimates in randomized experiments , 2015, Proceedings of the National Academy of Sciences.

[11]  Qiaozhu Mei,et al.  Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis , 2014, ICML.

[12]  Richard Biernacki,et al.  Reinventing Evidence in Social Inquiry : Decoding Facts and Variables , 2013 .

[13]  Margaret E. Roberts,et al.  A Model of Text for Experimentation in the Social Sciences , 2016 .

[14]  Alan E. Hubbard,et al.  Statistical Inference for Data Adaptive Target Parameters , 2016, The international journal of biostatistics.

[15]  Daniel M. Butler Representing the Advantaged: How Politicians Reinforce Inequality , 2014 .

[16]  Donald P. Green,et al.  Field Experiments: Design, Analysis, and Interpretation , 2012 .

[17]  Justin Grimmer,et al.  Estimating Heterogeneous Treatment Effects and the Effects of Heterogeneous Treatments with Ensemble Methods , 2017, Political Analysis.

[18]  Thomas L. Griffiths,et al.  The Indian Buffet Process: An Introduction and Review , 2011, J. Mach. Learn. Res..

[19]  Margaret E. Roberts,et al.  Navigating the Local Modes of Big Data: The Case of Topic Models , 2016, Computational Social Science.

[20]  Mia Costa,et al.  How Responsive are Political Elites? A Meta-Analysis of Experiments on Public Officials* , 2017, Journal of Experimental Political Science.

[21]  Kevin Leyton-Brown,et al.  Counterfactual Prediction with Deep Instrumental Variables Networks , 2016, ArXiv.

[22]  Bruce A. Desmarais,et al.  What Can We Learn from Predictive Modeling? , 2016, Political Analysis.

[23]  Arthur Spirling,et al.  Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It , 2017, Political Analysis.

[24]  Tirthankar Dasgupta,et al.  Treatment Effects on Ordinal Outcomes: Causal Estimands and Sharp Bounds , 2015, 1507.01542.

[25]  E-Step Structural Topic Models for Open Ended Survey Responses , 2022 .

[26]  Justin Grimmer,et al.  Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts , 2013, Political Analysis.

[27]  Tirthankar Dasgupta,et al.  Sharp Bounds of Causal Effects on Ordinal Outcomes , 2015 .

[28]  Justin Grimmer,et al.  Discovery of Treatments from Text Corpora , 2016, ACL.

[29]  Michael L. Anderson,et al.  Split-Sample Strategies for Avoiding False Discoveries , 2017 .

[30]  Leif D. Nelson,et al.  False-Positive Psychology , 2011, Psychological science.

[31]  Klaus Krippendorff,et al.  Content Analysis: An Introduction to Its Methodology , 1980 .

[32]  Marcel Fafchamps,et al.  Using Split Samples to Improve Inference on Causal Effects , 2016, Political Analysis.

[33]  Michael Gill,et al.  How Judicial Identity Changes the Text of Legal Rulings , 2015 .

[34]  Stephanie T. Lanza,et al.  Causal Inference in Latent Class Analysis , 2013, Structural equation modeling : a multidisciplinary journal.

[35]  D. Rubin,et al.  Causal Inference for Statistics, Social, and Biomedical Sciences: A General Method for Estimating Sampling Variances for Standard Estimators for Average Causal Effects , 2015 .

[36]  Susan Athey,et al.  Machine Learning and Causal Inference for Policy Evaluation , 2015, KDD.

[37]  J. Pennebaker,et al.  Psychological aspects of natural language. use: our words, our selves. , 2003, Annual review of psychology.

[38]  Amy L. Catalinac,et al.  From Pork to Policy: The Rise of Programmatic Campaigning in Japanese Elections , 2016, The Journal of Politics.

[39]  Gary King,et al.  A Method of Automated Nonparametric Content Analysis for Social Science , 2010 .

[40]  Macartan Humphreys,et al.  Fishing, Commitment, and Communication: A Proposal for Comprehensive Nonbinding Research Registration , 2012, Political Analysis.

[41]  Margaret E. Roberts,et al.  Matching Methods for High-Dimensional Data with Applications to Text∗ , 2015 .

[42]  Stefan Wager,et al.  Estimation and Inference of Heterogeneous Treatment Effects using Random Forests , 2015, Journal of the American Statistical Association.

[43]  Adam Bonica,et al.  The Political Ideologies of American Lawyers , 2015 .

[44]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[45]  Dragomir R. Radev,et al.  How to Analyze Political Attention with Minimal Assumptions and Costs , 2010 .

[46]  Margaret E. Roberts,et al.  stm: An R Package for Structural Topic Models , 2019, Journal of Statistical Software.

[47]  Gary King,et al.  General purpose computer-assisted clustering and conceptualization , 2011, Proceedings of the National Academy of Sciences.

[48]  John W. Tukey,et al.  We Need Both Exploratory and Confirmatory , 1980 .

[49]  J. Robins,et al.  Double/Debiased Machine Learning for Treatment and Structural Parameters , 2017 .

[50]  C. Bird,et al.  Propaganda Technique in the World War. , 1928 .

[51]  A. Spirling U.S. Treaty Making with American Indians: Institutional Change and Relative Power, 1784–1911 , 2012 .

[52]  Illtyd Trethowan Causality , 1938 .

[53]  Kristin M. Bakke,et al.  The perils of policy by p-value: Predicting civil conflicts , 2010 .

[54]  Marc Ratkovic,et al.  Causal Inference through the Method of Direct Estimation , 2017, 1703.05849.

[55]  David B. Dunson,et al.  Probabilistic topic models , 2011, KDD '11 Tutorials.

[56]  Amber E. Boydstun Making the News: Politics, the Media, and Agenda Setting , 2013 .

[57]  J. Carlin,et al.  Beyond Power Calculations , 2014, Perspectives on psychological science : a journal of the Association for Psychological Science.

[58]  Jens Hainmueller,et al.  Public Attitudes toward Immigration , 2014 .

[59]  XuanLong Nguyen,et al.  Posterior contraction of the population polytope in finite admixture models , 2012, ArXiv.

[60]  M. Laver,et al.  Extracting Policy Positions from Political Texts Using Words as Data , 2003, American Political Science Review.

[61]  Benjamin E. Lauderdale,et al.  Crowd-sourced Text Analysis: Reproducible and Agile Production of Political Data , 2016, American Political Science Review.

[62]  Justin Grimmer,et al.  How Words and Money Cultivate a Personal Vote: The Effect of Legislator Credit Claiming on Constituent Credit Allocation , 2012, American Political Science Review.

[63]  Susan Athey,et al.  Recursive partitioning for heterogeneous causal effects , 2015, Proceedings of the National Academy of Sciences.

[64]  Sven-Oliver Proksch,et al.  A Scaling Model for Estimating Time-Series Party Positions from Texts , 2007 .

[65]  M. J. Laan,et al.  Targeted Learning: Causal Inference for Observational and Experimental Data , 2011 .

[66]  Sanjeev Arora,et al.  A Practical Algorithm for Topic Modeling with Provable Guarantees , 2012, ICML.

[67]  Kevin M. Carlsmith,et al.  Why do we punish? Deterrence and just deserts as motives for punishment. , 2002, Journal of personality and social psychology.