Quantifying causality in data science with quasi-experiments

Estimating causality from observational data is essential in many data science questions but can be a challenging task. Here we review approaches to causality that are popular in econometrics and that exploit (quasi) random variation in existing data, called quasi-experiments, and show how they can be combined with machine learning to answer causal questions within typical data science settings. We also highlight how data scientists can help advance these methods to bring causal estimation to high-dimensional data from medicine, industry and society. While estimating causality from observational data is challenging, quasi-experiments provide causal inference methods with plausible assumptions that can be practical to a range of real-world problems.

[1]  Guido W. Imbens,et al.  The Interpretation of Instrumental Variables Estimators in Simultaneous Equations Models with an Application to the Demand for Fish , 2000 .

[2]  Nathan Kallus,et al.  Classifying Treatment Responders Under Causal Effect Monotonicity , 2019, ICML.

[3]  Justin McCrary,et al.  Manipulation of the Running Variable in the Regression Discontinuity Design: A Density Test , 2007 .

[4]  J. Pearl Causal inference in statistics: An overview , 2009 .

[5]  Susan Athey,et al.  Machine Learning Methods That Economists Should Know About , 2019, Annual Review of Economics.

[6]  Hal R. Varian,et al.  Big Data: New Tricks for Econometrics , 2014 .

[7]  D. Rubin,et al.  The central role of the propensity score in observational studies for causal effects , 1983 .

[8]  Konrad P. Körding,et al.  Rarely-switching linear bandits: optimization of causal effects for the real world , 2019, ArXiv.

[9]  Konrad P. Kording,et al.  Neural spiking for causal inference , 2019 .

[10]  G. Tampubolon,et al.  Does poverty reduce mental health? An instrumental variable analysis. , 2014, Social science & medicine.

[11]  A. Case,et al.  Unnatural Experiments? Estimating the Incidence of Endogenous Policies , 1994 .

[12]  Joshua D. Angrist,et al.  The Credibility Revolution in Empirical Economics: How Better Research Design is Taking the Con Out of Econometrics , 2010, SSRN Electronic Journal.

[13]  Stefan Wager,et al.  Policy Learning With Observational Data , 2017, Econometrica.

[14]  J. Hahn,et al.  IDENTIFICATION AND ESTIMATION OF TREATMENT EFFECTS WITH A REGRESSION-DISCONTINUITY DESIGN , 2001 .

[15]  Guido W. Imbens,et al.  Potential Outcome and Directed Acyclic Graph Approaches to Causality: Relevance for Empirical Practice in Economics , 2019, Journal of Economic Literature.

[16]  David S. Lee,et al.  Regression Discontinuity Designs in Economics , 2009 .

[17]  David Card,et al.  Minimum Wages and Employment: A Case Study of the Fast Food Industry in New Jersey and Pennsylvania , 1993 .

[18]  Judea Pearl,et al.  The seven tools of causal inference, with reflections on machine learning , 2019, Commun. ACM.

[19]  M. Fine,et al.  A prediction rule to identify low-risk patients with community-acquired pneumonia. , 1997, The New England journal of medicine.

[20]  Bernhard Schölkopf,et al.  Causal Discovery from Heterogeneous/Nonstationary Data , 2019, J. Mach. Learn. Res..

[21]  Till Bärnighausen,et al.  Regression Discontinuity Designs in Epidemiology , 2014, Epidemiology.

[22]  A. Belloni,et al.  SPARSE MODELS AND METHODS FOR OPTIMAL INSTRUMENTS WITH AN APPLICATION TO EMINENT DOMAIN , 2012 .

[23]  Jeffrey M. Wooldridge,et al.  Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data , 2003 .

[24]  Ang Li,et al.  Unit Selection Based on Counterfactual Logic , 2019, IJCAI.

[25]  J. Angrist,et al.  Empirical Strategies in Labor Economics , 1998 .

[26]  L. Keele,et al.  Geographic Boundaries as Regression Discontinuities , 2015, Political Analysis.

[27]  Susan Athey,et al.  Ensemble Methods for Causal Effects in Panel Data Settings , 2019, AEA Papers and Proceedings.

[28]  D. Rubin For objective causal inference, design trumps analysis , 2008, 0811.1640.

[29]  C. Barrett,et al.  Revisiting the Effect of Food Aid on Conflict: A Methodological Caution , 2017 .

[30]  Hal R Varian,et al.  Causal inference in economics and marketing , 2016, Proceedings of the National Academy of Sciences.

[31]  Johannes Gehrke,et al.  Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Readmission , 2015, KDD.

[32]  J. Newhouse,et al.  Econometrics in outcomes research: the use of instrumental variables. , 1998, Annual review of public health.

[33]  Nancy Qian,et al.  US Food Aid and Civil Conflict , 2014 .

[34]  Konrad P. Kording,et al.  Regression discontinuity threshold optimization , 2019, PloS one.

[35]  Matias D. Cattaneo,et al.  Robust Data-Driven Inference in the Regression-Discontinuity Design , 2014 .

[36]  Marianne Fyhn,et al.  Inferring causal connectivity from pairwise recordings and optogenetics , 2018, bioRxiv.

[37]  Arthur Huang,et al.  The Effects of Daylight Saving Time on Vehicle Crashes in Minnesota , 2010, Journal of safety research.

[38]  Steven Tadelis,et al.  Consumer Heterogeneity and Paid Search Effectiveness: A Large-Scale Field Experiment: Paid Search Effectiveness , 2015 .

[39]  Philip S. Thomas,et al.  Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[40]  Alberto Abadie Semiparametric Difference-in-Differences Estimators , 2005 .

[41]  Elias Bareinboim,et al.  Counterfactual Data-Fusion for Online Reinforcement Learners , 2017, ICML.

[42]  P. Rothwell,et al.  External validity of randomised controlled trials: “To whom do the results of this trial apply?” , 2005, The Lancet.

[43]  Kevin Leyton-Brown,et al.  Deep IV: A Flexible Approach for Counterfactual Prediction , 2017, ICML.

[44]  Dylan S. Small,et al.  Powerful three-sample genome-wide design and robust statistical inference in summary-data Mendelian randomization. , 2019, International journal of epidemiology.

[45]  J. Angrist,et al.  Journal of Economic Perspectives—Volume 15, Number 4—Fall 2001—Pages 69–85 Instrumental Variables and the Search for Identification: From Supply and Demand to Natural Experiments , 2022 .

[46]  B. Shepherd,et al.  GUIDO IMBENS, DONALD RUBIN, Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. New York: Cambridge University Press. , 2016, Biometrics.

[47]  Arthur Lewbel,et al.  Identifying the Effect of Changing the Policy Threshold in Regression Discontinuity Models , 2015, Review of Economics and Statistics.

[48]  Ruocheng Guo,et al.  Causal Interpretability for Machine Learning - Problems, Methods and Evaluation , 2020, SIGKDD Explor..

[49]  Justin McCrary,et al.  Manipulation of the Running Variable in the Regression Discontinuity Design , 2005 .

[50]  J. Angrist,et al.  Does Compulsory School Attendance Affect Schooling and Earnings? , 1990 .

[51]  Matias D. Cattaneo,et al.  A Practical Introduction to Regression Discontinuity Designs , 2019, 1911.09511.

[52]  Arthur Gretton,et al.  Kernel Instrumental Variable Regression , 2019, NeurIPS.

[53]  Steven Tadelis,et al.  Consumer Heterogeneity and Paid Search Effectiveness: A Large Scale Field Experiment , 2014 .

[54]  G. Imbens,et al.  Matrix Completion Methods for Causal Panel Data Models , 2017, Journal of the American Statistical Association.

[55]  Amit Sharma,et al.  Explaining machine learning classifiers through diverse counterfactual explanations , 2020, FAT*.

[56]  Andrew M Ryan,et al.  Methods for evaluating changes in health care policy: the difference-in-differences approach. , 2014, JAMA.

[57]  J. Robins,et al.  Double/Debiased Machine Learning for Treatment and Structural Parameters , 2017 .

[58]  Ashish Kumar,et al.  The Effect of Customers' Social Media Participation on Customer Visit Frequency and Profitability: An Empirical Investigation , 2013, Inf. Syst. Res..

[59]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[60]  Matias D. Cattaneo,et al.  Econometric Methods for Program Evaluation , 2018, Annual Review of Economics.

[61]  Sivaraman Balakrishnan,et al.  Sharp instruments for classifying compliers and generalizing causal effects , 2018, The Annals of Statistics.

[62]  Duncan J. Watts,et al.  Estimating the Causal Impact of Recommendation Systems from Observational Data , 2015, EC.

[63]  Volker C. Radeloff,et al.  Quasi-experimental methods enable stronger inferences from observational data in ecology , 2017 .

[64]  John Hsu,et al.  A Second Chance to Get Causal Inference Right: A Classification of Data Science Tasks , 2018, CHANCE.

[65]  Miikka Rokkanen,et al.  Wanna Get Away? Rd Identification Away from the Cutoff , 2012, SSRN Electronic Journal.

[66]  Bernhard Schölkopf,et al.  Distinguishing Cause from Effect Using Observational Data: Methods and Benchmarks , 2014, J. Mach. Learn. Res..

[67]  Barbara E. Engelhardt,et al.  How algorithmic confounding in recommendation systems increases homogeneity and decreases utility , 2017, RecSys.

[68]  Dylan S. Small,et al.  Association of the 2011 ACGME resident duty hour reforms with mortality and readmissions among hospitalized Medicare patients. , 2014, JAMA.

[69]  D. Campbell,et al.  Regression-Discontinuity Analysis: An Alternative to the Ex-Post Facto Experiment , 1960 .

[70]  Konrad Paul Kording,et al.  Spiking allows neurons to estimate their causal effect , 2018, bioRxiv.

[71]  J. Angrist,et al.  Identification and Estimation of Local Average Treatment Effects , 1995 .

[72]  Trevor Hastie,et al.  Causal Interpretations of Black-Box Models , 2021, Journal of business & economic statistics : a publication of the American Statistical Association.

[73]  M. Hudgens,et al.  Toward Causal Inference With Interference , 2008, Journal of the American Statistical Association.

[74]  George Davey Smith,et al.  Mendelian randomization: Using genes as instruments for making causal inferences in epidemiology , 2008, Statistics in medicine.

[75]  Edward E. Leamer,et al.  Let's Take the Con Out of Econometrics , 1983 .

[76]  Konrad Paul Kording,et al.  Quasi-experimental causality in neuroscience and behavioural research , 2018, Nature Human Behaviour.

[77]  Giles Hooker,et al.  Please Stop Permuting Features: An Explanation and Alternatives , 2019, ArXiv.

[78]  Sendhil Mullainathan,et al.  Machine Learning: An Applied Econometric Approach , 2017, Journal of Economic Perspectives.

[79]  Aureo de Paula,et al.  Econometric Models of Network Formation , 2019, Annual Review of Economics.

[80]  Orley Ashenfelter,et al.  Using the Longitudinal Structure of Earnings to Estimate the Effect of Training Programs , 1984 .

[81]  David M. Blei,et al.  The Blessings of Multiple Causes , 2018, Journal of the American Statistical Association.

[82]  M. Greenstone,et al.  Evidence on the impact of sustained exposure to air pollution on life expectancy from China’s Huai River policy , 2013, Proceedings of the National Academy of Sciences.

[83]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[84]  Till Bärnighausen,et al.  Regression discontinuity designs are underutilized in medicine, epidemiology, and public health: a review of current and best practice. , 2015, Journal of clinical epidemiology.

[85]  Joshua D. Angrist,et al.  Mostly Harmless Econometrics: An Empiricist's Companion , 2008 .

[86]  J. Angrist,et al.  Jackknife Instrumental Variables Estimation , 1995 .

[87]  Amina Adadi,et al.  Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI) , 2018, IEEE Access.

[88]  Uri Shalit,et al.  Removing Hidden Confounding by Experimental Grounding , 2018, NeurIPS.