How and Why to Use Experimental Data to Evaluate Methods for Observational Causal Inference

Methods that infer causal dependence from observational data are central to many areas of science, including medicine, economics, and the social sciences. A variety of theoretical properties of these methods have been proven, but empirical evaluation remains a challenge, largely due to the lack of observational data sets for which treatment effect is known. We describe and analyze observational sampling from randomized controlled trials (OSRCT), a method for evaluating causal inference methods using data from randomized controlled trials (RCTs). This method can be used to create constructed observational data sets with corresponding unbiased estimates of treatment effect, substantially increasing the number of data sets available for empirical evaluation of causal inference methods. We show that, in expectation, OSRCT creates data sets that are equivalent to those produced by randomly sampling from empirical data sets in which all potential outcomes are available. We then perform a large-scale evaluation of seven causal inference methods over 37 data sets, drawn from RCTs, as well as simulators, real-world computational systems, and observational data sets augmented with a synthetic response variable. We find notable performance differences when comparing across data from different sources, demonstrating the importance of using data from a variety of sources when evaluating any causal inference method.

[1]  Elias Bareinboim,et al.  Bounding Causal Effects on Continuous Outcome , 2021, AAAI.

[2]  Vikash K. Mansinghka,et al.  Causal Inference using Gaussian Processes with Structured Latent Confounders , 2020, ICML.

[3]  Jennifer Hill,et al.  Automated versus Do-It-Yourself Methods for Causal Inference: Lessons Learned from a Data Analysis Competition , 2017, Statistical Science.

[4]  Madeleine Udell,et al.  Causal Inference with Noisy and Missing Covariates via Matrix Factorization , 2018, NeurIPS.

[5]  K. Sachs,et al.  Causal Protein-Signaling Networks Derived from Multiparameter Single-Cell Data , 2005, Science.

[6]  Marie Davidian,et al.  Doubly robust estimation of causal effects. , 2011, American journal of epidemiology.

[7]  Chen Yanover,et al.  Benchmarking Framework for Performance-Evaluation of Causal Inference Analysis , 2018, ArXiv.

[8]  Richard A. Nielsen,et al.  Why Propensity Scores Should Not Be Used for Matching , 2019, Political Analysis.

[9]  Qing Wang,et al.  Online Context-Aware Recommendation with Time Varying Multi-Armed Bandit , 2016, KDD.

[10]  Amanda Gentzel,et al.  The Case for Evaluating Causal Models Using Interventional Measures and Empirical Data , 2019, NeurIPS.

[11]  Liang Tang,et al.  Automatic ad format selection via contextual bandits , 2013, CIKM.

[12]  David M. Blei,et al.  Adapting Neural Networks for the Estimation of Treatment Effects , 2019, NeurIPS.

[13]  Jennifer L. Hill,et al.  Bayesian Nonparametric Modeling for Causal Inference , 2011 .

[14]  Fiona Godlee,et al.  The new BMJ policy on sharing data from drug and device trials , 2012, BMJ : British Medical Journal.

[15]  Uri Shalit,et al.  Removing Hidden Confounding by Experimental Grounding , 2018, NeurIPS.

[16]  Shota Yasui,et al.  Counterfactual Cross-Validation: Stable Model Selection Procedure for Causal Inference Models , 2019, ICML.

[17]  Jacques Rougemont,et al.  Nemo: an evolutionary and population genetics programming framework , 2006, Bioinform..

[18]  Bernhard Schölkopf,et al.  Distinguishing Cause from Effect Using Observational Data: Methods and Benchmarks , 2014, J. Mach. Learn. Res..

[19]  Stefan Wager,et al.  Estimation and Inference of Heterogeneous Treatment Effects using Random Forests , 2015, Journal of the American Statistical Association.

[20]  Peter M. Steiner,et al.  Can Nonrandomized Experiments Yield Accurate Answers? A Randomized Experiment Comparing Random and Nonrandom Assignments , 2008 .

[21]  Hedvig Kjellström,et al.  Neuropathic Pain Diagnosis Simulator for Causal Discovery Algorithm Evaluation , 2019, NeurIPS.

[22]  Harlan M. Krumholz,et al.  Individual Patient-Level Data Sharing for Continuous Learning: A Strategy for Trial Data Sharing. , 2019, NAM perspectives.

[23]  C. Ohmann,et al.  Evaluation of repositories for sharing individual-participant data from clinical studies , 2019, Trials.

[24]  R. Lalonde Evaluating the Econometric Evaluations of Training Programs with Experimental Data , 1984 .

[25]  Thomas M. Norman,et al.  Perturb-Seq: Dissecting Molecular Circuits with Scalable Single-Cell RNA Profiling of Pooled Genetic Screens , 2016, Cell.

[26]  Aidong Zhang,et al.  Representation Learning for Treatment Effect Estimation from Observational Data , 2018, NeurIPS.

[27]  Dipak Kalra,et al.  Sharing and reuse of individual participant data from clinical trials: principles and recommendations , 2017, BMJ Open.

[28]  Matt J. Kusner,et al.  Counterfactual Fairness , 2017, NIPS.

[29]  Liang Tang,et al.  Ensemble contextual bandits for personalized recommendation , 2014, RecSys '14.

[30]  Max Welling,et al.  Causal Effect Inference with Deep Latent-Variable Models , 2017, NIPS 2017.

[31]  Mihaela van der Schaar,et al.  Deep-Treat: Learning Optimal Personalized Treatments From Observational Data Using Neural Networks , 2018, AAAI.

[32]  Judea Pearl,et al.  The seven tools of causal inference, with reflections on machine learning , 2019, Commun. ACM.

[33]  Uri Shalit,et al.  Estimating individual treatment effect: generalization bounds and algorithms , 2016, ICML.

[34]  Wei Chu,et al.  Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms , 2010, WSDM '11.

[35]  Edward I. George,et al.  Bayesian Ensemble Learning , 2006, NIPS.

[36]  D. Green,et al.  Comparing Experimental and Matching Methods Using a Large-Scale Voter Mobilization Experiment , 2006, Political Analysis.

[37]  Nathan Kallus,et al.  Confounding-Robust Policy Improvement , 2018, NeurIPS.

[38]  Olivier Nicol,et al.  Improving offline evaluation of contextual bandit algorithms via bootstrapping techniques , 2014, ICML.

[39]  G. Imbens,et al.  Using Wasserstein Generative Adversarial Networks for the Design of Monte Carlo Simulations , 2019, 1909.02210.

[40]  Matthew J. Salganik,et al.  Experimental Study of Inequality and Unpredictability in an Artificial Cultural Market , 2006, Science.

[41]  D. Rubin,et al.  The central role of the propensity score in observational studies for causal effects , 1983 .

[42]  Jeffrey M Drazen,et al.  Sharing individual patient data from clinical trials. , 2015, The New England journal of medicine.

[43]  Andrew P. Jaciw Assessing the Accuracy of Generalized Inferences From Comparison Group Studies Using a Within-Study Comparison Approach , 2016 .