Limits of Estimating Heterogeneous Treatment Effects: Guidelines for Practical Algorithm Design

Estimating heterogeneous treatment effects from observational data is a central problem in many domains. Because counterfactual data is inaccessible, the problem differs fundamentally from supervised learning, and entails a more complex set of modeling choices. Despite a variety of recently proposed algorithmic solutions, a principled guideline for building estimators of treatment effects using machine learning algorithms is still lacking. In this paper, we provide such a guideline by characterizing the fundamental limits of estimating heterogeneous treatment effects, and establishing conditions under which these limits can be achieved. Our analysis reveals that the relative importance of the different aspects of observational data vary with the sample size. For instance, we show that selection bias matters only in small-sample regimes, whereas with a large sample size, the way an algorithm models the control and treated outcomes is what bottlenecks its performance. Guided by our analysis, we build a practical algorithm for estimating treatment effects using a non-stationary Gaussian processes with doubly-robust hyperparameters. Using a standard semi-synthetic simulation setup, we show that our algorithm outperforms the state-of-the-art, and that the behavior of existing algorithms conforms with our analysis.

[1]  Sören R. Künzel,et al.  Meta-learners for Estimating Heterogeneous Treatment Effects using Machine Learning , 2017 .

[2]  Neil D. Lawrence,et al.  Kernels for Vector-Valued Functions: a Review , 2011, Found. Trends Mach. Learn..

[3]  Martin J. Wainwright,et al.  Lower bounds on minimax rates for nonparametric regression with additive sparsity and smoothness , 2009, NIPS.

[4]  S. Dudoit,et al.  Asymptotics of cross-validated risk estimation in estimator selection and performance assessment , 2005 .

[5]  Yu Xie,et al.  Estimating Heterogeneous Treatment Effects with Observational Data , 2012, Sociological methodology.

[6]  I. Castillo Lower bounds for posterior rates with Gaussian process priors , 2008, 0807.2734.

[7]  D. Rubin,et al.  Reducing Bias in Observational Studies Using Subclassification on the Propensity Score , 1984 .

[8]  Hemant Ishwaran,et al.  Estimating Individual Treatment Effect in Observational Data Using Random Forest Methods , 2017, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[9]  Stefan Wager,et al.  Estimation and Inference of Heterogeneous Treatment Effects using Random Forests , 2015, Journal of the American Statistical Association.

[10]  A. W. van der Vaart,et al.  Adaptive Bayesian credible bands in regression with a Gaussian process prior , 2015, Sankhya A.

[11]  A. W. Vaart,et al.  Reproducing kernel Hilbert spaces of Gaussian priors , 2008, 0805.3252.

[12]  Joaquin Quiñonero Candela,et al.  Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..

[13]  Harry van Zanten,et al.  Information Rates of Nonparametric Gaussian Process Methods , 2011, J. Mach. Learn. Res..

[14]  D. Rubin Causal Inference Using Potential Outcomes , 2005 .

[15]  Masashi Sugiyama,et al.  Mixture Regression for Covariate Shift , 2006, NIPS.

[16]  Ahmed M. Alaa,et al.  Bayesian Nonparametric Causal Inference: Information Rates and Learning Algorithms , 2017, IEEE Journal of Selected Topics in Signal Processing.

[17]  Uri Shalit,et al.  Learning Representations for Counterfactual Inference , 2016, ICML.

[18]  J. Heckman Sample Selection Bias as a Specification Error (with an Application to the Estimation of Labor Supply Functions) , 1977 .

[19]  Xiongzhi Chen Brownian Motion and Stochastic Calculus , 2008 .

[20]  James M. Robins,et al.  Optimal Structural Nested Models for Optimal Sequential Decisions , 2004 .

[21]  Yun Yang,et al.  Minimax-optimal nonparametric regression in high dimensions , 2014, 1401.7278.

[22]  Yuhong Yang,et al.  Information-theoretic determination of minimax rates of convergence , 1999 .

[23]  Van Der Vaart,et al.  Rates of contraction of posterior distributions based on Gaussian process priors , 2008 .

[24]  P. Richard Hahn,et al.  Bayesian Regression Tree Models for Causal Inference: Regularization, Confounding, and Heterogeneous Effects , 2017, 1706.09523.

[25]  P. Bickel Efficient and Adaptive Estimation for Semiparametric Models , 1993 .

[26]  Klaus-Robert Müller,et al.  Covariate Shift Adaptation by Importance Weighted Cross Validation , 2007, J. Mach. Learn. Res..

[27]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[28]  Bernhard Schölkopf,et al.  Correcting Sample Selection Bias by Unlabeled Data , 2006, NIPS.

[29]  Aad van der Vaart,et al.  Higher order influence functions and minimax estimation of nonlinear functionals , 2008, 0805.3040.

[30]  Mihaela van der Schaar,et al.  Bayesian Inference of Individualized Treatment Effects using Multi-task Gaussian Processes , 2017, NIPS.

[31]  J. M. Taylor,et al.  Subgroup identification from randomized clinical trial data , 2011, Statistics in medicine.

[32]  Jennifer L. Hill,et al.  Bayesian Nonparametric Modeling for Causal Inference , 2011 .

[33]  M. D. Martínez-Miranda,et al.  Bandwidth selection for kernel density estimation with length-biased data , 2016, 1606.05584.

[34]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[35]  Uri Shalit,et al.  Estimating individual treatment effect: generalization bounds and algorithms , 2016, ICML.

[36]  Trevor Hastie,et al.  Some methods for heterogeneous treatment effect estimation in high dimensions , 2017, Statistics in medicine.

[37]  Kristin E. Porter,et al.  The Relative Performance of Targeted Maximum Likelihood Estimators , 2011, The international journal of biostatistics.

[38]  Mihaela van der Schaar,et al.  Deep-Treat: Learning Optimal Personalized Treatments From Observational Data Using Neural Networks , 2018, AAAI.

[39]  C. J. Stone,et al.  Optimal Global Rates of Convergence for Nonparametric Regression , 1982 .

[40]  Mihaela van der Schaar,et al.  GANITE: Estimation of Individualized Treatment Effects using Generative Adversarial Nets , 2018, ICLR.

[41]  Qinghua Zhang,et al.  Using wavelet network in nonparametric estimation , 1997, IEEE Trans. Neural Networks.

[42]  H. Shimodaira,et al.  Improving predictive inference under covariate shift by weighting the log-likelihood function , 2000 .

[43]  Edward H. Kennedy Nonparametric Causal Effects Based on Incremental Propensity Score Interventions , 2017, Journal of the American Statistical Association.