Estimating individual treatment effect: generalization bounds and algorithms

There is intense interest in applying machine learning to problems of causal inference in fields such as healthcare, economics and education. In particular, individual-level causal inference has important applications such as precision medicine. We give a new theoretical analysis and family of algorithms for predicting individual treatment effect (ITE) from observational data, under the assumption known as strong ignorability. The algorithms learn a "balanced" representation such that the induced treated and control distributions look similar. We give a novel, simple and intuitive generalization-error bound showing that the expected ITE estimation error of a representation is bounded by a sum of the standard generalization-error of that representation and the distance between the treated and control distributions induced by the representation. We use Integral Probability Metrics to measure distances between distributions, deriving explicit bounds for the Wasserstein and Maximum Mean Discrepancy (MMD) distances. Experiments on real and simulated data show the new algorithms match or outperform the state-of-the-art.

[1]  Illtyd Trethowan Causality , 1938 .

[2]  R. Lalonde Evaluating the Econometric Evaluations of Training Programs with Experimental Data , 1984 .

[3]  J. Robins A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect , 1986 .

[4]  J. Pearl,et al.  Bounds on Treatment Effects from Studies with Imperfect Compliance , 1997 .

[5]  A. Müller Integral Probability Metrics and Their Generating Classes of Functions , 1997, Advances in Applied Probability.

[6]  Adi Ben-Israel,et al.  The Change-of-Variables Formula Using Matrix Volume , 1999, SIAM J. Matrix Anal. Appl..

[7]  K. Guittet Extended Kantorovich norms : a tool for optimization , 2001 .

[8]  Jeffrey A. Smith,et al.  Does Matching Overcome Lalonde's Critique of Nonexperimental Estimators? , 2000 .

[9]  Rajeev Dehejia,et al.  Propensity Score-Matching Methods for Nonexperimental Causal Studies , 2002, Review of Economics and Statistics.

[10]  J. Robins,et al.  Effect of highly active antiretroviral therapy on time to acquired immunodeficiency syndrome or death using marginal structural models. , 2003, American journal of epidemiology.

[11]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[12]  J. Robins,et al.  Doubly Robust Estimation in Missing Data and Causal Inference Models , 2005, Biometrics.

[13]  D. Rubin Causal Inference Using Potential Outcomes , 2005 .

[14]  G. King,et al.  Improving Quantitative Studies of International Conflict: A Conjecture , 2000, American Political Science Review.

[15]  H. Chipman,et al.  Bayesian Additive Regression Trees , 2006 .

[16]  Judea Pearl,et al.  Identification of Conditional Interventional Distributions , 2006, UAI.

[17]  Koby Crammer,et al.  Analysis of Representations for Domain Adaptation , 2006, NIPS.

[18]  Hal Daumé,et al.  Frustratingly Easy Domain Adaptation , 2007, ACL.

[19]  Andreas Christmann,et al.  Support Vector Machines , 2008, Data Mining and Knowledge Discovery Handbook.

[20]  J. Pearl,et al.  Bounds on Direct Effects in the Presence of Confounded Intermediate Variables , 2008, Biometrics.

[21]  Bernhard Schölkopf,et al.  Nonlinear causal discovery with additive noise models , 2008, NIPS.

[22]  C. Villani Optimal Transport: Old and New , 2008 .

[23]  Koby Crammer,et al.  A theory of learning from different domains , 2010, Machine Learning.

[24]  B. Schölkopf,et al.  Covariate Shift by Kernel Mean Matching , 2009, NIPS 2009.

[25]  Yishay Mansour,et al.  Domain Adaptation: Learning Bounds and Algorithms , 2009, COLT.

[26]  Lihong Li,et al.  Learning from Logged Implicit Exploration Data , 2010, NIPS.

[27]  H. Chipman,et al.  BART: Bayesian Additive Regression Trees , 2008, 0806.3286.

[28]  Peter Bühlmann,et al.  Predicting causal effects in large-scale systems from observational data , 2010, Nature Methods.

[29]  Jennifer L. Hill,et al.  Bayesian Nonparametric Modeling for Causal Inference , 2011 .

[30]  Ivor W. Tsang,et al.  Domain Adaptation via Transfer Component Analysis , 2009, IEEE Transactions on Neural Networks.

[31]  Victor Chernozhukov,et al.  Inference on Treatment Effects after Selection Amongst High-Dimensional Controls , 2011 .

[32]  P. Austin An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies , 2011, Multivariate behavioral research.

[33]  Marie Davidian,et al.  Doubly robust estimation of causal effects. , 2011, American journal of epidemiology.

[34]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[35]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[36]  Gert R. G. Lanckriet,et al.  On the empirical estimation of integral probability metrics , 2012 .

[37]  Mark J. van der Laan,et al.  tmle : An R Package for Targeted Maximum Likelihood Estimation , 2012 .

[38]  Elias Bareinboim,et al.  Controlling Selection Bias in Causal Inference , 2011, AISTATS.

[39]  A. Belloni,et al.  Inference on Treatment Effects after Selection Amongst High-Dimensional Controls , 2011, 1201.0224.

[40]  Marco Cuturi,et al.  Sinkhorn Distances: Lightspeed Computation of Optimal Transport , 2013, NIPS.

[41]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  J. Pearl Detecting Latent Heterogeneity , 2013, Probabilistic and Causal Inference.

[43]  John Shawe-Taylor,et al.  Smooth Operators , 2013, ICML.

[44]  Mehryar Mohri,et al.  Domain adaptation and sample bias correction theory and algorithm for regression , 2014, Theor. Comput. Sci..

[45]  Arnaud Doucet,et al.  Fast Computation of Wasserstein Barycenters , 2013, ICML.

[46]  Craig Anthony Rolling Estimation of conditional average treatment effects , 2014 .

[47]  Shai Ben-David,et al.  Understanding Machine Learning - From Theory to Algorithms , 2014 .

[48]  Matt Taddy,et al.  Heterogeneous Treatment Effects in Digital Experimentation , 2014, 1412.8563.

[49]  Thorsten Joachims,et al.  Batch learning from logged bandit feedback through counterfactual risk minimization , 2015, J. Mach. Learn. Res..

[50]  G. Peyré,et al.  Unbalanced Optimal Transport: Geometry and Kantorovich Formulation , 2015 .

[51]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[52]  Zoubin Ghahramani,et al.  Training generative neural networks via Maximum Mean Discrepancy optimization , 2015, UAI.

[53]  Ioannis Tsamardinos,et al.  Constraint-based causal discovery from multiple interventions over overlapping variable sets , 2014, J. Mach. Learn. Res..

[54]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[55]  Hossein Mobahi,et al.  Learning with a Wasserstein Loss , 2015, NIPS.

[56]  G. Peyré,et al.  Unbalanced Optimal Transport: Geometry and Kantorovich Formulation , 2015, 1508.05216.

[57]  Gabriel Peyré,et al.  Fast Optimal Transport Averaging of Neuroimaging Data , 2015, IPMI.

[58]  Bernhard Schölkopf,et al.  Distinguishing Cause from Effect Using Observational Data: Methods and Benchmarks , 2014, J. Mach. Learn. Res..

[59]  Susan Athey,et al.  Recursive partitioning for heterogeneous causal effects , 2015, Proceedings of the National Academy of Sciences.

[60]  François Laviolette,et al.  Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[61]  Alexander Peysakhovich,et al.  Combining observational and experimental data to find heterogeneous treatment effects , 2016, ArXiv.

[62]  Gabriel Peyré,et al.  Stochastic Optimization for Large-scale Optimal Transport , 2016, NIPS.

[63]  Uri Shalit,et al.  Learning Representations for Counterfactual Inference , 2016, ICML.

[64]  W. Newey,et al.  Double machine learning for treatment and causal parameters , 2016 .

[65]  Kevin Leyton-Brown,et al.  Counterfactual Prediction with Deep Instrumental Variables Networks , 2016, ArXiv.

[66]  G. Imbens,et al.  Efficient Inference of Average Treatment Effects in High Dimensions via Approximate Residual Balancing , 2016 .

[67]  Kate Saenko,et al.  Return of Frustratingly Easy Domain Adaptation , 2015, AAAI.

[68]  G. Imbens,et al.  Approximate residual balancing: debiased inference of average treatment effects in high dimensions , 2016, 1604.07125.

[69]  Elias Bareinboim,et al.  Causal inference and the data-fusion problem , 2016, Proceedings of the National Academy of Sciences.

[70]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[71]  MAX KUANG,et al.  Preconditioning of Optimal Transport , 2017, SIAM J. Sci. Comput..

[72]  Stefan Wager,et al.  Estimation and Inference of Heterogeneous Treatment Effects using Random Forests , 2015, Journal of the American Statistical Association.