Counterfactual Mean Embedding: A Kernel Method for Nonparametric Causal Inference

This paper introduces a novel Hilbert space representation of a counterfactual distribution---called counterfactual mean embedding (CME)---with applications in nonparametric causal inference. Counterfactual prediction has become an ubiquitous tool in machine learning applications, such as online advertisement, recommendation systems, and medical diagnosis, whose performance relies on certain interventions. To infer the outcomes of such interventions, we propose to embed the associated counterfactual distribution into a reproducing kernel Hilbert space (RKHS) endowed with a positive definite kernel. Under appropriate assumptions, the CME allows us to perform causal inference over the entire landscape of the counterfactual distribution. The CME can be estimated consistently from observational data without requiring any parametric assumption about the underlying distributions. We also derive a rate of convergence which depends on the smoothness of the conditional mean and the Radon-Nikodym derivative of the underlying marginal distributions. Our framework can deal with not only real-valued outcome, but potentially also more complex and structured outcomes such as images, sequences, and graphs. Lastly, our experimental results on off-policy evaluation tasks demonstrate the advantages of the proposed estimator.

[1]  Le Song,et al.  A Hilbert Space Embedding for Distributions , 2007, Discovery Science.

[2]  Thorsten Joachims,et al.  Counterfactual Risk Minimization: Learning from Logged Bandit Feedback , 2015, ICML.

[3]  Kevin Leyton-Brown,et al.  Deep IV: A Flexible Approach for Counterfactual Prediction , 2017, ICML.

[4]  C. Baker Joint measures and cross-covariance operators , 1973 .

[5]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[6]  Matthias W. Seeger,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[7]  Uri Shalit,et al.  Learning Representations for Counterfactual Inference , 2016, ICML.

[8]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[9]  D. Rubin,et al.  The central role of the propensity score in observational studies for causal effects , 1983 .

[10]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[11]  Koby Crammer,et al.  A theory of learning from different domains , 2010, Machine Learning.

[12]  Thorsten Joachims,et al.  Recommendations as Treatments: Debiasing Learning and Evaluation , 2016, ICML.

[13]  Olivier Chapelle,et al.  Expected reciprocal rank for graded relevance , 2009, CIKM.

[14]  D. Horvitz,et al.  A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[15]  John Langford,et al.  Off-policy evaluation for slate recommendation , 2016, NIPS.

[16]  Jennifer L. Hill,et al.  Bayesian Nonparametric Modeling for Causal Inference , 2011 .

[17]  Uri Shalit,et al.  Bounding and Minimizing Counterfactual Error , 2016, ArXiv.

[18]  Alexander J. Smola,et al.  Hilbert space embeddings of conditional distributions with applications to dynamical systems , 2009, ICML '09.

[19]  Le Song,et al.  Kernel Bayes' rule: Bayesian inference with positive definite kernels , 2013, J. Mach. Learn. Res..

[20]  Joaquin Quiñonero Candela,et al.  Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..

[21]  J. Mata,et al.  Counterfactual decomposition of changes in wage distributions using quantile regression , 2005 .

[22]  Bernhard Schölkopf,et al.  Towards a Learning Theory of Causation , 2015, 1502.02398.

[23]  V. Chernozhukov,et al.  Inference on Counterfactual Distributions , 2009, 0904.0951.

[24]  Krikamol Muandet,et al.  Minimax Estimation of Kernel Mean Embeddings , 2016, J. Mach. Learn. Res..

[25]  Karsten M. Borgwardt,et al.  Learning via Hilbert Space Embedding of Distributions , 2007 .

[26]  T. Shakespeare,et al.  Observational Studies , 2003 .

[27]  John Langford,et al.  Exploration scavenging , 2008, ICML '08.

[28]  J. Heckman,et al.  Longitudinal Analysis of Labor Market Data: Alternative methods for evaluating the impact of interventions , 1985 .

[29]  Bernhard Schölkopf,et al.  Domain Adaptation under Target and Conditional Shift , 2013, ICML.

[30]  Bernhard Schölkopf,et al.  Hilbert Space Embeddings and Metrics on Probability Measures , 2009, J. Mach. Learn. Res..

[31]  Gilles Blanchard,et al.  Statistical properties of Kernel Prinicipal Component Analysis , 2019 .

[32]  Michael I. Jordan,et al.  Dimensionality Reduction for Supervised Learning with Reproducing Kernel Hilbert Spaces , 2004, J. Mach. Learn. Res..

[33]  A. Berlinet,et al.  Reproducing kernel Hilbert spaces in probability and statistics , 2004 .

[34]  Hans-Peter Kriegel,et al.  Integrating structured biological data by Kernel Maximum Mean Discrepancy , 2006, ISMB.

[35]  David M. Blei,et al.  Causal Inference for Recommendation , 2016 .

[36]  D. Rubin Causal Inference Using Potential Outcomes , 2005 .

[37]  Le Song,et al.  A unified kernel framework for nonparametric inference in graphical models ] Kernel Embeddings of Conditional Distributions , 2013 .

[38]  C. Cassel,et al.  Some results on generalized difference estimation and generalized regression estimation for finite populations , 1976 .