Counterfactual Mean Embeddings

Counterfactual inference has become a ubiquitous tool in online advertisement, recommendation systems, medical diagnosis, and finance. Accurate modeling of outcome distributions associated with different interventions---known as counterfactual distributions---is crucial for the success of these applications. In this work, we propose to model counterfactual distributions using a novel Hilbert space representation called counterfactual mean embedding (CME). The CME embeds the associated counterfactual distribution into a reproducing kernel Hilbert space (RKHS) endowed with a positive definite kernel, which allows us to perform causal inference over the entire landscape of the counterfactual distribution. Based on this representation, we propose a distributional treatment effect (DTE) which can quantify the causal effect over entire outcome distributions. Our approach is nonparametric as the CME can be estimated consistently from observational data without requiring any parametric assumption about the underlying distributions. We also establish a rate of convergence of the proposed estimator which depends on the smoothness of the conditional mean and the Radon-Nikodym derivative of the underlying marginal distributions. Furthermore, our framework also allows for more complex outcomes such as images, sequences, and graphs. Lastly, our experimental results on synthetic data and off-policy evaluation tasks demonstrate the advantages of the proposed estimator.

[1]  C. Rothe Nonparametric estimation of distributional policy effects , 2010 .

[2]  Ingo Steinwart,et al.  Mercer’s Theorem on General Domains: On the Interaction between Measures, Kernels, and RKHSs , 2012 .

[3]  Alexander J. Smola,et al.  Learning with Kernels: support vector machines, regularization, optimization, and beyond , 2001, Adaptive computation and machine learning series.

[4]  Kevin Leyton-Brown,et al.  Deep IV: A Flexible Approach for Counterfactual Prediction , 2017, ICML.

[5]  Kenji Fukumizu,et al.  Kernel Recursive ABC: Point Estimation with Intractable Likelihood , 2018, ICML.

[6]  Thorsten Joachims,et al.  Counterfactual Risk Minimization: Learning from Logged Bandit Feedback , 2015, ICML.

[7]  Yee Whye Teh,et al.  Causal Inference via Kernel Deviance Measures , 2018, NeurIPS.

[8]  Uri Shalit,et al.  Estimating individual treatment effect: generalization bounds and algorithms , 2016, ICML.

[9]  Lihong Li,et al.  Learning from Logged Implicit Exploration Data , 2010, NIPS.

[10]  Bernhard Schölkopf,et al.  Computing functions of random variables via reproducing kernel Hilbert space representations , 2015, Statistics and Computing.

[11]  Kirthevasan Kandasamy,et al.  Batch Policy Gradient Methods for Improving Neural Conversation Models , 2017, ICLR.

[12]  C. Baker Joint measures and cross-covariance operators , 1973 .

[13]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[14]  Mihaela van der Schaar,et al.  Deep-Treat: Learning Optimal Personalized Treatments From Observational Data Using Neural Networks , 2018, AAAI.

[15]  Alex Smola,et al.  Kernel methods in machine learning , 2007, math/0701907.

[16]  Matthias W. Seeger,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[17]  Krikamol Muandet,et al.  Minimax Estimation of Kernel Mean Embeddings , 2016, J. Mach. Learn. Res..

[18]  Thomas Gärtner,et al.  A survey of kernels for structured data , 2003, SKDD.

[19]  Jennifer L. Hill,et al.  Bayesian Nonparametric Modeling for Causal Inference , 2011 .

[20]  D. Horvitz,et al.  A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[21]  John Langford,et al.  Off-policy evaluation for slate recommendation , 2016, NIPS.

[22]  Alexander J. Smola,et al.  Generative Models and Model Criticism via Optimized Maximum Mean Discrepancy , 2016, ICLR.

[23]  Hans-Peter Kriegel,et al.  Integrating structured biological data by Kernel Maximum Mean Discrepancy , 2006, ISMB.

[24]  Kenji Fukumizu,et al.  Filtering with State-Observation Examples via Kernel Monte Carlo Filter , 2013, Neural Computation.

[25]  Zoubin Ghahramani,et al.  Training generative neural networks via Maximum Mean Discrepancy optimization , 2015, UAI.

[26]  N. Dinculeanu Vector Integration and Stochastic Integration in Banach Spaces , 2000, Oxford Handbooks Online.

[27]  John Langford,et al.  Exploration scavenging , 2008, ICML '08.

[28]  Bernhard Schölkopf,et al.  A Permutation-Based Kernel Conditional Independence Test , 2014, UAI.

[29]  Francis R. Bach,et al.  On the Equivalence between Herding and Conditional Gradient Algorithms , 2012, ICML.

[30]  Susan Athey,et al.  Recursive partitioning for heterogeneous causal effects , 2015, Proceedings of the National Academy of Sciences.

[31]  Alexander J. Smola,et al.  Super-Samples from Kernel Herding , 2010, UAI.

[32]  Le Song,et al.  A unified kernel framework for nonparametric inference in graphical models ] Kernel Embeddings of Conditional Distributions , 2013 .

[33]  C. Cassel,et al.  Some results on generalized difference estimation and generalized regression estimation for finite populations , 1976 .

[34]  Bernhard Schölkopf,et al.  Causal Discovery via Reproducing Kernel Hilbert Space Embeddings , 2014, Neural Computation.

[35]  D. Rubin Causal Inference Using Potential Outcomes , 2005 .

[36]  Uri Shalit,et al.  Bounding and Minimizing Counterfactual Error , 2016, ArXiv.

[37]  R. Tibshirani,et al.  An introduction to the bootstrap , 1993 .

[38]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[39]  Bernhard Schölkopf,et al.  Domain Adaptation under Target and Conditional Shift , 2013, ICML.

[40]  Alexander J. Smola,et al.  Hilbert space embeddings of conditional distributions with applications to dynamical systems , 2009, ICML '09.

[41]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[42]  Marc G. Genton,et al.  Classes of Kernels for Machine Learning: A Statistics Perspective , 2002, J. Mach. Learn. Res..

[43]  Bernhard Schölkopf,et al.  Kernel Mean Embedding of Distributions: A Review and Beyonds , 2016, Found. Trends Mach. Learn..

[44]  Fredrik Lindsten,et al.  Sequential Kernel Herding: Frank-Wolfe Optimization for Particle Filtering , 2015, AISTATS.

[45]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[46]  Olivier Chapelle,et al.  Expected reciprocal rank for graded relevance , 2009, CIKM.

[47]  P. Holland Statistics and Causal Inference , 1985 .

[48]  Le Song,et al.  A Hilbert Space Embedding for Distributions , 2007, Discovery Science.

[49]  Klaus-Robert Müller,et al.  Covariate Shift Adaptation by Importance Weighted Cross Validation , 2007, J. Mach. Learn. Res..

[50]  Guy Lever,et al.  Conditional mean embeddings as regressors , 2012, ICML.

[51]  D. Rubin Estimating causal effects of treatments in randomized and nonrandomized studies. , 1974 .

[52]  D. Rubin,et al.  Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction , 2016 .

[53]  Bernhard Schölkopf,et al.  Kernel-based Conditional Independence Test and Application in Causal Discovery , 2011, UAI.

[54]  Gilles Blanchard,et al.  Statistical properties of Kernel Prinicipal Component Analysis , 2019 .

[55]  Uri Shalit,et al.  Learning Representations for Counterfactual Inference , 2016, ICML.

[56]  G. Imbens Nonparametric Estimation of Average Treatment Effects Under Exogeneity: A Review , 2004 .

[57]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[58]  J. Mata,et al.  Counterfactual decomposition of changes in wage distributions using quantile regression , 2005 .

[59]  Susan Athey,et al.  The State of Applied Econometrics - Causality and Policy Evaluation , 2016, 1607.00699.

[60]  S. Smale,et al.  Learning Theory Estimates via Integral Operators and Their Approximations , 2007 .

[61]  D. Rubin,et al.  The central role of the propensity score in observational studies for causal effects , 1983 .

[62]  Richard S. Zemel,et al.  Generative Moment Matching Networks , 2015, ICML.

[63]  Bernhard Schölkopf,et al.  Towards a Learning Theory of Causation , 2015, 1502.02398.

[64]  Yiming Yang,et al.  MMD GAN: Towards Deeper Understanding of Moment Matching Network , 2017, NIPS.

[65]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[66]  Stefan Wager,et al.  Estimation and Inference of Heterogeneous Treatment Effects using Random Forests , 2015, Journal of the American Statistical Association.

[67]  Tao Qin,et al.  Introducing LETOR 4.0 Datasets , 2013, ArXiv.

[68]  Le Song,et al.  Kernel Bayes' rule: Bayesian inference with positive definite kernels , 2013, J. Mach. Learn. Res..

[69]  Joaquin Quiñonero Candela,et al.  Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..

[70]  Koby Crammer,et al.  A theory of learning from different domains , 2010, Machine Learning.

[71]  Thorsten Joachims,et al.  Recommendations as Treatments: Debiasing Learning and Evaluation , 2016, ICML.

[72]  Bernhard Schölkopf,et al.  Kernel Distribution Embeddings: Universal Kernels, Characteristic Kernels and Kernel Metrics on Distributions , 2016, J. Mach. Learn. Res..

[73]  R. A. Fisher,et al.  Design of Experiments , 1936 .

[74]  Bernhard Schölkopf,et al.  Hilbert Space Embeddings and Metrics on Probability Measures , 2009, J. Mach. Learn. Res..

[75]  Bernhard Schölkopf,et al.  Kernel Measures of Conditional Dependence , 2007, NIPS.

[76]  Michael I. Jordan,et al.  Dimensionality Reduction for Supervised Learning with Reproducing Kernel Hilbert Spaces , 2004, J. Mach. Learn. Res..

[77]  A. Berlinet,et al.  Reproducing kernel Hilbert spaces in probability and statistics , 2004 .

[78]  Mihaela van der Schaar,et al.  GANITE: Estimation of Individualized Treatment Effects using Generative Adversarial Nets , 2018, ICLR.

[79]  P. Spirtes,et al.  Causation, prediction, and search , 1993 .