Counterfactual Mean Embeddings

Counterfactual inference has become a ubiquitous tool in online advertisement, recommendation systems, medical diagnosis, and finance. Accurate modeling of outcome distributions associated with different interventions---known as counterfactual distributions---is crucial for the success of these applications. In this work, we propose to model counterfactual distributions using a novel Hilbert space representation called counterfactual mean embedding (CME). The CME embeds the associated counterfactual distribution into a reproducing kernel Hilbert space (RKHS) endowed with a positive definite kernel, which allows us to perform causal inference over the entire landscape of the counterfactual distribution. Based on this representation, we propose a distributional treatment effect (DTE) which can quantify the causal effect over entire outcome distributions. Our approach is nonparametric as the CME can be estimated consistently from observational data without requiring any parametric assumption about the underlying distributions. We also establish a rate of convergence of the proposed estimator which depends on the smoothness of the conditional mean and the Radon-Nikodym derivative of the underlying marginal distributions. Furthermore, our framework also allows for more complex outcomes such as images, sequences, and graphs. Lastly, our experimental results on synthetic data and off-policy evaluation tasks demonstrate the advantages of the proposed estimator.

[1]  J. I The Design of Experiments , 1936, Nature.

[2]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[3]  D. Horvitz,et al.  A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[4]  C. Baker Joint measures and cross-covariance operators , 1973 .

[5]  D. Rubin Estimating causal effects of treatments in randomized and nonrandomized studies. , 1974 .

[6]  J. Diestel,et al.  On vector measures , 1974 .

[7]  C. Cassel,et al.  Some results on generalized difference estimation and generalized regression estimation for finite populations , 1976 .

[8]  D. Rubin,et al.  The central role of the propensity score in observational studies for causal effects , 1983 .

[9]  Gerald B. Folland,et al.  Real Analysis: Modern Techniques and Their Applications , 1984 .

[10]  P. Holland Statistics and Causal Inference , 1985 .

[11]  P. Spirtes,et al.  Causation, prediction, and search , 1993 .

[12]  T. Shakespeare,et al.  Observational Studies , 2003 .

[13]  H. Shimodaira,et al.  Improving predictive inference under covariate shift by weighting the log-likelihood function , 2000 .

[14]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[15]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[16]  N. Dinculeanu Vector Integration and Stochastic Integration in Banach Spaces , 2000, Oxford Handbooks Online.

[17]  Christopher K. I. Williams,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[18]  Marc G. Genton,et al.  Classes of Kernels for Machine Learning: A Statistics Perspective , 2002, J. Mach. Learn. Res..

[19]  Thomas Gärtner,et al.  A survey of kernels for structured data , 2003, SKDD.

[20]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[21]  Michael I. Jordan,et al.  Dimensionality Reduction for Supervised Learning with Reproducing Kernel Hilbert Spaces , 2004, J. Mach. Learn. Res..

[22]  A. Berlinet,et al.  Reproducing kernel Hilbert spaces in probability and statistics , 2004 .

[23]  Gilles Blanchard,et al.  Statistical properties of Kernel Prinicipal Component Analysis , 2019 .

[24]  J. Mata,et al.  Counterfactual decomposition of changes in wage distributions using quantile regression , 2005 .

[25]  D. Rubin Causal Inference Using Potential Outcomes , 2005 .

[26]  Hans-Peter Kriegel,et al.  Integrating structured biological data by Kernel Maximum Mean Discrepancy , 2006, ISMB.

[27]  Bernhard Schölkopf,et al.  Kernel Measures of Conditional Dependence , 2007, NIPS.

[28]  K. Fukumizu,et al.  Learning via Hilbert Space Embedding of Distributions , 2007 .

[29]  Le Song,et al.  A Hilbert Space Embedding for Distributions , 2007, Discovery Science.

[30]  Klaus-Robert Müller,et al.  Covariate Shift Adaptation by Importance Weighted Cross Validation , 2007, J. Mach. Learn. Res..

[31]  James J. Heckman,et al.  Econometric Evaluation of Social Programs, Part I: Causal Models, Structural Models and Econometric Policy Evaluation , 2007 .

[32]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[33]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[34]  A. Caponnetto,et al.  Optimal Rates for the Regularized Least-Squares Algorithm , 2007, Found. Comput. Math..

[35]  S. Smale,et al.  Learning Theory Estimates via Integral Operators and Their Approximations , 2007 .

[36]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[37]  John Langford,et al.  Exploration scavenging , 2008, ICML '08.

[38]  Alexander J. Smola,et al.  Hilbert space embeddings of conditional distributions with applications to dynamical systems , 2009, ICML '09.

[39]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[40]  Koby Crammer,et al.  A theory of learning from different domains , 2010, Machine Learning.

[41]  Olivier Chapelle,et al.  Expected reciprocal rank for graded relevance , 2009, CIKM.

[42]  V. Chernozhukov,et al.  Inference on Counterfactual Distributions , 2009, 0904.0951.

[43]  Alexander J. Smola,et al.  Super-Samples from Kernel Herding , 2010, UAI.

[44]  Lihong Li,et al.  Learning from Logged Implicit Exploration Data , 2010, NIPS.

[45]  C. Rothe Nonparametric estimation of distributional policy effects , 2010 .

[46]  Bernhard Schölkopf,et al.  Hilbert Space Embeddings and Metrics on Probability Measures , 2009, J. Mach. Learn. Res..

[47]  Bernhard Schölkopf,et al.  Kernel-based Conditional Independence Test and Application in Causal Discovery , 2011, UAI.

[48]  Jennifer L. Hill,et al.  Bayesian Nonparametric Modeling for Causal Inference , 2011 .

[49]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[50]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[51]  Ingo Steinwart,et al.  Mercer’s Theorem on General Domains: On the Interaction between Measures, Kernels, and RKHSs , 2012 .

[52]  Guy Lever,et al.  Conditional mean embeddings as regressors , 2012, ICML.

[53]  Francis R. Bach,et al.  On the Equivalence between Herding and Conditional Gradient Algorithms , 2012, ICML.

[54]  K. Fukumizu,et al.  Kernel Embeddings of Conditional Distributions: A Unified Kernel Framework for Nonparametric Inference in Graphical Models , 2013, IEEE Signal Processing Magazine.

[55]  Tao Qin,et al.  Introducing LETOR 4.0 Datasets , 2013, ArXiv.

[56]  Joaquin Quiñonero Candela,et al.  Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..

[57]  Bernhard Schölkopf,et al.  Domain Adaptation under Target and Conditional Shift , 2013, ICML.

[58]  Le Song,et al.  Kernel Bayes' rule: Bayesian inference with positive definite kernels , 2013, J. Mach. Learn. Res..

[59]  Bernhard Schölkopf,et al.  Causal Discovery via Reproducing Kernel Hilbert Space Embeddings , 2014, Neural Computation.

[60]  Bernhard Schölkopf,et al.  A Permutation-Based Kernel Conditional Independence Test , 2014, UAI.

[61]  D. Rubin,et al.  Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction , 2016 .

[62]  Fredrik Lindsten,et al.  Sequential Kernel Herding: Frank-Wolfe Optimization for Particle Filtering , 2015, AISTATS.

[63]  Zoubin Ghahramani,et al.  Training generative neural networks via Maximum Mean Discrepancy optimization , 2015, UAI.

[64]  Bernhard Schölkopf,et al.  Towards a Learning Theory of Causation , 2015, 1502.02398.

[65]  Bernhard Schölkopf,et al.  Computing functions of random variables via reproducing kernel Hilbert space representations , 2015, Statistics and Computing.

[66]  Richard S. Zemel,et al.  Generative Moment Matching Networks , 2015, ICML.

[67]  Thorsten Joachims,et al.  Counterfactual Risk Minimization: Learning from Logged Bandit Feedback , 2015, ICML.

[68]  Michael Rabadi,et al.  Kernel Methods for Machine Learning , 2015 .

[69]  Susan Athey,et al.  Recursive partitioning for heterogeneous causal effects , 2015, Proceedings of the National Academy of Sciences.

[70]  Uri Shalit,et al.  Learning Representations for Counterfactual Inference , 2016, ICML.

[71]  Kenji Fukumizu,et al.  Filtering with State-Observation Examples via Kernel Monte Carlo Filter , 2013, Neural Computation.

[72]  Arthur Gretton,et al.  Interpretable Distribution Features with Maximum Testing Power , 2016, NIPS.

[73]  Uri Shalit,et al.  Bounding and Minimizing Counterfactual Error , 2016, ArXiv.

[74]  Susan Athey,et al.  The State of Applied Econometrics - Causality and Policy Evaluation , 2016, 1607.00699.

[75]  Thorsten Joachims,et al.  Recommendations as Treatments: Debiasing Learning and Evaluation , 2016, ICML.

[76]  Kirthevasan Kandasamy,et al.  Batch Policy Gradient Methods for Improving Neural Conversation Models , 2017, ICLR.

[77]  Uri Shalit,et al.  Estimating individual treatment effect: generalization bounds and algorithms , 2016, ICML.

[78]  Krikamol Muandet,et al.  Minimax Estimation of Kernel Mean Embeddings , 2016, J. Mach. Learn. Res..

[79]  Nathan Kallus,et al.  A Framework for Optimal Matching for Causal Inference , 2016, AISTATS.

[80]  John Langford,et al.  Off-policy evaluation for slate recommendation , 2016, NIPS.

[81]  Mihaela van der Schaar,et al.  Bayesian Inference of Individualized Treatment Effects using Multi-task Gaussian Processes , 2017, NIPS.

[82]  Wittawat Jitkrittum,et al.  Large sample analysis of the median heuristic , 2017, 1707.07269.

[83]  Yiming Yang,et al.  MMD GAN: Towards Deeper Understanding of Moment Matching Network , 2017, NIPS.

[84]  Alexander J. Smola,et al.  Generative Models and Model Criticism via Optimized Maximum Mean Discrepancy , 2016, ICLR.

[85]  Kevin Leyton-Brown,et al.  Deep IV: A Flexible Approach for Counterfactual Prediction , 2017, ICML.

[86]  Bernhard Schölkopf,et al.  Kernel Mean Embedding of Distributions: A Review and Beyonds , 2016, Found. Trends Mach. Learn..

[87]  Nathan Kallus,et al.  Policy Evaluation and Optimization with Continuous Treatments , 2018, AISTATS.

[88]  Raymond K. W. Wong,et al.  Kernel-based covariate functional balancing for observational studies. , 2018, Biometrika.

[89]  Bernhard Schölkopf,et al.  Kernel Distribution Embeddings: Universal Kernels, Characteristic Kernels and Kernel Metrics on Distributions , 2016, J. Mach. Learn. Res..

[90]  Yee Whye Teh,et al.  Causal Inference via Kernel Deviance Measures , 2018, NeurIPS.

[91]  Stefan Wager,et al.  Estimation and Inference of Heterogeneous Treatment Effects using Random Forests , 2015, Journal of the American Statistical Association.

[92]  Mihaela van der Schaar,et al.  GANITE: Estimation of Individualized Treatment Effects using Generative Adversarial Nets , 2018, ICLR.

[93]  Mihaela van der Schaar,et al.  Deep-Treat: Learning Optimal Personalized Treatments From Observational Data Using Neural Networks , 2018, AAAI.

[94]  Kenji Fukumizu,et al.  Kernel Recursive ABC: Point Estimation with Intractable Likelihood , 2018, ICML.

[95]  Arthur Gretton,et al.  Kernel Instrumental Variable Regression , 2019, NeurIPS.

[96]  Masahiro Kato,et al.  Off-Policy Evaluation and Learning for External Validity under a Covariate Shift , 2020, NeurIPS.

[97]  Krikamol Muandet,et al.  Dual Instrumental Variable Regression , 2019, NeurIPS.

[98]  Krikamol Muandet,et al.  Kernel Conditional Moment Test via Maximum Moment Restriction , 2020, UAI.

[99]  Kenji Fukumizu,et al.  Model-based kernel sum rule: kernel Bayesian inference with probabilistic models , 2014, Machine Learning.