Informative Subspace Learning for Counterfactual Inference

Inferring causal relations from observational data is widely used for knowledge discovery in healthcare and economics. To investigate whether a treatment can affect an outcome of interest, we focus on answering counterfactual questions of this type: what would a patient’s blood pressure be had he/she recieved a different treatment? Nearest neighbor matching (NNM) sets the counterfactual outcome of any treatment (control) sample to be equal to the factual outcome of its nearest neighbor in the control (treatment) group. Although being simple, flexible and interpretable, most NNM approaches could be easily misled by variables that do not affect the outcome. In this paper, we address this challenge by learning subspaces that are predictive of the outcome variable for both the treatment group and control group. Applying NNM in the learned subspaces leads to more accurate estimation of the counterfactual outcomes and therefore treatment effects. We introduce an informative subspace learning algorithm by maximizing the nonlinear dependence between the candidate subspace and the outcome variable measured by the Hilbert-Schmidt Independence Criterion (HSIC). We propose a scalable estimator of HSIC, called HSIC-RFF that reduces the quadratic computational and storage complexities (with respect to the sample size) of the naive HSIC implementation to linear through constructing random Fourier features. We also prove an upper bound on the approximation error of the HSIC-RFF estimator. Experimental results on simulated datasets and real-world datasets demonstrate our proposed approach outperforms existing NNM approaches and other commonly used regression-based methods for counterfactual inference.

[1]  Bernhard Schölkopf,et al.  Measuring Statistical Dependence with Hilbert-Schmidt Norms , 2005, ALT.

[2]  Yun Fu,et al.  Matching via Dimensionality Reduction for Estimation of Treatment Effects in Digital Marketing Campaigns , 2016, IJCAI.

[3]  W. Kruskal,et al.  Use of Ranks in One-Criterion Variance Analysis , 1952 .

[4]  Marco Caliendo,et al.  Some Practical Guidance for the Implementation of Propensity Score Matching , 2005, SSRN Electronic Journal.

[5]  S. Schneeweiss,et al.  Evaluating uses of data mining techniques in propensity score estimation: a simulation study , 2008, Pharmacoepidemiology and drug safety.

[6]  J. I The Design of Experiments , 1936, Nature.

[7]  Michael I. Jordan,et al.  Matrix concentration inequalities via the method of exchangeable pairs , 2012, 1201.6002.

[8]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[9]  T. Shakespeare,et al.  Observational Studies , 2003 .

[10]  Jennifer G. Dy,et al.  From Transformation-Based Dimensionality Reduction to Feature Selection , 2010, ICML.

[11]  Susan Athey,et al.  Recursive partitioning for heterogeneous causal effects , 2015, Proceedings of the National Academy of Sciences.

[12]  Paul R. Rosenbaum,et al.  Comparison of Multivariate Matching Methods: Structures, Distances, and Algorithms , 1993 .

[13]  P. Austin An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies , 2011, Multivariate behavioral research.

[14]  Uri Shalit,et al.  Learning Representations for Counterfactual Inference , 2016, ICML.

[15]  Richard A. Nielsen,et al.  Why Propensity Scores Should Not Be Used for Matching , 2019, Political Analysis.

[16]  Jennifer L. Hill,et al.  Bayesian Nonparametric Modeling for Causal Inference , 2011 .

[17]  D. Rubin,et al.  Causal Inference for Statistics, Social, and Biomedical Sciences: A General Method for Estimating Sampling Variances for Standard Estimators for Average Causal Effects , 2015 .

[18]  D. McCaffrey,et al.  Propensity score estimation with boosted regression for evaluating causal effects in observational studies. , 2004, Psychological methods.

[19]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[20]  Stefan Wager,et al.  Estimation and Inference of Heterogeneous Treatment Effects using Random Forests , 2015, Journal of the American Statistical Association.

[21]  Daniel Westreich,et al.  Propensity score estimation : machine learning and classification methods as alternatives to logistic regression , 2010 .

[22]  G. Imbens,et al.  Large Sample Properties of Matching Estimators for Average Treatment Effects , 2004 .

[23]  D. Rubin,et al.  The central role of the propensity score in observational studies for causal effects , 1983 .

[24]  D. Rubin,et al.  Using Multivariate Matched Sampling and Regression Adjustment to Control Bias in Observational Studies , 1978 .

[25]  Bernhard Schölkopf,et al.  Randomized Nonlinear Component Analysis , 2014, ICML.

[26]  W. Rudin,et al.  Fourier Analysis on Groups. , 1965 .

[27]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[28]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[29]  Jeffrey A. Smith,et al.  Does Matching Overcome Lalonde's Critique of Nonexperimental Estimators? , 2000 .

[30]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[31]  D. Rubin Estimating causal effects of treatments in randomized and nonrandomized studies. , 1974 .

[32]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[33]  Masashi Sugiyama,et al.  Sufficient Dimension Reduction via Squared-Loss Mutual Information Estimation , 2010, Neural Computation.

[34]  Rajeev Dehejia,et al.  Propensity Score-Matching Methods for Nonexperimental Causal Studies , 2002, Review of Economics and Statistics.