Semi-Parametric Efficient Policy Learning with Continuous Actions

We consider off-policy evaluation and optimization with continuous action spaces. We focus on observational data where the data collection policy is unknown and needs to be estimated from data. We take a semi-parametric approach where the value function takes a known parametric form in the treatment, but we are agnostic on how it depends on the observed contexts. We propose a doubly robust off-policy estimate for this setting and show that off-policy optimization based on this doubly robust estimate is robust to estimation errors of the policy function or the regression model. We also show that the variance of our off-policy estimate achieves the semi-parametric efficiency bound. Our results also apply if the model does not satisfy our semi-parametric form but rather we measure regret in terms of the best projection of the true value function to this functional space. Our work extends prior approaches of policy optimization from observational data that only considered discrete actions. We provide an experimental evaluation of our method in a synthetic data example motivated by optimal personalized pricing.

[1]  K. Do,et al.  Efficient and Adaptive Estimation for Semiparametric Models. , 1994 .

[2]  Jeffrey M. Wooldridge,et al.  Estimating average partial effects under conditional moment independence assumptions , 2004 .

[3]  Gary Chamberlain,et al.  Efficiency Bounds for Semiparametric Regression , 1992 .

[4]  Stefan Wager,et al.  Policy Learning With Observational Data , 2017, Econometrica.

[5]  A. A. Weiss,et al.  Semiparametric estimates of the relation between weather and electricity sales , 1986 .

[6]  Zhengyuan Zhou,et al.  Offline Multi-Action Policy Learning: Generalization and Optimization , 2018, Oper. Res..

[7]  Stefan Wager,et al.  Efficient Policy Learning , 2017, ArXiv.

[8]  Tong Zhang,et al.  Covering Number Bounds of Certain Regularized Linear Function Classes , 2002, J. Mach. Learn. Res..

[9]  Bryan S. Graham,et al.  Semiparametrically Efficient Estimation of the Average Linear Regression Function , 2018, Journal of Econometrics.

[10]  P. Robinson ROOT-N-CONSISTENT SEMIPARAMETRIC REGRESSION , 1988 .

[11]  Vasilis Syrgkanis,et al.  Orthogonal Statistical Learning , 2019, The Annals of Statistics.

[12]  S. Murphy,et al.  PERFORMANCE GUARANTEES FOR INDIVIDUALIZED TREATMENT RULES. , 2011, Annals of statistics.

[13]  D. Rubin,et al.  Causal Inference for Statistics, Social, and Biomedical Sciences: Sensitivity Analysis and Bounds , 2015 .

[14]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[15]  W. Newey,et al.  Semiparametric Efficiency Bounds , 1990 .

[16]  Karthik Sridharan,et al.  Empirical Entropy, Minimax Regret and Minimax Risk , 2013, ArXiv.

[17]  Martin J. Wainwright,et al.  High-Dimensional Statistics , 2019 .

[18]  Donglin Zeng,et al.  Estimating Individualized Treatment Rules Using Outcome Weighted Learning , 2012, Journal of the American Statistical Association.

[19]  M. J. Laan,et al.  Targeted Learning: Causal Inference for Observational and Experimental Data , 2011 .

[20]  Massimiliano Pontil,et al.  Empirical Bernstein Bounds and Sample-Variance Penalization , 2009, COLT.

[21]  Michael R Kosorok,et al.  Residual Weighted Learning for Estimating Individualized Treatment Rules , 2015, Journal of the American Statistical Association.

[22]  J. Robins,et al.  Locally Robust Semiparametric Estimation , 2016, Econometrica.

[23]  Toru Kitagawa,et al.  Who should be Treated? Empirical Welfare Maximization Methods for Treatment Choice , 2015 .

[24]  J. Robins,et al.  Double/Debiased Machine Learning for Treatment and Structural Parameters , 2017 .

[25]  D. Rubin,et al.  Causal Inference for Statistics, Social, and Biomedical Sciences: A General Method for Estimating Sampling Variances for Standard Estimators for Average Causal Effects , 2015 .

[26]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[27]  John Langford,et al.  Contextual Bandits with Continuous Actions: Smoothing, Zooming, and Adapting , 2019, COLT.

[28]  Thorsten Joachims,et al.  Counterfactual Risk Minimization: Learning from Logged Bandit Feedback , 2015, ICML.

[29]  Nathan Kallus,et al.  Policy Evaluation and Optimization with Continuous Treatments , 2018, AISTATS.

[30]  John Langford,et al.  The offset tree for learning with partial labels , 2008, KDD.