More Robust Doubly Robust Off-policy Evaluation

We study the problem of off-policy evaluation (OPE) in reinforcement learning (RL), where the goal is to estimate the performance of a policy from the data generated by another policy(ies). In particular, we focus on the doubly robust (DR) estimators that consist of an importance sampling (IS) component and a performance model, and utilize the low (or zero) bias of IS and low variance of the model at the same time. Although the accuracy of the model has a huge impact on the overall performance of DR, most of the work on using the DR estimators in OPE has been focused on improving the IS part, and not much on how to learn the model. In this paper, we propose alternative DR estimators, called more robust doubly robust (MRDR), that learn the model parameter by minimizing the variance of the DR estimator. We first present a formulation for learning the DR model in RL. We then derive formulas for the variance of the DR estimator in both contextual bandits and RL, such that their gradients w.r.t.~the model parameters can be estimated from the samples, and propose methods to efficiently minimize the variance. We prove that the MRDR estimators are strongly consistent and asymptotically optimal. Finally, we evaluate MRDR in bandits and RL benchmark problems, and compare its performance with the existing methods.

[1]  Richard S. Sutton,et al.  Weighted importance sampling for off-policy learning with linear function approximation , 2014, NIPS.

[2]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[3]  Philip S. Thomas,et al.  High-Confidence Off-Policy Evaluation , 2015, AAAI.

[4]  Wei Chu,et al.  Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms , 2010, WSDM '11.

[5]  Marc G. Bellemare,et al.  The Reactor: A Sample-Efficient Actor-Critic Architecture , 2017, ArXiv.

[6]  Zoran Popovic,et al.  Offline Evaluation of Online Reinforcement Learning Algorithms , 2016, AAAI.

[7]  Sergey Levine,et al.  Offline policy evaluation across representations with applications to educational games , 2014, AAMAS.

[8]  J M Robins,et al.  Marginal Mean Models for Dynamic Regimes , 2001, Journal of the American Statistical Association.

[9]  C. Cassel,et al.  Some results on generalized difference estimation and generalized regression estimation for finite populations , 1976 .

[10]  Matthieu Geist,et al.  Off-policy learning with eligibility traces: a survey , 2013, J. Mach. Learn. Res..

[11]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[12]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[13]  Uri Shalit,et al.  Estimating individual treatment effect: generalization bounds and algorithms , 2016, ICML.

[14]  Joaquin Quiñonero Candela,et al.  Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..

[15]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[16]  Leonid Peshkin,et al.  Learning from Scarce Experience , 2002, ICML.

[17]  Philip S. Thomas,et al.  Personalized Ad Recommendation Systems for Life-Time Value Optimization with Guarantees , 2015, IJCAI.

[18]  J. Robins,et al.  Estimation of Regression Coefficients When Some Regressors are not Always Observed , 1994 .

[19]  J. Robins,et al.  Doubly Robust Estimation in Missing Data and Causal Inference Models , 2005, Biometrics.

[20]  Marc G. Bellemare,et al.  The Reactor: A fast and sample-efficient Actor-Critic agent for Reinforcement Learning , 2017, ICLR.

[21]  John Langford,et al.  Off-policy evaluation for slate recommendation , 2016, NIPS.

[22]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[23]  Philip S. Thomas,et al.  High Confidence Policy Improvement , 2015, ICML.

[24]  Max Welling,et al.  Causal Effect Inference with Deep Latent-Variable Models , 2017, NIPS 2017.

[25]  Sanjoy Dasgupta,et al.  Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[26]  G. Imbens,et al.  Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score , 2000 .

[27]  Philip S. Thomas,et al.  Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[28]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[29]  Lihong Li,et al.  Toward Minimax Off-policy Value Estimation , 2015, AISTATS.

[30]  G. Imbens,et al.  Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score , 2002 .

[31]  M. Davidian,et al.  Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data , 2009, Biometrika.

[32]  Peter Stone,et al.  High Confidence Off-Policy Evaluation with Models , 2016, ArXiv.