Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning

In this paper we present a new way of predicting the performance of a reinforcement learning policy given historical data that may have been generated by a different policy. The ability to evaluate a policy from historical data is important for applications where the deployment of a bad policy can be dangerous or costly. We show empirically that our algorithm produces estimates that often have orders of magnitude lower mean squared error than existing methods---it makes more efficient use of the available data. Our new estimator is based on two advances: an extension of the doubly robust estimator (Jiang and Li, 2015), and a new way to mix between model based estimates and importance sampling based estimates.

[1]  M. J. D. Powell,et al.  Weighted Uniform Sampling — a Monte Carlo Technique for Reducing Variance , 1966 .

[2]  Pranab Kumar Sen,et al.  Large Sample Methods in Statistics: An Introduction with Applications , 1993 .

[3]  J. Robins,et al.  Semiparametric regression estimation in the presence of dependent censoring , 1995 .

[4]  R. Bartle The elements of integration and Lebesgue measure , 1995 .

[5]  R. Mittelhammer Mathematical Statistics for Economics and Business , 1996 .

[6]  Andrew G. Barto,et al.  Reinforcement learning , 1998 .

[7]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[8]  Debashis Kushary,et al.  Bootstrap Methods and Their Application , 2000, Technometrics.

[9]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[10]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[11]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[12]  J. Robins,et al.  Doubly Robust Estimation in Missing Data and Causal Inference Models , 2005, Biometrics.

[13]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[14]  Devinder Thapa,et al.  Agent Based Decision Support System Using Reinforcement Learning Under Emergency Circumstances , 2005, ICNC.

[15]  Michael H. Bowling,et al.  Optimal Unbiased Estimators for Evaluating Agent Performance , 2006, AAAI.

[16]  John N. Tsitsiklis,et al.  Bias and Variance Approximation in Value Function Estimates , 2007, Manag. Sci..

[17]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[18]  Martha White,et al.  Learning a Value Analysis Tool for Agent Evaluation , 2009, IJCAI.

[19]  Shalabh Bhatnagar,et al.  Natural actor-critic algorithms , 2009, Autom..

[20]  Scott Sanner,et al.  Temporal Difference Bayesian Model Averaging: A Bayesian Perspective on Adapting Lambda , 2010, ICML.

[21]  Martha White,et al.  A general framework for reducing variance in agent evaluation , 2010 .

[22]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[23]  P. Thomas,et al.  TD γ : Re-evaluating Complex Backups in Temporal Difference Learning , 2011 .

[24]  Joel Veness,et al.  Variance Reduction in Monte-Carlo Tree Search , 2011, NIPS.

[25]  Scott Niekum,et al.  TD_gamma: Re-evaluating Complex Backups in Temporal Difference Learning , 2011, NIPS.

[26]  Sergey Levine,et al.  Guided Policy Search , 2013, ICML.

[27]  Richard S. Sutton,et al.  Weighted importance sampling for off-policy learning with linear function approximation , 2014, NIPS.

[28]  Sergey Levine,et al.  Offline policy evaluation across representations with applications to educational games , 2014, AAMAS.

[29]  Richard S. Sutton,et al.  Off-policy TD( l) with a true online equivalence , 2014, UAI.

[30]  Philip S. Thomas,et al.  Personalized Ad Recommendation Systems for Life-Time Value Optimization with Guarantees , 2015, IJCAI.

[31]  Scott Niekum,et al.  Policy Evaluation Using the Ω-Return , 2015, NIPS.

[32]  Philip S. Thomas,et al.  A Notation for Markov Decision Processes , 2015, ArXiv.

[33]  Philip S. Thomas,et al.  High-Confidence Off-Policy Evaluation , 2015, AAAI.

[34]  Lihong Li,et al.  Doubly Robust Off-policy Evaluation for Reinforcement Learning , 2015, ArXiv.

[35]  Philip S. Thomas,et al.  Safe Reinforcement Learning , 2015 .

[36]  Martha White,et al.  Emphatic Temporal-Difference Learning , 2015, ArXiv.

[37]  Zoran Popovic,et al.  Offline Evaluation of Online Reinforcement Learning Algorithms , 2016, AAAI.

[38]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[39]  A. Preliminaries Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016 .