Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation

For an autonomous agent, executing a poor policy may be costly or even dangerous. For such agents, it is desirable to determine confidence interval lower bounds on the performance of any given policy without executing said policy. Current methods for exact high confidence off-policy evaluation that use importance sampling require a substantial amount of data to achieve a tight lower bound. Existing model-based methods only address the problem in discrete state spaces. Since exact bounds are intractable for many domains we trade off strict guarantees of safety for more data-efficient approximate bounds. In this context, we propose two bootstrapping off-policy evaluation methods which use learned MDP transition models in order to estimate lower confidence bounds on policy performance with limited data in both continuous and discrete state spaces. Since direct use of a model may introduce bias, we derive a theoretical upper bound on model bias for when the model transition function is estimated with i.i.d. trajectories. This bound broadens our understanding of the conditions under which model-based methods have high bias. Finally, we empirically evaluate our proposed methods and analyze the settings in which different bootstrapping off-policy confidence interval methods succeed and fail.

[1]  B. Efron Bootstrap Methods: Another Look at the Jackknife , 1979 .

[2]  B. Efron Better Bootstrap Confidence Intervals , 1987 .

[3]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[4]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[5]  J Carpenter,et al.  Bootstrap confidence intervals: when, which, what? A practical guide for medical statisticians. , 2000, Statistics in medicine.

[6]  A. Folsom,et al.  Coronary heart disease risk prediction in the Atherosclerosis Risk in Communities (ARIC) study. , 2003, Journal of clinical epidemiology.

[7]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[8]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[9]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[10]  Lihong Li,et al.  Reinforcement Learning in Finite MDPs: PAC Analysis , 2009, J. Mach. Learn. Res..

[11]  Martha White,et al.  Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains , 2010, NIPS.

[12]  Peter Stone,et al.  Real time targeted exploration in large domains , 2010, 2010 IEEE 9th International Conference on Development and Learning.

[13]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[14]  J. Andrew Bagnell,et al.  Agnostic System Identification for Model-Based Reinforcement Learning , 2012, ICML.

[15]  Joaquin Quiñonero Candela,et al.  Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..

[16]  Ufuk Topcu,et al.  Probably Approximately Correct MDP Learning and Control With Temporal Logic Constraints , 2014, Robotics: Science and Systems.

[17]  Philip S. Thomas,et al.  High Confidence Policy Improvement , 2015, ICML.

[18]  M. Ghavamzadeh,et al.  Robust Policy Optimization with Baseline Guarantees , 2015, 1506.04514.

[19]  Philip S. Thomas,et al.  High-Confidence Off-Policy Evaluation , 2015, AAAI.

[20]  Lihong Li,et al.  Doubly Robust Off-policy Evaluation for Reinforcement Learning , 2015, ArXiv.

[21]  Philip S. Thomas,et al.  Safe Reinforcement Learning , 2015 .

[22]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[23]  Philip S. Thomas,et al.  Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.