Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains

The reinforcement learning community has explored many approaches to obtaining value estimates and models to guide decision making; these approaches, however, do not usually provide a measure of confidence in the estimate. Accurate estimates of an agent's confidence are useful for many applications, such as biasing exploration and automatically adjusting parameters to reduce dependence on parameter-tuning. Computing confidence intervals on reinforcement learning value estimates, however, is challenging because data generated by the agent-environment interaction rarely satisfies traditional assumptions. Samples of value-estimates are dependent, likely non-normally distributed and often limited, particularly in early learning when confidence estimates are pivotal. In this work, we investigate how to compute robust confidences for value estimates in continuous Markov decision processes. We illustrate how to use bootstrapping to compute confidence intervals online under a changing policy (previously not possible) and prove validity under a few reasonable assumptions. We demonstrate the applicability of our confidence estimation algorithms with experiments on exploration, parameter estimation and tracking.

[1]  Shie Mannor,et al.  Percentile Optimization for Markov Decision Processes with Parameter Uncertainty , 2010, Oper. Res..

[2]  Peter Stone,et al.  Model-Based Exploration in Continuous State Spaces , 2007, SARA.

[3]  D. Aldous Random walks on finite groups and rapidly mixing markov chains , 1983 .

[4]  John N. Tsitsiklis,et al.  Bias and variance in value function estimation , 2004, ICML.

[5]  Peter Auer,et al.  Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning , 2006, NIPS.

[6]  J. Shao,et al.  The jackknife and bootstrap , 1996 .

[7]  Michael L. Littman,et al.  Multi-resolution Exploration in Continuous Spaces , 2008, NIPS.

[8]  Erniel B. Barrios,et al.  Bootstrap Methods , 2011, International Encyclopedia of Statistical Science.

[9]  J. Neyman,et al.  Proceedings of the 6th Berkeley Symposium on Mathematical Statistics and Probability , 1972 .

[10]  Paulo Eduardo Oliveira,et al.  Exponential rates for kernel density estimation under association , 2005 .

[11]  Richard S. Sutton,et al.  Adapting Bias by Gradient Descent: An Incremental Version of Delta-Bar-Delta , 1992, AAAI.

[12]  P. Young,et al.  Time series analysis, forecasting and control , 1972, IEEE Transactions on Automatic Control.

[13]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[14]  P. Hall The Bootstrap and Edgeworth Expansion , 1992 .

[15]  H. Kile,et al.  Bandwidth Selection in Kernel Density Estimation , 2010 .

[16]  Richard Wyatt Learning in embedded systems by Leslie Pack Kaelbling, Bradford Books. MIT Press, USA, 1993, pp 176, $29.95, ISBN 0-262-11174-8 , 1994, Knowl. Eng. Rev..

[17]  Anthony C. Davison,et al.  Bootstrap Methods and Their Application , 1998 .

[18]  Michael L. Littman,et al.  An empirical evaluation of interval estimation for Markov decision processes , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[19]  Leslie Pack Kaelbling,et al.  Learning in embedded systems , 1993 .

[20]  Doina Precup,et al.  Combining TD-learning with Cascade-correlation Networks , 2003, ICML.

[21]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[22]  Raymond J. Mooney,et al.  Using Active Relocation to Aid Reinforcement Learning , 2006, FLAIRS.

[23]  Joel L. Horowitz,et al.  Bootstrap Methods for Markov Processes , 2003 .

[24]  Stephen M. S. Lee,et al.  Double block bootstrap confidence intervals for dependent data , 2009 .

[25]  Peter Stone,et al.  Generalized model learning for reinforcement learning in factored domains , 2009, AAMAS.

[26]  P. Hall,et al.  On blocking rules for the bootstrap with dependent data , 1995 .

[27]  Robert M. Gray,et al.  Probability, Random Processes, And Ergodic Properties , 1987 .

[28]  Lihong Li,et al.  Online exploration in least-squares policy iteration , 2009, AAMAS.

[29]  Gwilym M. Jenkins,et al.  Time series analysis, forecasting and control , 1971 .

[30]  Shalabh Bhatnagar,et al.  Toward Off-Policy Learning Control with Function Approximation , 2010, ICML.

[31]  Shie Mannor,et al.  Reinforcement learning with Gaussian processes , 2005, ICML.

[32]  D. Andrews,et al.  The Block-Block Bootstrap: Improved Asymptotic Refinements , 2002 .

[33]  Jānis Zvingelis,et al.  On Bootstrap Coverage Probability with Dependent Data , 2000 .

[34]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[35]  Aarnout Brombacher,et al.  Probability... , 2009, Qual. Reliab. Eng. Int..

[36]  Gwilym M. Jenkins,et al.  Time series analysis, forecasting and control , 1972 .

[37]  Michael L. Littman,et al.  Online Linear Regression and Its Application to Model-Based Reinforcement Learning , 2007, NIPS.

[38]  Richard S. Sutton,et al.  On the role of tracking in stationary environments , 2007, ICML '07.

[39]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[40]  Brian Tanner,et al.  RL-Glue: Language-Independent Software for Reinforcement-Learning Experiments , 2009, J. Mach. Learn. Res..