An empirical evaluation of interval estimation for Markov decision processes

This work takes an empirical approach to evaluating three model-based reinforcement-learning methods. All methods intend to speed the learning process by mixing exploitation of learned knowledge with exploration of possibly promising alternatives. We consider /spl epsi/-greedy exploration, which is computationally cheap and popular, but unfocused in its exploration effort; R-Max exploration, a simplification of an exploration scheme that comes with a theoretical guarantee of efficiency; and a well-grounded approach, model-based interval estimation, that better integrates exploration and exploitation. Our experiments indicate that effective exploration can result in dramatic improvements in the observed rate of learning.

[1]  Andrew G. Barto,et al.  Local Bandit Approximation for Optimal Learning Problems , 1996, NIPS.

[2]  C. Atkeson,et al.  Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[3]  Terrence J. Sejnowski,et al.  Exploration Bonuses and Dual Control , 1996, Machine Learning.

[4]  P. Dayan,et al.  Exploration bonuses and dual control , 1996 .

[5]  Shie Mannor,et al.  Action Elimination and Stopping Conditions for Reinforcement Learning , 2003, ICML.

[6]  Donald A. Berry,et al.  Bandit Problems: Sequential Allocation of Experiments. , 1986 .

[7]  Marco Wiering,et al.  Explorations in efficient reinforcement learning , 1999 .

[8]  J. Bather,et al.  Multi‐Armed Bandit Allocation Indices , 1990 .

[9]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[10]  Leslie Pack Kaelbling,et al.  Learning in embedded systems , 1993 .

[11]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[12]  Robert Givan,et al.  Bounded-parameter Markov decision processes , 2000, Artif. Intell..

[13]  Jeremy L. Wyatt,et al.  Exploration Control in Reinforcement Learning using Optimistic Model Selection , 2001, ICML.

[14]  Paul Bourgine,et al.  Exploration of Multi-State Environments: Local Measures and Back-Propagation of Uncertainty , 1999, Machine Learning.

[15]  Jürgen Schmidhuber,et al.  Efficient model-based exploration , 1998 .

[16]  Andrew W. Moore,et al.  Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[17]  Philip W. L. Fong A Quantitative Study of Hypothesis Selection , 1995, ICML.

[18]  David A. McAllester,et al.  On the Convergence Rate of Good-Turing Estimators , 2000, COLT.

[19]  M. Littman,et al.  Exploration via Model-based Interval Estimation , 2004 .

[20]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 1998, Machine Learning.

[21]  David Andre,et al.  Model based Bayesian Exploration , 1999, UAI.

[22]  Reid G. Simmons,et al.  Complexity Analysis of Real-Time Reinforcement Learning , 1993, AAAI.

[23]  Sebastian Thrun,et al.  The role of exploration in learning control , 1992 .

[24]  Donald A. Sofge,et al.  Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches , 1992 .

[25]  E. Ordentlich,et al.  Inequalities for the L1 Deviation of the Empirical Distribution , 2003 .

[26]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.