Adaptive ε-greedy Exploration in Reinforcement Learning Based on Value Differences

This paper presents “Value-Difference Based Exploration” (VDBE), a method for balancing the exploration/exploitation dilemma inherent to reinforcement learning. The proposed method adapts the exploration parameter of ε-greedy in dependence of the temporal-difference error observed from value-function backups, which is considered as a measure of the agent’s uncertainty about the environment. VDBE is evaluated on a multi-armed bandit task, which allows for insight into the behavior of the method. Preliminary results indicate that VDBE seems to be more parameter robust than commonly used ad hoc approaches such as ε-greedy or softmax.

[1]  Dimitri P. Bertsekas,et al.  Dynamic Programming: Deterministic and Stochastic Models , 1987 .

[2]  C. Watkins Learning from delayed rewards , 1989 .

[3]  Sebastian Thrun,et al.  Efficient Exploration In Reinforcement Learning , 1992 .

[4]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[5]  Junichiro Yoshimoto,et al.  Control of exploitation-exploration meta-parameter in reinforcement learning , 2002, Neural Networks.

[6]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[7]  Baruch Awerbuch,et al.  Adaptive routing with end-to-end feedback: distributed learning and geometric approaches , 2004, STOC '04.

[8]  Rina Azoulay-Schwartz,et al.  Exploitation vs. exploration: choosing a supplier in an environment of incomplete information , 2004, Decis. Support Syst..

[9]  Mehryar Mohri,et al.  Multi-armed Bandit Algorithms and Empirical Evaluation , 2005, ECML.

[10]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[11]  Warren B. Powell,et al.  Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming , 2006, Machine Learning.

[12]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[13]  Gianluca Bontempi,et al.  Improving the Exploration Strategy in Bandit Algorithms , 2008, LION.

[14]  Verena Heidrich-Meisner Interview with Richard S. Sutton , 2009, Künstliche Intell..