Gradient Algorithms for Exploration/Exploitation Trade-Offs: Global and Local Variants

Gradient-following algorithms are deployed for efficient adaptation of exploration parameters in temporal-difference learning with discrete action spaces. Global and local variants are evaluated in discrete and continuous state spaces. The global variant is memory efficient in terms of requiring exploratory data only for starting states. In contrast, the local variant requires exploratory data for each state of the state space, but produces exploratory behavior only in states with improvement potential. Our results suggest that gradient-based exploration can be efficiently used in combination with off- and on-policy algorithms such as Q-learning and Sarsa.

[1]  Friedhelm Schwenker,et al.  Learning a Strategy with Neural Approximated Temporal-Difference Methods in English Draughts , 2010, 2010 20th International Conference on Pattern Recognition.

[2]  Michael L. Littman,et al.  Multi-resolution Exploration in Continuous Spaces , 2008, NIPS.

[3]  Richard S. Sutton,et al.  Reinforcement learning with replacing eligibility traces , 2004, Machine Learning.

[4]  Günther Palm,et al.  Robust Exploration/Exploitation Trade-Offs in Safety-Critical Applications , 2012 .

[5]  Marco Wiering,et al.  Explorations in efficient reinforcement learning , 1999 .

[6]  Günther Palm,et al.  Adaptive Exploration Using Stochastic Neurons , 2012, ICANN.

[7]  Warren B. Powell,et al.  Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming , 2006, Machine Learning.

[8]  Sebastian Thrun,et al.  Efficient Exploration In Reinforcement Learning , 1992 .

[9]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[10]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[11]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[12]  Stefan Edelkamp,et al.  KI 2011: Advances in Artificial Intelligence , 2011, Lecture Notes in Computer Science.

[13]  Daniel Kudenko,et al.  Online learning of shaping rewards in reinforcement learning , 2010, Neural Networks.

[14]  P. Dayan,et al.  Cortical substrates for exploratory decisions in humans , 2006, Nature.

[15]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[16]  Günther Palm,et al.  Value-Difference Based Exploration: Adaptive Control between Epsilon-Greedy and Softmax , 2011, KI.

[17]  Nees Jan van Eck,et al.  Application of reinforcement learning to the game of Othello , 2008, Comput. Oper. Res..

[18]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[19]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.