Self-Regulating Action Exploration in Reinforcement Learning

Abstract The basic tenet of a learning process is for an agent to learn for only as much and as long as it is necessary. With reinforcement learning, the learning process is divided between exploration and exploitation. Given the complexity of the problem domain and the randomness of the learning process, the exact duration of the reinforcement learning process can never be known with certainty. Using an inaccurate number of training iterations leads either to the non-convergence or the over-training of the learning agent. This work addresses such issues by proposing a technique to self-regulate the exploration rate and training du–ration leading to convergence effciently. The idea originates from an intuitive understanding that exploration is only necessary when the success rate is low. This means the rate of exploration should be conducted in inverse proportion to the rate of suc–cess. In addition, the change in exploration-exploitation rates alters the duration of the learning process. Using this approach, the duration of the learning process becomes adaptive to the updated status of the learning process. Experimental results from the K -Armed Bandit and Air Combat Maneuver scenario prove that optimal action policies can be discovered using the right amount of training iterations. In essence, the proposed method eliminates the guesswork on the amount of exploration needed during reinforcement learning.

[1]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[2]  Ah-Hwee Tan,et al.  Cognitive Agents Integrating Rules and Reinforcement Learning for Context-Aware Decision Support , 2008, 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[3]  Axel van Lamsweerde,et al.  Learning machine learning , 1991 .

[4]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[5]  Relu Patrascu,et al.  Adaptive exploration in reinforcement learning , 1999, IJCNN'99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339).

[6]  Ah-Hwee Tan,et al.  Self-organizing neural models integrating rules and reinforcement learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[7]  J. Gittins Bandit processes and dynamic allocation indices , 1979 .

[8]  Paul Bourgine,et al.  Exploration of Multi-State Environments: Local Measures and Back-Propagation of Uncertainty , 1999, Machine Learning.

[9]  Ah-Hwee Tan,et al.  Self-organizing neural networks for learning air combat maneuvers , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[10]  Michel Tokic,et al.  Adaptive epsilon-Greedy Exploration in Reinforcement Learning Based on Value Difference , 2010, KI.

[11]  Giuseppe Riccardi,et al.  The exploration/exploitation trade-off in Reinforcement Learning for dialogue management , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[12]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[13]  Michel Tokic Adaptive ε-greedy Exploration in Reinforcement Learning Based on Value Differences , 2010 .

[14]  Carlos Domingo,et al.  Faster Near-Optimal Reinforcement Learning: Adding Adaptiveness to the E3 Algorithm , 1999, ALT.

[15]  M. D. Ardema,et al.  An approach to three-dimensional aircraft pursuit-evasion , 1987 .

[16]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[17]  Ah-Hwee Tan,et al.  Integrating Temporal Difference Methods and Self-Organizing Neural Networks for Reinforcement Learning With Delayed Evaluative Feedback , 2008, IEEE Transactions on Neural Networks.

[18]  Ah-Hwee Tan,et al.  Direct Code Access in Self-Organizing Neural Networks for Reinforcement Learning , 2007, IJCAI.

[19]  Stephen Grossberg,et al.  Fuzzy ART: Fast stable learning and categorization of analog patterns by an adaptive resonance system , 1991, Neural Networks.

[20]  Wei Pan,et al.  The Two Facets of the Exploration-Exploitation Dilemma , 2006, 2006 IEEE/WIC/ACM International Conference on Intelligent Agent Technology.

[21]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[22]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[23]  Ah-Hwee Tan,et al.  Creating Human-like Autonomous Players in Real-time First Person Shooter Computer Games , 2009 .

[24]  Ah-Hwee Tan,et al.  FALCON: a fusion architecture for learning, cognition, and navigation , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).