Heuristic Search Based Exploration in Reinforcement Learning

In this paper, we consider reinforcement learning in systems with unknown environment where the agent must trade off efficiently between: exploration(long-term optimization) and exploitation (short-term optimization). Ɛ-greedy algorithm is a method using near-greedy action selection rule. It behaves greedily (exploitation) most of the time, but every once in a while, say with small probability Ɛ (exploration), instead select an action at random. Many works already proved that random exploration drives the agent towards poorly modeled states. Therefore, this study evaluates the role of heuristic based exploration in reinforcement learning. We proposed three methods: neighborhood search based exploration, simulated annealing based exploration, and tabu search based exploration. All techniques follow the same rule "Explore the most unvisited state". In the simulation, these techniques are evaluated and compared on a discrete reinforcement learning task (robot navigation).

[1]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[2]  Andrew W. Moore,et al.  Applying Online Search Techniques to Continuous-State Reinforcement Learning , 1998, AAAI/IAAI.

[3]  A.G. Parlos,et al.  A reinforcement learning method based on adaptive simulated annealing , 2003, 2003 46th Midwest Symposium on Circuits and Systems.

[4]  C. Reeves Modern heuristic techniques for combinatorial problems , 1993 .

[5]  H. Wechsler,et al.  Competitive reinforcement learning for combinatorial problems , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[6]  B. Widrow,et al.  The truck backer-upper: an example of self-learning in neural networks , 1989, International 1989 Joint Conference on Neural Networks.

[7]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[8]  Marco Wiering,et al.  Explorations in efficient reinforcement learning , 1999 .

[9]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[10]  Sebastian Thrun,et al.  Planning with an Adaptive World Model , 1990, NIPS.

[11]  David H. Wolpert,et al.  Bandit problems and the exploration/exploitation tradeoff , 1998, IEEE Trans. Evol. Comput..

[12]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[13]  Derrick H. Nguyen,et al.  Truck backer-upper: an example of self-learning in neural networks , 1990, Defense, Security, and Sensing.

[14]  Sebastian Thrun,et al.  Active Exploration in Dynamic Environments , 1991, NIPS.

[15]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..