Coordination of exploration and exploitation in a dynamic environment

One much researched issue in reinforcement learning is the trade off between exploration and exploitation. Being able to effectively balance exploration and exploitation activities becomes even more crucial in a dynamic environment. An algorithm is proposed herein that provides one solution to the exploration vs. exploitation dilemma. The algorithm is presented in the context of a path-finding agent in a dynamic grid-world problem. The state-value function used is penalty based, allowing the agent to act over the space of paths with minimal penalties. A forgetting mechanism is implemented that allows the agent to explore paths that were previously determined to be suboptimal. Simulation results are used to analyze the behavior of the proposed algorithm in a dynamic environment.

[1]  Gary G. Yen,et al.  A learning and forgetting algorithm in associative memories: the eigenstructure method , 1992 .

[2]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[3]  Yoshikazu Arai,et al.  Collision avoidance in multi-robot systems based on multi-layered reinforcement learning , 1999, Robotics Auton. Syst..

[4]  Osamu Katai,et al.  Q-PSP learning: an exploitation-oriented Q-learning algorithm and its applications , 1996, Proceedings of IEEE International Conference on Evolutionary Computation.

[5]  N.G. Bourbakis Design of an autonomous navigation system , 1988, IEEE Control Systems Magazine.

[6]  Liam Pedersen,et al.  Autonomous characterization of unknown environments , 2001, Proceedings 2001 ICRA. IEEE International Conference on Robotics and Automation (Cat. No.01CH37164).

[7]  Anthony G. Pipe,et al.  Balancing exploration with exploitation-solving mazes with real numbered search spaces , 1994, Proceedings of the First IEEE Conference on Evolutionary Computation. IEEE World Congress on Computational Intelligence.

[8]  Gary G. Yen,et al.  Unlearning algorithm in associative memory , 1996 .

[9]  David G. Green,et al.  An Empirical Investigation of Optimization in Dynamic Environments Using the Cellular Genetic Algorithm , 2000, GECCO.

[10]  David H. Wolpert,et al.  Bandit problems and the exploration/exploitation tradeoff , 1998, IEEE Trans. Evol. Comput..

[11]  Bruno Sinopoli,et al.  Vision based navigation for an unmanned aerial vehicle , 2001, Proceedings 2001 ICRA. IEEE International Conference on Robotics and Automation (Cat. No.01CH37164).

[12]  Gary G. Yen Reconfigurable learning control in large space structures , 1994, IEEE Trans. Control. Syst. Technol..

[13]  Kagan Tumer,et al.  Reinforcement Learning in Distributed Domains: Beyond Team Games , 2001, IJCAI.