Active exploratory q-learning for large problems

Although reinforcement learning (RL) emerged more than a decade ago, it is still under extensive investigation in application to large problems, where the states and actions are multi-dimensional and continuous and result in the so- called curse of dimensionality. Conventional RL methods are still not efficient enough in huge state-action spaces, while value-function generalization-based approaches require a very large number of good training examples. This paper presents an active exploratory approach to address the challenge of RL in large problems. The core principle of this approach is that the agent does not rush to the next state. Instead, it attempts a number of actions at the current state first, and then selects the action that returns the greatest immediate reward. The state resulting from performing the action is considered as the next state. Four active exploration algorithms for good actions are proposed: random-based search, opposition-based random search, search by cyclical adjustment, and opposition-based cyclical adjustment of each action dimension. The efficiency of these algorithms is determined by a visual-servoing experiment with a 6-axis robot.

[1]  Peter I. Corke,et al.  A robotics toolbox for MATLAB , 1996, IEEE Robotics Autom. Mag..

[2]  Jürgen Schmidhuber,et al.  Speeding up Q(lambda)-Learning , 1998, ECML.

[3]  Peter I. Corke,et al.  A tutorial on visual servo control , 1996, IEEE Trans. Robotics Autom..

[4]  Richard S. Sutton,et al.  Generalization in ReinforcementLearning : Successful Examples UsingSparse Coarse , 1996 .

[5]  K. Ponnambalam,et al.  Opposition-Based Reinforcement Learning in the Management of Water Resources , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[6]  Sridhar Mahadevan,et al.  Hierarchical Policy Gradient Algorithms , 2003, ICML.

[7]  Andrej Dobnikar,et al.  Adaptive Radial Basis Decomposition by Learning Vector Quantization , 2003, Neural Processing Letters.

[8]  Maja J. Mataric,et al.  Reward Functions for Accelerated Learning , 1994, ICML.

[9]  Alexander Zelinsky,et al.  Q-Learning in Continuous State and Action Spaces , 1999, Australian Joint Conference on Artificial Intelligence.

[10]  Carlos H. C. Ribeiro,et al.  Speeding up autonomous learning by using state-independent option policies and termination improvement , 2002, VII Brazilian Symposium on Neural Networks, 2002. SBRN 2002. Proceedings..

[11]  George K. Knopf,et al.  Continuous unconstrained range sensing of free-form surfaces without sensor-head pose measurement , 2003 .

[12]  Jing Peng,et al.  Incremental multi-step Q-learning , 1994, Machine Learning.

[13]  Martin A. Riedmiller,et al.  Comparing different methods to speed up reinforcement learning in a complex domain , 2005, 2005 IEEE International Conference on Systems, Man and Cybernetics.

[14]  Reinaldo A. C. Bianchi,et al.  Heuristically Accelerated Q-Learning: A New Approach to Speed Up Reinforcement Learning , 2004, SBIA.

[15]  Li Jun,et al.  Q-Learning with a growing RBF network for behavior learning in mobile robotics , 2005 .

[16]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[17]  Michael Kaiser,et al.  Transfer of Elementary Skills via Human-Robot Interaction , 1997, Adapt. Behav..

[18]  Peter Vamplew,et al.  Global Versus Local Constructive Function Approximation for On-Line Reinforcement Learning , 2005, Australian Conference on Artificial Intelligence.