Q-learning for Robots

Robot learning is a challenging – and somewhat unique – research domain. If a robot behavior is defined as a mapping between situations that occurred in the real world and actions to be accomplished, then the supervised learning of a robot behavior requires a set of representative examples (situation, desired action). In order to be able to gather such learning base, the human operator must have a deep understanding of the robot-world interaction (i.e., a model). But, there are many application domains where such models cannot be obtained, either because detailed knowledge of the robot’s world is unavailable (e.g., spatial or underwater exploration, nuclear or toxic waste management), or because it would be to costly. In this context, the automatic synthesis of a representative learning base is an important issue. It can be sought using reinforcement learning techniques – in particular Q-learning which does not require a model of the robot-world interaction. Compared to supervised learning, Q-learning examples are triplets (situation, action, Q value), where the Q value is the utility of executing the action in the situation. The supervised learning base is obtained by recruiting the triplets with the highest utility.

[1]  Claude Touzet,et al.  Dynamic Update of the Reinforcement Function During Learning , 1999, Connect. Sci..

[2]  Sridhar Mahadevan,et al.  Automatic Programming of Behavior-Based Robots Using Reinforcement Learning , 1991, Artif. Intell..

[3]  Claude F. Touzet,et al.  Neural reinforcement learning for behaviour synthesis , 1997, Robotics Auton. Syst..

[4]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[5]  Andrew McCallum,et al.  Instance-Based State Identification for Reinforcement Learning , 1994, NIPS.

[6]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[7]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[8]  David W. Aha,et al.  Lazy Learning , 1997, Springer Netherlands.

[9]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[10]  Steven Salzberg,et al.  A Teaching Strategy for Memory-Based Control , 1997, Artificial Intelligence Review.

[11]  Claude F. Touzet Programming robots with associative memories , 1999, IJCNN'99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339).

[12]  David H. Ackley,et al.  Interactions between learning and evolution , 1991 .

[13]  P. Dayan,et al.  TD(λ) converges with probability 1 , 2004, Machine Learning.

[14]  Marco Colombetti,et al.  Robot Shaping: An Experiment in Behavior Engineering , 1997 .

[15]  Long-Ji Lin,et al.  Reinforcement learning for robots using neural networks , 1992 .