Distributed Lazy Q-Learning for Cooperative Mobile Robots

Compared to single robot learning, cooperative learning adds the challenge of a much larger search space (combined individual search spaces), awareness of other team members, and also the synthesis of the individual behaviors with respect to the task given to the group. Over the years, reinforcement learning has emerged as the main learning approach in autonomous robotics, and lazy learning has become the leading bias, allowing the reduction of the time required by an experiment to the time needed to test the learned behavior performance. These two approaches have been combined together in what is now called lazy Q-learning, a very efficient single robot learning paradigm. We propose a derivation of this learning to team of robots : the «pessimistic» algorithm able to compute for each team member a lower bound of the utility of executing an action in a given situation. We use the cooperative multi-robot observation of multiple moving targets (CMOMMT) application as an illustrative example, and study the efficiency of the Pessimistic Algorithm in its task of inducing learning of cooperation.

[1]  Andrew G. Barto,et al.  Reinforcement learning , 1998 .

[2]  Alex Fukunaga,et al.  Cooperative mobile robotics: antecedents and directions , 1995 .

[3]  Steven Salzberg,et al.  A Teaching Strategy for Memory-Based Control , 1997, Artificial Intelligence Review.

[4]  François Michaud,et al.  Learning from History for Behavior-Based Mobile Robots in Non-Stationary Conditions , 1998, Machine Learning.

[5]  Maja J. Mataric,et al.  Learning social behavior , 1997, Robotics Auton. Syst..

[6]  M. Dorigo Introduction to the Special Issue on Learning Autonomous Robots , 1996 .

[7]  Claude F. Touzet,et al.  Robot Awareness in Cooperative Mobile Robot Learning , 2000, Auton. Robots.

[8]  Michael I. Jordan,et al.  Learning Without State-Estimation in Partially Observable Markovian Decision Processes , 1994, ICML.

[9]  Richard S. Sutton,et al.  Reinforcement learning architectures for animats , 1991 .

[10]  Trevor Darrell Reinforcement Learning of Active Recognition Behaviors , 1997, NIPS 1997.

[11]  Claude F. Touzet,et al.  Neural reinforcement learning for behaviour synthesis , 1997, Robotics Auton. Syst..

[12]  Jürgen Schmidhuber,et al.  Multi-Agent Learning with the Success-Story Algorithm , 1996, ECAI Workshop LDAIS / ICMAS Workshop LIOME.

[13]  Richard S. Sutton,et al.  Reinforcement Learning , 1992, Handbook of Machine Learning.

[14]  Andrew McCallum,et al.  Instance-Based State Identification for Reinforcement Learning , 1994, NIPS.

[15]  Didier Dubois,et al.  Representing partial ignorance , 1996, IEEE Trans. Syst. Man Cybern. Part A.

[16]  Juan Miguel Santos,et al.  Exploration tuned reinforcement function , 1999, Neurocomputing.

[17]  Jonas Karlsson,et al.  Learning Multiple Goal Behavior via Task Decomposition and Dynamic Policy Merging , 1993 .

[18]  Leslie Pack Kaelbling,et al.  Acting Optimally in Partially Observable Stochastic Domains , 1994, AAAI.

[19]  Andrew B. Kahng,et al.  Cooperative Mobile Robotics: Antecedents and Directions , 1997, Auton. Robots.

[20]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[21]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[22]  Michael L. Littman,et al.  Memoryless policies: theoretical limitations and practical results , 1994 .

[23]  David W. Aha,et al.  Lazy Learning , 1997, Springer Netherlands.

[24]  Lynne E. Parker,et al.  Multi-Robot Learning in a Cooperative Observation Task , 2000, DARS.

[25]  Noel E. Sharkey,et al.  Learning subsumptions for an autonomous robot , 1996 .

[26]  Patrick Brézillon,et al.  Lecture Notes in Artificial Intelligence , 1999 .