Timesharing-tracking framework for decentralized reinforcement learning in fully cooperative multi-agent system

Dimension-reduced and decentralized learning is always viewed as an efficient way to solve multi-agent cooperative learning in high dimension. However, the dynamic environment brought by the concurrent learning makes the decentralized learning hard to converge and bad in performance. To tackle this problem, a timesharing-tracking framework (TTF), stemming from the idea that alternative learning in microscopic view results in concurrent learning in macroscopic view, is proposed in this paper, in which the joint-state best-response Q-learning (BRQ-learning) serves as the primary algorithm to adapt to the companions' policies. With the properly defined switching principle, TTF makes all agents learn the best responses to others at different joint states. Thus from the view of the whole joint-state space, agents learn the optimal cooperative policy simultaneously. The simulation results illustrate that the proposed algorithm can learn the optimal joint behavior with less computation and faster speed compared with other two classical learning algorithms.

[1]  Martin A. Riedmiller,et al.  The Cooperative Driver: Multi-Agent Learning for Preventing Traffic Jams , 2013 .

[2]  Victor R. Lesser,et al.  A Multiagent Reinforcement Learning Algorithm with Non-linear Dynamics , 2008, J. Artif. Intell. Res..

[3]  Guillaume J. Laurent,et al.  Independent reinforcement learners in cooperative Markov games: a survey regarding coordination problems , 2012, The Knowledge Engineering Review.

[4]  Javier de Lope Asiaín,et al.  Coordination of communication in robot teams by reinforcement learning , 2013, Robotics Auton. Syst..

[5]  Gao Yan,et al.  Learning Control of Dynamical Systems Based on Markov Decision Processes:Research Frontiers and Outlooks , 2012 .

[6]  Ying Wang,et al.  A machine-learning approach to multi-robot coordination , 2008, Eng. Appl. Artif. Intell..

[7]  Dan Ventura,et al.  Predicting and Preventing Coordination Problems in Cooperative Q-learning Systems , 2007, IJCAI.

[8]  Kagan Tumer,et al.  Distributed agent-based air traffic flow management , 2007, AAMAS '07.

[9]  Cheng Yu,et al.  Expectation-maximization Policy Search with Parameter-based Exploration , 2012 .

[10]  Bart De Schutter,et al.  A Comprehensive Survey of Multiagent Reinforcement Learning , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[11]  Bart De Schutter,et al.  Decentralized Reinforcement Learning Control of a Robotic Manipulator , 2006, 2006 9th International Conference on Control, Automation, Robotics and Vision.

[12]  Manuela M. Veloso,et al.  Multiagent learning using a variable learning rate , 2002, Artif. Intell..

[13]  Daniel Kudenko,et al.  Reinforcement Learning of Coordination in Heterogeneous Cooperative Multi-agent Systems , 2005, Adaptive Agents and Multi-Agent Systems.

[14]  Guillaume J. Laurent,et al.  Hysteretic q-learning :an algorithm for decentralized reinforcement learning in cooperative multi-agent teams , 2007, 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[15]  Xin Xu,et al.  Learning Control of Dynamical Systems Based on Markov Decision Processes: Research Frontiers and Outlooks: Learning Control of Dynamical Systems Based on Markov Decision Processes: Research Frontiers and Outlooks , 2012 .

[16]  Ying Wang,et al.  Multi-robot Box-pushing: Single-Agent Q-Learning vs. Team Q-Learning , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[17]  John N. Tsitsiklis,et al.  On the Convergence of Optimistic Policy Iteration , 2002, J. Mach. Learn. Res..

[18]  Chen Shi,et al.  Research on Reinforcement Learning Technology: A Review , 2004 .

[19]  Iasonas Kokkinos,et al.  Parsing Facades with Shape Grammars and Reinforcement Learning , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Gang Chen,et al.  Cooperative learning with joint state value approximation for multi-agent systems , 2013 .