论文信息 - Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning - 字舞流文

Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning

Many real-world problems, such as network packet routing and urban traffic control, are naturally modeled as multi-agent reinforcement learning (RL) problems. However, existing multi-agent RL methods typically scale poorly in the problem size. Therefore, a key challenge is to translate the success of deep learning on single-agent RL to the multi-agent setting. A major stumbling block is that independent Q-learning, the most popular multi-agent RL method, introduces nonstationarity that makes it incompatible with the experience replay memory on which deep Q-learning relies. This paper proposes two methods that address this problem: 1) using a multi-agent variant of importance sampling to naturally decay obsolete data and 2) conditioning each agent's value function on a fingerprint that disambiguates the age of the data sampled from the replay memory. Results on a challenging decentralised variant of StarCraft unit micro-management confirm that these methods enable the successful combination of experience replay with multi-agent RL.

Shimon Whiteson | Pushmeet Kohli | Philip H. S. Torr | Triantafyllos Afouras | Jakob N. Foerster | Gregory Farquhar | Nantas Nardelli | S. Whiteson | Pushmeet Kohli | Gregory Farquhar | Triantafyllos Afouras | Nantas Nardelli | Shimon Whiteson

[1] Ben J. A. Kröse,et al. Learning from delayed rewards , 1995, Robotics Auton. Syst..

[2] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[3] Maja J. Mataric,et al. Using communication to reduce locality in distributed multiagent learning , 1997, J. Exp. Theor. Artif. Intell..

[4] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[5] Martin Lauer,et al. An Algorithm for Distributed Reinforcement Learning in Cooperative Multi-Agent Systems , 2000, ICML.

[6] Gerald Tesauro,et al. Extending Q-Learning to General Adaptive Multi-Agent Systems , 2003, NIPS.

[7] Erfu Yang,et al. Multiagent Reinforcement Learning for Multi-Robot Systems: A Survey , 2004 .

[8] Christian P. Robert,et al. Monte Carlo Statistical Methods , 2005, Springer Texts in Statistics.

[9] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[10] Danny Weyns,et al. The Packet-World: A Test Bed for Investigating Situated Multi-Agent Systems , 2005 .

[11] Paulo Martins Engel,et al. Dealing with non-stationary environments using context detection , 2006, ICML.

[12] Vincent Conitzer,et al. AWESOME: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents , 2003, Machine Learning.

[13] Nikos A. Vlassis,et al. Collaborative Multiagent Reinforcement Learning by Payoff Propagation , 2006, J. Mach. Learn. Res..

[14] Sridhar Mahadevan,et al. Hierarchical multi-agent reinforcement learning , 2001, AGENTS '01.

[15] Bart De Schutter,et al. A Comprehensive Survey of Multiagent Reinforcement Learning , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[16] Yoav Shoham,et al. Multiagent Systems - Algorithmic, Game-Theoretic, and Logical Foundations , 2009 .

[17] Shimon Whiteson,et al. Multiagent Reinforcement Learning for Urban Traffic Control Using Coordination Graphs , 2008, ECML/PKDD.

[18] Clément Farabet,et al. Torch7: A Matlab-like Environment for Machine Learning , 2011, NIPS 2011.

[19] Guillaume J. Laurent,et al. Independent reinforcement learners in cooperative Markov games: a survey regarding coordination problems , 2012, The Knowledge Engineering Review.

[20] Kevin Leyton-Brown,et al. Empirically Evaluating Multiagent Learning Algorithms , 2014, ArXiv.

[21] Yoshua Bengio,et al. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[22] Yun Yang,et al. A Multi-Agent Framework for Packet Routing in Wireless Sensor Networks , 2015, Sensors.

[23] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[24] Peter Stone,et al. Deep Recurrent Q-Learning for Partially Observable MDPs , 2015, AAAI Fall Symposia.

[25] Shimon Whiteson,et al. Learning to Communicate with Deep Multi-Agent Reinforcement Learning , 2016, NIPS.

[26] Frans A. Oliehoek,et al. Coordinated Deep Reinforcement Learners for Traffic Light Control , 2016 .

[27] Rob Fergus,et al. Learning Multiagent Communication with Backpropagation , 2016, NIPS.

[28] Jordan L. Boyd-Graber,et al. Opponent Modeling in Deep Reinforcement Learning , 2016, ICML.

[29] Matthew Hausknecht and Peter Stone,et al. Half Field Offense: An Environment for Multiagent Learning and Ad Hoc Teamwork , 2016 .

[30] Tom Schaul,et al. Prioritized Experience Replay , 2015, ICLR.

[31] Florian Richoux,et al. TorchCraft: a Library for Machine Learning Research on Real-Time Strategy Games , 2016, ArXiv.

[32] Nicolas Usunier,et al. Episodic Exploration for Deep Deterministic Policies: An Application to StarCraft Micromanagement Tasks , 2016, ArXiv.

[33] Emil Gustavsson,et al. Learning to Play Guess Who? and Inventing a Grounded Language as a Consequence , 2016, ArXiv.

[34] Nando de Freitas,et al. Sample Efficient Actor-Critic with Experience Replay , 2016, ICLR.

[35] Dorian Kodelja,et al. Multiagent cooperation and competition with deep reinforcement learning , 2015, PloS one.

[36] Joel Z. Leibo,et al. Multi-agent Reinforcement Learning in Sequential Social Dilemmas , 2017, AAMAS.

[37] Shimon Whiteson,et al. OFFER: Off-Environment Reinforcement Learning , 2017, AAAI.