A Deep Bayesian Policy Reuse Approach Against Non-Stationary Agents

In multiagent domains, coping with non-stationary agents that change behaviors from time to time is a challenging problem, where an agent is usually required to be able to quickly detect the other agent's policy during online interaction, and then adapt its own policy accordingly. This paper studies efficient policy detecting and reusing techniques when playing against non-stationary agents in Markov games. We propose a new deep BPR+ algorithm by extending the recent BPR+ algorithm with a neural network as the value-function approximator. To detect policy accurately, we propose the \textit{rectified belief model} taking advantage of the \textit{opponent model} to infer the other agent's policy from reward signals and its behaviors. Instead of directly storing individual policies as BPR+, we introduce \textit{distilled policy network} that serves as the policy library in BPR+, using policy distillation to achieve efficient online policy learning and reuse. Deep BPR+ inherits all the advantages of BPR+ and empirically shows better performance in terms of detection accuracy, cumulative rewards and speed of convergence compared to existing algorithms in complex Markov games with raw visual inputs.

[1]  Mykel J. Kochenderfer,et al.  Cooperative Multi-agent Control Using Deep Reinforcement Learning , 2017, AAMAS Workshops.

[2]  Peter Stone,et al.  Autonomous agents modelling other agents: A comprehensive survey and open problems , 2017, Artif. Intell..

[3]  Shimon Whiteson,et al.  Learning with Opponent-Learning Awareness , 2017, AAMAS.

[4]  Matthew E. Taylor,et al.  A Bayesian Approach for Learning and Tracking Switching, Non-Stationary Opponents: (Extended Abstract) , 2016, AAMAS.

[5]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[6]  Pablo Hernandez-Leal,et al.  Towards a Fast Detection of Opponents in Repeated Stochastic Games , 2017, AAMAS Workshops.

[7]  Jacob W. Crandall,et al.  Just add Pepper: extending learning algorithms for repeated matrix games to repeated Markov games , 2012, AAMAS.

[8]  Yi Wu,et al.  Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments , 2017, NIPS.

[9]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[10]  Benjamin Rosman,et al.  Bayesian policy reuse , 2015, Machine Learning.

[11]  Craig Boutilier,et al.  Coordination in multiagent reinforcement learning: a Bayesian approach , 2003, AAMAS '03.

[12]  Pablo Hernandez-Leal,et al.  Learning against sequential opponents in repeated stochastic games , 2017 .

[13]  Liang Zhang,et al.  Deep Reinforcement Learning for List-wise Recommendations , 2017, ArXiv.

[14]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[15]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[16]  Zhang-Wei Hong,et al.  A Deep Policy Inference Q-Network for Multi-Agent Systems , 2017, AAMAS.

[17]  Dorian Kodelja,et al.  Multiagent cooperation and competition with deep reinforcement learning , 2015, PloS one.

[18]  Rahul Savani,et al.  Lenient Multi-Agent Deep Reinforcement Learning , 2017, AAMAS.

[19]  Jordan L. Boyd-Graber,et al.  Opponent Modeling in Deep Reinforcement Learning , 2016, ICML.

[20]  Pablo Hernandez-Leal,et al.  A Survey of Learning in Multiagent Environments: Dealing with Non-Stationarity , 2017, ArXiv.

[21]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[22]  Michael P. Wellman,et al.  Multiagent Reinforcement Learning: Theoretical Framework and an Algorithm , 1998, ICML.

[23]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[24]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[25]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.