Coordination in Adversarial Sequential Team Games via Multi-Agent Deep Reinforcement Learning

Many real-world applications involve teams of agents that have to coordinate their actions to reach a common goal against potential adversaries. This paper focuses on zero-sum games where a team of players faces an opponent, as is the case, for example, in Bridge, collusion in poker, and collusion in bidding. The possibility for the team members to communicate before gameplay---that is, coordinate their strategies ex ante---makes the use of behavioral strategies unsatisfactory. We introduce Soft Team Actor-Critic (STAC) as a solution to the team's coordination problem that does not require any prior domain knowledge. STAC allows team members to effectively exploit ex ante communication via exogenous signals that are shared among the team. STAC reaches near-optimal coordinated strategies both in perfectly observable and partially observable games, where previous deep RL algorithms fail to reach optimal coordinated behaviors.

[1]  Tuomas Sandholm,et al.  Safe and Nested Subgame Solving for Imperfect-Information Games , 2017, NIPS.

[2]  Guillaume J. Laurent,et al.  Hysteretic q-learning :an algorithm for decentralized reinforcement learning in cooperative multi-agent teams , 2007, 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[3]  Bernhard von Stengel,et al.  Extensive-Form Correlated Equilibrium: Definition and Computational Complexity , 2008, Math. Oper. Res..

[4]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[5]  Matthew E. Taylor,et al.  A survey and critique of multiagent deep reinforcement learning , 2019, Autonomous Agents and Multi-Agent Systems.

[6]  David Silver,et al.  Deep Reinforcement Learning from Self-Play in Imperfect-Information Games , 2016, ArXiv.

[7]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[8]  R. Aumann Subjectivity and Correlation in Randomized Strategies , 1974 .

[9]  Victor R. Lesser,et al.  Coordinating multi-agent reinforcement learning with limited communication , 2013, AAMAS.

[10]  Boi Faltings,et al.  Decentralized Anti-coordination Through Multi-agent Learning , 2013, J. Artif. Intell. Res..

[11]  Ming Zhou,et al.  Signal Instructed Coordination in Team Competition , 2019, ArXiv.

[12]  Tuomas Sandholm,et al.  Ex ante coordination and collusion in zero-sum multi-player extensive-form games , 2018, NeurIPS.

[13]  Guy Lever,et al.  Value-Decomposition Networks For Cooperative Multi-Agent Learning Based On Team Reward , 2018, AAMAS.

[14]  Nikos A. Vlassis,et al.  Optimal and Approximate Q-value Functions for Decentralized POMDPs , 2008, J. Artif. Intell. Res..

[15]  Boi Faltings,et al.  Reaching correlated equilibria through multi-agent learning , 2011, AAMAS.

[16]  Jonathan P. How,et al.  Deep Decentralized Multi-task Multi-Agent Reinforcement Learning under Partial Observability , 2017, ICML.

[17]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[18]  Kevin Waugh,et al.  DeepStack: Expert-level artificial intelligence in heads-up no-limit poker , 2017, Science.

[19]  Michael H. Bowling,et al.  Regret Minimization in Games with Incomplete Information , 2007, NIPS.

[20]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[21]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[22]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[23]  Craig Boutilier,et al.  Sequential Optimality and Coordination in Multiagent Systems , 1999, IJCAI.

[24]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[25]  Bart De Schutter,et al.  A Comprehensive Survey of Multiagent Reinforcement Learning , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[26]  Demis Hassabis,et al.  A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play , 2018, Science.

[27]  Rob Fergus,et al.  Learning Multiagent Communication with Backpropagation , 2016, NIPS.

[28]  Max Jaderberg,et al.  Population Based Training of Neural Networks , 2017, ArXiv.

[29]  Guillaume J. Laurent,et al.  Independent reinforcement learners in cooperative Markov games: a survey regarding coordination problems , 2012, The Knowledge Engineering Review.

[30]  Craig Boutilier,et al.  The Dynamics of Reinforcement Learning in Cooperative Multiagent Systems , 1998, AAAI/IAAI.

[31]  Shimon Whiteson,et al.  QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning , 2018, ICML.

[32]  Michael H. Bowling,et al.  Bayes' Bluff: Opponent Modelling in Poker , 2005, UAI 2005.

[33]  Nicola Basilico,et al.  Team-Maxmin Equilibrium: Efficiency Bounds and Algorithms , 2016, AAAI.

[34]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[35]  Manuela M. Veloso,et al.  Rational and Convergent Learning in Stochastic Games , 2001, IJCAI.

[36]  Sean Luke,et al.  Cooperative Multi-Agent Learning: The State of the Art , 2005, Autonomous Agents and Multi-Agent Systems.

[37]  Shimon Whiteson,et al.  Learning to Communicate with Deep Multi-Agent Reinforcement Learning , 2016, NIPS.

[38]  J. Nash NON-COOPERATIVE GAMES , 1951, Classics in Game Theory.

[39]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[40]  Guy Lever,et al.  Human-level performance in 3D multiplayer games with population-based reinforcement learning , 2018, Science.

[41]  K. Tuyls,et al.  Lenient Frequency Adjusted Q-learning , 2010 .

[42]  Martin Lauer,et al.  Reinforcement learning for stochastic cooperative multi-agent-systems , 2004, Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems, 2004. AAMAS 2004..

[43]  Guillaume J. Laurent,et al.  A study of FMQ heuristic in cooperative multi-agent games , 2008, AAMAS 2008.

[44]  Xiangxiang Chu,et al.  Parameter Sharing Deep Deterministic Policy Gradient for Cooperative Multi-agent Reinforcement Learning , 2017, ArXiv.

[45]  Nicola Gatti,et al.  Computational Results for Extensive-Form Adversarial Team Games , 2017, AAAI.

[46]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[47]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[48]  Martin Lauer,et al.  An Algorithm for Distributed Reinforcement Learning in Cooperative Multi-Agent Systems , 2000, ICML.

[49]  Keith B. Hall,et al.  Correlated Q-Learning , 2003, ICML.

[50]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[51]  Noam Brown,et al.  Superhuman AI for multiplayer poker , 2019, Science.

[52]  Noam Brown,et al.  Superhuman AI for heads-up no-limit poker: Libratus beats top professionals , 2018, Science.

[53]  Bikramjit Banerjee,et al.  Multi-agent reinforcement learning as a rehearsal for decentralized planning , 2016, Neurocomputing.

[54]  Guy Lever,et al.  Emergent Coordination Through Competition , 2019, ICLR.

[55]  Tuomas Sandholm,et al.  Deep Counterfactual Regret Minimization , 2018, ICML.

[56]  J. Robinson AN ITERATIVE METHOD OF SOLVING A GAME , 1951, Classics in Game Theory.

[57]  Shimon Whiteson,et al.  Counterfactual Multi-Agent Policy Gradients , 2017, AAAI.

[58]  Yi Wu,et al.  Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments , 2017, NIPS.

[59]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[60]  Rahul Savani,et al.  Lenient Multi-Agent Deep Reinforcement Learning , 2017, AAMAS.

[61]  Yoav Shoham,et al.  Multiagent Systems - Algorithmic, Game-Theoretic, and Logical Foundations , 2009 .

[62]  Stefan Schaal,et al.  2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[63]  B. Stengel,et al.  Team-Maxmin Equilibria☆ , 1997 .