Robust Opponent Modeling via Adversarial Ensemble Reinforcement Learning in Asymmetric Imperfect-Information Games

This paper presents an algorithmic framework for learning robust policies in asymmetric imperfect-information games, where the joint reward could depend on the uncertain opponent type (a private information known only to the opponent itself and its ally). In order to maximize the reward, the protagonist agent has to infer the opponent type through agent modeling. We use multiagent reinforcement learning (MARL) to learn opponent models through self-play, which captures the full strategy interaction and reasoning between agents. However, agent policies learned from self-play can suffer from mutual overfitting. Ensemble training methods can be used to improve the robustness of agent policy against different opponents, but it also significantly increases the computational overhead. In order to achieve a good trade-off between the robustness of the learned policy and the computation complexity, we propose to train a separate opponent policy against the protagonist agent for evaluation purposes. The reward achieved by this opponent is a noisy measure of the robustness of the protagonist agent policy due to the intrinsic stochastic nature of a reinforcement learner. To handle this stochasticity, we apply a stochastic optimization scheme to dynamically update the opponent ensemble to optimize an objective function that strikes a balance between robustness and computation complexity. We empirically show that, under the same limited computational budget, the proposed method results in more robust policy learning than standard ensemble training.

[1]  Jakub W. Pachocki,et al.  Learning dexterous in-hand manipulation , 2018, Int. J. Robotics Res..

[2]  Shirin Sohrabi,et al.  Plan Recognition as Planning Revisited , 2016, IJCAI.

[3]  David Silver,et al.  Deep Reinforcement Learning from Self-Play in Imperfect-Information Games , 2016, ArXiv.

[4]  Yee Whye Teh,et al.  Distral: Robust multitask reinforcement learning , 2017, NIPS.

[5]  Shlomo Zilberstein,et al.  Dynamic Programming for Partially Observable Stochastic Games , 2004, AAAI.

[6]  Max Jaderberg,et al.  Population Based Training of Neural Networks , 2017, ArXiv.

[7]  Hirotaka Osawa,et al.  AI Wolf Contest - Development of Game AI Using Collective Intelligence - , 2016, CGW@IJCAI.

[8]  Kagan Tumer,et al.  Collaborative Evolutionary Reinforcement Learning , 2019, ICML.

[9]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[10]  David Hsu,et al.  DESPOT: Online POMDP Planning with Regularization , 2013, NIPS.

[11]  Yi Wu,et al.  Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments , 2017, NIPS.

[12]  Dan Boneh,et al.  Ensemble Adversarial Training: Attacks and Defenses , 2017, ICLR.

[13]  Joel Z. Leibo,et al.  Human-level performance in first-person multiplayer games with population-based deep reinforcement learning , 2018, ArXiv.

[14]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[15]  Padraig Cunningham,et al.  Case-Based Plan Recognition in Computer Games , 2003, ICCBR.

[16]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[17]  Drew Fudenberg,et al.  Learning to Play Bayesian Games , 2001, Games Econ. Behav..

[18]  Frans A. Oliehoek,et al.  A Concise Introduction to Decentralized POMDPs , 2016, SpringerBriefs in Intelligent Systems.

[19]  Myint Swe Khine,et al.  Learning to Play , 2011 .

[20]  Peter Stone,et al.  Deep Recurrent Q-Learning for Partially Observable MDPs , 2015, AAAI Fall Symposia.

[21]  Jakub W. Pachocki,et al.  Emergent Complexity via Multi-Agent Competition , 2017, ICLR.

[22]  Leslie Pack Kaelbling,et al.  Collision Avoidance for Unmanned Aircraft using Markov Decision Processes , 2010 .

[23]  Rob Fergus,et al.  Modeling Others using Oneself in Multi-Agent Reinforcement Learning , 2018, ICML.

[24]  Kevin Waugh,et al.  DeepStack: Expert-level artificial intelligence in heads-up no-limit poker , 2017, Science.

[25]  Peter Stone,et al.  Autonomous agents modelling other agents: A comprehensive survey and open problems , 2017, Artif. Intell..

[26]  Peter Stone,et al.  Multiagent learning in the presence of memory-bounded agents , 2013, Autonomous Agents and Multi-Agent Systems.

[27]  Shlomo Zilberstein,et al.  Memory-Bounded Dynamic Programming for DEC-POMDPs , 2007, IJCAI.

[28]  Oliver Brock,et al.  SARSOP: Efficient Point-Based POMDP Planning by Approximating Optimally Reachable Belief Spaces , 2009 .

[29]  John E. Laird,et al.  Learning to play , 2009 .

[30]  Joel Veness,et al.  Monte-Carlo Planning in Large POMDPs , 2010, NIPS.

[31]  T. Cormen,et al.  Model-based Learning of Interaction Strategies in Multi-agent Systems , 1997 .