Actor-Critic Policy Optimization in Partially Observable Multiagent Environments

Optimization of parameterized policies for reinforcement learning (RL) is an important and challenging problem in artificial intelligence. Among the most common approaches are algorithms based on gradient ascent of a score function representing discounted return. In this paper, we examine the role of these policy gradient and actor-critic algorithms in partially-observable multiagent environments. We show several candidate policy update rules and relate them to a foundation of regret minimization and multiagent learning techniques for the one-shot and tabular cases, leading to previously unknown convergence guarantees. We apply our method to model-free multiagent reinforcement learning in adversarial sequential decision problems (zero-sum imperfect information games), using RL-style function approximation. We evaluate on commonly used benchmark Poker domains, showing performance against fixed policies and empirical convergence to approximate Nash equilibria in self-play with rates similar to or better than a baseline model-free algorithm for zero-sum games, without any domain-specific state space reductions.

[1]  H. W. Kuhn,et al.  11. Extensive Games and the Problem of Information , 1953 .

[2]  S. Vajda Some topics in two-person games , 1971 .

[3]  P. Taylor,et al.  Evolutionarily Stable Strategies and Game Dynamics , 1978 .

[4]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[5]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[6]  Nicolò Cesa-Bianchi,et al.  Gambling in a rigged casino: The adversarial multi-armed bandit problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[7]  S. Hart,et al.  A simple adaptive procedure leading to correlated equilibrium , 2000 .

[8]  Josef Hofbauer,et al.  Evolutionary Games and Population Dynamics , 1998 .

[9]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[10]  Yishay Mansour,et al.  Nash Convergence of Gradient Dynamics in General-Sum Games , 2000, UAI.

[11]  Richard S. Sutton,et al.  Comparing Policy-Gradient Algorithms , 2001 .

[12]  Manuela M. Veloso,et al.  Multiagent learning using a variable learning rate , 2002, Artif. Intell..

[13]  G. Tesauro,et al.  Analyzing Complex Strategic Interactions in Multi-Agent Systems , 2002 .

[14]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[15]  Rajarshi Das,et al.  Choosing Samples to Compute Heuristic-Strategy Nash Equilibrium , 2003, AMEC.

[16]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[17]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[18]  Michael H. Bowling,et al.  Convergence and No-Regret in Multiagent Learning , 2004, NIPS.

[19]  Michael L. Littman,et al.  Cyclic Equilibria in Markov Games , 2005, NIPS.

[20]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[21]  Paul W. Goldberg,et al.  The complexity of computing a Nash equilibrium , 2006, STOC '06.

[22]  Michael P. Wellman Methods for Empirical Game-Theoretic Analysis , 2006, AAAI.

[23]  Michael H. Bowling,et al.  Regret Minimization in Games with Incomplete Information , 2007, NIPS.

[24]  Victor R. Lesser,et al.  A Multiagent Reinforcement Learning Algorithm with Non-linear Dynamics , 2008, J. Artif. Intell. Res..

[25]  Bart De Schutter,et al.  A Comprehensive Survey of Multiagent Reinforcement Learning , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[26]  Yoav Shoham,et al.  Multiagent Systems - Algorithmic, Game-Theoretic, and Logical Foundations , 2009 .

[27]  Kevin Waugh,et al.  Monte Carlo Sampling for Regret Minimization in Extensive Games , 2009, NIPS.

[28]  Josef Hofbauer,et al.  Time Average Replicator and Best-Reply Dynamics , 2009, Math. Oper. Res..

[29]  Duane Szafron,et al.  Using counterfactual regret minimization to create competitive multiplayer poker agents , 2010, AAMAS.

[30]  Michael L. Littman,et al.  Classes of Multiagent Q-learning Dynamics with epsilon-greedy Exploration , 2010, ICML.

[31]  Victor R. Lesser,et al.  Multi-Agent Learning with Policy Prediction , 2010, AAAI.

[32]  William H. Sandholm,et al.  Population Games And Evolutionary Dynamics , 2010, Economic learning and social evolution.

[33]  Peter Vrancx,et al.  Game Theory and Multi-agent Reinforcement Learning , 2012, Reinforcement Learning.

[34]  Guillaume J. Laurent,et al.  Independent reinforcement learners in cooperative Markov games: a survey regarding coordination problems , 2012, The Knowledge Engineering Review.

[35]  Todd W. Neller,et al.  An Introduction to Counterfactual Regret Minimization , 2013 .

[36]  Richard Gibson,et al.  Regret Minimization in Non-Zero-Sum Games with Applications to Building Champion Multiplayer Computer Poker Agents , 2013, ArXiv.

[37]  Michael H. Bowling,et al.  Monte carlo sampling and regret minimization for equilibrium computation and decision-making in large extensive form games , 2013 .

[38]  Marcello Restelli,et al.  Efficient Evolutionary Dynamics with Extensive-Form Games , 2013, AAAI.

[39]  Marcello Restelli,et al.  Evolutionary Dynamics of Q-Learning over the Sequence Form , 2014, AAAI.

[40]  Marc Lanctot,et al.  Further developments of extensive-form replicator dynamics using the sequence-form representation , 2014, AAMAS.

[41]  Neil Burch,et al.  Heads-up limit hold’em poker is solved , 2015, Science.

[42]  Kevin Waugh,et al.  Solving Games with Functional Regret Estimation , 2014, AAAI Workshop: Computer Poker and Imperfect Information.

[43]  Yishay Mansour,et al.  Lower bounds on individual sequence regret , 2012, Machine Learning.

[44]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[45]  David Silver,et al.  Fictitious Self-Play in Extensive-Form Games , 2015, ICML.

[46]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[47]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[48]  Michael H. Bowling,et al.  Solving Heads-Up Limit Texas Hold'em , 2015, IJCAI.

[49]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[50]  Karl Tuyls,et al.  Evolutionary Dynamics of Multi-Agent Learning: A Survey , 2015, J. Artif. Intell. Res..

[51]  Bruno Scherrer,et al.  Approximate Dynamic Programming for Two-Player Zero-Sum Markov Games , 2015, ICML.

[52]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[53]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[54]  Bruno Scherrer,et al.  On the Use of Non-Stationary Strategies for Solving Two-Player Zero-Sum Markov Games , 2016, AISTATS.

[55]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[56]  Qian Yu,et al.  Stochastic Evolution Dynamic of the Rock–Scissors–Paper Game Based on a Quasi Birth and Death Process , 2016, Scientific Reports.

[57]  Amnon Shashua,et al.  Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving , 2016, ArXiv.

[58]  Frans A. Oliehoek,et al.  A Concise Introduction to Decentralized POMDPs , 2016, SpringerBriefs in Intelligent Systems.

[59]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[60]  Branislav Bosanský,et al.  Algorithms for computing strategies in two-player simultaneous move games , 2016, Artif. Intell..

[61]  Sergey Levine,et al.  Deep Reinforcement Learning for Robotic Manipulation , 2016, ArXiv.

[62]  David Silver,et al.  Deep Reinforcement Learning from Self-Play in Imperfect-Information Games , 2016, ArXiv.

[63]  Jordan L. Boyd-Graber,et al.  Opponent Modeling in Deep Reinforcement Learning , 2016, ICML.

[64]  Elad Hazan,et al.  Introduction to Online Convex Optimization , 2016, Found. Trends Optim..

[65]  Pablo Hernandez-Leal,et al.  A Survey of Learning in Multiagent Environments: Dealing with Non-Stationarity , 2017, ArXiv.

[66]  Demis Hassabis,et al.  Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm , 2017, ArXiv.

[67]  Alexander Peysakhovich,et al.  Maintaining cooperation in complex social dilemmas using deep reinforcement learning , 2017, ArXiv.

[68]  Yi Wu,et al.  Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments , 2017, NIPS.

[69]  Kevin Waugh,et al.  DeepStack: Expert-level artificial intelligence in heads-up no-limit poker , 2017, Science.

[70]  Tuomas Sandholm,et al.  Dynamic Thresholding and Pruning for Regret Minimization , 2017, AAAI.

[71]  Yuandong Tian,et al.  Training Agent for First-Person Shooter Game with Actor-Critic Curriculum Learning , 2016, ICLR.

[72]  David Silver,et al.  A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning , 2017, NIPS.

[73]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[74]  Hoong Chuin Lau,et al.  Policy Gradient With Value Function Approximation For Collective Multiagent Planning , 2018, NIPS.

[75]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[76]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[77]  Alexander Peysakhovich,et al.  Multi-Agent Cooperation and the Emergence of (Natural) Language , 2016, ICLR.

[78]  Sergey Levine,et al.  DeepMimic , 2018, ACM Trans. Graph..

[79]  Alexandre M. Bayen,et al.  Variance Reduction for Policy Gradient with Action-Dependent Factorized Baselines , 2018, ICLR.

[80]  Shimon Whiteson,et al.  Learning with Opponent-Learning Awareness , 2017, AAMAS.

[81]  Shimon Whiteson,et al.  Counterfactual Multi-Agent Policy Gradients , 2017, AAAI.

[82]  Shimon Whiteson,et al.  Expected Policy Gradients , 2017, AAAI.

[83]  Sergey Levine,et al.  Deep Reinforcement Learning for Vision-Based Robotic Grasping: A Simulated Comparative Evaluation of Off-Policy Methods , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[84]  Kurt Keutzer,et al.  Regret Minimization for Partially Observable Deep Reinforcement Learning , 2017, ICML.

[85]  Zhang-Wei Hong,et al.  A Deep Policy Inference Q-Network for Multi-Agent Systems , 2017, AAMAS.

[86]  Joel Z. Leibo,et al.  A Generalised Method for Empirical Game Theoretic Analysis , 2018, AAMAS.

[87]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[88]  Pieter Abbeel,et al.  Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments , 2017, ICLR.

[89]  Stephen Clark,et al.  Emergent Communication through Negotiation , 2018, ICLR.

[90]  Peter Stone,et al.  Autonomous agents modelling other agents: A comprehensive survey and open problems , 2017, Artif. Intell..

[91]  Yuval Tassa,et al.  Maximum a Posteriori Policy Optimisation , 2018, ICLR.

[92]  Hao Liu,et al.  Action-dependent Control Variates for Policy Optimization via Stein Identity , 2018, ICLR.

[93]  Jakub W. Pachocki,et al.  Emergent Complexity via Multi-Agent Competition , 2017, ICLR.

[94]  Noam Brown,et al.  Superhuman AI for heads-up no-limit poker: Libratus beats top professionals , 2018, Science.

[95]  Olivier Pietquin,et al.  Actor-Critic Fictitious Play in Simultaneous Move Multistage Games , 2018, AISTATS.

[96]  Viliam Lisý,et al.  Analysis of Hannan consistent selection for Monte Carlo tree search in simultaneous move games , 2015, Machine Learning.

[97]  Ian A. Kash,et al.  Combining No-regret and Q-learning , 2019, AAMAS.