Competitive Policy Optimization

A core challenge in policy optimization in competitive Markov decision processes is the design of efficient optimization methods with desirable convergence and stability properties. To tackle this, we propose competitive policy optimization (CoPO), a novel policy gradient approach that exploits the game-theoretic nature of competitive games to derive policy updates. Motivated by the competitive gradient optimization method, we derive a bilinear approximation of the game objective. In contrast, off-the-shelf policy gradient methods utilize only linear approximations, and hence do not capture interactions among the players. We instantiate CoPO in two ways:(i) competitive policy gradient, and (ii) trust-region competitive policy optimization. We theoretically study these methods, and empirically investigate their behavior on a set of comprehensive, yet challenging, competitive games. We observe that they provide stable optimization, convergence to sophisticated strategies, and higher scores when played against baseline policy gradient methods.

[1]  Sridhar Mahadevan,et al.  Global Convergence to the Equilibrium of GANs using Variational Inequalities , 2018, ArXiv.

[2]  Fei Sha,et al.  Actor-Attention-Critic for Multi-Agent Reinforcement Learning , 2018, ICML.

[3]  Guillaume J. Laurent,et al.  Independent reinforcement learners in cooperative Markov games: a survey regarding coordination problems , 2012, The Knowledge Engineering Review.

[4]  David Silver,et al.  Deep Reinforcement Learning from Self-Play in Imperfect-Information Games , 2016, ArXiv.

[5]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[6]  Florian Schäfer,et al.  Competitive Gradient Descent , 2019, NeurIPS.

[7]  Shimon Whiteson,et al.  Learning with Opponent-Learning Awareness , 2017, AAMAS.

[8]  Tamer Basar,et al.  Policy Optimization Provably Converges to Nash Equilibria in Zero-Sum Linear Quadratic Games , 2019, NeurIPS.

[9]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[10]  Kamyar Azizzadenesheli,et al.  Policy Gradient in Partially Observable Environments: Approximation and Convergence , 2018 .

[11]  Yi Wu,et al.  Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments , 2017, NIPS.

[12]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[13]  Michael L. Littman,et al.  Value-function reinforcement learning in Markov games , 2001, Cognitive Systems Research.

[14]  Manuela M. Veloso,et al.  Multiagent learning using a variable learning rate , 2002, Artif. Intell..

[15]  Michael I. Jordan,et al.  Policy-Gradient Algorithms Have No Guarantees of Convergence in Linear Quadratic Games , 2019, AAMAS.

[16]  Dorian Kodelja,et al.  Multiagent cooperation and competition with deep reinforcement learning , 2015, PloS one.

[17]  Jacob Andreas,et al.  Can Deep Reinforcement Learning Solve Erdos-Selfridge-Spencer Games? , 2017, ICML.

[18]  Matthew E. Taylor,et al.  A survey and critique of multiagent deep reinforcement learning , 2019, Autonomous Agents and Multi-Agent Systems.

[19]  Joel Z. Leibo,et al.  Multi-agent Reinforcement Learning in Sequential Social Dilemmas , 2017, AAMAS.

[20]  Haitham Bou-Ammar,et al.  Balancing Two-Player Stochastic Games with Soft Q-Learning , 2018, IJCAI.

[21]  J. Shewchuk An Introduction to the Conjugate Gradient Method Without the Agonizing Pain , 1994 .

[22]  Chuan-Sheng Foo,et al.  Optimistic mirror descent in saddle-point problems: Going the extra (gradient) mile , 2018, ICLR.

[23]  Bart De Schutter,et al.  Multi-agent Reinforcement Learning: An Overview , 2010 .

[24]  Michael P. Wellman,et al.  Nash Q-Learning for General-Sum Stochastic Games , 2003, J. Mach. Learn. Res..

[25]  George B. Dantzig,et al.  Linear programming and extensions , 1965 .

[26]  H. Robbins A Stochastic Approximation Method , 1951 .

[27]  Benoît Frénay,et al.  QL2, a simple reinforcement learning scheme for two-player zero-sum Markov games , 2009, ESANN.

[28]  Christos H. Papadimitriou,et al.  Cycles in adversarial regularized learning , 2017, SODA.

[29]  Yoram Singer,et al.  Convex Repeated Games and Fenchel Duality , 2006, NIPS.

[30]  Tim Roughgarden,et al.  Algorithmic Game Theory , 2007 .

[31]  Rob Fergus,et al.  Modeling Others using Oneself in Multi-Agent Reinforcement Learning , 2018, ICML.

[32]  Michael H. Bowling,et al.  Actor-Critic Policy Optimization in Partially Observable Multiagent Environments , 2018, NeurIPS.

[33]  Ioannis Mitliagkas,et al.  Negative Momentum for Improved Game Dynamics , 2018, AISTATS.

[34]  Neil Burch,et al.  Heads-up limit hold’em poker is solved , 2015, Science.

[35]  A. Wald Sequential Tests of Statistical Hypotheses , 1945 .

[36]  Anima Anandkumar,et al.  Implicit competitive regularization in GANs , 2020, ICML.

[37]  J. Filar,et al.  Competitive Markov Decision Processes , 1996 .

[38]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[39]  Mi-Ching Tsai,et al.  Robust and Optimal Control , 2014 .

[40]  Michael H. Bowling,et al.  Regret Minimization in Games with Incomplete Information , 2007, NIPS.

[41]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[42]  John Lygeros,et al.  A Noncooperative Game Approach to Autonomous Racing , 2017, IEEE Transactions on Control Systems Technology.

[43]  H. Barger The General Theory of Employment, Interest and Money , 1936, Nature.

[44]  Keith B. Hall,et al.  Correlated Q-Learning , 2003, ICML.

[45]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[46]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[47]  J. Doyle,et al.  Robust and optimal control , 1995, Proceedings of 35th IEEE Conference on Decision and Control.

[48]  R. C. Coulter,et al.  Implementation of the Pure Pursuit Path Tracking Algorithm , 1992 .

[49]  Todd W. Neller,et al.  An Introduction to Counterfactual Regret Minimization , 2013 .

[50]  Mohammad Taghi Hajiaghayi,et al.  Regret minimization and the price of total anarchy , 2008, STOC.

[51]  David C. Parkes,et al.  The AI Economist: Improving Equality and Productivity with AI-Driven Tax Policies , 2020, ArXiv.

[52]  J. Keynes The General Theory of Employment , 1937 .

[53]  Javier Peña,et al.  Gradient-Based Algorithms for Finding Nash Equilibria in Extensive Form Games , 2007, WINE.

[54]  Constantinos Daskalakis,et al.  Training GANs with Optimism , 2017, ICLR.

[55]  David Silver,et al.  Fictitious Self-Play in Extensive-Form Games , 2015, ICML.

[56]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[57]  Sebastian Nowozin,et al.  The Numerics of GANs , 2017, NIPS.

[58]  Thore Graepel,et al.  The Mechanics of n-Player Differentiable Games , 2018, ICML.

[59]  Jordan L. Boyd-Graber,et al.  Opponent Modeling in Deep Reinforcement Learning , 2016, ICML.

[60]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[61]  Michael L. Littman,et al.  Friend-or-Foe Q-learning in General-Sum Games , 2001, ICML.

[62]  Shimon Whiteson,et al.  Counterfactual Multi-Agent Policy Gradients , 2017, AAAI.

[63]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[64]  Éva Tardos,et al.  No-Regret Learning in Bayesian Games , 2015, NIPS.

[65]  Nesa L'abbe Wu,et al.  Linear programming and extensions , 1981 .

[66]  Hans B. Pacejka,et al.  Tyre Modelling for Use in Vehicle Dynamics Studies , 1987 .

[67]  Kevin Waugh,et al.  DeepStack: Expert-Level Artificial Intelligence in No-Limit Poker , 2017, ArXiv.

[68]  John Lygeros,et al.  Optimization-Based Hierarchical Motion Planning for Autonomous Racing , 2020, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[69]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[70]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[71]  Manfred Morari,et al.  Optimization‐based autonomous racing of 1:43 scale RC cars , 2015, ArXiv.

[72]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[73]  Olivier Pietquin,et al.  Actor-Critic Fictitious Play in Simultaneous Move Multistage Games , 2018, AISTATS.

[74]  Stefan Winkler,et al.  The Unusual Effectiveness of Averaging in GAN Training , 2018, ICLR.

[75]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[76]  D. Koller,et al.  Efficient Computation of Equilibria for Extensive Two-Person Games , 1996 .