Provably Efficient Policy Gradient Methods for Two-Player Zero-Sum Markov Games

Policy gradient methods are widely used in solving two-player zero-sum games to achieve superhuman performance in practice. However, it remains elusive when they can provably find a near-optimal solution and how many samples and iterations are needed. The current paper studies natural extensions of Natural Policy Gradient algorithm for solving two-player zero-sum games where function approximation is used for generalization across states. We thoroughly characterize the algorithms’ performance in terms of the number of samples, number of iterations, concentrability coefficients, and approximation error. To our knowledge, this is the first quantitative analysis of policy gradient methods with function approximation for two-player zero-sum Markov games.

[1]  Stephen D. Patek,et al.  Stochastic and shortest path games: theory and algorithms , 1997 .

[2]  Tuomas Sandholm,et al.  Solving Imperfect-Information Games via Discounted Regret Minimization , 2018, AAAI.

[3]  Noah Golowich,et al.  Independent Policy Gradient Methods for Competitive Reinforcement Learning , 2021, NeurIPS.

[4]  Shie Mannor,et al.  Basis Function Adaptation in Temporal Difference Reinforcement Learning , 2005, Ann. Oper. Res..

[5]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[6]  Luke S. Zettlemoyer,et al.  Reinforcement Learning for Mapping Instructions to Actions , 2009, ACL.

[7]  Nicolas Le Roux,et al.  A Geometric Perspective on Optimal Representations for Reinforcement Learning , 2019, NeurIPS.

[8]  Csaba Szepesvári,et al.  Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[9]  Rémi Munos,et al.  Error Bounds for Approximate Value Iteration , 2005, AAAI.

[10]  Yuxin Chen,et al.  Fast Global Convergence of Natural Policy Gradient Methods with Entropy Regularization , 2020, Oper. Res..

[11]  Aranyak Mehta,et al.  Progress in approximate nash equilibria , 2007, EC '07.

[12]  Chi Jin,et al.  Near-Optimal Reinforcement Learning with Self-Play , 2020, NeurIPS.

[13]  Bruno Scherrer,et al.  Approximate Dynamic Programming for Two-Player Zero-Sum Markov Games , 2015, ICML.

[14]  J. Filar,et al.  Competitive Markov Decision Processes , 1996 .

[15]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[16]  Karl Tuyls,et al.  Computing Approximate Equilibria in Sequential Adversarial Games by Exploitability Descent , 2019, IJCAI.

[17]  Michael H. Bowling,et al.  Regret Minimization in Games with Incomplete Information , 2007, NIPS.

[18]  Noam Brown,et al.  Superhuman AI for heads-up no-limit poker: Libratus beats top professionals , 2018, Science.

[19]  Shimon Whiteson,et al.  Learning with Opponent-Learning Awareness , 2017, AAMAS.

[20]  Stefano Ermon,et al.  Multi-Agent Generative Adversarial Imitation Learning , 2018, NeurIPS.

[21]  Tamer Basar,et al.  Policy Optimization Provably Converges to Nash Equilibria in Zero-Sum Linear Quadratic Games , 2019, NeurIPS.

[22]  Peter Corcoran,et al.  Traffic Light Control Using Deep Policy-Gradient and Value-Function Based Reinforcement Learning , 2017, ArXiv.

[23]  Lillian J. Ratliff,et al.  Global Convergence of Policy Gradient for Sequential Zero-Sum Linear Quadratic Dynamic Games , 2019, ArXiv.

[24]  Shie Mannor,et al.  Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPs , 2020, AAAI.

[25]  Hao Zhu,et al.  Global Convergence of Policy Gradient Methods to (Almost) Locally Optimal Policies , 2019, SIAM J. Control. Optim..

[26]  L. Shapley,et al.  Stochastic Games* , 1953, Proceedings of the National Academy of Sciences.

[27]  Bruno Scherrer,et al.  On the Use of Non-Stationary Strategies for Solving Two-Player Zero-Sum Markov Games , 2016, AISTATS.

[28]  David Silver,et al.  Fictitious Self-Play in Extensive-Form Games , 2015, ICML.

[29]  Nan Jiang,et al.  Abstraction Selection in Model-based Reinforcement Learning , 2015, ICML.

[30]  Arnoud Pastink,et al.  On the communication complexity of approximate Nash equilibria , 2012, Games Econ. Behav..

[31]  Chi Jin,et al.  Provable Self-Play Algorithms for Competitive Reinforcement Learning , 2020, ICML.

[32]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[33]  Nan Jiang,et al.  Information-Theoretic Considerations in Batch Reinforcement Learning , 2019, ICML.

[34]  Yee Whye Teh,et al.  Actor-Critic Reinforcement Learning with Energy-Based Policies , 2012, EWRL.

[35]  Bruno Scherrer,et al.  Approximate Policy Iteration Schemes: A Comparison , 2014, ICML.

[36]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[37]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[38]  Jalaj Bhandari,et al.  Global Optimality Guarantees For Policy Gradient Methods , 2019, ArXiv.

[39]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[40]  Yuandong Tian,et al.  ELF OpenGo: An Analysis and Open Reimplementation of AlphaZero , 2019, ICML.

[41]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[42]  Honglak Lee,et al.  Deep Learning for Reward Design to Improve Monte Carlo Tree Search in ATARI Games , 2016, IJCAI.

[43]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[44]  Oskari Tammelin,et al.  Solving Large Imperfect Information Games Using CFR+ , 2014, ArXiv.

[45]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[46]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[47]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[48]  S. Kakade,et al.  Optimality and Approximation with Policy Gradient Methods in Markov Decision Processes , 2019, COLT.

[49]  Karthik Sridharan,et al.  Optimization, Learning, and Games with Predictable Sequences , 2013, NIPS.

[50]  J. Robinson AN ITERATIVE METHOD OF SOLVING A GAME , 1951, Classics in Game Theory.

[51]  Matthieu Geist,et al.  Approximate Modified Policy Iteration , 2012, ICML.

[52]  Zhuoran Yang,et al.  Provable Q-Iteration with L infinity Guarantees and Function Approximation , 2019 .

[53]  Mika Göös,et al.  Near-Optimal Communication Lower Bounds for Approximate Nash Equilibria , 2018, 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS).

[54]  Noah A. Smith,et al.  Softmax-Margin CRFs: Training Log-Linear Models with Cost Functions , 2010, NAACL.

[55]  Matthieu Geist,et al.  Approximate modified policy iteration and its application to the game of Tetris , 2015, J. Mach. Learn. Res..

[56]  Paul G. Spirakis,et al.  Computing Approximate Nash Equilibria in Polymatrix Games , 2015, Algorithmica.

[57]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.