A New Policy Iteration Algorithm For Reinforcement Learning in Zero-Sum Markov Games

Many model-based reinforcement learning (RL) algorithms can be viewed as having two phases that are iteratively implemented: a learning phase where the model is approximately learned and a planning phase where the learned model is used to derive a policy. In the case of standard MDPs, the learning problem can be solved using either value iteration or policy iteration. However, in the case of zero-sum Markov games, there is no efficient policy iteration algorithm; e.g., it has been shown in Hansen et al. (2013) that one has to solve Omega(1/(1-alpha)) MDPs, where alpha is the discount factor, to implement the only known convergent version of policy iteration. Another algorithm for Markov zero-sum games, called naive policy iteration, is easy to implement but is only provably convergent under very restrictive assumptions. Prior attempts to fix naive policy iteration algorithm have several limitations. Here, we show that a simple variant of naive policy iteration for games converges, and converges exponentially fast. The only addition we propose to naive policy iteration is the use of lookahead in the policy improvement phase. This is appealing because lookahead is anyway often used in RL for games. We further show that lookahead can be implemented efficiently in linear Markov games, which are the counterpart of the linear MDPs and have been the subject of much attention recently. We then consider multi-agent reinforcement learning which uses our algorithm in the planning phases, and provide sample and time complexity bounds for such an algorithm.

[1]  R. Srikant,et al.  On The Convergence Of Policy Iteration-Based Reinforcement Learning With Monte Carlo Policy Evaluation , 2023, AISTATS.

[2]  Sarnaduti Brahma,et al.  Convergence Rates of Asynchronous Policy Iteration for Zero-Sum Markov Games under Stochastic and Optimistic Settings , 2022, 2022 IEEE 61st Conference on Decision and Control (CDC).

[3]  R. Srikant,et al.  Reinforcement Learning with Unbiased Policy Evaluation and Linear Function Approximation , 2022, 2022 IEEE 61st Conference on Decision and Control (CDC).

[4]  D. Schuurmans,et al.  Making Linear MDPs Practical via Contrastive Representation Learning , 2022, ICML.

[5]  Asuman Ozdaglar,et al.  Independent Learning in Stochastic Games , 2021, ArXiv.

[6]  Wen Sun,et al.  Representation Learning for Online and Offline RL in Low-rank MDPs , 2021, ICLR.

[7]  D. Bertsekas Distributed Asynchronous Policy Iteration for Sequential Zero-Sum Games and Minimax Control , 2021, ArXiv.

[8]  Tiancheng Yu,et al.  The Power of Exploiter: Provable Multi-Agent RL in Large State Spaces , 2021, ICML.

[9]  Jason D. Lee,et al.  Provably Efficient Policy Optimization for Two-Player Zero-Sum Markov Games , 2021, AISTATS.

[10]  Noah Golowich,et al.  Independent Policy Gradient Methods for Competitive Reinforcement Learning , 2021, NeurIPS.

[11]  Yaodong Yang,et al.  An Overview of Multi-Agent Reinforcement Learning from Game Theoretical Perspective , 2020, ArXiv.

[12]  Qinghua Liu,et al.  A Sharp Analysis of Model-based Reinforcement Learning with Self-Play , 2020, ICML.

[13]  Lin F. Yang,et al.  Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity , 2020, NeurIPS.

[14]  Yang Yang,et al.  Multi-robot path planning based on a deep reinforcement learning DQN algorithm , 2020, CAAI Trans. Intell. Technol..

[15]  S. Kakade,et al.  FLAMBE: Structural Complexity and Representation Learning of Low Rank MDPs , 2020, NeurIPS.

[16]  Yuxin Chen,et al.  Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model , 2020, NeurIPS.

[17]  Zhuoran Yang,et al.  Learning Zero-Sum Simultaneous-Move Markov Games Using Function Approximation and Correlated Equilibrium , 2020, COLT.

[18]  Chi Jin,et al.  Provable Self-Play Algorithms for Competitive Reinforcement Learning , 2020, ICML.

[19]  Akshay Krishnamurthy,et al.  Reward-Free Exploration for Reinforcement Learning , 2020, ICML.

[20]  A. Wierman,et al.  Scalable Reinforcement Learning for Multiagent Networked Systems , 2019, Oper. Res..

[21]  T. Başar,et al.  Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms , 2019, Handbook of Reinforcement Learning and Control.

[22]  Tamer Basar,et al.  Non-Cooperative Inverse Reinforcement Learning , 2019, NeurIPS.

[23]  M. Ghavamzadeh,et al.  Multi-step Greedy Reinforcement Learning Algorithms , 2019, ICML.

[24]  Lin F. Yang,et al.  Solving Discounted Stochastic Two-Player Games with Near-Optimal Time and Sample Complexity , 2019, AISTATS.

[25]  Lin F. Yang,et al.  Model-Based Reinforcement Learning with a Generative Model is Minimax Optimal , 2019, COLT.

[26]  Mengdi Wang,et al.  Feature-Based Q-Learning for Two-Player Stochastic Games , 2019, ArXiv.

[27]  D. Shah,et al.  Non-Asymptotic Analysis of Monte Carlo Tree Search , 2019, Proceedings of the ACM on Measurement and Analysis of Computing Systems.

[28]  Shie Mannor,et al.  How to Combine Tree-Search Methods in Reinforcement Learning , 2018, AAAI.

[29]  Shie Mannor,et al.  Multiple-Step Greedy Policies in Online and Approximate Reinforcement Learning , 2018, NIPS 2018.

[30]  Shie Mannor,et al.  Beyond the One Step Greedy Approach in Reinforcement Learning , 2018, ICML.

[31]  Demis Hassabis,et al.  Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm , 2017, ArXiv.

[32]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[33]  Amnon Shashua,et al.  Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving , 2016, ArXiv.

[34]  Sergey Levine,et al.  Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[35]  Matthieu Geist,et al.  Softened Approximate Policy Iteration for Markov Games , 2016, ICML.

[36]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[37]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[38]  Bruno Scherrer,et al.  Approximate Dynamic Programming for Two-Player Zero-Sum Markov Games , 2015, ICML.

[39]  Hilbert J. Kappen,et al.  On the Sample Complexity of Reinforcement Learning with a Generative Model , 2012, ICML.

[40]  Peter Bro Miltersen,et al.  Strategy Iteration Is Strongly Polynomial for 2-Player Turn-Based Stochastic Games with a Constant Discount Factor , 2010, JACM.

[41]  Michael P. Wellman,et al.  Nash Q-Learning for General-Sum Stochastic Games , 2003, J. Mach. Learn. Res..

[42]  Michail G. Lagoudakis,et al.  Value Function Approximation in Zero-Sum Markov Games , 2002, UAI.

[43]  Ariel Rubinstein,et al.  Experience from a Course in Game Theory: Pre- and Post-class Problem Sets as a Didactic Device , 1999 .

[44]  Gerald Tesauro,et al.  On-line Policy Improvement using Monte-Carlo Search , 1996, NIPS.

[45]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[46]  J. Filar,et al.  ON THE COMPUTATION OF EQUILIBRIA IN DISCOUNTED STOCHASTIC DYNAMIC GAMES , 1986 .

[47]  M. Puterman,et al.  Modified Policy Iteration Algorithms for Discounted Markov Decision Problems , 1978 .

[48]  J. Wal Discounted Markov games: Generalized policy iteration method , 1978 .

[49]  M. Pollatschek,et al.  Algorithms for Stochastic Games with Geometrical Interpretation , 1969 .

[50]  L. Shapley,et al.  Stochastic Games* , 1953, Proceedings of the National Academy of Sciences.

[51]  R. Srikant,et al.  The Role of Lookahead and Approximate Policy Evaluation in Policy Iteration with Linear Value Function Approximation , 2021, ArXiv.

[52]  Handbook of Reinforcement Learning and Control , 2021, Studies in Systems, Decision and Control.

[53]  S. Kakade,et al.  Reinforcement Learning: Theory and Algorithms , 2019 .

[54]  Dimitri P. Bertsekas,et al.  Neuro-Dynamic Programming , 2009, Encyclopedia of Optimization.

[55]  Stephen D. Patek,et al.  Stochastic and shortest path games: theory and algorithms , 1997 .

[56]  J. Filar,et al.  On the Algorithm of Pollatschek and Avi-ltzhak , 1991 .

[57]  Anne Condon,et al.  On Algorithms for Simple Stochastic Games , 1990, Advances In Computational Complexity Theory.

[58]  R. Karp,et al.  On Nonterminating Stochastic Games , 1966 .