Finite-Time Analysis of Minimax Q-Learning for Two-Player Zero-Sum Markov Games: Switching System Approach

The objective of this paper is to investigate the finite-time analysis of a Q-learning algorithm applied to two-player zero-sum Markov games. Specifically, we establish a finite-time analysis of both the minimax Q-learning algorithm and the corresponding value iteration method. To enhance the analysis of both value iteration and Q-learning, we employ the switching system model of minimax Q-learning and the associated value iteration. This approach provides further insights into minimax Q-learning and facilitates a more straightforward and insightful convergence analysis. We anticipate that the introduction of these additional insights has the potential to uncover novel connections and foster collaboration between concepts in the fields of control theory and reinforcement learning communities.

[1]  Niao He,et al.  A Discrete-Time Switching System Analysis of Q-Learning , 2021, SIAM J. Control. Optim..

[2]  Siva Theja Maguluri,et al.  A Lyapunov Theory for Finite-Sample Guarantees of Asynchronous Q-Learning and TD-Learning Variants , 2021, ArXiv.

[3]  Dongbin Zhao,et al.  Online Minimax Q Network Learning for Two-Player Zero-Sum Markov Games , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[4]  Yuxin Chen,et al.  Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and Variance Reduction , 2020, IEEE Transactions on Information Theory.

[5]  Niao He,et al.  Periodic Q-Learning , 2020, L4DC.

[6]  H. Khalil,et al.  Nonlinear systems , 2020, Student Solution Manual for Differential Equations: Techniques, Theory, and Applications.

[7]  Adam Wierman,et al.  Finite-Time Analysis of Asynchronous Stochastic Approximation and Q-Learning , 2020, COLT.

[8]  T. Başar,et al.  Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms , 2019, Handbook of Reinforcement Learning and Control.

[9]  Shalabh Bhatnagar,et al.  A Generalized Minimax Q-Learning Algorithm for Two-Player Zero-Sum Stochastic Games , 2019, IEEE Transactions on Automatic Control.

[10]  Martin J. Wainwright,et al.  Stochastic approximation with cone-contractive operators: Sharp 𝓁∞-bounds for Q-learning , 2019, ArXiv.

[11]  R. Srikant,et al.  Finite-Time Error Bounds For Linear Stochastic Approximation and TD Learning , 2019, COLT.

[12]  Zhuoran Yang,et al.  A Theoretical Analysis of Deep Q-Learning , 2019, L4DC.

[13]  Michael H. Bowling,et al.  Actor-Critic Policy Optimization in Partially Observable Multiagent Environments , 2018, NeurIPS.

[14]  Jalaj Bhandari,et al.  A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation , 2018, COLT.

[15]  Olivier Pietquin,et al.  Actor-Critic Fictitious Play in Simultaneous Move Multistage Games , 2018, AISTATS.

[16]  Chi-Jen Lu,et al.  Online Reinforcement Learning in Stochastic Games , 2017, NIPS.

[17]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[18]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[19]  Shie Mannor,et al.  Finite Sample Analyses for TD(0) With Function Approximation , 2017, AAAI.

[20]  Bruno Scherrer,et al.  On the Use of Non-Stationary Strategies for Solving Two-Player Zero-Sum Markov Games , 2016, AISTATS.

[21]  David Silver,et al.  Memory-based control with recurrent neural networks , 2015, ArXiv.

[22]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[23]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[24]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[25]  Bruno Scherrer,et al.  Approximate Dynamic Programming for Two-Player Zero-Sum Markov Games , 2015, ICML.

[26]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[27]  Michael I. Jordan,et al.  Trust Region Policy Optimization , 2015, ICML.

[28]  R. Srikant,et al.  Error bounds for constant step-size Q-learning , 2012, Syst. Control. Lett..

[29]  Hilbert J. Kappen,et al.  Speedy Q-Learning , 2011, NIPS.

[30]  Hado van Hasselt,et al.  Double Q-learning , 2010, NIPS.

[31]  Hai Lin,et al.  Stability and Stabilizability of Switched Linear Systems: A Survey of Recent Results , 2009, IEEE Transactions on Automatic Control.

[32]  Sean P. Meyn,et al.  An analysis of reinforcement learning with function approximation , 2008, ICML '08.

[33]  Tao Wang,et al.  Dual Representations for Dynamic Programming and Reinforcement Learning , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[34]  Abhijit Gosavi,et al.  Boundedness of iterates in Q-Learning , 2006, Syst. Control. Lett..

[35]  Yishay Mansour,et al.  Learning Rates for Q-learning , 2004, J. Mach. Learn. Res..

[36]  Michael P. Wellman,et al.  Nash Q-Learning for General-Sum Stochastic Games , 2003, J. Mach. Learn. Res..

[37]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[38]  Daniel Liberzon,et al.  Switching in Systems and Control , 2003, Systems & Control: Foundations & Applications.

[39]  Michail G. Lagoudakis,et al.  Value Function Approximation in Zero-Sum Markov Games , 2002, UAI.

[40]  Michael L. Littman,et al.  Friend-or-Foe Q-learning in General-Sum Games , 2001, ICML.

[41]  Michael L. Littman,et al.  Value-function reinforcement learning in Markov games , 2001, Cognitive Systems Research.

[42]  Michael H. Bowling,et al.  Convergence Problems of General-Sum Multiagent Reinforcement Learning , 2000, ICML.

[43]  Michael Kearns,et al.  Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms , 1998, NIPS.

[44]  Michael P. Wellman,et al.  Multiagent Reinforcement Learning: Theoretical Framework and an Algorithm , 1998, ICML.

[45]  Csaba Szepesvári,et al.  The Asymptotic Convergence-Rate of Q-learning , 1997, NIPS.

[46]  Csaba Szepesvári,et al.  A Generalized Reinforcement-Learning Model: Convergence and Applications , 1996, ICML.

[47]  Chi-Tsong Chen,et al.  Linear System Theory and Design , 1995 .

[48]  John N. Tsitsiklis,et al.  Asynchronous Stochastic Approximation and Q-Learning , 1994, Machine Learning.

[49]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[50]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[51]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[52]  L. Shapley,et al.  Stochastic Games* , 1953, Proceedings of the National Academy of Sciences.

[53]  Donghwan Lee,et al.  A Unified Switching System Perspective and Convergence Analysis of Q-Learning Algorithms , 2020, NeurIPS.

[54]  Sean P. Meyn,et al.  Zap Q-Learning , 2017, NIPS.

[55]  Dimitri P. Bertsekas,et al.  Neuro-Dynamic Programming , 2009, Encyclopedia of Optimization.

[56]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[57]  Michael I. Jordan,et al.  On the Convergence of Stochastic Iterative Dynamic Programming Algorithms , 1994, Neural Computation.