Minimax Sample Complexity for Turn-based Stochastic Game

The empirical success of Multi-agent reinforcement learning is encouraging, while few theoretical guarantees have been revealed. In this work, we prove that the plug-in solver approach, probably the most natural reinforcement learning algorithm, achieves minimax sample complexity for turn-based stochastic game (TBSG). Specifically, we plan in an empirical TBSG by utilizing a `simulator' that allows sampling from arbitrary state-action pair. We show that the empirical Nash equilibrium strategy is an approximate Nash equilibrium strategy in the true TBSG and give both problem-dependent and problem-independent bound. We develop absorbing TBSG and reward perturbation techniques to tackle the complex statistical dependence. The key idea is artificially introducing a suboptimality gap in TBSG and then the Nash equilibrium strategy lies in a finite set.

[1]  Lin F. Yang,et al.  Q-learning with Logarithmic Regret , 2020, AISTATS.

[2]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[3]  Yuantao Gu,et al.  Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model , 2020, NeurIPS.

[4]  Chi Jin,et al.  Near-Optimal Reinforcement Learning with Self-Play , 2020, NeurIPS.

[5]  Sergey Levine,et al.  Model-Based Reinforcement Learning for Atari , 2019, ICLR.

[6]  Max Simchowitz,et al.  Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs , 2019, NeurIPS.

[7]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[8]  Sylvain Sorin,et al.  Stochastic Games and Applications , 2003 .

[9]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[10]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[11]  Mengdi Wang,et al.  Feature-Based Q-Learning for Two-Player Stochastic Games , 2019, ArXiv.

[12]  Michael Kearns,et al.  Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms , 1998, NIPS.

[13]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[14]  Hao Wu,et al.  Mastering Complex Control in MOBA Games with Deep Reinforcement Learning , 2019, AAAI.

[15]  Ruosong Wang,et al.  Agnostic Q-learning with Function Approximation in Deterministic Systems: Tight Bounds on Approximation Error and Sample Complexity , 2020, ArXiv.

[16]  Xian Wu,et al.  Variance reduced value iteration and faster algorithms for solving Markov decision processes , 2017, SODA.

[17]  Sham M. Kakade,et al.  Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity , 2020, NeurIPS.

[18]  Lin F. Yang,et al.  Is Plug-in Solver Sample-Efficient for Feature-based Reinforcement Learning? , 2020, NeurIPS.

[19]  Peter Bro Miltersen,et al.  Strategy Iteration Is Strongly Polynomial for 2-Player Turn-Based Stochastic Games with a Constant Discount Factor , 2010, JACM.

[20]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[21]  Lin F. Yang,et al.  Solving Discounted Stochastic Two-Player Games with Near-Optimal Time and Sample Complexity , 2019, AISTATS.

[22]  Lin F. Yang,et al.  On the Optimality of Sparse Model-Based Planning for Markov Decision Processes , 2019, ArXiv.

[23]  L. Shapley,et al.  Stochastic Games* , 1953, Proceedings of the National Academy of Sciences.

[24]  Yinyu Ye,et al.  Towards solving 2-TBSG efficiently , 2019, Optim. Methods Softw..

[25]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[26]  Hilbert J. Kappen,et al.  On the Sample Complexity of Reinforcement Learning with a Generative Model , 2012, ICML.

[27]  Pieter Abbeel,et al.  Benchmarking Model-Based Reinforcement Learning , 2019, ArXiv.