Exploration Analysis in Finite-Horizon Turn-based Stochastic Games

Exploration and exploitation trade-off is one of the key concerns in reinforcement learning. Previous work on one-player Markov Decision Processes has reached near-optimal results for both PAC and high probability regret guarantees. However, such an analysis is lacking for the more complex stochastic games with multi-players, where all players aim to find an approximate Nash Equilibrium. In this work, we address the exploration issue for the N -player finite-horizon turn-based stochastic games (FTSG). We propose a framework, Upper Bounding the Values for Players (UBVP), to guide exploration in FTSGs. UBVP leverages the key insight that players choose the optimal policy conditioning on the policies of the others simultaneously; thus players can explore in the face of uncertainty and get close to the Nash Equilibrium. Based on UBVP, we present two provable algorithms. One is Uniform-PAC with a sample complexity of Õ(1/✏2) to get an ✏-Nash Equilibrium for arbitrary ✏ > 0, and the other has a cumulative exploitability of Õ( p T ) with high probability.

[1]  Tim Roughgarden,et al.  Algorithmic Game Theory , 2007 .

[2]  Tamer Basar,et al.  Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms , 2019, Handbook of Reinforcement Learning and Control.

[3]  Demis Hassabis,et al.  Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm , 2017, ArXiv.

[4]  Chen-Yu Wei,et al.  Online Reinforcement Learning in Stochastic Games , 2017, NIPS.

[5]  Lihong Li,et al.  Policy Certificates: Towards Accountable Reinforcement Learning , 2018, ICML.

[6]  Michael P. Wellman,et al.  Nash Q-Learning for General-Sum Stochastic Games , 2003, J. Mach. Learn. Res..

[7]  Christoph Dann,et al.  Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning , 2015, NIPS.

[8]  Richard Rouse,et al.  Game design : theory and practice , 2001 .

[9]  Bruno Scherrer,et al.  Approximate Dynamic Programming for Two-Player Zero-Sum Markov Games , 2015, ICML.

[10]  L. Shapley,et al.  Stochastic Games* , 1953, Proceedings of the National Academy of Sciences.

[11]  Emma Brunskill,et al.  Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds , 2019, ICML.

[12]  Rémi Coulom,et al.  Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search , 2006, Computers and Games.

[13]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[14]  Karl Tuyls,et al.  Computing Approximate Equilibria in Sequential Adversarial Games by Exploitability Descent , 2019, IJCAI.

[15]  David Silver,et al.  Fictitious Self-Play in Extensive-Form Games , 2015, ICML.

[16]  Jun Zhu,et al.  Posterior sampling for multi-agent reinforcement learning: solving extensive games with imperfect information , 2020, ICLR.

[17]  Kevin Waugh,et al.  Monte Carlo Sampling for Regret Minimization in Extensive Games , 2009, NIPS.

[18]  Michail G. Lagoudakis,et al.  Value Function Approximation in Zero-Sum Markov Games , 2002, UAI.

[19]  Tor Lattimore,et al.  Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning , 2017, NIPS.

[20]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[21]  Ariel Rubinstein,et al.  A Course in Game Theory , 1995 .

[22]  Shimon Whiteson,et al.  A Survey of Multi-Objective Sequential Decision-Making , 2013, J. Artif. Intell. Res..

[23]  Rémi Munos,et al.  Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.