Solving for Best Responses and Equilibria in Extensive-Form Games with Reinforcement Learning Methods

We present a framework to solve for best responses and equilibria in extensive-form games (EFGs) of imperfect information by transforming a game together with an other-agent policy into a set of Markov decision processes (MDPs), one per player, and then applying simulation-based reinforcement learning (RL) to the ensuing MDPs. More specifically, we first transform a turn-taking partially observable Markov game (TT-POMG) into a set of partially observable Markov decision processes (POMDPs), and we then transform that set of POMDPs into a corresponding set of Markov decision processes (MDPs). Next, we observe that EFGs are a special case of TT-POMGs, and hence can be transformed as described. Furthermore, because each transformation preserves the strategically-relevant information of the model to which it is applied, an optimal policy in one of the ensuing MDPs corresponds to a best response to the given other-agent policy in the original EFG. We then go on to prove that our RL algorithm finds a near-optimal policy (and therefore a near-best response in the original EFG) in finite time, although the sample complexity is lower bounded by a function with an exponential dependence on the horizon. Nonetheless, we apply this algorithm iteratively to search for equilibria in an EFG. When the iterative procedure converges, the resulting MDP policies comprise an approximate weak perfect Bayesian equilibrium. Although this procedure is not guaranteed to converge, it frequently did in numerical experiments with sequential auctio.

[1]  Russell Bent,et al.  Modeling Humans as Reinforcement Learners: How to Predict Human Behavior in Multi-Stage Games , 2011 .

[2]  H. Young,et al.  Handbook of Game Theory with Economic Applications , 2015 .

[3]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[4]  1 What Is Game Theory Trying to Accomplish ? , 1985 .

[5]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[6]  Philip Wolfe,et al.  Contributions to the theory of games , 1953 .

[7]  Andrew McLennan,et al.  Gambit: Software Tools for Game Theory , 2006 .

[8]  Dana H. Ballard,et al.  Learning to perceive and act by trial and error , 1991, Machine Learning.

[9]  Keith B. Hall,et al.  Correlated Q-Learning , 2003, ICML.

[10]  E. Ordentlich,et al.  Inequalities for the L1 Deviation of the Empirical Distribution , 2003 .

[11]  A. Mas-Colell,et al.  Microeconomic Theory , 1995 .

[12]  D. Koller,et al.  Efficient Computation of Equilibria for Extensive Two-Person Games , 1996 .

[13]  Jean-Francois Richard,et al.  Approximation of Nash equilibria in Bayesian games , 2008 .

[14]  Brett Katzman,et al.  A Two Stage Sequential Auction with Multi-Unit Demands☆☆☆ , 1999 .

[15]  Tuomas Sandholm,et al.  Hierarchical Abstraction, Distributed Equilibrium Computation, and Post-Processing, with Application to a Champion No-Limit Texas Hold'em Agent , 2015, AAAI Workshop: Computer Poker and Imperfect Information.

[16]  R. Weber Multiple-Object Auctions , 1981 .

[17]  Sergiu Hart,et al.  Games in extensive and strategic forms , 1992 .

[18]  Kevin Waugh,et al.  Monte Carlo Sampling for Regret Minimization in Extensive Games , 2009, NIPS.

[19]  J. Nash NON-COOPERATIVE GAMES , 1951, Classics in Game Theory.

[20]  H. W. Kuhn,et al.  11. Extensive Games and the Problem of Information , 1953 .

[21]  Michael L. Littman,et al.  Friend-or-Foe Q-learning in General-Sum Games , 2001, ICML.

[22]  Flavio M. Menezes,et al.  Synergies and price trends in sequential auctions , 1999 .

[23]  Tuomas Sandholm,et al.  Computing Equilibria in Multiplayer Stochastic Games of Imperfect Information , 2009, IJCAI.

[24]  Michael P. Wellman,et al.  Computing Best-Response Strategies in Infinite Games of Incomplete Information , 2004, UAI.

[25]  Roger B. Myerson,et al.  Game theory - Analysis of Conflict , 1991 .

[26]  Joel Veness,et al.  Monte-Carlo Planning in Large POMDPs , 2010, NIPS.

[27]  Shlomo Zilberstein,et al.  Dynamic Programming for Partially Observable Stochastic Games , 2004, AAAI.

[28]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[29]  Victor Naroditskiy,et al.  Using Iterated Best-Response to Find Bayes-Nash Equilibria in Auctions , 2007, AAAI.

[30]  Michael P. Wellman,et al.  Self-Confirming Price Prediction for Bidding in Simultaneous Ascending Auctions , 2005, UAI.

[31]  Michael P. Wellman,et al.  Nash Q-Learning for General-Sum Stochastic Games , 2003, J. Mach. Learn. Res..

[32]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[33]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[34]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[35]  D. Fudenberg,et al.  The Theory of Learning in Games , 1998 .

[36]  Paul W. Goldberg,et al.  The Complexity of Computing a Nash Equilibrium , 2009, SIAM J. Comput..

[37]  Victor Lesser,et al.  Approximately Solving Sequential Games With Incomplete Information , 2008 .

[38]  Nicholas R. Jennings,et al.  Computing pure Bayesian-Nash equilibria in games with finite actions and continuous types , 2013, Artif. Intell..

[39]  Frans A. Oliehoek,et al.  Best-response play in partially observable card games , 2005 .

[40]  Hilbert J. Kappen,et al.  On the Sample Complexity of Reinforcement Learning with a Generative Model , 2012, ICML.

[41]  Lihong Li,et al.  Reinforcement Learning in Finite MDPs: PAC Analysis , 2009, J. Mach. Learn. Res..

[42]  Amy Greenwald,et al.  Approximating Equilibria in Sequential Auctions with Incomplete Information and Multi-Unit Demand , 2012, NIPS.