Deep Reinforcement Learning from Self-Play in Imperfect-Information Games

Many real-world applications can be described as large-scale games of imperfect information. To deal with these challenging domains, prior work has focused on computing Nash equilibria in a handcrafted abstraction of the domain. In this paper we introduce the first scalable end-to-end approach to learning approximate Nash equilibria without prior domain knowledge. Our method combines fictitious self-play with deep reinforcement learning. When applied to Leduc poker, Neural Fictitious Self-Play (NFSP) approached a Nash equilibrium, whereas common reinforcement learning methods diverged. In Limit Texas Holdem, a poker game of real-world scale, NFSP learnt a strategy that approached the performance of state-of-the-art, superhuman algorithms based on significant domain expertise.

[1]  J. Neumann Zur Theorie der Gesellschaftsspiele , 1928 .

[2]  J. Neumann,et al.  Theory of Games and Economic Behavior. , 1945 .

[3]  Claude E. Shannon,et al.  Programming a computer for playing chess , 1950 .

[4]  J. Robinson AN ITERATIVE METHOD OF SOLVING A GAME , 1951, Classics in Game Theory.

[5]  J. Nash NON-COOPERATIVE GAMES , 1951, Classics in Game Theory.

[6]  L. S. Shapley,et al.  10. A SIMPLE THREE-PERSON POKER GAME , 1951 .

[7]  O. H. Brownlee,et al.  ACTIVITY ANALYSIS OF PRODUCTION AND ALLOCATION , 1952 .

[8]  H. W. Kuhn,et al.  11. Extensive Games and the Problem of Information , 1953 .

[9]  Arthur L. Samuel,et al.  Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[10]  Samuel Karlin,et al.  Mathematical Methods and Theory in Games, Programming, and Economics , 1961 .

[11]  R. Bellman Dynamic programming. , 1957, Science.

[12]  J. Harsanyi Games with randomly disturbed payoffs: A new rationale for mixed-strategy equilibrium points , 1973 .

[13]  R. Selten Reexamination of the perfectness concept for equilibrium points in extensive games , 1975, Classics in Game Theory.

[14]  M. Puterman,et al.  Modified Policy Iteration Algorithms for Discounted Markov Decision Problems , 1978 .

[15]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[16]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[17]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[18]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[19]  Roger B. Myerson,et al.  Game theory - Analysis of Conflict , 1991 .

[20]  Geoffrey E. Hinton,et al.  Feudal Reinforcement Learning , 1992, NIPS.

[21]  Gerald Tesauro,et al.  Practical Issues in Temporal Difference Learning , 1992, Mach. Learn..

[22]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[23]  Bernhard von Stengel,et al.  Fast algorithms for finding randomized strategies in game trees , 1994, STOC '94.

[24]  Nicolò Cesa-Bianchi,et al.  Gambling in a rigged casino: The adversarial multi-armed bandit problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[25]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[26]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[27]  D. Fudenberg,et al.  Consistency and Cautious Fictitious Play , 1995 .

[28]  Richard S. Sutton,et al.  Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[29]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[30]  D. Koller,et al.  Efficient Computation of Equilibria for Extensive Two-Person Games , 1996 .

[31]  L. Shapley,et al.  Fictitious Play Property for Games with Identical Interests , 1996 .

[32]  B. Stengel,et al.  Efficient Computation of Behavior Strategies , 1996 .

[33]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[34]  Michael L. Littman,et al.  Algorithms for Sequential Decision Making , 1996 .

[35]  John N. Tsitsiklis,et al.  Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[36]  H. J. Jacobsen,et al.  Fictitious Play in Extensive Form Games , 1996 .

[37]  Ian Frank,et al.  Search in Games with Incomplete Information: A Case Study Using Bridge Card Play , 1998, Artif. Intell..

[38]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[39]  D. Fudenberg,et al.  The Theory of Learning in Games , 1998 .

[40]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[41]  B. Jones BOUNDED RATIONALITY , 1999 .

[42]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[43]  David Sklansky,et al.  The Theory of Poker , 1999 .

[44]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[45]  Geoffrey J. Gordon Reinforcement Learning with Function Approximation Converges to a Region , 2000, NIPS.

[46]  Andrew G. Barto,et al.  Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density , 2001, ICML.

[47]  Manuela M. Veloso,et al.  Rational and Convergent Learning in Stochastic Games , 2001, IJCAI.

[48]  Jonathan Schaeffer,et al.  The challenge of poker , 2002, Artif. Intell..

[49]  Murray Campbell,et al.  Deep Blue , 2002, Artif. Intell..

[50]  William H. Sandholm,et al.  ON THE GLOBAL CONVERGENCE OF STOCHASTIC FICTITIOUS PLAY , 2002 .

[51]  A. Roth The Economist as Engineer: Game Theory, Experimentation, and Computation as Tools for Design Economics , 2002 .

[52]  Sridhar Mahadevan,et al.  Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[53]  Jonathan Schaeffer,et al.  Approximating Game-Theoretic Optimal Strategies for Full-scale Poker , 2003, IJCAI.

[54]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[55]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[56]  Shie Mannor,et al.  Dynamic abstraction in reinforcement learning via clustering , 2004, ICML.

[57]  Long Ji Lin,et al.  Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[58]  Josef Hofbauer,et al.  Stochastic Approximations and Differential Inclusions , 2005, SIAM J. Control. Optim..

[59]  Robert L. Smith,et al.  A Fictitious Play Approach to Large-Scale Optimization , 2005, Oper. Res..

[60]  Marcus Hutter Simulation Algorithms for Computational Systems Biology , 2017, Texts in Theoretical Computer Science. An EATCS Series.

[61]  Michael H. Bowling,et al.  Bayes' Bluff: Opponent Modelling in Poker , 2005, UAI 2005.

[62]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[63]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[64]  Jeff S. Shamma,et al.  Dynamic fictitious play, dynamic gradient play, and distributed convergence to Nash equilibria , 2005, IEEE Transactions on Automatic Control.

[65]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[66]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[67]  Tuomas Sandholm,et al.  A Competitive Texas Hold'em Poker Player via Automated Abstraction and Real-Time Equilibrium Computation , 2006, AAAI.

[68]  Michael Kearns,et al.  Reinforcement learning for optimized trade execution , 2006, ICML.

[69]  David S. Leslie,et al.  Generalised weakened fictitious play , 2006, Games Econ. Behav..

[70]  Rémi Coulom,et al.  Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search , 2006, Computers and Games.

[71]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[72]  Javier Peña,et al.  Gradient-Based Algorithms for Finding Nash Equilibria in Extensive Form Games , 2007, WINE.

[73]  Michael H. Bowling,et al.  Regret Minimization in Games with Incomplete Information , 2007, NIPS.

[74]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[75]  Geoffrey J. Gordon,et al.  A Fast Bundle-based Anytime Algorithm for Poker and other Convex Games , 2007, AISTATS.

[76]  S. Legg Machine super intelligence , 2008 .

[77]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[78]  Bret Hoehn,et al.  Effective short-term opponent exploitation in simplified poker , 2005, Machine Learning.

[79]  S. Nakamoto,et al.  Bitcoin: A Peer-to-Peer Electronic Cash System , 2008 .

[80]  Ana L. C. Bazzan,et al.  Opportunities for multiagent systems and multiagent reinforcement learning in traffic control , 2009, Autonomous Agents and Multi-Agent Systems.

[81]  Aurélien Garivier,et al.  On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems , 2008 .

[82]  Raymond J. Dolan,et al.  Game Theory of Mind , 2008, PLoS Comput. Biol..

[83]  Joel Veness,et al.  Bootstrapping from Game Tree Search , 2009, NIPS.

[84]  Brett Browning,et al.  A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[85]  Kevin Waugh,et al.  A Practical Use of Imperfect Recall , 2009, SARA.

[86]  Kevin Waugh,et al.  Monte Carlo Sampling for Regret Minimization in Extensive Games , 2009, NIPS.

[87]  Kevin Waugh,et al.  Abstraction pathologies in extensive games , 2009, AAMAS.

[88]  Tuomas Sandholm,et al.  Computing Equilibria in Multiplayer Stochastic Games of Imperfect Information , 2009, IJCAI.

[89]  Martin A. Riedmiller,et al.  Reinforcement learning for robot soccer , 2009, Auton. Robots.

[90]  Duane Szafron,et al.  Using counterfactual regret minimization to create competitive multiplayer poker agents , 2010, AAMAS 2010.

[91]  Scott Kuindersma,et al.  Constructing Skill Trees for Reinforcement Learning Agents from Demonstration Trajectories , 2010, NIPS.

[92]  Javier Peña,et al.  Smoothing Techniques for Computing Nash Equilibria of Sequential Games , 2010, Math. Oper. Res..

[93]  Tuomas Sandholm,et al.  The State of Solving Large Incomplete-Information Games, and Application to Poker , 2010, AI Mag..

[94]  Joel Veness,et al.  Monte-Carlo Planning in Large POMDPs , 2010, NIPS.

[95]  Marc Lanctot,et al.  Computing Approximate Nash Equilibria and Robust Best-Responses Using Sampling , 2011, J. Artif. Intell. Res..

[96]  Ian D. Watson,et al.  Computer poker: A review , 2011, Artif. Intell..

[97]  Kevin Waugh,et al.  Accelerating Best Response Calculation in Large Extensive Games , 2011, IJCAI.

[98]  Doina Precup,et al.  Automatic Construction of Temporally Extended Actions for MDPs Using Bisimulation Metrics , 2011, EWRL.

[99]  David Auger,et al.  Multiple Tree for Partially Observable Monte-Carlo Tree Search , 2011, EvoApplications.

[100]  Milind Tambe,et al.  Security and Game Theory - Algorithms, Deployed Systems, Lessons Learned , 2011 .

[101]  Martin A. Riedmiller,et al.  Batch Reinforcement Learning , 2012, Reinforcement Learning.

[102]  Tuomas Sandholm,et al.  Lossy stochastic game abstraction with bounds , 2012, EC '12.

[103]  Peter I. Cowling,et al.  Information Set Monte Carlo Tree Search , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[104]  Michèle Sebag,et al.  The grand challenge of computer Go , 2012, Commun. ACM.

[105]  Yee Whye Teh,et al.  Actor-Critic Reinforcement Learning with Energy-Based Policies , 2012, EWRL.

[106]  Michael H. Bowling,et al.  Finding Optimal Abstract Strategies in Extensive-Form Games , 2012, AAAI.

[107]  Simon M. Lucas,et al.  A Survey of Monte Carlo Tree Search Methods , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[108]  Nathan R. Sturtevant,et al.  A parameterized family of equilibrium profiles for three-player kuhn poker , 2013, AAMAS.

[109]  Branislav Bosanský,et al.  Convergence of Monte Carlo Tree Search in Simultaneous Move Games , 2013, NIPS.

[110]  M. Littman,et al.  Solving for Best Responses in Extensive-Form Games using Reinforcement Learning Methods , 2013 .

[111]  Michael H. Bowling,et al.  Monte carlo sampling and regret minimization for equilibrium computation and decision-making in large extensive form games , 2013 .

[112]  Michael H. Bowling,et al.  Evaluating state-space abstractions in extensive-form games , 2013, AAMAS.

[113]  Daniel Urieli,et al.  TacTex'13: A Champion Adaptive Power Trading Agent , 2014, AAAI.

[114]  Tuomas Sandholm,et al.  Extensive-form game abstraction with bounds , 2014, EC.

[115]  J Heinrich,et al.  Self-play Monte-Carlo tree search in computer poker , 2014, AAAI 2014.

[116]  Michael H. Bowling,et al.  Solving Imperfect Information Games Using Decomposition , 2013, AAAI.

[117]  Branislav Bosanský,et al.  An Exact Double-Oracle Algorithm for Zero-Sum Extensive-Form Games with Imperfect Information , 2014, J. Artif. Intell. Res..

[118]  V. Lisý ALTERNATIVE SELECTION FUNCTIONS FOR INFORMATION SET MONTE CARLO TREE SEARCH , 2014 .

[119]  Ashwin Lall,et al.  Exponential Reservoir Sampling for Streaming Language Models , 2014, ACL.

[120]  Michael H. Bowling,et al.  Online Monte Carlo Counterfactual Regret Minimization for Search in Imperfect Information Games , 2015, AAMAS.

[121]  Neil Burch,et al.  Heads-up limit hold’em poker is solved , 2015, Science.

[122]  Branislav Bosanský,et al.  Optimal Network Security Hardening Using Attack Graph Games , 2015, IJCAI.

[123]  David Silver,et al.  Move Evaluation in Go Using Deep Convolutional Neural Networks , 2014, ICLR.

[124]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[125]  Tuomas Sandholm,et al.  Simultaneous Abstraction and Equilibrium Finding in Games , 2015, IJCAI.

[126]  Kevin Waugh,et al.  Solving Games with Functional Regret Estimation , 2014, AAAI Workshop: Computer Poker and Imperfect Information.

[127]  Tuomas Sandholm,et al.  Hierarchical Abstraction, Distributed Equilibrium Computation, and Post-Processing, with Application to a Champion No-Limit Texas Hold'em Agent , 2015, AAAI Workshop: Computer Poker and Imperfect Information.

[128]  David Silver,et al.  Fictitious Self-Play in Extensive-Form Games , 2015, ICML.

[129]  David Silver,et al.  Smooth UCT Search in Computer Poker , 2015, IJCAI.

[130]  Peter I. Cowling,et al.  Emergent bluffing and inference with Monte Carlo Tree Search , 2015, 2015 IEEE Conference on Computational Intelligence and Games (CIG).

[131]  Tuomas Sandholm,et al.  Endgame Solving in Large Imperfect-Information Games , 2015, AAAI Workshop: Computer Poker and Imperfect Information.

[132]  Amos J. Storkey,et al.  Training Deep Convolutional Neural Networks to Play Go , 2015, ICML.

[133]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[134]  Peter Dayan,et al.  Monte Carlo Planning Method Estimates Planning Horizons during Interactive Social Exchange , 2015, PLoS Comput. Biol..

[135]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[136]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[137]  Colin Raffel,et al.  Poker-CNN: A Pattern Learning Strategy for Making Draws and Bets in Poker Games Using Convolutional Networks , 2015, AAAI.

[138]  Kevin Waugh,et al.  DeepStack: Expert-Level Artificial Intelligence in No-Limit Poker , 2017, ArXiv.

[139]  Tom Schaul,et al.  The Predictron: End-To-End Learning and Planning , 2016, ICML.

[140]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .