论文信息 - Deep Reinforcement Learning from Self-Play in Imperfect-Information Games - 字舞流文

Deep Reinforcement Learning from Self-Play in Imperfect-Information Games

Many real-world applications can be described as large-scale games of imperfect information. To deal with these challenging domains, prior work has focused on computing Nash equilibria in a handcrafted abstraction of the domain. In this paper we introduce the first scalable end-to-end approach to learning approximate Nash equilibria without prior domain knowledge. Our method combines fictitious self-play with deep reinforcement learning. When applied to Leduc poker, Neural Fictitious Self-Play (NFSP) approached a Nash equilibrium, whereas common reinforcement learning methods diverged. In Limit Texas Holdem, a poker game of real-world scale, NFSP learnt a strategy that approached the performance of state-of-the-art, superhuman algorithms based on significant domain expertise.

David Silver | Johannes Heinrich | D. Silver | J. Heinrich | David Silver | Johannes Heinrich

[1] J. Neumann. Zur Theorie der Gesellschaftsspiele , 1928 .

[2] J. Neumann,et al. Theory of Games and Economic Behavior. , 1945 .

[3] Claude E. Shannon,et al. Programming a computer for playing chess , 1950 .

[4] J. Robinson. AN ITERATIVE METHOD OF SOLVING A GAME , 1951, Classics in Game Theory.

[5] J. Nash. NON-COOPERATIVE GAMES , 1951, Classics in Game Theory.

[6] L. S. Shapley,et al. 10. A SIMPLE THREE-PERSON POKER GAME , 1951 .

[7] O. H. Brownlee,et al. ACTIVITY ANALYSIS OF PRODUCTION AND ALLOCATION , 1952 .

[8] H. W. Kuhn,et al. 11. Extensive Games and the Problem of Information , 1953 .

[9] Arthur L. Samuel,et al. Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[10] Samuel Karlin,et al. Mathematical Methods and Theory in Games, Programming, and Economics , 1961 .

[11] R. Bellman. Dynamic programming. , 1957, Science.

[12] J. Harsanyi. Games with randomly disturbed payoffs: A new rationale for mixed-strategy equilibrium points , 1973 .

[13] R. Selten. Reexamination of the perfectness concept for equilibrium points in extensive games , 1975, Classics in Game Theory.

[14] M. Puterman,et al. Modified Policy Iteration Algorithms for Discounted Markov Decision Problems , 1978 .

[15] Richard S. Sutton,et al. Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[16] Jeffrey Scott Vitter,et al. Random sampling with a reservoir , 1985, TOMS.

[17] George Cybenko,et al. Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[18] Richard S. Sutton,et al. Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[19] Roger B. Myerson,et al. Game theory - Analysis of Conflict , 1991 .

[20] Geoffrey E. Hinton,et al. Feudal Reinforcement Learning , 1992, NIPS.

[21] Gerald Tesauro,et al. Practical Issues in Temporal Difference Learning , 1992, Mach. Learn..

[22] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[23] Bernhard von Stengel,et al. Fast algorithms for finding randomized strategies in game trees , 1994, STOC '94.

[24] Nicolò Cesa-Bianchi,et al. Gambling in a rigged casino: The adversarial multi-armed bandit problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[25] Ben J. A. Kröse,et al. Learning from delayed rewards , 1995, Robotics Auton. Syst..

[26] Gerald Tesauro,et al. Temporal difference learning and TD-Gammon , 1995, CACM.

[27] D. Fudenberg,et al. Consistency and Cautious Fictitious Play , 1995 .

[28] Richard S. Sutton,et al. Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[29] Leemon C. Baird,et al. Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[30] D. Koller,et al. Efficient Computation of Equilibria for Extensive Two-Person Games , 1996 .

[31] L. Shapley,et al. Fictitious Play Property for Games with Identical Interests , 1996 .

[32] B. Stengel,et al. Efficient Computation of Behavior Strategies , 1996 .

[33] Thomas G. Dietterich. What is machine learning? , 2020, Archives of Disease in Childhood.

[34] Michael L. Littman,et al. Algorithms for Sequential Decision Making , 1996 .

[35] John N. Tsitsiklis,et al. Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[36] H. J. Jacobsen,et al. Fictitious Play in Extensive Form Games , 1996 .

[37] Ian Frank,et al. Search in Games with Incomplete Information: A Case Study Using Bridge Card Play , 1998, Artif. Intell..

[38] Leslie Pack Kaelbling,et al. Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[39] D. Fudenberg,et al. The Theory of Learning in Games , 1998 .

[40] Doina Precup,et al. Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[41] B. Jones. BOUNDED RATIONALITY , 1999 .

[42] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[43] David Sklansky,et al. The Theory of Poker , 1999 .

[44] Thomas G. Dietterich. Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[45] Geoffrey J. Gordon. Reinforcement Learning with Function Approximation Converges to a Region , 2000, NIPS.

[46] Andrew G. Barto,et al. Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density , 2001, ICML.

[47] Manuela M. Veloso,et al. Rational and Convergent Learning in Stochastic Games , 2001, IJCAI.

[48] Jonathan Schaeffer,et al. The challenge of poker , 2002, Artif. Intell..

[49] Murray Campbell,et al. Deep Blue , 2002, Artif. Intell..

[50] William H. Sandholm,et al. ON THE GLOBAL CONVERGENCE OF STOCHASTIC FICTITIOUS PLAY , 2002 .

[51] A. Roth. The Economist as Engineer: Game Theory, Experimentation, and Computation as Tools for Design Economics , 2002 .

[52] Sridhar Mahadevan,et al. Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[53] Jonathan Schaeffer,et al. Approximating Game-Theoretic Optimal Strategies for Full-scale Poker , 2003, IJCAI.

[54] Peter Dayan,et al. Q-learning , 1992, Machine Learning.

[55] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[56] Shie Mannor,et al. Dynamic abstraction in reinforcement learning via clustering , 2004, ICML.

[57] Long Ji Lin,et al. Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[58] Josef Hofbauer,et al. Stochastic Approximations and Differential Inclusions , 2005, SIAM J. Control. Optim..

[59] Robert L. Smith,et al. A Fictitious Play Approach to Large-Scale Optimization , 2005, Oper. Res..

[60] Marcus Hutter. Simulation Algorithms for Computational Systems Biology , 2017, Texts in Theoretical Computer Science. An EATCS Series.

[61] Michael H. Bowling,et al. Bayes' Bluff: Opponent Modelling in Poker , 2005, UAI 2005.

[62] Martin A. Riedmiller. Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[63] Pierre Geurts,et al. Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[64] Jeff S. Shamma,et al. Dynamic fictitious play, dynamic gradient play, and distributed convergence to Nash equilibria , 2005, IEEE Transactions on Automatic Control.

[65] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[66] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[67] Tuomas Sandholm,et al. A Competitive Texas Hold'em Poker Player via Automated Abstraction and Real-Time Equilibrium Computation , 2006, AAAI.

[68] Michael Kearns,et al. Reinforcement learning for optimized trade execution , 2006, ICML.

[69] David S. Leslie,et al. Generalised weakened fictitious play , 2006, Games Econ. Behav..

[70] Rémi Coulom,et al. Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search , 2006, Computers and Games.

[71] Csaba Szepesvári,et al. Bandit Based Monte-Carlo Planning , 2006, ECML.

[72] Javier Peña,et al. Gradient-Based Algorithms for Finding Nash Equilibria in Extensive Form Games , 2007, WINE.

[73] Michael H. Bowling,et al. Regret Minimization in Games with Incomplete Information , 2007, NIPS.

[74] Yoshua. Bengio,et al. Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[75] Geoffrey J. Gordon,et al. A Fast Bundle-based Anytime Algorithm for Poker and other Convex Games , 2007, AISTATS.

[76] S. Legg. Machine super intelligence , 2008 .

[77] Geoffrey E. Hinton,et al. Visualizing Data using t-SNE , 2008 .

[78] Bret Hoehn,et al. Effective short-term opponent exploitation in simplified poker , 2005, Machine Learning.

[79] S. Nakamoto,et al. Bitcoin: A Peer-to-Peer Electronic Cash System , 2008 .

[80] Ana L. C. Bazzan,et al. Opportunities for multiagent systems and multiagent reinforcement learning in traffic control , 2009, Autonomous Agents and Multi-Agent Systems.

[81] Aurélien Garivier,et al. On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems , 2008 .

[82] Raymond J. Dolan,et al. Game Theory of Mind , 2008, PLoS Comput. Biol..

[83] Joel Veness,et al. Bootstrapping from Game Tree Search , 2009, NIPS.

[84] Brett Browning,et al. A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[85] Kevin Waugh,et al. A Practical Use of Imperfect Recall , 2009, SARA.

[86] Kevin Waugh,et al. Monte Carlo Sampling for Regret Minimization in Extensive Games , 2009, NIPS.

[87] Kevin Waugh,et al. Abstraction pathologies in extensive games , 2009, AAMAS.

[88] Tuomas Sandholm,et al. Computing Equilibria in Multiplayer Stochastic Games of Imperfect Information , 2009, IJCAI.

[89] Martin A. Riedmiller,et al. Reinforcement learning for robot soccer , 2009, Auton. Robots.

[90] Duane Szafron,et al. Using counterfactual regret minimization to create competitive multiplayer poker agents , 2010, AAMAS 2010.

[91] Scott Kuindersma,et al. Constructing Skill Trees for Reinforcement Learning Agents from Demonstration Trajectories , 2010, NIPS.

[92] Javier Peña,et al. Smoothing Techniques for Computing Nash Equilibria of Sequential Games , 2010, Math. Oper. Res..

[93] Tuomas Sandholm,et al. The State of Solving Large Incomplete-Information Games, and Application to Poker , 2010, AI Mag..

[94] Joel Veness,et al. Monte-Carlo Planning in Large POMDPs , 2010, NIPS.

[95] Marc Lanctot,et al. Computing Approximate Nash Equilibria and Robust Best-Responses Using Sampling , 2011, J. Artif. Intell. Res..

[96] Ian D. Watson,et al. Computer poker: A review , 2011, Artif. Intell..

[97] Kevin Waugh,et al. Accelerating Best Response Calculation in Large Extensive Games , 2011, IJCAI.

[98] Doina Precup,et al. Automatic Construction of Temporally Extended Actions for MDPs Using Bisimulation Metrics , 2011, EWRL.

[99] David Auger,et al. Multiple Tree for Partially Observable Monte-Carlo Tree Search , 2011, EvoApplications.

[100] Milind Tambe,et al. Security and Game Theory - Algorithms, Deployed Systems, Lessons Learned , 2011 .

[101] Martin A. Riedmiller,et al. Batch Reinforcement Learning , 2012, Reinforcement Learning.

[102] Tuomas Sandholm,et al. Lossy stochastic game abstraction with bounds , 2012, EC '12.

[103] Peter I. Cowling,et al. Information Set Monte Carlo Tree Search , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[104] Michèle Sebag,et al. The grand challenge of computer Go , 2012, Commun. ACM.

[105] Yee Whye Teh,et al. Actor-Critic Reinforcement Learning with Energy-Based Policies , 2012, EWRL.

[106] Michael H. Bowling,et al. Finding Optimal Abstract Strategies in Extensive-Form Games , 2012, AAAI.

[107] Simon M. Lucas,et al. A Survey of Monte Carlo Tree Search Methods , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[108] Nathan R. Sturtevant,et al. A parameterized family of equilibrium profiles for three-player kuhn poker , 2013, AAMAS.

[109] Branislav Bosanský,et al. Convergence of Monte Carlo Tree Search in Simultaneous Move Games , 2013, NIPS.

[110] M. Littman,et al. Solving for Best Responses in Extensive-Form Games using Reinforcement Learning Methods , 2013 .

[111] Michael H. Bowling,et al. Monte carlo sampling and regret minimization for equilibrium computation and decision-making in large extensive form games , 2013 .

[112] Michael H. Bowling,et al. Evaluating state-space abstractions in extensive-form games , 2013, AAMAS.

[113] Daniel Urieli,et al. TacTex'13: A Champion Adaptive Power Trading Agent , 2014, AAAI.

[114] Tuomas Sandholm,et al. Extensive-form game abstraction with bounds , 2014, EC.

[115] J Heinrich,et al. Self-play Monte-Carlo tree search in computer poker , 2014, AAAI 2014.

[116] Michael H. Bowling,et al. Solving Imperfect Information Games Using Decomposition , 2013, AAAI.

[117] Branislav Bosanský,et al. An Exact Double-Oracle Algorithm for Zero-Sum Extensive-Form Games with Imperfect Information , 2014, J. Artif. Intell. Res..

[118] V. Lisý. ALTERNATIVE SELECTION FUNCTIONS FOR INFORMATION SET MONTE CARLO TREE SEARCH , 2014 .

[119] Ashwin Lall,et al. Exponential Reservoir Sampling for Streaming Language Models , 2014, ACL.

[120] Michael H. Bowling,et al. Online Monte Carlo Counterfactual Regret Minimization for Search in Imperfect Information Games , 2015, AAMAS.

[121] Neil Burch,et al. Heads-up limit hold’em poker is solved , 2015, Science.

[122] Branislav Bosanský,et al. Optimal Network Security Hardening Using Attack Graph Games , 2015, IJCAI.

[123] David Silver,et al. Move Evaluation in Go Using Deep Convolutional Neural Networks , 2014, ICLR.

[124] Jürgen Schmidhuber,et al. Deep learning in neural networks: An overview , 2014, Neural Networks.

[125] Tuomas Sandholm,et al. Simultaneous Abstraction and Equilibrium Finding in Games , 2015, IJCAI.

[126] Kevin Waugh,et al. Solving Games with Functional Regret Estimation , 2014, AAAI Workshop: Computer Poker and Imperfect Information.

[127] Tuomas Sandholm,et al. Hierarchical Abstraction, Distributed Equilibrium Computation, and Post-Processing, with Application to a Champion No-Limit Texas Hold'em Agent , 2015, AAAI Workshop: Computer Poker and Imperfect Information.

[128] David Silver,et al. Fictitious Self-Play in Extensive-Form Games , 2015, ICML.

[129] David Silver,et al. Smooth UCT Search in Computer Poker , 2015, IJCAI.

[130] Peter I. Cowling,et al. Emergent bluffing and inference with Monte Carlo Tree Search , 2015, 2015 IEEE Conference on Computational Intelligence and Games (CIG).

[131] Tuomas Sandholm,et al. Endgame Solving in Large Imperfect-Information Games , 2015, AAAI Workshop: Computer Poker and Imperfect Information.

[132] Amos J. Storkey,et al. Training Deep Convolutional Neural Networks to Play Go , 2015, ICML.

[133] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[134] Peter Dayan,et al. Monte Carlo Planning Method Estimates Planning Horizons during Interactive Social Exchange , 2015, PLoS Comput. Biol..

[135] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[136] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[137] Colin Raffel,et al. Poker-CNN: A Pattern Learning Strategy for Making Draws and Bets in Poker Games Using Convolutional Networks , 2015, AAAI.

[138] Kevin Waugh,et al. DeepStack: Expert-Level Artificial Intelligence in No-Limit Poker , 2017, ArXiv.

[139] Tom Schaul,et al. The Predictron: End-To-End Learning and Planning , 2016, ICML.

[140] T. L. Lai Andherbertrobbins. Asymptotically Efficient Adaptive Allocation Rules , 2022 .