Applying machine learning techniques to an imperfect information game

The game of poker presents a challenging game to Artificial Intelligence researchers because it is a complex asymmetric information game. In such games, a player can improve his performance by inferring the private information held by the other players from their prior actions. A novel connectionist structure was designed to play a version of poker (multi-player limit Hold‟em). This allows simple reinforcement learning techniques to be used which previously not been considered for the game of multi-player hold‟em. A related hidden Markov model was designed to be fitted to records of poker play without using any private information. Belief vectors generated by this model provide a more convenient and flexible representation of an opponent‟s action history than alternative approaches. The structure was tested in two settings. Firstly self-play simulation was used to generate an approximation to a Nash equilibrium strategy. A related, but slower, rollout strategy that uses Monte-Carlo samples was used to evaluate the performance. Secondly the structure was used to model and hence exploit a population of opponents within a relatively small number of games. When and how to adapt quickly to new opponents are open questions in poker AI research. A opponent model with a small number of discrete types is used to identify the largest differences in strategy between members of the population. A commercial software package (Poker Academy) was used to provide a population of sophisticated opponents to test against. A series of experiments was conducted to compare adaptive and static systems. All systems showed positive results but surprisingly the adaptive systems did not show a significant improvement over similar static systems. The possible reasons for this result are discussed. This work formed the basis of a series of entries to the computer poker competition hosted at the annual conferences of the Association for the Advancement of Artificial Intelligence (AAAI). Its best rankings were 3rd in the 2006 6-player limit hold‟em competition and 2nd in the 2008 3-player limit hold‟em competition.

[1]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[2]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data , 1988 .

[3]  Grgoire Montavon,et al.  Neural Networks: Tricks of the Trade , 2012, Lecture Notes in Computer Science.

[4]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[5]  V. Crawford Learning the Optimal Strategy in a Zero-Sum Game , 1974 .

[6]  Bret Hoehn,et al.  The Effectiveness of Opponent Modelling in a Small Imperfect Information Game , 2006 .

[7]  Fredrik A. Dahl,et al.  The Lagging Anchor Algorithm: Reinforcement Learning in Two-Player Zero-Sum Games with Imperfect Information , 2002, Machine Learning.

[8]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[9]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[10]  Fredrik A. Dahl The lagging anchor model for game learning—a solution to the Crawford puzzle , 2005 .

[11]  D. Papp Dealing with imperfect information in poker , 1998 .

[12]  Peter Vamplew,et al.  Using Stereotypes to Improve Early-Match Poker Play , 2008, Australasian Conference on Artificial Intelligence.

[13]  Manuela M. Veloso,et al.  Multiagent learning using a variable learning rate , 2002, Artif. Intell..

[14]  H. Jaap van den Herik,et al.  Progressive Strategies for Monte-Carlo Tree Search , 2008 .

[15]  Terence Conrad Schauenberg,et al.  Opponent Modelling and Search in Poker , 2006 .

[16]  M. Spence Job Market Signaling , 1973 .

[17]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over a Finite Horizon , 1973, Oper. Res..

[18]  J. Q. Smith Decision Analysis: A Bayesian Approach , 1988 .

[19]  Kurt Driessens,et al.  Bayes-Relational Learning of Opponent Models from Incomplete Information in No-Limit Poker , 2008, AAAI.

[20]  Dione. Brunson Super/System A Course in Power Poker , 1994 .

[21]  Duane Szafron,et al.  Using counterfactual regret minimization to create competitive multiplayer poker agents , 2010, AAMAS 2010.

[22]  David Schnizlein,et al.  State translation in no-limit poker , 2009 .

[23]  Kevin Swingler,et al.  Applying neural networks - a practical guide , 1996 .

[24]  Kevin Waugh,et al.  Monte Carlo Sampling for Regret Minimization in Extensive Games , 2009, NIPS.

[25]  Jonathan Schaeffer,et al.  Poker as a Testbed for Machine Intelligence Research , 1998 .

[26]  John Aaron. Davidson,et al.  Opponent modeling in poker: learning and acting in a hostile and uncertain environment , 2002 .

[27]  Guillaume Chaslot,et al.  Integrating Opponent Models with Monte-Carlo Tree Search in Poker , 2010, Interactive Decision Theory and Game Theory.

[28]  Tommi S. Jaakkola,et al.  Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms , 2000, Machine Learning.

[29]  Ken Binmore,et al.  Fun and games : a text on game theory , 1992 .

[30]  Nando de Freitas,et al.  An Introduction to MCMC for Machine Learning , 2004, Machine Learning.

[31]  Kevin Waugh,et al.  Abstraction pathologies in extensive games , 2009, AAMAS.

[32]  Yishay Mansour,et al.  Nash Convergence of Gradient Dynamics in General-Sum Games , 2000, UAI.

[33]  J. Neumann,et al.  Theory of games and economic behavior , 1945, 100 Years of Math Milestones.

[34]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[35]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[36]  Donald Michie,et al.  Introductory Readings in Expert Systems , 1982 .

[37]  Michael H. Bowling,et al.  Regret Minimization in Games with Incomplete Information , 2007, NIPS.

[38]  Nathan R. Sturtevant,et al.  Understanding the Success of Perfect Information Monte Carlo Sampling in Game Tree Search , 2010, AAAI.

[39]  Tony Jebara,et al.  Machine Learning: Discriminative and Generative , 2012 .

[40]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[41]  Charles M. Macal,et al.  Managing Business Complexity: Discovering Strategic Solutions with Agent-Based Modeling and Simulation , 2007 .

[42]  J. Bather,et al.  Multi‐Armed Bandit Allocation Indices , 1990 .

[43]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[44]  Johannes Fürnkranz,et al.  An Exploitative Monte-Carlo Poker Agent , 2009, LWA.

[45]  Michael H. Bowling,et al.  Bayes' Bluff: Opponent Modelling in Poker , 2005, UAI 2005.

[46]  Jonathan Schaeffer,et al.  The challenge of poker , 2002, Artif. Intell..

[47]  Jonathan Schaeffer,et al.  Approximating Game-Theoretic Optimal Strategies for Full-scale Poker , 2003, IJCAI.

[48]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[49]  Guy Van den Broeck,et al.  Monte-Carlo Tree Search in Poker Using Expected Reward Distributions , 2009, ACML.

[50]  Javier Peña,et al.  Smoothing Techniques for Computing Nash Equilibria of Sequential Games , 2010, Math. Oper. Res..

[51]  Jonathan Schaeffer,et al.  Using Probabilistic Knowledge and Simulation to Play Poker , 1999, AAAI/IAAI.

[52]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[53]  Jonathan Schaeffer,et al.  Game-Tree Search with Adaptation in Stochastic Imperfect-Information Games , 2004, Computers and Games.

[54]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[55]  Andrzej S. Kozek,et al.  A rule of thumb (not only) for gamblers , 1995 .

[56]  Peter Bro Miltersen,et al.  A near-optimal strategy for a heads-up no-limit Texas Hold'em poker tournament , 2007, AAMAS '07.

[57]  Finn V. Jensen,et al.  Bayesian Networks and Decision Graphs , 2001, Statistics for Engineering and Information Science.

[58]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[59]  Tuomas Sandholm,et al.  Computing an approximate jam/fold equilibrium for 3-player no-limit Texas Hold'em tournaments , 2008, AAMAS.

[60]  Bill Chen,et al.  The Mathematics of Poker , 2006 .

[61]  Jugal K. Kalita,et al.  The Significance of Temporal-Difference Learning in Self-Play Training TD-Rummy versus EVO-rummy , 2003, ICML.

[62]  Michael H. Bowling,et al.  Data Biased Robust Counter Strategies , 2009, AISTATS.

[63]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[64]  Nicolò Cesa-Bianchi,et al.  Gambling in a rigged casino: The adversarial multi-armed bandit problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[65]  Darse Billings Algorithms and assessment in computer poker , 2006 .

[66]  Darse Billings,et al.  A Tool for the Direct Assessment of Poker Decisions , 2006, J. Int. Comput. Games Assoc..

[67]  Tuomas Sandholm,et al.  A Competitive Texas Hold'em Poker Player via Automated Abstraction and Real-Time Equilibrium Computation , 2006, AAAI.

[68]  Ian D. Watson,et al.  Computer poker: A review , 2011, Artif. Intell..

[69]  Matthew L. Ginsberg,et al.  GIB: Steps Toward an Expert-Level Bridge-Playing Program , 1999, IJCAI.

[70]  Elaine Rich,et al.  User Modeling via Stereotypes , 1998, Cogn. Sci..