Feedback-Based Tree Search for Reinforcement Learning

Inspired by recent successes of Monte-Carlo tree search (MCTS) in a number of artificial intelligence (AI) application domains, we propose a model-based reinforcement learning (RL) technique that iteratively applies MCTS on batches of small, finite-horizon versions of the original infinite-horizon Markov decision process. The terminal condition of the finite-horizon problems, or the leaf-node evaluator of the decision tree generated by MCTS, is specified using a combination of an estimated value function and an estimated policy function. The recommendations generated by the MCTS procedure are then provided as feedback in order to refine, through classification and regression, the leaf-node evaluator for the next iteration. We provide the first sample complexity bounds for a tree search-based RL algorithm. In addition, we show that a deep neural network implementation of the technique can create a competitive AI agent for the popular multi-player online battle arena (MOBA) game King of Glory.

[1]  John Langford,et al.  Relating reinforcement learning performance to classification performance , 2005, ICML '05.

[2]  William B. Haskell,et al.  Empirical Dynamic Programming , 2013, Math. Oper. Res..

[3]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[4]  S. Resnick A Probability Path , 1999 .

[5]  Sepp Hochreiter,et al.  Self-Normalizing Neural Networks , 2017, NIPS.

[6]  Honglak Lee,et al.  Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning , 2014, NIPS.

[7]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[8]  Tristan Cazenave,et al.  Nested Monte-Carlo Search , 2009, IJCAI.

[9]  David Silver,et al.  Combining online and offline knowledge in UCT , 2007, ICML '07.

[10]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[11]  Nataliya Sokolovska,et al.  Continuous Upper Confidence Trees , 2011, LION.

[12]  Warren B. Powell,et al.  Monte Carlo Tree Search with Sampled Information Relaxation Dual Bounds , 2017, ArXiv.

[13]  Wouter M. Koolen,et al.  Monte-Carlo Tree Search by Best Arm Identification , 2017, NIPS.

[14]  Michael P. Wellman,et al.  Nash Q-Learning for General-Sum Stochastic Games , 2003, J. Mach. Learn. Res..

[15]  Pieter Spronck,et al.  Monte-Carlo Tree Search: A New Framework for Game AI , 2008, AIIDE.

[16]  Rémi Munos,et al.  Adaptive play in Texas Hold'em Poker , 2008, ECAI.

[17]  Tamás Linder,et al.  On the Asymptotic Optimality of Finite Approximations to Markov Decision Processes with Borel Spaces , 2015, Math. Oper. Res..

[18]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[19]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[20]  Markus Enzenberger,et al.  Evaluation in Go by a Neural Network using Soft Segmentation , 2003, ACG.

[21]  Michèle Sebag,et al.  Continuous Rapid Action Value Estimates , 2011, ACML 2011.

[22]  Bruno Bouzy,et al.  Monte-Carlo strategies for computer Go , 2006 .

[23]  Eiji Takimoto,et al.  Efficient Sampling Method for Monte Carlo Tree Search Problem , 2014, IEICE Trans. Inf. Syst..

[24]  Csaba Szepesvári,et al.  Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[25]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[26]  Simon M. Lucas,et al.  A Survey of Monte Carlo Tree Search Methods , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[27]  Rémi Coulom,et al.  Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search , 2006, Computers and Games.

[28]  Benjamin Van Roy Performance Loss Bounds for Approximate Value Iteration with State Aggregation , 2006, Math. Oper. Res..

[29]  Michèle Sebag,et al.  The grand challenge of computer Go , 2012, Commun. ACM.

[30]  Csaba Szepesvári,et al.  Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[31]  Philip Hingston,et al.  Experiments with Monte Carlo Othello , 2007, 2007 IEEE Congress on Evolutionary Computation.

[32]  Vadim Bulitko,et al.  Focus of Attention in Reinforcement Learning , 2007, J. Univers. Comput. Sci..

[33]  Warren B. Powell,et al.  The Information-Collecting Vehicle Routing Problem: Stochastic Optimization for Emergency Storm Response , 2016, ArXiv.

[34]  Alessandro Lazaric,et al.  Analysis of a Classification-based Policy Iteration Algorithm , 2010, ICML.

[35]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[36]  David Silver,et al.  Monte-Carlo tree search and rapid action value estimation in computer Go , 2011, Artif. Intell..

[37]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[38]  Rémi Munos,et al.  Performance Bounds in Lp-norm for Approximate Value Iteration , 2007, SIAM J. Control. Optim..

[39]  F. Dufour,et al.  Approximation of Markov decision processes with general state space , 2012 .

[40]  D. Pollard Empirical Processes: Theory and Applications , 1990 .

[41]  D. Bertsekas Convergence of discretization procedures in dynamic programming , 1975 .

[42]  Jean Méhat,et al.  Combining UCT and Nested Monte Carlo Search for Single-Player General Game Playing , 2010, IEEE Transactions on Computational Intelligence and AI in Games.

[43]  Rahul Jain,et al.  An Empirical Dynamic Programming Algorithm for Continuous MDPs , 2017, 1709.07506.

[44]  David Barber,et al.  Thinking Fast and Slow with Deep Learning and Tree Search , 2017, NIPS.

[45]  Matthieu Geist,et al.  Approximate modified policy iteration and its application to the game of Tetris , 2015, J. Mach. Learn. Res..

[46]  David Haussler,et al.  Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications , 1992, Inf. Comput..

[47]  Ameet Talwalkar,et al.  Foundations of Machine Learning , 2012, Adaptive computation and machine learning.