Monte-Carlo tree search for Bayesian reinforcement learning

Bayesian model-based reinforcement learning can be formulated as a partially observable Markov decision process (POMDP) to provide a principled framework for optimally balancing exploitation and exploration. Then, a POMDP solver can be used to solve the problem. If the prior distribution over the environment's dynamics is a product of Dirichlet distributions, the POMDP's optimal value function can be represented using a set of multivariate polynomials. Unfortunately, the size of the polynomials grows exponentially with the problem horizon. In this paper, we examine the use of an online Monte-Carlo tree search (MCTS) algorithm for large POMDPs, to solve the Bayesian reinforcement learning problem online. We will show that such an algorithm successfully searches for a near-optimal policy. In addition, we examine the use of a parameter tying method to keep the model search space small, and propose the use of nested mixture of tied models to increase robustness of the method when our prior information does not allow us to specify the structure of tied models exactly. Experiments show that the proposed methods substantially improve scalability of current Bayesian reinforcement learning methods.

[1]  Jianghao Li,et al.  Microassembly path planning using reinforcement learning for improving positioning accuracy of a 1 cm3 omni-directional mobile microrobot , 2011, Applied Intelligence.

[2]  Joel Veness,et al.  Monte-Carlo Planning in Large POMDPs , 2010, NIPS.

[3]  Dimitri P. Bertsekas,et al.  Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems , 1996, NIPS.

[4]  Lihong Li,et al.  A Bayesian Sampling Approach to Exploration in Reinforcement Learning , 2009, UAI.

[5]  Gerald Tesauro,et al.  TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[6]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[7]  Joelle Pineau,et al.  Model-Based Bayesian Reinforcement Learning in Large Structured Domains , 2008, UAI.

[8]  Thomas J. Walsh,et al.  Integrating Sample-Based Planning and Model-Based Reinforcement Learning , 2010, AAAI.

[9]  Andrew G. Barto,et al.  Optimal learning: computational procedures for bayes-adaptive markov decision processes , 2002 .

[10]  David Silver,et al.  Combining online and offline knowledge in UCT , 2007, ICML '07.

[11]  Doina Precup,et al.  Using Linear Programming for Bayesian Exploration in Markov Decision Processes , 2007, IJCAI.

[12]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[13]  Vittaldas V. Prabhu,et al.  Distributed Reinforcement Learning Control for Batch Sequencing and Sizing in Just-In-Time Manufacturing Systems , 2004, Applied Intelligence.

[14]  Jesse Hoey,et al.  An analytic solution to discrete Bayesian reinforcement learning , 2006, ICML.

[15]  Paloma Martínez,et al.  Learning teaching strategies in an Adaptive and Intelligent Educational System through Reinforcement Learning , 2009, Applied Intelligence.

[16]  Ole-Christoffer Granmo,et al.  Accelerated Bayesian learning for decentralized two-armed bandit based decision making with applications to the Goore Game , 2013, Applied Intelligence.

[17]  John Langford,et al.  Exploration in Metric State Spaces , 2003, ICML.

[18]  Nguyen Hoang Viet,et al.  Policy Gradient SMDP for Resource Allocation and Routing in Integrated Services Networks , 2009 .

[19]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[20]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[21]  Shie Mannor,et al.  Bayes Meets Bellman: The Gaussian Process Approach to Temporal Difference Learning , 2003, ICML.

[22]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[23]  Arthur L. Samuel,et al.  Some studies in machine learning using the game of checkers , 2000, IBM J. Res. Dev..

[24]  Mohammad Ghavamzadeh,et al.  Bayesian actor-critic algorithms , 2007, ICML '07.

[25]  Mohammad Ghavamzadeh,et al.  Bayesian Policy Gradient Algorithms , 2006, NIPS.

[26]  Maziar Palhang,et al.  Multi-criteria expertness based cooperative Q-learning , 2012, Applied Intelligence.

[27]  Wei Zhang,et al.  A Reinforcement Learning Approach to job-shop Scheduling , 1995, IJCAI.

[28]  Gerald Tesauro,et al.  Practical Issues in Temporal Difference Learning , 1992, Mach. Learn..

[29]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[30]  Michael L. Littman,et al.  Learning is planning: near Bayes-optimal reinforcement learning via Monte-Carlo tree search , 2011, UAI.

[31]  Nan Rong,et al.  What makes some POMDP problems easy to approximate? , 2007, NIPS.

[32]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[33]  Joelle Pineau,et al.  Bayes-Adaptive POMDPs , 2007, NIPS.

[34]  Andrew Y. Ng,et al.  Near-Bayesian exploration in polynomial time , 2009, ICML '09.

[35]  Andrew Tridgell,et al.  Learning to Play Chess Using Temporal Differences , 2000, Machine Learning.

[36]  Michael L. Littman,et al.  An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..

[37]  Tao Wang,et al.  Bayesian sparse sampling for on-line reward optimization , 2005, ICML.

[38]  TaeChoong Chung,et al.  Hessian matrix distribution for Bayesian policy gradient reinforcement learning , 2011, Inf. Sci..

[39]  Csaba Szepesvári,et al.  Model-based reinforcement learning with nearly tight exploration complexity bounds , 2010, ICML.

[40]  Shie Mannor,et al.  Reinforcement learning with Gaussian processes , 2005, ICML.