Scalable and Efficient Bayes-Adaptive Reinforcement Learning Based on Monte-Carlo Tree Search

Bayesian planning is a formally elegant approach to learning optimal behaviour under model uncertainty, trading off exploration and exploitation in an ideal way. Unfortunately, planning optimally in the face of uncertainty is notoriously taxing, since the search space is enormous. In this paper we introduce a tractable, sample-based method for approximate Bayes-optimal planning which exploits Monte-Carlo tree search. Our approach avoids expensive applications of Bayes rule within the search tree by sampling models from current beliefs, and furthermore performs this sampling in a lazy manner. This enables it to outperform previous Bayesian model-based reinforcement learning algorithms by a significant margin on several well-known benchmark problems. As we show, our approach can even work in problems with an in finite state space that lie qualitatively out of reach of almost all previous work in Bayesian exploration.

[1]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[2]  R Bellman,et al.  On the Theory of Dynamic Programming. , 1952, Proceedings of the National Academy of Sciences of the United States of America.

[3]  E. Silver MARKOVIAN DECISION PROCESSES WITH UNCERTAIN TRANSITION PROBABILITIES OR REWARDS , 1963 .

[4]  Nils J. Nilsson,et al.  A Formal Basis for the Heuristic Determination of Minimum Cost Paths , 1968, IEEE Trans. Syst. Sci. Cybern..

[5]  P. Randolph Bayesian Decision Problems and Markov Chains , 1968 .

[6]  J. K. Satia,et al.  Markovian Decision Processes with Uncertain Transition Probabilities , 1973, Oper. Res..

[7]  Sheldon M. Ross,et al.  Introduction to Stochastic Dynamic Programming: Probability and Mathematical , 1983 .

[8]  Christian M. Ernst,et al.  Multi-armed Bandit Allocation Indices , 1989 .

[9]  J. Bather,et al.  Multi‐Armed Bandit Allocation Indices , 1990 .

[10]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[11]  Jürgen Schmidhuber,et al.  Curious model-building control systems , 1991, [Proceedings] 1991 IEEE International Joint Conference on Neural Networks.

[12]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[13]  Andrew G. Barto,et al.  Reinforcement learning , 1998 .

[14]  Scott Moore Applying Online Search Techniques to Reinforcement Learning , 1998 .

[15]  Yoram Singer,et al.  Efficient Bayesian Parameter Estimation in Large Discrete Domains , 1998, NIPS.

[16]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[17]  Stuart J. Russell,et al.  Bayesian Q-Learning , 1998, AAAI/IAAI.

[18]  Malcolm J. A. Strens,et al.  A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[19]  Eric Allender,et al.  Complexity of finite-horizon Markov decision process problems , 2000, JACM.

[20]  Andrew G. Barto,et al.  Optimal learning: computational procedures for bayes-adaptive markov decision processes , 2002 .

[21]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[22]  Michael O. Duff,et al.  Design for an Optimal Probe , 2003, ICML.

[23]  Anne Condon,et al.  On the undecidability of probabilistic planning and related stochastic optimization problems , 2003, Artif. Intell..

[24]  Terrence J. Sejnowski,et al.  Exploration Bonuses and Dual Control , 1996, Machine Learning.

[25]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[26]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[27]  Paul Bourgine,et al.  Exploration of Multi-State Environments: Local Measures and Back-Propagation of Uncertainty , 1999, Machine Learning.

[28]  Tao Wang,et al.  Bayesian sparse sampling for on-line reward optimization , 2005, ICML.

[29]  Jesse Hoey,et al.  An analytic solution to discrete Bayesian reinforcement learning , 2006, ICML.

[30]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[31]  Doina Precup,et al.  Using Linear Programming for Bayesian Exploration in Markov Decision Processes , 2007, IJCAI.

[32]  David Silver,et al.  Combining online and offline knowledge in UCT , 2007, ICML '07.

[33]  Rémi Munos,et al.  Bandit Algorithms for Tree Search , 2007, UAI.

[34]  Joelle Pineau,et al.  Online Planning Algorithms for POMDPs , 2008, J. Artif. Intell. Res..

[35]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[36]  Richard S. Sutton,et al.  Sample-based learning and search with permanent and transient memories , 2008, ICML '08.

[37]  Andrew Y. Ng,et al.  Near-Bayesian exploration in polynomial time , 2009, ICML '09.

[38]  Lihong Li,et al.  Reinforcement Learning in Finite MDPs: PAC Analysis , 2009, J. Mach. Learn. Res..

[39]  Lihong Li,et al.  A Bayesian Sampling Approach to Exploration in Reinforcement Learning , 2009, UAI.

[40]  Rémi Munos,et al.  Pure Exploration in Multi-armed Bandits Problems , 2009, ALT.

[41]  H. Kappen,et al.  Optimal exploration as a symmetry breaking phenomenon , 2010 .

[42]  Joshua B. Tenenbaum,et al.  Nonparametric Bayesian Policy Priors for Reinforcement Learning , 2010, NIPS.

[43]  Doina Precup,et al.  Smarter Sampling in Model-Based Bayesian Reinforcement Learning , 2010, ECML/PKDD.

[44]  Thomas J. Walsh,et al.  Integrating Sample-Based Planning and Model-Based Reinforcement Learning , 2010, AAAI.

[45]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[46]  Joel Veness,et al.  Monte-Carlo Planning in Large POMDPs , 2010, NIPS.

[47]  Radford M. Neal Probabilistic Inference Using Markov Chain Monte Carlo Methods , 2011 .

[48]  Michael L. Littman,et al.  Learning is planning: near Bayes-optimal reinforcement learning via Monte-Carlo tree search , 2011, UAI.

[49]  Leslie Pack Kaelbling,et al.  Bayesian Policy Search with Policy Priors , 2011, IJCAI.

[50]  Joelle Pineau,et al.  A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes , 2011, J. Mach. Learn. Res..

[51]  Olivier Buffet,et al.  Near-Optimal BRL using Optimistic Local Transitions , 2012, ICML.

[52]  Viet-Hung Dang,et al.  Monte-Carlo tree search for Bayesian reinforcement learning , 2012, 2012 11th International Conference on Machine Learning and Applications.

[53]  David Hsu,et al.  Monte Carlo Bayesian Reinforcement Learning , 2012, ICML.

[54]  Peter Dayan,et al.  Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search , 2012, NIPS.

[55]  Michèle Sebag,et al.  The grand challenge of computer Go , 2012, Commun. ACM.

[56]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[57]  Lucian Busoniu,et al.  Optimistic planning for belief-augmented Markov Decision Processes , 2013, 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).