Cover tree Bayesian reinforcement learning

This paper proposes an online tree-based Bayesian approach for reinforcement learning. For inference, we employ a generalised context tree model. This defines a distribution on multivariate Gaussian piecewise-linear models, which can be updated in closed form. The tree structure itself is constructed using the cover tree method, which remains efficient in high dimensional spaces. We combine the model with Thompson sampling and approximate dynamic programming to obtain effective exploration policies in unknown environments. The flexibility and computational simplicity of the model render it suitable for many reinforcement learning problems in continuous state spaces. We demonstrate this in an experimental comparison with a Gaussian process model, a linear model and simple least squares policy iteration.

[1]  Y. Shtarkov,et al.  The context-tree weighting method: basic properties , 1995, IEEE Trans. Inf. Theory.

[2]  Stuart J. Russell,et al.  Bayesian Q-Learning , 1998, AAAI/IAAI.

[3]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[4]  Mohammad Ghavamzadeh,et al.  Bayesian Policy Gradient Algorithms , 2006, NIPS.

[5]  Tao Wang,et al.  Bayesian sparse sampling for on-line reward optimization , 2005, ICML.

[6]  Marc G. Bellemare,et al.  Bayesian Learning of Recursively Factored Environments , 2013, ICML.

[7]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[8]  Joel Veness,et al.  Context Tree Switching , 2011, 2012 Data Compression Conference.

[9]  Georg Zeitler,et al.  Universal Piecewise Linear Prediction Via Context Trees , 2007, IEEE Transactions on Signal Processing.

[10]  Benjamin Van Roy,et al.  Universal Reinforcement Learning , 2007, IEEE Transactions on Information Theory.

[11]  Olivier Buffet,et al.  Near-Optimal BRL using Optimistic Local Transitions , 2012, ICML.

[12]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[13]  Nicholas Roy,et al.  Provably Efficient Learning with Typed Parametric Models , 2009, J. Mach. Learn. Res..

[14]  Andrew G. Barto,et al.  Linear Least-Squares Algorithms for Temporal Difference Learning , 2005, Machine Learning.

[15]  Rémi Munos,et al.  Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.

[16]  W. Wong,et al.  Optional P\'{o}lya tree and Bayesian inference , 2010, 1010.0490.

[17]  Doina Precup,et al.  Smarter Sampling in Model-Based Bayesian Reinforcement Learning , 2010, ECML/PKDD.

[18]  M. Degroot Optimal Statistical Decisions , 1970 .

[19]  R. R. Hocking,et al.  Algorithm AS 53: Wishart Variate Generator , 1972 .

[20]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[21]  Marcus Hutter,et al.  Feature Reinforcement Learning using Looping Suffix Trees , 2012, EWRL.

[22]  Jesse Hoey,et al.  An analytic solution to discrete Bayesian reinforcement learning , 2006, ICML.

[23]  Andrew G. Barto,et al.  Optimal learning: computational procedures for bayes-adaptive markov decision processes , 2002 .

[24]  Christos Dimitrakakis,et al.  Bayesian variable order Markov models , 2010, AISTATS.

[25]  Michael I. Jordan,et al.  Learning with Mixtures of Trees , 2001, J. Mach. Learn. Res..

[26]  Csaba Szepesvári,et al.  –armed Bandits , 2022 .

[27]  Yoram Singer,et al.  An Efficient Extension to Mixture Techniques for Prediction and Decision Trees , 1997, COLT '97.

[28]  Steven de Rooij,et al.  Catching Up Faster by Switching Sooner: A Prequential Solution to the AIC-BIC Dilemma , 2008, ArXiv.

[29]  Rémi Munos,et al.  Thompson Sampling: An Optimal Finite Time Analysis , 2012, ArXiv.

[30]  Christos Dimitrakakis,et al.  Linear Bayesian Reinforcement Learning , 2013, IJCAI.

[31]  Shie Mannor,et al.  Sparse Online Greedy Support Vector Regression , 2002, ECML.

[32]  Malcolm J. A. Strens,et al.  A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[33]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[34]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[35]  D. Bertsekas Approximate policy iteration: a survey and some new methods , 2011 .

[36]  Carl E. Rasmussen,et al.  Gaussian Processes in Reinforcement Learning , 2003, NIPS.

[37]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[38]  Neil D. Lawrence,et al.  Efficient Multioutput Gaussian Processes through Variational Inducing Kernels , 2010, AISTATS.

[39]  D. Bertsekas Dynamic Programming and Suboptimal Control: From ADP to MPC , 2005, Proceedings of the 44th IEEE Conference on Decision and Control.

[40]  Lihong Li,et al.  A Bayesian Sampling Approach to Exploration in Reinforcement Learning , 2009, UAI.

[41]  T. Ferguson Prior Distributions on Spaces of Probability Measures , 1974 .

[42]  Christos Dimitrakakis Context model inference for large or partially observable MDPs , 2010, ICML 2010.

[43]  Joel Veness,et al.  A Monte-Carlo AIXI Approximation , 2009, J. Artif. Intell. Res..

[44]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[45]  Christos Dimitrakakis,et al.  Robust Bayesian Reinforcement Learning through Tight Lower Bounds , 2011, EWRL.

[46]  Christos Dimitrakakis,et al.  Complexity of Stochastic Branch and Bound Methods for Belief Tree Search in Bayesian Reinforcement Learning , 2009, ICAART.

[47]  Csaba Szepesvári,et al.  Model-based reinforcement learning with nearly tight exploration complexity bounds , 2010, ICML.

[48]  John Langford,et al.  Cover trees for nearest neighbor , 2006, ICML.

[49]  Shie Mannor,et al.  Reinforcement learning with Gaussian processes , 2005, ICML.

[50]  Bart De Schutter,et al.  Online least-squares policy iteration for reinforcement learning control , 2010, Proceedings of the 2010 American Control Conference.

[51]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[52]  Carl E. Rasmussen,et al.  Gaussian process dynamic programming , 2009, Neurocomputing.

[53]  Jeremy Wyatt,et al.  Exploration and inference in learning from reinforcement , 1998 .

[54]  Justin A. Boyan,et al.  Technical Update: Least-Squares Temporal Difference Learning , 2002, Machine Learning.

[55]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[56]  Ronald Ortner,et al.  Online Regret Bounds for Undiscounted Continuous Reinforcement Learning , 2012, NIPS.

[57]  Michael L. Littman,et al.  Online Linear Regression and Its Application to Model-Based Reinforcement Learning , 2007, NIPS.

[58]  Benjamin Van Roy,et al.  (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[59]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[60]  Pieter Abbeel,et al.  Exploration and apprenticeship learning in reinforcement learning , 2005, ICML.

[61]  Ran El-Yaniv,et al.  On Prediction Using Variable Order Markov Models , 2004, J. Artif. Intell. Res..

[62]  Christos Dimitrakakis,et al.  Beliefbox: A framework for statistical methods in sequential decision making , 2007 .