Efficient Bayesian Clustering for Reinforcement Learning

A fundamental artificial intelligence challenge is how to design agents that intelligently trade off exploration and exploitation while quickly learning about an unknown environment. However, in order to learn quickly, we must somehow generalize experience across states. One promising approach is to use Bayesian methods to simultaneously cluster dynamics and control exploration; unfortunately, these methods tend to require computationally intensive MCMC approximation techniques which lack guarantees. We propose Thompson Clustering for Reinforcement Learning (TCRL), a family of Bayesian clustering algorithms for reinforcement learning that leverage structure in the state space to remain computationally efficient while controlling both exploration and generalization. TCRL-Theoretic achieves near-optimal Bayesian regret bounds while consistently improving over a standard Bayesian exploration approach. TCRL-Relaxed is guaranteed to converge to acting optimally, and empirically outperforms state-of-the-art Bayesian clustering algorithms across a variety of simulated domains, even in cases where no states are similar.

[1]  K. Pearson Biometrika , 1902, The American Naturalist.

[2]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[3]  Leslie Pack Kaelbling,et al.  Input Generalization in Delayed Reinforcement Learning: An Algorithm and Performance Comparisons , 1991, IJCAI.

[4]  Michael I. Jordan,et al.  Reinforcement Learning with Soft State Aggregation , 1994, NIPS.

[5]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[6]  Andrew McCallum,et al.  Learning to Use Selective Attention and Short-Term Memory in Sequential Tasks , 1996 .

[7]  Gilbert Laporte,et al.  Annals of Operations Research , 1996 .

[8]  Maja J. Matarić,et al.  Learning to Use Selective Attention and Short-Term Memory in Sequential Tasks , 1996 .

[9]  Servicio Geológico Colombiano Sgc Volume 4 , 2013, Journal of Diabetes Investigation.

[10]  Kathleen Daly,et al.  Volume 7 , 1998 .

[11]  Malcolm J. A. Strens,et al.  A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[12]  Constance de Koning,et al.  Editors , 2003, Annals of Emergency Medicine.

[13]  Tao Wang,et al.  Bayesian sparse sampling for on-line reward optimization , 2005, ICML.

[14]  Peter Auer,et al.  Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning , 2006, NIPS.

[15]  Martin A. Riedmiller,et al.  Abstract State Spaces with History , 2006, NAFIPS 2006 - 2006 Annual Meeting of the North American Fuzzy Information Processing Society.

[16]  Sridhar Mahadevan,et al.  Constructing basis functions from directed graphs for value function approximation , 2007, ICML '07.

[17]  Stephan Timmer,et al.  Safe Q-Learning on Complete History Spaces , 2007, ECML.

[18]  Michael L. Littman,et al.  Efficient Reinforcement Learning with Relocatable Action Models , 2007, AAAI.

[19]  Michael L. Littman,et al.  An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..

[20]  Doina Precup,et al.  Bounding Performance Loss in Approximate MDP Homomorphisms , 2008, NIPS.

[21]  Finale Doshi-Velez,et al.  The Infinite Partially Observable Markov Decision Process , 2009, NIPS.

[22]  Lihong Li,et al.  A Bayesian Sampling Approach to Exploration in Reinforcement Learning , 2009, UAI.

[23]  Lihong Li,et al.  The adaptive k-meteorologists problem and its application to structure learning and feature selection in reinforcement learning , 2009, ICML '09.

[24]  Nicholas Roy,et al.  Provably Efficient Learning with Typed Parametric Models , 2009, J. Mach. Learn. Res..

[25]  Stephen Lin,et al.  Evolutionary Tile Coding: An Automated State Abstraction Algorithm for Reinforcement Learning , 2010, Abstraction, Reformulation, and Approximation.

[26]  Steven L. Scott,et al.  A modern Bayesian look at the multi-armed bandit , 2010 .

[27]  McCollinChristopher Applied stochastic models in business and industry , 2011 .

[28]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[29]  Csaba Szepesvári,et al.  Regret Bounds for the Adaptive Control of Linear Quadratic Systems , 2011, COLT.

[30]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[31]  Viet-Hung Dang,et al.  Monte-Carlo tree search for Bayesian reinforcement learning , 2012, 2012 11th International Conference on Machine Learning and Applications.

[32]  Wolfgang Ertel,et al.  Monte-Carlo tree search for Bayesian reinforcement learning , 2012, Applied Intelligence.

[33]  Benjamin Van Roy,et al.  (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[34]  Shipra Agrawal,et al.  Further Optimal Regret Bounds for Thompson Sampling , 2012, AISTATS.

[35]  Ronald Ortner,et al.  Noname manuscript No. (will be inserted by the editor) Adaptive Aggregation for Reinforcement Learning in Average Reward Markov Decision Processes , 2022 .

[36]  Peter Dayan,et al.  Scalable and Efficient Bayes-Adaptive Reinforcement Learning Based on Monte-Carlo Tree Search , 2013, J. Artif. Intell. Res..

[37]  Benjamin Van Roy,et al.  Near-optimal Reinforcement Learning in Factored MDPs , 2014, NIPS.

[38]  Ronald Ortner,et al.  Selecting Near-Optimal Approximate State Representations in Reinforcement Learning , 2014, ALT.

[39]  Parag Singla,et al.  ASAP-UCT: Abstraction of State-Action Pairs in UCT , 2015, IJCAI.

[40]  Peter Kulchyski and , 2015 .

[41]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[42]  Federica Mandreoli,et al.  Journal of Computer and System Sciences Special Issue on Query Answering on Graph-Structured Data , 2016, Journal of computer and system sciences (Print).