An analytic solution to discrete Bayesian reinforcement learning

Reinforcement learning (RL) was originally proposed as a framework to allow agents to learn in an online fashion as they interact with their environment. Existing RL algorithms come short of achieving this goal because the amount of exploration required is often too costly and/or too time consuming for online learning. As a result, RL is mostly used for offline learning in simulated environments. We propose a new algorithm, called BEETLE, for effective online learning that is computationally efficient while minimizing the amount of exploration. We take a Bayesian model-based approach, framing RL as a partially observable Markov decision process. Our two main contributions are the analytical derivation that the optimal value function is the upper envelope of a set of multivariate polynomials, and an efficient point-based value iteration algorithm that exploits this simple parameterization.

[1]  M. Degroot Optimal Statistical Decisions , 1970 .

[2]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over a Finite Horizon , 1973, Oper. Res..

[3]  Leslie Pack Kaelbling,et al.  Learning in embedded systems , 1993 .

[4]  Gerald Tesauro,et al.  Temporal Difference Learning and TD-Gammon , 1995, J. Int. Comput. Games Assoc..

[5]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[6]  Andrew G. Barto,et al.  Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[7]  Andrew G. Barto,et al.  Reinforcement learning , 1998 .

[8]  Stuart J. Russell,et al.  Bayesian Q-Learning , 1998, AAAI/IAAI.

[9]  David Andre,et al.  Model based Bayesian Exploration , 1999, UAI.

[10]  Malcolm J. A. Strens,et al.  A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[11]  Andrew G. Barto,et al.  Optimal learning: computational procedures for bayes-adaptive markov decision processes , 2002 .

[12]  S. Shankar Sastry,et al.  Autonomous Helicopter Flight via Reinforcement Learning , 2003, NIPS.

[13]  Michael O. Duff,et al.  Design for an Optimal Probe , 2003, ICML.

[14]  Ben Tse,et al.  Autonomous Inverted Helicopter Flight via Reinforcement Learning , 2004, ISER.

[15]  Paul Bourgine,et al.  Exploration of Multi-State Environments: Local Measures and Back-Propagation of Uncertainty , 1999, Machine Learning.

[16]  Nikos A. Vlassis,et al.  Perseus: Randomized Point-based Value Iteration for POMDPs , 2005, J. Artif. Intell. Res..

[17]  Nikos A. Vlassis,et al.  Robot Planning in Partially Observable Continuous Domains , 2005, BNAIC.

[18]  Jesse Hoey,et al.  A Decision-Theoretic Approach to Task Assistance for Persons with Dementia , 2005, IJCAI.

[19]  Tao Wang,et al.  Bayesian sparse sampling for on-line reward optimization , 2005, ICML.