Model-based Bayesian Reinforcement Learning in Partially Observable Domains

Bayesian reinforcement learning in partially observable domains is notoriously difficult, in part due to the unknown form of the beliefs and the optimal value function. We show that beliefs represented by mixtures of products of Dirichlet distributions are closed under belief updates for factored domains. Belief monitoring algorithms that use this mixture representation are proposed. We also show that the optimal value function is a linear combination of products of Dirichlets for factored domains. Finally, we extend BEETLE, which is a point-based value iteration algorithm for Bayesian RL in fully observable domains, to partially observable domains.

[1]  Michael I. Jordan Learning in Graphical Models , 1999, NATO ASI Series.

[2]  Pascal Poupart,et al.  Point-Based Value Iteration for Continuous POMDPs , 2006, J. Mach. Learn. Res..

[3]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over a Finite Horizon , 1973, Oper. Res..

[4]  David Heckerman,et al.  A Tutorial on Learning with Bayesian Networks , 1999, Innovations in Bayesian Networks.

[5]  Malcolm J. A. Strens,et al.  A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[6]  S. Shankar Sastry,et al.  Autonomous Helicopter Flight via Reinforcement Learning , 2003, NIPS.

[7]  Andrew Y. Ng,et al.  Policy Search via Density Estimation , 1999, NIPS.

[8]  Jesse Hoey,et al.  An analytic solution to discrete Bayesian reinforcement learning , 2006, ICML.

[9]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[10]  Doina Precup,et al.  Using Linear Programming for Bayesian Exploration in Markov Decision Processes , 2007, IJCAI.

[11]  Kee-Eung Kim,et al.  Learning Finite-State Controllers for Partially Observable Environments , 1999, UAI.

[12]  Joelle Pineau,et al.  Active Learning in Partially Observable Markov Decision Processes , 2005, ECML.

[13]  Andrew G. Barto,et al.  Optimal learning: computational procedures for bayes-adaptive markov decision processes , 2002 .

[14]  David Andre,et al.  Model based Bayesian Exploration , 1999, UAI.

[15]  Gerald Tesauro,et al.  Temporal Difference Learning and TD-Gammon , 1995, J. Int. Comput. Games Assoc..

[16]  Douglas Aberdeen,et al.  Scalable Internal-State Policy-Gradient Methods for POMDPs , 2002, ICML.

[17]  Tao Wang,et al.  Bayesian sparse sampling for on-line reward optimization , 2005, ICML.

[18]  Peter Stone,et al.  Policy gradient reinforcement learning for fast quadrupedal locomotion , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[19]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[20]  Jonathan Baxter,et al.  Scaling Internal-State Policy-Gradient Methods for POMDPs , 2002 .