A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes

Bayesian learning methods have recently been shown to provide an elegant solution to the exploration-exploitation trade-off in reinforcement learning. However most investigations of Bayesian reinforcement learning to date focus on the standard Markov Decision Processes (MDPs). The primary focus of this paper is to extend these ideas to the case of partially observable domains, by introducing the Bayes-Adaptive Partially Observable Markov Decision Processes. This new framework can be used to simultaneously (1) learn a model of the POMDP domain through interaction with the environment, (2) track the state of the system under partial observability, and (3) plan (near-)optimal sequences of actions. An important contribution of this paper is to provide theoretical results showing how the model can be finitely approximated while preserving good learning performance. We present approximate algorithms for belief tracking and planning in this model, as well as empirical results that illustrate how the model estimate and agent's return improve as a function of experience.

[1]  Richard Bellman,et al.  Adaptive Control Processes: A Guided Tour , 1961, The Mathematical Gazette.

[2]  Doina Precup,et al.  Using Linear Programming for Bayesian Exploration in Markov Decision Processes , 2007, IJCAI.

[3]  Peter Auer,et al.  Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning , 2006, NIPS.

[4]  Pascal Poupart,et al.  Model-based Bayesian Reinforcement Learning in Partially Observable Domains , 2008, ISAIM.

[5]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[6]  N. Filatov,et al.  Survey of adaptive dual control methods , 2000 .

[7]  Andrew G. Barto,et al.  Optimal learning: computational procedures for bayes-adaptive markov decision processes , 2002 .

[8]  Jesse Hoey,et al.  An analytic solution to discrete Bayesian reinforcement learning , 2006, ICML.

[9]  Joelle Pineau,et al.  Online Planning Algorithms for POMDPs , 2008, J. Artif. Intell. Res..

[10]  Michael O. Duff,et al.  Monte-Carlo Algorithms for the Improvement of Finite-State Stochastic Controllers: Application to Bayes-Adaptive Markov Decision Processes , 2001, AISTATS.

[11]  Edward J. Sondik,et al.  The optimal control of par-tially observable Markov processes , 1971 .

[12]  Joelle Pineau,et al.  Reinforcement learning with limited reinforcement: using Bayes risk for active learning in POMDPs , 2008, ICML '08.

[13]  Ilan Rusnak Rafael Optimal Adaptive Control of Uncertain Stochastic Discrete Linear Systems , 1999 .

[14]  Joel Veness,et al.  Monte-Carlo Planning in Large POMDPs , 2010, NIPS.

[15]  Finale Doshi-Velez,et al.  The Infinite Partially Observable Markov Decision Process , 2009, NIPS.

[16]  Andrew Y. Ng,et al.  Near-Bayesian exploration in polynomial time , 2009, ICML '09.

[17]  L. M. M.-T. Theory of Probability , 1929, Nature.

[18]  Brahim Chaib-draa,et al.  An online POMDP algorithm for complex multiagent environments , 2005, AAMAS '05.

[19]  Dr. Marcus Hutter,et al.  Universal artificial intelligence , 2004 .

[20]  Andrew McCallum,et al.  Reinforcement learning with selective perception and hidden state , 1996 .

[21]  Ambuj Tewari,et al.  Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs , 2007, NIPS.

[22]  Tao Wang,et al.  Bayesian sparse sampling for on-line reward optimization , 2005, ICML.

[23]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[24]  Shie Mannor,et al.  Bayes Meets Bellman: The Gaussian Process Approach to Temporal Difference Learning , 2003, ICML.

[25]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[26]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over a Finite Horizon , 1973, Oper. Res..

[27]  O. Zane Discrete-time Bayesian adaptive control problems with complete observations , 1992, [1992] Proceedings of the 31st IEEE Conference on Decision and Control.

[28]  Joelle Pineau,et al.  Point-based value iteration: An anytime algorithm for POMDPs , 2003, IJCAI.

[29]  G. Casella,et al.  Statistical Inference , 2003, Encyclopedia of Social Network Analysis and Mining.

[30]  Malcolm J. A. Strens,et al.  A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[31]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[32]  Joel Veness,et al.  A Monte-Carlo AIXI Approximation , 2009, J. Artif. Intell. Res..

[33]  Reid G. Simmons,et al.  Heuristic Search Value Iteration for POMDPs , 2004, UAI.

[34]  Arnaud Doucet,et al.  Sequential Monte Carlo Methods , 2006, Handbook of Graphical Models.

[35]  Edwin T. Jaynes Prior Probabilities , 2010, Encyclopedia of Machine Learning.

[36]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[37]  Sean P. Meyn,et al.  Bayesian adaptive control of time varying systems , 1992, [1992] Proceedings of the 31st IEEE Conference on Decision and Control.

[38]  Michael L. Littman,et al.  A theoretical analysis of Model-Based Interval Estimation , 2005, ICML.

[39]  Shie Mannor,et al.  Percentile optimization in uncertain Markov decision processes with application to efficient exploration , 2007, ICML '07.

[40]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[41]  Shie Mannor,et al.  Reinforcement learning with Gaussian processes , 2005, ICML.

[42]  Richard E. Neapolitan,et al.  Learning Bayesian networks , 2007, KDD '07.

[43]  Joelle Pineau,et al.  Bayes-Adaptive POMDPs , 2007, NIPS.

[44]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[45]  Lihong Li,et al.  A Bayesian Sampling Approach to Exploration in Reinforcement Learning , 2009, UAI.

[46]  A. Greenfield,et al.  Adaptive Control of Nonlinear Stochastic Systems by Particle Filtering , 2003, 2003 4th International Conference on Control and Automation Proceedings.

[47]  Brahim Chaib-draa,et al.  Bayesian reinforcement learning in continuous POMDPs with gaussian processes , 2009, 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[48]  A. Guez,et al.  Optimal adaptive control of uncertain stochastic linear systems , 1995, 1995 IEEE International Conference on Systems, Man and Cybernetics. Intelligent Systems for the 21st Century.

[49]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[50]  Timothy J. Robinson,et al.  Sequential Monte Carlo Methods in Practice , 2003 .

[51]  Nikos A. Vlassis,et al.  Perseus: Randomized Point-based Value Iteration for POMDPs , 2005, J. Artif. Intell. Res..

[52]  Richard S. Sutton,et al.  Predictive Representations of State , 2001, NIPS.

[53]  David Maxwell Chickering,et al.  Learning Bayesian Networks: The Combination of Knowledge and Statistical Data , 1994, Machine Learning.

[54]  Joelle Pineau,et al.  Active Learning in Partially Observable Markov Decision Processes , 2005, ECML.

[55]  David Andre,et al.  Model based Bayesian Exploration , 1999, UAI.

[56]  Stuart J. Russell,et al.  Bayesian Q-Learning , 1998, AAAI/IAAI.

[57]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[58]  Mohammad Ghavamzadeh,et al.  Bayesian actor-critic algorithms , 2007, ICML '07.

[59]  Mohammad Ghavamzadeh,et al.  Bayesian Policy Gradient Algorithms , 2006, NIPS.