Perseus: Randomized Point-based Value Iteration for POMDPs

Partially observable Markov decision processes (POMDPs) form an attractive and principled framework for agent planning under uncertainty. Point-based approximate techniques for POMDPs compute a policy based on a finite set of points collected in advance from the agent's belief space. We present a randomized point-based value iteration algorithm called PERSEUS. The algorithm performs approximate value backup stages, ensuring that in each backup stage the value of each point in the belief set is improved; the key observation is that a single backup may improve the value of many belief points. Contrary to other point-based methods, PERSEUS backs up only a (randomly selected) subset of points in the belief set, sufficient for improving the value of each belief point in the set. We show how the same idea can be extended to dealing with continuous action spaces. Experimental results show the potential of PERSEUS in large scale POMDP problems.

[1]  Karl Johan Åström,et al.  Optimal control of Markov processes with incomplete state information , 1965 .

[2]  E. Dynkin Controlled Random Sequences , 1965 .

[3]  M. Aoki Optimal control of partially observable Markovian systems , 1965 .

[4]  E. J. Sondik,et al.  The Optimal Control of Partially Observable Markov Decision Processes. , 1971 .

[5]  J. Satia,et al.  Markovian Decision Processes with Probabilistic Observation of States , 1973 .

[6]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over a Finite Horizon , 1973, Oper. Res..

[7]  John N. Tsitsiklis,et al.  The Complexity of Markov Decision Processes , 1987, Math. Oper. Res..

[8]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[9]  William S. Lovejoy,et al.  Computationally Feasible Bounds for Partially Observed Markov Decision Processes , 1991, Oper. Res..

[10]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[11]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[12]  Leslie Pack Kaelbling,et al.  Learning Policies for Partially Observable Environments: Scaling Up , 1997, ICML.

[13]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[14]  Csaba Szepesv Ari,et al.  Generalized Markov Decision Processes: Dynamic-programming and Reinforcement-learning Algorithms , 1996 .

[15]  David Andre,et al.  Generalized Prioritized Sweeping , 1997, NIPS.

[16]  Michael L. Littman,et al.  Incremental Pruning: A Simple, Fast, Exact Method for Partially Observable Markov Decision Processes , 1997, UAI.

[17]  Ronen I. Brafman,et al.  A Heuristic Variable Grid Solution Method for POMDPs , 1997, AAAI/IAAI.

[18]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[19]  Eric A. Hansen,et al.  Solving POMDPs by Searching in Policy Space , 1998, UAI.

[20]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[21]  Shlomo Zilberstein,et al.  Finite-memory control of partially observable systems , 1998 .

[22]  Anne Condon,et al.  On the Undecidability of Probabilistic Planning and Infinite-Horizon Partially Observable Markov Decision Problems , 1999, AAAI/IAAI.

[23]  Kee-Eung Kim,et al.  Solving POMDPs by Searching the Space of Finite Policies , 1999, UAI.

[24]  Yishay Mansour,et al.  Approximate Planning in Large POMDPs via Reusable Trajectories , 1999, NIPS.

[25]  Sebastian Thrun,et al.  Monte Carlo POMDPs , 1999, NIPS.

[26]  Milos Hauskrecht,et al.  Value-Function Approximations for Partially Observable Markov Decision Processes , 2000, J. Artif. Intell. Res..

[27]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[28]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[29]  Weihong Zhang,et al.  Speeding Up the Convergence of Value Iteration in Partially Observable Markov Decision Processes , 2011, J. Artif. Intell. Res..

[30]  Eric A. Hansen,et al.  An Improved Grid-Based Approximation Algorithm for POMDPs , 2001, IJCAI.

[31]  N. Zhang,et al.  Algorithms for partially observable markov decision processes , 2001 .

[32]  Ronald E. Parr,et al.  Solving Factored POMDPs with Linear Value Functions , 2001 .

[33]  Kin Man Poon,et al.  A fast heuristic algorithm for decision-theoretic planning , 2001 .

[34]  Douglas Aberdeen,et al.  Scalable Internal-State Policy-Gradient Methods for POMDPs , 2002, ICML.

[35]  Blai Bonet,et al.  An epsilon-Optimal Grid-Based Algorithm for Partially Observable Markov Decision Processes , 2002, ICML.

[36]  Nicholas Roy,et al.  Exponential Family PCA for Belief Compression in POMDPs , 2002, NIPS.

[37]  Zaharia,et al.  The Interface Specification and Implementation Internals of a Program Module for Geometric Algebra , 2002 .

[38]  Jonathan Baxter,et al.  Scaling Internal-State Policy-Gradient Methods for POMDPs , 2002 .

[39]  Marius Dorian Zaharia Computer Graphics from a Geometric Algebra Perspective , 2002 .

[40]  Craig Boutilier,et al.  Value-Directed Compression of POMDPs , 2002, NIPS.

[41]  Joelle Pineau,et al.  Point-based value iteration: An anytime algorithm for POMDPs , 2003, IJCAI.

[42]  F. Groen,et al.  Fast Translation Invariant Classification of (HRR) Range Profiles in a Zero Phase Representation , 2003 .

[43]  Craig Boutilier,et al.  Bounded Finite State Controllers , 2003, NIPS.

[44]  Ben Kröse,et al.  Aircraft Classification from Estimated Models of Radar Scattering , 2003 .

[45]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[46]  Peter Norvig,et al.  Artificial intelligence - a modern approach, 2nd Edition , 2003, Prentice Hall series in artificial intelligence.

[47]  Jeff G. Schneider,et al.  Policy Search by Dynamic Programming , 2003, NIPS.

[48]  Jelle R. Kok,et al.  The Pursuit Domain Package , 2003 .

[49]  Craig Boutilier,et al.  VDCBPI: an Approximate Scalable Algorithm for Large POMDPs , 2004, NIPS.

[50]  N. Vlassis,et al.  A fast point-based algorithm for POMDPs , 2004 .

[51]  Reid G. Simmons,et al.  Heuristic Search Value Iteration for POMDPs , 2004, UAI.

[52]  Leslie Pack Kaelbling,et al.  Representing hierarchical POMDPs as DBNs for multi-scale robot localization , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[53]  Andrew W. Moore,et al.  Prioritized sweeping: Reinforcement learning with less data and less time , 2004, Machine Learning.

[54]  Craig Boutilier,et al.  Stochastic Local Search for POMDP Controllers , 2004, AAAI.

[55]  Jeff G. Schneider,et al.  Approximate solutions for partially observable stochastic games with common payoffs , 2004, Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems, 2004. AAMAS 2004..

[56]  Nikos A. Vlassis,et al.  A point-based POMDP algorithm for robot planning , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[57]  Claudia V. Goldman,et al.  Solving Transition Independent Decentralized Markov Decision Processes , 2004, J. Artif. Intell. Res..

[58]  Zoubin Ghahramani,et al.  Propagating uncertainty in POMDP value iteration with Gaussian processes , 2004 .

[59]  Amos Storkey,et al.  Advances in Neural Information Processing Systems 20 , 2007 .

[60]  Nikos A. Vlassis,et al.  Robot Planning in Partially Observable Continuous Domains , 2005, BNAIC.

[61]  Jesse Hoey,et al.  A Decision-Theoretic Approach to Task Assistance for Persons with Dementia , 2005, IJCAI.

[62]  Nikos A. Vlassis,et al.  Planning with Continuous Actions in Partially Observable Environments , 2005, Proceedings of the 2005 IEEE International Conference on Robotics and Automation.

[63]  Jesse Hoey,et al.  Solving POMDPs with Continuous or Large Discrete Observation Spaces , 2005, IJCAI.

[64]  Geoffrey J. Gordon,et al.  Finding Approximate POMDP solutions Through Belief Compression , 2011, J. Artif. Intell. Res..

[65]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[66]  P. Poupart Exploiting structure to efficiently solve large scale partially observable Markov decision processes , 2005 .

[67]  Brahim Chaib-draa,et al.  An online POMDP algorithm for complex multiagent environments , 2005, AAMAS '05.

[68]  George E. Monahan,et al.  A Survey of Partially Observable Markov Decision Processes: Theory, Models, and Algorithms , 2007 .