Noname manuscript No. (will be inserted by the editor) A Survey of Point-Based POMDP Solvers

The past decade has seen a significant breakthrough in research on solving partially observable Markov decision processes (POMDPs). Where past solvers could not scale beyond perhaps a dozen states, modern solvers can handle complex domains with many thousands of states. This breakthrough was mainly due to the idea of restricting value function computations to a finite subset of the belief space, permitting only local value updates for this subset. This approach, known as point-based value iteration, avoids the exponential growth of the value function, and is thus applicable for domains with longer horizons, even with relatively large state spaces. Many extensions were suggested to this basic idea, focusing on various aspects of the algorithm—mainly the selection of the belief space subset, and the order of value function updates. In this survey, we walk the reader through the fundamentals of point-based value iteration, explaining the main concepts and ideas. Then, we survey the major extensions to the basic algorithm, discussing their merits. Finally, we include an extensive empirical analysis using well known benchmarks, in order to shed light on the strengths and limitations of the various approaches.

[1]  R. Bellman A Markovian Decision Process , 1957 .

[2]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[3]  E. J. Sondik,et al.  The Optimal Control of Partially Observable Markov Decision Processes. , 1971 .

[4]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over the Infinite Horizon: Discounted Costs , 1978, Oper. Res..

[5]  William S. Lovejoy,et al.  Computationally Feasible Bounds for Partially Observed Markov Decision Processes , 1991, Oper. Res..

[6]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[7]  Leslie Pack Kaelbling,et al.  Learning Policies for Partially Observable Environments: Scaling Up , 1997, ICML.

[8]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[9]  Richard S. Sutton,et al.  Reinforcement Learning with Replacing Eligibility Traces , 2005, Machine Learning.

[10]  Michael L. Littman,et al.  Algorithms for Sequential Decision Making , 1996 .

[11]  Shlomo Zilberstein,et al.  Using Anytime Algorithms in Intelligent Systems , 1996, AI Mag..

[12]  Michael L. Littman,et al.  Incremental Pruning: A Simple, Fast, Exact Method for Partially Observable Markov Decision Processes , 1997, UAI.

[13]  Milos Hauskrecht,et al.  Incremental Methods for Computing Bounds in Partially Observable Markov Decision Processes , 1997, AAAI/IAAI.

[14]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[15]  Hector Geffner,et al.  Solving Large POMDPs using Real Time Dynamic Programming , 1998 .

[16]  Eric A. Hansen,et al.  Solving POMDPs by Searching in Policy Space , 1998, UAI.

[17]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[18]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[19]  Milos Hauskrecht,et al.  Planning treatment of ischemic heart disease with partially observable Markov decision processes , 2000, Artif. Intell. Medicine.

[20]  Milos Hauskrecht,et al.  Value-Function Approximations for Partially Observable Markov Decision Processes , 2000, J. Artif. Intell. Res..

[21]  Weihong Zhang,et al.  Speeding Up the Convergence of Value Iteration in Partially Observable Markov Decision Processes , 2011, J. Artif. Intell. Res..

[22]  Richard S. Sutton,et al.  Predictive Representations of State , 2001, NIPS.

[23]  Kin Man Poon,et al.  A fast heuristic algorithm for decision-theoretic planning , 2001 .

[24]  Craig Boutilier,et al.  A POMDP formulation of preference elicitation problems , 2002, AAAI/IAAI.

[25]  Guy Shani,et al.  An MDP-Based Recommender System , 2002, J. Mach. Learn. Res..

[26]  Joelle Pineau,et al.  Applying Metric-Trees to Belief-Point POMDPs , 2003, NIPS.

[27]  Joelle Pineau,et al.  Point-based value iteration: An anytime algorithm for POMDPs , 2003, IJCAI.

[28]  Blai Bonet,et al.  Labeled RTDP: Improving the Convergence of Real-Time Dynamic Programming , 2003, ICAPS.

[29]  Craig Boutilier,et al.  Bounded Finite State Controllers , 2003, NIPS.

[30]  Reid G. Simmons,et al.  Heuristic Search Value Iteration for POMDPs , 2004, UAI.

[31]  Michael L. Littman,et al.  An Instance-Based State Representation for Network Repair , 2004, AAAI.

[32]  Nikos A. Vlassis,et al.  A point-based POMDP algorithm for robot planning , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[33]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[34]  Nikos A. Vlassis,et al.  Perseus: Randomized Point-based Value Iteration for POMDPs , 2005, J. Artif. Intell. Res..

[35]  Doina Precup,et al.  Using core beliefs for point-based value iteration , 2005, IJCAI.

[36]  Joelle Pineau,et al.  POMDP Planning for Robust Robot Control , 2005, ISRR.

[37]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[38]  P. Poupart Exploiting structure to efficiently solve large scale partially observable Markov decision processes , 2005 .

[39]  Reid G. Simmons,et al.  Point-Based POMDP Algorithms: Improved Analysis and Implementation , 2005, UAI.

[40]  Kevin D. Seppi,et al.  Prioritization Methods for Accelerating MDP Solvers , 2005, J. Mach. Learn. Res..

[41]  Doina Precup,et al.  Belief Selection in Point-Based Planning Algorithms for POMDPs , 2006, Canadian Conference on AI.

[42]  Pascal Poupart,et al.  Point-Based Value Iteration for Continuous POMDPs , 2006, J. Mach. Learn. Res..

[43]  Joelle Pineau,et al.  Anytime Point-Based Approximations for Large POMDPs , 2006, J. Artif. Intell. Res..

[44]  Eric A. Hansen,et al.  Indefinite-Horizon POMDPs with Action-Based Termination , 2007, AAAI.

[45]  Timothy J. Ross,et al.  Alexander Gegov, Complexity Management in Fuzzy Systems, 2007, 368 pp. Hardcover: Studies in Fuzziness and Soft Computing, Volume 211, ISBN-13 978-3-540-38883-8 , 2007, Artif. Intell..

[46]  Peng Dai,et al.  Topological Value Iteration Algorithm for Markov Decision Processes , 2007, IJCAI.

[47]  Guy Shani,et al.  Forward Search Value Iteration for POMDPs , 2007, IJCAI.

[48]  Guy Shani,et al.  Scaling Up: Solving POMDPs through Value Based Clustering , 2007, AAAI.

[49]  Brahim Chaib-draa,et al.  AEMS: An Anytime Online Search Algorithm for Approximate Policy Refinement in Large POMDPs , 2007, IJCAI.

[50]  Steve J. Young,et al.  Partially observable Markov decision processes for spoken dialog systems , 2007, Comput. Speech Lang..

[51]  Leslie Pack Kaelbling,et al.  Grasping POMDPs , 2007, Proceedings 2007 IEEE International Conference on Robotics and Automation.

[52]  Hui Li,et al.  Point-Based Policy Iteration , 2007, AAAI.

[53]  Kee-Eung Kim,et al.  Symbolic Heuristic Search Value Iteration for Factored POMDPs , 2008, AAAI.

[54]  Guy Shani,et al.  Efficient ADD Operations for Point-Based Algorithms , 2008, ICAPS.

[55]  Guy Shani,et al.  Prioritizing Point-Based POMDP Solvers , 2006, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[56]  David Hsu,et al.  SARSOP: Efficient Point-Based POMDP Planning by Approximating Optimally Reachable Belief Spaces , 2008, Robotics: Science and Systems.

[57]  Nicholas Roy,et al.  The permutable POMDP: fast solutions to POMDPs for preference elicitation , 2008, AAMAS.

[58]  N. Armstrong-Crews Solving POMDPs from Both Sides : Growing Dual Parsimonious Bounds , 2008 .

[59]  Leslie Pack Kaelbling,et al.  Continuous-State POMDPs with Hybrid Dynamics , 2008, ISAIM.

[60]  Guy Shani,et al.  Topological Order Planner for POMDPs , 2009, International Joint Conference on Artificial Intelligence.

[61]  Guy Shani,et al.  Improving Existing Fault Recovery Policies , 2009, NIPS.

[62]  Joelle Pineau,et al.  Development and Validation of a Robust Speech Interface for Improved Human-Robot Interaction , 2009, Int. J. Soc. Robotics.

[63]  Blai Bonet,et al.  Solving POMDPs: RTDP-Bel vs. Point-based Algorithms , 2009, IJCAI.

[64]  Hector Geffner,et al.  A Translation-Based Approach to Contingent Planning , 2009, IJCAI.

[65]  Nicholas Roy,et al.  icLQG: Combining local and global optimization for control in information space , 2009, 2009 IEEE International Conference on Robotics and Automation.

[66]  Guy Shani Evaluating Point-Based POMDP Solvers on Multicore Machines , 2010, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[67]  Scott Sanner,et al.  Symbolic Dynamic Programming for First-order POMDPs , 2010, AAAI.

[68]  Jesse Hoey,et al.  Automated handwashing assistance for persons with dementia using video and a partially observable Markov decision process , 2010, Comput. Vis. Image Underst..

[69]  Roni Khardon,et al.  Relational Partially Observable MDPs , 2010, AAAI.

[70]  Kee-Eung Kim,et al.  Closing the Gap: Improved Bounds on Optimal POMDP Solutions , 2011, ICAPS.

[71]  Andreas Krause,et al.  Advances in Neural Information Processing Systems (NIPS) , 2014 .