Partially Observable Markov Decision Processes

For reinforcement learning in environments in which an agent has access to a reliable state signal, methods based on the Markov decision process (MDP) have had many successes. In many problem domains, however, an agent suffers from limited sensing capabilities that preclude it from recovering a Markovian state signal from its perceptions. Extending the MDP framework, partially observable Markov decision processes (POMDPs) allow for principled decision making under conditions of uncertain sensing. In this chapter we present the POMDP model by focusing on the differences with fully observable MDPs, and we show how optimal policies for POMDPs can be represented. Next, we give a review of model-based techniques for policy computation, followed by an overview of the available model-free methods for POMDPs. We conclude by highlighting recent trends in POMDP reinforcement learning.

[1]  Leslie Pack Kaelbling,et al.  Learning Policies for Partially Observable Environments: Scaling Up , 1997, ICML.

[2]  John N. Tsitsiklis,et al.  The Complexity of Markov Decision Processes , 1987, Math. Oper. Res..

[3]  Panos E. Trahanias,et al.  Real-time hierarchical POMDPs for autonomous robot navigation , 2007, Robotics Auton. Syst..

[4]  Joel Veness,et al.  Monte-Carlo Planning in Large POMDPs , 2010, NIPS.

[5]  G. Monahan State of the Art—A Survey of Partially Observable Markov Decision Processes: Theory, Models, and Algorithms , 1982 .

[6]  Jesse Hoey,et al.  Solving POMDPs with Continuous or Large Discrete Observation Spaces , 2005, IJCAI.

[7]  Craig Boutilier,et al.  Stochastic Local Search for POMDP Controllers , 2004, AAAI.

[8]  Neil Immerman,et al.  The Complexity of Decentralized Control of Markov Decision Processes , 2000, UAI.

[9]  Michael L. Littman,et al.  Memoryless policies: theoretical limitations and practical results , 1994 .

[10]  Shlomo Zilberstein,et al.  Finite-memory control of partially observable systems , 1998 .

[11]  Nikos A. Vlassis,et al.  Robot Planning in Partially Observable Continuous Domains , 2005, BNAIC.

[12]  Alex Pentland,et al.  Active gesture recognition using partially observable Markov decision processes , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[13]  Craig Boutilier,et al.  Symbolic Dynamic Programming for First-Order MDPs , 2001, IJCAI.

[14]  Steven L. Shafer,et al.  Comparison of Some Suboptimal Control Policies in Medical Drug Therapy , 1996, Oper. Res..

[15]  Joelle Pineau,et al.  An integrated approach to hierarchy and abstraction for pomdps , 2002 .

[16]  W. Burgard,et al.  Markov Localization for Mobile Robots in Dynamic Environments , 1999, J. Artif. Intell. Res..

[17]  Joelle Pineau,et al.  Active Learning in Partially Observable Markov Decision Processes , 2005, ECML.

[18]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[19]  Peter Vrancx,et al.  Reinforcement Learning: State-of-the-Art , 2012 .

[20]  Yossi Aviv,et al.  A Partially Observed Markov Decision Process for Dynamic Pricing , 2005, Manag. Sci..

[21]  Chelsea C. White,et al.  A survey of solution techniques for the partially observed Markov decision process , 1991, Ann. Oper. Res..

[22]  Nicholas Roy,et al.  The permutable POMDP: fast solutions to POMDPs for preference elicitation , 2008, AAMAS.

[23]  Zhengzhu Feng,et al.  Dynamic Programming for POMDPs Using a Factored State Representation , 2000, AIPS.

[24]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[25]  Blai Bonet,et al.  An epsilon-Optimal Grid-Based Algorithm for Partially Observable Markov Decision Processes , 2002, ICML.

[26]  Stuart J. Russell,et al.  Approximating Optimal Policies for Partially Observable Stochastic Domains , 1995, IJCAI.

[27]  Pascal Poupart,et al.  Point-Based Value Iteration for Continuous POMDPs , 2006, J. Mach. Learn. Res..

[28]  Anne Condon,et al.  On the undecidability of probabilistic planning and related stochastic optimization problems , 2003, Artif. Intell..

[29]  Shlomo Zilberstein,et al.  Formal models and algorithms for decentralized decision making under uncertainty , 2008, Autonomous Agents and Multi-Agent Systems.

[30]  Andrew W. Moore,et al.  Gradient Descent for General Reinforcement Learning , 1998, NIPS.

[31]  Nicholas Roy,et al.  Exponential Family PCA for Belief Compression in POMDPs , 2002, NIPS.

[32]  Craig Boutilier,et al.  Value-Directed Compression of POMDPs , 2002, NIPS.

[33]  Milos Hauskrecht,et al.  Value-Function Approximations for Partially Observable Markov Decision Processes , 2000, J. Artif. Intell. Res..

[34]  E. Dynkin Controlled Random Sequences , 1965 .

[35]  Hsien-Te Cheng,et al.  Algorithms for partially observable markov decision processes , 1989 .

[36]  Sebastian Thrun,et al.  Probabilistic robotics , 2002, CACM.

[37]  Andrew McCallum,et al.  Reinforcement learning with selective perception and hidden state , 1996 .

[38]  Sebastian Thrun,et al.  Monte Carlo POMDPs , 1999, NIPS.

[39]  Eric A. Hansen,et al.  An Improved Grid-Based Approximation Algorithm for POMDPs , 2001, IJCAI.

[40]  Reid G. Simmons,et al.  Heuristic Search Value Iteration for POMDPs , 2004, UAI.

[41]  Steve J. Young,et al.  Partially observable Markov decision processes for spoken dialog systems , 2007, Comput. Speech Lang..

[42]  Milos Hauskrecht,et al.  Planning treatment of ischemic heart disease with partially observable Markov decision processes , 2000, Artif. Intell. Medicine.

[43]  Ross B. Corotis,et al.  INSPECTION, MAINTENANCE, AND REPAIR WITH PARTIAL OBSERVABILITY , 1995 .

[44]  Marco Wiering,et al.  Utile distinction hidden Markov models , 2004, ICML.

[45]  Joelle Pineau,et al.  Spoken Dialog Management for Robots , 2000, ACL 2000.

[46]  A. Yezzi,et al.  Local or Global Minima: Flexible Dual-Front Active Contours , 2007 .

[47]  Joelle Pineau,et al.  Online Planning Algorithms for POMDPs , 2008, J. Artif. Intell. Res..

[48]  S. Nanda Mathematical Analysis and Applications , 2004 .

[49]  Nikos A. Vlassis,et al.  Perseus: Randomized Point-based Value Iteration for POMDPs , 2005, J. Artif. Intell. Res..

[50]  Richard S. Sutton,et al.  Predictive Representations of State , 2001, NIPS.

[51]  Michael R. James,et al.  Predictive State Representations: A New Theory for Modeling Dynamical Systems , 2004, UAI.

[52]  Edward J. Sondik,et al.  The optimal control of par-tially observable Markov processes , 1971 .

[53]  Douglas Aberdeen,et al.  Scalable Internal-State Policy-Gradient Methods for POMDPs , 2002, ICML.

[54]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[55]  E. J. Sondik,et al.  The Optimal Control of Partially Observable Markov Decision Processes. , 1971 .

[56]  Karl Johan Åström,et al.  Optimal control of Markov processes with incomplete state information , 1965 .

[57]  Jan Peters Policy gradient methods , 2010, Scholarpedia.

[58]  Marc Toussaint,et al.  Model-free reinforcement learning as mixture learning , 2009, ICML '09.

[59]  Michael I. Jordan,et al.  Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[60]  Scott Sanner,et al.  Symbolic Dynamic Programming for First-order POMDPs , 2010, AAAI.

[61]  Craig Boutilier,et al.  Bounded Finite State Controllers , 2003, NIPS.

[62]  Alvin W Drake,et al.  Observation of a Markov process through a noisy channel , 1962 .

[63]  Joelle Pineau,et al.  Towards robotic assistants in nursing homes: Challenges and results , 2003, Robotics Auton. Syst..

[64]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[65]  Jeff G. Schneider,et al.  Policy Search by Dynamic Programming , 2003, NIPS.

[66]  Andrew G. Barto,et al.  Optimal learning: computational procedures for bayes-adaptive markov decision processes , 2002 .

[67]  John Loch,et al.  Using Eligibility Traces to Find the Best Memoryless Policy in Partially Observable Markov Decision Processes , 1998, ICML.

[68]  Nikos A. Vlassis,et al.  Optimal and Approximate Q-value Functions for Decentralized POMDPs , 2008, J. Artif. Intell. Res..

[69]  Peter L. Bartlett,et al.  Experiments with Infinite-Horizon, Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[70]  Nikos A. Vlassis,et al.  A point-based POMDP algorithm for robot planning , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[71]  Michael I. Jordan,et al.  Learning Without State-Estimation in Partially Observable Markovian Decision Processes , 1994, ICML.

[72]  C. R. Sox,et al.  Adaptive Inventory Control for Nonstationary Demand and Partial Information , 2002, Manag. Sci..

[73]  Jesse Hoey,et al.  A Decision-Theoretic Approach to Task Assistance for Persons with Dementia , 2005, IJCAI.

[74]  Anthony R. Cassandra,et al.  Development and Evaluation of a Bayesian Low-Vision Navigation Aid , 2007, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[75]  William S. Lovejoy,et al.  Computationally Feasible Bounds for Partially Observed Markov Decision Processes , 1991, Oper. Res..

[76]  Pedro U. Lima,et al.  Active cooperative perception in network robot systems using POMDPs , 2010, 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[77]  Wenju Liu,et al.  Planning in Stochastic Domains: Problem Characteristics and Approximation , 1996 .

[78]  Chelsea C. White,et al.  A Hybrid Genetic/Optimization Algorithm for Finite-Horizon, Partially Observed Markov Decision Processes , 2004, INFORMS J. Comput..

[79]  Eric A. Hansen,et al.  Solving POMDPs by Searching in Policy Space , 1998, UAI.

[80]  Roni Khardon,et al.  Relational Partially Observable MDPs , 2010, AAAI.

[81]  Andrew McCallum,et al.  Instance-Based Utile Distinctions for Reinforcement Learning , 1995 .

[82]  Michael L. Littman,et al.  Incremental Pruning: A Simple, Fast, Exact Method for Partially Observable Markov Decision Processes , 1997, UAI.

[83]  Jonathan Baxter,et al.  Scaling Internal-State Policy-Gradient Methods for POMDPs , 2002 .

[84]  Deb Roy,et al.  Connecting language to the world , 2005, Artif. Intell..

[85]  Joelle Pineau,et al.  Point-based value iteration: An anytime algorithm for POMDPs , 2003, IJCAI.

[86]  Yishay Mansour,et al.  Approximate Planning in Large POMDPs via Reusable Trajectories , 1999, NIPS.

[87]  Guy Shani,et al.  Resolving Perceptual Aliasing In The Presence Of Noisy Sensors , 2004, NIPS.

[88]  Robert G. Haight,et al.  Optimal control of an invasive species with imperfect information about the level of infestation , 2010 .

[89]  David Hsu,et al.  SARSOP: Efficient Point-Based POMDP Planning by Approximating Optimally Reachable Belief Spaces , 2008, Robotics: Science and Systems.

[90]  Joelle Pineau,et al.  Bayes-Adaptive POMDPs , 2007, NIPS.

[91]  Milind Tambe,et al.  Exploiting belief bounds: practical POMDPs for personal assistant agents , 2005, AAMAS '05.

[92]  P. Poupart Exploiting structure to efficiently solve large scale partially observable Markov decision processes , 2005 .

[93]  Jesse Hoey,et al.  Value-Directed Human Behavior Analysis from Video Using Partially Observable Markov Decision Processes , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[94]  Geoffrey J. Gordon,et al.  Finding Approximate POMDP solutions Through Belief Compression , 2011, J. Artif. Intell. Res..

[95]  George E. Monahan,et al.  A Survey of Partially Observable Markov Decision Processes: Theory, Models, and Algorithms , 2007 .

[96]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[97]  Kee-Eung Kim,et al.  Learning Finite-State Controllers for Partially Observable Environments , 1999, UAI.

[98]  Long Lin,et al.  Memory Approaches to Reinforcement Learning in Non-Markovian Domains , 1992 .

[99]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[100]  Benjamin Van Roy,et al.  A Tractable POMDP for a Class of Sequencing Problems , 2001, UAI 2001.

[101]  Reid G. Simmons,et al.  Point-Based POMDP Algorithms: Improved Analysis and Implementation , 2005, UAI.

[102]  Nikos A. Vlassis,et al.  Planning with Continuous Actions in Partially Observable Environments , 2005, Proceedings of the 2005 IEEE International Conference on Robotics and Automation.

[103]  Bram Bakker,et al.  Reinforcement Learning with Long Short-Term Memory , 2001, NIPS.

[104]  Michael L. Littman,et al.  Algorithms for Sequential Decision Making , 1996 .

[105]  Andrew McCallum,et al.  Overcoming Incomplete Perception with Utile Distinction Memory , 1993, ICML.

[106]  Jürgen Schmidhuber,et al.  HQ-Learning , 1997, Adapt. Behav..

[107]  Satinder P. Singh,et al.  Experimental Results on Learning Stochastic Memoryless Policies for Partially Observable Markov Decision Processes , 1998, NIPS.

[108]  Kee-Eung Kim,et al.  Solving POMDPs by Searching the Space of Finite Policies , 1999, UAI.

[109]  Leslie Pack Kaelbling,et al.  Acting Optimally in Partially Observable Stochastic Domains , 1994, AAAI.

[110]  J. Satia,et al.  Markovian Decision Processes with Probabilistic Observation of States , 1973 .

[111]  Guy Shani,et al.  Forward Search Value Iteration for POMDPs , 2007, IJCAI.

[112]  Craig Boutilier,et al.  Computing Optimal Policies for Partially Observable Decision Processes Using Compact Representations , 1996, AAAI/IAAI, Vol. 2.

[113]  Leslie Pack Kaelbling,et al.  Acting under uncertainty: discrete Bayesian models for mobile-robot navigation , 1996, Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems. IROS '96.

[114]  Ronen I. Brafman,et al.  A Heuristic Variable Grid Solution Method for POMDPs , 1997, AAAI/IAAI.

[115]  Leslie Pack Kaelbling,et al.  Continuous-State POMDPs with Hybrid Dynamics , 2008, ISAIM.

[116]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over a Finite Horizon , 1973, Oper. Res..

[117]  Richard Dearden,et al.  Planning to see: A hierarchical approach to planning visual actions on a robot using POMDPs , 2010, Artif. Intell..

[118]  Reid G. Simmons,et al.  Unsupervised learning of probabilistic models for robot navigation , 1996, Proceedings of IEEE International Conference on Robotics and Automation.

[119]  Pascal Poupart,et al.  Model-based Bayesian Reinforcement Learning in Partially Observable Domains , 2008, ISAIM.

[120]  Jesse Hoey,et al.  An analytic solution to discrete Bayesian reinforcement learning , 2006, ICML.

[121]  J SondikEdward The Optimal Control of Partially Observable Markov Processes over the Infinite Horizon , 1978 .

[122]  Guy Shani,et al.  Model-Based Online Learning of POMDPs , 2005, ECML.

[123]  R. L. Stratonovich CONDITIONAL MARKOV PROCESSES , 1960 .

[124]  Shlomo Zilberstein,et al.  Region-Based Incremental Pruning for POMDPs , 2004, UAI.

[125]  Reid G. Simmons,et al.  Probabilistic Robot Navigation in Partially Observable Environments , 1995, IJCAI.

[126]  Kin Man Poon,et al.  A fast heuristic algorithm for decision-theoretic planning , 2001 .

[127]  A. Cassandra,et al.  Exact and approximate algorithms for partially observable markov decision processes , 1998 .

[128]  Sebastian Thrun,et al.  Coastal Navigation with Mobile Robots , 1999, NIPS.

[129]  Sridhar Mahadevan,et al.  Approximate planning with hierarchical partially observable Markov decision process models for robot navigation , 2002, Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No.02CH37292).

[130]  Guy Shani,et al.  Efficient ADD Operations for Point-Based Algorithms , 2008, ICAPS.

[131]  Leslie Pack Kaelbling,et al.  Grasping POMDPs , 2007, Proceedings 2007 IEEE International Conference on Robotics and Automation.