Algorithms for partially observable markov decision processes

Partially Observable Markov Decision Process (POMDP) is a general sequential decision-making model where the effects of actions are nondeterministic and only partial information about world states is available. However, finding near optimal solutions for POMDPs is computationally difficult. Value iteration is a standard algorithm for solving POMDPs. It conducts a sequence of dynamic programming (DP) updates to improve value functions. Value iteration is inefficient for two reasons. First, a DP update is expensive due to the need of accounting for all belief states in a continuous belief space. Second, value iteration needs to conduct a large number of DP updates before its convergence. This thesis investigates two ways to accelerate value iteration. The work presented centers around the idea of conducting DP updates and therefore value iteration over a belief subspace, a subset of belief space. The first use of belief subspace is to reduce the number of DP updates for value iteration to converge. We design a computationally cheap procedure considering a belief subspace which consists of a finite number of belief states. It is used as an additional step for improving value functions. Due to additional improvements by the procedure, value iteration conducts fewer DP updates and therefore is more efficient. The second use of belief subspace is to reduce the complexity of DP updates. We establish a framework on how to carry out value iteration over a belief subspace determined by a POMDP model. Whether the belief subspace is smaller than the belief space is model dependent. If this is true for a POMDP, value iteration over the belief subspace is expected to be more efficient. Based on this framework, we study three POMDP classes with special problem characteristics and propose different value iteration algorithms for them. (1) An informative POMDP assumes that an agent always has a good idea about the world states. The subspace determined by the model is much smaller than the belief space. Value iteration over the belief subspace is more efficient for this POMDP class. (2) A near-discernible POMDP assumes that the agent can get a good idea about states once in a while if it executes some particular actions. For such a POMDP, the belief subspace determined by the model can be of the same size as the belief space. We propose an anytime value iteration algorithm which focuses the computations on a small belief subspace and gradually expand it. (3) A more general class than near-discernible POMDPs assumes that the agent can get a good idea about states with a high likelihood once in a while if it executes some particular actions. For such POMDPs, we adapt the anytime algorithm to conduct value iteration over a growing belief subspace.

[1]  S. Vajda,et al.  GAMES AND DECISIONS; INTRODUCTION AND CRITICAL SURVEY. , 1958 .

[2]  Alvin W Drake,et al.  Observation of a Markov process through a noisy channel , 1962 .

[3]  Cyrus Derman,et al.  Finite State Markovian Decision Processes , 1970 .

[4]  Leon S. Lasdon,et al.  Optimization Theory of Large Systems , 1970 .

[5]  H. Kushner,et al.  Mathematical programming and the control of Markov chains , 1971 .

[6]  Richard Fikes,et al.  STRIPS: A New Approach to the Application of Theorem Proving to Problem Solving , 1971, IJCAI.

[7]  H. Kushner,et al.  Decomposition of systems governed by Markov chains , 1974 .

[8]  Chelsea C. White,et al.  Optimal Diagnostic Questionnaires Which Allow Less than Truthful Responses , 1976, Inf. Control..

[9]  P. Varaiya,et al.  Multilayer control of large Markov chains , 1978 .

[10]  C. White Optimal Inspection and Repair of a Production Process Subject to Deterioration , 1978 .

[11]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[12]  James N. Eagle The Optimal Search for a Moving Target When the Search Path Is Constrained , 1984, Oper. Res..

[13]  Richard E. Korf,et al.  Macro-Operators: A Weak Method for Learning , 1985, Artif. Intell..

[14]  Nils J. Nilsson,et al.  Probabilistic Logic * , 2022 .

[15]  John N. Tsitsiklis,et al.  The Complexity of Markov Decision Processes , 1987, Math. Oper. Res..

[16]  Eric Horvitz,et al.  Decision theory in expert systems and artificial intelligenc , 1988, Int. J. Approx. Reason..

[17]  Chelsea C. White,et al.  Solution Procedures for Partially Observed Markov Decision Processes , 1989, Oper. Res..

[18]  C. Watkins Learning from delayed rewards , 1989 .

[19]  Hsien-Te Cheng,et al.  Algorithms for partially observable markov decision processes , 1989 .

[20]  Richard E. Korf,et al.  Real-Time Heuristic Search , 1990, Artif. Intell..

[21]  John L. Bresina,et al.  Anytime Synthetic Projection: Maximizing the Probability of Goal Satisfaction , 1990, AAAI.

[22]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[23]  Mark S. Boddy,et al.  Anytime Problem Solving Using Dynamic Programming , 1991, AAAI.

[24]  William S. Lovejoy,et al.  Computationally Feasible Bounds for Partially Observed Markov Decision Processes , 1991, Oper. Res..

[25]  Anne Condon,et al.  The Complexity of Stochastic Games , 1992, Inf. Comput..

[26]  Mark A. Peot,et al.  Conditional nonlinear planning , 1992 .

[27]  Daniel S. Weld,et al.  UCPOP: A Sound, Complete, Partial Order Planner for ADL , 1992, KR.

[28]  C. Atkeson,et al.  Prioritized Sweeping : Reinforcement Learning withLess Data and Less Real , 1993 .

[29]  Leslie Pack Kaelbling,et al.  Planning With Deadlines in Stochastic Domains , 1993, AAAI.

[30]  William S. Lovejoy,et al.  Suboptimal Policies, with Bounds, for Parameter Adaptive Decision Processes , 1993, Oper. Res..

[31]  J. Peng,et al.  Efficient Learning and Planning Within the Dyna Framework , 1993, IEEE International Conference on Neural Networks.

[32]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[33]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[34]  Chelsea C. White,et al.  Finite-Memory Suboptimal Design for Partially Observed Markov Decision Processes , 1994, Oper. Res..

[35]  Michael I. Jordan,et al.  Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[36]  Anthony R. Cassandra,et al.  Optimal Policies for Partially Observable Markov Decision Processes , 1994 .

[37]  Michael I. Jordan,et al.  Learning Without State-Estimation in Partially Observable Markovian Decision Processes , 1994, ICML.

[38]  M. Littman The Witness Algorithm: Solving Partially Observable Markov Decision Processes , 1994 .

[39]  Sebastian Thrun,et al.  Finding Structure in Reinforcement Learning , 1994, NIPS.

[40]  Leslie Pack Kaelbling,et al.  Planning under Time Constraints in Stochastic Domains , 1993, Artif. Intell..

[41]  Nicholas Kushmerick,et al.  An Algorithm for Probabilistic Planning , 1995, Artif. Intell..

[42]  Thomas Dean,et al.  Decomposition Techniques for Planning in Stochastic Domains , 1995, IJCAI.

[43]  Stuart J. Russell,et al.  Approximating Optimal Policies for Partially Observable Stochastic Domains , 1995, IJCAI.

[44]  Leslie Pack Kaelbling,et al.  Learning Policies for Partially Observable Environments: Scaling Up , 1997, ICML.

[45]  Andrew G. Barto,et al.  Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[46]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[47]  R. Andrew McCallum,et al.  Hidden state and reinforcement learning with instance-based state identification , 1996, IEEE Trans. Syst. Man Cybern. Part B.

[48]  M. Paterson,et al.  The complexity of mean payo games on graphs , 1996 .

[49]  Alex Pentland,et al.  Active gesture recognition using partially observable Markov decision processes , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[50]  Craig Boutilier,et al.  uting Optimal Policies for Compact Representations , 1996 .

[51]  Richard Washington,et al.  Incremental Markov-model planning , 1996, Proceedings Eighth IEEE International Conference on Tools with Artificial Intelligence.

[52]  Richard Washington,et al.  Uncertainty and Real-Time Therapy Planning: Incremental Markov-Model Approaches , 1996 .

[53]  Robert Givan,et al.  Model Minimization, Regression, and Propositional STRIPS Planning , 1997, IJCAI.

[54]  David Andre,et al.  Generalized Prioritized Sweeping , 1997, NIPS.

[55]  Wenju Liu,et al.  A Model Approximation Scheme for Planning in Partially Observable Stochastic Domains , 1997, J. Artif. Intell. Res..

[56]  R. Simmons,et al.  Xavier: A Robot Navigation Architecture Based on Partially Observable Markov Decision Process Models , 1998 .

[57]  Richard Washington,et al.  BI-POMDP: Bounded, Incremental, Partially-Observable Markov-Model Planning , 1997, ECP.

[58]  Eric A. Hansen,et al.  An Improved Policy Iteration Algorithm for Partially Observable MDPs , 1997, NIPS.

[59]  Milos Hauskrecht,et al.  Planning and control in stochastic domains with imperfect information , 1997 .

[60]  Doina Precup,et al.  Multi-time Models for Temporally Abstract Planning , 1997, NIPS.

[61]  E. Allender,et al.  Encyclopaedia of Complexity Results for Finite-Horizon Markov Decision Process Problems , 1997 .

[62]  Craig Boutilier,et al.  Abstraction and Approximate Decision-Theoretic Planning , 1997, Artif. Intell..

[63]  Robert Givan,et al.  Model Minimization in Markov Decision Processes , 1997, AAAI/IAAI.

[64]  Milos Hauskrecht,et al.  Incremental Methods for Computing Bounds in Partially Observable Markov Decision Processes , 1997, AAAI/IAAI.

[65]  Weihong Zhang,et al.  Fast Value Iteration for Goal-Directed Markov Decision Processes , 1997, UAI.

[66]  Ronen I. Brafman,et al.  A Heuristic Variable Grid Solution Method for POMDPs , 1997, AAAI/IAAI.

[67]  Shlomo Zilberstein,et al.  Heuristic Search in Cyclic AND/OR Graphs , 1998, AAAI/IAAI.

[68]  John Loch,et al.  Using Eligibility Traces to Find the Best Memoryless Policy in Partially Observable Markov Decision Processes , 1998, ICML.

[69]  Judy Goldsmith,et al.  Complexity issues in Markov decision processes , 1998, Proceedings. Thirteenth Annual IEEE Conference on Computational Complexity (Formerly: Structure in Complexity Theory Conference) (Cat. No.98CB36247).

[70]  Hector Geffner,et al.  Solving Large POMDPs using Real Time Dynamic Programming , 1998 .

[71]  Michael Kearns,et al.  Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms , 1998, NIPS.

[72]  Michael L. Littman,et al.  The Computational Complexity of Probabilistic Planning , 1998, J. Artif. Intell. Res..

[73]  Peter Haddawy,et al.  Utility Models for Goal‐Directed, Decision‐Theoretic Planners , 1998, Comput. Intell..

[74]  A. Cassandra,et al.  Exact and approximate algorithms for partially observable markov decision processes , 1998 .

[75]  Ronen I. Brafman,et al.  Structured Reachability Analysis for Markov Decision Processes , 1998, UAI.

[76]  Shlomo Zilberstein,et al.  Finite-memory control of partially observable systems , 1998 .

[77]  Milos Hauskrecht,et al.  Modeling treatment of ischemic heart disease with partially observable Markov decision processes , 1998, AMIA.

[78]  Ronald Parr,et al.  Flexible Decomposition Algorithms for Weakly Coupled Markov Decision Problems , 1998, UAI.

[79]  Anne Condon,et al.  On the Undecidability of Probabilistic Planning and Infinite-Horizon Partially Observable Markov Decision Problems , 1999, AAAI/IAAI.

[80]  Kee-Eung Kim,et al.  Solving POMDPs by Searching the Space of Finite Policies , 1999, UAI.

[81]  Weihong Zhang,et al.  A Method for Speeding Up Value Iteration in Partially Observable Markov Decision Processes , 1999, UAI.

[82]  Yishay Mansour,et al.  Approximate Planning in Large POMDPs via Reusable Trajectories , 1999, NIPS.

[83]  Craig Boutilier,et al.  Decision-Theoretic Planning: Structural Assumptions and Computational Leverage , 1999, J. Artif. Intell. Res..

[84]  Dit-Yan Yeung,et al.  An Environment Model for Nonstationary Reinforcement Learning , 1999, NIPS.

[85]  Daphne Koller,et al.  Policy Iteration for Factored MDPs , 2000, UAI.

[86]  Milos Hauskrecht,et al.  Planning treatment of ischemic heart disease with partially observable Markov decision processes , 2000, Artif. Intell. Medicine.

[87]  Milos Hauskrecht,et al.  Value-Function Approximations for Partially Observable Markov Decision Processes , 2000, J. Artif. Intell. Res..

[88]  Craig Boutilier,et al.  Stochastic dynamic programming with factored representations , 2000, Artif. Intell..

[89]  Thomas G. Dietterich,et al.  A POMDP Approximation Algorithm That Anticipates the Need to Observe , 2000, PRICAI.

[90]  Judy Goldsmith,et al.  Nonapproximability Results for Partially Observable Markov Decision Processes , 2011, Universität Trier, Mathematik/Informatik, Forschungsbericht.

[91]  Blai Bonet,et al.  Planning with Incomplete Information as Heuristic Search in Belief Space , 2000, AIPS.

[92]  Zhengzhu Feng,et al.  Dynamic Programming for POMDPs Using a Factored State Representation , 2000, AIPS.

[93]  Weihong Zhang,et al.  Speeding Up the Convergence of Value Iteration in Partially Observable Markov Decision Processes , 2011, J. Artif. Intell. Res..

[94]  Eric A. Hansen,et al.  An Improved Grid-Based Approximation Algorithm for POMDPs , 2001, IJCAI.

[95]  Weihong Zhang,et al.  Value Iteration over Belief Subspace , 2001, ECSQARU.

[96]  Weihong Zhang,et al.  Space-Progressive Value Iteration: An Anytime Algorithm for a Class of POMDPs , 2001, ECSQARU.